[opensuse] Processing large wads of data with SuSE
Hi Folks, I've got, at this point, a rather ill-defined requirement that I thought I'd run by you. I thought about posting on the off-topic list, but this is a real world task that I hope SuSE is up to. The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point. The number of elements for a fully populated array is huge (2.5e+13). There are ways to reduce the number of elements, maybe by the sources being in a smaller patch than the receivers. It might also be possible to use a polar grid about each source coordinate with perhaps 50 radials. It's thought that the total array size could be pared to 10-TB or less. Do any of you have any thoughts about how we could set this up? A database? A file structure using HDF5 or NetCDF? The programming language of choice is FORTRAN, but anything that makes sense could be used. Do you think this kind of problem could be hosted on anything affordable by a small company (~$30,000)? I know it's rather an ill-defined problem at this point, but I'd guess that HDF5 would make the most sense. Thanks in advance, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Lew Wolfgang wrote:
Hi Folks,
I've got, at this point, a rather ill-defined requirement that I thought I'd run by you. I thought about posting on the off-topic list, but this is a real world task that I hope SuSE is up to.
The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point.
Lew For an ill-defined requirement, an equally ill-defined guess at a solution. First, what information is held at each array address? My first thought was, "if nothing more than coordinates are held in the array, why store the information at all?" Sound loss through water seems like nothing more than a 3-dimensional space problem with decrease in signal amplitude resulting from propagation through a known medium being the answer. If that were the case, there would be no storage requirement at all. The response would be based on a pure computation of acoustic decay at or between points A and B with the source at point C. Second ill-defined guess would be, "is it really necessary to maintain a data array with that "fine-grained" a data set, or can we have less points within the array and then interpolate between the known points on a computational basis to provided the needed answers?" Perhaps cutting down by several orders of magnitude the size of the number behind the E. From a code stand point, if you do need an array that large, then you will need whatever you write it in to access some type of "directory structure" to access the data with little or no computational overhead. A pure transactional database approach seems computationally expensive for accessing data that isn't changing. Hopefully FORTRAN has improved many times over in its data interfacing and has something more to offer than "common-block" access that was always a hindrance. Both FORTRAN and C are computationally efficient with a number of good open source math libraries readily available. However, how you access that much data in an efficient manner seems like it would drive the choice of what you write it in. What are you doing anyway? Russian Submarine Tracking?? -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David C. Rankin wrote:
For an ill-defined requirement, an equally ill-defined guess at a solution. First, what information is held at each array address? My first thought was, "if nothing more than coordinates are held in the array, why store the information at all?" Sound loss through water seems like nothing more than a 3-dimensional space problem with decrease in signal amplitude resulting from propagation through a known medium being the answer. If that were the case, there would be no storage requirement at all. The response would be based on a pure computation of acoustic decay at or between points A and B with the source at point C.
Hi David, Thanks for the reply, sorry I'm late in reply to yours! I've been really busy these days. The data stored at each element describes the transmission loss between the two points. Sound propagation through a body of water is anything but uniform. Temperature differences give you density differences, also affected by salinity in ocean water, and you end up with a very complicated sound field. I'm fairly sure the data will be calculated in advance based on models.
Second ill-defined guess would be, "is it really necessary to maintain a data array with that "fine-grained" a data set, or can we have less points within the array and then interpolate between the known points on a computational basis to provided the needed answers?" Perhaps cutting down by several orders of magnitude the size of the number behind the E.
I think the researchers are already doing this, good observation!
From a code stand point, if you do need an array that large, then you will need whatever you write it in to access some type of "directory structure" to access the data with little or no computational overhead. A pure transactional database approach seems computationally expensive for accessing data that isn't changing. Hopefully FORTRAN has improved many times over in its data interfacing and has something more to offer than "common-block" access that was always a hindrance.
Agree on the database overhead. There are apparently HPC projects that specialize in this kind of problem. I think they come as libraries that would be called by FORTRAN code. The HDF5 project is an example of this kind of library, but I have zero experience with them.
Both FORTRAN and C are computationally efficient with a number of good open source math libraries readily available. However, how you access that much data in an efficient manner seems like it would drive the choice of what you write it in.
What are you doing anyway? Russian Submarine Tracking??
Har! I actually laughed out loud when I read this! Actually no, this has nothing to do with submarines. I'm not sure if the principal investigator wants me to tell more at this point, I'll ask. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Lew Wolfgang wrote:
David C. Rankin wrote:
For an ill-defined requirement, an equally ill-defined guess at a solution. First, what information is held at each array address? My first thought was, "if nothing more than coordinates are held in the array, why store the information at all?" Sound loss through water seems like nothing more than a 3-dimensional space problem with decrease in signal amplitude resulting from propagation through a known medium being the answer. If that were the case, there would be no storage requirement at all. The response would be based on a pure computation of acoustic decay at or between points A and B with the source at point C.
Hi David,
Thanks for the reply, sorry I'm late in reply to yours! I've been really busy these days.
The data stored at each element describes the transmission loss between the two points. Sound propagation through a body of water is anything but uniform. Temperature differences give you density differences, also affected by salinity in ocean water, and you end up with a very complicated sound field. I'm fairly sure the data will be calculated in advance based on models.
Initially, this would rather suggest that you are dealing with the (graphical?) representation of the data rather than the calculation of the value of that data. In this case the issue with the array is moot, as the code really need only concern itself with translation of the data into the relevant visual format. You do not really need access to all of the data at any one time and the only thing that needs to be retained is the visual representation (which may be in the representational device). This explanation also suggests that some further processing is required, and therefore rather suggest the size of the array is the least of the issues involved. It looks as if the values for each cell need to be calculated not only from the status of the cell itself but from the status of adjoining cells. If the calculation involves iteration towards a stable solution then the there is an awful lot of brute force computation involved.
I think the researchers are already doing this, good observation!
From a code stand point, if you do need an array that large, then you will need whatever you write it in to access some type of "directory structure" to access the data with little or no computational overhead. A pure transactional database approach seems computationally expensive for accessing data that isn't changing. Hopefully FORTRAN has improved many times over in its data interfacing and has something more to offer than "common-block" access that was always a hindrance.
Some sideways thinking here may be appropriate. You seem to be talking about a number of cells which not only have values but relationships with other cells. With traditional arrays of values this might be rather difficult to handle as a representational structure. What you are really talking about is an array of cellular objects. Now I am not very sure that FORTRAN even in its modern form is really up to representing this effectively, C will if each object is a collection of values, C++ (if you are prepared for the culture shock) or some other compiled OOP language if each cell not has values but also functional relationships with other cells. All these languages are computationally effective, but C and C++ are more flexible about data representation and I/O than FORTRAN. (But this does not stop one using FORTRAN for the math and C/C++ for the data management and I/O). Even if you are only dealing with representation of data, an OOP related approach has potential advantages (if only for modelling the programming problem)..
Agree on the database overhead. There are apparently HPC projects that specialize in this kind of problem. I think they come as libraries that would be called by FORTRAN code. The HDF5 project is an example of this kind of library, but I have zero experience with them.
Because of the potential computational intensity of this process some form of HPC solution is obviously required. Whether SuSE is an appropriate tool would depend on how well it support the HPC solution selected. - -- ============================================================================== I have always wished that my computer would be as easy to use as my telephone. My wish has come true. I no longer know how to use my telephone. Bjarne Stroustrup ============================================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFImW8sasN0sSnLmgIRAj6FAKC8PM2PdA1H1QZkkf3+di7043nyTgCgm/XU TnBEeeKC30xKWw5dVVXAQ4E= =OEZE -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Lew Wolfgang wrote:
Hi Folks,
I've got, at this point, a rather ill-defined requirement that I thought I'd run by you. I thought about posting on the off-topic list, but this is a real world task that I hope SuSE is up to.
SUSE won't have a problem with handling it - there are distros better suited for this type of computation, but that usually means they've got the tools pre-selected etc.
The number of elements for a fully populated array is huge (2.5e+13). There are ways to reduce the number of elements, maybe by the sources being in a smaller patch than the receivers. It might also be possible to use a polar grid about each source coordinate with perhaps 50 radials. It's thought that the total array size could be pared to 10-TB or less.
You've neglected to mention how fast or how often you need to process this amount of data. Storing it, on disk on in memory is not a problem - 10 x 1Tb SATA drives and/or 640 nodes with 16Gb of memory each.
Do you think this kind of problem could be hosted on anything affordable by a small company (~$30,000)?
I think quite possibly, yes. USD30K will buy you a decent amount of hardware, certainly plenty of space for holding your data. Here's an article about someone building an HPC cluster: http://jessen.ch/ammonite/ It's a bit old, so you'll certainly get more for your money today.
I know it's rather an ill-defined problem at this point, but I'd guess that HDF5 would make the most sense.
I think perhaps your question is better asked on the beowulf list, where problems such as these are dealt with regularly. That list has some very knowledgeable people. http://www.beowulf.org/mailman/listinfo/beowulf /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Monday 2008-08-04 at 20:09 -0700, Lew Wolfgang wrote:
I thought I'd run by you. I thought about posting on the off-topic list, but this is a real world task that I hope SuSE is up to.
The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point.
I would instead try to calculate interpolation functions and store those. Chances are you will need to interpolate later, anyway, so why don't do it before? However, I don't know how to create a 3 dimmensional interpolation function. Maybe packages like matlab or equivalent do that easily. Just an idea. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFImCr5tTMYHG2NR9URAgPjAJ99fl8K8kls686otp5dRAaTe6Sw2gCgkFGW T3qJPNtNtqHET3fegE8dqeg= =AA/N -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tuesday 05 August 2008 13:27:02 Carlos E. R. wrote:
The Monday 2008-08-04 at 20:09 -0700, Lew Wolfgang wrote:
I thought I'd run by you. I thought about posting on the off-topic list, but this is a real world task that I hope SuSE is up to.
The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point.
I would instead try to calculate interpolation functions and store those. Chances are you will need to interpolate later, anyway, so why don't do it before?
However, I don't know how to create a 3 dimmensional interpolation function. Maybe packages like matlab or equivalent do that easily.
Just an idea.
-- Cheers, Carlos E. R.
For data processing a free replacement of Matlab is IT++. Try to their list too for suggestions -- Bogdan Cristea software engineer Sytron Technologies Overseas www.sytron.ro -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Carlos E. R. wrote:
I would instead try to calculate interpolation functions and store those. Chances are you will need to interpolate later, anyway, so why don't do it before?
However, I don't know how to create a 3 dimmensional interpolation function. Maybe packages like matlab or equivalent do that easily.
Just an idea.
Hi Carlos, Everyone in this small company has their own Matlab license, so that's certainly a possibility. I've been trying to get them interested in SciPy, with some success. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Lew Wolfgang wrote:
The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point.
The number of elements for a fully populated array is huge (2.5e+13). There are ways to reduce the number of elements, maybe by the sources being in a smaller patch than the receivers. It might also be possible to use a polar grid about each source coordinate with perhaps 50 radials. It's thought that the total array size could be pared to 10-TB or less.
It feels like there's probably a better way to represent this problem that will reduce the storage. Perhaps by trading computation for storage. But on the little information presented, that's just a wild-a**ed guess. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tuesday 05 August 2008 01:03:24 am Dave Howorth wrote:
Lew Wolfgang wrote:
The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point.
The number of elements for a fully populated array is huge (2.5e+13). There are ways to reduce the number of elements, maybe by the sources being in a smaller patch than the receivers. It might also be possible to use a polar grid about each source coordinate with perhaps 50 radials. It's thought that the total array size could be pared to 10-TB or less.
It feels like there's probably a better way to represent this problem that will reduce the storage. Perhaps by trading computation for storage. But on the little information presented, that's just a wild-a**ed guess.
Cheers, Dave
It is very interesting that so much specific information is posted. Do we all have to be killed now? more seriously though, the first impression one gets is that the physics of the problem need to be thought out a bit more. once that is done, i can't think why one of numerous existing fea solutions can't handle the problem, especially solutions with a lot of substructuring built in. yes, e13 is a large number, but if the problem really is of that size, then there very few public forums where useful conversation can be carried out. this is not a -doze/-nix/-nucs issue, it's an issue that mr. evil would definitely pay *one million* to get his hands on, mini me might even go higher... d. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
----- Original Message -----
From:
On Tuesday 05 August 2008 01:03:24 am Dave Howorth wrote:
Lew Wolfgang wrote:
The problem takes a volume of water that can be represented as a three-axis array. The X and Y directions have a modulus of 500, the Z 20. We then need to access this volume by specifying any two X,Y,Z points to extract data that represents acoustic transmission loss between the specified points. The data returned would be a vector of sound amplitudes and time delays. An array of frequency vs transmission loss might also be required for each point.
The number of elements for a fully populated array is huge (2.5e+13). There are ways to reduce the number of elements, maybe by the sources being in a smaller patch than the receivers. It might also be possible to use a polar grid about each source coordinate with perhaps 50 radials. It's thought that the total array size could be pared to 10-TB or less.
It is very interesting that so much specific information is posted. Do we all have to be killed now?
Since I was just today in the car listening to an article in audible.com's version of scientific american... My vote is modeling sound in widely varying ocean conditions to see if it's really possible that navy sonar or other man made noise from generic shipping is really killing whales? -- Brian K. White brian@aljex.com http://www.myspace.com/KEYofR +++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++. filePro BBx Linux SCO FreeBSD #callahans Satriani Filk! -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
kanenas@hawaii.rr.com wrote:
It is very interesting that so much specific information is posted. Do we all have to be killed now? more seriously though, the first impression one gets is that the physics of the problem need to be thought out a bit more. once that is done, i can't think why one of numerous existing fea solutions can't handle the problem, especially solutions with a lot of substructuring built in. yes, e13 is a large number, but if the problem really is of that size, then there very few public forums where useful conversation can be carried out. this is not a -doze/-nix/-nucs issue, it's an issue that mr. evil would definitely pay *one million* to get his hands on, mini me might even go higher...
Har! That's what the PI's interested in! Do you have Mini Me's phone number? Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (9)
-
Bogdan Cristea
-
Brian K. White
-
Carlos E. R.
-
Dave Howorth
-
David C. Rankin
-
G T Smith
-
kanenas@hawaii.rr.com
-
Lew Wolfgang
-
Per Jessen