by plumber » Mon Sep 26, 2011 3:23 pm
For what it's worth, here is how I organize the USNOB-1 data set in SciDB:
CREATE EMPTY ARRAY Objects <
Proper_Motion_RA : int32,
Error_in_Motion_RA : int32,
Proper_Motion_DECL : int32,
Error_in_Motion_DECL : int32,
Obs_Epoch : double,
B_mag : double,
B_mag_flag : int32,
R_mag : double,
R_mag_flag : int32,
B_mag2 : double,
B_mag2_flag : int32,
R_mag2 : double,
R_mag2_flag : int32 >
[ RA(right_ascention)=36000000,144000,8400, DECL(declination)=18000000,72000,4200];
The tricks are to:
1. Use the right_ascention (we know! we know! ticket logged) and declination user-defined types. These are intended to map between the double precision value of the USNOB data, and the index values used for the Array.
2. Loading this data is a bit tricky, as it comes to us in a .CSV file. At the moment (we're not written a good loader because we're trying to get the underlying mechanics right) you need to go through a rather (too) elaborate process to get your data into this format. The following script illustrates how to load the first 60,000 objects in the catalog (we do this as part of testing).
time iquery -aq "
CREATE EMPTY ARRAY Raw_Objects <
RA : right_ascention,
DECL : declination,
Proper_Motion_RA : int32,
Error_in_Motion_RA : int32,
Proper_Motion_DECL : int32,
Error_in_Motion_DECL : int32,
Obs_Epoch : double,
B_mag : double,
B_mag_flag : int32,
R_mag : double,
R_mag_flag : int32,
B_mag2 : double,
B_mag2_flag : int32,
R_mag2 : double,
R_mag2_flag : int32 >
[ LN=0:59999,60000,0 ]"
#
# 3.3 Load the data using the csv2scidb utility.
#
rm -rf /tmp/Load_File.pipe
mkfifo /tmp/Load_File.pipe
csv2scidb -c 60000 < /home/plumber/Devel/POCs/poc/LSST/Small_LSST.csv > /tmp/Load_File.pipe &
time iquery -naq "load (Raw_Objects, '/tmp/Load_File.pipe')"
#
# NOTE: There's a problem. The existing redimension_store() can't target an
# array with overlapping chunks. So, we need a two stage process. First,
# we need to get the data to a chunk model that doesn't have overlaps.
# Then second, we use reshape() to get to a chunk format with overlaps.
#
# Q4: Create the intermediate
#
# 4.1 Hygiene.
time iquery -aq "remove ( ObjectsInter )"
#
# 4.2 Create the array to hold the intermediate result, which we
# can use to reshape() into the eventual target.
time iquery -aq "
CREATE EMPTY ARRAY ObjectsInter <
Proper_Motion_RA : int32,
Error_in_Motion_RA : int32,
Proper_Motion_DECL : int32,
Error_in_Motion_DECL : int32,
Obs_Epoch : double,
B_mag : double,
B_mag_flag : int32,
R_mag : double,
R_mag_flag : int32,
B_mag2 : double,
B_mag2_flag : int32,
R_mag2 : double,
R_mag2_flag : int32 >
[ RA(right_ascention)=36000000,144000,0, DECL(declination)=18000000,72000,0]"
#
# 4.3 redimension_store()
#
time iquery -naq "redimension_store ( Raw_Objects, ObjectsInter )"
...
# Q5: Now, populate the Objects array. We need the target
# Objects array because it has overlaps needed to
# make the window() queries work.
#
# 5.1 Create the new target Objects array.
time iquery -anq "store ( repart ( ObjectsInter, Objects ), Objects )"
3. At this point, you can do the following kinds of queries:
time iquery -aq "join (
join (
project ( min (Objects, B_mag), B_mag_min),
project ( max (Objects, B_mag), B_mag_max)
),
join (
project ( min (Objects, R_mag), R_mag_min),
project ( max (Objects, R_mag), R_mag_max)
)
)"
time iquery -o sparse -aq "sum( regrid ( Objects, 10, 10, count(*)), count)"