Yes, one Array cell corresponds roughly to one SQL tuple. One key SciDB difference is that we separate attributes from dimensions and treat them very differently. Lookup by dimensions, filtering by dimensions and join on dimensions are optimized and should perform quite well.
The chunk serves as an important unit of processing. Namely:
- the chunk is a unit of storage; we write one chunk at a time and read one chunk at a time
- the chunk is a unit of network transfer; we send data between nodes one chunk at a time
- some operators also process a chunk of data at a time; sometimes the chunk is a unit of processing
Also observe that, when the array has n dimensions, chunks make sure that cells that are close to each other in n-dimensional space are likely to be stored in the same chunk - or "close to each other" on disk.
What's a good chunk size for your 1-D array with 31 million cells?
We need to know two things:
- how much data is in each cell (what are the attributes)?
- how sparse is the array? are there empty cells and how often do they occur?
Given these two pieces of data, you should set a chunk size such that a chunk is several megabytes in size - around 4-8MB - small enough to fit in the CPU cache, big enough to optimize disk reads and network transfer.
In the future we are considering some improvements including:
- automatic chunk size (system does it for you)
- separate units for processing versus storage versus network
Users browsing this forum: No registered users and 0 guests