How to set a appropriate CHUNK value when "create array" ?

Get help using the SciDB system, discuss existing features, report bugs and problems.

How to set a appropriate CHUNK value when "create array" ?

Postby hlfwm » Wed Jun 29, 2011 11:49 pm

How to set a appropriate CHUNK value when "create 1-D dimention array" that has 31 million records?

why scidb invent a concept of CHUNK?

Does one cell correspond to one record in the RDBMS?

Thanks very much!
hlfwm
 
Posts: 18
Joined: Mon May 23, 2011 9:41 pm

Re: How to set a appropriate CHUNK value when "create array"

Postby apoliakov » Thu Jun 30, 2011 12:27 am

Hello,

Yes, one Array cell corresponds roughly to one SQL tuple. One key SciDB difference is that we separate attributes from dimensions and treat them very differently. Lookup by dimensions, filtering by dimensions and join on dimensions are optimized and should perform quite well.

The chunk serves as an important unit of processing. Namely:
- the chunk is a unit of storage; we write one chunk at a time and read one chunk at a time
- the chunk is a unit of network transfer; we send data between nodes one chunk at a time
- some operators also process a chunk of data at a time; sometimes the chunk is a unit of processing

Also observe that, when the array has n dimensions, chunks make sure that cells that are close to each other in n-dimensional space are likely to be stored in the same chunk - or "close to each other" on disk.

What's a good chunk size for your 1-D array with 31 million cells?
We need to know two things:
- how much data is in each cell (what are the attributes)?
- how sparse is the array? are there empty cells and how often do they occur?
Given these two pieces of data, you should set a chunk size such that a chunk is several megabytes in size - around 4-8MB - small enough to fit in the CPU cache, big enough to optimize disk reads and network transfer.

In the future we are considering some improvements including:
- automatic chunk size (system does it for you)
- separate units for processing versus storage versus network
apoliakov
Site Admin
 
Posts: 248
Joined: Wed Nov 03, 2010 2:46 pm

Re: How to set a appropriate CHUNK value when "create array"

Postby hlfwm » Thu Jun 30, 2011 3:24 am

Thank you Apoliakov,excitedly I can try now with your help now, when will the automatic chunk size scidb appear to world?



apoliakov wrote:Hello,

Yes, one Array cell corresponds roughly to one SQL tuple. One key SciDB difference is that we separate attributes from dimensions and treat them very differently. Lookup by dimensions, filtering by dimensions and join on dimensions are optimized and should perform quite well.

The chunk serves as an important unit of processing. Namely:
- the chunk is a unit of storage; we write one chunk at a time and read one chunk at a time
- the chunk is a unit of network transfer; we send data between nodes one chunk at a time
- some operators also process a chunk of data at a time; sometimes the chunk is a unit of processing

Also observe that, when the array has n dimensions, chunks make sure that cells that are close to each other in n-dimensional space are likely to be stored in the same chunk - or "close to each other" on disk.

What's a good chunk size for your 1-D array with 31 million cells?
We need to know two things:
- how much data is in each cell (what are the attributes)?
- how sparse is the array? are there empty cells and how often do they occur?
Given these two pieces of data, you should set a chunk size such that a chunk is several megabytes in size - around 4-8MB - small enough to fit in the CPU cache, big enough to optimize disk reads and network transfer.

In the future we are considering some improvements including:
- automatic chunk size (system does it for you)
- separate units for processing versus storage versus network
hlfwm
 
Posts: 18
Joined: Mon May 23, 2011 9:41 pm


Return to SciDB Support and Discussion

Who is online

Users browsing this forum: No registered users and 0 guests

cron