History of SciDB

Inspired by the needs of LSST — the Large Synoptic Survey Telescope — representatives from the scientific, industrial, and computer science communities dealing with extremely large databases met at the 1st Extremely Large Databases (XLDB) workshop in October 2007. These sessions highlighted the rise of large-scale analytics, the convergence of analytic needs between science and industry, and frustration with slow progress in database systems for scientists due to repeated re-invention. Despite the highly successful use of commercial systems such as Microsoft SQL Server for the Sloan Digital Sky Survey, it was very clear from comments made by the participants at this workshop that the database system software community has failed to build the kinds of systems that scientists need for managing the massive data sets that they accumulate and manipulate and that the time had come for a system to be built from scratch to suit their needs specifically which leverages state-of-the-art hardware and software architectures.

LSST is not unique in their need for next generation database management systems — the XLDB conference attendees confirmed the belief that scientific along with commercial organizations are growing increasingly data-intensive. Analyses of terabytes and even petabytes of data are becoming routine. Operating at this scale in an efficient manner involves rethinking systems from the ground up. Building new systems from scratch for each new peta-scale project is grossly inefficient and unnecessary.

The consensus from the XLDB meeting was that the academic and commercial scientific communities should join forces to build a next-generation open source database management system for data-intensive scientific analytics. The most obvious and compelling use case for the system is the needs of LSST.

The evolving situation in scientific database management is not unlike what is happening in the Internet search business where tools such as Map-Reduce have been developed to analyze massive datasets as an alternative to relational database systems. The emergence of these new tools is an indication that the traditional RDBMS architectures (which SciDB leaders such as Mike Stonebraker, PhD (MIT) and Dave DeWitt, PhD (Wisconsin) helped invent) are no longer adequate to satisfy the requirements of new analytical applications.

The database research community has repeatedly reached out to the science community in the past. During the mid-1990s Microsoft's Jim Gray along with his close friend and colleague Mike Stonebraker tried to work with NASA on the EOSDIS project to design a database-centric architecture for storing and managing environmental data sets obtained from satellites. NASA ignored their recommendations and built a system instead around CORBA. Needless to say it was a total failure. About the same time SLAC and CERN both elected to use Objectivity/DB, a DBMS based on a persistent C++ data model. After three years of using the system, the scientists decided to switch to a home grown system co-developed with CERN. The main reasons to abandon Objectivity/DB included its closed source, its schema inflexibility, the uncertainty about the long term viability of the ODBMS market, and architectural scalability limits present in Objectivity/DB at that time.

Based on the success of projects such as the Sloan Digital Sky Survey, the science community now recognizes the value of using a 3rd party DBMS system instead of a custom data management system. The science community has also come to appreciate the power that well-established query languages provide for manipulating large datasets and providing access to those datasets for large numbers of "self-service" users. Why then has there not been a rush by other science projects to embrace and adopt 3rd party relational database systems? With many open source relational systems available (e.g. PostgreSQL and MySQL), the cost of licenses is not the impediment. The real issue is that the relational data model is simply not a good fit for most science data sets. While Jim Gray was able to get Microsoft SQL Server to work well for Sloan, no other science groups had a Turing Award winner as their chief programmer and DBA. In addition, from the mid-1980s until very recently, the focus of the entire relational database industry has been on executing short transactions very efficiently in a highly scalable fashion.

In the pursuit of faster and cheaper transactions, the relational vendors (except for Teradata) have largely ignored the needs of the analytical database market, in part relying on caching mechanisms (materialized views, bitmap indexes, OLAP Cubes, etc.) to provide incremental benefits on top of their OLTP engines, but in the process creating a maintenance/management nightmare for end-users who now have to maintain 2–3 levels of caching as well as dealing with the fundamental challenges of large data management. Additionally, they have most definitely ignored the needs of the scientific market — a market Mike Stonebraker likes to characterize as a "zero billion dollar/year market" — despite the fact that the end-users in this market are working on the most important problems facing our society today, such as global warming, climate change, global healthcare initiatives, etc.

In March 2008 at Asilomar, a subset of the participants of the original XLDB conference organized a follow-up meeting that brought together a representative group of science and database experts. The primary focus of this meeting was to determine if the requirements of the different scientific domains (and some large-scale commercial applications) were similar enough to justify building a database system tailored to the needs of the scientific community. The answer was clearly yes. These requirements are elaborated upon in detail in the workshop report which can be summarized as follows:

  1. A data model based on multidimensional arrays, not sets of tuples
  2. A storage model based on versions and not update in place
  3. Built-in support for provenance (lineage), workflows, and uncertainty
  4. Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
  5. Support for "external" data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
  6. Open source in order to foster a community of contributors and to insure that data is never "locked up" — a critical requirement for scientists.