Open Source NG Databases (mailing list summary)

There are plenty of new databases coming out, aiming to tackle the massively scalable domain that Google’s BigTable pioneered. On the Radar mailing list, Jesse pointed out Cassandra (Facebook’s offering) and Mike Loukides countered with Hypertable, asking “We’re sort of being overrun with BigTable-style databases; I wonder what’s going to win?”. (Artur observed, “Cassandra is less like BigTable and more like a distributed column store with autocreating and searching in column namespace, but lacks a lot of indexing needed for BigTable.”)

Jesse replied it’d be the one that’s easiest for developers to use quickly, and I expanded that to:

  • language and platform integration (e.g., Ruby, Rails, Django) so it can be used in the language you use to get stuff done
  • higher abstractions available (as ActiveRecord is to databases, the higher abstraction would be to BigTable) to make it map closer to the problems you have, and to make you more productive with it (nobody disputes that machine code is very powerful but nobody wants to be debugging race conditions via hex dumps)
  • straightforward deployment (either buy time on an S3-like cloud, or it’s no more hassle than MySQL to deploy)
  • a killer app for PR purposes

Mike queried my integrations and abstractions items, though, observing that CouchDB has only an HTTP REST interface—you can write to it in bash using curl if you want. Jesse said that’s why they’re using it for Chef. I was the parade-rainer, though:

Doesn’t that just mean that it’s equally inconvenient for many languages? I mean, nobody slaps HTTP into their code as an API. Surely they write an own-language wrapper for the REST stuff so they’re not constantly arsing around with encoding parameters, decoding responses, all that business that’s not the business you’re in.

Artur rebutted with:

use LWP qw(get);
use Whatever::JSON;

$data = json_parse(get("http://couchdb/query"));

Simpler than DBI for sure.

I’m curious: do you use these new databases? What do you like about the one you use? What don’t you like? What do you think will be necessary for one to break out into the mainstream?

tags: ,
  • http://blog.42signals.com Prashant Kumar

    couchdb is interesting idea but performance is a serious issue. Hopefully it catches up.
    -prashant

  • http://blog.42signals.com Prashant Kumar

    here is the link for couchdb and mysql performance
    http://metalelf0dev.blogspot.com/2008/09/mysql-couchdb-performance-comparison.html
    -prashant

  • http://www.mymeemz.com Alex Tolley

    I’m using Amazon’s SimpleDB.

    pros: Simple to use, scalable, “simpler” schema than relational.
    cons: No many-to-many joins. Not suited to transactions. Querying is very limited.

    Like many programmers, I don’t want to think about the details of a database, any more than I want to think about how exactly a file is stored on a hard disk. I just want to put data in with the simplest structure that solves the problem, and then easily retrieve that data with simple queries.

    I think you might asking the wrong people about which would break out into the mainstream. A database that can be used easily by non-programmers is what is needed. In my experience, most non-programmers are comfortable with a big Excel spreadsheet as their database because it is easy to view (2D) all the data, data access is fairly intuitive (rows & cells) and building “queries” using filters and other worksheet functions is intuitive and incremental (adding new columns for partial work).

    A massive Excel spreadsheet engine, hosted on the web, with good tools to view and manipulate data would be the breakout database IMO.

  • http://blog.botfu.com Kevin Marshall

    I haven’t started using any of the options yet (still able to get/do what I want with good old Oracle, PostgreSQL, MySQL, and SQL Server), but I’ve also been actively watching the space and sort of waiting to see what’s going to win out…the one that’s been peaking my interest the most lately is the Mongo DB from 10gen ( http://www.10gen.com/ )…but I think it’s a slightly different take than the ones you’ve mentioned.

    To me, the winner will be whomever can make it not only scalable and easy to use, but will also show a performance improvement on querying massive collections over that of an Oracle for example…I think it’s pretty darn easy to store massive amounts of data into the cloud and pull back out specific bits of that information so long as you know just what and where to look…but it’s still not so easy to do ad-hoc type things against those large datasets, and especially not performance friendly.

  • Jan Lehnardt

    @Prashant No worries, we’ll catch up :-)

    Note however that CouchDB is not about maximum speed for single queries about handling stunning numbers of concurrent queries gracefully.

  • http://nicklothian.com Nick L

    See also http://project-voldemort.com/ (LinkedIn’s distributed key-value storage system).

    But to answer your question – perhaps making a keystore a drop-in replacement for BerkleyDB style databases?

  • http://jeremy.zawodny.com/blog/ Jeremy Zawodny

    I’ve been looking at this stuff myself recently, Nat. It seems your requirements make good sense–they’re alot of what got our current generation open source databases (MySQL and PostgreSQL) to where they are.

    I think that a key element is going to be making it very, very easy to add severs and scale. MySQL bolted replication on back in the 3.23.xx days (really 4.0 when you consider the implementation). That was part of the puzzle, but having to get something like MySQL Proxy (or worse) setup is painful on the front end–especially if the app wasn’t well thought out in advance.

    I’m hoping to find something we can use for one piece of Craigslist that is very much like CouchDB. In fact, it may be CouchDB. I just haven’t worked with it enough to see.

    Even more interesting would be a combination of systems that work well together: CouchDB (or similar) for schema-free storage of whole documents (yay for less disk seeks and no joins?), Sphinx or Solr for full-text search, and some sort of distributed memory-based (but durable) table for lookups (which cluster does record X live on?).

    Anyway, I’m rambling. This is a very interesting area and will be for quite some time…

  • deepblue

    well the slow performance of CouchDB in building indexes (views) seems to be (to me) due to the use of JavaScript for this functionality… Not sure which VM is used for it (traceMonkey?) but either way it cant beat MySQL’s (C?) index building code… the fact that such a dynamic language is used for this is brilliant, since complete beginners can code up their own views.

    the benefits of CouchDB make up for this, in my oppinion, as its significanlty easier to schale, dead simple to “upgrade the schema” (since there isnt any)… also as mentioned its a young project,there’s plenty of stuff in there that can be improved… sure it might not perform as well on a single server, but it more than makes up for it on 500 (with MySQL that would be so much work).
    It would be interesting to see if that perofmrance would significantly improve with Googles V8 JavaScript JIT compiler for example, one used in Chrome, since it translates JS directly into machine code before execution (not sure if traceMonkey does the same or if its mostly interpreted)…

    either way the situation is going to improve significantly for CouchDB in the short-term future I would imagine

  • Jan Lehnardt

    @deepblue Careful there with the assumptions :) The JS itself is not slow. Our Map and Reduce functions tend to be small and execution speed has not showed up in benchmarking. What did show up was converting Erlang terms to JSON for Spidermonkey to use and then parsing the JSON into JS-Objects again. There are optimizations on its way to drop these conversions to the C level and we’ve seen quite promising speedups with the current in-development code.

    We’ll also add the ability to write M/R functions in pure Erlang to avoid the conversion altogether. That said, the current implementation was fast enough for a recently specced performance critical application.

    @Jeremy terrific news!

  • http://www.qubesystems.com Greg Quinn

    There seems to be a lot of thinking about these data stores and examining their possibilities.

    However who ACTUALLY uses them and what do they use them for?

  • http://www.dancingbison.com Vasudev Ram

    Informative article, thanks. So were the comments, including the mentions about some other NG DBs. Must take a look at some of them.

    @Greq Quinn: Good question. A partial answer could be that some people/orgs don’t disclose their use of these DB’s since it could be a competitive advantage :-) as is sometimes said about Python, and probably about some other technologies too.

    - Vasudev

  • Jan Lehnardt

    @Greg We keep track of things at http://wiki.apache.org/couchdb/CouchDB_in_the_wild

  • dingo

    Prophet seems like a good alternative to CouchDB.

  • Jan Lehnardt

    @dingo Prophet is really cool tech and they have the data migration nailed and I wish we can borrow some of their work in the future (the Prophet authors are cool with that, I asked). But it is fundamentally a different beast. If you are look for a low-traffic replacement for CouchDB’s replication where you move data between different schemas, Prophet is your deal. CouchDB’s goals are a little different.

  • http://groovie.org/ Ben Bangert

    The recently launched PylonsHQ site uses CouchDB for storage of all documentation, comments, pastebin entries, snippets, and pretty much anything else I need to store.

    I wrote more details about it here, and the site code is open-source as well:
    http://pylonshq.com/articles/archives/2009/1/new_pylonshq_site_launches

    Some of the things I’ve liked about working with CouchDB:

    - As Artur mentions, the API is drop-dead simple. I can actually read the connector code!! (Contrast to the obscure DB-API code that most people use in their dynamic language of choice)

    - No schema migrations. This is awesome for rapid development, I just use new fields as I need to.

    - It’s lightweight. My CouchDB instance seems to hover around 8mb of ram even under load. Views are dang fast (I am using the latest svn though which apparently has some performance optimizations that 0.8 didn’t)

    - The distributed replication is very handy. I can sync back to my local dev instance production data, and on occasion sync specific docs back to production from my dev or staging CouchDB instances.

    Things I don’t like:

    - It needs a better way to retrieve ‘related’ doc’s by doc id in a single query. Lotus Notes (Which CouchDB is heavily inspired by) has this, I think its slated for a future CouchDB release.

    - Being able to order on values built in a map/reduce would be awesome.

    Overall, I’m really digging CouchDB. As I mentioned, development is lightning fast, and performance has been just great in my experience.

  • http://krow.net/ Brian Aker

    Hi!

    I wold encourage you to look further afield then just CouchDB and Hypertable. There really does seem to be a rethinking right now in what databases should look like (and lets face it, no one ever loved SQL).

    Years ago I traded some posts back and forth with Tim about the future of databases. I believe the traditional relational databases will become metadata storage systems for small bits of data, while we will see databases like CouchDB and similar take on the problem of storing more contained like object data. There is no reason why we can’t mix more technologies together (ala the Unix way).

    As far as interfaces go… I go back and forth on wanting to do just plain HTTP as transport. I suspect for Drizzle we will offer both a binary protocol means, and a REST means for retrieving data.

    Cheers,
    -Brian

    BTW take a look at the Analytic engines that are just starting to appear, datawarehousing is under heavy evolution right now.

  • http://voodoowarez.com rektide

    Distributed Hash Table systems being accessed via ReST+JSON. Persevere isnt a backing store, but I think it defines the key set of technologies the ultimate DHT in the sky will derive from, in that it codifies the common-sense ReST / JSON metaobject principles we need to build systems loosely coupled enough to build this great DHT in the sky. It sounds kind of tangential to the direct “db” question asked, but in terms of hitting the language/platform and higher abstraction sweet spots, I think once the distributed cornerstones fall into place the very notion of “the database” will become at least antiquated if not simply laughable, in the same way that Clojure or 9P dont particularly care where the data is; just mount or reference resources.

  • http://taint.org/ Justin Mason

    @Alex Tolley:

    ‘A massive Excel spreadsheet engine, hosted on the web, with good tools to view and manipulate data would be the breakout database IMO.’

    http://dabbledb.com/ sounds most like what you’re after there. I don’t think that kind of “mainstream” is what Nat means, though ;)

  • http://mysql-log-filter.googlecode.com/ René Leonhardt

    Has anyone tried Tokyo Cabinet already?
    It sounds so impressive to store 1 million records in 1-2 seconds :)
    http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/