NoSQL conference coming to Boston

On March 11 Boston will join several other cities who have host
conferences on the movement broadly known as NoSQL. Cassandra, CouchDB, HBase, HypergraphDB,
Hypertable, Memcached, MongoDB,
Neo4j, Riak, SimpleDB, Voldemort, and
probably other projects as well will be represented at the one-day affair.

It’s generally understood that characterizing a movement by what it’s
not is awkward, and it’s hard to find an elevator speech to
encompass all the topics of NoSQL Boston. Are these tools for “big
data” problems? Usually, but sometimes even small web sites can find
them useful. Are the tools meant for processing streams such as log
files? Sometimes, but they can be useful for other text and data
processing as well. And do they reject relational principles? Well, so
you’d think–but different ones reject different principles, so even
there it’s hard to find commonality. (I compared them to relational
databases in a blog
last year
.

The interviews I had with various projects leaders for this article
turned up a recurring usage pattern for NoSQL. I was seeking
particular domains or types of data where the tools would be useful,
but couldn’t see much commonality. What connects the users is that
they carry out web-related data crunching, searching, and other Web
2.0 related work. I think these companies use NoSQL tools because
they’re the companies who understand leading-edge technologies and are
willing to take risks in those areas. As the field gets better known,
usage will spread.

I had a talk last week with conference organizer Eliot Horowitz, who
is the founder and CTO of 10gen, the company that makes MongoDB. He
let me know that the conference plans to bypass the head-scratching
and launch into practical applications. The day will contain a coding
session and a schema design session along with keynotes.

The resilience of open source

One question that intrigues me is why all the offerings in the NoSQL
area are open source. Some have commercial add-ons, but the core
technology is provided as free software. The few proprietary products
and services in the market (such as Citrusleaf) get far less
attention. Reasons seem to include:

  • The market is currently too small. Just as most computing innovations
    start off in research settings, this one is being explored by people
    looking for solutions to their own problems, more than ways to extract
    a profit. Numerous in-house projects exist in this space that are not
    free software (Google’s Map/Reduce and BigTable, for instance, and
    Amazon’s SimpleDB and Dynamo) but they aren’t commercialized either.

  • Experimentation is moving too fast. Most of the projects are just a
    couple years old, and are rapidly adding features.

  • The ROI is hard to calculate. Horowitz says, “People won’t pay for
    anything they don’t really understand yet.” (Nevertheless, 10gen and
    other companies are commercializing the open source offerings.)

  • Whatever problem an organization is trying to solve, each NoSQL
    offering tends to be a piece of the solution. It has to be tuned for and
    integrated into the organization’s architecture, and combined with
    pieces from other places.

The projects in this conference therefore demonstrate the innovative
power of free software. CouchDB and Cassandra are particularly
interesting in this regard because they are community efforts more
than corporate efforts. Both are Apache top-level projects. (Cassandra
was just moved from the incubator to a top-level project on February
17.) CouchDB committer J. Chris Anderson tells me that the Apache
community process ensures a wide range of voices are heard, leading to
(of course) occasional public wrangling but a superior outcome.

The BBC and (according to Anderson) SXSW are among the users of
CouchDB, CouchDB has been integrated into Ubuntu, Mozilla Messaging is
basing Raindrop (their next-generation messaging platform) on CouchDB,
and even mobile handset manufacturers are looking at it. (O’Reilly
Media also uses CouchDB.)

I also talked to Alan Hoffman of Cloudant, which offers a CouchDB cloud
service that fills in some of the gaps left by bare CouchDB
(consistent hashing, partitioning, quorum, etc.). Although a couple
companies offer commercial support, no single company takes
responsibility for CouchDB. Its community is highly
distributed. Anderson listed 10 Apache committers working for 8
different companies, and nearly 40 other people who contribute
patches. Support takes place on mailing lists (roughly one thousand
messages a month) and IRC channels.

Jonathan Ellis, project chair of Cassandra, calls it an “open source
success story” because it went from a state of near petrification to
vibrant regrowth through open sourcing. Facebook invented it and
brought it to a state where it satisfied their needs. They made it
open in and moved it into the Apache Incubator in 2008 but declared
that they would not be doing further development. It could easily have
receded into obscurity.

Ellis says that he was hired at Rackspace and asked to find a
distributed data store that was fast and scaled easily; he decided on
Cassandra. Soon after he became a public and enthusiastic advocate,
Digg and Twitter joined Rackspace as users and developers. Having
multiple QA teams test each release–particularly in very different
environments–helps quality immensely. Ellis find that Eric Raymond’s
“many eyes” characterization of open source bug fixing applies.

Although Cassandra is found mostly as a backing store for web sites
with a lot of users, Ellis thinks it would meet the needs of many
academic and commercial sites, and looks forward to someone offering a
cloud service based on it.

Justin Sheehy, CTO of Basho, maker of
the Riak data store, told me they can confirm the typical advantages
cited for open source. Developers at potential customer sites can try
out the software without going through a bureaucratic procurement
process, and then become internal advocates who function much more
effectively than outside salespeople.

He also says that companies such as Basho offer the best of both
worlds to tentative customers. The backing of a corporation means that
professional services and added tools are available to go along with
the product those customers buy. But because the source is open and
has a community around it, those customers can feel secure that
development and support will continue regardless of the fate of the
originating company. 10gen, of course, plays a similar role for
MongoDB and Anderson’s company Couchio
offers support for CouchDB. For projects that are not closely
associated with the backing of one company, the Apache Foundation’s
sponsorship helps to ensure continuity.

What are the fault lines in the NoSQL landscape?

Naturally, the projects I’ve mentioned in this blog borrow ideas from
each other and show tiny variations on common solutions regarding such
things as B-tree storage, replication, solutions to locality of
reference, etc. Experience will eventually lead to a shake-out and a
convergence among surviving projects. In the meanwhile, how can you
get your head around them?

We’ll pause here for a word from our sponsors, letting you know that
O’Reilly has published books on CouchDB and Hadoop and is
developing one about MongoDB.

Horowitz offers an initial subdivision of projects based on data model
(document, key-value, or tabular), a theme he explored in another
interview
.

Roger Magoulas, a research director with O’Reilly, further subdivides
projects into those that crunch large data sets in a batch
manner–such as Hadoop–and those that retrieve views of data to
fulfill visitor search requests on web pages or similar tasks. He goes
on to say that you can compare them on the basis of particular
features, such as automatic replication, auto-sharding or
partitioning, and in-memory caches.

The most comprehensive attempts I’ve seen to make sense of this gangly
crew of projects from a feature standpoint come in a
blog by Ellis
and one by blog
by Vineet Gupta
. (Gupta’s blog is labeled “Part 1″ and I’d love to
see more parts.) But Sheehy says the various features of the offerings
interact too strongly and have too many subtle variations to fit into
an easy taxonomy. “Many people try to classify the projects, everyone
does it differently, and nobody gets it quite right.”

Community features

So who uses these things? To take Horowitz’s MongoDB again as an
example, many web sites gravitate toward it because the document
structure makes some things–adding fields to rows, mapping objects to
fields–easier than a relational database does. A few scientific sites
also use MongoDB.

Riak also has a large following among web sites and startups, but
their customers also include media companies, ad networks, SMS
gateways, analytics firms, and many other types of organizations.

Magoulas finds that an organization’s bent is determined by the
background and expertise of its developers. Programmers with lots of
traditional relational database experience tend to be wary of the
recent upstarts, a position reinforced by legacy investments in tools
that depend on their relational database and are sometimes very
expensive.

On the other hand, web programmers look for tools that conform more
closely to the data structures and programming techniques they’re used
to, and can actually be “flummoxed” by relational database logic or
abstraction layers on top of the databases. These programmers may
think it intuitive to do the kinds of filtering and sorting that seem
like reinventing the wheel to a traditional RDMBS programmer.
Anderson likes to quote Jacob Kaplan-Moss, the creator of Django, as
saying, “Django may be built for the Web, but CouchDB is built of the
Web. I’ve never seen software that so completely embraces the
philosophies behind HTTP.”

10gen’s consultation with MongoDB users includes asking for votes on
new features. They also see a great deal of code contributions in the
driver layer and adapters (sessions, logging, etc.) but not much in
the core. Sheehy said the same is true of Riak: although contributions
to the core are rare, half the client libraries are developed by
outsiders, and many of the tools.

Rapid change is part of life for NoSQL developers. Anderson says of
CouchDB, “The ancillary APIs have been evolving rapidly in preparation
for our 1.0 release, which should come out in the next few months and
won’t differ much from today’s trunk. The new APIs include
authentication, authorization, details of Map/Reduce, and functions
for transforming and serving JSON documents as other datatypes such as
HTML or CSV.” Horowitz stressed that MongoDB will roll out a lot of
new features over the upcoming year.

One hundred people have signed up for NoSQL Boston so far, and more
than 150 are expected. I’ll be there to take it in and try to reduce
it to some high-level insights for this blog.

tags: , , , , , , , , , , , , , , ,
  • Bradford

    Cool article, and a very good overview of the “State of Affairs” I’m actually hosting the scalability panel at NoSQL Live (Boston).

    You mentioned “Whatever problem an organization is trying to solve, each NoSQL offering tends to be piece of the solution.”– that’s exactly one of the problems our startup, Drawn To Scale, has solved.

    We make all these problems “go away””. We make it easy and scalable for companies to process, query, serve, store, and search their data in real-time. And it’s seamlessly scalable. All companies have to do is put data in an API, we handle the rest :)

  • Andy Oram

    It’s fair to mention your service, Bradford–and it actually complements the blog well because I was asking who could offer a cloud service for the kinds of work these tools are doing–but I think potential clients are going to want to know a lot more about the service before they fill out the form on your site. I saw a question on your blog, and if I had you in my own blog I would have asked you a lot more questions. Knowing how big some sites’ data sets get, for instance–how much effort goes into managing dedicated servers for data–I’m curious how you’re confident you can store everybody’s data.

  • Emil Eifrem

    Good overview. One of the things a lot of people seem to forget is that NOSQL is not only about scaling to size, but also scaling to complexity. While petabytes of data is a worthy goal, I believe the more common use case (outside of a few giants like Facebook and Google) is to cope with complex data. That’s where a graph database like Neo4j shines.

    -EE

  • Emil Eifrem

    Hmm, I messed up that link. Here’s the proper link about scaling to complexity.

    -EE

  • Adam Crabtree

    I’m currently building a new social networking site (ya, ya, ya…) and while my architecture will tap into new technologies like Node.js, Sammy.js, (big on JavaScript), etc… I’ve considered using solutions like CouchDB and MongoDB as they seem to have heavy followings within the “Web 2.0″ communities.

    That being said, I still don’t understand how the NoSQL solutions will help solve heavily interrelated data issues such as user information and recommendations like events. I see them as fast solutions for some form of front-loading for real-time and Comet, but they don’t seem much use to me beyond that?? Maybe I’m having trouble breaking out of the RDBMS paradigm, but I’m trying hard and I just don’t see these as viable for REAL web apps with complex interrelated data.

    How would a social network like Facebook utilize a NoSQL solution when so much of their data must be heavily joined and interrelated?

  • Borislav Iordanov

    Adam,

    Look at HyperGraphDB (http://www.kobrix.com/hgdb.jsp), it’s precisely this kind of problems that it was build to solve. It’s an embedded OO hypergraph db (think db4o + general hypergraphs).

    Boris

  • Adam Saltiel

    Adam,
    I’m interested, how would you use Node.js with a graph db? How would Node.js interface with the db, is there an existing means in Node.js api?
    I think you are wrong about not being suitable for real web apps with complicated interrelated data. I understand that graph dbs specifically are for such data and that a problem with RDBMs is that they do not have flexible schema, therefore flexible relationships between those schema. I don’t think this is too hard to understand in that the db access layer in an app must model the underlying db schema, data manipulation is done with objects that are determined by the schema. In this situation it is particularly difficult to join the data in ways outside of that dictated by the schema. Think of the difficulty of extracting hierarchical data from rdbms. These issues are explained well in Beautiful Data, O’Reilly, 2009, esp. in chapter 20, as well as the background to various of the NoSQL projects.