With IBM’s acquisition of Netezza, it seems like big data is also big business. Companies are using their data assets to aim their products and services with increasing precision. And there’s more and more data to chew on. Not a website goes by without a Like, Check In, or Retweet button on it.
It’s not just the marketers that are throwing petabytes of information at problems. Scientists, intelligence analysts, governments, meteorologists, air traffic controllers, architects, civil engineers-nearly every industry or profession is touched by the era of big data. Add to that the fact that the democratization of IT has made everyone a (sort of) data expert, familiar with searches and queries, and we’re seeing a huge burst of interest in big data.
Giving enterprises big data they can digest
Netezza sprinkled an appliance philosophy over a complex suite of technologies, making it easier for enterprises to get started. But the real reason for IBM’s offer was that the company reset the price/performance equation for enterprise data analysis. This was the result of three changes:
- The company put storage next to computation on very fast, custom-built systems. This addressed one of the big bottlenecks of processing: instead of distinct storage, database, and application tiers, Netezza’s systems made the computation and the storage work in concert, and broke data up across many storage devices so it can be hoovered out at astonishing rates.
- The company’s systems made it easy to do things in parallel. Frameworks like Hadoop make it possible to split up a task into many subcomponents and farm the work out to hundreds, even thousands, of computers, getting results far more quickly. But parallelism has a problem: many enterprise data systems rely on database models that link information across several tables using JOINs. They need everything locked down in order to query the system, which doesn’t fit well with the idea of doing many things in parallel. Netezza’s custom FPGA hardware acted like a traffic cop, splitting up analysis across the entire system while avoiding roadblocks.
- Finally, Netezza’s technology presented familiar interfaces like ODBC that worked with existing enterprise applications. That meant its products accelerated what was already in place, rather than requiring a forklift upgrade.
There are a huge number of innovations in big data (and what some call, somewhat inaccurately, the NoSQL movement). These include large-object storage models like Amazon’s S3; key-value storage like CouchDB, Mongo, Basho Riak, and so on. We’re already moving beyond first-generation big data systems: Facebook largely abandoned Project Cassandra long ago, and Google has replaced BigTable with a more real-time map of the web with Caffeine. It’s hard to keep track of it all. And it’s harder still for enterprises to digest.
Big data appliances are the new mainframes
Peel open a big data appliance, and you’ll find an array of common-off-the-shelf (COTS) processors on blades, a very fast network backplane that’s good at virtualization, some custom load-sharing technology, and storage. That’s what the Cisco/VMWare/EMC marriage dubbed Acadia has in it, and it’s what’s in Oracle’s newly-announced Exalogic cloud-in-a-box, and it’s what Netezza makes. It resembles a legacy mainframe: elastic, shared, highly parallel, and very fast.
There’s a reason that the distributed, COTS data center is contracting into these high-performance appliances. A paper by the late Jim Gray of Microsoft that says that, compared to the cost of moving bytes around, everything else is free. That applies in data processing. It’s why Amazon’s S3 large-object store, not its EC2 compute service, is core to the company’s strategy: your computation goes to where your data is, not the other way around.
Clouds level the playing field
In the past, companies with enough money to afford data systems had a competitive advantage, because they could crunch numbers better. Analytics isn’t just about storing a lot of information— after all, these days, storage is practically free. It’s also about indexing structured data so that it can be queried and processed quickly. In a data warehouse world, that means creating data cubes that are indexed along several dimensions.
Imagine, for example, that you’re tasked with analyzing some customer shopping data. You might index it by product, by store, and by date. You could quickly find out how sales went by any of those three dimensions. That would be a three-dimensional data cube. But if you wanted to find out about sales by color, and hadn’t indexed the data along that dimension, it would take a long time to calculate.
With clouds, the infrastructure is cheap, and you only pay for what you need. So if you’re making a data warehouse, you can have as many indices as you need. The same thing applies for dozens of other industries where infrastructure was a barrier to entry: architectural engineering, financial modeling, risk assessment and insurance, genomics, and so on. The availability of on-demand, elastic compute capacity delivered as a utility model has torn down the barriers to entry.
There are versions of Netezza available on cloud platforms; but with much of the data they analyze considered proprietary or constrained by privacy legislation, companies want to keep it within their four walls and use appliances. These big data appliances can unlock data and massive parallelism for enterprise customers, letting them crunch data quickly, make better business decisions, and reduce the time it takes them to react.
Correction 9/22/10: “Cisco/HP/EMC” was changed to “Cisco/VMWare/EMC” under the “Big data appliances are the new mainframes” section.