ENTRIES TAGGED "Hadoop"

It’s getting easier to build Big Data applications

Analytic engines on top of Hadoop simplify the creation of interesting, low-cost, scalable applications

Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop dominating batch processing is permeating other types of analytics.

Another collection of open source, Hadoop-compatible analytic engines, the Berkeley Data Analytics Stack (BDAS), is being built just across the San Francisco Bay. Starting with a batch-processing framework that’s faster than MapReduce (Spark), it now includes interactive SQL (Shark), and real-time analytics (Spark Streaming). Sometime this summer, frameworks for machine-learning (MLbase) and graph analytics (GraphX) will be released. A cluster manager (Mesos) and an in-memory file system (Tachyon) allow users of other analytic frameworks to leverage the BDAS platform. (The Python data community is looking at Tachyon closely.)

Read more…

Comment |

Tracking the progress of large-scale Query Engines

A new, open source benchmark can be used to track performance improvements over time

As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data into the cloud are beginning to explore Amazon Redshift1, Google BigQuery, and Qubole.

A variety of analytic engines2 built for Hadoop are allowing companies to bring its low-cost, scale-out architecture to a wider audience. In particular, companies are rediscovering that SQL makes data accessible to lots of users, and many prefer3 not having to move data to a separate (MPP) cluster. There are many new tools that seek to provide an interactive SQL interface to Hadoop, including Cloudera’s Impala, Shark, Hadapt, CitusDB, Pivotal-HD, PolyBase4, and SQL-H.

An open source benchmark from UC Berkeley’s Amplab
A benchmark for tracking the progress5 of scalable query engines has just been released. It’s a worthy first effort, and its creators hope to grow the list of tools to include other open source (Drill, Stinger) and commercial6 systems. As these query engines mature and features get added, data from this benchmark can provide a quick synopsis of performance improvements over time.

The initial release includes Redshift, Hive, Impala, and Shark (Hive, Impala, Shark were configured to run on AWS). Hive 0.10 and the most recent versions7 of Impala and Shark were used (Hive 0.11 was released in mid-May and has not yet been included). Data came from Intel’s Hadoop Benchmark Suite and CommonCrawl. In the case of Hive/Impala/Shark, data was stored in compressed SequenceFile format using CDH 4.2.0.

Initial Findings
At least for the queries included in the benchmark, Redshift is about 2-3x faster than Shark/on-disk, and 0.3-2x faster than Shark/in-memory. Given that it’s built on top of a general purpose engine (Spark), it’s encouraging that Shark’s performance is within range of MPP8 databases (such as Redshift) that are highly optimized for interactive SQL queries. With new frameworks like Shark and Impala providing speedups comparable to those observed in MPP databases, organizations now have the option of using a single system (Hadoop/Spark) instead of two (Hadoop/Spark + MPP database).

Let’s look at some of the results in detail:

Read more…

Comment: 1 |

Data Science Tools: Fast, easy to use, and scalable

Tools slowly democratize many data science tasks

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.

Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.

Read more…

Comments: 2 |

Strata Week: Hortonworks brings Hadoop to Windows

Hortonworks' Data Platform for Windows, Intel's Hadoop distribution, invasive smartphone surveillance, and data-driven "House of Cards."

Windows gets Hadoop, Intel launches Hadoop distribution

Hadoop-logoHortonworks released a beta version of its Hortonworks Data Platform for Windows this week. In the press release, the company highlights the mission is to “expand the reach of Apache Hadoop across the enterprise” and notes that the “100% open source Hortonworks Data Platform is the industry’s first and only Apache Hadoop distribution for both Windows and Linux.”

Barb Darrow notes at GigaOm that there’s likely no better way to bring big data to the masses than via Microsoft Excel. Darrow reports that Hortonworks’ VP of corporate strategy Shawn Connolly told her that “[t]he combination should make it easier to integrate data from SQL Server and Hadoop and to funnel all that into Excel for charting and pivoting and all the tasks Excel is good at,” stressing that the same Apache Hadoop distribution will run on both Windows and Linux. Connolly also noted to Darrow that “an analogous Hortonworks Data Platform for Windows Azure is still in the works.”

Read more…

Comment: 1 |

Five big data predictions for 2013

Diversity and manageability are big data watchwords for the next 12 months.

Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in Strata.

Emergence of a big data architecture

Leadenhall Building skyscraper Under Construction by Martin Pettitt, on FlickrThe coming year will mark the graduation for many big data pilot projects, as they are put into production. With that comes an understanding of the practical architectures that work. These architectures will identify:

  • best of breed tools for different purposes, for instance, Storm for streaming data acquisition
  • appropriate roles for relational databases, Hadoop, NoSQL stores and in-memory databases
  • how to combine existing data warehouses and analytical databases with Hadoop

Of course, these architectures will be in constant evolution as big data tooling matures and experience is gained.

In parallel, I expect to see increasing understanding of where big data responsibility sits within a company’s org chart. Big data is fundamentally a business problem, and some of the biggest challenges in taking advantage of it lie in the changes required to cross organizational silos and reform decision making.

One to watch: it’s hard to move data, so look for a starring architectural role for HDFS for the foreseeable future. Read more…

Comment: 1 |

The future of MongoDB

Steve Francia on alternatives to Hadoop and what lies ahead for MongoDB.

Steve Francia (@ spf13) is an O’Reilly author and chief evangelist at 10gen.

Steve and I sat down during the Strata + Hadoop World conference in New York last month to talk about what he’s most excited about nowadays.  He focused on alternatives to Hadoop, what we can expect to see next from MongoDB, and the future of big data.

Highlights from the conversation include:

  • Discover alternatives to Hadoop. [Discussed 18 seconds in].
  • The new features in MongoDB 2.2. [Discussed at the 1:23 mark].
  • How being an open source company helps 10gen connect with its users. [Discussed at the 3:09 mark].
  • Long-term goals for MongoDB. [Discussed at the 5:10 mark].
  • New technologies are enabling all of us to participate in big data. [Discussed at the 7:05 mark].

You can view the entire interview in the following video.

Read more…

Comment |

Shark: Real-time queries and analytics for big data

Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning

Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.

Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL. Read more…

Comment |

Strata Week: AWS and Rackspace, compared

How Amazon Web Services and Rackspace measure up; IBM's Watson goes to school; Google researches data; and what will we call really, really big data?

Here are a few stories from the data space that caught my attention this week.

Rackspace vs Amazon

Rackspace LogoAs Rackspace continues to ramp up its services to compete with Amazon Web Services (AWS) — this week, announcing a partnership with Hortonworks to develop a cloud-based enterprise-ready Hadoop platform to compete against Amazon’s Elastic MapReduce — Derrick Harris at GigaOm compared apples to apples.

John Engates, CTO of Rackspace, told Harris the most fundamental difference between the two services is the level of control given to the customer. Harris writes that Rackspace’s new Hadoop services aims to give the customer “granular control over how their systems are configured and how their jobs run,” providing “the experience of owning a Hadoop cluster without actually owning any of the hardware.” Engates pointed out, “It’s not MapReduce as a service; it’s more Hadoop as a service.”

Harris also points out that Rackspace is considering making moves into NoSQL and looks at AWS’ DynamoDB service. He notes that Amazon and Rackspace aren’t the only players on any of these fields, pointing to the likes of Microsoft’s HDInsight, IBM’s BigInsights, Qubole, Infochimps, MongoDB, Cassandra and CouchDB-based services.

In related news, Rackspace announced its new Cloud Networks feature this week that allows customers to design their own networks on Rackspace’s Cloud Servers. In an interview with Jack McCarthy at CRN, Engates explained the background:

“When we went from dedicated physical networks to our public cloud, we lost the ability to segment these networks. We used to have a vLAN. As we moved to OpenStack, we wanted to give our customers the ability to enable segmented networks in the cloud. Cloud Networks gives customers a degree of control over how they build networks in the cloud, whether it’s building networks application servers or for Web servers or databases.”

Engates also points out the networks are software-defined, “so customers can program their network on the fly.” You can read more about the new feature on the Rackspace blog.

Read more…

Comment |

Strata Week: Real-time Hadoop

Cloudera ventures into real-time queries with Impala, data centers are the new landfill, and Jesper Andersen looks at the relationship between art and data.

Here are a few stories from the data space that caught my attention this week.

Cloudera’s Impala takes Hadoop queries into real-time

Cloudera ventured into real-time Hadoop querying this week, opening up its Impala software platform. As Derrick Harris reports at GigaOm, Impala — an SQL query engine — doesn’t rely on MapReduce, making it faster than tools such as Hive. Cloudera estimates its queries run 10 times faster than Hive, and Charles Zedlewski, Cloudera’s cloud VP of products, told Harris that “small queries can run in less than a second.”

Harris notes that Zedlewski pointed out that Impala wasn’t designed to replace business intelligence (BI) tools, and that “Cloudera isn’t interested in selling BI or other analytic applications.” Rather, Impala serves as the execution engine, still relying on software from Cloudera partners — Zedlewski told Harris, “We’re sticking to our knitting as a platform vendor.”

Joab Jackson at PC World reports that “[e]ventually, Impala will be the basis of a Cloudera commercial offering, called the Cloudera Enterprise RTQ (Real-Time Query), though the company has not specified a release date.”

Impala has plenty of competition on this playing field, which Harris also covers, and he notes the significance of all the recent Hadoop innovation:

“I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.”

You can read more from Harris’ piece here and Jackson’s piece here. Wired also has an interesting piece on Impala, covering the Google F1 database upon which it is based and the Googler Cloudera hired away to help build it.

(Cloudera CEO Mike Olson discussed Impala, Hadoop and the importance of real-time at this week’s Strata Conference + Hadoop World.)

Read more…

Comment |

Strata Week: Data mining for votes

Candidates are data mining behind the scenes, data mining gets a PR campaign, Google faces privacy policy issues, and Hadoop and BI.

Here are a few stories from the data space that caught my attention this week.

Presidential candidates are mining your data

Data is playing an unprecedented role in the US presidential election this year. The two presidential campaigns have access to personal voter data “at a scale never before imagined,” reports Charles Duhigg at the New York Times. The candidate camps are using personal data in polling calls, accessing such details as “whether voters may have visited pornography Web sites, have homes in foreclosure, are more prone to drink Michelob Ultra than Corona or have gay friends or enjoy expensive vacations,” Duhigg writes. He reports that both campaigns emphasized they were committed to protecting voter privacy, but notes:

“Officials for both campaigns acknowledge that many of their consultants and vendors draw data from an array of sources — including some the campaigns themselves have not fully scrutinized.”

A Romney campaign official told Duhigg: “You don’t want your analytical efforts to be obvious because voters get creeped out. A lot of what we’re doing is behind the scenes.”

The “behind the scenes” may be enough in itself to creep people out. These sorts of situations are starting to tarnish the image of the consumer data-mining industry, and a Manhattan trade group, the Direct Marketing Association, is launching a public relations campaign — the “Data-Driven Marketing Institute” — to smooth things over before government regulators get involved. Natasha Singer reports at the New York Times:

“According to a statement, the trade group intends to promote such targeted marketing to lawmakers and the public ‘with the goal of preventing needless regulation or enforcement that could severely hamper consumer marketing and stifle innovation’ as well as ‘tamping down unfavorable media attention.’ As part of the campaign, the group plans to finance academic research into the industry’s economic impact, said Linda A. Woolley, the acting chief executive of the Direct Marketing Association.”

One of the biggest issues, Singer notes, is that people want control over their data. Chuck Teller, founder of Catalog Choice, told Singer that in a recent survey conducted by his company, 67% of people responded that they wanted to see the data collected about them by data brokers and 78% said they wanted the ability to opt out of the sale and distribution of that data.

Read more…

Comment |