Hadoop: What it is, how it works, and what it can do

Cloudera CEO Mike Olson on Hadoop's architecture and its data applications.

Hadoop gets a lot of buzz these days in database and content management circles, but many people in the industry still don’t really know what it is and or how it can be best applied.

Cloudera CEO and Strata speaker Mike Olson, whose company offers an enterprise distribution of Hadoop and contributes to the project, discusses Hadoop’s background and its applications in the following interview.

Where did Hadoop come from?

Mike OlsonMike Olson: The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?

Mike Olson: The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.

How is Hadoop architected?

Mike Olson: Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.

Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.

At this point, do companies need to develop their own Hadoop applications?

Mike Olson: It’s fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many “shrink wrapped” applications today that you can get right out of the box and run on your Hadoop processor. It’s similar to the early ’80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data.

That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they’re able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language developed by Yahoo that allows for data flow and data transformation operations on a Hadoop cluster.

Hadoop’s deployment is a bit tricky at this stage, but the vendors are moving quickly to create applications that solve these problems. I expect to see more of the shrink-wrapped apps appearing over the next couple of years.

Where do you stand in the SQL vs NoSQL debate?

Mike Olson: I’m a deep believer in relational databases and in SQL. I think the language is awesome and the products are incredible.

I hate the term “NoSQL.” It was invented to create cachet around a bunch of different projects, each of which has different properties and behaves in different ways. The real question is, what problems are you solving? That’s what matters to users.

Related:

tags: , , , ,
  • http://www.karmasphere.com John Murphy

    Karmasphere is giving developers and analysts the power to mine and explore Big Data on Hadoop.

    Analysts:
    Write and Prototype SQL
    Profile and Diagnose Queries
    Generate Dataset and Visualize

    Developers:
    Develop, Test, and Debug MapReduce
    Profile and optimize
    Deploy to Any Hadoop

    karmasphere.com

  • Mike S.

    Excellent Q&A – this is the first time I’ve looked at the technology enough to understand why it’s useful. Had to overcome a natural disinclination because of the name and logo. Would be embarassed to list “Hadoop” and the elephant logo on my resume.

  • http://www.intellipaat.com Abhirup Mazumder

    Hello Friends,

    New batch of Hadoop training will be starting from 07-Nov’11 . It will be an online training of 35 to 40 hrs and the following contents will be covered:

    1.HDFS
    2.Map Reduce
    3.Hive
    4.HBASE
    5.Working with Hadoop on ec2 (cloud).

    80 % of the training is completely practical.

    Our institute has its leadership in providing online training at the lowest price with high quality. The trainer has 6 yrs of experience on this technology has conducted multiple online trainings.

    Please drop me an email for registration at sales@intellipaat.com or give us a call . Guys hurry up last few seats are available.

    Regards,
    Sales
    Intellipaat Team

    Mob: 91-9019368913

    • Abhishek Purohit

      Do you teach individually ??

  • Badri Mohanty

    Hi,

    I am intrested for Hadoop training.
    Could you please let me know when exactly the next training is going to start and all the details regarding the training?

    Thanks,
    Badri

  • http://helpful-tips-for-seo.blogspot.com/ Peter Paul

    Hi James Turner,
    I am searching a lot about hadoop but i found that this post have exact information in an easily understood which i want. Thanks for sharing it@@@@@

  • Hkapil30

    This is a precise , concise and easy to understand article about Hadoop. Thanks !! a lot!

  • Bandlave

    Hi,
    i had around 1.6 years of IT experience as Teradata DBA, i am very much intersted in learning Hadoop. is there any prerequisites before starting Hadoop

    • Sir Fidel

      Teradata have there own version of Hadoop called Teradata Aster or Aster SQL-H

  • Nitin Mishra

    This article is a nutshell if what hadoop is and how is works.
    Altogether it is the best.

    • prasanth

      Sir,i want the hadoop chunking program in java

  • lvprasad tumma

    Hi currently i am working with BI tools like SSIS , SSRS,i am very much intersted in learning Hadoop. is there any prerequisites before starting Hadoop

  • darshan

    these article is really helpful for understanding about hadoop………………

  • Deepak Thapar

    Can we use Hadoop in core banking system , if network to central database goes down and customer related info can be retrieved from database stored locally in the branch, this database may have only account and balance related info.

  • BN Singh

    This is realy a concise and conceptual information on Hadoop. Very helpful!

  • http://twitter.com/evway EVway

    I wonder if something Hadoop or BigQuery can be built on the domain CloudQuery.com. That’s the perfect name for Bigger Query. Zoho owns cloud-query.com, maybe they are listening..

  • gopi

    sir i nwant to placement to hadoop

  • http://wisentechnologies.com/it-courses/java-training.aspx Java Training In Chennai

    very fast growing technology

  • sud

    i am getting started so….which software i need to install…….to getting started with hadoop

  • amit

    Thanks a lot for this Hadoop intro. Perfect for me to start with…

  • shruti garg

    Hello
    I am Shruti Garg.
    I am learning online Hadoop database course.
    After wasting much time and money as lag of guidance in such field, finally I am taking online big data hadoop online course.
    And which helps in relating hadoop with my daily life.
    They are offering such course at a very effective cost and with nice and infromative syllabus.
    http://www.wiziq.com/course/21308-hadoop-big-data-training
    Hope this will be helpful for you too.
    Thank you

    • Srinivas Gade

      Hi Shruti,

      I’m planning to learn Hadoop before I gets started need your help in choosing the right module.

      Will it possible you to add me on gsrinivas2186@gmail.com for quick conversation.

      Thanks,
      Srinivas

      • toomuch

        Don’t put buckets here!!!

    • Pri

      waste of money…..dont do your course with wiziq…… inexperienced tutor……

  • Tam

    Nice interview. Has the exact extract. No more no less than needed. Love it

  • yogesh

    nice !!! Now I clearly got the exact working of hadoop. Thanks!!! thanks a lot!!!

  • lalith kumar

    do we have any front end applications in hadoop?
    which can connect to hdfs??

  • Jim Porter

    Looks like the emerging solution to optimizing use of Hadoop/MapReduce is SequenceL. Exciting progress being made

  • vimal jena

    what basic knowledge should i have to learn hadoop effectively.. i mean is it necessary to have knowledge if java, .net, etc etc

  • Shakeel

    Hi,

    where can i get traning for Hadoop in pune

  • Sunil

    Before learning hadoop is there is any specific languages we must already know by ourself?? like java or .net or jquery etc can u plz guide me frdzz..what way make hadoop learn easy and quick.?

  • Sap Trainings

    Hello Folks,
    We are going to start online classes
    for HADOOP from next week.

    Interested people can contact:
    admin@saptrainings.com

  • RR Fox

    Hi,

    I don’t really care if I know Hadoop or not I will just lie and come to American on H1-B visa anyway. If anyone questions me I will just get angry and change subject.

    THanks,

    Sud,Shurti,Bandlave, SriniVas,Veejay,Sudhakar,Pri,Amit

  • Asker

    So, hadoop is a database system like oracle?