1. Fun Fact: created by Doug Cutting and Mike Cafarella in 2005
Named After Doug’s son’s toy elephant named, Hadoop
2. Hadoop is a Distributed Data Management and Processing system on clusters of hardware (servers)
Unlike MongoDB Database Systems we have studied, Hadoop can do all sorts of processing (not just Database operations).
3. Hadoop is open source (Provisioned through Apache Foundation)
4. Hadoop is the foundation of Big Data Frameworks
Hadoop Users
Amazon
Google
Expedia
JP Morgan
Facebook
Yahoo
Ebay
History and Evolution
What is Hadoop Used for?
Searching
Log Processing
Recommendation Systems
Data Analytics
Video and Image Analysis
Data Storage/Retention
• Structured/Unstructured/Semi-Structured
Machine Learning Models
Hadoop Distributions
Amazon Web Services
Apache Bigtop
Cascading
Cloudera
Cloudspace
Datameer
Data Mine Lab
Datasalt
DataStax
DataTorrent
Ndisco
Debian
Emblocsoft
Hortonworks
HStreaming
IBM
Impetus
Jaspersoft
Karmasphere
Apache Mahout
Nutch
And more others
Hadoop Distributed File System (HDFS)
Data is divided into blocks of the same size.
Each block is replicated (default 3)
Each replica is stored in a different Datanode (some of which may be in different Racks)
Namenode contains all the information about which blocks are in which Datanode
NameNode is Primary node
A write request can go to any DataNode (Secondary Node)
That node makes the write and then sends the data to its replicas.
A read request can go to any of the replicas
No consistency checks like Cassandra
NameNode can have a CheckSum (digital signature) of the Data and check the returned data against its memory of CheckSums.
Comments