Tuesday, March 31, 2020

Large scale data management with Apache HADOOP

Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers.  Hadoop is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.  Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.




Hadoop is a scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).

Core Hadoop has two components:

        ------ Hadoop Distributed File System:  self-healing high bandwidth clustered storage
        ------ MapReduce:  distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

In a way, Hadoop is the complementary (opposite) of virtualization.  If you consider VMware or any other virtualization product, it is taking a big physical server and chopping or dividing it into many small virtual servers.  In case of Hadoop we are taking a large number of servers and consolidating the data into one big chunk.


Salient features of Hadoop
  • Solution for Big Data
  •                * Deals with complexities of high volume, velocity and variety of data
  • Set of Open Source Projects
  • Transforms commodity hardware into a service that:
                      * stores petabytes of data reliably
                      * allows huge distributed computations
  • Key Attributes
                      * Redundant and reliable (no data loss)
                      * Extremely powerful
                      * Batch processing centric
                      * Easy to program distributed applications
                      * Runs on commodity hardware

Hadoop is:
  • Reliable
                      * Data is typically held on multiple DataNodes
                      * Tasks that fail are redone
  • Scalable
                      * Same program runs on 1, 1000 or 5000 machines
                      * Scales linearly
  • Simple APIs
  • Very powerful
                      * You can process in parallel massive petabytes amount of data
                      * Processing in parallel allows for the timely processing of massive amounts of data


What is HDFS ?

HDFS --- Hadoop Distributed File System
  • Highly fault-tolerant
  • High throughput
  • Suitable for applications with large data sets
  • Streaming access to file system data
  • Can be built out of commodity hardware
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.


Areas where HDFS is not a good fit today:
  • Low latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

Main components of HDFS

NameNode:
  • Master of the system
  • Maintains and manages the blocks which are present on the Data Nodes

DataNodes:
  • Slaves which are deployed on each machine and provide the actual storage
  • Responsible for serving read and write requests for the clients


The MapReduce server on a typical machine is called a TaskTracker
The HDFS server on a typical machine is called a DataNode

















Commonly used Hadoop commands

User commands

hadoop archive --->   Creates a hadoop archive
      -archivename <name>      Name of the archive to be created
      src                                       Filesystem pathnames which work as usual with regular expressions
      dest                                     Destination directory which would contain the archive

hadoop distcp --->      Copy file or directories recursively
      srcurl                                  Source URL
      desturl                                Destination URL

hadoop dfs      --->      Runs a generic filesystem user client

hadoop fsck    --->      Runs a HDFS filesystem checking utility
      path                                    Start checking from this path
      -move                                 Move corrupted files to /lost+found
      -delete                                Delete corrupted files
      -openforwrite                    Print out files opened for write
      -files                                   Print out files being checked
      -blocks                               Print out block report
      -locations                           Print out locations for every block
      -racks                                 Print out network topology for datanode locations

hadoop fetchdt --->     Gets Delegation Token from a NameNode
    filename                                        Filename to store the token into
    --webserver <https-address>      Use HTTP protocol instead of RPC

hadoop jar  --->           Runs a jar file.  Users can bundle their MapReduce code in a jar file and execute it using this command

hadoop job --->           Command to interact with MapReduce jobs
      -submit <job-file>
      -status <job-id>
      -counter <job-id> <group-name> <counter-name>
      -kill <job-id>
      -events <job-id> <from-event-#> <#-of-event>
      -history [all] <jobOutputDir>
      -list [all]
      -kill-task <task-id>
      -fail-task <task-id>
      -set-priority <job-id-priority>

hadoop pipes  --->       Runs a pipes job
      -conf path
      -jobconf <key=value, key=value, .....>
      -input path
      -output path
      -jar <jar-file>
      -inputformat <class>
      -map <class>
      -partitioner <class>
      -reduce <class>
      -writer <class>
      -program <executable>
      -reduces <number>

hadoop queue  --->      Command to interact and view JobQueue information
      -list
      -info <job-queue-name> [showJobs]
      -showacls

hadoop version  --->   Prints the version

hadoop CLASSNAME  ->  To invoke any class

hadoop classpath  ---> Prints the class path needed to get the Hadoop jar and the required libraries

Administration commands

hadoop balancer --->   Runs a cluster balancing utility.  An administrator can simply press Ctrl C to stop the rebalancing process
      -threshold <threshold>

hadoop daemonlog ---> Get/Set the log level for each daemon
      -getlevel <host:portname>
      -setlevel <host:portname> <level>

hadoop datamode  ---> Runs a HDFS datanode
      -rollback                   Rolls back the datanode to the previous version.  This should be used after stopping the datanode and distributing the old hadoop version.

hadoop dfsadmin  --->  Runs a HDFS dfsadmin client
      -safemode enter / leave / get / wait
      -refreshNodes
      -finalizeUpgrade
      -upgradeProcess status / details / force
      -metasave <filename>
      -setQuota quota <dirname........dirname>
      -clrQuota <dirname........dirname>
      -restoreFailedStorage true / false / check
      -help [cmd]

hadoop mradmin --->   Runs MR admin client
      -refreshQueueAcls

hadoop jobtracker
      -dumpConfiguration

hadoop namenode
      -format
      -upgrade
      -rollback
      -finalize
      -importCheckpoint

hadoop secondarynamenode
      -checkpoint [-force]
      -geteditsize

hadoop tasktracker

Startup scripts

start-dfs.sh      ===>   Starts the Hadoop DFS daemons, the namenode and datanodes.  Use this before you run start-mapred.sh
stop-dfs.sh        ===>   Stops the Hadoop DFS daemons
start-mapred.sh ==>   Starts the Hadoop MapReduce daemons, the jobtracker and tasktracker
stop-mapred.sh ==>    Stops the Hadoop MapReduce daemons

Configuration files

hadoop-env.sh  ==>     This file contains some environment variable settings used by Hadoop.  You can use these to influence some aspects of Hadoop daemon behaviour, such as where the log files are stored, the maximum amount of heap used, and so on.  The only variable you should need to change in this file is JAVA_HOME which specifies the path to the Java 1.5 installation used by Hadoop.
slaves                ===>   This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.
hadoop-default.xml =>  This file contains generic default settings for Hadoop daemons and MapReduce jobs.  DO NOT MODIFY THIS FILE.
mapred-default.xml =>  This file contains site specific settings for the Hadoop MapReduce daemons and jobs.  This file is empty by default.  Putting configuration properties in this file will override MapReduce settings in the hadoop-default.xml file.  Use this file to customize the behaviour of MapReduce on your site.
hadoop-site.xml  ===>   This file contains site specific settings for all Hadoop daemons and MapReduce jobs.  This file is empty by default.  Settings in this file override those in hadoop-default.xml and mapred-default.xml files.  This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.





No comments:

Post a Comment