Networking and Linux concepts: Large scale data management with Apache HADOOP

Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. Hadoop is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.

Hadoop is a scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).

Core Hadoop has two components:
------ Hadoop Distributed File System: self-healing high bandwidth clustered storage
------ MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

In a way, Hadoop is the complementary (opposite) of virtualization. If you consider VMware or any other virtualization product, it is taking a big physical server and chopping or dividing it into many small virtual servers. In case of Hadoop we are taking a large number of servers and consolidating the data into one big chunk.

Salient features of Hadoop

Solution for Big Data

* Deals with complexities of high volume, velocity and variety of data

Set of Open Source Projects
Transforms commodity hardware into a service that:

* stores petabytes of data reliably
* allows huge distributed computations

Key Attributes

* Redundant and reliable (no data loss)
* Extremely powerful
* Batch processing centric
* Easy to program distributed applications
* Runs on commodity hardware

Hadoop is:

Reliable

* Data is typically held on multiple DataNodes
* Tasks that fail are redone

Scalable

* Same program runs on 1, 1000 or 5000 machines
* Scales linearly

Simple APIs
Very powerful

* You can process in parallel massive petabytes amount of data
* Processing in parallel allows for the timely processing of massive amounts of data

What is HDFS ?

HDFS --- Hadoop Distributed File System

Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware

HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

Areas where HDFS is not a good fit today:

Low latency data access
Lots of small files
Multiple writers, arbitrary file modifications

Main components of HDFS

NameNode:

Master of the system
Maintains and manages the blocks which are present on the Data Nodes

DataNodes:

Slaves which are deployed on each machine and provide the actual storage
Responsible for serving read and write requests for the clients

The MapReduce server on a typical machine is called a TaskTracker
The HDFS server on a typical machine is called a DataNode

Commonly used Hadoop commands

User commands

hadoop archive ---> Creates a hadoop archive
-archivename <name> Name of the archive to be created
src Filesystem pathnames which work as usual with regular expressions
dest Destination directory which would contain the archive

hadoop distcp ---> Copy file or directories recursively
srcurl Source URL
desturl Destination URL

hadoop dfs ---> Runs a generic filesystem user client

hadoop fsck ---> Runs a HDFS filesystem checking utility
path Start checking from this path
-move Move corrupted files to /lost+found
-delete Delete corrupted files
-openforwrite Print out files opened for write
-files Print out files being checked
-blocks Print out block report
-locations Print out locations for every block
-racks Print out network topology for datanode locations

hadoop fetchdt ---> Gets Delegation Token from a NameNode
filename Filename to store the token into
--webserver <https-address> Use HTTP protocol instead of RPC

hadoop jar ---> Runs a jar file. Users can bundle their MapReduce code in a jar file and execute it using this command

hadoop job ---> Command to interact with MapReduce jobs
-submit <job-file>
-status <job-id>
-counter <job-id> <group-name> <counter-name>
-kill <job-id>
-events <job-id> <from-event-#> <#-of-event>
-history [all] <jobOutputDir>
-list [all]
-kill-task <task-id>
-fail-task <task-id>
-set-priority <job-id-priority>

hadoop pipes ---> Runs a pipes job
-conf path
-jobconf <key=value, key=value, .....>
-input path
-output path
-jar <jar-file>
-inputformat <class>
-map <class>
-partitioner <class>
-reduce <class>
-writer <class>
-program <executable>
-reduces <number>

hadoop queue ---> Command to interact and view JobQueue information
-list
-info <job-queue-name> [showJobs]
-showacls

hadoop version ---> Prints the version

hadoop CLASSNAME -> To invoke any class

hadoop classpath ---> Prints the class path needed to get the Hadoop jar and the required libraries

Administration commands

hadoop balancer ---> Runs a cluster balancing utility. An administrator can simply press Ctrl C to stop the rebalancing process
-threshold <threshold>

hadoop daemonlog ---> Get/Set the log level for each daemon
-getlevel <host:portname>
-setlevel <host:portname> <level>

hadoop datamode ---> Runs a HDFS datanode
-rollback Rolls back the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version.

hadoop dfsadmin ---> Runs a HDFS dfsadmin client
-safemode enter / leave / get / wait
-refreshNodes
-finalizeUpgrade
-upgradeProcess status / details / force
-metasave <filename>
-setQuota quota <dirname........dirname>
-clrQuota <dirname........dirname>
-restoreFailedStorage true / false / check
-help [cmd]

hadoop mradmin ---> Runs MR admin client
-refreshQueueAcls

hadoop jobtracker
-dumpConfiguration

hadoop namenode
-format
-upgrade
-rollback
-finalize
-importCheckpoint

hadoop secondarynamenode
-checkpoint [-force]
-geteditsize

hadoop tasktracker

Startup scripts

start-dfs.sh ===> Starts the Hadoop DFS daemons, the namenode and datanodes. Use this before you run start-mapred.sh
stop-dfs.sh ===> Stops the Hadoop DFS daemons
start-mapred.sh ==> Starts the Hadoop MapReduce daemons, the jobtracker and tasktracker
stop-mapred.sh ==> Stops the Hadoop MapReduce daemons

Configuration files

hadoop-env.sh ==> This file contains some environment variable settings used by Hadoop. You can use these to influence some aspects of Hadoop daemon behaviour, such as where the log files are stored, the maximum amount of heap used, and so on. The only variable you should need to change in this file is JAVA_HOME which specifies the path to the Java 1.5 installation used by Hadoop.
slaves ===> This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.
hadoop-default.xml => This file contains generic default settings for Hadoop daemons and MapReduce jobs. DO NOT MODIFY THIS FILE.
mapred-default.xml => This file contains site specific settings for the Hadoop MapReduce daemons and jobs. This file is empty by default. Putting configuration properties in this file will override MapReduce settings in the hadoop-default.xml file. Use this file to customize the behaviour of MapReduce on your site.
hadoop-site.xml ===> This file contains site specific settings for all Hadoop daemons and MapReduce jobs. This file is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml files. This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.

Networking and Linux concepts

Tuesday, March 31, 2020

Large scale data management with Apache HADOOP

No comments:

Post a Comment

Various topics

About Me