Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. Hadoop is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.
Hadoop is a scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).
Core Hadoop has two components:
------ Hadoop Distributed File System: self-healing high bandwidth clustered storage
------ MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
In a way, Hadoop is the complementary (opposite) of virtualization. If you consider VMware or any other virtualization product, it is taking a big physical server and chopping or dividing it into many small virtual servers. In case of Hadoop we are taking a large number of servers and consolidating the data into one big chunk.
Salient features of Hadoop
* allows huge distributed computations
* Extremely powerful
* Batch processing centric
* Easy to program distributed applications
* Runs on commodity hardware
Hadoop is:
* Tasks that fail are redone
* Scales linearly
* Processing in parallel allows for the timely processing of massive amounts of data
What is HDFS ?
HDFS --- Hadoop Distributed File System
Commonly used Hadoop commands
User commands
hadoop archive ---> Creates a hadoop archive
-archivename <name> Name of the archive to be created
src Filesystem pathnames which work as usual with regular expressions
dest Destination directory which would contain the archive
hadoop distcp ---> Copy file or directories recursively
srcurl Source URL
desturl Destination URL
hadoop dfs ---> Runs a generic filesystem user client
hadoop fsck ---> Runs a HDFS filesystem checking utility
path Start checking from this path
-move Move corrupted files to /lost+found
-delete Delete corrupted files
-openforwrite Print out files opened for write
-files Print out files being checked
-blocks Print out block report
-locations Print out locations for every block
-racks Print out network topology for datanode locations
hadoop fetchdt ---> Gets Delegation Token from a NameNode
filename Filename to store the token into
--webserver <https-address> Use HTTP protocol instead of RPC
hadoop jar ---> Runs a jar file. Users can bundle their MapReduce code in a jar file and execute it using this command
hadoop job ---> Command to interact with MapReduce jobs
-submit <job-file>
-status <job-id>
-counter <job-id> <group-name> <counter-name>
-kill <job-id>
-events <job-id> <from-event-#> <#-of-event>
-history [all] <jobOutputDir>
-list [all]
-kill-task <task-id>
-fail-task <task-id>
-set-priority <job-id-priority>
hadoop pipes ---> Runs a pipes job
-conf path
-jobconf <key=value, key=value, .....>
-input path
-output path
-jar <jar-file>
-inputformat <class>
-map <class>
-partitioner <class>
-reduce <class>
-writer <class>
-program <executable>
-reduces <number>
hadoop queue ---> Command to interact and view JobQueue information
-list
-info <job-queue-name> [showJobs]
-showacls
hadoop version ---> Prints the version
hadoop CLASSNAME -> To invoke any class
hadoop classpath ---> Prints the class path needed to get the Hadoop jar and the required libraries
Administration commands
hadoop balancer ---> Runs a cluster balancing utility. An administrator can simply press Ctrl C to stop the rebalancing process
-threshold <threshold>
hadoop daemonlog ---> Get/Set the log level for each daemon
-getlevel <host:portname>
-setlevel <host:portname> <level>
hadoop datamode ---> Runs a HDFS datanode
-rollback Rolls back the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version.
hadoop dfsadmin ---> Runs a HDFS dfsadmin client
-safemode enter / leave / get / wait
-refreshNodes
-finalizeUpgrade
-upgradeProcess status / details / force
-metasave <filename>
-setQuota quota <dirname........dirname>
-clrQuota <dirname........dirname>
-restoreFailedStorage true / false / check
-help [cmd]
hadoop mradmin ---> Runs MR admin client
-refreshQueueAcls
hadoop jobtracker
-dumpConfiguration
hadoop namenode
-format
-upgrade
-rollback
-finalize
-importCheckpoint
hadoop secondarynamenode
-checkpoint [-force]
-geteditsize
hadoop tasktracker
Startup scripts
start-dfs.sh ===> Starts the Hadoop DFS daemons, the namenode and datanodes. Use this before you run start-mapred.sh
stop-dfs.sh ===> Stops the Hadoop DFS daemons
start-mapred.sh ==> Starts the Hadoop MapReduce daemons, the jobtracker and tasktracker
stop-mapred.sh ==> Stops the Hadoop MapReduce daemons
Configuration files
hadoop-env.sh ==> This file contains some environment variable settings used by Hadoop. You can use these to influence some aspects of Hadoop daemon behaviour, such as where the log files are stored, the maximum amount of heap used, and so on. The only variable you should need to change in this file is JAVA_HOME which specifies the path to the Java 1.5 installation used by Hadoop.
slaves ===> This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.
hadoop-default.xml => This file contains generic default settings for Hadoop daemons and MapReduce jobs. DO NOT MODIFY THIS FILE.
mapred-default.xml => This file contains site specific settings for the Hadoop MapReduce daemons and jobs. This file is empty by default. Putting configuration properties in this file will override MapReduce settings in the hadoop-default.xml file. Use this file to customize the behaviour of MapReduce on your site.
hadoop-site.xml ===> This file contains site specific settings for all Hadoop daemons and MapReduce jobs. This file is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml files. This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.
Hadoop is a scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).
Core Hadoop has two components:
------ Hadoop Distributed File System: self-healing high bandwidth clustered storage
------ MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
In a way, Hadoop is the complementary (opposite) of virtualization. If you consider VMware or any other virtualization product, it is taking a big physical server and chopping or dividing it into many small virtual servers. In case of Hadoop we are taking a large number of servers and consolidating the data into one big chunk.
Salient features of Hadoop
- Solution for Big Data * Deals with complexities of high volume, velocity and variety of data
- Set of Open Source Projects
- Transforms commodity hardware into a service that:
* allows huge distributed computations
- Key Attributes
* Extremely powerful
* Batch processing centric
* Easy to program distributed applications
* Runs on commodity hardware
Hadoop is:
- Reliable
* Tasks that fail are redone
- Scalable
* Scales linearly
- Simple APIs
- Very powerful
* Processing in parallel allows for the timely processing of massive amounts of data
What is HDFS ?
HDFS --- Hadoop Distributed File System
- Highly fault-tolerant
- High throughput
- Suitable for applications with large data sets
- Streaming access to file system data
- Can be built out of commodity hardware
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
Areas where HDFS is not a good fit today:
Main components of HDFS
NameNode:
DataNodes:
The MapReduce server on a typical machine is called a TaskTracker
The HDFS server on a typical machine is called a DataNode
Areas where HDFS is not a good fit today:
- Low latency data access
- Lots of small files
- Multiple writers, arbitrary file modifications
Main components of HDFS
NameNode:
- Master of the system
- Maintains and manages the blocks which are present on the Data Nodes
DataNodes:
- Slaves which are deployed on each machine and provide the actual storage
- Responsible for serving read and write requests for the clients
The MapReduce server on a typical machine is called a TaskTracker
The HDFS server on a typical machine is called a DataNode
Commonly used Hadoop commands
User commands
hadoop archive ---> Creates a hadoop archive
-archivename <name> Name of the archive to be created
src Filesystem pathnames which work as usual with regular expressions
dest Destination directory which would contain the archive
hadoop distcp ---> Copy file or directories recursively
srcurl Source URL
desturl Destination URL
hadoop dfs ---> Runs a generic filesystem user client
hadoop fsck ---> Runs a HDFS filesystem checking utility
path Start checking from this path
-move Move corrupted files to /lost+found
-delete Delete corrupted files
-openforwrite Print out files opened for write
-files Print out files being checked
-blocks Print out block report
-locations Print out locations for every block
-racks Print out network topology for datanode locations
hadoop fetchdt ---> Gets Delegation Token from a NameNode
filename Filename to store the token into
--webserver <https-address> Use HTTP protocol instead of RPC
hadoop jar ---> Runs a jar file. Users can bundle their MapReduce code in a jar file and execute it using this command
hadoop job ---> Command to interact with MapReduce jobs
-submit <job-file>
-status <job-id>
-counter <job-id> <group-name> <counter-name>
-kill <job-id>
-events <job-id> <from-event-#> <#-of-event>
-history [all] <jobOutputDir>
-list [all]
-kill-task <task-id>
-fail-task <task-id>
-set-priority <job-id-priority>
hadoop pipes ---> Runs a pipes job
-conf path
-jobconf <key=value, key=value, .....>
-input path
-output path
-jar <jar-file>
-inputformat <class>
-map <class>
-partitioner <class>
-reduce <class>
-writer <class>
-program <executable>
-reduces <number>
hadoop queue ---> Command to interact and view JobQueue information
-list
-info <job-queue-name> [showJobs]
-showacls
hadoop version ---> Prints the version
hadoop CLASSNAME -> To invoke any class
hadoop classpath ---> Prints the class path needed to get the Hadoop jar and the required libraries
Administration commands
hadoop balancer ---> Runs a cluster balancing utility. An administrator can simply press Ctrl C to stop the rebalancing process
-threshold <threshold>
hadoop daemonlog ---> Get/Set the log level for each daemon
-getlevel <host:portname>
-setlevel <host:portname> <level>
hadoop datamode ---> Runs a HDFS datanode
-rollback Rolls back the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version.
hadoop dfsadmin ---> Runs a HDFS dfsadmin client
-safemode enter / leave / get / wait
-refreshNodes
-finalizeUpgrade
-upgradeProcess status / details / force
-metasave <filename>
-setQuota quota <dirname........dirname>
-clrQuota <dirname........dirname>
-restoreFailedStorage true / false / check
-help [cmd]
hadoop mradmin ---> Runs MR admin client
-refreshQueueAcls
hadoop jobtracker
-dumpConfiguration
hadoop namenode
-format
-upgrade
-rollback
-finalize
-importCheckpoint
hadoop secondarynamenode
-checkpoint [-force]
-geteditsize
hadoop tasktracker
Startup scripts
start-dfs.sh ===> Starts the Hadoop DFS daemons, the namenode and datanodes. Use this before you run start-mapred.sh
stop-dfs.sh ===> Stops the Hadoop DFS daemons
start-mapred.sh ==> Starts the Hadoop MapReduce daemons, the jobtracker and tasktracker
stop-mapred.sh ==> Stops the Hadoop MapReduce daemons
Configuration files
hadoop-env.sh ==> This file contains some environment variable settings used by Hadoop. You can use these to influence some aspects of Hadoop daemon behaviour, such as where the log files are stored, the maximum amount of heap used, and so on. The only variable you should need to change in this file is JAVA_HOME which specifies the path to the Java 1.5 installation used by Hadoop.
slaves ===> This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.
hadoop-default.xml => This file contains generic default settings for Hadoop daemons and MapReduce jobs. DO NOT MODIFY THIS FILE.
mapred-default.xml => This file contains site specific settings for the Hadoop MapReduce daemons and jobs. This file is empty by default. Putting configuration properties in this file will override MapReduce settings in the hadoop-default.xml file. Use this file to customize the behaviour of MapReduce on your site.
hadoop-site.xml ===> This file contains site specific settings for all Hadoop daemons and MapReduce jobs. This file is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml files. This file should contain settings that must be respected by all servers and clients in a Hadoop installation, for instance, the location of the namenode and the jobtracker.
No comments:
Post a Comment