mirror of
https://github.com/clearlinux/clear-linux-documentation.git
synced 2026-05-14 10:53:55 +00:00
* Editorial changes to apache-hadoop.rst Signed-off-by: DougTW <doug.martin@intel.com> * Content organization changes - Simplified Prerequisites - Moved sentence out of Prerequisites into Description Signed-off-by: DougTW <doug.martin@intel.com>
257 lines
7.0 KiB
ReStructuredText
257 lines
7.0 KiB
ReStructuredText
.. _hadoop:
|
|
|
|
Apache\* Hadoop\*
|
|
#################
|
|
|
|
This tutorial explains the process of installing, configuring, and
|
|
running Apache Hadoop on |CL-ATTR|.
|
|
|
|
.. contents::
|
|
:local:
|
|
:depth: 1
|
|
|
|
Description
|
|
***********
|
|
|
|
For this tutorial, you will install Hadoop in a single machine
|
|
running both the master and slave daemons.
|
|
|
|
The Apache Hadoop software library is a framework for distributed processing
|
|
of large data sets across clusters of computers using simple programming
|
|
models. It is designed to scale up from single servers to thousands of
|
|
machines, with each machine offering local computation and storage.
|
|
|
|
Prerequisites
|
|
*************
|
|
|
|
* :ref:`bare-metal-install-desktop`
|
|
|
|
* In |CL|, run :command:`swupd update`
|
|
|
|
|
|
Install Apache Hadoop
|
|
*********************
|
|
|
|
Apache Hadoop is included in the :command:`big-data-basic` bundle. To install
|
|
the framework, enter the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo swupd bundle-add big-data-basic
|
|
|
|
Configure Apache Hadoop
|
|
***********************
|
|
|
|
#. To create the configuration directory, enter the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo mkdir /etc/hadoop
|
|
|
|
#. Copy the defaults from :file:`/usr/share/defaults/hadoop` to
|
|
:file:`/etc/hadoop` with the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ sudo cp /usr/share/defaults/hadoop/* /etc/hadoop
|
|
|
|
.. note:: Since |CL| is a stateless system, never modify the
|
|
files under the :file:`/usr/share/defaults` directory. The software
|
|
updater will overwrite those files.
|
|
|
|
Once all the configuration files are in :file:`/etc/hadoop`, edit them to
|
|
fit your needs. The `NameNode` server is the master server that manages the
|
|
namespace of the files' system and regulates the clients' access to files.
|
|
The first file to be edited, :file:`/etc/hadoop/core-site.xml`, informs the
|
|
Hadoop daemon where `NameNode` is running. In this tutorial, `NameNode` runs
|
|
in the `localhost`.
|
|
|
|
#. Open the :file:`/etc/hadoop/core-site.xml` file using any editor and modify
|
|
the file as follows:
|
|
|
|
.. code-block:: xml
|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
|
<configuration>
|
|
<property>
|
|
<name>fs.default.name</name>
|
|
<value>hdfs://localhost:9000</value>
|
|
</property>
|
|
</configuration>
|
|
|
|
#. Edit the :file:`/etc/hadoop/hdfs-site.xml` file. This file configures the
|
|
:abbr:`HDFS (Hadoop Distributed File System)` daemons. This configuration
|
|
includes the list of permitted and excluded data nodes and the size of
|
|
those blocks. For this example, set the number of block replication to 1
|
|
from the default of 3 as follows:
|
|
|
|
.. code-block:: xml
|
|
:emphasize-lines: 6
|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
|
<configuration>
|
|
<property>
|
|
<name>dfs.replication</name>
|
|
<value>1</value>
|
|
</property>
|
|
<property>
|
|
<name>dfs.permission</name>
|
|
<value>false</value>
|
|
</property>
|
|
</configuration>
|
|
|
|
#. Edit the :file:`/etc/hadoop/mapred-site.xml` file. This file configures
|
|
all daemons related to `MapReduce`: `JobTracker` and `TaskTrackers`. With
|
|
`MapReduce`, Hadoop can process big amounts of data in multiple systems. In
|
|
our example, we set :abbr:`YARN (Yet Another Resource Manager)` as our
|
|
runtime framework for executing `MapReduce` jobs as follows:
|
|
|
|
.. code-block:: xml
|
|
:emphasize-lines: 5,6
|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
|
<configuration>
|
|
<property>
|
|
<name>mapreduce.framework.name</name>
|
|
<value>yarn</value>
|
|
</property>
|
|
</configuration>
|
|
|
|
#. Edit the :file:`/etc/hadoop/yarn-site.xml` file. This file configures all
|
|
daemons related to `YARN`: `ResourceManager` and `NodeManager`. In our
|
|
example, we implement the `mapreduce_shuffle` service, which is the
|
|
default as follows:
|
|
|
|
.. code-block:: xml
|
|
:emphasize-lines: 4,5,8,9
|
|
|
|
<?xml version="1.0"?>
|
|
<configuration>
|
|
<property>
|
|
<name>yarn.nodemanager.aux-services</name>
|
|
<value>mapreduce_shuffle</value>
|
|
</property>
|
|
<property>
|
|
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
|
|
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
|
|
</property>
|
|
</configuration>
|
|
|
|
Configure your SSH key
|
|
**********************
|
|
|
|
#. Create a SSH key. If you already have one, skip this step.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo ssh-keygen -t rsa
|
|
|
|
|
|
#. Copy the key to your authorized keys.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo cat /root/.ssh/id_rsa.pub | sudo tee -a /root/.ssh/authorized_keys
|
|
|
|
#. Log into the localhost. If no password prompt appears, you are ready to
|
|
run the Hadoop daemons.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo ssh localhost
|
|
|
|
Run the Hadoop daemons
|
|
**********************
|
|
|
|
With all the configuration files properly edited, you are ready to start the
|
|
daemons.
|
|
|
|
When you format the `NameNode` server, it formats the metadata related to
|
|
data nodes. Thus, all the information on the data nodes is lost and the nodes
|
|
can be reused for new data.
|
|
|
|
#. Format the `NameNode` server with the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo hdfs namenode -format
|
|
|
|
#. Start the DFS in `NameNode` and `DataNodes` with the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo start-dfs.sh
|
|
|
|
The console output should be similar to:
|
|
|
|
.. code-block:: console
|
|
|
|
Starting namenodes on [localhost]
|
|
The authenticity of host 'localhost (::1)' can't be established.
|
|
ECDSA key fingerprint is
|
|
SHA256:97e+7TnomsS9W7GjFPjzY75HGBp+f1y6sA+ZFcOPIPU.
|
|
Are you sure you want to continue connecting (yes/no)?
|
|
|
|
#. Enter `yes` to continue.
|
|
|
|
#. Start the `YARN` daemons `ResourceManager` and `NodeManager` with the
|
|
following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo start-yarn.sh
|
|
|
|
#. Ensure everything is running as expected with the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo jps
|
|
|
|
The console output should be similar to:
|
|
|
|
.. code-block:: console
|
|
|
|
22674 DataNode
|
|
26228 Jps
|
|
22533 NameNode
|
|
23046 ResourceManager
|
|
22854 SecondaryNameNode
|
|
23150 NodeManager
|
|
|
|
Run the MapReduce wordcount example
|
|
***********************************
|
|
|
|
#. Create the input directory.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo hdfs dfs -mkdir -p /user/root/input
|
|
|
|
#. Copy a file from the local file system to the HDFS.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo hdfs dfs -copyFromLocal local-file /user/root/input
|
|
|
|
#. Run the `wordcount` example.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo hadoop jar /usr/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount input output
|
|
|
|
#. Read the output file "part-r-00000". This file contains the number of times
|
|
each word appears in the file.
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo hdfs dfs -cat /user/root/output/part-r-00000
|
|
|
|
**Congratulations!**
|
|
|
|
You have successfully installed and setup a single node Hadoop cluster.
|
|
Additionally, you ran a simple wordcount example.
|
|
|
|
Your single node Hadoop cluster is up and running! |