Files
clear-linux-documentation/tutorials/apache-hadoop.html
2024-11-04 18:56:31 +00:00

371 lines
21 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="en" data-content_root="../">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Apache* Hadoop* &#8212; Documentation for Clear Linux* project</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
<link rel="stylesheet" type="text/css" href="../_static/bizstyle.css?v=5283bb3d" />
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=9bcbadda"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/clipboard.min.js?v=a7894cd8"></script>
<script src="../_static/copybutton.js?v=a56c686a"></script>
<script src="../_static/bizstyle.js"></script>
<link rel="canonical" href="https://clearlinux.github.io/clear-linux-documentation/tutorials/apache-hadoop.html" />
<link rel="icon" href="../_static/favicon.ico"/>
<link rel="author" title="About these documents" href="../about.html" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Broadcom* Drivers" href="broadcom.html" />
<link rel="prev" title="Tutorials" href="index.html" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<!--[if lt IE 9]>
<script src="_static/css3-mediaqueries.js"></script>
<![endif]-->
</head><body>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="broadcom.html" title="Broadcom* Drivers"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="index.html" title="Tutorials"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="../index.html">Documentation for Clear Linux* project</a> &#187;</li>
<li class="nav-item nav-item-1"><a href="index.html" accesskey="U">Tutorials</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Apache* Hadoop*</a></li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="apache-hadoop">
<span id="hadoop"></span><h1>Apache* Hadoop*<a class="headerlink" href="#apache-hadoop" title="Link to this heading"></a></h1>
<p>This tutorial explains the process of installing, configuring, and
running Apache Hadoop on Clear Linux* OS.</p>
<nav class="contents local" id="contents">
<ul class="simple">
<li><p><a class="reference internal" href="#description" id="id1">Description</a></p></li>
<li><p><a class="reference internal" href="#prerequisites" id="id2">Prerequisites</a></p></li>
<li><p><a class="reference internal" href="#install-apache-hadoop" id="id3">Install Apache Hadoop</a></p></li>
<li><p><a class="reference internal" href="#configure-apache-hadoop" id="id4">Configure Apache Hadoop</a></p></li>
<li><p><a class="reference internal" href="#configure-your-ssh-key" id="id5">Configure your SSH key</a></p></li>
<li><p><a class="reference internal" href="#run-the-hadoop-daemons" id="id6">Run the Hadoop daemons</a></p></li>
<li><p><a class="reference internal" href="#run-the-mapreduce-wordcount-example" id="id7">Run the MapReduce wordcount example</a></p></li>
</ul>
</nav>
<section id="description">
<h2><a class="toc-backref" href="#id1" role="doc-backlink">Description</a><a class="headerlink" href="#description" title="Link to this heading"></a></h2>
<p>For this tutorial, you will install Hadoop in a single machine
running both the master and slave daemons.</p>
<p>The Apache Hadoop software library is a framework for distributed processing
of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of
machines, with each machine offering local computation and storage.</p>
</section>
<section id="prerequisites">
<h2><a class="toc-backref" href="#id2" role="doc-backlink">Prerequisites</a><a class="headerlink" href="#prerequisites" title="Link to this heading"></a></h2>
<ul class="simple">
<li><p><a class="reference internal" href="../get-started/bare-metal-install-desktop.html#bare-metal-install-desktop"><span class="std std-ref">Install Clear Linux* OS from the live desktop</span></a></p></li>
<li><p>In Clear Linux OS, run <strong class="command">swupd update</strong></p></li>
</ul>
</section>
<section id="install-apache-hadoop">
<h2><a class="toc-backref" href="#id3" role="doc-backlink">Install Apache Hadoop</a><a class="headerlink" href="#install-apache-hadoop" title="Link to this heading"></a></h2>
<p>Apache Hadoop is included in the <strong class="command">big-data-basic</strong> bundle. To install
the framework, enter the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>big-data-basic
</pre></div>
</div>
</section>
<section id="configure-apache-hadoop">
<h2><a class="toc-backref" href="#id4" role="doc-backlink">Configure Apache Hadoop</a><a class="headerlink" href="#configure-apache-hadoop" title="Link to this heading"></a></h2>
<ol class="arabic">
<li><p>To create the configuration directory, enter the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>mkdir<span class="w"> </span>/etc/hadoop
</pre></div>
</div>
</li>
<li><p>Copy the defaults from <code class="file docutils literal notranslate"><span class="pre">/usr/share/defaults/hadoop</span></code> to
<code class="file docutils literal notranslate"><span class="pre">/etc/hadoop</span></code> with the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span>sudo<span class="w"> </span>cp<span class="w"> </span>/usr/share/defaults/hadoop/*<span class="w"> </span>/etc/hadoop
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Since Clear Linux OS is a stateless system, never modify the
files under the <code class="file docutils literal notranslate"><span class="pre">/usr/share/defaults</span></code> directory. The software
updater will overwrite those files.</p>
</div>
<p>Once all the configuration files are in <code class="file docutils literal notranslate"><span class="pre">/etc/hadoop</span></code>, edit them to
fit your needs. The <cite>NameNode</cite> server is the master server that manages the
namespace of the files system and regulates the clients access to files.
The first file to be edited, <code class="file docutils literal notranslate"><span class="pre">/etc/hadoop/core-site.xml</span></code>, informs the
Hadoop daemon where <cite>NameNode</cite> is running. In this tutorial, <cite>NameNode</cite> runs
in the <cite>localhost</cite>.</p>
</li>
<li><p>Open the <code class="file docutils literal notranslate"><span class="pre">/etc/hadoop/core-site.xml</span></code> file using any editor and modify
the file as follows:</p>
<div class="highlight-xml notranslate"><div class="highlight"><pre><span></span><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span>
<span class="cp">&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
<span class="nt">&lt;configuration&gt;</span>
<span class="nt">&lt;property&gt;</span>
<span class="nt">&lt;name&gt;</span>fs.default.name<span class="nt">&lt;/name&gt;</span>
<span class="nt">&lt;value&gt;</span>hdfs://localhost:9000<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span>
</pre></div>
</div>
</li>
<li><p>Edit the <code class="file docutils literal notranslate"><span class="pre">/etc/hadoop/hdfs-site.xml</span></code> file. This file configures the
<abbr title="Hadoop Distributed File System">HDFS</abbr> daemons. This configuration
includes the list of permitted and excluded data nodes and the size of
those blocks. For this example, set the number of block replication to 1
from the default of 3 as follows:</p>
<div class="highlight-xml notranslate"><div class="highlight"><pre><span></span><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span>
<span class="cp">&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
<span class="nt">&lt;configuration&gt;</span>
<span class="nt">&lt;property&gt;</span>
<span class="nt">&lt;name&gt;</span>dfs.replication<span class="nt">&lt;/name&gt;</span>
<span class="hll"><span class="nt">&lt;value&gt;</span>1<span class="nt">&lt;/value&gt;</span>
</span><span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;property&gt;</span>
<span class="nt">&lt;name&gt;</span>dfs.permission<span class="nt">&lt;/name&gt;</span>
<span class="nt">&lt;value&gt;</span>false<span class="nt">&lt;/value&gt;</span>
<span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span>
</pre></div>
</div>
</li>
<li><p>Edit the <code class="file docutils literal notranslate"><span class="pre">/etc/hadoop/mapred-site.xml</span></code> file. This file configures
all daemons related to <cite>MapReduce</cite>: <cite>JobTracker</cite> and <cite>TaskTrackers</cite>. With
<cite>MapReduce</cite>, Hadoop can process big amounts of data in multiple systems. In
our example, we set <abbr title="Yet Another Resource Manager">YARN</abbr> as our
runtime framework for executing <cite>MapReduce</cite> jobs as follows:</p>
<div class="highlight-xml notranslate"><div class="highlight"><pre><span></span><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span>
<span class="cp">&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;</span>
<span class="nt">&lt;configuration&gt;</span>
<span class="nt">&lt;property&gt;</span>
<span class="hll"><span class="nt">&lt;name&gt;</span>mapreduce.framework.name<span class="nt">&lt;/name&gt;</span>
</span><span class="hll"><span class="nt">&lt;value&gt;</span>yarn<span class="nt">&lt;/value&gt;</span>
</span><span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span>
</pre></div>
</div>
</li>
<li><p>Edit the <code class="file docutils literal notranslate"><span class="pre">/etc/hadoop/yarn-site.xml</span></code> file. This file configures all
daemons related to <cite>YARN</cite>: <cite>ResourceManager</cite> and <cite>NodeManager</cite>. In our
example, we implement the <cite>mapreduce_shuffle</cite> service, which is the
default as follows:</p>
<div class="highlight-xml notranslate"><div class="highlight"><pre><span></span><span class="cp">&lt;?xml version=&quot;1.0&quot;?&gt;</span>
<span class="nt">&lt;configuration&gt;</span>
<span class="nt">&lt;property&gt;</span>
<span class="hll"><span class="nt">&lt;name&gt;</span>yarn.nodemanager.aux-services<span class="nt">&lt;/name&gt;</span>
</span><span class="hll"><span class="nt">&lt;value&gt;</span>mapreduce_shuffle<span class="nt">&lt;/value&gt;</span>
</span><span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;property&gt;</span>
<span class="hll"><span class="nt">&lt;name&gt;</span>yarn.nodemanager.auxservices.mapreduce.shuffle.class<span class="nt">&lt;/name&gt;</span>
</span><span class="hll"><span class="nt">&lt;value&gt;</span>org.apache.hadoop.mapred.ShuffleHandler<span class="nt">&lt;/value&gt;</span>
</span><span class="nt">&lt;/property&gt;</span>
<span class="nt">&lt;/configuration&gt;</span>
</pre></div>
</div>
</li>
</ol>
</section>
<section id="configure-your-ssh-key">
<h2><a class="toc-backref" href="#id5" role="doc-backlink">Configure your SSH key</a><a class="headerlink" href="#configure-your-ssh-key" title="Link to this heading"></a></h2>
<ol class="arabic">
<li><p>Create a SSH key. If you already have one, skip this step.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>ssh-keygen<span class="w"> </span>-t<span class="w"> </span>rsa
</pre></div>
</div>
</li>
<li><p>Copy the key to your authorized keys.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>cat<span class="w"> </span>/root/.ssh/id_rsa.pub<span class="w"> </span><span class="p">|</span><span class="w"> </span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/root/.ssh/authorized_keys
</pre></div>
</div>
</li>
<li><p>Log into the localhost. If no password prompt appears, you are ready to
run the Hadoop daemons.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>ssh<span class="w"> </span>localhost
</pre></div>
</div>
</li>
</ol>
</section>
<section id="run-the-hadoop-daemons">
<h2><a class="toc-backref" href="#id6" role="doc-backlink">Run the Hadoop daemons</a><a class="headerlink" href="#run-the-hadoop-daemons" title="Link to this heading"></a></h2>
<p>With all the configuration files properly edited, you are ready to start the
daemons.</p>
<p>When you format the <cite>NameNode</cite> server, it formats the metadata related to
data nodes. Thus, all the information on the data nodes is lost and the nodes
can be reused for new data.</p>
<ol class="arabic">
<li><p>Format the <cite>NameNode</cite> server with the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>hdfs<span class="w"> </span>namenode<span class="w"> </span>-format
</pre></div>
</div>
</li>
<li><p>Start the DFS in <cite>NameNode</cite> and <cite>DataNodes</cite> with the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>start-dfs.sh
</pre></div>
</div>
<p>The console output should be similar to:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">Starting namenodes on [localhost]</span>
<span class="go">The authenticity of host &#39;localhost (::1)&#39; can&#39;t be established.</span>
<span class="go">ECDSA key fingerprint is</span>
<span class="go">SHA256:97e+7TnomsS9W7GjFPjzY75HGBp+f1y6sA+ZFcOPIPU.</span>
<span class="go">Are you sure you want to continue connecting (yes/no)?</span>
</pre></div>
</div>
</li>
<li><p>Enter <cite>yes</cite> to continue.</p></li>
<li><p>Start the <cite>YARN</cite> daemons <cite>ResourceManager</cite> and <cite>NodeManager</cite> with the
following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>start-yarn.sh
</pre></div>
</div>
</li>
<li><p>Ensure everything is running as expected with the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>jps
</pre></div>
</div>
<p>The console output should be similar to:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">22674 DataNode</span>
<span class="go">26228 Jps</span>
<span class="go">22533 NameNode</span>
<span class="go">23046 ResourceManager</span>
<span class="go">22854 SecondaryNameNode</span>
<span class="go">23150 NodeManager</span>
</pre></div>
</div>
</li>
</ol>
</section>
<section id="run-the-mapreduce-wordcount-example">
<h2><a class="toc-backref" href="#id7" role="doc-backlink">Run the MapReduce wordcount example</a><a class="headerlink" href="#run-the-mapreduce-wordcount-example" title="Link to this heading"></a></h2>
<ol class="arabic">
<li><p>Create the input directory.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>hdfs<span class="w"> </span>dfs<span class="w"> </span>-mkdir<span class="w"> </span>-p<span class="w"> </span>/user/root/input
</pre></div>
</div>
</li>
<li><p>Copy a file from the local file system to the HDFS.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>hdfs<span class="w"> </span>dfs<span class="w"> </span>-copyFromLocal<span class="w"> </span>local-file<span class="w"> </span>/user/root/input
</pre></div>
</div>
</li>
<li><p>Run the <cite>wordcount</cite> example.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>hadoop<span class="w"> </span>jar<span class="w"> </span>/usr/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar<span class="w"> </span>wordcount<span class="w"> </span>input<span class="w"> </span>output
</pre></div>
</div>
</li>
<li><p>Read the output file “part-r-00000”. This file contains the number of times
each word appears in the file.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>hdfs<span class="w"> </span>dfs<span class="w"> </span>-cat<span class="w"> </span>/user/root/output/part-r-00000
</pre></div>
</div>
</li>
</ol>
<p><strong>Congratulations!</strong></p>
<p>You have successfully installed and setup a single node Hadoop cluster.
Additionally, you ran a simple wordcount example.</p>
<p>Your single node Hadoop cluster is up and running!</p>
</section>
</section>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="Main">
<div class="sphinxsidebarwrapper">
<p class="logo"><a href="../index.html">
<img class="logo" src="../_static/clearlinux.png" alt="Logo of Clear Linux* Project Docs"/>
</a></p>
<div>
<h3><a href="../index.html">Table of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">Apache* Hadoop*</a><ul>
<li><a class="reference internal" href="#description">Description</a></li>
<li><a class="reference internal" href="#prerequisites">Prerequisites</a></li>
<li><a class="reference internal" href="#install-apache-hadoop">Install Apache Hadoop</a></li>
<li><a class="reference internal" href="#configure-apache-hadoop">Configure Apache Hadoop</a></li>
<li><a class="reference internal" href="#configure-your-ssh-key">Configure your SSH key</a></li>
<li><a class="reference internal" href="#run-the-hadoop-daemons">Run the Hadoop daemons</a></li>
<li><a class="reference internal" href="#run-the-mapreduce-wordcount-example">Run the MapReduce wordcount example</a></li>
</ul>
</li>
</ul>
</div>
<div>
<h4>Previous topic</h4>
<p class="topless"><a href="index.html"
title="previous chapter">Tutorials</a></p>
</div>
<div>
<h4>Next topic</h4>
<p class="topless"><a href="broadcom.html"
title="next chapter">Broadcom* Drivers</a></p>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="../_sources/tutorials/apache-hadoop.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<search id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="../search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
<input type="submit" value="Go" />
</form>
</div>
</search>
<script>document.getElementById('searchbox').style.display = "block"</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="broadcom.html" title="Broadcom* Drivers"
>next</a> |</li>
<li class="right" >
<a href="index.html" title="Tutorials"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="../index.html">Documentation for Clear Linux* project</a> &#187;</li>
<li class="nav-item nav-item-1"><a href="index.html" >Tutorials</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Apache* Hadoop*</a></li>
</ul>
</div>
<div class="footer" role="contentinfo">
&#169; Copyright 2022 Intel Corporation. All Rights Reserved..
Last updated on Nov 04, 2024.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
</div>
</body>
</html>