Files
clear-linux-documentation/source/tutorials/apache-spark.rst
DougTW 2ec68b4ab1 apache-spark.rst: Editorial and organizational changes (#798)
* Editorial and organizational changes to apache-spark.rst

Signed-off-by: DougTW <doug.martin@intel.com>

* fixed indent alignment changes requested by mvincerx

Signed-off-by: DougTW <doug.martin@intel.com>
2019-09-20 15:50:36 -07:00

143 lines
4.0 KiB
ReStructuredText

.. _spark:
Apache\* Spark\*
################
This tutorial describes how to install, configure, and run Apache Spark on
|CL-ATTR| on a single machine running the master daemon and a worker daemon.
.. contents::
:local:
:depth: 1
Description
***********
Apache Spark is a fast, general-purpose cluster computing system with
the following features:
* Provides high-level APIs in Java\*, Scala\*, Python\*, and R\*.
* Includes an optimized engine that supports general execution graphs.
* Supports high-level tools including Spark SQL, MLlib, GraphX, and Spark
Streaming.
Prerequisites
*************
* |CL| installed on your host system.
For detailed instructions on installing |CL| on a bare metal system, visit
the :ref:`bare metal installation guide <bare-metal-install-desktop>`.
* Before installing any new packages, update |CL| with the following command:
.. code-block:: bash
sudo swupd update
Install Apache Spark
********************
Apache Spark is included in the :command:`big-data-basic` bundle. To install the
framework, run the following command:
.. code-block:: bash
sudo swupd bundle-add big-data-basic
Configure Apache Spark
**********************
#. Create the configuration directory:
.. code-block:: bash
sudo mkdir /etc/spark
#. Copy the default templates from :file:`/usr/share/defaults/spark` to
:file:`/etc/spark`:
.. code-block:: bash
sudo cp /usr/share/defaults/spark/* /etc/spark
.. note:: Since |CL| is a stateless system, you should never modify the
files under the :file:`/usr/share/defaults` directory. The software
updater overwrites those files.
#. Copy the template files shown below to create custom configuration files:
.. code-block:: bash
sudo cp /etc/spark/spark-defaults.conf.template /etc/spark/spark-defaults.conf
sudo cp /etc/spark/spark-env.sh.template /etc/spark/spark-env.sh
sudo cp /etc/spark/log4j.properties.template /etc/spark/log4j.properties
#. Edit the :file:`/etc/spark/spark-env.sh` file and add the
:envvar:`SPARK_MASTER_HOST` variable. Replace the example address below
with your localhost IP address. View your IP address using the
:command:`hostname -I` command.
.. code-block:: bash
SPARK_MASTER_HOST="10.300.200.100"
.. note:: This optional step enables the master's web user interface to
view information needed later in this tutorial.
#. Edit the :file:`/etc/spark/spark-defaults.conf` file and update the
:envvar:`spark.master` variable with the `SPARK_MASTER_HOST` address and port
`7077`.
.. code-block:: bash
spark.master spark://10.300.200.100:7077
Start the master server and a worker daemon
*******************************************
#. Start the master server:
.. code-block:: bash
sudo /usr/share/apache-spark/sbin/./start-master.sh
#. Start one worker daemon and connect it to the master using the
:envvar:`spark.master` variable defined earlier:
.. code-block:: bash
sudo /usr/share/apache-spark/sbin/./start-slave.sh spark://10.300.200.100:7077
#. Open an internet browser and view the worker daemon information using
the master's IP address and port `8080`:
.. code-block:: bash
http://10.300.200.100:8080
Run the Spark wordcount example
*******************************
#. Run the wordcount example using a file on your local host and output the
results to a new file with the following command:
.. code-block:: bash
sudo spark-submit /usr/share/apache-spark/examples/src/main/python/wordcount.py ~/Documents/example_file > ~/Documents/results
#. Open an internet browser and view the application information using
the master's IP address and port `8080`:
.. code-block:: bash
http://10.300.200.100:8080
#. View the results of the wordcount application in the :file:`~/Documents/results` file.
**Congratulations!**
You have successfully installed and set up a standalone Apache Spark cluster,
and ran a simple wordcount example.