Files
clear-linux-documentation/tutorials/hpc.html
2024-11-04 18:48:51 +00:00

801 lines
46 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="en" data-content_root="../">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>HPC Cluster &#8212; Documentation for Clear Linux* project</title>
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
<link rel="stylesheet" type="text/css" href="../_static/bizstyle.css?v=5283bb3d" />
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
<script src="../_static/doctools.js?v=9bcbadda"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/clipboard.min.js?v=a7894cd8"></script>
<script src="../_static/copybutton.js?v=a56c686a"></script>
<script src="../_static/bizstyle.js"></script>
<link rel="canonical" href="https://clearlinux.github.io/clear-linux-documentation/tutorials/hpc.html" />
<link rel="icon" href="../_static/favicon.ico"/>
<link rel="author" title="About these documents" href="../about.html" />
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Kata Containers*" href="kata.html" />
<link rel="prev" title="Function Multi-Versioning" href="fmv.html" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<!--[if lt IE 9]>
<script src="_static/css3-mediaqueries.js"></script>
<![endif]-->
</head><body>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="kata.html" title="Kata Containers*"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="fmv.html" title="Function Multi-Versioning"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="../index.html">Documentation for Clear Linux* project</a> &#187;</li>
<li class="nav-item nav-item-1"><a href="index.html" accesskey="U">Tutorials</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">HPC Cluster</a></li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="hpc-cluster">
<span id="hpc"></span><h1>HPC Cluster<a class="headerlink" href="#hpc-cluster" title="Link to this heading"></a></h1>
<p>This tutorial demonstrates how to set a simple <abbr title="High Performance Computing">HPC</abbr> cluster using <a class="reference external" href="https://en.wikipedia.org/wiki/Slurm_Workload_Manager">Slurm</a>, <a class="reference external" href="https://dun.github.io/munge/">MUNGE</a>, and
<a class="reference external" href="https://linux.die.net/man/1/pdsh">pdsh</a>. For this tutorial, this cluster consists of a controller node
and four worker nodes, as shown in Figure 1. For the sake of simplicity,
each node resides on a separate host and their hostnames are hpc-controller,
hpc-worker1, hpc-worker2, hpc-worker3, and hpc-worker4.</p>
<figure class="dropshadow align-default" id="id1">
<img alt="Simple HPC cluster" src="../_images/hpc-01.png" />
<figcaption>
<p><span class="caption-text">Figure 1: Simple HPC cluster</span><a class="headerlink" href="#id1" title="Link to this image"></a></p>
</figcaption>
</figure>
<p>The configuration is intentionally kept simple, notably avoiding setting
up cgroups and accounting. These and many more additional configuration
options can be added later.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This tutorial assumes you start with a new installation of Clear Linux OS for all
nodes.</p>
</div>
<section id="prerequisites">
<h2>Prerequisites<a class="headerlink" href="#prerequisites" title="Link to this heading"></a></h2>
<ul class="simple">
<li><p>Knowledge and experience with HPC</p></li>
<li><p>Familiarity with Slurm, MUNGE, and pdsh</p></li>
<li><p>All nodes have synchronized clocks (typically by NTP)</p></li>
</ul>
</section>
<section id="set-up-controller-node">
<h2>Set up controller node<a class="headerlink" href="#set-up-controller-node" title="Link to this heading"></a></h2>
<p>In this step, install the cluster tools, configure and enable the MUNGE service,
and enable the Slurm controller service.</p>
<ol class="arabic">
<li><p>Install Clear Linux OS on the controller node, add a user with adminstrator
privilege, and set its hostname to <cite>hpc-controller</cite>.</p></li>
<li><p>Boot it up and log in.</p></li>
<li><p>Update Clear Linux OS to the latest version.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>update
</pre></div>
</div>
</li>
<li><p>Set the date and time to synchronize with an NTP server.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>timedatectl<span class="w"> </span>set-ntp<span class="w"> </span><span class="nb">true</span>
</pre></div>
</div>
</li>
<li><p>Install the <cite>cluster-tools</cite> bundle.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>cluster-tools
</pre></div>
</div>
</li>
<li><p>Create a MUNGE key and start the MUNGE service.</p>
<ol class="loweralpha">
<li><p>Create the MUNGE key.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>mkdir<span class="w"> </span>/etc/munge
dd<span class="w"> </span><span class="k">if</span><span class="o">=</span>/dev/urandom<span class="w"> </span><span class="nv">bs</span><span class="o">=</span><span class="m">1</span><span class="w"> </span><span class="nv">count</span><span class="o">=</span><span class="m">1024</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/etc/munge/munge.key
</pre></div>
</div>
</li>
<li><p>Set the ownership to <cite>munge</cite> and set the correct access permissions.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>chown<span class="w"> </span>munge:<span class="w"> </span>/etc/munge/munge.key
sudo<span class="w"> </span>chmod<span class="w"> </span><span class="m">400</span><span class="w"> </span>/etc/munge/munge.key
</pre></div>
</div>
</li>
<li><p>Start the MUNGE service and set it to start automatically on boot.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>munge<span class="w"> </span>--now
</pre></div>
</div>
</li>
</ol>
</li>
<li><p>Test MUNGE.</p>
<ol class="loweralpha">
<li><p>Create a MUNGE credential.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>munge<span class="w"> </span>-n
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">MUNGE:AwQFAAC8QZHhL/+Fqhalhi+ZJBD5LavtMa8RMles1aPq7yuIZq3LtMmrB7KQZcQjG0qkFmoIIvixaCACFe1stLmF4VIg4Bg/7tilxteXHS940cuZ/TxpIuqC6fUH8zLgUZUPwJ4=:</span>
</pre></div>
</div>
</li>
<li><p>Validate a MUNGE credential.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>munge<span class="w"> </span>-n<span class="w"> </span><span class="p">|</span><span class="w"> </span>unmunge<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>STATUS
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">STATUS: Success (0)</span>
</pre></div>
</div>
</li>
</ol>
</li>
<li><p>Start the Slurm controller service and enable it to start automatically
on boot.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>slurmctld<span class="w"> </span>--now
</pre></div>
</div>
</li>
</ol>
</section>
<section id="set-up-worker-nodes">
<h2>Set up worker nodes<a class="headerlink" href="#set-up-worker-nodes" title="Link to this heading"></a></h2>
<p>For each worker node, perform these steps:</p>
<ol class="arabic">
<li><p>Install Clear Linux OS on the worker node, add a user with adminstrator privilege,
and set its hostname to <cite>hpc-worker</cite> plus its number, i.e. hpc-worker1,
hpc-worker2, etc.</p>
<p>Ensure the username is the same as the one on the controller node. This
is needed to simplify password-less-SSH-access setup, which is needed for
pdsh, in the next section.</p>
</li>
<li><p>Boot it up and log in.</p></li>
<li><p>Update Clear Linux OS to the latest version.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>update
</pre></div>
</div>
</li>
<li><p>Set the date and time to synchronize with an NTP server.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>timedatectl<span class="w"> </span>set-ntp<span class="w"> </span><span class="nb">true</span>
</pre></div>
</div>
</li>
<li><p>Install the <cite>cluster-tools</cite> bundle.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>cluster-tools
</pre></div>
</div>
</li>
</ol>
</section>
<section id="set-up-password-less-ssh-access-and-pdsh-on-all-nodes">
<h2>Set up password-less SSH access and pdsh on all nodes<a class="headerlink" href="#set-up-password-less-ssh-access-and-pdsh-on-all-nodes" title="Link to this heading"></a></h2>
<p>To efficiently manage a cluster, it is useful to have a tool
that allows issuing the same command to multiple nodes at once.
And that tool is <abbr title="parallel distributed shell">pdsh</abbr>, which is included
with the <cite>cluster-tools</cite> bundle. pdsh is built with Slurm support, so it can
access hosts as defined in the Slurm partitions. pdsh relies on password-less
SSH access in order for it to work properly. There are two ways to set up
pasword-less SSH authentication: key-based or host-based. In this case,
the latter approach will be used. The controller authenticates a user and
all worker nodes will trust that authentication and not ask the user to
enter a password again.</p>
<ol class="arabic">
<li><p>Configure the controller node.</p>
<ol class="loweralpha">
<li><p>Log into the controller node.</p></li>
<li><p>Configure the SSH service for host-based authentication.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/etc/ssh/ssh_config<span class="w"> </span><span class="s">&lt;&lt; EOF</span>
<span class="s">HostbasedAuthentication yes</span>
<span class="s">EnableSSHKeysign yes</span>
<span class="s">EOF</span>
</pre></div>
</div>
</li>
<li><p>Restart the SSH service.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>sshd
</pre></div>
</div>
</li>
</ol>
</li>
<li><p>Configure each worker node.</p>
<ol class="loweralpha">
<li><p>Configure SSH service for host-based authentication.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/etc/ssh/sshd_config<span class="w"> </span><span class="s">&lt;&lt; EOF</span>
<span class="s">HostbasedAuthentication yes</span>
<span class="s">IgnoreRhosts no</span>
<span class="s">UseDNS yes</span>
<span class="s">EOF</span>
</pre></div>
</div>
</li>
<li><p>Create the <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code> file and add the controllers
<abbr title="fully qualified domain name">FQDN</abbr>. This tells the worker
node to accept connection from the controller.</p>
<p>For example:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-controller.my-domain.com</span>
</pre></div>
</div>
</li>
<li><p>Set its permission to root access only.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>chmod<span class="w"> </span><span class="m">600</span><span class="w"> </span>/etc/hosts.equiv
</pre></div>
</div>
</li>
<li><p>Add the controllers FQDN to <code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>. This allows
host-based authentication for the root account so that
actions requiring sudo privileges can be performed.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>cp<span class="w"> </span>-v<span class="w"> </span>/etc/hosts.equiv<span class="w"> </span>/root/.shosts
</pre></div>
</div>
</li>
<li><p>Using the controllers FQDN in <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>, scan for its
RSA public key and copy it to <code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code>.
Verify the scanned RSA public key matches the controllers
<code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_rsa_key.pub</span></code> file.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>ssh-keyscan<span class="w"> </span>-t<span class="w"> </span>rsa<span class="w"> </span>-f<span class="w"> </span>/etc/hosts.equiv<span class="w"> </span>&gt;<span class="w"> </span>~/ssh_known_hosts
sudo<span class="w"> </span>cp<span class="w"> </span>-v<span class="w"> </span>~/ssh_known_hosts<span class="w"> </span>/etc/ssh
rm<span class="w"> </span>~/ssh_known_hosts
</pre></div>
</div>
</li>
<li><p>Restart the SSH service.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>sshd
</pre></div>
</div>
</li>
</ol>
</li>
<li><p>On the controller node, SSH into each worker node without having to enter
a password. At the first-time connection to each host, youll be asked to
add the unknown host to the <code class="file docutils literal notranslate"><span class="pre">$HOME/.ssh/known_hosts</span></code> file. Accept
the request. This is will make future SSH connections to each host be
non-interactive.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ssh<span class="w"> </span>&lt;worker-node&gt;
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Setting up host-based authentication on
<abbr title="Cloud Service Provider">CSP</abbr> environments such as Microsoft Azure
and Amazon AWS may require some tweaking on the worker nodes SSH
configurations due to the CSPs virtual network setup. In general,
cloud VMs have a public and private DNS name. When SSHing from the
controller to a worker node, the SSH client may send the controllers
private DNS name, usually something with “internal” in the name,
as the <cite>chost</cite> instead of its public FQDN as expected in worker nodes
<code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>, <code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>, and
<code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code> files. If the above configurations
do not work, meaning youre asked to enter a password when
SSHing from the controller to a worker node, on a cloud VM, here are
some suggestions for debugging the problem:</p>
<ol class="arabic simple">
<li><p>On the controller, try to identify the chost data sent by the SSH
client using <strong class="command">ssh -vvv &lt;worker-node&gt;</strong>. Look for <cite>chost</cite>
in the debug log. If the chost value is different than the controllers
FQDN listed in worker nodes <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>,
<code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>, and <code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code> files,
then that is likely the cause of the problem. In some cases, chost
data may not be shown. If so, its safe to assume that the SSH client
is using the controllers private DNS name as the chost. Proceed to
steps 2 and 3 below to fix the problem.</p></li>
<li><p>Get the controllers private DNS name either by the above step or by
getting it from your system administrator.</p></li>
<li><p>On the worker node, make these changes:</p>
<ol class="arabic simple">
<li><p>Change the controllers FQDN in <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>,
<code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>, and <code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code>
to its private DNS name.</p></li>
<li><p>Restart the SSH service on the worker node.</p></li>
<li><p>Retest the connection from the controller node to the worker node.
If that still doesnt work, try the SSH directive
<cite>HostbasedUsesNameFromPacketOnly yes</cite> which tell the SSH service
to accept the supplied host name as is and not try to resolve it.
Also, set the directive <cite>UseDNS</cite> to <cite>no</cite> to disable host name lookup.</p></li>
</ol>
</li>
</ol>
</div>
</li>
<li><p>Verify you can issue a simple command over SSH without typing a password.</p>
<ol class="loweralpha">
<li><p>Issue the <strong class="command">hostname</strong> command.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ssh<span class="w"> </span>&lt;worker-node&gt;<span class="w"> </span>hostname
</pre></div>
</div>
</li>
<li><p>Issue the <strong class="command">hostname</strong> command with <strong class="command">sudo</strong>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ssh<span class="w"> </span>&lt;worker-node&gt;<span class="w"> </span>sudo<span class="w"> </span>hostname
</pre></div>
</div>
</li>
</ol>
<p>In both cases, you should get a response with the worker nodes hostname.
If the <cite>sudo</cite> version requires additional permission, grant the user
<cite>NOPASSWD</cite> privilege. For example:</p>
<ol class="arabic">
<li><p>Edit the sudoer file.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>visudo
</pre></div>
</div>
</li>
<li><p>Add the following:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">&lt;user&gt; ALL=(ALL) NOPASSWD: ALL</span>
</pre></div>
</div>
</li>
</ol>
</li>
</ol>
</section>
<section id="create-slurm-conf-configuration-file">
<h2>Create <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file<a class="headerlink" href="#create-slurm-conf-configuration-file" title="Link to this heading"></a></h2>
<p>On the controller, create a new <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file
that contains general settings, each nodes hardware resource information,
grouping of nodes into different partitions, and scheduling settings for
each partition. This file will be copied to all worker nodes in the cluster.</p>
<ol class="arabic">
<li><p>Create a base <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>mkdir<span class="w"> </span>-p<span class="w"> </span>/etc/slurm
sudo<span class="w"> </span>cp<span class="w"> </span>-v<span class="w"> </span>/usr/share/defaults/slurm/slurm.conf<span class="w"> </span>/etc/slurm
</pre></div>
</div>
</li>
<li><p>Add the controller information.</p>
<ol class="loweralpha">
<li><p><strong class="command">sudoedit</strong> the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file. Set the <cite>ControlMachine</cite>
value to the controllers resolvable hostname.</p>
<p>For example:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">ControlMachine=hpc-controller</span>
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Assuming the controllers FQDN is resolvable, specifying the
controllers IP address with the <cite>ControlAddr</cite> key is optional.
However, it maybe helpful to add it.</p>
</div>
</li>
<li><p>Save and exit.</p></li>
</ol>
</li>
<li><p>Add the worker nodes information.</p>
<ol class="loweralpha">
<li><p>Create a file containing a list of the worker nodes.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>&gt;<span class="w"> </span>worker-nodes-list<span class="w"> </span><span class="s">&lt;&lt; EOF</span>
<span class="s">hpc-worker1</span>
<span class="s">hpc-worker2</span>
<span class="s">hpc-worker3</span>
<span class="s">hpc-worker4</span>
<span class="s">EOF</span>
</pre></div>
</div>
</li>
<li><p>Using pdsh, get the hardware configuration of each node.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdsh<span class="w"> </span>-w<span class="w"> </span>^worker-nodes-list<span class="w"> </span>slurmd<span class="w"> </span>-C
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker4: NodeName=hpc-worker4 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1915</span>
<span class="go">hpc-worker4: UpTime=0-01:23:28</span>
<span class="go">hpc-worker3: NodeName=hpc-worker3 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1663</span>
<span class="go">hpc-worker3: UpTime=0-01:33:41</span>
<span class="go">hpc-worker2: NodeName=hpc-worker2 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=721</span>
<span class="go">hpc-worker2: UpTime=0-01:34:56</span>
<span class="go">hpc-worker1: NodeName=hpc-worker1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=721</span>
<span class="go">hpc-worker1: UpTime=0-01:39:21</span>
</pre></div>
</div>
</li>
<li><p><strong class="command">sudoedit</strong> the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file. Append each worker node
information, but without the <cite>UpTime</cite>, under the <cite>COMPUTE NODES</cite> section.</p>
<div class="admonition tip">
<p class="admonition-title">Tip</p>
<p>It is strongly recommended to set the <cite>RealMemory</cite> value for each
worker node slightly below, say 90%, what was reported by
<strong class="command">slurmd -C</strong>
in case some memory gets use by some processes, which would
cause Slurm to make a node not available due to its memory
resource falling below the stated value in the configuration file.</p>
</div>
<p>Heres an example with four worker nodes:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="gp">#</span>
<span class="gp"># </span>COMPUTE<span class="w"> </span>NODES<span class="w"> </span><span class="o">(</span>mode<span class="w"> </span>detailed<span class="w"> </span>values<span class="w"> </span>reported<span class="w"> </span>by<span class="w"> </span><span class="s2">&quot;slurmd -C&quot;</span><span class="w"> </span>on<span class="w"> </span>each<span class="w"> </span>node<span class="o">)</span>
<span class="go">NodeName=hpc-worker1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=648</span>
<span class="go">NodeName=hpc-worker2 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=648</span>
<span class="go">NodeName=hpc-worker3 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1497</span>
<span class="go">NodeName=hpc-worker4 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1723</span>
</pre></div>
</div>
</li>
<li><p>Create partitions.</p>
<p>A Slurm partition is basically the grouping of worker nodes.
Give each partition a name and decide which worker node(s) belong to
it.</p>
<p>For example:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">PartitionName=workers Nodes=hpc-worker1, hpc-worker2, hpc-worker3, hpc-worker4 Default=YES MaxTime=INFINITE State=UP</span>
<span class="go">PartitionName=debug Nodes=hpc-worker1, hpc-worker3 MaxTime=INFINITE State=UP</span>
</pre></div>
</div>
</li>
<li><p>Save and exit.</p></li>
<li><p>Set the ownership of the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file to <cite>slurm</cite>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>chown<span class="w"> </span>slurm:<span class="w"> </span>/etc/slurm/slurm.conf
</pre></div>
</div>
</li>
</ol>
</li>
<li><p>On the controller node, restart the Slurm controller service.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>slurmctld
</pre></div>
</div>
</li>
<li><p>Verify the Slurm controller service restarted without any issues before
proceeding.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>status<span class="w"> </span>slurmctld
</pre></div>
</div>
</li>
</ol>
</section>
<section id="copy-munge-key-and-slurm-conf-to-all-worker-nodes">
<h2>Copy MUNGE key and <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> to all worker nodes<a class="headerlink" href="#copy-munge-key-and-slurm-conf-to-all-worker-nodes" title="Link to this heading"></a></h2>
<p>On the controller node, using pdsh, in conjunction with the list of
defined nodes in the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code>, copy it and the MUNGE key to
all worker nodes.</p>
<ol class="arabic">
<li><p>On the controller node, copy the MUNGE key to all worker nodes and start the
MUNGE service.</p>
<ol class="loweralpha">
<li><p>Create the <code class="file docutils literal notranslate"><span class="pre">/etc/munge/</span></code> directory on each node.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>mkdir<span class="w"> </span>/etc/munge
</pre></div>
</div>
</li>
<li><p>Copy the MUNGE key over.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdcp<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>/etc/munge/munge.key<span class="w"> </span>/etc/munge
</pre></div>
</div>
</li>
<li><p>Set the ownership of the <code class="file docutils literal notranslate"><span class="pre">munge.key</span></code> file to <cite>munge</cite>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>chown<span class="w"> </span>munge:<span class="w"> </span>/etc/munge/munge.key
</pre></div>
</div>
</li>
</ol>
<blockquote>
<div><ol class="arabic simple">
<li><p>Start the MUNGE service and set it to start automatically on boot.</p></li>
</ol>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>munge<span class="w"> </span>--now
</pre></div>
</div>
</div></blockquote>
<ol class="arabic">
<li><p>Verify the MUNGE service is running.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span><span class="s2">&quot;systemctl status munge | grep Active&quot;</span>
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker3: Active: active (running) since Wed 2020-04-15 19:47:58 UTC; 55s ago</span>
<span class="go">hpc-worker4: Active: active (running) since Wed 2020-04-15 19:47:58 UTC; 55s ago</span>
<span class="go">hpc-worker2: Active: active (running) since Wed 2020-04-15 19:47:59 UTC; 54s ago</span>
<span class="go">hpc-worker1: Active: active (running) since Wed 2020-04-15 19:47:59 UTC; 54s ago</span>
</pre></div>
</div>
</li>
</ol>
</div></blockquote>
</li>
<li><p>On the controller node, copy the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file to all
worker nodes and start the slurmd service on them.</p>
<ol class="loweralpha">
<li><p>Create the <code class="file docutils literal notranslate"><span class="pre">/etc/slurm/</span></code> directory on each worker node.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>mkdir<span class="w"> </span>/etc/slurm
</pre></div>
</div>
</li>
<li><p>Copy the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file over.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdcp<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>/etc/slurm/slurm.conf<span class="w"> </span>/etc/slurm
</pre></div>
</div>
</li>
<li><p>Set the ownership of the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file to <cite>slurm</cite>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>chown<span class="w"> </span>slurm:<span class="w"> </span>/etc/slurm/slurm.conf
</pre></div>
</div>
</li>
<li><p>Start the Slurm service and set it automatically start on boot.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>slurmd<span class="w"> </span>--now
</pre></div>
</div>
</li>
<li><p>Verify the slurmd service is running.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>systemctl<span class="w"> </span>status<span class="w"> </span>slurmd<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>Active
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker3: Active: active (running) since Wed 2020-04-15 19:39:22 UTC; 1min 17s ago</span>
<span class="go">hpc-worker4: Active: active (running) since Wed 2020-04-15 19:39:22 UTC; 1min 17s ago</span>
<span class="go">hpc-worker2: Active: active (running) since Wed 2020-04-15 19:39:23 UTC; 1min 17s ago</span>
<span class="go">hpc-worker1: Active: active (running) since Wed 2020-04-15 19:39:23 UTC; 1min 17s ago</span>
</pre></div>
</div>
</li>
</ol>
</li>
</ol>
</section>
<section id="verify-controller-can-run-jobs-on-all-nodes">
<h2>Verify controller can run jobs on all nodes<a class="headerlink" href="#verify-controller-can-run-jobs-on-all-nodes" title="Link to this heading"></a></h2>
<ol class="arabic">
<li><p>Check the state of the worker nodes.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sinfo
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">PARTITION AVAIL TIMELIMIT NODES STATE NODELIST</span>
<span class="go">workers* up infinite 4 idle hpc-worker[1-4]</span>
<span class="go">debug up infinite 2 idle hpc-worker[1,3]</span>
</pre></div>
</div>
<div class="admonition tip">
<p class="admonition-title">Tip</p>
<p>If the nodes are in a “down” state, put them in the “idle” state.</p>
<p>For example:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>scontrol<span class="w"> </span>update<span class="w"> </span><span class="nv">nodename</span><span class="o">=</span>hpc-worker<span class="o">[</span><span class="m">1</span>-4<span class="o">]</span><span class="w"> </span><span class="nv">state</span><span class="o">=</span>idle<span class="w"> </span><span class="nv">reason</span><span class="o">=</span><span class="s2">&quot;&quot;</span>
</pre></div>
</div>
<p>Additional <a class="reference external" href="https://slurm.schedmd.com/troubleshoot.html">Slurm troubleshooting tips</a>.</p>
</div>
</li>
<li><p>And finally, verify Slurm can run jobs on all 4 worker nodes by issuing
a simple <strong class="command">hostname</strong> command.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>srun<span class="w"> </span>-N4<span class="w"> </span>-p<span class="w"> </span>workers<span class="w"> </span>hostname
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker4</span>
<span class="go">hpc-worker3</span>
<span class="go">hpc-worker1</span>
<span class="go">hpc-worker2</span>
</pre></div>
</div>
</li>
</ol>
</section>
<section id="create-and-run-example-scripts">
<h2>Create and run example scripts<a class="headerlink" href="#create-and-run-example-scripts" title="Link to this heading"></a></h2>
<section id="example-1-return-the-hostname-of-each-worker-and-output-to-show-hostnames-out">
<h3>Example 1: Return the hostname of each worker and output to <code class="file docutils literal notranslate"><span class="pre">show-hostnames.out</span></code><a class="headerlink" href="#example-1-return-the-hostname-of-each-worker-and-output-to-show-hostnames-out" title="Link to this heading"></a></h3>
<ol class="arabic">
<li><p>On the controller node, create the Slurm <code class="file docutils literal notranslate"><span class="pre">show-hostnames.sh</span></code> script.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>&gt;<span class="w"> </span>show-hostnames.sh<span class="w"> </span><span class="s">&lt;&lt; EOF</span>
<span class="s">#!/bin/bash</span>
<span class="s">#</span>
<span class="s">#SBATCH --job-name=show-hostnames</span>
<span class="s">#SBATCH --output=show-hostnames.out</span>
<span class="s">#</span>
<span class="s">#SBATCH --ntasks=4</span>
<span class="s">#SBATCH --time=10:00</span>
<span class="s">#SBATCH --mem-per-cpu=100</span>
<span class="s">#SBATCH --ntasks-per-node=1</span>
<span class="s">srun hostname</span>
<span class="s">EOF</span>
</pre></div>
</div>
</li>
<li><p>Execute the script.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sbatch<span class="w"> </span>show-hostnames.sh
</pre></div>
</div>
<p>The result will appear on the first node of the partition used. As no
partition was explicitly specified, it would be the default partition.</p>
</li>
<li><p>View the result.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdsh<span class="w"> </span>-w<span class="w"> </span>hpc-worker1<span class="w"> </span><span class="s2">&quot;cat show-hostnames.out&quot;</span>
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker1: hpc-worker3</span>
<span class="go">hpc-worker1: hpc-worker4</span>
<span class="go">hpc-worker1: hpc-worker1</span>
<span class="go">hpc-worker1: hpc-worker2</span>
</pre></div>
</div>
</li>
</ol>
</section>
<section id="example-2-an-mpi-hello-world-program">
<h3>Example 2: An MPI “Hello, World!” program<a class="headerlink" href="#example-2-an-mpi-hello-world-program" title="Link to this heading"></a></h3>
<ol class="arabic">
<li><p>On the controller node, create the <code class="file docutils literal notranslate"><span class="pre">mpi-helloworld.c</span></code> program.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="w"> </span>cat<span class="w"> </span>&gt;<span class="w"> </span>mpi-helloworld.c<span class="w"> </span><span class="s">&lt;&lt; EOF</span>
<span class="s"> #include &lt;stdio.h&gt;</span>
<span class="s"> #include &lt;unistd.h&gt;</span>
<span class="s"> #include &lt;mpi.h&gt;</span>
<span class="s"> int main(int argc, char** argv)</span>
<span class="s"> {</span>
<span class="s"> // Init the MPI environment</span>
<span class="s"> MPI_Init(NULL, NULL);</span>
<span class="s"> // Get the number of processes</span>
<span class="s"> int world_size;</span>
<span class="s"> MPI_Comm_size(MPI_COMM_WORLD, &amp;world_size);</span>
<span class="s"> // Get the rank of the process</span>
<span class="s"> int world_rank;</span>
<span class="s"> MPI_Comm_rank(MPI_COMM_WORLD, &amp;world_rank);</span>
<span class="s"> // Get the name of the processor</span>
<span class="s"> char processor_name[MPI_MAX_PROCESSOR_NAME];</span>
<span class="s"> int name_len;</span>
<span class="s"> MPI_Get_processor_name(processor_name, &amp;name_len);</span>
<span class="s"> // Print a hello world message</span>
<span class="s"> printf(&quot;Hello, World! from from processor %s, rank %d out of %d processors\n&quot;, processor_name, world_rank, world_size);</span>
<span class="s"> // Finalize the MPI environment</span>
<span class="s"> MPI_Finalize();</span>
<span class="s">}</span>
<span class="s">EOF</span>
</pre></div>
</div>
</li>
<li><p>Add the <cite>c-basic</cite> and <cite>devpkg-openmpi</cite> bundles, which are needed to compile
it.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>c-basic<span class="w"> </span>devpkg-openmpi
</pre></div>
</div>
</li>
<li><p>Compile it.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>mpicc<span class="w"> </span>-o<span class="w"> </span>mpi-helloworld<span class="w"> </span>mpi-helloworld.c
</pre></div>
</div>
</li>
<li><p>Copy the binary to all worker nodes.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdcp<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>./mpi-helloworld<span class="w"> </span><span class="nv">$HOME</span>
</pre></div>
</div>
</li>
<li><p>Create a Slurm batch script to run it.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>&gt;<span class="w"> </span>mpi-helloworld.sh<span class="w"> </span><span class="s">&lt;&lt; EOF</span>
<span class="s">#!/bin/sh</span>
<span class="s">#SBATCH -o mpi-helloworld.out</span>
<span class="s">#SBATCH --nodes=4</span>
<span class="s">#SBATCH --ntasks-per-node=1</span>
<span class="s">srun ./mpi-helloworld</span>
<span class="s">EOF</span>
</pre></div>
</div>
</li>
<li><p>Run the batch script.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sbatch<span class="w"> </span>mpi-helloworld.sh
</pre></div>
</div>
</li>
<li><p>View the results on first worker node in the partition.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdsh<span class="w"> </span>-w<span class="w"> </span>hpc-worker1<span class="w"> </span><span class="s2">&quot;cat mpi-helloworld.out&quot;</span>
</pre></div>
</div>
<p>Example output:</p>
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">Hello, World! from from processor hpc-worker3, rank 2 out of 4 processors</span>
<span class="go">Hello, World! from from processor hpc-worker4, rank 3 out of 4 processors</span>
<span class="go">Hello, World! from from processor hpc-worker1, rank 0 out of 4 processors</span>
<span class="go">Hello, World! from from processor hpc-worker2, rank 1 out of 4 processors</span>
</pre></div>
</div>
</li>
</ol>
</section>
</section>
</section>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="Main">
<div class="sphinxsidebarwrapper">
<p class="logo"><a href="../index.html">
<img class="logo" src="../_static/clearlinux.png" alt="Logo of Clear Linux* Project Docs"/>
</a></p>
<div>
<h3><a href="../index.html">Table of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">HPC Cluster</a><ul>
<li><a class="reference internal" href="#prerequisites">Prerequisites</a></li>
<li><a class="reference internal" href="#set-up-controller-node">Set up controller node</a></li>
<li><a class="reference internal" href="#set-up-worker-nodes">Set up worker nodes</a></li>
<li><a class="reference internal" href="#set-up-password-less-ssh-access-and-pdsh-on-all-nodes">Set up password-less SSH access and pdsh on all nodes</a></li>
<li><a class="reference internal" href="#create-slurm-conf-configuration-file">Create <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file</a></li>
<li><a class="reference internal" href="#copy-munge-key-and-slurm-conf-to-all-worker-nodes">Copy MUNGE key and <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> to all worker nodes</a></li>
<li><a class="reference internal" href="#verify-controller-can-run-jobs-on-all-nodes">Verify controller can run jobs on all nodes</a></li>
<li><a class="reference internal" href="#create-and-run-example-scripts">Create and run example scripts</a><ul>
<li><a class="reference internal" href="#example-1-return-the-hostname-of-each-worker-and-output-to-show-hostnames-out">Example 1: Return the hostname of each worker and output to <code class="file docutils literal notranslate"><span class="pre">show-hostnames.out</span></code></a></li>
<li><a class="reference internal" href="#example-2-an-mpi-hello-world-program">Example 2: An MPI “Hello, World!” program</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<div>
<h4>Previous topic</h4>
<p class="topless"><a href="fmv.html"
title="previous chapter">Function Multi-Versioning</a></p>
</div>
<div>
<h4>Next topic</h4>
<p class="topless"><a href="kata.html"
title="next chapter">Kata Containers*</a></p>
</div>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="../_sources/tutorials/hpc.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<search id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="../search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
<input type="submit" value="Go" />
</form>
</div>
</search>
<script>document.getElementById('searchbox').style.display = "block"</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="Related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="kata.html" title="Kata Containers*"
>next</a> |</li>
<li class="right" >
<a href="fmv.html" title="Function Multi-Versioning"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="../index.html">Documentation for Clear Linux* project</a> &#187;</li>
<li class="nav-item nav-item-1"><a href="index.html" >Tutorials</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">HPC Cluster</a></li>
</ul>
</div>
<div class="footer" role="contentinfo">
&#169; Copyright 2022 Intel Corporation. All Rights Reserved..
Last updated on Nov 04, 2024.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
</div>
</body>
</html>