mirror of
https://github.com/clearlinux/clear-linux-documentation.git
synced 2026-04-29 11:38:23 +00:00
801 lines
46 KiB
HTML
801 lines
46 KiB
HTML
|
||
<!DOCTYPE html>
|
||
|
||
<html lang="en" data-content_root="../">
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
|
||
|
||
<title>HPC Cluster — Documentation for Clear Linux* project</title>
|
||
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
|
||
<link rel="stylesheet" type="text/css" href="../_static/bizstyle.css?v=5283bb3d" />
|
||
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
|
||
|
||
<script src="../_static/documentation_options.js?v=5929fcd5"></script>
|
||
<script src="../_static/doctools.js?v=9bcbadda"></script>
|
||
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
|
||
<script src="../_static/clipboard.min.js?v=a7894cd8"></script>
|
||
<script src="../_static/copybutton.js?v=a56c686a"></script>
|
||
<script src="../_static/bizstyle.js"></script>
|
||
<link rel="canonical" href="https://clearlinux.github.io/clear-linux-documentation/tutorials/hpc.html" />
|
||
<link rel="icon" href="../_static/favicon.ico"/>
|
||
<link rel="author" title="About these documents" href="../about.html" />
|
||
<link rel="index" title="Index" href="../genindex.html" />
|
||
<link rel="search" title="Search" href="../search.html" />
|
||
<link rel="next" title="Kata Containers*" href="kata.html" />
|
||
<link rel="prev" title="Function Multi-Versioning" href="fmv.html" />
|
||
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
|
||
<!--[if lt IE 9]>
|
||
<script src="_static/css3-mediaqueries.js"></script>
|
||
<![endif]-->
|
||
</head><body>
|
||
<div class="related" role="navigation" aria-label="Related">
|
||
<h3>Navigation</h3>
|
||
<ul>
|
||
<li class="right" style="margin-right: 10px">
|
||
<a href="../genindex.html" title="General Index"
|
||
accesskey="I">index</a></li>
|
||
<li class="right" >
|
||
<a href="kata.html" title="Kata Containers*"
|
||
accesskey="N">next</a> |</li>
|
||
<li class="right" >
|
||
<a href="fmv.html" title="Function Multi-Versioning"
|
||
accesskey="P">previous</a> |</li>
|
||
<li class="nav-item nav-item-0"><a href="../index.html">Documentation for Clear Linux* project</a> »</li>
|
||
<li class="nav-item nav-item-1"><a href="index.html" accesskey="U">Tutorials</a> »</li>
|
||
<li class="nav-item nav-item-this"><a href="">HPC Cluster</a></li>
|
||
</ul>
|
||
</div>
|
||
|
||
<div class="document">
|
||
<div class="documentwrapper">
|
||
<div class="bodywrapper">
|
||
<div class="body" role="main">
|
||
|
||
<section id="hpc-cluster">
|
||
<span id="hpc"></span><h1>HPC Cluster<a class="headerlink" href="#hpc-cluster" title="Link to this heading">¶</a></h1>
|
||
<p>This tutorial demonstrates how to set a simple <abbr title="High Performance Computing">HPC</abbr> cluster using <a class="reference external" href="https://en.wikipedia.org/wiki/Slurm_Workload_Manager">Slurm</a>, <a class="reference external" href="https://dun.github.io/munge/">MUNGE</a>, and
|
||
<a class="reference external" href="https://linux.die.net/man/1/pdsh">pdsh</a>. For this tutorial, this cluster consists of a controller node
|
||
and four worker nodes, as shown in Figure 1. For the sake of simplicity,
|
||
each node resides on a separate host and their hostnames are hpc-controller,
|
||
hpc-worker1, hpc-worker2, hpc-worker3, and hpc-worker4.</p>
|
||
<figure class="dropshadow align-default" id="id1">
|
||
<img alt="Simple HPC cluster" src="../_images/hpc-01.png" />
|
||
<figcaption>
|
||
<p><span class="caption-text">Figure 1: Simple HPC cluster</span><a class="headerlink" href="#id1" title="Link to this image">¶</a></p>
|
||
</figcaption>
|
||
</figure>
|
||
<p>The configuration is intentionally kept simple, notably avoiding setting
|
||
up cgroups and accounting. These and many more additional configuration
|
||
options can be added later.</p>
|
||
<div class="admonition note">
|
||
<p class="admonition-title">Note</p>
|
||
<p>This tutorial assumes you start with a new installation of Clear Linux OS for all
|
||
nodes.</p>
|
||
</div>
|
||
<section id="prerequisites">
|
||
<h2>Prerequisites<a class="headerlink" href="#prerequisites" title="Link to this heading">¶</a></h2>
|
||
<ul class="simple">
|
||
<li><p>Knowledge and experience with HPC</p></li>
|
||
<li><p>Familiarity with Slurm, MUNGE, and pdsh</p></li>
|
||
<li><p>All nodes have synchronized clocks (typically by NTP)</p></li>
|
||
</ul>
|
||
</section>
|
||
<section id="set-up-controller-node">
|
||
<h2>Set up controller node<a class="headerlink" href="#set-up-controller-node" title="Link to this heading">¶</a></h2>
|
||
<p>In this step, install the cluster tools, configure and enable the MUNGE service,
|
||
and enable the Slurm controller service.</p>
|
||
<ol class="arabic">
|
||
<li><p>Install Clear Linux OS on the controller node, add a user with adminstrator
|
||
privilege, and set its hostname to <cite>hpc-controller</cite>.</p></li>
|
||
<li><p>Boot it up and log in.</p></li>
|
||
<li><p>Update Clear Linux OS to the latest version.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>update
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Set the date and time to synchronize with an NTP server.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>timedatectl<span class="w"> </span>set-ntp<span class="w"> </span><span class="nb">true</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Install the <cite>cluster-tools</cite> bundle.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>cluster-tools
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Create a MUNGE key and start the MUNGE service.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Create the MUNGE key.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>mkdir<span class="w"> </span>/etc/munge
|
||
dd<span class="w"> </span><span class="k">if</span><span class="o">=</span>/dev/urandom<span class="w"> </span><span class="nv">bs</span><span class="o">=</span><span class="m">1</span><span class="w"> </span><span class="nv">count</span><span class="o">=</span><span class="m">1024</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/etc/munge/munge.key
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Set the ownership to <cite>munge</cite> and set the correct access permissions.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>chown<span class="w"> </span>munge:<span class="w"> </span>/etc/munge/munge.key
|
||
sudo<span class="w"> </span>chmod<span class="w"> </span><span class="m">400</span><span class="w"> </span>/etc/munge/munge.key
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Start the MUNGE service and set it to start automatically on boot.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>munge<span class="w"> </span>--now
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
<li><p>Test MUNGE.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Create a MUNGE credential.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>munge<span class="w"> </span>-n
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">MUNGE:AwQFAAC8QZHhL/+Fqhalhi+ZJBD5LavtMa8RMles1aPq7yuIZq3LtMmrB7KQZcQjG0qkFmoIIvixaCACFe1stLmF4VIg4Bg/7tilxteXHS940cuZ/TxpIuqC6fUH8zLgUZUPwJ4=:</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Validate a MUNGE credential.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>munge<span class="w"> </span>-n<span class="w"> </span><span class="p">|</span><span class="w"> </span>unmunge<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>STATUS
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">STATUS: Success (0)</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
<li><p>Start the Slurm controller service and enable it to start automatically
|
||
on boot.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>slurmctld<span class="w"> </span>--now
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="set-up-worker-nodes">
|
||
<h2>Set up worker nodes<a class="headerlink" href="#set-up-worker-nodes" title="Link to this heading">¶</a></h2>
|
||
<p>For each worker node, perform these steps:</p>
|
||
<ol class="arabic">
|
||
<li><p>Install Clear Linux OS on the worker node, add a user with adminstrator privilege,
|
||
and set its hostname to <cite>hpc-worker</cite> plus its number, i.e. hpc-worker1,
|
||
hpc-worker2, etc.</p>
|
||
<p>Ensure the username is the same as the one on the controller node. This
|
||
is needed to simplify password-less-SSH-access setup, which is needed for
|
||
pdsh, in the next section.</p>
|
||
</li>
|
||
<li><p>Boot it up and log in.</p></li>
|
||
<li><p>Update Clear Linux OS to the latest version.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>update
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Set the date and time to synchronize with an NTP server.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>timedatectl<span class="w"> </span>set-ntp<span class="w"> </span><span class="nb">true</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Install the <cite>cluster-tools</cite> bundle.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>cluster-tools
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="set-up-password-less-ssh-access-and-pdsh-on-all-nodes">
|
||
<h2>Set up password-less SSH access and pdsh on all nodes<a class="headerlink" href="#set-up-password-less-ssh-access-and-pdsh-on-all-nodes" title="Link to this heading">¶</a></h2>
|
||
<p>To efficiently manage a cluster, it is useful to have a tool
|
||
that allows issuing the same command to multiple nodes at once.
|
||
And that tool is <abbr title="parallel distributed shell">pdsh</abbr>, which is included
|
||
with the <cite>cluster-tools</cite> bundle. pdsh is built with Slurm support, so it can
|
||
access hosts as defined in the Slurm partitions. pdsh relies on password-less
|
||
SSH access in order for it to work properly. There are two ways to set up
|
||
pasword-less SSH authentication: key-based or host-based. In this case,
|
||
the latter approach will be used. The controller authenticates a user and
|
||
all worker nodes will trust that authentication and not ask the user to
|
||
enter a password again.</p>
|
||
<ol class="arabic">
|
||
<li><p>Configure the controller node.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Log into the controller node.</p></li>
|
||
<li><p>Configure the SSH service for host-based authentication.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/etc/ssh/ssh_config<span class="w"> </span><span class="s"><< EOF</span>
|
||
<span class="s">HostbasedAuthentication yes</span>
|
||
<span class="s">EnableSSHKeysign yes</span>
|
||
<span class="s">EOF</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Restart the SSH service.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>sshd
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
<li><p>Configure each worker node.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Configure SSH service for host-based authentication.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>tee<span class="w"> </span>-a<span class="w"> </span>/etc/ssh/sshd_config<span class="w"> </span><span class="s"><< EOF</span>
|
||
<span class="s">HostbasedAuthentication yes</span>
|
||
<span class="s">IgnoreRhosts no</span>
|
||
<span class="s">UseDNS yes</span>
|
||
<span class="s">EOF</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Create the <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code> file and add the controller’s
|
||
<abbr title="fully qualified domain name">FQDN</abbr>. This tells the worker
|
||
node to accept connection from the controller.</p>
|
||
<p>For example:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-controller.my-domain.com</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Set its permission to root access only.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>chmod<span class="w"> </span><span class="m">600</span><span class="w"> </span>/etc/hosts.equiv
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Add the controller’s FQDN to <code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>. This allows
|
||
host-based authentication for the root account so that
|
||
actions requiring sudo privileges can be performed.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>cp<span class="w"> </span>-v<span class="w"> </span>/etc/hosts.equiv<span class="w"> </span>/root/.shosts
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Using the controller’s FQDN in <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>, scan for its
|
||
RSA public key and copy it to <code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code>.
|
||
Verify the scanned RSA public key matches the controller’s
|
||
<code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_rsa_key.pub</span></code> file.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>ssh-keyscan<span class="w"> </span>-t<span class="w"> </span>rsa<span class="w"> </span>-f<span class="w"> </span>/etc/hosts.equiv<span class="w"> </span>><span class="w"> </span>~/ssh_known_hosts
|
||
sudo<span class="w"> </span>cp<span class="w"> </span>-v<span class="w"> </span>~/ssh_known_hosts<span class="w"> </span>/etc/ssh
|
||
rm<span class="w"> </span>~/ssh_known_hosts
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Restart the SSH service.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>sshd
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
<li><p>On the controller node, SSH into each worker node without having to enter
|
||
a password. At the first-time connection to each host, you’ll be asked to
|
||
add the unknown host to the <code class="file docutils literal notranslate"><span class="pre">$HOME/.ssh/known_hosts</span></code> file. Accept
|
||
the request. This is will make future SSH connections to each host be
|
||
non-interactive.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ssh<span class="w"> </span><worker-node>
|
||
</pre></div>
|
||
</div>
|
||
<div class="admonition note">
|
||
<p class="admonition-title">Note</p>
|
||
<p>Setting up host-based authentication on
|
||
<abbr title="Cloud Service Provider">CSP</abbr> environments such as Microsoft Azure
|
||
and Amazon AWS may require some tweaking on the worker nodes’ SSH
|
||
configurations due to the CSP’s virtual network setup. In general,
|
||
cloud VMs have a public and private DNS name. When SSH’ing from the
|
||
controller to a worker node, the SSH client may send the controller’s
|
||
private DNS name, usually something with “internal” in the name,
|
||
as the <cite>chost</cite> instead of its public FQDN as expected in worker node’s
|
||
<code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>, <code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>, and
|
||
<code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code> files. If the above configurations
|
||
do not work, meaning you’re asked to enter a password when
|
||
SSH’ing from the controller to a worker node, on a cloud VM, here are
|
||
some suggestions for debugging the problem:</p>
|
||
<ol class="arabic simple">
|
||
<li><p>On the controller, try to identify the chost data sent by the SSH
|
||
client using <strong class="command">ssh -vvv <worker-node></strong>. Look for <cite>chost</cite>
|
||
in the debug log. If the chost value is different than the controller’s
|
||
FQDN listed in worker node’s <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>,
|
||
<code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>, and <code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code> files,
|
||
then that is likely the cause of the problem. In some cases, chost
|
||
data may not be shown. If so, it’s safe to assume that the SSH client
|
||
is using the controller’s private DNS name as the chost. Proceed to
|
||
steps 2 and 3 below to fix the problem.</p></li>
|
||
<li><p>Get the controller’s private DNS name either by the above step or by
|
||
getting it from your system administrator.</p></li>
|
||
<li><p>On the worker node, make these changes:</p>
|
||
<ol class="arabic simple">
|
||
<li><p>Change the controller’s FQDN in <code class="file docutils literal notranslate"><span class="pre">/etc/hosts.equiv</span></code>,
|
||
<code class="file docutils literal notranslate"><span class="pre">/root/.shosts</span></code>, and <code class="file docutils literal notranslate"><span class="pre">/etc/ssh/ssh_known_hosts</span></code>
|
||
to its private DNS name.</p></li>
|
||
<li><p>Restart the SSH service on the worker node.</p></li>
|
||
<li><p>Retest the connection from the controller node to the worker node.
|
||
If that still doesn’t work, try the SSH directive
|
||
<cite>HostbasedUsesNameFromPacketOnly yes</cite> which tell the SSH service
|
||
to accept the supplied host name as is and not try to resolve it.
|
||
Also, set the directive <cite>UseDNS</cite> to <cite>no</cite> to disable host name lookup.</p></li>
|
||
</ol>
|
||
</li>
|
||
</ol>
|
||
</div>
|
||
</li>
|
||
<li><p>Verify you can issue a simple command over SSH without typing a password.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Issue the <strong class="command">hostname</strong> command.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ssh<span class="w"> </span><worker-node><span class="w"> </span>hostname
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Issue the <strong class="command">hostname</strong> command with <strong class="command">sudo</strong>.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>ssh<span class="w"> </span><worker-node><span class="w"> </span>sudo<span class="w"> </span>hostname
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
<p>In both cases, you should get a response with the worker node’s hostname.
|
||
If the <cite>sudo</cite> version requires additional permission, grant the user
|
||
<cite>NOPASSWD</cite> privilege. For example:</p>
|
||
<ol class="arabic">
|
||
<li><p>Edit the sudoer file.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>visudo
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Add the following:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go"><user> ALL=(ALL) NOPASSWD: ALL</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="create-slurm-conf-configuration-file">
|
||
<h2>Create <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file<a class="headerlink" href="#create-slurm-conf-configuration-file" title="Link to this heading">¶</a></h2>
|
||
<p>On the controller, create a new <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file
|
||
that contains general settings, each node’s hardware resource information,
|
||
grouping of nodes into different partitions, and scheduling settings for
|
||
each partition. This file will be copied to all worker nodes in the cluster.</p>
|
||
<ol class="arabic">
|
||
<li><p>Create a base <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>mkdir<span class="w"> </span>-p<span class="w"> </span>/etc/slurm
|
||
sudo<span class="w"> </span>cp<span class="w"> </span>-v<span class="w"> </span>/usr/share/defaults/slurm/slurm.conf<span class="w"> </span>/etc/slurm
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Add the controller information.</p>
|
||
<ol class="loweralpha">
|
||
<li><p><strong class="command">sudoedit</strong> the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file. Set the <cite>ControlMachine</cite>
|
||
value to the controller’s resolvable hostname.</p>
|
||
<p>For example:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">ControlMachine=hpc-controller</span>
|
||
</pre></div>
|
||
</div>
|
||
<div class="admonition note">
|
||
<p class="admonition-title">Note</p>
|
||
<p>Assuming the controller’s FQDN is resolvable, specifying the
|
||
controller’s IP address with the <cite>ControlAddr</cite> key is optional.
|
||
However, it maybe helpful to add it.</p>
|
||
</div>
|
||
</li>
|
||
<li><p>Save and exit.</p></li>
|
||
</ol>
|
||
</li>
|
||
<li><p>Add the worker nodes information.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Create a file containing a list of the worker nodes.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>><span class="w"> </span>worker-nodes-list<span class="w"> </span><span class="s"><< EOF</span>
|
||
<span class="s">hpc-worker1</span>
|
||
<span class="s">hpc-worker2</span>
|
||
<span class="s">hpc-worker3</span>
|
||
<span class="s">hpc-worker4</span>
|
||
<span class="s">EOF</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Using pdsh, get the hardware configuration of each node.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdsh<span class="w"> </span>-w<span class="w"> </span>^worker-nodes-list<span class="w"> </span>slurmd<span class="w"> </span>-C
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker4: NodeName=hpc-worker4 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1915</span>
|
||
<span class="go">hpc-worker4: UpTime=0-01:23:28</span>
|
||
<span class="go">hpc-worker3: NodeName=hpc-worker3 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1663</span>
|
||
<span class="go">hpc-worker3: UpTime=0-01:33:41</span>
|
||
<span class="go">hpc-worker2: NodeName=hpc-worker2 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=721</span>
|
||
<span class="go">hpc-worker2: UpTime=0-01:34:56</span>
|
||
<span class="go">hpc-worker1: NodeName=hpc-worker1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=721</span>
|
||
<span class="go">hpc-worker1: UpTime=0-01:39:21</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p><strong class="command">sudoedit</strong> the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file. Append each worker node
|
||
information, but without the <cite>UpTime</cite>, under the <cite>COMPUTE NODES</cite> section.</p>
|
||
<div class="admonition tip">
|
||
<p class="admonition-title">Tip</p>
|
||
<p>It is strongly recommended to set the <cite>RealMemory</cite> value for each
|
||
worker node slightly below, say 90%, what was reported by
|
||
<strong class="command">slurmd -C</strong>
|
||
in case some memory gets use by some processes, which would
|
||
cause Slurm to make a node not available due to its memory
|
||
resource falling below the stated value in the configuration file.</p>
|
||
</div>
|
||
<p>Here’s an example with four worker nodes:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="gp">#</span>
|
||
<span class="gp"># </span>COMPUTE<span class="w"> </span>NODES<span class="w"> </span><span class="o">(</span>mode<span class="w"> </span>detailed<span class="w"> </span>values<span class="w"> </span>reported<span class="w"> </span>by<span class="w"> </span><span class="s2">"slurmd -C"</span><span class="w"> </span>on<span class="w"> </span>each<span class="w"> </span>node<span class="o">)</span>
|
||
<span class="go">NodeName=hpc-worker1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=648</span>
|
||
<span class="go">NodeName=hpc-worker2 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=648</span>
|
||
<span class="go">NodeName=hpc-worker3 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1497</span>
|
||
<span class="go">NodeName=hpc-worker4 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=1723</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Create partitions.</p>
|
||
<p>A Slurm partition is basically the grouping of worker nodes.
|
||
Give each partition a name and decide which worker node(s) belong to
|
||
it.</p>
|
||
<p>For example:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">PartitionName=workers Nodes=hpc-worker1, hpc-worker2, hpc-worker3, hpc-worker4 Default=YES MaxTime=INFINITE State=UP</span>
|
||
<span class="go">PartitionName=debug Nodes=hpc-worker1, hpc-worker3 MaxTime=INFINITE State=UP</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Save and exit.</p></li>
|
||
<li><p>Set the ownership of the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file to <cite>slurm</cite>.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>chown<span class="w"> </span>slurm:<span class="w"> </span>/etc/slurm/slurm.conf
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
<li><p>On the controller node, restart the Slurm controller service.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>slurmctld
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Verify the Slurm controller service restarted without any issues before
|
||
proceeding.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>systemctl<span class="w"> </span>status<span class="w"> </span>slurmctld
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="copy-munge-key-and-slurm-conf-to-all-worker-nodes">
|
||
<h2>Copy MUNGE key and <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> to all worker nodes<a class="headerlink" href="#copy-munge-key-and-slurm-conf-to-all-worker-nodes" title="Link to this heading">¶</a></h2>
|
||
<p>On the controller node, using pdsh, in conjunction with the list of
|
||
defined nodes in the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code>, copy it and the MUNGE key to
|
||
all worker nodes.</p>
|
||
<ol class="arabic">
|
||
<li><p>On the controller node, copy the MUNGE key to all worker nodes and start the
|
||
MUNGE service.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Create the <code class="file docutils literal notranslate"><span class="pre">/etc/munge/</span></code> directory on each node.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>mkdir<span class="w"> </span>/etc/munge
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Copy the MUNGE key over.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdcp<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>/etc/munge/munge.key<span class="w"> </span>/etc/munge
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Set the ownership of the <code class="file docutils literal notranslate"><span class="pre">munge.key</span></code> file to <cite>munge</cite>.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>chown<span class="w"> </span>munge:<span class="w"> </span>/etc/munge/munge.key
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
<blockquote>
|
||
<div><ol class="arabic simple">
|
||
<li><p>Start the MUNGE service and set it to start automatically on boot.</p></li>
|
||
</ol>
|
||
<blockquote>
|
||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>munge<span class="w"> </span>--now
|
||
</pre></div>
|
||
</div>
|
||
</div></blockquote>
|
||
<ol class="arabic">
|
||
<li><p>Verify the MUNGE service is running.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span><span class="s2">"systemctl status munge | grep Active"</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker3: Active: active (running) since Wed 2020-04-15 19:47:58 UTC; 55s ago</span>
|
||
<span class="go">hpc-worker4: Active: active (running) since Wed 2020-04-15 19:47:58 UTC; 55s ago</span>
|
||
<span class="go">hpc-worker2: Active: active (running) since Wed 2020-04-15 19:47:59 UTC; 54s ago</span>
|
||
<span class="go">hpc-worker1: Active: active (running) since Wed 2020-04-15 19:47:59 UTC; 54s ago</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</div></blockquote>
|
||
</li>
|
||
<li><p>On the controller node, copy the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file to all
|
||
worker nodes and start the slurmd service on them.</p>
|
||
<ol class="loweralpha">
|
||
<li><p>Create the <code class="file docutils literal notranslate"><span class="pre">/etc/slurm/</span></code> directory on each worker node.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>mkdir<span class="w"> </span>/etc/slurm
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Copy the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file over.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdcp<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>/etc/slurm/slurm.conf<span class="w"> </span>/etc/slurm
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Set the ownership of the <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> file to <cite>slurm</cite>.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>chown<span class="w"> </span>slurm:<span class="w"> </span>/etc/slurm/slurm.conf
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Start the Slurm service and set it automatically start on boot.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>slurmd<span class="w"> </span>--now
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Verify the slurmd service is running.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>pdsh<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>systemctl<span class="w"> </span>status<span class="w"> </span>slurmd<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>Active
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker3: Active: active (running) since Wed 2020-04-15 19:39:22 UTC; 1min 17s ago</span>
|
||
<span class="go">hpc-worker4: Active: active (running) since Wed 2020-04-15 19:39:22 UTC; 1min 17s ago</span>
|
||
<span class="go">hpc-worker2: Active: active (running) since Wed 2020-04-15 19:39:23 UTC; 1min 17s ago</span>
|
||
<span class="go">hpc-worker1: Active: active (running) since Wed 2020-04-15 19:39:23 UTC; 1min 17s ago</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="verify-controller-can-run-jobs-on-all-nodes">
|
||
<h2>Verify controller can run jobs on all nodes<a class="headerlink" href="#verify-controller-can-run-jobs-on-all-nodes" title="Link to this heading">¶</a></h2>
|
||
<ol class="arabic">
|
||
<li><p>Check the state of the worker nodes.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sinfo
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">PARTITION AVAIL TIMELIMIT NODES STATE NODELIST</span>
|
||
<span class="go">workers* up infinite 4 idle hpc-worker[1-4]</span>
|
||
<span class="go">debug up infinite 2 idle hpc-worker[1,3]</span>
|
||
</pre></div>
|
||
</div>
|
||
<div class="admonition tip">
|
||
<p class="admonition-title">Tip</p>
|
||
<p>If the nodes are in a “down” state, put them in the “idle” state.</p>
|
||
<p>For example:</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>scontrol<span class="w"> </span>update<span class="w"> </span><span class="nv">nodename</span><span class="o">=</span>hpc-worker<span class="o">[</span><span class="m">1</span>-4<span class="o">]</span><span class="w"> </span><span class="nv">state</span><span class="o">=</span>idle<span class="w"> </span><span class="nv">reason</span><span class="o">=</span><span class="s2">""</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>Additional <a class="reference external" href="https://slurm.schedmd.com/troubleshoot.html">Slurm troubleshooting tips</a>.</p>
|
||
</div>
|
||
</li>
|
||
<li><p>And finally, verify Slurm can run jobs on all 4 worker nodes by issuing
|
||
a simple <strong class="command">hostname</strong> command.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>srun<span class="w"> </span>-N4<span class="w"> </span>-p<span class="w"> </span>workers<span class="w"> </span>hostname
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker4</span>
|
||
<span class="go">hpc-worker3</span>
|
||
<span class="go">hpc-worker1</span>
|
||
<span class="go">hpc-worker2</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="create-and-run-example-scripts">
|
||
<h2>Create and run example scripts<a class="headerlink" href="#create-and-run-example-scripts" title="Link to this heading">¶</a></h2>
|
||
<section id="example-1-return-the-hostname-of-each-worker-and-output-to-show-hostnames-out">
|
||
<h3>Example 1: Return the hostname of each worker and output to <code class="file docutils literal notranslate"><span class="pre">show-hostnames.out</span></code><a class="headerlink" href="#example-1-return-the-hostname-of-each-worker-and-output-to-show-hostnames-out" title="Link to this heading">¶</a></h3>
|
||
<ol class="arabic">
|
||
<li><p>On the controller node, create the Slurm <code class="file docutils literal notranslate"><span class="pre">show-hostnames.sh</span></code> script.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>><span class="w"> </span>show-hostnames.sh<span class="w"> </span><span class="s"><< EOF</span>
|
||
<span class="s">#!/bin/bash</span>
|
||
<span class="s">#</span>
|
||
<span class="s">#SBATCH --job-name=show-hostnames</span>
|
||
<span class="s">#SBATCH --output=show-hostnames.out</span>
|
||
<span class="s">#</span>
|
||
<span class="s">#SBATCH --ntasks=4</span>
|
||
<span class="s">#SBATCH --time=10:00</span>
|
||
<span class="s">#SBATCH --mem-per-cpu=100</span>
|
||
<span class="s">#SBATCH --ntasks-per-node=1</span>
|
||
|
||
<span class="s">srun hostname</span>
|
||
<span class="s">EOF</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Execute the script.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sbatch<span class="w"> </span>show-hostnames.sh
|
||
</pre></div>
|
||
</div>
|
||
<p>The result will appear on the first node of the partition used. As no
|
||
partition was explicitly specified, it would be the default partition.</p>
|
||
</li>
|
||
<li><p>View the result.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdsh<span class="w"> </span>-w<span class="w"> </span>hpc-worker1<span class="w"> </span><span class="s2">"cat show-hostnames.out"</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">hpc-worker1: hpc-worker3</span>
|
||
<span class="go">hpc-worker1: hpc-worker4</span>
|
||
<span class="go">hpc-worker1: hpc-worker1</span>
|
||
<span class="go">hpc-worker1: hpc-worker2</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
<section id="example-2-an-mpi-hello-world-program">
|
||
<h3>Example 2: An MPI “Hello, World!” program<a class="headerlink" href="#example-2-an-mpi-hello-world-program" title="Link to this heading">¶</a></h3>
|
||
<ol class="arabic">
|
||
<li><p>On the controller node, create the <code class="file docutils literal notranslate"><span class="pre">mpi-helloworld.c</span></code> program.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="w"> </span>cat<span class="w"> </span>><span class="w"> </span>mpi-helloworld.c<span class="w"> </span><span class="s"><< EOF</span>
|
||
<span class="s"> #include <stdio.h></span>
|
||
<span class="s"> #include <unistd.h></span>
|
||
<span class="s"> #include <mpi.h></span>
|
||
|
||
<span class="s"> int main(int argc, char** argv)</span>
|
||
<span class="s"> {</span>
|
||
<span class="s"> // Init the MPI environment</span>
|
||
<span class="s"> MPI_Init(NULL, NULL);</span>
|
||
|
||
<span class="s"> // Get the number of processes</span>
|
||
<span class="s"> int world_size;</span>
|
||
<span class="s"> MPI_Comm_size(MPI_COMM_WORLD, &world_size);</span>
|
||
|
||
<span class="s"> // Get the rank of the process</span>
|
||
<span class="s"> int world_rank;</span>
|
||
<span class="s"> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);</span>
|
||
|
||
<span class="s"> // Get the name of the processor</span>
|
||
<span class="s"> char processor_name[MPI_MAX_PROCESSOR_NAME];</span>
|
||
<span class="s"> int name_len;</span>
|
||
<span class="s"> MPI_Get_processor_name(processor_name, &name_len);</span>
|
||
|
||
<span class="s"> // Print a hello world message</span>
|
||
<span class="s"> printf("Hello, World! from from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);</span>
|
||
|
||
<span class="s"> // Finalize the MPI environment</span>
|
||
<span class="s"> MPI_Finalize();</span>
|
||
<span class="s">}</span>
|
||
<span class="s">EOF</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Add the <cite>c-basic</cite> and <cite>devpkg-openmpi</cite> bundles, which are needed to compile
|
||
it.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sudo<span class="w"> </span>swupd<span class="w"> </span>bundle-add<span class="w"> </span>c-basic<span class="w"> </span>devpkg-openmpi
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Compile it.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>mpicc<span class="w"> </span>-o<span class="w"> </span>mpi-helloworld<span class="w"> </span>mpi-helloworld.c
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Copy the binary to all worker nodes.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdcp<span class="w"> </span>-P<span class="w"> </span>workers<span class="w"> </span>./mpi-helloworld<span class="w"> </span><span class="nv">$HOME</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Create a Slurm batch script to run it.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>><span class="w"> </span>mpi-helloworld.sh<span class="w"> </span><span class="s"><< EOF</span>
|
||
<span class="s">#!/bin/sh</span>
|
||
<span class="s">#SBATCH -o mpi-helloworld.out</span>
|
||
<span class="s">#SBATCH --nodes=4</span>
|
||
<span class="s">#SBATCH --ntasks-per-node=1</span>
|
||
|
||
<span class="s">srun ./mpi-helloworld</span>
|
||
<span class="s">EOF</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>Run the batch script.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>sbatch<span class="w"> </span>mpi-helloworld.sh
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
<li><p>View the results on first worker node in the partition.</p>
|
||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pdsh<span class="w"> </span>-w<span class="w"> </span>hpc-worker1<span class="w"> </span><span class="s2">"cat mpi-helloworld.out"</span>
|
||
</pre></div>
|
||
</div>
|
||
<p>Example output:</p>
|
||
<div class="highlight-console notranslate"><div class="highlight"><pre><span></span><span class="go">Hello, World! from from processor hpc-worker3, rank 2 out of 4 processors</span>
|
||
<span class="go">Hello, World! from from processor hpc-worker4, rank 3 out of 4 processors</span>
|
||
<span class="go">Hello, World! from from processor hpc-worker1, rank 0 out of 4 processors</span>
|
||
<span class="go">Hello, World! from from processor hpc-worker2, rank 1 out of 4 processors</span>
|
||
</pre></div>
|
||
</div>
|
||
</li>
|
||
</ol>
|
||
</section>
|
||
</section>
|
||
</section>
|
||
|
||
|
||
<div class="clearer"></div>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="sphinxsidebar" role="navigation" aria-label="Main">
|
||
<div class="sphinxsidebarwrapper">
|
||
<p class="logo"><a href="../index.html">
|
||
<img class="logo" src="../_static/clearlinux.png" alt="Logo of Clear Linux* Project Docs"/>
|
||
</a></p>
|
||
<div>
|
||
<h3><a href="../index.html">Table of Contents</a></h3>
|
||
<ul>
|
||
<li><a class="reference internal" href="#">HPC Cluster</a><ul>
|
||
<li><a class="reference internal" href="#prerequisites">Prerequisites</a></li>
|
||
<li><a class="reference internal" href="#set-up-controller-node">Set up controller node</a></li>
|
||
<li><a class="reference internal" href="#set-up-worker-nodes">Set up worker nodes</a></li>
|
||
<li><a class="reference internal" href="#set-up-password-less-ssh-access-and-pdsh-on-all-nodes">Set up password-less SSH access and pdsh on all nodes</a></li>
|
||
<li><a class="reference internal" href="#create-slurm-conf-configuration-file">Create <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> configuration file</a></li>
|
||
<li><a class="reference internal" href="#copy-munge-key-and-slurm-conf-to-all-worker-nodes">Copy MUNGE key and <code class="file docutils literal notranslate"><span class="pre">slurm.conf</span></code> to all worker nodes</a></li>
|
||
<li><a class="reference internal" href="#verify-controller-can-run-jobs-on-all-nodes">Verify controller can run jobs on all nodes</a></li>
|
||
<li><a class="reference internal" href="#create-and-run-example-scripts">Create and run example scripts</a><ul>
|
||
<li><a class="reference internal" href="#example-1-return-the-hostname-of-each-worker-and-output-to-show-hostnames-out">Example 1: Return the hostname of each worker and output to <code class="file docutils literal notranslate"><span class="pre">show-hostnames.out</span></code></a></li>
|
||
<li><a class="reference internal" href="#example-2-an-mpi-hello-world-program">Example 2: An MPI “Hello, World!” program</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
|
||
</div>
|
||
<div>
|
||
<h4>Previous topic</h4>
|
||
<p class="topless"><a href="fmv.html"
|
||
title="previous chapter">Function Multi-Versioning</a></p>
|
||
</div>
|
||
<div>
|
||
<h4>Next topic</h4>
|
||
<p class="topless"><a href="kata.html"
|
||
title="next chapter">Kata Containers*</a></p>
|
||
</div>
|
||
<div role="note" aria-label="source link">
|
||
<h3>This Page</h3>
|
||
<ul class="this-page-menu">
|
||
<li><a href="../_sources/tutorials/hpc.rst.txt"
|
||
rel="nofollow">Show Source</a></li>
|
||
</ul>
|
||
</div>
|
||
<search id="searchbox" style="display: none" role="search">
|
||
<h3 id="searchlabel">Quick search</h3>
|
||
<div class="searchformwrapper">
|
||
<form class="search" action="../search.html" method="get">
|
||
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
|
||
<input type="submit" value="Go" />
|
||
</form>
|
||
</div>
|
||
</search>
|
||
<script>document.getElementById('searchbox').style.display = "block"</script>
|
||
</div>
|
||
</div>
|
||
<div class="clearer"></div>
|
||
</div>
|
||
<div class="related" role="navigation" aria-label="Related">
|
||
<h3>Navigation</h3>
|
||
<ul>
|
||
<li class="right" style="margin-right: 10px">
|
||
<a href="../genindex.html" title="General Index"
|
||
>index</a></li>
|
||
<li class="right" >
|
||
<a href="kata.html" title="Kata Containers*"
|
||
>next</a> |</li>
|
||
<li class="right" >
|
||
<a href="fmv.html" title="Function Multi-Versioning"
|
||
>previous</a> |</li>
|
||
<li class="nav-item nav-item-0"><a href="../index.html">Documentation for Clear Linux* project</a> »</li>
|
||
<li class="nav-item nav-item-1"><a href="index.html" >Tutorials</a> »</li>
|
||
<li class="nav-item nav-item-this"><a href="">HPC Cluster</a></li>
|
||
</ul>
|
||
</div>
|
||
<div class="footer" role="contentinfo">
|
||
© Copyright 2022 Intel Corporation. All Rights Reserved..
|
||
Last updated on Nov 04, 2024.
|
||
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
|
||
</div>
|
||
</body>
|
||
</html> |