chore: reorganized use-case structure

This commit is contained in:
David Allen 2025-07-19 14:37:51 -06:00
parent 6ad4cfb189
commit 9c11e3883a
Signed by: towk
GPG key ID: 793B2924A49B3A3F
10 changed files with 209 additions and 33 deletions

View file

@ -2,6 +2,15 @@ After going through the [tutorial](https://github.com/OpenCHAMI/tutorial-2025),
At this point, we can use what we have learned so far in the OpenCHAMI tutorial to customize our nodes in various ways such as changing how we serve images, deriving new images, and updating our cloud-init config. This sections explores some of the use cases that you may want to explore to utilize OpenCHAMI to fit your own needs.
Some of the use cases includes:
1. [Adding SLURM and MPI to the Compute Node](Adding%20SLURM%20and%20MPI%20to%20the%20Compute%20Node.md)
2. [Serving the Root Filesystem with NFS](Serving%20the%20Root%20Filesystem%20with%20NFS%20(import-image.sh).md)
3. [Enable WireGuard Security for the `cloud-init-server`](Enable%20WireGuard%20Security%20for%20the%20`cloud-init-server`.md)
4. [Using Image Layers to Customize Boot Image with a Common Base](Using%20Image%20Layers%20to%20Customize%20Boot%20Image%20with%20a%20Common%20Base.md)
5. [Using `kexec` to Reboot Nodes For an Kernel Upgrade or Specialized Kernel](Using%20`kexec`%20to%20Reboot%20Nodes%20For%20an%20Upgrade%20or%20Specialized%20Kernel.md)
6. [Discovering Nodes Dynamically with Redfish](Discovering%20Nodes%20Dynamically%20with%20Redfish.md)
## Adding SLURM and MPI to the Compute Node
After getting our nodes to boot using our compute images, let's try running a test MPI job. We need to install and configure both SLURM and MPI to do so. We can do this at least two ways here:
@ -10,11 +19,101 @@ After getting our nodes to boot using our compute images, let's try running a te
### Building Into the Image
We can use the `image-builder` tool to include the SLURM and OpenMPI packages directly in the image. Since we're building a new image for our compute node, we'll base our new image on the compute image definition from the tutorial.
You should already have a directory at `/opt/workdir/images`. Make sure you already have a base compute image with `s3cmd ls`.
```bash
# TODO: put the output of `s3cmd ls` here with the compute-base image
```
If you do not have the image, go back to [this step](https://github.com/OpenCHAMI/tutorial-2025/blob/main/Phase%202/Readme.md#243-configure-the-base-compute-image) in the tutorial, build the image, and push it to S3. Once you have done that, proceed to the next step.
Now, edit a new file at path `/opt/workdir/images/compute-slurm-rocky9.yaml` and copy the contents below.
```bash
options:
layer_type: 'base'
name: 'compute-slurm'
publish_tags:
- 'rocky9'
pkg_manager: 'dnf'
parent: 'demo.openchami.cluster:5000/demo/rocky-base:9'
registry_opts_pull:
- '--tls-verify=false'
# Publish SquashFS image to local S3
publish_s3: 'http://demo.openchami.cluster:9000'
s3_prefix: 'compute/base/'
s3_bucket: 'boot-images'
# Publish OCI image to container registry
#
# This is the only way to be able to re-use this image as
# a parent for another image layer.
publish_registry: 'demo.openchami.cluster:5000/demo'
registry_opts_push:
- '--tls-verify=false'
repos:
- alias: 'Epel9'
url: 'https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/'
gpg: 'https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9'
packages:
- slurm
- openmpi
```
Notice that the only changes to the new image definition were to the `options.name` and `packages`. Since we're basing this image on another image, we only need the packages we want to add to the new image. We can build the image and push it to S3 now.
```bash
podman run --rm --device /dev/fuse --network host -e S3_ACCESS=admin -e S3_SECRET=admin123 -v /opt/workdir/images/compute-slurm-rocky9.yaml:/home/builder/config.yaml ghcr.io/openchami/image-build:latest image-build --config config.yaml --log-level DEBUG
```
Wait until the build finishes and check the S3 bucket to confirm that it is there with `s3cmd ls` again. Add a new boot script to `/opt/workdir/boot/boot-compute-slurm.yaml` which we will use to boot our compute nodes.
```bash
kernel: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/vmlinuz-5.14.0-570.21.1.el9_6.x86_64'
initrd: 'http://172.16.0.254:9000/boot-images/efi-images/compute/debug/initramfs-5.14.0-570.21.1.el9_6.x86_64.img'
params: 'nomodeset ro root=live:http://172.16.0.254:9000/boot-images/compute/debug/rocky9.6-compute-slurm-rocky9 ip=dhcp overlayroot=tmpfs overlayroot_cfgdisk=disabled apparmor=0 selinux=0 console=ttyS0,115200 ip6=off cloud-init=enabled ds=nocloud-net;s=http://172.16.0.254:8081/cloud-init'
macs:
- 52:54:00:be:ef:01
- 52:54:00:be:ef:02
- 52:54:00:be:ef:03
- 52:54:00:be:ef:04
- 52:54:00:be:ef:05
```
Set and confirm that the boot parameters have been set correctly.
```bash
ochami bss boot params set -f yaml -d @/opt/workdir/boot/boot-compute-slurm.yaml
ochami bss boot params get -F yaml
```
Finally, boot the compute node.
```bash
sudo virt-install \
--name compute1 \
--memory 4096 \
--vcpus 1 \
--disk none \
--pxe \
--os-variant centos-stream9 \
--network network=openchami-net,model=virtio,mac=52:54:00:be:ef:01 \
--graphics none \
--console pty,target_type=serial \
--boot network,hd \
--boot loader=/usr/share/OVMF/OVMF_CODE.secboot.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/OVMF/OVMF_VARS.fd,loader_secure=no \
--virt-type kvm
```
Your compute node should start up with iPXE output. If your node does not boot, check the [troubleshooting](Troubleshooting.md) sections for common issues.
### Installing via Cloud-Init
Alternatively, we can install the necessary SLURM and MPI packages in our cloud-init config and set up or node in the `cmds` section of the config file.
Alternatively, we can install the necessary SLURM and MPI packages after booting by adding packages to our cloud-init config and use the `cmds` section for configuration.
Let's start by making changes to the cloud-init config file in `/opt/workdir/cloud-init/computes.yaml` that we used previously. Note that we are using a pre-built RPMs to install SLURM and OpenMPI from the Rocky 9 repos.
@ -141,20 +240,4 @@ If you saw the output above, you should now be able to inspect the output of the
# TODO: add output of MPI job (should be something like hello.o and/or hello.e)
```
And that's it! You have successfully launched an MPI job with SLURM from an OpenCHAMI deployed system.
## Serving the Root Filesystem with NFS (import-image.sh)
For this tutorial, we served images via HTTP using a local S3 bucket (MinIO) and OCI registry. We could instead serve our images using NFS by setting up and running a NFS server on the head node, include NFS tools in our base image, and configuring our nodes to work with NFS.
## Enable WireGuard Security for the `cloud-init-server`
## Using Image Layers to Customize Boot Image and with a Common Base
Often, we want to allocate nodes for different purposes using different images. Let's use the base image that we created before and create another Kubernetes layer called `kubernetes-worker` based on the `base` image we created before. We would need to modify the boot script to use this new Kubernetes image and update cloud-init set up the nodes.
## Using `kexec` to Reboot Nodes For an Upgrade or Specific Kernal
## Discovering Nodes Dynamically with Redfish
In this tutorial, we used static discovery to to populate our inventory in SMD instead of dynamically discovering nodes on our network. Static discovery is good when we know beforehand the MAC address, IP address, xname, and NID of our nodes and guarantee determistic behavior. However, if we don't know these properties or if we want to update our inventory state, we can use `magellan` to scan, collect, and populate SMD with these properties.
And that's it! You have successfully launched an MPI job with SLURM from an OpenCHAMI deployed system.

View file

@ -0,0 +1,41 @@
In the tutorial, we used static discovery to populate our inventory in SMD instead of dynamically discovering nodes on our network. Static discovery is good when we know beforehand the MAC address, IP address, xname, and/or node ID of our nodes and guarantees deterministic behavior. However, sometimes we might not know these properties or we may want to check the current state of our hardware, say for a failure. In these scenario, we can probe our hardware dynamically using the scanning feature from `magellan` and then update the state of our inventory.
For this demonstration, we have two prerequisites:
1. Emulate board management controllers (BMCs) with running Redfish services
2. Have a running instance of SMD or a full deployment of OpenCHAMI
The `magellan` repository has an emulator included in the project that we can used for quick and dirty testing. This is useful if we want to try out the capabilities of the tool without have to put to much time and effort setting up an environment. However, we want to use multiple BMCs to show how `magellan` can distinguish between Redfish and non-Redfish services.
TODO: Add content setting up multiple emulated BMCs with Redfish services (the quickstart in the deployment-recipes has this already).
### Performing a Scan
A scan sends out requests to all devices on a network specified with the `--subnet` flag. If the device responds, it is added to a cache database that we'll need for the next section.
Let's do a scan and see what we can find on our network. We should be able to find all of our emulated BMCs without having to worry too much about any other services.
```bash
magellan scan --subnet 172.16.0.100/24
```
This command should not have any output if it runs successfully. By default, the cache will be stored in `/tmp/$USER/magellan/assets.db` in a tiny SQLite 3 database. If we want to store this somewhere else, we can specify a path with the `--cache` flag.
We can see which BMCs were found with the `list` command.
```bash
magellan list
```
You should see the emulated BMCs.
```bash
# TODO: add list of emulated BMCs from `magellan list` output
```
Now that we know the IP addresses of the BMCs, we can use the `collect` command to pull inventory data.
### Collecting Hardware Inventory
When collecting inventory
### Updating the Hardware Inventory

View file

@ -0,0 +1 @@
For this tutorial, we served images via HTTP using a local S3 bucket (MinIO) and OCI registry. We could instead serve our images using NFS by setting up and running a NFS server on the head node, include NFS tools in our base image, and configuring our nodes to work with NFS.

View file

@ -0,0 +1 @@
Often, we want to allocate nodes for different purposes using different images. Let's use the base image that we created before and create another Kubernetes layer called `kubernetes-worker` based on the `base` image we created before. We would need to modify the boot script to use this new Kubernetes image and update cloud-init set up the nodes.