Athena GPU Cluster User Guide

Introduction
Accessing the Cluster
Containerized Jobs Workflow
Job Submission in SLURM
Job Monitoring and Debugging
Performance Optimization
Shared Storage & File Systems
Best Practices & Troubleshooting
Frequently Asked Questions

Introduction

The Athena GPU Cluster consists of 9 compute nodes, each equipped with 8 Nvidia A100 GPUs, interconnected via a high-speed InfiniBand fabric. The cluster is optimized for deep learning, AI research, and scientific simulations requiring accelerated computation.

Accessing the Cluster

SSH to Login Node

After successful registration to HPC services and receiving A confirmation email reply, Connect to Athena Login Node using SSH from a Technion network or VPN:

ssh username@dgx-master.technion.ac.il

Login nodes are protected behind the Technion firewall and therefor you must be connected to the Technion network before “SSHing” to dgx-master or remotely connected to Technion VPN.
The login node is the entry point to the Athena Cluster, where users can manage jobs, edit scripts, and transfer files. It is not meant for running computations – jobs must be submitted to SLURM to execute on compute nodes. Running heavy tasks on the login node can disrupt other users. Currently, dgx-master is the main Login node for Athena but in the near future there might be multiple servers that will act as the frontend of the cluster.

SSH using MobaXterm Windows client

Windows users should download two programs to access the cluster conveniently.
MobaXterm – An SSH client for remote connections.
WinSCP – An SCP client for remote file transfers.
After installing MobaXterm, open the application and select “Session,” then choose “SSH.” In the session settings, enter “dgx-master.technion.ac.il” under “Remote Host”. Check the “Specify username” box and input your Technion username (same as your email address). Set Port to 22 and proceed to connect.

After clicking OK, an Athena Cluster session will open. You will see a Linux terminal on the right and a file navigator on the left. The terminal allows you to execute commands, while the navigator helps with file transfers and management. However, all computations on Athena must be executed via the terminal using SLURM, as described in the following sections.

WinSCP provides a more convenient way to transfer files between your computer and your Athena home directory. To get started, open WinSCP and create a new session. In the connection settings, enter the following details:

Host Name: dgx-master.technion.ac.il
Port Number: 22
Username: Your Technion username (same as your email’s)
Password: Your Technion account password

Once logged in, you can easily transfer files by dragging and dropping them between your local machine and the cluster.

Shared Storage

When you log-in to dgx-master a bash shell login session is opened, awaits your commands. Each Athena user is granted 300GB of storage assigned to his HOME directory (~/) and 2TB in the shared group (project) directory (~/work). Both directories are shared across all compute nodes.
To upload your data to the cluster storage, Use scp or WinSCP for secure file transfers.

Containerized Jobs Workflow

The supported method of submitting jobs to Athena is based on containers.
All computations on the Athena GPU Cluster are executed within Enroot containers, which encapsulate applications, code, and dependencies into a self-contained environment. This ensures compatibility and reproducibility, allowing jobs to run on any compute node without dependency conflicts.

The standard workflow involves:

Find & Download a Container Image

A variety of optimized containers are available in the NGC Catalog, including TensorFlow and PyTorch images for AI applications. If no suitable container is found, a minimal Ubuntu image can serve as a base. To download a container, first, retrieve its pull tag from the catalog.
Then, execute the following command in the login node:

enroot import docker://nvcr.io#nvidia/tensorflow:22.02-tf2-py3
To specify the path, image file name, for the output image:
enroot import -o my_tensorflow_22.02.sqsh docker://nvcr.io#nvidia/tensorflow:22.02-tf2-py3
For a basic container, such as Ubuntu, use:
enroot import docker://ubuntu
Evaluate the image by launching in interactive session

Once a container is imported, it remains immutable by default. To customize it, such as installing additional packages or modifying configurations, you need to run it interactively, apply the changes, and save the modified version.
To open an interactive session within the container and allow modifications, use:
srun --pty -p mig --qos=mig_2H_2G_1J --time=00:30:00 --gpus=1 --container-image="path/to/container.sqsh" --container-save="path/to/save/container.sqsh" /bin/bash
- Launches an interactive shell inside the container with “–pty”
- The job requests the scheduler to assign the resources in the “mig” partition where we allow users to evaluate their images and where there is a hard limit on job’s running time/walltime.
- The QOS mig_2H_2G_1J applies specific configurations for users to start using the mig partition.
- The job request to allocate resources for a period of half an hour out of the limit of two hours applied by the QOS.
- With --gpus=1 we ask to allocate a single GPU to our job run.
- With --container-image we specify the image path we want to use.
- Applies changes in-memory to later on, once the session ends, save the modifications to a new .sqsh file specified by the “–container-save” argument.
- Eventually, the command /bin/bash will be executed inside the container – starts an interactive bash session.
Suppose you need additional Python packages in your image, once inside the container shell, install packages as needed:
pip install mypackage apt update && apt install -y somepackage
As long as you specified a --container-save path, the modified image will be written to your working directory with your saved required packages.
Running a Container

To submit a job to Athena, you must launch the container in an interactive or batch mode using SLURM.
Running a Container in Batch Mode
For production workloads, submit batch jobs rather than interactive sessions. Below is an example batch script:
```
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --ntasks=1
#SBATCH --gpus=4
#SBATCH --cpus-per-task=32
#SBATCH --qos=normal
#SBATCH --time=12:00:00

srun --container-image=/home/user/container.sqsh --container-mounts=/home/user/data:/mnt/data python train.py
```
Save this script as my_job.sh and submit it using: sbatch my_job.sh
Key Differences from Interactive Mode:
- Batch mode is non-interactive, meaning jobs run in the background.
- SLURM handles scheduling, ensuring your job executes when resources are available.
- Logs are saved to a slurm-.out file instead of displaying in real-time.
Tip: Use squeue -u $USER to check your job status and see if it’s running or pending.

Job Submission in SLURM

Selecting the Right QOS for Your Jobs

The Athena Cluster provides multiple Quality of Service (QOS) options to optimize GPU allocation and job scheduling. The QOS you select determines job runtime, GPU limits, and job concurrency.

Each QOS name follows a structured format indicating its constraints:

<Partition>_<WallTime>_<Max GPUs>_<Max Running Jobs>

For example, mig_2H_2G_1J means:

“mig” partition (for testing/evaluation).
2 Hours of maximum runtime (2H).
2 GPUs maximum (2G).
1 running job at a time (1J).

QOS Options for the `mig` Partition (Testing & Evaluation)

QOS Name	Max WallTime	Max GPUs	Max Running Jobs	Max Pending Jobs
mig_2H_2G_1J	2 Hours	2 GPUs (16 shards)	1	1

Tip: Use “mig” QOS if you are testing jobs, debugging scripts, or evaluating container setups before submitting to production.

QOS Options for the `work` Partition (Main Workloads)

QOS Name	Max WallTime	Max GPUs	Max Running Jobs	Max Pending Jobs
work_1H_1G_4J	1 Hour	1 GPU (8 shards)	4	12
work_1H_2G_4J	1 Hour	2 GPUs (16 shards)	4	12
work_24H_8G_4J	24 Hours	8 GPUs (64 shards)	1	12
work_24H_16G_4J	24 Hours	16 GPUs (128 shards)	1	12

Tip: Use “work” QOS if you are running production workloads or large training jobs.

How to Choose the Right QOS

How long will my job run? Short jobs → 1H QOS, Long jobs → 24H QOS.
How many GPUs do I need? Small-scale → 1G or 2G, Large-scale → 8G or 16G.
Do I need multiple jobs running at the same time? Need parallel jobs → QOS with higher job limits (e.g., 4J, 10J).
Is this a test or production workload? Debugging/testing → mig QOS, Running actual jobs → work QOS.

Tip: Specifying --time when submitting a job helps the SLURM backfill scheduler optimize resource usage.

Jobs with shorter walltimes may be scheduled sooner if they fit within available gaps in the queue, allowing SLURM to efficiently utilize idle resources while waiting for larger jobs to start.
To increase the chance of earlier execution, set the shortest time limit necessary for your job.
srun --qos=work_1H_2G_4J --gpus=2 --time=00:30:00 --pty bash

Understanding SLURM Batch Scripts

A SLURM batch script is a Bash script that includes SLURM directives to define resource requests and execution parameters. These directives begin with #SBATCH and guide the job scheduler in allocating the necessary compute resources.

Essential SLURM Directives

Directive	Description	Example
`--job-name`	Assigns a name to the job.	`#SBATCH --job-name=my_job`
`--output`	Specifies an output file for logs.	`#SBATCH --output=job_%j.out`
`--error`	Specifies an error file for logs.	`#SBATCH --error=job_%j.err`
`--ntasks`	Number of tasks (1 for a single-node job).	`#SBATCH --ntasks=1`
`--cpus-per-task`	Number of CPU cores per task.	`#SBATCH --cpus-per-task=16`
`--gpus`	Number of GPUs requested.	`#SBATCH --gpus=4`
`--qos`	Quality of Service (QOS) selection.	`#SBATCH --qos=work_24H_8G_4J`
`--partition`	Defines the partition to use.	`#SBATCH --partition=work`
`--time`	Maximum job runtime.	`#SBATCH --time=12:00:00`

Writing a SLURM Job Script

A sample SLURM batch script for launching a deep learning training job:

#!/bin/bash
#SBATCH --job-name=my_training_job
#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus=4
#SBATCH --qos=work_24H_8G_4J
#SBATCH --partition=work
#SBATCH --time=12:00:00

# Load necessary modules
module load cuda/12.2
module load pytorch/2.1

# Run the job inside a container
srun --container-image=/home/user/container.sqsh --container-mounts=/home/user/data:/mnt/data python train.py

Explanation of Key Directives:

Resources: Requests 4 GPUs and 32 CPU cores for computation.
QOS & Partition: Uses the work_24H_8G_4J QOS in the work partition.
Environment Setup: Loads CUDA and PyTorch modules before running the training script.
Container Execution: Runs inside an Enroot container with external data mounted.

Submitting a Job

Once the batch script is ready, submit the job using:

sbatch my_job_script.sh

After submission, SLURM assigns a Job ID, which can be used to track the job’s status.

Example Output:

Submitted batch job 123456

Monitoring Job Progress

To check the status of your submitted jobs:

1. List all your jobs:

squeue -u $USER

2. View job details:

scontrol show job <job_id>

3. Monitor real-time job logs:

tail -f job_<job_id>.out

Canceling a Job

If needed, cancel a job using:

scancel <job_id>

Example:

scancel 123456

This immediately stops the job and releases the allocated resources.

Job Monitoring and Debugging

1. Checking Job Status

squeue -u $USER

2. Debugging Failures

Check SLURM output and error logs:

cat slurm-.out

Performance Optimization

1. Optimizing GPU Utilization

Monitor GPU usage in real-time:

watch -n 1 nvidia-smi

2. InfiniBand Tuning

Use efficient communication libraries like NCCL for multi-GPU jobs.

Shared Storage & File Systems

Athena provides two main storage locations:

HOME: Persistent storage for user files.
WORK: High-speed scratch space for temporary jobs.

Best Practices & Troubleshooting

1. Avoid Common SLURM Errors

Always check resource requests to avoid jobs pending indefinitely.

2. Handling File System Quotas

du -sh ~

Frequently Asked Questions

docker pull vs. enroot import – How to?

Most image repositories will advise you to copy-paste a line in the “docker” (Hub) form:
docker pull nvcr.io/nvidia/tensorflow:22.02-tf2-py3 but when we want to download the image using Enroot, as a squash file, to our filesystem, we need to convert the “pull” command to an “import” command according to the following instructions:

Replace the “/” character after “nvidia” with the “#” character.
Add “docker://” before “nvcr”.
Replace “docker pull” for “enroot import”.

enroot import docker://nvcr.io#nvidia/tensorflow:22.02-tf2-py3

Why Does Athena Have Granular QOS Configurations?

Efficient Resource Allocation: Different jobs have different needs, from quick debugging to long training runs.
Preventing Job Starvation: Long-running jobs should not block smaller jobs from starting.
Maximizing GPU Utilization: Fair allocation of compute power among users.
Encouraging Testing Before Production Runs: Users can evaluate container images and test jobs in the mig partition before using full resources.

How to change TMP directory for the imported Enroot layers?

Enroot download containers layer-by-layer to the Enroot TMPDIR directory which is configured by default as “/tmp”.
The size of the “/tmp” directory is limited and might reach its capacity if multiple imports have been made and wasn’t yet deleted. If you encounter this scenario you may modify the TMPDIR environment variable:
export TMPDIR="/home//work/tmp"before importing an Enroot image to Login Node.

Why is my SLURM job stuck in the queue?

Check squeue and ensure sufficient resources are available.

How do I transfer large files?

Use rsync or WinSCP for efficient transfers.