- Introduction
- Accessing the Cluster
- Containerized Jobs Workflow
- Job Submission in SLURM
- Job Monitoring and Debugging
- Performance Optimization
- Shared Storage & File Systems
- Best Practices & Troubleshooting
- Frequently Asked Questions
Introduction
The Athena GPU Cluster consists of 9 compute nodes, each equipped with 8 Nvidia A100 GPUs, interconnected via a high-speed InfiniBand fabric. The cluster is optimized for deep learning, AI research, and scientific simulations requiring accelerated computation.
Accessing the Cluster
SSH to Login Node
After successful registration to HPC services and receiving A confirmation email reply, Connect to Athena Login Node using SSH from a Technion network or VPN:
ssh username@dgx-master.technion.ac.il
Login nodes are protected behind the Technion firewall and therefor you must be connected to the Technion network before “SSHing” to
The login node is the entry point to the Athena Cluster, where users can manage jobs, edit scripts, and transfer files. It is not meant for running computations – jobs must be submitted to SLURM to execute on compute nodes. Running heavy tasks on the login node can disrupt other users. Currently, dgx-master is the main Login node for Athena but in the near future there might be multiple servers that will act as the frontend of the cluster.
- Host Name: dgx-master.technion.ac.il
- Port Number: 22
- Username: Your Technion username (same as your email’s)
- Password: Your Technion account password
SSH using MobaXterm Windows client
Windows users should download two programs to access the cluster conveniently.
MobaXterm – An SSH client for remote connections.
WinSCP – An SCP client for remote file transfers.
After installing MobaXterm, open the application and select “Session,” then choose “SSH.” In the session settings, enter “dgx-master.technion.ac.il” under “Remote Host”. Check the “Specify username” box and input your Technion username (same as your email address). Set Port to 22 and proceed to connect.
After clicking OK, an Athena Cluster session will open. You will see a Linux terminal on the right and a file navigator on the left. The terminal allows you to execute commands, while the navigator helps with file transfers and management. However, all computations on Athena must be executed via the terminal using SLURM, as described in the following sections.
WinSCP provides a more convenient way to transfer files between your computer and your Athena home directory. To get started, open WinSCP and create a new session. In the connection settings, enter the following details:
Once logged in, you can easily transfer files by dragging and dropping them between your local machine and the cluster.
Shared Storage
When you log-in to dgx-master a bash shell login session is opened, awaits your commands. Each Athena user is granted 300GB of storage assigned to his HOME directory (~/) and 2TB in the shared group (project) directory (~/work). Both directories are shared across all compute nodes.
To upload your data to the cluster storage, Use scp or WinSCP for secure file transfers.
Containerized Jobs Workflow
The supported method of submitting jobs to Athena is based on containers.
All computations on the Athena GPU Cluster are executed within Enroot containers, which encapsulate applications, code, and dependencies into a self-contained environment. This ensures compatibility and reproducibility, allowing jobs to run on any compute node without dependency conflicts.
The standard workflow involves:
-
Find & Download a Container Image
A variety of optimized containers are available in the NGC Catalog, including TensorFlow and PyTorch images for AI applications. If no suitable container is found, a minimal Ubuntu image can serve as a base. To download a container, first, retrieve its pull tag from the catalog.
Then, execute the following command in the login node:enroot import docker://nvcr.io#nvidia/tensorflow:22.02-tf2-py3
To specify the path, image file name, for the output image:
enroot import -o my_tensorflow_22.02.sqsh docker://nvcr.io#nvidia/tensorflow:22.02-tf2-py3
For a basic container, such as Ubuntu, use:
enroot import docker://ubuntu
-
Evaluate the image by launching in interactive session
Once a container is imported, it remains immutable by default. To customize it, such as installing additional packages or modifying configurations, you need to run it interactively, apply the changes, and save the modified version.To open an interactive session within the container and allow modifications, use:
srun --pty -p mig --qos=mig_2H_2G_1J --time=00:30:00 --gpus=1 --container-image="path/to/container.sqsh" --container-save="path/to/save/container.sqsh" /bin/bash
-
This command:
- Launches an interactive shell inside the container with “–pty”
- The job requests the scheduler to assign the resources in the “mig” partition where we allow users to evaluate their images and where there is a hard limit on job’s running time/walltime.
- The QOS
mig_2H_2G_1J
applies specific configurations for users to start using the mig partition. - The job request to allocate resources for a period of half an hour out of the limit of two hours applied by the QOS.
- With
--gpus=1
we ask to allocate a single GPU to our job run. - With
--container-image
we specify the image path we want to use. - Applies changes in-memory to later on, once the session ends, save the modifications to a new .sqsh file specified by the “–container-save” argument.
- Eventually, the command
/bin/bash
will be executed inside the container – starts an interactive bash session.
Suppose you need additional Python packages in your image, once inside the container shell, install packages as needed:
pip install mypackage
apt update && apt install -y somepackage
As long as you specified a--container-save
path, the modified image will be written to your working directory with your saved required packages. -
Running a Container
To submit a job to Athena, you must launch the container in an interactive or batch mode using SLURM.
Running a Container in Batch Mode
For production workloads, submit batch jobs rather than interactive sessions. Below is an example batch script:#!/bin/bash #SBATCH --job-name=my_job #SBATCH --ntasks=1 #SBATCH --gpus=4 #SBATCH --cpus-per-task=32 #SBATCH --qos=normal #SBATCH --time=12:00:00 srun --container-image=/home/user/container.sqsh --container-mounts=/home/user/data:/mnt/data python train.py
Save this script as my_job.sh and submit it using:
sbatch my_job.sh
Key Differences from Interactive Mode:- Batch mode is non-interactive, meaning jobs run in the background.
- SLURM handles scheduling, ensuring your job executes when resources are available.
- Logs are saved to a slurm-
.out file instead of displaying in real-time.
Tip: Usesqueue -u $USER
to check your job status and see if it’s running or pending.
Job Submission in SLURM
Selecting the Right QOS for Your Jobs
The Athena Cluster provides multiple Quality of Service (QOS) options to optimize GPU allocation and job scheduling. The QOS you select determines job runtime, GPU limits, and job concurrency.
Each QOS name follows a structured format indicating its constraints:
<Partition>_<WallTime>_<Max GPUs>_<Max Running Jobs>
For example, mig_2H_2G_1J means:
- “mig” partition (for testing/evaluation).
- 2 Hours of maximum runtime (
2H
). - 2 GPUs maximum (
2G
). - 1 running job at a time (
1J
).
QOS Options for the mig
Partition (Testing & Evaluation)
QOS Name | Max WallTime | Max GPUs | Max Running Jobs | Max Pending Jobs |
---|---|---|---|---|
mig_2H_2G_1J | 2 Hours | 2 GPUs (16 shards) | 1 | 1 |
Tip: Use “mig” QOS if you are testing jobs, debugging scripts, or evaluating container setups before submitting to production.
QOS Options for the work
Partition (Main Workloads)
QOS Name | Max WallTime | Max GPUs | Max Running Jobs | Max Pending Jobs |
---|---|---|---|---|
work_1H_1G_4J | 1 Hour | 1 GPU (8 shards) | 4 | 12 |
work_1H_2G_4J | 1 Hour | 2 GPUs (16 shards) | 4 | 12 |
work_24H_8G_4J | 24 Hours | 8 GPUs (64 shards) | 1 | 12 |
work_24H_16G_4J | 24 Hours | 16 GPUs (128 shards) | 1 | 12 |
Tip: Use “work” QOS if you are running production workloads or large training jobs.
How to Choose the Right QOS
- How long will my job run? Short jobs →
1H
QOS, Long jobs →24H
QOS. - How many GPUs do I need? Small-scale →
1G
or2G
, Large-scale →8G
or16G
. - Do I need multiple jobs running at the same time? Need parallel jobs → QOS with higher job limits (e.g.,
4J
,10J
). - Is this a test or production workload? Debugging/testing →
mig
QOS, Running actual jobs →work
QOS.
Tip: Specifying
--time
when submitting a job helps the SLURM backfill scheduler optimize resource usage.
Jobs with shorter walltimes may be scheduled sooner if they fit within available gaps in the queue, allowing SLURM to efficiently utilize idle resources while waiting for larger jobs to start.
To increase the chance of earlier execution, set the shortest time limit necessary for your job.
srun --qos=work_1H_2G_4J --gpus=2 --time=00:30:00 --pty bash
Understanding SLURM Batch Scripts
A SLURM batch script is a Bash script that includes SLURM directives to define resource requests and execution parameters. These directives begin with #SBATCH
and guide the job scheduler in allocating the necessary compute resources.
Essential SLURM Directives
Directive | Description | Example |
---|---|---|
--job-name |
Assigns a name to the job. | #SBATCH --job-name=my_job |
--output |
Specifies an output file for logs. | #SBATCH --output=job_%j.out |
--error |
Specifies an error file for logs. | #SBATCH --error=job_%j.err |
--ntasks |
Number of tasks (1 for a single-node job). | #SBATCH --ntasks=1 |
--cpus-per-task |
Number of CPU cores per task. | #SBATCH --cpus-per-task=16 |
--gpus |
Number of GPUs requested. | #SBATCH --gpus=4 |
--qos |
Quality of Service (QOS) selection. | #SBATCH --qos=work_24H_8G_4J |
--partition |
Defines the partition to use. | #SBATCH --partition=work |
--time |
Maximum job runtime. | #SBATCH --time=12:00:00 |
Writing a SLURM Job Script
A sample SLURM batch script for launching a deep learning training job:
#!/bin/bash #SBATCH --job-name=my_training_job #SBATCH --output=job_%j.out #SBATCH --error=job_%j.err #SBATCH --ntasks=1 #SBATCH --cpus-per-task=32 #SBATCH --gpus=4 #SBATCH --qos=work_24H_8G_4J #SBATCH --partition=work #SBATCH --time=12:00:00 # Load necessary modules module load cuda/12.2 module load pytorch/2.1 # Run the job inside a container srun --container-image=/home/user/container.sqsh --container-mounts=/home/user/data:/mnt/data python train.py
Explanation of Key Directives:
- Resources: Requests 4 GPUs and 32 CPU cores for computation.
- QOS & Partition: Uses the
work_24H_8G_4J
QOS in thework
partition. - Environment Setup: Loads CUDA and PyTorch modules before running the training script.
- Container Execution: Runs inside an Enroot container with external data mounted.
Submitting a Job
Once the batch script is ready, submit the job using:
sbatch my_job_script.sh
After submission, SLURM assigns a Job ID, which can be used to track the job’s status.
Example Output:
Submitted batch job 123456
Monitoring Job Progress
To check the status of your submitted jobs:
1. List all your jobs:
squeue -u $USER
2. View job details:
scontrol show job <job_id>
3. Monitor real-time job logs:
tail -f job_<job_id>.out
Canceling a Job
If needed, cancel a job using:
scancel <job_id>
Example:
scancel 123456
This immediately stops the job and releases the allocated resources.
Job Monitoring and Debugging
1. Checking Job Status
squeue -u $USER
2. Debugging Failures
Check SLURM output and error logs:
cat slurm-.out
Performance Optimization
1. Optimizing GPU Utilization
Monitor GPU usage in real-time:
watch -n 1 nvidia-smi
2. InfiniBand Tuning
Use efficient communication libraries like NCCL for multi-GPU jobs.
Shared Storage & File Systems
Athena provides two main storage locations:
- HOME: Persistent storage for user files.
- WORK: High-speed scratch space for temporary jobs.
Best Practices & Troubleshooting
1. Avoid Common SLURM Errors
Always check resource requests to avoid jobs pending indefinitely.
2. Handling File System Quotas
du -sh ~
Frequently Asked Questions
docker pull vs. enroot import – How to?
Most image repositories will advise you to copy-paste a line in the “docker” (Hub) form:
docker pull nvcr.io/nvidia/tensorflow:22.02-tf2-py3
but when we want to download the image using Enroot, as a squash file, to our filesystem, we need to convert the “pull” command to an “import” command according to the following instructions:
- Replace the “/” character after “nvidia” with the “#” character.
- Add “docker://” before “nvcr”.
- Replace “docker pull” for “enroot import”.
enroot import docker://nvcr.io#nvidia/tensorflow:22.02-tf2-py3
Why Does Athena Have Granular QOS Configurations?
- Efficient Resource Allocation: Different jobs have different needs, from quick debugging to long training runs.
- Preventing Job Starvation: Long-running jobs should not block smaller jobs from starting.
- Maximizing GPU Utilization: Fair allocation of compute power among users.
- Encouraging Testing Before Production Runs: Users can evaluate container images and test jobs in the
mig
partition before using full resources.
How to change TMP directory for the imported Enroot layers?
Enroot download containers layer-by-layer to the Enroot TMPDIR directory which is configured by default as “/tmp”.The size of the “/tmp” directory is limited and might reach its capacity if multiple imports have been made and wasn’t yet deleted. If you encounter this scenario you may modify the TMPDIR environment variable:
export TMPDIR="/home/
before importing an Enroot image to Login Node.
Why is my SLURM job stuck in the queue?
Check squeue
and ensure sufficient resources are available.
How do I transfer large files?
Use rsync
or WinSCP for efficient transfers.