Athena GPU Cluster User Guide

Athena (GPU) Cluster – SLURM QuickStart

Athena is a shared Technion GPU cluster managed by SLURM. You request resources (GPUs, CPUs, memory, time),
and SLURM decides when and where your job runs based on your chosen partition, QoS, and runtime estimate.

Most important practical consequence:
shorter jobs requesting fewer resources tend to start earlier.
Always provide a realistic --time estimate.

Access & first login

1) New user registration (one-time)

All new users must register through:

https://reg-hpc.technion.ac.il/

Under Affiliations, select the appropriate affiliation.
Under Budget Owner, add <Your Lab PI>.

2) SSH to Athena

ssh <USER>@athena.technion.ac.il

Authenticate with your Technion password.

Optional: passwordless login

# On your local machine
ssh-keygen -t ed25519

# Copy the public key to Athena
ssh-copy-id <USER>@athena.technion.ac.il

Submission workflow

A SLURM submission is typically defined by:

Partition: which pool of nodes your workload may run on
QoS: limits + priority within that partition
Resources: GPUs / CPUs / memory / time

Workflow: pick a partition → pick an allowed QoS → request resources → run srun (interactive) or sbatch (batch).

Step 1: Choose a partition

What is a partition?

In Athena (and SLURM in general), a partition is a logical group of compute nodes that share the same
hardware type, access policy, and scheduling rules. You can think of a partition as a queue with specific machines behind it.

Partition types (Public / Shared / Contributor)

Type	Who it serves	Operational behavior
Public partitions `*-public`	All authorized researchers (non-contributors and contributors)	Stable: jobs are not preempted. Typically stricter limits on time/GPUs.
Shared partitions `*-shared`	Combines Technion-owned resources with contributor resources	Jobs can be stopped at any time. Use checkpointing and auto-resume.
Contributor partitions `<name>`	Dedicated partitions for a contributor lab’s resources	Preferred access for that contributor; can preempt public users on shared resources (where applicable).

Shared partitions: assume preemption. Enable checkpointing and resume automatically from the last checkpoint.

Examples of partition names

Partition	Meaning
`a100-public`	Public partition backed by A100 GPU nodes
`l40s-shared`	L40S GPUs in a shared partition
`h200-shared`	H200 GPUs in a shared partition
`l40s-benisty`	Contributor partition (dedicated to a specific lab)

Which partitions are currently available?

tsinfo
tsinfo <partition_name>

Check availability (tsinfo / tsqr)

Overall availability: interpret tsinfo output

tsinfo
tsinfo <partition_name>

CPU notation

SLURM reports CPU state as:

A – Active
I – Idle
T – Total

CPUS(A/I/T): 36 / 220 / 256

This means 36 CPUs are currently in use and 220 are idle out of 256 total.

GPU notation

gpu:nvidia_a100:2 (IDX:0,2) → GPUs 0 and 2 are allocated
:0 → no GPUs in use

Queue status

tsqr running
tsqr pending
tsqr NODELIST
tsqr -u <username>

Step 2: Choose a QoS

QoS (Quality of Service) controls: max runtime, max GPUs, scheduling priority, and preemption behavior.

Inspect QoS options

tsqos
tsqos <qos_name>

Example QoS tiers

QoS	Max runtime	Max GPUs	Priority	Typical use
`2h_2g`	2 hours	2	High	Short development workloads
`4h_0g`	4 hours	0	High	Short CPU-only workloads
`12h_4g`	12 hours	4	Medium	Default shared usage
`24h_4g`	24 hours	4	Lower	Long public workloads
`contrib`	7 days	No limit	Very high	Contributor workloads (group/server)

Key rules to internalize

Higher priority QoS runs first.
Shorter runtime = higher priority. Estimate runtime as closely as possible.
contrib can preempt shared workloads; shared workloads get a 5-minute grace period before re-queue.
TRESPU limits total resources per user.
MAXJOBSPU limits number of concurrent jobs.

Resource allocation rule of thumb

Request CPU and memory proportional to GPU usage.

Node has 8 GPUs → requesting 1 GPU ≈ request ~1/8 of the node’s CPUs and RAM.
Adjust based on observed GPU utilization and your workload’s CPU/RAM needs.

Step 3: Submit your workload (srun / sbatch)

Guideline: use srun for interactive development and short experiments;
use sbatch for long-running or production workloads.

Interactive jobs (development only)

Interactive jobs are intended for debugging, environment setup, and short experiments.
They should not be used for long-running workloads.

Example: interactive GPU shell (Apptainer)

export APPTAINER_BIND=$PRJ_WORKSPACE,$HOME

srun   --partition=h200-shared   --qos=2h_2g   --time=1:30:00   --gres=gpu:1   --ntasks=1   --cpus-per-task=16   --mem=40G   --pty   --container="/apps/apptainer/images/pytorch/2.7/torch2.7-tf2.20-cu128.sif"   /bin/bash

Batch jobs (recommended for long workloads)

Example: batch submission script (sbatch)

#!/bin/bash
#
# SBATCH Submission Script: Example GPU Workload
# Run with: sbatch example.sbatch
#

# --- Job Configuration ---
#SBATCH --job-name=gpu_workload
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1

# --- Resource Allocation ---
#SBATCH --partition=h200-dds
#SBATCH --qos=contrib
#SBATCH --gres=gpu:nvidia_h200:1
#SBATCH --cpus-per-gpu=12
#SBATCH --time=12:00:00

# --- Container ---
#SBATCH --container=/apps/apptainer/images/pytorch/2.7/torch2.7-tf2.20-cu128.sif

set -e

echo "================================================================"
echo "Timestamp            : $(date)"
echo "Host                 : $(hostname)"
echo "SLURM Job ID         : $SLURM_JOB_ID"
echo "Apptainer Container  : $APPTAINER_CONTAINER"
echo "APPTAINER_BIND       : $APPTAINER_BIND"
echo "GPU(s) Allocated     : $CUDA_VISIBLE_DEVICES"
echo "================================================================"

nvidia-smi

# Your command here:
# cd /home/<user>/work/YourProject/
# python run.py

To run on a public partition, choose a *-public partition and an allowed QoS (e.g., 12h_4g), and verify availability with tsinfo.

Monitor and manage jobs

# Show my jobs (formatted)
tsqr -u $USER

# Job details
scontrol show job <job_id>

# Cancel a job
scancel <job_id>

Containers & Apptainer

Athena uses Apptainer (not Docker).

Default mounts (when using `--container`)

Current working directory
/scratch

Additional mounts (recommended)

export APPTAINER_BIND=$PRJ_WORKSPACE,$HOME

Think of APPTAINER_BIND as container mounts. The variable PRJ_WORKSPACE is assigned the user “work” directory path.

Installing packages in a read only container (persistent overlays)

New to Persistent Overlays?
A Persistent Overlay is a writable layer that sits on top of a read-only .sif container image,
letting you install packages and save changes across runs – without modifying the shared base image.

Read more about Apptainer Persistent Overlays →

Best practice: use $TMPDIR (local NVMe storage; faster than network storage).

cd $TMPDIR
cp /apps/apptainer/images/pytorch/2.7/torch2.7-tf2.20-cu128.sif .

# Create overlay (size in MB)
apptainer overlay create --fakeroot --sparse --size 102400 torch2.7-tf2.20-cu128.sif

# Inspect partition layers
apptainer sif list torch2.7-tf2.20-cu128.sif

# Delete overlay if needed (example: delete partition index 5)
apptainer sif del 5 torch2.7-tf2.20-cu128.sif

# Enter writable container
apptainer exec   --writable   --fakeroot   --userns   --nvccli   torch2.7-tf2.20-cu128.sif   /bin/bash

Storage & quotas

Type	Path	Quota	How to check
Private storage	`/home/<USER>`	300 GB	`quota -s`
Shared storage	`/home/<USER>/work`	2 TB (unless additional storage is purchased)	`quota-g`

About the Shared storage path

/home/<USER>/work = $PRJ_WORKSPACE = /rg/<group_prj>/<USER>
This directory resides inside your group’s shared project space (group_prj)
and is accessible to all members of the same group. Be mindful of what you store there.

Need more shared storage?

Additional shared storage quota can be purchased by your group’s Research PI through the official
Technion CIS Store.

Data transfer (scp / rsync)

# Copy a file to Athena
scp myfile.txt <USER>@athena.technion.ac.il:~/

# Sync a folder to Athena (recommended for large folders)
rsync -avP /local/data/ <USER>@athena.technion.ac.il:/path/on/athena/

Performance note

$TMPDIR is local NVMe (fast).
Much faster than network-mounted storage.
Ideal for temporary data, container modification, and intermediate outputs.

No backup and limited space. Do not use Athena for long-term storage.

OnDemand (Odin)

Odin (odin.technion.ac.il)
is Athena’s browser-based interface powered by OpenOnDemand — no SSH or local software required.
Users can launch interactive applications and submit jobs directly from the browser.

Interactive App	Purpose
JupyterLab	Notebook environment for PyTorch and TensorFlow workloads on Athena’s GPU nodes
RELION	Cryo-EM structure determination with a guided graphical interface
MATLAB	Full MATLAB desktop environment running on the cluster

Login: use the local part of your Technion email address as the username
(e.g. jsmith from [email protected]), and your regular Technion account password.

Community forum (Teams)

Athena GPU Microsoft Teams channel:
Athena (GPU) Cluster

FAQ

Job Submission & Scheduling

My job is pending — what should I check?

Verify your --partition and --qos are correct and compatible. Request fewer resources where possible. Set a shorter and more realistic --time. Check cluster availability with tsinfo. Use tsqr to inspect the queue — the output includes the reason your job is pending (e.g. Resources, Priority, QOSMaxJobsPerUserLimit).

How do I know which QoS is allowed for a given partition?

QoS availability is determined by your account’s association tree. Run the following to see which QoS tiers are associated with your account:

tsassoc | grep $USER

Can I run multiple jobs simultaneously?

Yes, subject to your QoS limits. MAXJOBSPU in tsqos output shows the maximum concurrent jobs per QoS. If you have reached the limit under one QoS, submitting under a different QoS has its own independent counter — allowing you to effectively run more jobs in parallel.

How do I estimate how long my job will take?

Run a short test first with a small subset of your data. Use that runtime to extrapolate. Always add a small buffer — but avoid over-estimating, as shorter --time values receive higher scheduling priority.

What happens if my job exceeds its requested time limit?

SLURM kills it immediately with no grace period. Always request enough time, and use checkpointing for long workloads so progress is not lost.

Preemption & Resilience

How do I ensure my shared-partition workload survives preemption?

Enable checkpointing in your training code and save progress periodically. Shared workloads receive a 10-minute grace period before being re-queued — use it to save a final checkpoint on signal.

How do I detect incoming preemption before the grace period ends?

SLURM sends a SIGTERM signal to your job at the start of the grace period. Trap this signal in your script to trigger an immediate checkpoint save, giving you the full 10 minutes to write it cleanly:

trap 'echo "Preemption signal received — saving checkpoint"; python save_checkpoint.py; exit 0' SIGTERM

My job was preempted — will it restart automatically?

It will be re-queued automatically, but it will restart from the beginning unless your code resumes from a checkpoint. Implementing checkpoint/resume logic is strongly recommended for any workload running on shared partitions.

Resources

How do I know how much memory or how many CPUs to request?

A good starting rule: if a node has 8 GPUs, requesting 1 GPU entitles you to roughly 1/8 of its CPUs and RAM. Run tsinfo <partition_name> to see total node resources, then divide proportionally by the number of GPUs you request. Adjust based on your workload’s observed utilization.

How do I check my current resource usage and limits?

Run scontrol show job <job_id> for the full resource allocation of a running job, including actual CPU, memory, and GPU assignments. For your QoS-level limits (max GPUs, max runtime, max concurrent jobs), run tsqos.

Containers & Storage

How do I confirm I have GPU access inside my container?

Run nvidia-smi inside the container. You can also verify PyTorch sees the GPU:

python -c "import torch; print(torch.cuda.get_device_name(0))"

Can I use Docker images on Athena?

Docker is not available on Athena. Two alternatives exist: Apptainer, which can pull and convert Docker images directly and is the recommended approach:

apptainer pull docker://<image>

And Podman, which is available but functionally limited — its image storage defaults to $TMPDIR, a temporary per-job directory that is wiped when the job ends.

How do I install Python packages inside a container?

Use a persistent overlay — it adds a writable layer on top of the read-only container image, preserving installed packages across runs. See the overlay instructions in the Containers section above.

I can’t find my files — what should I check?

Check your APPTAINER_BIND variable. Your $HOME and $PRJ_WORKSPACE directories must be explicitly bound before the container can see them. Set the following before running srun or include it in your sbatch script:

export APPTAINER_BIND=$PRJ_WORKSPACE,$HOME

What is $TMPDIR and when should I use it?

$TMPDIR is a per-job local NVMe directory on the compute node — significantly faster than network-mounted storage for intensive I/O. Use it for temporary files, container overlays, and intermediate outputs. Its contents are deleted when the job ends. Note that Athena’s current shared storage is NFS-based, which handles general file access well but is not optimized for the parallel I/O demands of HPC workloads — local NVMe is the practical workaround for I/O-heavy jobs for the time being.

My job wrote output files but I can’t find them — where do they go?

By default, SLURM writes output to the directory from which you submitted the job, named slurm-<job_id>.out and slurm-<job_id>.err. Use #SBATCH --output and #SBATCH --error in your script to set explicit paths.

Odin

Can I run long workloads from a JupyterLab session on Odin?

Yes — Odin is actually the preferred place for interactive long-running sessions. Unlike direct srun sessions on the command line, Odin manages the session lifecycle for you, making it more suitable for extended interactive work. For fully unattended workloads, sbatch remains the right choice.

The Technion appreciates Omer Shubi (DDS) and Daniel Zur (Med) for authoring the original QuickStart that this page is based on.