Division of Computing and Information Systems
Division of Computing and Information Systems
Division of Computing |and Information Systems
  • Services
    • Artificial Intelligence (AI)
    • Backup and Restore
      • Backup Policy
      • Joining and Installation
      • Restoring data from backup
      • Mailbox Backup
    • CIS Shop
    • Cloud Computing Services
      • Office 365
        • SharePoint Online
        • Central email at the Technion – Cloud services Office 365
      • Microsoft Azure Service
      • Video Conference service using Zoom software
    • Communication Services
    • High Performance Computing (HPC)
      • HPC Services
        • Zeus CPU Cluster
        • Zeus DGX Cluster
      • HPC Resources
        • CPUs (Zeus) – Utilization Graphs
        • GPUs (Zeus) – Utilization Graphs
        • Active Table
      • HPC Documentation
        • Work procedure
        • Accounts
      • HPC Support
      • HPC Rates & Billing
      • HPC Software
        • Zeus Cluster Supported Software
      • IUCC HPC Cloud
    • Monitoring
    • Servers
      • Account Opening Guide for a computer account at the Technion
      • Authentication Services in Central Servers
      • Hosting physical servers
      • Hosting virtual servers
      • Virtualization
        • Server virtualization
        • Azure Virtual Desktop service for students
    • Software
      • Software Catalogue
      • Software Acquisition
      • Microsoft Software Licensing
    • Storage Solutions
      • Central storage
      • OneDrive Cloud Storage
    • Technion Email
      • Central email at the Technion – Cloud services Office 365
      • Mailing Lists
      • FAQ’s
    • SECTIGO Certificate System
  • Support
    • FAQ-Frequently Asked Questions
    • ESS/MSS System – Self-Service for Employees and Managers
    • Training Aids – Miscal Project
    • TeamViewer Installation
    • Remote Desktop Connection
    • Faculty Engineers
    • Macintosh Support
    • Antivirus
  • Information Security
    • User Information
    • Security alerts
    • Safe Surfing
    • Privacy in social networks
    • Password management guide
    • Password Management
    • File Encryption Guide
    • Tips for using Zoom
    • Avoid and report phishing emails
  • About
    • Directions to CIS
    • Contact Us
    • MyCIS
    • People
  • ע
Division of Computing and Information Systems > Services > High Performance Computing (HPC) > HPC Services > Athena GPU Cluster > Athena GPU Cluster User Guide

Services

  • Central Services
  • Artificial Intelligence (AI)
  • CIS Division Shop
  • CIS Division Price List
  • Communication services
    • Off Campus connection
    • Communication on Campus
    • Communication at the Dormitories
    • Wireless Communication at the Technion
  • Backup and Restore
    • Backup Policy
    • Joining and Installation
    • Restoring data from backup
    • Mailbox Backup
  • Servers
    • Account Opening Guide for a computer account at the Technion
    • Authentication Services in Central Servers
    • Hosting physical servers
    • Hosting virtual servers
    • Virtualization
    • Server virtualization
    • Azure Virtual Desktop service for students
  • High Performance Computing (HPC)
    • Getting Started
    • HPC Account request
    • HPC Services
      • Zeus CPU Cluster
      • Athena GPU Cluster
    • HPC Resources
      • Zeus CPU Utilization Graphs
      • Athena GPU Utilization Graphs
      • Active Table
    • HPC Documentation
      • PBS usage
    • HPC Support
    • HPC Rates & Billing
    • HPC Software
      • Zeus Cluster Supported Software
      • Python on ZEUS under Rocky8
  • Cloud Computing Services
    • IUCC HPC Cloud
    • Microsoft Azure Service
    • Office 365
      • Cloud email services Office 365
      • SharePoint
    • Zoom Video Conference
  • Technion Email
  • Software
  • Storage Solutions
    • Central storage
    • OneDrive Cloud Storage
  • Monitoring

Athena GPU Cluster User Guide

Athena (GPU) Cluster – SLURM QuickStart

Athena is a shared Technion GPU cluster managed by SLURM. You request resources (GPUs, CPUs, memory, time),
and SLURM decides when and where your job runs based on your chosen partition, QoS, and runtime estimate.

Most important practical consequence:
shorter jobs requesting fewer resources tend to start earlier.
Always provide a realistic --time estimate.

On this page

  • Access & first login
  • Submission workflow
  • Step 1: Choose a partition
  • Check availability (tsinfo / tsqr)
  • Step 2: Choose a QoS
  • Step 3: Submit your workload (srun / sbatch)
  • Containers & Apptainer
  • Storage & quotas
  • OnDemand (Odin)
  • Community forum (Teams)
  • FAQ

Access & first login

1) New user registration (one-time)

All new users must register through:

https://reg-hpc.technion.ac.il/

  1. Under Affiliations, select the appropriate affiliation.
  2. Under Budget Owner, add <Your Lab PI>.

2) SSH to Athena

ssh <USER>@athena.technion.ac.il

Authenticate with your Technion password.

Optional: passwordless login
# On your local machine
ssh-keygen -t ed25519

# Copy the public key to Athena
ssh-copy-id <USER>@athena.technion.ac.il

Submission workflow

A SLURM submission is typically defined by:

  • Partition: which pool of nodes your workload may run on
  • QoS: limits + priority within that partition
  • Resources: GPUs / CPUs / memory / time

Workflow: pick a partition → pick an allowed QoS → request resources → run srun (interactive) or sbatch (batch).


Step 1: Choose a partition

What is a partition?

In Athena (and SLURM in general), a partition is a logical group of compute nodes that share the same
hardware type, access policy, and scheduling rules. You can think of a partition as a queue with specific machines behind it.

Partition types (Public / Shared / Contributor)

Type Who it serves Operational behavior
Public partitions
*-public
All authorized researchers (non-contributors and contributors) Stable: jobs are not preempted. Typically stricter limits on time/GPUs.
Shared partitions
*-shared
Combines Technion-owned resources with contributor resources Jobs can be stopped at any time. Use checkpointing and auto-resume.
Contributor partitions
<name>
Dedicated partitions for a contributor lab’s resources Preferred access for that contributor; can preempt public users on shared resources (where applicable).

Shared partitions: assume preemption. Enable checkpointing and resume automatically from the last checkpoint.

Examples of partition names

Partition Meaning
a100-public Public partition backed by A100 GPU nodes
l40s-shared L40S GPUs in a shared partition
h200-shared H200 GPUs in a shared partition
l40s-benisty Contributor partition (dedicated to a specific lab)
Which partitions are currently available?
tsinfo
tsinfo <partition_name>

Check availability (tsinfo / tsqr)

Overall availability: interpret tsinfo output
tsinfo
tsinfo <partition_name>

CPU notation

SLURM reports CPU state as:

  • A – Active
  • I – Idle
  • T – Total
CPUS(A/I/T): 36 / 220 / 256

This means 36 CPUs are currently in use and 220 are idle out of 256 total.

GPU notation

  • gpu:nvidia_a100:2 (IDX:0,2) → GPUs 0 and 2 are allocated
  • :0 → no GPUs in use
Queue status
tsqr running
tsqr pending
tsqr NODELIST
tsqr -u <username>

Step 2: Choose a QoS

QoS (Quality of Service) controls: max runtime, max GPUs, scheduling priority, and preemption behavior.

Inspect QoS options
tsqos
tsqos <qos_name>

Example QoS tiers

QoS Max runtime Max GPUs Priority Typical use
2h_2g 2 hours 2 High Short development workloads
4h_0g 4 hours 0 High Short CPU-only workloads
12h_4g 12 hours 4 Medium Default shared usage
24h_4g 24 hours 4 Lower Long public workloads
contrib 7 days No limit Very high Contributor workloads (group/server)

Key rules to internalize

  • Higher priority QoS runs first.
  • Shorter runtime = higher priority. Estimate runtime as closely as possible.
  • contrib can preempt shared workloads; shared workloads get a 5-minute grace period before re-queue.
  • TRESPU limits total resources per user.
  • MAXJOBSPU limits number of concurrent jobs.
Resource allocation rule of thumb

Request CPU and memory proportional to GPU usage.

  • Node has 8 GPUs → requesting 1 GPU ≈ request ~1/8 of the node’s CPUs and RAM.
  • Adjust based on observed GPU utilization and your workload’s CPU/RAM needs.

Step 3: Submit your workload (srun / sbatch)

Guideline: use srun for interactive development and short experiments;
use sbatch for long-running or production workloads.

Interactive jobs (development only)

Interactive jobs are intended for debugging, environment setup, and short experiments.
They should not be used for long-running workloads.

Example: interactive GPU shell (Apptainer)
export APPTAINER_BIND=$PRJ_WORKSPACE,$HOME

srun   --partition=h200-shared   --qos=2h_2g   --time=1:30:00   --gres=gpu:1   --ntasks=1   --cpus-per-task=16   --mem=40G   --pty   --container="/apps/apptainer/images/pytorch/2.7/torch2.7-tf2.20-cu128.sif"   /bin/bash

Batch jobs (recommended for long workloads)

Example: batch submission script (sbatch)
#!/bin/bash
#
# SBATCH Submission Script: Example GPU Workload
# Run with: sbatch example.sbatch
#

# --- Job Configuration ---
#SBATCH --job-name=gpu_workload
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1

# --- Resource Allocation ---
#SBATCH --partition=h200-dds
#SBATCH --qos=contrib
#SBATCH --gres=gpu:nvidia_h200:1
#SBATCH --cpus-per-gpu=12
#SBATCH --time=12:00:00

# --- Container ---
#SBATCH --container=/apps/apptainer/images/pytorch/2.7/torch2.7-tf2.20-cu128.sif

set -e

echo "================================================================"
echo "Timestamp            : $(date)"
echo "Host                 : $(hostname)"
echo "SLURM Job ID         : $SLURM_JOB_ID"
echo "Apptainer Container  : $APPTAINER_CONTAINER"
echo "APPTAINER_BIND       : $APPTAINER_BIND"
echo "GPU(s) Allocated     : $CUDA_VISIBLE_DEVICES"
echo "================================================================"

nvidia-smi

# Your command here:
# cd /home/<user>/work/YourProject/
# python run.py

To run on a public partition, choose a *-public partition and an allowed QoS (e.g., 12h_4g), and verify availability with tsinfo.

Monitor and manage jobs
# Show my jobs (formatted)
tsqr -u $USER

# Job details
scontrol show job <job_id>

# Cancel a job
scancel <job_id>

Containers & Apptainer

Athena uses Apptainer (not Docker).

Default mounts (when using --container)

  • Current working directory
  • /scratch

Additional mounts (recommended)

export APPTAINER_BIND=$PRJ_WORKSPACE,$HOME

Think of APPTAINER_BIND as container mounts. The variable PRJ_WORKSPACE is assigned the user “work” directory path.

Installing packages in a read only container (persistent overlays)

New to Persistent Overlays?
A Persistent Overlay is a writable layer that sits on top of a read-only .sif container image,
letting you install packages and save changes across runs – without modifying the shared base image.

Read more about Apptainer Persistent Overlays →

Best practice: use $TMPDIR (local NVMe storage; faster than network storage).

cd $TMPDIR
cp /apps/apptainer/images/pytorch/2.7/torch2.7-tf2.20-cu128.sif .

# Create overlay (size in MB)
apptainer overlay create --fakeroot --sparse --size 102400 torch2.7-tf2.20-cu128.sif

# Inspect partition layers
apptainer sif list torch2.7-tf2.20-cu128.sif

# Delete overlay if needed (example: delete partition index 5)
apptainer sif del 5 torch2.7-tf2.20-cu128.sif

# Enter writable container
apptainer exec   --writable   --fakeroot   --userns   --nvccli   torch2.7-tf2.20-cu128.sif   /bin/bash

Storage & quotas

Type Path Quota How to check
Private storage /home/<USER> 300 GB quota -s
Shared storage /home/<USER>/work 2 TB (unless additional storage is purchased) quota-g
About the Shared storage path

/home/<USER>/work = $PRJ_WORKSPACE = /rg/<group_prj>/<USER>
This directory resides inside your group’s shared project space (group_prj)
and is accessible to all members of the same group. Be mindful of what you store there.

Need more shared storage?

Additional shared storage quota can be purchased by your group’s Research PI through the official
Technion CIS Store.

Data transfer (scp / rsync)
# Copy a file to Athena
scp myfile.txt <USER>@athena.technion.ac.il:~/

# Sync a folder to Athena (recommended for large folders)
rsync -avP /local/data/ <USER>@athena.technion.ac.il:/path/on/athena/
Performance note
  • $TMPDIR is local NVMe (fast).
  • Much faster than network-mounted storage.
  • Ideal for temporary data, container modification, and intermediate outputs.

No backup and limited space. Do not use Athena for long-term storage.


OnDemand (Odin)

Odin (odin.technion.ac.il)
is Athena’s browser-based interface powered by OpenOnDemand — no SSH or local software required.
Users can launch interactive applications and submit jobs directly from the browser.
Interactive App Purpose
JupyterLab Notebook environment for PyTorch and TensorFlow workloads on Athena’s GPU nodes
RELION Cryo-EM structure determination with a guided graphical interface
MATLAB Full MATLAB desktop environment running on the cluster
Login: use the local part of your Technion email address as the username
(e.g. jsmith from [email protected]), and your regular Technion account password.

Community forum (Teams)

Athena GPU Microsoft Teams channel:
Athena (GPU) Cluster


FAQ

Job Submission & Scheduling

My job is pending — what should I check?

Verify your --partition and --qos are correct and compatible. Request fewer resources where possible. Set a shorter and more realistic --time. Check cluster availability with tsinfo. Use tsqr to inspect the queue — the output includes the reason your job is pending (e.g. Resources, Priority, QOSMaxJobsPerUserLimit).

How do I know which QoS is allowed for a given partition?

QoS availability is determined by your account’s association tree. Run the following to see which QoS tiers are associated with your account:

tsassoc | grep $USER
Can I run multiple jobs simultaneously?

Yes, subject to your QoS limits. MAXJOBSPU in tsqos output shows the maximum concurrent jobs per QoS. If you have reached the limit under one QoS, submitting under a different QoS has its own independent counter — allowing you to effectively run more jobs in parallel.

How do I estimate how long my job will take?

Run a short test first with a small subset of your data. Use that runtime to extrapolate. Always add a small buffer — but avoid over-estimating, as shorter --time values receive higher scheduling priority.

What happens if my job exceeds its requested time limit?

SLURM kills it immediately with no grace period. Always request enough time, and use checkpointing for long workloads so progress is not lost.

Preemption & Resilience

How do I ensure my shared-partition workload survives preemption?

Enable checkpointing in your training code and save progress periodically. Shared workloads receive a 10-minute grace period before being re-queued — use it to save a final checkpoint on signal.

How do I detect incoming preemption before the grace period ends?

SLURM sends a SIGTERM signal to your job at the start of the grace period. Trap this signal in your script to trigger an immediate checkpoint save, giving you the full 10 minutes to write it cleanly:

trap 'echo "Preemption signal received — saving checkpoint"; python save_checkpoint.py; exit 0' SIGTERM
My job was preempted — will it restart automatically?

It will be re-queued automatically, but it will restart from the beginning unless your code resumes from a checkpoint. Implementing checkpoint/resume logic is strongly recommended for any workload running on shared partitions.

Resources

How do I know how much memory or how many CPUs to request?

A good starting rule: if a node has 8 GPUs, requesting 1 GPU entitles you to roughly 1/8 of its CPUs and RAM. Run tsinfo <partition_name> to see total node resources, then divide proportionally by the number of GPUs you request. Adjust based on your workload’s observed utilization.

How do I check my current resource usage and limits?

Run scontrol show job <job_id> for the full resource allocation of a running job, including actual CPU, memory, and GPU assignments. For your QoS-level limits (max GPUs, max runtime, max concurrent jobs), run tsqos.

Containers & Storage

How do I confirm I have GPU access inside my container?

Run nvidia-smi inside the container. You can also verify PyTorch sees the GPU:

python -c "import torch; print(torch.cuda.get_device_name(0))"
Can I use Docker images on Athena?

Docker is not available on Athena. Two alternatives exist: Apptainer, which can pull and convert Docker images directly and is the recommended approach:

apptainer pull docker://<image>

And Podman, which is available but functionally limited — its image storage defaults to $TMPDIR, a temporary per-job directory that is wiped when the job ends.

How do I install Python packages inside a container?

Use a persistent overlay — it adds a writable layer on top of the read-only container image, preserving installed packages across runs. See the overlay instructions in the Containers section above.

I can’t find my files — what should I check?

Check your APPTAINER_BIND variable. Your $HOME and $PRJ_WORKSPACE directories must be explicitly bound before the container can see them. Set the following before running srun or include it in your sbatch script:

export APPTAINER_BIND=$PRJ_WORKSPACE,$HOME
What is $TMPDIR and when should I use it?

$TMPDIR is a per-job local NVMe directory on the compute node — significantly faster than network-mounted storage for intensive I/O. Use it for temporary files, container overlays, and intermediate outputs. Its contents are deleted when the job ends. Note that Athena’s current shared storage is NFS-based, which handles general file access well but is not optimized for the parallel I/O demands of HPC workloads — local NVMe is the practical workaround for I/O-heavy jobs for the time being.

My job wrote output files but I can’t find them — where do they go?

By default, SLURM writes output to the directory from which you submitted the job, named slurm-<job_id>.out and slurm-<job_id>.err. Use #SBATCH --output and #SBATCH --error in your script to set explicit paths.

Odin

Can I run long workloads from a JupyterLab session on Odin?

Yes — Odin is actually the preferred place for interactive long-running sessions. Unlike direct srun sessions on the command line, Odin manages the session lifecycle for you, making it more suitable for extended interactive work. For fully unattended workloads, sbatch remains the right choice.

The Technion appreciates Omer Shubi (DDS) and Daniel Zur (Med) for authoring the original QuickStart that this page is based on.

© All rights reserved to Division of Computing and Information Systems
  • Accessibility Statement (heb)
  • Phone Directory
  • Site Map
Font Resize
Accessibility by WAH