Definition and Primary Roles
Definition: PBS is a distributed workload management system. It handles the management and monitoring of the computational workload on a set of computers
Queuing: Users submit tasks or “jobs” to the resource management system where they are queued up until the system is ready to run them.
Scheduling: The process of selecting which jobs to run, when, and where, according to a predetermined policy. Aimed at balance competing needs and goals on the system(s) to maximize efficient use of resources
Monitoring: Tracking and reserving system resources, enforcing usage policy. This includes both software enforcement of usage limits and user or administrator monitoring of scheduling policies
Submitting jobs to PBS: qsub command
qsub command is used to submit a batch job to PBS. Submitting a PBS job specifies a task, requests resources and sets job attributes, which can be defined in an executable scriptfile. The syntax of qsub recommended on ZEUS:
> qsub [options] scriptfile
PBS script files ( PBS shell scripts, see the next page) should be created in the user’s directory
To obtain detailed information about qsub options, please use the command:
> man qsub
Job Identifier (JOB_ID) Upon successful submission of a batch job PBS returns a job identifier in the following format:
> sequence_number.server_name
> 12345.zeus
The PBS shell script sections
Shell specification: #!/bin/sh
PBS directives: used to request resources or set attributes. A directive begins with the default string “#PBS”.
Tasks (programs or commands)
– environment definitions
– I/O specifications
– executable specifications
NB! Other lines started with # are comments
Zeus Public Queues
PBS script example for multicore user code
#!/bin/sh #PBS -N job_name #PBS -q queue_name #PBS -m abe #PBS -M user@technion.ac.il #PBS -l select=1:ncpus=N #PBS -l select=mem=P GB #PBS -l walltime=24:00:00 PBS_O_WORKDIR=$HOME/mydir cd $PBS_O_WORKDIR ./program.exe < input.file > output.file 2>&1
You can use the PBS script generator here
Checking the job/queue status: qstat command
qstat command is used to request the status of batch jobs, queues, or servers
Detailed information: > man qstat
qstat output structure (see on Zeus)
Useful commands
> qstat –a all users in all queues (default)
> qstat -1n all jobs in the system with node names
> qstat -1nu username all user’s jobs with node names
> qstat –f JOB_ID extended output for the job
> Qstat –Q list of all queues in the system
> qstat –Qf queue_name extended queue details
> qstat –1Gn queue_name all jobs in the queue with node names
Removing job from a queue: qdel command
qdel used to delete queued or running jobs. The job’s running processes are killed. A PBS job may be deleted by its owner or by the administrator
Detailed information: > man qdel
Useful commands
> qdel JOB_ID deletes job from a queue
> qdel -W force JOB_ID force delete job
Checking a job results and Troubleshooting
Save the JOB_ID for further inspection
Check error and output files:
job_name.eJOB_ID; job_name.oJOB_ID
Inspect job’s details (also after N days ) : > tracejob [-n N]JOB_ID
Job in E state – occupies resources, will be deleted
Running interactive batch job (debugging): > qsub –I pbs_script
Job sent to execution node, PBS directives executed, job awaits user’s command
Checking the job on an execution node: > ssh node_name
> hostname
> top /u user – shows user shows processes ; /1 – CPU usage
> kill -9 PID remove job from the node
> ls –rtl /gtmp check error, output and other files under user ownership
Output can be copied from the node to the home directory