Job scheduling

Content

Introduction
File system
- Storage folder
- Scratch folder
Running jobs
- Job description
- Slurm environment variables
Types of jobs
Monitoring and management of jobs

Introduction

The Orthus cluster uses Slurm (Simple Linux Utility for Resource Management) job scheduling system to enable users to run and administer batch compute jobs.

When you ssh to the Orthus cluster via $ ssh orthus.cir.irb.hr you are logging into a login node. When you submit a batch job with sbatch, the login node schedules your job to run on compute nodes. The login node has all the same software installed as the compute nodes.

File system

The available file systems on Orthus are combination of local (per node) file systems and mounted (NFS) storage locations. The table below gives an overview over the available file systems:

Mount point	Accessibility	Description
/home	Login+Compute	NFS user’s home folder visible on frontend and compute nodes
/apps	Login+Compute	NFS shared folder in which packages are installed
/storage	Login+Compute	NFS shared folder for sharing data
/scratch	Compute	Local and fast (SSD) storage on each compute node

Storage folder

The /storage folder is used for storing and sharing large amounts of data between users. To each registered project a shared folder is created in which members of the project can share data.

Scratch folder

The /scratch folder is a fast (SSD) disc attached to the compute node. The main purpose of the folder is to be a working folder for all active jobs on the compute node. Users can also store their temporary data here.

==CAUTION== All the data in the /scratch folder will be deleted once the job is finished!

Running jobs

Jobs are run/submitted via Slurm and have to be described with a batch script. Inside each script, before standard shell commands, special Slurm directives have to be defined that describe to Slurm the resources the job requires.

To start a job run command:

sbatch <name of the batch script>

The command sbatch returns the job ID which can be used later for monitoring the status of the job:

Submitted batch job <JobID>

Check the status of submitted jobs:

squeue

which will list all jobs in the queue. To check only your jobs:

squeue -u $USER

To see detailed information about a specific job:

scontrol show job <JobID>

The possible states of jobs include: PD (pending), R (running), CG (completing), CD (completed), F (failed), CA (cancelled).

Job description

Slurm batch scripts are standard shell scripts with special directives that describe the job requirements. The header of each script contains the Slurm parameters, followed by normal commands to execute user applications.

The structure of a job batch script (file: my_job.sh):

#!/bin/bash

#SBATCH --<parameter1>=<value1>
#SBATCH --<parameter2>=<value2>

<command1>
<command2>

Here’s a basic example batch script:

#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=test_job_%j.out
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:05:00

echo "Job started at $(date)"
echo "Running on host: $(hostname)"
echo "CPU info:"
lscpu | grep "Model name"
echo "Memory info:"
free -h
echo "Sleeping for 30 seconds..."
sleep 30
echo "Job completed at $(date)"

The job is submitted to the cluster for execution by running:

sbatch my_job.sh

The basic Slurm parameters for describing jobs are:

--job-name=<name>          # Name of the job
--output=<filename>        # File for standard output
--error=<filename>         # File for standard error
--time=<time>              # Maximum runtime (e.g., 1:30:00 for 1.5 hours)
--nodes=<count>            # Number of nodes required
--ntasks=<count>           # Number of tasks (processes)
--ntasks-per-node=<count>  # Tasks per node
--cpus-per-task=<count>    # CPU cores per task
--mem=<size>               # Memory per node (e.g., 4G, 1000M)
--mem-per-cpu=<size>       # Memory per CPU core
--partition=all            # Partition name (use "all")
--gres=gpu:<count>         # GPU resources

Slurm environment variables

Inside the batch script it is possible to use Slurm environment variables. Some of the most commonly used are:

$SLURM_JOB_ID          # Unique job identifier
$SLURM_JOB_NAME        # Job name
$SLURM_SUBMIT_DIR      # Directory from which job was submitted
$SLURM_JOB_NODELIST    # List of nodes assigned to job
$SLURM_NTASKS          # Number of tasks
$SLURM_CPUS_PER_TASK   # CPU cores per task
$SLURM_PROCID          # Process rank
$SLURM_LOCALID         # Local task ID on node
$SLURM_ARRAY_JOB_ID    # Job array ID
$SLURM_ARRAY_TASK_ID   # Task ID within job array
$SLURM_TMPDIR          # Temporary directory (typically /scratch)

Types of jobs

Serial jobs

An example of a simple script to start a serial job (requiring only 1 CPU core) which prints some system information:

#!/bin/bash

#SBATCH --job-name=example-serial
#SBATCH --output=example-serial.out
#SBATCH --error=example-serial.err
#SBATCH --partition=all
#SBATCH --ntasks=1
#SBATCH --time=00:10:00

echo "Starting serial job at $(date)"
echo "Running on node: $(hostname)"
echo "Working directory: $(pwd)"

# Your application commands here
./my_serial_program

Interactive jobs

To start an interactive job use the qalloc srun command with the --pty flag. This gives you direct access to a compute node:

salloc srun --pty bash

For example, to run an interactive job that requires 4 CPU cores for 2 hours:

salloc --partition=all --ntasks=4 --time=02:00:00 srun --pty bash

It is also possible to ssh to the compute node (ssh compute01) but this is only allowed when there is an active job for your user active on the node.

Array jobs

Slurm enables multiple submissions of the same job with different parameters, called job arrays. Each job inside the array is called a task and has its unique identifier.

At submission time, specify the array range using the --array parameter:

--array=<start>-<end>:<step>

The task identifier is stored in the environment variable $SLURM_ARRAY_TASK_ID. Tasks can be both serial or parallel jobs.

Example of usage

An example script that starts 10 serial jobs, each processing a different input file:

#!/bin/bash

#SBATCH --job-name=job_array_serial
#SBATCH --output=output/job_%A_%a.out
#SBATCH --error=output/job_%A_%a.err
#SBATCH --partition=all
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --array=1-10

echo "Processing task $SLURM_ARRAY_TASK_ID"
./myexec inputFile.$SLURM_ARRAY_TASK_ID

An example script starting 10 parallel jobs:

#!/bin/bash

#SBATCH --job-name=job_array_parallel
#SBATCH --output=output/job_%A_%a.out
#SBATCH --error=output/job_%A_%a.err
#SBATCH --partition=all
#SBATCH --ntasks=4
#SBATCH --time=02:00:00
#SBATCH --array=1-10

echo "Running parallel task $SLURM_ARRAY_TASK_ID"
srun ./myexec inputFile.$SLURM_ARRAY_TASK_ID

Parallel jobs

To start parallel jobs, specify the number of tasks and optionally the number of nodes. An example script requiring 12 compute cores and running a simple Hello World MPI program:

#!/bin/bash

#SBATCH --job-name=example-mpi
#SBATCH --output=example-mpi.out
#SBATCH --error=example-mpi.err
#SBATCH --partition=all
#SBATCH --ntasks=12
#SBATCH --time=01:00:00

# Load required modules
module load openmpi

echo "Starting MPI job with $SLURM_NTASKS tasks"
echo "Nodes allocated: $SLURM_JOB_NODELIST"

srun ./hello_world.exe

The script requires that an MPI package is loaded using the module system before the job is submitted.

MPI with OpenMP threads

When running hybrid parallel applications that combine both MPI and OpenMP (thread) parallelism, you need to properly configure the resource allocation and thread binding to achieve optimal performance.

#!/bin/bash

#SBATCH --job-name=test-mpi-openmp
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --partition=all
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=12
#SBATCH --time=02:00:00

# Set OpenMP environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=close
export OMP_PLACES=cores

# Load required modules
module load openmpi

echo "Running hybrid MPI+OpenMP job"
echo "MPI tasks: $SLURM_NTASKS"
echo "OpenMP threads per task: $OMP_NUM_THREADS"

srun --cpu-bind=cores ./my-hybrid-application

GPU jobs

GPU jobs require special resource allocation using the --gres flag. An example GPU job script that requires 1 CPU core and 2 GPU devices:

#!/bin/bash

#SBATCH --job-name=test-gpu
#SBATCH --output=test_gpu_%j.out
#SBATCH --error=test_gpu_%j.err
#SBATCH --partition=all
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --time=01:00:00

echo "Job started at $(date)"
echo "CUDA_VISIBLE_DEVICES = $CUDA_VISIBLE_DEVICES"
echo "SLURM_JOB_GPUS = $SLURM_JOB_GPUS"

# Display GPU information
nvidia-smi

# Load CUDA if needed
module load cuda

# Run your GPU application
./my_gpu_program

Monitoring and management of jobs

Host information

To print information about nodes in the cluster:

sinfo                    # Show partition and node state information
sinfo -N                 # Show node-oriented format
scontrol show nodes      # Detailed node information

Job management

Jobs can be managed after submission using various Slurm commands:

Cancel a job:

scancel <JobID>

Cancel all your jobs:

scancel -u $USER

Hold/suspend a job (prevent it from running):

scontrol hold <JobID>

Release a held job:

scontrol release <JobID>

Suspend a running job:

scontrol suspend <JobID>

Resume a suspended job:

scontrol resume <JobID>

Get statistics of finished jobs

To get information about finished jobs use the sacct command:

sacct                    # Show accounting information for your recent jobs
sacct -j <JobID>         # Detailed info for specific job
sacct -u <username>      # Jobs for specific user

For a quick efficiency report of a completed job:

seff <JobID>             # Shows CPU and memory efficiency

Common sacct usage examples:

# Detailed job information with custom format
sacct -j <JobID> --format=JobID,JobName,State,ExitCode,CPUTime,MaxRSS

# Jobs from a specific time period
sacct --starttime=2024-01-01 --endtime=2024-01-31

# Your jobs from today
sacct -u $USER --starttime=today

The sacct command provides extensive information about completed jobs including runtime, memory usage, CPU efficiency, and exit codes, which is useful for optimizing future job submissions.

ORTHUS documentation