Introduction to batch jobs
A batch job refers to a task or a series of tasks that can be executed without user intervention. These jobs are submitted to a job scheduler, which manages resources and executes them when the required resources (such as CPUs, memory, etc.) become available. Unity uses Slurm, a popular open-source job scheduler used in many supercomputing clusters and high-performance computing (HPC) setups.
sbatch
is a command within Slurm that is used to submit batch jobs. sbatch
is a non-blocking command, meaning there is no circumstance where running the command will cause it to hold. If the resources requested in the batch job are unavailable, the job will be placed into a queue and will start to run once resources become available.
The following sections will guide you through how to:
- Create and submit a batch job
- Check the status of your job while it’s pending or running
- Receive emails about your job status
Create and submit a batch job
There are two parts to submitting a batch job:
- You need to create a batch script, which is a separate file that contains all of the parameters for your job and the commands you want to run.
- You need to use the
sbatch
command to submit the batch job you created.
The following steps will guide you through how to create and submit a batch job in more detail.
Create a batch script file in your preferred location.
In the first line of the batch script, write the line
#!/bin/bash
, or whichever interpreter you need. If you are unsure of which interpreter to use, use#!/bin/bash
.After the
#!/bin/bash
line, specify your#SBATCH
parameters. These parameters specify important information about your batch job, such as the number of cores per task or the amount of memory you are requesting.The following example is a simple batch script that contains common
sbatch
parameters.#!/bin/bash #SBATCH -c 4 # Number of Cores per Task #SBATCH --mem=8192 # Requested Memory #SBATCH -p gpu # Partition #SBATCH -G 1 # Number of GPUs #SBATCH -t 01:00:00 # Job time limit #SBATCH -o slurm-%j.out # %j = job ID module load cuda/10.1.243 /modules/apps/cuda/10.1.243/samples/bin/x86_64/linux/release/deviceQuery
Note that these lines are contained within the batch script file. Any parameters specified on the command line when submitting your job will override those in the file.
As defined by the parameters, this example script allocates four CPUs and one GPU in the GPU partition. It queries the available GPUs, and prints only one device to the specified file. The last two lines of this example load the required module and script. Feel free to remove or modify any of the parameters in the script to suit your needs. Additionally, Slurm provides a wide variety of additional parameters for use with
sbatch
.To submit your batch job, use the command
sbatch BATCH_SCRIPT
. Be sure to replaceBATCH_SCRIPT
with the file name of your batch script.
Check the status of your job while it’s pending or running
To check the status of all your jobs while they are pending or running, use the squeue --me
command.
Alternatively, to see the status of a specific job at any time, use the command sacct -j YOUR_JOB_ID
. Be sure to replace YOUR_JOB_ID
with the actual job ID you received when you submitted your job.
Receive emails about your job status
To receive emails based on the status of your job, use the --mail-type
argument. Common mail types are BEGIN, END, FAIL, INVALID_DEPEND, and REQUEUE
. For more information on which mail type makes the most sense for you, see Slurm’s sbatch page which not only covers --mail-type
but also contains a full guide on sbatch
.
To check that the email feature works for you with either salloc
or sbatch
, use the following code samples.
In your terminal:
salloc --mail-type=BEGIN /bin/true
Or, within your batch script:
#!/bin/bash
#SBATCH --mail-type=BEGIN
/bin/true
The BEGIN
mail type sends you an email once your job begins.
--mail-user
argument.Receive a time limit email to prevent a loss of work
Your job will be terminated as soon as it reaches its time limit, regardless of how close it was to finishing its task. Without checkpointing, those CPU hours would be lost, and you would have to schedule the job all over again.
Another way to prevent losing your work is to check on your job’s output as it approaches its time limit. To receive an email about your job’s output as it approaches its time limit, use the --mail-type=TIME_LIMIT_80
argument.
With the --mail-type=TIME_LIMIT_80
argument, Slurm emails you if 80% of the time limit has passed and your job is still running. Then, you can check on the job’s output and determine if it will finish in time. If you do not think your job will finish in time, email us at hpc@umass.edu or ask on the Community Slack and we can extend your job’s time limit.
Check job progress
To see the status of all your jobs while they are pending or running, use the squeue --me
command. This command shows the state of your jobs (e.g., running, pending, completed), job ID, partition, username, and more.
Alternatively, to see the status of a certain job at any time, use the command sacct -j YOUR_JOBID
.
For an in-depth guide on monitoring batch jobs, see Monitor a batch job.