Temporary working directory#

Overview#

This script exploits a temporary directory on the compute node to avoid costly read/write actions over the network.

In particular it does the following:

Copy files to a temporary directory on the compute node (which is in this case automatically provided by the queuing system).
Run the job (while reading and writing locally on the compute node).
Copy all files back to the directory from which the job is submitted (on the head node).

Here only items 1 and 3 use the network. During the computation all actions are local on the compute node. Often, quite some efficiency can be gained by doing this.

File structure#

This example assumes a file structure where (almost) everything that a simulation needs, and all its output, are located in a single directory (which may have arbitrary sub-directories). For example:

/home/user/...
| - simulation
  | - job.slurm
  | - ... (code, input)

It is vital that you submit from this directory, to copy only the files relevant to this simulation:

$ cd /home/user/.../simulation
$ sbatch job.slurm

Note

Gsub can be called from everywhere but guarantees this behaviour.

If the job terminates (the simulation is finished, or it is cut by the queuing system) all files that are in the simulation on the compute node are copied back to the directory in the home folder (on the head node). After that the folder looks like

/home/user/...
| - simulation
  | - job.slurm
  | - ... (code, input)
  | - ... (output)

Job script#

[source: job.slurm]

#!/bin/bash
#SBATCH --job-name tempdir
#SBATCH --out job.slurm.out
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --partition debug

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get name of the temporary directory working directory, physically on the compute-node
workdir="${TMPDIR}"

# get submit directory
# (every file/folder below this directory is copied to the compute node)
submitdir="${SLURM_SUBMIT_DIR}"

# 1. Transfer to node [DO NOT CHANGE]
# ===================================

# create/empty the temporary directory on the compute node
if [ ! -d "${workdir}" ]; then
  mkdir -p "${workdir}"
else
  rm -rf "${workdir}"/*
fi

# change current directory to the location of the sbatch command
# ("submitdir" is somewhere in the home directory on the head node)
cd "${submitdir}"
# copy all files/folders in "submitdir" to "workdir"
# ("workdir" == temporary directory on the compute node)
cp -prf * ${workdir}
# change directory to the temporary directory on the compute-node
cd ${workdir}

# 3. Function to transfer back to the head node [DO NOT CHANGE]
# =============================================================

# define clean-up function
function clean_up {
  # - remove log-file on the compute-node, to avoid the one created by SLURM is overwritten
  rm job.slurm.out
  # - delete temporary files from the compute-node, before copying
  # rm -r ...
  # - change directory to the location of the sbatch command (on the head node)
  cd "${submitdir}"
  # - copy everything from the temporary directory on the compute-node
  cp -prf "${workdir}"/* .
  # - erase the temporary directory from the compute-node
  rm -rf "${workdir}"/*
  rm -rf "${workdir}"
  # - exit the script
  exit
}

# call "clean_up" function when this script exits, it is run even if SLURM cancels the job
trap 'clean_up' EXIT

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# simplest example in the world, sleep a bit to allow a bit of monitoring
echo "hello world" > "test.log"
sleep 10

Note

To facilitate writing job-scripts, it can be created using the GooseSLURM Python module: [source: writeJob.py].

Code explained#

Language selection#

#!/bin/bash

Resource allocation#

Definition of the language that this script is written in (BASH in this case).

#SBATCH --job-name tempdir
#SBATCH --out job.slurm.out
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --partition debug

Allocation of resources. These lines are interpreted by the sbatch command, but are then ordinary comments when the script run on the compute node.

Directory selection#

# I. Define directory names [DO NOT CHANGE]
# =========================================

# get name of the temporary directory working directory, physically on the compute-node
workdir="${TMPDIR}"

# get submit directory
# (every file/folder below this directory is copied to the compute node)
submitdir="${SLURM_SUBMIT_DIR}"

Definition of:

workdir: the temporary directory on the compute node, here taken from the ${TMPDIR} provided by SLURM. Reading from and writing to the workdir is local on the compute node, and does not involve the cluster’s internal network.
submitdir: the directory from which the sbatch command is run. It is assumed that this is the simulation directory (/home/user/.../simulation above). All files/directory in this folder are copied back and forth. Be sure to select the correct directory here.

Copy to the compute-node#

# 1. Transfer to node [DO NOT CHANGE]
# ===================================

# create/empty the temporary directory on the compute node
if [ ! -d "${workdir}" ]; then
  mkdir -p "${workdir}"
else
  rm -rf "${workdir}"/*
fi

# change current directory to the location of the sbatch command
# ("submitdir" is somewhere in the home directory on the head node)
cd "${submitdir}"
# copy all files/folders in "submitdir" to "workdir"
# ("workdir" == temporary directory on the compute node)
cp -prf * ${workdir}
# change directory to the temporary directory on the compute-node
cd ${workdir}

Optionally create or clear the temporary directory on the compute node (workdir).
Copy all files/directory in submitdir to the temporary directory on the compute node (over the cluster’s internal network).

Copy back to the head-node (when the job finishes)#

# 3. Function to transfer back to the head node [DO NOT CHANGE]
# =============================================================

# define clean-up function
function clean_up {
  # - remove log-file on the compute-node, to avoid the one created by SLURM is overwritten
  rm job.slurm.out
  # - delete temporary files from the compute-node, before copying
  # rm -r ...
  # - change directory to the location of the sbatch command (on the head node)
  cd "${submitdir}"
  # - copy everything from the temporary directory on the compute-node
  cp -prf "${workdir}"/* .
  # - erase the temporary directory from the compute-node
  rm -rf "${workdir}"/*
  rm -rf "${workdir}"
  # - exit the script
  exit
}

# call "clean_up" function when this script exits, it is run even if SLURM cancels the job
trap 'clean_up' EXIT

Define a function that will be run when the job ends (exits normally, or is voluntarily or involuntarily terminated by the queuing system). This function will copy everything (including all the generated results) back to the submitdir, which again involves the cluster’s internal network. Note that this may overwrite files in submitdir.

If temporary files are created that you do not need anymore (for example build files, executables, debug output, …) it is wise to delete it by uncommenting and modifying line 46. This way these files are created before copying them over the network.

Actual job#

  # - erase the temporary directory from the compute-node
  rm -rf "${workdir}"/*
  rm -rf "${workdir}"
  # - exit the script
  exit
}

# call "clean_up" function when this script exits, it is run even if SLURM cancels the job
trap 'clean_up' EXIT

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================

# simplest example in the world, sleep a bit to allow a bit of monitoring
echo "hello world" > "test.log"
sleep 10

Here you can do what you want. Remember that all read and write operations in the current directory (i.e. all files like ./somepath) are local on the compute node (which is as efficient as reading and writing gets). Avoid here to do anything involving the home folder, as that is a network mount.