Using MCMCtree in QMUL's Apocrita
In this tutorial we will run a simple MCMCtree analysis at QMUL’s Apocrita high-performance computer cluster. QMUL students and staff can request an Apocrita account here. Apocrita uses job scheduling software for fair allocation of computational resources to users of the cluster. To submit a computational analysis to Apocrita, you must first write a script with instructions for you analyses. This script (known as a job) is submitted to the scheduler’s job queue. When a computing node with enough resources becomes available, your job is released from the queue and allocated to the node, where it is allowed to run. There is extensive documentation at Apocrita’s home site on how to use the cluster and the scheduler.
This tutorial assumes you are familiar with the command line, SSH, MCMCtree (i.e, you have already done the MCMCtree tutorials), and basic shell scripting with BASH.
We will follow these steps:
- Download and compile MCMCtree in Apocrita.
- Prepare data and control files for our MCMCtree analysis.
- Submit the MCMCtree job to Apocrita’s job queueing system.
- Check the MCMC results once the job has finished.
1. Download and compile MCMCtree in the cluster
First, let’s login onto the cluster. Please replace xyz123
below with your QMUL user name.
ssh xyz123@login.hpc.qmul.ac.uk
Downloading and compiling software are resource intesive tasks. Thus, we will first log onto the queue, so that a suitable node is allocated to us. In that way, we will not interfere with the work of other users of Apocrita. In the command line type:
qlogin
The command will wait until a node becomes available, and then it will log you onto an interactive job.
When working in the cluster it is very important to keep our files well organised. We will create three directories, bin
, software
and projects
. The bin
directory will contain scripts and executable files we need to use often; software
will contain programs for major analyses (such as MCMCtree, RAxML, or MrBayes), and projects
will contain our analyses, i.e., it is the place where we will run our HPC jobs.
Make the three directories with the mkdir
command:
mkdir bin software projects
You can verify the directories have been created using ls
:
ls
We will now download and install MCMCtree. First we go into the software
directory and then we use git
to clone the MCMCtree software:
cd software
git clone https://github.com/dosreislab/mcmctree.git
You now have downloaded dos Reis’ lab modified version of MCMCtree. The modifications have been made to simplify certain HPC tasks (such as running unlimited checkpoint cycles, for computational heavy analyses, such as trees with too many species). Let’s go into the src
directory of mcmctree and compile the executable.
cd mcmctree/src
The src
directory contains the source files with the code that makes up MCMCtree. MCMCtree is written in C, an efficient computer language that needs to be compiled into machine code. We will use the Intel compiler to generate fast, effcient executables that are fine tuned to QMUL’s cluster architecture. The instructions for the compiler are inside the Makefile
file. Using your favorite text editor, open Makefile
(for example, you can try module load nano; nano Makefile
). Edit Makefile
so that it looks like this:
PRGS = mcmctree
CC = icc # cc, gcc, cl # if running on an intel cluster, check if you can compile with the intel compiler, icc
#CFLAGS = -O3
CFLAGS = -fast # this flag is for the intel compiler
LIBS = -lm # -lM
all : $(PRGS)
mcmctree : mcmctree.c tools.c treesub.c treespace.c paml.h
$(CC) $(CFLAGS) -o $@ mcmctree.c tools.c $(LIBS)
$(CC) $(CFLAGS) -o infinitesites -D INFINITESITES mcmctree.c tools.c $(LIBS)
Three lines have changed. The second line is changed to CC = icc
, which tells the system to use the Intel compiler. Line five has a comment character #
added at the beginning. Line six is changed to CFLAGS = -fast
, which tells the Intel compiler to generate fast code. Go back to the command line and ask the system to list the Intel compilers available:
module avail intel
As of Aug 2024, the command above lists the following compilers:
intel/2017.1 intel/2018.1 intel/2020.4 intelmpi/2017.1 intelmpi/2018.1 intelmpi/2020.4
intel/2017.3 intel/2018.3 intel/2022.2 intelmpi/2017.3 intelmpi/2018.3 intelmpi/2022.2
The mpi compilers are used to generate MPI parallelisable code. MCMCtree does not use MPI so we don’t need to use these. The latest compiler available in my list is intel/2022.2
. Load the compiler with:
module load intel/2022.2
You can use a later version of the compiler if available. Now we can compile MCMCtree:
make
The make
program reads the instructions in Makefile
, and calls the icc
compiler as appropriate. Let’s print the contents of the src
directory to our screen, which now contains the mcmctree
executable:
ls
This should produce the following output:
Makefile infinitesites mcmctree mcmctree.c paml.h tools.c treespace.c treesub.c
That is, two new executable files have been created, mcmctree
and infinitesites
. The latter is a special version of MCMCtree used for asymptotic analysis assuming an infinite amount of data.
Tip: To get out of the interactive queue started with the qlogin
command, simply type exit
in the command line.
Tip: You can find the manual for a command using man
. For example, try man qlogin
or man ls
to learn more about the qlogin
and ls
commands.
Tip: Sometimes an ssh session to the HPC cluster becomes unresponsive. For example, you went away from your computer and when you came back you found you could not type anything into the terminal. If that happens, simply press the follwing keys in sequence: Enter
~
.
. That will terminate the shell process. You will then need to login again. You can try it now.
2. Prepare data and control files for our MCMCtree analysis
We will estimate divergence times on a four-species hominid dataset including human, Neanderthal, chimp and gorilla. First, we create the appropriate project directory.
cd ~/projects/
mkdir hominid
The alignment, tree file, and control file we need for our analyses are in the mcmctree git directory, so we copy them (it is not good pratice to carry out analyses within git directories, as these should be used for software development and testing only).
cd hominid
cp ~/software/mcmctree/data/ape4s/* .
ls
You should see the following files:
ape4s.phy ape4s.tree mcmctree.ctl
We will run four MCMC chains in parallel. So we need to create four directories:
mkdir `seq 1 4`
The seq
command simply generates sequences of numbers. In this case, 1, 2, 3, and 4, which are used by the mkdir
command as the names of the directories to be generated.
We now need to link the alignment, tree and control files inside the directories that will run each independent MCMC chain:
for d in `seq 1 4`
do
cd $d
ln -s ../mcmctree.ctl .
ln -s ../ape4s.phy .
ln -s ../ape4s.tree .
cd ..
done
The sequence of commands above loops among the four directories, goes inside each one, creates soft links to the control, alignment and tree files, and then leaves each directory before moving to the next one. Soft links are useful: instead of copying each file inside each directory, we simply put a link that tells the system where the true file is located. Soft links are efficient because they use very little disk space, and they are synchronised to the main file. If you edit the tree file (for example, you change a fossil calibration), all the soft links are updated automatically.
You can check the commands above worked correctly by listing the contents of one the directories. For example:
ls -l 1
Which should generate an output similar to this one:
lrwxrwxrwx 1 xyz123 qmul 12 Aug 18 19:31 ape4s.phy -> ../ape4s.phy
lrwxrwxrwx 1 xyz123 qmul 13 Aug 18 19:31 ape4s.tree -> ../ape4s.tree
lrwxrwxrwx 1 xyz123 qmul 15 Aug 18 19:31 mcmctree.ctl -> ../mcmctree.ctl
Note the arrows indicating the files are soft links.
3. Submit the MCMCtree job to Apocrita’s job queueing system
Now we prepare our job submission script. First we create the job file inside the bin
directory:
cd ~/bin
touch mcmctree-job.sh
Now using your favorite text editor, open the file, which will be empty, and add the following lines:
#! /bin/bash # use the BASH shell
#$ -cwd # change to current working directory
#$ -V # export environment to job
#$ -j y # merge standard out and standard error
#$ -l h_rt=01:00:00 # set time limit
#$ -l h_vmem=100M # amount of memory needed
FMCMC=mcmc.txt
cd $SGE_TASK_ID
# Check if mcmc.txt file exists, if it does, abort
# mission, otherwise run mcmctree
if [ -f "$FMCMC" ]; then
echo "ERROR: $FMCMC exists, nothing done."
else
# This is MCMCtree 4.9j compiled with icc -fast and with NS = 1500
date >START
~/software/mcmctree/src/mcmctree mcmctree.ctl
date >DONE
fi
The first six lines have general instructions for the scheduler, such as the shell type (BASH), and the amount of memory and time needed to run the jobs. Here we request 1 hour of time and 100 Mb of memory. It is important to get these numbers right, as the scheduler uses them to decide on the priority of your job in the queue. A job that needs 10 days to run may spend longer in the queue than a job that requires just one hour.
Line FMCMC=mcmc.txt
creates a variable with the name of the output of our MCMC chain. Next line cd $SGE_TASK_ID
tells the script to change to the directory defined in SGE_TASK_ID
. This variable will contain the directories 1 to 4 (see below).
The next block, starting with if [ -f "$FMCMC" ]; then
checks whether the mcmc.txt
file exists. If it does, the job is aborted. This is done for safety. A common mistake is to submit a job in the wrong directory. If that happened, then you would have two MCMCtree runs interfering with each other. When designing submission scripts, it is important to think about safety features to make your analyses robust to errors.
The next block, starting with else
calls the mcmctree
executable that we just compiled. The date
command prints the current date and time to the screen. Here we catch the output of date
and we redirect it to files called START
and DONE
. In this way, we record when our mcmctree
analysis started, and when it finished.
Now we submit our script. Because we need to carry out four analyses with identical setting of parameters, we will submit an array job. In the command line type:
cd ~/projects/hominid
qsub -t 1-4 ~/bin/mcmctree-job.sh
The qsub
command submits our script to the queue. The -t
flag activates the array job option, and tells the scheduler we are running four jobs with numbers 1 to 4. Internally, the scheduler will set the SGE_TASK_ID
variable sequentially to 1, 2, 3, and 4. Thus, line cd $SGE_TASK_ID
in our submission script will have the effect of changing to the appropriate directory as set by the scheduler.
Our array job is now in the queue. To check our job status type
qstat
This will list the array job. Initally, the job is queued, indicated with qw
, in the output of qstat
:
3803592 0.00000 mcmctree-j xyz123 qw 08/18/2024 19:39:33 1 1-4:1
The job will eventually be split into the array, and you will see four jobs with running status r
. These jobs will run very quickly, so you may not have chance to see them split in the qstat
output.
4. Check the MCMC results once the job has finished
Congratulations You have submitted your firt job to the cluster! Once the jobs have finished, you can check the results. You will see four files have been created:
mcmctree-job.sh.o3803592.1 mcmctree-job.sh.o3803592.2 mcmctree-job.sh.o3803592.3 mcmctree-job.sh.o3803592.4
The long number, 3803592
, is the ID number the scheduler assigned to the job, which will look different in your case. The last number is the array number. These files contain the output that MCMCtree would usually print to the screen. If anything went wrong with your analysis, examining these files is a good place to start, as MCMCtree may print errors and issues there.
The next step is to check your MCMC chains converged. If you want, you can transfer the result files to your local computer to check them with Tracer, R, or any other useful software. The command rsync
(for remote synchronisation) is useful for this. It is available in Linux and MacOS. In Windows, you can use rsync
within Windows’ Linux Subsystem. Alternatively, you can use many of the GUI SSH options available.
Within the cluster, you can do a quick check by inspecting the output files (mcmc.txt
, out
and FigTree.tre
) generated by MCMCtree. For example,
cd 1/
less out
will let you examine the out
file. The less
command is very useful to examine text files as it allows you to search for specific strings of text and move quickly to specific parts of the document.
Now that you have finished your HPC tutorial, take the time to go through Apocrita’s documentation on the web, where many job example scripts are available. Also, learning BASH (shell scripting) as well as the commands awk
, sed
and grep
will improve your skills in preparing scripts and processing text output. The vim
text editor is also extremely powerful. There are many tutorials on the web available for all these commands. It is worth spending a day learning each one of them. Other simple, yet useful, commands include rev
, head
, tail
, cat
and paste
. You can learn more about them with their manual pages (e.g., man cat
). Enjoy!
Last Updated 19th Aug 2024.