Estimating the marginal likelihood of a relaxed-clock model with MCMCTree
MCMCTree now implements MCMC sampling from power-posterior distributions. This allows estimation of the marginal likelihood of a model for Bayesian model selection –that is, by calculation of Bayes factors or posterior model probabilities. In MCMCTree, this allows selection of the relaxed-clock model for inference of species divergence times using molecular data (dos Reis et al. 2017).
To calculate the marginal likelihood of a model, one must take samples from the so-called power-posterior, which is proportional to the prior times the likelihood to the power of b, with 0 ≦ b ≦ 1. When b = 0, the power posterior reduces to the prior, and when b = 1, it reduces to the normal posterior distribution. Thus, by selecting n values of b between 0 and 1, one can sample likelihood values from the power posterior in a path from the prior to the posterior. The sampled likelihoods are then used to estimate the marginal likelihood either by thermodynamic integration (a.k.a. path sampling) or by the stepping stones method. Applications of both methods are extensive in the phylogenetics literature (Lartillot and Philippe 2006, Lepage et al. 2007, Xie et al. 2011). A review of Bayesian model selection is given in Yang (2014).
Tutorial
This tutorial introduces the user to marginal likelihood calculation in MCMCTree to select for a relaxed-clock model. MCMCTree (v4.9f at the time of writing) implements three clock models: the geometric Brownian motion (GBM) model, the independent log-normal (ILN) model, and the strict clock (CLK) model (Rannala and Yang, 2007). I have written an R package mcmc3r
(available in GitHub) which helps the user in selecting appropriate b values, preparing the corresponding MCMCTree control files, and in parsing MCMCTree’s output to calculate the marginal likelihood. This tutorial assumes the user has basic knowledge of MCMCTree and Bayesian divergence time estimation, and a basic understanding of Bayes factors and marginal likelihood theory. It also assumes you have basic knowledge of R, and have the devtools
and coda
R packages installed. The tutorial has been tested on MacOS, but it should work in other systems (e.g. Linux or Windows), although some tweaking may be necessary.
You can download MCMCTree, which is part of the PAML phylogenetic analysis package, from Ziheng Yang’s website. You should place the mcmctree
excecutable in your system’s search path as explained in the website. The mcmc3r
package can be installed in R by typing
devtools::install_github ("dosreislab/mcmc3r")
The general procedure to calculate Bayes factors with MCMCTree is as follows:
-
Select the sequence alignment and phylogenetic tree to be analysed.
-
Prepare a template
mcmctree.ctl
file with values for the appropriate relaxed-clock model, priors, alignment and tree files. -
Use
mcmc3r
to select n appropriate b values according to the marginal likelihood calculation method of choice (stepping stones or thermodynamic integration) and prepare n directories with correspondingmcmctree.ctl
files. -
Run MCMCTree n times, to sample from the n power posteriors.
-
Use
mcmc3r
to parse MCMCTree’s output and calculate the marginal likelihood for the chosen relaxed-clock model. -
Repeat 2-5 for other clock models as necessary.
-
Calculate Bayes factors and posterior model probabilities.
1. Alignment and tree
The data to be analysed are the 15,899 nucleotides alignment of the mitochondrial genomes of four ape species (human, Neanderthal, chimp and gorilla). The alignment ape4s.phy
and tree ape4s.phy
, as well as the mcmcmtree.ctl
template, are available within the misc/
directory in the R package. Make a directory called ape4s/
and copy the alignment, and tree files into it.
Using a text editor you can look into the alignment file. The alignment, which is compressed into site patterns, is shown below:
4 86 P
Ggor AAAAAAAAAA AAAAAAAAAA AAAAAAACCC CCCCCCCCCC CCCCCCCCCC CCCGGGGGGG GGGGGGGGTT TTTTTTTTTT TTTTTT
Hnea AAAAAAACCC CCCCGGGGGG TTTTTTTAAA AAAACCCCCC CCGGGGTTTT TTTAAAACCC CGGGGGTTAA AACCCCCCGG GTTTTT
Hsap AAAAGGTACC CTTTAAGGGG AACTTTTAAA AGGTACCCCG TTAGGGCCCT TTTAAGGCCC CAAGGGTTAA AACCCGTTGG GCCTTT
Ptro ACGTAGAAAC TACTAGACGT ATTACGTACG TACTCACGTA CTAACGACTA CGTAGAGACG TAGACGATAC GTACTCCTAG TCTACT
4423 10 136 13 27 1 1 1 12 34 8 1 2 1 28
5 131 1 59 1 1 1 1 9 5 1 12 28 13 2
5 1 1 1 1 11 4028 3 261 1 18 5 1 1 1
4 1 16 11 2 233 1 220 166 28 3 5 3 1 1
4 10 3 51 2 1793 1 3 32 2 5 3 1 368 169
1 7 6 2 2 1 7 14 6 120 3284
The first line gives the number of species (4), the number of site patterns (86) and a ‘P’ indicating it is a compressed alignment. The next block shows the four species and corresponding nucleotide sequences. The last block shows the number of times each site pattern is seen in the alignment. The sum of these numbers is 15,899, the alignment length. See PAML’s manual for alignment formats.
The tree file is:
4 1
(((Hsap, Hnea), Ptro), Ggor)'B(0.999,1.001)';
The first line indicates the number of species (4), and the number of trees in the file (1), then the tree in Newick format is given. Because our interest is to select the relaxed-clock model and not to estimate absolute divergence times, we will fix the age of the root to one. In MCMCTree this is done by labelling the root with B(0.999,1.001)
, which tells MCMCTree that the age of the root is constrained to be between 0.999 and 1.001. See MCMCTree’s manual for calibration formats.
2. Preparing the MCMCTree template
The first clock model that we will test is the strict clock (CLK). Create a directory ape4s/clk/
and copy the misc/mcmctree.ctl
file into it. The MCMCTree template file is shown below:
seed = -1
seqfile = ../../ape4s.phy
treefile = ../../ape4s.tree
outfile = out
ndata = 1 * number of partitions
usedata = 1 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
clock = 1 * 1: global clock; 2: independent rates; 3: correlated rates
model = 4 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = .5 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 2 .2 * gamma prior for kappa
alpha_gamma = 2 4 * gamma prior for alpha
rgene_gamma = 2 20 * gamma prior for mean rates for genes
sigma2_gamma = 1 10 * gamma prior for sigma^2 (for clock=2 or 3)
print = 1
burnin = 4000
sampfreq = 6
nsample = 20000
For a detailed explanation of all the options in the file please refer to MCMCTree’s manual. Note that the clock model is set to clock = 1
which is the model we will test first.
3. Selecting the b values with mcmc3r
Open a terminal window, change into the clk/
directory and start R. Make sure that clk/
is R’s current working directory. We will select 8 b points to estimate the marginal likelihood of CLK using the stepping stones method. In R, type:
b = mcmc3r::make.beta(n=8, a=5, method="step-stones")
The 8 b values range from 0 to 0.5129. Constant a
controls the distribution of b between 0 and 1. Large a
values produce b values clustered close to zero, which is desirable for large sequence alignments.
We now construct 8 directories each containing a modification of the mcmctree.ctl
template. In R type:
mcmc3r::make.bfctlf(b, ctlf="mcmctree.ctl", betaf="beta.txt")
ctlf
specifies the template control file, and betaf
is the name of a file that will contain the selected b values. Open a new terminal window and look at the contents of clk/
. You will see the 8 new directories created together with the beta.txt
file. Each directory contains the mcmctree.ctl
file with an additional line. For example, the last line of 8/mcmctree.ctl
is
BayesFactorBeta = 0.512908935546875
which tells MCMCTree to sample from the power posterior with b = 0.5129 … . Note that MCMCTree currently cannot sample log-likelihoods using b = 0, and so b = 10–300 (a tiny number) is used instead (i.e. look into 1/mcmctree.ctl
).
4. Run MCMCTree
Now run MCMCTree within each one of the 8 directories created in the previous step. The following Bash command does the trick in the Mac:
for d in `seq 1 1 8`; do cd $d; mcmctree >/dev/null & cd ..; done
It takes about 40s for the 8 MCMCTree runs to finish on a 2.8 GHz Intel Core i7 machine with four processors. For analysis of larger alignments, and in particular with richer taxon sampling, computation time will be substantially longer. In such cases it may be desirable to prepare a customised script and submit the MCMCTree jobs to a high-throughput computer cluster.
5. Parse MCMCTree’s output with mcmc3r
Go back to the terminal where you are running R. Type:
clk <- mcmc3r::stepping.stones()
clk$logml; clk$se
# $logml
# [1] -32185.72
# $se
# [1] 0.03516095
The stepping.stones()
function will read the b values in beta.txt
and will read the log-likelihood values sampled by MCMCTree within each directory. It will then compute the log-marginal likelihood and its standard error.
The log-marginal likelihood estimate for CLK is –32,185.72 with a standard error (S.E.) of 0.035. Note that your values may be slightly different due to the stochastic nature of the MCMC algorithm. The S.E. can be used to construct a 95% confidence interval for the estimate: –32,185.72 ± 2×0.035. Ideally, you want the S.E. to be much smaller than the log-marginal likelihood difference between the models being tested. You may reduce the S.E. by increasing nsample
or samplefreq
in the mcmctree.ctl
template file. Note that to reduce the S.E. by half, you need to increase nsample
four times.
6. Repeat for the ILN and GBM models
Go back to ape4s/
and create directories called iln/
and gbm/
. Copy the mcmctree.ctl
template into each directory, and modify the templates appropriately. For the ILN model, you must set clock = 2
(independent rates) in the template, and for the GBM model, it must be set to clock = 3
(correlated rates). Repeat steps 3 to 5 for both models. In R:
# This assumes you are currently in clk/
setwd("../iln")
# prepare templates for ILN:
mcmc3r::make.bfctlf(b, ctlf="mcmctree.ctl", betaf="beta.txt")
setwd("../gbm")
# prepare templates for GBM:
mcmc3r::make.bfctlf(b, ctlf="mcmctree.ctl", betaf="beta.txt")
In the terminal:
# This assumes you are currently in clk/
cd ../iln; for d in `seq 1 1 8`; do cd $d; mcmctree >/dev/null & cd ..; done
cd ../gbm; for d in `seq 1 1 8`; do cd $d; mcmctree >/dev/null & cd ..; done
Once the MCMCTree jobs have finished, return to your R session:
setwd("../iln"); iln <- mcmc3r::stepping.stones()
setwd("../gbm"); gbm <- mcmc3r::stepping.stones()
The estimated log-marginal likelihoods and S.E.’s for the three models are:
CLK: | –32,185.72 | ± 0.035 |
ILN: | –32,186.69 | ± 0.045 |
GBM: | –32,186.20 | ± 0.060 |
7. Calculate Bayes factors and posterior model probabilities
Now we can calculate the Bayes factors and posterior model probabilities easily with R:
# log-marginal likelihoods for CLK, ILN and GBM:
mlnl <- c(clk$logml, iln$logml, gbm$logml)
# mlnl: -32185.72, -32186.69, -32186.20
# Bayes factors
( BF <- exp(mlnl - max(mlnl)) )
# [1] 1.0000000 0.3790830 0.6187834
# Posterior model probabilities
( Pr <- BF / sum(BF) )
# [1] 0.5005340 0.1897439 0.3097221
# or alternatively:
mcmc3r::bayes.factors(clk, iln, gbm)
The posterior probabilities are calculated assuming equal prior model probabilities. The CLK model has the highest log-marginal likelihood, and thus the highest posterior probability (Pr = 0.50), followed by GBM (Pr = 0.31), with ILN being the worst performing model (Pr = 0.19). This result should not be surprising. Human, Neanderthal, chimp and gorilla are all very closely related, and the strict clock is usually not rejected in comparisons of such closely related species. Indeed, a likelihood-ratio test fails to reject the strict clock in this data (see Box 2 in dos Reis et al. 2016 where the data are analysed).
Update – Feb 2020: Function bayes.factors
now performs parametric bootstrap of posterior probabilities, so you should see an element called $pr.ci
with the confidence intervals for the posterior probabilities.
Update - Dec 2020: For large datasets, MCMC sampling of power posteriors can be quite time consuming. In such cases, a good strategy may be to run shorter MCMC chains and more frequent sampling, for example using nsample = 10000
and samplefreq = 2
in the control file. To compensate for the shorter chains, one may then use many more beta points (for example 32 or 64). However, in such cases, the delta approximation used to calculate the S.E. (see Xie et al. 2011) may not work well due to the smaller sample size. Instead, we can use the stationary block bootstrap method of Politis and Romano (1994) to calculate the S.E. New functions are now provided in the package to do this. In R:
r <- 100 # we will use 100 bootstrap replicates
setwd("../clk")
mcmc3r::block.boot(r) # generate bootstrap replicates
# (for each beta value, replicates are stored in files lnL1.txt to lnL100.txt,
# with lnL0.txt containing the original log-likelihood values)
clk.boot <- mcmc3r::stepping.stones.boot(r) # calculate logml on the replicates
# repeat for the iln and gbm models
setwd("../iln")
mcmc3r::block.boot(r); iln.boot <- mcmc3r::stepping.stones.boot(r)
setwd("../gbm")
mcmc3r::block.boot(r); gbm.boot <- mcmc3r::stepping.stones.boot(r)
# You can now look at the S.E.'s calculated using the bootstrap samples
clk.boot$se; iln.boot$se; gbm.boot$se
# calculate bayes factors and posterior model probabilities
mcmc3r::bayes.factors(clk.boot, iln.boot, gbm.boot)
Function block.boot
uses the MCMC sample of power likelihoods to generate new pseudo-MCMC samples of likelihoods. This is done by extracting, with replacement, blocks of consecutive likelihood values within each power posterior MCMC. The block sizes are random and have a geometrical distribution. The sampled blocks are then stitched together to form the pseudo-MCMC sample. This is necessary because the MCMC is a stationary autocorrelated time series and the standard bootstrap method cannot be used. Politis and Romano (1994) show the method works well to approximate the distribution of statistics of stationary time series. The stepping.stones.boot
function goes through the r=100
bootstrap replicates and calculates a marginal likelihood on each replicate. These in turn are used to obtain the S.E. and a 95% C.I. for the original log-marginal likelihood estimate.
In our example here, the S.E.’s from the block boot method are quite close to those from the delta method, and the 95% CI’s on the posterior model probabilities, Pr(M|D), are virtually identical:
M | –log L | S.E. (delta) | S.E. (boot) | Pr(M|D) | 95% CI (delta) | 95% CI (boot) |
---|---|---|---|---|---|---|
CLK | 32,185.72 | ± 0.035 | ± 0.034 | 0.500 | (0.47, 0.53) | (0.47, 0.53) |
ILN | 32,186.69 | ± 0.045 | ± 0.043 | 0.190 | (0.17, 0.21) | (0.17, 0.21) |
GBM | 32,186.20 | ± 0.060 | ± 0.068 | 0.310 | (0.28, 0.34) | (0.28, 0.34) |
For short MCMC chains the delta method’s S.E. estimates will deteriorate rapidly, but those from the boot method will be more reliable.
Thermodynamic integration with Gaussian quadrature
You can repeat steps 3 to 7 using the thermodynamic integration method. Make sure you create a new set of clk/
, iln/
and gbm/
directories to run the analyses. In step 3, generate the b values and directories using R with:
b = mcmc3r::make.beta(n=32, method="gauss-quad")
mcmc3r::make.bfctlf(b, ctlf="mcmctree.ctl", betaf="beta.txt")
This will select b values using the n-Gauss-Legendre quadrature rule (see Rannala and Yang, 2017, for details) and prepare the necessary mcmctree.ctl
files. Note that we are using n = 32 points. Then continue with step 4 to run MCMCTree. In step 5, you again use R to parse MCMCTree’s output for the CLK model, but this time you use a different function:
clk <- mcmc3r::gauss.quad()
In the thermodynamic integration method, the log-marginal likelihood is the integral of the path formed by the mean log-likelihoods sampled as a function of the b value used (that is, the area above the path, between the path and zero). You can plot this easily in R:
plot(clk$b, clk$mean.logl, pch=19, col=rgb(0,0,0,alpha=0.3), xaxs="i",
xlim=c(0,1), xlab="b", ylab="mean logL", main="CLK model")
lines(clk$b, clk$mean.logl)
You can now repeat step 6 to calculate the marginal likelihoods for the ILN and GBM models, and then repeat step 7 to obtain the posterior model probabilites. The log-marginal likelihood estimates and S.E.’s are:
CLK: | -32,185.66 | ± 0.023 |
ILN: | -32,186.61 | ± 0.036 |
GBM: | -32,188.17 | ± 0.055 |
The log-marginal likelihood estimates here are very close to those obtained under the stepping stones method. However, note we used n = 32 points to converge to the same result as with stepping stones. Thus, the stepping stones method appears more efficient. Note the S.E. only gives you an idea of the precision, not the accuracy, of the estimate. It is possible to obtain very precise estimates of the marginal likelihood (the S.E. is very small) that are biased (i.e. the estimate is far from the true value). This is due to the discretisation bias in the calculation of the ‘thermodynamic’ integral (see Lartillot and Philippe 2006, and Xie et al. 2011). This will occur especially if n is small. Try n = 1 (which performs very poorly).
Other applications of Bayes factors in MCMCTree
The strategy used here to select for a relaxed-clock model can also be used to select for the tree topology or the substitution model.
Say you have 3 competing tree topologies, and you want to calculate the posterior probability of each. You can prepare 3 Newick files with the different topologies, and create 3 directories, into which you will run the three separate marginal likelihood calculations. In this case you would prepare 3 mcmctree.ctl
templates, with the same parameters for all analyses, except for the treefile
variable, which you would edit to point to the appropriate tree topology. The rest of the procedure is then exactly the same as when selecting for the relaxed-clock model.
A similar approach can be used to select for a substitution model. Say you want to compare HKY85 vs. HKY85+Gamma. In this case you would have two mcmctree.ctl
templates, differing only in the alpha
variable in the template (alpha=0
for no Gamma model, and, say alpha = 0.5
to activate the gamma model). The rest follows as above.
Finally, the mcmc3r
package can also be used to prepare bpp.ctl
files to calculate Bayes factors and model probabilities for species delimitation with BPP (Rannala and Yang, 2017). The procedure for this is essentially the same as the one used here with MCMCTree.
References
-
dos Reis et al. (2016) Bayesian molecular clock dating of species divergences in the genomics era. Nature Reviews Genetics, 17: 71–80.
-
dos Reis et al. (2017) Using phylogenomic data to explore the effects of relaxed clocks and calibration strategies on divergence time estimation: Primates as a test case. bioRxiv.
-
Lartillot and Philippe 2006. Computing Bayes factors using thermodynamic integration. Systematic Biology, 55: 195–207.
-
Lepage et al. (2007) A general comparison of relaxed molecular clock models. Molecular Biology and Evolution, 24: 2669–2680.
-
Politis and Romano (1994) The stationary boostrap. Journal of the American Statistical Association, 89: 1303-1313.
-
Rannala and Yang (2007) Inferring speciation times under an episodic molecular clock. Systematic Biology, 56: 453–466.
-
Rannala and Yang (2017) Efficient Bayesian species tree inference under the multispecies coalescent. Systematic Biology, 66: 823–842.
-
Xie et al. (2011) Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology, 60: 150–160.
-
Yang (2014) Molecular Evolution: A Statistical Approach. Oxford University Press.