bppr: Bayes factors using stepping stones in BPP
In this tutorial we will calculate Bayes factors to compare two species-tree models: (M1) human and chimpanzee are more closely related to each other than to gorilla and orangutang, and (M2) chimpanzee and gorilla are the most closely related species. For a long time, it was thought that chimp and gorilla were the most closely related. This view was challenged when the first molecular sequences from the apes were analysed.
This tutorial assumes that: (i) you have BPP 4 and bbpr
installed and working in your system, (ii) you are familiar with the multi-species coalescent and Bayesian model selection theory, and (iii) you have basic knowledge of R and the command line in your operating system. An introductory tutorial for BPP is given in Yang (2015). Overviews of the multi-species coalescent and Bayesian model selection are given in chapters 9 and 7 of Yang (2014).
The power posterior is the posterior with the likelihood elevated to the power of b. In the stepping-stones method, one chooses n values of b between 0 and 1, and then one runs n independent MCMC chains corresponding to each b value. Once the runs have completed, one collects the sampled likelihoods for each run, and the marginal likelihood of the model is then calculated by the stepping-stones algorithm. The procedure is repeated for the K models being tested, and the marginal likelihoods are then used to calculate Bayes factors and posterior model probabilities. With BPP and bppr
, the general procedure involves 7 steps:
- Select the data to be analysed.
- Prepare a template
bpp.ctl
control file with the model to be tested, prior and parameters. - Use
bppr
to select n appropriate values for b. - Run BPP n times to sample from the n power posteriors.
- Use
bppr
to collect the results of the runs and calculate the marginal likelihood of the model. - Repeat 2-5 for other models as necessary.
- Calculate Bayes factors and posterior model probabilities.
1. Select the data to be analysed
We will analyse 50 loci from the 4 ape genomes (human, chimp, gorilla and orang). Directory bppr/misc
contains the alignment neutral.25loci.4sp.txt
, the myImap.txt
file (which maps individuals to species), and a template bpp.ctl
file. Create a directory called ape4s/
in a suitable place in your system, and copy the alignment, Imap, and control files into it. You can view the alignment with a text editor. The first two loci look like:
4 126
human^1 CTG-CCCCCACCTCCTCCAGCCCCC-AGGGTTGGACC-AGAAAGCCTTGGCTGCCTCTGAACAGCAGGGATTGTCTGGCCA-GGGGATGCTTG-AGGGACAGAGAA-CCCAGCCTGGAGGGTGCAA
chimp^2 CTG-CCCCCACCTCCTCCAGCCCTC-AGGGTCGGACC-AGAAAGCCTTGGCTGCCTCTGAACAGCAGGGATTGTCTGGCCA-GGGGATGCTTG-AGGGACAGAGAA-CCCAGCCTGGAGGGTGCAA
gorilla^3 CTG-CCCCCA-CTCCTCCAGCCCCC-AGGGTCGGACC-AGAAAGCCTTGGCTGCCTCTGAACAGCAGGGACTGTCTGGCCA-GGGGATGCTTG-AGGGACAGAGAA-CCCAGCCTGGAGGGTGCAA
orang^4 CTGCCCCCCACCTC--CCAGCCCCCTGGGGTCGGACCAAAAAAGCCCTGGCTGCCTCTGAACAGCAGTGACTGTCTGGCCAGGGGGGTGCTTGAAGGGACAGAGAA-CCCAGCCTGGAGGGTGCAA
4 404
human^1 GTTTCGCTGGTGGCCATGCCATTCTGCA-TCCCCAGGGGCAGCATTAGAATTCCAGCTGCTCCTCACCGGCACTTGCTATTGTTAAACTTGTTGTTCTTTGTTTATAATTTAGCTGTTCTTATA-GGTGTGT-AGTGGTGTCCCATTGTGGTTTTAATTTACATTTTCTTCATCTCATGTTTAATGATGTTGAGCGTCTTTTCATGT-GCGAATTTCTCATCCACGTACGTATCTTCTCTGGTGAAGCATCTGTTCA-GTCTTTTGCCC------ATTTTTTTTTTTTTT--TTGAGATGGAGTTTCATTCACTCTCATTGCCCAGGCTGGAGTGCAATGGTGCAGTCTCAGCTCACTGCAAACTCCGCCTCCCAGGTTCAAGTGATTCTCCTGCCTCAGCCTC
chimp^2 ATTTCGCTGGTGGCCATGCCATTCTGCA-TCCCCAGGGGCAGCATTAGAATTCCAGCTGCTCCTCACCGGCACTTGCTATTGTTAAACTTGTTGTTCTTTGTTTATAATTTAGCTGTTCTTATA-GGTGTGT-AGTGTTGTCCCATTGTGGTTTTAATTTACATTTTCTTCATCTCATGTTTAATGATGTTGAGCGTCTTTTCACGT-GCGAATTTCTCATCC----ACGTATCTTCTCTGGTGAAGCATCTGTTCA-GTCTTTTGCCC------ATTTTTTTTTTTTTTTTTTGAGATGGAGTTTCATTCACTCTCATTGCCCAGGCTGGAGTGCAATGGTGCAGTCTCAGCTCACTGCAAACTCTGCCTCCCAAGTTCAAGCGATTCTCCTGCCTCAGCCTC
gorilla^3 GTTTCGCTGGTGGCCATGCCATTCTGCA-TCCCCAGGGGCAGCATTAGAATTCCAGC-GCTCCTCACCGGCACCTGCTATTGTTAAACTTTTTGTTCTTTGTTTATAATTTAGCTGTTCTTATA-GGTGTGC-AGTGGCGTCCCATTGTGGTTTTAATTTACATTTTCTTCATCTCATGTTTAATGATGTTGAGCGTCTTTTCATGT-GCGACCTTCTCATGC----ACGTATCTTCTCTGGTGAAGCATCTGTTCG-GTCTTTTGCCC------ATTTTTTTTTTTTTTTTTTGAGATGGAGTTTCATTCACTCTCATTGCCCAGGCTGGAGTGCAATGGTGCGGTCTCAGCTCACTGCAAACTCCGCCTCCCAGGTTCAAGGGATTCTCCTGCCTCAGCCTC
orang^4 GTTTCGCTGGTGGCCATGCCATTCTGCATTCCCCAGGGGCAGCATTAGAGTTCCAGCTGTTCCTCACCGGCACTTGCTATTGTTAAACTTTTTGTTCTGTGTTTATAATTTAGCTGTTCTTATAGGGTGTGTAAGTGGTGTCCCATTGTGGTTTTAATTTACATTTTCTTCATCTCATG-TTAATGATGTCGAGCATCTTTTCATGTGGCTAATTTCTCATCC----ACATATCTTCTCTGGTGAAGCATCTCCTCATGTCTTTTGCCC------ATTTTTTTTTTTTTT-----AGATGCAGTTTCATTCACTCTCATTGCCCAGGCTAGAGTGCAATGGTGCGGTCTCGGCTCACTGCAAA------CTCCCGGGTTCAAGCGATTCTCCTGCCTCAGCCTC
The alignment file contains the first 25 neutral loci analysed by Burgess and Yang (2008). Please refer to BPP’s manual for alignment and Imap files formats. The myImap.txt
file, although necessary, is not important for this tutorial and we will not delve into it.
2. Prepare the BBP control file template
The template control file bpp.ctl
looks like this:
* This is a BPP 4.0 control file !!!
seed = -1
seqfile = ../../neutral.25loci.4sp.txt
Imapfile = ../../myImap.txt
outfile = out.txt
mcmcfile = mcmc.txt
speciesdelimitation = 0 * fixed species tree
speciesmodelprior = 1 * 0: uniform labeled histories; 1:uniform rooted trees; 2:user probs
species&tree = 4 human chimp gorilla orang
1 1 1 1
(((human, chimp), gorilla), orang);
usedata = 1 * 0: no data (prior); 1:seq like
nloci = 25 * number of data sets in seqfile
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
thetaprior = 3 0.008 * 2 500 # invgamma(a, b) for theta
tauprior = 3 0.036 * 4 219 1 # invgamma(a, b) for root tau & Dirichlet(a) for other tau's
locusrate = 0 # (0: No variation, 1: estimate, 2: from file) & a_Dirichlet (if 1)
heredity = 0 # (0: No variation, 1: estimate, 2: from file) & a_gamma b_gamma (if 1)
finetune = 1: .012 .003 .0001 .00005 .004 .01 .01 # auto (0 or 1): finetune for GBtj, GBspr, theta, tau, mix, locusrate, seqerr
print = 1 0 0 0 * MCMC samples, locusrate, heredityscalars Genetrees
burnin = 2000
sampfreq = 10
nsample = 10000
For the full details of control file specification, please refer to BPP’s manual. The first four variables, seqfile
, Imapfile
, outfile
, and mcmcfile
give the names of the various input and output files for the analysis. Note that speciesdelimitation = 0
, which means we will be performing our analysis on a fixed species tree. Here, BPP will only estimate the relative node ages (the tau’s) for the ape phylogeny. Now note the specification of species&tree
:
(((human, chimp), gorilla), orang);
which correspond to model M1 (human and chimp as the most closely related species). We will calculate the marginal likelihood of this model first. Later, we will modify this file with a different tree model.
Note that this is a BPP 4 control file. BPP 3.4 and 4.0 now use the inverse gamma distribution to specify the prior for the tau’s and theta’s (the nucloetide diversity parameters, theta = 4Nu). Here we use thetaprior = 3 0.008
which corresponds to an inverse gamma with parameters 3 and 0.008. In the old gamma prior specification, to achieve the same prior mean, we would have used thetaprior = 2 500
. Note that the gamma and inverse gamma distribution although related, are actually quite different. If you accidentally use the old parameters values for the gamma as the input parameters for the inverse gamma you will most likely get nonsensical results.
3. Select b values
Create directory ape4s/M1
and copy the bpp.ctl
file into it. Change into the ape4s/M1
directory and start R. In R type:
b = bppr::make.beta(n=8, a=4, method="step-stones")
This will select 8 b values for the stepping stones algorithm. Constant a
controls the distribution of b between 0 and 1. Large values of a
lead to b points clustered closer to zero (i.e. close to the prior), which is good for large alignments.
In R type:
bppr::make.bfctlf(b, ctlf="bpp.ctl", betaf="beta.txt")
The side effect of this function is the creation of n directories, each containing a modified bpp.ctl
file, which will tell BPP to sample from the appropriate power posterior. For example, M1/8/bpp.ctl
contains the following last line:
BayesFactorBeta = 0.586181640625
which corresponds to b = 0.586 … .
4. Run BPP
Use a separate terminal window to run BPP n = 8 times, inside each directory. In the Mac terminal I use:
# You must be in ape4s/M1/
for d in `seq 1 1 8`; do cd $d; bpp -cfile bpp.ctl &>/dev/null & cd ..; done
This should run in less than a minute in a modern computer. For large datasets, a BPP analysis can take many hours, days, or even weeks. In such cases it is best to submit each BPP run as a job in a high-performance computer cluster.
5. Collect BPP results and calculate marginal likelihood
From R (and making sure you are in the ape4s/M1
directory) type:
M1 <- bppr::stepping.stones()
M1
# $logml
# [1] -23362.79
# $se
# [1] 0.1660856
# $mean.logl
# [1] -23597.38 -23600.81 -23521.69 -23448.81 -23409.58 -23390.04 -23374.12
# [8] -23348.48
# $b
# [1] 1.000000e-300 2.441406e-04 3.906250e-03 1.977539e-02 6.250000e-02
# [6] 1.525879e-01 3.164062e-01 5.861816e-01
This calculates the log-marginal likelihood, M1$logml
, by the stepping stones algorithm. The other parameters are the standard error of the log-marginal likelihood estimate, M1$se
, the mean of the log-likelihoods sampled at each b point, M1$mean.logl
, and the b points, M1$b
. Your values may look slightly different given the random nature of MCMC. Don’t worry if you see warning messages about the standard error. These warnings simply mean you may need to run the MCMC longer to obtain more precise estimates of the errors.
It is useful to do a diagnostic plot of the mean log-likelihoods vs. the b points.
plot(M1$mean.logl ~ M1$b, pch=19, ty="b", xlab="b", ylab="mean log-likelihood")
You should obtain a smooth curve increasing from b = 0 towards the right. Any wiggliness in the plot means there may have been convergence problems in the MCMC, or simply that the MCMCs did not run long enough. If you see anything strange with the plot, or if the standard error is too large, you can increase nsample
and/or samplefreq
to improve the estimates.
The 95% confidence interval of the log-marginal likelihood is
M1$logml + 2 * c(-M1$se, M1$se)
# [1] -23363.13 -23362.46
6. Repeat 2-5 for the M2 model
Create directory ape4s/M2
and copy the bpp.ctl
file into it. Edit the file so that the tree looks like
(((gorilla, chimp), human), orang);
That is, we have swapped the positions of gorilla and human so that gorilla is now more closely related to the chimp.
Now repeat steps 3-5:
(3) Select b values. In R:
setwd("../M2") # This assumes you are in ape4s/M1
b = bppr::make.beta(n=8, a=4, method="step-stones")
bppr::make.bfctlf(b, ctlf="bpp.ctl", betaf="beta.txt")
(4) Now run BPP 8 times. In my Mac BASH terminal I do:
# Make sure you are in ape4s/M2
for d in `seq 1 1 8`; do cd $d; bpp -cfile bpp.ctl &>/dev/null & cd ..; done
(5) Collect BPP results and calculate marginal likelihood for M2. In R:
# Make sure R workding directory is now M2
M2 <- bppr::stepping.stones()
M2
# $logml
# [1] -23365.93
# $se
# [1] 0.2044337
# $mean.logl
# [1] -23662.16 -23656.17 -23555.31 -23465.32 -23418.52 -23396.46 -23376.87
# [8] -23348.82
# $b
# [1] 1.000000e-300 2.441406e-04 3.906250e-03 1.977539e-02 6.250000e-02
# [6] 1.525879e-01 3.164062e-01 5.861816e-01
You can see that M1 has a larger marginal likelihood (-23,362.79) than M2 (-23,365.93).
7. Calculate Bayes Factors and posterior model probabilities
We can now convert the marginal likelihoods for M1 and M2 into Bayes factors and the posterior model probabilities. In R:
bppr::bayes.factors(M1, M2)
# $bf
# [1] 1.00000000 0.04335463
# $pr
# [1] 0.95844689 0.04155311
The Bayes factor of M2 over M1 is 0.043 (the first, 1.0, is the Bayes factors of the reference model with itself, i.e. M1 over M1). The relative posterior probability for model M1 is 0.958, while for M2 it is 0.042. These are relative posterior probabilities as we did not test all the possible species-tree models (there are 15 possible rooted trees for 4 species). Thus, we have statistically significant evidence in favour of M1 vs the M2. Because the standard errors for the log-marginal likelihoods are relatively high, your probabilities may show noticeable differences.
If the posterior probabilities for two models are close to 0 or 1, an error of 1 in the log BF may be ok (in other words, we should not care much whether the posterior is 0.999 or 0.9999). If they are moderate, we should like the posterior probabilities to be accurate at the 1% level (that is 0.69 is ok when the correct value is 0.70). In this case the log BF should be accurate at the 0.01 level. Because exp(x) = 1 + x when x is close to 0, the error in log BF becomes the relative error in BF or posterior probabilities (thanks to Ziheng Yang for this note).
[ Update – Feb 2020: Function bayes.factors
now performs parametric bootstrap of posterior probabilities, so you should see an element called $pr.ci
with the confidence intervals for the posterior probabilities. ]
Exercise: The standard errors for the estimated log-marginal likelihoods are 0.17 and 0.20 for models M1 and M2 respectively. These values are rather high. Modify the bpp.ctl
file and increase samplefreq = 40
and nsample = 20000
. Run the analysis again and re-calculate the log-marginal likelihoods and standard errors. Roughly, you may want twice the sum of the standard errors to be much smaller than the difference between the log-likelihoods for the two models. How much smaller are the standard errors for the new analysis? Are the calculated posterior probabilities different?
Exercise: In the example above we compared two speciation models. You can test for a third model, M3, in which human and chimp are assummed to be the same species. Create a directory called ape4s/M3
and copy the bpp.ctl
file over. Edit the file, and any other input BPP files as appropriate so that BPP considers human and chimp to be two individuals from the same species. You may want to do the tutorial in Yang (2015) if you are not familiar with species delimitation in BPP.
Exercise (harder): bppr
also implements thermodynamic integration by Gaussian quadrature (Rannala and Yang, 2017). Repeat the Bayes factors calculations using this technique (check the help files for functions gauss.quad
and make.beta
in the package).
Bibliography
-
Burgess, R. and Z. Yang. 2008 Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Molecular Biology and Evolution, 25: 1979-1994.
-
Flouri T, Xiyun J, Rannala B, Yang Z. 2018. Species tree inference with BPP using genomic sequences and the multispecies coalescent. Mol Biol Evol.
-
Rannala, B., and Z. Yang. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Systematic Biology, 66: 823-842.
-
Xie et al. (2011) Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology, 60: 150–160.
-
Yang (2014) Molecular Evolution: A Statistical Approach. Oxford University Press.
-
Yang, Z. 2015. The BPP program for species tree estimation and species delimitation. Current Zoology, 61: 854-865.