Genome-wide association studies (GWASs) are often confounded by population stratification and structure. Linear mixed models (LMMs) are a powerful class of methods for uncovering genetic effects, while controlling for such confounding. LMMs include random effects for a genetic similarity matrix, and they assume that a true genetic similarity matrix is known. However, uncertainty about the phylogenetic structure of a study population may degrade the quality of LMM results. This may happen in bacterial studies in which the number of samples or loci is small, or in studies with low-quality genotyping. In this study, we develop methods for linear mixed models in which the genetic similarity matrix is unknown and is derived from Markov chain Monte Carlo estimates of the phylogeny. We apply our model to a GWAS of multidrug resistance in tuberculosis, and illustrate our methods on simulated data.

Get full access to this article

View all available purchase options and get full access to this article.


Ates LS. New insights into the mycobacterial PE and PPE proteins provide a framework for future research. Mol Microbiol 2020;113(1):4–21;
Bouckaert R, Heled J, Kühnert D, et al. BEAST 2: A software platform for Bayesian evolutionary analysis. PLoS Comput Biol 2014;10(4):e1003537;
Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 2018;562(7726):203–209;
Coll F, McNerney R, Guerra-Assuncao JA, et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun 2014;5:4812;
Corbeil RR, Searle SR. Restricted maximum likelihood (REML) estimation of variance components in the mixed model. Technometrics 1976;18(1):31–38.
Cordero OX, Polz MF. Explaining microbial genomic diversity in light of evolutionary ecology. Nat Rev Microbiol 2014;12(4):263–273;
Croucher NJ, Page AJ, Connor TR, et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 2015;43(3):e15;
Dahl A, Iotchkova V, Baud A, et al. A multiple-phenotype imputation method for genetic studies. Nat Genet 2016;48(4):466–472;
Delahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS One 2021;16(10):e0257521;
De Vos M, Müller B, Borrell S, et al. Putative compensatory mutations in the rpoC gene of rifampin-resistant Mycobacterium tuberculosis are associated with ongoing transmission. Antimicrob Agents Chemother 2013;57(2):827–832;
Didelot X, Wilson DJ. Clonalframeml: Efficient inference of recombination in whole bacterial genomes. PLoS Comput Biol 2015;11(2):e1004041;
Earle SG, Wu C-H, Charlesworth J, et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol 2016;1(5):16041;
Goldstein BP. Resistance to rifampicin: A review. J Antibiot 2014;67(9):625–630;
Grandjean L, Gilman RH, Iwamoto T, et al. Convergent evolution and topologically disruptive polymorphisms among multidrug-resistant tuberculosis in Peru. PLoS One 2017;12(12):e0189838;
Hjort NL, Dahl FA, Steinbakk GH. Post-processing posterior predictive p-values. J Am Stat Assoc 2006;101(475):1157–1174.
Hudson RR. Generating samples under a Wright-Fisher neutral model. Bioinformatics 2002;18(2):337–338;
Jiang L, Zheng Z, Fang H, et al. A generalized linear mixed model association tool for biobank-scale data. Nat Genet 2021;53(11):1616–1621;
Jombart T, Kendall M, Almagro-Garcia J, et al. treespace: Statistical exploration of landscapes of phylogenetic trees. Mol Ecol Resour 2017;17(6):1385–1392;
Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian Protein Metabolism, Volume 3. (Munro HN. ed.) Academic Press: New York; 1969; pp. 21–132.
Kang HM, Sul JH, Service SK, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 2010;42(4):348–354;
Kirkpatrick B, Ge S, Wang L. Efficient computation of the kinship coefficients. Bioinformatics 2019;35(6):1002–1008;
Lees JA, Galardini M, Bentley SD, et al. pyseer: A comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 2018;34(24):4310–4312;
Lempens P, Meehan CJ, Vandelannoote K, et al. Isoniazid resistance levels of Mycobacterium tuberculosis can largely be predicted by high-confidence resistance-conferring mutations. Sci Rep 2018;8(1):1–9;
Lipin M, Stepanshina V, Shemyakin I, et al. Association of specific mutations in katG, rpoB, rpsL and rrs genes with spoligotypes of multidrug-resistant Mycobacterium tuberculosis isolates in Russia. Clin Microbiol Infect 2007;13(6):620–626;
Lippert C, Listgarten J, Liu Y, et al. FaST linear mixed models for genome-wide association studies. Nat Methods 2011;8(10):833–835;
Listgarten J, Lippert C, Heckerman D. FaST-LMM-Select for addressing confounding from spatial structure and rare variants. Nat Genet 2013;45(5):470–471;
Listgarten J, Lippert C, Kadie CM, et al. Improved linear mixed models for genome-wide association studies. Nat Methods 2012;9(6):525–526;
Loh P-R, Kichaev G, Gazal S, et al. Mixed-model association for biobank-scale datasets. Nat Genet 2018;50(7):906–908;
Loh P-R, Tucker G, Bulik-Sullivan BK, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet 2015;(3):284–290;
Ma P, Luo T, Ge L, et al. Compensatory effects of M. tuberculosis rpoB mutations outside the rifampicin resistance-determining region. Emerg Microb Infect 2021;10(1):743–752;
Meng X-L. Posterior predictive p-values. Ann Statist 1994;22(3):1142–1160;
Mostowy R, Croucher NJ, Andam CP, et al. Efficient inference of recent and ancestral recombination within bacterial populations. Mol Biol Evol 2017;34(5):1167–1182;
Nicholls SM, Quick JC, Tang S, et al. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience 2019;8(5):giz043;
Novembre J, Johnson T, Bryc K, et al. Genes mirror geography within Europe. Nature 2008;456(7218):98–101;
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet 2006;2(12):e190;
Price AL, Patterson NJ, Plenge RM, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006;38(8):904–909;
Qian J, Chen R, Wang H, et al. Role of the PE/PPE family in host–pathogen interactions and prospects for anti-tuberculosis vaccine and diagnostic tool design. Front Cell Infect Microbiol 2020;10:743.
Rasmussen MD, Hubisz MJ, Gronau I, et al. Genome-wide inference of ancestral recombination graphs. PLoS Genet 2014;10(5):e1004342;
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria; 2013.
Ronquist F, Teslenko M, van der Mark P, et al. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61:539–542;
Schliep KP. phangorn: Phylogenetic analysis in R. Bioinformatics 2011;27(4):592–593;
Stoler N, Nekrutenko A. Sequencing error profiles of illumina sequencing instruments. NAR Genom Bioinform 2021;3(1):lqab019;
Sudlow C, Gallacher J, Allen N, et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015;12(3):e1001779;
Wang L, Wang S, Bouchard-Côté A. An annealed sequential Monte Carlo method for Bayesian phylogenetics. Syst Biol 2020;69(1):155–183;
Wang S, Ge S, Colijn C, et al. Estimating genetic similarity matrices using phylogenies. J Comput Biol 2021;28(6):587–600;
Wang S, Wang L. Particle Gibbs sampling for Bayesian phylogenetic inference. Bioinformatics 2021;37(5):642–649;
Williams C. On a connection between kernel PCA and metric multidimensional scaling. In: Proceeds of the 13th International Conference on Advances in Neural Information Processing Systems. Adv Neural Inform Process Syst 2000;13.
Yang J, Lee SH, Goddard ME, et al. GCTA: A tool for genome-wide complex trait analysis. Am J Hum Genet 2011;88(1):76–82;
Zhang Z, Ersoz E, Lai C-Q, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 2010;42(4):355–360;
Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods 2014;11(4):407–409;

Information & Authors


Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 30Issue Number 2February 2023
Pages: 189 - 203
PubMed: 36374242


Published online: 7 February 2023
Published in print: February 2023
Published ahead of print: 14 November 2022


Request permissions for this article.




School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China.
Shufei Ge*
Institute of Mathematical Sciences, ShanghaiTech University, Shanghai, China.
Benjamin Sobkowiak
Department of Mathematics and Simon Fraser University, Burnaby, Canada.
Liangliang Wang
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada.
Louis Grandjean
Department of Infectious Diseases, University College London, London, United Kingdom.
Caroline Colijn
Department of Mathematics and Simon Fraser University, Burnaby, Canada.
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada.


These authors contributed equally to this study.
Address correspondence to: Dr. Lloyd T. Elliott, Department of Statistics and Actuarial Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada [email protected]

Authors' Contributions

S.W. and S.G. contributed to methodology, formal analysis, software, writing—original draft, and writing—review and editing. B.S. contributed to writing—original draft and writing—review and editing. L.G. contributed to data curation. L.W. contributed to writing—review and editing. C.C. contributed to conceptualization and writing—review and editing. L.T.E. contributed to conceptualization, methodology, writing—original draft and writing—review and editing.

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

This research is supported by the National Natural Science Funds of China (Grant No. 12101333), the Natural Science Funds of Tianjin (Grant No. 21JCQNJC00050), the Shanghai Science and Technology Program (Grant No. 21010502500), the startup fund of ShanghaiTech University, and NSERC (Grant Nos. RGPIN/05484-2019, RGPIN/2019-06131, and DGECR/00118-2019). L.T.E. is supported by a Michael Smith Health Research BC Scholor Award.

Metrics & Citations



Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options


View PDF/ePub

Full Text

View Full Text







Copy the content Link

Share on social media

Back to Top