Research Article
No access
Published Online: 20 May 2021

NetMix: A Network-Structured Mixture Model for Reduced-Bias Estimation of Altered Subnetworks

Publication: Journal of Computational Biology
Volume 28, Issue Number 5


A classic problem in computational biology is the identification of altered subnetworks: subnetworks of an interaction network that contain genes/proteins that are differentially expressed, highly mutated, or otherwise aberrant compared with other genes/proteins. Numerous methods have been developed to solve this problem under various assumptions, but the statistical properties of these methods are often unknown. For example, some widely used methods are reported to output very large subnetworks that are difficult to interpret biologically. In this work, we formulate the identification of altered subnetworks as the problem of estimating the parameters of a class of probability distributions that we call the Altered Subset Distribution (ASD). We derive a connection between a popular method, jActiveModules, and the maximum likelihood estimator (MLE) of the ASD. We show that the MLE is statistically biased, explaining the large subnetworks output by jActiveModules. Based on these insights, we introduce NetMix, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the ASD. We demonstrate that NetMix outperforms existing methods in identifying altered subnetworks on both simulated and real data, including the identification of differentially expressed genes from both microarray and RNA-seq experiments and the identification of cancer driver genes in somatic mutation data.

Get full access to this article

View all available purchase options and get full access to this article.


Addario-Berry, L., Broutin, N., Devroye, L., et al. 2010. On combinatorial testing problems. Ann. Stat. 38, 3063–3092.
Amgalan, B. and Lee, H. 2014. Wmaxc: A weighted maximum clique method for identifying condition-specific sub-network. PLoS One 9, e104993.
Arias-Castro, E., Candès, E.J., and Durand, A. 2011. Detection of an anomalous cluster in a network. Ann. Stat. 39, 278–304.
Arias-Castro, E., Candès, E.J., Helgason, H., et al. 2008. Searching for a trail of evidence in a maze. Ann. Stat. 36, 1726–1757.
Arias-Castro, E., Castro, R.M., Tánczos, E., et al. 2018. Distribution-free detection of structured anomalies: Permutation and rank-based scans. J. Am. Stat. Assoc. 113, 789–801.
Arias-Castro, E., Donoho, D L., and Huo, X. 2006. Adaptive multiscale detection of filamentary structures in a background of uniform random points. Ann. Stat. 34, 326–349.
Ayati, M., Erten, S., Chance, M.R., et al. 2015. Mobas: Identification of disease-associated protein subnetworks using modularity-based scoring. EURASIP J. Bioinform. Syst. Biol. 2015, 7.
Bailey, M.H., Tokheim, C., Porta-Pardo, E., et al. 2018. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385.e18.
Batra, R., Alcaraz, N., Gitzhofer, K., et al. 2017. On the performance of de novo pathway enrichment. NPJ Syst. Biol. Appl. 3, 6.
Berger, B., Peng, J., and Singh, M. 2013. Computational solutions for omics data. Nat. Rev. Genet. 14, 333.
Bhalla, U.S., and Iyengar, R. 1999. Emergent properties of networks of biological signaling pathways. Science 283, 381–387.
Califano, A., Butte, A.J., Friend, S., et al. 2012. Leveraging models of cell regulation and gwas data in integrative network-based association studies. Nat. Genet. 44, 841–847.
Chasman, D., Siahpirani, A.F., and Roy, S. 2016. Network-based approaches for analysis of complex biological systems. Curr. Opin. Biotechnol. 39, 157–166.
Chen, J. 2017. Consistency of the mle under mixture models. Stat. Sci. 32, 47–63.
Cho, A., Shim, J.E., Kim, E., et al. 2016. Muffinn: Cancer gene discovery via network analysis of somatic mutation data. Genome Biol. 17, 129.
Cho, D.-Y., Kim, Y.-A., and Przytycka, T.M. 2012. Chapter 5: Network biology approach to complex diseases. PLoS Comput. Biol. 8, 1–11.
Choi, J., Shooshtari, P., Samocha, K.E., et al. 2016. Network analysis of genome-wide selective constraint reveals a gene network active in early fetal brain intolerant of mutation. PLoS Genet. 12, e1006121.
Chua, H.N., Sung, W.-K., and Wong, L. 2006. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 22, 1623–1630.
Cline, M.S., Smoot, M., Cerami, E., et al. 2007. Integration of biological networks and gene expression data using cytoscape. Nat. Protoc. 2, 2366.
Cowen, L., Ideker, T., Raphael, B.J., et al. 2017. Network propagation: A universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562.
Creixell, P., Reimand, J., Haider, S., et al. 2015. Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621.
Croft, D., Mundo, A.F., Haw, R., et al. 2014. The reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477.
Das, J. and Yu, H. 2012. Hint: High-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol. 6, 92.
Daskalakis, C., Tzamos, C., and Zampetakis, M. 2017. Ten steps of EM suffice for mixtures of two gaussians. Presented at the Proceedings of the 2017 Conference on Learning Theory, Amsterdam, The Netherlands. pp. 704–710.
de la Fuente, A. 2010. From ‘differential expression’ to ‘differential networking’—identification of dysfunctional regulatory networks in diseases. Trends Genet. 26, 326–333.
Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39, 1–38.
Deng, M., Zhang, K., Mehta, S., et al. 2003. Prediction of protein function using protein–protein interaction data. J. Comput. Biol. 10, 947–960.
Dimitrakopoulos, C.M., and Beerenwinkel, N. 2017. Computational approaches for the identification of cancer genes and pathways. Wiley Interdiscip. Rev. Syst. Biol. Med. 9, e1364.
Dittrich, M.T., Klau, G., Rosenwald, A., et al. 2008. Identifying functional modules in protein–protein interaction networks: An integrated exact approach. Bioinformatics 24, i223–i231.
Fabregat, A., Sidiropoulos, K., Garapati, P., et al. 2016. The reactome pathway knowledgebase. Nucleic Acids Res. 44, D481–D487.
Ferguson, T.S. 1982. An inconsistent maximum likelihood estimate. J. Am. Stat. Assoc. 77, 831–834.
Firth, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38.
Forbes, S.A., Beare, D., Boutselakis, H., et al. 2016. Cosmic: Somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783.
Futreal, P.A., Coin, L., Marshall, M., et al. 2004. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183.
Glaz, J., Naus, J., and Wallenstein, S. 2001. Scan Statistics. Springer-Verlag, New York.
Gligorijević, V. and Pržulj, N. 2015. Methods for biological data integration: Perspectives and challenges. J. R. Soc. Interface 12, 20150571.
Gulsuner, S., Walsh, T., Watts, A.C., et al. 2013. Spatial and temporal mapping of de novo mutations in schizophrenia to a fetal prefrontal cortical network. Cell 154, 518–529.
Guo, M., Bao, E.L., Wagner, M., et al. 2016. SLICE: Determining cell differentiation and lineage based on single cell entropy. Nucleic Acids Res. 45, e54–e54.
Guo, Z., Li, Y., Gong, X., et al. 2007. Edge-based scoring and searching method for identifying condition-responsive protein–protein interaction sub-network. Bioinformatics 23, 2121–2128.
Halldórsson, B.V., and Sharan, R. 2013. Network-based interpretation of genomic variation data. J. Mol. Biol. 425, 3964–3969.
He, H., Lin, D., Zhang, J., et al. 2017. Comparison of statistical methods for subnetwork detection in the integration of gene expression and protein interaction network. BMC Bioinf. 18, 149.
Head, M.L., Holman, L., Lanfear, R., et al. 2015. The extent and consequences of p-hacking in science. PLoS Biol. 13, e1002106.
Hofree, M., Shen, J.P., Carter, H., et al. 2013. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115.
Hormozdiari, F., Penn, O., Borenstein, E., et al. 2015. The discovery of integrated gene networks for autism and related disorders. Genome Res. 25, 142–154.
Horn, H., Lawrence, M.S., Chouinard, C.R., et al. 2018. Netsig: Network-based discovery from cancer genomes. Nat. Methods 15, 61–66.
Huang, J.K., Carlin, D.E., Yu, M.K., et al. 2018. Systematic evaluation of molecular networks for discovery of disease genes. Cell Syst. 6, 484–495.
Hung, H.M.J., O’Neill, R.T., Bauer, P., et al. 1997. The behavior of the p-value when the alternative hypothesis is true. Biometrics 53, 11–22.
Hung, J.-H., Yang, T.-H., Hu, Z., et al. 2011. Gene set enrichment analysis: Performance evaluation and usage guidelines. Brief Bioinform. 13, 281–291.
Ideker, T., Ozier, O., Schwikowski, B., et al. 2002. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18, S233–S240.
Ioannidis, J.P. 2005. Why most published research findings are false. PLoS Med. 2, e124.
Kelley, B.P., Yuan, B., Lewitter, F., et al. 2004. Pathblast: A tool for alignment of protein interaction networks. Nucleic Acids Res. 32, W83–W88.
Kim, M., and Hwang, D. 2016. Network-based protein biomarker discovery platforms. Genomics Inf. 14, 2–11.
Klimm, F., Toledo, E.M., Monfeuga, T., et al. 2019. Functional module detection through integration of single-cell rna sequencing data with protein–protein interaction networks. bioRxiv 698647.
Kulldorff, M. 1997. A spatial scan statistic. Commun. Stat. Theor. Methods 26, 1481–1496.
Lawrence, M.S., Stojanov, P., Mermel, C.H., et al. 2014. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495.
Lee, I., Blom, U.M., Wang, P.I., et al. 2011. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121.
Leiserson, M.D., Eldridge, J.V., Ramachandran, S., et al. 2013. Network analysis of gwas data. Curr. Opin. Genet. Dev. 23, 602–610.
Leiserson, M.D.M., Vandin, F., Wu, H.-T., et al. 2015. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114.
Liang, D., Han, G., Feng, X., et al. 2012. Concerted perturbation observed in a hub network in alzheimer’s disease. PLoS One 7, 1–17.
Liu, J.J., Sharma, K., Zangrandi, L., et al. 2018. In vivo brain gpcr signaling elucidated by phosphoproteomics. Science 360, eaao4927.
Lu, X., and Bressan, S. 2012. Sampling connected induced subgraphs uniformly at random. Presented at the Proceedings of the 24th International Conference on Scientific and Statistical Database Management, SSDBM’12. Springer, Berlin-Heidelberg. pp. 195–212.
Luo, Y., Zhao, X., Zhou, J., et al. 2017. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 8, 573.
McLachlan, G., Bean, R.W., and Jones, L.B.-T. 2006. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22, 1608–1615.
Menche, J., Sharma, A., Kitsak, M., et al. 2015. Uncovering disease-disease relationships through the incomplete human interactome. Science 347, 1257601–1257601.
Mitra, K., Carvunis, A.-R., Ramesh, S.K., et al. 2013. Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet. 14, 719–732.
modENCODE Consortium, Roy, S., Ernst, J., et al. 2010. Identification of functional elements and regulatory circuits by drosophila modencode. Science 330, 1787–1797.
Nabieva, E., Jim, K., Agarwal, A., et al. 2005. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21, i302–i310.
Nibbe, R.K., Koyutürk, M., and Chance, M.R. 2010. An integrative-omics approach to identify functional sub-networks in human colorectal cancer. PLoS Comput. Biol. 6, e1000639.
Nikolayeva, I., Pla, O.G., and Schwikowski, B. 2018. Network module identification—A widespread theoretical bias and best practices. Methods 132, 19–25.
Nuzzo, R. 2015. How scientists fool themselves—And how they can stop. Nat. News 526, 182.
Pan, W., Lin, J., and Le, C.T. 2003. A mixture model approach to detecting differentially expressed genes with microarray data. Funct. Integr. Genomics 3, 117–124.
Petryszak, R., Keays, M., Tang, Y.A., et al. 2015. Expression atlas update: An integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Res. 44, D746–D752.
Pounds, S., and Cheng, C. 2004. Improving false discovery rate estimation. Bioinformatics 20, 1737–1745.
Pounds, S., and Morris, S.W. 2003. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19, 1236–1242.
Radivojac, P., Clark, W.T., Oron, T.R., et al. 2013. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227.
Razick, S., Magklaras, G., and Donaldson, I.M. 2008. irefindex: A consolidated protein interaction database with provenance. BMC Bioinformatics 9, 1.
Reyna, M. A., Leiserson, M. D., and Raphael, B. J. 2018. Hierarchical hotnet: Identifying hierarchies of altered subnetworks. Bioinformatics 34, i972–i980.
Rolland, T., Taşan, M., Charloteaux, B., et al. 2014. A proteome-scale map of the human interactome network. Cell 159, 1212–1226.
Shannon, P., Markiel, A., Ozier, O., et al. 2003. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504.
Sharan, R., Ulitsky, I., and Shamir, R. 2007. Network-based prediction of protein function. Mol. Syst. Biol. 3, 88–88.
Sharpnack, J., Krishnamurthy, A., and Singh, A. 2013a. Near-optimal anomaly detection in graphs using lovász extended scan statistic. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 1959–1967.
Sharpnack, J., Rinaldo, A., and Singh, A. 2016. Detecting anomalous activity on networks with the graph fourier scan statistic. IEEE Trans. Signal Process. 64, 364–379.
Sharpnack, J., and Singh, A. 2013. Near-optimal and computationally efficient detectors for weak and sparse graph-structured patterns. Presented at the 2013 IEEE Global Conference on Signal and Information Processing. Austin, Texas. pp. 443–446.
Sharpnack, J., Singh, A., and Rinaldo, A. 2013b. Changepoint detection over graphs with the spectral scan statistic. Artif. Intell. Stat. 31, 545–553.
Shrestha, R., Hodzic, E., Sauerwald, T., et al. 2017. Hit’ndrive: Patient-specific multidriver gene prioritization for precision oncology. Genome Res. 27, 1573–1588.
Soul, J., Hardingham, T.E., Boot-Handford, R.P., et al. 2015. Phenomeexpress: A refined network analysis of expression datasets by inclusion of known disease phenotypes. Sci. Rep. 5, 8117.
Thuong, N.T.T., Dunstan, S.J., Chau, T.T.H., et al. 2008. Identification of tuberculosis susceptibility genes with human macrophage gene expression profiles. PLoS Pathog. 4, e1000229–e1000229.
Vandin, F., Upfal, E., and Raphael, B.J. 2011. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522.
Vanunu, O., Magger, O., Ruppin, E., et al. 2010. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641.
Wang, X., Terfve, C., Rose, J.C., et al. 2011. HTSanalyzeR: An R/Bioconductor package for integrated network analysis of high-throughput screens. Bioinformatics 27, 879–880.
Wang, Y.H., Bower, N.I., Reverter, A., et al. 2009. Gene expression patterns during intramuscular fat development in cattle. J. Anim. Sci. 87, 119–130.
Xia, J., Gill, E.E., and Hancock, R.E.W. 2015. Networkanalyst for statistical, visual and network-based meta-analysis of gene expression data. Nat. Protoc. 10, 823–844.
Xu, J., Hsu, D., and Maleki, A. 2016. Global analysis of expectation maximization for mixtures of two gaussians. Presented at the Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16. Barcelona, Spain. pp. 2684–2692.

Information & Authors


Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 28Issue Number 5May 2021
Pages: 469 - 484
PubMed: 33400606


Published online: 20 May 2021
Published in print: May 2021
Published ahead of print: 5 January 2021


Request permissions for this article.


Data Availability

NetMix is available online at



Matthew A. Reyna*
Department of Biomedical Informatics, Emory University, Atlanta, Georgia, USA.
Uthsav Chitra*
Department of Computer Science, Princeton University, Princeton, New Jersey, USA.
Rebecca Elyanow
Department of Computer Science, Princeton University, Princeton, New Jersey, USA.
Department of Computer Science, Brown University, Providence, Rhode Island, USA.
Benjamin J. Raphael [email protected]
Department of Computer Science, Princeton University, Princeton, New Jersey, USA.


These authors contributed equally.
Address correspondence to: Prof. Benjamin J. Raphael, Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540, USA [email protected]

Author Disclosure Statement

B.J.R. is a cofounder of, and consultant to, Medley Genomics.

Funding Information

M.A.R. was supported in part by the National Cancer Institute of the NIH (Cancer Target Discovery and Development Network grant U01CA217875). B.J.R. was supported by US National Institutes of Health (NIH) grants R01HG007069 and U24CA211000.

Metrics & Citations



Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options


View PDF/ePub

Full Text

View Full Text







Copy the content Link

Share on social media

Back to Top