Research Article
No access
Published Online: 30 April 2015

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences

Publication: Journal of Computational Biology
Volume 22, Issue Number 5


We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate—slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.

Get full access to this article

View all available purchase options and get full access to this article.


Boisseau J., and Stanzione D. 2013. TACC: Texas Advanced Computing Center. Available at:
Cannone J.J., Subramanian S., Schnare M.N., et al. 2002. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 3, 2.
Eddy S. 2009. A new generation of homology search tools based on probabilistic inference. Genome Informatics 23, 205211.
Finn R., Clements J., and Eddy S. 2011. HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39, W29–W37.
Fletcher W., and Yang Z. 2009. Indelible: A flexible simulator of biological sequence evolution. Mol. Bio. Evol. 26, 1879–1888.
Gloor G.B., Martin L.C., Wahl L.M., and Dunn S.D. 2005. Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44, 7156–7165.
Guo S., Wang L.-S., and Kim J. 2009. Large-scale simulating of RNA macroevolution by an energy-dependent fitness model. arXiv:0912.2326.
iPlant Collaborative. 2013. iPTOL, assembling the tree of life for the plant sciences. Available at:
Katoh K., Kuma K., Toh H., and Miyata T. 2005. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518.
Liu K., Raghavan S., Nelesen S., et al. 2009. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324, 1561–1564.
Liu K., Warnow T., Holder M., et al. 2011. SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Systematic Biology 61, 90–106.
Mirarab S., and Warnow T. 2011. FastSP: Linear-time calculation of alignment accuracy. Bioinformatics 27, 3250–3258.
Mizuguchi K., Deane C., Blundell T., and Overington J. 1998. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7, 2469–2471.
Price M., Dehal P., and Arkin A. 2010. FastTree-2 approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490.
Sievers F., Dineen D., Wilm A., and Higgins D.G., 2013. Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29, 989–995.
Sievers F., Wilm A., Dineen D., et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Sys. Bio. 7, 539.
Stamatakis A. 2006. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690.
Stoye J., Evers D., and Meyer F. 1998. ROSE: generating sequence families. Bioinformatics 14, 157–163.
Thompson J.D., Linard B., Lecompte O. and Poch O. 2011. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PloS ONE 6, e18093.
Wheeler T., and Kececioglu J. 2007. Multiple alignment by aligning alignments. Intelligent Systems for Molecular Biology, 559–568.
Wong G.K.-S. 2013. The thousand transcriptome (1KP) project. Available at:

Information & Authors


Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 22Issue Number 5May 2015
Pages: 377 - 386
PubMed: 25549288


Published in print: May 2015
Published online: 30 April 2015
Published ahead of print: 30 December 2014


Request permissions for this article.




Siavash Mirarab
Department of Computer Science, University of Texas at Austin, Austin, Texas.
Nam Nguyen
Department of Computer Science, University of Texas at Austin, Austin, Texas.
Sheng Guo
Genomics and Computational Biology Graduate Group, University of Pennsylvania, Philadelphia, Pennsylvania.
Li-San Wang
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.
Junhyong Kim
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.
Tandy Warnow
Department of Computer Science, University of Texas at Austin, Austin, Texas.
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois.


Address correspondence to:Prof. Tandy WarnowDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign201 North Goodwin AvenueUrbana, IL 61801E-mail: [email protected]

Author Disclosure Statement

No competing financial interests exist.

Metrics & Citations



Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options


View PDF/ePub

Full Text

View Full Text







Copy the content Link

Share on social media

Back to Top