This site will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device.

Skip Navigation skip menu and banner

The Adaptive Evolution Database

TAED is a database of phylogenetically indexed gene families. It contains multiple sequence alignments from MAFFT1, maximum likelihood phylogenetic trees from PhyML2, bootstrap values for each node, dN/dS ratios for each lineage from the free ratios model in PAML3, and labels for each node of speciation or duplication from gene tree/species tree reconciliation using SoftParsMap4. The phylogenetic indexing enables simultaneous viewing of lineages with high dN/dS that occurred along the same species tree branches. Resources from the Protein Data Bank (PDB) and the Kyoto Encyclopedia of Genes and Genomes (KEGG)5, have been incorporated into the TAED analysis to detect substitutions along each branch within the phylogenetic tree and to assess selection within pathways.

The database can be entered through the species tree, where the species tree branch links to underlying gene families, either all gene families where that species tree lineage is represented, or the subset with dN/dS>1. It can also be entered by searching for genes of interest, or through an alphabetical list of gene family annotations. It is ultimately useful in identifying candidates to answer the question, "What makes this species unique?" or which genes show signals for diversification under which lineages?

Some Methodological Details of the Analysis

143,806 different protein families have been created and analyzed. The protein families were created by performing an all-against-all BLAST of all proteins within Chordata. The BLAST results showed a collection of hits that represented possible homologous protein relationships. Point accepted mutation (PAM) distances were then calculated for each BLAST result to give a measure of the evolutionary distance between the different sequences. All sequences that had a PAM distance of 120 or less were included in the formation of the protein families. The families were then formed through single-linkage clustering. Further curation and refinement of these families was then performed, by examining phylogenetic tree distances and topologies as well as pairwise analysis of the various proteins within the families to control for alignment quality. To determine possible areas of positive selection, each protein family has been thoroughly examined in a phylogenetic context. For each family, a multiple sequence alignment has been created with MAFFT, and a phylogenetic tree was constructed using PhyML (for families of >500 taxa, neighbor-joining trees were generated from RapidNJ6). To assist in tree construction computational resources from the Liberles Group and the Mt. Moran computing center at the University of Wyoming are being employed.

TreeThrasher is the best way to visualize the larger chordate trees. This will enable visualization of the overarching species tree as well as individual gene trees. More information on the installation and use of TreeThrasher can be found The TreeThrasher Executable Download, and TreeThrasher Installation Guide and Manual. Individual protein family trees may be viewed without the installation of any software, as a traditional phylogeny or using the new OneZoom (Rosindell and Harmon 2012) visualizer, provided you have a modern browser (Chrome works best).

Summary statistics over TAED
  • Number of proteins in the database = 3,185,986 proteins
  • Number of protein families completed to date (November 15, 2016) = 143,806 families
  • Average protein family size: 33 taxa
  • Total number of species represented = 3,214 species
  • Total number of families found with dN/dS ratios greater than 1 = 23,970 families



Literature
  1. Katoh, K.; Kuma, K.; Toh, H.; Miyata, T., MAFFT version 5: Improvement of accuracy in multiple sequence alignment. Nucleic Acids Research 2005, 33, 511-518.
  2. Guindon, S.; Dufayard, J. F.; Lefort, V.; Anisimova, M.; Hordijk, W.; Gascuel, O., New Algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic Biology 2010, 59, (3), 307-321.
  3. Yang, Z., PAML 4: Phylogenetic analysis by Maximum Likelihood. Mol. Biol. Evol. 2007, 24, (8), 1586-1591.
  4. Berglund, A. C.; Steffansson, P.; Betts, M. J.; Liberles, D. A., Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J. Mol. Evol. 2006, 63, (2), 240-250.
  5. Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W.; Bono, H.; Kanehisa, M., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 1998, 27, (1), 29-34.
  6. Simonsen, M.; Mailund, T.; Pedersen, C. N. S., Rapid Neighbor-Joining. In: Algorithms in Bioinformatics: 8th International Workshop. 2008, 113-122.

Flat files from the release described in Roth et al. (2005), Nucleic Acids Research 33:D495-D497 can be downloaded from: https://liberles.cst.temple.edu/public/TAED_Flat_Files/index.html.


The database described in Liberles et al. (2001) Genome Biology 2(8):research0028 can be accessed from : http://www.sbc.su.se/~liberles/TAED3.0/2index.html.


TAED References