What phylogenetic reconstruction methods are around and which of them requiere explicit model selection?
A convenient way to classify phylogeny inference methods is based on two criteria: i) the type of data they use to reconstruct the tree(s) (i.e. distance matrices vs. discrete characters) and ii ) the reconstruction strategy (algorithmic vs. optimality criterium), as summarized in the following table.
- Table 1. Classification of phylogeny reconstruction methods based on the type of data and reconstruction strategy used.
DATA TYPE distance matrix discrete characters Algorithmic UPGMA, NJ Optimality
ME, FM MP, ML, BI
- FM: Fitch-Margoliash's least-squares method; MP: maximum parsimony; ML: maximum likelihood; BI: Bayesian inference
Another way to classify phylogenetic methods is based on wether or not they demand an explicit substitution model. With the single exception of MP, all other methods listed in Table 1 require the selection by the user of one of the different substitution models available in the particular program used for tree reconstruction.
What are DNA substitution models ?
The GTR family of parametric nucleotide substitution models.
In the context of molecular phylogenetics, models are used to make predictions about the substitution process in molecular sequences (calculating their probabilities) along the branches of a tree or phylogeny. More explicitly, a sequence substitution model describes in probabilistic terms the process (Markov process) by which a sequence of characters (nucleotides or aminoacids) changes into another set of homologous (i.e. aligned) character states over time. There are basically two different strategies to develop a substitution model: an empricial and a parametric approach (Lio and Goldman, 1998). Empirical models have been developed sucessfully for protein sequences. Substitution matrices such as BLOSUM, PAM, WAG, JTT ... are examples of empirical substitution matrices developed by statistically analyzing the frequencies of observed substitutions in sets of alignments of conserved protein domains with varying degrees of evolutionary divergence. Parametric substitution models have been developed mainly for nucleotide sequences. In this case, the models are based on what are thought to be key parameters that govern the rates of substitution in homologous DNA sequences. There are two fundamental classes of parameters to be considered in this context: i) frequency and ii) rate parameters. The most widely and generally useful nucleotide substitution models are those from the General Time-Reversible (GTR) family. Within this family we find all the standard nucleotide substitution models (JC69, K80 or K2P, F81, HKY85, TN93, GTR ) implemented in widely used phylogenetics software packages such as MEGA3, PAUP*, PHYLIP or PHYML. These are only a minor fraction of the 203 possible substitution models within the GTR family, the vast majority of which are not named and are simply distinguished from each other by its number and type of parameters (i.e. level or degree of parameterization), as specified in the corresponding rate matrix. That is, the simplest (parameter poorest) of all models within the GTR family is that of Jukes and Cantor (JC69), developed in 1969, which has a single rate parameter (alpha) and therefore asumes that all possible substitutions take place at a single rate and that nucleotide frequencies are equal (i.e. pA = pC = pG = pT = 0.25). It is therefore a very restricted or constrained model, that will be only of very limited use for phylogenetic analyses of empirical sequence data. The most general, parameter-rich model is the GTR model, which has 6 relative rates (taking the rate of G <-> T = 1 and making all other 5 possible substiturion rates relative to the G-T transversion) and 4 relative nucleotide frequencies (fixing one of them equal to 1 and making the three remaining frequencies relative to it). Figure 1 makes summarizes the previous paragraph.
- Figure 1. Cartoon showing the relationships between frequency and rate parameters in some well-knon DNA
- substitution models of the GTR family.
Fig. 2. Different forms of the gamma distribution are controlled by a single parameter alpha (the "shape parameter").
It should be noted at this point that all models typically used to infer phylogenies from DNA sequences represent a special case of the GTR model. This means that imposing constrictions on the parameter values of the GTR model leads to a speacial case of the general model. A particular model is said to be nested within a more complex one only if constraining parameter values of the later yields the former. For example, the JC69 model is a special case of the K2P model, whereas the K2P model is a speciall case of the TN93 model, which in turn is a constrained version of the most general and parameter rich GTR mode. But the F81 and K2P are not nested because fixing parameter values of either one does not yield the other model. This is important to know, because one of the most popular tests used to evaluate the relative goodness of fit of two competing models uses the so-called likelihood ratio test (LRT) which can only be used to compare nested models. The following figure shows a hierachy of 56 phylogenetic models (Posada and Crandall, 1998). Note that not all of the possible models in the GTR family are depicted.
The figure above was taken from the original publication by Posada and Crandall, 1998.
What are DNA substitution models good for?
Fig. 3. Transition and transversion saturation plot for third codon positions of a set of 55 rhizobial glnII sequences
Fig. 3 clearly shows how slowly evolving tv substitutions accumulate almost linearly with increasing genetic distance between pairs of sequences, whereas the much faster evolving transitional substitutions don't, reaching a saturation plateau somewhere around a F84 distance of 0.5. This aparent slow-down in the accumulation of substitutions when increasingly distant sequences are compared is due to the correspondingly greater saturation level of the highly variable third codon positions in the protein-coding sequences analyzed in this study (55 rhizobial glnII sequences). At an F84 distance of about 0.5 all variable sites have become saturated and consequently lost all their genuine phylogenetic signal. Nucleotide substitution models are designed to make corrections in the distance estimates to minimize the effects of multiple substitutions at variable sequence sites.
How to select a suitable nucleotide substitution model for a multiple sequence alignment?
There are several strategies and approaches used in phylogenetics to select best-fit substitution models. Probably the most popular is to use the Likelihood Ratio Test (LRT) in a maximum likelihood framework to evaluate the statistical significance of the increase in fit of alternative substitution models to the data as their number and types of parameters increaes. It is important to stress that the LRT can only be applied in pairwise comparisons of nested models. The LRT test statistic is calculated as follows: where L1 is the global maximum likelihood estimate for the alternative hypothesis (parameter richer model) and L0 is the global maximum likelihood estimate for the null hypothesis (simpler model). Delta is always positive, as complexer models always fit the data better. However, this increased fit is not always significant and may only provide a better explanation for stochastic variation in the data. When the two compared models are nested Î” approximately follows a X2 distribution with q degrees of fredom, where q = difference in the number of free parameters of both models. The following example shows how hierarchical LRTs progressing from the simplest JC69 up to the HKY85 model successively reject the simpler models in favour of the alternative ones. The lnL score of the TN93 model is only barely better than that of the simpler HKY85 model. The LRT indicates that this increase in lnL is not significant, reason why the selected model is that of HKY85.
After selecting the HKY58 base model, it should be evaluated if accomodating among-site rate variation and or a proportion of invariant sites significantly improves the fit of the model to the data. This is shown below.
It is important to know that there are other means of evaluating the statistical significance of the relative fit of alternative models to a particular dataset such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). As a matter of fact, the AIC or BIC should be used as the model selection criteria in molecular phylogenetics rather than the LRT, as they permit the simultaneous evaluation of all competing models, be they nested or not (Posada and Buckley, 2004).
David Posada developed a very popular tool called ModelTest (REF) during his stay in Keith Crandall's lab to automatically select best-approaching substitution models. ModelTest selects the best-fit model from among 56 competing models by performing LRT, AICs and BIC calculations based on the likelihood scores computed by PAUP* for a given nucleotide alignment and a starting phylogenetic tree (the default is a JC69-NJ tree). It returns the name of the best-fit model selected by the different criteria as well as a PAUP* block with the best-fit parameter values for the winning model. ModelTest is also implemented as a web server, the ModelTest server.
If you don't have a copy of PAUP* to calculate the lnL scores for each of the 56 models evaluated by ModelTest don't worry. You can download jModelTest, which used PhyML for the likelihood, or you can download ModelGenerator, which can also select protein substitution matrices if you happen to work with protein alignments. Protocol 1 shows the use of ModelGenerator and MultiPhyl from the NUI Maynoothon.
If you understand Spanish, you can download my tutorial on using Modeltest3.7 from the command line on PDF, as well as a full course on molecular phylogenetics.
Protocol 1. Finding the best-fit substitution model for a DNA or protein multiple sequence alignment and estimating a maximum likelihood phylogeny under the selected model using the ModelGenerator and MultiPhyl web server.
MultiPhyl is the first high-throughput implementation of a distributed phylogenetics platform that is capable of using the idle computational resources of many heteogeneous non-dedicated machines to form a phylogenetics supercomputer. MultiPhyl allows a user to upload hundreds or thousands of amino acid or nucleotide alignments simultaneously and perform computationally intensive tasks such as model selection, tree searching, and bootstrapping of each of the alignments. The program implements a set of 88 amino acid models and 56 nucleotide ML models and a variety of statistical methods for choosing between alternative models using code from ModelGenerator. Notice however, that jobs received via this service will receive a lower priority on our system than simultaneous analyses being carried out by the members of the NUI Maynooth lab and therefore may take longer to execute. Results are emailed back to the user when the analysis is finished. User documentation for MultiPhyl Online is available from here.
Note: If you wish to analyse more than one alignment - please upload a single zip file containing all your alignments. All alignments must be in either Fasta or Phylip format and taxa names should not contain any spaces, commas, semi-colons, or brackets.
1. Navigate to the ModelGenerator+MultiPhyl site and upload your sequences in FASTA or Phylip format
Execute and the results will be obtained by email.
You may also want to visit my page on bioinformatics resources and tutorials, which may contain further material of your interest.