Model fitting in phylogenetics
- a short and simple introduction by Pablo Vinuesa

A convenient way to classify phylogeny inference methods is based on two criteria: i) the type of data  they use to reconstruct the tree(s) (i.e. distance matrices vs. discrete characters) and  ii ) the reconstruction strategy (algorithmic vs. optimality criterium), as summarized in the following table.

  Table 1. Classification of phylogeny reconstruction methods based on the type of data and reconstruction strategy used.
distance matrix discrete characters
Algorithmic UPGMA, NJ
UPGMA: unweighted pair-group method using arithmetic means; NJ: neighbor-joining; ME: minimum evolution; 
FM: Fitch-Margoliash's least-squares method; MP: maximum parsimony; ML: maximum likelihood; BI: Bayesian inference

Another way to classify phylogenetic methods is based on wether or not they demand an explicit substitution model. With the single exception of MP, all other methods listed in Table 1 require the selection by the user of one of the different substitution models available in the particular program used for tree reconstruction.


The GTR family of parametric nucleotide substitution models.

Models in science represent an abstraction of complex natural processes in order to make them mathematically tractable and hence useful to make reasonable predictions (extrapolations) about the outcome of the studied process or system under different scenarios. This is done in quantitative terms by devising a predictive formula.
In the context of molecular phylogenetics, models are used to make predictions about the substitution process in molecular sequences (calculating their probabilities) along the branches of a tree or phylogeny. More explicitly, a sequence substitution model describes in probabilistic terms the process (Markov process) by which a sequence of characters (nucleotides or  aminoacids) changes into another set of homologous (i.e. aligned) character states over time. There are basically two different strategies to develop a substitution model: an empricial and a parametric approach (Lio and Goldman, 1998). Empirical models have been developed sucessfully for protein sequences. Substitution matrices such as BLOSUM, PAM, WAG, JTT ... are examples of empirical substitution matrices developed by statistically analyzing the frequencies of observed substitutions in sets of alignments of conserved protein domains with varying degrees of evolutionary divergence. Parametric substitution models have been developed mainly for nucleotide sequences.  In this case, the models are based on what are thought to be key parameters that govern the rates of substitution in homologous DNA sequences. There are two fundamental classes of parameters to be considered in this context: i) frequency  and iirate parameters. The most widely and generally useful nucleotide substitution models are those from the General Time-Reversible (GTR) family. Within this family we find all the standard nucleotide substitution models (JC69, K80 or K2P, F81, HKY85, TN93, GTR )  implemented in widely used phylogenetics software packages such as MEGA3, PAUP*, PHYLIP or PHYML. These are only a minor fraction of the 203 possible substitution models within the GTR family, the vast majority of which are not named and are simply distinguished from each other by its number and type of parameters (i.e. level or degree of parameterization), as specified in the corresponding rate matrix. That is, the simplest (parameter poorest) of all models within the GTR family is that of Jukes and Cantor (JC69), developed in 1969, which has a single rate parameter (alpha) and therefore asumes that all possible substitutions take place at a single rate and that nucleotide frequencies are equal (i.e. pA = pC = pG  = pT = 0.25). It is therefore a very restricted or constrained model, that will be only of very limited use for phylogenetic analyses of empirical sequence data. The most general, parameter-rich model is the GTR model, which has 6 relative rates (taking the rate of G <-> T = 1 and making all other 5 possible substiturion rates relative to the G-T transversion) and 4 relative nucleotide frequencies (fixing one of them equal to 1 and making the three remaining frequencies relative to it). Figure 1 makes summarizes the previous paragraph.

Rate and frequency parameters of nucleotide substitution models
Figure 1. Cartoon showing the relationships between frequency and rate parameters in some well-knon DNA 
               substitution models of the GTR family.
In addition to the rate and frequency parameters mentioned above, most modern phylogeny reconstruction programs allow the user to add significantly more realism to the model chosen by including additional parameters that model substitution rate heterogeneity over alignment sites. The standard Markovian process assumes that the rates of evolution are equally distributed and independent over sites, which is a very unrealistic assumption. A discontinuous gamma distribution is most frequently used to model this rate heterogeneity over sites in phylogenetics. Prof. Ziheng Yang pioneered its use and popularization in phylogenetics, showing the tremendous impact it has on phylogeny estimation and in molecular evolution (Yang, 1996). A single paramter (alpha or shape parameter) controls the form of the gamma distribution (Fig. 2), making it very convenient in a phylogenetic context. When alpha has a value < 1 there is strong among-site variation present in the data set. The higher the value of alpha, the lower the heterogeneity (Fig. 2). Sometimes a proportion of invariant sites (I) is also used to model rate heterogeneity over sites. The most complex model in the GTR family would therefore be a GTR+I+G model, with 10 free parameters.

Shapes of the gamma distribution with varying values of alpha, the shape parameter
Fig. 2. Different forms of the gamma distribution are controlled by a single parameter alpha (the "shape parameter").

It should be noted at this point that all models typically used to infer phylogenies from DNA sequences represent a special case of the GTR model. This means that imposing constrictions on the parameter values of the GTR model leads to a speacial case of the general model.  A particular model is said to be nested within a more complex one only if constraining parameter values of the later yields the former. For example, the JC69 model is a special case of the K2P model, whereas the K2P model is a speciall case of the TN93 model, which in turn is a constrained version of the most general and parameter rich GTR mode. But  the F81 and K2P are not nested because fixing parameter values of either one does not yield the other model. This is important to know, because one of the most popular tests used to evaluate the relative goodness of fit of two competing models uses the so-called likelihood ratio test (LRT) which can only be used to compare nested models. The following figure shows a hierachy of 56 phylogenetic models (Posada and Crandall, 1998). Note that not all of the possible models in the GTR family are depicted.

Hierarchical hypothesis testing in ModelTest

The figure above was taken from the original publication by Posada and Crandall, 1998.

Models of sequence evolution are used to make corrections on the estimates of genetic or evolutionary distances. The more diverged a pair of lineages (sequences) is, the more likely it is that they will have accumulated multiple substitutions at their fast-evolving sites, which results in the accumulation of a stochastic signal in the sequences (homoplasies). This phenomenon is very well documented and can be graphically depicted using so-called saturation plots such as the one shown below.

Ti and Tv saturation plots

Fig. 3. Transition and transversion saturation plot for third codon positions of a set of 55 rhizobial glnII sequences

Fig. 3 clearly shows how slowly evolving tv substitutions accumulate almost linearly with increasing genetic distance between pairs of sequences, whereas the much faster evolving transitional substitutions don't, reaching a saturation plateau somewhere around a F84 distance of 0.5.  This aparent slow-down in the accumulation of substitutions when increasingly distant sequences are compared is due to the correspondingly greater saturation level of the highly variable third codon positions in the protein-coding sequences analyzed in this study (55 rhizobial glnII sequences). At  an F84 distance of about  0.5 all variable sites have become saturated and consequently lost all their genuine phylogenetic signal. Nucleotide substitution models are designed to make corrections in the distance estimates to minimize the effects of multiple substitutions at variable sequence sites.


In order to make the best possible estimates of evolutionary distances among sequences or lineages on a tree, and consequently get the most realistic branch lengths and other parameter estimates, including bootstrap support values, we need to select the so-called best-fit substitution model. All models are wrong, as they are simplifications of reality, but some models are useful and the serious phylogeneticist should make an effort to identify the model that best fits a particular data set. Obviously, if the data are close to, or completely saturated, there is nothing a model can do to filter out the non-phylogenetic or homoplasious signal and the dataset should be discarded.

There are several strategies and approaches used in phylogenetics to select best-fit substitution models. Probably the most popular is to use the Likelihood Ratio Test (LRT) in a maximum likelihood framework to evaluate the statistical significance of the increase in fit of alternative substitution models to the data as their number and types of parameters increaes. It is important to stress that the LRT can only be applied in pairwise comparisons of nested models. The LRT test statistic is calculated as follows:  where L1 is the global maximum likelihood estimate for the alternative hypothesis (parameter richer model) and L0 is the global maximum likelihood estimate for the null hypothesis (simpler model). Delta is always positive, as complexer models always fit the data better. However, this increased fit is not always significant and may only provide a better explanation for stochastic variation in the data. When the two compared models are nested Î” approximately follows a X2 distribution with q degrees of fredom, where q = difference in the number of free parameters of both models. The following example shows how hierarchical LRTs progressing from the simplest JC69 up to the HKY85 model successively reject the simpler models in favour of the alternative ones. The lnL score of the TN93 model is only barely better than that of the simpler HKY85 model. The LRT indicates that this increase in lnL is not significant, reason why the selected model is that of HKY85.

A LRT example case

After selecting the HKY58 base model, it should be evaluated if accomodating among-site rate variation and or a proportion of invariant sites significantly improves the fit of the model to the data. This is shown below.

LRT to evaluate among-site rate variation

It is important to know that there are other means of evaluating the statistical significance of the relative fit of alternative models to a particular dataset such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). As a matter of fact, the AIC or BIC should be used as the model selection criteria in molecular phylogenetics rather than the LRT, as they permit the simultaneous evaluation of all competing models, be they nested or not (Posada and Buckley, 2004).

David Posada developed a very popular tool called ModelTest (REF) during his stay in Keith Crandall's lab to automatically select best-approaching substitution models. ModelTest selects the best-fit model from among 56 competing models by performing LRT, AICs and BIC calculations based on the likelihood scores computed by PAUP* for a given nucleotide alignment and a starting phylogenetic tree (the default is a JC69-NJ tree). It returns the name of the best-fit model selected by the different criteria as well as a PAUP* block with the best-fit parameter values for the winning model. ModelTest is also implemented as a web server, the ModelTest server.

If you don't have a copy of PAUP* to calculate the lnL scores for each of the 56 models evaluated by ModelTest don't worry. You can download jModelTest, which used PhyML for the likelihood, or you can download ModelGenerator, which can also select protein substitution matrices if you happen to work with protein alignments. Protocol 1 shows the use of ModelGenerator and MultiPhyl from the NUI Maynoothon.

If you understand Spanish, you can download my tutorial on using Modeltest3.7 from the command line on PDF, as well as a full course on molecular phylogenetics.


Protocol 1. Finding the best-fit substitution model for a DNA or protein multiple sequence alignment and estimating a maximum likelihood phylogeny under the selected model using the ModelGenerator and MultiPhyl web server.

MultiPhyl is the first high-throughput implementation of a distributed phylogenetics platform that is capable of using the idle computational resources of many heteogeneous non-dedicated machines to form a phylogenetics supercomputer. MultiPhyl allows a user to upload hundreds or thousands of amino acid or nucleotide alignments simultaneously and perform computationally intensive tasks such as model selection, tree searching, and bootstrapping of each of the alignments. The program implements a set of 88 amino acid models and 56 nucleotide ML models and a variety of statistical methods for choosing between alternative models using code from ModelGenerator. Notice however, that jobs received via this service will receive a lower priority on our system than simultaneous analyses being carried out by the members of the NUI Maynooth lab and therefore may take longer to execute. Results are emailed back to the user when the analysis is finished. User documentation for MultiPhyl Online is available from here.

Note: If you wish to analyse more than one alignment - please upload a single zip file containing all your alignments. All alignments must be in either Fasta or Phylip format and taxa names should not contain any spaces, commas, semi-colons, or brackets.

1. Navigate to the ModelGenerator+MultiPhyl site and upload your sequences in FASTA or Phylip format

ModelGenerator and MultiPhyl web server input page

Execute and the results will be obtained by email.

Good luck!