General Methods & Model Overview#
Core Engine & Discrete Character Models#
HyPhy is a general-purpose computational engine designed to define, fit, and simulate sequence evolution under any continuous-time, discrete-state Markov model. While it is widely used for selection analyses, its core design is entirely agnostic to the state space. Researchers can specify an arbitrary set of character states, formulate a rate transition matrix, and calculate likelihoods on a phylogenetic tree.
Built-in State Spaces#
HyPhy includes native support and optimized libraries for several common biological character types:
- Nucleotides (4 states): Full support for all 203 reversible substitution models (e.g. JC69, HKY85, GTR) as well as non-reversible models.
- Amino Acids / Proteins (20 states): Built-in empirical substitution models (such as JTT, WAG, LG, Dayhoff, MtREV, etc.) with support for user-defined rate matrices.
- Codons (61 or 64 states): Standard and customized evolutionary codon models (including Muse-Gaut 94 (MG94) derivatives, György-Yang models, and selection testing methods).
- Di-nucleotides (16 states): Models designed for analyzing double-nucleotide substitution patterns.
- Binary / Morphological (2 states): Models for restriction sites, presence/absence data (0/1), or binary morphological traits.
Custom Characters & Arbitrary States#
Beyond the built-in state spaces, the HyPhy Batch Language (HBL) allows users to construct custom characters of arbitrary size. Examples include codon-pair models, multi-state morphological characters, copy-number profiles, and structural state spaces.
MG94xREV Framework#
All methods used to infer selection from coding-sequence data rely, to some extent, on the MG94xREV codon model, a generalized extension of the MG94 model that allows for a full GTR mutation rate matrix. The MG94xREV transition matrix Q (also known as the instantaneous rate matrix), for the substitution from codon to codon is given by:
Parameters in this matrix include the following:
-
The function is an indicator function that equals the number of nucleotide differences between codons and ; for example, and . Like most other codon models, the MG94xREV model considers only single-nucleotide codon substitutions to be instantaneous.
-
refers to the amino-acid encoded by codon .
-
represents the synonymous substitution rate dS, and represents the nonsynonymous substitution rate dN. Hence, . We refer to the ratio as simply .
-
Together, the mutation model ("REV" component of MG94xREV model) is described by two parameter sets: , comprised of values , and , comprised of values . values are the nucleotide mutational biases, and are the equilibrium nucleotide frequencies.
-
Not explicitly seen in this model are the equilibrium codon frequencies, denoted . These frequencies are estimated using nine positional nucleotide frequencies for the target nucleotides in each codon substitution. Specifically, HyPhy employs the CF3x4 frequency estimator, a corrected version of the common F3x4 estimator (introduced in Goldman and Yang 1994) which accounts for biases in nucleotide composition induced by stop codons.
Most methods will perform a global MG94xREV fit to optimize branch length and nucleotide substitution parameters before proceeding to hypothesis testing. Several methods (FEL, FUBAR, and MEME) additionally pre-fit a GTR nucleotide model to the data, using the estimated parameters as starting values for the global MG94xREV fit, as a computational speed-up. Resulting branch length and nucleotide substitution parameters are subsequently used as initial parameter values during model fitting for hypothesis testing.
Synonymous Rate Variation#
A key component of HyPhy methods is the inclusion of synonymous rate variation. In other words, dS is allowed to vary across sites and/or branches, depending on the specific method. This paper provides a detailed analysis demonstrating why incorporating synonymous rate variation into positive selection inference is likely beneficial. Importantly, this consideration of synonymous rate variation stands in contrast to methods implemented in, for example, PAML where dS is constrained to equal 1.