# PRIME

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

## Contents

### PRoperty Informed Models of Evolution

Protein evolution models are typically based on estimates of amino acid exchangeabilities, e.g. as quantified in the BloSum substitution matrices which are still commonly used today. These models derive their power from the fact that radical substitutions – involving amino acids with very different physico-chemical properties – are generally rare, while conservative substitutions – involving similar amino acids – are more common. However, more recent studies have shown that amino acid exchangeabilities vary across organisms and across genes, reflecting the fact that the set of relevant physico-chemical properties changes from case to case, so that the same substitution may sometimes be radical (having a large effect on protein structure and/or function) and sometimes conservative (having little effect on structure or function). This variation can be expected from site to site within a protein: for instance, amino acids with different hydrophobicity may be unexchangeable at sites where the protein fold is sensitive to hydrophobicity but exchangeable at sites where it is not. PRIME models were designed to take account of this variation.

PRIME builds on the same conceptual frameworks as FEL[1] and MEME[2], but allows the non-synonymous substitution rate β depend not only on the site in question (like FEL and MEME), but also on which residues are being exchanged (e.g. I-V would be different from K-R).

#### Model details

In PRIME, the non-synonymous rates of replacing codon encoding amino-acid x with a codon encoding amino-acid y is parameterized as $\beta_{xy, x \neq y} = f^{(s)}\left(\overrightarrow{d(x,y)}\right)$. Here, $\overrightarrow{d(x,y)}$ is a vector consisting of property-specific distance measures, each of which indicates the degree of dissimilarity of amino acids x and y with respect to a particular (possibly composite) property or set of properties. The function f(s) maps the distances into an exchangeability, with the superscript (s) making it explicit that the exchangeability of x and y depends on the relative importance of the various amino acid properties at site s.

We assume that the properties do not interact: they are composite measures that have been constructed in such a way as to remove dependencies (as is attempted by standard dimensionality reduction techniques such as principal components analysis), this should be reasonable. We calculate the property-specific distance di(x,y) = | xiyi | between amino acids x and y for each property i, and model the contribution of each of these distances to βxy as independent. The exchangeability function f(s) is then a site-specific function of D property-specific distances.

##### Example

Consider the first two of the five composite properties from Atchley et al.[3]: the first measures bipolarity, while the second relates to the propensity of amino acids to be in various secondary structure configurations; the numerical distances for alanine (A) and cysteine (C) are 0.752 (property 1) and 1.767 (property 2). The exchangeability of A and C is a function f(s)([0.752,1.767]) which depends on both distances and the relative importance of properties 1 and 2 at site s.

##### Exchangeability function

Because we are modeling the contribution due to each property as independent, the exchangeability function is a product of property-specific contributions. Under purifying selection, each of these contributions should be a monotonically decreasing function -- the most natural parameterization is an exponential decline of exchangeability as properties become more dissimilar. This yields the exponential independent model: $f^{(s)}\left(\overrightarrow{d(x,y)}\right) = r^{(s)}\prod_{i=1}^{D} \left[e^{-\alpha_i^{(s)} d_i(x,y)}\right] = r^{(s)} \exp \left[ -\sum_{i=1}^D \alpha_i^{(s)} |x_i-y_i| \right].$ Here, r(s) is the site-specific synonymous substitution rate, and the site-specific parameters $\alpha_i^{(s)}$ represent the importance of property i: when $\alpha_i^{(s)}=0$, selection is neutral with respect to that property, while positive values of $\alpha_i^{(s)}$ cause the property to be conserved. Small positive values of $\alpha_i^{(s)}$ mean that conservative changes are tolerated but not radical changes, but as $\alpha_i^{(s)}$ increases, purifying selection starts affecting even conservative changes.

If substitutions that are radical with respect to property i are accelerated relative to substitutions that are conservative with respect to i, the exponential parameterization applies, and fitting the model to a site under positive selection will result in $\alpha_i^{(s)}<0$.

##### Sets of amino acid properties

PRIME currently supports two predefined sets of 5 amino-acid properties: the five empirically measured properties used by Conant et al. [4] and the five composite properties proposed by Atchley et al. [3]. The latter were obtained by applying a dimensionality reduction technique based on factor analysis to a large set of 494 empirically measured attributes.

 Property 1 2 3 4 5 Conant-Stadler[4] Chemical Composition Polarity Volume Iso-electric point Hydropathy Atchley et al[3] Polarity index Secondary structure factor Volume Refractivity/Heat Capacity Charge/ Iso-electric point

#### Fitting and testing

• Fitted a simple codon model to an alignment to estimate nucleotide substitution biases and alignment-wide branch lengths: these parameters will be held constant at those estimates for subsequent analyses (all see FUBAR)
• For each site, we fit the 6-parameter model described by -- the full model, with parameters r(s) (synonymous rate, relative to the alignment average) and $\alpha^{(s)}_1,\ldots,\alpha^{(s)}_5$ (five property weights) -- and five null models, each of which constrains one of the $\alpha^{(s)}_i$ to 0 and thereby tests whether or not there is evidence that change in this property is important at site s.
• Each individual null model is compared to the full model by a likelihood ratio test using the $\chi^2_1$ distribution to assess significance. Previous applications of FEL in similar modeling contexts suggest that this test statistic is appropriate, albeit somewhat conservative for small datasets (e.g. [1]). Because multiple tests are performed on the same data (a single site), we employ the Holm-Bonferroni procedure to control the family-wise false positive rate (at a site). We also report q-values based on the False Discovery Rate calculation by the procedure of Benjamini and Hochberg.

### References

1. 1.0 1.1 Sergei L. Kosakovsky Pond and Simon D. W. Frost Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection Mol Biol Evol (May 2005) 22(5): 1208-1222 first published online February 9, 2005 doi:10.1093/molbev/msi105
2. Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, et al. (2012) Detecting Individual Sites Subject to Episodic Diversifying Selection. PLoS Genet 8(7): e1002764. doi:10.1371/journal.pgen.1002764
3. 3.0 3.1 3.2 William R. Atchley, Jieping Zhao, Andrew D. Fernandes, and Tanja Drüke Solving the protein sequence metric problem PNAS 2005 102 (18) 6395-6400; published ahead of print April 25, 2005, http://dx.doi.org/doi:10.1073/pnas.0408677102
4. 4.0 4.1 Gavin C. Conant, Günter P. Wagner, Peter F. Stadler Modeling amino acid substitution patterns in orthologous and paralogous genes Molecular Phylogenetics and Evolution, Volume 42, Issue 2, February 2007, Pages 298–307 http://dx.doi.org/10.1016/j.ympev.2006.07.006