Workshop:Vancouver2011:SelectionAnalyses

From HyPhy Wiki
Jump to: navigation, search

Contents

Screening an alignment for evidence of diversifying selection in HyPhy (GUI)

Running SLAC

In this exercise we will screen an alignment of 19 West Nile virus sequences (NS3 gene) for evidence of diversifying positive selection using the SLAC method implemented in HyPhy.

  1. Download File:WN NS3.fas to your computer saving it into a new directory (name it WNNS3)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Positive Selection to expand the list further, choose QuickSelectionDetection.bf. This file implements a wide variety of analyses, each having multiple options, and can be quite intimidating! For detailed discussion of many of the options, please take a look at Section 2 of The Positive Selection in HyPhy book chapter. In the following listing of input options to the analysis, we are going to use the shorthand Option : Value
  1. Choose Genetic Code : Universal
  2. New/Restore : New Analysis
  3. Please specify a codon data file : Use the navigation box to find WN_NS3.fas
  4. Model Options : Custom
  5. Please enter a 6 character model designation (e.g:010010 defines HKY85):012345
  6. A tree was found in the data file...: Y (type into the lower box of the console window and press Enter)
  7. Save nucleotide model fit to : Use the navigation box to save the file to the same directory as WN_NS3.fas, but name it WN_NS3.HKY85.fit
  8. dN/dS bias parameter options : Estimate dN/dS only
  9. Ancestor counting options : Single Ancestor Counting
  10. SLAC Options : Full tree
  11. Treatment of Ambiguities : Averaged
  12. Test Statistic : Approximate
  13. ... HYPHY will do some work and print some text to the console window ...
  14. Significance level for a site to be classified as positively/negatively selected? : 0.1 (type into the lower box of the console window and press Enter)
  15. ... More text to the screen, including the list of sites under positive and negative selection ...
  16. Output Options : Chart
  17. Rate Class Estimator : Skip

At the end of the execution there will be several windows displayed, and there should be a single positively selected site


******* FOUND 1 POSITIVELY SELECTED SITES ********
+--------------+--------------+--------------+--------------+
| Index        | Site Index   | dN-dS        | p-value      | 
+--------------+--------------+--------------+--------------+
|            1 |   249.000000 |     3.304463 |     0.087525 |
+--------------+--------------+--------------+--------------+

SLAC screen shot

Take a look at the row for site 249 in the results chart: this is the site reported positively. The method inferred that a total of 7 substitutions took place at that site, with 0 (Observed S Changes) of them being synonymous and 7 (Observed NS Changes) being non-synonymous. How does this observed synonymous proportion (0, Observed S. Prop.) compare to the expected proportion under neutrality? This neutral expectation is computed by averaging the imputed proportions of synonymous and non-synonymous 'sites' along the tree. These 'site' counts, which can be more accurately thought of as the proportion of random one-nucleotide substitutions that are expected to be synonymous (E[S Sites] = 0.8817) and non-synonymous (E[NS Sites] = 2.002). Their ratio (P{S} = 0.293884) is the proportion of substitutions expected to be synonymous under neutral evolution. For site 249, this means that about 2/7 substitutions would be synonymous, and the probability of getting 0/7 synonymous substitutions under the binomial distribution with the probability parameter P{S} = 0.293884 is (1-P{S})7 = 0.087525. This number is the one reported in the column P{S leq. observed}, and is the p-value for having as many or fewer synonymous substitutions at a site (positive selection), with the analogous quantity for negative selection reported in the column P{S geq. observed}. dS and dN are simply S/E[S Sites] and NS/ E[NS Sites], dN-dS is their difference, and Scaled dN-dS is dN-dS normalized by the total length of the tree.



Running REL

In this exercise we will screen an alignment of 19 West Nile virus sequences (NS3 gene) for evidence of diversifying positive selection using a Random Effects Approach (REL) method implemented in HyPhy.

  1. We will be using the File:WN NS3.fas from the previous example.
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Codon Selection Analyses to expand the list further, choose BivariateCodonRateAnalysis.bf. This file implements the general bivariate discrete distribution model, which permits both synonymous and non-synonymous rates to vary between classes of sites in an alignment (Evolutionary fingerprinting of genes). The number of classes is determined automatically.
  1. Datafile : Use the navigation box to find WN_NS3.fas
  2. Choose Genetic Code : Universal
  3. Branch Lengths : Nucleotide Model
  4. Model string : 012345
  5. Amino Acid Class Model : Default
  6. Please enter a 6 character model designation (e.g:010010 defines HKY85): 012345 (type into the lower box of the console window and press Enter)
  7. ... HyPhy will now start fitting the model. It will start with one class of sites, determine the best values for dS and dN rates given that assumption, then move on two two rate classes etc. The process terminates when no improvement in model fit (measured by c-AIC) can be obtained by adding a further rate class ...

Here's the output for 3 rates

Model fit summary

Log likelihood:    -6454.04398
Parameters    :             56
AIC           :    13020.08796
c-AIC         :    13020.63341
Class 1
	dS    =      0.940
	dN    =      0.006
	dN/dS =      0.007
	Prob  =      0.835
Class 2
	dS    =      0.000
	dN    =      2.556
	dN/dS =        inf
	Prob  =      0.002
Class 3
	dS    =      1.314
	dN    =      0.091
	dN/dS =      0.069
	Prob  =      0.164
[FINISHED FITTING A MODEL WITH 3 RATES]
[CURRENT c-AIC = 13020.6. BEST c-AIC SO FAR = 13031.3]

The analysis will write a number of files to the same directory as the original alignment file and in the end instruct you to execute a processing module on a particular file (which requires the GUI) to further interpret the results:

[BEST-FITTING MODEL HAS 3 RATES]
[USE CodonBivariateRateProcessor.bf ON /Users/sergei/Desktop/Vancouver2011/WN_NS3.fas.fit.1.2.3 TO PROCESS THE RESULTS]

Processing REL Results

Return to the Standard Analyses menu, click on the arrow next to Codon Selection Analyses to expand the list further (if needed), choose CodonBivariateRateProcessor.bf.

  1. Choose Genetic Code : Universal
  2. Choose A Model Fit: : Use the navigation box to find WN_NS3.fas.fit.1.2.3 (this is is NEXUS file which stores the entire HyPhy model fit)
  3. The program will run for a few seconds and open a number of window.
  4. Find the window named Posteriors. It contains the information about different values for dN-dS and what posterior probability (empirical Bayes) each site has of belonging to (in this case) one of three rate classes.
  5. Click on the arrow next to the dN_minus_dS variable to display its values. One of the values should be positive, indicating that sites belonging to this rate class have dN>dS. Click on the row with this value (it should become highlighted).
  6. Go to the Categories menu and choose the Event Posteriors option.
  7. A new window (Event [number] Info For dN_minus_dS) will open. It contains automatically computed priors, posteriors and empirical Bayes factors that each site belongs to the rate class with dN>dS.
  8. Confirm that site 249 is the only site with a high posterior (>.9) probability and Empirical Bayes Factor (>100) for dN>dS. You can draw a simple chart in the same window (select Bar Chart from the Type pulldown and the appropriate column from the Y dropdown).

Using Branch-Site REL to look for episodic selection

In this exercise we will screen an alignment of 6 HIV-1 (partial gp120) sequences for evidence of diversifying episodic selection using the BranchSiteREL method implemented in HyPhy.

  1. Download File:HIV gp120.nex to your computer saving it into a new directory (name it HIVtransmission)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Positive Selection to expand the list further, choose BranchSiteREL.bf. This file implements a very recent method for finding evidence of episodic diversifying selection (aka branch-site done right) from a sequence alignment.
  1. Choose Genetic Code : Universal
  2. Data file : Use the navigation box to find HIV_gp120.nex
  3. A tree was found in the data file...: Y (type into the lower box of the console window and press Enter)
  4. Save analysis results to : Use the navigation box to save the file to the same directory as HIV_gp120.nex, but name it HIV_gp120.csv
  5. ... HYPHY will go to work and print some text to the console window ...
    1. A sequence of increasingly more complex models (phases) is fitted to the data
    2. In Phase 2, the model which permits unrestricted combinations of selective regimes across sites and branches is fitted
    3. Following that, all branches with some proportion of sites having dN>dS are tested for evidence of positive selection using a series of LRT.
    4. Note that this phase also prints out the actual selective profile of each branch, i.e. the proportions of sites in each (dS,dN) category at this branch, e.g. the transmission branch has 93.5 percent of sites at omega = 1, and the rest at a very high omega (>100):
Node: mixtureTree.Node5
	Class 1: omega = 0.542641 weight = 0
	Class 2: omega = 1 weight = 0.934506
	Class 3: omega = 111.945 weight = 0.065494

At the end of the analysis, two branches should identified as being under influence of episodic diversifying selection (one of them is the transmission branch).

Summary of branches under episodic selection:
	RCP3 p = 5.29641e-06
	Node5 p = 0.0116331

Episodic selection tree summary

File:HIV gp120.txt

Using DEPS to look for directional selection

In this exercise we will use a Directional Evolution in Protein Sequences (DEPS) analysis to identify sites subject to directional positive selection in an alignment of 68 protein hemagglutinin sequences from H3N2 Influenza A virus.

  1. Download the alignment file to your computer saving it into a new directory (name it DEPS)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Positive Selection to expand the list further, choose DirectionalREL.bf. This file implements a method to examine protein sequences for evidence of directional evolution, using a rooted tree (DEPS manuscript).
  1. Reload/New : Reload (the initial model fitting take about 10-15 minutes, so the file you've downloaded contains the Flu H5N1 model pre-fitted to this alignment)
  2. 'Load an existing fit : Use the navigation box to find H3N2_HA.nex
  3. Fix Branch: Unknown root"
  4. ... HYPHY will go to work and print some text to the console window.
    1. It will try every possible amino-acid residue and compute a p-value that some proportion of sites in the alignment are evolving directionally towards it. Each of these models will be written to a fit file, e.g H3N2_NA.nex.A etc
    2. For residues with significant p-values (after a multiple test correction), individual sites which may be evolving under directional selection are identified.

For instance, text block below indicates that 2.3% of the sites show strong bias towards T (p = 0.001). Rates to T are 1112.128 times faster than background, leading to 62.358% frequency increase of residue T over the length of the tree.

[PHASE 17.1] Model biased for T
[PHASE 17.2] Finished with the model biased for T. Log-L = -3259.811

        Bias term           = 1112.128
        proportion          =    0.023
        Exp freq increase   =   62.358%
        p-value    =    0.001

The following are the two sites found to be evolving directionally towards T

Residues (and p-values) for which there is evidence of directional selection

        T : 0.000895177

The list of sites which show evidence of directional selection (Bayes Factor > 20)
together with the target residues and inferred substitution counts

Site      151 (max BF = 135.956)
 Preferred residues: T
 Substitution counts:
        G->K:    1/K->G:    0
        K->T:    1/T->K:    0
Site      171 (max BF = 205920)
 Preferred residues: T
 Substitution counts:
        H->T:    1/T->H:    0
        H->Y:    0/Y->H:    1
        T->Y:    1/Y->T:    0

Open H3N2_HA.nex in the HyPhy alignment viewer and look at the character frequencies at the sites identified as directionally evolving (the root and the presumed ancestral residue are the same as in sequence V01103_1).

Screening an alignment for evidence of recombination

Using Single Breakpoint Recombination (SBP)

In this exercise we will examine an alignment of 9 HIV-1 pol sequences for evidence of recombination using SBP.

  1. Download the alignment file to your computer saving it into a new directory (name it SBP)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Recombination to expand the list further (if needed), choose SingleBreakpointRecomb.bf. The analysis examines every possible location for a single recombination breakpoint (i.e. every variable site in the alignment) and computes the goodness of fit of the model that allows recombination (i.e. has two trees, left and right of the breakpoint) versus the universal null (single tree).
  1. Data type : Nucleotide
  2. Locate a nucleotide data file : Use the navigation box to find HIV_pol_BC_08.fas
  3. KH Testing : Skip (we'll do it later)
  4. Choose one of the standard models : HKY85
  5. Model options : Global
  6. Save analysis results to : Use the navigation box to save the file to the same directory as HIV_pol_BC_08.fas, but name it SBP.txt
  7. ... HYPHY will do some work and print some text to the console window ...

For each possible breakpoint, the program will fit the 2-tree model and report its goodness of fit according to three information criteria: AIC, small sample AIC (AICc) - the default, and BIC. AIC is the least conservative and BIC is by far the most.

A typical output line may look like this, where AIC and AICc both support locating a breakpoint, but BIC does not.

Breakpoint at position    260. dAIC =       63.21 dAICc =      61.54 dBIC =    -110.89

At the end of the run, a summary is presented:

AIC

Best supported breakpoint is located at position 260
AIC = 5155.5 : an improvement of 63.2106 AIC points

AIC-c

Best supported breakpoint is located at position 260
AIC = 5157.78 : an improvement of 61.5429 AIC points

BIC

There seems to be NO recombination in this alignment

The analysis will generate two tree files (one for each tree left to the breakpoint, and one for each tree right to the breakpoint), written to the same directory as the main result file (SBP.txt in this case), called SBP.txt.trees1, and SBP.txt.trees2.

Next, we will perform a standard topological incongruence test, using the GARDProcessor.bf file in the Recombination section of standard analyses.

  1. Please load a nucleotide data file : Use the navigation box to find HIV_pol_BC_08.fas
  2. Please load a GA partition analysis result file: : Use the navigation box to find SBP.txt_cAIC.splits
  3. ... HYPHY will do some work and print some text to the console window ...

HyPhy will perform a variety of tests to determine which if the breakpoint can be attributed to topological incongruence (i.e. recombination and not rate variation or heterotachy), and report that in this case, the breakpoint at position 261 is likely due to recombination according to the Kishino-Hasegawa test

Breakpoint | LHS Raw p | LHS adjusted p | RHS Raw p | RHS adjusted p 
       261 |   0.00050 |        0.00100 |   0.00540 |        0.01080

At p = 0.01 there are 0 significant breakpoints
At p = 0.05 there are 1 significant breakpoints
At p = 0.1 there are 1 significant breakpoints

Screening for selection in the presence of recombination

Using a multi-partition fixed effects likelihood (FEL)

In this exercise we will examine an alignment of 13 HIV-1 glycoprotein sequences from Cache Valley Fever virus, with and without correcting for recombination.

  1. Download File:CVV G.fas and File:CVV G GARD.nex to your computer saving it into a new directory (name it FEL)
  2. Lauch HyPhy, select Analysis:Standard Analyses from the console window menu.
  3. In the ensuing dialog box listing analysis types, click on the arrow next to Selection/Recombination to expand the list further (if needed), choose QuickSelectionDetectionMF.bf. The analysis allows SLAC and FEL to be run on a dataset which is partitioned into multiple non-recombinant fragments (e.g. by GARD), i.e. using the [PARRIS approach http://bioinformatics.oxfordjournals.org/content/22/20/2493.abstract]
  1. Choose Genetic Code : Universal
  2. New/Restore : New Analysis
  3. Model Options : Custom
  4. Please enter a 6 character model designation (e.g:010010 defines HKY85):012345 (type into the lower box of the console window and press Enter)
  5. How many datafiles are to be analyzed (>=1):?: 1 (type into the lower box of the console window and press Enter)
  6. Please specify a codon data file : Use the navigation box to find CVV_G_GARD.nex (this file contains the alignment and the five partitions inferred to be non-recombinant by GARD).
  7. Save nucleotide model fit to : Use the navigation box to save it to CVV_G_GARD.REV
  8. dN/dS bias parameter options : Estimate dN/dS only
  9. Which method? : FEL
  10. ... HYPHY will do some work and print some text to the console window ...
  11. Significance level for Likelihood Ratio Tests (between 0 and 1)? : 0.1 (type into the lower box of the console window and press Enter)
  12. Branch option : All
  13. ... HYPHY will do some work and print some text to the console window ...
  14. Save results to : Save site-by-site LRT results toCVV_G_GARD.csv

A typical output line may look like this:

Site  195 dN/dS =     inf dN =  4.9848 dS =  0.0000 dS(=dN)  2.3353 Full Log(L) = -14.5463 LRT=  3.9208 p-value =  0.04769 *P

Here, codon 195, has the maximum likelihood synonymous rate (dS) inferred at 0, and the non-syn rate (dN) - at 4.9848 (their ratio is infinite). The log-likelihood of the site with these parameters is -14.5463. The null model which forces dN=dS infers the value at 2.3353. The likelihood ratio test for non-neutral evolution has the test statistic of 3.9208 and the p-value of 0.04769 (which is significant at the specified level). The site is called positively selected (*P), because the test is significant and dN>dS.

Now repeat the analysis on the same file but not corrected for recombination (File:CVV G.fas). Make sure not to overwrite any of the result files. If you have access to a plotting program (e.g. Excel or R), load the two resulting .csv files and plot the p-values against each other. Do you think that there is an effect depending on whether or not we correct for the possible confounding caused by recombination?

GARD or no GARD?

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox