# Datamonkey.org tutorial

Note that if the cluster is busy, you may want to do the exercises out of order, because some of them may take a long time to complete

For this practical, we will be using the http://www.datamonkey.org public webserver and working through the corresponding tutorial (http://www.datamonkey.org/help/tutorial.pdf). Note that the tutorial is a bit out of date (and some of the results you obtain with the current server are slightly different due to methodology and code changes), but you should be able to follow it.

Please begin by reviewing chapters 1 and 2 of the tutorial.

Lecture notes can be found here File:Selection.pdf.

## Diversifying selection at individual sites

### An empirical SLAC, FEL and REL analysis

Carry out the exercise in Chapter 3 using the referenced Influenza example.

• Confirm that section 3.2.2 returns HKY85 as the best nucleotide model for the alignment
• Perform the SLAC analysis and replicate the results in Figure 4, including plots.
• Carefully work through section 3.5, replicate Figure 5, and see if you can convince yourself how the p-value for positive selection is obtained
• Carry out FEL and REL analyses in sections 3.6 and 3.7 (Figure 6 should be different from the one in the tutorial with more sites reported under selection). Note that you may have to wait a few minutes for the job to finish, depending on how busy the cluster is.
• Follow through to Section 3.8 to appreciate how different methods to analyze the same data can lead to different results, and that their agreement is sometimes good evidence that selection has really taken place and is not a statistical artifact.

### FEL on simulated data

Analyze the alignment in File:Neutral.nex using the best-fitting nucleotide model and the user tree with FEL. This alignment was simulated under neutrality, i.e. each site has ω = 1. Tabulate the number of sites inferred to be under positive and negative selection for different p-values (use the Retabulate button on the FEL results page). Does this number behave as expected? What does the fact that some sites are falsely inferred as positively/negatively selected tell us about the reliability of statistical inference on molecular data?

### Power to detect selection

Analyze the drosophila alcohol dehydrogenase (adh) alignment File:Adh.nex (using the best-fitting nucleotide model and the user tree) with SLAC, FEL and MEME. This file was analyzed by many authors, most notably in [Yang et al 2000], where the authors found no evidence of positive selection and stated

We note that previous studies (e.g., Hudson et al. 1987) have suggested the operation of balancing selec- tion at one particular amino acid site at the adh locus in Drosophila. Our LRTs, while highlighting the extreme variation in selective pressure among sites, do not sug- gest existence of sites under diversifying selection. This may be due to the lack of power of our models to detect balancing selection.
Use the Integrative Selection
Integrative Selection Tool on the File Status Page
tool (see Section 3.8 in the tutorial) with p-value of 0.1 for FEL and p-value of 0.05 for MEME to find three types of sites
1. A site which both FEL and MEME find to be under positive selection (reaching significance)
2. A site which MEME finds to be under positive selection with p<0.05, and so does FEL but with p > 0.1
3. A site which MEME finds to be under positive selection with p<0.05, and FEL - under negative selection with p<0.2

Explore the mutational patterns for each type of site (as was done on page 90 of lecture notes) by clicking on links in the Additional Information column on the Integrative Analysis page, and reason some mutational pattern may be easier to detect with MEME than with FEL.

## Diversifying selection along lineages

Read and do the exercise in Section 3.9 of the Datamonkey tutorial to see how computational methods can look at which branches in the tree are subject to selection. Reanalyze the same alignment with the Branch-Site REL method and compare the results of the two methods.

Note that the GA-Branch approach could be more powerful because it pools branches together (assuming several are under selection), while Branch-site REL could be more powerful because it permits ω to vary from site to site, i.e. only some of the sites need to evolve with ω>1, unlike GA-Branch, where on average all sites have ω>1.

## Directional selection

Access the amino-acid (FASTA) translation of the flu file from the first empirical analysis of tho tutorial from the job summary page (see Figure),
Amino-acid translation
download and save it to a file (flu.aa). Upload this amino-acid file to datamonkey to start a separate analysis for directional selection. First, run the protein model selection tool (which model does it select?). Second, run the FADE (a faster version of DEPS) analysis to identify which sites in this alignment are subject to directional positive selection. Make sure to select the appropriate Protein Substitution Model and root the tree on one of the 1997 sequences (you can tell the year from the sequence name). This analysis takes a few minutes to run – are there any sites under directional selection? Repeat the analysis with a larger file from http://www.hyphy.org/pubs/MED263/data/Human_H1.fas

## The effect of recombination on selection detection

For this exercise we will use an alignment of the Cache Valley Fever virus glycoprotein sequences File:CVV G.fas.

1. Run FEL and PARRIS on this alignment (best model, user tree). Download the .csv table of site-wise p-values reported by FEL and record the global p-value returned by PARRIS.
2. Screen the same alignment for recombination using GARD (this may take a bit of time).
3. Repeat Step 1 but using GARD Inferred trees for the analysis.
4. Compare the site-by-site FEL p-values and the global PARRIS p-value.