HIV454

From HyPhy Wiki
Jump to: navigation, search

Contents

Overview

This pipeline was developed to analyze high-throughput sequencing data from the Roche 454 platform. The pipeline comprises a quality filtering utility written in C and several HyPhy batch files. The pipeline will run on all *nix operating systems with HyPhy installed. All results are added to a SQLite database contained in the results folder.

Installation


Download, extract and install the tarball as follows.

$ tar -zxvf 454.tar.gz
$ cd ./454/source/c
$ gcc -lm 454_filtering.c -o 454_filter


This will create a quality filtering C executable for Stage 0 of the pipeline. The remaining stages are all implemented within HyPhy. See this page for information on installing HyPhy on your system. In all cases we assume that HYPHYMP and/or HYPHYMPI executables are located in /opt/hyphy/HYPHY

Stage 0 (Quality filter)


Overview: The 454 quality filter removes reads of low quality (as determined by their PHRED scores).

Prerequisites: 454 .fna file containing the sequence reads; 454 *.qna file containing the PHRED quality scores.

Usage: The general syntax is as follows:

$ 454_filter <fasta> <quality> <min phred score> <min run length> <filtering mode>


<fasta> is the fasta sequence read file
<qual> is the PHRED quality scores file
<min phred score> is the minimum PHRED score for inclusion.
Suggested default is 20 which corresponds to a 0.001 error rate
<min run length> is the minimum length of consecutive nucleotides with PHRED score >= min phred score
Suggested default is 100-200 nucleotides.
<filtering mode> is means by which reads are filtered
0 truncate reads to only sites meeting the min phred score and keep homopolymers
1 split reads containing sites with low quality scores into multiple fragments and keep homopolymers
2 truncate reads to only sites meeting the min phred score and remove homopolymers
3 split reads containing sites with low quality scores into multiple fragments and remove homopolymers


Output: Sequence reads passing quality tests are printed to standard out (stdout), and summary statistics to standard error (stderr). Both will default to the screen, but can be redirected. For instance the following will redirect stdout (1) to filename reads; and stderr (2) to filename summary

$ 454_filter <fasta> <quality> <min phred score> <min run length> <filtering mode> 1>reads 2>summary


Stage 1-7 launcher


Overview:: All stages of 454 sequence analysis described below can be launched using a single batch file.

Prerequisites: A quality filtered read file from Stage 0.

Usage: The 454_launcher.bf batch file will launch all 454 analysis. For single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_launcher.bf


or on multi-processor machines with a working installation of MPI:

$ mpirun -np <np> /opt/hyphy/HYPHY/HYPHYMPI BASEPATH=/opt/hyphy/HYPHY/ ./source/hyphy/454_launcher.bf


Output: All results are added to tables in the corresponding SQLite database file. Additional output files for each stage are described below

Stage 1 (Alignment)


Overview: The alignment batch file (454_alignment.bf) first performs amino acid alignment between a chosen reference sequence and each of the reads. Only alignments that exceed an alignment score threshold are retained, where the threshold is 5 x the the alignment score expected from a read of equal length and identical base composition. The next alignment step tries to include reads which failed the amino acid alignment by performing pairwise nucleotide alignments to the consensus of the reads which passed the amino acid alignment. Sequences are included in this second step if the pairwise per nucleotide alignment score exceeds the median of that from all reads included in the amino acid alignment step.

Prerequisites: A quality filtered read file from Stage 0.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows on single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454.bf


or on multi-processor machines with a working installation of MPI:

$ mpirun -np <np> /opt/hyphy/HYPHY/HYPHYMPI BASEPATH=/opt/hyphy/HYPHY/ ./source/hyphy/454.bf


Output:All results are added to tables in the corresponding SQLite database file. In addition, reads which do not pass the alignment are output to *_uds.genename.remaining.fas. These are either low quality reads, or in the case of multiplexed 454 runs, reads from other gene regions. The *_uds.genename.remaining.fas can subsequently be run through Stage 1 with the next reference gene.

Stage 2 (Summary statistics)


Overview: This batch file reports summary statistics on read length, depth and frequencies of minority variants.

Prerequisites: A SQLite database file from Stage 1 with tables SEQUENCES, SETTINGS, NUC_ALIGNMENT.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_reporter.bf


Output: All results are added to tables in the corresponding SQLite database file. Coverage and majority frequency plots are also generated in postscript format as *_coverage.ps and *_majority.ps.

Stage 3 (Diversity analysis)


Overview: The sliding window analysis batch file estimates maximum sequence divergence in sliding windows which meet the minimum coverage criteria. Phylogenies are also estimated within sliding windows, and bootstrap resampling is applied to the sliding window with at least 4 variants and maximum sequence divergence. The latter is useful for the estimation of dual/multi infection, although the power to recover well-supported trees is reduced since reads are typically short (<200bp).

Prerequisites: A SQLite database file from Stage 1 with tables SEQUENCES, SETTINGS, NUC_ALIGNMENT.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows on single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_sliding_window.wbf


or on multi-processor machines with a working installation of MPI:

$ mpirun -np <np> /opt/hyphy/HYPHY/HYPHYMPI BASEPATH=/opt/hyphy/HYPHY/ ./source/hyphy/454_sliding_window.wbf


Output: All results are added to tables in the corresponding SQLite database file. In addition, the following files are generated for each sliding window.

*_from_to.fas: unaligned reads in window from-to
*_from_to.fas.nuc: alignment of reads in window from-to
*_from_to.fas.tree: newick tree file with Neighbor-Joining estimated tree of reads in window from-to
*_from_to.fas.sim: bootstrap replicate in window from-to. Note that the serial HYPHYMP version will overwrite this file for every replicate, whereas the parallel HYPHYMPI version will create a new file for every replicate. These files can be deleted once the analyses is complete.
*_from_to.fas.sim.tree: newick tree file for bootstrapped dataset
*max_from_to.fas: alignment of reads in the window (from-to) with maximum divergence
*max_from_to.fas.ps: postscript Neighbor-Joining tree estimated from reads in the window with maximum divergence
*max_from_to.fas.tree: newick tree file with Neighbor-Joining estimated tree of reads in window with maximum divergence



Stage 4 (Mutation rate estimation)


Overview: The number of mutation rate classes is estimated using a binomial mixture model. Briefly, we fit a model with a single rate class and estimate the mutation rate from a binomial distribution with the number of successes equal to the number of observed mutations at a site, and the number of trials equal to the observed coverage at a site. Additional rate classes are added using a mixture of binomial models until model fit (evaluated using AIC) is no longer improved. The parameters of the binomial mixture model (i.e. rates and their respective proportions) are estimated using maximum likelihood.

Prerequisites: A SQLite database file from Stage 1 with tables SEQUENCES, SETTINGS, NUC_ALIGNMENT, AA_ALIGNMENT.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows on single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_variants.bf


Once complete a second batch file is used to assign sites to the estimated rate classes:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_rateClass_NEB.bf


Output: All results are added to tables in the corresponding SQLite database file.

Stage 5 (Selection analysis)


Overview: Selection at sites is evaluated using all pairwise comparisons between reads. We estimate the ratio of observed non-synonymous to synonymous substitutions (weighted by the number of pairwise comparisons) and compare this to that expected given the observed codon frequencies and the genetic code.

Prerequisites: A SQLite database file from Stage 1 with tables SEQUENCES, SETTINGS, NUC_ALIGNMENT, AA_ALIGNMENT.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows on single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_FEL.bf


Output: All results are added to tables in the corresponding SQLite database file.

Stage 6 (Drug resistant mutation analysis)


Overview: For each drug resistant site we estimate the mutation rank (i.e. the rank of the mutation rate with respect to all other sites) and calculate the median mutation rank of all drug resistant sites. The probability (P) that the median mutation rank at drug resistant sites is greater than an equivalent-sized sample of non-drug resistant sites is evaluated with permutations (n=1000). These data can be used to determine if mutation properties at drug resistant sites are unique. Furthermore, we classify drug resistant sites into mutation rate classes using the same methods described in the mutation rate class estimation procedure. Here we can evaluate the posterior probability that a drug resistant site falls within a particular mutation rate class.

Prerequisites: A SQLite database file from Stage 1 with tables SEQUENCES, SETTINGS, NUC_ALIGNMENT, AA_ALIGNMENT.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows on single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_MDR_variants.bf


Once complete a second batch file is used to assign drug resistant sites to the estimated rate classes:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_rateClass_NEB.bf


Output: All results are added to tables in the corresponding SQLite database file.

Stage 7 (Identification of drug resistant compensatory mutations)


Overview: This batch file will screen reads for the occurrence of both drug resistant and compensatory mutation sites. A Fisher's exact test is performed to determine whether drug resistant mutations and compensatory mutations occur more frequently than expected by chance.

Prerequisites: A SQLite database file from Stage 1 with tables SEQUENCES, BASE_FREQUENCIES.

Usage: The batch file is included in the 454_launcher.bf script, however, it can also be run independently as follows on single/multi-core machines:

$ /opt/hyphy/HYPHY/HYPHYMP BASEPATH=/opt/hyphy/HYPHY ./source/hyphy/454_compensatoryMutations.bf


Output: All results are added to tables in the corresponding SQLite database file.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox