Oral Presentation 9th GeneMappers Conference 2012

Web-based analysis pipeline for generation and assessment of SNP calls and genotyping from long‑range amplicon sequence (#10)

Alexander Shaw 1 , Warren Kaplan 2 , Yash Tiwari 1 , Peter R Schofield 1 3 , Janice M Fullerton 1 3
  1. Neuroscience Research Australia, Sydney, NSW, Australia
  2. Peter Wills Bioinformatics Centre, Garvan Institute of Medical Research, Sydney, NSW, Australia
  3. School of Medical Sciences, The University of New South Wales, Sydney, NSW, Australia

The advent of large-scale massively-parallel sequencing projects has been accompanied by the development of sophisticated analysis tools. Despite their public availability, most tools are difficult to set up and use, rendering them largely inaccessible to those without significant bioinformatics expertise. This has resulted in an often-costly ‘usability gap’ between the growing potential of exploiting massively‑parallel sequencing data to accurately characterise genetic variation, and the ability of researchers to perform such analyses1 ; especially for laboratories undertaking small-scale projects without extensive bioinformatics support. We demonstrate an effective solution to the ‘usability gap’, by constructing a complete analysis pipeline for amplicon-based targeted resequencing of a cohort of individuals, that incorporates best practice, cross‑platform analysis tools, and can be easily used and modified by any researcher for analysis of their own data. Our pipeline operates within the Galaxy web-based genomic analysis platform3 , enabling users to perform an entire analysis using a web browser, from uploading of raw sequencing reads (.sff files) and running the pipeline remotely, to downloading the results of their analysis: SNP calls and phased haplotypes (.vcf files). Our pipeline follows best practice variant detection recommendations using the Genome Analysis Toolkit (GATK)2 , and implements additional QC steps for PCR amplicon sequencing data. We demonstrate the effectiveness of our pipeline by analysing Roche 454-generated long‑range amplicon sequencing of a 96kb gene region in 48 individuals. We assess the accuracy of our approach by examining concordance with genotype data for the same DNA samples generated using the Illumina Goldengate and 660W beadchip platforms, and compare SNPs called in our analysis with those reported in the 1000 genomes project. Our pipeline will aid researchers in efficiently applying best‑practice tools to accurately characterise genetic variation within their genomic region of interest.

  1. Mcpherson, J.D. Next-generation gap. Nature Methods 6, 2-5 (2009).
  2. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20, 1297-303 (2010).
  3. Goecks, J., Nekrutenko, A. & Taylor, J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology 11, R86 (2010).