GenomeGems: evaluation of genetic variability from deep sequencing data
© Ben-Zvi et al.; licensee BioMed Central Ltd. 2012
Received: 14 December 2011
Accepted: 24 May 2012
Published: 2 July 2012
Detection of disease-causing mutations using Deep Sequencing technologies possesses great challenges. In particular, organizing the great amount of sequences generated so that mutations, which might possibly be biologically relevant, are easily identified is a difficult task. Yet, for this assignment only limited automatic accessible tools exist.
We developed GenomeGems to gap this need by enabling the user to view and compare Single Nucleotide Polymorphisms (SNPs) from multiple datasets and to load the data onto the UCSC Genome Browser for an expanded and familiar visualization. As such, via automatic, clear and accessible presentation of processed Deep Sequencing data, our tool aims to facilitate ranking of genomic SNP calling. GenomeGems runs on a local Personal Computer (PC) and is freely available athttp://www.tau.ac.il/~nshomron/GenomeGems.
GenomeGems enables researchers to identify potential disease-causing SNPs in an efficient manner. This enables rapid turnover of information and leads to further experimental SNP validation. The tool allows the user to compare and visualize SNPs from multiple experiments and to easily load SNP data onto the UCSC Genome browser for further detailed information.
KeywordsDeep sequencing Next generation sequencing Software Genetic analysis Data interpretation Variance calling
Comparison of some of the currently avaibale tools for data interpretaion and analysis
Data file formats
Win, Linux, and OS X
Multiple sequence alignments and data typically associated with alignments
Win, Linux, Mac OS X
Next-generation sequencing data
ACE format (commonly used by genome assembly programs), READS, EGI, MAP
An AJAX based web viewer. Requires a standard web browser
Illumina genome analyzer sequencing data
1. Both CIGAR and new sequencing technology alignment data
2. SAM/BAM format of SAM tools
Next-generation sequencing data
SAM format- enables an easy conversion of various input file formats, including PSL, MAQ, Bowtie, SOAP, ZOOM
Windows, OS X, Linux, Solaris
ACE, AFG, MAQ and SOAP assembly formats. Also 454 and Solexa data
Next-generation sequencing data, analyzed by MAQ, Variant SNP Classifier, and SNVMix in, a pre-determined ‘.txt’ format
Especially MAQ, but also Variant SNP Classifier and SNVMix
Reads '.txt’ file format with columns separated by tab
A comparison of the Visualization Capabilities and Data Integration of the different tools currently available with those of GenomeGems
Three distinct display modes:
1. A very low resolution- histogram
2. Intermediate resolutions- a ‘Wiggly Plot’
3. Very high resolution - the user may view the sequence data directly.
Compact with zooming capability.
Genome features (exon, intron, etc.), Polymorphism data (e.g. SNP), 454 flowgram trace, Illumina four color raw signals.
Pinpoint view of: base quality, technology-specific sequence trace, read ID and strand.
1. A resolution from the level of a whole chromosome to the level of individual bases.
LookSeq can visualize read alignments and some basic properties as separate “tracks”:
2. There are options to view genome coverage, GC content, and annotations to the reference sequence.
1. Sequence annotation
3. GC contents
This information is taken from the alignment databases as well as some auxiliary files.
The short read image can be zoomed to any resolution, from a whole chromosome to individual bases at any desired level.
Also displays auxiliary information: read ID, location, base quality, read length and orientation.
The main display provides a view of a single contig at a time, with reads aligned against their consensus sequence.
Five separate analysis methods are available:
GenomeGems integrates well with the UCSC Genome Browser, for the purpose of visualization of SNPs, in addition to the analysis and visualization in the actual tool.
1. Data Table - displays the data supplied by the user and analyzes the percentage of mutant reads, in spreadsheet format, enabling analysis within the tool in addition to fast export to Excel.
UCSC custom tracks supply additional data calculated by UCSC such as: context of the SNP – CDS or intron, and the properties of the changed amino acid – polarity, acidity and hydropathy.
2. Sample Comparison - displays a bar graph presenting the frequency of each SNP in the investigated samples, according to a threshold value.
3. SNP-View - displays a table containing the numbers of samples that include each SNP in a specific chromosome.
4. Translation of the input file into a PgSNP file format for a later visualization in the UCSC, as a UCSC Custom Track.
5. Additional Information- suggests additional external links for further investigation and annotation of specific SNPs and of the impact of amino acid changes on human proteins.
Several tools exist to facilitate the data interpretation stage, each focusing on a different aspect of the analysis: EagleView, for example, is compatible with a variety of operating systems and supports visualization of Deep Sequencing derived genome assemblies. However, this software, freely available on the internet, is not suitable for the most up-to-date sequencing technologies (such as ABI/SOLID or Helicos). LookSeq, an AJAX based web viewer was developed to visualize the multiple layers of information which includes large data sets of aligned sequence reads, produced by Deep Sequencing, and enable the user to visualize the information at different levels of resolution. This tool uses Illumina Genome Analyzer/HiSeq 2000 data as input though lacks the ability to visualize large sequenced regions such as an entire human chromosome due to significant memory demands. MagicViewer, a freely available application based on an independent operating system implementation, provides annotation facilities for Single Nucleotide Polymorphisms (SNPs) without extending annotations for Insertion-Deletions (Indels). In addition, it lacks features of conducting comparisons among various samples. ABC, a Java based viewer for exploration of data associated with alignments displays quantitative data (such as sequence similarity) and annotation data (such as location of genes and repeats), simultaneously. ABC does not function as a genome-wide browser, but is suitable for comparative sequence analysis. Finally, Tablet displays the data as highly packed views allowing instant navigation to any region of interest. Compatible with a variety of operating systems, Tablet requires large memory storage therefore has limited use on a Personal Computer (PC).
Our tool, termed GenomeGems, was developed in order to provide systematic means to reduce inconsistency in selecting which genetic variances or mutations should be further investigated. We developed a unique interface which includes analysis and visualization (via the widely used UCSC Genome Browser) leading to prioritization of data generated by Deep Sequencing runs. One way to facilitate variance calling from genetic sequences is putting them in context with other sequenced samples. Therefore, one of GenomeGems’ strong features lies within its ability to compare, analyze and visualize a large number of samples, simultaneously. Using tables and graphs on a PC workstation, both Microsoft Excel and the UCSC Genome Browser are directly linked to the interpreted information. While some tasks carried out by GenomeGems can be achieved by other standalone tools, such as the ‘R package’ or also partially by Microsoft Excel, GenomeGems is a suite of applications which makes it easier to perform a combination of tasks accessible for end users of non-computational background. This tool comes to facilitate genomic research via multiple-processing and accessible presentation of Deep Sequencing data for variance calling, in order to assist rapid turnover of information leading to further experimental mutation detection. Since SNPs are the most prevalent genetic modification among individualsGenomeGems currently focuses on these variations.
Main user interface
The selected files, with a specified sample number, chromosome number, novel or clinically associated and location appear in the ‘Select Files’ panel (marked as B) as a list. This list of files must include all of the files that are required for the later analysis. At any stage the user may return to the main user interface in order to add more files to be available for analysis. The ‘Analysis’ panel (marked as C) contains the different functions available for analysis. At the moment, the tool contains five options for analysis: Data Table, Compare Samples, SNP View, Generate PgSNP, and Additional Information. In the future, additional forms of analysis will be added to this panel, as the tool is built in a modular form, allowing for further expansion.
Input file format
PGSNP file format in the UCSC Genome Browser
Custom tracks in the UCSC
GenomeGems enables researchers to identify potential disease-causing SNPs in an efficient manner. GenomeGems’ main advantages are its: (i) ability to integrate data from several Deep Sequencing runs on a standard PC; (ii) assimilation with the UCSC Genome Browser and Microsoft Excel; (iii) applicability for any Deep Sequencing data (given the correct input file format) (iv) power to compare and analyze a large number of samples. GenomeGems' main virtues allow: (i) reducing variability in selecting which mutations should be further investigated; (ii) facilitating genomic research via clear and accessible presentation of processed Deep Sequencing data; (iii) assisting rapid turnover of information and a quick lead to further experimental mutation detection.
GenomeGems facilitates genomic research
Behind the implementation of GenomeGems lies our main objective of facilitating genomic research by processing Deep Sequencing data in a comprehensive and accessible fashion. This enables rapid turnover of information and leads to further experimental SNP validation. The tool allows the user to compare and visualize SNPs from multiple experiments and to easily load SNP data onto the UCSC Genome browser for further detailed information.
At the moment, the tool does not support data files containing indels. An extension of the tool will include indel analysis and an algorithm for determining whether an indel causes the appearance of a nonsense mutation in the sequence analyzed.
Full Genome Analysis
At the moment GenomeGems enables analysis of a single chromosome specified by the user. In the next version of GenomeGems we intend to enable full genome analysis and full genome comparison between samples.
Additional Visualization Capabilities
The current version of GenomeGems enables SNP visualization by means of UCSC’s Custom Tracks. In subsequent versions a convenient visualization within the application and without the need to connect to the Internet will be included.
Further mutation Analysis
The current version of GenomeGems lacks an independent feature for prediction of the impacts amino acid substitutions (caused by SNPs) on the structure and function of human proteins. Instead, external free tools providing this information are suggested. In subsequent versions this feature will be included as an integrated function of GenomeGems.
Availability of the software and system requirements
Project Name: GenomeGems.
Project Home Page:http://xwww.tau.ac.il/~nshomron/GenomeGems.
Operating System: Microsoft Windows.
Programming Language: MATLAB 2009.
Other Requirements: installation of an ActiveX Control and “MCR Ver 7.10” on the users' workstations.
Single Nucleotide Polymorphisms
Mapping and Assembly with Quality
University of California Santa Cruz
National Center of Biotechnology Information
Small Nucleotide Variants
Personal Genome SNP
Short Oligonucleotide Analysis Package
Archive Compression Extension
Auxiliary File Generator
Embedded Gateway Interface
Small Nucleotide Variants
Compact Idiosyncratic Gapped Alignment Report.
We thank Prof Karen Avaraham, Dr Lilach Friedman, Dr Zippi Brownstein, Dr Barak Markus, Dr Nitzan Kol and Ofer Iaskov for fruitful discussions on software development. We thank Dr Tamir Tuller for commenting on the manuscript.
The Shomron laboratory is supported by the the National Institutes of Health (NIDCD) R01DC011835; Chief Scientist Office, Ministry of Health, Israel; Israel Cancer Association; Wolfson Family Charitable Fund; I-CORE Program of the Planning and Budgeting Committee, The Israel Science Foundation (grant number 41/11).
- Van Tassell CP, et al: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5 (3): 247-252. 10.1038/nmeth.1185.PubMedView Article
- Anderson MW, Shrijver I: Next generation DNA sequencing and the future of genomic medicine. Genes. 2010, 1 (1): 38-69. 10.3390/genes1010038.PubMedPubMed CentralView Article
- Mardis ER: Next generation DNA sequencing methods. Annu Rev Genomics Hum Genome. 2008, 9: 387-402.View Article
- Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet. 2011 Jan, 11 (1): 31-46.View Article
- Schuster SC: Next-generation sequencing transforms today’s biology. Nat Methods. 2008 Jan, 5 (1): 16-18. 10.1038/nmeth1156.PubMedView Article
- Janitz M: Next-generation genome sequencing, towards personalized medicine. 2008, Wiley-VCH Verlag GmbH & CoView Article
- Li R, et al: SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009 Jun, 19 (6): 1124-1132. 10.1101/gr.088013.108.PubMedPubMed CentralView Article
- Kuhlenbaumer G, Hullmann J, Appenzeller S: Novel genomic techniques open new avenues in the analysis of monogenic disorders. Hum Mutat. 2011 Feb, 32 (2): 144-151. 10.1002/humu.21400.PubMedView Article
- Bentley DR: Whole-genome re-sequencing. Curr Opin Genet Dev. 2006 Dec, 16 (6): 545-552. 10.1016/j.gde.2006.10.009.PubMedView Article
- Voelkreding KV, Dames SA, Durtschi JD: Next-generation sequencing: from basic research to diagnostics. Clin Chem. 2009 Apr, 55 (4): 641-658. 10.1373/clinchem.2008.112789.View Article
- Ansorge WJ: Next-generation DNA sequencing techniques. New Biotechnol. 2009, 25 (4): 195-203. 10.1016/j.nbt.2008.12.009.View Article
- Mihai P, Salzberg SL: Bioinformatics challenges of new sequencing technology. Trends Genet. 2007 Mar, 24 (3): 142-149.
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008 Oct, 26 (10): 1135-1145. 10.1038/nbt1486.PubMedView Article
- Cooper GM, Singaravelu SAG, Sidow A: ABC: software for interactive browsing of genomic multiple sequence alignment data. BMC Bioinforma. 2004 Dec, 8 (5): 192-View Article
- Huang W, Marth G: EagleView: a genome assembly viewer for next-generation sequencing technologies. 2008, Cold Spring Harbor Laboratory Press, 1538-1542.
- Manske HM, Kwitowski DP: LookSeq: a browser-based viewer for deep sequencing data. 2009, Cold Spring Harbor Laboratory press, 2125-2131.
- Hou H, et al: Magic viewer: integrated solution for next-generation sequencing data visualization and genetic variation detection and annotation. Nucleic Acid Res. 2010 Jul, 38: W732-W736. 10.1093/nar/gkq302.PubMedPubMed CentralView Article
- Mile I, et al: Tablet: next generation sequence assemble visualization. Bioinforma Appl Note. 2010, 26: 401-402.View Article
- McPherson JD: Next-generation gap. Nat Methods Suppl. 2009 Nov, 6 (11 Suppl): S2-S5.View Article
- Frazer KA, et al: Human genetic variation and its contribution to complex traits. Nat Rev Genet. 2009 Apr, 10 (4): 241-251.PubMedView Article
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov, 18 (11): 1851-1858. 10.1101/gr.078212.108.PubMedPubMed CentralView Article
- Goya R, et al: SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics. 2010 Mar 15, 26 (6): 730-736. 10.1093/bioinformatics/btq040.PubMedPubMed CentralView Article
- Sherry ST, et al: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMedPubMed CentralView Article
- Adzhubei IA, et al: A method and server for predicting damaging missense mutations. Nat Methods. 2010 Apr, 7 (4): 248-249. 10.1038/nmeth0410-248.PubMedPubMed CentralView Article
- Glaser F, et al: ConSurf: indentification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics. 2003 Jan, 19 (1): 163-164. 10.1093/bioinformatics/19.1.163.PubMedView Article
- Personal Genome SNP format: UCSC Genome Browser. [Online] [http://genome.ucsc.edu/FAQ/FAQformat.html#format10],
- UCSC Genome Bioinformatics: Custom Tracks. [Online] [http://genome.cse.ucsc.edu/goldenPath/help/customTrack.html#EXAMPLE1],
- Zweig AS, et al: UCSC genome browser tutorial. Genomics. 2008, 92: 75-84. 10.1016/j.ygeno.2008.02.003.PubMedView Article
- Ada H, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005 Jan 1, 33: D514-D517.
- Kent WJ, Hsu F, Karolchik D: Exploring relationships and mining data with the UCSC gene sorter. Genome Res. 2005, 15: 737-741. 10.1101/gr.3694705.PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.