GenBank and PubMed: How connected are they?
© Sarkar et al; licensee BioMed Central Ltd. 2009
Received: 11 August 2008
Accepted: 09 June 2009
Published: 09 June 2009
GenBank(R) is a public repository of all publicly available molecular sequence data from a range of sources. In addition to relevant metadata (e.g., sequence description, source organism and taxonomy), publication information is recorded in the GenBank data file. The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses. Although many of the publications associated with GenBank records may not be linked into or part of complementary literature databases (e.g., PubMed), GenBank records associated with literature indexed in Medline are identifiable as they contain PubMed identifiers (PMIDs).
Here we show that an analysis of 87,116,501 GenBank sequence files reveals that 42% are associated with a publication or patent. Of these, 71% are associated with PMIDs, and can therefore be linked to a citation record in the PubMed database. The remaining (29%) of publication-associated GenBank entries either do not have PMIDs or cite a publication that is not currently indexed by PubMed. We also identify the journal titles that are linked through citations in the GenBank files to the largest number of sequences.
Our analysis suggests that GenBank contains molecular sequences from a range of disciplines beyond biomedicine, the initial scope of PubMed. The findings thus suggest opportunities to develop mechanisms for integrating biological knowledge beyond the biomedical field.
Overview of GenBank
The US Congress established National Center for Biotechnology Information (NCBI) in 1988 to develop bioinformatics approaches to support the progress of biomedical research. A major component of NCBI's mission is to provide access to a variety of databases and software for the scientific and medical communities. GenBank , an archive of all publically available primary sequence data, is one of these databases. Sequence data are submitted to GenBank from individual scientists and from large genome sequencing centers. GenBank, the European Molecular Biology Laboratory Nucleotide Sequence Database (EMBL) in Europe, and the DNA Databank of Japan (DDBJ) together form the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC archives and makes publically available more than 80 million individual molecular sequences including mRNA sequences, genomic survey sequences and ribosomal RNA gene clusters . Data is exchanged daily among the INSDC partners (GenBank, EMBL, and DDBJ) to maintain consistency and completeness of molecular sequence data contributed and used by the scientific community.
Accuracy and completeness of GenBank
In addition to the actual molecular sequence, each GenBank entry includes a set of associated metadata that provide information about each sequence. These metadata elements include a description of the sequence, the scientific name of the organism, bibliographic information citing relevant publications, taxonomy of the source organism, and a features table providing information about coding regions, translated protein sequence, repeat sequences and many other relevant characteristics of the sequence. The importance and accuracy of GenBank annotations has been the topic of recent discussions . Community-based curation of GenBank annotations has been suggested as a tractable means to keep pace with high-throughput genome sequencing initiatives [3, 4]. Currently, NCBI performs some quality control, but does not curate the data; only the original submitter can make modifications to a molecular sequence record including the annotation in GenBank .
Many, if not all, biology and biomedical journals require that authors submit manuscript associated molecular sequence data to public repositories such as GenBank. There have been recent studies exploring author compliance with such requirements – Noor et al. report that 3–20% of biomedical journal articles do not have requisite GenBank accession numbers in accordance with journal policies . In many cases, the requisite molecular sequence data is not available even six months after publication . However, many molecular sequences are submitted by generous scientist(s) who have no plans to formally publish the research and in this case no citation information would be in the GenBank record.
Link between GenBank sequences and publications
The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses. Thus, the connection of GenBank records to peer-reviewed, published literature is an essential component of contemporary biomedical research. Many of the publications associated with GenBank records may not be linked into or part of complementary literature databases (e.g., PubMed). GenBank records associated with literature indexed in Medline are identifiable as they contain PubMed identifiers (PMIDs). From the perspective of molecular sequence data, it can be essential to have access to the associated publication to address quality or methodology inquiries . For example, quality issues are of paramount importance when combining molecular sequence data for identification, population and evolutionary studies [6–8]. Having an associated publication linked to a given GenBank record enables one to confirm the methodology used to acquire the molecular sequences (e.g., specimen handling or PCR primers used) in addition to the context under which the sequence was studied (e.g., field collection or laboratory extraction). A link to published literature can also be a means to explore data behind proposed gene function or identify other related experiments (e.g., gene expression studies).
To analyze the connections between GenBank and published literature, a full GenBank archive (release 164) was downloaded in flat-file format from the NCBI at the National Library of Medicine in March 2008. The downloaded flat files were then parsed to extract 70 metadata types associated with each GenBank record. Annotation values for each of the 70 metadata types were then loaded into a MySQL database. Citation information was gathered from the JOURNAL GenBank Feature Table field; PubMed identifiers were extracted from the PMID GenBank Feature Table field. A citations table was thus created that contained: GenBank Identifier, full citation, journal name, and PMID (when available) and a series of MySQL queries were then crafted to calculate the statistics reported.
Journals linked with largest number of GenBank sequences
Journals Indexed in PubMed Linked to Most Sequences (Including Genome Sequencing Projects)
Citations in GenBank Sequence Records
Percent of sequences associated with a citation
Number of distinct journal articles
Number of sequences
The Proceedings of the National Academy of Sciences
Plant Molecular Biology
Journals Indexed in PubMed Linked to Most Sequences
Number of sequences
Journal of Biological Chemistry
The Proceedings of the National Academy of
Nucleic Acids Research
Journal of Bacteriology
Applied Environmental Microbiology
Biochem. Biophys. Research Communications
Journal of Virology
Int Journal of Systematic Evol Microbiol
Journals Not Indexed in PubMed Linked to Most Sequences (Including Genome Sequencing Projects)
Citations in GenBank Sequence Records
Percent of sequences associated with a citation
Ave #seqs/arti cle
Number of distinct journal articles
Genetics and Molecular Biology
Pathology Journal of Phycology
Molecular Ecology Resources (Formerly Molecular Ecology Notes)
Integrative and Comparative Biology
Phycologia Plant Molecular
Of the journals not indexed in PubMed, Genetics and Molecular Biology is cited in 237,603 GenBank sequences or 0.65% of the sequences are associated with a citation. The sequences associated with those journals not indexed by PubMed represent a significant source of sequence data (over 1.7 million sequence records). Table 3 shows the journals that have the highest number of GenBank sequence citations that are from journals not indexed in PubMed. The top journal in this ranked list is Genetics and Molecular Biology (ISSN 1415–4757; formerly Brazilian Journal of Genetics) is published by Sociedade Brasileira de Genética. None of the GenBank records from this journal have direct Web links to any of the articles associated with the over 230,000 sequences to the published articles, even though the journal abstract is electronically accessible through indices like ISI Web of Science and Biological Abstracts.
Links to cited publication in GenBank record
GenBank, a resource maintained at the US National Library of Medicine (NLM), currently provides Web links only to those journals that are indexed in PubMed (which is also maintained at the NLM). The European Molecular Biology Laboratory (EMBL), which shares (along with the DNA Databank of Japan) molecular sequence records as part of the International Nucleotide Sequence Database Collaboration, uses Digital Object Identifiers (DOI®; http://www.doi.org/) for citations that do not have PMIDs. EMBL also includes links to other complementary citation databases, like the AGRICOLA database http://agricola.nal.usda.gov/help/aboutagricola.html. For example, a sampling of GenBank sequence records associated with publications in Molecular Plant Pathology were manually compared to equivalent records in the EMBL database and found to contain direct links to the cited article at the publisher site (via the DOI) and to the abstract in AGRICOLA. However, further exploration of EMBL records suggests that incorporation of DOIs is not uniformly applied. For example, we found that articles from Molecular Ecology Resources and Systematic Botany are not associated with DOIs, even though both journals make use of DOIs.
GenBank records that lack PMIDs
A closer examination of GenBank records that lack PMIDs but are associated with journals that have PMIDs reveals some interesting trends. Looking specifically at four journals that had a large proportion of GenBank citations without PMIDs suggests that there are significant gulfs in electronic linkages between molecular data and corresponding literature. PLoS Biology, for example, is cited in ~6.5 million GenBank entries. However, more than 40% of these records are missing PMIDs (which we identified through manual searches in PubMed). There were some instances where a PMID was available in the corresponding EMBL record, indicating that there may be some important parts of the sequence record are not exchanged as part of the INSDC relationship. Of note, nearly 70% of these sequences are associated with a single article ; overall, only 13 distinct articles are cited in these 3.8 million entries. This suggests that authors of manuscripts associated with molecular sequence data need to be diligent in updating their submissions such that the community may benefit from the electronic linking of molecular sequence to relevant literature . In other instances, there was no clear pattern of GenBank entries that had citation information without PMIDs. For example, GenBank entries associated with Molecular Ecology are missing PMIDs for nearly 20% of the citations. On checking a number of these entries manually, we discovered that several of these entries have PMIDs listed in the corresponding EMBL record.
For a small portion (606 records), we encountered cases where PMIDs were the only information associated with the GenBank record (i.e., no additional citation information was available in the GenBank Record). Examination of the citation information from PubMed using the PMIDs shows that nearly half of these missing citations were published in Plant Biology (Stuttgart, Germany; ISSN 1435–8603). There is no apparent explanation as to why the full citations are missing from the GenBank records.
Linking molecular sequences and scientific literature
The continually increasing size of molecular sequence and other scientific databases underscores the importance of linking information across relevant resources. There is currently much molecular and literature data available in electronic resources that can already be connected using existing technologies. GenBank and PubMed are two key resources that are well linked (although with some gaps). Still, a large portion of citation data in GenBank (~30% of GenBank records associated with some citation information) is not indexed as part of PubMed. A number of sequences are not associated with PMIDs and therefore not readily linkable to other spheres of knowledge (i.e., contained in relevant literature). Where PMIDs are not available, DOIs may be an alternative solution. For example, the presence of DOIs in the EMBL database enables access to literature beyond the scope of PubMed (e.g., biodiversity or ecology journals). Discussion with members of the NCBI staff indicates that DOIs are available in the raw GenBank record (ASN.1 format); however, they are not presented in the standard GenBank Website or through the flat-file download of GenBank records (Pers. Comm. – Scott Federhen & Mark Cavanaugh/NCBI/April 4, 2008).
It is an illuminating exercise to compare the presentation of the same sequence in all three databases. For example Accession BC002701, Homo sapiens ATM interactor, mRNA. The GenBank record at NCBI http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=33877038 has the PubMed id linked to the abstract for the cited PNAS article. The EMBL record http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[EMBL:BC002701]+-newId has a DOI and a PubMed, both of which are linked to the web resources to allow easy access to the article abstract and full text, as this article in PNAS is available for free. The DDBJ record displays the PubMed id but it is not linked to the abstract in PubMed.
An additional problem in connecting relevant literature to gene sequence data arises when considering whole genome sequencing projects. It is common for the large number of sequences derived from such projects to be linked to a single article that has little or no information related to the particular sequence. When sequences that are linked to 'sequence-rich' articles and are therefore probably part of large sequencing projects are excluded, only 6% of GenBank sequences are linked to articles that we postulate will offer pertinent and extensive information about the sequence.
As we aspire for a truly connected universe of knowledge, where machines are able to communicate and even infer new correlations, it will be increasingly essential to have accurate and complete linkages across relevant resources. GenBank, along with its INSDC partners (EMBL & DDBJ), are not only archival stores of molecular sequence data but can also be considered starting points for future studies. As GenBank continues to grow beyond a predominantly biomedical resource and incorporated into non-biomedical research inquiries, it will be necessary to consider means to link additional electronic indices associated with non-biomedical biological literature.
The authors wish to thank Sheldon Kotzin (NLM), Lou Knecht (NLM), Scott Federhen (NCBI/NLM) and Mark Cavanaugh (NCBI/NLM) for their valuable insights and explanations pertaining to the results discussed herein. INS and HM are funded in part by a research grant from the Ellison Medical Foundation and National Library of Medicine award R01LM009725 to INS. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Ellison Medical Foundation, the National Library of Medicine, or the National Institutes of Health.
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2009, D26-31. 10.1093/nar/gkn723. 37 Database
- Bidartondo MI: Preserving accuracy in GenBank. Science. 2008, 319 (5870): 1616-10.1126/science.319.5870.1616a.View ArticlePubMed
- Pennisi E: DNA data. Proposal to 'Wikify' GenBank meets stiff resistance. Science. 2008, 319 (5870): 1598-1599. 10.1126/science.319.5870.1598.View ArticlePubMed
- Salzberg SL: Genome re-annotation: a wiki solution?. Genome Biol. 2007, 8 (1): 102-PubMed CentralPubMed
- Noor MA, Zimmerman KJ, Teeter KC: Data sharing: how much doesn't get submitted to GenBank?. PLoS Biol. 2006, 4 (7): e228-10.1371/journal.pbio.0040228.PubMed CentralView ArticlePubMed
- Harris D: Can You Bank on GenBank. Trends Ecol Evol. 2003, 18 (7): 317-319. 10.1016/S0169-5347(03)00150-2.View Article
- Ryberg M, Nilsson RH, Kristiansson E, Topel M, Jacobsson S, Larsson E: Mining metadata from unidentified ITS sequences in GenBank: a case study in Inocybe (Basidiomycota). BMC Evol Biol. 2008, 8: 50-10.1186/1471-2148-8-50.PubMed CentralView ArticlePubMed
- Valkiunas G, Atkinson CT, Bensch S, Sehgal RN, Ricklefs RE: Parasite misidentifications in GenBank: how to minimize their number?. Trends Parasitol. 2008, 24 (6): 247-248. 10.1016/j.pt.2008.03.004.View ArticlePubMed
- Ouellette F: Users must help to keep public databases correct. Nature. 2001, 409 (6819): 452-10.1038/35054237.View ArticlePubMed
- Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G: Structural and functional diversity of the microbial kinome. PLoS Biol. 2007, 5 (3): e17-10.1371/journal.pbio.0050017.PubMed CentralView ArticlePubMed
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, et al: The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007, 5 (3): e77-10.1371/journal.pbio.0050077.PubMed CentralView ArticlePubMed
- Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, et al: The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol. 2007, 5 (3): e16-10.1371/journal.pbio.0050016.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.