hPDB – Haskell library for processing atomic biomolecular structures in protein data bank format
© Gajda; licensee BioMed Central Ltd. 2013
Received: 6 June 2013
Accepted: 21 November 2013
Published: 23 November 2013
Protein DataBank file format is used for the majority of biomolecular data available today. Haskell is a lazy functional language that enjoys a high-level class-based type system, a growing collection of useful libraries and a reputation for efficiency.
I present a fast library for processing biomolecular data in the Protein Data Bank format. I present benchmarks indicating that this library is faster than other frequently used Protein Data Bank parsing programs. The proposed library also features a convenient iterator mechanism, and a simple API modeled after BioPython.
I set a new standard for convenience and efficiency of Protein Data Bank processing in a Haskell library, and release it to open source.
KeywordsStructural biology Protein DataBank file format Parallel parser Parser efficiency Column-based parsing
The Protein Data Bank (PDB) is a widely used data repository of atomic resolution, three-dimensional protein and nucleic acid structures . The rapid growth of structural data enables key endeavors to bring knowledge of genomes  to the structure and function of large biomolecules. In addition to sequence searches and genome assemblies, efficient and reliable structural data processing are one of the most important and common structural bioinformatics tasks .
Haskell is a modern, lazy, pure functional language [4, 5] that enjoys fluid syntax, and clarity comparable to Python , as well as an efficient compiler that often generates code approaching the speeds of industry standard languages such as C  or C++ .
The library is a comprehensive solution for the parsing, rapid processing and writing of PDB files. I introduce the library by providing examples and describing the underlying data structuresa, and finally, I present an evaluation of its efficiency.
Simple use example
Here, I extract a list of models from a Bio.PDB.Structure.Structurec object (V.toList. modelsd), and repackage each model as a separate structure. These structures are then written using PDB.write.
A simple PDB.parse action returns a structure, to which I apply splitModels and zip a list of results with model numbers. These results are used to generate names of output files, that are then written using the PDB.write IO action. ByteString is used rather than [Char] within the library for everything except file paths (FilePath), due to efficiency considerationse.
Data structure describing molecules
Structure that contains information about whole PDB entry; Model that shows a single model of the molecule; Chain describing a single polymer chain; Residue for a single monomer (aminoacid residue or nucleic acid base) within the polymer; Atom for a single atom location.
Names of these types correspond to the names used by PDB file format definition . Those atoms which may have multiple locations within the model are described by several records, and those residues that have alternative mutants are also described by different records in accord with current practice of PDB .
Iterating with Iterable
The itmap method allows mutation of any of the objects of a given type b contained within type a. To compute a statistic over all contained objects itfoldr or itfoldl can be used.
Note that itlength is the only method using a dummy first argument to indicate the type of contained object to be counted. As all other methods use a function argument, automatic type inference finds the proper class instance without requiring a type declaration, as shown in the examples belowf.
Structure analysis example
In the following examples I skip the command line interface, assuming that all functions input a parsed Structure object.
The most convenient interface for a complex cascade of container types within a PDB structure is composition based on fold, and map analogs.
Simple itmap, and itfoldl’ methods are not sufficient to perform a complex stateful operation such as renumbering residues starting from 1.
Although the automatically inferred type allows this function to act not only on the entire Structure but also on any single Model that contains the Chain objects.
hPDB’s speed and ease of use has allowed for rapid implementation of typical functions such as: orienting structure so that the longest diameter corresponds to the Y axis, and the second longest cross-sectional dimension corresponds to the X axis (CanonicalAxes in hPDB-examples package), normalizing PDB files for use by applications restrictive with respect to file format (CleanPDB), and examining the sequence of main polymer chain or geometric parameters of small-angle scattering shape reconstructions (Rg example) with minimal code.
Results and discussion
For the benchmark, hPDB was compiled in single-threaded and multi-threaded mode by GHC v7.6.2.
Total CPU time in seconds
Total allocated memory in megabytes
Haskell memory is reported for the current heap, in addition to the target space for copying garbage collector .
Completion time after parsing in seconds
The benchmarks were measured on a quad-core Intel®; Core™ i7 2600 processor running at 3.4 GHzj, 16 GB of 1333 MHz memory, and a SAMSUNG 470 Series solid-state disk. The system was running a 64-bit Ubuntu 13.04 with a standard Linux kernel package 3.8.0-31.
While hPDB may be expected to stand out in runtime comparisons to the bytecode-based dynamic language libraries BioRuby and BioPython, surprisingly, serial hPDB is faster than other parsers in compiled languages, with the exception of PyMol. The parallel version of the hPDB parser may be the fastest PDB parser on machines with at least 4 independent processing cores.
It was noted that memory use, even with a necessary overhead (2×) of Haskell’s copying garbage collector, compared favorably with memory used by other libraries.
Parsing the entire PDB archive (as of January 6th 2013, compressed, 16 GB) takes approximately 14.5 minutes using 4 cores in parallel, with total CPU and I/O time reported to be 50 minutes. No crashes are reported, but 8k lines (mostly meta data) are reported as erroneousk because they are inconsistent with strict interpretation of PDB format .
Benchmarks show that in this specific application, the mildly optimized Haskell parser may provide speeds competitive with compiled languages such as Java and even lower level explicitly allocated languages such as C. Memory usage is also less than any other aforementioned library.
There is another Haskell library parsing PDB files on Hackage  called PDBtools, but it was not able to fully parse any of our example files because it does not handle errors in the read routine.
I have shown clear uses of a nice high-level interface for the analysis and modification of molecule descriptions encoded in the PDB file format .
While there are many similar parsers written in other languages, this is the first one I am aware of in Haskell, that parses entire coordinate contents within the PDB repository. It is also efficient both in runtime and memory use, and thus, the preferable choice for sophisticated, high volume analyses.
While future work on analysis API extensions would likely further improve utility of this library, I believe that it is ready for production use, as indicated by the many code examples.
I conclude that in this specific application, Haskell has both ease of use and abstraction of high-level dynamic languages, along with a speed competitive with lower level explicit-allocation languages such as C.
Availability and requirements
Source code is available as Additional files 1, 2 and 3 attached to the manuscript or from GitHub repository https://github.com/mgajda/hPDB, and released on Hackage as hPDB. It has been tested with several GHC versions including 7.0.3, 7.2.2, 7.4.2, and the recently released 7.6.2. It has few dependencies, and all are available from Hackage .
Project name: hPDB Project home page:http://hackage.haskell.org/package/hPDBSource repositories:http://github.com/mgajda/hPDBhttp://github.com/mgajda/hPDB-exampleshttp://github.com/mgajda/iterableOperating system(s): Platform independent Programming language: Haskell Libraries: Haskell Platform, AC-Vector Other requirements: GHC ≥ 7.0 License: BSD
a While this article contains only one figure showing the most important types for the API, two additional diagrams elucidating the library’s internal structure are available in the Additional files 4 and 5.
b The command line interface for this function may be found in examples/SplitModels.hs in the hPDB-examples package.
c Names defined in the hPDB package are emphasized in bold font for ease of reading. Other modules are the standard collection interface Data.Vector from the vector package, the 3D vector interface Data.Vector.V3 from the AC-Vector package, and Data.ByteString.Char8 from the bytestring package.
d Note the use of Data.Vector for space efficient storage of data.
e Most records in the PDB file format are ASCII-only; therefore, Unicode encoding is not necessary. As non-ASCII characters can only occur in comments and metadata, they may be decoded after parsing.
f Type parameter b in declaration for itlength is a dummy type argument to specify the contained object types to be counted.
g This declaration is less polymorphic than the actual itmap type, as demonstrated in the following section about Iterable class description.
h Extended examples are present in the CleanPDB.hs example attached to the library.
i Indicating termination of polymer chain, rather than an atom.
j With overclocking switched off.
k It is known that, after six different official releases of file format descriptions and many data remediation efforts, there is a small amount of data that does not entirely conform to the PDB archive format.
Application Programming Interface (function and data declarations)
American Standard Code of Information Interchange (7-bit text encoding)
Central Processing Unit (processor)
Protein DataBank (repository of biomolecular structural data)
Glasgow Haskell Compiler
Gesellschaft für Wissenschaftliche Datenverarbeitung mbH Göttingen – Göttingen Society for Scientific Data Processing.
The author thanks his former PhD supervisor for fostering his interest in the field of Bioinformatics and for his insight into the deficiencies of many currently available bioinformatics tools. All diagrams were created with Graphviz, an open source graph layout and drawing tool . The author thanks American Journal Experts for proofreading the manuscript.
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE:The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW:GenBank. Nucleic Acids Res. 2011, 39 (suppl 1): D32-D37. [http://nar.oxfordjournals.org/content/39/suppl_1/D32.abstract],PubMedPubMed CentralView ArticleGoogle Scholar
- Tramontano A:Introduction to Bioinformatics. 2006,http://www.amazon.com/Introduction-Bioinformatics-Chapman-Mathematical-Computational/dp/1584885696,Google Scholar
- Peyton Jones S (Ed):The Haskell 98 Language and Libraries: The Revised Report. J Funct Program. 2003, 13: 0-255. [http://www.haskell.org/definition/],Google Scholar
- Haskell 2010 language report. 2010, [http://www.haskell.org/onlinereport/haskell2010/],
- Van Rossum G:Scripting the web with python. World Wide Web J. 1997, 2 (2): 97-120. [http://dl.acm.org/citation.cfm?id=275062.275072],Google Scholar
- International Standards Organization: ISO/IEC 9899:2011 Information technology — Programming languages — C. 2011, [http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=57853]Google Scholar
- International Standards Organization:ISO/IEC 14882:2011 Information technology — Programming languages — C++. 2012, [http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=50372],Google Scholar
- Jones MP:Functional programming with overloading and higher-order polymorphism. Advanced Functional Programming, First International Spring School on Advanced Functional Programming Techniques-Tutorial Text,. 1995, London, UK, UK: Springer-Verlag, 97-136. [http://dx.doi.org/10.1093\2Fbioinformatics\2Fbtn397],Google Scholar
- Callaway J, Cummings M, Deroski B, Esposito P, Forman A, Langdon P, Libeson M, McCarthy J, Sikora J, Xue D, Abola E, Bernstein F, Manning N, Shea R, Stampf D, Sussman J:PDB File format – contents guide version 3.30. The Worldwide Protein Data Bank. 2012, [http://www.wwpdb.org/docs.html],Google Scholar
- Holland RCG, Down TA, Pocock MR, Prlic A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer ML, Schreiber MJ:BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008, 24 (18): 2096-2097. 10.1093/bioinformatics/btn397. [http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics24.html\#HollandDPPHJFDYHS08],PubMedPubMed CentralView ArticleGoogle Scholar
- Goto N, Prins P, Nakao M, Bonnal R, Aerts J, Katayama T:BioRuby: bioinformatics software for the Ruby programming language. Bioinformatics. 2010, 26 (20): 2617-2619. 10.1093/bioinformatics/btq475. [http://bioinformatics.oxfordjournals.org/content/26/20/2617.abstract],PubMedPubMed CentralView ArticleGoogle Scholar
- Hamelryck T, Manderick B:PDB file parser and structure class implemented in Python. Bioinformatics. 2003, 19 (17): 2308-2310. 10.1093/bioinformatics/btg299. [http://bioinformatics.oxfordjournals.org/content/19/17/2308.abstract],PubMedView ArticleGoogle Scholar
- Sayle RA, Milner-White JE:RASMOL: biomolecular graphics for all. Trends Biochem Sci. 1995, 20 (9): 374-376. 10.1016/S0968-0004(00)89080-5. [http://dx.doi.org/10.1016/S0968-0004(00)89080-5],PubMedView ArticleGoogle Scholar
- Schrödinger LLC:The PyMOL molecular graphics system, version 1.3r1. 2010, [http://www.pymol.org/citing],Google Scholar
- Jmol: an open-source Java viewer for chemical structures in 3D. [http://www.jmol.org/],
- Marlow S, Harris T, James RP, Peyton Jones S:Parallel generational-copying garbage collection with a block-structured heap. Proceedings of the 7th international symposium on Memory management, ISMM ’08. 2008, New York, NY, USA: ACM, 11-20. [http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/],View ArticleGoogle Scholar
- Jones I, Jones SP, Marlow S, Wallace M, Patterson R:The Haskell Cabal – a common architecture for building applications and tools. 2005, [http://www.haskell.org/cabal/proposal/index.html],Google Scholar
- Ellson J, Gansner ER, Koutsofios E, North SC, Woodhull G:Graphviz and Dynagraph – static and dynamic graph drawing tools. GRAPH DRAWING SOFTWARE. 2003, Heidelberg: Springer-Verlag, 127-148.Google Scholar
- Sheard T, Jones SP:Template meta-programming for Haskell. SIGPLAN Not. 2002, 37 (12): 60-75. 10.1145/636517.636528. [http://research.microsoft.com/en-us/um/people/simonpj/papers/meta-haskell/],View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.