BetaSearch: a new method for querying βresidue motifs
 Hui Kian Ho^{1, 2}Email author,
 Graeme Gange^{1},
 Michael J Kuiper^{3} and
 Kotagiri Ramamohanarao^{1}
https://doi.org/10.1186/175605005391
© Ho et al.; licensee BioMed Central Ltd. 2012
Received: 10 May 2012
Accepted: 15 June 2012
Published: 30 July 2012
Abstract
Background
Searching for structural motifs across known protein structures can be useful for identifying unrelated proteins with similar function and characterising secondary structures such as βsheets. This is infeasible using conventional sequence alignment because linear protein sequences do not contain spatial information. βresidue motifs are βsheet substructures that can be represented as graphs and queried using existing graph indexing methods, however, these approaches are designed for general graphs that do not incorporate the inherent structural constraints of βsheets and require computationallyexpensive filtering and verification procedures. 3D substructure search methods, on the other hand, allow βresidue motifs to be queried in a threedimensional context but at significant computational costs.
Findings
We developed a new method for querying βresidue motifs, called BetaSearch, which leverages the natural planar constraints of βsheets by indexing them as 2D matrices, thus avoiding much of the computational complexities involved with structural and graph querying. BetaSearch exhibits faster filtering, verification, and overall query time than existing graph indexing approaches whilst producing comparable index sizes. Compared to 3D substructure search methods, BetaSearch achieves 33 and 240 times speedups over indexbased and pairwise alignmentbased approaches, respectively. Furthermore, we have presented casestudies to demonstrate its capability of motif matching in sequentially dissimilar proteins and described a method for using BetaSearch to predict βstrand pairing.
Conclusions
We have demonstrated that BetaSearch is a fast method for querying substructure motifs. The improvements in speed over existing approaches make it useful for efficiently performing highvolume exploratory querying of possible protein substructural motifs or conformations. BetaSearch was used to identify a nearly identical βresidue motif between an entirely synthetic (Top7) and a naturallyoccurring protein (CharcotLeyden crystal protein), as well as identifying structural similarities between biotinbinding domains of avidin, streptavidin and the lipocalin gamma subunit of human C8.
Background
Many characteristic βresidue motifs are observed in the Protein Data Bank (PDB)[10]. For example, the βsheets in leucinerich repeat (LRR) domains contain consecutive adjacent interstrand pairs of buried leucines stericallypacked alongside their bridgepartners, contributing to structural stability[11]. Other βresidue motifs appear as a combination of inter and intrastrand residue neighbours, as is the case for the TCT motif of certain antifreeze proteins[12] and glutamic acid/lysine motifs[13]. The conserved biotinbinding site in streptavidin [PDB:1STP] contains a βresidue motif of five inwarddirecting residues of a βbarrel: S88, T90, W92, W108, and L110. Identification of these βresidue motifs can be used to search for other proteins with similar structural elements or function with low sequence identities.
Searching for structural motifs, βresidue or otherwise, in the PDB using linear approaches such as sequence alignment is a difficult, if not impossible task because interresidue interactions can occur across secondary structures that are sporadically located throughout a protein sequence. Furthermore, pairwise residue interactions are not accounted for in conventional multiplesequence alignment tools such as BLAST[14] and CLUSTALW[15], given the onedimensional nature of sequences.
The conventional approach to motif querying, involves the use of protein substructure search methods that structurally align the 3D atomic coordinates of a query with known protein structures. These methods provide a structural context to each query hit and generally produce approximate matches in the form of a ranked list of hits but may take hours to perform few queries due to their reliance on structural alignment algorithms[16].
Alternatively, protein structures can be represented as graphs and queried for motifs using graph indexing approaches[17]. Unlike 3D substructure searches, these methods perform exact matching by querying only the discrete edge, node, and label features of graphs rather than by 3D similarity between continuous coordinates. The query matching algorithms used by existing graph indexing methods are based on solutions to the subgraph isomorphism problem, described briefly as follows:
A graph G=(V,E) is defined by a set of vertices v∈V and a set of edges e∈E where each edge represents a connection between a pair of vertices$\left({v}_{i},{v}_{j}\right)\in V$. A graph is undirected if its edges are unordered pairs and directed otherwise. The degree of a node is the number of edges it has to other nodes.
If G_{1}and G_{2}are graphs defined as${G}_{1}=\left({V}_{1},{E}_{1}\right)$ and${G}_{2}=\left({V}_{2},{E}_{2}\right)$, then G_{1}is a subgraph of G_{2}if${V}_{1}\subseteq {V}_{2}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\wedge \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{E}_{1}\subseteq \phantom{\rule{0.3em}{0ex}}{E}_{2}$. An isomorphism between G_{1}and G_{2}is a bijection f:V_{1}→V_{2}such that$\forall \left({v}_{i},{v}_{j}\right)\in {E}_{1}\phantom{\rule{2.77695pt}{0ex}}\iff \phantom{\rule{2.77695pt}{0ex}}\left(f\left({v}_{i}\right),f\left({v}_{j}\right)\right)\in {E}_{2}$.
A graph G_{1}is subgraph isomorphic to G_{2}if G_{3}is a subgraph of G_{2}and there exists an isomorphism between G_{3}and G_{1}.
Graph representations of proteins[18] and βsheets[17] have been previously described in which nodes represent residues and edges represent interresidue interactions such as peptide or hydrogen bonds. For simplicity, we define a β graph to be a graph representation of a βsheet in which each node is labelled with a residue name, solid edges represent peptide bonds, and dotted edges represent a bridgepartner relationship between adjacent interstrand residue pairs (Figure1B). βresidue motifs are considered to be connected subgraphs of βgraphs whose nodes are labelled with amino acids.
The planarity of βgraphs allows for a compact twodimensional representation. We define a βmatrix to be a projection (or “flattening”) of a βgraph onto a 2D matrix of amino acid characters. The residues in the same βstrand are located in the same row and residues connected by bridge edges lie in the same column (Figure1C). βresidue motifs are then considered to be submatrices of βmatrices (Figure1D).
 1.
Index construction: The features of each graph in a database are obtained, each representing a graph characteristic. A data structure, usually an inverted index, is then constructed in which each feature is associated with the set of their originating graphs.
 2.
Filtering: The features of the query graph are obtained. An initial set of coarsegrained candidates containing these features are retrieved from the index (or indices). It is possible these candidates do not contain any query matches.
 3.
Verification: Each candidate is checked for a subgraph that exactly matches the query. This is performed using a subgraph test in most methods.
Graph indexing methods are loosely classified into three categories: pathbased, subgraphbased, and treebased.
and verification of each candidate is performed using the VF2 algorithm[20].
Indexing times and disk sizes
Dataset size  

Method  1,000  2,000  4,000  8,000  16,000 
GraphGrepSX,l_{ p }=4  27s  57s  1m 44s  3m 29s  7m 43s 
2Mb  4Mb  9Mb  18Mb  37Mb  
GraphGrepSX,l_{ p }=10  49s  1m 44s  3m 31s  7m 09s  14m 50s 
75 Mb  142 Mb  267 Mb  496 Mb  925 Mb  
SING,l_{ p }=4  34s  1m 11s  2m 17s  4m 29s  9m 12s 
10 Mb  21 Mb  42 Mb  84 Mb  169 Mb  
SING,l_{ p }=10  4m 05s  8m 25s  16m 26s  32m 16s  1h 04m 39s 
329 Mb  636 Mb  1187 Mb  2243 Mb  4279 Mb  
BetaSearch  9s  19s  56s  2m 03s  4m 04s 
10 Mb  20 Mb  40 Mb  81 Mb  164 Mb 
Subgraphbased methods (gIndex[28], FGIndex[29], and GDIndex[30]) use subgraphs as features and retain more topological information about graphs than paths due to their more complex structures. Index construction then requires time exponential to the number of nodes in each graph which also produces larger indices than those of pathbased methods. These problems can be alleviated to a degree by indexing only the most frequent subgraphs[28].
Treebased methods (TreePi[31], TreePi+δ[32], and CTree[33]) use subtrees as features and are purported to provide an ideal compromise between the small indices of pathbased methods and the specificity of subgraphbased methods. Algorithmic operations on trees are generally more asymptotically efficient than those on graphs, in particular, subtree isomorphism can be tested in polynomial time[34]. However, previous results showed that certain pathbased methods are still an order of magnitude faster in query time than existing treebased methods[27].
GCoding[35] generates numeric representations of graphs using encodings of their adjacency matrices and cannot be classified into any of the above groups. These representations allow efficient filtering without computationally expensive graph traversals or feature enumeration. A specialised subgraph isomorphism test is used for verification[35]. While these encodings provide a compact index, expensive eigenvalue calculations are required to compute them[27].
Each of these methods can be applied to a wide variety of problems because they were designed for general graphs (i.e. graphs with an unrestricted degree and/or node count) that do not make use of the inherent structural constraints of βsheets. For example, each βresidue has at most four neighbours: the preceding and following peptidebonded residues and one bridgepartner located on each of the two adjacent hydrogenbonded βstrands.
The problem of protein 3D substructure searching involves searching a database of protein structures for structures that contain substructures similar to a query structure and remains a significant problem in structure biology. These substructures may be relevant to biological processes such as binding sites, enzymatic function, or may be representative of a particular fold family[16, 36]. Current methods for substructure searching are based on the comparison of threedimensional coordinates between structures and use computationally complex structural alignment algorithms. Methods such as Dali[37], DaliLite[38], and SHEBA[39] align protein structures at the residue level, that is, they find onetoone residue alignments between pairs of proteins; methods such as QPTableauSearch[40] and SATableauSearch[16] align proteins at the level of secondary structure elements (SSEs) and therefore lack residuelevel specificity but are generally faster[16]. A common drawback of these methods is that exhaustive pairwise comparisons are required between the query and the each protein structure in a database. This naive approach often leads to redundant comparisons between highly similar structures or structures with no obvious match, ultimately resulting in queries requiring hours or even days to complete[16].
Recently, LabelHash[36, 41] was developed primarily for the 3D substructure matching of small motifs, commonly between 4 and 15 residues. This method is unique among structural search methods in general since it uses a precomputed index to vastly accelerate querying in a manner similar to those of graph indexing approaches. Indeed, the results in this paper show that LabelHash yields a considerable performance boost in compute time over a conventional pairwise structural alignment approach (see Results and discussion).
In this paper we describe BetaSearch, a method that allows fast querying of βresidue motifs in large datasets of protein structures. Our method leverages the natural planar constraints of βsheets by indexing them as 2D matrices, known as βmatrices. This approach avoids the geometric, topological, and computational complexities usually involved in 3D substructure or graph querying. Furthermore, by using βsheet representations independent of a 3D coordinate system, BetaSearch identifies matching βresidue motifs in structurally and sequentially dissimilar proteins.
Results and discussion
We have compared the performance of BetaSearch against stateoftheart graph indexing and 3D substructure search methods separately. The results of three case studies are also presented, which provide biologicallyrelevant contexts in which BetaSearch could be used.
Comparisons with graph indexing methods
We compared BetaSearch against SING and GraphGrepSX. SING was shown to outperform existing methods in terms of query time on standard datasets of chemical compounds, protein transcription networks, proteininteraction networks, and synthetic graphs[27].
The elapsed indexing, total query, filtering, and verification times were averaged over five repetitions. SING and GraphGrepSX were run using l_{ p }=4 and l_{ p }=10, where l_{ p } denotes the path length. These values were chosen by the authors of each method in their own comparisons[23, 27]. We were unable to use larger l_{ p } values or datasets of more than 16,000 βsheets due to the memory consumption of the SING and GraphGrepSX implementations. We therefore only reported results for datasets up to and including N = 16,000. Accuracies of each method were not measured since each query matches at least one βsheet and any nonmatching βsheet is excluded at the filtering and verification stages of each method.
Indexing
The elapsed times and disk space required for index construction are shown in Table1.
BetaSearch recorded the fastest indexing times with a 1.9 times speedup over the next fastest method (GraphGrepSX,l_{ p }=4) for the N = 16,000 dataset. The size of the BetaSearch indices were similar to those of SING,l_{ p }=4 since trimers have an effective path length of l_{ p }=3 and both use hash tables.
The l_{ p }=10 variants of SING and GraphGrepSX were slower than their l_{ p }=4 variants due to the increase in the number of features generated in the former case. This observation was consistent with those of general graphs[23].
GraphGrepSX,l_{ p }=4 generated the smallest indices by a considerable margin through the use of a suffix trees to store its indices. However, the results obtained in the following sections show that the reduction in index disk space came at a significant cost to the querying time.
Furthermore, the BetaSearch index is limited only by the size of the hard disk on which it is stored, whereas the implementations of SING and GraphGrepSX used in this study were memorylimited, requiring the entire index to be loaded into memory in order for queries to be performed.
Overall query times
Overall query times (graph indexing comparisons)
Dataset size  

Method  1,000  2,000  4,000  8,000  16,000 
7205  14,516  29,101  58,795  117,415  
GraphGrepSX,l_{ p }=4  2m 10s  9m 13s  44m 16s  3h 19m 37s  15h 03m 54s 
GraphGrepSX,l_{ p }=10  1m 18s  6m 12s  24m 55s  1h 50m 52s  8h 02m 31s 
SING,l_{ p }=4  50s  3m 27s  12m 16s  52m 41s  3h 53m 45s 
SING,l_{ p }=10  20s  1m 19s  5m 16s  23m 06s  1h 37m 49s 
BetaSearch  2s  7s  33s  2m 14s  8m 19s 
The tradeoff between the index disk size and querying times within the SING and GraphGrepSX variants can be seen in these results where the l_{ p }=10 variants required four to five times as much disk space but were at least twice as fast as the l_{ p }=4 variants.
Filtering
Filtering times and precisions (graph indexing comparisons)
Dataset size  

Method  1,000  2,000  4,000  8,000  16,000 
GraphGrepSX,l_{ p }=4  9s  40s  3m 04s  13m 30s  56m 27s 
0.38  0.34  0.23  0.08  0.03  
GraphGrepSX,l_{ p }=10  10s  43s  3m 03s  13m 14s  55m 05s 
0.92  0.97  0.97  0.34  0.20  
SING,l_{ p }=4  25s  1m 42s  6m 38s  28m 23s  2h 01m 40s 
0.39  0.39  0.39  0.39  0.39  
SING,l_{ p }=10  13s  54s  3m 51s  16m 54s  1h 10m 32s 
0.98  0.99  1.00  1.00  1.00  
BetaSearch  2s  7s  32s  2m 13s  8m 16s 
1.00  1.00  1.00  1.00  1.00 
In contrast to their indexing performances, the l_{ p }=10 variants of SING and GraphGrepSX generally outperformed their l_{ p } = 4 variants. A larger l_{ p }value has more specificity and therefore results in fewer numbers of filtered candidates than a small l_{ p } value, reducing the verification load. The BetaSearch and l_{ p } = 10 precision values were consistently near 1.0 for all datasets and query sizes. The precision of the l_{ p } = 4 variants were considerably lower than the l_{ p } = 10 variants due to the aforementioned specificity limitations of smaller path lengths.
Verification
We measured the mean verification time as the total verification time for a dataset divided by the total number of filtered candidates for a dataset. The mean verification times were less than a second due to the relatively small query graphs involved in this study. The speedups of each method were measured using the GraphGrepSX,l_{ p }=4 times as the baseline and are shown in Figure2.
BetaSearch consistently recorded the fastest verifications across all query sizes and datasets, this is because the BetaSearch verification algorithm runs in quadratic time whereas the VF2 algorithm employed by SING and GraphGrepSX was designed for general graphs and has a potential nonpolynomial time complexity[21]. The largest speedup by BetaSearch was achieved for queries with two edges, since these queries equated to individual trimers, there was no need for candidates to be verified.
Comparisons with 3D substructure search methods
We have compared BetaSearch with LabelHash and SHEBA since they each perform residuelevel matching. SHEBA was shown to be amongst the most accurate substructure search methods in recent work[16], however, LabelHash has yet to be evaluated against other methods. LabelHash and SHEBA were run using default search parameters. Comparisons with DaliLite were unable to be performed due to the majority of our queries and βsheets not meeting the minimum number of residues required by DaliLite. DaliLite was shown to have accuracies comparable to SHEBA but with considerably longer compute times[16].
LabelHash at p^{ ′ }≥0.999 clearly outperforms SHEBA on all query sizes, however, neither method performed particularly well on queries of 10 residues or less with the worst F_{1}scores observed for queries of 4–5 residues, which have the largest number of hits amongst all query sizes (see Additional file1: Figure S1A). Although, once the queries reach sizes of 25 residues, LabelHash maintained F_{1} scores of at least 0.9 since the number of possible hits closely approaches the number queries (see Additional file1: Figure S1B).
We measured the CPU times of each method and computed the speedups over SHEBA. The wallclock times were also measured but were omitted since they were analogous to the CPU times. The CPU times of each method for the ASTRAL95 query set were measured as follows:

SHEBA – 239 h 25 m

LabelHash – 33 h 17 m

BetaSearch – 0 h 59 m
Figure3B shows the speedups at each query size. BetaSearch achieved total speedups of 240 times over SHEBA and 33 times over LabelHash. The largest speedups of BetaSearch were obtained for queries of 4–15 residues, which are the sizes of commonly studied motifs[41]. The improved performance of BetaSearch and LabelHash over SHEBA can be attributed to their use of indices which removes the need to perform exhaustive pairwise comparisons for each query against the dataset. This naive approach to substructural searching can lead to query sets taking days to complete[16, 40].
Case Studies
βresidue motifs can contribute both to the structural and functional features of a protein. For researchers who study protein structure, BetaSearch can be a useful tool for surveying particular βsheet configurations across known protein structures. The frequency of a particular motif may give an indication of its relative stability as a βsheet structural element. Researchers who study functional aspects of βsheets can use BetaSearch to identify similar motifs in unrelated proteins, as we demonstrate with the biotinbinding pockets of avidin and streptavidin. BetaSearch is fast with a simple, intuitive search query context that allows the researcher to efficiently make comparisons against known βsheets.
A typical BetaSearch workflow involves the researcher (i) inspecting a protein structure for a βsheet of interest, (ii) identifying a specific βresidue motif, and (iii) manually entering the amino acids in the corresponding βmatrix into BetaSearch. Alternatively, this workflow can be automated, allowing BetaSearch to be used in a data mining or knowledgediscovery capacity which potentially allows interesting relationships between specific amino acid configurations and protein structures or functions, which would not be intuitively revealed by manual trialanderror querying.
To demonstrate the capabilities and potential usecases of BetaSearch we present the results of three case studies. These were drawn from realworld examples and illustrate the role βresidue motifs play in the structure and function of proteins. We also provide the matches from comparative queries using BLAST to demonstrate the difference in matches between a conventional sequencebased homology search and BetaSearch (see Additional files2,3, and4).
Case Study 1  Synthetic motifs in the Top7 protein
A structural alignment of Top7 and CLC around the matching regions of the query motif (Figure4B) shows remarkably, that the βstrand topology and sidechain directions are nearly identical. This does not suggest a homology between the two proteins because Top7 is entirely synthetic. However, our findings demonstrate that the RosettaDesign[44] approach used to engineer Top7 had inadvertently reproduced a known stable βresidue motif ab initio. The CLC protein was not found in a BLAST query of Top7 chain A (see Additional file2).
Case Study 2  Biotinbinding domains
We have characterised these biotinbinding sites as a minimal, βresidue motif (Figure5C). This putative motif is evolutionary conserved between the avidins and has not yet been shown to be recurrent in evolutionarily distant proteins. Using this query, BetaSearch not only identifies the structures of avidin and streptavidin, but also xenavidin—a biotinbinding protein from Xenopus tropicalis (frog). A number of seemingly unrelated proteins were also matched including uncharacterised proteins from Roseovarius nubinhibens [PDB:3BVC] and Oceanicola granulosus [PDB:2RG4]; and the human complement protein C8 gamma [PDB:1IW2]. The complete set of matching βsheets is listed in Additional file4.
Inspection of the uncharacterised proteins reveal a similar arrangement of residues to the knownbiotin binding proteins but with less room for the ligand to bind. More tantalising is the match with the gamma subunit of human C8 which is a crucial component of the cytolytic membrane attack complex (MAC)[45]. This subunit has a characteristic lipocalin fold with a distinctive binding pocket similar to the avidins, however, the ligand target of C8 gamma remains unknown[45]. Based on the spatial similarities of this binding pocket with the biotinbinding sites of avidin, one may suggest that these proteins could have an affinity for biotin or a biotinlike compounds.
These results demonstrate that a relatively small βresidue motif query can be matched in unrelated proteins. This capability can be particularly useful in characterising proteins of unknown function by similarities in βresidue motifs to those of known function.
Case Study 3  βstrand pairing prediction
One of the unsolved problems of tertiary structure prediction is the ability to predict the pairs of βstrands which are hydrogen bonded, and therefore adjacent, in a βsheet[46]. Information about adjacent βstrands can be used to determine the overall topology of a βsheet. A number of βsheet topology prediction algorithms exist that are based on wellknown machine learning methods[46–49].
The predicted strand pairing scores for the 1A09 chain A β sheet (Case Study 3)
Pair  Parallel score  Antiparallel score  Total score 

^{∗}4,5  4.86  4.99  9.85 
^{∗}2,3  4.59  4.60  9.19 
^{∗}3,4  4.32  4.31  8.64 
^{∗}1,2  4.09  4.09  8.18 
3,5  3.12  3.09  6.21 
2,4  2.98  2.97  5.96 
1,3  2.95  2.97  5.93 
1,4  2.72  2.72  5.44 
2,5  2.33  2.33  4.66 
Conclusion
We have described a method for indexing and querying βresidue motifs, called BetaSearch, that is at least an order of magnitude faster than stateoftheart graph indexing methods. These speedups are achieved by indexing βsheets as 2D matrices of amino acids known as βmatrices. This representation leverages the inherent planar structural constraints of βsheets, thereby avoiding much of the computational complexity involved in querying and indexing 3D or graph representations of protein structures. BetaSearch is therefore able to achieve quadratictime querying. Filtering precisions were close to 1.0 for all datasets and query sizes, resulting in near minimal verification time.
When compared with existing 3D substructure search methods, BetaSearch achieves a 240 times speedup over the baseline (SHEBA) and a 33 times speedup over the next fastest method (LabelHash). The demonstrated efficiency of BetaSearch lends itself well to the rapid exploration of probable motif or βsheet conformations in a matter of minutes, rather than days or weeks with 3Dbased methods. Furthermore, the ability of BetaSearch to perform exact matching ensures that correct hits are not missed.
Our three case studies demonstrated the utility of BetaSearch in biological contexts. We discovered that the synthetic Top7 protein shares an identical βresidue motif with a known naturallyoccurring protein—the CharcotLeyden crystal. A small query derived from the biotinbinding motif of avidins easily identified unrelated biotinbinding proteins and is suggestive of biotinbinding in others including the gamma subunit of the human C8 complement protein. BetaSearch, with its ability to identify functional similarity from unrelated proteins can potentially help characterise the proteins in the PDB with unknown function. We also demonstrated how BetaSearch could be used to predict strand pairing in βsheets, which could help reduce the search space of more complex supersecondary or tertiary structure prediction tasks. Although our work has focused on substructural motifs in βsheets, our algorithm can be modified to perform querying of any substructural motif involving pairwise interactions, such as the wellcharacterised hydrogenbond pairings in helices and turns. Indeed, this is an avenue of development we are currently exploring.
It is our intention for BetaSearch to be used by protein researchers to supplement conventional sequence and structural search methods. For example, the efficiency of the BetaSearch filtering and verification algorithms introduces the possibility for their use as a rapid “firstpass” filter to improve the querying performance of other methods. Such an application would be nontrivial to develop but could potentially reduce conventional structural query times from hours to minutes.
Findings
The pseudocode for each of the algorithms described in this section is provided in the Supplementary Materials.
Trimers
A trimer is a path of three amino acids in a βmatrix configured in the shape of an ‘L’ (an Ltrimer), vertically in the same column (a Vtrimer), or horizontally in the same row (an Htrimer). Trimers are the features by which βmatrices are indexed in BetaSearch. An example of the trimer extraction process is shown in Additional file1: Figure S2.
A trimer t has a number of attributes that encode its configuration and location within a βmatrix:

t.seq: a three letter string of residues spanned by the trimer where$\begin{array}{l}\phantom{\rule{2em}{0ex}}\text{if}\phantom{\rule{2.77695pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}=\text{\u201cabc\u201d}\\ \phantom{\rule{0.3em}{0ex}}\text{then}\phantom{\rule{2.77695pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]\to \text{\u201ca\u201d},\phantom{\rule{0.3em}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[1\right]\to \text{\u201cb\u201d},\phantom{\rule{0.3em}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\to \text{\u201cc\u201d}.\end{array}$(4)

t.class: an integer representing the class of the trimer, defined as$\begin{array}{l}\mathrm{t.}\mathrm{\text{class}}=\left\{\begin{array}{l}1\phantom{\rule{1em}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is an Ltrimer}\\ 3\phantom{\rule{1em}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is a Vtrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]\ne \mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 5\phantom{\rule{1em}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is an Htrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]\ne \mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 15\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is a Vtrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]=\mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 31\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is an Htrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]=\mathrm{t.}\mathrm{\text{seq}}\left[2\right].\end{array}\right.\end{array}$(5)

t.id: a (t.class,t.seq) tuple.

t.orient: an integer value such that t.orient$\in \left\{0,1,2,3\right.\}$. These values describe the possible orientations of a trimer and were chosen to allow the calculation of x and yaxis trimer reflections using the bitwiseXOR (‘⊕’) operator. Orientation reflections are calculated as$\begin{array}{c}{t}^{\prime}.\mathrm{\text{orient}}=\left\{\begin{array}{cc}\mathrm{t.}\mathrm{\text{orient}}\oplus 1& \phantom{\rule{2.77695pt}{0ex}}\text{if reflected in the yaxis}\\ \mathrm{t.}\mathrm{\text{orient}}\oplus 2& \phantom{\rule{2.77695pt}{0ex}}\text{if reflected in the xaxis}\end{array}\right.\end{array}$(6)
where t’ is the reflection of t. Additional file1: Figure S2 shows how trimer orientations are determined for each trimer class.
t.eqorients: an integer that encodes the equivalent orientations of t.orient, defined as$\begin{array}{l}\mathit{\text{t.}}\mathrm{\text{eqorients}}=\left\{\begin{array}{l}{2}^{\mathit{\text{t.}}\mathrm{\text{orient}}}\phantom{\rule{2.36043pt}{0ex}}\phantom{\rule{2.36043pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=1\\ 3\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.36043pt}{0ex}}\phantom{\rule{9pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=3\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]<\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\\ 12\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{6pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=3\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]>\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\\ 5\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.36043pt}{0ex}}\phantom{\rule{9pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=5\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]<\mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 10\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{6pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=5\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]>\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\\ 15\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{6pt}{0ex}}\text{otherwise,}\end{array}\right.\end{array}$(7)such that$\begin{array}{l}\mathit{\text{t.}}\mathrm{\text{eqorients}}\phantom{\rule{2.36043pt}{0ex}}\text{\&}\phantom{\rule{2.36043pt}{0ex}}i=\left\{\begin{array}{l}1\phantom{\rule{5pt}{0ex}}\text{if orientation}\phantom{\rule{2.36043pt}{0ex}}i\phantom{\rule{2.36043pt}{0ex}}\text{is equivalent to}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{orient}}\\ 0\phantom{\rule{5pt}{0ex}}\text{otherwise,}\end{array}\right.\end{array}$(8)where ‘<’ and ‘>’ are the lexicographic lessthan and greaterthan operators; and ‘&’ is the bitwiseAND operator. The equivalent orientations of a trimer are encoded using bitmasks such that orientation i is equivalent to t.ORIENT if the i^{th}bit of t.class is set to 1. The t.class value for each type of trimer is an encoding of the minimum t.class value for the class.

t.row: the row coordinate of t.seq[1] within its βmatrix.

t.col: the column coordinate of t.seq[1] within its βmatrix.

t.coord: a (t.row,t.col) tuple.

t.span1, t.span2: the row (t.rowspan) or column spans (t.colspan) of a trimer, depending on the trimer class. Each span is an ordered tuple (i,j) where i is the coordinate of t.seq[1] and j is the coordinate of either t.seq[0] or t.seq[2], depending on the trimer type. Ltrimers have one row span and one column span, Vtrimers have two row spans, and Htrimers have two column spans. Examples of the spans for each trimer are shown in Additional file1: Figure S3.
Index construction
BetaSearch uses three indices:$\mathcal{D}$,$\mathcal{R}$, and$\mathcal{C}$.

$\mathcal{D}$ is an inverted index that maps each trimer id to the set of βmatrices in which they are contained, defined as$\mathcal{D}\left[\mathrm{id}\right]\mapsto \{b\in B:\mathrm{id}\in \mathrm{b.}\mathrm{\text{trimerids}}\}$(9)
where B is the set of βmatrices in the dataset.

$\mathcal{R}$ maps a compound key${\kappa}_{\mathcal{R}}$ to a trimer t, defined by$\begin{array}{l}\phantom{\rule{40pt}{0ex}}\mathcal{R}\left[{\kappa}_{\mathcal{R}}\right]\mapsto t\\ \phantom{\rule{10pt}{0ex}}\text{where}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\kappa}_{\mathcal{R}}=\left(\mathrm{t.}\mathrm{\text{matrixid}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{id}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{eqorients}},\phantom{\rule{0.3em}{0ex}}\right.\\ \phantom{\rule{60pt}{0ex}}\left(\right)close=")">\mathrm{t.}\mathrm{\text{coord}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{rowspan}}\end{array}$(10)
such that class∉{3,15}.

$\mathcal{C}$ maps a compound key${\kappa}_{\mathcal{C}}$ to a trimer t, defined by$\begin{array}{l}\phantom{\rule{40pt}{0ex}}\mathcal{C}\left[{\kappa}_{\mathcal{C}}\right]\mapsto t\\ \phantom{\rule{10pt}{0ex}}\text{where}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\kappa}_{\mathcal{C}}=\left(\mathrm{t.}\mathrm{\text{matrixid}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{id}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{eqorients}},\phantom{\rule{0.3em}{0ex}}\right.\\ \phantom{\rule{60pt}{0ex}}\left(\right)close=")">\mathit{\text{t.}}\mathrm{\text{coord}},\phantom{\rule{0.3em}{0ex}}\mathit{\text{t.}}\mathrm{\text{colspan}}\end{array}$(11)
such that t.class∉{5,31}.
Ltrimers are indexed in$\mathcal{R}$ and$\mathcal{C}$; whereas Htrimers are indexed only in$\mathcal{C}$ because they do not contain any row spans, conversely, Vtrimers are indexed only in$\mathcal{R}$ because they do not contain any column spans. The BuildIndices procedure in Additional file1: Algorithm S1 describes the index construction algorithm.
Time complexity
Each entry in a βmatrix is the intersection of at most six trimers: four Ltrimers (one in each of the four orientations), one Vtrimer, and one Htrimer. BuildIndices runs in O(6mn) time where m is the maximum number of residues in a βmatrix and n is the number of βmatrices.
Filtering
 1.
Enumerate the query trimers and storing them in Q.trimers.
 2.
Store the corresponding trimer IDs in Q.trimerids.
Partially matching candidates are obtained by pruning the βmatrices in$\mathcal{D}$ using two filters. The FirstFilter procedure, described in Additional file1: Algorithm S2, prunes the βmatrices in$\mathcal{D}$ that do not contain the entire set of trimer IDs in Q.trimerids.
βmatrices are indexed in a single arbitrary orientation, therefore a procedure for comparing a query against all orientations of a candidate is required. The naive approach enumerates and stores the x and yaxes reflections of each βmatrix, effectively tripling the index size. Alternatively, the x and yaxes reflections of the query are enumerated and compared with a candidate, effectively tripling the filtering time.
The Congruent procedure defined in Additional file1: Algorithm S3 implements Equation 5 using bitwise operations that enable constanttime set unions and intersections.
Time complexity
The FirstFilter algorithm runs in$O\left(\left{p}_{min}\right\xb7\right.$Q. trimerids) time where$\left{p}_{min}\right$ is the cardinality of the smallest postings set and Q. trimerids is the number of unique trimer ids in the query. A postings set is a set in the index of βmatrices containing a particular query trimer. The SecondFilter algorithm runs in$O\left(\left\mathrm{Q.}\mathrm{\text{trimerids}}\right\xb7\left{C}_{1}\right\right)$ time where C_{1} is the number of candidates returned by FirstFilter.
Verification
Most graph indexing methods use the VF2[20] or Ullmann[19] algorithms for candidate verification. Other methods use algorithms optimised for the data structures of their features and indices.
Constraining βgraphs to the simpler structures of βmatrices has enabled us to develop a quadratic time candidate verification algorithm that does not rely on subgraph isomorphism tests.
A graph G of the query is constructed, in which each vertex is a trimer q∈Q. trimers and each edge e=$\left({q}_{\mathrm{src}},{q}_{\mathrm{des}}\right)$ indicates a span overlap (i.e. t.colspan or t.rowspan) between adjacent query trimers q_{src} and q_{des}. The algorithm to construct a query graph is implemented in the MakeQueryGraph procedure described in Additional file1: Algorithm S4.
The keys by which trimers are indexed in$\mathcal{R}$ and$\mathcal{C}$ contain their locations and geometric configurations within a candidate, allowing MatchPairs to be tested in constanttime. MatchPairs is implemented using the procedures defined in Additional file1: Algorithm S8. The VerifyCandidate procedure in Additional file1: Algorithm S6 describes our algorithm for verifying a single candidate.
Time complexity
The MakeQueryGraph procedure runs in$O\left(\left\mathrm{Q.}\mathrm{\text{trimers}}\right\right)$ time where Q. trimers is the number of trimers in the query. The VerifyCandidate procedure runs in$O\left({\left\mathrm{Q.}\mathrm{\text{trimers}}\right}^{2}\right)$ time, which is called by the Verify procedure in Additional file1: Algorithm S7 to verify each filtered candidate in C_{2}. Therefore, the overall time complexity of our candidate verification algorithm is$O\left(\left\mathrm{Q.}\mathrm{\text{trimers}}\right+\left{C}_{2}\right\xb7{\left\mathrm{Q.}\mathrm{\text{trimers}}\right}^{2}\right)$.
βstrand pairing prediction
 1.
Let $\mathcal{I}$ be the index generated from the ASTRAL95 dataset.
 2.
Let S ^{ P }and S ^{ A }be the strand pairing score matrices for the parallel and antiparallel strand pairs, respectively.
 3.
Let M be a matrix where M _{ ij }contains the number of 4mers between strands i and j. We define a 4mer as a single occurrence of twoconsecutive bridge pairings. For example, the βmatrix $\left(\begin{array}{ccc}\mathrm{A}& \mathrm{B}& \mathrm{C}\\ \mathrm{D}& \mathrm{E}& \mathrm{F}\end{array}\right)$ has the 4mers – $\left(\begin{array}{cc}\mathrm{A}& \mathrm{B}\\ \mathrm{D}& \mathrm{E}\end{array}\right)$ and $\left(\begin{array}{cc}\mathrm{B}& \mathrm{C}\\ \mathrm{E}& \mathrm{F}\end{array}\right)$.
 4.For each strand pair (i,j):
 (a)For each alignment a of j on i:
 (i)For each 4mer m of a:
 (A)
Let h _{ P }and h _{ A }be the number of hits returned from querying $\mathcal{I}$ for the parallel and antiparallel orientations of m, respectively.Set ${S}_{\mathit{\text{ij}}}^{P}\leftarrow {S}_{\mathit{\text{ij}}}^{P}+{h}_{P}$Set ${S}_{\mathit{\text{ij}}}^{A}\leftarrow {S}_{\mathit{\text{ij}}}^{A}+{h}_{A}$Set ${M}_{\mathit{\text{ij}}}\leftarrow {M}_{\mathit{\text{ij}}}+1$
 (A)
 (i)
 (b)
Let $\mathrm{seq}\text{\_}\mathrm{sep}\left(i,j\right)$ be the number of residues separating strands i and j.Set ${S}_{\mathit{\text{ij}}}^{P}\leftarrow \frac{{S}_{\mathit{\text{ij}}}^{P}}{{M}_{\mathit{\text{ij}}}}log\left[\mathrm{seq}\text{\_}\mathrm{sep}\left(i,j\right)\right]$Set ${S}_{\mathit{\text{ij}}}^{A}\leftarrow \frac{{S}_{\mathit{\text{ij}}}^{A}}{{M}_{\mathit{\text{ij}}}}log\left[\text{seq\_sep}\left(i,j\right)\right]$
 (a)
Evaluation
The graph indexing methods were evaluated according to their filtering, verification, and overall query times. The sizes of the indices generated by each method were measured using the POSIX “stat” command. The filtering precision was calculated as the total number of hits divided by the total number of filtered candidates for all queries of a given dataset size.
is the harmonic mean of the precision and recall values. It provides a convenient way of evaluating the query precision and recall as a single value. The β parameter allows emphasis to be placed on precision or recall depending on the query performance goals. We used the F_{1} score in our 3D substructure search comparisons as a measure of query accuracy.
Datasets
For the graph indexing comparisons, the January 3, 2011 (PDB2011) snapshot of the PDB was used to generate a dataset of 209,127 βsheets. A number of PDB files were excluded due to discrepancies in their content[50]. βsheets exhibiting poor planarity, such as those with significantly pronounced twisting or curvature, were also excluded.
The DSSP[51] algorithm was used to assign secondary structures to each PDB file. Residues with a secondary structural assignment of “E” or “b” were considered to form part of a βstrand. DSSP also assigns bridge partner relationships between residues on adjacent βstrands, which were used to determine the bridge edges in βgraphs.
We used ProOrigami[52] to generate βgraphs and a topological sort was used to generate the βmatrix from the peptide and bridge edges of each βgraph.
Subsets containing 1,000; 2,000; 4,000; 8,000; and 16,000 βsheets were randomly selected from the dataset. These sizes were used in previous benchmarks[23, 27]. GraphGrepSX and SING were unable to be run on datasets of 32,000 or more due to the memory consumption of their respective implementations, we therefore restricted the sizes of our datasets accordingly.
Each βgraph was preprocessed by inserting “dummy” nodes in place of a labelled edge. Each dummy node was labelled with either a “b” to denote a bridge edge or a “z” to denote a peptide edge in order to avoid conflict with the labels of residue nodes. Preprocessing was required because edgelabelled graphs were not supported by SING or GraphGrepSX.
Queries were generated in the same manner as in previous benchmarks[23]. A query was created from each βgraph by randomly selecting a root node and performing a random breadthfirst traversal until the query obtained the degrees d such that 2≤d≤10. Queries were not generated from βgraphs with insufficient edges. Each query contained a single wildcard node that matched any amino acid. To enable wildcard matching on all methods, each query was repeated by replacing the wildcard node with each of the 20 amino acids. The βmatrices for each query were generated for use by BetaSearch. The total numbers of queries generated for each query size are shown in Additional file1: Table S1.
For the case studies, we generated the required indices from all the βmatrices in the PDB2011 dataset. Each βmatrix was assigned a sheet identifier of the format “Â¡PDB ID¿Â¡chain ID¿_SHEET_Â¡number¿”. For example, the sheet ID of the first βmatrix in chain A of ubiquitin [PDB:1UBQ] is “1UBQA_SHEET_000”.
For the 3D substructure search comparisons, the ASTRAL SCOP 1.75A 95% sequence identity nonredundant dataset[53] of protein structures (ASTRAL95) were used. βsheets were extracted and filtered as per the graph indexing datasets and a total of 29,341 βsheets were obtained. A subset of 26,669 βsheets containing between 4 and 50 residues, inclusive, were used as queries. The βmatrices corresponding to each βsheet were generated and used with BetaSearch.
Implementation
The graph indexing comparisons were performed on a 2.66 Ghz Intel Nehalem 8core processor with 48 GB of main memory running CentOS. The source code to SING and GraphGrephSX were provided by the authors of their respective publications. All methods were implemented in C++, compiled using g++ version 4.3, and depend on the Boost C++ version 1.42.0 libraries[54]. BetaSearch additionally requires Redis[55] version 2.0.4 and the official Redis C headers[56].
The indices in SING and GraphGrepSX were implemented as modified C++ STL “std::map” containers in the memory spaces of their respective processes. In contrast, BetaSearch stores its indices using Redis hash tables that are stored in (diskbased) virtual memory and operates external to BetaSearch as a concurrent process. It is therefore subject to interprocess communication overhead during filtering and indexing, which are included in our experimental timings.
The 3D substructure search comparisons were performed on the same platform with 8 GB of main memory. BetaSearch was reimplemented in Python 2.7 using the Whoosh Python Search Library[57]. LabelHash 1.0.2[58] and SHEBA 3.1.1[59] were downloaded and used. The LabelHash index was built from the PDB coordinates of all the ASTRAL95 βsheets, which were extracted from their original structures using ProDy[60]. The BetaSearch index was built from the βmatrix representations of each βsheet.
Structural renderings and alignments
We used PyMOL[61] to generate 3D renderings and structural alignments of proteins.
Availability and requirements

Project name: BetaSearch

Project homepage: http://www.csse.unimelb.edu.au/∼hohkhkh1/betasearch

Operating system(s): Ubuntu Linux 11.10+ (http://www.ubuntu.com)

Programming language(s): Python

Other requirements: A complete listing of Python module dependencies is provided on the project homepage.

License: None

Any restrictions to use by nonacademics: BetaSearch can be used freeofcharge by nonacademics, provided appropriate citation and credit is given to the authors of this publication.
Funding
This work was supported by a Victorian Life Sciences Computation Initiative (VLSCI) [grant number VR0127] on its Peak Computing Facility at the University of Melbourne, an initiative by the Victorian Government. HH is supported by a NICTA PhD scholarship. NICTA (National ICT Australia) is funded by the Australian Government’s Department of Communications; Information Technology and the Arts; Australian Research Council through Backing Australia’s Ability; ICT Centre of Excellence programs.
Declarations
Acknowledgements
We would like to thank:
The reviewers for their feedback in improving the quality of this article.
R. Di Natale, R. Giugno, V. Bonnici, and D. Shasha for their assistance and providing the GraphGrepSX and SING source code.
M. Moll for his assistance with the LabelHash software.
A. Stivala for his assistance with the ProOrigami source code.
The VLSCI systems support staff for their assistance with our HPC technical requests.
Authors’ Affiliations
References
 Kessel A, BenTal N: Introduction to proteins: structure, function, and motion. 2010, CRC Press, LondonView ArticleGoogle Scholar
 Zaremba SM, Gregoret LM: Contextdependence of amino acid residue pairing in antiparallel βsheets. J Mol Biol. 1999, 291: 463479. 10.1006/jmbi.1999.2961.PubMedView ArticleGoogle Scholar
 Parisien M, Major F: Ranking the factors that contribute to protein βsheet folding. Proteins. 2007, 68: 824829. 10.1002/prot.21475.PubMedView ArticleGoogle Scholar
 Wathen B, Jia Z: Folding by numbers: primary sequence statistics and their use in studying protein folding. Int J Mol Sci. 2009, 10: 15671589. 10.3390/ijms10041567.PubMedPubMed CentralView ArticleGoogle Scholar
 Hubbard TJP: Use of βstrand interaction pseudopotentials in protein structure prediction and modelling. Proceedings of the 27th Hawaii International Conference on System Sciences. 1994, 336344.Google Scholar
 Zhu H, Braun W: Sequence specificity, statistical potentials, and threedimensional structure prediction with selfcorrecting distance geometry calculations of βsheet formation in proteins. Prot Sci. 1999, 8: 326342.View ArticleGoogle Scholar
 Steward RE, Thornton JM: Prediction of strand pairing in antiparallel and parallel βSheets using information theory. Proteins. 2002, 48: 178191. 10.1002/prot.10152.PubMedView ArticleGoogle Scholar
 Rajgaria R, Wei Y, Floudas CA: Contact prediction for beta and alphabeta proteins using integer linear optimization and its impact on the first principles 3D structure prediction method ASTROFOLD. Proteins. 2010, 78: 18251846.PubMedPubMed CentralGoogle Scholar
 Bork P, Koonin E: Protein sequence motifs. Curr Opin Struct Biol. 1996, 6: 366376. 10.1016/S0959440X(96)800571.PubMedView ArticleGoogle Scholar
 Berman HM, Westbrook J, Fend Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
 Bella J, Hindle KL, McEwan PA, Lovell SC: The leucinerich repeat structure. Cell Mol Life Sci. 2008, 65: 23072333. 10.1007/s0001800880190.PubMedView ArticleGoogle Scholar
 Liou YC, Tocilij A, Davies PL, Jia Z: Mimicry of ice structure by surface hydroxyls and water of a βhelix antifreeze protein. Nature. 2000, 406: 322324. 10.1038/35018604.PubMedView ArticleGoogle Scholar
 Makabe K, McElheny D, Tereshko V, Hilyard A, Gawlak G, Yan S, Koide A, Koide S: Atomic structures of peptide selfassembly mimics. Proc Natl Acad Sci USA. 2006, 103: 1775317758. 10.1073/pnas.0606690103.PubMedPubMed CentralView ArticleGoogle Scholar
 Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 33893402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
 Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: ClustalW and ClustalX version 2. Bioinformatics. 2007, 23: 29472948. 10.1093/bioinformatics/btm404.PubMedView ArticleGoogle Scholar
 Stivala A, Wirth A, Stuckey PJ: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinformatics. 2010, 11:Google Scholar
 Parisien M: Les feullets beta dans les protéines. Annotation, comparaison et construction. Master’s thesis. 2005, Université de MontréalGoogle Scholar
 Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S: Network analysis of protein structures identifies functional residues. J Mol Biol. 2004, 344: 11351146. 10.1016/j.jmb.2004.10.055.PubMedView ArticleGoogle Scholar
 Ullmann JR: An algorithm for subgraph isomorphism. J ACM. 1976, 23: 3142. 10.1145/321921.321925.View ArticleGoogle Scholar
 Cordella LP, Foggia P, Sansone C, Vento M: A (sub)graph isomorphism algorithm for matching large graphs. IEEE T Pattern Anal. 2004, 10: 13671372.View ArticleGoogle Scholar
 Zampelli S: A constraint programming approach to subgraph isomorphism. PhD thesis. 2008, Université catholique de LouvainGoogle Scholar
 Cook SA: The complexity of theoremproving procedures. Proceedings of the 3rd ACM Symposium on Theory of Computing. 1971, 151158.Google Scholar
 Bonnici V, Ferro A, Giugno R, Pulvirenti A, Shasha D: Enhancing graph database indexing by suffix tree structure. Pattern Recognition in Bioinformatics, Volume 6282 of Lecture Notes in Computer Science. 2010, Springer, 195203.Google Scholar
 Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval. 2008, Cambridge University Press, CambridgeView ArticleGoogle Scholar
 Giugno R, Shasha D: GraphGrep: a fast and universal method for querying graphs. Proceedings of the 16th International Conference on Pattern Recognition, 2002, Volume 2. 2002, 112115.Google Scholar
 Ferro A, Giugno R, Mongiovi M, Pulvirenti A, Skripin D, Shasha D: GraphFind: enhancing graph searching by low support data mining. BMC Bioinformatics. 2008, 9:Google Scholar
 Di Natale R, Ferro A, Giugno R, Mongiovi M, Pulvirenti A, Shasha D: SING: subgraph search in nonhomogeneous graphs. BMC Bioinformatics. 2010, 11:Google Scholar
 Yan X, Yu PS, Han J: Graph indexing based on discrimintative frequent structure analysis. ACM T Database Syst. 2005, 30 (4): 960993. 10.1145/1114244.1114248.View ArticleGoogle Scholar
 Cheng J, Ke Y, Ng W, Lu A: FGIndex: towards verificationfree query processing on graph databases. Proceedings of the 2007 ACM SIGMOD International Conference on the Management of Data. 2007, 857872.View ArticleGoogle Scholar
 Williams DW, Huan J, Wang W: Graph database indexing using structured graph decomposition. IEEE 23rd International Conference on Data Engineering, 2007. 2007, 976985.View ArticleGoogle Scholar
 Zhang S, Hu M, Yang J: TreePi: a novel graph indexing method. IEEE 23rd International Conference on Data Engineering, 2007. 2007, 966975.View ArticleGoogle Scholar
 Zhao P, Yu JX, Yu PS: Graph indexing: tree + delta <= graph. Proceedings of the VLDB Endowment. 2007, 938949.Google Scholar
 He H, Singh AK: Closuretree: an index structure for graph queries. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). 2006, 3838.Google Scholar
 Shamir R, Tsur D: Faster subtree isomorphism. Proceedings of the 5th Israel Symposium on the Theory of Computing Systems. 1997, 267280.Google Scholar
 Zou L, Chen L, Yu JX, Lu Y: A novel spectral coding in a large graph database. Proceedings of the 11th International Conference on Extending Database Technology (EDBT’08). 2008, 181192.Google Scholar
 Moll M, Bryant DH, Kavraki LE: The LabelHash server and tools for substructurebased functional annotation. Bioinformatics. 2011, 27:Google Scholar
 Holm L, Sander C: Mapping the protein universe. Science. 1996, 273: 595602. 10.1126/science.273.5275.595.PubMedView ArticleGoogle Scholar
 Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics. 2000, 16: 566567. 10.1093/bioinformatics/16.6.566.PubMedView ArticleGoogle Scholar
 Jung J, Lee B: Protein structure alignment using environmental profiles. Prot Sci. 2000, 13: 535543.Google Scholar
 Stivala A, Wirth A, Stuckey PJ: Tableaubased protein substructure search using quadratic programming. BMC Bioinformatics. 2009, 10:Google Scholar
 Moll M, Bryant DH, Kavraki LE: The LabelHash algorithm for substructure matching. BMC Bioinformatics. 2010, 11:Google Scholar
 Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D: Design of a novel globular protein fold with atomiclevel accuracy. Science. 2003, 302: 13641368. 10.1126/science.1089427.PubMedView ArticleGoogle Scholar
 Havranek JJ: Specificity in computational protein design. J Biol Chem. 2010, 285: 3109531099. 10.1074/jbc.R110.157685.PubMedPubMed CentralView ArticleGoogle Scholar
 Liu Y, Kuhlman B: RosettaDesign server for protein design. Nucleic Acids Res. 2006, 34: W235W238. 10.1093/nar/gkl163.PubMedPubMed CentralView ArticleGoogle Scholar
 Rosado CJ, Kondos S, Bull TE, Kuiper MJ, Law RHP, Buckle AM, Voskoboinik I, Bird PI, Trapani JA, Whisstock JC, Dunstone MA: The MACPF/CDC family of poreforming toxins. Cell Microbiol. 2008, 10: 17651774. 10.1111/j.14625822.2008.01191.x.PubMedPubMed CentralView ArticleGoogle Scholar
 Brown WM, Martin S, Chabarek JP, Strauss C, Faulon JL: Prediction of βstrand packing interactions using the signature product. J Mol Model. 2006, 12: 355361. 10.1007/s0089400500524.PubMedView ArticleGoogle Scholar
 Cheng J, Baldi P: Threestage prediction of protein βsheets by neural networks, alignments and graph algorithms. Bioinformatics. 2005, 21: 7584. 10.1093/bioinformatics/bti1004.View ArticleGoogle Scholar
 Jeong JK, Berman P, Przytycka TM: Bringing folding pathways into strand pairing prediction. Lecture Notes in Computer Science, Volume 4645. 2007, Springer, 3848.Google Scholar
 Aydin Z, Altunbasak Y, Erdogan H: Bayesian models and algorithms for protein βsheet prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 8. 2011, Springer, 395409.Google Scholar
 Schierz AC, Soldatova LN, King RD: Overhauling the PDB. Nat Biotechnol. 2007, 25: 437442. 10.1038/nbt0407437.PubMedView ArticleGoogle Scholar
 Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers. 1983, 22: 25772637. 10.1002/bip.360221211.PubMedView ArticleGoogle Scholar
 Stivala AD, Wybrow M, Wirth A, Whisstock JC, Stuckey PJ: Automatic generation of protein structure cartoons with Proorigami. Bioinformatics. 2011, 27: 33153316. 10.1093/bioinformatics/btr575.PubMedView ArticleGoogle Scholar
 Brenner M, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. 2000, 28: 254256.Google Scholar
 Boost v1.42.0. [http://www.boost.org/users/history/version_1_42_0],
 Redis. [http://www.redis.io]Google Scholar
 hiredis. [https://github.com/antirez/hiredis],
 Whoosh Python search library. [https://bitbucket.org/mchaput/whoosh/wiki/Home],
 LabelHash 1.0.2. [http://labelhash.kavrakilab.org/downloads/python27/LabelHash1.0.2Linux64.tar.gz],
 SHEBA 3.1.1. [https://ccrod.cancer.gov/confluence/download/attachments/63341259/sheba3.1.1.tar.gz],
 Bakan A, Meireles LM, Bahar I: ProDy: protein dynamics inferred from theory and experiments. Bioinformatics. 2011, 27: 15751577. 10.1093/bioinformatics/btr168.PubMedPubMed CentralView ArticleGoogle Scholar
 Schrödinger LLC: The PyMOL Molecular Graphics System, Version 1.3r1. 2010Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments
View archived comments (1)