 Technical Note
 Open Access
 Published:
BetaSearch: a new method for querying βresidue motifs
BMC Research Notesvolume 5, Article number: 391 (2012)
Abstract
Background
Searching for structural motifs across known protein structures can be useful for identifying unrelated proteins with similar function and characterising secondary structures such as βsheets. This is infeasible using conventional sequence alignment because linear protein sequences do not contain spatial information. βresidue motifs are βsheet substructures that can be represented as graphs and queried using existing graph indexing methods, however, these approaches are designed for general graphs that do not incorporate the inherent structural constraints of βsheets and require computationallyexpensive filtering and verification procedures. 3D substructure search methods, on the other hand, allow βresidue motifs to be queried in a threedimensional context but at significant computational costs.
Findings
We developed a new method for querying βresidue motifs, called BetaSearch, which leverages the natural planar constraints of βsheets by indexing them as 2D matrices, thus avoiding much of the computational complexities involved with structural and graph querying. BetaSearch exhibits faster filtering, verification, and overall query time than existing graph indexing approaches whilst producing comparable index sizes. Compared to 3D substructure search methods, BetaSearch achieves 33 and 240 times speedups over indexbased and pairwise alignmentbased approaches, respectively. Furthermore, we have presented casestudies to demonstrate its capability of motif matching in sequentially dissimilar proteins and described a method for using BetaSearch to predict βstrand pairing.
Conclusions
We have demonstrated that BetaSearch is a fast method for querying substructure motifs. The improvements in speed over existing approaches make it useful for efficiently performing highvolume exploratory querying of possible protein substructural motifs or conformations. BetaSearch was used to identify a nearly identical βresidue motif between an entirely synthetic (Top7) and a naturallyoccurring protein (CharcotLeyden crystal protein), as well as identifying structural similarities between biotinbinding domains of avidin, streptavidin and the lipocalin gamma subunit of human C8.
Background
The βsheet is a common secondary structure element that plays important functional and structural roles in proteins, for example, the ligandbinding pockets of biotinbinding proteins and the structure of the commonlyoccurring TIMbarrel fold[1]. These processes are often mediated by interactions between adjacent pairs of residues across βstrands. These include the disulphide, ionic, and hydrogen bonds; and hydrophobic packing interactions frequently involved in maintaining the structural stability of a protein or in enzymatic active sites[1]. The influence of pairwise interactions within βsheets and their tertiary structures have been studied experimentally[2] and statistically[3, 4], the results of which have been used to predict βsheet topology[5–7] and tertiary structure[8]. These studies have provided insights into the folding mechanisms of βsheets although it remains an open problem[4]. Examining interresidue interactions at the single pairwise level however, provides only a limited view of a larger interaction network within a βsheet. We refer to these clusters of interacting residues as β residue motifs, which are contiguous subsets of βsheet residues connected by peptide and/or hydrogen bonds (as shown in Figure1D). Unlike sequence motifs, βresidue motifs encode information about both the peptide and bridgepartners of each residue. For the purposes of this study, we consider βresidue motifs to be provisional, since they may also exist via a general conservation between homologs rather than as independent functional units, as is the case for motifs in the traditional sense[9].
Many characteristic βresidue motifs are observed in the Protein Data Bank (PDB)[10]. For example, the βsheets in leucinerich repeat (LRR) domains contain consecutive adjacent interstrand pairs of buried leucines stericallypacked alongside their bridgepartners, contributing to structural stability[11]. Other βresidue motifs appear as a combination of inter and intrastrand residue neighbours, as is the case for the TCT motif of certain antifreeze proteins[12] and glutamic acid/lysine motifs[13]. The conserved biotinbinding site in streptavidin [PDB:1STP] contains a βresidue motif of five inwarddirecting residues of a βbarrel: S88, T90, W92, W108, and L110. Identification of these βresidue motifs can be used to search for other proteins with similar structural elements or function with low sequence identities.
Searching for structural motifs, βresidue or otherwise, in the PDB using linear approaches such as sequence alignment is a difficult, if not impossible task because interresidue interactions can occur across secondary structures that are sporadically located throughout a protein sequence. Furthermore, pairwise residue interactions are not accounted for in conventional multiplesequence alignment tools such as BLAST[14] and CLUSTALW[15], given the onedimensional nature of sequences.
The conventional approach to motif querying, involves the use of protein substructure search methods that structurally align the 3D atomic coordinates of a query with known protein structures. These methods provide a structural context to each query hit and generally produce approximate matches in the form of a ranked list of hits but may take hours to perform few queries due to their reliance on structural alignment algorithms[16].
Alternatively, protein structures can be represented as graphs and queried for motifs using graph indexing approaches[17]. Unlike 3D substructure searches, these methods perform exact matching by querying only the discrete edge, node, and label features of graphs rather than by 3D similarity between continuous coordinates. The query matching algorithms used by existing graph indexing methods are based on solutions to the subgraph isomorphism problem, described briefly as follows:
A graph G=(V,E) is defined by a set of vertices v∈V and a set of edges e∈E where each edge represents a connection between a pair of vertices$\left({v}_{i},{v}_{j}\right)\in V$. A graph is undirected if its edges are unordered pairs and directed otherwise. The degree of a node is the number of edges it has to other nodes.
If G_{1}and G_{2}are graphs defined as${G}_{1}=\left({V}_{1},{E}_{1}\right)$ and${G}_{2}=\left({V}_{2},{E}_{2}\right)$, then G_{1}is a subgraph of G_{2}if${V}_{1}\subseteq {V}_{2}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\wedge \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{E}_{1}\subseteq \phantom{\rule{0.3em}{0ex}}{E}_{2}$. An isomorphism between G_{1}and G_{2}is a bijection f:V_{1}→V_{2}such that$\forall \left({v}_{i},{v}_{j}\right)\in {E}_{1}\phantom{\rule{2.77695pt}{0ex}}\iff \phantom{\rule{2.77695pt}{0ex}}\left(f\left({v}_{i}\right),f\left({v}_{j}\right)\right)\in {E}_{2}$.
A graph G_{1}is subgraph isomorphic to G_{2}if G_{3}is a subgraph of G_{2}and there exists an isomorphism between G_{3}and G_{1}.
Graph representations of proteins[18] and βsheets[17] have been previously described in which nodes represent residues and edges represent interresidue interactions such as peptide or hydrogen bonds. For simplicity, we define a β graph to be a graph representation of a βsheet in which each node is labelled with a residue name, solid edges represent peptide bonds, and dotted edges represent a bridgepartner relationship between adjacent interstrand residue pairs (Figure1B). βresidue motifs are considered to be connected subgraphs of βgraphs whose nodes are labelled with amino acids.
The planarity of βgraphs allows for a compact twodimensional representation. We define a βmatrix to be a projection (or “flattening”) of a βgraph onto a 2D matrix of amino acid characters. The residues in the same βstrand are located in the same row and residues connected by bridge edges lie in the same column (Figure1C). βresidue motifs are then considered to be submatrices of βmatrices (Figure1D).
Algorithms for detecting subgraph isomorphisms have been described for general graphs[19, 20] that run in time factorial to the number of vertices in a query[21] and cannot be solved in polynomial time, as it is proven to be NPcomplete for general graphs[22]. Naively performing a subgraph test on every graph in a database is therefore computationallyexpensive[23]. Consequently, graph indexing methods were developed to simplify this problem. A potentially large number of nonmatching graphs can be pruned from the search by using indexing techniques analogous to those in conventional search engines[24]. Graphs can be indexed by various features using disk or memorybased indices. These approaches usually consist of three stages:

1.
Index construction: The features of each graph in a database are obtained, each representing a graph characteristic. A data structure, usually an inverted index, is then constructed in which each feature is associated with the set of their originating graphs.

2.
Filtering: The features of the query graph are obtained. An initial set of coarsegrained candidates containing these features are retrieved from the index (or indices). It is possible these candidates do not contain any query matches.

3.
Verification: Each candidate is checked for a subgraph that exactly matches the query. This is performed using a subgraph test in most methods.
Graph indexing methods are loosely classified into three categories: pathbased, subgraphbased, and treebased.
Pathbased methods (GraphGrep[25], GraphFind[26], GraphGrepSX[23], and SING[27]) index graphs using paths as features. These methods construct an inverted index I that maps each path p to a set of their originating graphs
where p is a sequence of connected vertices in a graph G
where l_{ p } is the maximum path length. The filtering process returns the set of candidate graphs C containing all the paths of a query graph Q
and verification of each candidate is performed using the VF2 algorithm[20].
Enumerating all paths up to and including length l_{ p }produces large feature sets and consequently, large indices. GraphFind avoids these problems by pruning redundant features using data mining techniques similar to those of gIndex[28]. GraphGrepSX exploits feature redundancy by implementing its index as a suffix tree where each string is a path sequence. The suffix tree was shown to be more spaceefficient than the hash tables used by other pathbased methods[23]. Our empirical results corroborate these findings (Table1). SING uses a second filtering stage that prunes candidates by using path locality information. For example, a path p in a candidate must be surrounded by the same paths as in the query. This improvement in filtering comes at the cost of maintaining an auxiliary hash table of locality information.
Subgraphbased methods (gIndex[28], FGIndex[29], and GDIndex[30]) use subgraphs as features and retain more topological information about graphs than paths due to their more complex structures. Index construction then requires time exponential to the number of nodes in each graph which also produces larger indices than those of pathbased methods. These problems can be alleviated to a degree by indexing only the most frequent subgraphs[28].
Treebased methods (TreePi[31], TreePi+δ[32], and CTree[33]) use subtrees as features and are purported to provide an ideal compromise between the small indices of pathbased methods and the specificity of subgraphbased methods. Algorithmic operations on trees are generally more asymptotically efficient than those on graphs, in particular, subtree isomorphism can be tested in polynomial time[34]. However, previous results showed that certain pathbased methods are still an order of magnitude faster in query time than existing treebased methods[27].
GCoding[35] generates numeric representations of graphs using encodings of their adjacency matrices and cannot be classified into any of the above groups. These representations allow efficient filtering without computationally expensive graph traversals or feature enumeration. A specialised subgraph isomorphism test is used for verification[35]. While these encodings provide a compact index, expensive eigenvalue calculations are required to compute them[27].
Each of these methods can be applied to a wide variety of problems because they were designed for general graphs (i.e. graphs with an unrestricted degree and/or node count) that do not make use of the inherent structural constraints of βsheets. For example, each βresidue has at most four neighbours: the preceding and following peptidebonded residues and one bridgepartner located on each of the two adjacent hydrogenbonded βstrands.
The problem of protein 3D substructure searching involves searching a database of protein structures for structures that contain substructures similar to a query structure and remains a significant problem in structure biology. These substructures may be relevant to biological processes such as binding sites, enzymatic function, or may be representative of a particular fold family[16, 36]. Current methods for substructure searching are based on the comparison of threedimensional coordinates between structures and use computationally complex structural alignment algorithms. Methods such as Dali[37], DaliLite[38], and SHEBA[39] align protein structures at the residue level, that is, they find onetoone residue alignments between pairs of proteins; methods such as QPTableauSearch[40] and SATableauSearch[16] align proteins at the level of secondary structure elements (SSEs) and therefore lack residuelevel specificity but are generally faster[16]. A common drawback of these methods is that exhaustive pairwise comparisons are required between the query and the each protein structure in a database. This naive approach often leads to redundant comparisons between highly similar structures or structures with no obvious match, ultimately resulting in queries requiring hours or even days to complete[16].
Recently, LabelHash[36, 41] was developed primarily for the 3D substructure matching of small motifs, commonly between 4 and 15 residues. This method is unique among structural search methods in general since it uses a precomputed index to vastly accelerate querying in a manner similar to those of graph indexing approaches. Indeed, the results in this paper show that LabelHash yields a considerable performance boost in compute time over a conventional pairwise structural alignment approach (see Results and discussion).
In this paper we describe BetaSearch, a method that allows fast querying of βresidue motifs in large datasets of protein structures. Our method leverages the natural planar constraints of βsheets by indexing them as 2D matrices, known as βmatrices. This approach avoids the geometric, topological, and computational complexities usually involved in 3D substructure or graph querying. Furthermore, by using βsheet representations independent of a 3D coordinate system, BetaSearch identifies matching βresidue motifs in structurally and sequentially dissimilar proteins.
Results and discussion
We have compared the performance of BetaSearch against stateoftheart graph indexing and 3D substructure search methods separately. The results of three case studies are also presented, which provide biologicallyrelevant contexts in which BetaSearch could be used.
Comparisons with graph indexing methods
We compared BetaSearch against SING and GraphGrepSX. SING was shown to outperform existing methods in terms of query time on standard datasets of chemical compounds, protein transcription networks, proteininteraction networks, and synthetic graphs[27].
The elapsed indexing, total query, filtering, and verification times were averaged over five repetitions. SING and GraphGrepSX were run using l_{ p }=4 and l_{ p }=10, where l_{ p } denotes the path length. These values were chosen by the authors of each method in their own comparisons[23, 27]. We were unable to use larger l_{ p } values or datasets of more than 16,000 βsheets due to the memory consumption of the SING and GraphGrepSX implementations. We therefore only reported results for datasets up to and including N = 16,000. Accuracies of each method were not measured since each query matches at least one βsheet and any nonmatching βsheet is excluded at the filtering and verification stages of each method.
Indexing
The elapsed times and disk space required for index construction are shown in Table1.
BetaSearch recorded the fastest indexing times with a 1.9 times speedup over the next fastest method (GraphGrepSX,l_{ p }=4) for the N = 16,000 dataset. The size of the BetaSearch indices were similar to those of SING,l_{ p }=4 since trimers have an effective path length of l_{ p }=3 and both use hash tables.
The l_{ p }=10 variants of SING and GraphGrepSX were slower than their l_{ p }=4 variants due to the increase in the number of features generated in the former case. This observation was consistent with those of general graphs[23].
GraphGrepSX,l_{ p }=4 generated the smallest indices by a considerable margin through the use of a suffix trees to store its indices. However, the results obtained in the following sections show that the reduction in index disk space came at a significant cost to the querying time.
Furthermore, the BetaSearch index is limited only by the size of the hard disk on which it is stored, whereas the implementations of SING and GraphGrepSX used in this study were memorylimited, requiring the entire index to be loaded into memory in order for queries to be performed.
Overall query times
The query time for a single query was calculated as the sum of its filtering and verification times. The time required to perform all the queries on a dataset was measured as the sum of its individual query times, shown in Table2. These results show that BetaSearch consistently recorded the fastest querying times for all datasets by at least an order of magnitude over the next fastest method (SING,l_{ p }=10) and a 109 times speedup over the baseline (GraphGrepSX,l_{ p }=4) for the N = 16,000 dataset.
The tradeoff between the index disk size and querying times within the SING and GraphGrepSX variants can be seen in these results where the l_{ p }=10 variants required four to five times as much disk space but were at least twice as fast as the l_{ p }=4 variants.
The overall query time speedups for all query sizes were measured using the GraphGrepSX,l_{ p }=4 as the baseline, shown in Figure2. Only the speedups for the N = 2,000 and 16,000 datasets were shown for the purposes of brevity. The speedups of each method generally tapers down after queries of approximately six to seven edges. This is an expected observation because larger βsheet subgraph queries are more specific than smaller ones, resulting in fewer possible candidates and therefore a reduced filtering and verification load.
Filtering
The filtering time was calculated as the time required to perform filtering for all the queries of a given dataset. The precision was calculated as the total number of actual query matches divided by the total number of filtered candidates for all the queries of a given dataset. The filtering results are shown in Table3.
In contrast to their indexing performances, the l_{ p }=10 variants of SING and GraphGrepSX generally outperformed their l_{ p } = 4 variants. A larger l_{ p }value has more specificity and therefore results in fewer numbers of filtered candidates than a small l_{ p } value, reducing the verification load. The BetaSearch and l_{ p } = 10 precision values were consistently near 1.0 for all datasets and query sizes. The precision of the l_{ p } = 4 variants were considerably lower than the l_{ p } = 10 variants due to the aforementioned specificity limitations of smaller path lengths.
Verification
We measured the mean verification time as the total verification time for a dataset divided by the total number of filtered candidates for a dataset. The mean verification times were less than a second due to the relatively small query graphs involved in this study. The speedups of each method were measured using the GraphGrepSX,l_{ p }=4 times as the baseline and are shown in Figure2.
BetaSearch consistently recorded the fastest verifications across all query sizes and datasets, this is because the BetaSearch verification algorithm runs in quadratic time whereas the VF2 algorithm employed by SING and GraphGrepSX was designed for general graphs and has a potential nonpolynomial time complexity[21]. The largest speedup by BetaSearch was achieved for queries with two edges, since these queries equated to individual trimers, there was no need for candidates to be verified.
Comparisons with 3D substructure search methods
We have compared BetaSearch with LabelHash and SHEBA since they each perform residuelevel matching. SHEBA was shown to be amongst the most accurate substructure search methods in recent work[16], however, LabelHash has yet to be evaluated against other methods. LabelHash and SHEBA were run using default search parameters. Comparisons with DaliLite were unable to be performed due to the majority of our queries and βsheets not meeting the minimum number of residues required by DaliLite. DaliLite was shown to have accuracies comparable to SHEBA but with considerably longer compute times[16].
Figure3A shows the F_{1}scores computed across all query sizes for each method. Exact matches for each method were considered to be those with p^{′}=1 for LabelHash and m=1 for SHEBA. We also computed F_{1}at p^{′}≥0.999, however, the F_{1} at m≥0.999 was identical to that of m=1 so we instead computed F_{1}at m≥0.95. BetaSearch, by virtue of inherent exact matching, produces unranked hits and consequently produces an F_{1} score of 1.0 for the entire query set.
LabelHash at p^{′}≥0.999 clearly outperforms SHEBA on all query sizes, however, neither method performed particularly well on queries of 10 residues or less with the worst F_{1}scores observed for queries of 4–5 residues, which have the largest number of hits amongst all query sizes (see Additional file1: Figure S1A). Although, once the queries reach sizes of 25 residues, LabelHash maintained F_{1} scores of at least 0.9 since the number of possible hits closely approaches the number queries (see Additional file1: Figure S1B).
We measured the CPU times of each method and computed the speedups over SHEBA. The wallclock times were also measured but were omitted since they were analogous to the CPU times. The CPU times of each method for the ASTRAL95 query set were measured as follows:

SHEBA – 239 h 25 m

LabelHash – 33 h 17 m

BetaSearch – 0 h 59 m
Figure3B shows the speedups at each query size. BetaSearch achieved total speedups of 240 times over SHEBA and 33 times over LabelHash. The largest speedups of BetaSearch were obtained for queries of 4–15 residues, which are the sizes of commonly studied motifs[41]. The improved performance of BetaSearch and LabelHash over SHEBA can be attributed to their use of indices which removes the need to perform exhaustive pairwise comparisons for each query against the dataset. This naive approach to substructural searching can lead to query sets taking days to complete[16, 40].
Case Studies
βresidue motifs can contribute both to the structural and functional features of a protein. For researchers who study protein structure, BetaSearch can be a useful tool for surveying particular βsheet configurations across known protein structures. The frequency of a particular motif may give an indication of its relative stability as a βsheet structural element. Researchers who study functional aspects of βsheets can use BetaSearch to identify similar motifs in unrelated proteins, as we demonstrate with the biotinbinding pockets of avidin and streptavidin. BetaSearch is fast with a simple, intuitive search query context that allows the researcher to efficiently make comparisons against known βsheets.
A typical BetaSearch workflow involves the researcher (i) inspecting a protein structure for a βsheet of interest, (ii) identifying a specific βresidue motif, and (iii) manually entering the amino acids in the corresponding βmatrix into BetaSearch. Alternatively, this workflow can be automated, allowing BetaSearch to be used in a data mining or knowledgediscovery capacity which potentially allows interesting relationships between specific amino acid configurations and protein structures or functions, which would not be intuitively revealed by manual trialanderror querying.
To demonstrate the capabilities and potential usecases of BetaSearch we present the results of three case studies. These were drawn from realworld examples and illustrate the role βresidue motifs play in the structure and function of proteins. We also provide the matches from comparative queries using BLAST to demonstrate the difference in matches between a conventional sequencebased homology search and BetaSearch (see Additional files2,3, and4).
Case Study 1  Synthetic motifs in the Top7 protein
Top7 [PDB:1QYS] is the only engineered protein (nonhypothetical) not to be derived from the sequence or structure of any other protein[42, 43]. Most notably, it adopts a unique fold that has yet to be observed in nature. Its structure consists of an amphipathic βsheet and two αhelices. Inspection of the βsheet revealed a repeating βresidue motif (Figure4A). Using this as a query, we wanted to discover known protein structures that possessed this putative synthetic motif. BetaSearch was used to query the PDB2011 dataset, which revealed matches only in structures of the CharcotLeyden crystal (CLC) protein [PDB:1G86,1HDK,1LCL,1QKQ].
A structural alignment of Top7 and CLC around the matching regions of the query motif (Figure4B) shows remarkably, that the βstrand topology and sidechain directions are nearly identical. This does not suggest a homology between the two proteins because Top7 is entirely synthetic. However, our findings demonstrate that the RosettaDesign[44] approach used to engineer Top7 had inadvertently reproduced a known stable βresidue motif ab initio. The CLC protein was not found in a BLAST query of Top7 chain A (see Additional file2).
Case Study 2  Biotinbinding domains
Streptavidin [PDB:1STP] and avidin [PDB:1VYO] are structurally and functionally similar homologous proteins that bind strongly to biotin despite having a sequence identity of less than 35%. Both proteins consist of eight antiparallel βstrands that fold into a βbarrel, inside of which forms a highly conserved biotinbinding site. The βresidue motifs that line this highly specific site are shown in Figures5A and5B. When the residues on the nonbinding face of the βsheet are ignored, the two motifs are differentiated by only a single residue:${\mathrm{W}}_{92}^{1\mathrm{STP}}\iff {\mathrm{F}}_{79}^{1\mathrm{VYO}}$. The results from the corresponding BLAST query are shown in Additional file3.
We have characterised these biotinbinding sites as a minimal, βresidue motif (Figure5C). This putative motif is evolutionary conserved between the avidins and has not yet been shown to be recurrent in evolutionarily distant proteins. Using this query, BetaSearch not only identifies the structures of avidin and streptavidin, but also xenavidin—a biotinbinding protein from Xenopus tropicalis (frog). A number of seemingly unrelated proteins were also matched including uncharacterised proteins from Roseovarius nubinhibens [PDB:3BVC] and Oceanicola granulosus [PDB:2RG4]; and the human complement protein C8 gamma [PDB:1IW2]. The complete set of matching βsheets is listed in Additional file4.
Inspection of the uncharacterised proteins reveal a similar arrangement of residues to the knownbiotin binding proteins but with less room for the ligand to bind. More tantalising is the match with the gamma subunit of human C8 which is a crucial component of the cytolytic membrane attack complex (MAC)[45]. This subunit has a characteristic lipocalin fold with a distinctive binding pocket similar to the avidins, however, the ligand target of C8 gamma remains unknown[45]. Based on the spatial similarities of this binding pocket with the biotinbinding sites of avidin, one may suggest that these proteins could have an affinity for biotin or a biotinlike compounds.
These results demonstrate that a relatively small βresidue motif query can be matched in unrelated proteins. This capability can be particularly useful in characterising proteins of unknown function by similarities in βresidue motifs to those of known function.
Case Study 3  βstrand pairing prediction
One of the unsolved problems of tertiary structure prediction is the ability to predict the pairs of βstrands which are hydrogen bonded, and therefore adjacent, in a βsheet[46]. Information about adjacent βstrands can be used to determine the overall topology of a βsheet. A number of βsheet topology prediction algorithms exist that are based on wellknown machine learning methods[46–49].
We used BetaSearch to predict the βstrand pairings of the fivestranded βsheet found in chain A of csrc tyrosine kinase [PDB: 1A09]. This βsheet contains five strands, is nonbarreled, nonbifurcated, and therefore has four native strand pairings. A score for each possible strand pair was computed as a function of the number of hits, in a BetaSearch index, obtained for each interstrand 4mer query. The top four pairs ranked by pairing scores were considered as predictions. Strand pairing scores were computed for parallel and antiparallel orientations, as shown in Table4. These results demonstrated that each of the native strand pairs were correctly predicted. The procedures used to perform these predictions are described in the Methods section. Our mechanisms for strand pair scoring are by no means a definitive solution to βsheet topology prediction. They can, however, be used in existing algorithms such as BetaPro[47] which require preliminary βstrand or βresidue pairing scores in order for predictions to be made. A large scale evaluation of our BetaSearchbased prediction method is the topic of future work.
Conclusion
We have described a method for indexing and querying βresidue motifs, called BetaSearch, that is at least an order of magnitude faster than stateoftheart graph indexing methods. These speedups are achieved by indexing βsheets as 2D matrices of amino acids known as βmatrices. This representation leverages the inherent planar structural constraints of βsheets, thereby avoiding much of the computational complexity involved in querying and indexing 3D or graph representations of protein structures. BetaSearch is therefore able to achieve quadratictime querying. Filtering precisions were close to 1.0 for all datasets and query sizes, resulting in near minimal verification time.
When compared with existing 3D substructure search methods, BetaSearch achieves a 240 times speedup over the baseline (SHEBA) and a 33 times speedup over the next fastest method (LabelHash). The demonstrated efficiency of BetaSearch lends itself well to the rapid exploration of probable motif or βsheet conformations in a matter of minutes, rather than days or weeks with 3Dbased methods. Furthermore, the ability of BetaSearch to perform exact matching ensures that correct hits are not missed.
Our three case studies demonstrated the utility of BetaSearch in biological contexts. We discovered that the synthetic Top7 protein shares an identical βresidue motif with a known naturallyoccurring protein—the CharcotLeyden crystal. A small query derived from the biotinbinding motif of avidins easily identified unrelated biotinbinding proteins and is suggestive of biotinbinding in others including the gamma subunit of the human C8 complement protein. BetaSearch, with its ability to identify functional similarity from unrelated proteins can potentially help characterise the proteins in the PDB with unknown function. We also demonstrated how BetaSearch could be used to predict strand pairing in βsheets, which could help reduce the search space of more complex supersecondary or tertiary structure prediction tasks. Although our work has focused on substructural motifs in βsheets, our algorithm can be modified to perform querying of any substructural motif involving pairwise interactions, such as the wellcharacterised hydrogenbond pairings in helices and turns. Indeed, this is an avenue of development we are currently exploring.
It is our intention for BetaSearch to be used by protein researchers to supplement conventional sequence and structural search methods. For example, the efficiency of the BetaSearch filtering and verification algorithms introduces the possibility for their use as a rapid “firstpass” filter to improve the querying performance of other methods. Such an application would be nontrivial to develop but could potentially reduce conventional structural query times from hours to minutes.
Findings
The pseudocode for each of the algorithms described in this section is provided in the Supplementary Materials.
Trimers
A trimer is a path of three amino acids in a βmatrix configured in the shape of an ‘L’ (an Ltrimer), vertically in the same column (a Vtrimer), or horizontally in the same row (an Htrimer). Trimers are the features by which βmatrices are indexed in BetaSearch. An example of the trimer extraction process is shown in Additional file1: Figure S2.
A trimer t has a number of attributes that encode its configuration and location within a βmatrix:

t.seq: a three letter string of residues spanned by the trimer where
$$\begin{array}{l}\phantom{\rule{2em}{0ex}}\text{if}\phantom{\rule{2.77695pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}=\text{\u201cabc\u201d}\\ \phantom{\rule{0.3em}{0ex}}\text{then}\phantom{\rule{2.77695pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]\to \text{\u201ca\u201d},\phantom{\rule{0.3em}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[1\right]\to \text{\u201cb\u201d},\phantom{\rule{0.3em}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\to \text{\u201cc\u201d}.\end{array}$$(4) 
t.class: an integer representing the class of the trimer, defined as
$$\begin{array}{l}\mathrm{t.}\mathrm{\text{class}}=\left\{\begin{array}{l}1\phantom{\rule{1em}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is an Ltrimer}\\ 3\phantom{\rule{1em}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is a Vtrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]\ne \mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 5\phantom{\rule{1em}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is an Htrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]\ne \mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 15\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is a Vtrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]=\mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 31\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\phantom{\rule{2.56865pt}{0ex}}\text{if}\phantom{\rule{2.56865pt}{0ex}}t\phantom{\rule{2.56865pt}{0ex}}\text{is an Htrimer and}\phantom{\rule{2.56865pt}{0ex}}\mathrm{t.}\mathrm{\text{seq}}\left[0\right]=\mathrm{t.}\mathrm{\text{seq}}\left[2\right].\end{array}\right.\end{array}$$(5) 
t.id: a (t.class,t.seq) tuple.

t.orient: an integer value such that t.orient$\in \left\{0,1,2,3\right.\}$. These values describe the possible orientations of a trimer and were chosen to allow the calculation of x and yaxis trimer reflections using the bitwiseXOR (‘⊕’) operator. Orientation reflections are calculated as
$$\begin{array}{c}{t}^{\prime}.\mathrm{\text{orient}}=\left\{\begin{array}{cc}\mathrm{t.}\mathrm{\text{orient}}\oplus 1& \phantom{\rule{2.77695pt}{0ex}}\text{if reflected in the yaxis}\\ \mathrm{t.}\mathrm{\text{orient}}\oplus 2& \phantom{\rule{2.77695pt}{0ex}}\text{if reflected in the xaxis}\end{array}\right.\end{array}$$(6)where t’ is the reflection of t. Additional file1: Figure S2 shows how trimer orientations are determined for each trimer class.
t.eqorients: an integer that encodes the equivalent orientations of t.orient, defined as
$$\begin{array}{l}\mathit{\text{t.}}\mathrm{\text{eqorients}}=\left\{\begin{array}{l}{2}^{\mathit{\text{t.}}\mathrm{\text{orient}}}\phantom{\rule{2.36043pt}{0ex}}\phantom{\rule{2.36043pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=1\\ 3\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.36043pt}{0ex}}\phantom{\rule{9pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=3\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]<\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\\ 12\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{6pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=3\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]>\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\\ 5\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{2.36043pt}{0ex}}\phantom{\rule{9pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=5\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]<\mathrm{t.}\mathrm{\text{seq}}\left[2\right]\\ 10\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{6pt}{0ex}}\text{if}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{class}}=5\phantom{\rule{2.36043pt}{0ex}}\text{and}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{seq}}\left[0\right]>\mathit{\text{t.}}\mathrm{\text{seq}}\left[2\right]\\ 15\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{6pt}{0ex}}\text{otherwise,}\end{array}\right.\end{array}$$(7)such that
$$\begin{array}{l}\mathit{\text{t.}}\mathrm{\text{eqorients}}\phantom{\rule{2.36043pt}{0ex}}\text{\&}\phantom{\rule{2.36043pt}{0ex}}i=\left\{\begin{array}{l}1\phantom{\rule{5pt}{0ex}}\text{if orientation}\phantom{\rule{2.36043pt}{0ex}}i\phantom{\rule{2.36043pt}{0ex}}\text{is equivalent to}\phantom{\rule{2.36043pt}{0ex}}\mathit{\text{t.}}\mathrm{\text{orient}}\\ 0\phantom{\rule{5pt}{0ex}}\text{otherwise,}\end{array}\right.\end{array}$$(8)where ‘<’ and ‘>’ are the lexicographic lessthan and greaterthan operators; and ‘&’ is the bitwiseAND operator. The equivalent orientations of a trimer are encoded using bitmasks such that orientation i is equivalent to t.ORIENT if the i^{th}bit of t.class is set to 1. The t.class value for each type of trimer is an encoding of the minimum t.class value for the class.

t.row: the row coordinate of t.seq[1] within its βmatrix.

t.col: the column coordinate of t.seq[1] within its βmatrix.

t.coord: a (t.row,t.col) tuple.

t.span1, t.span2: the row (t.rowspan) or column spans (t.colspan) of a trimer, depending on the trimer class. Each span is an ordered tuple (i,j) where i is the coordinate of t.seq[1] and j is the coordinate of either t.seq[0] or t.seq[2], depending on the trimer type. Ltrimers have one row span and one column span, Vtrimers have two row spans, and Htrimers have two column spans. Examples of the spans for each trimer are shown in Additional file1: Figure S3.
Index construction
BetaSearch uses three indices:$\mathcal{D}$,$\mathcal{R}$, and$\mathcal{C}$.

$\mathcal{D}$ is an inverted index that maps each trimer id to the set of βmatrices in which they are contained, defined as
$$\mathcal{D}\left[\mathrm{id}\right]\mapsto \{b\in B:\mathrm{id}\in \mathrm{b.}\mathrm{\text{trimerids}}\}$$(9)where B is the set of βmatrices in the dataset.

$\mathcal{R}$ maps a compound key${\kappa}_{\mathcal{R}}$ to a trimer t, defined by
$$\begin{array}{l}\phantom{\rule{40pt}{0ex}}\mathcal{R}\left[{\kappa}_{\mathcal{R}}\right]\mapsto t\\ \phantom{\rule{10pt}{0ex}}\text{where}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\kappa}_{\mathcal{R}}=\left(\mathrm{t.}\mathrm{\text{matrixid}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{id}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{eqorients}},\phantom{\rule{0.3em}{0ex}}\right.\\ \phantom{\rule{60pt}{0ex}}\left(\right)close=")">\mathrm{t.}\mathrm{\text{coord}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{rowspan}}\end{array}$$(10)such that class∉{3,15}.

$\mathcal{C}$ maps a compound key${\kappa}_{\mathcal{C}}$ to a trimer t, defined by
$$\begin{array}{l}\phantom{\rule{40pt}{0ex}}\mathcal{C}\left[{\kappa}_{\mathcal{C}}\right]\mapsto t\\ \phantom{\rule{10pt}{0ex}}\text{where}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\kappa}_{\mathcal{C}}=\left(\mathrm{t.}\mathrm{\text{matrixid}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{id}},\phantom{\rule{0.3em}{0ex}}\mathrm{t.}\mathrm{\text{eqorients}},\phantom{\rule{0.3em}{0ex}}\right.\\ \phantom{\rule{60pt}{0ex}}\left(\right)close=")">\mathit{\text{t.}}\mathrm{\text{coord}},\phantom{\rule{0.3em}{0ex}}\mathit{\text{t.}}\mathrm{\text{colspan}}\end{array}$$(11)
such that t.class∉{5,31}.
Ltrimers are indexed in$\mathcal{R}$ and$\mathcal{C}$; whereas Htrimers are indexed only in$\mathcal{C}$ because they do not contain any row spans, conversely, Vtrimers are indexed only in$\mathcal{R}$ because they do not contain any column spans. The BuildIndices procedure in Additional file1: Algorithm S1 describes the index construction algorithm.
Time complexity
Each entry in a βmatrix is the intersection of at most six trimers: four Ltrimers (one in each of the four orientations), one Vtrimer, and one Htrimer. BuildIndices runs in O(6mn) time where m is the maximum number of residues in a βmatrix and n is the number of βmatrices.
Filtering
A query is a wellformed βmatrix Q. Preprocessing Q requires the following steps:

1.
Enumerate the query trimers and storing them in Q.trimers.

2.
Store the corresponding trimer IDs in Q.trimerids.
Partially matching candidates are obtained by pruning the βmatrices in$\mathcal{D}$ using two filters. The FirstFilter procedure, described in Additional file1: Algorithm S2, prunes the βmatrices in$\mathcal{D}$ that do not contain the entire set of trimer IDs in Q.trimerids.
βmatrices are indexed in a single arbitrary orientation, therefore a procedure for comparing a query against all orientations of a candidate is required. The naive approach enumerates and stores the x and yaxes reflections of each βmatrix, effectively tripling the index size. Alternatively, the x and yaxes reflections of the query are enumerated and compared with a candidate, effectively tripling the filtering time.
We have developed an algorithm that prunes invalid candidates without enumerating reflections of the candidate or the query. The algorithm is implemented as the SecondFilter procedure described in Additional file1: Algorithm S3, where only the candidates congruent to Q are retained from C_{1}. A query Q is congruent to a candidate c if
The Congruent procedure defined in Additional file1: Algorithm S3 implements Equation 5 using bitwise operations that enable constanttime set unions and intersections.
Time complexity
The FirstFilter algorithm runs in$O\left(\left{p}_{min}\right\xb7\right.$Q. trimerids) time where$\left{p}_{min}\right$ is the cardinality of the smallest postings set and Q. trimerids is the number of unique trimer ids in the query. A postings set is a set in the index of βmatrices containing a particular query trimer. The SecondFilter algorithm runs in$O\left(\left\mathrm{Q.}\mathrm{\text{trimerids}}\right\xb7\left{C}_{1}\right\right)$ time where C_{1} is the number of candidates returned by FirstFilter.
Verification
Most graph indexing methods use the VF2[20] or Ullmann[19] algorithms for candidate verification. Other methods use algorithms optimised for the data structures of their features and indices.
Constraining βgraphs to the simpler structures of βmatrices has enabled us to develop a quadratic time candidate verification algorithm that does not rely on subgraph isomorphism tests.
A graph G of the query is constructed, in which each vertex is a trimer q∈Q. trimers and each edge e=$\left({q}_{\mathrm{src}},{q}_{\mathrm{des}}\right)$ indicates a span overlap (i.e. t.colspan or t.rowspan) between adjacent query trimers q_{src} and q_{des}. The algorithm to construct a query graph is implemented in the MakeQueryGraph procedure described in Additional file1: Algorithm S4.
The remainder of the verification algorithm attempts to find a subgraph of a candidate c that matches G by matching each pair$\left({q}_{\mathrm{src}},{q}_{\mathrm{des}}\right)$ to a pair$\left({t}_{\mathrm{src}},{t}_{\mathrm{des}}\right)\in \mathrm{c.}\mathrm{\text{trimers}}$, where a match between pairs is defined by
and a match between a query Q and a candidate c occurs if
The keys by which trimers are indexed in$\mathcal{R}$ and$\mathcal{C}$ contain their locations and geometric configurations within a candidate, allowing MatchPairs to be tested in constanttime. MatchPairs is implemented using the procedures defined in Additional file1: Algorithm S8. The VerifyCandidate procedure in Additional file1: Algorithm S6 describes our algorithm for verifying a single candidate.
Time complexity
The MakeQueryGraph procedure runs in$O\left(\left\mathrm{Q.}\mathrm{\text{trimers}}\right\right)$ time where Q. trimers is the number of trimers in the query. The VerifyCandidate procedure runs in$O\left({\left\mathrm{Q.}\mathrm{\text{trimers}}\right}^{2}\right)$ time, which is called by the Verify procedure in Additional file1: Algorithm S7 to verify each filtered candidate in C_{2}. Therefore, the overall time complexity of our candidate verification algorithm is$O\left(\left\mathrm{Q.}\mathrm{\text{trimers}}\right+\left{C}_{2}\right\xb7{\left\mathrm{Q.}\mathrm{\text{trimers}}\right}^{2}\right)$.
βstrand pairing prediction
For Case Study 3, we constructed a BetaSearch index from the ASTRAL95 dataset, as per the 3D substructure search comparisons. The secondary structures for the βsheet were using DSSP. Each βstrand sequence was delineated as contiguous substrings of “E” secondary structure assignments. Scores for each possible βstrand pairing (i,j) were computed as follows:

1.
Let $\mathcal{I}$ be the index generated from the ASTRAL95 dataset.

2.
Let S ^{P}and S ^{A}be the strand pairing score matrices for the parallel and antiparallel strand pairs, respectively.

3.
Let M be a matrix where M _{ ij }contains the number of 4mers between strands i and j. We define a 4mer as a single occurrence of twoconsecutive bridge pairings. For example, the βmatrix $\left(\begin{array}{ccc}\mathrm{A}& \mathrm{B}& \mathrm{C}\\ \mathrm{D}& \mathrm{E}& \mathrm{F}\end{array}\right)$ has the 4mers – $\left(\begin{array}{cc}\mathrm{A}& \mathrm{B}\\ \mathrm{D}& \mathrm{E}\end{array}\right)$ and $\left(\begin{array}{cc}\mathrm{B}& \mathrm{C}\\ \mathrm{E}& \mathrm{F}\end{array}\right)$.

4.
For each strand pair (i,j):

(a)
For each alignment a of j on i:

(i)
For each 4mer m of a:

(A)
Let h _{ P }and h _{ A }be the number of hits returned from querying $\mathcal{I}$ for the parallel and antiparallel orientations of m, respectively.Set ${S}_{\mathit{\text{ij}}}^{P}\leftarrow {S}_{\mathit{\text{ij}}}^{P}+{h}_{P}$Set ${S}_{\mathit{\text{ij}}}^{A}\leftarrow {S}_{\mathit{\text{ij}}}^{A}+{h}_{A}$Set ${M}_{\mathit{\text{ij}}}\leftarrow {M}_{\mathit{\text{ij}}}+1$

(A)

(i)

(b)
Let $\mathrm{seq}\text{\_}\mathrm{sep}\left(i,j\right)$ be the number of residues separating strands i and j.Set ${S}_{\mathit{\text{ij}}}^{P}\leftarrow \frac{{S}_{\mathit{\text{ij}}}^{P}}{{M}_{\mathit{\text{ij}}}}log\left[\mathrm{seq}\text{\_}\mathrm{sep}\left(i,j\right)\right]$Set ${S}_{\mathit{\text{ij}}}^{A}\leftarrow \frac{{S}_{\mathit{\text{ij}}}^{A}}{{M}_{\mathit{\text{ij}}}}log\left[\text{seq\_sep}\left(i,j\right)\right]$

(a)
The total score for a strand pair was defined as
Evaluation
The graph indexing methods were evaluated according to their filtering, verification, and overall query times. The sizes of the indices generated by each method were measured using the POSIX “stat” command. The filtering precision was calculated as the total number of hits divided by the total number of filtered candidates for all queries of a given dataset size.
A “hit” for a query against a βsheet occurs when it contains (or is) an exact match of the query structure. Unlike BetaSearch, LabelHash and SHEBA perform approximate matching and return a list of hits ranked by a structural similarity score. For LabelHash, this is a statisticallydetermined pvalue where a low value indicates a close match. SHEBA ranks hits according to the mvalue which is the number of aligned residues between a query and a result, divided by the number of residues in the larger of the two structures. For simplicity, we denote the LabelHash score for a hit as p^{′}=1−p. These scores need to be thresholded in order to obtain exact matches so that all hits with scores below a given threshold are ignored and those above the threshold are assigned the same rank. Once a suitable hit score threshold is chosen, the querying accuracy of each method can be computed by counting a hit as a true positive (TP) if the query is exactly matched within a βsheet, a false positive (FP) if it is not, or a false negative (FN) if a correct βsheet hit is absent from the list of results. We can then calculate the recall, as
which denotes the proportion of exactlymatching βsheets out of all correct βsheet hits. and the precision, as
which denotes the proportion of correct βsheet hits in a list of hits. These two measures are commonly used to evaluate conventional document retrieval systems[24], as well as protein structural search methods[16]. The Fscore, defined as
is the harmonic mean of the precision and recall values. It provides a convenient way of evaluating the query precision and recall as a single value. The β parameter allows emphasis to be placed on precision or recall depending on the query performance goals. We used the F_{1} score in our 3D substructure search comparisons as a measure of query accuracy.
Datasets
For the graph indexing comparisons, the January 3, 2011 (PDB2011) snapshot of the PDB was used to generate a dataset of 209,127 βsheets. A number of PDB files were excluded due to discrepancies in their content[50]. βsheets exhibiting poor planarity, such as those with significantly pronounced twisting or curvature, were also excluded.
The DSSP[51] algorithm was used to assign secondary structures to each PDB file. Residues with a secondary structural assignment of “E” or “b” were considered to form part of a βstrand. DSSP also assigns bridge partner relationships between residues on adjacent βstrands, which were used to determine the bridge edges in βgraphs.
We used ProOrigami[52] to generate βgraphs and a topological sort was used to generate the βmatrix from the peptide and bridge edges of each βgraph.
Subsets containing 1,000; 2,000; 4,000; 8,000; and 16,000 βsheets were randomly selected from the dataset. These sizes were used in previous benchmarks[23, 27]. GraphGrepSX and SING were unable to be run on datasets of 32,000 or more due to the memory consumption of their respective implementations, we therefore restricted the sizes of our datasets accordingly.
Each βgraph was preprocessed by inserting “dummy” nodes in place of a labelled edge. Each dummy node was labelled with either a “b” to denote a bridge edge or a “z” to denote a peptide edge in order to avoid conflict with the labels of residue nodes. Preprocessing was required because edgelabelled graphs were not supported by SING or GraphGrepSX.
Queries were generated in the same manner as in previous benchmarks[23]. A query was created from each βgraph by randomly selecting a root node and performing a random breadthfirst traversal until the query obtained the degrees d such that 2≤d≤10. Queries were not generated from βgraphs with insufficient edges. Each query contained a single wildcard node that matched any amino acid. To enable wildcard matching on all methods, each query was repeated by replacing the wildcard node with each of the 20 amino acids. The βmatrices for each query were generated for use by BetaSearch. The total numbers of queries generated for each query size are shown in Additional file1: Table S1.
For the case studies, we generated the required indices from all the βmatrices in the PDB2011 dataset. Each βmatrix was assigned a sheet identifier of the format “Â¡PDB ID¿Â¡chain ID¿_SHEET_Â¡number¿”. For example, the sheet ID of the first βmatrix in chain A of ubiquitin [PDB:1UBQ] is “1UBQA_SHEET_000”.
For the 3D substructure search comparisons, the ASTRAL SCOP 1.75A 95% sequence identity nonredundant dataset[53] of protein structures (ASTRAL95) were used. βsheets were extracted and filtered as per the graph indexing datasets and a total of 29,341 βsheets were obtained. A subset of 26,669 βsheets containing between 4 and 50 residues, inclusive, were used as queries. The βmatrices corresponding to each βsheet were generated and used with BetaSearch.
Implementation
The graph indexing comparisons were performed on a 2.66 Ghz Intel Nehalem 8core processor with 48 GB of main memory running CentOS. The source code to SING and GraphGrephSX were provided by the authors of their respective publications. All methods were implemented in C++, compiled using g++ version 4.3, and depend on the Boost C++ version 1.42.0 libraries[54]. BetaSearch additionally requires Redis[55] version 2.0.4 and the official Redis C headers[56].
The indices in SING and GraphGrepSX were implemented as modified C++ STL “std::map” containers in the memory spaces of their respective processes. In contrast, BetaSearch stores its indices using Redis hash tables that are stored in (diskbased) virtual memory and operates external to BetaSearch as a concurrent process. It is therefore subject to interprocess communication overhead during filtering and indexing, which are included in our experimental timings.
The 3D substructure search comparisons were performed on the same platform with 8 GB of main memory. BetaSearch was reimplemented in Python 2.7 using the Whoosh Python Search Library[57]. LabelHash 1.0.2[58] and SHEBA 3.1.1[59] were downloaded and used. The LabelHash index was built from the PDB coordinates of all the ASTRAL95 βsheets, which were extracted from their original structures using ProDy[60]. The BetaSearch index was built from the βmatrix representations of each βsheet.
Structural renderings and alignments
We used PyMOL[61] to generate 3D renderings and structural alignments of proteins.
Availability and requirements

Project name: BetaSearch

Project homepage: http://www.csse.unimelb.edu.au/∼hohkhkh1/betasearch

Operating system(s): Ubuntu Linux 11.10+ (http://www.ubuntu.com)

Programming language(s): Python

Other requirements: A complete listing of Python module dependencies is provided on the project homepage.

License: None

Any restrictions to use by nonacademics: BetaSearch can be used freeofcharge by nonacademics, provided appropriate citation and credit is given to the authors of this publication.
Funding
This work was supported by a Victorian Life Sciences Computation Initiative (VLSCI) [grant number VR0127] on its Peak Computing Facility at the University of Melbourne, an initiative by the Victorian Government. HH is supported by a NICTA PhD scholarship. NICTA (National ICT Australia) is funded by the Australian Government’s Department of Communications; Information Technology and the Arts; Australian Research Council through Backing Australia’s Ability; ICT Centre of Excellence programs.
References
 1.
Kessel A, BenTal N: Introduction to proteins: structure, function, and motion. 2010, CRC Press, London
 2.
Zaremba SM, Gregoret LM: Contextdependence of amino acid residue pairing in antiparallel βsheets. J Mol Biol. 1999, 291: 463479. 10.1006/jmbi.1999.2961.
 3.
Parisien M, Major F: Ranking the factors that contribute to protein βsheet folding. Proteins. 2007, 68: 824829. 10.1002/prot.21475.
 4.
Wathen B, Jia Z: Folding by numbers: primary sequence statistics and their use in studying protein folding. Int J Mol Sci. 2009, 10: 15671589. 10.3390/ijms10041567.
 5.
Hubbard TJP: Use of βstrand interaction pseudopotentials in protein structure prediction and modelling. Proceedings of the 27th Hawaii International Conference on System Sciences. 1994, 336344.
 6.
Zhu H, Braun W: Sequence specificity, statistical potentials, and threedimensional structure prediction with selfcorrecting distance geometry calculations of βsheet formation in proteins. Prot Sci. 1999, 8: 326342.
 7.
Steward RE, Thornton JM: Prediction of strand pairing in antiparallel and parallel βSheets using information theory. Proteins. 2002, 48: 178191. 10.1002/prot.10152.
 8.
Rajgaria R, Wei Y, Floudas CA: Contact prediction for beta and alphabeta proteins using integer linear optimization and its impact on the first principles 3D structure prediction method ASTROFOLD. Proteins. 2010, 78: 18251846.
 9.
Bork P, Koonin E: Protein sequence motifs. Curr Opin Struct Biol. 1996, 6: 366376. 10.1016/S0959440X(96)800571.
 10.
Berman HM, Westbrook J, Fend Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235242. 10.1093/nar/28.1.235.
 11.
Bella J, Hindle KL, McEwan PA, Lovell SC: The leucinerich repeat structure. Cell Mol Life Sci. 2008, 65: 23072333. 10.1007/s0001800880190.
 12.
Liou YC, Tocilij A, Davies PL, Jia Z: Mimicry of ice structure by surface hydroxyls and water of a βhelix antifreeze protein. Nature. 2000, 406: 322324. 10.1038/35018604.
 13.
Makabe K, McElheny D, Tereshko V, Hilyard A, Gawlak G, Yan S, Koide A, Koide S: Atomic structures of peptide selfassembly mimics. Proc Natl Acad Sci USA. 2006, 103: 1775317758. 10.1073/pnas.0606690103.
 14.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 33893402. 10.1093/nar/25.17.3389.
 15.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: ClustalW and ClustalX version 2. Bioinformatics. 2007, 23: 29472948. 10.1093/bioinformatics/btm404.
 16.
Stivala A, Wirth A, Stuckey PJ: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinformatics. 2010, 11:
 17.
Parisien M: Les feullets beta dans les protéines. Annotation, comparaison et construction. Master’s thesis. 2005, Université de Montréal
 18.
Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S: Network analysis of protein structures identifies functional residues. J Mol Biol. 2004, 344: 11351146. 10.1016/j.jmb.2004.10.055.
 19.
Ullmann JR: An algorithm for subgraph isomorphism. J ACM. 1976, 23: 3142. 10.1145/321921.321925.
 20.
Cordella LP, Foggia P, Sansone C, Vento M: A (sub)graph isomorphism algorithm for matching large graphs. IEEE T Pattern Anal. 2004, 10: 13671372.
 21.
Zampelli S: A constraint programming approach to subgraph isomorphism. PhD thesis. 2008, Université catholique de Louvain
 22.
Cook SA: The complexity of theoremproving procedures. Proceedings of the 3rd ACM Symposium on Theory of Computing. 1971, 151158.
 23.
Bonnici V, Ferro A, Giugno R, Pulvirenti A, Shasha D: Enhancing graph database indexing by suffix tree structure. Pattern Recognition in Bioinformatics, Volume 6282 of Lecture Notes in Computer Science. 2010, Springer, 195203.
 24.
Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval. 2008, Cambridge University Press, Cambridge
 25.
Giugno R, Shasha D: GraphGrep: a fast and universal method for querying graphs. Proceedings of the 16th International Conference on Pattern Recognition, 2002, Volume 2. 2002, 112115.
 26.
Ferro A, Giugno R, Mongiovi M, Pulvirenti A, Skripin D, Shasha D: GraphFind: enhancing graph searching by low support data mining. BMC Bioinformatics. 2008, 9:
 27.
Di Natale R, Ferro A, Giugno R, Mongiovi M, Pulvirenti A, Shasha D: SING: subgraph search in nonhomogeneous graphs. BMC Bioinformatics. 2010, 11:
 28.
Yan X, Yu PS, Han J: Graph indexing based on discrimintative frequent structure analysis. ACM T Database Syst. 2005, 30 (4): 960993. 10.1145/1114244.1114248.
 29.
Cheng J, Ke Y, Ng W, Lu A: FGIndex: towards verificationfree query processing on graph databases. Proceedings of the 2007 ACM SIGMOD International Conference on the Management of Data. 2007, 857872.
 30.
Williams DW, Huan J, Wang W: Graph database indexing using structured graph decomposition. IEEE 23rd International Conference on Data Engineering, 2007. 2007, 976985.
 31.
Zhang S, Hu M, Yang J: TreePi: a novel graph indexing method. IEEE 23rd International Conference on Data Engineering, 2007. 2007, 966975.
 32.
Zhao P, Yu JX, Yu PS: Graph indexing: tree + delta <= graph. Proceedings of the VLDB Endowment. 2007, 938949.
 33.
He H, Singh AK: Closuretree: an index structure for graph queries. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). 2006, 3838.
 34.
Shamir R, Tsur D: Faster subtree isomorphism. Proceedings of the 5th Israel Symposium on the Theory of Computing Systems. 1997, 267280.
 35.
Zou L, Chen L, Yu JX, Lu Y: A novel spectral coding in a large graph database. Proceedings of the 11th International Conference on Extending Database Technology (EDBT’08). 2008, 181192.
 36.
Moll M, Bryant DH, Kavraki LE: The LabelHash server and tools for substructurebased functional annotation. Bioinformatics. 2011, 27:
 37.
Holm L, Sander C: Mapping the protein universe. Science. 1996, 273: 595602. 10.1126/science.273.5275.595.
 38.
Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics. 2000, 16: 566567. 10.1093/bioinformatics/16.6.566.
 39.
Jung J, Lee B: Protein structure alignment using environmental profiles. Prot Sci. 2000, 13: 535543.
 40.
Stivala A, Wirth A, Stuckey PJ: Tableaubased protein substructure search using quadratic programming. BMC Bioinformatics. 2009, 10:
 41.
Moll M, Bryant DH, Kavraki LE: The LabelHash algorithm for substructure matching. BMC Bioinformatics. 2010, 11:
 42.
Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D: Design of a novel globular protein fold with atomiclevel accuracy. Science. 2003, 302: 13641368. 10.1126/science.1089427.
 43.
Havranek JJ: Specificity in computational protein design. J Biol Chem. 2010, 285: 3109531099. 10.1074/jbc.R110.157685.
 44.
Liu Y, Kuhlman B: RosettaDesign server for protein design. Nucleic Acids Res. 2006, 34: W235W238. 10.1093/nar/gkl163.
 45.
Rosado CJ, Kondos S, Bull TE, Kuiper MJ, Law RHP, Buckle AM, Voskoboinik I, Bird PI, Trapani JA, Whisstock JC, Dunstone MA: The MACPF/CDC family of poreforming toxins. Cell Microbiol. 2008, 10: 17651774. 10.1111/j.14625822.2008.01191.x.
 46.
Brown WM, Martin S, Chabarek JP, Strauss C, Faulon JL: Prediction of βstrand packing interactions using the signature product. J Mol Model. 2006, 12: 355361. 10.1007/s0089400500524.
 47.
Cheng J, Baldi P: Threestage prediction of protein βsheets by neural networks, alignments and graph algorithms. Bioinformatics. 2005, 21: 7584. 10.1093/bioinformatics/bti1004.
 48.
Jeong JK, Berman P, Przytycka TM: Bringing folding pathways into strand pairing prediction. Lecture Notes in Computer Science, Volume 4645. 2007, Springer, 3848.
 49.
Aydin Z, Altunbasak Y, Erdogan H: Bayesian models and algorithms for protein βsheet prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 8. 2011, Springer, 395409.
 50.
Schierz AC, Soldatova LN, King RD: Overhauling the PDB. Nat Biotechnol. 2007, 25: 437442. 10.1038/nbt0407437.
 51.
Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers. 1983, 22: 25772637. 10.1002/bip.360221211.
 52.
Stivala AD, Wybrow M, Wirth A, Whisstock JC, Stuckey PJ: Automatic generation of protein structure cartoons with Proorigami. Bioinformatics. 2011, 27: 33153316. 10.1093/bioinformatics/btr575.
 53.
Brenner M, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. 2000, 28: 254256.
 54.
Boost v1.42.0. [http://www.boost.org/users/history/version_1_42_0],
 55.
Redis. [http://www.redis.io]
 56.
hiredis. [https://github.com/antirez/hiredis],
 57.
Whoosh Python search library. [https://bitbucket.org/mchaput/whoosh/wiki/Home],
 58.
LabelHash 1.0.2. [http://labelhash.kavrakilab.org/downloads/python27/LabelHash1.0.2Linux64.tar.gz],
 59.
SHEBA 3.1.1. [https://ccrod.cancer.gov/confluence/download/attachments/63341259/sheba3.1.1.tar.gz],
 60.
Bakan A, Meireles LM, Bahar I: ProDy: protein dynamics inferred from theory and experiments. Bioinformatics. 2011, 27: 15751577. 10.1093/bioinformatics/btr168.
 61.
Schrödinger LLC: The PyMOL Molecular Graphics System, Version 1.3r1. 2010
Acknowledgements
We would like to thank:
The reviewers for their feedback in improving the quality of this article.
R. Di Natale, R. Giugno, V. Bonnici, and D. Shasha for their assistance and providing the GraphGrepSX and SING source code.
M. Moll for his assistance with the LabelHash software.
A. Stivala for his assistance with the ProOrigami source code.
The VLSCI systems support staff for their assistance with our HPC technical requests.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HH, GG, and KR contributed to the design of the algorithms. HH implemented the algorithms and evaluation software, performed the experimental analyses, and prepared the manuscript and figures. MJK contributed to the evaluation of the case studies. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Protein Data Bank
 Query Time
 Suffix Tree
 Query Graph
 Query Size
Comments
View archived comments (1)