BetaSearch: a new method for querying β-residue motifs

Ho, Hui Kian; Gange, Graeme; Kuiper, Michael J; Ramamohanarao, Kotagiri

doi:10.1186/1756-0500-5-391

Technical Note
Open access
Published: 30 July 2012

BetaSearch: a new method for querying β-residue motifs

Hui Kian Ho^1,2,
Graeme Gange¹,
Michael J Kuiper³ &
…
Kotagiri Ramamohanarao¹

BMC Research Notes volume 5, Article number: 391 (2012) Cite this article

3813 Accesses
1 Citations
Metrics details

Abstract

Background

Searching for structural motifs across known protein structures can be useful for identifying unrelated proteins with similar function and characterising secondary structures such as β-sheets. This is infeasible using conventional sequence alignment because linear protein sequences do not contain spatial information. β-residue motifs are β-sheet substructures that can be represented as graphs and queried using existing graph indexing methods, however, these approaches are designed for general graphs that do not incorporate the inherent structural constraints of β-sheets and require computationally-expensive filtering and verification procedures. 3D substructure search methods, on the other hand, allow β-residue motifs to be queried in a three-dimensional context but at significant computational costs.

Findings

We developed a new method for querying β-residue motifs, called BetaSearch, which leverages the natural planar constraints of β-sheets by indexing them as 2D matrices, thus avoiding much of the computational complexities involved with structural and graph querying. BetaSearch exhibits faster filtering, verification, and overall query time than existing graph indexing approaches whilst producing comparable index sizes. Compared to 3D substructure search methods, BetaSearch achieves 33 and 240 times speedups over index-based and pairwise alignment-based approaches, respectively. Furthermore, we have presented case-studies to demonstrate its capability of motif matching in sequentially dissimilar proteins and described a method for using BetaSearch to predict β-strand pairing.

Conclusions

We have demonstrated that BetaSearch is a fast method for querying substructure motifs. The improvements in speed over existing approaches make it useful for efficiently performing high-volume exploratory querying of possible protein substructural motifs or conformations. BetaSearch was used to identify a nearly identical β-residue motif between an entirely synthetic (Top7) and a naturally-occurring protein (Charcot-Leyden crystal protein), as well as identifying structural similarities between biotin-binding domains of avidin, streptavidin and the lipocalin gamma subunit of human C8.

Background

The β-sheet is a common secondary structure element that plays important functional and structural roles in proteins, for example, the ligand-binding pockets of biotin-binding proteins and the structure of the commonly-occurring TIM-barrel fold[1]. These processes are often mediated by interactions between adjacent pairs of residues across β-strands. These include the disulphide, ionic, and hydrogen bonds; and hydrophobic packing interactions frequently involved in maintaining the structural stability of a protein or in enzymatic active sites[1]. The influence of pairwise interactions within β-sheets and their tertiary structures have been studied experimentally[2] and statistically[3, 4], the results of which have been used to predict β-sheet topology[5–7] and tertiary structure[8]. These studies have provided insights into the folding mechanisms of β-sheets although it remains an open problem[4]. Examining interresidue interactions at the single pairwise level however, provides only a limited view of a larger interaction network within a β-sheet. We refer to these clusters of interacting residues as β residue motifs, which are contiguous subsets of β-sheet residues connected by peptide and/or hydrogen bonds (as shown in Figure1D). Unlike sequence motifs, β-residue motifs encode information about both the peptide and bridge-partners of each residue. For the purposes of this study, we consider β-residue motifs to be provisional, since they may also exist via a general conservation between homologs rather than as independent functional units, as is the case for motifs in the traditional sense[9].

Many characteristic β-residue motifs are observed in the Protein Data Bank (PDB)[10]. For example, the β-sheets in leucine-rich repeat (LRR) domains contain consecutive adjacent interstrand pairs of buried leucines sterically-packed alongside their bridge-partners, contributing to structural stability[11]. Other β-residue motifs appear as a combination of inter- and intrastrand residue neighbours, as is the case for the TCT motif of certain antifreeze proteins[12] and glutamic acid/lysine motifs[13]. The conserved biotin-binding site in streptavidin [PDB:1STP] contains a β-residue motif of five inward-directing residues of a β-barrel: S88, T90, W92, W108, and L110. Identification of these β-residue motifs can be used to search for other proteins with similar structural elements or function with low sequence identities.

Searching for structural motifs, β-residue or otherwise, in the PDB using linear approaches such as sequence alignment is a difficult, if not impossible task because interresidue interactions can occur across secondary structures that are sporadically located throughout a protein sequence. Furthermore, pairwise residue interactions are not accounted for in conventional multiple-sequence alignment tools such as BLAST[14] and CLUSTALW[15], given the one-dimensional nature of sequences.

The conventional approach to motif querying, involves the use of protein substructure search methods that structurally align the 3D atomic coordinates of a query with known protein structures. These methods provide a structural context to each query hit and generally produce approximate matches in the form of a ranked list of hits but may take hours to perform few queries due to their reliance on structural alignment algorithms[16].

Alternatively, protein structures can be represented as graphs and queried for motifs using graph indexing approaches[17]. Unlike 3D substructure searches, these methods perform exact matching by querying only the discrete edge, node, and label features of graphs rather than by 3D similarity between continuous coordinates. The query matching algorithms used by existing graph indexing methods are based on solutions to the subgraph isomorphism problem, described briefly as follows:

A graph G=(V,E) is defined by a set of vertices v∈V and a set of edges e∈E where each edge represents a connection between a pair of vertices $(v_{i}, v_{j}) \in V$ . A graph is undirected if its edges are unordered pairs and directed otherwise. The degree of a node is the number of edges it has to other nodes.

If G₁and G₂are graphs defined as $G_{1} = (V_{1}, E_{1})$ and $G_{2} = (V_{2}, E_{2})$ , then G₁is a subgraph of G₂if $V_{1} \subseteq V_{2} \land E_{1} \subseteq E_{2}$ . An isomorphism between G₁and G₂is a bijection f:V₁→V₂such that $\forall (v_{i}, v_{j}) \in E_{1} \Leftrightarrow (f (v_{i}), f (v_{j})) \in E_{2}$ .

A graph G₁is subgraph isomorphic to G₂if G₃is a subgraph of G₂and there exists an isomorphism between G₃and G₁.

Graph representations of proteins[18] and β-sheets[17] have been previously described in which nodes represent residues and edges represent inter-residue interactions such as peptide or hydrogen bonds. For simplicity, we define a β graph to be a graph representation of a β-sheet in which each node is labelled with a residue name, solid edges represent peptide bonds, and dotted edges represent a bridge-partner relationship between adjacent interstrand residue pairs (Figure1B). β-residue motifs are considered to be connected subgraphs of β-graphs whose nodes are labelled with amino acids.

The planarity of β-graphs allows for a compact two-dimensional representation. We define a β-matrix to be a projection (or “flattening”) of a β-graph onto a 2D matrix of amino acid characters. The residues in the same β-strand are located in the same row and residues connected by bridge edges lie in the same column (Figure1C). β-residue motifs are then considered to be submatrices of β-matrices (Figure1D).

Algorithms for detecting subgraph isomorphisms have been described for general graphs[19, 20] that run in time factorial to the number of vertices in a query[21] and cannot be solved in polynomial time, as it is proven to be NP-complete for general graphs[22]. Naively performing a subgraph test on every graph in a database is therefore computationally-expensive[23]. Consequently, graph indexing methods were developed to simplify this problem. A potentially large number of non-matching graphs can be pruned from the search by using indexing techniques analogous to those in conventional search engines[24]. Graphs can be indexed by various features using disk- or memory-based indices. These approaches usually consist of three stages:

1.
Index construction: The features of each graph in a database are obtained, each representing a graph characteristic. A data structure, usually an inverted index, is then constructed in which each feature is associated with the set of their originating graphs.
2.
Filtering: The features of the query graph are obtained. An initial set of coarse-grained candidates containing these features are retrieved from the index (or indices). It is possible these candidates do not contain any query matches.
3.
Verification: Each candidate is checked for a subgraph that exactly matches the query. This is performed using a subgraph test in most methods.

Graph indexing methods are loosely classified into three categories: path-based, subgraph-based, and tree-based.

Path-based methods (GraphGrep[25], GraphFind[26], GraphGrepSX[23], and SING[27]) index graphs using paths as features. These methods construct an inverted index I that maps each path p to a set of their originating graphs

I : p \mapsto {G : p \in G}

(1)

where p is a sequence of connected vertices in a graph G

\begin{align} p & = (v_{i}, v_{i + 1}, \dots, v_{i + k - 1}) \\ \forall k & : 1 \leq k \leq l_{p} \end{align}

(2)

where l_p is the maximum path length. The filtering process returns the set of candidate graphs C containing all the paths of a query graph Q

C = ⋂_{p \in paths (Q)} I (p)

(3)

and verification of each candidate is performed using the VF2 algorithm[20].

Enumerating all paths up to and including length l_pproduces large feature sets and consequently, large indices. GraphFind avoids these problems by pruning redundant features using data mining techniques similar to those of gIndex[28]. GraphGrepSX exploits feature redundancy by implementing its index as a suffix tree where each string is a path sequence. The suffix tree was shown to be more space-efficient than the hash tables used by other path-based methods[23]. Our empirical results corroborate these findings (Table1). SING uses a second filtering stage that prunes candidates by using path locality information. For example, a path p in a candidate must be surrounded by the same paths as in the query. This improvement in filtering comes at the cost of maintaining an auxiliary hash table of locality information.

Table 1 Indexing times and disk sizes

Full size table

Subgraph-based methods (gIndex[28], FG-Index[29], and GDIndex[30]) use subgraphs as features and retain more topological information about graphs than paths due to their more complex structures. Index construction then requires time exponential to the number of nodes in each graph which also produces larger indices than those of path-based methods. These problems can be alleviated to a degree by indexing only the most frequent subgraphs[28].

Tree-based methods (TreePi[31], TreePi+δ[32], and CTree[33]) use subtrees as features and are purported to provide an ideal compromise between the small indices of path-based methods and the specificity of subgraph-based methods. Algorithmic operations on trees are generally more asymptotically efficient than those on graphs, in particular, subtree isomorphism can be tested in polynomial time[34]. However, previous results showed that certain path-based methods are still an order of magnitude faster in query time than existing tree-based methods[27].

GCoding[35] generates numeric representations of graphs using encodings of their adjacency matrices and cannot be classified into any of the above groups. These representations allow efficient filtering without computationally expensive graph traversals or feature enumeration. A specialised subgraph isomorphism test is used for verification[35]. While these encodings provide a compact index, expensive eigenvalue calculations are required to compute them[27].

Each of these methods can be applied to a wide variety of problems because they were designed for general graphs (i.e. graphs with an unrestricted degree and/or node count) that do not make use of the inherent structural constraints of β-sheets. For example, each β-residue has at most four neighbours: the preceding and following peptide-bonded residues and one bridge-partner located on each of the two adjacent hydrogen-bonded β-strands.

The problem of protein 3D substructure searching involves searching a database of protein structures for structures that contain substructures similar to a query structure and remains a significant problem in structure biology. These substructures may be relevant to biological processes such as binding sites, enzymatic function, or may be representative of a particular fold family[16, 36]. Current methods for substructure searching are based on the comparison of three-dimensional coordinates between structures and use computationally complex structural alignment algorithms. Methods such as Dali[37], DaliLite[38], and SHEBA[39] align protein structures at the residue level, that is, they find one-to-one residue alignments between pairs of proteins; methods such as QPTableauSearch[40] and SATableauSearch[16] align proteins at the level of secondary structure elements (SSEs) and therefore lack residue-level specificity but are generally faster[16]. A common drawback of these methods is that exhaustive pairwise comparisons are required between the query and the each protein structure in a database. This naive approach often leads to redundant comparisons between highly similar structures or structures with no obvious match, ultimately resulting in queries requiring hours or even days to complete[16].

Recently, LabelHash[36, 41] was developed primarily for the 3D substructure matching of small motifs, commonly between 4 and 15 residues. This method is unique among structural search methods in general since it uses a pre-computed index to vastly accelerate querying in a manner similar to those of graph indexing approaches. Indeed, the results in this paper show that LabelHash yields a considerable performance boost in compute time over a conventional pairwise structural alignment approach (see Results and discussion).

In this paper we describe BetaSearch, a method that allows fast querying of β-residue motifs in large datasets of protein structures. Our method leverages the natural planar constraints of β-sheets by indexing them as 2D matrices, known as β-matrices. This approach avoids the geometric, topological, and computational complexities usually involved in 3D substructure or graph querying. Furthermore, by using β-sheet representations independent of a 3D coordinate system, BetaSearch identifies matching β-residue motifs in structurally and sequentially dissimilar proteins.

Results and discussion

We have compared the performance of BetaSearch against state-of-the-art graph indexing and 3D substructure search methods separately. The results of three case studies are also presented, which provide biologically-relevant contexts in which BetaSearch could be used.

Comparisons with graph indexing methods

We compared BetaSearch against SING and GraphGrepSX. SING was shown to outperform existing methods in terms of query time on standard datasets of chemical compounds, protein transcription networks, protein-interaction networks, and synthetic graphs[27].

The elapsed indexing, total query, filtering, and verification times were averaged over five repetitions. SING and GraphGrepSX were run using l_p=4 and l_p=10, where l_p denotes the path length. These values were chosen by the authors of each method in their own comparisons[23, 27]. We were unable to use larger l_p values or datasets of more than 16,000 β-sheets due to the memory consumption of the SING and GraphGrepSX implementations. We therefore only reported results for datasets up to and including N = 16,000. Accuracies of each method were not measured since each query matches at least one β-sheet and any non-matching β-sheet is excluded at the filtering and verification stages of each method.

Indexing

The elapsed times and disk space required for index construction are shown in Table1.

BetaSearch recorded the fastest indexing times with a 1.9 times speed-up over the next fastest method (GraphGrepSX,l_p=4) for the N = 16,000 dataset. The size of the BetaSearch indices were similar to those of SING,l_p=4 since trimers have an effective path length of l_p=3 and both use hash tables.

The l_p=10 variants of SING and GraphGrepSX were slower than their l_p=4 variants due to the increase in the number of features generated in the former case. This observation was consistent with those of general graphs[23].

GraphGrepSX,l_p=4 generated the smallest indices by a considerable margin through the use of a suffix trees to store its indices. However, the results obtained in the following sections show that the reduction in index disk space came at a significant cost to the querying time.

Furthermore, the BetaSearch index is limited only by the size of the hard disk on which it is stored, whereas the implementations of SING and GraphGrepSX used in this study were memory-limited, requiring the entire index to be loaded into memory in order for queries to be performed.

Overall query times

The query time for a single query was calculated as the sum of its filtering and verification times. The time required to perform all the queries on a dataset was measured as the sum of its individual query times, shown in Table2. These results show that BetaSearch consistently recorded the fastest querying times for all datasets by at least an order of magnitude over the next fastest method (SING,l_p=10) and a 109 times speed-up over the baseline (GraphGrepSX,l_p=4) for the N = 16,000 dataset.

Table 2 Overall query times (graph indexing comparisons)

Full size table

The trade-off between the index disk size and querying times within the SING and GraphGrepSX variants can be seen in these results where the l_p=10 variants required four to five times as much disk space but were at least twice as fast as the l_p=4 variants.

The overall query time speedups for all query sizes were measured using the GraphGrepSX,l_p=4 as the baseline, shown in Figure2. Only the speedups for the N = 2,000 and 16,000 datasets were shown for the purposes of brevity. The speedups of each method generally tapers down after queries of approximately six to seven edges. This is an expected observation because larger β-sheet subgraph queries are more specific than smaller ones, resulting in fewer possible candidates and therefore a reduced filtering and verification load.

Filtering

The filtering time was calculated as the time required to perform filtering for all the queries of a given dataset. The precision was calculated as the total number of actual query matches divided by the total number of filtered candidates for all the queries of a given dataset. The filtering results are shown in Table3.

Table 3 Filtering times and precisions (graph indexing comparisons)

Full size table

In contrast to their indexing performances, the l_p=10 variants of SING and GraphGrepSX generally outperformed their l_p = 4 variants. A larger l_pvalue has more specificity and therefore results in fewer numbers of filtered candidates than a small l_p value, reducing the verification load. The BetaSearch and l_p = 10 precision values were consistently near 1.0 for all datasets and query sizes. The precision of the l_p = 4 variants were considerably lower than the l_p = 10 variants due to the aforementioned specificity limitations of smaller path lengths.

Verification

We measured the mean verification time as the total verification time for a dataset divided by the total number of filtered candidates for a dataset. The mean verification times were less than a second due to the relatively small query graphs involved in this study. The speedups of each method were measured using the GraphGrepSX,l_p=4 times as the baseline and are shown in Figure2.

BetaSearch consistently recorded the fastest verifications across all query sizes and datasets, this is because the BetaSearch verification algorithm runs in quadratic time whereas the VF2 algorithm employed by SING and GraphGrepSX was designed for general graphs and has a potential non-polynomial time complexity[21]. The largest speed-up by BetaSearch was achieved for queries with two edges, since these queries equated to individual trimers, there was no need for candidates to be verified.

Comparisons with 3D substructure search methods

We have compared BetaSearch with LabelHash and SHEBA since they each perform residue-level matching. SHEBA was shown to be amongst the most accurate substructure search methods in recent work[16], however, LabelHash has yet to be evaluated against other methods. LabelHash and SHEBA were run using default search parameters. Comparisons with DaliLite were unable to be performed due to the majority of our queries and β-sheets not meeting the minimum number of residues required by DaliLite. DaliLite was shown to have accuracies comparable to SHEBA but with considerably longer compute times[16].

Figure3A shows the F₁scores computed across all query sizes for each method. Exact matches for each method were considered to be those with p^′=1 for LabelHash and m=1 for SHEBA. We also computed F₁at p^′≥0.999, however, the F₁ at m≥0.999 was identical to that of m=1 so we instead computed F₁at m≥0.95. BetaSearch, by virtue of inherent exact matching, produces unranked hits and consequently produces an F₁ score of 1.0 for the entire query set.

LabelHash at p^′≥0.999 clearly outperforms SHEBA on all query sizes, however, neither method performed particularly well on queries of 10 residues or less with the worst F₁scores observed for queries of 4–5 residues, which have the largest number of hits amongst all query sizes (see Additional file1: Figure S1A). Although, once the queries reach sizes of 25 residues, LabelHash maintained F₁ scores of at least 0.9 since the number of possible hits closely approaches the number queries (see Additional file1: Figure S1B).

We measured the CPU times of each method and computed the speedups over SHEBA. The wallclock times were also measured but were omitted since they were analogous to the CPU times. The CPU times of each method for the ASTRAL95 query set were measured as follows:

SHEBA – 239 h 25 m
LabelHash – 33 h 17 m
BetaSearch – 0 h 59 m

Figure3B shows the speedups at each query size. BetaSearch achieved total speedups of 240 times over SHEBA and 33 times over LabelHash. The largest speedups of BetaSearch were obtained for queries of 4–15 residues, which are the sizes of commonly studied motifs[41]. The improved performance of BetaSearch and LabelHash over SHEBA can be attributed to their use of indices which removes the need to perform exhaustive pairwise comparisons for each query against the dataset. This naive approach to substructural searching can lead to query sets taking days to complete[16, 40].

Case Studies

β-residue motifs can contribute both to the structural and functional features of a protein. For researchers who study protein structure, BetaSearch can be a useful tool for surveying particular β-sheet configurations across known protein structures. The frequency of a particular motif may give an indication of its relative stability as a β-sheet structural element. Researchers who study functional aspects of β-sheets can use BetaSearch to identify similar motifs in unrelated proteins, as we demonstrate with the biotin-binding pockets of avidin and streptavidin. BetaSearch is fast with a simple, intuitive search query context that allows the researcher to efficiently make comparisons against known β-sheets.

A typical BetaSearch workflow involves the researcher (i) inspecting a protein structure for a β-sheet of interest, (ii) identifying a specific β-residue motif, and (iii) manually entering the amino acids in the corresponding β-matrix into BetaSearch. Alternatively, this workflow can be automated, allowing BetaSearch to be used in a data mining or knowledge-discovery capacity which potentially allows interesting relationships between specific amino acid configurations and protein structures or functions, which would not be intuitively revealed by manual trial-and-error querying.

To demonstrate the capabilities and potential use-cases of BetaSearch we present the results of three case studies. These were drawn from real-world examples and illustrate the role β-residue motifs play in the structure and function of proteins. We also provide the matches from comparative queries using BLAST to demonstrate the difference in matches between a conventional sequence-based homology search and BetaSearch (see Additional files2,3, and4).

Case Study 1 - Synthetic motifs in the Top7 protein

Top7 [PDB:1QYS] is the only engineered protein (non-hypothetical) not to be derived from the sequence or structure of any other protein[42, 43]. Most notably, it adopts a unique fold that has yet to be observed in nature. Its structure consists of an amphipathic β-sheet and two α-helices. Inspection of the β-sheet revealed a repeating β-residue motif (Figure4A). Using this as a query, we wanted to discover known protein structures that possessed this putative synthetic motif. BetaSearch was used to query the PDB2011 dataset, which revealed matches only in structures of the Charcot-Leyden crystal (CLC) protein [PDB:1G86,1HDK,1LCL,1QKQ].

A structural alignment of Top7 and CLC around the matching regions of the query motif (Figure4B) shows remarkably, that the β-strand topology and sidechain directions are nearly identical. This does not suggest a homology between the two proteins because Top7 is entirely synthetic. However, our findings demonstrate that the RosettaDesign[44] approach used to engineer Top7 had inadvertently reproduced a known stable β-residue motif ab initio. The CLC protein was not found in a BLAST query of Top7 chain A (see Additional file2).

Case Study 2 - Biotin-binding domains

Streptavidin [PDB:1STP] and avidin [PDB:1VYO] are structurally and functionally similar homologous proteins that bind strongly to biotin despite having a sequence identity of less than 35%. Both proteins consist of eight antiparallel β-strands that fold into a β-barrel, inside of which forms a highly conserved biotin-binding site. The β-residue motifs that line this highly specific site are shown in Figures5A and5B. When the residues on the non-binding face of the β-sheet are ignored, the two motifs are differentiated by only a single residue: $W_{92}^{1 STP} \Leftrightarrow F_{79}^{1 VYO}$ . The results from the corresponding BLAST query are shown in Additional file3.

We have characterised these biotin-binding sites as a minimal, β-residue motif (Figure5C). This putative motif is evolutionary conserved between the avidins and has not yet been shown to be recurrent in evolutionarily distant proteins. Using this query, BetaSearch not only identifies the structures of avidin and streptavidin, but also xenavidin—a biotin-binding protein from Xenopus tropicalis (frog). A number of seemingly unrelated proteins were also matched including uncharacterised proteins from Roseovarius nubinhibens [PDB:3BVC] and Oceanicola granulosus [PDB:2RG4]; and the human complement protein C8 gamma [PDB:1IW2]. The complete set of matching β-sheets is listed in Additional file4.

Inspection of the uncharacterised proteins reveal a similar arrangement of residues to the known-biotin binding proteins but with less room for the ligand to bind. More tantalising is the match with the gamma subunit of human C8 which is a crucial component of the cytolytic membrane attack complex (MAC)[45]. This subunit has a characteristic lipocalin fold with a distinctive binding pocket similar to the avidins, however, the ligand target of C8 gamma remains unknown[45]. Based on the spatial similarities of this binding pocket with the biotin-binding sites of avidin, one may suggest that these proteins could have an affinity for biotin or a biotin-like compounds.

These results demonstrate that a relatively small β-residue motif query can be matched in unrelated proteins. This capability can be particularly useful in characterising proteins of unknown function by similarities in β-residue motifs to those of known function.

Case Study 3 - β-strand pairing prediction

One of the unsolved problems of tertiary structure prediction is the ability to predict the pairs of β-strands which are hydrogen bonded, and therefore adjacent, in a β-sheet[46]. Information about adjacent β-strands can be used to determine the overall topology of a β-sheet. A number of β-sheet topology prediction algorithms exist that are based on well-known machine learning methods[46–49].

We used BetaSearch to predict the β-strand pairings of the five-stranded β-sheet found in chain A of c-src tyrosine kinase [PDB: 1A09]. This β-sheet contains five strands, is non-barreled, non-bifurcated, and therefore has four native strand pairings. A score for each possible strand pair was computed as a function of the number of hits, in a BetaSearch index, obtained for each interstrand 4-mer query. The top four pairs ranked by pairing scores were considered as predictions. Strand pairing scores were computed for parallel and antiparallel orientations, as shown in Table4. These results demonstrated that each of the native strand pairs were correctly predicted. The procedures used to perform these predictions are described in the Methods section. Our mechanisms for strand pair scoring are by no means a definitive solution to β-sheet topology prediction. They can, however, be used in existing algorithms such as BetaPro[47] which require preliminary β-strand or β-residue pairing scores in order for predictions to be made. A large scale evaluation of our BetaSearch-based prediction method is the topic of future work.

Table 4 The predicted strand pairing scores for the 1A09 chain A β -sheet (Case Study 3)

Full size table

Conclusion

We have described a method for indexing and querying β-residue motifs, called BetaSearch, that is at least an order of magnitude faster than state-of-the-art graph indexing methods. These speedups are achieved by indexing β-sheets as 2D matrices of amino acids known as β-matrices. This representation leverages the inherent planar structural constraints of β-sheets, thereby avoiding much of the computational complexity involved in querying and indexing 3D or graph representations of protein structures. BetaSearch is therefore able to achieve quadratic-time querying. Filtering precisions were close to 1.0 for all datasets and query sizes, resulting in near minimal verification time.

When compared with existing 3D substructure search methods, BetaSearch achieves a 240 times speedup over the baseline (SHEBA) and a 33 times speedup over the next fastest method (LabelHash). The demonstrated efficiency of BetaSearch lends itself well to the rapid exploration of probable motif or β-sheet conformations in a matter of minutes, rather than days or weeks with 3D-based methods. Furthermore, the ability of BetaSearch to perform exact matching ensures that correct hits are not missed.

Our three case studies demonstrated the utility of BetaSearch in biological contexts. We discovered that the synthetic Top7 protein shares an identical β-residue motif with a known naturally-occurring protein—the Charcot-Leyden crystal. A small query derived from the biotin-binding motif of avidins easily identified unrelated biotin-binding proteins and is suggestive of biotin-binding in others including the gamma subunit of the human C8 complement protein. BetaSearch, with its ability to identify functional similarity from unrelated proteins can potentially help characterise the proteins in the PDB with unknown function. We also demonstrated how BetaSearch could be used to predict strand pairing in β-sheets, which could help reduce the search space of more complex supersecondary or tertiary structure prediction tasks. Although our work has focused on substructural motifs in β-sheets, our algorithm can be modified to perform querying of any substructural motif involving pairwise interactions, such as the well-characterised hydrogen-bond pairings in helices and turns. Indeed, this is an avenue of development we are currently exploring.

It is our intention for BetaSearch to be used by protein researchers to supplement conventional sequence and structural search methods. For example, the efficiency of the BetaSearch filtering and verification algorithms introduces the possibility for their use as a rapid “first-pass” filter to improve the querying performance of other methods. Such an application would be non-trivial to develop but could potentially reduce conventional structural query times from hours to minutes.

Findings

The pseudocode for each of the algorithms described in this section is provided in the Supplementary Materials.

Trimers

A trimer is a path of three amino acids in a β-matrix configured in the shape of an ‘L’ (an L-trimer), vertically in the same column (a V-trimer), or horizontally in the same row (an H-trimer). Trimers are the features by which β-matrices are indexed in BetaSearch. An example of the trimer extraction process is shown in Additional file1: Figure S2.

A trimer t has a number of attributes that encode its configuration and location within a β-matrix:

t.seq: a three letter string of residues spanned by the trimer where
$\begin{align} if t. seq = “abc” \\ then t. seq [0] \to “a”, t. seq [1] \to “b”, t. seq [2] \to “c” . \end{align}$
(4)
t.class: an integer representing the class of the trimer, defined as
$\begin{align} t. class = \{\begin{align} 1 if t is an L-trimer \\ 3 if t is a V-trimer and t. seq [0] \neq t. seq [2] \\ 5 if t is an H-trimer and t. seq [0] \neq t. seq [2] \\ 15 if t is a V-trimer and t. seq [0] = t. seq [2] \\ 31 if t is an H-trimer and t. seq [0] = t. seq [2] . \end{align} \end{align}$
(5)
t.id: a (t.class,t.seq) tuple.
t.orient: an integer value such that t.orient $\in \{0, 1, 2, 3}$ . These values describe the possible orientations of a trimer and were chosen to allow the calculation of x- and y-axis trimer reflections using the bitwise-XOR (‘⊕’) operator. Orientation reflections are calculated as
$\begin{matrix} t^{'} . orient = \{\begin{matrix} t. orient \oplus 1 & if reflected in the y-axis \\ t. orient \oplus 2 & if reflected in the x-axis \end{matrix} \end{matrix}$
(6)

where t’ is the reflection of t. Additional file1: Figure S2 shows how trimer orientations are determined for each trimer class.

t.eq-orients: an integer that encodes the equivalent orientations of t.orient, defined as
$\begin{align} t. eq-orients = \{\begin{align} 2^{t. orient} if t. class = 1 \\ 3 if t. class = 3 and t. seq [0] < t. seq [2] \\ 12 if t. class = 3 and t. seq [0] > t. seq [2] \\ 5 if t. class = 5 and t. seq [0] < t. seq [2] \\ 10 if t. class = 5 and t. seq [0] > t. seq [2] \\ 15 otherwise, \end{align} \end{align}$
(7)

such that
$\begin{align} t. eq-orients & i = \{\begin{align} 1 if orientation i is equivalent to t. orient \\ 0 otherwise, \end{align} \end{align}$
(8)

where ‘<’ and ‘>’ are the lexicographic less-than and greater-than operators; and ‘&’ is the bitwise-AND operator. The equivalent orientations of a trimer are encoded using bitmasks such that orientation i is equivalent to t.ORIENT if the i^thbit of t.class is set to 1. The t.class value for each type of trimer is an encoding of the minimum t.class value for the class.
t.row: the row coordinate of t.seq[1] within its β-matrix.
t.col: the column coordinate of t.seq[1] within its β-matrix.
t.coord: a (t.row,t.col) tuple.
t.span1, t.span2: the row (t.row-span) or column spans (t.col-span) of a trimer, depending on the trimer class. Each span is an ordered tuple (i,j) where i is the coordinate of t.seq[1] and j is the coordinate of either t.seq[0] or t.seq[2], depending on the trimer type. L-trimers have one row span and one column span, V-trimers have two row spans, and H-trimers have two column spans. Examples of the spans for each trimer are shown in Additional file1: Figure S3.

Index construction

BetaSearch uses three indices: $D$ , $R$ , and $C$ .

$D$ is an inverted index that maps each trimer id to the set of β-matrices in which they are contained, defined as
$D [id] \mapsto {b \in B : id \in b. trimer-ids}$
(9)

where B is the set of β-matrices in the dataset.
$R$ maps a compound key $κ_{R}$ to a trimer t, defined by
$\begin{align} R [κ_{R}] \mapsto t \\ where κ_{R} = (t. matrix-id, t. id, t. eq-orients, \\ t. coord, t. row-span) \end{align}$
(10)

such that class∉{3,15}.
$C$ maps a compound key $κ_{C}$ to a trimer t, defined by
$\begin{align} C [κ_{C}] \mapsto t \\ where κ_{C} = (t. matrix-id, t. id, t. eq-orients, \\ t. coord, t. col-span) \end{align}$
(11)

such that t.class∉{5,31}.

L-trimers are indexed in $R$ and $C$ ; whereas H-trimers are indexed only in $C$ because they do not contain any row spans, conversely, V-trimers are indexed only in $R$ because they do not contain any column spans. The Build-Indices procedure in Additional file1: Algorithm S1 describes the index construction algorithm.

Time complexity

Each entry in a β-matrix is the intersection of at most six trimers: four L-trimers (one in each of the four orientations), one V-trimer, and one H-trimer. Build-Indices runs in O(6mn) time where m is the maximum number of residues in a β-matrix and n is the number of β-matrices.