BetaSearch: a new method for querying β-residue motifs

Background Searching for structural motifs across known protein structures can be useful for identifying unrelated proteins with similar function and characterising secondary structures such as β-sheets. This is infeasible using conventional sequence alignment because linear protein sequences do not contain spatial information. β-residue motifs are β-sheet substructures that can be represented as graphs and queried using existing graph indexing methods, however, these approaches are designed for general graphs that do not incorporate the inherent structural constraints of β-sheets and require computationally-expensive filtering and verification procedures. 3D substructure search methods, on the other hand, allow β-residue motifs to be queried in a three-dimensional context but at significant computational costs. Findings We developed a new method for querying β-residue motifs, called BetaSearch, which leverages the natural planar constraints of β-sheets by indexing them as 2D matrices, thus avoiding much of the computational complexities involved with structural and graph querying. BetaSearch exhibits faster filtering, verification, and overall query time than existing graph indexing approaches whilst producing comparable index sizes. Compared to 3D substructure search methods, BetaSearch achieves 33 and 240 times speedups over index-based and pairwise alignment-based approaches, respectively. Furthermore, we have presented case-studies to demonstrate its capability of motif matching in sequentially dissimilar proteins and described a method for using BetaSearch to predict β-strand pairing. Conclusions We have demonstrated that BetaSearch is a fast method for querying substructure motifs. The improvements in speed over existing approaches make it useful for efficiently performing high-volume exploratory querying of possible protein substructural motifs or conformations. BetaSearch was used to identify a nearly identical β-residue motif between an entirely synthetic (Top7) and a naturally-occurring protein (Charcot-Leyden crystal protein), as well as identifying structural similarities between biotin-binding domains of avidin, streptavidin and the lipocalin gamma subunit of human C8.


3D substructure search queries
shows the histograms of the number queries and hits for each query size (every two query sizes are binned), generated from the ASTRAL95 dataset.

Exploratory querying of Top7
We translated each amino acid of our entire dataset of 209,127 β-matrices (from the PDB2011 dataset) to a reduced alphabet according to their hydrophobic ("h") or hydrophilic ("p") properties. We used a conventional amino acid grouping scheme [1] (SM- Table 2) to determine the amino acid translations. We subsequently formulated an "amphipathic" query as This query returned 259 matching β-matrices from 116 protein structures, in which the Top7 protein [PDB:1QYS] was found with a sheet ID of 1QYSA SHEET 000.
The Top7 protein is of particular interest to protein researchers because it was the first (and currently, the only) engineered protein not to be derived from the structure or sequence of any other known protein [2]. The protein was designed using the RosettaDesign suite [3] ab initio then experimentally expressed and crystallised. The x-ray structure exhibited a remarkable 1.2Å similarity to the in silico model.  Figure 1: Histograms of the number of queries and hits for each query size in the ASTRAL95 dataset. The spans of symmetrictrimers are interchangeable such that t.span1 = t.span2.   (a) For each residue r in b and for each t in T where T is the set of the trimers originating from r.

L-trimers
Return D, R, and C.
Algorithm 2: First-Filter returns the set of β-matrices that contain all the trimer IDs in the query.
2. For each trimer q in Q.trimers and span s in q.spans, where s is converted to an unordered span. 5. Return G.
2. For each pair of adjacent trimers (q src , q des ) in a breadth-first traversal of G beginning at q root .
(a) The orientation of t des is orient ← Rel-Orient ⊕ t src .orient.
(b) The overlapping span numbers between t src and t des are (s src , s des ) ← Overlap-Span-Nums(q src , q des ).
(c) The overlap type between t src and t des is ol-type ← Overlap-Type(q src , q des ).
(d) The span in t src that overlaps with t des is (e) Determine the following compound key values for t des span des : the span in t des that overlaps with t src coord : the row or column coordinate of t des I : the row or column span index in which to find t des where each value is calculated as (span des , pos, I) = if ol-type = Peptide span −1 src , t src .col, C otherwise.
(f) The span index compound key for t des is key ← (c.sheet-id, q des .id, q des .equiv-orients, coord , span des ).