Structure Comparison Algorithm
Our method uses predominantly the goodness of geometric superposition as
similarity measurement and considers the minimum unit of a match
between two protein structures is a pair of residues. Protein structure
is represented, in structural comparison, by its traces of alpha carbons.
Thus, in our view, protein was an object composed of connected points in a
3-dimensional space. To find the most similar part between two proteins
thereby is equivalent to obtain a list of matched pairs of cartesian points.
Insertion, deletion, and sequence-order-dependent problems will
not exist in our method.
Our residue-based algorithm can be described in two processes. In first
process, a set of superpositions based on local similarity between two
proteins were obtained through an efficient exhausting searches. In second
process, found local similarities were then used as a starting point in an
iterative procedure to locate the best global similarity between two proteins.
In overall, our method is a fast, reliable protein structure comparison tool.
Since proteins are connected chains,
we can superimpose every three consecutive alpha carbon atoms of one protein onto
every three consecutive alpha carbon atoms of the other protein in order to find all
possible local similarities between proteins. Thus,
comparing two proteins said both with N residues, we have to perform N2 operations
of comparison for exaustively searching good local matches.
With Geometry Hashing algorithm, we are able to accomplish the same task described above
by performing only N instead of N2 operations.
The Geometry Hash algorithm is composed of two procedures, namely the "hash table
construction" and "voting" procedure, respectively.
Hash Table Construction
In construction of a hash table,
an orthogonal reference frame (Rx, Ry, Rz) was calculated for every
three consecutive alpha carbons in a protein A as following:
Rx = Ci - Cj
Vt = Ck - Cj
Rz = Rx X Vt
Ry = Rx X Rz
where Ci, Cj, and Ck are the coordinates of three
consecutive alpha carbon atoms.
Local structural environment of every frame was then implicitly registered
in the table according their transformed structural invariants which were
computed by projecting onto the orthogonal reference frame. Usually, all residues
within a ball of 15 Angstrom were included to represent
a local structure. For each entry in the table with address directly calculated
from a structural invariants, both the frame identifier (Fi) and alpha carbon atom
identifier (Ai) were stored. Since different reference frame might produces same
structural invariants, we are very likely to have multiple entries in a single
address. For example, if we include the central alpha carbom atom of
the reference frame in the table, obviously all of them will occupy at the same
address.
Voting Process
In voting procedure, we repeat the same calculations for another protein B
to obtain a address for every environment residue of every reference frame.
However, instead of registering it into table we retrieve all entries at that
address and update the voting records. For every frame pair (Fi,Fj),
we keep the vote count Ni,j and a list of matched alpha carbon atom pairs,
(Ai,Bj)n, n=1,Ni,j.
For any frame pair in which vote count exceeds a preset threshold, it is called
a seed match. At this stage, we are able to detect local
similarity between two proteins by geometric superposition criteria but no actual
superimposition calculation was performed. A translational vector and
rotational matrix for every seed match was calculated by a superposition procedure via
its associated list of matched alpha carbon atom pairs.
If a protein was compared with itself, we are going to obtain N-2 exactly the same
seed matches. This imply similar seed matches are needed to be clustered and coalesced.
After similar seed matches were grouped, a set of potential matches emerged and
passed to the extending process.
Extending Process
This is an iterative process. After one protein (or interface) was superimposed onto
another protein, a new matched pair list was reassigned, in which the distance between
a matched pair must be shorter than a preset threshold (3.5 Angstrom was used in this study).
When one residue from a protein can be matched to many residues in another protein by
distance criterion, the selection was always made in favor of some selected similarity
score. Here a simple procedure was devised to select a match with improved
connectivity score in protein structural comparison.
When the similarity score converges within a threshold as compared with previous
iteration, an optimal extended alignment was achieved and an extending process
ends, otherwise iteration continues.
Parameters used in the clustering
The first column gives the iteration cycle. There are six successive
clustering cycles (A through F). The second column gives the number of
interface clusters at the beginning and end of the iteration. For example,
during iteration A, there were initially 21,686 interfaces (in the first
cycle, this number is equal to the number of two-chain interfaces in the
PDB). Using the similarities of the structures and sequences (with the
parameters listed in columns 3-5) the number decreased to 16,446. The
connectivity score takes into account the residue connectivity in the
polypeptide chains. The score favors a match with consecutive residues. At
the end of the sixth (final) cycle, we obtained 3,799 distinct clusters.
After this cycle, members of each cluster had at least 0.5 connectivity
score. There was no amino acid similarity constraint and the maximal size
difference between interfaces was 50 residues.