The rationale
According to the building block folding model, a native protein conformation is the outcome of a combinatorial assembly of a set of contiguous fragments (building blocks). The building blocks exist in an ensemble of conformations. During the combinatorial assembly process, the building block conformations selected for binding might be those with low population times, differing from the conformations observed for the corresponding peptides in solution. However, owing to the mutual stabilization of the building blocks in the assembly process, low stability stand-alone conformations may produce more stable hydrophobic folding units, as compared to the assembly of the more highly populated conformers. Nevertheless, while such lower stability, lower population times building block conformers may be selected, most of the associating building blocks are the stable, high-population times conformers. During the folding process these conformers are preserved, and are observed in the final, ``static'' native 3-D structure.
It is reasonable to assume that a stability measurement of an isolated contiguous fragment in a particular conformation will reflect the population time of the conformation during the folding process. The larger the measurement, the larger the time that the fragment spends in this conformation. Hence, to locate building blocks from a native protein 3-D structure, we need to find a set of non-overlapping contiguous fragments that have the highest conformational stability among all other possible candidate combinations. In practice, a few residue-overlap between the fragments is permitted.
For a polypeptide chain with a size of
residues, and with a size limit of a building
block set to
residues, the total number of candidate fragments is
=
(
-
+ 1), where
runs the summation from
to
.
A contiguous fragment along a sequence can be specified by two independent variables: size
and position. The position of a fragment can be assigned by either the starting residue or
the residue at the center (as done here). We call this two-dimensional coordinate system
for all contiguous fragments in a given protein structure a ``fragment map''. Note that both
coordinates in a fragment map use a common unit (in residue), although one refers to the size
of the fragment and the other to its position. Now, if a fragment map is associated with a
scoring function that can reflect the conformational stability of each fragment, regardless of
its size, then the local minima of the map are the locations of the building blocks of a given
protein.
The algorithm
The scoring function
The scoring function developed in this study is based on a previous scoring function, that
has been successfully applied to locate hydrophobic folding units (7).
The HFU scoring function has four ingredients: compactness, hydrophobicity, degree of
isolatedness, and number of segments. By definition, a building block has only one segment.
Therefore, only the first three ingredients are used in the current scoring function for
locating building blocks. The scoring function we have designed is fragment-size-independent.
The new function is expressed as a linear combination of the three measurements, with each
quantity calculated as the deviation from the averaged value of known protein structures.
The new building block scoring function, Score
, is in this form:
Where, Z, H, and I are respectively, the compactness, the hydrophobicity, and the degree of
isolatedness of a candidate fragment. (A brief description of their definitions is given in
Methods). The corresponding arithmetic average, X
, and standard deviation, X
,
are determined from a non-redundant dataset of 930 representative single-chain proteins.
Average and standard deviation with superscript 1 are calculated with respect to fragment size;
Average and standard deviation with superscript 2 are calculated as a function of the fraction
of the fragment size to the whole protein. The plots of these twelve statistical values (six
averages and six corresponding standard deviations) are depicted in Figure 1.
The size-independent stability form of the function assumes that fragments of different sizes
have equal averaged conformational stability. However, the linearity of the hydrophobicity (H)
and of the isolatedness (I) in the region of fragment size ;SPMgt; 150 residues, suggest that true
fragmental stability should reflect this trend. Therefore, instead of using statistical
values, the H
and I
in the scoring function are calculated from a fitted straight
line, to reflect the relative size-dependent stability. Figure 2 shows the average scores and
their standard deviations for all fragments in the 930 representative chains as a function of
fragment size, following the modification due to this, relative stability, concern.
The cutting procedure
Step 1: Locating a basket of building blocks (relatively stable contiguous fragments)
For a given protein 3-D structure, all fragment candidates are assigned a stability score calculated by the above building block scoring function. To collect a basket of building blocks, we locate all local minima on the fragment map. We define a local minimum in the simplest way: A local minimum must be the highest value in a defined local region. A local region can be specified in an absolute or relative way. In order to locate compact substructures in a fragment map, Zehfus (18) has defined an absolute local quasi-circle region, with a distance of 4 residues from a candidate fragment. However, we find that a relative local region is more suitable for locating building blocks. We adopt the Zehfus' quasi-circle definition as the area of a local region in our algorithm. However, the radius of the circle is a variable which is 7.5% of the size of a candidate fragment. Thus, for example, the local region of a 100-residue candidate fragment with its termini at least 7 residues away from the termini of the chain, will contain a total of 197 fragments in which either their size or position are within 7 residues of the candidate. To locate all of the building blocks, every fragment candidate is examined to see if it is the highest scoring within its local region as defined above. If a candidate is the highest scoring one, we register it in the basket of building blocks. In the case of actin (19) with 373 residues, 78 local minima with score ;SPMgt; -5.0 have been located out of a total 64,620 fragments. Figure 3 shows a plot of the score surface of the fragment map.
Step 2: A recursive top-down splitting process
Currently it is generally agreed that the folding process does not follow a single pathway. In constructing an anatomy tree, our goals are two-fold: First, an anatomy tree straightforwardly yields the most likely folding pathway(s); And second, we wish to identify the set of the most likely building blocks which, via a process of combinatorial assembly, form the final, native protein conformation. Hence, in our scheme, the anatomy of the protein structure is organized as a tree which grows upside down, with the starting node at the top. The root node is the first, starting node, and it constitutes the entire native protein. Each node represents a contiguous fragment. A node can sprout multiple branches to create child nodes. These are generated via a multi-splicing procedure described below. If a new node does not produce a child, it is an end node. The level of a node is determined by counting the number of steps which are needed to trace back to the root node. The entire tree growth process stops when no new children nodes can be generated. At the end, the collection of end nodes is the set of the most likely building blocks. And, the tree organization itself depicts the most likely folding pathway. Our success in generating such anatomy trees, further indicates that the ``building block'' folding model leads to a hierarchical folding process.
The multi-splicing procedure:
Unlike other top-down binary splicing algorithms (12,13,20), our recursive top-down multi-splicing procedure sets no limit on the number of branches at any level. Starting with a node fragment and a basket of building blocks created above, the search for multiple cutting is very simple: We search the basket for a set of fragments that constitute the entire node fragment. Then, among all combinatorially assembled candidates, the average score of the best two fragments (if there are more than two fragments) is used to rank the dissections for the node fragment. The search algorithm follows several rules: First, a short overlap between building blocks is allowed, with the overlapping segment not exceeding 7 residues; Second, if an unassigned segment is less than 15 residues, it is left unassigned. Otherwise, the segment is assigned to be a low-score building block, not listed in the original building block basket. A short unassigned segment may be a short linker between two large units. A long low-score fragment may be a long fragment linker or a building block that has opened-up; Third, except for the root node, a node can not have only one branch-child node. Fourth, a node is considered as an end node if we can not find two building blocks with scores above a threshold value.
Assembly of hydrophobic folding units:
Furthermore, in addition to the above, at each branching level, we locate the HFUs by combinatorially assembling the collection of building blocks. This procedure is straightforwardly implied from the building block folding model (3). Preliminary results indicate a significant improvement compared to our previous HFU cutting algorithm (7). The detailed procedure will be reported elsewhere (Tsai & Nussinov, unpublished).