Initial Cluster Analysis

Abstract We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document}, these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.


INTRODUCTION
I n the study of proteins, a variety of methods derive, from analyzing a multiple alignment of related sequences or a corresponding phylogenetic tree, predictions that a specific set of residues within a particular protein are functionally important (Lichtarge et al., 1996;Mihalek et al., 2004;Fischer et al., 2008;Sankararaman and Sjölander, 2008;Janda et al., 2012;Neuwald and Altschul, 2016). When a threedimensional structure is available for that protein, the residues so identified may appear to form one or more spatial clusters perhaps indicative of active sites or other functional modules. How can we identify the set of residues that best constitute such a cluster, and determine whether the cluster is statistically significant or can be explained completely by chance?
Pioneering work on this question was done by Karlin and Zhu (1996), who suggested that orders for the amino acid residues in a protein other than their arrangement along the protein's backbone could be studied. Specifically, given a three-dimensional structure, one could reorder protein residues by their distance from a specific starting residue or from a specific point in space; by their nearest distance to a progressively growing residue set; by their distance to the center of mass of such a set; or by any other well-defined criterion. Karlin and Zhu (1996) proposed seeking clusters of particular residue types within these 1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland. reordered sequences. We use the first of these reordering methods here, and elsewhere describe several others of utility in various biological contexts (Neuwald, Aravind and Altschul, submitted).
Our general approach diverges from Karlin and Zhu's in three specific ways. First, we focus on clusters of residues that are ''distinguished'' by a computer analysis of multiple alignments, rather than on residues having specific physical-chemical properties such as charge Zhu and Karlin, 1996). This has no formal effect on our approach, only a motivational one. Second, we seek the most surprising initial cluster along a sequence. In other words, we seek to define the best division of a sequence into an initial segment with a high density of distinguished residues, and a terminal segment with a low density. Finally, we assume we are given a fixed number of distinguished residues along a sequence, in contrast to Karlin's approach, which assumes a fixed independent probability for a residue being distinguished. In other words, we derive statistics based on sampling without as opposed to with replacement.
Below we use the Minimum Description Length (MDL) principle (Grünwald, 2007) to formulate and analyze our problem in simple abstract terms, and then illustrate its application to a specific protein sequence and structure.

FORMALIZATION USING THE MDL PRINCIPLE
Assume we are given a sequence of length L consisting of D 1s and (L -D) 0s. We wish to compare the hypothesis H 1 that the 1s cluster near the start of the sequence with the null hypothesis H 0 that the 1s and 0s occur randomly.
The MDL principle defines a theory h as a probability distribution P h over the space of all possible sets of data. The description length (in bits) of a data set S given a theory h is then defined as DL(Sjh) = log 2 [P h (S)]. (Throughout this article, we assume logs to be base 2; when natural logarithms are needed, we use the notation ln.) A model M is a parameterized set of theories, and the description length of S given M is defined as DL(SjM) = min h2M DL(Sjh). The MDL principle asserts that given multiple models to explain S, one should prefer the model M that minimizes DL(SjM) + COMP(M). Here, the description length or complexity COMP(M) of a model M is the log of the number of effectively independent theories M contains. A central element of MDL theory is the formal definition of COMP(M), and its calculation for specific models.
2.1. The description length of S given H 0 , and the complexity of H 0 For our purposes, we take H 0 to be the model consisting of a single Bernoulli trial theory for generating S, with the probability of a 1 taken as P = D=L, and the probability of a 0 taken as Q = 1 -P. We then have which is L times the entropy of the Bernoulli trial. Because H 0 contains only one theory, its complexity is COMP(H 0 ) = log (1) = 0. Note that, more formally, we could treat P as a parameter estimated from the data, in which case the complexity of the resulting single-parameter model would be approximately 1 2 log (pL=2) (Grünwald, 2007). However, because we will assume D to be fixed in our model H 1 as well, where its indeterminacy would add a similar complexity, we may avoid the complication of treating D as variable when we wish simply to compare H 0 to H 1 .

The description length of S given H 1
The hypothesis H 1 may be understood as a single-parameter model, whose parameter x describes the location of a cut at a discrete point from 1 to L -1 along the sequence S, thereby dividing it into an initial segment S 1 of length x, and a terminal segment S 2 of length y = Lx. If S 1 contains D 1 1s, and S 2 contains D 2 = (D -D 1 ) 1s, assume at first that S 1 is generated by Bernoulli trials with maximum-likelihood probability P 1 = D 1 =x for a 1, and S 2 is generated by Bernoulli trials with probability P 2 = D 2 =y for a 1. Naively, given a particular fixed value for our parameter x, the probability for the data S would seem to be However, this equation estimates P 1 (and therefore implicitly P 2 ) by maximum-likelihood from the data S, so K x is not a probability distribution over the space of all possible data. To avoid adding a second 122 ALTSCHUL AND NEUWALD parameter and attendant complexity to the model H 1 , we may define the probability P x for S as its normalized maximum likelihood (NML) (Grünwald, 2007). This is simply where Z is the sum of K x (S) over all length-L sequences having D 1s. Note that using counting formulas, one may calculate Z using at most D + 1 terms, corresponding to D 1 equal to 0 to D. The description length of the data S under H 1 is then DL(SjH 1 ) =log [max x P x (S)].

The complexity of H 1
We still need to calculate the complexity of the model H 1 , which is the log of the effective number of independent theories it contains. Intuitively, the problem is that while P x is a probability distribution for any fixed x, we allow x to take on values from 1 to L -1, each of which will yield a P x (S). When we select the maximum, P X , it again becomes a likelihood, and we need to discount P X (S) for the multiple trials implicit in the L -1 possible choices for x. The Bonferroni correction, which simply multiplies P X (S) by L -1, is much too conservative, so we seek the number I of effectively independent values of x. For one-parameter models parameterized by a continuous x, this is given by where J x is the Fisher information associated with parameter x (Grünwald, 2007). However, because our x is discrete, we will study a continuous analog to our problem to approximate J x and I. Specifically, we take x as continuous, with domain (0‚ L), and the number of 1s observed in the initial and terminal segments to be Poisson distributed. The NML probability for observing D 1 1s in the initial segment [and therefore D 2 = (D -D 1 ) 1s in the terminal segment] is then given by where k 1 and k 2 are the Poisson parameters associated, respectively, with the initial and terminal segments, and Z is a normalization constant. Our assumption is that there are different probability densities L 1 and L 2 for the occurrences of 1s in the initial and terminal segments, so that when we vary x we can write for constant L 1 and L 2 . Combining Eqs. (5) and (6), we can view f (D 1 ) as a function of x, and derive the Fisher information for the parameter x (Grünwald, 2007) from where E[:] denotes the expected value. We begin by observing that where C is a constant, so that Our primary use for J X is to calculate the number I of effectively independent values of x. Substituting J x from Eq. (10) into Eq. (4), but returning to discrete x to obtain a sum, we get It is simple enough to calculate the sum. However, since the sum's terms approach 1=x and 1=y for small x and small y, respectively, for large L the sum itself should approach 2 ln (L) + C, or alternatively 2 ln (KL), for some constants C and K. Computation shows that even for quite small L the sum may be very closely approximated by 2 ln (1:024L), yielding In the above analysis, we have made no distinction between cuts for which the initial segment has a greater density of 1s than the terminal segment, and those for which it has a lesser density. For the applications we envision, however, the latter case has no ready biological interpretation, so we wish to limit H 1 to the former case. Over the space of sequences, this discards half of all cuts H 1 allows, so when we impose this restriction we can effectively write COMP(H 1 ) = log (I=2).

Flattening Jeffreys' priors
One subtlety is that the approach we have taken above is essentially equivalent to a Bayesian approach, in which Jeffreys' priors are assigned to parameter x ( Jeffreys, 1946;Grünwald, 2007). Jeffreys' priors are proportional to the square root of the Fisher information, or in our case to ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=x 2 + 1=y 2 p . Inspection shows this function to be U-shaped, diverging as x or y approaches 0, and with a minimum at x = L=2. However, there is no biological reason to expect the location of real cut-points to follow such a prior. Indeed, as we will illustrate below, the approach so far described has a tendency to trim or extend biologically significant clusters unduly. For most applications it would be much more appropriate to specify flat or uniform priors for x in place of Jeffreys'. We may attempt to do this by maximizing not P x (S) of Eq. (3), but rather This evidently favors intermediate values of x at the expense of extreme values and is a move toward rendering the density of independent theories flat in x. (Although this approach does not have rigorous theoretical support, we will show below, by random simulation, that it is reasonably effective.) Following through, we define the ''generalized description length of the data'' as When this is done, we conjecture that the corresponding formula for the ''generalized number of independent theories'' is given by As before, because we restrict H 1 to initial segments with a greater density of 1s than terminal segments, we get a generalized complexity COMP Ã (H 1 ) = log (I Ã =2). We will refer to this somewhat heuristic approach as using ''Flattened'' as opposed to Jeffreys' priors, and explore its potential advantages below.

RANDOM SIMULATION
The MDL principle says that we should prefer H 1 to H 0 when DL(SjH 1 ) + COMP(H 1 ) <DL(SjH 0 ) + COMP(H 0 ). Treating each hypothesis as equally likely a priori, we may view the difference D between the two sides of this inequality as a log-odds ratio, and use the logistic function e D 1 + e D to convert this into a p-value; see p. 37 of Durbin et al. (1998). For any S for which H 1 is preferred, there is a particular X that optimizes Eq. (3), and a corresponding number D 1 of 1s in the initial segment. To elucidate the effect of Jeffreys' priors, we generated 10 6 random sequences of length L = 601 (thus allowing X to range from 1 to 124 ALTSCHUL AND NEUWALD 600), each having D = 75 1s, and recorded the optimal X for all sequences for which H 1 was preferred to H 0 . We present in Panel A of Figure 1 a histogram (light bars) of the percentage of these Xs falling into each of 20 bins (i.e., 1-30, 31-60, etc.). Due to the shape of Jeffreys' priors, bins with X near the extremes are strongly preferred. The asymmetry arises from our requirement that the initial segment have a greater density of 1s than the terminal. Applying Flattened rather than Jeffreys' priors to the identical sequences yields the alternative histogram (black bars). Although not flat, this histogram diverges, on average, much less from a uniform 5% in each bin (indicated by the dotted horizontal line) than does the Jeffreys' histogram.
In general, the choice of particular Bayesian priors becomes irrelevant for strong data, but can have a noticeable effect when data are weaker. For illustrative purposes, we consider artificially skewed sequences, also with L = 601 and D = 75, but with 35 1s placed randomly within an initial segment of length 200, and 40 1s placed randomly within a terminal segment of length 401. Generating 10 6 such sequences, we used alternatively Jeffreys' and Flattened priors to find optimal Xs and D 1 s; panels B and C of Figure 1 show the distributions of observed Xs and D 1 s, respectively. Both Jeffreys' and Flattened priors are most likely to return an X within bin 7 (i.e., from 181 to 210) (Fig. 1B), which contains the ''true'' X = 200, as well as to return the ''true'' D 1 = 35 (Fig. 1C). However, Jeffreys' priors have a marked tendency to trim or extend the ''true'' result. For 13.3% of sequences it returns an X 60 and for 3.4% an X ! 541, compared to 2.1% and 0.7% for Flattened priors. Relatedly, for 7.5% of sequences it returns D 1 5 and for 4.6% D 1 ! 71, compared to 0.2% and 1.5% for Flattened priors. In addition, for this example Jeffreys' priors return, on average, somewhat less significant results: 62.5% of sequences with a p-value 0:1, as opposed to 86.4%, and 8.6% with a p-value 0:01, as opposed to 11.4%. For ''true'' cuts quite close to the sequence boundaries, however, Jeffreys' priors should have an advantage.
Given that Eq. (4) for I is an approximation, that we have made a further approximation in passing to a continuous analog of our problem to calculate J x , and that Eq. (15) for I Ã is conjectural, it is worth testing the accuracy of our analysis by random simulation. For L = 250 and L = 500, and for each of D = 12, D = 25, and D = 50, we generated 10 8 random sequences, and maximized alternatively P x ( Jeffreys' priors) and R x (Flattened priors). In Figure 2 we plot, for P from 1 to 10 -7 , the observed proportion P Ã of sequences with nominal p-value P. In these log-log graphs, theory is represented by the dotted lines with slope 1. The calculated p values tend to be somewhat conservative, that is, larger than the experimental ones. However, for L = 500, P Ã differs from P, within statistical error, by 33% when P 0:001, and for L = 250 by 39%.
The conservative nature of our calculated p-values can be partially understood by noticing that the Poisson model we use to derive the Fisher information, and thus to calculate I, allows an arbitrary number

126
ALTSCHUL AND NEUWALD of the D 1s to occur before any cut, whereas in fact at most X 1s may occur before the cut X, and symmetrically at most Y 1s may occur after such a cut. This aspect of our approximation should on average yield a greater error the larger D is with respect to L, as we observe in the examples here, as well as in others not shown. A simple heuristic correction that renders the calculated P on average in closer agreement with P Ã is to multiply either I or I Ã of Eqs. (4) and (15) by 1r, where r % D=L is the proportion of (D 1 ‚ X) pairs allowed by the Poisson model but excluded by the discrete considerations that D 1 X and D 2 Y. The resulting corrected values of P remain on average conservative but, in all cases considered, now differ from experiment by 28% when P 0:001. We use this correction when calculating p values in the following section.

APPLICATION
To apply our theory to a particular protein we require, (a) a criterion for ''distinguishing'' particular amino acid residues, which we then represent as 1s, with all others represented as 0s, and (b) a criterion for ordering these residues, which in general does not correspond to the order imposed by the protein's backbone. Typically we distinguish residues based on a prediction of their importance or relevance, which may arise from the analysis of a multiple alignment, and we order residues based on some definition of structural distance, possibly from a fixed point in space. We may allow a particular, selected residue to provide such a point, in which case this ''index'' residue is omitted from the sequence.
As an illustration, we here analyze the P-loop GTPase domain (Hall, 2000) of the elongation factor eEF1A. GTPases constitute a large, functionally diverse protein superfamily (Leipe et al., 2002), for which hundreds of thousands of sequences are currently available. We aligned 127,418 GTPase domain sequences that were < 95% identical and partitioned these sequences into subgroups, based on the particular patterns of residues characteristic of each subgroup (Neuwald, 2011(Neuwald, , 2014. eEF1A was assigned to a set of translation initiation and elongation factors (TIEFs), and we used this subgroup's pattern of characteristic residues to distinguish D = 20 of eEF1A's L = 158 residues. Next, we used the structure of yeast eEF1A bound to the guanine nucleotide exchange factor (GEF) eEF1Ba (pdb: 1g7c) (Andersen et al., 2001) to order eEF1A's residues, as follows. GEFs catalyze the exchange of guanosine diphosphate (GDP) bound to the GTPase domain, a process for which disruption of the GTPase Mg + + binding site is believed to play a critical role. A lysine residue (K205) in eEF1Ba appears to play a critical role in nucleotide exchange: K205 inserts into the Mg + + binding site of eEF1A and is lethal when mutated. For this reason, we ordered eEF1A's residues by their distance from K205 of eEF1Ba.
Applying our approach with Flattened priors to the resulting sequence of 0s and 1s yields a cut at X = 88 with D 1 = 20, that is, with all 20 distinguished residues in the initial segment; this division is highly significant, with p-value 5:4 · 10 -6 . If we distinguish instead the D = 20 residue positions that are most FIG. 3. Initial cluster analysis of residues within the yeast elongation factor eEF1A GTPase domain bound to the nucleotide exchange factor eEF1Ba (pdb_id: 1g7c). Color scheme: eEF1A GTPase domain, green; eEF1A switch I and II regions, brown; eEF1A domains II and III, gray; eEF1Ba, marine blue; GMP, cyan; side chains of TIEF-specific and GTPase-conserved residues included in initial clusters, red and yellow, respectively (two side chains common to both clusters are colored red). K205 of eEF1Ba and G70 of eEF1A, which were used as (alternative) focal points, are indicated. GMP, guanosine-5 0monophosphate; TIEF, translation initiation and elongation factor. distinctive of all P-loop GTPases, we find an optimal cut with X = 66 and D 1 = 17, and p-value 1:1 · 10 -6 . Jeffreys' priors yield the same cuts in both cases, although with different but still highly significant p values.
In Figure 3, we show these clusters within the eEF1A-eEF1Ba crystal structure. The 20 TIEF-specific residues and 17 GTPase-specific residues surround the bound eEF1Ba and map to the P-loop and switch I and II regions, which undergo functionally relevant conformational changes (Hall, 2000) associated with both sensing GTP versus GDP and with nucleotide exchange. Both K205 and the bound Mg + + ion are positioned near both the active site (generally conserved in all GTPases) and the switch I and II regions, which sense and transmit to other cellular components whether the domain is bound to GDP or to GTP. These facts suggest why residue sets characteristic for TIEF and for GTPase sequences should cluster near K205 of eEF1Ba.
To illustrate how Flattened and Jeffreys' priors can yield different results, we performed the same TIEFspecific analysis, except with residues ordered by their distance from G70 of eEF1A rather than from K205 of eEF1Ba, shifting the focus of analysis to the switch I region. (The index residue G70 is accordingly ignored, yielding D = 19 and L = 157.) Jeffreys' priors return a cut with X = 7, D 1 = 7, and p = 2:4 · 10 -7 , whereas Flattened priors return a cut with X = 60, D 1 = 18, and p = 3:3 · 10 -7 ; see Figure 4. In this case, Jeffreys' priors favor a cut near the start of the sequence, but Flattened prior eliminates this bias and, notably, identifies nearly the same cluster as before. Both cuts and their associated clusters have biologically relevant interpretations, although with somewhat different focuses.

CONCLUSION
Given a string of 0s and 1s, we have developed two methods to divide it into initial and terminal segments, with high and low concentrations of 1s, respectively, and to assess whether these segments differ significantly in their concentrations of 1s. The first method finds the maximum-likelihood division, but has a tendency to cut the sequence near its ends, which arises from the implicit use of Jeffreys' priors for the cut location. The second method seeks to counteract this tendency by flattening the implicit priors. C code implementing these methods is available from the authors on request. We have illustrated these methods by using them to identify statistically significant spatial clusters among residues that distinguish translational initiation and elongation factor GTPases from other P-loop GTPases. We will describe extensive applications elsewhere (Neuwald, Aravind and Altschul, submitted). Potential applications of our approach extend beyond protein sequence/structural analysis. One such application, for instance, would be to look for a significant association between a particular human microbiome-associated disease and RNA-Seq expression levels of a candidate microbial gene postulated to be responsible for the disease symptoms. In this case, both symptomatic and asymptomatic subjects could be ordered from highest to lowest levels of microbial gene expression; significant clustering of symptomatic subjects near the start point would indicate an association. Additional biomedical applications may similarly be devised.