Application of Subspace Clustering in DNA Sequence Analysis

Abstract Identification and clustering of orthologous genes plays an important role in developing evolutionary models such as validating convergent and divergent phylogeny and predicting functional proteins in newly sequenced species of unverified nucleotide protein mappings. Here, we introduce an application of subspace clustering as applied to orthologous gene sequences and discuss the initial results. The working hypothesis is based upon the concept that genetic changes between nucleotide sequences coding for proteins among selected species and groups may lie within a union of subspaces for clusters of the orthologous groups. Estimates for the subspace dimensions were computed for a small population sample. A series of experiments was performed to cluster randomly selected sequences. The experimental design allows for both false positives and false negatives, and estimates for the statistical significance are provided. The clustering results are consistent with the main hypothesis. A simple random mutation binary tree model is used to simulate speciation events that show the interdependence of the subspace rank versus time and mutation rates. The simple mutation model is found to be largely consistent with the observed subspace clustering singular value results. Our study indicates that the subspace clustering method may be applied in orthology analysis.


INTRODUCTION
I dentification of orthologous relationships among protein and nucleotide sequences is of wide interest for biological sequence analysis because such information provides insight into the molecular function and evolutionary history of these sequences. To date, many mathematical techniques have been utilized intensively and extensively in biological sequence analysis to perform diverse functions, ranging from sequence alignment to sequence clustering (e.g., survey Andreopoulos et al, 2009).
From a computational perspective, the two common methods of clustering of orthologous genes are three-way triangle relative reciprocal similarity and graph clustering used in NCBI clusters of orthologous groups (COG) and orthoMCL works (respectively in Wolf et al., 2012, andLi et al., 2003). Both of the aforementioned methods are well represented in the literature as used in orthologous protein classifications. The development of additional mathematical tools and theory for analysis of related sequences is an area of burgeoning interest. Examples may be found in the literature, such as in Kim and Lee (2006) and Viswanath and Madabhushi (2012).
We propose a new approach based upon a hypothesis that under evolutionary constraint, descendant sequences from a common ancestor share a mathematical subspace, as opposed to ambient space. In particular, we hypothesize that member nucleotide or peptide sequences lie within low dimensions within the ambient space of all sequences. Therefore, the subspace dimensions are a reflection of mutation and time elapsed since the speciation or duplication within the cluster.
In order to investigate the hypothesis, several sets of experiments and estimates of the p-values for the the null-hypothesis are computed. The nucleotide sequences data were taken from the reverse mapping provided by NUCOCOG, that is, the nucleotide sequences-based COG database of Meereis and Kaufmann (2008), relating the works of Wolf and Makarova (2012) at the NCBI. A simple interface was written to extract the raw nucleotide sequences for a given cluster identification number. The subspace clustering was initially performed on the nucleotide sequences. However, the subspace clustering was found to produce similar results with the forward mappings to amino acids forming proteins. The accuracy was found to be slightly less than the nucleotide clustering, presumably due to the degeneracy of the forward mapping, which may discard retained inheritance elements present in the nucleotides.

METHODS
The following section contains an overview of the methods used to estimate the dimensional rank of the subspaces of COG sequences and the principle angles distribution. The population sample data used in this study come primarily from the first M s = 200 orthologous clusters of the NUCOCOG database reverse mapping to the NIH NCBI COG database.

Subspace considerations
In this work, it is preferred to obtain a reasonable estimate of the dimensions of COG subspaces at hand, since many of the common methods assume constant equal subspace dimensions in the models. The fundamental aspects considered toward estimating the rank of the subspace included review and analysis of the principal angles and singular values distributions.
2.1.1. Rank estimation of a single orthologous group. Let a set of unaligned nucleotide sequences form a matrix W j 2 C m j · M for the j-th experiment comprising m j sequences containing, at most, M base pairs of a prospective orthologous cluster that is to be analyzed. To estimate the effective rank of W j , the modal selection algorithm of Yan and Pollefeys (2006) was initially used to estimate the rank r as follows: where r j is the j-th singular value and j is a suitable constant. The foregoing method was repeated to estimate the distribution of ranks across a sample of the first M s = 200 orthologous clusters from the NUCOCOG database.
2.1.2. Distribution of principle angles. The principle angles between orthologous subspaces is defined the smallest principal angle between two subspaces S i and S j , denoted by the matrix element h i,j , is defined as The angular distribution was estimated by computing a subspace basis estimate for each COG and then computing the principal angles from the above equation.
where A k 2 C m k · M , M is the adjusted length of the nucleotide or protein sequence, r is the rank approximation of the singular value decomposition (SVD), m k is the number of sequence in the COG of interest, and { denotes the Hermitian adjoint. The subspace basis is constructed as the first-most r column vectors U m k · r 2 C m k · r , and the singular value matrix S 2 C r · r where the k-th subspace basis is chosen as S k = V y 2 C r · M . The subspace dimension is seen to be dim(S k ) = r k . In this work, the subspace basis is computed by the LSA algorithm of Yan and Pollefeys (2006).

Sequence alignment.
Prior to performing the subspace analysis, the randomly selected cluster nucleotide sequences are aligned to ensure best common basis representation. This alignment corresponds to a relative energy minimization argument and is also a best effort attempt to allow for mutations, which may include insertions and/or deletions. The multiple sequence alignment used in this study was the MAFFT code in Katoh and Standley (2013) and Chenna et al. (2003), compiled and optimized for use on the Intel i7 processor under 3.12.x baseline Linux kernel. The goal is to perform the alignment for each W j matrix as described below.

Probability of error estimates
In order to estimate the statistical significance of the experimental outcomes, an approximation is employed to compute the probability of the experimental data and classification outcomes. Consider the probability of randomly classifying k-groups into a set fm 1 ‚ . . . ‚ m k g where P k i = 1 m i = N j items 8 j 2 1‚ . . . ‚ 30 ½ experiments. Let the known membership identification classification cluster vector be defined as c = c 1 ‚ . . . ‚ c N ½ . The probability of randomly selecting the unique set of cluster memberships c 1 ‚ . . . ‚ c N j among k-groups from a total of N j sequences is needed for the j-th random experiment.
Let the random vector S j = X 1 ‚ . . . ‚ X N j È É be defined as the set of random variables for the gene sequence memberships of the j-th experiment, and the group index element c i be defined as a mapping of the known group id such that c i 2 [1‚ . . . ‚ k]. The classification outcome probability is a binary event, where the probability of random success is p s = 1/k and the probability of failure is 1p s = k -1 k . To estimate the p-value tail due to one or more errors in classification results, the estimate for probability of n j -classification errors (order is not important here) and N jn j correct for total of N j sequences of k-cluster groups becomes the binomial as follows: where probability of success is p s = 1/k. It is noted that for the experiments below, the error in the tail about the outcome is estimated as the probability of n j -errors or less. The missrate for the experiments was computed using the LSA code distribution from the online resource of Johns Hopkins University Vision, Dynamics, and Learning Lab.

Subspace dimensions in the model tree
In order to investigate the relationship of the subspace dimensions to the phylogenetic history, a simple randomized binary evolutionary tree was developed. Since the orthologous trees are inferred (predicted), the analysis here uses simulations of a binary randomized evolutionary tree, where each speciation branch is a random outcome of an M-component random vector v i 2 Z M 4 , where each component is comprised of a binary random variable product (probability of mutation) with a uniformly randomly selected nucleotide of Z 4 (mutation).
2.3.1. Simple random binary mutation tree model. In order to investigate the vector properties and singular values, a mathematical construction is used for the k-th nucleotide sequence vector s k in Z M 4 for Z 4 = f0‚ 1‚ 2‚ 3g to represent a particular sequence in the above model balanced binary random mutation tree:
where v 1 = s CA is the common ancestor at the root of the tree; v m = s ms parent(m) is the m-th column vector of matrix V, which represents the mutation difference vector from the parent s parent(m) to child s m , where the parent(m) function returns the parent branch pointer of the m-th child; and C m,k are scalar coefficients. An example of a simple binary model tree is shown in Figure 1. It should be noted that in reality, phylogenetic trees are generally not perfect, binary, or symmetric, due to incomplete data, missing evolutionary paths, differing branch lengths, or other reasons.
If each species branch is comprised of the sum of all speciation and mutation vectors in the path from the common ancestor, it is then possible to simulate and solve for the singular value decomposition (SVD) or eigenvalue decomposition (EVD) of the random binary tree of depth d having 2 d + 1 -1 = N s species vectors in Z M 4 ring for all N s nucleotide sequences of M components. For example, compute 2 (2 + 1) -1 = 7 speciation vectors, which also include the root common ancestor in counting. Let v 1 = s ca be the common ancestor, v 2 = s 2v 1 be the speciation mutation vector v 2 from s ca to the path leading to species s 2 , and so on. By inspection of Figure 1, the resulting set of speciation nucleotide random vectors comprising the columns of matrix S 2 R M · N s may be written in terms of an upper triangular matrix form as follows: where . . ‚ 7] are column vectors, and matrix C 2 R M · N s contains the selective inheritance coefficients C m,k of the column vectors v m , which comprise the species column vectors of matrix S. Mathematically, each random component of the mutation vector is represented by a product of a Bernoulli mutation event B k‚ j~fb (K; and a uniformly distributed nucleotide

SUBSPACE CLUSTERING IN DNA SEQUENCE ANALYSIS 943
random outcome X k,j˛{ 0,1,2,3}. Nucleotides are represented as mathematical rings as needed. Formally, let the random mutation vector be defined as It should be noted that by the above definitions, To simplify the analysis, it is assumed that the statistical expected magnitude of the speciation vector distance of mutation is directly proportional to the elapsed time period between the speciation node of the tree from the parent to child. In addition, a simple constant rate c /~m utations BPÁgeneration of mutation is assumed, relative to a constant molecular clock rate. Mathematically, the aforementioned assumptions allow estimation of the expected number of mutations or differences from child to parent sequence as where s m 2 Z M 4 is the m-th species orthologous vector, g m is the number of generations per unit time, M is the total number of base-pairs for the sequence, t m is the elapsed time from parent to child, and p b is the effective probability of mutation change or difference associated with the binary random speciation event over the period t m ; the p = 0 norm is employed here. Dimensional analysis of the above shows the temporal rate of mutations per speciation event to be where BP is the number of base-pairs in the gene sequence of interest. In the simulated results for the binary tree of this study, the mutation vectors are generated from binomial distribution where the initial probability of mutation is set to p b = 0.1 for a mutation rate of 10 percent of the orthologous sequence and increased in 10 percent increments to p b = 0.5.
2.3.2. Stochastic phylogenetic tree model. In order to investigate the secondary hypothesis, a simplified statistical binary tree whose branches are comprised of random variables (RV) for the elapsed time was developed. Although not used here, a more extensive model may allow (1) each speciation event to be an m-fold generalization of the binary tree, which may also be an empty branch, and (2) the lengths of the speciation branches would have unequal lengths with random variable outcomes.
Using the foregoing simple mathematical model and expectation results, it is noted that the expectation of any pair of vector dot products from two randomly created binary trees A, B is statistically estimated as follows: In the above simple binary mutation tree treatment, the IID random variable vectors components s j have zero mean and are independent. If the root common ancestor sequence vectors are randomly generated with similar IID variates, then statistically, the species vectors of tree A are uncorrelated with the vectors of tree B. Two general outcomes are noted as follows: case I orthologous genes of differing common ancestors lie within independent subspaces if and only if the common ancestor vectors are uncorrelated. The trees do not intersect in expectation. caseII recent descendants of common ancestors may still retain nonvanishing similarity from inherited common ancestor vector components. The trees intersect.

Experimental approach
In order to test the subspace mutation tree relationship hypothesis, we seek to cluster sequences from the known COG database. Toward this end, two randomized experiments were conducted to test the primary hypothesis using the COG database and sets of thirty trials each containing unknown mixtures of sequences belonging to K˛{2, 3} orthologous groups. It should be noted that the experiments permit both false positives and false negatives. Also, the two and three groups correspond to two and three subspaces, which conceptually compare to two and three motion classification tests presented in Tron and Vidal (2007).
The approach is as follows: (1) Estimate the subspace dimensions of the first M s x200 clusters from the NCBI COG by computing the singular value decomposition (SVD) in Akritas et al. (2006) of j-th cluster and computing the rank r j using Equation ( where the rows of W j contain the nucleotide sequence set for the j-th experiment. (c) Randomize the order of rows of W j again using uniform random distribution. (4) Align W j matrix by nucleotide sequences (rows), with preference given to global alignment algorithms using codes such as MAFFT.

Detailed algorithm
Samples were drawn at random uniformly from the first M s = 200 COG's of the nucleotide groups of the NUCOCOG reverse mapping provided by reference database Meereis and Kaufmann (2008) and Meiler et al. (2012) as used in implementation of the methods in Algorithm (1). The preliminary results reported here are in agreement with the three-way mutual bidirectional best hits (BBH) genomic nearest match forming the basis of a protein cluster triangle. The general method is shown in Algorithm (1)

RESULTS
In this section, the average subspace dimensions are obtained from the rank estimates of the first M = 200 clusters of our sample in this study. Representative results of the subspace clustering algorithm are presented. The singular value decomposition of simulations and rank estimates for the simple binary model tree have distributions consistent with the subspace hypothesis.

Estimation of parameters
The estimated rank for the first M = 200 cluster samples is shown in Figure 2a, where the rank estimation method is further illustrated as shown in Figure 2b. Figure 2c shows the subspace angle distribution of same aforementioned cluster samples.

Experimental objectives
The experimental data sequence analysis methods used here are designed to evaluate the subspace algorithm hypothesis for COGs of proteins or DNA nucleotide sequences with presumed known truth from the NIH COG database. The objectives include consideration of membership, similarity, principle angles, ensemble distribution, and/or other aspects.
It is important that the experimental design allows for both false positives and false negatives, since a primary objective is to estimate the error rate and the statistical significance of the experimental results.

WALLACE ET AL.
Plots of the p-values for each experiment are to be computed and considered relative to the null hypothesis. The primary concern is to evaluate the stated hypothesis and related prediction: descendants of common ancestor orthologs diverge over elapsed time within subspaces. If the method demonstrates feasibility, the knowledge may be utilized within the context of future sequence analysis tools or methods. As such, the analysis herein is not a replacement for the existing COG method at this time.

Computational considerations
The data matrix of interest is a three-dimensional tensor of the form and size 1000-4000 (nucleotide base pairs) · 50 -100 (orthologous genus species taxon of interest) · 2 -200 (orthologous protein groups).
Several observations regarding the challenges of existing methods in COG are included below.
(1) Direct application of linear algebra is complicated by mutation events, such as insertion and deletion. However, preliminary results suggest potential for a hybrid model using subspace techniques combined with traditional methods published and demonstrated in the NIH COG series of Makarova et al. (2007).
(2) Real-world data sets are often incomplete, contain noise, and have other outliers due to errors and uncertainties and potential false positives and false negatives. These aspects generally complicate the process of modeling and inference, which may be significant. The experiments discussed in this study do not contain known nonmember outlying sequences. Nonmember outliers may be addressed in a future effort.

Experiments
Randomized simulations were performed using two and three clusters for R = 30 experiments each. A typical example for a single experiment for M = 3 is shown in Figure 3a. The average error for two orthologous gene clusters was estimated to be about 5%. The average error for three orthologous gene clusters was about 13%. Typical experimental simulation classification error results are shown in Figures 3b and c.

Statistical significance
Preliminary review of the estimated experimental errors and probabilities suggests that the orthologous sequences are indeed members of a common subspace, since the observed correct classification outcomes (with shown error rates) would occur due to random selection on average with probability estimate p r 50:001 in the case of 30 experiments using 2 orthologous clusters. The case for random successful classification (with shown error rates) of 3 orthologous clusters in 30 experiments has an average probability estimate of p r 50:001, but with two-fold increased success errors. The latter result suggests successful selection of three COGs at random is unlikely.

DISCUSSION
In our small orthologous population sample study, we observe that the common ancestors are sufficiently uncorrelated as to significantly preserve the inherited vector components that allow the LSA (Zappella et al., 2011) algorithm to perform the subspace classification of descendants relative to their common ancestors. A larger study would be needed to ascertain the limiting degree of applicability of the observed distinctiveness of the common ancestor orthologous sequences.

Supplemental results for simple model tree
This section includes results from simulation and analysis of the simple binary model tree. The characteristics of rank, tree depth, and number of nodes are consistent with the main hypothesis and are included here for reference. It should be noted that operations such as dot products and projections are generally performed after alignment. Also, the ' 2 or ' 0 distances have been used in the literature (Kim and Lee, 2006;Viswanath and Madabhushi, 2012). 4.1.1. Idealized model tree. The simple randomized binary mutation tree was generated for mutation rates p b ranging from 0.1 (10%) to 0.5 (50%), and the singular values were computed. The simple model simulation results are shown in Figure 4a. The distribution of singular values is sharply concentrated for 0.1 mutation rate but spreads out significantly for the higher mutation rate. These results are consistent with increasing effective rank versus mutation rate. The cumulative relative sums over the singular values of the same results are shown in Figure 4b. The effective rank may be estimated by the 90% area under the curve, which gives ranks ranging from circa 8 to 18, with the larger ranks correlating with larger mutation rates.

Singular values versus binary tree depth.
Simulations using the random binary tree evolution model were run over a range of depths. For each complete balanced random tree, the rank was estimated using Equation (2.1), as shown in Figures 2a and 5a. The dependence of estimated rank versus binary tree depth is noted in Figure 5b  observed to be dependent upon similarities among the sequences. In particular, it is observed that the mutation rate product with the number of sequence replication is proportional to the estimated rank dimensions of the binary tree. This is consistent, since a tree with very high rate of mutations would produce species that were all significantly different, and the full rank of the sequence matrix would be required for the subspace representation. In addition, a perfect binary tree shows increasing rank versus depth, as illustrated in the simulations presented. The subspace nature of orthologous groups in general is believed to have a connection within the molecular aspects of divergent evolutionary processes. In particular, the simple mathematical model binary tree simulations indicate descendants from two uncorrelated unique common ancestors are statistically unlikely to intersect, and therefore, the subspaces of binary trees are not expected to intersect. It is important to note that statistically, the mathematical model allows exceptions and variance about the mean, which is consistent with exceptions that are expected in nature and biology. However, the reader should be mindful that the binary tree is generally too simple for most orthologous sequence relationships, and is primarily used here as a means for analysis. 4.1.5. Model tree simulations. Based upon the results in Figure 1, the estimated dimensional rank of the subspace S j follows a relation with the height of the tree, h, as anticipated, which is observed to be a monotonically increasing function over the data range. The subspace dimension for each orthologous group could be the estimated value as determined empirically using Equation 4.1.
In addition, the simulation data are consistent with the following observation: rank estimates for the perfect complete binary tree are dependent upon the rate of mutation and the elapsed time, which give rise to the length of the branches. The larger the rate of mutations per generation for a given unit of time, the more quickly descendants diverge from the common ancestor, and the more significant the members of the branches become in the singular value decomposition, thereby increasing the rank approximation value [Eq. (2.1)]. In summary, the observed results support the hypothesis relating the mutation events in the phylogenic tree to the subspace constraint.

CONCLUSIONS
The experimental results are consistent with the hypothesis that similar genes of various related organisms lie within relatively small subspaces. Randomized experimental simulations were performed using two and three clusters for R = 30 experiments each. The average p-values of each series were found to be p/0.001 and support the subspace hypothesis. As a consequence, the results suggest that multiple orthologous genes comprise a union of subspace representation. It is expected that other genes of an organism's genome do not lie closer to the orthologous cluster subspace than does the COG member. However, the degree of uniqueness should be verified independently. Future efforts should improve methods of alignment and selection of focal regions. For example, it is expected that use of critical regions, such as maximum entropy or centroids of the sequences, would provide faster performance with regard to alignment and subspace segmentation. Also, early testing using GPCA showed promise for improved performance, but further investigation is needed. It is envisioned that the use of the subspace clustering algorithms and concepts may provide an additional means (in addition to NIH COG and OrthoMCL) for theoretical and applied classification of related sequences.

APPENDIX
A brief overview and discussion of subspace clustering and related theory is presented below with references from the literature.