Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.


Introduction
The FM-index, which is a compressed subsequence index based on Burrows Wheeler transform (BWT), is the primary data structure majority of short read aligners -including Bowtie [19], BWA [13] and SOAP2 [21].These aligners build a FM-index based data structure of sequences from a given genomic database, and then use the index to perform queries that find approximate matches of sequences to the database.And while these methods can easily index one or a few human genomes, they do not scale well to indexing the databases of thousands of genomes.This is problematic in analysis of the data produced by consortium projects, which routinely have several thousand genomes.
In this paper, we address this need by introducing and implementing an algorithm for efficiently constructing the FM-index.This will allow for the FM-index construction to scale to larger sets of genomes.To understand the challenge and solution behind our method, it helps to examine the two principal components of the FM-index: first a rank data structure over the BWT of the string that enables us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern (and, thus, compute how many such occurrences there are); second, a sample of the SA that, when used with the rank data structure, allows us access the SA (so we can list those starting positions).Searching with an FM-index can be summarized as follows: starting with the empty suffix, for each proper suffix of the given pattern we use rank queries at the ends of the BWT interval containing the characters immediately preceding occurrences of that suffix in the string, to compute the interval containing the characters immediately preceding occurrences of the suffix of length 1 greater; when we have the interval containing the characters immediately preceding occurrences of the whole pattern, we use a SA sample to list the contexts of the corresponding interval in the SA, which are the locations of those occurrences.
Although it is possible to use a compressed implementation of the rank data structure that does not become much slower or larger even for thousands of genomes, the same cannot be said for the SA sample.The product of the size and the access time must be at least linear in the length of the string for the standard SA sample.This implies that the FM-index will become much slower and/or much larger as the number of genomes in the databases grows significantly.This bottleneck has forced researchers to consider variations of FM-indexes adapted for massive genomic datasets, such as Valenzuela et al.'s pan-genomic index [33] or Garrison et al.'s variation graphs [7].Some of these proposals use elements of the FM-index, but all deviate in substantial ways from the description above.Not only does this mean they lack the FM-index's long and successful track record, it also means they usually do not give us the BWT intervals for all the suffixes as we search (whose lengths are the suffixes' frequencies, and thus a tightening sequence of upper bounds on the whole pattern's frequency), nor even the final interval in the suffix array (which is an important input in other string processing tasks).
Recently, Gagie, Navarro and Prezza [11] proposed a different approach to SA sampling, that takes space proportional to that of the compressed rank data structure while still allowing reasonable access times.While their result yields a potentially practical FM-index on massive databases, it does not directly lead to a solution since the problem of how to efficiently construct the BWT and SA sample remained open.In a direction toward to fully realizing the theoretical result of Gagie et al. [11], Boucher et al. [2] showed how to build the BWT of large genomic databases efficiently.We refer to this construction as prefix-free parsing.It takes as input string S, and in one-pass generates a dictionary and a parse of S with the property that the BWT can be constructed from dictionary and parse using workspace proportional to their total size and O(|S|) time.Yet, the resulting index of Boucher et al. [2] has no SA sample, and therefore, only supports counting and not locating.This makes this index not directly applicable to many bioinformatic applications, such as sequence alignment.
Our contributions.In this paper, we present a solution for building the FM-index1 for very large datasets by showing that we can build the BWT and Gagie et al.'s SA sample together in roughly the same time and memory needed to construct the BWT alone.We note that this algorithm is also based on prefix-free parsing.Thus, we begin by describing how to construct the BWT from the prefix-free parse, and then show that it can be modified to build the SA sample in addition to the BWT in roughly the same time and space.We implement this approach, and refer to the resulting implementation as bigbwt.We compare it to state-of-the-art methods for constructing the SA sample, and demonstrate that bigbwt currently the fastest and most space-efficient method for constructing the SA sample on large genomic databases.
Next, we demonstrate the applicability of our method to short read alignment.In particular, we compare the memory and time needed by our method to build an index for collections of chromosome 19 with that of Bowtie.Through these experiments, we show that Bowtie was unable to build indexes for our largest collections (500 or more) because it exhausted memory, whereas our method was able to build indexes up to 1,000 chromosome 19s (and likely beyond).At 250 chromosome 19 sequences, the our method required only about 2% of the time and 6% the peak memory of Bowtie's.Lastly, we demonstrate that it is possible to index collections of whole human genome assemblies with sub-linear scaling as the size of the collection grows.Related work.The development of methods for building and the FM-index on large datasets is closely related to the development short-read aligners for pan-genomics -an area where there is growing interest [27,5,12].Here, we briefly describe some previous approaches to this problem and detail its connection to the work in this paper.We note that majority of pan-genomic aligners requiring building the FM-index for a population of genomes and thus, can increase proficiency by using the methods described in this paper.
GenomeMapper [27], the method of Danek et al. [5], and GCSA [29] represent the genomes in a population as a graph, and then reduce the alignment problem to finding a path within the graph.Hence, these methods require all possible paths to be identified, which is exponential in the worst case.Some of these methods -such as the GCSA -use the FM-index to store and query the graph and could capitalize on our approach by building the index in the manner described here.Another set of approaches [24,8,12,32] consider the reference pan-genome as the concatenation of individual genomes and exploits redundancy by using a compressed index.The hybrid index [8] operates on a Lempel-Ziv compression of the reference pan-genome.An input parameter M sets the maximum length of reads that can be aligned; the parameter M has a large impact on the final size of the index.For this reason, the hybrid index is suitable for short-read alignment only, although there have been recent heuristic modifications to allow longer alignments [9].In contrast, the r-index, of which we provide an implementation in this work, has no such length limitation.The most recent implementation of the hybrid index is CHIC [33].Although CHIC has support for counting multiple occurrences of a pattern within a genomic database, it is an expensive operation, namely O( log log n), where is the number of occurrences in the databases and n is the length of the database.However, the r-index is capable of counting all occurrences of a pattern of length m in O(m) time up to polylog factors.There are a number of other approaches building off the hybrid index or similar ideas [5,34]; for an extended discussion, we refer the reader to the survey of Gagie and Puglisi [12].
Finally, a third set of approaches [14,23] attempts to encode variants within a single reference genome.BWBBLE by Huang et al. [14] follows this by supplementing the alphabet to indicate if multiple variants occur at a single location.This approach does not support counting of the number of variants matching a specific alignment; also, it suffers from memory blow-up when larger structural variations occur.

BWT and FM indexes
Consider a string S of length n from a totally ordered alphabet Σ, such that the last character of S is lexicographically less than any other character in S. Let F be the list of S's characters sorted lexicographically by the suffixes starting at those characters, and let L be the list of S's characters sorted lexicographically by the suffixes starting immediately after those characters.The list L is termed the Burrows-Wheeler Transform [3] of S and denoted BWT.
and S[j] have the same relative order in both lists; otherwise, their relative order in F is the same as their lexicographic order.This means that if S[i] is in position p in L then, assuming arrays are indexed from 0 and ≺ denotes lexicographic precedence, in F it is in position The mapping i → j i is termed the LF mapping.Finally, notice that the last character in S always appears first in L. By repeated application of the LF mapping, we can invert the BWT, that is, recover S from L. Formally, the suffix array SA of the string S is an array such that entry i is the starting position in S of the ith largest suffix in lexicographical order.The above definition of the BWT is equivalent to the following: The BWT was introduced as an aid to data compression: it moves characters followed by similar contexts together and thus makes many strings encountered in practice locally homogeneous and easily compressible.Ferragina and Manzini [10] showed how the BWT may be used for indexing a string S: given a pattern P of length m < n, find the number and location of all occurrences of P within S. If we know the range BWT(S)[i..j] occupied by characters immediately preceding occurrences of a pattern Q in S, then we can compute the range BWT(S)[i ..j ] occupied by characters immediately preceding occurrences of cQ in S, for any character c ∈ Σ, since Notice j − i + 1 is the number of occurrences of cQ in S. The essential components of an FMindex for S are, first, an array storing |{h : S[h] ≺ c}| for each character c and, second, a rank data structure for BWT that quickly tells us how often any given character occurs up to any given position2 .To be able to locate the occurrences of patterns in S (in addition to just counting them), the FM-index uses a sampled3 suffix array of S and a bit vector indicating the positions in BWT of the characters preceding the sampled suffixes.

Prefix-free parsing
Next, we give an overview of prefix-free parsing, which produces a dictionary D and a parse P by sliding a window of fixed width through the input string S. We refer the reader to Boucher et al. [2] for the formal proofs and Section 3.1 for the algorithmic details.A rolling hash function identifies when substrings are parsed into elements of a dictionary, which is a set of substrings of S. Intuitively, for a repetitive string, the same dictionary phrases will be encountered frequently.
We now formally define the dictionary D and parse P. Given a string4 S of length n, window size w ∈ N and modulus p ∈ N, we construct the dictionary D of substrings of S and the parse P as follows.We let f be a hash function on strings of length w, and let T be the sequence of substrings form a parsing of S in which each pair of consecutive strings S[s i , s i+1 +w−1] and S[s i+1 , s i+2 +w−1] overlaps by exactly w characters.We define that is, D consists of the set of the unique substrings s of S such that |s| > w and the first and last w characters of s form consecutive elements in T .If S has many repetitions we expect that |D| k.With a little abuse of notation we define the parsing P as the sequence of lexicographic ranks of substrings in i=1 .The parse P indicates how S may be reconstructed using elements of D. The dictionary D and parse P may be constructed in one pass over S in O (n + |D| log |D|) time if the hash function f can be computed in constant time.

r-index locating
Policriti and Prezza [26] showed that if we have stored SA[k] for each value k such that BWT[k] is the beginning or end of a run (i.e., a maximal non-empty unary substring) in BWT, and we know both the range BWT[i..j] occupied by characters immediately preceding occurrences of a pattern Q in S and the starting position of one of those occurrences of Q, then when we compute the range BWT[i ..j ] occupied by characters immediately preceding occurrences of cQ in S, we can also compute the starting position of one of those occurrences of cQ.Bannai et al [1] then showed that even if we have stored only SA[k] for each value k such that BWT[k] is the beginning of a run, then as long as we know SA[i] we can compute SA[i ].
Gagie, Navarro and Prezza [11] showed that if we have stored in a predecessor data structure , where pred(•) is a query to the predecessor data structure.Combined with Bannai et al.'s result, this means that while finding the range BWT[i..j] occupied by characters immediately preceding occurrences of a pattern Q, we can also find SA[i] and then report Gagie et al. gave the name r-index to the index resulting from combining a rank data structure over the run-length compressed BWT with their SA sample, and Bannai et al. used the same name for their index.Since our index is an implementation of theirs, we keep this name; on the other hand, we do not apply it to indexes based on run-length compressed BWTs that have standard SA samples or no SA samples at all.

Methods
Here, we describe our algorithm for building the SA or the sampled SA from the prefix free parse of a input string S, which is used to build the r-index.We first review the algorithm from [2] for building the BWT of S from the prefix free parse.Next, we show how to modify this construction to compute the SA or the sampled SA along with the BWT.

Construction of BWT from prefix-free parse
We assume we are given a prefix-free parse of S[1..n] with window size w consisting of a dictionary D and a parse P. We represent the dictionary as a string D where t i 's are the dictionary phrases in lexicographic order and # is a unique separator.We assume we have computed the SA of D, denoted by SA D [1.. ] in the following, and the suffix array of P, denoted BWT P , and the array Occ [1, d] such that Occ[i] stores the number of occurrences in the parse of the dictionary phrase t i .These preliminary computations take O(|D| + |P|) time.
By the properties of the prefix-free parsing, each suffix of S is prefixed by exactly one suffix α of a dictionary phrase t j with |α| > w.We call α the representative prefix of the suffix S[i..n].From the uniqueness of the representative prefix we can partition S's suffix array SA [1..n] into k ranges By construction, any suffix D[i.. ] of the dictionary D is also prefixed by the suffix of a dictionary phrase.For j = 1, . . ., , let β j denote the longest prefix of D[SA D [j].. ] which is the suffix of a phrase (i.e.D[SA D [j] + |β j |] = #).By construction the strings β j 's are lexicographically sorted Clearly, if we compute β 1 , . . ., β and discard those such that |β j | ≤ w, the remaining β j 's will coincide with the representative prefixes α i 's.Since both β j 's and α i 's are lexicographically sorted, this procedure will generate the representative prefixes in the order α 1 , α 2 , . . ., α k .We note that more than one β j can be equal to some α i since different dictionary phrases can have the same suffix.
We scan SA D [1.. ], compute β 1 , . . .β and use these strings to find the representative prefixes.As soon as we generate an α i we compute and output the portion BWT[b i , e i ] corresponding to the range [b i , e i ] associated to α i .To implement the above strategy, assume there are exactly k entries in SA D [1.. ] prefixed by α i .This means that there are k distinct dictionary phrases t i 1 , t i 2 , . . ., t i k that end with α i .Hence, the range [b i , e i ] contains z i = e i − b i + 1 = k h=1 Occ[i h ] elements.To compute BWT[b i , e i ] we need to: 1) find the symbol immediately preceding each occurrence of α i in S, and 2) find the lexicographic ordering of S's suffixes prefixed by α i .We consider the latter problem first.
Computing the lexicographic ordering of suffixes.For j = 1, . . ., z i consider the j-th occurrence of α i in S and let i j denote the position in the parsing of S of the phrase ending with the j-th occurrence of α i .In other words, P[i j ] is a dictionary phrase ending with α i and i By the properties of BWT P the lexicographic ordering of S's suffixes prefixed by α i coincides with the ordering of the symbols P[i j ] in BWT P .In other words, P[i j ] precedes P[i h ] in BWT P if and only if S's suffix prefixed by the j-th occurrence of α i is lexicographically smaller than S's suffix prefixed by the h-th occurrence of α i .
We could determine the desired lexicographic ordering by scanning BWT P and noticing which entries coincide with one of the dictionary phrases t i 1 , . . ., t i k that end with α i but this would clearly be inefficient.Instead, for each dictionary phrase t i we maintain an array IL i of length Occ[i] containing the indexes j such that BWT P [j] = i.These sorts of "inverted lists" are computed at the beginning of the algorithm and replace the BWT P which can be discarded.
Finding the symbol preceding α i .Given a representative prefix α i from SA D we retrieve the indexes i 1 , . . ., i k of the dictionary phrases t i 1 , . . ., t i k that end with α i .Then, we retrieve the inverted lists IL i 1 , . . .IL i k and we merge them obtaining the list of the z i positions y 1 < y 2 < • • • < y z i such that BWT P [y j ] is a dictionary phrase ending with α i .Such list implicitly provides the lexicographic order of S's suffixes starting with α i .
To compute the BWT we need to retrieve the symbols preceding such occurrences of α i .If α i is not a dictionary phrase, then α i is a proper suffix of the phrases t i 1 , . . ., t i k and the symbols preceding α i in S are those preceding α i in t i 1 , . . ., t i k that we can retrieve from D[1.. ] and SA D [1.. ].If α i coincides with a dictionary phrase t j , then it cannot be a suffix of another phrase.Hence, the symbols preceding α i in S are those preceding t j in S that we store at the beginning of the algorithm in an auxiliary array PR j along with the inverted list IL j .

Construction of SA and SA sample along with the BWT
We now show how to modify the above algorithm so that, along with BWT, it computes the full SA of S or the sampled SA consisting of the values SA[s 1 ], . . ., SA[s r ] and SA[e 1 ], . . ., SA[e r ], where r is the number of maximal non-empty runs in BWT and s i and e i are the starting and ending positions in BWT of the i-th such run, respectively.Note that if we compute the sampled SA the actual output will consist of r start-run pairs s i , SA[s i ] and r end-run pairs e i , SA[e i ] since the SA values alone are not enough for the construction of the r-index.
We solve both problems using the following strategy.Simultaneously to each entry BWT[j], we compute the corresponding entry SA[j].Then, if we need the sampled SA, we compare BWT[j − 1] and BWT[j] and if they differ, we output the pair j − 1, SA[j − 1] among the end-runs and the pair j, SA[j] among the start-runs.To compute the SA entries, we only need d additional arrays EP 1 , . . .EP d (one for each dictionary phrase), where and EP i [j] contains the ending position in S of the dictionary phrase which is in position IL i [j] of BWT P .
Recall that in the above algorithm for each occurrence of a representative prefix α i , we compute the indexes i 1 , . . ., i k of the dictionary phrases t i 1 , . . ., t i k that end with α i .Then, we use the lists IL i 1 , . . ., IL i k to retrieve the positions of all the occurrences of t i 1 , . . ., t i k in BWT P , thus establishing the relative lexicographic order of the occurrences of the dictionary phrases ending with α i .To compute the corresponding SA entries, we need the starting position in S of each occurrence of α i .Since the ending position in S of the phrase with relative lexicographic rank IL Hence, along with each BWT entry we obtain the corresponding SA entry which is saved to the output file if the full SA is needed, or further processed as described above if we need the sampled SA.

Time and memory usage for SA and SA sample construction
We compare the running time and memory usage of bigbwt5 with the following methods, which represent the current state-of-the-art.
bwt2sa Once the BWT has been computed, the SA or SA sample may be computed by applying the LF mapping to invert the BWT and the application of Eq. 1.Therefore, as a baseline, we use bigbwt to construct the BWT only, as in Boucher et al. [2]; next, we load the BWT as a Huffman-compressed string with access, rank, and select support to compute the LF mapping.We step backwards through the BWT and compute the entries of the SA in non-consecutive order.Finally, these entries are sorted in external memory to produce the SA or SA sample.This method may be parallelized when the input consists of multiple strings by stepping backwards from the end of each string in parallel.pSAscan A second baseline is to compute the SA directly from the input; for this computation, we use the external-memory algorithm pSAscan [17], with available memory set to the memory required by bigbwt on the specific input; with the ratio of memory to input size obtained from bigbwt, pSAscan is the current state-of-the-art method to compute the SA.Once pSAscan has computed the full SA, the SA sample may be constructed by loading the input text T into memory, streaming the SA from the disk, and the application of Eq. 1 to detect run boundaries.
We denote this method of computing the SA sample by pSAscan+.
We compared the performance of all the methods on two datasets: (1) Salmonella genomes obtained from GenomeTrakr [31]; and (2) chromosome 19 haplotypes derived from the 1000 Genomes Project phase 3 data [4].The Salmonella strains were downloaded from NCBI (NCBI BioProject PRJNA183844) and preprocessed by assembling each individual sample with IDBA-UD [25] and counting k-mers (k=32) using KMC [6].We modified IDBA by setting kMaxShortSequence to 1024 per public advice from the author to accommodate the longer paired end reads that modern sequencers produce.We sorted the full set of samples by the size of their k-mer counts and selected 1,000 samples about the median.This avoids exceptionally short assemblies, which may be due to low read coverage, and exceptionally long assemblies which may be due to contamination.
Next, we downloaded and preprocessed a collection of chromosome 19 haplotypes from 1000 Genomes Project.Chromosome 19 is 58 million base pairs in length and makes up around 1.9% of the total human genome sequence.Each sequence was derived by using the bcftools consensus tool to combine the haplotype-specific (maternal or paternal) variant calls for an individual in the 1KG project with the chr19 sequence in the GRCH37 human reference, producing a FASTA record per sequence.All DNA characters besides A, C, G, T and N were removed from the sequences before construction.
We performed all experiments in this section on a machine with Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz and 324 GB RAM.We measured running time and peak memory footprint using /usr/bin/time -v, with peak memory footprint captured by the Maximum resident set size (kbytes) field and running time by the User Time and System Time field.
We witnessed that the running time of each method to construct the full SA is shown in Figs.1(a) -1(c).On both the Salmonella and chr19 datasets, bigbwt ran the fastest, often by more than an order of magnitude.In Fig. 1(d), we show the peak memory usage of bigbwt as a function of input size.Empirically, the peak memory usage was sublinear in input size, especially on the chr19 data, which exhibited a high degree of repetition.Despite the higher diversity of the Salmonella genomes, bigbwt remained space-efficient and the fastest method for construction of the full SA.Furthermore, we found qualitatively similar results for construction of the SA sample, shown in Fig. 2. Similar to the results on full SA construction, bigbwt outperformed both baseline methods and exhibited sublinear memory scaling on both types of databases.

Application to many human genome sequences
We studied how the r-index scales to repetitive texts consisting of many similar genomic sequences.Since an ultimate goal is to improve read alignment, we benchmark against Bowtie (version 1.2.2) [19] .We ran Bowtie with the -v 0 and --norc options; -v 0 disables approximate matching, while --norc causes Bowtie (like r-index) to perform the locate query with respect to the query sequence only and not its reverse complement.

Indexing chromosome 19s
We performed our experiments on collections of one or more versions of chromosome 19.These versions were obtained from 1000 Genomes Project haplotypes in the manner described in the previous section.We used 10 collections of chromosome 19 haplotypes, containing 1, 2, 10, 30, 50, 100, 250, 500, and 1000 sequences, respectively.Each collection is a superset of the previous.Again, all DNA characters besides A, C, G, T and N were removed from the sequences before construction.All experiments in this section were ran on a Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz machine with 512GB memory.We measured running time and peak memory footprint as described in the previous section.
First we constructed r-index and Bowtie indexes on successively larger chromosome 19 collections (Figure 3(a), 3(b)).The r-index's peak memory is substantially smaller than Bowtie's for larger collections, and the gap grows with the collection size.At 250 chr19s, the r-index procedure takes about 2% of the time and 6% the peak memory of Bowtie's procedure.Bowtie fails to construct collections of more than 250 sequences due to memory exhaustion.
Next, we compared the disk footprint of the index files produced by Bowtie and r-index (Figure 3(c)).The r-index currently stores only the forward strand of the sequence, while the Bowtie index stores both the forward sequence and its reverse as needed by its double-indexing heuristic [19].Since the heuristic is relevant only for approximate matching, we omit the reverse sequence in these size comparisons.We also omit the 2-bit encoding of the original text (in the *.3.ebwt and *.4.ebwt files) as these too are used only for approximate matching.Specifically, the Bowtie index size was calculated by adding the sizes of the forward *.1.ebwtand *.2.ebwt files, which contain the BWT, SA sample, and auxiliary data structures for the forward sequence.The size of the r-index increased more slowly than Bowtie's, though the r-index was larger for the smallest collections.This is because, unlike Bowtie which samples a constant fraction of the SA elements (every 32nd by default), the density of the r-index SA sample depends on the ratio n/r.When the collection is small, n/r is small and more SA samples must be stored per base.At 250 sequences, the r-index index takes 6% the space of the Bowtie index.

Indexing whole human genomes
Lastly, we used r-index to index many human genomes at once.We repeated our measurements for successively larger collections of (concatenated) genomes.Thus, we first evaluated a series of haplotypes extracted from the 1000 Genomes Project [4] phase 3 callset (1KG).These collections ranged from 1 up to 10 genomes.As the first genome, we selected the GRCh37 reference itself.For the remaining 9, we used bcftools consensus to insert SNVs and other variants called by the 1000 Genomes Project for a single haplotype into the GRCh37 reference.
Second, we evaluated a series of whole-human genome assemblies from 6 different long-read assembly projects ("LRA").We selected GRCh37 reference as the first genome, so that the first data point would coincide with that of the previous series.We then added long-read assemblies from a Chinese genome assembly project [28], a Korean genome assembly project [16] a project to assemble the well-studied NA12878 individual [15], a hydatidiform mole (known as CHM1) assembly project [30] and the Celera human genome project [20].Compared to the series with only 1000 Genomes Project individuals, this series allowed us to measure scaling while capturing a wider range of genetic variation between humans.This is important since de novo human assembly projects regularly produce assemblies that differ from the human genome reference by megabases of sequence (12 megabases in the case of the Chinese assembly [28]), likely due to prevalent but hard-to-profile large-scale structural variation.Such variation was not comprehensively profiled in the 1000 Genomes Project, which relied on short reads.
The 1KG and LRA series were evaluated twice, once on the forward genome sequences and once on both the forward and reverse-complement sequences.This accounts for the fact that different de novo assemblies make different decisions about how to orient contigs.The r-index method achieves compression only with respect to the forward-oriented versions of the sequences indexed.That is, if two contigs are reverse complements of each other but otherwise identical, r-index achieves less compression than if their orientations matched.A more practical approach would be to index both forward and reverse-complement sequences, as Bowtie 2 [18] and BWA [22] do.
We measured the peak memory footprint when indexing these collections (Figure 4).We ran these experiments on an Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz system with 256GB memory.Memory footprints for LRA grew more quickly than those for 1KG.This was expected due to the 36,671 34,420 10.63 9.28 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 50000 100000 150000 200000 0 20000 40000 60000 Total Length of Collection (MB) Indexing Peak Mem.(MB) q q 1KG LRA forward + reverse complement forward−only Fig. 4: Peak index-building memory for r-index when indexing successively large collections of 1000-Genomes individuals (1KG) and long-read whole-genome assemblies (LRA).
greater genetic diversity captured in the assemblies.This may also be due in part to the presence of sequencing errors in the long-read assembles; long-read technologies are more prone to indel errors than short-read technologies, for examples, and some may survive in the assemblies.Also as expected, memory footprints for the LRA series that included both forward and reverse complement sequences grew more slowly than when just the forward sequence was included.This is due to sequences that differ only (or primarily) in their orientation between assemblies.All series exhibit sublinear trends, highlighting the efficacy of r-index compression even when indexing genetically diverse whole-genome assemblies.Indexing the forward and reverse complement strands of 10 1KG individuals took about 6 hours and 20 minutes and the final index size was 36GB.
We also measured lengths and n/r ratios for each collection of whole genomes (Table 1).Consistent with the memory-scaling results, we see that the n/r ratios are somewhat lower for the LRA series than for the 1KG series, likely due to greater genetic diversity in the assemblies.

Conclusions and Future Work
We give an algorithm for building the SA and SA sample from the prefix-free parse of an input string S, which fully completes the practical challenge of building the index proposed by Gagie et al. [11].This leads to a mechanism for building a complete index of large databases -which is the linchpin in developing practical means for pan-genomics short read alignment.In fact, we apply our method for indexing partial and whole human genomes, and show that it better than Bowtie with respect to both memory and time.This allows for an index to be constructed for large collections of chromosome 19s (500 or more); a task that is out of reach of Bowtie -as exceeded our limit of 512 GB of memory.
Even though this work opens up doors to indexing large collections of genomes, it also highlights problems that warrant further investigation.For example, there still remains a significant amount of work in adapting the index to work well on large sets of sequence reads.This problem not only requires the construction of the r-index but also an efficient means to update the index as new datasets become available.Moreover, there is interest in supporting more sophisticated queries than just pattern matching, which would allow for more complex searches of large databases.

Fig. 1 :
Fig. 1: Runtime and peak memory usage for construction of full SA.

Fig. 2 :
Fig. 2: Runtime and peak memory usage for construction of SA sample.

Fig. 3 :
Fig.3: Scalability of r-index and bowtie indexes against chr19 haplotype collection size and total sequence length (megabases) with respect to index construction time (seconds) (a), index construction peak memory (megabytes) (b), index disk space (megabytes) (c), and locate time (seconds) of 100,000 100bp queries (d).

Table 1 :
Sequence length and n/r statistic with respect to number of whole genomes for the first 6 collections in the 1000 Genomes (1KG) and long-read assembly (LRA) series.