ALPHLARD-NT: Bayesian Method for Human Leukocyte Antigen Genotyping and Mutation Calling through Simultaneous Analysis of Normal and Tumor Whole-Genome Sequence Data

Abstract Human leukocyte antigen (HLA) genes provide useful information on the relationship between cancer and the immune system. Despite the ease of obtaining these data through next-generation sequencing methods, interpretation of these relationships remains challenging owing to the complexity of HLA genes. To resolve this issue, we developed a Bayesian method, ALPHLARD-NT, to identify HLA germline and somatic mutations as well as HLA genotypes from whole-exome sequencing (WES) and whole-genome sequencing (WGS) data. ALPHLARD-NT showed 99.2% accuracy for WGS-based HLA genotyping and detected five HLA somatic mutations in 25 colon cancer cases. In addition, ALPHLARD-NT identified 88 HLA somatic mutations, including recurrent mutations and a novel HLA-B type, from WES data of 343 colon adenocarcinoma cases. These results demonstrate the potential of ALPHLARD-NT for conducting an accurate analysis of HLA genes even from low-coverage data sets. This method can become an essential tool for comprehensive analyses of HLA genes from WES and WGS data, helping to advance understanding of immune regulation in cancer as well as providing guidance for novel immunotherapy strategies.


INTRODUCTION
H uman leukocyte antigen (HLA) genes are essential components of the immune system, which present peptides to immune cells to facilitate recognition of nonself antigens. HLA genes must be 1 highly polymorphic to effectively carry out this function, with many types or alleles recognized, resulting in high individual variation in immune responses. Therefore, HLA genotyping, in which the specific pair of HLA types is identified for each HLA locus, is essential to understand the immune system. Recently, the interaction between cancer and the immune system has attracted attention (Grivennikov et al., 2010;Schreiber et al., 2011;Kreiter et al., 2015;Rooney et al., 2015;Marty et al., 2017), and somatic mutations in HLA genes have been shown to accumulate in specific cancer types (The Cancer Genome Atlas Research Network, 2014;Testoni et al., 2015;The Cancer Genome Atlas Network, 2015;Giannakis et al., 2016;McGranahan et al., 2017). Therefore, HLA genotyping can further help to understand the link between cancer and immunity, which would benefit personalized medicine.
There are several approaches currently available for HLA genotyping. Conventional approaches use polymerase chain reaction-based methods with sequence-specific oligonucleotides (Saiki et al., 1986), sequencespecific primers (Olerup and Zetterquist, 1992), and sequence-based typing (Santamaria et al., 1992); however, these methods are time consuming and labor intensive, and can only provide information on targeted HLA genes. New methods for HLA genotyping have been developed more recently with advances in molecular techniques, including whole-exome sequencing (WES), whole-genome sequencing (WGS), and RNA sequencing (Boegel et al., 2012;Warren et al., 2012;Kim and Pourmand 2013;Liu et al., 2013;Bai et al., 2014;Szolek et al., 2014;Nariai et al., 2015;Shukla et al., 2015;Dilthey et al., 2016;Xie et al., 2017;Hayashi et al., 2018;Lee and Kingsford, 2018). With these methods, information of both somatic mutations and HLA genotypes can be obtained from the entire sequence, which can facilitate investigations on the relationship between cancer and the immune system. In particular, methods that can specifically call germline or somatic mutations in HLA genes Hayashi et al., 2018;Lee and Kingsford, 2018) are valuable, since these mutations have potential to change immune responses, including tumor immune escape. However, the low coverage of WGS data makes it challenging to detect HLA germline and somatic mutations.
Previously, we developed a Bayesian model, called ALPHLARD (Hayashi et al., 2018), which identifies HLA genotypes and germline mutations from WGS data. ALPHLARD can also call HLA somatic mutations by comparing HLA sequences determined from normal and tumor samples. However, the specificity of the HLA somatic mutation calling is insufficient because ALPHLARD conducts the analyses of normal and tumor samples independently. To resolve this issue, we extended ALPHLARD to construct a new model named ALPHLARD-NT for accurately identifying both HLA germline and somatic mutations as well as HLA genotypes from WGS data. ALPHLARD-NT was validated from WES and WGS data sets from 343 and 25 colon cancer samples, respectively, which demonstrated its good performance in HLA genotyping, along with the ability to call HLA germline and somatic mutations, even from low-coverage data.

Human leukocyte antigen reference data
We used the IPD-IMGT/HLA Database (Robinson et al., 2015) as HLA reference sequences in our method. Since the database provides incomplete sequences for most HLA types, we replaced the unknown bases with those of the most similar HLA type. To this end, similarity was determined by measuring the hamming distance in multiple sequence alignments (MSAs) across HLA types obtained from the IPD-IMGT/HLA Database. We used the Allele Frequency Net Database (González-Galarza et al., 2015) for prior information on HLA type frequencies.

Human leukocyte antigen read filtering and realignment
Filtering of HLA reads must be carefully performed for various reasons. First, it is insufficient to use only a human genome reference such as GRCh37 or GRCh38 owing to the high polymorphism of HLA genes. Therefore, a specific HLA database is required, such as the IPD-IMGT/HLA Database. Second, HLA genes and pseudogenes are paralogs and are, therefore, quite similar. Hence, when performing HLA genotyping, it is essential to distinguish reads from an HLA gene of interest from those of other HLA genes and pseudogenes.
In our HLA genotyping pipeline, a BAM file whose reference is the human genome is used as input data. First, sequence reads in the BAM file are filtered by extracting the HLA region, which is defined by chr6: 28,477,448,354 for GRCh37 and chr6:28,510,480,577 for GRCh38, and covers the HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes. Next, the extracted reads are mapped 924 HAYASHI ET AL.
to all HLA reference sequences using BWA-MEM (version 0.7.17) with the option to obtain information on all identified alignments. Each read is classified based on whether or not the HLA genes produced the read, and if so, which specific gene was involved. This classification is made using alignment scores, which we call HLA read scores (HR scores), and are calculated as follows. Let x i be the i th read pair that consists of two single reads x i‚ 0 and x i‚ 1 . In the case of single-end sequence data, x i consists of one read, x i‚ 0 . In addition, t k is defined as the k th HLA type. If the read x i‚ j is unmapped to the HLA type t k , then the HR score s i‚ j‚ k for x i‚ j and t k is -1. Otherwise,x i‚ j‚ k andt i‚ j‚ k are the aligned sequences of x i‚ j and t k , whilex i‚ j‚ k‚ n andt i‚ j‚ k‚ n are the n th bases or gaps ofx i‚ j‚ k andt i‚ j‚ k , respectively. Moreover, the mismatch probabilitỹ q i‚ j‚ k‚ n ofx i‚ j‚ k‚ n andt i‚ j‚ k‚ n can be calculated bỹ q i‚ j‚ k‚ n = 10 -b i‚ j‚ k‚ n 10 ‚ whereb i‚ j‚ k‚ n is the Phred base quality ofx i‚ j‚ k‚ n . Using the aforementioned definitions, the HR score s i‚ j‚ k is given by i‚ j‚ k‚ n = 0 (ifx i‚ j‚ k‚ n ‚t i‚ j‚ k‚ n 2 B andx i‚ j‚ k‚ n =t i‚ j‚ k‚ n ) log (q i‚ j‚ k‚ n 3 ) (ifx i‚ j‚ k‚ n ‚t i‚ j‚ k‚ n 2 B andx i‚ j‚ k‚ n 6 ¼t i‚ j‚ k‚ n ) a (d‚ o) (ifx i‚ j‚ k‚ n =andx i‚ j‚ k‚ n - (ift i‚ j‚ k‚ n =andt i‚ j‚ k‚ n -1 = -) a (N) (ifx i‚ j‚ k‚ n = N andt i‚ j‚ k‚ n 2 B (N) orx i‚ j‚ k‚ n 2 B (N) andt i‚ j‚ k‚ n = N) Here, B = fA‚ C‚ G‚ Tg and B (N) = fA‚ C‚ G‚ T‚ Ng. s (r) i‚ j‚ k‚ n is a reward for the length of the read, and a (r) is a positive hyperparameter for one base. By contrast, s (p) i‚ j‚ k‚ n is a penalty for mismatches between the read and the HLA type, and a (d‚ o) , a (d‚ e) , a (i‚ o) , a (i‚ e) , and a (N) are negative hyperparameters for deletion opening, deletion extension, insertion opening, insertion extension, and an unknown base N in the read or the HLA type, respectively.
Then, for each read pair x i and each HLA locus l, the score s Ã i‚ l is defined by where T l is a set of HLA types of the HLA locus l. When x i is a paired-end read, it is used for genotyping the HLA locus l if the following two criteria are satisfied: Here, h (p‚ s) is a hyperparameter of a threshold for the maximum HR score of the locus and h (p‚ d) is a hyperparameter of a threshold for the difference between the maximum HR scores of the locus and other loci. However, if x i is a single-ended read, different thresholds are used; in other words, x i is used for genotyping the HLA locus l if The former criterion is necessary to collect reads that are likely to be produced by the locus, whereas the latter criterion is needed to exclude reads that might be produced by other loci.
Next, all of the read pairs that satisfy the conditions are realigned to the MSAs of the HLA types of the HLA locus l. Realignment of the read x i‚ j is performed using the best HLA type whose index is given by k Ã = arg max k:t k 2T l s i‚ j‚ k ‚ and the realigned readx i‚ j is obtained by aligning x i‚ j to the MSAt k Ã of the HLA type t k Ã to match the alignment (x i‚ j‚ k Ã ‚t i‚ j‚ k Ã ). This is done by simply translating the positions of bases and gaps int i‚ j‚ k Ã into those int k Ã .

Bayesian model for human leukocyte antigen analysis
We applied a Bayesian model for HLA genotyping and HLA somatic mutation detection, with basically the same structure as our previous method (Hayashi et al., 2018) except for some additional parameters. Figure 1 shows the graphical model. Hereafter, we suppose that the sequence reads are paired-ended for simplicity, and the model for single-ended sequence reads is the same except that the reads are unpaired.
Input data of the model include both the normal and tumor realigned reads. Let i‚ 1 ) be the i th tumor realigned read pair, where n and t indicate parameters for the normal and tumor sample, respectively. For each s 2 fn‚ tg, we define x (s) i‚ j‚ n as the n th base of x (s) i‚ j , and q (s) i‚ j‚ n as the mismatch probability of x (s) i‚ j‚ n . Note that the first position of each realigned read is not the beginning of the read but rather that of the MSAs, and x (s) i‚ j‚ n and q (s) i‚ j‚ n are undefined if the n th position is not covered by the read. We define r (s) i‚ j as a set of positions covered by the read x (s) i‚ j and r (s) i as (r (s) i‚ 0 ‚ r (s) i‚ 1 ). We denote HLA types of the sample by R (r) 1 and R (r) 2 , normal HLA sequences by S (n‚ r) 1 and S (n‚ r) 2 , and tumor HLA sequences by S (t‚ r) 1 and S (t‚ r) 2 . Here, the sequences of R (r) 1 and R (r) 2 are the MSAs of the HLA types. S (n‚ r) 1 and S (n‚ r) 2 are used to consider germline variants in R (r) 1 and R (r) 2 , and S (t‚ r) 1 and S (t‚ r) 2 are used to reflect somatic mutations. We also introduce decoy HLA types is a hyperparameter of the number of the decoy parameters. These parameters are essential to make a robust inference, because their presence can reduce the influence of misclassified reads at the previous filtering step that were actually produced by other HLA genes or pseudogenes. For convenience, we sometimes , respectively. In addition, in some cases, ). Similar to the notation for read pairs, R m‚ n and S m‚ n are defined as the n th base of R m and S m , respectively.
Next, let I (n) i and I (t) i be parameters that indicate the specific HLA sequence that produced x (n) i and x (t) i , respectively. In other words, I (s) i = m means that x (s) i was produced by S m . Note that I (n) i 2 f1‚ . . . ‚ (d) + 2g because tumor HLA sequences cannot produce normal sequence reads, and that I (t) i 2 f1‚ . . . ‚ 2 (d) + 4g because the tumor sample might also contain normal cells.
. F m is a positive real parameter that expresses the likelihood that a read is produced by S (n) m and S (t) m . G is also a positive real parameter and expresses the ratio of normal cells contained in the tumor sample. V m is a tuple (V m‚ 1 ‚ . . . ‚ V m‚ N ), where N is the length of MSAs and V m‚ n is a parameter of 0 or 1, which indicates whether S (n) m‚ n and S (t) m‚ n are valid, as described in more detail hereunder.
The posterior probability of the parameters is given by . The likelihoods of sequence read pairs are given by p(x i‚ j‚ n jS m‚ n 2 B) (if x i‚ j‚ n 2 B and x i‚ j‚ n 6 ¼ S m‚ n ) p(x i‚ j‚ n jS m‚ n = N) Here, p (e‚ d) , p (e‚ i) , and p (e‚ N) are hyperparameters of probabilities of a deletion error, insertion error, and N in a sequence read, respectively.
The prior probability of tumor HLA sequences is given by Here, p (s‚ s) , p (s‚ d) , p (s‚ i) , and p (s‚ N) are hyperparameters of probabilities of a somatic substitution, somatic deletion, somatic insertion, and N in a tumor HLA sequence, respectively.
The prior probability of normal HLA sequences is given by where 928 HAYASHI ET AL.
p(S (n‚ r) m‚ n jR (r) m‚ n 2 B‚ R (r) m‚ n is original)

=
(1 -p (g‚ r‚ o‚ N) )(1 -p (g‚ r‚ o‚ d) )(1 -p (g‚ r‚ o‚ s) ) (if S (n‚ r) m‚ n = R (r) m‚ n ) (1 -p (g‚ r‚ o‚ N) )(1 -p (g‚ r‚ o‚ d) ) p (g‚ r‚ o‚ s) 3 (if S (n‚ r) m‚ n 2 B and S (n‚ r) m‚ n 6 ¼ R (r) m‚ n ) (1 -p (g‚ r‚ o‚ N) )p (g‚ r‚ o‚ d) p(S (n‚ r) m‚ n jR (r) m‚ n = N‚ R (r) m‚ n is imputed) Here, p (g‚ r‚ o‚ s) , p (g‚ r‚ o‚ d) , p (g‚ r‚ o‚ i) , and p (g‚ r‚ o‚ N) are hyperparameters of probabilities of a germline substitution, germline deletion, germline insertion, and N, respectively, in a nondecoy normal HLA sequence at the position where the reference is an original base. The other hyperparameters are also defined in a similar way. The probabilities for an imputed reference base should be larger than those for an original base to reduce the influence of misimputation. In addition, the probabilities for a decoy normal HLA sequence should also be larger than those for a nondecoy normal HLA sequence to achieve robustness against misclassified reads.
The prior probability of HLA types is given by m ) / 1: Here, p t is a prior probability of the HLA type t, which was calculated using the Allele Frequency Net Database.
The prior probability of normal indicator variables is given by This formula means that the read cannot be produced by an HLA sequence without a valid position covered by the read, which is controlled by V. Similarly, the prior probability of tumor indicator variables is given by Note that I (t) i 2 M (n) indicates that the read was derived from a normal cell, and I (t) i 2 M (t) indicates that the read was derived from a tumor cell. Furthermore, matched normal-tumor HLA sequences S (n) m and S (t) m share V m and F m . The prior probability of F is given by Here, LN is a log-normal distribution, l (f‚ r) and (r (f‚ r) ) 2 are hyperparameters of the mean and variance for the nondecoy parameters, and l (f‚ d) and (r (f‚ d) ) 2 are hyperparameters of the mean and variance for the decoy parameters. l (f‚ d) should be smaller than l (f‚ r) because sequence reads mapped to decoy HLA sequences should be removed at the filtering step.
The prior probability of G is given by where l (g) and (r (g) ) 2 are hyperparameters of the mean and variance for normal contamination. The prior probability of V is given by Here, p (v‚ o) and p (v‚ e) are hyperparameters of probabilities of a validity flag opening and a validity flag extension, respectively. Note that V (r) m‚ n must always be 1.

Markov chain Monte Carlo-based parameter sampling
The parameters are sampled from the Bayesian model using Markov chain Monte Carlo. Gibbs sampling is primarily used to sample all parameters except for F m and V m .
A candidate parameter, F Ã m , is first sampled using the Metropolis-Hastings algorithm whose proposal distribution is given by where (r (f‚ p) m ) 2 is a hyperparameter of the variance of the proposal distribution. The acceptance ratio r Ã is calculated by . A candidate parameter, V Ã m , is sampled using the Metropolis-Hastings algorithm whose proposal distribution is analogous to the Wolff algorithm (Wolff, 1989), which is used for sampling of the Ising model. V Ã m is generated by Algorithm 1. Then, I (n)Ã and I (t)Ã are also sampled using Gibbs sampling given V Ã m . The acceptance ratio r Ã is calculated by We set 1 -p (v‚ o) and p (v‚ e) to p , respectively, so that the acceptance ratio can be calculated by

Efficient sampling from multimodal posteriors
In addition to the standard sampling approaches mentioned earlier, we applied some additional elaborate sampling schemes to prevent the parameters from becoming stuck in a local optimum. One such scheme swaps parts of the nondecoy and decoy HLA sequences. First, a nondecoy index m 2 f1‚ 2g, decoy index m 0 2 f3‚ Á Á Á ‚ (d) + 2g, and interval i such that 8n 2 i; V m 0 ‚ n = 1 are sampled uniformly. Next, S (n) m‚ n and S (n) m 0 ‚ n , and S (t) m‚ n and S (t) m 0 ‚ n are swapped for all n 2 i. Finally, R Ã m , R Ã m 0 , I (n)Ã , and I (t)Ã are sampled using Gibbs sampling given S (n)Ã and S (t)Ã , which are the normal and tumor HLA sequences after swapping. Consequently, the acceptance ratio r Ã is given by This sampling method helps to determine which HLA sequences should be decoys. in order. Then, the acceptance ratio r Ã is given by : This sampling functions in a similar way to blocked Gibbs sampling of R m , S (n) m , and S (t) m . This blocked Gibbs sampling requires substantial computation time because S (n) m and S (t) m must be integrated out for each HLA type. By contrast, our scheme requires much less time because S (n) m and S (t) m are integrated out only for R m and R Ã m . Other strategies were further used to obtain better parameters. First, reference sequences are periodically copied to HLA sequences. Second, sequence reads are assigned to decoy sequences if there are mismatches between the sequence reads and the reference sequences. These approaches help to reduce the incidence of false-positive mutations and retain only the mutations that seem true. The multistart method is also used to obtain better initial parameters. Moreover, parallel tempering is used to move parameters from mode to mode.

Human leukocyte antigen analysis from sampled parameters
HLA analysis is conducted based on the sampled parameters. HLA genotyping is performed by counting the number of sampled HLA types, and germline or somatic mutations are identified by finding different bases between HLA types and normal HLA sequences, or between normal and tumor HLA sequences, respectively.

Human leukocyte antigen genotyping from whole-genome sequencing data
We first evaluated the accuracy of this method for HLA genotyping from a WGS data set. For comparison, we applied ALPHLARD-NT, ALPHLARD (Hayashi et al., 2018), and POLYSOLVER  to WGS data of 25 colon cancer samples, which were used by Hayashi et al. (2018). The performance comparison is summarized in Table 1. Overall, ALPHLARD-NT outperformed POLY-SOLVER at all resolutions for all HLA loci. ALPHLARD-NT also achieved slightly higher accuracy than ALPHLARD because ALPHLARD-NT can use information from both normal and tumor samples, whereas ALPHLARD can only use information from normal samples.

Detection of human leukocyte antigen mutations from whole-genome sequencing data
We also searched for HLA class I somatic mutations among the WGS data from the 25 colon cancer samples using ALPHLARD-NT, POLYSOLVER, and EBCall (Shiraishi et al., 2013), which is a standard mutation caller. ALPHLARD-NT called one substitution, two insertions, and two deletions, all of which were verified by the TruSight HLA Sequencing Panels (Weimer et al., 2016). All four indels called are known to lead to the loss of function of the HLA alleles, and might contribute to immune escape. However, POLYSOLVER and EBCall detected no and one mutation, respectively, which was likely due to the low coverage of the data set.
3.3. Detection of human leukocyte antigen mutations from whole-exome sequencing data Next, we applied ALPHLARD-NT, POLYSOLVER, and EBCall to a WES data set of 343 colon adenocarcinoma cases from The Cancer Genome Atlas (TCGA). Figure 2 shows the Venn diagrams of the identified HLA class I somatic mutations with each method. This figure demonstrates the high sensitivity of ALPHLARD-NT (88 mutations) compared with POLYSOLVER (60 mutations) and EBCall (80 mutations), which is especially remarkable for insertions. ALPHLARD-NT detected seven insertions at the beginning of exon 4 of HLA class I genes, which is a known hotspot of indels , whereas POLYSOLVER and EBCall identified no and three insertions at this hotspot, respectively. ALPHLARD-NT also identified 12 deletions at the same position. These recurrent frameshift indels seemed to be positively selected for immune escape caused by loss of function of the HLA alleles.
In addition, ALPHLARD-NT detected a novel HLA-B allele whose exon sequence is the same as HLA-B*35:08:01 except that the 25th base is C rather than G, which changes the 9th amino acid from V to L. The protein produced by the new allele is also novel and not registered in the IPD-IMGT/HLA Database, indicating that the allele defines a new HLA type name at the second field.

CONCLUSION
In this article, we have presented a new Bayesian method, ALPHLARD-NT, which identifies HLA germline and somatic mutations as well as HLA genotypes. Comparison of the performance of ALPHLARD-NT clearly demonstrated its higher accuracy than existing methods for WGS-based HLA genotyping. ALPHLARD-NT also detected HLA somatic mutations from both WES and WGS data. In general, HLA mutation calling is difficult mainly due to the similarity of HLA genes and pseudogenes. We dealt with this problem by applying sophisticated filtering criteria and using decoy-related parameters that reduced the influence of misclassified reads at the filtering step. Although these approaches work well for HLA class I mutation calling, identification of HLA class II mutations remains a challenge, since databases tend to be relatively incomplete for identifying class II genes and pseudogenes compared with class I genes.
With the continuous accumulation of large amounts of WES and WGS data, HLA mutation calling from these data sets is a fundamental step in cancer immunogenomics. Thus, we expect that our method will be an essential tool for comprehensive analyses of HLA genes from WES and WGS data.