A Short-Term Assessment of Nascent HIV-1 Transmission Clusters Among Newly Diagnosed Individuals Using Envelope Sequence-Based Phylogenetic Analyses

Abstract The identification of transmission clusters (TCs) of HIV-1 using phylogenetic analyses can provide insights into viral transmission network and help improve prevention strategies. We compared the use of partial HIV-1 envelope fragment of 1,070 bp with its loop 3 (108 bp) to determine its utility in inferring HIV-1 transmission clustering. Serum samples of recently (n = 106) and chronically (n = 156) HIV-1-infected patients with status confirmed were sequenced. HIV-1 envelope nucleotide-based phylogenetic analyses were used to infer HIV-1 TCs. Those were constructed using ClusterPickerGUI_1.2.3 considering a pairwise genetic distance of ≤10% threshold. Logistic regression analyses were used to examine the relationship between the demographic factors that were likely associated with HIV-1 clustering. Ninety-eight distinct consensus envelope sequences were subjected to phylogenetic analyses. Using a partial envelope fragment sequence, 42 sequences were grouped into 15 distinct small TCs while the V3 loop reproduces 10 clusters. The agreement between the partial envelope and the V3 loop fragments was significantly moderate with a Cohen's kappa (κ) coefficient of 0.59, p < .00001. The mean age (<38.8 years) and HIV-1 B subtype are two factors identified that were significantly associated with HIV-1 transmission clustering in the cohort, odds ratio (OR) = 0.25, 95% confidence interval (CI, 0.04–0.66), p = .002 and OR: 0.17, 95% CI (0.10–0.61), p = .011, respectively. The present study confirms that a partial fragment of the HIV-1 envelope sequence is a better predictor of transmission clustering. However, the loop 3 segment may be useful in screening purposes and may be more amenable to integration in surveillance programs.


Introduction
I n 2015, a total of 609 HIV cases were reported to the Public Health Laboratory in Quebec (LSPQ), including 299 new diagnoses. 1 To better characterize transmission patterns, 2,3 we evaluated molecular epidemiology methods using phylogenetic analyses of HIV genome sequences obtained from infected individuals. Pairwise genetic distances (PWDs) derived from env, gag, and pol nucleotide sequences were used to assess the relationship between sequences and therefore define study HIV transmission dynamics. 4,5 Combining sequence-based clustering with risk factors and sociodemographic, clinical, and geographical parameters may help identify patterns of transmission in the HIV-infected population. [6][7][8][9][10][11] The env gene, encoding glycoproteins gp120 and gp41, is the most variable region of the HIV-1 genome. [12][13][14] contains five hypervariable regions (V1 to V5) and five conserved regions (C1-C5). [15][16][17][18] The gp41 region consists of three domains, the ectodomain (ECD), the transmembrane domain, and the long cytoplasmic domain. 19 Each subregion and each domain of the env play a key role in HIV pathogenesis. 16,18,20 In the literature, pol gene-derived sequences are mostly used for phylogenetic analysis to infer HIV-1 transmission clustering. 4 It is readily available in clinical laboratories performing drug resistance testing in most countries. However, little is known about the potential of HIV-1 envelope-derived sequences for this application. 21,22 It has been shown that within-host env genetic diversity is significantly associated with HIV-1 Fiebig stage and disease progression. 23,24 The HIV-1 evolution in a new recipient may recover some ancestral features of infected donors, 25 such as the genetic distance between the donor and recipient. 21,26 Such characteristics may be used to establish the closely transmitted network (cluster) between HIV-infected individuals.
The first objective of this study was to assess the potential of HIV env-derived sequences (1,070 bp) to determine HIV-1 transmission clustering. The second objective was to investigate whether shorter portions of the env gene, such as the env V3 loop-derived sequences (108 bp), could be a valuable tool to achieve the same mean while reducing technical, time, and cost constraints 27,28 associated with partial but longlength env sequencing. The third objective was to identify risk factors associated with HIV transmission clustering.

Patients and specimens
Serum samples that were reactive in Quebec diagnostic laboratories using a screening HIV 1/2 immunoassay (EIA) were submitted to the LSPQ for confirmation by Western blot (WB) and/or p24 EIA. Positive p24 antigen samples were de facto classified as recent infections. Samples from newly confirmed HIV-1 individuals by WB were then submitted to a recent infection testing algorithm (RITA) based on antibody avidity. The latter combines a Centers for Disease Control and Prevention (CDC)-modified Bio-Rad Avidity Assay and a Sedia-LAg-Avidity assay. 29 Recent infections were defined as a sample that was positive for HIV-1 p24 antigen or positive for HIV-1 antibodies by WB but classified as ''recent'' by RITA (£136 days of infection). Established infections were defined as samples positive by WB and classified as long-standing by RITA (>136 days of infection).
Based on these criteria, we selected 262 newly diagnosed HIV samples that included recent (n = 106) and long-standing (n = 156) infections collected in 2015. The risk factors and clinical and epidemiological parameter data reported for each sample were extracted from the HIV provincial surveillance program.
Sequence data processing, phylogenetic analysis, and HIV transmission cluster reconstruction All consensus env nucleic acid sequences were aligned with molecular evolutionary genetic analysis, using MEGA7 software (www.megasoftware.net) under ClustalW methods. 35 All aligned sequences were submitted to MAFFT multiple sequence alignment software version 7 36 to verify the reliability of the alignments. The human immunodeficiency virus type 1 K03455.1 (HXB2) env nucleotide sequence (nt) positions (6225 to 8795) were included in the alignment to serve as a reference. Matching sequences, excluding gaps with equal lengths (HXB2 nt positions 6831 to 7900 & 1,070 bp) in this case, were selected for phylogenetic analyses. The env partial fragment analyzed included the gp120-C2 to C5 subregions (HXB2 nt positions 6813 to 7757) and the gp41 partial ECD (HXB2 nt positions 7758 to 7915). The HIV-1 env gp120-V3 loop sequences (HXB2 genome nt positions 7110 to 7217 & 108 bp) were also analyzed separately. Phylogenetic trees were constructed in MEGA7 using the maximum likelihood algorithm, and their reliability was estimated from 1,000 bootstrap replicates. Transmission clusters (TCs) were evaluated among sequences that grouped around common proximal nodes with ‡99% bootstrap, as supported by a previous study. 11 In the present study, in the absence of the gold standard for HIV-1 envelope sequence-based clustering, we considered a PWD of £10% as a threshold, as suggested by Novitsky et al. 21 The HIV-1 transmission clustering was defined as two or more HIV-1-infected individual genomic sequences whose branches were grouped under a genetic distance threshold in the phylogenetic tree. 4 The interpretation of the extent of the relevant transmission clustering depended on the genetic distance threshold, the time that it was established, the number sequence data set, the sampling densities, the node bootstrap, and the HIV-1 genomic region concern. 4, 21 We considered a high PWD (£10%) to define clustering for the HIV-1 envelope because it presented the highest variability compared with GAG and Pol (PWD £1.5%). The extent of the HIV TC was classified as unique (1 member), small (2-4 members), or large (5-60 members) in the transmission chain. 6 We used ClusterPickerGUI_1.2.3 (http://hiv.bio.ed .ac.uk/software.html) to construct cluster trees 37 and FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/) to view them. We used Dendroscope version 3.5.10 to build a Tanglegram of connected taxa between rooted phylogenetic trees and networks. 38,39 The NCBI subtyping tool 40 was used to determine HIV viral subtypes, which was confirmed by the REGA HIV-1 subtyping tool version 3. 41

Statistical analyses
We used bivariate and multivariate analyses to determine the independence of associations between exposed variables (epidemiologic and clinical factors) and outcome (clustering). Categorical variables were compared using a chi-square or Fisher's exact test, and continuous variables were compared using a two-sample Student's t-test or Wilcoxon rank sum test. All variables with p < .25 as a cutoff point from bivariate analysis were included in the multivariate analysis using logistic regression. At this point, p < .05 42,43 using the Wald test was considered to be statistically significant, and selected variables were included in the final regression model. Statistics analyses were performed in SPSS version 24 and STATA version 14. The determination of agreement between env gp120-V3 sequence clusterings determining with a partial env fragment as a gold standard to correctly identify individuals in a cluster tree was assessed using Cohen's kappa coefficient. The estimated parameters were sensitivity, specificity, positive predictive value, and negative predictive value. Stata IC version 14 was used as a statistical package, and a p value less than 5% was considered statistically significant.

Ethics statement
All work was conducted in accordance with the Declaration of Helsinki in terms of informed consent. All samples were anonymized before we accessed them for this study. No nominal information was used for analysis or data management. Ethical approval was given and renewed yearly by our Institutional Review Board (IRB): the ''Comité d'éthique et de la recherche des Centres hospitaliers affiliés à l'Université de Montréal''.

Results
Amplification and DNA sequencing of HIV-1 env were attempted on 262 specimens and were successful for 39% of them (n = 102) (Fig. 1). This moderate level of amplification success could be mediated by multiple factors affecting the specimen quality such as the long-term storage of serum, viral RNA extraction procedures and enzymes used in the PCR amplification as well as the viral loads of infected individuals (VL < 20,000 copies/ml). In addition, depending on the length the HIV-1 envelope sequence to be amplified, this procedure is known to be challenging. 27,28 Of the included HIV-1 env sequences (n = 102), 47% (n = 48) and 35% (n = 36) were classified as chronic and recent HIV infections, respectively, using the RITA (26), whereas 18% (n = 18) were found to be recent as confirmed by EIA-p24 testing (Fig. 1). The demographic characteristics are described in Table 1. Briefly, 34% of individuals were asymptomatic, whereas 13% of individuals presented acute HIV infection symptoms, and 11% had reached the AIDS stage ( Table 1). The mean CD4 count in total population was 375 cell/mm 3 , range 150]. The mean HIV-1 viral load in total population was 4.94 log 10 copies/ mL, range [1. 60-8.20], and the mean age was 38.8 years, range . Of the 102 individuals, 79.4% were infected with HIV-1 subtypes B, and 20.6% with HIV-1 non-B subtypes. Age and baseline CD4 + T cell counts were significantly different between the recent and chronic infections using the Wilcoxon rank sum test. The mean age of recently infected individuals was 34.15, range , and for chronic infections, it was 44.03, range , p = .0002. The mean baseline CD4 + T cell count was 521, range 150] in recent infections and 186, range  in chronic infections.

FIG. 1.
Flowchart of HIV-1 envelope sequences used in this study. The figure presents the total number of specimens sampled (n = 262) and the final sequences obtained after amplification, sequencing, and sequence analysis processes (n = 102). EIA, enzyme immunoassay; RITA, recent infection testing algorithm.
HIV-1 partial envelope fragment-length sequence-based clustering using phylogenetic analysis From the 102 sequences analyzed, 4 were excluded because of insufficient env length. Finally, 98 sequences with satisfactory and similar lengths were included in the phylogenetic analysis. Of these 98 sequences, 57.14% (n = 56) did not form clusters, whereas 42.85% (n = 42) were part of 16 small clusters ranging from two to five individuals using the partial env fragment length (1,070 bp) at a distance cutoff of 0.1 (Fig. 2). The HIV-1 subtype B envelope sequence represented 93.3% of clusters identified (n = 14/15) and were labeled Clust1 to 15, except for Clust4 that was constituted by HXB2 and BAL reference sequences. The non-B HIV-1 subtypes, which represented 7.14% and 6.7% (n = 1/15), labeled Clust16, are shown in Figure 2.
Agreement between the HIV-1 partial envelope fragment lengths of the V3 loop-derived sequences as independent tools to perform HIV-1 transmission clustering The env-V3 sequence-based clustering reproduced 10 of the 15 clusters previously identified with the env 1,070 bp sequence length. An additional cluster (C6) was identified only with env-V3 loop sequence-based clustering, shown in Figure 4.
The sensitivity, specificity, positive predictive value, and negative predictive value of env gp120-V3 in determining TCs were 70%, 98.6%, 95.5%, and 88.8%, respectively, compared with the partial env fragment length (1,070 bp)derived sequences as the gold standard ( Table 2). The concordance between the gold standard and env gp120 V3 cluster determination was assessed by Cohen's kappa coefficient and demonstrated a significantly moderate agreement, j = 0.59, p < .00001 (Table 2).
Agreement between the partial HIV-1 envelope fragment sequence lengths with the gp120 C2V3C3 region-derived sequences as independent tools for identifying HIV-1 TCs We also compared the partial env fragment length (1,070 bp) with the gp120 C2V3C3 regions (HXB2 nt positions 6813 to 7376 & 564 bp) to improve the moderate sensitivity associated with the V3 sequence-based clustering identification. Surprisingly, the gp120 C2V3C3 regions identified only 6 of the 15 clusters (40%) observed using the partial env fragment sequence at PWD of £10%. This unexpected result represented less than the number of clusters identified by V3 loop-derived sequences alone, estimated to be 66.66% compared with the same gold standard.

Description of factors associated with HIV transmission clustering
The recently HIV-1-infected subjects defined by RITA 29 represented 53.4% of the TCs, followed by acutely infected individuals (EIA-p24+) at 23.3% and chronically infected individuals at 23.3% ( Fig. 1 and Table 1). The HIV-1 subtypes B and D comprised 93% and 7% of clusters, respectively. The distribution of the HIV-1 TCs by epidemiologic, clinical, and risk factors of acquisition is presented in Table 3.

Discussion
HIV-1 envelope sequence inference was used to reveal HIV-1 transmission clustering (network) among newly diagnosed individuals in 2015 in Quebec, Canada. The first analyses used a partial env fragment length (1,070 bp) that identified 15 small TCs, including 42 HIV-infected individuals. Using only the env gp120 V3 sequences (108 bp), we identified 11 small clusters, thus reproducing 66.7% (10/15) of clusters that were previously detected by the partialenv fragment sequence length, which have been considered as the gold standard. We used the envelope loop 3 (V3) fragment as comparison because it is the most conserved of the HIV-1 env hypervariable regions, and its sequences are frequently used to predict coreceptor tropism [44][45][46][47][48][49][50] and may be available for intention-to-treat analysis using entry inhibitors in clinics.
The agreement between these two approaches was moderate (Cohen's kappa coefficient (j) = 0.59). We tested different cutoff values of V3 sequences (1% to 20%) compared with partial env fragment sequences, but they did not enhance the accuracy of the agreement of the approaches. This observation underlies and confirms that the length of HIV-1 genome sequences analyzed inflects on TC determination. 26 Therefore, using only env V3 loop sequences may underestimate HIV-1 TCs.
Although we included two constant regions of the gp120 (C2, C3) with the V3 loop and compared with the HIV-1 partial env fragment in the hope of increasing the degree of sensitivity of a number of cluster estimates, unexpectedly, the sequence derived from the gp120_C2-V3-C3 regions identified only 40% of the clusters, which is lesser than the number of clusters identified by the sequences derived from the V3 loop alone (66.66%) compared with the HIV-1 partial env fragment. The three regions decreased the sensitivity associated with the use of a larger fragment comprising two constant regions and the V3 loop. It may be that not all have a good degree of conservation of the nucleotide sequences. The nucleotide sequence diversity introduced by the two constant regions (C2 and C3) may have contributed to increasing the genetic distance threshold between individuals up to The cohort-derived clusters are shown in blue for subtype B (Clust1 to Clust15, except Clust4) and orange for subtype D (Clust16). The control sequence names are shown in red (Clust4). Sequences that did not form a cluster are shown in black. Each depicted cluster satisfied ‡99% support on bootstrap resampling of 1,000 replicates and 10% of pairwise distance. 21   sequence-based clustering. The linked clusters and sequences between the two trees are connected by a gray line.
>10-15%, contributing to reduce the sensitivity of the estimated number of clusters. We believe that increasing the length of the sequence up to 524 bp, including only constant regions (C2, C3 with V3), may not increase the degree of sensitivity of cluster estimates compared with partial env fragment length. However, the small number of env sequence data sets used in this study did not yield formal conclusions. Further studies including a large env sequence data set and the combination of different env segments can help confirm the present result.
We have proceeded to new phylogenetic analyses concerning the gp120 C2V3C3 clustering and reached the same conclusions. We will consider in the future a large sequence data set and may evaluate different segments of the envelope sequence.
However, in screening for public health surveillance purposes, it may be useful as V3 loop sequencing techniques are routinely performed in clinical laboratories to inform virus tropism and the use of CCR5 inhibitors in treatment. A recent study has shown that the V3 loop is one of most predictable segments of the HIV-1 envelope and that using its sequences contributes to estimating the recency of an infection. 30 Hence, it may also be useful for real-time HIV-1 TC detection in complementing existing methods.
Previous studies have already demonstrated that the near full-length genome sequences (9,719 bp) of HIV-1 are the best tool for estimating the extent of HIV clusters. 26,51 However, the methodological and cost constraints associated with full-genome sequencing preclude its use in routine clinical epidemiology studies. The estimate of HIV TCs (size and extent) depends on the selected gene and sequence length. 21 It may negatively (underestimation) or positively (overestimation) affect the following: the cutoff value defined as genetic distance, 21 the type of sequencing (Sanger or NGS), the number of sequences in data sets, the number of variable sites or the period covered by the study, 4,26 the timely availability of sequence data, 52 and the molecular phylogenetic methods used for transmission chain defining. 4,53,54 This study covered a 1-year period (2015) that may underestimate transmission events occurring outside of this period. If sequencing is performed on a regular basis, it may help track the early founder of nascent clusters before the growth that will be identified in the following periods and help adapt prevention strategies for at-risk populations. Considering CDC Guide, June 2018, the identification of most linked sequences over a short period of time could possibly indicate that transmission occurs rapidly within a group. 55 The cutoff value of the PWD used for this study was £10%, as proposed by Novitsky et al. 21 We considered these cutoff values because they are not a gold standard concerning HIV envelope sequence-based clustering, and the study by Novitsky et al. 21 is the most recent and the method used (PWD) to define clustering was adapted to the present study than others. Using a small cutoff value may underestimate cluster sizes and, on the contrary, using a large cutoff value may overestimate them. Therefore, the cutoff value has to be well defined 5,8,21,56,57 according to the study by Novitsky et al. 21 Specially, in considering the specificity of each the different HIV-1 subtypes and CRFs, it may be also important to determine the best cutoff value when using the full-length HIV-1 envelope fragment (GP160) or its specific subregions or domain (GP120, GP41)-derived sequences for cluster estimates and established correlation between these tools. Further studies that may also evaluate the performance of using the near full-length HIV-1 genome and its three regions (GAG, Pol, and Env) adjusted by subtypes as independent tools for determining HIV-1 TCs are also encouraged using large sequence data sets.
Compared with earlier studies conducted by Brenner et al. 8,58,59 in Québec for longer periods of time and using a large sequence data set of the HIV-1 pol region (n = 1,277), 30 large TCs (20+/cluster) occurring over 13 years (2002 to 2015) 59 were identified. Lubelchek et al. also conducted a similar study in Chicago using a large data set of pol gene sequences and identified a single large TC in a total of 26 clusters observed. 11 Many HIV-1 TC studies have used pol gene sequences as these sequences are often available from clinical laboratories performing drug resistance testing (53). These findings contrast those of the present study, where no large cluster has been identified. The reason for the difference may be due to the limited env sequence data used in this study (n = 102) and the restricted period (1 year). 8,11,58,59 Nevertheless, for this 1-year period, we were able to identify and highlight the existence of HIV-1 clustering between newly HIV-1-diagnosed patients. The present study aimed to track the nascent or forming clusters in real time to help adapt early prevention strategies that may limit the formation of large clusters.
Bivariate analysis identified that newly HIV-1-infected individuals are significantly more associated with HIV-1 transmission clustering than chronically infected individuals, OR = 2.20, p = .04, using Fisher's exact test. This observation is in agreement with results presented by Brenner et al., 59 Dennis et al., 52,60 and Miller et al. 61 The results may be explained by the extreme fitness early founder viruses that successfully establish infections [62][63][64] and the higher viral load during acute infection.
Multivariate analyses identified two factors that were likely associated with TCs: the mean age (<38.8 years) and the HIV-1 B subtype. Individuals younger than 38.8 years were more likely to form TCs than those older than 38.8 Table presents the performance of the env gp120-V3 sequencebased clustering (108 bp) using partial HIV-1 envelope fragment length (1,070 bp) as the gold standard. The sensitivity and specificity were 61.9% (45.6-76.4) and 95% (86.1-99), respectively. In addition, the positive predictive value and negative predictive value were 89.7% (72.6-97.8) and 78.1% (66.9-86.9), respectively. Furthermore, significant and moderate agreement was documented using Cohen's kappa coefficient (j = 0.59, p < .00001). CI, confidence interval; n, number of clusters identified by env gp120 V3 loop-derived sequences; N, number of clusters identified by the gold standard. The objective of the present study was to evaluate tools to detect newly HIV-infected individuals who have the potential to transmit and sustain HIV epidemics as early as possible. Newly HIV-infected individuals are generally unaware of their infection status 66,67 and therefore can contribute to the spread of infection. 61,68 High viral loads observed during acute infection [69][70][71] also constitute a factor that enhances HIV transmission of newly HIV-infected individuals. HIV-1 viral load (>10,000 copies/mL) and CD4 counts >350 generally constitute biological factors significantly associated with HIV-1 transmission clustering. 57 Our study did not find a significant association of viral load or CD4 counts with clustering. The difference may be due to the fact that only 23% of    real-time follow-up of transmission networks regarding the early clusters of transmission established in a single year from newly HIV-infected individuals. The short-term assessment of early cluster building may help improve quick responses to prevent HIV transmission by identifying the nascent or forming clusters and populations at high risk.

Conclusion
Our results confirm the presence of small HIV TCs among HIV-1-infected individuals in Quebec. HIV subtype B-infected individuals and individuals younger than 38.8 years are two factors significantly associated with HIV transmission clustering. The HIV-1 partial env fragment lengthderived sequences were able to detect clusters of transmission contributing to the persistence of the HIV epidemic in Quebec. Although less sensitive than the sequencing of partial env fragments, the V3-derived sequences were able to identify HIV TCs with moderate agreement. The latter tools (V3 sequence) may be useful for short-term assessment of nascent HIV-1 transmission clustering in support of the existing methods in screening purposes.