HIV-1 Transmission Clustering and Phylodynamics Highlight the Important Role of Young Men Who Have Sex with Men

Abstract More persons living with HIV reside in the Southern United States than in any other region, yet little is known about HIV molecular epidemiology in the South. We used cluster and phylodynamic analyses to evaluate HIV transmission patterns in middle Tennessee. We performed cross-sectional analyses of HIV-1 pol sequences and clinical data collected from 2001 to 2015 among persons attending the Vanderbilt Comprehensive Care Clinic. Transmission clusters were identified using maximum likelihood phylogenetics and patristic distance differences. Demographic, risk behavior, and clinical factors were assessed evaluating “active” clusters (clusters including sequences sampled 2011–2015) and associations estimated with logistic regression. Transmission risk ratios for men who have sex with men (MSM) were estimated with phylodynamic models. Among 2915 persons (96% subtype-B sequences), 963 (33%) were members of 292 clusters (distance ≤1.5%, size range 2–39). Most clusters (62%, n = 690 persons) were active, either being newly identified (n = 80) or showing expansion on existing clusters (n = 101). Correlates of active clustering among persons with sequences collected during 2011–2015 included MSM risk and ≤30 years of age. Active clusters were significantly more concentrated in MSM and younger persons than historical clusters. Young MSM (YMSM) (≤26.4 years) had high estimated transmission risk [risk ratio = 4.04 (2.85–5.65) relative to older MSM] and were much more likely to transmit to YMSM. In this Tennessee cohort, transmission clusters over time were more concentrated by MSM and younger age, with high transmission risk among and between YMSM, highlighting the importance of interventions among this group. Detecting active clusters could help direct interventions to disrupt ongoing transmission chains.


Introduction
A better understanding of HIV transmission dynamics is needed in areas of the Southeastern United States where HIV incidence has not significantly declined. Southern U.S. states are an epicenter of the national epidemic, ac-counting for nearly 44% of people living with HIV, but only one third of the U.S. population. 1 The region also faces significant demographic disparities with disproportionate HIV burden among racial/ethnic minorities, in nonurban areas, and among men who have sex with men (MSM). 1 In addition, increased rates of opiate injection drug use in the region, especially in rural areas, are of concern for rapid dissemination of HIV such as occurred in Scott County, Indiana in 2014. 2 In Tennessee, new HIV diagnoses remain largely concentrated among MSM. 3 However, over 43% of counties in Tennessee are at high vulnerability for rapid HIV or HCV dissemination based on composite indicators (i.e., drug overdoses and low income levels) associated with acute HCV infection. 4,5 Tennessee is one of three states, including West Virginia and Kentucky, with the largest concentration of such vulnerable counties in the United States. 4 Thus, continued HIV surveillance to direct prevention activities among these high-risk groups is needed.
Molecular epidemiological approaches can help delineate local HIV transmission dynamics, monitor trends in the epidemic, and may assist in the design of targeted interventions. HIV sequences can be used to reconstruct HIV transmission networks or clusters 6 and reveal links between subepidemics overlapping in geography, time, and social/sexual interaction. Such links are difficult to detect through traditional surveillance such as partner contact tracing, which is resource intensive. With the timely identification of genetic links between subepidemics, targeted interventions, such as increased resource allocation for prevention services employed toward demographic subgroups (geographic or populations), may more effectively curb HIV incidence. However, the success of such interventions hinges on the ability to detect and intervene on new and growing transmission clusters.
Little is known about HIV molecular epidemiology in Tennessee. In this study, our objective was to characterize HIV transmission patterns in a cohort in middle Tennessee through molecular cluster and phylodynamic analyses. We evaluated the degree of local HIV transmission, diversity, and transmission risk behaviors among persons in HIV care in the region through assessment of cluster growth and associations with membership in active clusters that were newly recognized or expanded based on previous clusters. Furthermore, we applied phylodynamic modeling to estimate trends in transmission among MSM.

Study population
We conducted a retrospective cohort study of persons living with HIV (PLWH) attending the Vanderbilt Comprehensive Care Clinic (VCCC) in Nashville, Tennessee. The catchment area includes metropolitan Nashville and surrounding rural communities and represents an estimated 80% of HIVpositive persons in middle Tennessee who are in active care. The VCCC maintains a clinical and laboratory database with routine abstraction and validation from electronic medical records. Persons were included in this study if they were ‡18 years of age at clinic entry, had ‡1 HIV-1 pol sequence sampled from 2001 to 2015, and had ‡2 HIV primary care visits at VCCC within 12 months. The Institutional Review Board at Vanderbilt University approved this study (IRB No. 161368).

Demographic and clinical variables
We evaluated demographic variables including sex, age and year of diagnosis, race/ethnicity, and country of origin. The area of residence was estimated from three-digit zip codes and distributed in three categories: metropolitan Nashville (code 372), middle Tennessee (surrounding metropolitan Nashville, codes 370 and 371), and other (all other codes). HIV risk behaviors were categorized as MSM, heterosexual, people with history of injection drug use (PWID), and other/unknown. PWID also reporting MSM or heterosexual risk were categorized as PWID. These risk behaviors are patient reported during clinical care and are abstracted from medical records. Clinical data included CD4+ lymphocyte count and HIV-1 RNA viral loads collected £60 days of the earliest available sequence and antiretroviral therapy start dates to identify sequences sampled pre-antiretroviral exposure. Pre-therapy genotypes became routine after 2006. 7

HIV-1 sequences
Full-length protease and partial reverse transcriptase HIV-1 pol sequences were abstracted from drug resistance genotypes performed by LabCorp (Laboratory Corporation of America, Burlington, NC) or the Vanderbilt University Medical Center (all genotypes sampled after October 2010). Most sequences were generated with the ViroSeq HIV-1 Genotyping system. Sequences were aligned using MUS-CLE 8 and edited manually in Bioedit. 9 Gapped positions were stripped and the final sequence length was 1497 bases. HIV-1 subtypes and recombinants were identified with the Context-based Modeling for Expeditious Typing (COMET HIV-1) tool. 10

Cluster analyses
A maximum-likelihood (ML) phylogenetic tree was constructed in FastTree v.2.1.4 11 with the general time reversible model of nucleotide substitution using the earliest available sequence from each individual. Statistical support of clades was assessed with local support values (Shimodaira-Hasegawa-like test) in FastTree. Pairwise patristic distances (total path length separating tips in the phylogeny) were calculated using a custom Python script 12 (https://github .com/ArtPoon/bioinfo/blob/master/graphmaker.py). Putative clusters were defined as pairwise patristic distances of £1.5% divergence between ‡2 individuals. Of note, sequences linked in clusters do not necessarily represent direct transmission between persons, as unsampled persons may be involved in transmission chain, nor can directionality be inferred. Clusters within the phylogenetic trees were manually examined to ensure sequences from clusters grouped together with high branch support. Cluster diagrams were visualized using Graphviz v.2.38. The distribution of minimum pairwise patristic distances between individuals by various demographic factors were visualized using Sina-Plots. 13

Statistical analyses
We defined potentially ''active'' clusters as clusters that included sequences from the most recent sampling timeframe (2011)(2012)(2013)(2014)(2015). These clusters could be either newly identified (include only sequences from 2011 to 2015) or represent new growth on an existing cluster (including sequences from 2001 to 2010). Descriptive statistics and bivariate associations were used to assess differences in cluster membership using the chi-squared test for categorical variables and Kruskal-Wallis test for continuous variables. Independent predictors of membership in active clusters were assessed with logistic regression models. A multivariable model was based on inclusion of all variables with p < .2 in bivariable analyses. Data were analyzed in Stata 13.0 (StataCorp, College Station, TX).

Phylodynamic analyses
More recently developed phylodynamic methods can estimate transmission patterns, while accounting for unsampled individuals and variation in sampling times, as well as adjusting for stage of infection at time of sampling, although such methods are more computationally demanding than clustering.
We estimated transmission among MSM using structured population genetic models fitted to dated phylogenies of HIV-1 subtype B 14 with approaches that have previously been validated by simulations, 15 and that have been applied to other HIV sequence data. 16 These methods have also recently been employed among MSM in the United Kingdom. 17 The models were designed to allow for changing epidemic size and incidence through time, different transmission rates among young versus older MSM, different transmission rates over the course of HIV infection, and different probabilities of transmission between age groups. The model also accounted for importation of HIV lineages into Tennessee from the global HIV reservoir. A compartmental mathematical model was developed in terms of ordinary differential equations, which describe changing epidemic size and incidence among young and older MSM. We defined young MSM (YMSM) as those with an age in the bottom quintile of age distribution at the time of sequencing (threshold 26.4 years of age). A full specification of the mathematical model and diagram representing model structure is provided in the Supplementary Data and Supplementary Table S1 (Supplementary Data are available online at www.liebertpub.com/aid).
A time-calibrated phylogeny was produced by combining the VCCC subtype B sequences with the 2016 reference web alignment of HIV-1 pol sequences in the Los Alamos HIV Sequence Database (LANL) (at least 1,000 base pairs long spanning the VCCC alignment, selecting the closest reference to each VCCC sequence). Exact matches (to reduce the possibility that sequences were from the same person) and redundant sequences were removed, yielding 3,283 total sequences (including 486 references); these references were used to estimate the rate of introduction of HIV from outside the sampling area. Including close matches from LANL allows us to distinguish which parts of the phylogeny correspond to virus evolution taking place within Tennessee. An ML tree was reconstructed assuming an SRD06 model of substitution 18 using IQTREE v.1.5.3. 19 Tips were dropped if the sampling year was missing. The treedater algorithm 20 was used to obtain a time-calibrated phylogeny, assuming a strict molecular clock and enforcing temporal constraints on the node ordering. As sampling times were only available as years, treedater also estimated the sampling dates within each year. As fitting structured coalescent models to trees with over 2000 taxa was not computationally feasible, the phylogeny was divided into four nonoverlapping clades with times to the most recent common ancestor in the early 1980s for downstream analyses.
The mathematical model was converted to a structured coalescent model and fitted to dated phylogenies and associated metadata (age at the time of sampling and CD4+) using the phydynR R package (https://github.com/emvolzphylodynamics/phydynR). The model was fitted using a Bayesian MCMC algorithm implemented in R. The main parameters estimated were the relative risk of transmission among YMSM, the probability that a YMSM will transmit to another YMSM, and the probability that an older MSM will transmit to a YMSM. In addition, we estimated the relative risk of transmission during early HIV infection defined by CD4+ count >500 cells/lL and the rate of importation of HIV lineages from outside Tennessee. A full specification of parameters and a complete description of the model fitting procedure are provided in the Supplementary Data.
The distribution of minimum pairwise patristic distances differed between several demographic factors ( Fig. 1). Persons <30 years of age at first sequence had lower median minimum distances compared to older persons (0.014 substitutions per site vs. 0.036 for those ‡30 years of age; p < .001). Minimum distances were also lower for persons with subtype B compared to those with non-B subtypes, MSM risk compared to women or men reporting heterosexual risk, and black race compared to whites and Latinos, and for those diagnosed during 2005-2015 (all p < .001).

Characteristics of transmission clusters
In the cluster analysis, 33% (n = 963) of persons were identified in a cluster based on <1.5% patristic distance (Table 1). There were 292 clusters (size ranged 2-39 persons); 64% were in pairs (n = 2), 32% included 3-9 persons, and 4% were ‡10 persons. Most cluster members (n = 690, 71.7%) were in 181 active clusters (containing at least one sequence sampled during 2011-2015); the remaining 111 clusters were designated historical clusters (n = 273 persons, 28.3%) (The entire network is available in Supplementary  Fig. S1). Of 181 active clusters, 101 (55.8%) were growing clusters and 80 (44.2%) were newly recognized clusters (included only sequences sampled during 2011-2015). In bivariable analyses, cluster members differed by persons who were not identified in any cluster by multiple characteristics ( Cluster composition was assessed by majority risk behavior, race group, youth (defined as majority of cluster members <30 years of age), and area of residence. While most clusters were majority MSM (52.7%; 154/292), active clusters were significantly more likely to be majority MSM (59.7%; p < .01). Only four clusters were composed of majority PWID (involving total 30 persons); two were historical (both comprised three men) and two were active clusters (cluster sizes 4 and 20 persons, including >25% women). Among racial composition, most clusters were either majority black (40.8%) or white (39.7%). There was no significant change in composition between active and historical clusters with regard to race; 40.3% active clusters were majority black and 39.8% were majority white ( p = .9 trend). Active clusters were significantly more concentrated by  Fig. 2). These large clusters included 187 persons, most were male (87.7%) and 10/12 were majority MSM. Only one large cluster was mostly PWID (Cluster ID 15) and one was majority heterosexual (Cluster ID 47, notably, including 15 women). Of the 12 clusters composed of non-B subtypes, 11 were male-female pairs [subtype C (8), CRF02_ AG (2) and CRF01_AE (1)] and one pair were MSM (CRF07_BC).

Correlates of clustering
Among the subset of 1,027 persons with sequence sampled during 2011-2015, we evaluated correlates to membership in active clusters. In total, 400 persons (39%) newly sampled in this later timeframe were in 181 active clusters. In the multivariable analysis, MSM risk, age <30 years of age [odds ratio (OR) = 2.48; 95% confidence interval (CI) 1.85-3.31], higher CD4+ cell counts, and higher log 10 HIV viral loads were associated with cluster membership ( Table 3). Persons of black race, residing outside middle Tennessee, or with sequences sampled in calendar years that are more recent were less likely to be in these active clusters.  Phylodynamic modeling of HIV-1 subtype B phylogenies confirmed the importance of YMSM, who are inferred to transmit infection at a higher rate and are more likely to transmit to one another than expected by chance. We estimated the transmission risk ratio for YMSM (£26.4 years) relative to older MSM to be 4.04 (95% credible interval: 2.85-5.62). YMSM were much more likely to transmit to one another than other MSM (Fig. 3). The probability that a transmission from a YMSM went to another YMSM was 53% (95% CI 39%-70%), despite YMSM comprising only 20% of the sampled population. The probability that a transmission originating in older MSM went to YMSM was 10% (95% CI  . Each bar represents the proportion of transmissions, which flow from old to each age group. Color images available online at www.liebertpub.com/aid 7%-15%). We inferred a small net-flow of transmissions from young to old, which lacked statistical significance, in contrast to recent clustering results showing a net flow of transmissions from older to younger MSM. 21 The ratio of transmissions from young to old versus old to young was 1.7 (95% CI 0.89-2.57). Comparatively, the clustering analysis also showed possible association of YMSM to YMSM transmission. Of 323 YMSM, 201 (62%) were identified in any cluster. Of these, 72% (145/201) were in a cluster with another YMSM ( p < .001). The phylodynamic model also included a high CD4 compartment (CD4 > 500) to assess risk during early infection. We found only a modest increased risk, which lacked significance [1.27 (95% CI 0.86-1.76)].
We also estimated the rate that lineages flow into Tennessee due to migration of PLWH as well as transmission to Tennessee residents from outside of the state. This rate was estimated to be 0.197 per lineage per year, which implies that the average age of an HIV clade (the direct ancestor of a virus lineage) circulating in Tennessee is *5 years. Although it was not the primary aim of phylodynamic modeling, the fitted model provides an estimate of the number of new HIV-1 subtype B infections through time. We estimate that in 2015, there were 475 (95% CI 261-678) new infections, of the same order as the number of reported new HIV cases among MSM (n = 433 per year from 2010 to 2015) combined with those with unreported exposure (n = 101 in 2015). 22 Because our coalescent method accounts for incomplete sampling, our results imply that the sampling fraction (proportion of PLWH with a sequence included in the analysis) among MSM was large because the estimated number of new infections is similar to recent estimated number of diagnoses among MSM.

Discussion
We investigated HIV transmission patterns among persons who received care in middle Tennessee through genetic cluster and phylodynamic analyses with sampling extending 2001-2015, identifying over one third of the study population linked in closely related clusters. Over 60% of clusters were potentially active, including sequences collected between 2011 and 2015, representing new cluster detection or expansion on existing clusters. Younger age and MSM risk were associated with membership in these active clusters, which were also increasingly concentrated among young persons. Integration of phylodynamic modeling confirmed the importance of YMSM, as we estimate a high probability of transmission between YMSM by transmission rate ratios. While we found few clusters concentrated by PWID, many involved MSM and residents outside Nashville, indicating potential bridging between risk groups and regions. Continued surveillance among these high-risk groups with prospective cluster analysis would allow timely detection of new or expanding clusters that could be targeted for intensified prevention.
Genetic cluster and phylodynamic analyses provide information on HIV transmission that would be difficult to ascertain from traditional epidemiological survey data, especially in populations where infection is endemic. Clusters, when integrated with demographic data, provide insight into groups contributing disproportionately to ongoing, localized transmission. Phylodynamics enable more refined estimation of transmission risk, including between groups and locations. Our study is the first characterization of HIV transmission clusters in Tennessee and one of the largest among U.S. studies relying on data from a single clinical cohort. While Southeastern states are at the epicenter of the U.S. HIV epidemic, 1 few areas in the region investigated local HIV transmission dynamics using molecular epidemiology. [23][24][25] Studies based on national molecular HIV surveillance [26][27][28] also lack significant representation from southeastern Appalachian states, including Tennessee. These states are areas of concern for increased HIV risk due to concentration of the opioid epidemic. 4 In Tennessee, HIV transmission remains largely concentrated among MSM 3 as shown in our study.
We found that younger age is associated with cluster membership and that clusters are becoming increasingly concentrated by younger persons. While this association can be driven by that fact that persons are younger at diagnosis and more likely to be recently infected, our findings correspond with the phylodynamic analysis, which adjusts for stage of infection at the time of sampling. While our analyses unsurprisingly show more clustering among YMSM (following trends in surveillance data), our phylodynamic methods infer patterns of sources and recipients of infection at a population level. In contrast to other reports, which implicated transmission links among disparate age groups suggesting partnerships with older men, 21,28,29 we estimate a high transmission risk between YMSM. Such findings have implications for prevention by uncovering core groups where interventions could potentially have greater impact. 17,30 Mixing of different age groups in transmission linkages varies by race/ethnicity; young black and Latino MSM were less likely to have older transmission partners in national molecular surveillance, suggesting these groups are acquiring HIV from other YMSM. 28 Other U.S. studies also found younger age as a determinant of clustering, including in North Carolina, 23 Chicago, 31,32 and Seattle, 21 and in national molecular surveillance. 26 Our results harmonize with recent estimates reporting an increase in incidence among MSM aged 25-34 years of age from 2008 to 2014. 33 A greater proportion of younger persons are estimated to be unaware of their HIV infection (51%, 13-24 years of age, vs. 13%, among PLWH). 34 In addition, adolescent MSM report more high-risk behaviors compared to heterosexual youth. 35 We found evidence suggesting assortative mixing by race among transmission clusters. Most clusters were composed of mostly black (40.8%) or white (39.7%) persons, the most prevalent race groups in the study. These findings are congruent with several transmission network studies, including in Mississippi 25 and North Carolina. 23 Highly assortative mixing by race/ethnicity was found in U.S. surveillance data, particularly among blacks 26 and black MSM. 28 Altogether, these finding suggest that most transmissions occur within race groups, and support tailored interventions addressing risks faced by minority groups.
Further investigation into the geographical extent of HIV transmission networks is important, particularly with regions experiencing increased numbers of PWID. While new HIV diagnoses in Tennessee are largely reported in urban counties and among MSM, 3 many of these areas are in close proximity to rural counties predicted to be at high risk for an HIV and/or HCV outbreak. 4 Given that HIV outbreaks among PWID can rapidly occur, 2 understanding existing network structure between regions can be useful to target prevention resources. Our study is limited by lack of granular geographic data as we analyzed three-digit zip codes, which cover wider-than-optimal geographical areas. Despite this, we found that many clusters contained a person residing outside middle Tennessee, which provides some insight into geographic bridging between regions. Similarly, in Chicago, nearly 72% of clusters of MSM involved several regions. 32 Among clusters of black YMSM in Mississippi, 52% included residents involving multiple regions. 25 Further research incorporating more refined geographical and partner contact data for PWID can help better define geographic correlates to transmission.
In addition to clustering-based analyses, which can be readily applied to large sequence datasets, we also conducted a more computationally intensive phylodynamic analysis using structured coalescent models. A major limitation of genetic clustering analyses is that the connections comprising clusters do not have an unambiguous interpretation-for instance, they do not represent transmission events. Phylodynamic models have the advantage of being more robust to sampling bias, as well as providing more mechanistic interpretations, in terms of within-and between-group transmission rates. The results of the phylodynamic analyses were largely concordant with the clustering-based analyses, reinforcing the role of YMSM in the epidemic in middle Tennessee. Phylodynamic analyses also demonstrated that valuable results on transmission risk can be extrapolated from sequence data, despite frequent imports of HIV into this region. Transmission rates between major metropolitan areas could be estimated using similar methodology. In this study, we selected background samples from LANL to provide a diverse background to identify linkages that were likely imported into Tennessee as opposed to circulating endogenously.
Our findings should be interpreted in light of the inherent limitations of HIV phylogenetic analyses. In this study, samplings bias was possible because sequences were only from persons engaged in clinical care, who had a genotype available, which gives an incomplete view of transmission networks. In addition, directionality and direct linkage of transmission cannot be ascertained because additional unsampled parties may be involved in the transmission chain; however, phylodynamic analysis did not require inference of directionality or directness. While information on timing of infection can be used to assist inference of directionality, only weekly informative data in the form of CD4+ counts were available. Furthermore, we did not have infection or seroconversion dates and clusters were defined by sampling dates, which limits confirmation of our conclusions. Most individuals are identified relatively late in infection, and comparison of the VCCC sequences with publicly available data using a phylodynamic model highlighted the high rate of introduction of infections from outside the sampling area.
Combined with epidemiologic surveillance data, phylodynamic and transmission network analyses have the potential to inform HIV prevention. While HIV sequence data are increasingly available and being used to identify HIV genetic clusters, most of these studies are done retrospectively 6 and lack phylodynamic models. Prospective analyses are needed to test whether identification of putative subepidemics can help guide surveillance and prevention efforts. 36