Mixing patterns of HIV transmission among men who have sex with men in the United Kingdom

Background Near 60% of new HIV infections in the United Kingdom are estimated to occur in men who have sex with men (MSM). Patterns of mixing between different risk groups of MSM have been suggested to spread the HIV epidemics through age-disassortative partnerships and to contribute to ethnic disparities in infection rates. Understanding these mixing patterns in transmission can help to determine which groups are at a greater risk and guide prevention. Methods We analyzed combined epidemiologic data and viral sequences from MSM diagnosed with HIV as of mid-2015 at the national level. We applied a phylodynamic source attribution model to infer patterns of transmission between groups of patients by age, ethnicity and region. Results From pair probabilities of transmission between 19 847 MSM patients, we found that potential transmitters of HIV subtype B were on average 5 months older than recipients. We also found a moderate overall assortativity of transmission by ethnic group and a stronger assortativity by region. Conclusions Our findings suggest that there is only a modest net flow of transmissions from older to young MSM in subtype B epidemics and that young MSM, both for Black or White groups, are more likely to be infected by one another than expected in a sexual network with random mixing.


Introduction
Men who have sex with men (MSM) account for 40% of new diagnoses in Europe [1], and in the United Kingdom (UK), nearly sixty percent of new HIV infections are estimated to occur in MSM, although there is a recent sign of decline in transmission particularly recorded in London [2]. It has 5 been estimated that the largest contribution to transmission in the UK is attributable to young HIV positive MSM (in this case less than 35) [3]. More generally, since the early work from Morris et al. [4], it has been suggested that young MSM having sex with older partners represent a significant driver of the epidemic in North America [5]. This disassortative age mixing pattern has also been studied in interaction with mixing in ethnicity [6,7] or to illuminate disparity in HIV prevalence by ethnicity, which is a common feature of UK and US epidemics [8,9].
Studies based on partnership data have shown that having an older partner increases the risk of infection [10,11]. This result is also supported by 15 analyses of transmission networks based on phylogenetic clustering [12,13]. But clustering of genetically similar viruses is influenced by time since infection when patients are sampled, which is confounded by patients' age as well as CD4 and clinical stage of infection. Phylogenetic clustering is also confounded by the fraction of infected persons sampled, which makes direct 20 inference of transmission patterns difficult using genetic clustering [14][15][16].
When clustering is observed between older and younger MSM, it is suggestive of a flow from old to young because of higher prevalence in older MSM [12]. But, the direction of putative transmission events cannot be resolved by pairwise genetic distance alone, and it is not possible to estimate 25 flows of transmission between age groups based on clustering observations. In this study, we applied a phylogenetic source attribution (SA) method that infers the probability of potential transmission (infector probability) between pairs of patients among approximately 20 000 MSM diagnosed in the UK with available genetic sequences [17]. Source attribution methods 30 based on consensus pol-sequence data can not be used to infer transmission pairs with high confidence, but can provide useful insights when studied in aggregate over thousands of putative transmission pairs. In general, direction of transmission can not be inferred from consensus HIV sequence data, but in combination with clinical stage of infection at time of sequencing, 35 directionality can be inferred probabilistically in some cases, as when for example a patient with chronic infection is linked to a patient with early infection. By combining phylogenetic analysis of consensus-pol sequences with stage of infection data and independent estimates of incidence and prevalence in the population, we are able to quantify potentially imbalanced transmission patterns between different risk groups including different age groups.
We used HIV-1 sequences routinely collected for drug resistance testing, patient-level data informative of the time since infection to account for biased sampling, and epidemiological estimates of background prevalence and 45 incidence to account for potentially unsampled individuals that could be the sources of infection. In this study, by using estimates of transmission pair probabilities, our objective was to reveal patterns of transmission in men who have sex with men according to age, ethnicity, and geography. In particular, we searched for evidence of source-sink relationships in transmission 50 patterns between age groups and examined the hypothesis that there is a net flow of transmissions from old to young MSM. Midlands and East of England; North of England; Northern Ireland, Scotland and Wales. In analyses of assortativity, unknown category was treated as missing data.

Sequence processing
Partial HIV-1 pol sequences were sampled from 1997 to July 2015 with 80 a majority obtained after 2009. Sequences were aligned with MAFFT version 7 after adding 1780 international sequences from the Los Alamos HIV sequence database (http://www.hiv.lanl.gov/). The international sequences were added to infer importation of viral lineages, and were selected in a BLAST search that identified for each of the UK sequences the sequence with highest similarity (using bitscore) in the Los Alamos database. Drug resistance mutation sites as listed in the 2015 update from the IAS-USA were stripped from the alignment. Subtypes were determined with REGA version 3 and all further analyses were stratified by subtype.

Phylogenetic analysis 90
Phylogenetic trees were constructed with ExaML by maximum likelihood based inference with a gamma distribution model for rate heterogeneity among sites [20]. One hundred bootstrap replicates of each tree were computed to account for phylogenetic uncertainty. Trees were rooted at outgroup sequences that were added to each subtype alignment prior to tree 95 reconstruction.
We calculated root-to-tip distance and regressed distance by time from MRCA to sample. By iterations of Grubb's algorithm, we identified on overall 0.3% sequences as outliers in terms of divergence time and evolutionary rate. We applied least-square dating algorithm [21] on rooted trees 100 and sampling times to estimate the substitution rate and dates of ancestral nodes.
We analyzed separately the 4 main subtypes to account for different evolutionary rates. Fitch algorithm was used to reconstruct ancestral host status and determine distinct clades of virus transmitted in the UK. The 105 dated subtype B phylogeny comprised 18,484 taxa and for computational reasons was split into subtrees of <1000 sequences for further analyses.

Probabilistic source attribution
We applied a phylogenetic source attribution method that uses a populationgenetic model to derive probabilities that a given individual (donor) is the 110 source of infection for another individual (recipient) in the sample. These probabilities, termed infector probabilities, account for the epidemiological and sampling processes by incorporating into their calculation the timescaled phylogeny, patient data on stage of infection, and population-level data on occurrence of infection [17]. The method was evaluated in a previ-115 ous simulation study [16].
For population-level epidemic statistics, we used updated incidence estimates of CD4-based back-calculation method for MSM population and prevalence estimates of Bayesian multiparameter synthesis of surveillance data, as reported by Public Health England in 2017 [2]. To account for 120 uncertainty in those input parameters, we randomly drew 5 pair values of incidence and prevalence per bootstrap replicates (2000 in total) from normal distributions inferred from the credible intervals of those estimates. Incidence and prevalence were assumed to be proportional across subtypes.
The source attribution method uses a continuous time Markov chain 125 model to reconstruct the likely state of a lineage at the time of transmission given the CD4-stage of infection at time of sampling. The definition of stages of infection and progression rates were based on Cori et al. [19], as described in our previous analysis [16]. The method assumes that each infected patient corresponds to a single lineage of virus, ignoring multiple infections, and that 130 internal nodes in the phylogeny correspond to a transmission event between hosts. To limit calculations to non-negligible pairing, only coalescent events within a limit of 20 years prior to sequence sampling were incorporated to compute infector probabilities.

Statistical procedures 135
Infector probabilities W ij for each donor-recipient pair were averaged over all bootstrap replicates. To compare the mean age of donors and recipients we used a two-tailed paired weighted t-test, with pair-level infector probabilities as weights.
To characterize transmission patterns by patients' covariates, we first and recipients relative to random allocation. The matrix E has elements E kl = k M kl ⊗ l M kl / k l M kl , and represents the expected values in the absence of preferential mixing [22]. Matrix E allows the calculation of Newman's assortativity coefficient r = (Tr(M ) − Tr(E))/(1 − Tr(E)). The coefficient ranges from -1 to 1, where r = 0 when there is no assortative 155 mixing, r = 1 when there is perfect assortativity (every link connects individuals of the same type), and some negative value −1 ≤ r < 0 for a perfectly disassortative network (the lower bound depending on number of categories and density of subgraphs in each category). In all matrix-type figures, we represent transmission going from donors in columns to recipient 160 in rows.

Characteristics of the study population
The demographic and geographic composition of the 19,847 HIV-1 partial pol sequences from treatment naive patients diagnosed in the UK is 165 described in table 1. Most gay and bisexual men diagnosed in the UK were infected with subtype B (93%). Therefore the patterns of transmission inferred from reconstructed phylogeny of subtype B sequences are largely dominating that of all MSM patients. Patients infected with non-B subtype were on average sampled later (median year of 2008 for subtype B, 170 2009 for subtypes A1 and C, and 2011 for CRF02AG) and were on average younger (median age of 35 for subtype B, 34 for subtypes A and C, and 32 for CRF02AG).
In terms of ethnicity, the majority (84%) of patients were white persons. Patients infected with C or CRF02AG were more commonly of non-white ethnicity: Black-African for 11% and 16% and from other non-white ethnicity for 19% and 26% respectively.
In terms of geography, half of subtype B and 71% of subtype CRF02AG sequences were sampled in Greater London. Apart from London, subtype A was especially prevalent in North of England (27%).

Infector probabilities
Across 100 bootstrap tree replicates for each subtype, we computed infector probabilities for on average 554 514 potential transmission pairs of 14 603 patients (cf. table 2). Given the n by n matrix of probabilities that a patient i transmitted to a patient j, the sum i W ij represents the probabil-185 ity that the infector of j is in the sample. This quantity, denoted 'in-degree' in table 2, indicates that on average 36.6% (95%CI[35.2 -38.0]) of potential donors are included in our sampled population. Our estimates of in-degrees were moderately influenced by the variation in inputs of background incidence and prevalence, with lower incidence (or higher prevalence) increasing 190 average in-degrees as the probability of an unsampled intermediary transmitter is decreased (cf. figure S1). Table 3 shows the mean difference in age between donors and recipients, weighted by infector probabilities. A significant difference is only detectable 195 for subtype B, donors being on average 5 months older than recipients. For subtype B, most transmission pairs in our sample involved individuals less than 30 years old (figure 1m). The largest proportion (46%) of infection acquired by young individuals was attributable to individuals in the same age category (figure 1r). And a strong assortativity in transmission mixing 200 is seen in this youngest age category, indicating that young MSM are preferentially infected by young MSM. This preferential mixing is also seen in among individuals over 44 years. The overall assortativity coefficient was moderate with r = 0.16. Similar transmission patterns between age groups were observed for subtypes A and C (table S2). However transmission of 205 subtype CRF02AG was characterized by a strong assortativity mostly in the oldest age category but more intergenerational mixing between other categories(figure S2a). Despite the lack of significant difference in average age of donors relative to recipient shown previously for subtype CRF02AG, the most probable infector for individuals from intermediate age quartiles 210 (30-36 and 37-43) was younger (less than 30) (figure S2r).

Transmission by ethnicity
The vast majority (85%) of MSM infected with subtype B viruses were of white ethnicity. We estimated that 82% of all transmissions in our sample occurred between white individuals, and that recipients of all ethnicities 215 had a majority of white donors. The probability of having been infected by a white individual was 92% for whites, 77% for Indian/Pakistani or Bengladeshi, 75% for other Asians, 55% for Black Africans and 54% for Blacks Caribbeans. Conversely, a majority of transmission originating from donors of any ethnic group was estimated to affect white recipients. Fig-220 ure 2a shows the level of assortativity in transmission of subtype B viruses between ethnic groups. Inter-ethnic transmission (cumulated pair probabilities outside the diagonal) represented 17% on overall and 58% when excluding the white category. Overall assortativity was moderate (r = 0.17) but a preferential mixing was especially observed within and between all black 225 ethnic groups and within the South Asian group.
We estimated the probability of transmission of subtype B viruses between young (<30) and older MSM (30+) either from white or black ethnicity (figure 3). The relative excess of transmission within age categories observed previously is observed for both white and black ethnicities, and 230 overall assortativity by age was similar (r = 0.25 for white and 0.28 for black). However, for a given older MSM the probability of transmitting to a young MSM was higher in black (39%) than in white ethnic group (22%).

Transmission by geographical region
Analyses of transmission by region show the largest level of assorta-235 tivity, indicating a overall strong spatial structure of the epidemics (see figure 2b). Assortativity coefficients were 0.56 for subtype B and 0.49 for subtype CRF02AG. For those two subtypes, figure 4 shows the probability for a donor in a given region to transmit to a recipient of each respective region. Scotland and Wales (70%) to infect recipients in London than individuals within the same region.

Discussion
The objective of this study was to describe patterns of HIV transmission between age, ethnicity and geographical categories in the United Kingdom.
We used a phylodynamic inference based on sequences collected among diagnosed MSM which accounts for incomplete sampling and stage of infection at sampling time. By modelling an epidemic process that is compatible with the evolution of transmitted viruses and epidemiological surveillance data, we characterized past transmission events among nearly 15 000 MSM 255 patients at the national level.
Pair probabilities averaged over phylogenies and aggregated by age groups indicated a modest overall net flow of transmission from older to young MSM. This result is compatible with other studies reporting co-clustering of young and older patients [12, 13] as we do not observe pure assortative 260 mixing, with probable transmission occurring in both directions across age groups. But our results indicate that on average, flow from old to young is mostly compensated by the transmission from young to old (cf. figure 1). And when the flow is imbalanced, as for transmission of subtype B viruses, the difference is small. 265 We observed an overall preferential mixing in transmission by age with greater assortativity both in the youngest and oldest age groups and more random mixing in intermediate age groups. Understanding age mixing patterns in transmission can help to determine which groups are at a greater risk and potentially guide public health interventions [23]. Our findings 270 confirm that young MSM infect one another more than expected by random mixing which supports the idea that prevention benefit could be enhanced by focusing on this small group [24]. This result also corroborates the observation of recent clusters of young MSM sustaining the epidemic in the Netherlands [25]. 275 We showed an overall preferential pairing by ethnicity in conjunction with an important mixing between white men and men from each other ethnicity. It can be explained by the overwhelming proportion of white men in the population. But in non-white groups, more than a half of transmission was inter-ethnic, revealing that a substantial amount of transmission has 280 occurred between ethnic groups among MSM. A similar pattern for sexual partnership between ethnic groups was reported in Britain [8]. Although we found a relatively higher assortativity among black MSM in general and a non negligible mixing between black ethnic groups from different origins (African, Caribbean and other), HIV transmission seems less assortative 285 among black MSM in the UK than it is in the USA [26].
We assessed whether intergenerational transmission was different in white and black MSM and found a similar level of age assortativity in both groups. Therefore as others in the US context [7] we did not find support in our findings to explain a disparity in HIV prevalence by age mixing [5, 6].

290
Finally, we found a strong geographical structure for the epidemics among MSM, with region of diagnosis as the variable associated with the highest level of assortativity. However, region of diagnosis can be different than the region of residency or of actual transmission which is not known, which may lead to an underestimation of the true level of geographical structure.
Several potential limitations of our study relates to the assumptions of the phylogenetic inference and source attribution method. First, as stated in methods section, the source attribution method neglects some effects of within-host evolution which can cause discordance between phylogenies and transmission trees [27]. This approximation is reasonable if within-host evo-300 lution generates coalescence time considerably shorter than between hosts at the population level. Secondly, we incorporated crude estimates of incidence and prevalence in the inference of infector probabilities. These were assumed constant over the period and proportional across subtypes. However variation of these inputs within credible limits had limited impact on 305 average infector probabilities (figure S1). Third, the direction in transmission was derived from CD4 count and RITA result data that were partially complete.
Nevertheless, our analysis aimed to improve the use of phylogenetic information relative to genetic clustering in two ways. First, by providing a 310 rough measure of transmission probability which unlike linkage into clusters can indicate a directionality and gives more weight to pairs with higher credibility. Notably, output matrices and patterns between groups would be symmetrical if based on clustering. Secondly, by correcting for biases stemming from incomplete sampling of the infected host population. Lastly, 315 the source attribution method was fast to compute and scaled easily to phylogenies based on many thousands of sequences.
Future directions for this work include applying the analysis to the heterosexual population, where phylogenetic information could contribute to assess age disparity in mixing across gender. [28,29]. Another direction 320 would be to use methods exploiting next-generation sequencing, that account for within-host evolution and enhance resolution in identifying transmission [27,30].
In conclusion, this study has leveraged available patients data and viral sequences to provide evidence of assortativity in HIV transmission by age, ethnicity and geography. Understanding these patterns of transmission is important to modelling the impact of intervention strategies.  Year of sampling    Figure 1: Patterns of transmission of HIV subtype B by age in quartiles. The four graphics depict transmission from donor categories in column to recipient categories in row (from x-axis to y-axis). Axes labels represent the upper bound of quartiles of age. M: Each cell represents the proportion of overall transmissions from one category to an other, with higher proportion in lighter shade, i.e. the highest amount (14%) of transmissions involved donors and recipients both aged less than 30. R: Each row represents the probability distribution for a given age category of recipients of having been infected by donors by age, i.e. 25% of recipients less than 30 years old were infected by donor aged 30-36. D: Each column represents the probability distribution for a given age category of donors of having transmitted to recipients by age, i.e. 28% of donors aged 37-43 infected recipients aged 30-36. A: The assortativity matrix indicates that, relative to random mixing more transmissions occurred within the same age category, particularly for the oldest and youngest. Assortativity coefficient r = 0.16.  Supplementary material for the article "Mixing patterns of HIV transmission among men who have sex with men in the United Kingdom" Figure S1: Influence of variation in incidence and prevalence inputs on mean indegree. Incidence (number of new infections per year) and prevalence (number of persons living with HIV) were drawn independently 5 times per each of 100 phylogenetic trees for subtype B transmission. Under the source attribution model, the in-degree is the sum of infector probabilities incoming to an individual. It also represents the probability that its infector is in the sample. The mean in-degree tends to increase as prevalence is increased or as incidence is lowered, as it decreases the probability of an unsampled source case. Both variations have a limited impact on the average in-degree estimates as indicated by correlation coefficient R.   Table S2: 95% confidence intervals for proportion of transmission by age quartiles. M, R and D columns refer to the matrices described in section 2.5 and represented in figure 1. Column labels for donors and row labels for recipients represent the upper bound of quartiles of age.