# Scale-Free Spanning Trees and Their Application in Genomic Epidemiology

Publication: Journal of Computational Biology

Volume 28, Issue Number 10

## Abstract

**We study the algorithmic problem of finding the most “scale-free-like” spanning tree of a connected graph. This problem is motivated by the fundamental problem of genomic epidemiology: given viral genomes sampled from infected individuals, reconstruct the transmission network (“who infected whom”). We use two possible objective functions for this problem and introduce the corresponding algorithmic problems termed**

*m*

**-SF (-scale free) and**

*s*

**-SF Spanning Tree problems. We prove that those problems are APX- and NP-hard, respectively, even in the classes of cubic and bipartite graphs. We propose two integer linear programming (ILP) formulations for the**

*s*

**-SF Spanning Tree problem, and experimentally assess its performance using simulated and experimental data. In particular, we demonstrate that the ILP-based approach allows for accurate reconstruction of transmission histories of several hepatitis C outbreaks.**

## 1. Introduction

Viral outbreaks continue to be major causes of morbidity and mortality. The ongoing pandemic of the coronavirus SARS-CoV-2 (Huang et al., 2020) is a vivid example, but long-standing epidemics of HIV, hepatitis B virus, and hepatitis C virus (HCV) are hardly less damaging (Kilmarx, 2009; Hajarizadeh et al., 2013). Viral epidemics are complex processes defined by evolutionary dynamics of pathogens and social dynamics of susceptible populations (e.g., individual behaviors, social interactions, and mobility patterns).

Recent advances in sequencing technologies invigorated the field of genomic epidemiology (Armstrong et al., 2019; Knyazev et al., 2020) that aims to use viral genomic data to understand the epidemiological dynamics of pathogens. The fundamental algorithmic problem of genomic epidemiology could be formulated as follows:

•

Given viral genomes sampled from

*n*infected individuals, infer a transmission network indicating who of them infected whom (Knyazev et al., 2020). If each individual is supposed to be infected only once, then a transmission network is a tree called a*transmission tree*.This problem has been approached by a variety of methods (Jombart et al., 2011, 2014; Sledzieski et al., 2019; Wertheim et al., 2014; Campo et al., 2016; De Maio et al., 2016; Klinkenberg et al., 2017; Skums et al., 2018). One family of methods is based on the so-called network approach. It is particularly popular among researchers of HIV and HCV and has been adopted as a standard methodology for outbreak investigations carried out by the CDC (Wertheim et al., 2014; Campo et al., 2016; Campbell et al., 2017; Kosakovsky Pond et al., 2018; Ramachandran et al., 2018; Ragonnet-Cronin et al., 2019). This approach usually consists of two stages. First, a weighted

*relatedness graph G*is constructed. Its vertices represent infected hosts, and edges connect the hosts whose viral populations are close to each other according to a selected population genetics measure. Often_{R}*G*itself supplies enough information for epidemiologists and provides a_{R}*fast and scalable alternative*to phylogenetic trees when applied to next-generation sequencing (NGS) data (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019). However, usually it contains many edges that do not represent actual transmissions. Thus, at the second stage, the transmission tree is inferred as the spanning tree of*G*._{R}Under the maximum parsimony criterion, the most likely transmission network is a minimum spanning tree of

*G*(Jombart et al., 2011). However, experiments demonstrated that this approach is not accurate (Jombart et al., 2014). Furthermore, genomic data alone often do not allow to resolve ambiguities in transmission tree inference, and incorporation of additional evidence is necessary (Jombart et al., 2014; Villandre et al., 2016; Jha et al., 2017). Such evidence usually comes in the form of epidemiological information, such as sample collection times and exposure intervals. However, HIV, HCV, and many other infections tend to be initially asymptomatic, and consequently, sampling times may not accurately reflect the infection times. In addition, in outbreaks with high transmission rates (e.g., HIV/HCV among injection drug users), susceptible hosts are almost constantly exposed to the virus, which makes exposure intervals useless. Another important drawback of many existing methods is their implicit assumption that transmission tree edges are independent. In reality, it is not the case, as, for example, certain hosts (so-called superspreaders) infect more people than an average person (Galvani and May, 2005)._{R}Skums et al. (2018) proposed an alternative approach. It is known that for viruses, whose transmissions are associated with behavioral risk factors, their transmission trees have properties of so-called scale-free graphs (Leigh Brown et al., 2011; Wertheim et al., 2014). Those graphs have specific features, including power-law degree distribution, small diameter, and the presence of high-degree vertices (hubs). This observation gives rise to the following informally defined algorithmic problem (

*scale-free spanning tree problem*): find the most “scale-free-like” spanning tree*T*of the graph*G*. In addition, constraints on the weight of_{R}*T*could be imposed. This approach was the basis of the Bayesian framework and the Markov Chain Monte Carlo algorithm for the transmission network inference described by Skums et al. (2018) and implemented as a tool called QUENTIN. Although QUENTIN is efficient in practice, it is a heuristic, and the questions about computational complexity and possibility of the exact solution of the problem were left open.In this article, we present the first detailed study of the scale-free spanning tree problem. Our major contributions are as follows.

(1)

We propose two rigorous formulations of the scale-free spanning tree problem further referred to as

*m*-SF Spanning Tree and*s*-SF Spanning Tree problems. They are based on two related objective functions and, to the best of our knowledge, have not been previously studied.(2)

We establish the computational complexity of both problems by demonstrating that they are NP-hard or APX-hard, even when restricted to cubic graphs and bipartite graphs.

(3)

We propose two integer linear programming (ILP) formulations for the problems, and perform computational experiments to assess their performance using simulated data. Then we apply an ILP approach to real genomic data from several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Skums et al., 2018) and demonstrate that it allows for accurate inference of transmission trees.

*2. Preliminaries*

### 2.1. Problem formulations

We consider only finite undirected simple graphs and use standard graph-theoretic terminologies, see, for example, Chartrand et al. (2016). Let $G=\left(V,E\right)$ be a connected graph. For a vertex $x\in V\left(G\right)$, the

*neighborhood*${N}_{G}\left(x\right)$ of*x*is the set of all vertices that are adjacent to*x*in*G*. The*degree*of*x*is defined as ${deg}_{G}x=\left|{N}_{G}\left(x\right)\right|$. Several definitions of scale-free graphs of different degrees of mathematical rigor are known in a literature. We utilize the rigorous combinatorial characterization that has been introduced by Li et al. (2005) using the so-called*s*-metric of a graph. This graph invariant is defined as follows:$$s\left(G\right)={\sum}_{uv\in E\left(G\right)}{deg}_{G}u{deg}_{G}v.$$

(1)

The same parameter is known in mathematical chemistry as

*second Zagreb index*(Das and Gutman, 2004; Borovicanin et al., 2017). Li et al. (2005) demonstrated that a higher*s*-metric indicates with high probability the presence of most of the expected properties of scale-free graphs. The intuition behind these results is that in a graph with a high*s*-metric, a large number of edges should be incident to high-degree vertices, thus forcing them to resemble preferential attachment graphs—a standard Barabási and Albert (1999) model for scale-free networks. Therefore, another mathematical chemistry parameter called the*first Zagreb index*(Borovicanin et al., 2017) or*m-metric*also can serve as a measure of “scale-freeness” of a graph:Thus, we can formulate

*m*-SF Spanning Tree and*s*-SF Spanning Tree problems: given a connected graph*G*, find the spanning tree*T*of*G*such that $m\left(T\right)$ (respectively, $s\left(T\right)$) is maximal. The respective maximum values of $m\left(T\right)$ and $s\left(T\right)$ are called*first*and*second SF-dimensions*of*G*and denoted by ${\tau}_{1}\left(G\right)$ and ${\tau}_{2}\left(G\right)$. By ${T}^{sopt}$ and ${T}^{mopt}$, we denote an*s-optimal tree*and an*m-optimal tree*of*G*, respectively.A somehow related problem has been studied by Kincaid et al. (2016): find a spanning subgraph with

*prescribed vertex degrees*such that its*s*-metric is maximum. This problem is polynomially solvable in general, but becomes NP-hard, when the output spanning subgraph is required to be connected.### 2.2. Mathematical preliminaries

#### 2.2.1. Subgraph counting

Here we establish the characterizations for the

*m*-metric and*s*-metric in terms of numbers of small subgraphs in a graph. This technique is used to establish complexity results in Section 3 and ILP formulations in Section 4.#### 2.2.2. Neighbor switching

This is a tree rearrangement technique that is used for obtaining structural and complexity results. Let

*T*be a tree and $\left(u,v\right)$ be a pair of distinct vertices $u,v\in V\left(T\right)$, where ${deg}_{T}u=p\ge 2$ and ${deg}_{T}v=t\ge 2$. We denote the unique $u-v$ path in*T*by ${P}_{T}\left(u,v\right)$, and neighbors of*u*and*v*laying on ${P}_{T}\left(u,v\right)$ by ${u}^{+}$ and ${v}^{-}$, respectively. In case*u*and*v*are not adjacent, the neighbor of ${u}^{+}$ distinct from*u*and laying on ${P}_{T}\left(u,v\right)$ is denoted by ${u}^{++}$. Let $A={N}_{T}\left(u\right)\setminus \left\{{u}^{+}\right\}=\left\{{a}_{1},\dots ,{a}_{p-1}\right\}$, and let the set ${N}_{T}\left(v\right)\setminus \left\{{v}^{-}\right\}$ be partitioned into two subsets $B=\left\{{b}_{1},\dots ,{b}_{q}\right\}$ and $C=\left\{{c}_{1},\dots ,{c}_{r}\right\}$, where $B\ne $. Furthermore, let ${deg}_{T}{u}^{+}=\alpha $ and ${deg}_{T}{v}^{-}=\beta $. Define numbers*D*,_{A}*D*, and_{B}*D*as follows:_{C}$${D}_{A}={\sum}_{i=1}^{p-1}{deg}_{T}{a}_{i},{D}_{B}={\sum}_{j=1}^{q}{deg}_{T}{b}_{j},{D}_{C}={\sum}_{k=1}^{r}{deg}_{T}{c}_{k}.$$

(3)

Given the pair $\left(u,v\right)$, the

*neighbor switch*${S}_{v\to u}^{B}$ is a transformation producing a new tree $\stackrel{\u0303}{T}$ from*T*by replacing the edges $v{b}_{1},\dots ,v{b}_{q}$ with new edges $u{b}_{1},\dots ,u{b}_{q}$ (Fig. 1). This operation changes only degrees of the vertices*u*and*v*, namely ${deg}_{\stackrel{\u0303}{T}}u=p+q$, ${deg}_{\stackrel{\u0303}{T}}v=r+1$.If $B={N}_{T}\left(v\right)\setminus \left\{{v}^{-}\right\}$, then the neighbor switch produces a tree $\stackrel{\u0303}{T}$ with

*v*being a leaf. In this case ${S}_{v\to u}^{B}$ is a*total neighbor switch*. For our goals it suffices to prove the following corollary.For further results we need weaker modifications of Lemmas 2 and 4 for the case ${deg}_{T}u=p\ge 1$ (and therefore ${D}_{A}\ge 0$). Recall ${deg}_{T}v=t\ge 2$ since we still require at least one vertex to switch.

### 2.3. Bounds in terms of the maximum degree

There exist bounds for both SF-dimensions of a graph in terms of its order only (de Caen, 1998; Das, 2003; Das and Gutman, 2004). However, they are not particularly efficient, when used as ILP cuts. Here we provide the adjusted upper bounds that turned out to be more useful for that purpose. Let $\Delta \left(G\right)$ denote the maximum vertex degree of

*G*and ${S}_{m,k}$ denote a*double star*, that is, a tree obtained from two disjoint stars ${K}_{1,m}$ and ${K}_{1,k}$ with*m*and*k*leaves, respectively, by adding an edge joining their central vertices.## 3. Hardness Results

In this section, we study the computational complexity of both the

*m*-SF and the*s*-SF Spanning Tree problem. The following known fact is used:**Claim 11.**

*If*$\Delta \left(T\right)\le 3$

*and*$n\ge 4$

*are even, then*$s\left(T\right)\le 6n-15$

*. The equality holds if and only if T has no vertices of degree*2.

Next, we consider bipartite graphs.

Any spanning tree

*T*of*G*containing all edges of $\left(r:A\right)$ has $m+3n$ edges, $3n\left(m-1\right)$ paths of length three (each of the $3n$ edges of the tree connecting*A*and*B*induces exactly $m-1$ such paths), and $m\left(m-1\right)\u22152+3n$ paths of length two that are not formed by a pair of edges between*A*and*B*. There are $3{\delta}_{4}+{\delta}_{3}$ remaining paths of length two, where ${\delta}_{i}$ is the number of vertices in*A*that have degree*i*in the tree. Indeed, a vertex $v\in A$ with $j\in \left\{0,1,2,3\right\}$ neighbors from*B*in the tree contributes no such path in case of $j\in \left\{0,1\right\}$, one such path in case of $j=2$, and three such paths in case of $j=3$. Thus by Proposition 1$$m\left(T\right)={m}^{2}+m+12n+6{\delta}_{4}+2{\delta}_{3},s\left(T\right)={m}^{2}+3mn+6n+6{\delta}_{4}+2{\delta}_{3}.$$

Since $\left|B\right|=3n$, we have $3{\delta}_{4}+2{\delta}_{3}\le 3n$ and $6{\delta}_{4}+2{\delta}_{3}\le 6{\delta}_{4}+4{\delta}_{3}\le 6n$. Hence, $6{\delta}_{4}+2{\delta}_{3}\le 6n$ with equality holding if and only if ${\delta}_{3}=0$ and ${\delta}_{4}=n$.

A perfect 3-DM ${\mathcal{M}}^{\ast}=\left\{{M}_{1},\dots ,{M}_{n}\right\}$ induces the spanning tree ${T}_{{\mathcal{M}}^{\ast}}$ that contains all edges from $\left(r:A\right)$ and edges $ax,ay,az$ for each $a=\left\{x,y,z\right\}\in {\mathcal{M}}^{\ast}$. For this tree we have ${\delta}_{4}=n$ and

Conversely, every spanning tree

*T*that contains all edges from $\left(r:A\right)$ and $m\left(T\right)={t}_{1}\left(n,m\right)$ or $s\left(T\right)={t}_{2}\left(n,m\right)$ (and thus ${\delta}_{4}=n$) arises from a perfect 3-DM.By Lemma 13, the graph

*G*satisfies ${\tau}_{1}\left(G\right)\ge {t}_{1}\left(n,m\right)$ (resp., ${\tau}_{2}\left(G\right)\ge {t}_{2}\left(n,m\right)$) if and only if there is a spanning tree*T*of*G*that contains all edges from $\left(r:A\right)$ and whose*m*-metric (resp.,*s*-metric) is equal to ${t}_{1}\left(n,m\right)$ (resp., ${t}_{2}\left(n,m\right))$. The latter is true if and only if*Q*has a perfect 3-DM. □## 4. ILP Formulations

Here we describe two ILP models for the

*s*-SF Spanning Tree problem (for the*m*-SF Spanning Tree problem the approach is similar). For a given spanning tree*T*of a graph $G=\left(V,E\right)$ of order*n*, consider the indicator variables $({x}_{e})e\in E$:$${x}_{e}=\left\{\begin{array}{c}\begin{array}{c}1,e\in E\left(T\right);\\ 0,otherwise.\end{array}\end{array}\right.$$

(13)

Using Proposition 1, we can represent $s\left(T\right)$ as

$$s\left(T\right)={\sum}_{\left\{{e}_{i},{e}_{j},{e}_{k}\right\}\in {\Gamma}_{3}\left(G\right)}{x}_{{e}_{i}}{x}_{{e}_{j}}{x}_{{e}_{k}}+2{\sum}_{\left\{{e}_{i},{e}_{j}\right\}\in {\Gamma}_{2}\left(G\right)}{x}_{{e}_{i}}{x}_{{e}_{j}}+{\sum}_{e\in E\left(G\right)}{x}_{e},$$

(14)

where ${\Gamma}_{i}\left(G\right)$ denotes the set of all paths of length

*i*in*G*. To linearize (14), we introduce Boolean variables ${y}_{ijk}$ and ${y}_{ij}$ and the following constraints:$$\begin{array}{c}{y}_{ijk}\le {x}_{{e}_{i}},{y}_{ij}\le {x}_{{e}_{i}},\\ {y}_{ijk}\le {x}_{{e}_{j}},{y}_{ij}\le {x}_{{e}_{j}},\\ {y}_{ijk}\le {x}_{{e}_{k}},{y}_{ij}\ge {x}_{{e}_{i}}+{x}_{{e}_{j}}-1,\\ {y}_{ijk}\ge {x}_{{e}_{i}}+{x}_{{e}_{j}}+{x}_{{e}_{k}}-2,\end{array}$$

(15)

for every $\left\{{e}_{i},{e}_{j},{e}_{k}\right\}\in {\Gamma}_{3}\left(G\right)$ and $\left\{{e}_{i},{e}_{j}\right\}\in {\Gamma}_{2}\left(G\right)$, which are equivalent to ${y}_{ijk}={x}_{{e}_{i}}{x}_{{e}_{j}}{x}_{{e}_{k}}$ and ${y}_{ij}={x}_{{e}_{i}}{x}_{{e}_{j}}$. Thus the objective function (14) can be rewritten as

$$s\left(T\right)={\sum}_{\left\{{e}_{i},{e}_{j},{e}_{k}\right\}\in {\Gamma}_{3}\left(G\right)}{y}_{ijk}+2{\sum}_{\left\{{e}_{i},{e}_{j}\right\}\in {\Gamma}_{2}\left(G\right)}{y}_{ij}+{\sum}_{e\in E\left(G\right)}{x}_{e}.$$

(16)

We use two types of constraints to describe the spanning trees. The first type is the extended formulation of Martin (1991), which uses auxiliary variables

$${z}_{\left(v,w\right)}^{r},{z}_{\left(w,v\right)}^{r}\ge 0foreveryr\in V\left(G\right),vw\in E\left(G\right),$$

(17)

where ${z}_{\left(v,r\right)}^{r}=0$ for every $r\in V\left(G\right)$ and $vr\in E\left(G\right)$. A 0/1-vector

*x*describes a spanning tree of*G*if and only if these variables satisfy the constraints$$\begin{array}{cc}{x}_{vw}-{z}_{\left(v,w\right)}^{r}-{z}_{\left(w,v\right)}^{r}& =0,r\in V\left(G\right),vw\in E\left(G\right),\\ {\sum}_{vw\in E\left(G\right)}{z}_{\left(v,w\right)}^{r}=1,r,w\in V\left(G\right),r\ne w,\\ {\sum}_{vr\in E\left(G\right)}{z}_{\left(v,r\right)}^{r}=0,r\in V\left(G\right).\end{array}$$

(18)

The second type exploits the Miller–Tucker–Zemlin (MTZ) constraints (Miller et al., 1960). We introduce the auxiliary variables

$$\begin{array}{c}{z}_{\left(v,w\right)},{z}_{\left(w,v\right)}\in \left\{0,1\right\}foreveryvw\in E\left(G\right),\\ {t}_{v}\in \left[0,n-1\right]foreveryv\in V\left(G\right),\end{array}$$

(19)

and constraints

$$\begin{array}{c}{x}_{vw}-{z}_{\left(v,w\right)}-{z}_{\left(w,v\right)}=0,vw\in E\left(G\right),\\ {\sum}_{vw\in E\left(G\right)}{z}_{\left(v,w\right)}=1,w\in V\left(G\right)\setminus \left\{r\right\},\\ {\sum}_{vr\in E\left(G\right)}{z}_{\left(v,r\right)}=0,\\ {t}_{v}-{t}_{w}+n{z}_{\left(v,w\right)}\le n-1,v,w\in V\left(G\right),vw\in E\left(G\right),\end{array}$$

(20)

where $r\in V\left(G\right)$ is some fixed vertex. Finally we add the additional constraint

$$s\left(T\right)={\sum}_{\left\{{e}_{i},{e}_{j},{e}_{k}\right\}\in {\Gamma}_{3}\left(G\right)}{y}_{ijk}+2{\sum}_{\left\{{e}_{i},{e}_{j}\right\}\in {\Gamma}_{2}\left(G\right)}{y}_{ij}+{\sum}_{e\in E\left(G\right)}{x}_{e}\le n\left(n-\Delta \left(G\right)-1\right)+{\Delta}^{2}\left(G\right),$$

(21)

defined by Theorem 7, which turns out to significantly improve the algorithm running times. Maximization of the objective (16) subject to the constraints (15), (18), (21) is further referred to as Martin formulation, while maximization of Equation (16) subject to Equations (15), (20), (21) as MTZ formulation.

## 5. Experimental Results

In this section, we investigate the practical aspects of scale-free spanning tree problems by conducting computational experiments for various simulated and experimental data sets to evaluate the performance of the ILP models. All computations below were performed on a standard laptop with 2.0 GHz dual core processor and 16 GB of RAM, and ILP problems were solved using Gurobi 8.1.

### 5.1. Synthetic data

#### 5.1.1. Synthetic graphs

We used graphs from the following synthetic data sets:

*Erdős-Rényi graphs*constructed by adding each possible edge uniformly and independently with the probability $p=4.25\u2215n$. The number of nodes

*n*in our experiments varied from 10 to 40 (corresponding to the sizes of HCV outbreaks analyzed later).

$n\times m$

*grid graphs*(Cartesian products of paths*P*and_{n}*P*) with $n,m=4,\dots ,7$._{m}*Scale-free graphs*of two types generated using NetworkX library (Hagberg et al., 2008): those based on the classical Barabási and Albert (1999) model and those constructed with NetworkX default parameters. The latter graphs are usually denser.

For all synthetic data sets except for grid graphs, we generated 10 graphs per node number. Figures 3 and 4 illustrate the running times of the ILP solver on both the MTZ formulation and the Martin formulation compared with the published tool QUENTIN (Skums et al., 2018) runtimes for all four simulated graph classes.

^{1}The results demonstrate that for those graph classes, the ILP algorithms in average perform much better than in the worst case and are able to produce optimal results in a reasonable amount of time. Moreover, for considered graph sizes, they outperform QUENTIN. For Erdős–Rényi graphs and grids (Fig. 3), which are characterized by relatively large sets of feasible solutions, the Martin formulation was superior to MTZ and QUENTIN, while for Barabási–Albert scale-free graphs (Fig. 4a), the MTZ formulation was leading to the faster algorithm. In general, the ILP approach allows to solve the problem within minutes or few hours for small-to-medium-sized problems (up to several dozens of vertices) on Erdős–Rényi graphs and grids, and for medium-sized problems (several hundred vertices) on scale-free graphs.#### 5.1.2. Simulated outbreaks

We simulated outbreaks over scale-free Barabási–Albert contact networks of $n=10-30$ nodes using the following model. The infection spreads over each network according to the susceptible infected (SI) model (Newman, 2010) with the transmission rate $\rho =1{0}^{-2}$. Each infected individual is assumed to carry a viral sequence of length $m=13200$, and at each transmission event, the source's sequence is transmitted to the recipient. Sequence evolution is described by a skyline model with the piecewise constant decreasing mutation rate, that is, viral sequences mutate at the basic rate of $\mu =1{0}^{-5}$ changes/position/time unit, and the mutation rate is decreasing by $30\%$ every $\tau =100$ time units. This model captures the decrease of the speed of intrahost evolution as the infection progresses from an acute to a persistent stage (De Maio et al., 2016; Icer et al., 2020).

For each simulated outbreak, we compared the performance of the ILP algorithm for the Martin formulation, with the standard approach based on the phylogenetic trait inference (Sagulenko et al., 2018). First, we constructed a maximum likelihood phylogeny using MEGA (Kumar et al., 2018). Each patient was encoded by a discrete trait, and the marginal likelihood ancestral traits were reconstructed using the Felsenstein pruning algorithm (Felsenstein, 2004) with the pairwise between-trait transition rates equal to $\rho $. Inferred transmission links then correspond to trait changes along the phylogeny branches. The genetic relatedness network

*G*used as an input for the ILP was constructed using a threshold-based approach suggested by Kosakovsky Pond et al. (2018). A pair of vertices of_{R}*G*are adjacent, if the Hamming distance between the corresponding sequences does not exceed a threshold_{R}*t*that was estimated as the minimal integer such that the graph*G*is connected. The obtained graph was further sparsed out by applying the same procedure to each of its biconnected component._{R}The results of algorithms' comparison are shown in Figure 5. We measured algorithm accuracy by the proportion of correctly inferred transmission links and transmission ancestries (i.e., pairs, ancestor/descendant).

*s*-SF-based ILP clearly outperformed the phylogenetic approach: the average transmission link detection accuracy was $82.44\%$ for the former and $72.61\%$ for the latter, while the average transmission ancestry detection accuracies were $97.48\%$ and $73.96\%$, respectively.### 5.2. Data from hepatitis C outbreaks

We applied the concept of scale-free spanning trees to the graphs arising from the benchmark data set consisting of several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Glebova et al., 2017; Skums et al., 2018). This data set comprises HCV quasispecies populations sampled from 81 infected individuals involved in 10 viral outbreaks. Each population consists of RNA sequences of HCV hypervariable region 1 (HVR1) of length 264 bp. Transmission histories of the outbreaks (“who infected whom”) are known as a result of epidemiological investigations. In this case, we are dealing with intrahost viral populations rather than single sequences, and therefore, we compared the proposed approach with QUENTIN, which has been specifically designed to handle such data (Skums et al., 2018).

For each outbreak, the genetic relatedness network

*G*was constructed using the threshold-based approach suggested by Campo et al. (2016). The vertices of_{R}*G*are adjacent, if the_{R}*minimal*Hamming distance between the sets of sequences sampled from these patients does not exceed the threshold*t*. The threshold value was estimated as described in Subsection 5.1.2. Next, the ILP algorithm for the Martin formulation has been applied to*G*. For all outbreaks, the ILP problem has been solved to optimality._{R}We tested the accuracy of inference of transmission links and identification of the superspreaders (the sources of majority of infections). The results are reported in Table 1. The superspreaders correspond to vertices of highest degrees in

*s*-optimal and*m*-optimal trees for 9 out of 10 outbreaks. It should be noted that all algorithms incorrectly identified a superspreader for the same outbreak. It is the only outbreak where the virus was transmitted via a nonsocial interaction (namely, through blood transfusions), while all other outbreaks were associated with unsafe injection practices or sexual contacts. For those outbreaks, both ILP approaches correctly recovered $92\%$ of transmission links and all ancestor/descendant pairs, thus outperforming QUENTIN.Methods | Evaluation metric | ||
---|---|---|---|

(A) | (B) | (C) | |

QUENTIN | 0.9 | 0.78 | 0.98 |

s-SF | 0.9 | 0.92 | 1.0 |

m-SF | 0.9 | 0.92 | 1.0 |

(A) Superspreader inference accuracy, (B) accuracy of transmission link inference, and (C) accuracy of transmission ancestry inference.

## 6. Discussion

In genomic epidemiology, reconstruction of viral transmission histories from genomic data is fundamental for the investigation of outbreaks and understanding of epidemic spread. Genomic analysis has become one of the major tools for the investigation of outbreaks and surveillance of transmission dynamics (Armstrong et al., 2019; Knyazev et al., 2020). Naturally, graphs are the primary models used in such studies (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019). In many settings, graph-based methods have been shown to be more efficient to ascertain transmission links compared with methods based on binary phylogenies (Wertheim et al., 2014), as phylogenetic clades are not easily resolvable into transmission clusters and pairs (Lewis et al., 2008; Hughes et al., 2009; Kouyos et al., 2010), while the statistical support for a clade does not necessarily indicate the statistical support for a relationship between individual genomes inside a clade (Volz et al., 2012; Wertheim et al., 2014). However, in many cases, transmission links cannot be inferred using the genomic data alone (Jombart et al., 2014; Villandre et al., 2016). It leads to the need to introduce additional constraints on the reconstructed transmission networks or utilize more complicated objectives.

As a result, the associated algorithmic problems become harder. In this article, we studied one such problem—scale-free spanning tree problem—that arises in epidemiological studies of viruses whose spread is highly influenced by social networks of contacts between susceptible individuals. This includes HIV, HCV, and other pathogens transmitted through sexual contact or needle sharing. We demonstrated that this problem in its two possible algorithmic formulations is NP-hard, even if restricted to relatively simple graph classes. However, it admits an ILP formulation allowing to efficiently solve the problem for small-to-medium networks. It is often enough for the vast majority of outbreaks of HIV and HCV that involve dozens of infected individuals.

However, some outbreaks involve hundreds or even thousands of hosts, and in such cases, more scalable algorithmic solutions are needed. Thus, an important open problem is to establish whether constant or logarithmic approximation exists for the

*m*-SF Spanning Tree and*s*-SF Spanning Tree problems. In this context, it would be interesting to explore the relationships between scale-free spanning tree problems and max-leaf spanning tree problems. The latter is a well-studied combinatorial problem (Griggs et al., 1989; Galbiati et al., 1994), which seems to be the closest to our problem. Indeed, both problems aim to find a “star-like” spanning tree; furthermore, several reduction schemes for the proof of NP-completeness used by us exploit this relationship. Importantly, Lu and Ravi (1998) and Reich (2016) showed that the max-leaf spanning tree problem is approximable within a constant factor. Although the problems are far from being equivalent, it may seem reasonable for future studies to try to adopt algorithmic machinery developed for the max-leaf spanning tree problem to the scale-free spanning tree problem.## Footnote

^{1}

Running times for MTZ formulation on grids and Martin formulation on Barabási–Albert scale-free graphs are plotted only for smaller

*n*, since for large values they are significantly higher than for the other formulation. In particular, Martin formulation on Barabási–Albert scale-free graphs works $\sim $ 150 seconds for 1000 vertices, $\sim $ 480 seconds for 1500 vertices, and exceeds 1800 seconds for 2000 and more vertices. QUENTIN running times are not plotted in Figure 4, since they already exceed timeout of 3600 seconds for 50 vertices for both scale-free graphs.## References

Armstrong, G.L., MacCannell, D.R., Taylor, J., et al. 2019. Pathogen genomics in public health.

*N. Engl. J. Med*. 381, 2569–2580.Barabási, A.-L., and Albert, R. 1999. Emergence of scaling in random networks.

*Science*286, 509–512.Bonsma, P. 2012. Max-leaves spanning tree is APX-hard for cubic graphs.

*J. Discrete Algorithms*. 12, 14–23.Borovicanin, B., Das, K.C., Furtula, B., et al. 2017. Bounds for Zagreb indices.

*MATCH Commun. Math. Comput. Chem*. 78, 17–100.Campbell, E.M., Jia, H., Shankar, A., et al. 2017. Detailed transmission network analysis of a large opiate-driven outbreak of HIV infection in the United States.

*J. Infect. Dis*. 216, 1053–1062.Campo, D.S., Xia, G.-L., Dimitrova, Z., et al. 2016. Accurate genetic detection of hepatitis C virus transmissions in outbreak settings.

*J. Infect. Dis*. 213, 957–965.Chartrand, G., Lesniak, L., and Zhang, P. 2016.

*Graphs & Digraphs*. CRC Press, Taylor & Francis Group, Boca Raton, FL.Das, K.C. 2003. Sharp bounds for the sum of the squares of the degrees of a graph.

*Kragujev. J. Math*. 25, 19–41.Das, K.C., and Gutman, I. 2004. Some properties of the second Zagreb index.

*MATCH Commun. Math. Comput. Chem*. 52, 103–112.de Caen, D. 1998. An upper bound on the sum of squares of degrees in a graph.

*Discrete Math*. 185, 245–248.De Maio, N., Wu, C.-H., and Wilson, D.J. 2016. SCOTTI: Efficient reconstruction of transmission within outbreaks with the structured coalescent.

*PLoS Comput*.*Biol*. 12, e1005130.Felsenstein, J. 2004.

*Inferring phylogenies*. Sinauer Associates, Sunderland, MA.Galbiati, G., Maffioli, F., and Morzenti, A. 1994. A short note on the approximability of the maximum leaves spanning tree problem.

*Inform. Process. Lett*. 52, 45–49.Galvani, A.P., and May, R.M. 2005. Dimensions of superspreading.

*Nature*438, 293–295.Garey, M.R., and Johnson, D.S. 1979.

*Computers and Intractability: A Guide to the Theory of NP-Completeness*. W.H. Freeman & Co., New York, NY.Glebova, O., Knyazev, S., Melnyk, A., et al. 2017. Inference of genetic relatedness between viral quasispecies from sequencing data.

*BMC Genomics*. 18, 918.Griggs, J.R., Kleitman, D.J., and Shastri, A. 1989. Spanning trees with many leaves in cubic graphs.

*J Graph Theory*13, 669–695.Hagberg, A., Swart, P., and Chult, D. 2008. Exploring network structure, dynamics, and function using NetworkX.

*In Proceedings of the 7th Python in Science Conference*. Online publication, Pasadena, California; pp. 11–16.Hajarizadeh, B., Grebely, J., and Dore, G.J. 2013. Epidemiology and natural history of HCV infection.

*Nat. Rev. Gastroenterol. Hepatol*. 10, 553–562.Huang, C., Wang, Y., Li, X., et al. 2020. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China.

*Lancet*395, 497–506.Hughes, G.J., Fearnhill, E., Dunn, D., et al. 2009. Molecular phylodynamics of the heterosexual HIV epidemic in the United Kingdom.

*PLoS Pathogens*5, e1000590.Icer Baykal, P., Lara, J., Khudyakov, Y., et al. 2020. Quantitative differences between intra-host HCV populations from persons with recently established and persistent infections.

*Virus Evol*. 7, veaa103.Jha, D., Skums, P., Zelikovsky, A., et al. 2017. Modeling the spread of HIV and HCV infections based on identification and characterization of high-risk communities using social media, 425–430.

*In International Symposium on Bioinformatics Research and Applications*. Springer, Cham.Jombart, T., Cori, A., Didelot, X., et al. 2014. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data.

*PLoS Comput*.*Biol*. 10, e1003457.Jombart, T., Eggo, R., Dodd, P., et al. 2011. Reconstructing disease outbreaks from genetic data: A graph approach.

*Heredity*106, 383–390.Kilmarx, P.H. 2009. Global epidemiology of HIV.

*Curr. Opin. HIV AIDS*. 4, 240–246.Kincaid, R.K., Kunkler, S.J., Lamar, M.D., et al. 2016. Algorithms and complexity results for finding graphs with extremal Randić index.

*Networks*67, 338–347.Kleitman, D.J., and West, D.B. 1991. Spanning trees with many leaves.

*SIAM J Discrete Math*. 4, 99–106.Klinkenberg, D., Backer, J.A., Didelot, X., et al. 2017. Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks.

*PLoS Comput*.*Biol*. 13, e1005495.Knyazev, S., Hughes, L., Skums, P., et al. 2020. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era.

*Brief. Bioinformatics*22, 96–108.Kosakovsky Pond, S.L., Weaver, S., Leigh Brown, A.J., et al. 2018. HIV-TRACE (TRansmission Cluster Engine): A tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens.

*Mol. Biol. Evol*. 35, 1812–1819.Kouyos, R.D., Von Wyl, V., Yerly, S., et al. 2010. Molecular epidemiology reveals long-term changes in HIV type 1 subtype B transmission in Switzerland.

*J. Infect. Dis*. 201, 1488–1497.Kumar, S., Stecher, G., Li, M., et al. 2018. MEGA X: Molecular evolutionary genetics analysis across computing platforms.

*Mol. Biol. Evol*. 35, 1547–1549.Leigh Brown, A.J., Lycett, S.J., Weinert, L., et al. 2011. Transmission network parameters estimated from HIV sequences for a nationwide epidemic.

*J. Infect. Dis*. 204, 1463–1469.Lemke, P. 1988. The maximum leaf spanning tree problem for cubic graphs is NP-complete. IMA Preprint Series No. 428.

Lewis, F., Hughes, G.J., Rambaut, A., et al. 2008. Episodic sexual transmission of HIV revealed by molecular phylodynamics.

*PLoS Med*. 5, e50.Li, L., Alderson, D., Doyle, J. C., et al. 2005. Towards a theory of scale-free graphs: Definition, properties, and implications.

*Internet Math*. 2, 431–523.Lu, H.I., and Ravi, R. 1998. Approximating maximum leaf spanning trees in almost linear time.

*J. Algorithms*. 29, 132–141.Martin, R.K. 1991. Using separation algorithms to generate mixed integer model reformulations.

*Oper. Res. Lett*. 10, 119–128.Miller, C.E., Tucker, A.W., and Zemlin, R.A. 1960. Integer programming formulation of traveling salesman problems.

*J. ACM*7, 326–329.Newman, M. 2010.

*Networks*. Oxford University Press, New York, NY.Papadimitriou, C., and Yannakakis, M. 1991. Optimization, approximation, and complexity classes.

*J. Comput. Syst. Sci*. 43, 425–440.Ragonnet-Cronin, M., Hu, Y.W., Morris, S.R., et al. 2019. HIV transmission networks among transgender women in Los Angeles County, CA, USA: A phylogenetic analysis of surveillance data.

*Lancet HIV*. 6, e164–e172.Ramachandran, S., Thai, H., Forbi, J.C., et al. 2018. A large HCV transmission network enabled a fast-growing HIV outbreak in rural Indiana, 2015.

*EBioMedicine*37, 374–381.Reich, A. 2016. Complexity of the maximum leaf spanning tree problem on planar and regular graphs.

*Theoret.Comput. Sci*. 626, 134–143.Sagulenko, P., Puller, V., and Neher, R.A. 2018. TreeTime: Maximum-likelihood phylodynamic analysis.

*Virus Evol*. 4, vex042.Skums, P., Zelikovsky, A., Singh, R., et al. 2018. QUENTIN: Reconstruction of disease transmissions from viral quasispecies genomic data.

*Bioinformatics*34, 163–170.Sledzieski S., Zhang C., Mandoiu I., et al. 2019. TreeFix-TP: Phylogenetic error-correction for infectious disease transmission network inference.

*bioRxiv*. 1:813931.Villandre, L., Stephens, D.A., Labbe, A., et al. 2016. Assessment of overlap of phylogenetic transmission clusters and communities in simple sexual contact networks: Applications to HIV-1.

*PLoS One*11, e0148459.Volz, E.M., Koopman, J.S., Ward, M.J., et al. 2012. Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection.

*PLoS Comput.Biol*. 8, e1002552.Wertheim, J.O., Leigh Brown, A.J., Hepler, N.L., et al. 2014. The global transmission network of HIV-1.

*J. Infect. Dis*. 209, 304–313.## Information & Authors

### Information

#### Published In

Journal of Computational Biology

Volume 28 • Issue Number 10 • October 2021

Pages: 945 - 960

PubMed: 34491104

#### Copyright

© Yury Orlovich, et al., 2021. Published by Mary Ann Liebert, Inc.

##### Open Access

This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

#### History

**Published online**: 13 October 2021

**Published in print**: October 2021

**Published ahead of print**: 1 September 2021

#### Topics

### Authors

#### Author Disclosure Statement

The authors declare they have no conflicting financial interests.

#### Funding Information

Y.O. was partially supported by the BRFFR grant (Project F20UKA-005). The work of V.K. and K.K. was supported by the German National Science Foundation via DFG-Research Training Group 2297 (Mathematical Complexity Reduction—MathCoRe). P.S. was supported by the National Institutes of Health grant 1R01EB025022 and by the National Science Foundation grant 2047828.

## Metrics & Citations

### Metrics

### Citations

## Export Citation

#### Export citation

Select the format you want to export the citations of this publication.

## View Options

### View options

#### PDF/EPUB

View PDF/ePub### Get Access

## Access content

To read the fulltext, please use one of the options below to sign in or purchase access.## ⚠ Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.