Back in 2016, researchers published an analysis of retaliatory gun attacks in Chicago. To study the spread of violence in the city, they had reconstructed linked ‘cascades’ of attacks, in which one shooting had led to another, which had led to another etc. They found the average cascade size was 2.7 shootings.
So, how contagious was gun violence in Chicago? Or, to put it another way, suppose we had reconstructed chains of retaliatory gun violence in another city, and found an average cascade size of 5.4. How much more contagious is gun violence in that city compared to Chicago?
It might be tempting to look at 5.4 vs 2.7 and say “twice as contagious”. But the problem is that cascade size doesn’t scale directly with the amount of transmitted violence at the individual level. When it comes to understanding contagion, we need to dig a bit deeper.
Contagious cascades
Suppose that, on average, each shooting leads to R follow up attacks. We can think of R as the ‘reproduction number’ for violence, which tells us the extent of onward transmission we’d expect per attack.
If a city has a shooting, we’d therefore expect this to lead to R follow up shootings, and in turn these R shootings would lead to R x R follow ups. If we continue the logic, we can write down the expected cascade size:
If R is below 1 (i.e. each shooting on average leads to fewer than one follow up attack) there’s a neat bit of maths that allows us to simplify this long equation:
If we rearrange this equation, we can therefore estimate R from the average cascade size:
Hence a city with an average cascade size of 2.7 shootings has R = 1–1/2.7 = 0.63. Whereas a city with an average cascade size of 5.4 has R=0.81. In other words, gun violence in the second city is 1.3x more contagious at the individual level, not 2x.
E-mails and epidemics
We can apply the same method to other forms of contagion. Suppose you send a marketing e-mail to 1000 people, and it ends up reaching 2700 by word of mouth. You then make some tweaks to the content and send this second batch to another 1000 people. This time it reaches 5400 by word of mouth.
How much better is the second e-mail at spreading? As you may have spotted, the average e-mail cascade size in the first batch was 2.7 (i.e. people reached per e-mail sent), and 5.4 in the second. So, as we've already seen, this suggests the second e-mail content is 1.3x more contagious.
This is the approach that Duncan Watts, Jonah Peretti (of Buzzfeed fame), and Michael Frumin used to analyse viral marketing campaigns in the mid-2000s. For example, one petition was e-mailed to 22,582 people and ended up reaching 54,172. Another was sent to 7,064 and reached 30,608. If we convert these into estimates of R, the first campaign has R=0.58 and the second R=0.77. Although the headline numbers of people reached per e-mail sent were very different, the actual difference in individual-level transmission was much smaller.
Thinking in terms of transmission – rather than just raw outcomes – is useful because it means we can quantify the effort required to get a given outcome. Suppose we have an existing campaign with R=0.85. If we can get R up to 0.9 (i.e. a 6% increase in contagiousness), we’d expect to see a 50% increase in the average total reach as a result.
In other situations, we aren’t interested in increasing transmission; we want to try and reduce it. A common question in analysis of infectious disease epidemics is how much effect control measures have had. But the problem is that we can't just look directly at the tally of cases over time. If we introduce control measures, we’re not changing the number of current cases. We’re changing transmission, which will change the number of future cases.
To illustrate how misleading it can be to compare the number of infections with control, suppose we have an epidemic that begins with an R of 1.6 for 40 days. At this point there is an increase in control intensity, which cuts R to 1. Then, after another 40 days, another increase in control intensity cuts R to 0.6. Here is what the resulting plots look like for control intensity, infections and R over time:
Now suppose multiple countries introduce the same measures, but at different times - some almost straight away, and some much later. So all the curves grow at the same initial rate, but flatten off at different points:
If we look at the correlation between daily control intensity and the number of daily infections using data across all the above countries, we get the following relationship:
In other words, there seems to be very little correlation between control and infection levels. This is because countries that introduced control earlier have lower infections throughout, whereas countries that introduced control later have higher infection levels. Even though the impact of control on individual-level transmission in each country is identical, there isn’t a linear relationship between infections and control.
If, however, we instead look at the correlation between R and control, we instead see a strong negative correlation: the more control there is, the lower the value of R. Which is reassuring, because this is exactly the assumption we put into the model, so our analysis of the resulting dynamics should be able to recover the correct conclusion.
Unfortunately, comparing the intensity of control measures with the number of infections (or cases or hospitalisations etc.) is a common mistake in many published COVID papers. Often these papers get attention because they find no correlation between control measures and cases. But, based on the above, they would probably have found something very different if they’d calculated the reproduction number rather than just relying on raw case numbers.
So, whether we’re looking at the spread of violence, viral content or viruses – if we want to quantify contagion, we need to look beyond the raw data and focus on the magnitude of transmission.
I don't whether this idea is tangential or parallel, but we have a similar problem when we use the R number alone to describe contagiousness. A pathogen with an R number of 2 is far more contagious than another with an R number of 3, if the generation time of the former is a week, but that of the latter is a year. The timescale of the effect is essential when describing contagiousness, but I'm not sure I heard anyone in the media ever quoting generation times. Perhaps it wasn't so relevant for Covid if its generation time was stable, but it is necessary if you want to convert the R number into a growth rate, which I think is a more useful piece of information.