Endemic infection can shape epidemic exposure: using breakthroughs in statistical ecology to better understand co-infection patterns

Throughout our lives, we are exposed and infected by a diverse community of pathogens from viruses and bacteria to parasitic worms. In humans, what combination of pathogens you are infected by matters as these organisms can interact with each other in remarkable ways that can alter the outcome of an infection. For example, people co-infected by HIV (human immunodeficiency virus) and tuberculosis (tb – a disease caused by Mycobacterium bacteria) experience heightened symptoms of each pathogen and are a much higher risk of dying compared to people infected by just one of these pathogens. HIV interferes with the immune system that not only allows tb to grow faster but also increases the chances of that individual transmitting the bacteria. This is an example of a positive or ‘facilitative interaction’ between pathogens in ecological speak. In contrast, pathogens can compete as well (a negative interaction) and is some cases this can protect us from disease. For example, co-infection between certain parasitic worms can actually be protective of malaria (see Nacher, 2011 below). Further, we know it is possible that interactions between pathogens can be dependent on the order of infection  (see Hoverman et al. for more on this). But how do we test for these specific interactions, particularly in wildlife? Humans and wildlife are exposed and infected by a diverse range of organisms; how could we work out which ones to test? It is unfeasible to test every combination in the lab and even then, how would we know what combination actually occurs in the wild?

In this paper, we harnessed recent advances in ecological statistics and network theory to quantify associations between pathogens in a wild population of lions in the Serengeti in Tanzania. We label them associations as we can’t be 100% sure that they actually represent real interactions between pathogens (you’d need to do lab experiments for that which are difficult to do for wildlife). Based on over 10 years of exposure and infection data from a wide variety of pathogens that infect lions, we were able to establish which pathogens were positively or negatively associated with others. As we have been monitoring these lions often since birth, we were able to deduce the likely order of infection or exposure and work out if a pathogen that a lion was exposed to early in life could impact which pathogen they were exposed to as adults. These statistical methods are also useful as they can start to untangle if these associations could be just due to environmental factors (i.e. the lion got co-infected by two pathogens because of an ecological preference of these pathogens) rather than a potential biological mechanism.

The associations we found using these methods were often surprising but reflected what has been established in human lab-based studies which is promising. For example, we found a strong negative association between Rift Valley Fever (RVF -a mosquito-borne virus that infects lion as well as cattle and sheep leading to sometimes devastating economic loss) and felid equivalent to HIV (FIV). FIV infects nearly 100% of lions as cubs, whereas RVF infection is more likely to occur later in life. Interestingly RVF has similar molecular machinery to a group of viruses that are known to inhibit the growth of HIV, so it is possible that the same mechanism exists for lions as well. Similarly, we found a strong negative association between feline coronavirus (in the virus family that causes severe acute respiratory syndrome or SARS in humans) and one type of FIV also. Coronaviruses are considered possible candidate vaccines for HIV, so again laboratory work from human medicine provided some support for our findings.

We didn’t just find negative associations either, we also detected strong positive interaction between the tick-borne Babesia protozoans and canine distemper virus (CDV). This co-infection pattern has been identified previously and is likely the underlying factor that caused this lion population to crash by over 33% in the 1990s. Lions are may be able to withstand a CDV epidemic in isolation but when combined with Babesia in a co-infection, this can lead to serious population declines for this species (see Munson et al for some more details).  Our study shows that it didn’t matter which species of Babesia either, all of the species we included had these strong positive associations with CDV.

We can’t prove conclusively that these pathogens actually interact within a lion based on these statistical methods alone. However, we can provide a valuable ‘shortlist’ of possible interactions that occur in a wild population that can be tested using cell-level experiments in a lab – we obviously don’t want to actually test these hypotheses out on lions themselves. Given how common interactions between pathogens are and the potentially positive or negative outcomes of them for the host, our approach coupled with lab-work can provide important insights to understanding pathogen dynamics in wild populations.

Nacher (20111): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3192711/
Hoverman et al (2013): https://www.ncbi.nlm.nih.gov/pubmed/23754306?dopt=Abstract

Munson et al (2008) : https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002545

A link to the paper here: https://onlinelibrary.wiley.com/doi/full/10.1111/ele.13250

NEON & insights from my first ESA

This year I was lucky enough to be awarded a NEON-ESA early career scholar award to help fund my first trip to ESA. I’ve been to large ecology conferences before, but I was particularly excited to expand my understanding of NEON (National Science Foundation’s National Ecological Observatory Network), meet some great ecologists and learn some new analytical tools. Still recovering from Jetlag (I had got too steamy New Orleans after 35 hours of travelling from Tasmania).

I was thrust into it at 8 am Sunday morning with a workshop on how to use generalized joint attribute modelling with Jim Clark. The flexibility of this tool and robust way it deals with messy community data makes it something I want to use on the microbiome data I’ve got coming in. For those interested, the vignette is super useful too: https://cran.rproject.org/web/packages/gjam/vignettes/gjamVignette.html.

Immediately following the GJAM workshop, we started a NEON focussed workshop on how to access and use NEON data. I was super impressed with just how integrated NEON is with R and how well documented the data is. I felt like you could get to know a particular location and precisely what data was collected there. From a disease ecology perspective, it is really exciting to have disease/microbiome data matched with extensive environmental data. The opportunities to ask continental-scale questions with fine resolution data are enormous. It was great to continue the discussion at a restaurant t after – NEON people are my type of people! Monday was another NEON-orientated day where we got to see what people have been doing with NEON data. I also got to meet Mike Kaspari which was great – I’ve been admiring his work for years.

The rest of my time at ESA was a haze of presenting my work on puma disease dynamics and going to as many disease ecology talks as possible. Two (and sometimes three) parallel disease ecology sessions were pretty neat. Our NSF puma project also had quite a few people presenting too – it was great to see all of this population genomic/disease ecology work coming together. Overall, it had been a huge week, but one that I hope will lead to exciting future collaborations!

Time-series modelling for ecologists

Recently I have been working on massive long-term group-group networks for both the Serengeti lions and Yellowstone wolves. We have tracked territory size, average pack/pride size, the number (and strength) of between pack/pride contacts every year from 1971-until today. Basically a series of time series in which we want to know which one is dependent on which. Not being particularly familiar with time series analysis I didn’t know where to start.

After doing a heap of reading I decided that vector auto-regression was the way to go. Vector autoregression (VAR) are stochastic process models that capture linear dependencies between multivariate time series. Mostly used for economic forecasting, the method seems pretty robust and quite straightforward to implement in the R package ‘vars’. However,  finding out all of the steps/assumptions required to run the model was tricky so here is my adapted code to fill the gap:

—————————————————————–
rm(list = ls())
library(“vars”)

#——————————————————————-#
############import data#######################
#——————————————————————-#

data1 <- read.csv(“Data.csv”, head=T)
str(data1)

#——————————————————————-#
############detrend with regression#######################
#——————————————————————-#

m1 <- lm(model~0+Year, data=data1) #lm with no intercept
summary(m1)
m1resid <- residuals(m1)

…..

#make a datframe again

dataResid <- cbind(m1resid)

#——————————————————————-#
############Vector Autoregression#######################
#——————————————————————-#

#make a ts object – Freq here is how many obs per year.
ts.obj <- ts(dataResid,frequency=1, start = 1997, end = 2016); str(ts.obj)
#test for the most appropriate lag for your data (eg., does a 2 year time lag best predict the next years connectivity.

VARselect(ts.obj, lag.max=3, type=”const”)$selection

#  ‘p’ below is the the lag factor to test.

varLag1 <- VAR(ts.obj, p=1, type=”const”) #p is is the lag factor

#testing normality (has to be ‘insignificant’ at alpha 0.05 to trust the results)

serial.test(varLag1, lags.pt=10, type=”PT.asymptotic”)

arch.test(varLag1) #test for heteroskedasticity. Error terms are fine if p>0.05

roots(varLag1) #have to be under 1 to trust model results.

#extensive list of summary results.

summary(varLag1)

#links nicely to the forcast package to predict the future

library(“forecast”)
fcstL1 <- forecast(varLag1)
plot(fcstL1, xlab=”Year”)

PhyloPic – great resource for animal silhouettes.

Adding animal silhouettes to figures seems to be increasingly on trend in ecology. I have no empirical evidence to back up this claim, but it seems like every article in a high impact journal has at least one figure that incorporates silhouettes of species.  I too am guilty of adding them – I find them a useful visual tool, but in the past, I’ve had to create them myself using photoshop. No more! PhyloPic (http://www.phylopic.org/about/) provides an easy to search collection reusable silhouette images of organisms from beetles to dinosaurs.

Resources like this are truly great!

Integrating networks and phylogenies

Considering the broad similarities between networks and phylogenies  it is amazing that they have, up until recently,  been very separate approaches. In the world of epidemiology transmission trees have been gaining momentum over the last 5 years (see the excellent review by Hall et al: https://www.ncbi.nlm.nih.gov/pubmed/27217184) as they turn phylogenies into something that more-or-less equates to transmission. Now it appears that ecologists are doing the same thing with this really interesting paper just out in Methods in Ecology and Evolution (see link below). The package attached to Schliep et al looks really cool and I can imagine will be of use to a broad array of disciplines. I’m looking forward to trying it out my self…..

Here is the link: http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12760/full

FIV in the Seregeti lions :Our paper is our now in JAE.

jane12751-toc-0001Pathogen subtype really does matter – different subtypes of FIV (feline HIV) get around the Serengeti lions is remarkably different ways. This was the general conclusion from our paper just out in the Journal of  Animal Ecology (see link below). After many years work I’m thrilled that this paper is out. This paper hopefully highlights some of the ways in which cool community phylogenetic methods (coupled with phylodynamic approaches) can help understand disease transmission in  a wild population.

Here is a link: http://onlinelibrary.wiley.com/doi/10.1111/1365-2656.12751/full

 

Misunderstanding the Microbiome: misuse of community ecology tools to understand microbial communities

Understanding how the vast collection of organisms within us (‘the ‘microbiome’) is linked to human (and ecosystem) health is one of the most exciting scientific topics today.  It really does have the possibility of improving our lives considerably though is often over-hyped (see the link below). However, I’ve recently I’ve been reading quite a few microbiome papers (it was our journal clubs topic of the month) and have been struck by the poor study design and lack of understanding of the statistical methodology. Talking to colleagues in the microbiome field – these problems maybe more widespread and could be hindering our progress in understanding this important component of the ecosystem within us.

Of course microbiome research is simply microbe community ecology, but the way some microbiome practitioners use and report community ecology statistics is problematic and sometimes outright deceptive.This includes people publishing in the highest scientific journals. I won’t pick on any particular paper, but here are a few general observations (sorry for the technical detail).

  1. Effect sizes are often not reported or visualized using ordination techniques. They have a significant P value but how do you know how biologically relevant this is?  My guess is that they are small as in often the case with free living communities.
  2. Little detail is given about how the particular test is performed. Usual example: “We did a PERMANOVA  to test for XX”. Despite the fact that the PERMANOVA has some general issues (see the Warton et al paper below), no information is given about the test anyway e.g., was it a two way crossed design, did they use Type III sums of squares etc? Did they test for multivariate disperson using PERMDISP or similar? Literally that is one of the only assumptions of the test but I haven’t read any microbiome paper that has checked. If they haven’t we can’t trust the results. Have they read the original paper by Marti Anderson? Some cite it at least….
  3. I haven’t found any PCA or PCoA plot with % of variance explained. This is annoying – the axes shown may only explain a small amount of variance in the community,  so thus the pretty little clusters of points shown maybe pretty artificial.

I’ll stop ranting. These issues really impair interpretation of the results and make the science difficult to replicate. It makes you ask “how do these papers get through the gates?’ I’m guessing that a significant proportion of authors, reviewers and editors have little experience in  community biostats and don’t really understand what the tests are doing. They are relying on analytical pipelines such as QUIIME that claim to ‘ publication quality graphics and statistics’ and not thinking much more about it. More microbiome researchers need to go beyond these pipelines and keep up-to-date with community methods more broadly. The quality of the research will clearly improve.

Links:

Microbiome over-hype: http://www.nature.com/news/microbiology-microbiome-science-needs-a-healthy-dose-of-scepticism-1.15730).

Warton et al: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00127.x/full

Marti Anderson’s paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1442-9993.2001.01070.pp.x/full

‘Amusing’ reviewers comments

Having a thick skin and the ability to shrug off  harsh and sometimes personal criticism is an often unrecognized trait of a scientist. You put your work out there to the world and get feedback from often anonymous peers(but this is changing slowly) . The system works usually pretty well and 99% of the time makes the paper better.  When the comments are  highly critical, you go through a mini five stages of grief but  you always come around and the paper gets better. I’ve definitely had my fair share of critical feedback, but one of my recent favorites was  a reviewer suggesting that my literature review “hadn’t gone beyond the literature”….(?) However, none have come close to the comments that this author received:

https://telliamedrevisited.wordpress.com/2016/05/25/a-blast-from-the-past/

there are so many good lines but this one is the best: “This paper has merit and no errors, but I do not like it …”

Pleasing that it still got published in the journal anyway!

Guide to reducing data dimensions

In a world where collecting enormous amounts of complex and co-linear, data is increasingly the norm, techniques that reduce data dimensions to something that can be used in statistical models is essential . However, in ecology at least the means of doing this are unclear and the info out there is confusing. Earlier this year Meng et al provided a nice overview to what’s out there (see link below) specifically for reducing omics data sets, but is equally relevant for ecologists. One weakness of the paper is they provided only small amounts of practical advice particularly on how to interpret the resultant dimension reduced data.  Overall though this is an excellent guide and I aim to give a bit extra practical advice on dimension reduction using the techniques that i use.

Anyway, before going forward – what do we mean by dimension reduction? Paraphrasing from Meng et al – Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the data is reduced , allowing for a lower-dimensional representation (say 2-3 dimensions, but sometime many more) without significant loss of information. My philosophy is to try to use the data in a raw form wherever possible, but where this is problematic due to problems with co-linearity etc and where machine learning algorithms such as Random Forests are not appropriate (eg., your response variable is a matrix….) this is my brief practical guide to three common ones:

PCA : dependable principal components analysis – excellent  if you have lots of correlated continuous predictor variables with few missing values and 0’s. There are a large number of PCA based analyses that may be useful (e.g., penalized PCA for feature selection, see Meng et al), but I’ve never used them. Choosing  the right number of PCAs is subjective and is a problem for lots of these techniques -an arbitrary cutoff of selecting PCAs that account for ~70% of the original variation seems reasonable. However, if your PCs only explain a small amount of variation you have a problem as fitting a large number of PCs to a regression model is usually not an option (and if PC25 is a significant predictor what does that even mean biologically?). Furthermore,and if there are non-linear trends this technique  won’t be useful.

PCoA :  Principal co-ordinate analysis or classical scaling is similar to PCA but used on  matrices. Has the same limitations of PCA.

NMDS: Non-metric multidimensional scaling is a much more flexible method that can cope much better with non-linear trends in the data. This method is trying to best preserve distances between objects (using a ranking system), rather than finding the axes that best represent variation in the data as PCA and PCoA do. This means that NMDS also captures variation often in a few dimensions (often 2-3), though it is important to assess NMDS fit by assessing ‘stress’ (values below 0.1 are usually OK). There is debate how useful these axis scores are (see here: https://stat.ethz.ch/pipermail/r-sig-ecology/2016-January/005246.html) as they are rank based and the axis 1 doesn’t explain the largest amount of variation and so fourth as is the case with PCA/PCoA. However I still think this is a useful strategy (see the Beale 2006 link below).

I stress (no pun intended!) the biggest problem with this techniques is interpretation of new variables. Knowing the raw data inside and out and how they are mapped onto the new latent variables is important. For example, high loading’s on PCA1 reflect  high soil moisture and high pH. If you don’t know this  interpreting regressions coefficients in a meaningful way is going to be impossible. It also leads to annoying statements like ‘PcOA 10 was the most significant predictor’ without any further biological reasoning for what the axes actually represents. Tread with caution and data dimension reduction can be a really useful thing to do.

Meng et al: http://bib.oxfordjournals.org/content/early/2016/03/10/bib.bbv108.full

Beale 2006: http://link.springer.com/article/10.1007/s00442-006-0551-8