Making the most out of machine learning models

Statistical machine learning is now becoming much more commonplace in ecology and evolution disciplines. Qualitatively at least, I’d say 15% of the posters/talks I’ve seen at ESA/Evolution conferences recently have featured some form of machine learning approach.  This is not surprising given the power of these methods to form robust predictive models from increasingly common big complex datasets. However, there is still lots of confusion about machine learning approaches are and how they can be applied and interpreted. This is particularly the case in the sub-discipline I’m most familiar with; disease ecology. I feel the confusion mostly stems from lack of experience with the methods, but also from the assumption that these methods are just black boxes for big data and don’t have a probabilistic basis. These assumptions are demonstrably false, and recent advances in computer science have revolutionized the power and interpretability of these models. In particular, these methods have the power to allow us to find new insights and tackle new questions into complex and messy disease ecology datasets.

This is what encouraged our team to create and test a machine learning pipeline that incorporates the latest advances in computer science to understand disease risk in populations. The associated paper just came out in the Journal of Animal Ecology here: which I’m very excited about! This pipeline is really flexible and could be applied to literally any other classification or regression problem the same way. In essence, this pipeline brings together code from the R packages caret and iml as well as some preprocessing packages such as MissForests together in a user-friendly pipeline.  Where we identified gaps we put in our own functions to make the process as smooth as possible.

We tested the pipeline out examining disease risk in the Serengeti lions and find that our models cannot only provide powerful predictive models, but also unique insights into the mechanistic drivers of disease in this system.


This is just the tip of the iceberg though on how this pipeline could be used and is also, given the state of the machine learning world, likely to be outdated soon.  But hopefully, at least, it provides a basis for ecologists to build and compare their own robust machine learning models and interpret them in new powerful ways

Links to packages:

caret (

iml  (

MissForests (

The official Molecular Ecology Blog is live!

My apologies for a lack of blog articles in the last couple of months. I’ve been busy with the social media team and editorial board getting an official blog for Molecular Ecology and Molecular Ecology Resources off the ground. The aims of the blog are to highlight some of the fantastic papers published by both journals and provide ‘behind the paper’ insights as well as useful updates from the journals too.

It has been a monster effort by lots of great people, and we are really excited to get this out there. Here is the link:


Endemic infection can shape epidemic exposure: using breakthroughs in statistical ecology to better understand co-infection patterns

Throughout our lives, we are exposed and infected by a diverse community of pathogens from viruses and bacteria to parasitic worms. In humans, what combination of pathogens you are infected by matters as these organisms can interact with each other in remarkable ways that can alter the outcome of an infection. For example, people co-infected by HIV (human immunodeficiency virus) and tuberculosis (tb – a disease caused by Mycobacterium bacteria) experience heightened symptoms of each pathogen and are a much higher risk of dying compared to people infected by just one of these pathogens. HIV interferes with the immune system that not only allows tb to grow faster but also increases the chances of that individual transmitting the bacteria. This is an example of a positive or ‘facilitative interaction’ between pathogens in ecological speak. In contrast, pathogens can compete as well (a negative interaction) and is some cases this can protect us from disease. For example, co-infection between certain parasitic worms can actually be protective of malaria (see Nacher, 2011 below). Further, we know it is possible that interactions between pathogens can be dependent on the order of infection  (see Hoverman et al. for more on this). But how do we test for these specific interactions, particularly in wildlife? Humans and wildlife are exposed and infected by a diverse range of organisms; how could we work out which ones to test? It is unfeasible to test every combination in the lab and even then, how would we know what combination actually occurs in the wild?

In this paper, we harnessed recent advances in ecological statistics and network theory to quantify associations between pathogens in a wild population of lions in the Serengeti in Tanzania. We label them associations as we can’t be 100% sure that they actually represent real interactions between pathogens (you’d need to do lab experiments for that which are difficult to do for wildlife). Based on over 10 years of exposure and infection data from a wide variety of pathogens that infect lions, we were able to establish which pathogens were positively or negatively associated with others. As we have been monitoring these lions often since birth, we were able to deduce the likely order of infection or exposure and work out if a pathogen that a lion was exposed to early in life could impact which pathogen they were exposed to as adults. These statistical methods are also useful as they can start to untangle if these associations could be just due to environmental factors (i.e. the lion got co-infected by two pathogens because of an ecological preference of these pathogens) rather than a potential biological mechanism.

The associations we found using these methods were often surprising but reflected what has been established in human lab-based studies which is promising. For example, we found a strong negative association between Rift Valley Fever (RVF -a mosquito-borne virus that infects lion as well as cattle and sheep leading to sometimes devastating economic loss) and felid equivalent to HIV (FIV). FIV infects nearly 100% of lions as cubs, whereas RVF infection is more likely to occur later in life. Interestingly RVF has similar molecular machinery to a group of viruses that are known to inhibit the growth of HIV, so it is possible that the same mechanism exists for lions as well. Similarly, we found a strong negative association between feline coronavirus (in the virus family that causes severe acute respiratory syndrome or SARS in humans) and one type of FIV also. Coronaviruses are considered possible candidate vaccines for HIV, so again laboratory work from human medicine provided some support for our findings.

We didn’t just find negative associations either, we also detected strong positive interaction between the tick-borne Babesia protozoans and canine distemper virus (CDV). This co-infection pattern has been identified previously and is likely the underlying factor that caused this lion population to crash by over 33% in the 1990s. Lions are may be able to withstand a CDV epidemic in isolation but when combined with Babesia in a co-infection, this can lead to serious population declines for this species (see Munson et al for some more details).  Our study shows that it didn’t matter which species of Babesia either, all of the species we included had these strong positive associations with CDV.

We can’t prove conclusively that these pathogens actually interact within a lion based on these statistical methods alone. However, we can provide a valuable ‘shortlist’ of possible interactions that occur in a wild population that can be tested using cell-level experiments in a lab – we obviously don’t want to actually test these hypotheses out on lions themselves. Given how common interactions between pathogens are and the potentially positive or negative outcomes of them for the host, our approach coupled with lab-work can provide important insights to understanding pathogen dynamics in wild populations.

Nacher (20111):
Hoverman et al (2013):

Munson et al (2008) :

A link to the paper here:

Results are in!

If you were wondering what the results were from our survey asking about journal solicitations from preprint servers the link is below.

Basically, even though our sample was biased to people reading blogs and/or Twiiter, it seems like there is reasonable support for journals to solicit papers from preprint servers. This was particularly true for early career folks unsurprisingly….

Molecular tools and community ecology: Great special issue in Molecular Ecology

How can we use molecular tools to better understand community dynamics? This is but one of the questions that the recent special issue in Molecular Ecology delves into. This issue focuses on ecological networks where species are the ‘nodes’ and edges represent interactions between species. What I particularly like about this collection of papers is the breadth of taxa from aquatic to terrestrial as well as the breadth of interactions captured from predator-prey to host-symbiont. Most of these communities are hard to observe in nature (i.e. the organisms are small or nocturnal) so thus molecular tools are really the only option.

Lots to learn from this interesting set of papers!

Here is the link:

Science and social media

Recently I was introduced to the world of making scientific content for social media in a fun workshop.  As part of this workshop, I was introduced to the world of Lumen 5 I’ve always tried to communicate my research to the public through social media and Lumen 5 makes doing this really achievable. This website enables you to quickly generate a high-quality video ideal for sharing on Facebook etc. For social media videos, I didn’t realize the significance of using text over the video to allow the public to read your story and see the images. Often your video gets viewed on public transport (or other places where sound is a no-no) so having a video that can communicate without sound is important. The media library used to help construct these videos is free of any copyright issues which is nice.

See my first attempt here:

Using epidemiology to understand patterns of big cat attacks

Our paper about using epidemiological techniques to better understand big cat attacks in Tanzania and India is now out in the Journal of Applied Ecology:

This is one of the most rewarding papers I have been involved with (and my first last author paper!).  Attacks on humans represent not only a public health concern but also a major conservation challenge to these species.  We really wanted to know (i) if patterns of attacks were species-specific, and (ii) in what landscapes were clusters of attacks found in?  To address these questions,  we were able to assemble long-term man-eating attack data for leopards, lions and tigers from two continents and used spatio-temporal models to look for clusters of attacks in space and time. We found that lion clusters were larger, involved more human fatalities and occurred over longer periods of time compared to leopards and tigers. This possibly indicates that, as lions form social groups called prides, the idea of eating humans may be ‘transmitted’ amongst pride mates making attacks last longer. Most lions get killed post-attack so it is not the same individual committing all of the attacks. These attack clusters did not happen randomly in the landscape either with residential woodlands being particularly high risk for attack clusters. Tree loss was also important for lion attacks with attacks more common in areas with recent forest loss.

Related image

We hope that approaches such as this one can be used to better manage and understand attacks not just of these species, but others as well. We used SatScan to do the analysis and provide some easy to follow instructions that allow users to conduct this type of analysis themselves. Plus SatScan is freely available which is always good.

Big thanks to Craig Packer for getting the data together and making this happen.

Statistical Network Models

Recently I had the pleasure to hang out with Matt Silk from the University of Exeter:

It was great chatting about badgers, puma and network models over a few beers. His work summarizing statistical network models in Methods in Ecology and Evolution is particularly useful (see below).

After reading this you’ll know what ERGMS, TERGMs, REMs & SAOMs are and how they can answer network/disease questions. The only potentially useful addition I can see is generalized dissimilarity models (GDMs)  as a robust way to test for covariate effects on network structure (as I did in my JAE 2017 paper). Anyway,  this paper is certainly a good starting point for entering the world of statistical network analysis.


Methods in Ecology and Evolution goes microbial

Methods in Ecology and Evolution have put together a really exciting special issue on microbiome methods:

I’ve read a few of these papers already and there are certainly some really useful ideas and methods here. At a glance, Creer et al’s field ecologists guide to microbial ecology seems particularly useful – looking forward to reading in more depth!