Statistical machine learning is now becoming much more commonplace in ecology and evolution disciplines. Qualitatively at least, I’d say 15% of the posters/talks I’ve seen at ESA/Evolution conferences recently have featured some form of machine learning approach. This is not surprising given the power of these methods to form robust predictive models from increasingly common big complex datasets. However, there is still lots of confusion about machine learning approaches are and how they can be applied and interpreted. This is particularly the case in the sub-discipline I’m most familiar with; disease ecology. I feel the confusion mostly stems from lack of experience with the methods, but also from the assumption that these methods are just black boxes for big data and don’t have a probabilistic basis. These assumptions are demonstrably false, and recent advances in computer science have revolutionized the power and interpretability of these models. In particular, these methods have the power to allow us to find new insights and tackle new questions into complex and messy disease ecology datasets.
This is what encouraged our team to create and test a machine learning pipeline that incorporates the latest advances in computer science to understand disease risk in populations. The associated paper just came out in the Journal of Animal Ecology here: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.13076 which I’m very excited about! This pipeline is really flexible and could be applied to literally any other classification or regression problem the same way. In essence, this pipeline brings together code from the R packages caret and iml as well as some preprocessing packages such as MissForests together in a user-friendly pipeline. Where we identified gaps we put in our own functions to make the process as smooth as possible.
We tested the pipeline out examining disease risk in the Serengeti lions and find that our models cannot only provide powerful predictive models, but also unique insights into the mechanistic drivers of disease in this system.
This is just the tip of the iceberg though on how this pipeline could be used and is also, given the state of the machine learning world, likely to be outdated soon. But hopefully, at least, it provides a basis for ecologists to build and compare their own robust machine learning models and interpret them in new powerful ways
Links to packages: