Misunderstanding the Microbiome: misuse of community ecology tools to understand microbial communities

Understanding how the vast collection of organisms within us (‘the ‘microbiome’) is linked to human (and ecosystem) health is one of the most exciting scientific topics today.  It really does have the possibility of improving our lives considerably though is often over-hyped (see the link below). However, I’ve recently I’ve been reading quite a few microbiome papers (it was our journal clubs topic of the month) and have been struck by the poor study design and lack of understanding of the statistical methodology. Talking to colleagues in the microbiome field – these problems maybe more widespread and could be hindering our progress in understanding this important component of the ecosystem within us.

Of course microbiome research is simply microbe community ecology, but the way some microbiome practitioners use and report community ecology statistics is problematic and sometimes outright deceptive.This includes people publishing in the highest scientific journals. I won’t pick on any particular paper, but here are a few general observations (sorry for the technical detail).

  1. Effect sizes are often not reported or visualized using ordination techniques. They have a significant P value but how do you know how biologically relevant this is?  My guess is that they are small as in often the case with free living communities.
  2. Little detail is given about how the particular test is performed. Usual example: “We did a PERMANOVA  to test for XX”. Despite the fact that the PERMANOVA has some general issues (see the Warton et al paper below), no information is given about the test anyway e.g., was it a two way crossed design, did they use Type III sums of squares etc? Did they test for multivariate disperson using PERMDISP or similar? Literally that is one of the only assumptions of the test but I haven’t read any microbiome paper that has checked. If they haven’t we can’t trust the results. Have they read the original paper by Marti Anderson? Some cite it at least….
  3. I haven’t found any PCA or PCoA plot with % of variance explained. This is annoying – the axes shown may only explain a small amount of variance in the community,  so thus the pretty little clusters of points shown maybe pretty artificial.

I’ll stop ranting. These issues really impair interpretation of the results and make the science difficult to replicate. It makes you ask “how do these papers get through the gates?’ I’m guessing that a significant proportion of authors, reviewers and editors have little experience in  community biostats and don’t really understand what the tests are doing. They are relying on analytical pipelines such as QUIIME that claim to ‘ publication quality graphics and statistics’ and not thinking much more about it. More microbiome researchers need to go beyond these pipelines and keep up-to-date with community methods more broadly. The quality of the research will clearly improve.


Microbiome over-hype: http://www.nature.com/news/microbiology-microbiome-science-needs-a-healthy-dose-of-scepticism-1.15730).

Warton et al: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00127.x/full

Marti Anderson’s paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1442-9993.2001.01070.pp.x/full

‘Amusing’ reviewers comments

Having a thick skin and the ability to shrug off  harsh and sometimes personal criticism is an often unrecognized trait of a scientist. You put your work out there to the world and get feedback from often anonymous peers(but this is changing slowly) . The system works usually pretty well and 99% of the time makes the paper better.  When the comments are  highly critical, you go through a mini five stages of grief but  you always come around and the paper gets better. I’ve definitely had my fair share of critical feedback, but one of my recent favorites was  a reviewer suggesting that my literature review “hadn’t gone beyond the literature”….(?) However, none have come close to the comments that this author received:


there are so many good lines but this one is the best: “This paper has merit and no errors, but I do not like it …”

Pleasing that it still got published in the journal anyway!

Guide to reducing data dimensions

In a world where collecting enormous amounts of complex and co-linear, data is increasingly the norm, techniques that reduce data dimensions to something that can be used in statistical models is essential . However, in ecology at least the means of doing this are unclear and the info out there is confusing. Earlier this year Meng et al provided a nice overview to what’s out there (see link below) specifically for reducing omics data sets, but is equally relevant for ecologists. One weakness of the paper is they provided only small amounts of practical advice particularly on how to interpret the resultant dimension reduced data.  Overall though this is an excellent guide and I aim to give a bit extra practical advice on dimension reduction using the techniques that i use.

Anyway, before going forward – what do we mean by dimension reduction? Paraphrasing from Meng et al – Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the data is reduced , allowing for a lower-dimensional representation (say 2-3 dimensions, but sometime many more) without significant loss of information. My philosophy is to try to use the data in a raw form wherever possible, but where this is problematic due to problems with co-linearity etc and where machine learning algorithms such as Random Forests are not appropriate (eg., your response variable is a matrix….) this is my brief practical guide to three common ones:

PCA : dependable principal components analysis – excellent  if you have lots of correlated continuous predictor variables with few missing values and 0’s. There are a large number of PCA based analyses that may be useful (e.g., penalized PCA for feature selection, see Meng et al), but I’ve never used them. Choosing  the right number of PCAs is subjective and is a problem for lots of these techniques -an arbitrary cutoff of selecting PCAs that account for ~70% of the original variation seems reasonable. However, if your PCs only explain a small amount of variation you have a problem as fitting a large number of PCs to a regression model is usually not an option (and if PC25 is a significant predictor what does that even mean biologically?). Furthermore,and if there are non-linear trends this technique  won’t be useful.

PCoA :  Principal co-ordinate analysis or classical scaling is similar to PCA but used on  matrices. Has the same limitations of PCA.

NMDS: Non-metric multidimensional scaling is a much more flexible method that can cope much better with non-linear trends in the data. This method is trying to best preserve distances between objects (using a ranking system), rather than finding the axes that best represent variation in the data as PCA and PCoA do. This means that NMDS also captures variation often in a few dimensions (often 2-3), though it is important to assess NMDS fit by assessing ‘stress’ (values below 0.1 are usually OK). There is debate how useful these axis scores are (see here: https://stat.ethz.ch/pipermail/r-sig-ecology/2016-January/005246.html) as they are rank based and the axis 1 doesn’t explain the largest amount of variation and so fourth as is the case with PCA/PCoA. However I still think this is a useful strategy (see the Beale 2006 link below).

I stress (no pun intended!) the biggest problem with this techniques is interpretation of new variables. Knowing the raw data inside and out and how they are mapped onto the new latent variables is important. For example, high loading’s on PCA1 reflect  high soil moisture and high pH. If you don’t know this  interpreting regressions coefficients in a meaningful way is going to be impossible. It also leads to annoying statements like ‘PcOA 10 was the most significant predictor’ without any further biological reasoning for what the axes actually represents. Tread with caution and data dimension reduction can be a really useful thing to do.

Meng et al: http://bib.oxfordjournals.org/content/early/2016/03/10/bib.bbv108.full

Beale 2006: http://link.springer.com/article/10.1007/s00442-006-0551-8

Useful microbial community ecology resource

Gusta me (GUide to STatistical Analysis in Microbial Ecology) is a website packed full of useful information for commonly employed community ecology (plus other) techniques in a microbial setting. Probably needs updating as lots of new modelling based approaches have been added in the last couple of years. Still broadly useful though as the descriptions of the techniques are spot on.

Here is the link: http://mb3is.megx.net/gustame/home

I have not read the associated paper yet  – but I’m sure it could be useful too.

A guide to generating geophylogenies

Evolution is fundamentally a spatio-temporal process – but how to visualize  it on a landscape? Geophylogenies are one elegant way to do just this, but as I’ve found out can be a little tricky to implement particularly if you have a particular base map in mind.

I had a lovely base map of southern California with  the national land cover % impervious surface overlaid but quickly ran into an issue. Unfortunately (for those folks that use ArcGIS) it looks like the arcGIS based geophylobuilder doesn’t install properly anymore – if anyone has a solution let me know. So the challenge then was to get this base map out of Arc and in to R or GenGIS to construct the geophylogeny.

I saved the map as a geoTIFF and tried GenGIS first. Importing the tiff file into GenGIS led to some weird coloration and it doesn’t quite look right even after playing around with the inbuilt features. I like this program though and it can be excellent if the basemap you want is from a different source.

Update: saving as a.jpeg with the world file added does a fine job.

Fig.1 : A fine example of what GenGIS can do.

Anyway, R can also do geophylogenies and I could get R to import the geoTIFF properly with the following code:

#basemap from arcGIS needs package raster
b <- brick(“urbanSoCal.tif”)

Then using Liam Revell’s brilliant phytools package:

tree = read.nexus(“TargetTree97”)
phylomorphospace(tree,cbind(long,lat), colors=setNames(“red”,1),node.by.map=TRUE,add=TRUE, label=”horizontal”)

This took  me a bit of mucking around so hopefully this makes someone else’s application of this cool tool a bit easier – particularly if you are having problems in ArcGIS/GenGIS.

Some links: GenGIS: http://kiwi.cs.dal.ca/GenGIS/Main_Page

Phytools: http://blog.phytools.org/2014/07/new-user-controls-in-phylotomap.html

Mixed effect models and SEMs

I’m increasingly convinced of the value of structural equation models (SEMs)in ecology – particularly if random effects and phylogenetic relationships can be incorporated. That is exactly what has been achieved in the new R package ‘Piecewise SEM’ by Jonathan Lefcheck recently published Methods in Ecology and Evolution.  Before this if your model variables were not normally distributed or independent this was a big problem for SEM models. This limited the use of this type of modelling in ecology as ecological data, for the most part, violates these assumptions readily. This package extends SEMs to include mixed effect models, all sorts of other data distributions (Poisson etc) and phylogentic generalized least-squares (PGLS). This seems like a really nice extension of the idea and the package looks relatively easy to implement. There are limitations, of course, such as not being able to account for bidirectional relationships, but  as Jon suggests, future work will rectify these. Nonetheless this is a useful extension of the SEM idea which I’m sure will lead to greater uptake of this useful modelling approach.

Here is the link: http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12512/full

Ethical reflections

Here is a link to an excellent article focussing on ethics in ecology and ‘practising what we preach’. If we can’t act in an environmentally sensitive way, how can we expect others? 

Belinda Christie: Increasing the success of conservation outcomes

I paraphrase a discussion with a colleague this morning: We fly  tens of thousands of kilometres and drive tens of thousands of kilometres in Big Ugly highly inefficient Diesel Landcruisers and then tell people they should reduce their carbon foot print. We travel the world to help answer questions, but we don’t understand the ecology of our backyard.

In other news I just got my first paper from my phD accepted in Ecological Applications. I will give a summary of the results soon!