Submitting manuscripts: Excellent advice from Mike Kaspari

Here is some excellent advice about submitting manuscripts from Mike Kaspari:

What journal gets the first peek at your manuscript? Results from a year of ruminating.

I don’t agree with point 3 about shopping a manuscript around – I think some manuscripts are strengthened post peer review.  Nonetheless, sage advice. His benchmarks are great – currently I’m spending 100% of my time revising manuscripts and it is driving me insane….30% is an excellent goal.

Random forests: identifying package conflicts in R

I just lost a morning of my life dealing with a strange R problem. As a reader of my blog you may know my love for machine learning and Gradient Forests – turns out that if you have the uni-variate version installed (‘Random Forests’) your beautiful Gradient Forest is no longer – just a barren wasteland remains. Excuse the terrible metaphor, basically there is some weird conflict between the two that make Gradient forest produce the horrifying error: “The response has five or fewer unique values.  Are you sure you want to do regression?”. This was made worse by the fact that yesterday Gradient Forests was working perfectly yesterday (i.e. I hadn’t loaded Random Forests), but then today I found an inconsistency in the data and made some reasonable sized changes, got distracted and ran  Random Forests on another piece of data, then came back to analyse my modified data from yesterday  and bang I got the above error + topping it all off the error “The gradient forest is empty”.

The horror – was it the changes to the data I made? Did I modify the code and forget (I though my book keeping was pretty good…)? What’s going on?  I then did the same analysis on the old data-set and got the same error, (phew) and then eventually by process of elimination worked out that it was a package conflict. I wonder how many collective hours people spend diagnosing problems like this in science? Millions I suspect. Anyway, I guess I learnt something (?) and will check for this type of issue more frequently.

Gradient Forests: http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf

An interesting read about what reviewers want

Animal ecology recently put out an interesting post about what reviewers want (see the link below). Particularly interesting that so may respondents to the survey though a major shake up was needed (74%) – I couldn’t agree further. Also I found training to be a peer reviewer was an interesting idea and should be mandatory. No surprises that people reviewing high ranking journals are more likely to accept manuscripts and spend more time on them. I also find it strange that scientists find the idea of being paid to review articles weird – why should the companies simply get to profit off the authors and the readers without giving any of it back to the community? I guess that unfortunately the consequences of doing so are likely to lead to increased publication costs which would be annoying.

https://journalofanimalecology.wordpress.com/2016/09/22/what-do-reviewers-want/

How many types of statistical analysis approaches do you use regularly

Whilst deciphering  really cool R package called GDM (see below), I was thinking about how many different statistical approaches and techniques have have I read about, deciphered and applied in the last 2 years?  What’s a usual number of techniques people use reasonably regularly? My list is at approximately 30 currently –  but I am a postdoc that spends basically all of my time analyzing data from diverse range of systems with an equally wide variety of data types, so  perhaps that’s normal?

The first place I started looking was in my R package list and I quickly realized that there were quite a few. I excluded ‘bread and butter’ GLM type analyses and there Bayesian equivalents e.g., ANOVA & GLMMs and basic ordination techniques (e.g., PCoA, NMDS). I  haven’t also included techniques to calculate the various aspects of diversity or sequence alignment algorithms either as the list would just keep going. As I deal with species distribution data,  distances and (dis)similarities quite often there was obviously a trend towards distance-based techniques (see below), with a mixture of spatial, epi and phylogenetic approaches.

I’m too lazy to add citations and descriptions for each one – but they are all easy to find in google or email me if you are interested. If there is anything else that I should know and use to answer disease/phylogenetic community ecology type questions, please make suggestions.

In no particular order:

Permutation-based ANOVA (PERMANOVA), permutation based tests for homogeneity of dispersion PERMDISP, canonical analysis of principal coordinates (CAP) analyses , dbMEMs, Generalized dissimilarity modelling (GDM), distance-based linear modelling (distLM), multiple matrix regression (MRM), network-based linear models (netLM), Gradient Forests, Random Forests, cluster analysis, SYNCSA analysis,  fourth corner analysis, RLQ tests, Mantel tests, Moran’s I tests (phylogenetic and spatial), Phylogenetic GLMMs, everything in the R package Picante, ecophylogenetic regression (Pez), dynamic assembly model of colonization, local extinction and speciation (DAMOCLES), dynamical assembly of islands by speciation, immigration and extinction (DAISIE), all sorts of ancestral state reconstruction approaches, numerous Bayesian evolutionary analysis sampling trees (BEAST) methods, numerous phytools methods, environmental raster and phylogenetically informed movement (SERAPHIM),SaTScan, Circuitscape, point-time pattern analysis, Kriging, epitools risk analysis.

Link to GDM: https://cran.r-project.org/web/packages/gdm/vignettes/gdmVignette.pdf

 

Useful Bayesian explanations

I’ve been going over Bayesian analysis principals recently and trying to work out ways to explain it clearly (and  teach it).

Here are are my top 5 links that do a reasonable job at explaining Bayesian principles (in no particular order):

  1. Count Baysie with Lego: https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego
  2. Julia Galef has some good explantions: https://www.youtube.com/watch?v=BrK7X_XlGB8
  3. Aaron Eliison’s paper on Bayesian inference in ecology : http://www.uvm.edu/~bbeckage/Teaching/DataAnalysis/AssignedPapers/Ellison_2004.pdf
  4. The ETZ files Bayes theorem.
  5. On a lighter note you can’t beat xkcd:

Frequentists vs. Bayesians

Coupling BEAST with eco-phylogenetics: Tales from Glascow

Can community level phylogenetics be used effectively with population genetic approaches to better understand infectious disease dynamics? This was one of the questions that came up on my recent trip to the University of Glascow. It was great fun hanging out with the fine folk from the IBAHCM (Institute of Biodiversity Animal Health and Comparative Medicine) with numerous discussions about life, the world and all sorts of disease ecology topics. The purpose of the trip was a research exchange with Roman Biek and is lab to become more familiar with BEAST and associated phylodynamic tools. Naturally it got me thinking about how to synthesize these tools with community phylogenetics – particularly in understanding transmission dynamics. Basically BEAST provides excellent spatial/temporal estimations of disease spread, but is not as good at linking phylogenetic information to multiple interacting landscape and host variables. They are my conclusions for now anyway – BEAST can do GLMs apparently but I’ve heard the interpretation can be difficult.Stay tuned for  my review on the topic which is nearly ready to submit somewhere.

On a more applied note – if you are a BEAST user or interested in becoming one here is a link to a useful set of tutorials: https://github.com/beast-dev/beast-mcmc/tree/master/doc/tutorial. Also I can highly recommend the R package ‘Seraphim’ for post-BEAST spatial analysis of pathogen dynamics: http://evolve.zoo.ox.ac.uk/Evolve/Seraphim.html – though installing in R is a little tricky (this will be a topic of a future blog post).

Most useful blog comments section…

I was just reading a few older posts from Dynamic Ecology and stumbled into this post on the use and abuse of AIC (https://dynamicecology.wordpress.com/2015/05/21/why-aic-appeals-to-ecologists-lowest-instincts/).

After reading Brian McGill’s really cool article, I started to read the comments section (as suggested I should do in the foot notes). Wow – this is one way to get an AIC education! Probably the longest but most insightful comments section I’ve ever read – but absolutely worth the time. Lots of great ecology minds providing a really interesting debate. I agree with Mark Brewer that the comments section and the original article should be turned into a Methods paper

New favorite method combination: GF and GDM

Have you ever had that feeling of jubilation (or some similar feeling) when you read a paper that perfectly synthesizes ideas you’ve been thinking about for a long time? That’s exactly the the emotion I experienced when reading the 2014 paper by Matthew Fitzpatrick and Stephen Keller that use generalized dissimilarity modelling (GDM) and gradient forests (GF) to link genomic data to environmental gradients. How had I not seen this previously?

Just a qualifier I haven’t used these methods yet, but I can see the gap that these analyses fill. I have always considered random forests to be an intuitive way to understand the ecological drivers of species distributions , and lamented the fact that I didn’t know how to do the same with genetic data. I also had the delusion that one day I would get around to thinking about the idea in more detail and putting something together. No need to now I guess! Their solution was quite simple and from what I can tell consists of basically just performing GF on a SNP data set. I think the next step with this type of approach is to think about how to incorporate evolutionary models (E.g., Brownian motion).

GDM also looks like and interesting technique that I’s seem previously but never really grappled with. there is also phylogenetic gdm too which could be useful too (Rosaur et al 2014 Ecography). I really like there idea of using distance based MEMs (basically a version of spatial PCoA) to incorporate spatial auto correlation into the GF model. I’ve done similar with other modelling techniques so it’s nice to see this method used in this context. Anyway – in short I’m looking forward to testing and thinking more about these new and potentially really useful analysis approaches.

Here is the link: http://onlinelibrary.wiley.com/doi/10.1111/ele.12376/full

 

Model-based approaches and eco-phylogenetics.

Ever since I was a bright eyed naive undergrad,  I’ve been indoctrinated into the distance metric coupled with randomization test community statistical paradigm of Legendre and Anderson (among others). That was all I knew, and I thought I could apply this plethora of techniques reasonably well. I knew model-based techniques to analyse community data sets existed, but largely ignored them. That was until I met Will Pearce (http://willpearse.com/) and my  metric/randomization world has slowly been slipping since.

As David Warton suggests (see the link to mvabund),metric/randomization techniques were great short cuts, but these shortcuts are not necessary now due to increases in computing power. I think he slightly overstates the mean-variance problem as most people that deal with community abundance data apply some type transformation. Nonetheless, model-based approaches, such as mvabund and the eco-phylogenetic PGLMM have numerous advantages in terms of power and flexibility (see Wang et al 2012 and  Ives and Helmus 2011). Interestingly, considering the advantages offered, both packages (particularly PGLMM ) aren’t used commonly with 39 citations for PGLMM and just over 100 for mvabund compared to the thousands (I’m guessing) that have used metric/randomization approaches. I wonder when this will change? They require only marginally more effort to use and there are plenty of well written guides to ease the skeptical community ecologist into it (see Will’s below). Maybe, for most people, it takes some convincing to convert away from techniques you learnt during your PhD…

I’m not saying that metric/randomisation methods will become redundant by the way. In fact currently, if you are interested in what landscape/biotic factors shape phylogenetic patterns like I am,  I don’t know of any other set of methods that work better. If I’m missing something – let me know! Also, as Ives and Helmus suggest, metric/randomisation methods will also still be important, at minimum, for exploring complex community data sets prior to conducting PGLMM or other similar methods.

Links:

Mvabund: https://www.youtube.com/watch?v=KnPkH6d89l4

Wang et al, 2012: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2012.00190.x/full

Ives and Helmus 2011: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2012.00190.x/full

Will’s PGLMM guide: https://cran.r-project.org/web/packages/pez/vignettes/pez-pglmm-overview.pdf

Useful microbial community ecology resource

Gusta me (GUide to STatistical Analysis in Microbial Ecology) is a website packed full of useful information for commonly employed community ecology (plus other) techniques in a microbial setting. Probably needs updating as lots of new modelling based approaches have been added in the last couple of years. Still broadly useful though as the descriptions of the techniques are spot on.

Here is the link: http://mb3is.megx.net/gustame/home

I have not read the associated paper yet  – but I’m sure it could be useful too.