Excellent series of blog articles about data science

I just found this excellent series of articles by John Mount.  Really intuitive and the explanations he gives are good. Furthermore there is really useful R code to recreate the figures they make.   Really a must read if you are getting into data science.

 

http://www.win-vector.com/blog/2015/09/willyourmodelworkpart2/

Random forests: identifying package conflicts in R

I just lost a morning of my life dealing with a strange R problem. As a reader of my blog you may know my love for machine learning and Gradient Forests – turns out that if you have the uni-variate version installed (‘Random Forests’) your beautiful Gradient Forest is no longer – just a barren wasteland remains. Excuse the terrible metaphor, basically there is some weird conflict between the two that make Gradient forest produce the horrifying error: “The response has five or fewer unique values.  Are you sure you want to do regression?”. This was made worse by the fact that yesterday Gradient Forests was working perfectly yesterday (i.e. I hadn’t loaded Random Forests), but then today I found an inconsistency in the data and made some reasonable sized changes, got distracted and ran  Random Forests on another piece of data, then came back to analyse my modified data from yesterday  and bang I got the above error + topping it all off the error “The gradient forest is empty”.

The horror – was it the changes to the data I made? Did I modify the code and forget (I though my book keeping was pretty good…)? What’s going on?  I then did the same analysis on the old data-set and got the same error, (phew) and then eventually by process of elimination worked out that it was a package conflict. I wonder how many collective hours people spend diagnosing problems like this in science? Millions I suspect. Anyway, I guess I learnt something (?) and will check for this type of issue more frequently.

Gradient Forests: http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf

Model-based approaches and eco-phylogenetics.

Ever since I was a bright eyed naive undergrad,  I’ve been indoctrinated into the distance metric coupled with randomization test community statistical paradigm of Legendre and Anderson (among others). That was all I knew, and I thought I could apply this plethora of techniques reasonably well. I knew model-based techniques to analyse community data sets existed, but largely ignored them. That was until I met Will Pearce (http://willpearse.com/) and my  metric/randomization world has slowly been slipping since.

As David Warton suggests (see the link to mvabund),metric/randomization techniques were great short cuts, but these shortcuts are not necessary now due to increases in computing power. I think he slightly overstates the mean-variance problem as most people that deal with community abundance data apply some type transformation. Nonetheless, model-based approaches, such as mvabund and the eco-phylogenetic PGLMM have numerous advantages in terms of power and flexibility (see Wang et al 2012 and  Ives and Helmus 2011). Interestingly, considering the advantages offered, both packages (particularly PGLMM ) aren’t used commonly with 39 citations for PGLMM and just over 100 for mvabund compared to the thousands (I’m guessing) that have used metric/randomization approaches. I wonder when this will change? They require only marginally more effort to use and there are plenty of well written guides to ease the skeptical community ecologist into it (see Will’s below). Maybe, for most people, it takes some convincing to convert away from techniques you learnt during your PhD…

I’m not saying that metric/randomisation methods will become redundant by the way. In fact currently, if you are interested in what landscape/biotic factors shape phylogenetic patterns like I am,  I don’t know of any other set of methods that work better. If I’m missing something – let me know! Also, as Ives and Helmus suggest, metric/randomisation methods will also still be important, at minimum, for exploring complex community data sets prior to conducting PGLMM or other similar methods.

Links:

Mvabund: https://www.youtube.com/watch?v=KnPkH6d89l4

Wang et al, 2012: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2012.00190.x/full

Ives and Helmus 2011: http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2012.00190.x/full

Will’s PGLMM guide: https://cran.r-project.org/web/packages/pez/vignettes/pez-pglmm-overview.pdf

A guide to generating geophylogenies

Evolution is fundamentally a spatio-temporal process – but how to visualize  it on a landscape? Geophylogenies are one elegant way to do just this, but as I’ve found out can be a little tricky to implement particularly if you have a particular base map in mind.

I had a lovely base map of southern California with  the national land cover % impervious surface overlaid but quickly ran into an issue. Unfortunately (for those folks that use ArcGIS) it looks like the arcGIS based geophylobuilder doesn’t install properly anymore – if anyone has a solution let me know. So the challenge then was to get this base map out of Arc and in to R or GenGIS to construct the geophylogeny.

I saved the map as a geoTIFF and tried GenGIS first. Importing the tiff file into GenGIS led to some weird coloration and it doesn’t quite look right even after playing around with the inbuilt features. I like this program though and it can be excellent if the basemap you want is from a different source.

Update: saving as a.jpeg with the world file added does a fine job.

Fig.1 : A fine example of what GenGIS can do.

Anyway, R can also do geophylogenies and I could get R to import the geoTIFF properly with the following code:

#basemap from arcGIS needs package raster
b <- brick(“urbanSoCal.tif”)
plotRGB(b)…

Then using Liam Revell’s brilliant phytools package:

tree = read.nexus(“TargetTree97”)
phylomorphospace(tree,cbind(long,lat), colors=setNames(“red”,1),node.by.map=TRUE,add=TRUE, label=”horizontal”)

This took  me a bit of mucking around so hopefully this makes someone else’s application of this cool tool a bit easier – particularly if you are having problems in ArcGIS/GenGIS.

Some links: GenGIS: http://kiwi.cs.dal.ca/GenGIS/Main_Page

Phytools: http://blog.phytools.org/2014/07/new-user-controls-in-phylotomap.html

Calculating effect size for linear mixed models

Papers that just report AICs of statistical models bugs me. AICs (Akaike information criterion) and similar metrics (BICs etc) are great ways to assess how good statistical models are  – but are pretty much meaningless when reported by themselves. Particularly when just one model is presented! Yes I’m sure the authors selected the most parsimonious model with the lowest  AIC, but they provide no detail about how much variation the model actually explains. You can have a ‘low’ AIC for example, but the model may only explain a small fraction of the variation in the data.

Calculating effect size for mixed models is difficult (see the article attached below) and there was little consensus between statisticians about how to do it until recently. Due to these reasons many biologists simply avoided reporting/calculating R2 for mixed models. In particular this must drive people trying to to do meta analyses crazy! Now there is no excuse with plenty of R packages available to help calculate pseudo R2.  One R package I like is MuMiN (multi model inference) by Kamil Barton:  https://cran.r-project.org/web/packages/MuMIn/MuMIn.pdf. This package is an easy to use and seems to be a reasonably robust way to calculate pseudo R2 for mixed models generated in lme4. It also has lots of other very useful tool  as well, such as AICc calculation. Hopefully in the future we see a smaller number of papers getting through the review without reporting this really useful statistic.

Nakagawa Schielzeth (2012): http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210x.2012.00261.x/full.

Quick link dump for a Friday: A whirlwind guide to mixed modelling

If you want to get a brief practical intro into mixed modelling in R I can highly recommend this link: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html.

This link combined with the excellent book Mixed Effects Models and Extensions in Ecology With R by Zuur and colleagues can greatly help navigate the world of mixed effects models using R.