For a long time I wanted to keep my life simple and just use one social networking site (Facebook), my colleague Luis Escobar finally turned me to the dark side and convinced me twitter was a good idea. Took 6 months for me to get around to it, but now you can find me at @.
My first NSF DEB pre-proposal submitted (or any ‘big’ grant for that matter) … hooray! It’s nice to regain head-space to think about something else for a while at least. Even as a co-PI on a pre-proposal, the process was a tad stressful. To tell you the truth though, I actually enjoyed the process.Maybe because the thinking was in the future tense rather than past (i.e. I was thinking about future research rather than analyzing and writing about great data of the past)? Partly perhaps, but I enjoyed the fact that Meggan and I worked together nicely and with people across the world to create a 5 page document that sold, what we think at least, is a cool an novel idea. I read it and want to actually do it – I hope reviewers/the panel agree!
If you think about it logically though, the process looks absurd in that we put so much time and effort into something with an 8% chance of success is clearly insane (see the NSF blog below for trends). And this success rate is pre-Trump! I thought things were bad in Australia, but this actually makes the Australian Research Council (ARC) equivalent grants (Discovery or DECRA) seem like a ‘good’ bet with success rates of around ~17% (see below). I wonder where the cutoff is? I wonder at what % success will researchers even bother submitting anything? Or is even 1% success worth the effort considering the reward? This situation is clearly stressful for faculty, but for postdocs like myself who rely on this type of funding to ‘make a name’ and to get a gig (read tenure track position or another postdoc) it’s nearly too much. Nonetheless, I somehow push it to the back of my brain and continue to do what I enjoy doing (and hope is of some use to society). Should we move to NZ, Canada, Europe or Asia? Any perspective on these countries/continents would be great.
Even if we don’t get funded which is highly likely, we can no doubt use these ideas in other grants. Fingers crossed of course! Nonetheless, it has been an excellent learning experience and I’ve had fun helping craft the pre-proposal. There are excellent resources out there that have helped enormously and I feel are valuable for grant writing in general (the NSF DEB blog and Mike Kaspari’s blog below for example). Hopefully, one day things will get better and less of the collective grant writing effort will be wasted.
In a world where collecting enormous amounts of complex and co-linear, data is increasingly the norm, techniques that reduce data dimensions to something that can be used in statistical models is essential . However, in ecology at least the means of doing this are unclear and the info out there is confusing. Earlier this year Meng et al provided a nice overview to what’s out there (see link below) specifically for reducing omics data sets, but is equally relevant for ecologists. One weakness of the paper is they provided only small amounts of practical advice particularly on how to interpret the resultant dimension reduced data. Overall though this is an excellent guide and I aim to give a bit extra practical advice on dimension reduction using the techniques that i use.
Anyway, before going forward – what do we mean by dimension reduction? Paraphrasing from Meng et al – Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the data is reduced , allowing for a lower-dimensional representation (say 2-3 dimensions, but sometime many more) without significant loss of information. My philosophy is to try to use the data in a raw form wherever possible, but where this is problematic due to problems with co-linearity etc and where machine learning algorithms such as Random Forests are not appropriate (eg., your response variable is a matrix….) this is my brief practical guide to three common ones:
PCA : dependable principal components analysis – excellent if you have lots of correlated continuous predictor variables with few missing values and 0’s. There are a large number of PCA based analyses that may be useful (e.g., penalized PCA for feature selection, see Meng et al), but I’ve never used them. Choosing the right number of PCAs is subjective and is a problem for lots of these techniques -an arbitrary cutoff of selecting PCAs that account for ~70% of the original variation seems reasonable. However, if your PCs only explain a small amount of variation you have a problem as fitting a large number of PCs to a regression model is usually not an option (and if PC25 is a significant predictor what does that even mean biologically?). Furthermore,and if there are non-linear trends this technique won’t be useful.
PCoA : Principal co-ordinate analysis or classical scaling is similar to PCA but used on matrices. Has the same limitations of PCA.
NMDS: Non-metric multidimensional scaling is a much more flexible method that can cope much better with non-linear trends in the data. This method is trying to best preserve distances between objects (using a ranking system), rather than finding the axes that best represent variation in the data as PCA and PCoA do. This means that NMDS also captures variation often in a few dimensions (often 2-3), though it is important to assess NMDS fit by assessing ‘stress’ (values below 0.1 are usually OK). There is debate how useful these axis scores are (see here: https://stat.ethz.ch/pipermail/r-sig-ecology/2016-January/005246.html) as they are rank based and the axis 1 doesn’t explain the largest amount of variation and so fourth as is the case with PCA/PCoA. However I still think this is a useful strategy (see the Beale 2006 link below).
I stress (no pun intended!) the biggest problem with this techniques is interpretation of new variables. Knowing the raw data inside and out and how they are mapped onto the new latent variables is important. For example, high loading’s on PCA1 reflect high soil moisture and high pH. If you don’t know this interpreting regressions coefficients in a meaningful way is going to be impossible. It also leads to annoying statements like ‘PcOA 10 was the most significant predictor’ without any further biological reasoning for what the axes actually represents. Tread with caution and data dimension reduction can be a really useful thing to do.
Meng et al: http://bib.oxfordjournals.org/content/early/2016/03/10/bib.bbv108.full
Beale 2006: http://link.springer.com/article/10.1007/s00442-006-0551-8
Functional redundancy has always been a problematic buzz word in ecology. People (including myself) liked to use it and intuitively got it , though there was such a variety of techniques and approaches to calculate it that comparing across studies was impossible. I’ve previously used deBello et al’s 2007 functional redundancy measure, but a broader framework was lacking.
Carlo Ricotta and colleagues have started to fill that gap with a reasonably coherent in the latest issue of Methods in Ecology and Evolution (see link below). It is a ripper of a issue by the way – I could write several more posts if I had the time. One noticeable thing missing are appropriate nulls models testing infer there is more redundancy in a community than expected by chance. Unless I’ve missed something, anyone want to write a short paper?
Here is some excellent advice about submitting manuscripts from Mike Kaspari:
I don’t agree with point 3 about shopping a manuscript around – I think some manuscripts are strengthened post peer review. Nonetheless, sage advice. His benchmarks are great – currently I’m spending 100% of my time revising manuscripts and it is driving me insane….30% is an excellent goal.
I just lost a morning of my life dealing with a strange R problem. As a reader of my blog you may know my love for machine learning and Gradient Forests – turns out that if you have the uni-variate version installed (‘Random Forests’) your beautiful Gradient Forest is no longer – just a barren wasteland remains. Excuse the terrible metaphor, basically there is some weird conflict between the two that make Gradient forest produce the horrifying error: “The response has five or fewer unique values. Are you sure you want to do regression?”. This was made worse by the fact that yesterday Gradient Forests was working perfectly yesterday (i.e. I hadn’t loaded Random Forests), but then today I found an inconsistency in the data and made some reasonable sized changes, got distracted and ran Random Forests on another piece of data, then came back to analyse my modified data from yesterday and bang I got the above error + topping it all off the error “The gradient forest is empty”.
The horror – was it the changes to the data I made? Did I modify the code and forget (I though my book keeping was pretty good…)? What’s going on? I then did the same analysis on the old data-set and got the same error, (phew) and then eventually by process of elimination worked out that it was a package conflict. I wonder how many collective hours people spend diagnosing problems like this in science? Millions I suspect. Anyway, I guess I learnt something (?) and will check for this type of issue more frequently.
Gradient Forests: http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf
Animal ecology recently put out an interesting post about what reviewers want (see the link below). Particularly interesting that so may respondents to the survey though a major shake up was needed (74%) – I couldn’t agree further. Also I found training to be a peer reviewer was an interesting idea and should be mandatory. No surprises that people reviewing high ranking journals are more likely to accept manuscripts and spend more time on them. I also find it strange that scientists find the idea of being paid to review articles weird – why should the companies simply get to profit off the authors and the readers without giving any of it back to the community? I guess that unfortunately the consequences of doing so are likely to lead to increased publication costs which would be annoying.
Whilst deciphering really cool R package called GDM (see below), I was thinking about how many different statistical approaches and techniques have have I read about, deciphered and applied in the last 2 years? What’s a usual number of techniques people use reasonably regularly? My list is at approximately 30 currently – but I am a postdoc that spends basically all of my time analyzing data from diverse range of systems with an equally wide variety of data types, so perhaps that’s normal?
The first place I started looking was in my R package list and I quickly realized that there were quite a few. I excluded ‘bread and butter’ GLM type analyses and there Bayesian equivalents e.g., ANOVA & GLMMs and basic ordination techniques (e.g., PCoA, NMDS). I haven’t also included techniques to calculate the various aspects of diversity or sequence alignment algorithms either as the list would just keep going. As I deal with species distribution data, distances and (dis)similarities quite often there was obviously a trend towards distance-based techniques (see below), with a mixture of spatial, epi and phylogenetic approaches.
I’m too lazy to add citations and descriptions for each one – but they are all easy to find in google or email me if you are interested. If there is anything else that I should know and use to answer disease/phylogenetic community ecology type questions, please make suggestions.
In no particular order:
Permutation-based ANOVA (PERMANOVA), permutation based tests for homogeneity of dispersion PERMDISP, canonical analysis of principal coordinates (CAP) analyses , dbMEMs, Generalized dissimilarity modelling (GDM), distance-based linear modelling (distLM), multiple matrix regression (MRM), network-based linear models (netLM), Gradient Forests, Random Forests, cluster analysis, SYNCSA analysis, fourth corner analysis, RLQ tests, Mantel tests, Moran’s I tests (phylogenetic and spatial), Phylogenetic GLMMs, everything in the R package Picante, ecophylogenetic regression (Pez), dynamic assembly model of colonization, local extinction and speciation (DAMOCLES), dynamical assembly of islands by speciation, immigration and extinction (DAISIE), all sorts of ancestral state reconstruction approaches, numerous Bayesian evolutionary analysis sampling trees (BEAST) methods, numerous phytools methods, environmental raster and phylogenetically informed movement (SERAPHIM),SaTScan, Circuitscape, point-time pattern analysis, Kriging, epitools risk analysis.
Link to GDM: https://cran.r-project.org/web/packages/gdm/vignettes/gdmVignette.pdf
I’ve been going over Bayesian analysis principals recently and trying to work out ways to explain it clearly (and teach it).
Here are are my top 5 links that do a reasonable job at explaining Bayesian principles (in no particular order):
- Count Baysie with Lego: https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego
- Julia Galef has some good explantions: https://www.youtube.com/watch?v=BrK7X_XlGB8
- Aaron Eliison’s paper on Bayesian inference in ecology : http://www.uvm.edu/~bbeckage/Teaching/DataAnalysis/AssignedPapers/Ellison_2004.pdf
- The ETZ files Bayes theorem.
- On a lighter note you can’t beat xkcd:
Can community level phylogenetics be used effectively with population genetic approaches to better understand infectious disease dynamics? This was one of the questions that came up on my recent trip to the University of Glascow. It was great fun hanging out with the fine folk from the IBAHCM (Institute of Biodiversity Animal Health and Comparative Medicine) with numerous discussions about life, the world and all sorts of disease ecology topics. The purpose of the trip was a research exchange with Roman Biek and is lab to become more familiar with BEAST and associated phylodynamic tools. Naturally it got me thinking about how to synthesize these tools with community phylogenetics – particularly in understanding transmission dynamics. Basically BEAST provides excellent spatial/temporal estimations of disease spread, but is not as good at linking phylogenetic information to multiple interacting landscape and host variables. They are my conclusions for now anyway – BEAST can do GLMs apparently but I’ve heard the interpretation can be difficult.Stay tuned for my review on the topic which is nearly ready to submit somewhere.
On a more applied note – if you are a BEAST user or interested in becoming one here is a link to a useful set of tutorials: https://github.com/beast-dev/beast-mcmc/tree/master/doc/tutorial. Also I can highly recommend the R package ‘Seraphim’ for post-BEAST spatial analysis of pathogen dynamics: http://evolve.zoo.ox.ac.uk/Evolve/Seraphim.html – though installing in R is a little tricky (this will be a topic of a future blog post).