Sunday, July 21, 2013

Now, now...share nicely!

In lab meeting a couple of years ago, we discussed whether making government-funded ecology research publicly available would actually benefit science. The general consensus was that while the public should have the right to access research its tax dollars have paid for, making data open would not really benefit them or science. My labmates argued that there is already too much data and too few people with the knowledge necessary to make meaning from the data. Furthermore, they argued, the frequency with which some grants require data to be made publicly available would require researchers to take time away from science during peak field season in order to enter and upload the data. And I followed along.

However, attitudes are shifting. Recently there has been a flurry of papers and blog posts on open data and what it means for ecology. For example, in a really nice article in Frontiers in Ecology and the Environment, Hampton and colleagues argue that if ecologists are to survive, they must both share and use shared data. Yet in a survey, the authors found that less than half of the papers produced using NSF funds had also published some or all of the data used to write the paper. As another incentive to "open" data, the authors argue that there are instances - such as when rapid responses to environmental crises are needed - when open data is used more extensively than what they refer to as "dark data". Thus worries about data overload and lack of relevance appear to be unfounded; the government needs bang for its buck, not tree-hugging.

Joern Fischer, a professor at Leuphana University responded to this paper on his blog, stating that while he believes sharing is a nice idea, in practice there is no shortage of data, and allowing other people not intimate with the sites from which the data was collected is dangerous. Ecology is apparently a touchy-feely science which cannot be reduced to data points that can be used to look for larger global patterns, a point which the Hampton paper also brings up.

But I would argue that 1. getting too intimate with your site is dangerous (you start seeing patterns which aren't there, so you MAKE them there when you do statistical analyses), and 2. we really just need more complete metadata, including many pictures of research sites throughout the seasons. For example, there have been fires in various plots at the Boston Area Climate Experiment, and they have been logged in the online shared lab notebook. However, to my knowledge, this information is only accessible to people working at the site. "Hidden" metadata like this must be made available to anyone reading papers and using the associated data to complete a meta-analysis of climate warming effects themselves. 

Another point that Joern brings up is that field ecologists will do the hard work collecting data and have to publish in smaller, regional, less-prestigious journals while the modelers sit at their desks, distant from the field, and compile all this data into articles the top journals are begging for. I have a number of gripes with this statement. First, if you are doing ecology to get publicity, you are in the wrong field. That applies for all desk-, lab-, and field-bound types. Second, this separation between writers and doers is ancient - how many techs do biomedical labs have, and yet PIs write the paper with no input from the technicians about what funky things happened along the way? Third, having gone from an almost exclusively field-based position to an almost exclusively computer-based one, I would do anything to be spending my summer outside looking at nature's pixels; working at a computer is not some lazy-ass bliss. Nothing is. Fourth, most ecological data collection can be done by minimally-trained volunteers (Earthwatch actually requires that projects it funds use volunteer data collectors extensively); I reckon the future of ecology will be a PI with some model or question they want to ask, going to public data, identifying a hole, and involving the public to collect that data, and possibly analyze it. It seems like a grant-writers dream given the current funding requirements.

So what are we really worried about? The idea of more work? Being responsible for a broader array of literature? Isn't it our job to understand the world? Ecologists don't write grants which say "I want to understand exactly what happens in the four 6m*6m plots I will be studying", but rather "I will design a study using four 6m*6m plots superficially representative of the broader environment with the hope of understanding patterns and processes in ecology which can be extended to larger spatial scales". 

But to scale up in this day and age, we have a responsibility to not just conjecture, but actually test it. If nobody is asking the same question (or if it has been asked, but the data has been analyzed inappropriately), and we only have published results to go on, how will we do this? We can ask people for their raw data, but emailing busy professors who have to dig up datasets not necessarily formatted for sharing is a time-consuming process. 

It's time to go beyond the costs of taking the time now to put your data in a clear format for others (and you a few years down the line) to access, and to think long-term. That is not to say that I think all data should be analyzed blindly without respect to site intricacies; we don't know what factors are important in ecological data, and how they may differ with time and space. However, looking over larger landscapes allows us to examine broader patterns and identify best practices for land management in the absence of finer resolution data, and if the metadata we have does not predict responses of interest at a broader scale, we have a reason to apply for more funding to do field work and ask why. 

For a field so obsessed with statistics, such aversion to testing the effect of increasing sample size seems ridiculous. 

For a more positive spin on open data, Chris Lortie of York University has made a pre-print available on the role of open data in meta-analyses which is available here.

No comments:

Post a Comment