Year of triangulation, pt3: Data, Code, Results

Here is the third and last installment on my themed set of posts. In part 1, I reported on how my sabbatical visit to New Zealand grew into a three-way collaboration between labs in New Zealand, the US and Canada. In part 2, I described a conceptual framework that the Powell Center working group on soil carbon came up with to guide our thinking about the next generation models for soil carbon cycling. Here, I’ll describe how data, code and results are fused together in how we generate results from thermal analysis experiments.

Like many other analytical methods, the thermal analysis techniques we apply to soil organic matter research are relatively easy to execute. That is, running a sample is easy. The challenge after running the sample, in getting from raw data to an interpretable result. There are a number of steps to be taken to get from raw data to a results, most of which now involve code. This summer, I’ve moving, slowly but surely, to making the process open and reproducible as much as possible. Having read Bond-Lamberty et al. (2016), Wilson et al. (submitted) and Wilson et al. (2014), among others, I’m beginning to see the path forward.

But first, a brief history lesson. When we first started performing analytical thermal analysis, we stored our data in the usual chaotic way. At least the projects got their own folder, but all data management was done by hand using proprietary software and spreadsheets. We then hired someone for 6 months to write a huge amount of code to engineer a postGRE and R-based database system for data storage and manipulation that could, in theory be accessed publicly on a server. Unfortunately, that system was so complex that when it was “broken” by a change in the instrument software’s file format, it was deemed obsolete. We then moved to using smaller chunks of code using R scripts for most of the data manipulation tasks, but there are still a few steps that use propriety/commercial software.

The raw data from analytical thermal analysis consists of six data streams (time, temperature, mass, heat flux, CO2 evolved and H2O evolved) from two different instruments that don’t “speak” to each other. Each of these data streams consists of 5,400 to 20,000 data points. They generate several qualitative thermograms (plots as a function of temperature), but the real power of the analyses comes when they can be quantified and statistically compared to each – and this requires code to convert the raw data, to “tidy” data, to merged, final data, to results in the form of summary indices and statistical comparisons.

Aside from a growing consensus about best practices, the impetus for moving to more open and reproducible data management is that I recently purchased new instrumentation that will substantially increase our sample throughput. The historical accumulation of data, and the growing demands on the method, present a substantial challenge for data management. The goal now is to be able to post raw data and allow our collaborators to perform their own data manipulations according to a sequence of reproducible steps recommended/prescribed by us. This will require a major change in how we organize our own data, the development of rigorously tested and reproducible means for data manipulation, and new demands on our collaborators for how they handle their data. The current vision is the use of a repository system (e.g., GitHub) for storing data (both raw and tidy), metadata, and scripts, and for version control of the data and scripts. We would post raw data to the repository for access by a collaborator, and provide the documentation and scripts as a recommended means of converting the raw data to usable data to results.