A brilliant colleague of mine just published some of our work today. As a middle author I haven't read through it in a few months so I thought I would live toot a read through to tell y'all why it is so awesome.

I'll be reading through this as an academic which means it's not going to be a front to back read. If there are interesting tidbits from the 5+ years I've been involved I'll sprinkle them in. 1/n

Coastal carbon data and tools to support blue carbon research and policy. Holmquist etal 2023 https://onlinelibrary.wiley.com/doi/10.1111/gcb.17098 #SoilBGC #BlueCarbon #SciLit #DataPaper 2/n
This is a bit of an odd duck as a #DataPaper. Traditional research articles showcase some new development in the field and connect it to a reproducible line of evidence. #DataPapers on the other hand are relatively new and focus on the data collection where the data itself is the main development. The promise here is that the data is broad enough and robust enough to be of general interest to other researchers... let's dig in. 3/n
"Coastal carbon" or 'blue carbon' is at the interface of the land and sea. Those dark squishy soils of this in between place, hold a truly incredible amount of carbon that is often overlooked when we count soil carbon stocks. Coastal ecosystems are at a highly dynamic interface that are strongly affected by sea level raise and human communities. To understand coasts we need to understand their soils. 4/n
So reading a research article I always start with the abstract. This paper makes the case that while there are a lot of meta-analysis of coastal soil carbon stocks out there, they all reinvent the wheel by creating their own one-off data standards. While this paper also proposes yet another data standard, it documents this standard and provides #rstats tools to support the data base. 5/n
Talking about the soil #carbon database here, it's big. There are over 6700 #soil profiles across marshes, mangroves, tidal freshwater forests, and seagrasses. That might not sound like a lot if you are from data science but those are over 6700 holes that had to be dug and cores sliced up for analysis in the lab. This represents an insane amount of labor across many different groups. 6/n

This next bit in the abstract is interesting. You can break down those 6700 #soil profiles into different use cases including soil stock assessment (n=4815), #carbon burial rates (533) and dynamic soil formulation models (326). This is a slightly more nuanced "garbage in, garbage out" argument.

It's not that the data isn't valuable, it's that data is fit-for-purpose. 7/n

There was more carbon (generally) in tidal freshwater forests then seagrasses. And we need deeper cores and better spacial coverage outside the US.

Neither result here is particularly groundbreaking but you wouldn't expect that from a data paper. 8/n

Abstract is wrapped. Now onto the figures. First up is a big old relational database diagram. You have your standard site-study-core-'depth series' that echos work done in other databases.

I'm particularly pleased with the naming conventions here for the columns. Unlike single study databases there is a huge variation in methods and units here. Methods and units are treated as primary data objects instead of #metadata. 9/n

First table is an attribute table for the site descriptions. While the data geek in me wishes for a hierarchical grid system we stick with a good old bounded box for lat/lon in WGS84. I got to admit, I do like that there is no 'precise' lat/lon option here. This forces an uncertainty on geolocation which I have struggled with in other databases. 10/n

Spoke too soon. Next table in the paper describes the core attributes which *does* have lat/lon (although they nod to position accuracy and methods). Elevation w/re to sea-level is often a critical controller on these systems so location precision matters a bit more here then in upland soils.

Observation time is broken up into 3 columns year-month-day! Nice! Sorry ISO8601 no one actually uses you. Date strings are such a pain to work with. 11/n

On to the next table "depth-series". Huh this is the first time I haven't heard this called 'layers' but ok. Oh interesting, depth is broken into sampling increment and representative interval.

Lots of different isotopes identified here (137Cs, 210Pb, 214Pb, 226Ra, 214Bi, 14C, 13C, 7Be) plus age 'markers' and the modeled age. The biogeochemistry by comparison only has bulk density and carbon fraction. No nutrients or texture. You can see fit-for-purpose here. 12/n

Next table is slightly less standard in soil databases and is a methods table.

Interesting that sieve size isn't even mentioned in the bulk density specs but drying temperature is.

I see a future project here constructing an ontology from this table at some point. I imagine this level of detail is missing from most of the observations but just having framing here is neat. 13/n

Study table... pretty standard bibliography. Hopefully there is guidance to cite *all* contributing studies for any data used in reanalysis. Academics need their carrots to contribute to these larger databases!

Next table is control vocabulary. Clearly I've been in this too long, it's a thing of beauty. The different coring methods documented here alone are glorious. I'm really curious how many observations have this level of documentation 14/n

Backing up to the species and habitat table, got to admit a certain satisfaction seeing macro-ecology getting a super simplified treatment (just three data columns). Most of the time soils are overly simplified.

Impact table similarly only has one data column. So much heavy lifting done by this one little data column.

Both tables here will likely need future development but are a good example of fit-for-purpose design. 15/n

Back to figures, workflow diagram: Might appear standard at first read but most databases in soils are manually compiled so introducing the idea of a 'hook script' is HUGE for this domain. 16/n

Figure 3 is the obligatory data dashboard screen shot which is, almost certainly, already out of date. My first 'real' research project was GUI dev for stats software, respect to UI/UX folks.

Moving onto representative coverage... apparently carbon burial is only interesting to marsh folks, mangrove and seagrass coverage for this data type is horrid. 17/n

Figure 4 looks at data types and I'll be a monkey's aunt. Damned if bulk density isn't the most common measurement. Bulk density is the ratio of mass to volume of a soil sample and often inferred from organic carbon (and left out of data sets). There are almost twice as many bulk density data points compared to carbon fraction in this data base. This says something quite interesting about the marsh soil folks, not sure what yet but interesting. 18/n
More fit for purpose visuals from Figure 5 (might be a little redundant compared to some of the tables). Figure 6 hammers home the data spatial coverage issues (surprise <not> data coverage is focused on Global North). Tables 10 and Figure 7 breaks down soil organic carbon measures by habitat; no Gaussian distributions here! 19/n

Figures and tables done! Moving on to introduction...

If we manage coastal soils 'right' we could draw carbon down into the soils (maybe). But to make these decisions we need data. There are several prior data synthesis efforts going back 20 years to analyze existing data but none of these were developed to be a living database. But to do this we *need* researchers to publish their data. 20/n

#FAIRData requires data models that preserve as much detail as is practical balancing transparency with simplicity. Data provenance needs to move beyond documenting where the data comes from but to include academic attribution (ie citation counts) for the original data contributors to reward researchers who share their data.

While this study does have priorities, no study was rejected for methodological reasons. Instead 'utility' scores were assigned. 21/n

Data models were generally wider rather then long. (Detailed data descriptions in tables described above in thread.) Contributing data sets were assigned individual DOIs. 'Hook' scripts were used to reshape the source data to the data tables described here. Data sets then went through individualized QAQC. Secondary data classifications were then layered on to the aggregated dataset (habitat) were then inferred from original data. 22/n
@ktoddbrown ahem, big ISO8601 fanboi here for current data handling and days in the City swerving Y2K! B^>
@DamonHD I've seen too many date strings mangled by well intended software to fully support it as a robust data format however. Just use N different objects to interact with the lowly humans.