A question for all the people working on #massspec #massspectometry #teammassspec #proteomics, especially working in platforms: how do you store and manage all your raw / mzML files? Anything beyond chugging files on a big enough disk? Other folks in other fields of #bioinformatics do you have a raw file management system for all your raw data from sequencers and other machines?

https://hachyderm.io/@Elendol/109427080898974176

Elendol (@[email protected])

If someone knows they are likely to be here I am looking for a software solution (or pointers to build a solution) able to organise and manage BIG data. Not billions of small text records, dozens of thousand of ~1Gb binary files. Something better than a basic file system. So maybe a database for the metadata (and something to extract metadata). We want to keep the raw files as source of truth but also track files converted to friendlier formats (with loss). #programming #devops #dataOps

Hachyderm.io
@Elendol @makingions for me, this even worse. Our lab server doesn’t have a big enough disk :(

@zhenboli @Elendol

Yep, in industry the need is typically data on servers for regulatory/IP reasons, and in core facilities/group labs etc it become a challenge to manage data (like @zhenboli says) with diverse approaches from external HDDs to AWS.

@UCDProteomics captured thoughts on Twitter about data wrangling e.g. https://twitter.com/UCDProteomics/status/1482104659858771970

@mingxunwang is the main developer for MassQL that enables querying of raw mass spe trometry data. https://mwang87.github.io/MassQueryLanguage_Documentation/

Brett Phinney on Twitter

“We finally return for 2022 “Proteomics old time radio hour” with @neely615. Thursday, Jan 20 at 11:30 AM PST in @clubhouse. Join us! https://t.co/DEPvEBm2fh”

Twitter

@makingions @zhenboli @UCDProteomics @mingxunwang thank you.

I need to give it a listen.

Yes, very useful to query raw files. My issue now is how do I query 10k raw files 😁 (well, the issue is more: how to do it well).

@Elendol @makingions @zhenboli @UCDProteomics

So we regularly query 150K files with MassQL. Happy to chat about that!

@mingxunwang I’m mostly insterested in how you have all your 150 k files stored, locally, cloud, distributed locally. How do you keep track of metadata (recompute on demand or have it stored in a file /database). Keep some metadata in file name. Keep track of raw files and converted to mzML at the same time. I am trying to have a higher abstraction above the the file system but also avoiding big and complex frameworks that may be overkill and also would be difficult to maintain .
@Elendol So in general I agree with you sometimes having it all local in a file system is not super ideal but thats the way it works now. However, we're working on systems to have it more available (in computable forms) since reading mzML files is honestly the slowest part of any compute right now. As to the metadata, there are several things we use, for just knowing what exists, we have automated dataset caches so we know basics about everything, and also crowd sourced metadata in ReDU.
@Elendol As for conversions, in MassIVE we had built things to autoconvert and its reasonably successful, but trying to really consolidate in my new lab. I really feel you about the concern of where to put the data, should it all be on a single file system or in the cloud and who pays? My new lab, we're going down the route of all flash systems so that we can recompute on tons of data without even thinking of iops.
@Elendol I hope reasonably soon we can data replicated in multiple places, and you can compute on the raw data wherever you are in the cloud or on our systems. But its not a super easy thing, but its not terrible either.

@mingxunwang long term I think we will store all the raw data on aws glacier just for peace of mind. Flash storage is a cool idea though.

So far we don’t convert to mzML because we didn’t really needed it, not systematically at least. I am trying to organise alll of that. This conversion will be a massive pain point, quite annoying bit of the process (especially with all the older files, for the new ones it will be done as they come).

@mingxunwang The format itself of the raw data (at least for Thermo, the only one I’m familiar with) doesn’t lend itself to distributed analyses. Maybe we can slice the binary file or index it to extract and convert data only for some spectra without converting the whole thing. Well we will see what we can do. I may get my ambitions scaled down by my colleagues.
@Elendol Yeah goodluck haha, the distribution is hard in the native format. With regards to conversion, if you're running on any system with containers, NextFlow + MSConvert is kind of amazing and can distribute. Thats one way I want to convert all the data. Side note, converting/summarizing all data if its running 24/7 actually can churn through more data than you think. Ran through 400TB of Proteomics/Metabolomics data in a few weeks, not terrible.
@Elendol One final thing that I've been having a lot of fun with, is converting the mass spec data into columnar data formats like parquet or arrow. That plus some out of core compute strategies really makes it super fast to access a ton of data especially when paired with the flash ZFS arrays we've put together. If you're a company, totally affordable!
@mingxunwang yes actually that's a good idea. I already have most of my downstream analysis data in arrow files. Might as well get the MS data there too. We have a tons of data, but not that much. Well I have more convincing to do now 😅