Hardening macOS part 7: The Human Surface

You can harden the kernel and encrypt every byte, but the problem usually sits between the chair and the keyboard. In this chapter, I explore the hidden risks of metadata, the trap of social logins, and why your passphrase-less SSH keys are a standing invitation for a breach.

Read the full post here:
https://bytearchitect.io/macos-security/Hardening-macOS-part7-The-Human-Surface-and-Metadata-Risks/

#macOS #InfoSec #CyberSecurity #Metadata #SysAdmin #Privacy

Data citations are already in Crossref #metadata, they've just been difficult to locate among millions of references. Our new data citation API endpoint (beta) makes these connections easier to find and track. Learn more: https://doi.org/10.64000/rzbn5-wjy58
Data citations are already in Crossref #metadata, they've just been difficult to locate among millions of references. Our new data citation API endpoint (beta) makes these connections easier to find and track. Learn more: https://doi.org/10.64000/rzbn5-wjy58

Are your books languishing in KDP ghost categories? Don't despair! Optimising your metadata is key. Think specific keywords and categories. What are readers *actually* searching for? Drill down into the niche! #KDP #Metadata #BookMarketing #SelfPublishing #SelfPub

#KDP #Metadata #BookMarketing #SelfPublishing #SelfPub

🚀 The HMC Project Call 2026 is now open!

Working on challenges in #metadata, #FAIRdata, or research data infrastructure? This is your opportunity to turn your idea into a funded, interdisciplinary project within the Helmholtz community.

💡 Interdisciplinary projects welcome
🤝 Collaboration encouraged

Deadline: 6 July 2026

Learn more: https://helmholtz-metadaten.de/hmc-project-call-2026

#HMCproject

@HelmholtzImaging @HelmholtzOpenScienceOffice @helmholtz

Universiteitsbibliotheek Maastricht zoekt metadateerder – IP | Vakblad voor informatieprofessionals

Welkom bij de Universiteit Maastricht! Heb jij een scherp oog voor structuur en standaarden? En draag je graag actief bij aan de toekomstbestendige informatievoorziening van de Universiteitsbibliotheek. Ben jij graag […]

IP
Teams from the Crossref #Metadata Sprint 2026 in São Paulo present their projects live in English, Español & Português. 22 April · 15:00 UTC - 🔗 https://crossref.zoom.us/webinar/register/5017756028731/WN_V8OxcNeMRQ-CSLZXUlgWYQ

Navigating the European aquatic eDNA landscape: Opportunities for metadata standardisation and data mobilisation

#eDNA #standardization #Europe #metadata

https://mbmg.pensoft.net/article/173612/?utm_source=researchgate.net&utm_medium=article

Navigating the European aquatic eDNA landscape: Opportunities for metadata standardisation and data mobilisation

Environmental DNA (eDNA) has emerged as a transformative tool for monitoring aquatic biodiversity, offering a non-invasive and highly sensitive approach to detecting organisms across diverse ecosystems. However, its effective downstream application across Europe in environmental management is hindered by inconsistencies in data standardisation, metadata reporting, and accessibility. This perspective comprehensively evaluates current data repositories, data submission workflows, and standardisation efforts within the European aquatic eDNA landscape. By employing a multi-method approach, including an inventory of eDNA databases, a metadata assessment, a stakeholder questionnaire, and a generative Artificial Intelligence (AI)-driven analysis of scientific literature, our findings reveal substantial variability in metadata reporting practices, with several areas misaligned with Findable, Accessible, Interoperable, and Reusable (FAIR) principles. While some repositories demonstrate strong data curation and accessibility, others lack essential metadata descriptors, limiting interoperability. We identify critical gaps in metadata submission, particularly concerning sampling methods and wet lab workflows, which heavily impact data reusability. The use of generative AI in this study further enabled large-scale identification of recurring reporting weaknesses, highlighting structural challenges that extend beyond individual studies. Addressing these gaps and leveraging advanced computational approaches through international standards and harmonised guidelines represents a clear way forward, as articulated in the recent “Making eDNA FAIR” paper by Takahashi et al. (2025), which is based on the use of Darwin Core (DwC) and Genomics Standards Consortium (GSC) MIxS standards, as well as Global Biodiversity Information Facility (GBIF)’s “Publishing DNA-derived data through biodiversity data platforms” guidelines. Furthermore, additional complementary principles strengthen this framework. The Collective benefit, Authority to control, Responsibility, Ethics (CARE) principles emphasise Indigenous data governance and responsible sample stewardship, while the Transparency, Responsibility, User focus, Sustainability, Technology (TRUST) principles provide criteria for repository reliability and long-term digital preservation. Together, the combined application of FAIR, CARE, and TRUST principles provides a structured foundation for ensuring robust, interoperable, and ethically managed eDNA data that support aquatic biodiversity research, management, and conservation across Europe.

Metabarcoding and Metagenomics

Integrating Paperless-NGX with my own PDF Renamer

You may remember that a couple of years ago I wrote a post about trying to find a better workflow to manage the waterfall of PDF documents that me and my wife receive on a regular basis: bank and credit card statements, electricity (and now gas), water, and phone service bills, invoices, and so on.

This became particularly more important as we prepared to buy the house we now live in, and the fact that we ended up moving out of London to a temporary flat, collecting a large number of contracts, agreements, final bills, first bills, and so on along the way. And say nothing about having to produce years worth of bank statement to confirm funds provenance.

I had already written and published my pdfrenamer tool, which parses the various documents we receive, to figure out which service they come from, who the account holders are, and so on. Up to this day, this tool makes zero use of “AI” — I don’t actually think that there’s a sensible way for mixing a computer vision based detection with the current deterministic design, so I’m not going to try.

What I did do, instead, was to build up an integration between Paperless NGX (which I apologize, but will likely improperly call Paperless a few times in this post and in the future anyway), which I’m using as my document management system, and the tool above, which I called, with zero creativity, flameeyes-paperless-automation. It started as a way to run the same renaming processes through an already classified and archived document, as well as having the ability to re-run it on all the stored documents, to make sure once I fixed a renamer it would be able to propagate the fixes to the existing documents. It grown a bit since then.

Please note, that none of the tools I’m writing around this count as either Software Engineering or creativity — I’m releasing them with the most obvious permissive license I could, and I’ll be upfront that I’ve been experimenting using LLMs as CASE tools in both repositories. They are tools that are purpose-designed for my use case specifically — if they do happen to match yours, great! If not, please don’t complain, I will take pull requests for features as long as they don’t affect my workflow.

This turned out to be a learning journey in more ways than one. While the renamer itself was originally developed for, and run on Windows, I wanted to run the automation closer to the Paperless server, to avoid fetching every PDF twice over the network (NAS to Paperless server to desktop/laptop.) Eventually this all became even more important once I upgraded the NAS: Paperless is now running as a virtual machine on the same hardware, using NFS over a (separate) virtio network to avoid hitting the physical network layer: while it’s far from zero-copy (the PDF is being passed over between multiple virtual networks), it never leaves the actual host it’s running on.

Doing this work also allowed me to take a more thoughtful approach to the way pdfrenamer provides the additional details with a schematized format, so that I can streamline the document types in Paperless (this being a first-class feature of Paperless’s schema), and avoid similar, but not identical names. it also allowed me to think through a few other details I can extract rom PDF files easily, namely the account and document numbers (invoice number, bill number, etc…) This is useful and important because I found that our former mobile provider (O2 UK) would issue multiple bills with the same dates, and the same account holder, since we had three lines with them (past tense, because they pulled a bad one, and since our new house is not covered by their network anyway, we just migrated to alternative providers — I’ll write up on that later.)

Note though, this is something that I ended up having to extract myself. Just like I complained seventeen years ago, there is still no provider that seems to provide structured metadata of the PDFs. A few providers (particularly Octopus) appear to at least identify the software used for the creation in a way that is conductive to recognizing the documents, but there is no account, or document level metadata provided. If you’re lucky, you can rely on the generation date to match the document date, but unfortunately that is not even a given, as sometimes the documents are generated on the fly. Unless they use iText.

Thankfully, Paperless thought this out well — you can add custom named fields as part of the schema, which are then indexed, so you can search by them. Which means that instead of using a boatload of tags like I was doing before to distinguish whether the document related to me, my mother, or my wife (or a combination thereof), I can now search (and I have saved searches for) using the Account Holder field.

To avoid having to re-run the identification on on hundreds of documents – yes I’m a data hoarder, it’s a side effect of both having ran my own business, and having had to search through years of paperwork for my parents back in Italy over time – I have implemented a checkpointing feature, which meant I could simply set the script to run repeatedly once an hour to process any new document that was added to the storage, and that worked particularly well once I moved everything to run onto the NAS unit, as the virtual network is a lot faster than the physical gigabit network I had before.

This will take a moment to explain, since it would be a good question on why I would want the identification to run automatically, given I could run it every time I add documents to the storage, right? Well, I wanted to have a little bit more automation: since at least a few of the bills I receive monthly (Octopus Energy, AWS, Hetzner) are attachments on the email, and Paperless has the ability to fetch documents attached to email messages via IMAP, I set up a few email aliases that forward to a self-hosted IMAP server (in addition to my personal address, or in a couple of cases the address shared with my wife), using Mailu.

Because I’m not quite happy with self-hosted products, and in particular I dealt with email servers long enough to know I don’t want to run more than the minimum I need, the IMAP server is not even accessible over the Internet, it’s behind Tailscale. To be honest, a lot of stuff I do nowadays is behind Tailscale plus authentication, maybe because I’m paranoid, and maybe because it’s so easy that the additional security layer doesn’t cause too much grief. It still does sometimes, but the upside is still better than the downside as it is.

What this means is that, without me having to do anything, every month a number of document just come into existence on my Paperless instance — and once a week they get synced over to my Google Drive, at least for the time being. Accessing Paperless on the go is still annoying at times, even with Tailscale working fine, so for the time being I’m keeping a copy of the processed, renamed files onto Google Drive, managed by TrueNAS directly.

But what about the times I go and download all the various bank and credit card statements myself, and drop them in the Paperless inbox folder (also on the NAS)? Well, turns out Paperless has a few integration options. While it’s supposedly possible to run a specific script at the time a document is ingested, that didn’t feel particularly practical. Instead, you can set up a Workflow, that calls a webhook (i.e. makes a GET or POST request to a specific URL — I still don’t understand why we ended up giving a name to this concept, but, I guess) every time a new document is ingested.

So the automation tool has an optional web server now — which I’m running on Docker on the same machine as Paperless. Whenever a new document is ingested (either through email, or the drop folder), it gets called, and then it fetches the actual PDF from Paperless to see if it can identify it through the usual deterministic extraction — as long as the document is not obviously a scan.

That’s another important point. I currently have primarily three ways to add documents to Paperless: receiving them by email, dropping them on the inbox SMB share (write-only on the network), or… scanning them. Unfortunately, either the software I used for the longest time (PaperPort) or the Brother drivers have started fighting with Windows 11, and I couldn’t get my ADS-1100W scanner to scan through the app — previously, I would be scanning the document through my computer, and immediately drop them onto the SMB share. Nowadays, what I’m doing is choosing on the scanner if I want a black-and-white scan or a colour scan, and let it drop it… into the FTP upload folder.

Yes, FTP, classic, unencrypted FTP. You have no idea how annoying it was to find a way to set this up in TrueNAS in such a way that the Brother scanner could write to it — the alternative would have been allowing SMB1 connections just for the drop folder, and I didn’t feel like doing that. It’s a working solution for the time being, but I would be lying if I didn’t say I’d love to find myself an AN335W which should have support for modern protocols and a lot more presets than the two I currently can select from. Maybe this year or next.

For those documents, the deterministic extraction is impossible, so I made sure the service would first identify if the document is a scanned document through the creator software metadata, and in that case not bother trying to process the file at all, instead putting it into the pile of scanned documents I go through every so often to sort through.

And that’s pretty much the state I’m at right now — Paperless NGX has proven itself being more than a decent document management system. While it does have some issues here and there, particularly by depending on Ghostcript which makes it unable to process HMRC’s self-assessment statements (why? I don’t know!), it has plenty features for organization (including a great integration with Tesseract OCR, that I believe includes unpaper, ironically), and a good set of extension points through API and Workflows. Had I had this available when I wrote my old scan2pdf tool, I would have 100% wanted to integrate with it.

What does the future hold for my integrations? Almost certainly some Computer Vision model for document classification. While Paperless NGX attempts to extract document dates, and learn document types, and correspondents (and tags), these appear to be rudimentary and based on the extracted OCR — I feel it’s very rare that these are matched. But I’m fairly sure that a modern Computer Vision approach (which would now be labelled “AI”, even though it’s not an LLM and quite unrelated to it) would be able to be directed at extracting more reliable information.

The questions would be, how much refinement would that need, and would I be able to implement it myself? I can tell you already that for the latter, the answer is “no” — at least not without an “AI” (sigh) assistant, as even just the amount of theory to understand how that works is beyond my current skills, and I have enough things to work on and worry about that I wouldn’t be able to learn this. So it is likely this will be one of those tasks I’ll throw to Claude Code or something along those lines, and see how far it takes me — if it gets me something usable, yay me, if not, well I’m no worse than I was before trying (setting aside the subscription money, which simply put I’m writing down as a “cost of doing business”, or more precisely, cost of wanting to have a career.)

#Archiving #Metadata #Paperless #PaperlessOffice #PDF
Documents Management (Searching For Bigfoot, Again)

Context for the title: a few years ago, I wrote a post about the paperless office, quoting Jim…

Flameeyes's Weblog
Looking forward to collaborate on #software #metadata with #ConnOSS https://connoss-project.github.io/ at this years @de_rse collab workshop in Göttingen: https://events.hifis.net/event/3249/
ConnOSS