Mastodawn

I'm intereted in opinions... but I think I know what I should do.

I produce files that have a 64-bit ID generated by an STM32's RNG. This seems to do a reasonable job at being random (no collisions yet, in ~5k files), but I don't fully trust it, and 64-bit isn't that big. [it's likely that future hardware will have CSRNG, but the file IDs will probably remain 64-bit].

When handling this data, it's sometimes split across multiple files, which share that 64-bit ID... this and some other herustics (like timestamps) allow you to confidently associate these parts into a whole.

For some years, the files have been uploaded, processed, poked and prodded - but reside in a filesystem structure, and that was it. Accessible by me with utilities, but not accessible to the others.

Until now, the files have been "uniquely" identified and referred to by that 64-bit ID. And the uniqueness has persisted.

Now that we're building a Web UI for better accessibility, the details are being brought into a database, which has a basic unique 'id' column.

How do I refer to these files going forward?... by the hopefully-unique 64-bit ID, or by the actually-unique database ID?

I like the 64-bit ID, because it's the "source of truth", and it's familiar... but I'm not confident (enough) that they'll remain unique over time.

I like the database ID, because it's guaranteed to be unique within the system... but I don't necessarily want to depend on lookups via the database.

I've considered adding a small metadata file that identifies the file's database ID and it's other parts, so lookup can remain a "filesystem only" activity with either ID.

I've also considered using a standard 128-bit UUID that is generated and stored on the filesystem and in the database.

I don't know why this decision is being so problematic for me. 🫣

Use the 64-bit hopefully-unique ID

20%

Use the database's actually-unique ID

20%

Use a 128-bit UUID

50%

Something else

10%

Poll ended at Mar 14 at 4:18pm.

Show thread

Ian Smith Mar 13

@attie just hash the file and use it as the id?

Show thread

Attie Grande Mar 13

@katachora ... I quite like that idea!

There's more contex here: the files are formed of 4 KiB chunks, and I use the SHA256 of the file's first chunk to store it in the filesystem (it's all we're guaranteed to have at first... files are uploaded progressively).

I'm much more confident that there won't be a collision here, due to what's in the files.

Show thread

Theolodian Mar 13

@attie 32 random bits plus 32 bit timestamp?

Show thread

Michael Ossmann Mar 13

@attie What are the consequences of a collision? If it happens once or twice per decade, would it cause major problems?

Show thread

Attie Grande Mar 13

@mossmann You'd potentially reference the wrong file, or a file could "go missing" ... depending on implementation, it might be anywhere from "silent" through "confused" up to potentially "not prevent injury" (or worse, and difficult to judge when this becomes "cause" rather than "not prevent")...

Show thread

Michael Ossmann Mar 13

@attie Sounds like that is sufficient justification to make a change away from the "hopefully unique" ID. I like the content hash suggestion or a UUID.

Show thread

Attie Grande Mar 13

@mossmann Agreed!

I'm thinking I'll go for UUID - generated on insertion into the database, guaranteed unique by a database constraint, and then also stored next to the file in the filesystem, which provides consistency if the database content needs to be regenerated (i.e: read from file if it exists, or generate if it doesn't)

People are also somewhat familiar with UUID, where a "reasonable" hash (e.g: SHA256) is a bit more unknown and larger.

Show thread

penguin42 Mar 13

@attie Is a databases unique ID stable if you changed database?

Show thread

Attie Grande Mar 13

@penguin42 This is exactly one of my concerns!... I'm starting to think that the database becomes an integral part of this system though.

Show thread

acb Mar 13

@attie Do you have access to any monotonically increasing values like timestamps or counters? If so, devoting some of the bits to one could mitigate collision risks. (It could be fairly coarse-grained.)