Mastodawn

Thought-provoking post on refactoring data rather than code https://jreyesr.github.io/posts/data-refactorings/ - does anyone know of a Fowler-style catalog of data refactorings? (Note: not looking for filter, select, group by, and other operations on data, but on equivalent of "extract method" or "introduce base class".)

Refactoring data as if it were code: a case for extending refactoring to static data - jreyesr's blog

In this article we extend the concept of refactorings, as used on source code, to data stored, for example, on YAML or JSON files. As an example, we extend the typical Rename Variable action to renaming a Kubernetes resource. We explore several scenarios that seem like a data-focused variant of code refactorings, why they may be useful. Then, we review some ways in which those data refactorings may be implemented, which tools could support them, and how the user experience may be like.

Show thread

Greg Wilson Mar 11, 2024

Note: several people have suggested that this is database normalization, but that's only part of it: some of it (I think) is common data cleaning steps, such as normalizing the names of enumeration elements. That's not really a DB normal form (I think?).

Show thread

Krzysztof Sakrejda Mar 11, 2024

@gvwilson there are a bunch of tools that automate checking certain kinds of data against schema, and for longer-lived datasets I have defined immutable conditions that are supposed to hold at all times. That kind of approach works pretty nicely in the same vein as TDD but for data

Show thread

Ben Fulton Mar 11, 2024

@gvwilson I think I would call that database normalization.

https://en.m.wikipedia.org/wiki/Database_normalization

Database normalization - Wikipedia

Show thread

Filip Buric Mar 11, 2024

@gvwilson Interesting post! It seems to be mostly focused on OOP / mapping formats, however.

I'm not aware of a catalog of data refactorings (besides DB de-/normalization), but something I've used in my heterogeneous pipelines is a data +representations approach. Namely, as a given data structure is appropriate for certain types of processing, the aim was to expose different representations of the same underlying data, depending on what's consuming it (e.g. tabular for relational, mappings for non-relation, algebraic (matrices and tensors) for numerical computing, small files for certain specialized workflows, etc.). So basically conversions, but a bit more principled.

Some representations are more redundant than others, so from the post's POV, I suppose a 'base class' would be a repr that is more compressed? Or with semantic structure? From this, other repr methods would generate a given format+structure (e.g. YAML) needed by a consumer. On the semantic side, I suppose one could look at all the operations specific to RDF, OWL, or other semantic web formalisms.

Show thread

Rafael F Mar 11, 2024

@gvwilson perhaps Scott Ambler’s and Pramod Sadalage’s Refactoring Databases would be in this area https://martinfowler.com/books/refactoringDatabases.html

Refactoring Databases

Show thread

S. Lott Mar 11, 2024

@gvwilson It seems like these might include the normalization forms. And perhaps the common performance-related denormalization. And the various DDD value object considerations.

Show thread

Nat Pryce Mar 11, 2024

@gvwilson it’s worth having a look at how the Oracle RDBMS implements “schema editions”, a feature that allows a database server to make a single schema available in different versions to different client applications, to decouple the cadences of schema and application evolution. The same techniques can be built on views, triggers, etc. in databases like Postgres that don’t have this feature.

Show thread

Nat Pryce Mar 11, 2024

@gvwilson this, plus automated migration tools like flyway and liquibase, make it easier (not easy for sure) to refactor a database without breaking the applications that depend on it during migrations.