“To train the model, I assembled a dataset of 71k distinct fonts.”

Hmm, I wonder if those fonts were all open-licensed...

https://serce.me/posts/02-10-2023-hey-computer-make-me-a-font

Hey, Computer, Make Me a Font

This is a story of my journey learning to build generative ML models from scratch and teaching a computer to create fonts in the process.

Hey, Computer, Make Me a Font
@arrowtype My immediate thought too. And I fear things about both answers 😅
@jasonsantamaria ha, yes, same.
Publish dataset? · Issue #1 · SerCeMan/fontogen

Hello, is there info about the dataset used for training the model? It's pretty important for some use cases (like publishing games on steam) to be sure that all training data is licensed appropria...

GitHub
@klim @arrowtype @jasonsantamaria “I suggest we avoid discussing this specific conversation branch any further”. good grief.

@typographica @klim @jasonsantamaria “the discussions can easily become heated without leading to a productive outcome.”

i.e. “the discussions aren’t going to help with large-scale IP theft and laundering”

@arrowtype @typographica @klim @jasonsantamaria Oh good, I had a feeling *I* was the asshole. I mean, I still could be, but at least I'm not alone.
@simoncozens @typographica @klim @jasonsantamaria no, definitely, thank you for being articulate in defense of being thoughtful at the outset of such a project.
@arrowtype @simoncozens @typographica @klim @jasonsantamaria Yes thanks for speaking up. We just posted a response to the GitHub discussion also. https://github.com/SerCeMan/fontogen/issues/1#issuecomment-1747486006
Publish dataset? · Issue #1 · SerCeMan/fontogen

Hello, is there info about the dataset used for training the model? It's pretty important for some use cases (like publishing games on steam) to be sure that all training data is licensed appropria...

GitHub
@frerejonestype 🔥 Thank you for contributing to the discussion! I hadn’t even realized quite how thoroughly pirated the data was.
@arrowtype We figured there would be some heat from this, but from our perspective, it’s really important that as an industry we’re clear with developers what the legal boundaries are with using our copyrighted material in data sets, just as other creative industries are trying to do. Particularly with people who throw any consideration of this to the wind.
@frerejonestype yes, 100%. The best time to defend against AI infringement is before it gets deployed. As I understand it, once a GPT model gets trained, it is impossible to retroactively remove input data.
@arrowtype @frerejonestype One stolen font is piracy, thousands are a dataset.
@klim @arrowtype @frerejonestype nicely put! It seems to be the LLM motto, and it works for pictures and books too.
@klim @arrowtype @frerejonestype Stealing from one person is theft, stealing from everyone is the Third Reich