Moriel Schottlender

304 Followers
29 Following
24 Posts

Principal Software Engineer @ MediaWiki Services Group, @wikimediafoundation, the non-profit that operates #Wikipedia.

Header image by Subhashish Panigrahi, CC BY-SA 3.0, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:W_for_Wikipedia-Wikipedia_buttons.jpg

Wikimedia Profilehttps://meta.wikimedia.org/wiki/User:MSchottlender-WMF
PronounsShe/her
This talk will present a case study of utilizing Domain-Driven Design methodologies to address the challenges of evolving Wikipedia's underlying system architecture. Wikipedia is a 23-year-old open-source monolith that serves billions of reads and millions of simultaneous writes, with a unique combination of dynamic user-generated content and workflows. With Moriel Schottlender: https://buff.ly/4a7HDjy
DDD Europe 2024 - Moriel Schottlender

Extremely excited to be at #dddeu 2024!

Come see me talk about Wikipedia and MediaWiki's Architecture and how we are looking to the future evolution of the system!

https://2024.dddeurope.com/program/evolving-wikipedia-a-case-study-in-applying-domain-driven-design-in-a-challenging-system/

DDD Europe 2024 - Program

Pretty decent explainer for how LLMs work, especially for folks not familiar with word vector techniques. https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/
A jargon-free explanation of how AI large language models work

Want to really understand large language models? Here’s a gentle primer.

Ars Technica

@emergent There are actually a few reasons why database cells may need to be size-restricted (some have to do with the properties of the MySQL database, or with indexing) **but** in this specific case, I think there's a more significant issue:

With the scale #wikipedia is at, making changes to the DB schema can be a **very very expensive operation** especially when it touches such a large table as the edit/revision table.

You can read about this here:

https://wikitech.wikimedia.org/wiki/Schema_changes#Dangers_of_schema_changes

Schema changes - Wikitech

@smallsees

Yeah, this is one reason why #i18n of strings should ALWAYS come with a context of where that string is displayed.

When VisualEditor just launched, we had an issue with the mobile version's "Publish" button. It was tiny in some langs, and huge/wide in others.

We had to rewrite our translation documentation to request that translators take into account a mobile view, and try to pick shorter words...

Supporting many languages is not easy (but is rewarding!)

@Shepard I added a correction to the post, though don't have enough char count space (IRONICALLY) to fully explain it :)

thanks!

@Shepard Okay, I see what you mean (I think)

This would mean that the other combination emoji are counting pairs + separator(byte?) + pairs +... etc which is still 8 or 11 "places"/counts.

... but it's not actual bytes, even though it effectively would count the same if it was bytes.

That's a fair point! Thanks!

@emergent

That's a good point, but that only works for things that are *actually* artificially limited.

For example, when you save an edit on Wikipedia, you add an "edit summary" -- that space is limited in size in the database table.

Considering our enormous data sizes, we cannot just allow arbitrary sizes to it.

We have to find ways to balance between having to limit WHILE allowing for equity with the non-Latin characters.

@Shepard it might use UTF-16, but checking the ().length on each gives me
- 'πŸ‘©'.length = 2
- 'πŸ‘©β€πŸ‘©β€πŸ‘§'.length = 8
- 'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.length = 11

... Which looks like it's byte counts.

Breaking πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦ down, it makes sense, too; the emoji is made of

- β€Ž1F469 WOMAN = 2b
- β€Ž200D ZERO WIDTH JOINER = 1b
- β€Ž1F469 WOMAN = 2b
- β€Ž200D ZERO WIDTH JOINER = 1b
- 1F467 GIRL = 2b
- 200D ZERO WIDTH JOINER = 1b
- β€Ž1F466 BOY = 2b

Which is overall = 11bytes.

What am I missing?

(edit - see https://r12a.github.io/uniview/)

UniViewSVG 17

@lambdatotoro it does, but you have also stumbled onto another point here - the inconsistency between platforms!

I posted my thread from the web. On mastodon web, this emoji πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§ only reduced my char count by 1.

I'm answering you from Tusky, on Android, where πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§ is reducing the limit counter by 11.

Mastodon web counts this as a character.
Tusky on mobile counts this as 11 bytes.

I mean... They're both kinda right...