Moriel Schottlender

304 Followers
29 Following
24 Posts

Principal Software Engineer @ MediaWiki Services Group, @wikimediafoundation, the non-profit that operates #Wikipedia.

Header image by Subhashish Panigrahi, CC BY-SA 3.0, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:W_for_Wikipedia-Wikipedia_buttons.jpg

Wikimedia Profilehttps://meta.wikimedia.org/wiki/User:MSchottlender-WMF
PronounsShe/her
This talk will present a case study of utilizing Domain-Driven Design methodologies to address the challenges of evolving Wikipedia's underlying system architecture. Wikipedia is a 23-year-old open-source monolith that serves billions of reads and millions of simultaneous writes, with a unique combination of dynamic user-generated content and workflows. With Moriel Schottlender: https://buff.ly/4a7HDjy
DDD Europe 2024 - Moriel Schottlender

Extremely excited to be at #dddeu 2024!

Come see me talk about Wikipedia and MediaWiki's Architecture and how we are looking to the future evolution of the system!

https://2024.dddeurope.com/program/evolving-wikipedia-a-case-study-in-applying-domain-driven-design-in-a-challenging-system/

DDD Europe 2024 - Program

Pretty decent explainer for how LLMs work, especially for folks not familiar with word vector techniques. https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/
A jargon-free explanation of how AI large language models work

Want to really understand large language models? Here’s a gentle primer.

Ars Technica

🧵11/

So there you have it – Character and Word counters are a totally underestimated challenge in computers and on the web.

If you care about #inclusive products, you should really understand the impact of the choices you make counting those pesky characters…

/fin

🧵10/

On top of all of that, we’re also expanding our understanding of the impact of character and word counters when it comes to multilingual products.

The Wikimedia research team is doing some super cool research into multilingual readability scores that touches on how to figure out word counts in different languages.

(see https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research/Background_Research )

Research:Multilingual Readability Research/Background Research - Meta

🧵9/

Another issue is user-facing. Char counters are inconsistent – so they can be very confusing to users, especially non-Latin languages.

Most users don’t understand what a “byte” is when they type into an input.

Showing users a counter that may “jump” as they type is unhelpful and distracting.
So, we try not to.

VisualEditor doesn’t show the byte counter until it has to – when the text is approaching the limit. This means that most users at least don’t get distracted by this problem.

🧵8/

One relatively recent thing we did (in 2018) is splitting our `truncate` method with two others:

- truncateForDatabase - which counts bytes
- truncateForVisual - which counts characters

Technical contributors are encouraged to use `truncateForVisual` whenever possible to encourage equity, and use truncateForDatabase when we absolutely have to cut information by byte size.

(see https://phabricator.wikimedia.org/T197492)

⚓ T197492 Deprecate and remove Language::truncate()

🧵7/

So how do we deal with that at #Wikipedia?

The biggest thing is that WE CARE about this problem.
This might sound self-serving, but hear me out:

#Wikipedia’s mission is to enable anyone in the world to participate in the sum of all knowledge.

We support 400+ languages with billions of visits a month and an average of 345 edits per minute…

❗Other products’ “edge cases” are, very often, our use cases.❗

🧵6/

👩‍is 1 emoji
👩‍👩‍👧 is 3 emojis into 1
👩‍👩‍👧‍👦is 4 emojis into 1
Connected w/ a zero-width-join char

Mastodon considers all the above 1 characters.
Twitter considers all the above as 2 characters.

MDN and #javascript:
👩: 2 chars
👩‍👩‍👧: 8 chars
👩‍👩‍👧‍👦: 11 chars
…It seems to be counting the bytes, not the characters.

(Correction: JS counts UTF-16 code points; they artificially "act" like they're byte-sized, but they are different; see: https://en.wikipedia.org/wiki/UTF-16#Description)

… Joy!

(see https://r12a.github.io/uniview/)

UTF-16 - Wikipedia

🧵5/

There’s a lot more to say about this, but I want to quickly touch on emoji.

Emoji completely break character counters in really fun ways.

A basic emoji 🙂 is counted almost everywhere as two characters. (but not Mastodon!)

But there are also "combined" emojis; those that look like 1 emoji but are actually several together connected with zero-width-join char, like this one: 👪

Mastodon considers it 1 char.
MDN and twitter consider it 2 chars.

But wait, there’s more…