Mastodawn

AI Sparkup Feb 18

GPT-5 토크나이저 해부, Google은 1토큰인데 OpenAI는 2토큰인 이유

GPT-5 토크나이저 20만 개 토큰 분석 결과. Google은 1토큰인데 OpenAI는 2토큰인 이유, ChatGPT가 URL을 자주 틀리는 구조적 원인을 소개합니다.

https://aisparkup.com/posts/9299

Habr Feb 6

Морфемы против BPE: как лингвистика ускоряет обучение языковых моделей

GPT-5.x разбивает слово "paratrooper" на par , atro , oper — три бессмысленных слога. Ваш мозг видит para- (около), troop (отряд), -er (деятель). Токенизатор не видит ничего. BPE, золотой стандарт токенизации с 2016 года, режет текст по частоте, а не по смыслу. И все крупные модели — GPT, Claude, Gemini, LLaMA — используют именно его. Несколько исследовательских групп проверили: что будет, если резать слова по морфемам — корням, приставкам, суффиксам? Результаты: +25% на LAMBADA, вдвое быстрее сходимость, а модель с 200k шагов обучения догоняет GPT-2 Large, которая в 6 раз больше. В статье — разбор трёх подходов (MorphBPE, MorphPiece, Unigram + морфология), конкретные цифры, ограничения (которые авторы предпочитают не выносить в заголовки) и ссылки, чтобы попробовать самому.

https://habr.com/ru/articles/993768/

#BPE #токенизация #морфемы #языковые_модели #NLP #лингвистика #GPT #LLaMA #трансформеры

Морфемы против BPE: как лингвистика ускоряет обучение языковых моделей

Откройте любой BPE-токенизатор и введите слово "paratrooper". Вот что вернёт GPT-5.x (токенизатор o200k_base): ['par', 'atro', 'oper'] . Три бессмысленных слога. Ваш мозг видит para- (около), troop...

Хабр

MarketForces Africa Nov 14, 2025

The National Council on Privatisation (NCP), on Thursday, approved the request by the Bureau of Public Enterprise (BPE) to follow through on the engagement with Transcorp Power Consortium.

https://dmarketforces.com/ncp-to-regularise-sale-of-power-plant-to-transcorp-consortium/

#NCP #PowerPlant #BPE #TranscorpPower

NCP To Regularise Sale Of Power Plant To Transcorp Consortium

The National Council on Privatisation (NCP), on Thursday, approved the request by the Bureau of Public Enterprise (BPE) to follow through on the

MarketForces Africa

Reddit Tech VN Bot Oct 8, 2025

Tự xây dựng BPE Tokenizer từ đầu: Tối ưu và thử nghiệm! 🚀 Tác giả đã tăng tốc độ training lên 50 lần, inference nhanh hơn 3.7 lần (Rust), và thử nghiệm GPT-2 pre-training với tokenizer tùy chỉnh. Mã nguồn, notes và readme chi tiết đều có trên Github!

#BPE #Tokenizer #MachineLearning #NLP #Vietnamese #LậpTrình #AI #XửLýNgônNgữTựNhiên

https://www.reddit.com/r/LocalLLaMA/comments/1o18yl8/building_a_bpe_tokenizer_from_scratch/

hasamba Sep 27, 2025

⚠️ Vulnerability Report
=======================

🎯 AI

Executive summary: New analysis highlights that emojis and uncommon
Unicode byte sequences can cause brittle behavior in large language
models by producing unexpected tokenization outputs under Byte-Pair
Encoding (BPE) or similar tokenizers. This is an operational security
concern for any pipeline that accepts user text and relies on
deterministic token boundaries.

Technical details:
• Tokenizers relying on BPE or byte-level vocabularies split input
into subword units; multi-byte Unicode characters (for example emoji
or combined sequences) may be tokenized as rare or out-of-vocabulary
byte patterns.
• Rare or unseen byte sequences can create token fragmentation (many
short tokens) or produce tokens that map to semantically different
vectors, altering model context and generation.
• Edge cases include surrogate pairs, zero-width joiners, skin-tone
modifiers, and compound emoji sequences that change byte alignment.

Analysis and impact:
• Downstream effects include unintended prompt truncation, semantic
drift, and increased susceptibility to adversarial inputs that
leverage token boundary manipulation.
• Attackers can craft inputs that force models into degraded contexts,
leak system prompts through context misalignment, or trigger unsafe
completions by exploiting tokenization mismatches.

Detection:
• Monitor token length distributions versus character lengths to
detect anomalies where character count rises but token count balloons.
• Instrument preprocessing logs to capture unusual byte-sequence
frequencies and new tokens entering the embedding table.
• Use synthetic test suites that include emoji variants, combining
characters, and long multi-byte sequences.

Mitigation:
• Implement Unicode normalization (NFC/NFKC) in preprocessing and
strip or canonicalize zero-width joiners where appropriate.
• Expand tokenizer training data with diverse emoji and multi-byte
sequences, or use byte-level tokenizers robust to unseen sequences.
• Add input sanitation layers that flag or constrain user-supplied
content with high token/character ratios and apply rate limits or
transformation policies.

References / notes:
• This is a tokenizer-level robustness issue rather than a single
CVE-class vulnerability; mitigations focus on preprocessing, tokenizer
coverage, and monitoring.

🔹 llm_security #tokenization #BPE #unicode #adversarial_ml

🔗 Source: https://infosecwriteups.com/the-emoji-that-broke-the-ai-into-27-pieces-a6ab1e1c551b

😱 The Emoji That Broke the AI (into 27 Pieces) 😂

Okay, real talk. When you send a “😂” or a “🦄,” you’re probably just trying to express yourself, right? You’re not thinking about…

Medium

MarketForces Africa Sep 2, 2025

The Bureau of Public Enterprises (BPE) has announced plans to list two electricity Distribution Companies (DisCos) and one Generation Company (GenCo) on the Nigerian Stock Exchange through an Initial Public Offering (IPO)

https://dmarketforces.com/bpe-to-list-discos-genco-on-stock-market/

#BPE #GenCos #DisCos #stockmarketListing

BPE To List DisCos, GenCo On Stock Market

The Bureau of Public Enterprises (BPE) has announced plans to list two electricity Distribution Companies (DisCos) and one Generation Company

MarketForces Africa

Hacker News May 30, 2025

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

https://ndingwall.github.io/blog/tokenization

#HackerNews #Tokenization #LanguageModeling #BPE #Unigram #NLP

Tokenization for language modeling: Byte Pair Encoding vs Unigram Language Modeling

Tokenizers used by the best-performing language models (Bert, GPT-2, etc.) poorly reflect the morphology of English text. I had hoped to use some quarantine time to design one that more closely aligns to relationships between wordforms. But Kaj Bostrom and Greg Durrett beat me to it and so this blog post materialized instead. I add some additional motivation, evaluate both methods against ‘gold standard’ tokenizations, and speculate about what might come next.

Nick Dingwall

SellaTheChemist Nov 14, 2024

Looks like that salsa class that was on in our Ramsay Lecture theatre last night was pretty energetic. I hope those guys had fun. #telemetry #BPE #energymanagement #Aranet

SellaTheChemist Nov 12, 2024

A minor triumph this morning. Repairing my glove box this morning my colleague and I noticed the lab was at 11 ˚C - bizarrely no one had complained. Ten minutes with box fixed, we looked at BMS data – the "CT" circuit feeding the AHU was off. We traced this back to an event at 0245 on Monday morning. We reported our diagnosis to our BMS colleagues who reset the offending pump restoring the flow. An hour later temperature in the lab is already past 14 ˚C. #BPE #academiclife

Tarnkappe.info Nov 8, 2024

📬 Bypass Paywalls Clean: Paywalls umgehen leicht gemacht
#Internet #Test #BPE #BrowserAddon #byebyepaywallcom #BypassPaywallsClean #gitflicru #Paywalls https://sc.tarnkappe.info/a5fb03

Bypass Paywalls Clean: Paywalls umgehen leicht gemacht

Die Verlage richten in ihren Websites immer mehr Bezahlschranken ein. Die Browser-Erweiterung Bypass Paywalls Clean verspricht Abhilfe.

TARNKAPPE.INFO