GPT-5 토크나이저 해부, Google은 1토큰인데 OpenAI는 2토큰인 이유
GPT-5 토크나이저 20만 개 토큰 분석 결과. Google은 1토큰인데 OpenAI는 2토큰인 이유, ChatGPT가 URL을 자주 틀리는 구조적 원인을 소개합니다.GPT-5 토크나이저 해부, Google은 1토큰인데 OpenAI는 2토큰인 이유
GPT-5 토크나이저 20만 개 토큰 분석 결과. Google은 1토큰인데 OpenAI는 2토큰인 이유, ChatGPT가 URL을 자주 틀리는 구조적 원인을 소개합니다.Морфемы против BPE: как лингвистика ускоряет обучение языковых моделей
GPT-5.x разбивает слово "paratrooper" на par , atro , oper — три бессмысленных слога. Ваш мозг видит para- (около), troop (отряд), -er (деятель). Токенизатор не видит ничего. BPE, золотой стандарт токенизации с 2016 года, режет текст по частоте, а не по смыслу. И все крупные модели — GPT, Claude, Gemini, LLaMA — используют именно его. Несколько исследовательских групп проверили: что будет, если резать слова по морфемам — корням, приставкам, суффиксам? Результаты: +25% на LAMBADA, вдвое быстрее сходимость, а модель с 200k шагов обучения догоняет GPT-2 Large, которая в 6 раз больше. В статье — разбор трёх подходов (MorphBPE, MorphPiece, Unigram + морфология), конкретные цифры, ограничения (которые авторы предпочитают не выносить в заголовки) и ссылки, чтобы попробовать самому.
https://habr.com/ru/articles/993768/
#BPE #токенизация #морфемы #языковые_модели #NLP #лингвистика #GPT #LLaMA #трансформеры
The National Council on Privatisation (NCP), on Thursday, approved the request by the Bureau of Public Enterprise (BPE) to follow through on the engagement with Transcorp Power Consortium.
https://dmarketforces.com/ncp-to-regularise-sale-of-power-plant-to-transcorp-consortium/
Tự xây dựng BPE Tokenizer từ đầu: Tối ưu và thử nghiệm! 🚀 Tác giả đã tăng tốc độ training lên 50 lần, inference nhanh hơn 3.7 lần (Rust), và thử nghiệm GPT-2 pre-training với tokenizer tùy chỉnh. Mã nguồn, notes và readme chi tiết đều có trên Github!
#BPE #Tokenizer #MachineLearning #NLP #Vietnamese #LậpTrình #AI #XửLýNgônNgữTựNhiên
https://www.reddit.com/r/LocalLLaMA/comments/1o18yl8/building_a_bpe_tokenizer_from_scratch/
⚠️ Vulnerability Report
=======================
🎯 AI
Executive summary: New analysis highlights that emojis and uncommon
Unicode byte sequences can cause brittle behavior in large language
models by producing unexpected tokenization outputs under Byte-Pair
Encoding (BPE) or similar tokenizers. This is an operational security
concern for any pipeline that accepts user text and relies on
deterministic token boundaries.
Technical details:
• Tokenizers relying on BPE or byte-level vocabularies split input
into subword units; multi-byte Unicode characters (for example emoji
or combined sequences) may be tokenized as rare or out-of-vocabulary
byte patterns.
• Rare or unseen byte sequences can create token fragmentation (many
short tokens) or produce tokens that map to semantically different
vectors, altering model context and generation.
• Edge cases include surrogate pairs, zero-width joiners, skin-tone
modifiers, and compound emoji sequences that change byte alignment.
Analysis and impact:
• Downstream effects include unintended prompt truncation, semantic
drift, and increased susceptibility to adversarial inputs that
leverage token boundary manipulation.
• Attackers can craft inputs that force models into degraded contexts,
leak system prompts through context misalignment, or trigger unsafe
completions by exploiting tokenization mismatches.
Detection:
• Monitor token length distributions versus character lengths to
detect anomalies where character count rises but token count balloons.
• Instrument preprocessing logs to capture unusual byte-sequence
frequencies and new tokens entering the embedding table.
• Use synthetic test suites that include emoji variants, combining
characters, and long multi-byte sequences.
Mitigation:
• Implement Unicode normalization (NFC/NFKC) in preprocessing and
strip or canonicalize zero-width joiners where appropriate.
• Expand tokenizer training data with diverse emoji and multi-byte
sequences, or use byte-level tokenizers robust to unseen sequences.
• Add input sanitation layers that flag or constrain user-supplied
content with high token/character ratios and apply rate limits or
transformation policies.
References / notes:
• This is a tokenizer-level robustness issue rather than a single
CVE-class vulnerability; mitigations focus on preprocessing, tokenizer
coverage, and monitoring.
🔹 llm_security #tokenization #BPE #unicode #adversarial_ml
🔗 Source: https://infosecwriteups.com/the-emoji-that-broke-the-ai-into-27-pieces-a6ab1e1c551b
The Bureau of Public Enterprises (BPE) has announced plans to list two electricity Distribution Companies (DisCos) and one Generation Company (GenCo) on the Nigerian Stock Exchange through an Initial Public Offering (IPO)
https://dmarketforces.com/bpe-to-list-discos-genco-on-stock-market/
Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)
https://ndingwall.github.io/blog/tokenization
#HackerNews #Tokenization #LanguageModeling #BPE #Unigram #NLP
Tokenizers used by the best-performing language models (Bert, GPT-2, etc.) poorly reflect the morphology of English text. I had hoped to use some quarantine time to design one that more closely aligns to relationships between wordforms. But Kaj Bostrom and Greg Durrett beat me to it and so this blog post materialized instead. I add some additional motivation, evaluate both methods against ‘gold standard’ tokenizations, and speculate about what might come next.