Two new resources for natural language processing researchers and developers:
* wikisentences - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language
* ml-wiki-sentences - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool.
Preparing sentence dataset from a wikipedia
I’m excited to announce two new resources for natural language processing researchers and developers: wikisentences - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language ml-wiki-sentences - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool. The Wikisentences Tool The wikisentences project provides a complete pipeline for creating sentence datasets from Wikipedia content: Core Technology wiki-html-text-extractor (Rust) - Uses tree-sitter-html to parse article HTML and extract clean plain text sentencex (Rust) - Handles accurate sentence segmentation across languages. See my recent article about this library Four-Stage Pipeline Download enterprise HTML dumps from WikimediaThere is no recent html dumps for wikipedia, except this one year old dump Convert JSON dumps to Parquet format (id, name, url, language, html) Extract plain text from HTML (id, url, name, text) Segment text into sentences (id, url, name, sentence, sentence_index) Each stage is handled by a separate Python script, with the heavy lifting done by efficient Rust binaries. The pipeline is designed to be memory-efficient, streaming data between stages without writing intermediate files to disk.