Yesterday, we published an English – Kabyle Parallel Corpus.

130 883 aligned sentence pairs extracted from our contributions and contributions of the community on Tatoeba database.

The corpus is aligned pair by sentence-id (en-kab).

By number of sentences, Kabyle language is ranked 5th on Tatoeba with 772 002 submitted sentences (september 14th, 2025).

The dataset will be updated from time to time via :

HF dataset : https://huggingface.co/datasets/Imsidag-community/english-kabyle-parallel

#dataset #kabyle #taqbaylit

Imsidag-community/english-kabyle-parallel · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

We used some "basic fixer and normalizer" to correct some words and sentences.

The fixer/normalizer is used to correct some errors and map some char sequences to the standardized kabyle alphabet published by the CLDR/Unicode.

Tool used to download and align the corpus, kabyle-nlp-toolkit : https://github.com/BoFFire/kabyle-nlp-toolkit

GitHub - BoFFire/kabyle-nlp-toolkit: Kabyle NLP Toolkit

Kabyle NLP Toolkit. Contribute to BoFFire/kabyle-nlp-toolkit development by creating an account on GitHub.

GitHub

Expect more parallel `en-kab` corpus thanks to the project `translation-memory-tools` built by @softcatala with  

We tried to customize it locally for kabyle language and it works : we are able to build a kabyle translation memory from all the translations we submitted over the years by translating FLOSS Software and projects.

Link : https://github.com/Softcatala/translation-memory-tools

GitHub - Softcatala/translation-memory-tools: A set of tools to build, maintain and use translation memories

A set of tools to build, maintain and use translation memories - Softcatala/translation-memory-tools

GitHub