Macedonian Corpus 🇲🇰

This repository contains scripts, notebooks, and documentation to process, clean, and manage the Macedonian Corpus Raw.

🌟 Key Highlights

📚 First consolidated Macedonian Corpus for NLP research.
📊 Includes 3 versions of the corpus:
- Raw: 37.6 GB, 3.53 billion words.
- Cleaned: 35.5 GB, 3.31 billion words (filtered for quality).
- Cleaned + Deduplicated: 16.78 GB, 1.47 billion words (high-quality, minimal redundancy).
🚀 Enables pretraining/fine-tuning LLMs, machine translation, and linguistic analysis.
🛠️ Built with state-of-the-art filtering and deduplication techniques.

Raw

Origin	Size (GB)	Words (B)	Percentage
HPLT	15.85	1.49	42.21%
HuggingFace (fineweb-2)	14.21	1.33	37.66%
CLARIN (MaCoCu-mk 2.0)	5.20	0.49	13.92%
Wikipedia	0.78	0.07	1.96%
Other (MMORE)	1.48	0.14	4.07%
Common Voice	0.02	0.0018	0.05%
SETimes Corpus	0.06	0.0044	0.13%
Total	37.60	3.53	100.00%

Cleaned

Origin	Size (GB)	Words (B)	Percentage
HPLT	15.51	1.45	43.72%
HuggingFace (fineweb-2)	14.13	1.31	39.62%
CLARIN (MaCoCu-mk 2.0)	5.14	0.48	14.57%
Wikipedia	0.64	0.06	1.78%
Other (MMORE)	0.04	0.004	0.12%
Common Voice	0.02	0.002	0.05%
SETimes Corpus	0.06	0.004	0.13%
Total	35.54	3.31	100.00%

Cleaned + Deduplicated

Origin	Size (GB)	Words (B)	Percentage
HuggingFace (fineweb-2)	7.85	0.73	49.55%
HPLT	5.80	0.54	36.87%
CLARIN (MaCoCu-mk 2.0)	1.94	0.18	12.39%
Wikipedia	0.13	0.01	0.83%
Other (MMORE)	0.04	0.004	0.25%
Common Voice	0.02	0.002	0.12%
Total	16.78	1.47	100.00%

📚 Dataset Sources

The corpus is built by collecting and processing data from the following sources:

Source	Notes	Origin
UKIM	Books and dissertations from various topics	UKIM Digital Library, UKIM Repository
Wikipedia (MK)	Macedonian Wikipedia dump	Wikipedia
MANU	Various publications from MANU	MANU
HuggingFace (fineweb-2)	Macedonian subset of FineWeb-2 (mkd_Cyrl)	Hugging Face
Common Voice (MK)	Macedonian sentences from the Common Voice dataset	Common Voice
CLARIN MaCoCu-mk 2.0	Web-crawled Macedonian texts	CLARIN
UKLO	Resources from UKLO	UKLO
UGD	Resources from UGD	UGD
SETimes Corpus (MK-EN)	Macedonian-English parallel corpus (only MK sentences used)	SETimes
HPLT (MK)	Macedonian subset of HPLT	HPLT
Institute of Macedonian Language	Resources from the Institute of Macedonian Language "Krste Misirkov"	IMJ
Official PE Gazette of North Macedonia	Official Gazette of North Macedonia	slvesnik

📋 Overview

1. `filtering/`

This folder contains the primary scripts for downloading, filtering, and preparing the clean version of corpus input: raw corpus, output: cleaned corpus.

🧹 filter.py
- Purpose: Produces a cleaned version of the dataset (filtering process inspired by fineweb-2).
- Features:
  - C4-like filtering (removing irrelevant lines, low-quality text, and placeholder content).
  - Gopher-like filtering (handling incomplete or overly repetitive documents).
  - High-confidence language detection for Macedonian text.
  - Sentence deduplication to avoid redundancy.
  - Personally Identifiable Information (PII) filtering.
📥 download.py
- Purpose: Downloads the raw dataset (macedonian-corpus-raw) from its source. It is advisable to split the dataset into chunks (using split_data/) for efficient multiprocessing.
🔀 split_data/ (optional)
- Purpose: Contains split chunks of the downloaded corpus to exploit multiprocessing during filtering.
🧪 test_language_model.py
- Purpose: Evaluates the outputs of language detection models.
- Usage: Useful for testing the language filtering logic.
👥 dedup/minhash.py
- Purpose: Second stage deduplication. (input: cleaned, output: cleaned and deduplicated

2. `process_data/`

This folder contains notebooks used to process and unify various Macedonian text sources into the raw dataset.

📄 common_voice.ipynb
- Purpose: Extracts text from the Common Voice dataset, specifically from .tsv files.
🔗 consolidate_data.ipynb
- Purpose: Unifies all data sources into a single raw dataset (macedonian-corpus-raw).
- Usage: Combines text data from multiple sources (e.g., Common Voice, scraped PDFs, web crawls). For reference, see the dataset description here.

3. `scraping/`

This folder contains scripts for data collection through web scraping.

🖋️ scrape_pdfs.py
- Purpose: Scrapes text from PDFs.
- Usage: The extracted content is processed through MMORE and included in both the raw and cleaned datasets, with field 'source' == MMORE.