Skip to content
Change the repository type filter

All

    Repositories list

    • HPLT-WP4

      Public
      Information and pipelines on WP4: language models training
      Python
      Creative Commons Zero v1.0 Universal
      3200Updated Jan 26, 2025Jan 26, 2025
    • Internet archive downloader
      Jupyter Notebook
      0210Updated Jan 25, 2025Jan 25, 2025
    • Data Analytics Tool
      Python
      11011Updated Jan 23, 2025Jan 23, 2025
    • Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
      HTML
      0350Updated Jan 23, 2025Jan 23, 2025
    • Jupyter Notebook
      7110Updated Jan 22, 2025Jan 22, 2025
    • Scripts for running bitextor jobs
      Shell
      1010Updated Jan 20, 2025Jan 20, 2025
    • OpusPocus

      Public
      Marian machine translation training pipeline for thousands of models
      Python
      02191Updated Jan 17, 2025Jan 17, 2025
    • OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
      Python
      1450561Updated Jan 16, 2025Jan 16, 2025
    • Curriculum training
      Python
      MIT License
      616190Updated Jan 13, 2025Jan 13, 2025
    • Shell
      0000Updated Dec 30, 2024Dec 30, 2024
    • Shell
      0130Updated Dec 19, 2024Dec 19, 2024
    • Set of scripts to run monotextor-like pipeline under slurm HPCs
      Rust
      GNU General Public License v3.0
      0200Updated Nov 4, 2024Nov 4, 2024
    • Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
      Python
      1900Updated Nov 2, 2024Nov 2, 2024
    • Python port of Moses tokenizer, truecaser and normalizer
      Python
      MIT License
      59489274Updated May 26, 2024May 26, 2024
    • tf/idf-based document aligner from Bitextor
      C++
      Apache License 2.0
      0001Updated Mar 19, 2024Mar 19, 2024
    • PHP
      MIT License
      1000Updated Mar 9, 2024Mar 9, 2024
    • This contains the configuration and scripts for HPLT MT model releases.
      Python
      0410Updated Mar 6, 2024Mar 6, 2024
    • OpusFilter - Parallel corpus processing toolkit
      Python
      MIT License
      20000Updated Jan 3, 2024Jan 3, 2024
    • clianer

      Public
      A lightweight command-line frontend to OpusCleaner
      Python
      MIT License
      1000Updated Nov 27, 2023Nov 27, 2023
    • Make-shift interface for managing Paracrawl processing and exploring its outputs
      HTML
      1000Updated Oct 10, 2023Oct 10, 2023
    • 0100Updated Feb 7, 2023Feb 7, 2023