Roadmap for Balochi NLP

This is a non-exhaustive / incomplete list of some of the things for which work is needed for Balochi NLP:

Short term mini-projects

A custom-trained tokenizer (@strickvl working on this)
A stopword list (and some other basic things like lists of characters/punctuation and their associated unicode code points etc)
A conversion tool for language in different scripts
Dialect classifier
NER (named entity recognition) models
Good quality dataset(s) that are openly available for all to use
OCR support for Balochi texts (in the computer vision domain, but would probably help build datasets and it is highly likely we can benefit from work done for Arabic and Persian.)

Medium - Long term goals / projects

Embeddings
Benchmarks
Text-to-Speech (TTS) models (for generating audio)
Speech-to-Text (STS) models (for transcribing audio)
Language models (of various architectures)

Potential partner organisations

Support could possibly come from leading organisations in the space. Importantly, they both have a strong track-record of encouraging and offering support for low-resource languages:

Huggingface
Explosion (makers of Spacy)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadmap.md

roadmap.md

Roadmap for Balochi NLP

Short term mini-projects

Medium - Long term goals / projects

Potential partner organisations

Files

roadmap.md

Latest commit

History

roadmap.md

File metadata and controls

Roadmap for Balochi NLP

Short term mini-projects

Medium - Long term goals / projects

Potential partner organisations