This is a non-exhaustive / incomplete list of some of the things for which work is needed for Balochi NLP:
- A custom-trained tokenizer (@strickvl working on this)
- A stopword list (and some other basic things like lists of characters/punctuation and their associated unicode code points etc)
- A conversion tool for language in different scripts
- Dialect classifier
- NER (named entity recognition) models
- Good quality dataset(s) that are openly available for all to use
- OCR support for Balochi texts (in the computer vision domain, but would probably help build datasets and it is highly likely we can benefit from work done for Arabic and Persian.)
- Embeddings
- Benchmarks
- Text-to-Speech (TTS) models (for generating audio)
- Speech-to-Text (STS) models (for transcribing audio)
- Language models (of various architectures)
Support could possibly come from leading organisations in the space. Importantly, they both have a strong track-record of encouraging and offering support for low-resource languages:
- Huggingface
- Explosion (makers of Spacy)