-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c2b5bfa
commit 83c97bf
Showing
2 changed files
with
5 additions
and
6 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
layout: post | ||
type: post | ||
title: "The missing pieces in Filipino NLP in the age of LLMs" | ||
date: 2024-12-21 | ||
date: 2024-12-17 | ||
category: notebook | ||
comments: true | ||
author: "LJ MIRANDA" | ||
|
@@ -144,18 +144,17 @@ I believe it is important for us, the Filipino research community, to have a say | |
Many indigenous and endangered languages fall into this category due to their limited number of speakers and dedicated NLP researchers. | ||
Tagalog occupies an interesting middle ground: while we have a large speaker population and presumably extensive written content, there remains a scarcity of readily available datasets for downstream NLP tasks. | ||
|
||
One of my favorite papers this year, [*The Zeno's Paradox of Low-Resource Languages*](https://arxiv.org/pdf/2410.20817), helped clarify these definitions by examining how we define "low-resource" across different axes: Artifacts, Resources, Socio-Political factors, and Agency. | ||
One of my favorite papers this year, [_The Zeno's Paradox of Low-Resource Languages_](https://arxiv.org/pdf/2410.20817), helped clarify these definitions by examining how we define "low-resource" across different axes: Artifacts, Resources, Socio-Political factors, and Agency. | ||
Although Tagalog has millions of speakers (**↑ Resources**), it still lacks high-quality data for several core NLP and language modelling tasks (**↓ Artifacts**), and there remains significant room for growth in our participatin in developing these language technologies (**· Agency**). | ||
I appreciate this framework because it provides multiple dimensions for measuring a language's low-resource status, eliminating the need to debate or bikeshed new definitions. | ||
|
||
I maintain that Philippine languages remain low-resource across several dimensions. | ||
Even Tagalog, our majority language, still lacks the necessary tools and datasets to produce robust NLP pipelines. | ||
I maintain that Philippine languages remain low-resource across several dimensions. | ||
Even Tagalog, our majority language, still lacks the necessary tools and datasets to produce robust NLP pipelines. | ||
I believe the three research directions I described above can both increase the number of artifacts available for building language technologies and enhance our agency as a research community. | ||
I admit that I haven't done enough for Filipino NLP this year[^2] and this blog post serves not just as a research statement but **also a commitment to improve my involvement in this language.** | ||
I have some ideas (the ideas in this blog post are just a small part of it), so if you want to help out, [feel free to reach out](mailto:[email protected])! | ||
|
||
### Footnotes | ||
|
||
[^1]: The Cebuano Wikipedia is the second-largest Wikipedia in terms of number of articles. Although this appears impressive, its size is due to an article-generating bot called [Lsjbot](https://en.wikipedia.org/wiki/Lsjbot) rather than a dedicated group of Wikipedia volunteers. Unfortunately, the articles in Cebuano Wikipedia are unnatural and do not reflect how the language is actually used by native speakers. | ||
|
||
[^2]: This year we published [SEACrowd](https://aclanthology.org/2024.emnlp-main.296/), [Universal NER](https://aclanthology.org/2024.naacl-long.243/), and the [largest Tagalog UD Treebank](https://huggingface.co/collections/UD-Filipino/universal-dependencies-for-tagalog-67573d625baa5036fd59b317), but most of these efforts started back in 2023. | ||
[^2]: This year we published [SEACrowd](https://aclanthology.org/2024.emnlp-main.296/), [Universal NER](https://aclanthology.org/2024.naacl-long.243/), and the [largest Tagalog UD Treebank](https://huggingface.co/collections/UD-Filipino/universal-dependencies-for-tagalog-67573d625baa5036fd59b317), but most of these efforts started back in 2023. |