Skip to content

Commit

Permalink
add optional tagger arg to compute_readability
Browse files Browse the repository at this point in the history
  • Loading branch information
joshdavham committed Sep 27, 2024
1 parent a4d324c commit 46c9799
Show file tree
Hide file tree
Showing 4 changed files with 107 additions and 61 deletions.
24 changes: 20 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ print(score) # 5.596333333333334

Note that this readability calculator is specifically for <u>non-native speakers</u> learning to read Japanese. This is not to be confused with something like grade level or other readability scores meant for native speakers.

### Equation
### Model

```
readability = {mean number of words per sentence} * -0.056
Expand All @@ -62,8 +62,24 @@ readability = {mean number of words per sentence} * -0.056

*\* "kango" (漢語) means Japanese word of Chinese origin while "wago" (和語) means native Japanese word.*

---

#### Note on model consistency

The readability scores produced by this python package tend to differ slightly from the scores produced on the official [jreadability website](https://jreadability.net/sys/en). This is likely due to the version difference in UniDic between these two implementations as this package uses UniDic 2.1.2 while theirs uses UniDic 2.2.0. This issue will hopefully be resolved in the future.
The readability scores produced by this python package tend to differ slightly from the scores produced on the official [jreadability website](https://jreadability.net/sys/en). This is likely due to the version difference in UniDic between these two implementations as this package uses UniDic 2.1.2 while theirs uses UniDic 2.2.0. This issue will hopefully be resolved in the future.

## Batch processing

jreadability makes use of [fugashi](https://github.com/polm/fugashi)'s tagger under the hood and initializes a new tagger everytime `compute_retrievability` is invoked. If you are processing a large number of texts, it is recommended to initialize the tagger first on your own, then pass it as an argument to each subsequent `compute_retrievability` call.

```python
from fugashi import Tagger

texts = [...]

tagger = Tagger()

for text in texts:

score = compute_readability(text, tagger) # fast :D
#score = compute_readability(text) # slow :'(
...
```
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "jreadability"
version = "1.0.1"
version = "1.1.0"
description = "Calculate readability scores for Japanese texts."
readme = "README.md"
authors = [{ name = "Joshua Hamilton", email = "[email protected]" }]
Expand Down
12 changes: 7 additions & 5 deletions src/jreadability/jreadability.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,25 @@
There are no other public functions, classes or variables.
"""

import fugashi
from typing import List
from fugashi import Tagger
from typing import List, Optional
from fugashi.fugashi import UnidicNode

def compute_readability(text: str) -> float:
def compute_readability(text: str, tagger: Optional[Tagger]=None) -> float:
"""
Computes the readability of a Japanese text.
Args:
text (str): The text to be scored.
tagger (Optional[Tagger]): The fugashi parser used to parse the text.
Returns:
float: A float representing the readability score of the text.
"""

# initialize mecab parser
tagger = fugashi.Tagger()
if tagger is None:
# initialize mecab parser
tagger = Tagger()

doc = tagger(text)

Expand Down
Loading

0 comments on commit 46c9799

Please sign in to comment.