add optional tagger arg to compute_readability

joshdavham · Sep 27, 2024 · 46c9799 · 46c9799
1 parent a4d324c
commit 46c9799
Show file tree

Hide file tree

Showing 4 changed files with 107 additions and 61 deletions.
diff --git a/README.md b/README.md
@@ -49,7 +49,7 @@ print(score) # 5.596333333333334
 
 Note that this readability calculator is specifically for <u>non-native speakers</u> learning to read Japanese. This is not to be confused with something like grade level or other readability scores meant for native speakers.
 
-### Equation
+### Model
 
 ```
 readability = {mean number of words per sentence} * -0.056
@@ -62,8 +62,24 @@ readability = {mean number of words per sentence} * -0.056
 
 *\* "kango" (漢語) means Japanese word of Chinese origin while "wago" (和語) means native Japanese word.*
 
----
-
 #### Note on model consistency
 
-The readability scores produced by this python package tend to differ slightly from the scores produced on the official [jreadability website](https://jreadability.net/sys/en). This is likely due to the version difference in UniDic between these two implementations as this package uses UniDic 2.1.2 while theirs uses UniDic 2.2.0. This issue will hopefully be resolved in the future.
+The readability scores produced by this python package tend to differ slightly from the scores produced on the official [jreadability website](https://jreadability.net/sys/en). This is likely due to the version difference in UniDic between these two implementations as this package uses UniDic 2.1.2 while theirs uses UniDic 2.2.0. This issue will hopefully be resolved in the future.
+
+## Batch processing
+
+jreadability makes use of [fugashi](https://github.com/polm/fugashi)'s tagger under the hood and initializes a new tagger everytime `compute_retrievability` is invoked. If you are processing a large number of texts, it is recommended to initialize the tagger first on your own, then pass it as an argument to each subsequent `compute_retrievability` call.
+
+```python
+from fugashi import Tagger
+
+texts = [...]
+
+tagger = Tagger()
+
+for text in texts:
+
+    score = compute_readability(text, tagger) # fast :D
+    #score = compute_readability(text) # slow :'(
+    ...
+```
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "jreadability"
-version = "1.0.1"
+version = "1.1.0"
 description = "Calculate readability scores for Japanese texts."
 readme = "README.md"
 authors = [{ name = "Joshua Hamilton", email = "[email protected]" }]

diff --git a/src/jreadability/jreadability.py b/src/jreadability/jreadability.py
@@ -5,23 +5,25 @@
 There are no other public functions, classes or variables.
 """
 
-import fugashi
-from typing import List
+from fugashi import Tagger
+from typing import List, Optional
 from fugashi.fugashi import UnidicNode
 
-def compute_readability(text: str) -> float:
+def compute_readability(text: str, tagger: Optional[Tagger]=None) -> float:
     """
     Computes the readability of a Japanese text.
 
     Args:
         text (str): The text to be scored.
+        tagger (Optional[Tagger]): The fugashi parser used to parse the text. 
 
     Returns:
         float: A float representing the readability score of the text.
     """
 
-    # initialize mecab parser
-    tagger = fugashi.Tagger()
+    if tagger is None:
+        # initialize mecab parser
+        tagger = Tagger()
 
     doc = tagger(text)