-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add some developer documentation
- Loading branch information
1 parent
dfb96a6
commit 6e2d86e
Showing
5 changed files
with
182 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Parameters | ||
|
||
[Processes](processes.md), [Readers](readers.md), and [Writers](writers.md) can all be parameterized. To add a parameter to any of these classes, simply define a class attribute of type `Parameter`: | ||
|
||
```python | ||
class SomeWriter(Writer): | ||
warn_on_invalid = Parameter( | ||
type=bool, default=False, | ||
help="issue warnings on feature values that don't match the spec", | ||
) | ||
``` | ||
|
||
This parameter can then be passed either from a TOML file: | ||
|
||
```toml | ||
[export] | ||
# ... | ||
warn_on_invalid = true | ||
``` | ||
|
||
Or from a Python script: | ||
|
||
```python | ||
rebabel_format.run_command( | ||
'export', #... | ||
warn_on_invalid=True, | ||
) | ||
``` | ||
|
||
In either case, methods on `SomeWriter` can simply refer to `self.warn_on_invalid`, which from their perspective will be a boolean. | ||
|
||
## `Parameter` | ||
|
||
`Parameter` objects have the following attributes: | ||
|
||
- `required`: whether to raise an error if this parameter is omitted; defaults to `True` | ||
- `default`: the value of this parameter if not specified by the user; if this is not `None`, then `required` will be set to `False` | ||
- `type`: the type that a provided value must be (according to `isinstance`) | ||
- `help`: the documentation string for this parameter | ||
|
||
## `FeatureParameter` | ||
|
||
The input to this parameter type should specify a tier name and feature name. This can be done as a string (`"tier:feature"`), as a dictionary (`{"tier": tier, "feature": feature}`), or as an iterable of length 2 (`[tier, feature]`). It will be normalized to a tuple (`(tier, feature)`). | ||
|
||
## `QueryParameter` | ||
|
||
Input to this parameter should be a dictionary which is checked to ensure that it is a valid query. | ||
|
||
## `UsernameParameter` | ||
|
||
A string parameter which is optional by default. If not provided, it will be set to the value of the environment variable `$USER`. | ||
|
||
## `DBParameter` | ||
|
||
The input to this parameter is expected to be a path to a database. The value will be either `None` or an instance of `RBBLFile`. The parameter `db` of this type is inherited from `Process` and thus directly referring to `DBParameter` is rarely necessary. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Plugins | ||
|
||
Plugins to reBabel are loaded using [entry point specifiers](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata) in the Python package metadata. | ||
|
||
For example, if importing `my_process` would import a subclass of `Process`, then the following should be added to `setup.cfg`: | ||
|
||
```cfg | ||
[options.entry_points] | ||
rebabel.processes = | ||
my_process = my_process | ||
``` | ||
|
||
The other recognized entry points are `rebabel.readers` for `Reader` instances, `rebabel.writers` for `Writer` instances, and `rebabel.converters` for files which contain both. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Processes | ||
|
||
Creating a new process can be done by subclassing `Process`. | ||
|
||
```python | ||
class SomeProcess(Process): | ||
name = 'do_stuff' | ||
|
||
amount = Parameter(type=int, default=1, help='amount of stuff to do') | ||
|
||
def run(self): | ||
for i in range(self.amount): | ||
print('doing stuff to', self.db.path) | ||
``` | ||
|
||
To be invokable, a process must have a `name` attribute. The main action of the process occurs in the `run` method, which takes no arguments. | ||
|
||
The [parameters](parameters.md) that the process expects are specified by adding attributes of type `Parameter`. When the class is initialized, these are converted into the appropriate values. Subclasses of `Process` also inherit a required parameter named `db` which will have an `RBBLFile` as a value. | ||
|
||
Processes can be defined either in the `processes` directory of the reBabel project or in [plugins](plugins.md) which declare a `rebabel.processes` entry point. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Readers | ||
|
||
Readers import data into a reBabel database. | ||
|
||
```python | ||
class SomeReader(Reader): | ||
identifier = 'something' | ||
|
||
version = Parameter(type=int, default=3, help='schema version to read') | ||
|
||
def read_file(self, fin): | ||
for line_number, line in enumerate(fin): | ||
self.set_type(line_number, 'line') | ||
self.set_feature(line_number, 'something', 'line', 'str', | ||
line.strip()) | ||
self.finish_block() | ||
``` | ||
|
||
Readers should have an attribute named `identifier` in order to be invokable. | ||
|
||
Any [parameters](parameters.md) that need to be specified should also be specified as class attributes. | ||
|
||
The action of a reader is broken across the methods `open_file(path)`, `read_file(file)`, and `close_file(file)`. By default `open_file` opens a file in text mode and `close_file` calls `.close()` on it, but these can be overridden (and see below regarding subclasses for common cases). | ||
|
||
Within `read_file`, units are created when information about them is specified, which is done with the following methods: | ||
|
||
- `set_type(name, type)`: specify the unit type of `name`; if the type of a unit is not specified, a `ReaderError` will be raised | ||
- `set_parent(child_name, parent_name)`: set the primary parent of a given unit | ||
- `add_relation(child_name, parent_name)`: set a non-primary parent of a given unit | ||
- `set_feature(name, tier, feature, type, value)`: set `tier:feature` to `value` for unit `name`, creating the feature with type `type`, if necessary | ||
- `finish_block(keep_uids=False)`: indicates that a segment of data is complete and should be committed to the database | ||
- by default, the list of names accumulated by the other methods will be cleared; this can be prevented by setting `keep_uids=True`, which is useful for cases where the input has globally unique IDs, is very large, and has relations spanning the file | ||
|
||
Unit names are purely internal to the `Reader` instance and can be of any hashable type (`int`, `str`, `tuple`, etc). They will be converted to database IDs when `finish_block` is called. | ||
|
||
## `XMLReader` | ||
|
||
This subclass parses the input file using [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) and passes an `Element` object to `read_file`. | ||
|
||
## `JSONReader` | ||
|
||
This subclass parses the input file as JSON and passes a dictionary to `read_file`. | ||
|
||
## `LineReader` | ||
|
||
This subclass is specialized for text files where linebreaks are meaningful. It is roughly equivalent to the following: | ||
|
||
```python | ||
for line in file: | ||
if self.is_boundary(line): | ||
self.end() | ||
self.reset() | ||
self.process_line(line) | ||
``` | ||
|
||
- `is_boundary(line)`: should return `True` if `line` is the end of a group of lines or the beginning of a new one; by default it checks if the line is blank | ||
- `process_line(line)`: perform any processing needed on the text of the line | ||
- `end()`: hook to operate at the end of a block; calls `finish_block()` | ||
- `reset()`: set up any needed variables for a new block |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Writers | ||
|
||
Writers define a query to extract relevant units and then write them to an output file. | ||
|
||
```python | ||
class SomeWriter(Writer): | ||
identifier = 'something' | ||
|
||
query = { | ||
'S': {'type': 'sentence'}, | ||
'W': {'type': 'word', 'parent': 'S'}, | ||
} | ||
query_order = ['S', 'W'] | ||
|
||
indent = Parameter(type=str, default='\t') | ||
|
||
def write(self, fout): | ||
s_feat = self.table.add_features('S', ['something:text'])[0] | ||
|
||
w_feat_names = ['something:lemma', 'something:pos'] | ||
w_feat_ids = self.table.add_features('W', w_feat_names) | ||
w_lemma = w_feat_ids[0] | ||
w_pos = w_feat_ids[1] | ||
|
||
current_sentence = None | ||
for units, features in self.table.results(): | ||
if units['S'] != current_sentence: | ||
fout.write(str(features[units['S']].get(s_feat, '')) + '\n') | ||
current_sentence = units['S'] | ||
fout.write(self.indent) | ||
fout.write(str(features[units['W']].get(w_lemma, ''))) | ||
fout.write(' ') | ||
fout.write(str(features[units['W']].get(w_pos, ''))) | ||
fout.write('\n') | ||
``` |