Skip to content

Commit

Permalink
docs: add some developer documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mr-martian committed Sep 18, 2024
1 parent dfb96a6 commit 6e2d86e
Show file tree
Hide file tree
Showing 5 changed files with 182 additions and 0 deletions.
55 changes: 55 additions & 0 deletions docs/parameters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Parameters

[Processes](processes.md), [Readers](readers.md), and [Writers](writers.md) can all be parameterized. To add a parameter to any of these classes, simply define a class attribute of type `Parameter`:

```python
class SomeWriter(Writer):
warn_on_invalid = Parameter(
type=bool, default=False,
help="issue warnings on feature values that don't match the spec",
)
```

This parameter can then be passed either from a TOML file:

```toml
[export]
# ...
warn_on_invalid = true
```

Or from a Python script:

```python
rebabel_format.run_command(
'export', #...
warn_on_invalid=True,
)
```

In either case, methods on `SomeWriter` can simply refer to `self.warn_on_invalid`, which from their perspective will be a boolean.

## `Parameter`

`Parameter` objects have the following attributes:

- `required`: whether to raise an error if this parameter is omitted; defaults to `True`
- `default`: the value of this parameter if not specified by the user; if this is not `None`, then `required` will be set to `False`
- `type`: the type that a provided value must be (according to `isinstance`)
- `help`: the documentation string for this parameter

## `FeatureParameter`

The input to this parameter type should specify a tier name and feature name. This can be done as a string (`"tier:feature"`), as a dictionary (`{"tier": tier, "feature": feature}`), or as an iterable of length 2 (`[tier, feature]`). It will be normalized to a tuple (`(tier, feature)`).

## `QueryParameter`

Input to this parameter should be a dictionary which is checked to ensure that it is a valid query.

## `UsernameParameter`

A string parameter which is optional by default. If not provided, it will be set to the value of the environment variable `$USER`.

## `DBParameter`

The input to this parameter is expected to be a path to a database. The value will be either `None` or an instance of `RBBLFile`. The parameter `db` of this type is inherited from `Process` and thus directly referring to `DBParameter` is rarely necessary.
13 changes: 13 additions & 0 deletions docs/plugins.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Plugins

Plugins to reBabel are loaded using [entry point specifiers](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata) in the Python package metadata.

For example, if importing `my_process` would import a subclass of `Process`, then the following should be added to `setup.cfg`:

```cfg
[options.entry_points]
rebabel.processes =
my_process = my_process
```

The other recognized entry points are `rebabel.readers` for `Reader` instances, `rebabel.writers` for `Writer` instances, and `rebabel.converters` for files which contain both.
20 changes: 20 additions & 0 deletions docs/processes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Processes

Creating a new process can be done by subclassing `Process`.

```python
class SomeProcess(Process):
name = 'do_stuff'

amount = Parameter(type=int, default=1, help='amount of stuff to do')

def run(self):
for i in range(self.amount):
print('doing stuff to', self.db.path)
```

To be invokable, a process must have a `name` attribute. The main action of the process occurs in the `run` method, which takes no arguments.

The [parameters](parameters.md) that the process expects are specified by adding attributes of type `Parameter`. When the class is initialized, these are converted into the appropriate values. Subclasses of `Process` also inherit a required parameter named `db` which will have an `RBBLFile` as a value.

Processes can be defined either in the `processes` directory of the reBabel project or in [plugins](plugins.md) which declare a `rebabel.processes` entry point.
59 changes: 59 additions & 0 deletions docs/readers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Readers

Readers import data into a reBabel database.

```python
class SomeReader(Reader):
identifier = 'something'

version = Parameter(type=int, default=3, help='schema version to read')

def read_file(self, fin):
for line_number, line in enumerate(fin):
self.set_type(line_number, 'line')
self.set_feature(line_number, 'something', 'line', 'str',
line.strip())
self.finish_block()
```

Readers should have an attribute named `identifier` in order to be invokable.

Any [parameters](parameters.md) that need to be specified should also be specified as class attributes.

The action of a reader is broken across the methods `open_file(path)`, `read_file(file)`, and `close_file(file)`. By default `open_file` opens a file in text mode and `close_file` calls `.close()` on it, but these can be overridden (and see below regarding subclasses for common cases).

Within `read_file`, units are created when information about them is specified, which is done with the following methods:

- `set_type(name, type)`: specify the unit type of `name`; if the type of a unit is not specified, a `ReaderError` will be raised
- `set_parent(child_name, parent_name)`: set the primary parent of a given unit
- `add_relation(child_name, parent_name)`: set a non-primary parent of a given unit
- `set_feature(name, tier, feature, type, value)`: set `tier:feature` to `value` for unit `name`, creating the feature with type `type`, if necessary
- `finish_block(keep_uids=False)`: indicates that a segment of data is complete and should be committed to the database
- by default, the list of names accumulated by the other methods will be cleared; this can be prevented by setting `keep_uids=True`, which is useful for cases where the input has globally unique IDs, is very large, and has relations spanning the file

Unit names are purely internal to the `Reader` instance and can be of any hashable type (`int`, `str`, `tuple`, etc). They will be converted to database IDs when `finish_block` is called.

## `XMLReader`

This subclass parses the input file using [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) and passes an `Element` object to `read_file`.

## `JSONReader`

This subclass parses the input file as JSON and passes a dictionary to `read_file`.

## `LineReader`

This subclass is specialized for text files where linebreaks are meaningful. It is roughly equivalent to the following:

```python
for line in file:
if self.is_boundary(line):
self.end()
self.reset()
self.process_line(line)
```

- `is_boundary(line)`: should return `True` if `line` is the end of a group of lines or the beginning of a new one; by default it checks if the line is blank
- `process_line(line)`: perform any processing needed on the text of the line
- `end()`: hook to operate at the end of a block; calls `finish_block()`
- `reset()`: set up any needed variables for a new block
35 changes: 35 additions & 0 deletions docs/writers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Writers

Writers define a query to extract relevant units and then write them to an output file.

```python
class SomeWriter(Writer):
identifier = 'something'

query = {
'S': {'type': 'sentence'},
'W': {'type': 'word', 'parent': 'S'},
}
query_order = ['S', 'W']

indent = Parameter(type=str, default='\t')

def write(self, fout):
s_feat = self.table.add_features('S', ['something:text'])[0]

w_feat_names = ['something:lemma', 'something:pos']
w_feat_ids = self.table.add_features('W', w_feat_names)
w_lemma = w_feat_ids[0]
w_pos = w_feat_ids[1]

current_sentence = None
for units, features in self.table.results():
if units['S'] != current_sentence:
fout.write(str(features[units['S']].get(s_feat, '')) + '\n')
current_sentence = units['S']
fout.write(self.indent)
fout.write(str(features[units['W']].get(w_lemma, '')))
fout.write(' ')
fout.write(str(features[units['W']].get(w_pos, '')))
fout.write('\n')
```

0 comments on commit 6e2d86e

Please sign in to comment.