Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support external annotations files to allow selective loading and avoid memory issues #21

Open
ngawangtrinley opened this issue Feb 20, 2024 · 24 comments
Assignees
Labels
question Further information is requested

Comments

@ngawangtrinley
Copy link

ngawangtrinley commented Feb 20, 2024

We're working on PechaData, a multilingual Buddhist corpus project in collaboration with bdrc.io and pecha.org. As a format, Stam is a dream for our project, and we're starting to build our project on top of it with a mechanism to update annotation coordinates when the base text is updated.

However, our dataset includes many large texts (>10mb .txt) featuring multiple annotation layers often larger than the initial text file and we are concerned about performance issues when we have to load all the annotations in memory even when we only need a couple of sets of annotations. (i.e. we have a file with 15 annotation sets including POS tags and dependencies but we only need the text and the annotations for the table of content.)

Have you considered externalizing annotations in separate files like the .ann files of BrAT or do you have another solution to load annotations selectively? We thought about patching Stam to find a solution but we would much prefer a solution coming from the creators.

Thanks a lot for your work!

@proycon
Copy link
Collaborator

proycon commented Feb 20, 2024

Hi @ngawangtrinley !

Thanks for your interest in STAM! I'm glad you find the model interesting and useful for your project, so I'd love to support you in this. It's now precisely up to real use cases like yours that have to drive STAM development forward (we're a young software project).

When using STAM JSON there is already a way to externalize things over multiple stand-off files, as the resource texts themselves as well as the annotation datasets can be kept in separate files. So you can have a plain text file for each of your large texts, and an independent annotation data set file for each of the 15 annotation sets. In the STAM JSON serialisation of the annotation store, these are then referenced via the @include mechanism. This is explained here: https://github.com/annotation/stam/blob/master/README.md#multiple-files-and-the-include-statement .

The current option to have the actual annotations split over multiple files is to write to multiple annotation store files (which may or may not reference the same resources and datasets via the @include mechanism, it's fine if these are shared between stores).

I have, however, not yet implemented a mechanism to conveniently split an existing store into multiple, but I can certainly implement this fairly easily. The reverse is already in place, you can load (merge) multiple annotation store files into a single annotation store at run time (note that you should never work with multiple annotation stores in memory at run-time, but you can load from multiple files into one, effectively holding only the subset you need into memory), the caveat is that reserialisation to these split files is something that needs to be implemented.

I hope this answers your question, let me know if you encounter any problems.

@proycon proycon self-assigned this Feb 20, 2024
@proycon proycon added the question Further information is requested label Feb 20, 2024
proycon added a commit to annotation/stam-python that referenced this issue Feb 21, 2024
This was not exposed yet in the Python API, but only in the Rust API,
slightly related to annotation/stam#21
@ngawangtrinley
Copy link
Author

The @include mechanism is perfect. I was going to ask about other cases but it seems that this mechanism will solve them too. We'll test it next week and get back to you. :)

I'm pretty sure we will need the easy way to split stores as part of our annotation update setup but we're not there yet. We will let you know if it's not yet implemented when we reach that point.

More next week...

@tenzin3
Copy link

tenzin3 commented Jul 9, 2024

@proycon , I would like to ask how to store AnnotationDataset separatly in a json file.And what we would like to not to include base file when storing the different AnnotationDatasets separately.

I refered the following documentation here and we fully code in python.
documentation

proycon added a commit to annotation/stam-python that referenced this issue Jul 10, 2024
This wasn't clearly propagated to the Python binding yet.

Ref: annotation/stam#21
@proycon
Copy link
Collaborator

proycon commented Jul 10, 2024

@tenzin3 Here's an example of how to store data set and text resource in separate files when constructing data from scratch with Python:

from stam import * 

store = AnnotationStore(id="test", config={ "use_include": True })
resource = store.add_resource(id="testres", filename="test.txt")
dataset = store.add_dataset(id="testdataset", filename="testdataset.dataset.stam.json")
dataset.add_key("pos")
data = dataset.add_data("pos","noun","D1")
store.annotate(id="A1", 
                    target=Selector.textselector(resource, Offset.simple(6,11)),
                    data=data)
store.set_filename("test.store.stam.json")
store.save()

The use_include configuration parameter is important here, when constructing a store from scratch. If you have a TextResource or AnnotationDataSet object, you can also call set_filename() on it rather than do it via add_resource()/add_dataset().

Note that you need stam 0.8.3 for this (just released), so you may need to do a pip install -U stam first.

I hope this answers your question.

@tenzin3
Copy link

tenzin3 commented Jul 11, 2024

@proycon Thank you for your response, that is what we needed in our project.

@ngawangtrinley
Copy link
Author

ngawangtrinley commented Jul 11, 2024

@proycon some of our text resources are quite big so we've been thinking about splitting them in multiple chunks. Is it possible to do cross-file annotations? For instance a chapter annotation for a chapter spanning 3 text resources.

@proycon
Copy link
Collaborator

proycon commented Jul 11, 2024 via email

@tenzin3
Copy link

tenzin3 commented Jul 12, 2024

@proycon , i wanted to ask how did you stored the AnnotationDataset ("testdataset.dataset.stam.json")separely, as i have seen you load in the below line

dataset = store.add_dataset(id="testdataset", filename="testdataset.dataset.stam.json")

Current i facing error loading the the file 'annotation_store.json'.

stam.PyStamError: [StamError] DeserializationError: Deserialization failed: [StamError] DeserializationError: Deserialization failed: Expected type AnnotationDataSet, got AnnotationStore at line 2 column 29

annotation_store.json->

{
  "@type": "AnnotationStore",
  "@id": "IC17A18B7",
  "resources": [],
  "annotationsets": [
    {
      "@type": "AnnotationDataSet",
      "@id": "root_commentary_6a0",
      "@include": "Root_Segment-e8a.json"
    }
  ],
  "annotations": []
}

Root_Segment-e8a.json:>

{
  "@type": "AnnotationStore",
  "@id": "IC17A18B7",
  "resources": [
    {
      "@type": "TextResource",
      "@id": "453",
      "@include": "453.txt"
    }
  ],
  "annotationsets": [
    {
      "@type": "AnnotationDataSet",
      "@id": "root_commentary_6a0",
      "keys": [
        {
          "@type": "DataKey",
          "@id": "Structure Type"
        }
      ],
      "data": [
        {
          "@type": "AnnotationData",
          "@id": "7942390e945f4726b626631c5703e9ad",
          "key": "Structure Type",
          "value": {
            "@type": "String",
            "value": "Root_Segment"
          }
        }
      ]
    }
  ],

@proycon
Copy link
Collaborator

proycon commented Jul 12, 2024

@tenzin3 Can you show me the code where you created this Root_Segment-e8a.json? It indeed seems to be an AnnotationStore rather than an AnnotationDataSet, hence the error.

@tenzin3
Copy link

tenzin3 commented Jul 12, 2024

@proycon, yes the file Root_Segment-e8a.json is indeed an AnnotationStore and following is snippet code on how i created it.

new_ann_store = AnnotationStore(id="IC17A18B7")
ann_dataset = new_ann_store.add_dataset(id="root_commentary_6a0")

After this, i save the annotation store as Root_Segment-e8a.json.

I have not been succesfull in saving AnnotationDataset separatelty in json file till now.

@proycon
Copy link
Collaborator

proycon commented Jul 12, 2024

Try this, in line with the example I gave earlier:

new_ann_store = AnnotationStore(id="IC17A18B7", config={ "use_include": True })
ann_dataset = new_ann_store.add_dataset(id="root_commentary_6a0", filename="Root_Segment-e8a.json")
new_ann_store.set_filename("annotation_store.json")
new_ann_store.save()

That should give you an annotation_store.json as AnnotationStore with Root_segment-e8a.json being the AnnotationDataSet. Which is what you want if I interpreted things correctly?

@tenzin3
Copy link

tenzin3 commented Jul 12, 2024

@proycon, Oo yes i tried it and it is saving AnnotationDataset separately in json However, I would like to save the AnnotationDataset to a JSON file only after adding the annotations to the dataset.

Our plan is to store each separate AnnotationDataset with its annotations in individual JSON files. We will then have a main file, annotation_store.json, which will load only the AnnotationDataset that we need at the moment.

Would it be possible to save the AnnotationDataset object itself to a JSON file? Any guidance you can provide would be greatly appreciated.

@proycon
Copy link
Collaborator

proycon commented Jul 12, 2024 via email

@ngawangtrinley
Copy link
Author

ngawangtrinley commented Jul 15, 2024

Thank you for the clarification. Here are a couple of follow-ups:

  • how can we merge two stores? The python API doesn't seem to have this feature yet.
  • when merging AnnotationStore1, AnnotationStore2 and AnnotationStore3 with each TextResource1 in the @include statement, does the TextResource1 get loaded 3 times (once with each store)?

@proycon
Copy link
Collaborator

proycon commented Jul 15, 2024 via email

@tenzin3
Copy link

tenzin3 commented Jul 19, 2024

@proycon, thank you for the explanation.

One of our main goals is to store translation pairs, such as Tibetan and English translation pairs. We want to store Tibetan and English annotations separately and also store the mapping annotations that link these languages in a separate file.

Although we have successfully created mapping annotations using a composite selector, the stam module allows us to store Tibetan and English text annotations together in the store. Is there a way to store only the mapping annotations in an annotation store?

We want to keep the annotation files separate for the following reasons:

  1. Keep annotation JSON files lightweight.
  2. Easier updates: If we need to modify the Tibetan or English sentence annotations, we don't have to update the alignment file.
  3. Future annotations: As more annotations, like Named Entity Recognition (NER), are added to the Tibetan sentence file, it will prevent the file from becoming too large.

Looking at the picture below, when we need annotations for Tibetan files, we want to load only those. The same goes for English files. When we want to load the alignment of Tibetan and English translations, we want the translation mapping, then we wish to load the alignment annotation files with the Tibetan and English sentence annotation files.

image

@tenzin3
Copy link

tenzin3 commented Jul 22, 2024

In a team discussion with @eroux, an idea was proposed to keep the translation alignment file separate from STAM. Instead, we could use our own abstraction for the alignment file and integrate other annotations into STAM. What are your thoughts on this approach?

@proycon
Copy link
Collaborator

proycon commented Jul 22, 2024

Thanks for including the nice picture, that makes it a lot easier for me to
understand your use case! I understand the three objectives you listed and
agree that those are good principles.

  1. Keep annotation JSON files lightweight

One little sidenote about this first one though; JSON in general isn't a very
lightweight format of course, and STAM JSON is fairly verbose and not optimised
for file size or anything (it should compress fairly well though).
Optimisation does happen inside the STAM library as soon as you read things
into memory. Since you're listing Github as a platform, also be aware it will
warn when a file gets over 50MB and even block files over 100MB (unless you
use git LFS).

Now onto the actual issue: Keeping the alignments separated from the rest in
the way you describe is currently a bit problematic, because in
english_tibetan_alignment.json (repo 3) you're proposing to refer to
annotations using annotation selectors, but those annotations are not in the
same annotation store. So that store can not be loaded without first loading the other
two. That's currently not allowed by the model: An annotation store can depend
on stand-off text resources and stand-off annotation datasets, but it can not
depend on another store. Each must be independently loadable.

A possible solution is to include tibetan_text.txt and english_text.txt in
english_tibetan_alignment.json (stand-off) and then refer to the text spans directly from
the composite selector (using text selectors underneath). This would still keep the files small.

If you then load everything from 1, 2 and 3 into one store you still have easy
means to relate the sentence annotations and the alignments (because they
reference the same text). The question is if this is sufficiently flexible with
regard to your point 2 (easier updates), as the target spans are duplicated
now. From a STAM perspective, this duplication is not unnatural though, it's a
perfectly fine way to model things to let various annotations refer
directly to the same text span rather than to an abstraction layer on top of that (like
word or sentence annotations for instance).

In a team discussion with @eroux, an idea was proposed to keep the
translation alignment file separate from STAM. Instead, we could use our own
abstraction for the alignment file and integrate other annotations into STAM.
What are your thoughts on this approach?

Though of course possible, I'd be more inclined (and a bit biased probably) to
let STAM handle the alignments too, then you can reap the maximum benefits from
the existing library, otherwise you need to implement your own logic on top of
things.

Another possible solution, more on my side of things, is to reconsider if the
restriction you're running into now is something that can be solved elegantly
by adapting/expanding the STAM model. In other words, perhaps STAM needs to
allow annotation stores with dependencies on others. I do think you have a fair
use case here, with the three principles you outlined and the way you want to
model it. This is something that needs to be thought through and would take
time to implement, of course.

What do you think? I hope this gives some options to resolve your question.

@tenzin3
Copy link

tenzin3 commented Jul 22, 2024

@proycon , First, I would like to thank you for your deep thoughts on all our doubts and ideas. Yes, the STAM JSON is indeed very verbose, but we are keen on using STAM due to its diverse features regarding annotations and its speed optimization, being built on Rust.

A possible solution is to include tibetan_text.txt and english_text.txt in
english_tibetan_alignment.json (stand-off) and then refer to the text spans directly from
the composite selector (using text selectors underneath).

We did think of this solution before, and as you mentioned, it would lead to duplication of resource files and annotations, which we are very much trying to avoid.

Another possible solution, more on my side of things, is to reconsider if the
restriction you're running into now is something that can be solved elegantly
by adapting/expanding the STAM model. In other words, perhaps STAM needs to
allow annotation stores with dependencies on others. I do think you have a fair
use case here, with the three principles you outlined and the way you want to
model it. This is something that needs to be thought through and would take
time to implement, of course.

That would be very helpful, especially for us since we are working with many Tibetan religious texts and their translations. If the STAM model allows dependencies on other stores, it would be very helpful. This flexibility would help keep the Tibetan-English alignment file separate. It could also help with higher-order annotations such as sentence and word annotations. If a user needs sentence annotations, we could load only the sentence annotation file. If they need word annotations, then we could load both files. We had this kind of plan for future, so STAM adapting to these needs would be very helpful.

If possible, we want to integrate all our data in the STAM format and benefit from its potential.

@proycon
Copy link
Collaborator

proycon commented Jul 22, 2024

Yes, I agree that extending STAM here would probably be the best way to go forward. I do have to carefully consider all the ramifications and then do the implementation, so that will take some time.

I'm already thinking out loud here: The main challenge is the fact that different annotations in a single store (because in memory it is always a single one you work with at any time, in this case you'd load 1+2+3 otherwise you can't make the alignments) would need to be serialized to different STAM JSON files. This relates to the split functionality I already implemented this month, but does go further than it currently does and places some extra demand on things, like keeping track of what annotations come from what store files. It may be the most sensible solution and provide a lot of extra flexibility.

I'll open a new separate issue for it and link it from here.

@tenzin3
Copy link

tenzin3 commented Jul 23, 2024

@proycon , Thank you for considering this solution and for your willingness to extend STAM to meet our needs.
We understand that implementing these changes will take time and careful thought, and we are grateful for your dedication to finding the best way forward. We eagerly anticipate your update in STAM.

@ngawangtrinley
Copy link
Author

@proycon thanks a lot for considering this change to the model. Annotation selectors pointing to other stores is a must for applications dealing with literature. It looks like most religious corporas more or less for the Bible's verse/chapter/book model in which most annotations target verses.

Allowing stores to link to other stores will allow to work at different levels of abstraction. Use cases only dealing with verse and chapters shouldn't need to deal with the characters offsets. And most importantly we don't want to have 100 copies of the verse offsets when we have 100 stores dealing with verses and higher levels of abstraction.

The reason we split the 3 GitHub repos is to facilitate the offset update process either text resource is updates. If we keep a copy of the text resource in repo 3, we end up with two versions of the text resource and it becomes very complex to handle versioning.

What we have in mind is to require that annotations featuring text selectors must always be located with the text resource. (annotations with higher levels of abstraction such as repo 3 can be in different locations)

This is just to ensure that each time we change the text resource and trigger an offset update it is done for all relevant stores at the same time

@tenzin3
Copy link

tenzin3 commented Oct 10, 2024

You can have multiple annotation stores reference the same annotationdataset(s) and textresource(s) as long as you're careful to never work with multiple stores in memory or on disk at the same time.

@proycon , can you elaborate the reason you mentioned that we cant load multiple annotation stores in memory at the same time.

english_words = AnnotationStore(...)
french_words = AnnotationStore(...)

## added dataset , resources , setfilename (for annotationstore)

for word in words:
   if is_english_word(word):
        english_words.annotate(.....)
   elif is_french_word(word):
        french_words.annotate(.....)

english_words.save()
french_words.save()

Writing code as above, will be any problem? For our project, ideally we aim to store all the annotations first in memory(there will be more than one annotation store) and then after performing validation, then we wish to create at once in final.

@proycon
Copy link
Collaborator

proycon commented Oct 10, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants