Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid processing info in item IDs #1189

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

TomAugspurger
Copy link

This proposes a change to the Item ID best practices, based on some experiences and conversations with folks like @gadomski.

In my experience, many upstream data providers (USGS / landsat & MODIS, Copernicus / Sentinel,) include some kind of "processing timestamp" in their IDs. They'll occasionally reprocess assets, leading to new upstream IDs with the same "acquisition" timestamp but a new "processing" timestamp (what happens to the old assets varies, but I think doesn't matter for this discussion).

It's fundamentally ambiguous whether a reprocessed item is the "same" as an existing item. But I think the best recommendation is that the new, reprocessed item / assets should replace the old item / assets. That satisfies the common case of "Give me the item at this datetime over this area". If the processing datetime is included in the item ID then a provider would either

  1. Delete the old item, breaking anything linking directly to it
  2. Keep both the old and new items, causing "duplicate" items with the same spatio-temporal footprint (differing only by processing stuff).

Between the versioning and processing extensions, STAC has all the building blocks to handle this elegantly. So this PR updates the recommendation to use those instead of stuffing a processing timestamp in the item ID.

@pieschker
Copy link

The processing information (dates/versions) is also in the native metadata for the observation. This has been a contested subject for awhile. This could be a good solution.

best-practices.md Outdated Show resolved Hide resolved
Co-authored-by: Pete Gadomski <[email protected]>
@m-mohr m-mohr self-requested a review October 18, 2022 12:35
@TomAugspurger
Copy link
Author

CI is passing now.

@emmanuelmathot emmanuelmathot self-requested a review October 19, 2022 06:25
Copy link
Collaborator

@emmanuelmathot emmanuelmathot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good input for ids best practices. LGTM.

Copy link
Collaborator

@m-mohr m-mohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully agreeing with this one. If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between. Storing them under the same ID would conflict with the unique ID constraint. I think this should be made more clear in the description and the provided solution with using the version extension and the same IDs doesn't work with the unique ID best practice.

@emmanuelmathot
Copy link
Collaborator

emmanuelmathot commented Oct 19, 2022

If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between. Storing them under the same ID would conflict with the unique ID constraint. I think this should be made more clear in the description and the provided solution with using the version extension and the same IDs doesn't work with the unique ID best practice.

It all depends on the catalog implementation. With a static catalog, you can still use the same id but with a different path including the version. The reference in the collection will be the "latest" version with the unique id. Then in the item, you link to the previous version, still with the same id but at a different path that would include the version. In STAC API, this is even simpler using the version API extension

In my understanding, the main concept is that an item is always unique regardless of it's version.

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 19, 2022

Yes, but the spec says:

It is important that an Item identifier is unique within a Collection, and that the Collection identifier in turn is unique globally. Then the two can be combined to give a globally unique identifier. Items are strongly recommended to have Collections, and not having one makes it more difficult to be used in the wider STAC ecosystem. If an Item does not have a Collection, then the Item identifier should be unique within its root Catalog or root Collection.

and the best practice adds:

One of the key properties is the ID. [...] they just need to be sure it is globally unique, so may need a prefix.

That's what has been written in the spec and just a paragraph above the addition. Reading this addition then is contradicting and confusing. So this should be better explained and the proposed solution with the version extension should be clarified or the uniqueness constraint needs to be weakened.

@emmanuelmathot
Copy link
Collaborator

What is proposed is not contradicting the principle of uniqueness of the ids. You can manage multiple version of the same STAC Item with a unique id but 2 different files. In a collection or globally, there is still a unique STAC Item. Then it is up to the implementation to manage which version to get according to the link or the API.

On the other hand, this is for sure not what is done de facto within space agencies. Most of them including NASA and ESA with LANDSAT, Sentinels and many other includes the processing id, date or archive version in the filename.

@emmanuelmathot emmanuelmathot self-requested a review October 19, 2022 09:55
@TomAugspurger
Copy link
Author

TomAugspurger commented Oct 19, 2022

If you use the version extension, then you will need the processing timestamp in the ID as you'll need two distinct Items which you can link between.

Oh, I may have misunderstood the version extension. I thought you had had two items with the same ID: item-a (version 0) and item-a (version 1). And then the latest version would be available at /collections/<collection>/item-a and would include a link to the old version at (e.g.) /versioned/<collection>/item-a/ (I don't know exactly what the path would be).


I guess going back to the thing that originally motivated this: Say you have some software that generates a level-2 product from level-1 data (like sen2cor). If I run that at 8:00 and again at 9:00, the actual data assets should be byte-for-byte identical. And while the filenames might differ because they have a processing time, I'd argue that the STAC ID should not include the processing time.

It's a bit more complicated when talking about changes to the actual processing software rather than just different processing times. In that case the outputs might not be byte-for-byte identical and so you could argue that item-a (processed with version 1) is distinct from item-a (processed with version 2). But as a best practice I'd say we probably want people using the latest and so the item ID should not include that information.

As a user, I (probably) want the "latest" (best) version of the assets for a particular spatio-temporal footprint. I (probably) don't want to have to think about choosing between multiple items with the same spatio-temporal footprint. And for the less typical case where you do want the "old" version, we have the version extension.

@gadomski
Copy link
Collaborator

As a user, I (probably) want the "latest" (best) version of the assets for a particular spatio-temporal footprint. I (probably) don't want to have to think about choosing between multiple items with the same spatio-temporal footprint.

I agree with this as a motivating principle, and think that this could eventually be hardened into a Best Practice, i.e.: "Within a single collection, it is considered best practice to only have one non-deprecated item with a given spatio-temporal footprint." (see what I did there w/ the non-deprecated thing? More on that later)

@m-mohr is correct that the unique-ID constraint forces us to include some sort of version information (whether it's processing datetime, an incrementing integer, a hash, whatever) in the item ID if we want to support item versions within a single collection. Which leads to three possible solutions (as I see it):

  1. Remove all version information from the ID and only support "latest" versions in the collection
  2. Modify the spec to allow non-unique IDs (whoa)
  3. Include some sort of version information in the item ID

I think 1 is fine, but I think we can do better. My proposal:

  1. Include version information in the item ID
  2. Use the deprecated field in the version extension to mark all non-latest items as deprecated=True
  3. Update the tooling to ignore deprecated items by default
Real-world example

Currently, the USGS has The Worst solution to the problem at hand (at least for landsat). They:

  1. Include processing datetime in the item ID
  2. Remove old assets after reprocessing, but
  3. Keep old items after reprocessing

This leads to duplicate items for a given spatio-temporal footprint, where all but the latest items have 404 asset hrefs.

Under scenario 1 above (no processing datetime in item ids), the USGS would remove processing datetime from item ids, and the re-processed items would have updated (presumably, more correct) assets. This is a good thing -- new searches will fetch only a single item per footprint, and that item will have "the best" data. So scenario 1 works. However, if the USGS wanted (in the future) to implement the version extension in its entirety to provide processing provenance, they couldn't -- only one item for a given spatio-temporal footprint could exist in the collection. Additionaly, any "frozen" items or feature collections (e.g. part of a publication) would have their assets changing, possibly in significant ways, without the knowledge of the user.

Scenario 3 (use deprecated) requires a bit more ecosystem work, but allows us to support the version extension while still providing the best user experience (search for a thing, get one item per footprint).

cc @matthewhanson, @pjhartzell, @ircwaves, and @arthurelmes (who joined me in a chat about this topic this week)

@TomAugspurger
Copy link
Author

Thanks Pete, your proposal sounds pretty solid. I think there are some details to work out (does iterating the items in a collection include deprecated items?) but it sounds workable. It solves my main issue with processing information in the IDs today and has the advantage of not silently changing the assets referenced by an item ID (at least I think that's an advantage... I suppose it's not always clear).

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 24, 2022

Dev call:

  • Change uniqueness constraint for Items to be: id + version should be unique per collection
  • Move version extension fields and rel types to common metadata in v1.1?!

@pieschker
Copy link

Is there an update on the PR?

@m-mohr m-mohr added this to the 1.1 milestone Jun 27, 2024
@m-mohr m-mohr marked this pull request as draft July 30, 2024 14:35
@m-mohr m-mohr modified the milestones: 1.1, 1.2 Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants