Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attached detached lite #390

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
ad56acc
Initial pass at clarifying Attached/Detached in which I changed Attac…
ptsefton Dec 31, 2024
9ecc6ae
Further tidying up after merge
ptsefton Dec 31, 2024
2d04c59
Fix conflicts
ptsefton Dec 31, 2024
c3d4da1
Fixed wording
ptsefton Dec 31, 2024
d3e950d
Found some formatting issues and typos
ptsefton Dec 31, 2024
9aa0928
Further tweaks to pseudo code
ptsefton Dec 31, 2024
90bcaeb
Removed conflict
ptsefton Jan 1, 2025
dd86e94
More tidying up of algorithm for File @id
ptsefton Jan 1, 2025
bb82e7e
ADded new Detached RO-Crate Metadata File
ptsefton Jan 1, 2025
1ef66de
Tightened up rules about detached File Data Entities
ptsefton Jan 8, 2025
963eba9
Merge branch 'ResearchObject:main' into main
ptsefton Jan 8, 2025
2119467
Reverted to attached/detatched a few tweaks
ptsefton Jan 9, 2025
02a2426
Added new general note
ptsefton Jan 9, 2025
d556f84
Added new general note
ptsefton Jan 9, 2025
2c08726
Futher edits to note on 1.2
ptsefton Jan 9, 2025
bcc30d7
Update root-data-entity.md
ptsefton Jan 9, 2025
75da32c
Removing some of the extra stuff I put in and simplifying the attache…
ptsefton Jan 9, 2025
700c1eb
Cleaning up
ptsefton Jan 9, 2025
b241c33
Typos
ptsefton Jan 9, 2025
c56c8d8
Draft simpler treatment of Attached / Detached
ptsefton Jan 10, 2025
5f6f202
Update docs/_specification/1.2-DRAFT/data-entities.md
ptsefton Jan 10, 2025
da2db55
Update docs/_specification/1.2-DRAFT/data-entities.md
ptsefton Jan 10, 2025
5090e85
Update docs/_specification/1.2-DRAFT/data-entities.md
ptsefton Jan 10, 2025
78731d9
Update docs/_specification/1.2-DRAFT/data-entities.md
ptsefton Jan 10, 2025
d4f5878
Update docs/_specification/1.2-DRAFT/structure.md
ptsefton Jan 10, 2025
9958603
Update docs/_specification/1.2-DRAFT/structure.md
ptsefton Jan 10, 2025
0812d95
Update docs/_specification/1.2-DRAFT/structure.md
ptsefton Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 78 additions & 70 deletions docs/_specification/1.2-DRAFT/data-entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,25 +34,25 @@ parent: RO-Crate 1.2-DRAFT

The primary purpose for RO-Crate is to gather and describe a set of _Data entities_ in the form of:

* Files
* Files which are datastreams available on the local file system or over the web
* Directories
* Web resources


The data entities can be further described by referencing [contextual entities](contextual-entities) such as persons, organizations and publications.

## Referencing files and folders from the Root Data Entity

Where files and folders are represented as _Data Entities_ in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the [Root Data Entity](root-data-entity) using the [hasPart] property. Directory hierarchies MAY be represented with nested [Dataset] _Data Entities_, or the Root Data Entity MAY refer to files anywhere in the hierarchy using [hasPart].

_Data Entities_ representing files MUST have `"File"` as a value for `@type`. `File` is an RO-Crate alias for <http://schema.org/MediaObject>. The term _File_ includes:
- _Attached_ resources where `@id` is a URI (path) relative to the _RO-Crate Root_ which MUST resolve to a file.
- _Detached_ "downloadable" resources where `@id` is an absolute URI which resolves to a single datastream that can be downloaded and saved as a file. _Detached_ Files SHOULD NOT reference intermediate resources such as splash-pages, search services or web-based viewer applications.
_Data Entities_ representing files: MUST have `"File"` as a value for `@type`. `File` is an RO-Crate alias for <http://schema.org/MediaObject>. The term _File_ includes:
- _Attached_ resources which are available locally and
- _Detached_ "downloadable" resources which can be can be downloaded and saved as a file.

_Data Entities_ representing directories MUST have `Dataset` as a value for `@type`. The term _directory_ here includes HTTP file listings where `@id` is an absolute URI, however "external," _Detached_ directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate). It follows that the _RO-Crate Root_ is itself a data entity.
The rules for the `@id` property of Files are set out below.

_Data Entities_ can also be other types, for instance an online database. These SHOULD have a `@type` of [CreativeWork] (or one of its subtypes) and typically have a `@id` which is an absolute URI.
_Data Entities_ representing directories MUST have `Dataset` as a value for `@type`. The term _directory_ here includes HTTP file listings where `@id` is an absolute URI, however "external," _Detached_ directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate). It follows that the _RO-Crate Root_ is itself a data entity.

In all cases, `@type` MAY be an array in order to also specify a more specific type, e.g. `"@type": ["File", "ComputationalWorkflow"]`
In all cases, `@type` MAY be an array to also specify a more specific type, e.g. `"@type": ["File", "ComputationalWorkflow"]`

There is no requirement to represent _every_ file and folder in an RO-Crate as _Data Entities_ in the _RO-Crate JSON-LD_. Reasons for not describing files would include that the files:
- are described in some other way, for example a manifest or another package management system,
Expand All @@ -63,9 +63,64 @@ There is no requirement to represent _every_ file and folder in an RO-Crate as _
In any of the above cases where files are not described, a directory containing a set of files _MAY_ be described using a `Dataset` _Data Entity_ that encapsulates the files with a `description` property that explains the contents. If the RO-Crate file structure is flat, or files are not grouped together, a `description` property on the _Root Data Entity_ may be used, or a `Dataset` with a local reference beginning with `#` (e.g. to describe a certain type of file which occurs throughout the crate). This approach is recommended for RO-Crates which are to be deposited in a long-term archive.


## Core Metadata for Data Entities


### Encoding file paths

Note that all `@id` [identifiers must be valid URI references](appendix/jsonld#describing-entities-in-json-ld), care must be taken to express any relative paths using `/` separator, correct casing, and escape special characters like space (`%20`) and percent (`%25`), for instance a _File Data Entity_ from the Windows path `Results and Diagrams\almost-50%.png` becomes `"@id": "Results%20and%20Diagrams/almost-50%25.png"` in the _RO-Crate JSON-LD_.

In this document the term _URI_ includes international *IRI*s; the _RO-Crate Metadata Document_ is always UTF-8 and international characters in identifiers SHOULD be written using native UTF-8 characters (*IRI*s), however traditional URL encoding of Unicode characters with `%` MAY appear in `@id` strings. Example: `"@id": "面试.mp4"` is preferred over the equivalent `"@id": "%E9%9D%A2%E8%AF%95.mp4"`


### File Data Entity {#file-data-entity}

A [File] _Data Entity_ MUST have the following properties:

* `@type`: MUST be `File`, or an array where `File` is one of the values.
* `@id`: MUST be a relative or absolute URI.

A [File] MAY have also a `contentURL` property which links to an online copy of the file.

Further constraints on the `@id` are dependent on whether the [File] entity is being considered as part of an _Attached RO-Crate Package_ or _Detached RO-Crate Package_.

If an `@id` is a relative URI then it is treated as a `filePath`, which is calculated by appending the `@id` to the `RO-Crate Root`.

Both `@id` and and `contentURL` may be used in a variety of combinations:

1. For a _Attached RO-Crate Package_:
* If a `contentUrl` is present `@id` MUST be a A valid relative URI reference and `contentURL` must be an absolute URI. In this case a file may or may not be present at `filePath`. If it is not present then the presence of the `contentUrl` property is an indication that the File content may be sourced from that URL.
* If no `contentURL` is present `@id` MUST one of either:
ptsefton marked this conversation as resolved.
Show resolved Hide resolved
a. A relative URI indicating that an file MUST be present at `filePath` when validating a package.
b. An Absolute URI indicating that the entity is a [Web-based Data Entity](#web-based-data-entity).

2. For a _Detached RO-Crate Package_ all [File] Data Entities are [Web-based Data Enties](#web-based-data-entity)
* If a `contentUrl`is present: `@id` MUST be a A valid relative URI reference and `contentURL` must be an absolute URI. The presence of the `contentUrl` property is an indication that the File content may be sourced from that URL and if the _Detached RO-Crate Package_ were to be converted to an _Attached RO-Crate Package_ the `@id` indicates the `filePath` to use for saving a local copy the [File].
ptsefton marked this conversation as resolved.
Show resolved Hide resolved
* If there is no `contentUrl`: `@id` MUST be an Absolute URI




Additionally, `File` entities SHOULD have:

* [name] giving a human readable name (not necessarily the filename)
* [description] giving a longer description, e.g. the role of this file within this crate
* [encodingFormat] indicating the the IANA [media type] as a string (e.g. `"text/plain") and/or a reference to [file format](#adding-detailed-descriptions-of-encodings) contextual entity.
* [conformsTo] to a contextual entity of type [Profile], that indicate a [profile](profiles) of the encoding format, if applicable
* [contentSize] with the size of the file in bytes

RO-Crate's `File` is an alias for schema.org type [MediaObject], any of its properties MAY also be used (adding contextual entities as needed). [Files on the web](#embedded-data-entities-that-are-also-on-the-web) SHOULD also use `identifier`, `url`, `subjectOf`, and/or `mainEntityOfPage`.


{.note}
> It is up to implementers to decide whether to offer some form of URL "link checker" service [Web-based Data Entities](#web-based-data-entity) for both attached and Detached RO-Crate Packages. If `contentUrl` has more than one value then a checker service SHOULD try each provided value until one resolves, and returns a correct [contentSize].



### Example linking to a file and folders

### _Attached RO-Crate Package_

Example linking to a file and folders

```
<RO-Crate root>/
Expand Down Expand Up @@ -123,7 +178,7 @@ An example _RO-Crate JSON-LD_ for the above would be as follows:

If the dataset contained a large number of `*.ai` files which were spread throughout the crate structure and which did not have `File Data Entities` then a approach to describing them would be:

```
```json
{
"@id": "./",
"@type": [
Expand Down Expand Up @@ -234,43 +289,25 @@ Some generic file formats like `application/json` may be specialized using a _pr
The [Metadata Descriptor](root-data-entity#ro-crate-metadata-descriptor) `ro-crate-metadata.json` is not a data entity, but is described with `conformsTo` to an _implicit contextual entity_ for the RO-Crate specification, a profile of [JSON-LD](appendix/jsonld). RO-Crates themselves can be specialized using [Profile Crates](profiles), specified with `conformsTo` on the root data entity.


## Core Metadata for Data Entities


### Encoding file paths

Note that all `@id` [identifiers must be valid URI references](appendix/jsonld#describing-entities-in-json-ld), care must be taken to express any relative paths using `/` separator, correct casing, and escape special characters like space (`%20`) and percent (`%25`), for instance a _File Data Entity_ from the Windows path `Results and Diagrams\almost-50%.png` becomes `"@id": "Results%20and%20Diagrams/almost-50%25.png"` in the _RO-Crate JSON-LD_.

In this document the term _URI_ includes international *IRI*s; the _RO-Crate Metadata Document_ is always UTF-8 and international characters in identifiers SHOULD be written using native UTF-8 characters (*IRI*s), however traditional URL encoding of Unicode characters with `%` MAY appear in `@id` strings. Example: `"@id": "面试.mp4"` is preferred over the equivalent `"@id": "%E9%9D%A2%E8%AF%95.mp4"`


### File Data Entity

A [File] _Data Entity_ MUST have the following properties:

* `@type`: MUST be `File`, or an array where `File` is one of the values.
* `@id`: MUST be either a _URI Path_ relative to the _RO-Crate root_ which MUST resolve to a file that is present in the _RO-Crate Root_, or an absolute URI.

Additionally, `File` entities SHOULD have:

* [name] giving a human readable name (not necessarily the filename)
* [description] giving a longer description, e.g. the role of this file within this crate
* [encodingFormat] indicating the the IANA [media type] as a string (e.g. `"text/plain") and/or a reference to [file format](#adding-detailed-descriptions-of-encodings) contextual entity.
* [conformsTo] to a contextual entity of type [Profile], that indicate a [profile](profiles) of the encoding format, if applicable
* [contentSize] with the size of the file in bytes

RO-Crate's `File` is an alias for schema.org type [MediaObject], any of its properties MAY also be used (adding contextual entities as needed). [Files on the web](#embedded-data-entities-that-are-also-on-the-web) SHOULD also use `identifier`, `url`, `subjectOf`, and/or `mainEntityOfPage`.

### Directory File Entity
ptsefton marked this conversation as resolved.
Show resolved Hide resolved

A [Dataset] (directory) _Data Entity_ MUST have the following properties:

* `@type` MUST be `Dataset` or an array where `Dataset` is one of the values.
* `@id` MUST be either:
* a _URI Path_ relative to the _RO Crate root_ which MUST resolve to a directory that is present in the _RO-Crate Root_. The id SHOULD end with `/`.
* a _URI Path_ The id SHOULD end with `/`.
ptsefton marked this conversation as resolved.
Show resolved Hide resolved
* an absolute URI
* a local reference beginning with `#`

For an _Attached RO-Crate Package_:
* The `@id` MUST be a relative path that resolves to a directory that is present in the _RO-Crate Root_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Dataset in an attached crate can also be web-based, with an absolute URI as its @id: this is already allowed in RO-Crate 1.1, I don't think we should change that. The requirement that the directory be present is also absent in 1.1: adding it would basically force validators to perform a check on the file system for every Dataset in the crate, which could be quite expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I don't think I looked at this this on this edit -- what should it say?


For a _Detached RO-Crate Package_:
* If the `@id` is a _URI Path it MAY be used to create a directory and MAY resolve to a service which returns a list of files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If the `@id` is a _URI Path it MAY be used to create a directory and MAY resolve to a service which returns a list of files
* If the `@id` is a _URI Path_ it MAY be used to create a directory and MAY resolve to a service which returns a list of files

* If the `@id` is a URL then it SHOULD resolve to a service which returns a list of files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between a "URI Path" (previous bullet point) and a "URL"? "a service which returns a list of files" is also not well defined


Additionally, `Dataset` entities SHOULD have:

* [name] giving a human readable name (not necessarily the directory name)
Expand All @@ -283,7 +320,7 @@ Any of the properties of schema.org [Dataset] MAY additionally be used (adding c

## Web-based Data Entities

While one use-case of RO-Crates is to describe _files_ contained within the _RO-Crate Root_ directory, RO-Crates can also gather resources from the web identified by _absolute URIs_ instead of relative _URI paths_, i.e. Web-based data entities.


Using Web-based data entities can be important particularly where a file can't be included in the _RO-Crate Root_ because of licensing concerns, large data sizes, privacy, or where it is desirable to link to the latest online version.

Expand Down Expand Up @@ -331,6 +368,7 @@ Example of an RO-Crate including a _File Data Entity_ external to the _RO-Crate
}
```


Additional care SHOULD be taken to improve persistence and long-term preservation of web resources included
in an RO-Crate, as they can be more difficult to archive or move along with the _RO-Crate Root_, and
may change intentionally or unintentionally, leaving the RO-Crate with incomplete or outdated information.
Expand Down Expand Up @@ -373,7 +411,9 @@ These MAY be included for File Data Entities as additional metadata, regardless
* [subjectOf] to a [CreativeWork] (or [WebPage]) that mentions this file or its content (but also other resources)
* [mainEntityOfPage] to a [CreativeWork] (or [WebPage]) that primarily describes this file (or its content)

Note that if a local file is intended to be packaged within an _Attached RO-Crate_, the `@id` property MUST be a _URI Path_ relative to the _RO Crate Root_, for example `survey-responses-2019.csv` as in the example below, where the content URL points to a download endpoint as a string.
If a [contentUrl] is present, then in an _Attached RO-Crate Package_ the file MAY be omitted from the packages, for example if it is very large or if of peripheral interest. Core files should NOT be omitted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "core file"?


Note that if a local file is intended to be packaged within an _Attached RO-Crate Package_, the `@id` property MUST be a _URI Path_ relative to the _RO Crate root_, for example `survey-responses-2019.csv` as in the example below, where the content URL points to a download endpoint as a string.

```json
{
Expand Down Expand Up @@ -523,38 +563,6 @@ Similarly, the _RO-Crate Root_ entity (or a reference to another RO-Crate as a `
In all cases, consumers should be aware that a `DataDownload` is a snapshot that may not reflect the current state of the `Dataset` or RO-Crate.


#### Retrieving an RO-Crate

To resolve a reference to an RO-Crate, but where `subjectOf` or `distribution` is unknown (e.g. an RO-Crate is cited from a journal article), the below approach is recommended to retrieve its [RO-Crate Metadata Document](structure#ro-crate-metadata-document-ro-crate-metadatajson):

1. Assuming the URI is a permalink, after following HTTP redirects without content negotiation, try [Signposting] to look for `Link` headers that reference `Link rel="describedby"` for an _RO-Crate Metadata Document_, or `Link rel="item"` for a distribution archive -- in either case prefer a link with `profile="https://w3id.org/ro/crate"` declared. For example, signposting for `https://doi.org/10.48546/workflowhub.workflow.120.5` leads to the archive `https://workflowhub.eu/workflows/120/ro_crate?version=5` as:

```
curl --location --head https://doi.org/10.48546/workflowhub.workflow.120.5

HTTP/2 302
Location: https://workflowhub.eu/workflows/120?version=5

HTTP/2 200
Content-Type: text/html; charset=UTF-8
Link: <https://workflowhub.eu/workflows/120/ro_crate?version=5> ;
rel="item" ; type="application/zip" ;
profile="https://w3id.org/ro/crate"
```
2. [HTTP Content-negotiation] for the [RO-Crate media type](appendix/jsonld#ro-crate-json-ld-media-type), for example:

Requesting `https://w3id.org/workflowhub/workflow-ro-crate/1.0` with HTTP header
`Accept: application/ld+json;profile=https://w3id.org/ro/crate` redirects to the _RO-Crate Metadata file_
`https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json`

3. The above approaches may fail or return a HTML page, e.g. for content-delivery networks that do not support content-negotiation.
4. An optional heuristic fallback is to try resolving the path `./ro-crate-metadata.json` from the _resolved_ URI (after permalink redirects). For example:
If permalink `https://w3id.org/workflowhub/workflow-ro-crate/1.0` redirects to `https://about.workflowhub.eu/Workflow-RO-Crate/1.0/index.html` (a HTML page), then
try retrieving `https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json`.
5. If the retrieved resource is a ZIP file (`Content-Type: application/zip`), then extract `ro-crate-metadata.json`, or, if the archive root only contains a single folder (e.g. `folder1/`), extract `folder1/ro-crate-metadata.json`
6. If the retrieved resource is a [BagIt archive](appendix/implementation-notes#combining-with-other-packaging-schemes), e.g. containing a single folder `folder1` with `folder1/bagit.txt`, then extract and verify BagIt checksums before returning the bag's `data/ro-crate-metadata.json`
7. If the returned/extracted document is valid JSON-LD and has a [root data entity](root-data-entity#finding-the-root-data-entity), this is the RO-Crate Metadata File.

{% include callout.html type="tip" content="Some PID providers such as DataCite may respond to content-negotiation and provide their own JSON-LD, which do not describe an RO-Crate (the `profile=` was ignored). The use of Signposting allows the repository to explicitly provide the RO-Crate." %}

{% include references.liquid %}
2 changes: 2 additions & 0 deletions docs/_specification/1.2-DRAFT/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ parent: RO-Crate 1.2-DRAFT
limitations under the License.
-->



# Introduction

This document specifies a method, known as _RO-Crate_ (Research Object Crate), of aggregating and describing data for distribution, re-use, publishing, preservation and archiving. RO-Crates aggregate data into a Dataset, and may describe any resource including files, URI-addressable resources, or use other addressing schemes to locate digital or physical data. Describing resources includes technical metadata such as file sizes and types as well as contextual information including how and where datasets and files were created, how they were collated and collected, who was involved in the process, what equipment and software was used, who funded the work, how to cite it, and crucially, how it may be reused, and by whom.
Expand Down
Loading
Loading