Skip to content

Commit

Permalink
Merge pull request #278 from ptsefton/clarify-data-enties
Browse files Browse the repository at this point in the history
Tightening up wording around data entities -- we had not actual said …
  • Loading branch information
stain authored Apr 25, 2024
2 parents c543d91 + 7aa4c98 commit 26920b9
Showing 1 changed file with 57 additions and 10 deletions.
67 changes: 57 additions & 10 deletions docs/1.2-DRAFT/data-entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,26 @@ The data entities can be further described by referencing [contextual entities](

Where files and folders are represented as _Data Entities_ in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the [Root Data Entity](root-data-entity.md) using the [hasPart] property. Directory hierarchies MAY be represented with nested [Dataset] _Data Entities_, or the Root Dataset MAY refer to files anywhere in the hierarchy using [hasPart].

_Data Entities_ representing files MUST have `"File"` as a value for `@type`. `File` is an RO-Crate alias for <http://schema.org/MediaObject>. The term _File_ here is liberal, and includes "downloadable" resources where `@id` is an absolute URI.
_Data Entities_ representing files: MUST have `"File"` as a value for `@type`. `File` is an RO-Crate alias for <http://schema.org/MediaObject>. The term _File_ includes:
- _Attached_ resources where `@id` is a URI (path) relative to the _RO-Crate Root_ which MUST resolve to file.
- _Detached_ "downloadable" resources where `@id` is an absolute URI which resolves to a single datastream that can be downloaded and saved as a file. _Detached_ Files SHOULD NOT reference intermediate resources such as splash-pages, search services or web-based viewer applications.

_Data Entities_ representing directories MUST be of `"@type": "Dataset"`. The term _directory_ here includes HTTP file listings where `@id` is an absolute URI, however "external" directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate). It follows that the _RO-Crate Root_ is itself a data entity.
_Data Entities_ representing directories MUST have `Dataset` as a value for `@type`. The term _directory_ here includes HTTP file listings where `@id` is an absolute URI, however "external, _Detached_ directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate). It follows that the _RO-Crate Root_ is itself a data entity.

_Data Entities_ can also be other types, for instance an online database. These SHOULD be a `@type` of [CreativeWork] (or one of its subtypes) and typically have a `@id` which is an absolute URI.

In all cases, `@type` MAY be an array in order to also specify a more specific type, e.g. `"@type": ["File", "ComputationalWorkflow"]`

{: .tip }
> There is no requirement to represent _every_ file and folder in an RO-Crate as Data Entities in the RO-Crate JSON-LD.
There is no requirement to represent _every_ file and folder in an RO-Crate as Data Entities in the RO-Crate JSON-LD. Reasons for not describing files would include that the files:
- are described in some other way, for example a manifest or another package management system,
- are supporting files for a software application,
- have metadata embedded in their filenames or paths which can be explained once,
- have a purpose that is unknown to the crate author, but they need to be preserved as part of an archive.

In any of the above cases where files are not described, a directory containing a set of files _MAY_ be described using a `Dataset` _Data Entity_ that encapsulates the files with a `description` property that explains the contents. If the RO-Crate file structure is flat, or files are not grouped together a `description` property on the _Root Data Entity_ may be used, or a `Dataset` with a local reference beginning with `#` (eg to describe certain type of file which occurs throughout the crate). This approach is recommended for RO-Crates which are to be deposited in a long-term archive.




### Example linking to a file and folders

Expand Down Expand Up @@ -103,12 +113,37 @@ An example _RO-Crate JSON-LD_ for the above would be as follows:
"@id": "lots_of_little_files/",
"@type": "Dataset",
"name": "Too many files",
"description": "This directory contains many small files, that we're not going to describe in detail."
"description": "This directory contains many small files - the name of the file is a date in YYYY-MM-DD.csv, each file contains daily temperature readings, sampled hourly for the Glop Pot cave."
}
]
}
```

If the dataset contained a large number of `*.ai` files which were spread throughout the crate structure and which did not have `File Data Entities` then a approach to describing them would be:

```
{
"@id": "./",
"@type": [
"Dataset"
],
"hasPart": [
{
"@id": "#ai-files"
}
]
},
{
"@id": "#ai-files",
"@type": "Dataset",
"name": ".ai Files",
"description": "This dataset contains some files with the extension '.ai' which despite their extension have an encoding format of 'application/pdf'. These have yet to be catalogued."
}
```

### Adding detailed descriptions of encodings

The above example provides a media type for the file `cp7glop.ai` - which is
Expand All @@ -120,7 +155,7 @@ identifier to a _Contextual Entity_ with `@type` array containing [WebPage] and
{
"@id": "cp7glop.ai",
"@type": "File",
"name": "Diagram showing trend to increase",
"name": "Glop Plot map",
"contentSize": "383766",
"description": "Illustrator file for Glop Pot",
"encodingFormat": ["application/pdf", {"@id": "https://www.nationalarchives.gov.uk/PRONOM/fmt/19"}]
Expand Down Expand Up @@ -200,20 +235,20 @@ The [Metadata Descriptor](root-data-entity.md#ro-crate-metadata-file-descriptor)

## Core Metadata for Data Entities

The table below outlines the properties that Data Entities, when present, MUST have to be minimally valid.

### Encoding file paths

Note that all `@id` [identifiers must be valid URI references](appendix/jsonld.md#describing-entities-in-json-ld), care must be taken to express any relative paths using `/` separator, correct casing, and escape special characters like space (`%20`) and percent (`%25`), for instance a _File Data Entity_ from the Windows path `Results and Diagrams\almost-50%.png` becomes `"@id": "Results%20and%20Diagrams/almost-50%25.png"` in the _RO-Crate JSON-LD_.

In this document the term _URI_ includes international *IRI*s; the _RO-Crate Metadata Document_ is always UTF-8 and international characters in identifiers SHOULD be written using native UTF-8 characters (*IRI*s), however traditional URL encoding of Unicode characters with `%` MAY appear in `@id` strings. Example: `"@id": "面试.mp4"` is preferred over the equivalent `"@id": "%E9%9D%A2%E8%AF%95.mp4"`


### File Data Entity

A [File] _Data Entity_ MUST have the following properties:

* `@type`: MUST be `File`, or an array where `File` is one of the values.
* `@id` MUST be either a _URI Path_ relative to the _RO Crate root_, or an absolute URI.
* `@id` MUST be either a _URI Path_ relative to the _RO-Crate root_ which MUST resolve to a file that is present in the _RO-Crate Root_, or an absolute URI.

Additionally, `File` entities SHOULD have:

Expand All @@ -230,7 +265,10 @@ RO-Crate's `File` is an alias for schema.org type [MediaObject], any of its prop
A [Dataset] (directory) _Data Entity_ MUST have the following properties:

* `@type` MUST be `Dataset` or an array where `Dataset` is one of the values.
* `@id` MUST be either a _URI Path_ relative to the _RO Crate root_, or an absolute URI. The id SHOULD end with `/`
* `@id` MUST be either:
- a _URI Path_ relative to the _RO Crate root_ which MUST resolve to a directory that is present in the _RO-Crate Root_. The id SHOULD end with `/`.
- an absolute URI
- a local reference beginning with `#`

Additionally, `Dataset` entities SHOULD have:

Expand All @@ -240,6 +278,8 @@ Additionally, `Dataset` entities SHOULD have:

Any of the properties of schema.org [Dataset] MAY additionally be used (adding contextual entities as needed). [Directories on the web](#directories-on-the-web-dataset-distributions) SHOULD also provide `distribution`.



## Web-based Data Entities

While one use-case of RO-Crates is to describe _files_ contained within the _RO-Crate root_ directory, RO-Crates can also gather resources from the web identified by _absolute URIs_ instead of relative _URI paths_, i.e. Web-based data entities.
Expand Down Expand Up @@ -294,7 +334,7 @@ Additional care SHOULD be taken to improve persistence and long-term preservatio
in an RO-Crate as they can be more difficult to archive or move along with the _RO-Crate root_, and
may change intentionally or unintentionally leaving the RO-Crate with incomplete or outdated information.

File Data Entries with an `@id` URI outside the _RO-Crate Root_ SHOULD at the time of RO-Crate creation be directly downloadable by a simple retrieval (e.g. HTTP GET), permitting redirections and HTTP/HTTPS authentication. For instance, in the example above, <https://zenodo.org/record/3541888> and <https://doi.org/10.5281/zenodo.3541888> cannot be used as `@id` above as retrieving these URLs give a HTML landing page rather than the desired PDF as indicated by `encodingFormat`.
File Data Entries with an `@id` URI outside the _RO-Crate Root_ SHOULD at the time of RO-Crate creation be directly downloadable by a simple non-interactive retrieval (e.g. HTTP GET) of a single data stream, permitting redirections and HTTP/HTTPS authentication. For instance, in the example above, <https://zenodo.org/record/3541888> and <https://doi.org/10.5281/zenodo.3541888> cannot be used as `@id` above as retrieving these URLs give a HTML landing page rather than the desired PDF as indicated by `encodingFormat`.

As files on the web may change, the timestamp property [sdDatePublished] SHOULD be included to indicate when the absolute URL was accessed, and derived metadata like [encodingFormat] and [contentSize] were considered to be representative:

Expand All @@ -309,6 +349,13 @@ As files on the web may change, the timestamp property [sdDatePublished] SHOULD
}
```

### Encoding file paths

Note that all `@id` [identifiers must be valid URI references](appendix/jsonld.md#describing-entities-in-json-ld), care must be taken to express any relative paths using `/` separator, correct casing, and escape special characters like space (`%20`) and percent (`%25`), for instance a _File Data Entity_ from the Windows path `Results and Diagrams\almost-50%.png` becomes `"@id": "Results%20and%20Diagrams/almost-50%25.png"` in the _RO-Crate JSON-LD_.

In this document the term _URI_ includes international *IRI*s; the _RO-Crate Metadata File_ is always UTF-8 and international characters in identifiers SHOULD be written using native UTF-8 characters (*IRI*s), however traditional URL encoding of Unicode characters with `%` MAY appear in `@id` strings. Example: `"@id": "面试.mp4"` is preferred over the equivalent `"@id": "%E9%9D%A2%E8%AF%95.mp4"`


### Embedded data entities that are also on the web

File Data Entities may already have a corresponding web presence, for instance a landing page that describes the file, including persistent identifiers (e.g. DOI) resolving to an intermediate HTML page instead of the downloadable file directly.
Expand Down

0 comments on commit 26920b9

Please sign in to comment.