Skip to content

Latest commit

 

History

History
552 lines (360 loc) · 31.3 KB

README.md

File metadata and controls

552 lines (360 loc) · 31.3 KB

signposting-tutorial

Overview

  • Name: Tutorial on adding Signposting to HTML in Web pages
  • Description: This tutorial shows how to add Signposting to GitHub pages. It uses a simple GitHub page hosted in the docs/ folder to create a sample project page, i.e., as learners could do with their own GitHub projects. As an example, it uses the dataset corresponding to the released project TREC-doc-2-doc-relevance, a web-based interface to add document-to-document relevance assessments to pairs of documents retrieved from TREC 2005 Genomics Track.
  • Keywords: Signposting, GitHub pages

Questions

  • How can I add Signposting to GitHub pages?
  • How can I include external metadata in my signposting?
  • How do I decide which metadata to include in signposting?

Learning outcomes

  • Describe how Signposting can be embedded in GitHub pages
  • Understanding of Signposting limitation of static content-delivery networks
  • Knowledge of different metadata formats and their signposting profiles

Requirements

  • Familiarity on how to use GitHub
  • Basic knowledge on how to use GitHub Pages
  • Brief understanding of Signposting (introduction slides)
  • Familiarity with HTML
  • Knowledge of developer tools on a browser

Time estimation 30 minutes

Level Beginner / Introductory

Published 2024-02-25

Latest modification 2024-02-25

License CC-By 4.0

Version 0.1.0

Learning experience

Agenda

In this tutorial we will cover:

Prerequisites

To follow this tutorial, you should already have experience with using GitHub, and be signed in to your GitHub account. See Learn the basics of GitHub as a refresher.

Creating this GitHub Page

Task: Let's start by forking this repository for your own purposes. Once forked, go to Settings

Screenshot of GitHub, highlighting the Settings link

Task: You will need to enable Pages on your forked repository, and select Deploy from a branch using Branch: main and Folder: docs/. Then Save the changes.

As this repo does have a gh-pages branch, it will use it. If such branch would not exist, GitHub would ask you to use the main branch to start the gh-pages one

Screenshot of GitHub Pages setting

In a matter of minutes, your site will be live. The pages corresponding to the examples used in this tutorial are available at https://stain.github.io/signposting-tutorial/ and corresponding pages should appear by replacing stain with your GitHub username.

Screenshot: Your pages are live

Do not forget to check out a local copy of your fork so you can make changes -- alternatively you may use the GitHub editor.

Overview of the repository

This repository is emulating a basic HTML-based institutional repository, with a single dataset entry corresponding to a Zenodo's entry. Layout:

The remaining dataset and metadata downloads are in this case shown as deeplinks to Zenodo to indicate that Signposting is not tied to a particular domain.

This tutorial is deployed using GitHub Pages as described above. For simplicity it uses static HTML files based on a Bootstrap v5 starter template -- applying Signposting to a real repository deployment may require editing its HTML templates, which is currently out of scope for this tutorial.

Challenge of machine actionability

Look at HTML page https://stain.github.io/signposting-tutorial/7338056/ (or equivalent for your username) and open the HTML code in docs/7338056/index.html. This is a somewhat typical landing page for a Web-based data repository. We will imagine that the persistent identifier (DOI) has redirected to this page, as is the case for the original https://doi.org/10.5281/zenodo.7338056

Screenshot of landing page, showing metadata, download links etc

We see that the landing page is quite useful for humans, including an abstract; metadata including title, author, keywords; and a big download button. There are some export formats listed at the end for metadata formats like Bibtex.

The tutorial bioschemas-ghpages-markup-tutorial highlights how this kind of metadata can be made machine-readable in a FAIR format -- which for completeness is included in the <script> tag at the end of the HTML. Bioschemas is however just one of the many ways that FAIR metadata can be provided (as shown in this example).

However, a machine (example: pre-programmed script) who accesses the given persistent identifier, and do not already know this particular repository implementation or Bioschemas, is not immediately able to answer the most basic FAIR questions:

  • What is the persistent identifier?
  • What is the type of the resource described?
  • Where can it download the data (if any), and in which format(s)?
  • What is the license and authorship of the data?
  • What other metadata formats are available? What conventions do they follow?

The goal of Signposting is to reduce the heuristics that would otherwise be needed by such clients (e.g. text mining or content-negotiation), to give explicit typed links to facilitate navigation. Note that this is different from semantics, as the main goal is to give the client further waypoints rather than meaning.

Adding FAIR Signposting

In this tutorial we will implement FAIR Signposting at level 1, which provides:

  • Author(s)
  • Persistent identifiers
  • Metadata
  • Download/archive
  • License
  • Type

A landing page points out using link relations cite-as, author, item, type, describedby

Where to add Signposting?

Signposting can be added in three ways:

  1. In the HTTP GET / HEAD response, using Link header
  2. In a HTML landing page, within the HTML <head> section using <link> element
  3. In a dedicated linkset JSON or text file, linked to using any of the above

As this tutorial is neutral to deployment, and GitHub Pages do not permit control over HTTP headers, we will primarily work with option 2.

Task: To add the HTML links to your forked repository, now open docs/7338056/index.html and click either Edit in place button or the more powerful Open with github.dev.

Screenshot: Edit file: Edit in place. Open with... github.dev

If you don't see these options, make sure you are on your fork of the repository.

Towards the top of the file, you will find two tags we will expand:

Screenshot from code editor, showing HTML tags

<!-- Bootstrap CSS -->
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"  
    rel="stylesheet" integrity="sha384-EVSTQN3/..." crossorigin="anonymous">

<!-- Copy and modify below line to add Signposting -->    
<link href="" rel="self" />

The first line shows how we are using the existing HTML mechanism for linking, rel=stylesheet tells the browser how to add the styling using the linked Bootstrap theme.

The second line is a template which we'll copy and modify in the instructions below. In the end you may delete this example line, as rel=self is not needed in HTML documents.

Make sure you add the new links within the <head> .... </head> section, as recommended by FAIR Signposting. To simplify life for clients, it is NOT RECOMMENDED to add <link> to the <body> content.

Adding a persistent identifier

Literature: The FAIR Guiding Principles include:

F1. (meta)data are assigned a globally unique and persistent identifier
...
F3. metadata clearly and explicitly include the identifier of the data it describes

Persistent identifiers as expressed in Signposting using rel="cite-as" (RFC8574) -- this allows a landing page to say which persistent identifier will redirect to the page.

The original entry for this dataset has a DOI 10.5281/zenodo.7338056 -- however DOIs as untyped strings are not a good targets, as every Signposting has to be a valid URI -- typically starting with http:// or https:// followed by a domain name for the corresponding Web server. For DOIs we will therefore use the https://doi.org/ resolver -- to convert the DOI to a URI, simply add this as a prefix to become: https://doi.org/10.5281/zenodo.7338056

Task: Modify docs/7338056/index.html so that it includes the signposting for the DOI 10.5281/zenodo.7338056:

<link href="https://doi.org/10.5281/zenodo.7338056" rel="cite-as" />

Using a w3id persistent identifier

Note however that the purpose of cite-as is not to give any odd scholarly citation, but a persistent identifier that leads back to this place. In this idealized example we have duplicated a Zenodo entry, however their DOI https://doi.org/10.5281/zenodo.7338056 of course will still redirect to their landing page https://zenodo.org/records/7338056 and we are not at power to modify their HTML template. alt text https://w3id.org/signposting-tutorial/{user}.{number}

Add the below signposting to reflect your username, and use this instead as a cite-as:

<link href="https://w3id.org/signposting-tutorial/stain.7338056" rel="cite-as" />

Literature: If you manage a repository, you likely already assign persistent identifiers that can be used with cite-as -- if not, consider these resources:

Specifying the resource type

The FAIR Signposting requires a type to classify the scholarly object, in our case a CSV file.

Literature: Browse the Schema.org hierarchy to expand CreativeWork and find the type Dataset (other common types may be ScholarlyArticle, ImageObject, SoftwareSourceCode)

Task: To specify Dataset as a type, use:

<link href="https://schema.org/Dataset" rel="type" />

Note: This schema.org identifier is subtly different from the JSON-LD usage in Bioschemas, which @context maps Dataset to http://schema.org/Dataset etc. As Signposting is navigational and not semantic, we here prefer the https:// variant.

Now, the resource we are providing the signposting from is not technically speaking the dataset, but a landing page about the downloadable dataset. Therefore Signposting recommends also adding:

<link href="https://schema.org/AboutPage" rel="type" />    

This may be a good time to try it out using a signposting client to verify your changes to index.html.

Specifying authors

If each author of the resource have some persistent identifier (e.g. ORCID), or other user page within the repository, we can list them using author link relation.

Task: Add for each of the authors listed in the HTML their ORCID identifier using rel="author":

    <link href="https://orcid.org/0000-0003-2978-8922" rel="author" />

Note that if the author does not have a page but only a name, you can't provide a link nor persistent identifier, and so there is nothing to signpost to. Remember the purpose here is navigation, full semantics is however left in the metadata, which we'll cover later.

Specifying license

In many cases, a repository entry has an open access or open source license. In this case it is very valuable to provide the license signposting, in order to indicate to clients what they are permitted to do with the download.

Task: In our first attempt, let's specify the Creative Commons CC-BY 4.0 license by using the URI as provided in the HTML:

<link href="https://creativecommons.org/licenses/by/4.0/" rel="license" />

Needless to say, there are many possible license, each of which may have many identifiers. So while this link may be useful for humans, for machine actionability it is preferrably to use a known persistent identifier also for the license.

Literature: The SPDX License List is such a well known set of license identifiers. Identify the line for "Creative Commons Attribution 4.0 International". Remember signposting can't go to untyped identifiers like CC-BY-4.0 but needs a URI. Luckily SPDX provides such URIs e.g. https://spdx.org/licenses/CC-BY-4.0 (although, for unexplained reasons, their list links to .html variants).

Task: Modify the above license to use the SPDX persistent identifier:

<link href="https://spdx.org/licenses/CC-BY-4.0" rel="license" />

In other cases there is no single license, or the license is only embedded within the dataset. In this case you should not include a license as you don't have a single resource to link to.

Tip: Make sure you use the US spelling of the link relation license!

Specifying content downloads

Returning to the FAIR Principles we also find:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

If we accept that many persistent identifier goes to a HTML landing page, rather than directly to the downloadable data (which would then hide the metadata), A1 must be enabled for machine through an indirection. In Signposting this is done using the item link relation.

From the existing HTML we find the CSV file as a Download link.

Task: Add the signposting for the download:

<link 
    href="https://zenodo.org/records/7338056/files/Fleiss%20Kappa%20for%20document-to-document%20relevant%20assessment.csv?download=1"
    rel="item"
    type="text/csv" />

Note that although type is optional, it is strongly recommended for downloads, specially if the server is unable to return a correct Content-Type.

Literature: See the IANA media types or PRONOM to find known file formats.

It is possible to have additional downloads. For instance, Zenodo entries can have multiple uploads for a single DOI/landing page. In this tutorial repository, we have included the fleiss.tsv as an example of an additional resource, converted from the CSV to the Tabular Separated Values format .

Task: Add another download for our converted TSV file:

<link 
    href="fleiss.tsv"
    rel="item"
    type="text/tab-separated-values" />

Note: There is no indication in the outgoing links that these are alternatives of the same resource (the underlying table). This could have to be done using rel=alternate at a HTTP header level for each of the files, however this kind of semantics is not required by Signposting. Likewise, provenance history of a conversion taking place would be the role of metadata to cover.

Listing metadata resources

A very important motivation for FAIR Signposting is to make machine-readable metadata easier to find. In particular, clients should not need to content-negotiate or know in advance exactly which format are available. In some cases metadata is also available externally, which is examplified by this repository, which links back to Zenodo.

In Signposting, metadata resources are listed as describedby. Metadata is considered separated from the data (the item downloads) if they can be considered to primarily be describing the data.

Note: A dataset may just happen to be written in a semantic format like JSON-LD (e.g. the dataset is an ontology), in which case it should still be listed under item, not as describedby.

Task: Add a link relation for each of the metadata formats linked from the HTML, e.g.

<link
    href="https://zenodo.org/records/7338056/export/json-ld" 
    rel="describedby" />

Literature: For further reading, see:

Specifying metadata format and profiles

Signposting recommends attributes for typed links, which are particularly important for metadata, which is available in many different formats.

Literature: The Bibliographic Metadata Formats listed for Signposting and the FAIR Signposting level 1 entry for describedby lists common media types like application/vnd.datacite.datacite+xml

Task: Augment the list of metadata resource to list the specific media types, e.g.:

<link 
    href="https://zenodo.org/records/7338056/export/bibtex"
    rel="describedby"
    type="application/x-bibtex"
  />

Finally, some metadata use generic formats, like application/xml (XML) or application/ld+json (JSON-LD) -- the client will need to know particular namespaces or vocabularies to understand them. There may also be multiple metadata resources using the same format, but different models or variants. The concept of profile is intended for such disambiguating, and this is specified as an attribute of the link relation. The profile is identified by an URI.

Task: To distinguish the two JSON-LD formats in this example, add their specific profile:

<link
    href="https://zenodo.org/records/7338056/export/json-ld" 
    rel="describedby" 
    type="application/ld+json"
    profile="http://schema.org/"/>
<link
    href="bioschemas.jsonld" 
    rel="describedby"
    type="application/ld+json"
    profile="https://bioschemas.org/profiles/Dataset/1.1-DRAFT"
/>         

Task: For the Dublin Core export, add profile="http://purl.org/dc/elements/1.1/" as suggested by the Bibliographic Metadata Formats table:

<link 
    href="https://zenodo.org/records/7338056/export/dublincore" 
    rel="describedby" 
    type="application/xml"
    profile="http://purl.org/dc/elements/1.1/"/>

Linking back to the collection

While not required, it is good practice to link to the collection(s) the resource is from. In this case we don't have a persistent identifier for the collection.

Task: Add the parent using collection link relation:

<link
    href="/signposting-tutorial/"
    rel="collection">

Repository listing

Finally, now let's consider the root index.html. The listing of datasets is not machine-readable.

A corresponding PID https://w3id.org/signposting-tutorial/USER should redirect to your dataset listing.

Task: Modify docs/index.html to add cite-as to reflect this persistent identifier:

<link
  href="https://w3id.org/signposting-tutorial/stain" 
  rel="cite-as" />

To indicate that this is a listing of datasets, use a type like https://schema.org/DataCatalog (for Dataset items), or https://schema.org/Collection (for any other types of items).

Task: Modify docs/index.html to add a type:

Note: We don't include AboutPage here, because the HTML listing is the collection.

Task: Add the dataset using the link relation item, but specify the type as text/html (items are landing pages):

<link href="7338056/" 
  rel="item" 
  type="text/html" />          

Linksets

For repositories there is likely to be very many entries. It is therefore NOT RECOMMENDED to list them in the HTTP headers, and they should not be listed in the HTML.

Task: Remove the item from before, and replace it with a linkset indirection:

<link href="linkset.json" 
  rel="linkset" 
  type="application/linkset+json" />

A linkset is a mechanism to move links to a separate document . The document uses anchor to refer to which outgoing document the links are for, therefore a linkset can be common for multiple resources, which each link to it using linkset.

Task: Inspect the docs/linkset.json for an example linkset in JSON.

Task: Augment the linkset.json file to include "type": "text/html" for the dataset 7338056. Remember to use correct , placement when editing the JSON.

{
    "linkset": [
      {
        "anchor": "https://stain.github.io/signposting-tutorial/",
        "cite-as": [
          {
            "href": "https://w3id.org/signposting-tutorial/stain"
          }
        ],        
        "item": [
          {
            "href": "https://stain.github.io/signposting-tutorial/7338056/",
            "type": "text/html"
          }
        ]
      }
    ]
  }

Literature: RFC9264 specifies the application/linkset+json format, along with a text-based alternative application/linkset

Try it out

In order to try out Signposting we will try two alternative Signposting clients.

Task: Ensure you have committed and pushed your code to Github, allowing the page to rebuild. Visit the Actions tab in the GitHub repository to ensure the build succeeded as before.

After visiting your page, e.g. https://USER.github.io/signposting-tutorial/7338056/ , you may use Inspect Element in the Browser to check the <link> headers have been added correctly -- however your browser will by default not do any further validation.

Command line / Python

If you are comfortable using the command line, and have Python 3.7 or later installed, then install the signposting Python package:

pip install signposting

Verify the tool is installed on the PATH:

(base) stain@xena:~/src/signposting-tutorial$ signposting --help
usage: signposting [-h] [--http] [--html] [--linkset] [-D] [-c] url [url ...]

positional arguments:
  url                URL(s) to discover signposting for

optional arguments:
  -h, --help         show this help message and exit
  --http             Find signposting in HTTP Link headers
  --html             Find signposting in <link> HTML elements if media-type matches
  --linkset          Find signposting in RFC9264 JSON or text linksets if media-type matches. When used with --recurse without specifying --http or --html, use those signposts to recurse, but
                     only report from linksets
  -D, --distinct     Report each signposting method (--http, --html and --linkset) separately
  -c, --any-context  Include signposts any contexts/anchors, not just resolved URI

Now try it on your GitHub deployment:

(base) stain@xena:~/src/signposting-tutorial$ signposting https://stain.github.io/signposting-tutorial/7338056/
Signposting for https://stain.github.io/signposting-tutorial/7338056/ 
CiteAs: <https://doi.org/10.5281/zenodo.7338056>
Type: <https://schema.org/AboutPage>
      <https://schema.org/Dataset>
Collection: <https://stain.github.io/signposting-tutorial/>
License: <https://spdx.org/licenses/CC-BY-4.0>
Author: <https://orcid.org/0000-0003-2978-8922>
        <https://orcid.org/0009-0004-1529-0095>
        <https://orcid.org/0000-0003-3986-0510>
        <https://orcid.org/0000-0002-1018-0370>
DescribedBy: <https://stain.github.io/signposting-tutorial/7338056/bioschemas.jsonld> application/ld+json
             <https://zenodo.org/records/7338056/export/dublincore> application/xml
             <https://zenodo.org/records/7338056/export/bibtex> application/x-bibtex
             <https://zenodo.org/records/7338056/export/json-ld> application/ld+json
             <https://zenodo.org/records/7338056/export/datacite-xml> application/vnd.datacite.datacite+xml
Item: <https://zenodo.org/records/7338056/files/Fleiss%20Kappa%20for%20document-to-document%20relevant%20assessment.csv?download=1> text/csv

If you have made a mistake, this library is likely to skip the particular signposting, or give a warning.

The Signposting Python library can also be used programmatically from other Python programs. See Signposting adopters for a complete list of software and repositories working with Signposting. alt text

Signposting in browser

An experimental browser plugin for the Chrome browser (and its derivatives Chromium, Edge etc.) is available as Signposting Sniffing. Click Add to Chrome to enable this plugin. Note that although the plugin has access to inspect every web page, it should not be doing any external requests.

When Signposting is detected in a page, it will be presented as an overlay. After installing the plugin again, re-visit your dataset page in that browser.

Screenshot of Chrome browser with Signposting Sniffing plugin, showing detected signposting

Acknowledgements

This tutorial is based on bioschemas-ghpages-markup-tutorial, bioschemas-github-markup-example and Adding schema.org to a GitHub Pages site.

LJC has received fundings from the German Research Foundation (DFG) via the grant for NFDI4DataScience No. 460234259

We use free SVG icons from Font Awesone