Skip to content
This repository has been archived by the owner on Apr 27, 2023. It is now read-only.

RFC: JATS support #21

Open
thewilkybarkid opened this issue Jan 28, 2019 · 9 comments
Open

RFC: JATS support #21

thewilkybarkid opened this issue Jan 28, 2019 · 9 comments
Labels
rfc A request for comments

Comments

@thewilkybarkid
Copy link
Contributor

Problem

Libero's data model is planned to support schemas like JATS (see #11), but developing solely based on Libero's native schema is slow (as it doesn't exist yet). We have to model all the possibilities, which is not amenable to rushing (especially as we don't want to version it). For example, libero/publisher#5 would require a lot of schema work.

Users like eLife will need non-JATS content, but IJM only have scholarly content and have already investigated converting their archive to JATS.

Suggestion

  • Commit to supporting JATS now and prioritise it.
  • Continue to build-up Libero's schema, based on the JATS support that is implemented (ie as non-blocking, possibly follow-up, work).

Concerns

  • What JATS to support. JATS4R have been making progress on recommendations but isn't comprehensive and might still be too open. DAR seems too strict.
  • How to support multiple versions of JATS (and flavours?), including the rumoured 2.0.
  • How to handle assets. eLife XML just references a TIF (without an actual URI), whereas we'd want a IIIF endpoint.
  • Complexity of supporting multiple schemas across all services. (This is an existing concern, but doing it now does bring it to the forefront.)
@thewilkybarkid thewilkybarkid added the rfc A request for comments label Jan 28, 2019
@thewilkybarkid
Copy link
Contributor Author

Related Browser work at libero/browser#30 and schema changes at libero/schemas#14.

@giorgiosironi
Copy link
Member

What JATS to support

Another way to see this issue is: when there is a discrepancy between two services (client/server, upstream/downstream like content-store and search), what is the source of truth to decide? JATS validity according to its DTD/RelaxNG/XSD is by definition too wide, so it seems we need to "build or buy" a schema anyway to validate inputs.

How to handle assets. eLife XML just references a TIF (without an actual URI), whereas we'd want a IIIF endpoint.

I'd assume the JATS used as the original input is not identical to the JATS served by the API, so there can be processing steps that substitute in URLs. The IIIF format is a strong dependency though, as it evolves more frequently than JATS (unstable should depend on stable rather than the other way around) and is not ubiquitous.

Complexity of supporting multiple schemas across all services

If we have a top level Libero element acting as a wrapper, this should be dealt with the same tooling that would validate Libero documents in general. Relies on integrating, for example, the RelaxNG JATS definitions into the schemas so that it fits with the Libero RelaxNG ones that are there now.

@thewilkybarkid
Copy link
Contributor Author

I'd assume the JATS used as the original input is not identical to the JATS served by the API, so there can be processing steps that substitute in URLs.

Agreed. (Not restricted to JATS too, as all assets will probably get moved around etc.)

If we have a top level Libero element acting as a wrapper, this should be dealt with the same tooling that would validate Libero documents in general. Relies on integrating, for example, the RelaxNG JATS definitions into the schemas so that it fits with the Libero RelaxNG ones that are there now.

Embedding is quite simple. But

Complexity of supporting multiple schemas across all services

was meant to refer to actually using the data, eg Browser being able to convert different types of XML into consistent HTML, Search being able to index different types of XML.

@giorgiosironi
Copy link
Member

eg Browser being able to convert different types of XML into consistent HTML, Search being able to index different types of XML.

Examples:

  • browser would have to read information (e.g. the title) from multiple possible places
  • search would have to index different kind of services
  • issues (if it exists) could use mostly metadata rather than using the body of the article.

I guess some very common information that can be used in other services like id and title or authorLine would be extracted into the Libero wrapper elements; services that only need to list or link to an article would be better off, while services that make use of the content have to necessarily support multiple formats. In practice in Continuum we had:

  • lax: the article store
  • elife-metrics: collecting views for a certain id
  • journal-cms: indexing articles to attach cover images
  • search: indexing all the content
  • observer: reporting indexes all sort of data e.g. produces an RSS feed

Since the listing-based services would return ids only, they wouldn't necessarily need to know about JATS.

@GiancarloFusiello
Copy link

Sorry if this is very basic but I'm trying to understand what is the definition of Libero's data model?

@thewilkybarkid
Copy link
Contributor Author

@GiancarloFusiello, essentially what's in https://github.com/libero/schemas. Rather than being one big schema, it's broken down into the core (ie the required part, which is as small as possible), then a whole load of extensions that you can enable (so the opposite of JATS, which is one massive schema that you have to cut down to the parts that you want). Currently there's only 1 extension (italic text) along side the required parts (eg content item).

The walking skeleton has this in more detail: a bunch of schemas for different publishers comprised of a set of extensions, but with some customisations. So your schema for your content, sharing where possible but not blocked from doing anything.

@giorgiosironi
Copy link
Member

For example, libero/publisher#5 would require a lot of schema work.

This is what makes me inclined to 👍 this RFC: we can build the current features now with a borrowed schema (some version of JATS) and introduce a different (Libero) schema when we know more about the complexity of putting all the service together.

@de-code
Copy link

de-code commented Jan 30, 2019

  • What JATS to support. JATS4R have been making progress on recommendations but isn't comprehensive and might still be too open. DAR seems too strict.

Could the default be DAR? It seems DAR will need to be extended to cater for missing use cases. Or do you mean it's too strict in how the structure should look like? If it's the latter, and the JATS served by the API, could we do an up-front transformationstep like Giorgio suggested, to make it DAR?

There may be already existing efforts to normalise JATS. Patrice from GROBID has for example created Pub2TEI (the output here is obviously TEI - which we also converted back to JATS if we wanted to, but might make the pipeline more complicated).

Using TEI altogether could be another option. It may not cater for IJM, although with a translation tool like Pub2TEI it might?

*None of the above is meant to favour one standard over the other.

Having the option to use an existing "standard" seems to make sense to me.

@Melissa37
Copy link

Could the default be DAR? It seems DAR will need to be extended to cater for missing use cases. Or do you mean it's too strict in how the structure should look like? If it's the latter, and the JATS served by the API, could we do an up-front transformation step like Giorgio suggested, to make it DAR?

DAR is very strict and is being developed for a tool for editing and so decisions are made based on getting one product ready for use. Because of the decisions being made for it, it could likely alienate 50% of publishers because their XML decisions would not work in it. Examples being how authors and affiliations are linked to each other.

JATS seems like a good standard to work with as most publishers who create full text are familiar with it and a learning curve to learn a new standard may be off-putting.

IMO :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
rfc A request for comments
Projects
None yet
Development

No branches or pull requests

5 participants