Releases: pangaea-data-publisher/fuji
v3.4.0
Main changes are support for DQV as additional output format for both, metrics as well as FAIR assessment results, a more verbose standard JSON output which now includes information about the origin of harvested metadata (format, source, method) and some newly implemented methods to verify the presence of temporal and spatial coverage metadata which supports the assessment of community specific metrics from the earth and environmental sciences. In detail the following changes may affect future F-UJI test results:
- Support for metric and FAIR assessment result standardised output as DQV, F-UJI API now support output as RDF (ttl, jsonld etc) which return DQV RDF. Default output still is the F-UJI custom JSON.
- Added some ontology and metadata standards namespaces such as FHIR and geoDCAT etc.
- Improved DDI mapping and parsing (e.g. file type and size detection improved for distributions = may improve FsF-R1-01MD)
- Improved ISO GCMG mapping and parsing of file size and type (may improve FsF-R1-01MD)
- Data objects which are offered via services (streaming) now supported for DCA and schema.org and ISO 19xxx see: #513 this included verification sub test (FsF-R1-01MD-2-c) which checks if service endpoint is given and protocol information are specified in metadata
- Added a browser-like user agent to mimic browsers in case web scraping detection methods hinder access (HTTP 405)
- Replaced JMESPath based simple JSON-LD parsing moved to RDF parsing
- Improved schema.org handling e.g. license mapping/parsing now supports CreativeWork licenses in schema.org, may improve FsF-R1.1-01
- Swagger output format JSON is changed so it now also includes the harvested metadata as well as metadata sources and formats (similar to the harvest method)
- Improved RDF handling for complicated graphs, now F-UJI tries to detect the main entity instead of picking Dataset classes from a graph which actually describes something else.
- Added a warning in case the resource type is not indicated or differs from ‘Dataset’ so users may decide if F-UJI is appropriate for the test.
- Improved schema.org and RO-Crate handling: FsF-R1-01MD and FsF-F3-01M now also consider MediaObjects which are indicated as hasPart of a Dataset
- New metadata properties are parsed to support community specific tests (geo, env) : spatial coverage, temporal coverage in DCAT, schema.org, DC, DDI EML ISO etc..
- New tests implemented for env/earth science metrics which verify the presence of spatial or temporal coverage info
- New YAML prototypic file of a first potential env/earth community metric
- Some pseudo namespaces which are included in some lov collections are excluded from lov list since they are identifiers: "orcid.org", "doi.org", "ror.org", "zenodo.org", "isni.org", "github.com", "arxiv.org" which may result in lower scores in FsF-I2-01M
- Due to a parsing bug, sometimes empty property values or null or None values have been stringified to “None” or “null” and scored as valid values. This is no longer regarded as valid value thus, some scores might be lower in 3.4.0.
v3.2.0
Changes from 3.1.0 to 3.2.0
- Integration of FAIR testing for software, for more details see the following pull request:
- Improved DCAT handling, now avoids overwriting existing license and access rights info; fixed incorrect handling of distribution info (bytesize type)
- Re3data metadata lookup is now always performed, before it was done in case no service endpoint was given only.
- Improved RDFa handling: image tags like are ignore now
- Upgraded connexion to v 3; python 3.11
- Improved XML handling / scheme recognition e.g. for DDI formats
- Improved handling of non HTML “landing content” for DOIs see: #492
- Improved handling of CC licenses, previously these were not always correctly recognized as valid
v3.1.0
The main change in this release is the data_harvester behavior which is now using threads to download data objects/files. This allows to include more data files for the assessment. In detail, F-UJI now is trying to analyse up to 5 files per mime-type (as listed in the metadata).
Some other changes to note:
All: Incorrect handling of some landing pages which cause the evaluator to stop has been fixed.
R1.1: Licenses packed as lists are now unpacked and correctly identified
I3: In some cases scores for I3 are improved due to the inclusion of schema.org/citation as scanned relation property
R1: Incorrect handling of file sizes given or interpreted as strings like 'None', which were accepted as valid content, caused incorrect (too high) scoring of R1, score might be lower but correct now in theses cases.
R1: Improved handling of mime types including e.g. charset info (text/plain; charset=US-ASCII) may result in higher score for R1 (FsF-R1-01MD-3)
R1: Improved parsing of content length byte units may improve the scoring.
F2: Improved handling of RDF graphs containing DC or schema.org terms to describe the content may improve findability and other scores
R1.3: F-UJI now uses threads to download more data objects (up to five files/links per claimed content type) which improves its capability to evaluate data content
v3.0.0
This new release allows configuration of metric YAML which also affects how tests are performed. More documentation about this will be published soon in the README.
Some changes of F-UJI's behaviour have to be mentioned:
The role of the YAML metric definition file is more important now. It also allows defining individual scores and maturity levels which are now longer hardcoded.
Metrics and tests which are not listed in the YAML files are not performed/assessed; this allows to switch on/off metrics and tests for community specific metrics to be defined in dedicated yaml files.
F-UJI is now able to use different metrics the REST has now an additional parameter ‘metric_version’ by which the yaml file can be defined (default metrics_v0.5.yaml)
F-UJI > 2.3.0 has more tests implemented which allow to define metrics and tests in specific yaml files which are more compatible with RDA and The Evaluator:
- FsF-F1-01DD unique identifier of data
- FsF-F1-02DD persistent identifier of data
- FsF-F1-01M which will replace FsF-F1-01D unique identifier of metadata
- FsF-F1-02M which will replace FsF-F1-02D persistent identifier of metadata
- FsF-F3-02M (metadata include identifier of dataset)
- FsF-F4-01M-2 which tests if OAI-PMH, SPARQl or CSW is used to offer metadata
F-UJI now is not using the first data object for F3, A1, R1 and R1.3 but the first data object which is accessible (HTTP 200)
Fixed a bug which caused wrong scores for R1 because FsF-R1-01MD-3 was sometimes ignoring matching file sizes and types.
F-UJI now also recognizes resource types for R1 if given as URI e.g. schema.org/Dataset
Fixed a bug due to which in 2.2.5 signposting links to JSON-LD files was incorrectly accepted as valid search engine support mechanism.
Fixed a bug which accepted stringified ‘None’ as entry for file type and size and cause wrong scores for R1
Improved license recognition
Improved JSON-LD handling
F-UJI is truncating very large data files prior to testing which caused R1 test FsF-R1-01MD-3 (Data content matches file type and size specified in metadata) to incorrectly compare expected file size with truncated size. Now F-UJI compares expected size with size given in HTTP header (if given) to perform this test for truncated files.
Prior to version 2.3.0 F-UJI was correctly detecting valid domain agnostic metadata standards in R1.3 (FsF-R1.3-01M-3) but did not assign any score for this. This bug was fixed for F-UJI >=2.3.0
Prior to version 3.0.0 F-UJI was accepting content negotiation in addition to html embedding and microdata as a search engine friendly way to offer metadata in FsF-F4-01M - (Metadata is offered in such a way that it can be retrieved programmatically.) Additionally F-UJI did not verify the metadata standard and content offered via RDFa/microdata. Now, F-UJi is exclusively expecting schema.org, DC or DCAT as search engine friendly metadata formats offered via html embedding and microdata/RDFa. It no longer considers empty RDFa content as it did before.
v2.2.5
v2.0.2
Full Changelog: v1.4.9...v.2.0.2
This release is the first which is based on the completely restructured metadata_harvesting class. All metadata and PID collecting methods have moved there from fair_check. This allows easier testing and also using the harvester for other purposes.
v.1.4.9
Includes 1.7.9b
This will be the last version which uses metric 0.4
Improvements:
- Improved signposting handling: better recognition in HTML as well as header; now focusses on metadata and identifier related links and ignores e.g. ORCID author links.
- Improved JSON-LD handling, now tries to identify dataset (preferred) or creative work metadata in case several JSON-LD snippets are given (e.g. one for Webpage and another one for Dataset)
- More mime types now recognized
- Content negotiation now adds a preferred type, e.g. the one found in typed links
- Namespace recognition now case insensitive
- Improved Dublin Core parsing, now case insensitive
- Improved XML mime type recognition