Page not found
+The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+diff --git a/00_introduction.md b/00_introduction.md
new file mode 100644
index 0000000..29c2ee9
--- /dev/null
+++ b/00_introduction.md
@@ -0,0 +1,20 @@
+---
+output: html_document
+---
+
+# Introduction {-}
+
+Over the last decade, the supply of socio-economic data available to researchers and policy makers has increased considerably, along with advances in the tools and methods available to exploit these data. This provides the research community and development practitioners with unprecedented opportunities to increase the use and value of existing data.
+
+#Note:
+Data that were initially collected with one intention can be reused for a completely different purpose. (…) Because the potential of data to serve a productive use is essentially limitless, enabling the reuse and repurposing of data is critical if data are to lead to better lives. ([World Bank, World Development Report 2021](https://www.worldbank.org/en/publication/wdr2021))
+
+But data can be challenging to find, access, and use, resulting in many valuable datasets remaining underutilized. Data repositories and libraries, and the data catalogs they maintain, play a crucial role in making data more discoverable, visible, and usable. But many of these catalogs are built on sub-optimal standards and technological solutions, resulting in limited findability and visibility of their assets. To address such market failures, a better market place for data is needed.
+
+A better market place for data can be developed on the model of large e-commerce platforms, which are designed to effectively and efficiently serve both buyers and sellers. In a market place for data, the "buyers" are the data users, and the "sellers" are the organizations who own or curate datasets and seek to make them available to users -- preferably free of charge to maximize the use of data. Data platforms must be optimized to provide data users with convenient ways of identifying, locating, and acquiring data (which requires the implementation of a user-friendly search and recommendation system), and to provide data owners with a trustable mechanism to make their datasets visible and discoverable and to share them in a cost-effective, convenient, and safe manner.
+
+Achieving such objectives requires detailed and structured metadata that properly describe the data products. Indeed, search algorithms and recommender systems exploit metadata, not data. Metadata are essential to the credibility, discoverability, visibility, and usability of the data. Adopting metadata standards and schemas is a practical and efficient solution to achieve completeness and quality of the metadata. This Guide presents a set of recommended standards and schemas covering multiple types of data along with guidance for their implementation. The data types covered include microdata, statistical tables, indicators and time series, geographic datasets, text, images, video recordings, and programs and scripts.
+
+Chapter 1 of the Guide outlines the challenges associated with finding and using data. Chapter 2 describes the essential features of a modern data catalog, and Chapter 3 explains how rich and structured metadata, compliant with the metadata standards and schemas we describe in the Guide, can enable advanced search algorithms and recommender systems. Finally, Chapters 4 to 13 present the recommended standards and schemas, along with examples of their use.
+
+This Guide was produced by the Office of the World Bank Chief Statistician as a reference guide for World Bank staff and for partners involved in the curation and dissemination of data related to social and economic development. The standards and schemas it describes are used by the World Bank in its data management and dissemination systems, and for the development of systems and tools for the acquisition, documentation, cataloguing, and dissemination of data. Among these tools is a specialized **Metadata Editor** designed to facilitate the documentation of datasets in compliance with the recommended standards and schemas, and a cataloguing application ("NADA"). Both applications are openly available.
diff --git a/01_chapter01_challenge_finding_using_data.md b/01_chapter01_challenge_finding_using_data.md
new file mode 100644
index 0000000..b183984
--- /dev/null
+++ b/01_chapter01_challenge_finding_using_data.md
@@ -0,0 +1,55 @@
+---
+output: html_document
+---
+
+# (PART) RATIONALE AND OBJECTIVES {-}
+
+# The challenge of finding and assessing, accessing, and using data {#chapter01}
+
+In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges.
+
+## Finding and assessing data
+
+Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks, often referred to as *tribal knowledge*, to locate and obtain the data they require. This may lead to the use of *convenient* data that may not be the most relevant. Others may encounter datasets of interest in academic publications, which can be challenging due to the inconsistent or non-standardized citation of datasets. However, most data users use general search engines or turn to specialized data catalogs to discover relevant data resources.
+
+Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, such as a query for "population of India in 2023," yield instant informative responses (though not always from the most authoritative source). Even less direct queries, like "indicators of malnutrition in Yemen," return adequate responses, as the engine can "understand" concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Additionally, generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although it is not without the risk of errors and inaccuracies. However, these search engines may not be optimized to identify the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking "satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition."
+
+While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking pertinent data. Nonetheless, the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal search indexes and algorithms. With the rapid advancements in AI-based solutions, many of which are available as open-source software, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems.
+
+The solution to improve data discoverability involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest.
+
+Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data's suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset's fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time.
+
+## Accessing data
+
+Accessing data is a multifaceted challenge that encompasses legal, ethical, and practical considerations. To ensure that data access is lawful, ethical, efficient, and enables relevant and responsible use of the data, data providers and users must adhere to specific principles and practices:
+
+- Data providers must ensure that they possess the legal rights to share the data and define clear usage rights for data users.
+- Data users must understand how they can use the data, whether for research, commercial purposes, or other applications, and they must strictly adhere to the terms of use.
+- Data access must comply with data privacy laws and ethical standards. Sensitive or personally identifiable information must be handled with care to protect individuals' privacy.
+- Data providers must furnish comprehensive metadata that provides context and a full understanding of the data. Metadata should include details about the data's provenance, encompassing its history, transformations, and processing steps. Understanding how the data was created and modified is essential for accurate and responsible analysis.
+- Data should be available in user-friendly formats compatible with common data analysis tools, such as CSV, JSON, or Excel.
+- Data should be accessible through various means, accommodating users' preferences and capacities. This may involve offering downloadable files, providing access through web-based tools, and supporting data streaming. - APIs are essential for enabling programmable data access, allowing researchers to retrieve and manipulate data programmatically for integration into their research workflows and applications.
+
+Data users in developing countries often encounter additional challenges in accessing data, including:
+
+- Lack of resources: Researchers in developing countries may lack the financial resources to purchase data or access data stored in expensive cloud-based repositories.
+- Lack of infrastructure: Researchers in developing countries may lack access to the high-speed internet and computing resources required for working with large datasets.
+- Lack of expertise: Researchers in developing countries may lack the expertise to work with complex data formats and utilize data analysis tools.
+These specific challenges should be considered when developing data dissemination systems.
+
+## Using data
+
+The challenge for data users extends beyond discovering data to obtaining all the necessary information for a comprehensive understanding of the data and for responsible and appropriate use. A single indicator label, such as "unemployment rate (%)," can obscure significant variations by country, source, and time. The international recommendations for the definition and calculation of the "unemployment rate" have evolved over time, and not all countries employ the same data collection instrument (e.g., labor force surveys) to gather the underlying data. Detailed metadata should always accompany data on online data dissemination platforms. This association should be close; relevant metadata should ideally be no more than one click away from the data. This is particularly crucial when a platform publishes data from multiple sources that are not fully harmonized.
+
+:::quote
+The scope and meaning of labor statistics, in general, are determined by their source and methodology, which holds true for the unemployment rate. To interpret the data accurately, it is crucial to understand what the data convey, how they were collected and constructed, and to have information on the relevant metadata. The design and characteristics of the data source, typically a labor force survey or a similar household survey for the unemployment rate, especially in terms of definitions and concepts used, geographical and age coverage, and reference periods, have significant implications for the resulting data. Taking these aspects into account is essential when analyzing the statistics. Additionally, it is crucial to seek information on any methodological changes and breaks in series to assess their impact on trend analysis and to keep in mind methodological differences across countries when conducting cross-country studies. (From Quick guide on interpreting the unemployment rate, International Labour Office – Geneva: ILO, 2019, ISBN: 978-92-2-133323-4 (web pdf)).
+:::
+
+Whenever possible, reproducible or replicable scripts used with the data, along with the analytical output of these scripts, should be published alongside the data. These scripts can be highly valuable to researchers who wish to expand the scope of previous data analysis or reuse parts of the code, and to students who can learn from reading and replicating the work of experienced analysts. To enhance data usability, we have developed a specific metadata schema for documenting research projects and scripts.
+
+## A FAIR solution
+
+To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469).
+
+It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse.
diff --git a/02_chapter02_search_engine_for_data.md b/02_chapter02_search_engine_for_data.md
new file mode 100644
index 0000000..380ac6d
--- /dev/null
+++ b/02_chapter02_search_engine_for_data.md
@@ -0,0 +1,592 @@
+---
+output: html_document
+---
+
+# The features of a modern data dissemination platform {#chapter02}
+
+In the introductory section of this Guide, we proposed that a data dissemination platform should be modeled after highly successful e-commerce platforms. These platforms are designed to optimally satisfy the requirements and expectations of both buyers (in our context, the data users) and sellers (in our context, the data providers who make their datasets accessible through a data catalog). In this chapter, we outline the crucial features that a modern online data catalog should incorporate to adhere to this model and effectively cater to the diverse needs and expectations of its users.
+
+Our objective is to provide recommendations for developing data catalogs that encompass lexical search and semantic search, filtering, advanced search functionality, interactive user interfaces, and the capability to operate as a data recommender system. To define these features, we approach the topic from three distinct perspectives: the viewpoint of data users, who represent a highly diverse community with varying needs, preferences, expectations, and capabilities; the standpoint of data suppliers, who either publish their data or delegate the task to a data library; and the perspective of catalog administrators, responsible for curating and disseminating data in a responsible, effective, and efficient manner while optimizing both user and supplier satisfaction.
+
+The creation of a contemporary data dissemination platform is a collaborative endeavor, engaging data curators, user experience (UX) experts, designers, search engineers, and subject matter specialists with a profound understanding of both the data and the users' requirements and preferences. Inclusive in this development process should be the active participation of the users themselves, allowing them to provide feedback that directly influences the system's design.
+
+## Features for data users
+
+In order to cultivate a favorable user experience, online data catalogs must offer an intuitive and efficient interface, allowing users to effortlessly access the most pertinent datasets. To meet user expectations effectively, one should emphasize simplicity, predictability, relevance, speed, and reliability. Integrating these principles into the design of data catalogs can deliver a seamless and user-friendly experience, akin to the convenience and ease provided by well-known internet search engines and e-commerce platforms. This, in turn, streamlines the process of discovering and obtaining the necessary data, making it quick and hassle-free for users.
+
+### Simple search interface
+
+The default option to search for data in a specialized catalog should be a single search box, following the model of general search engines. The objective of the search algorithm should then be to "understand" the user's query as accurately as possible, potentially by parsing and enhancing the query, and returning the most relevant results ranked in order of importance.
+
+
+
The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+Documents
+Asian Development Bank (ADB). 2001. Mapping the Spatial Distribution of Poverty Using Satellite Imagery in Thailand ISBN 978-92-9262-768-3 (print), 978-92-9262-769-0 (electronic), 978-92-9262-770-6 (ebook) +Publication Stock No. TCS210112-2. DOI: http://dx.doi.org/10.22617/TCS210112-2
Balashankar, A., L.Subramanian, and S.P. Fraiberger. 2021. Fine-grained prediction of food insecurity using news streams
British Ecological Society. 2017. Guide to Reproducible Code in Ecology and Evolution
Google. Google’s Search Engine Optimization (SEO) Starter Guide
Jurafsky, Daniel; H. James, Martin. 2000. Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0-13-095069-7
Mikolov, T., K.Chen, G.Corrado, and J.Dean. 2013. Efficient Estimation of Word Representations in Vector Space
Min, B. and Z.O’Keeffe. 2021. http://www-personal.umich.edu/~brianmin/HREA/index.html
Priest, G.. 2010. The Struggle for Integration and Harmonization of Social Statistics in a Statistical Agency - A Case Study of Statistics Canada
Stodden et al. 2013. Setting the Default to Reproducible - Reproducibility in Computational and Experimental Mathematics
Turnbull, D. and J. Berryman. 2016. Relevant Search: With applications for Solr and Elasticsearch
Links (standards, schemas, controlled vocabularies)
+American Psychological Association (APA): APA Style (example of specific publications styles for a table)
Consortium of European Social Science Data Archives (CESSDA)
US Census Bureau, CsPro Users Guide: Parts of a Table
DDI Alliance, Data Documentation Initiative (DDI) Codebook
eMathZone: Construction of a Statistical Table
GoFair (Findable, Accessible, Interoperable and Reusable (FAIR))
International Organization for Standardization (ISO) 19139: Geographic information — Metadata — XML schema implementation
Microsoft Bing: Bing Webmaster Tools Help & How-To Center, Bing Webmaster Guidelines
Links (tools)
+Links (others)
+The use of structured data described in section 1.6.2 requires a mapping between the relevant elements of some of the metadata standards and schemas described in the Guide to the schema.org standard. We provide here a suggested selection and mapping for the core set of elements (we do not attempt to map all possible elements that are common to our schemas and schema.org).
+schema.org/dataset | +DDI CodeBook | +Recommendation | +
---|---|---|
name | ++ | + |
description | ++ | + |
url | ++ | + |
sameAs | ++ | + |
identifier | ++ | + |
keywords | ++ | + |
license | ++ | + |
isAccessibleForFree | ++ | + |
hasPart / isPartOf | ++ | + |
creator type / url / name / contactPoint / funder | ++ | + |
includedInDataCatalog | ++ | + |
distribution | ++ | + |
temporalCoverage | ++ | + |
spatialCoverage | ++ | + |
Example:
+<html>
+<head>
+ <script type="application/ld+json">
+
+ {"@context":"https://schema.org/",
+ "@type":"Dataset",
+ "name":"Albania Living Standards Measurement Survey 2012 (LSMS 2010)",
+ "description":"The Living Standards Measurement Survey (LSMS) is a multi-purpose household survey conducted to measure living conditions and poverty situation, and to help policymakers in monitoring and developing social programs. LSMS has been carried out in Albania in the context of continuing monitoring of poverty and the creation of policy evaluation system in the framework of the National Strategy for Development and Integration (previously the National Strategy for Economic and Social Development). The first Albania LSMS was conducted in 2002, followed by 2003, 2004, 2005, 2008 and 2012 surveys. In 2012, 6,671 households participated in the survey.",
+ "url":"https://microdata.worldbank.org/index.php/catalog/1970",
+ "identifier": ["ALB_2012_LSMS_v01_M_v01_A_PUF"],
+ "keywords":[
+ "demographic characteristics",
+ "education",
+ "communication",
+ "labor",
+ "employment",
+ "non-farm business",
+ "migration",
+ "remittances",
+ "subjective poverty",
+ "health",
+ "fertility",
+ "non-food expenditures",
+ "dwelling",
+ "utilities",
+ "durable goods",
+ "daily food consumption"
+ ,
+ ]"license" : "",
+ "isAccessibleForFree" : true,
+ "creator":[
+
+ {"@type":"Organization",
+ "url": "http://www.instat.gov.al/en/",
+ "name":"Institute of Statistics of Albania",
+ "contactPoint":{
+ "@type":"ContactPoint",
+ "email":"info@instat.gov.al"
+ ,
+ }
+ { "@type":"Organization",
+ "url": "https://www.worldbank.org/",
+ "name":"World Bank",
+ "contactPoint":{
+ "@type":"ContactPoint",
+ "contactType": "LSMS technical support",
+ "email":"lsms@worldbank.org"
+
+ },
+ ]"funder":{
+ "@type": "Organization",
+ "name": "World Bank"
+ ,
+ }"includedInDataCatalog":{
+ "@type":"World Bank Microdata Library",
+ "name":"https://microdata.worldbank.org/index.php/home"
+ ,
+ }"distribution":[
+
+ {"@type":"DataDownload",
+ "encodingFormat":"SPSS Windows (.sav)",
+ "contentUrl":"http://www.instat.gov.al/en/figures/micro-data/"
+
+ },
+ ]"temporalCoverage":"2012",
+ "spatialCoverage":{
+ "@type":"Place",
+ "name": "Albania"
+
+ }
+ }
+ }</script>
+ </head>
+ <body>
+ </body>
+ </html>
schema.org/dataset | +ISO 19139 | +Recommendation | +
---|---|---|
name | ++ | + |
description | ++ | + |
url | ++ | + |
sameAs | ++ | + |
identifier | ++ | + |
keywords | ++ | + |
license | ++ | + |
isAccessibleForFree | ++ | + |
hasPart / isPartOf | ++ | + |
creator type / url / name / contactPoint / funder | ++ | + |
includedInDataCatalog | ++ | + |
distribution | ++ | + |
temporalCoverage | ++ | + |
spatialCoverage | ++ | + |
Example:
+schema.org/dataset | +INDICATOR schema | +Recommendation | +
---|---|---|
name | ++ | + |
description | ++ | + |
url | ++ | + |
sameAs | ++ | + |
identifier | ++ | + |
keywords | ++ | + |
license | ++ | + |
isAccessibleForFree | ++ | + |
hasPart / isPartOf | ++ | + |
creator type / url / name / contactPoint / funder | ++ | + |
includedInDataCatalog | ++ | + |
distribution | ++ | + |
temporalCoverage | ++ | + |
spatialCoverage | ++ | + |
Example:
+schema.org/dataset | +TABLES schema | +Recommendation | +
---|---|---|
name | ++ | + |
description | ++ | + |
url | ++ | + |
sameAs | ++ | + |
identifier | ++ | + |
keywords | ++ | + |
license | ++ | + |
isAccessibleForFree | ++ | + |
hasPart / isPartOf | ++ | + |
creator type / url / name / contactPoint / funder | ++ | + |
includedInDataCatalog | ++ | + |
distribution | ++ | + |
temporalCoverage | ++ | + |
spatialCoverage | ++ | + |
Example:
+The complete list of elements available in schema.org to document an image object is available at https://schema.org/ImageObject. We only show in the table below a selection of the ones we consder the most relevant and frequently available. Images can be documented either using the IPTC-based schema, or the Dublin Core (DCMI)-based schema.
+schema.org/dataset | +IMAGE schema (IPTC) | +Recommendation | +
---|---|---|
name | ++ | + |
abstract | ++ | + |
creator | ++ | + |
provider | ++ | + |
sourceOrganization | ++ | + |
dateCreated | ++ | + |
keywords | ++ | + |
contentLocation | ++ | + |
contentReferenceTime | ++ | + |
copyrightHolder | ++ | + |
copyrightNotice | ++ | + |
copyrightYear | ++ | + |
creditText | ++ | + |
isAccessibleForFree | ++ | + |
license | ++ | + |
acquireLicensePage | ++ | + |
contentUrl | ++ | + |
schema.org/dataset | +IMAGE schema (DCMI) | +Recommendation | +
---|---|---|
name | ++ | + |
abstract | ++ | + |
creator | ++ | + |
provider | ++ | + |
sourceOrganization | ++ | + |
dateCreated | ++ | + |
keywords | ++ | + |
contentLocation | ++ | + |
contentReferenceTime | ++ | + |
copyrightHolder | ++ | + |
copyrightNotice | ++ | + |
copyrightYear | ++ | + |
creditText | ++ | + |
isAccessibleForFree | ++ | + |
license | ++ | + |
acquireLicensePage | ++ | + |
contentUrl | ++ | + |
Example:
+ + +JSON Schema | +DDI/XML CodeBook 2.5 | +Title | +
---|---|---|
doc_desc | +docDscr | ++ |
doc_desc/title | +docDscr/citation/titlStmt/titl | +Document title | +
doc_desc/idno | +docDscr/citation/titlStmt/IDNo | +Unique ID number for the document | +
doc_desc/producers | +docDscr/citation/prodStmt/producer | +Producers | +
- name | +. | +Name | +
- abbr | +- abbr | +Abbreviation | +
- affiliation | +- affiliation | +Affiliation | +
- role | +- role | +Role | +
doc_desc/prod_date | +docDscr/citation/prodStmt/prodDate | +Date of Production | +
doc_desc/version_statement | +docDscr/citation/verStmt | +Version Statement | +
doc_desc/version_statement/version | +docDscr/citation/verStmt/version | +Version | +
doc_desc/version_statement/version_date | +docDscr/citation/verStmt/version/@date | +Version Date | +
doc_desc/version_statement/version_resp | +docDscr/citation/verStmt/verResp | +Version Responsibility Statement | +
doc_desc/version_statement/version_notes | +docDscr/citation/verStmt/notes | +Version Notes | +
study_desc | +stdyDscr | ++ |
study_desc/title_statement | +stdyDscr/citation/titlStmt | ++ |
study_desc/title_statement/idno | +stdyDscr/citation/titlStmt/IDNo | +Unique user defined ID | +
study_desc/title_statement/identifiers | ++ | Other identifiers | +
- type | ++ | Identifier type | +
- identifier | ++ | Identifier | +
study_desc/title_statement/title | +stdyDscr/citation/titlStmt/titl | +Survey title | +
study_desc/title_statement/sub_title | +stdyDscr/citation/titlStmt/subTitl | +Survey subtitle | +
study_desc/title_statement/alternate_title | +stdyDscr/citation/titlStmt/altTitl | +Abbreviation or Acronym | +
study_desc/title_statement/translated_title | +stdyDscr/citation/titlStmt/parTitl | +Translated Title | +
study_desc/authoring_entity | +stdyDscr/citation/rspStmt/AuthEnty | +Authoring entity/Primary investigators | +
- name | +. | +Agency Name | +
- affiliation | +- affiliation | +Affiliation | +
study_desc/oth_id | +stdyDscr/citation/rspStmt/othId | +Other Identifications/Acknowledgments | +
- name | +. | +Name | +
- role | +- role | +Role | +
- affiliation | +- affiliation | +Affiliation | +
study_desc/production_statement | +stdyDscr/citation/prodStmt | +Production Statement | +
study_desc/production_statement/producers | +stdyDscr/citation/prodStmt/producer | +Producers | +
- name | +. | +Name | +
- abbr | +- abbr | +Abbreviation | +
- affiliation | +- affiliation | +Affiliation | +
- role | +- role | +Role | +
study_desc/production_statement/copyright | +stdyDscr/citation/prodStmt/copyright | +Copyright | +
study_desc/production_statement/prod_date | +stdyDscr/citation/prodStmt/prodDate | +Production Date | +
study_desc/production_statement/prod_place | +stdyDscr/citation/prodStmt/prodPlac | +Production Place | +
study_desc/production_statement/funding_agencies | +stdyDscr/citation/prodStmt/fundAg | +Funding Agency/Sponsor | +
- name | +. | +Funding Agency/Sponsor | +
- abbr | +- abbr | +Abbreviation | +
- grant | +- stdyDscr/citation/prodStmt/fundAg | +Grant Number | +
- role | +- role | +Role | +
study_desc/distribution_statement | +stdyDscr/citation/distStmt | +Distribution Statement | +
study_desc/distribution_statement/distributors | +stdyDscr/citation/distStmt/distrbtr | +Distributor | +
- name | +. | +Organization name | +
- abbr | +- abbr | +Abbreviation | +
- affiliation | +- affiliation | +Affiliation | +
- uri | +- uri | +URI | +
study_desc/distribution_statement/contact | +stdyDscr/citation/distStmt/contact | +Contact | +
- name | +. | +Name | +
- affiliation | +- affiliation | +Affiliation | +
- uri | +- uri | +URI | +
study_desc/distribution_statement/depositor | +stdyDscr/citation/distStmt/depositr | +Depositor | +
- name | +. | +Name | +
- abbr | +- abbr | +Abbreviation | +
- affiliation | +- affiliation | +Affiliation | +
- uri | ++ | URI | +
study_desc/distribution_statement/deposit_date | +stdyDscr/citation/distStmt/depDate | +Date of Deposit | +
study_desc/distribution_statement/distribution_date | +stdyDscr/citation/distStmt/distDate | +Date of Distribution | +
study_desc/series_statement | +stdyDscr/citation/serStmt | +Series Statement | +
study_desc/series_statement/series_name | +stdyDscr/citation/serStmt/serName | +Series Name | +
study_desc/series_statement/series_info | +stdyDscr/citation/serStmt/serInfo | +Series Information | +
study_desc/version_statement | +stdyDscr/citation/verStmt | +Version Statement | +
study_desc/version_statement/version | +stdyDscr/citation/verStmt/version | +Version | +
study_desc/version_statement/version_date | +stdyDscr/citation/verStmt/version/@date | +Version Date | +
study_desc/version_statement/version_resp | +stdyDscr/citation/verStmt/verResp | +Version Responsibility Statement | +
study_desc/version_statement/version_notes | +stdyDscr/citation/verStmt/notes | +Version Notes | +
study_desc/bib_citation | +stdyDscr/citation/biblCit | +Bibliographic Citation | +
study_desc/bib_citation_format | +stdyDscr/citation/biblCit/@format | +Bibliographic Citation Format | +
study_desc/holdings | +stdyDscr/citation/holdings | +Holdings Information | +
- name | +. | +Name | +
- location | +- location | +Location | +
- callno | +- callno | +Callno | +
- uri | +- uri | +URI | +
study_desc/study_notes | +stdyDscr/citation/notes | +Study notes | +
study_desc/study_authorization | +stdyDscr/studyAuthorization | +Study Authorization | +
study_desc/study_authorization/date | +stdyDscr/studyAuthorization/@date | +Authorization Date | +
study_desc/study_authorization/agency | +stdyDscr/studyAuthorization/authorizingAgency | +Authorizing Agency | +
- name | +. | +Funding Agency/Sponsor | +
- affiliation | +- affiliation | +Affiliation | +
- abbr | +- abbr | +Abbreviation | +
study_desc/study_authorization/authorization_statement | +stdyDscr/studyAuthorization/authorizationStatement | +Authorization Statement | +
study_desc/study_info | +stdyDscr/stdyInfo | +Study Scope | +
study_desc/study_info/study_budget | +stdyDscr/stdyInfo/studyBudget | +Study Budget | +
study_desc/study_info/keywords | +stdyDscr/stdyInfo/subject/keyword | ++ |
- keyword | +. | +Keyword | +
- vocab | +- vocab | +Vocabulary | +
- uri | +- vocabURI | +uri | +
study_desc/study_info/topics | +stdyDscr/stdyInfo/subject/topcClas | +Topic Classification | +
- topic | +. | +Topic | +
- vocab | +- vocab | +Vocab | +
- uri | +- vocabURI | +URI | +
study_desc/study_info/abstract | +stdyDscr/stdyInfo/abstract | +Abstract | +
study_desc/study_info/time_periods | +stdyDscr/stdyInfo/sumDscr/timePrd | +Time periods (YYYY/MM/DD) | +
- start | ++ | Start date | +
- end | ++ | End date | +
- cycle | ++ | Cycle | +
study_desc/study_info/coll_dates | +stdyDscr/stdyInfo/sumDscr/collDate | +Dates of Data Collection (YYYY/MM/DD) | +
- start | ++ | Start date | +
- end | ++ | End date | +
- cycle | ++ | Cycle | +
study_desc/study_info/nation | +stdyDscr/stdyInfo/sumDscr/nation | +Country | +
- name | +. | +Name | +
- abbreviation | +- abbr | +Country code | +
study_desc/study_info/bbox | +stdyDscr/sumDscr/geoBndBox | +Geographic bounding box | +
- west | +- westBL | +West | +
- east | +- eastBL | +East | +
- south | +- southBL | +South | +
- north | +- northBL | +North | +
study_desc/study_info/bound_poly | +stdyDscr/sumDscr/boundPoly/polygon/point | +Geographic Bounding Polygon | +
- lat | +gringLat | +Latitude | +
- lon | +gringLon | +longitude | +
study_desc/study_info/geog_coverage | +stdyDscr/stdyInfo/sumDscr/geogCover | +Geographic Coverage | +
study_desc/study_info/geog_coverage_notes | +stdyDscr/sumDscr/geogCover/txt | +Geographic Coverage notes | +
study_desc/study_info/geog_unit | +stdyDscr/stdyInfo/sumDscr/geogUnit | +Geographic Unit | +
study_desc/study_info/analysis_unit | +stdyDscr/stdyInfo/sumDscr/anlyUnit | +Unit of Analysis | +
study_desc/study_info/universe | +stdyDscr/stdyInfo/sumDscr/universe | +Universe | +
study_desc/study_info/data_kind | +stdyDscr/stdyInfo/sumDscr/dataKind | +Kind of Data | +
study_desc/study_info/notes | +stdyDscr/stdyInfo/notes | +Study notes | +
study_desc/study_info/quality_statement | +stdyDscr/stdyInfo/qualityStatement | +Quality Statement | +
study_desc/study_info/quality_statement/compliance_description | +stdyDscr/stdyInfo/qualityStatement/standardsCompliance/complianceDescription | +Standard compliance description | +
study_desc/study_info/quality_statement/standards | +stdyDscr/stdyInfo/qualityStatement/standardsCompliance/standard | +Standards | +
- name | +standardName | +Name | +
- producer | +producer * | +Producer | +
study_desc/study_info/quality_statement/other_quality_statement | +stdyDscr/stdyInfo/qualityStatement/otherQualityStatement | +Other quality statement | +
study_desc/study_info/ex_post_evaluation | +stdyDscr/stdyInfo/exPostEvaluation | +Ex-Post Evaluation | +
study_desc/study_info/ex_post_evaluation/completion_date | +stdyDscr/stdyInfo/exPostEvaluation/@completionDate | +Evaluation completion date | +
study_desc/study_info/ex_post_evaluation/type | +stdyDscr/stdyInfo/@type | +Evaluation type | +
study_desc/study_info/ex_post_evaluation/evaluator | +stdyDscr/stdyInfo/exPostEvaluation/evaluator | +Evaluators | +
- name | +. | +Funding Agency/Sponsor | +
- affiliation | +- affiliation | +Affiliation | +
- abbr | +- abbr | +Abbreviation | +
- role | +- role | +Role | +
study_desc/study_info/ex_post_evaluation/evaluation_process | +stdyDscr/stdyInfo/exPostEvaluation/evaluationProcess | +Evaluation process | +
study_desc/study_info/ex_post_evaluation/outcomes | +stdyDscr/stdyInfo/exPostEvaluation/outcomes | +Outcomes | +
study_desc/study_development | +stdyDscr/studyDevelopment | +Study Development | +
study_desc/study_development/development_activity | +stdyDscr/studyDevelopment/developmentActivity | +Development activity | +
- activity_type | +. | +Development activity type | +
- activity_description | +- description | +Development activity description | +
- participants | +- participants | +Participants | +
- resources | +- resources | +Development activity resources | +
- outcome | +- outcome | +Development Activity Outcome | +
study_desc/method | +stdyDscr/method | +Methodology and Processing | +
study_desc/method/data_collection | +stdyDscr/method/dataColl | +Data Collection | +
study_desc/method/data_collection/time_method | +stdyDscr/method/dataColl/timeMeth | +Time Method | +
study_desc/method/data_collection/data_collectors | +stdyDscr/method/dataColl/dataCollector | +Data Collectors | +
- name | +. | +Name | +
- affiliation | ++ | Affiliation | +
- abbr | ++ | Abbreviation | +
- role | ++ | Role | +
study_desc/method/data_collection/collector_training | +stdyDscr/method/dataColl/collectorTraining | +Collector training | +
- type | +@type | +Training type | +
- training | +. | +Training | +
study_desc/method/data_collection/frequency | +stdyDscr/method/dataColl/frequenc | +Frequency of Data Collection | +
study_desc/method/data_collection/sampling_procedure | +stdyDscr/method/dataColl/sampProc | +Sampling Procedure | +
study_desc/method/data_collection/sample_frame | +stdyDscr/method/dataColl/sampleFrame | +Sample Frame | +
study_desc/method/data_collection/sample_frame/name | +stdyDscr/method/dataColl/sampleFrame/sampleFrameName | +Sample frame name | +
study_desc/method/data_collection/sample_frame/valid_period | +stdyDscr/method/dataColl/sampleFrame/validPeriod | +Valid periods (YYYY/MM/DD) | +
- event | ++ | Event | +
- date | ++ | Date | +
study_desc/method/data_collection/sample_frame/custodian | +stdyDscr/method/dataColl/sampleFrame/custodian | +Custodian | +
study_desc/method/data_collection/sample_frame/universe | +stdyDscr/method/dataColl/sampleFrame/universe | +Universe | +
study_desc/method/data_collection/sample_frame/frame_unit | +stdyDscr/method/dataColl/sampleFrame/frameUnit | +Frame unit | +
study_desc/method/data_collection/sample_frame/frame_unit/is_primary | +stdyDscr/method/dataColl/sampleFrame/frameUnit/@isPrimary | +Is Primary | +
study_desc/method/data_collection/sample_frame/frame_unit/unit_type | +stdyDscr/method/dataColl/sampleFrame/frameUnit/unitType | +Unit Type | +
study_desc/method/data_collection/sample_frame/frame_unit/num_of_units | +stdyDscr/method/dataColl/sampleFrame/frameUnit/@numberOfUnits | +Number of units | +
study_desc/method/data_collection/sample_frame/reference_period | +stdyDscr/method/dataColl/sampleFrame/referencePeriod | +Reference periods (YYYY/MM/DD) | +
- event | ++ | Event | +
- date | ++ | Date | +
study_desc/method/data_collection/sample_frame/update_procedure | +stdyDscr/method/dataColl/sampleFrame/updateProcedure | +Update procedure | +
study_desc/method/data_collection/sampling_deviation | +stdyDscr/method/dataColl/deviat | +Deviations from the Sample Design | +
study_desc/method/data_collection/coll_mode | +stdyDscr/method/dataColl/collMode | +Mode of data collection | +
study_desc/method/data_collection/research_instrument | +stdyDscr/method/dataColl/resInstru | +Type of Research Instrument | +
study_desc/method/data_collection/instru_development | +stdyDscr/method/dataColl/instrumentDevelopment | +Instrument development | +
study_desc/method/data_collection/instru_development_type | +stdyDscr/method/dataColl/instrumentDevelopment/@type | +Instrument development type | +
study_desc/method/data_collection/sources | +stdyDscr/method/dataColl/sources | +Sources | +
- name | ++ | Source name | +
- origin | ++ | Origin of Source | +
- characteristics | ++ | Characteristics of Source Noted | +
study_desc/method/data_collection/coll_situation | +stdyDscr/method/dataColl/collSitu | +Characteristics of Data Collection Situation - Notes on data collection | +
study_desc/method/data_collection/act_min | +stdyDscr/method/dataColl/actMin | +Supervision | +
study_desc/method/data_collection/control_operations | +stdyDscr/method/dataColl/ConOps | +Control Operations | +
study_desc/method/data_collection/weight | +stdyDscr/method/dataColl/weight | +Weighting | +
study_desc/method/data_collection/cleaning_operations | +stdyDscr/method/dataColl/cleanOps | +Cleaning Operations | +
study_desc/method/method_notes | +stdyDscr/method/notes | +Methodology notes | +
study_desc/method/analysis_info | +stdyDscr/method/anlyInfo | +Data Appraisal | +
study_desc/method/analysis_info/response_rate | +stdyDscr/method/anlyInfo/respRate | +Response Rate | +
study_desc/method/analysis_info/sampling_error_estimates | +stdyDscr/method/anlyInfo/EstSmpErr | +Estimates of Sampling Error | +
study_desc/method/analysis_info/data_appraisal | +stdyDscr/method/anlyInfo/dataAppr | +Data Appraisal | +
study_desc/method/study_class | +stdyDscr/method/stdyClas | +Class of the Study | +
study_desc/method/data_processing | +stdyDscr/method/dataProcessing | +Data Processing | +
- type | ++ | Data processing type | +
- description | ++ | Data processing description | +
study_desc/method/coding_instructions | +stdyDscr/method/codingInstructions | +Coding Instructions | +
- related_processes | ++ | Related processes | +
- type | ++ | Coding instructions type | +
- txt | ++ | Coding instructions text | +
- command | ++ | Command | +
- formal_language | ++ | Identify the language of the command code | +
study_desc/data_access | +stdyDscr/dataAccs/setAvail/dataAccs | ++ |
study_desc/data_access/dataset_availability | +stdyDscr/dataAccs/setAvail | +Data Set Availability | +
study_desc/data_access/dataset_availability/access_place | +stdyDscr/dataAccs/setAvail/accsPlac | +Location of Data Collection | +
study_desc/data_access/dataset_availability/access_place_url | +stdyDscr/dataAccs/setAvail/accsPlac/@URI | +URL for Location of Data Collection | +
study_desc/data_access/dataset_availability/original_archive | +stdyDscr/dataAccs/setAvail/origArch | +Archive where study is originally stored | +
study_desc/data_access/dataset_availability/status | +stdyDscr/dataAccs/setAvail/avlStatus | +Availability Status | +
study_desc/data_access/dataset_availability/coll_size | +stdyDscr/dataAccs/setAvail/collSize | +Extent of Collection | +
study_desc/data_access/dataset_availability/complete | +stdyDscr/dataAccs/setAvail/complete | +Completeness of Study Stored | +
study_desc/data_access/dataset_availability/file_quantity | +stdyDscr/dataAccs/setAvail/fileQnty | +Number of Files | +
study_desc/data_access/dataset_availability/notes | +stdyDscr/dataAccs/setAvail/notes | +Notes | +
study_desc/data_access/dataset_use | +stdyDscr/dataAccs/useStmt | +Data Set Availability | +
study_desc/data_access/dataset_use/conf_dec | +stdyDscr/dataAccs/useStmt/confDec | +Confidentiality Declaration | +
- txt | +. | +Confidentiality declaration text | +
- required | +- required | +Is signing of a confidentiality declaration required? | +
- form_url | +- URI | +Confidentiality declaration form URL | +
- form_id | +- formNo | +Form ID | +
study_desc/data_access/dataset_use/spec_perm | +stdyDscr/dataAccs/useStmt/specPerm | +Special Permissions | +
- txt | ++ | Special permissions description | +
- required | +- required | +Indicate if special permissions are required to access a resource | +
- form_url | +- URI | +Form URL | +
- form_id | +- formNo | +Form ID | +
study_desc/data_access/dataset_use/restrictions | +stdyDscr/dataAccs/useStmt/restrctn | +Restrictions | +
study_desc/data_access/dataset_use/contact | +stdyDscr/dataAccs/useStmt/contact | +Contact | +
- name | +. | +Name | +
- affiliation | +- affiliation | +Affiliation | +
- uri | +- URI | +URI | +
study_desc/data_access/dataset_use/cit_req | +stdyDscr/dataAccs/useStmt/citReq | +Citation requirement | +
study_desc/data_access/dataset_use/deposit_req | +stdyDscr/dataAccs/useStmt/deposReq | +Deposit requirement | +
study_desc/data_access/dataset_use/conditions | +stdyDscr/dataAccs/useStmt/conditions | +Conditions | +
study_desc/data_access/dataset_use/disclaimer | +stdyDscr/dataAccs/useStmt/disclaimer | +Disclaimer | +
study_desc/data_access/notes | +stdyDscr/dataAccs/setAvail/notes | +Notes | +
data_files | ++ | + |
variables | ++ | + |
variable_groups | ++ | Variable groups | +
In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges.
+Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks, often referred to as tribal knowledge, to locate and obtain the data they require. This may lead to the use of convenient data that may not be the most relevant. Others may encounter datasets of interest in academic publications, which can be challenging due to the inconsistent or non-standardized citation of datasets. However, most data users use general search engines or turn to specialized data catalogs to discover relevant data resources.
+Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, such as a query for “population of India in 2023,” yield instant informative responses (though not always from the most authoritative source). Even less direct queries, like “indicators of malnutrition in Yemen,” return adequate responses, as the engine can “understand” concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Additionally, generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although it is not without the risk of errors and inaccuracies. However, these search engines may not be optimized to identify the most relevant data when the user’s requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking “satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition.”
+While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking pertinent data. Nonetheless, the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal search indexes and algorithms. With the rapid advancements in AI-based solutions, many of which are available as open-source software, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems.
+The solution to improve data discoverability involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest.
+Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data’s suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset’s fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time.
+Accessing data is a multifaceted challenge that encompasses legal, ethical, and practical considerations. To ensure that data access is lawful, ethical, efficient, and enables relevant and responsible use of the data, data providers and users must adhere to specific principles and practices:
+Data users in developing countries often encounter additional challenges in accessing data, including:
+The challenge for data users extends beyond discovering data to obtaining all the necessary information for a comprehensive understanding of the data and for responsible and appropriate use. A single indicator label, such as “unemployment rate (%),” can obscure significant variations by country, source, and time. The international recommendations for the definition and calculation of the “unemployment rate” have evolved over time, and not all countries employ the same data collection instrument (e.g., labor force surveys) to gather the underlying data. Detailed metadata should always accompany data on online data dissemination platforms. This association should be close; relevant metadata should ideally be no more than one click away from the data. This is particularly crucial when a platform publishes data from multiple sources that are not fully harmonized.
+The scope and meaning of labor statistics, in general, are determined by their source and methodology, which holds true for the unemployment rate. To interpret the data accurately, it is crucial to understand what the data convey, how they were collected and constructed, and to have information on the relevant metadata. The design and characteristics of the data source, typically a labor force survey or a similar household survey for the unemployment rate, especially in terms of definitions and concepts used, geographical and age coverage, and reference periods, have significant implications for the resulting data. Taking these aspects into account is essential when analyzing the statistics. Additionally, it is crucial to seek information on any methodological changes and breaks in series to assess their impact on trend analysis and to keep in mind methodological differences across countries when conducting cross-country studies. (From Quick guide on interpreting the unemployment rate, International Labour Office – Geneva: ILO, 2019, ISBN: 978-92-2-133323-4 (web pdf)).
+Whenever possible, reproducible or replicable scripts used with the data, along with the analytical output of these scripts, should be published alongside the data. These scripts can be highly valuable to researchers who wish to expand the scope of previous data analysis or reuse parts of the code, and to students who can learn from reading and replicating the work of experienced analysts. To enhance data usability, we have developed a specific metadata schema for documenting research projects and scripts.
+To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469).
+It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse.
+ +In the introductory section of this Guide, we proposed that a data dissemination platform should be modeled after highly successful e-commerce platforms. These platforms are designed to optimally satisfy the requirements and expectations of both buyers (in our context, the data users) and sellers (in our context, the data providers who make their datasets accessible through a data catalog). In this chapter, we outline the crucial features that a modern online data catalog should incorporate to adhere to this model and effectively cater to the diverse needs and expectations of its users.
+Our objective is to provide recommendations for developing data catalogs that encompass lexical search and semantic search, filtering, advanced search functionality, interactive user interfaces, and the capability to operate as a data recommender system. To define these features, we approach the topic from three distinct perspectives: the viewpoint of data users, who represent a highly diverse community with varying needs, preferences, expectations, and capabilities; the standpoint of data suppliers, who either publish their data or delegate the task to a data library; and the perspective of catalog administrators, responsible for curating and disseminating data in a responsible, effective, and efficient manner while optimizing both user and supplier satisfaction.
+The creation of a contemporary data dissemination platform is a collaborative endeavor, engaging data curators, user experience (UX) experts, designers, search engineers, and subject matter specialists with a profound understanding of both the data and the users’ requirements and preferences. Inclusive in this development process should be the active participation of the users themselves, allowing them to provide feedback that directly influences the system’s design.
+In order to cultivate a favorable user experience, online data catalogs must offer an intuitive and efficient interface, allowing users to effortlessly access the most pertinent datasets. To meet user expectations effectively, one should emphasize simplicity, predictability, relevance, speed, and reliability. Integrating these principles into the design of data catalogs can deliver a seamless and user-friendly experience, akin to the convenience and ease provided by well-known internet search engines and e-commerce platforms. This, in turn, streamlines the process of discovering and obtaining the necessary data, making it quick and hassle-free for users.
+The default option to search for data in a specialized catalog should be a single search box, following the model of general search engines. The objective of the search algorithm should then be to “understand” the user’s query as accurately as possible, potentially by parsing and enhancing the query, and returning the most relevant results ranked in order of importance.
+However, not all users can be expected to provide ideal queries. The search engine must be able to tolerate spelling mistakes to provide a seamless user experience. Auto-completion and spell checkers of queries are independent of the metadata being searched and can be enabled using indexing tools such as Solr or ElasticSearch. Additionally, after processing a user query, the application can provide suggestions for related keywords. This can be implemented using a graph of related words generated by natural language processing (NLP) models. Access to an API is necessary to implement keyword suggestions based on such graphs. The example below shows a related words graph for the terms “climate change” as returned by an NLP model.
+A search interface could retrieve such information via API and display it as follows:
+Some users will just want to browse a catalog. This should be made easy. The use of cards is recommended. For images, a mosaic view can be provided. For microdata, a variable view.
+The catalog must provide a list of the most recent additions, and a history of additions and updates. +For each entry, information must be available on the date the entry was first added to the catalog, and when it was last updated. +When a dataset is replaced with a new version, the versioning must be clear.
+
+
+
It is useful also to provide users with an option to build a more advanced search, targetted to specific metadata elements and with boolean operators. Advanced search are enabled by structured metadata, i.e., by the use of metadata standards and schemas. The advanced search should be available as a user interface and using a syntax option. The interface could be as follows:
+A search engine with semantic search capability should be able to process short or long queries, even accepting a document (a PDF or a TXT file) as a query. The search engine will then first analyze the semantic content of the document, convert it into an embedding vector, and identify the closest resources available in the catalog.
+Data catalogs receive numerous queries that are related to a particular geography. Analysis of millions of queries from the World Bank (WB) and International Monetary Fund (IMF) data catalogs revealed that a significant percentage of queries consist of a single country name. For data catalogs that cover multiple countries, creating a “Country page” can provide a quick overview of the most recent and popular datasets of different types, which many users may find helpful.
+However, geography is not limited to countries alone. Many users may be interested in sub-national data or geographic areas that do not correspond to administrative areas, such as a watershed or an ocean. Especially when a data catalog contains geographic datasets, it is recommended to provide specialized search tools. Most metadata standards allow the use of bounding boxes to specify geographic coverage, which could be used to develop a “search” tool that enables a user to draw a box on a map. But this option is very imperfect (explain why).
+Example from data.gov (https://catalog.data.gov/dataset/?metadata_type=geospatial) +For geographic datasets, geographic indexing is recommended. The H3 index is a powerful option. (describe)
+Also, one must take into account that many users will rely on a keywords search to identify data. For example, a raster image of the Philippines (e.g., dataset from a satellite imagery) will contain the country name in the metadata, but the metadata cannot contain the name of all geographic areas coveregd by the data. A user looking for “Iloilo” for example would not find this relevant dataset based on a simple keyword search. The solution would be for the search engine to parse the query, detect if it contains the name of a geographic area, automatically identify the area (polygon of geographic coordinates) that corresponds to it (possibly using an API built around Nominatim), and retrieve resources in the catalog that cover the area (which requires that the datasets in the catalog be indexed geographically).
+(describe how this works - illustrate from our KCP project “Indexing the world”).
+Example of use of Nominatim: The Nominatim application shows the polygon boundary for the search query “Iloilo City” automatically provided by the API.
+The search API endpoint of Nominatim returns this JSON data which can be processed to generate search cell(s).
+There are two types of search engines: lexical and semantic. The former matches literal terms in the query to the search engine’s index, while the latter aims to identify datasets that have semantically similar metadata to the query. While an ideal data catalog would offer both types of search engines, implementing semantic searchability can be complex.
+(explain how semantic search workd for different data types - with embeddings and vector indexing and cosine similarity - use of API)
+For microdata: embeddings based on thematic variable groupings - an option to implement semantic search and recommendations +Discovery of microdata poses specific challenges. Typically, a data dictionary will be available, with variables organized by data file. A “virtual” organization of variables by thematic group, with a pre-defined ontology, can significantly improve data discoverability. AI solutions can be used to generate such groupings and map variables to them. The DDI metadata standard provides the metadata elements needed to store information on variable groups.
+Build your own dashboards +- Allo users to set preferences: thematic, data type, geographies, search query +- Have a page where pre-designed dashboards (country/thematic pages) and custom dashboard are accessible +- Allow sharing of dashboards +- Core idea: all data and metadata accessible via API; platform operates as a service to feed dashboards (within the platform or external)
+A search engine not only needs to identify relevant datasets but also must return the results in a proper order of relevance, with the most relevant results at the top of the list. If users fail to find a relevant response among the top results, they may choose to search for data elsewhere. The ability of a search engine to return relevant results in the optimal rank depends on the metadata’s content and structure. To optimize the ranking of results, a lot of relevance engineering is required, including tuning advanced search tools like Solr or ElasticSearch. Large data catalogs managed by well-resourced agencies can leverage data scientists to explore the possibility of using machine learning solutions such as “learn-to-rank” to improve result ranking. See section “Improving results ranking” below. For more detailed information, see D. Turnbull and J. Berryman’s (2016) in-depth description of tools and methods.
+Keyword-based searches can be optimized using tools like Solr or ElasticSearch. Out-of-the-box solutions, such as those provided by SQL databases, rarely deliver satisfactory results. Structured metadata can help optimize search engines and the ranking of results by allowing for the boosting of specific metadata elements. For instance, a query term found in the title of a dataset would carry more weight than if it were found in the notes element, and the results would be ranked accordingly. Similarly, a country name found in the nation or reference country metadata elements should be given more weight than if it were found in a variable description. Advanced indexing tools like Solr and ElasticSearch provide boosting functionalities to fine-tune search engines and enhance result relevancy.
+Facets or filters are useful for narrowing down datasets based on specific metadata categories. For instance, in a data catalog with datasets from different countries, a “country” facet can help users find relevant datasets quickly. To be effective, filters should be based on metadata elements that have a limited number of categories and a predictable set of options. Controlled vocabularies can be used to enable such filters. Furthermore, as some metadata elements are specific to particular data types, contextual facets should be integrated into the catalog’s user interface to offer relevant filters based on the type of data being searched.
+Tags and tag groups (which are available in all schemas we recommend) provide much flexibility to implement facets, as we showed in section 1.7.
+(use pills / …)
+Not all data catalog users know exactly what they are looking for and may need to explore the catalog to find relevant resources. E-commerce platforms use recommender systems to suggest products to customers, and data catalogs should have a similar commitment to bringing relevant resources to users’ attention. To achieve this, modern data catalogs display relationships between entries, which may involve data of different types, such as microdata files, analytical scripts, and working papers.
+These relationships can be documented in the metadata, such as identifying datasets as part of a series or new versions of a previous dataset. When relationships are not known or documented, machine learning tools such as topic models and word embedding models can be used to establish the topical or semantic closeness between resources of different types. This can be used to implement a recommender system in data catalogs, which automatically identifies and displays related documents and data for a given resource. The image below shows how “related documents” and “related data” can be automatically identified and displayed for a resource (in this case a document).
+When a data catalog contains multiple types of data, it should offer an easy way for users to filter and display query results by data type. For example, when searching for “US population,” one user may only be interested in knowing the total population of the USA, while another may need the public use census microdata sample, and a third may be searching for a publication. To cater to such needs, presenting query results in type-specific tabs (with an “All” option) and/or providing a filter (facet) by type will allow users to focus on the types of data relevant to them. This is similar to commercial platforms that offer search results organized by department, allowing users to search for “keyboard” in either the “music” or “electronics” department.
+Option for user to set a profile with preferences that may be used to display results.
+To make metadata easily accessible to users, it’s important to display it in a convenient way. The display of metadata will vary depending on the data type being used, as each type uses a specific metadata schema. For online catalogs, style sheets can be utilized to control the appearance of the HTML pages.
+In addition to being displayed in HTML format, metadata should be available as electronic files in JSON, XML, and potentially PDF format. Structured metadata provides greater control and flexibility to automatically generate JSON and XML files, as well as format and create PDF outputs. It’s important that the JSON and XML files generated by the data catalog comply with the underlying metadata schema and are properly validated. This ensures that the metadata files can be easily and reliably reused and repurposed.
+E-commerce platforms commonly allow customers to compare products by displaying their pictures and descriptions (i.e., metadata) side-by-side. Similarly, for data users, the ability to compare datasets can be valuable to evaluate the consistency or comparability of a variable or an indicator over time or across sources and countries. However, to implement this functionality, detailed and structured metadata at the variable level are necessary. These metadata standards, such as DDI and ISO 19110/19139, enable the implementation of this feature.
+In the example below, we show how a query for water returns not only a list of seven datasets, but also a list of variables in each dataset that match the query.
+The variable view shows that a total of 90 variables match the searched keyword.
+After selecting the variables of interest, users should be able to display their metadata in a format that facilitates comparison. The availability of detailed metadata is crucial to ensure the quality and usefulness of these comparisons. For example, when working with a survey dataset, capturing information on the variable universe, categories, questions, interviewer instructions, and summary statistics would be ideal. This comprehensive metadata will enable users to make informed decisions about which variables to use and how to analyze them.
+The terms of use (ideally provided in the form of a standard license) and the conditions of access to data should be made transparent and visible in the data catalog. The access policy will preferably be provided using a controlled vocabulary, which can be used to enable a facet (filter) as shown in the screenshot below.
+To keep up with modern data management needs, a comprehensive data catalog must provide users with convenient access to both data and metadata through an application programming interface (API). The structured metadata in a catalog allows users to extract specific components of the metadata they need, such as the identifier and title of all microdata and geographic datasets conducted after a certain year. With an API, users can easily and automatically access datasets or subsets of datasets they require. This enables internal features of the catalog such as dynamic visualizations and data previews, making data management more efficient. It is crucial that detailed documentation and guidelines on the use of the data and metadata API are provided to users to maximize the benefits of this feature.
+Metadata (and data) should be accessible via API +The API should be well documented with examples. +API query builder: UI for building an API query
+Make the process of registration, requests fully digital, easy, and fully traceable.
+ +When the data (time series and tabular data, possibly also microdata) are made available via API, the data catalog can also provide a data preview option, and possibly a data extraction option, to the users. Multiple JavaScript tools, some of them open-source, are available to easily embed data grids in catalog pages.
+For a document, the “data preview” would consist of a document viewer that would allow the user to view the document within the application (even when the document is not stored in the catalog itself but in an external website). When implementing such a feature, check that the terms of use of the origination source allows that.
+For some data (microdata / time series), provide a simple way for users to extract specific variables / observations.
+Embedding visualizations in a data catalog can greatly enhance its usefulness. Different types of data require different types of visualizations. For instance, time series data can be effectively displayed using a line chart, while images with geographic information can be displayed on a map that shows the location of the image capture. For more complex data, other types of charts can be created as well. However, in order to embed dynamic charts in a catalog page, the data needs to be available via API. A good data catalog should offer flexibility in the types of charts and maps that can be embedded in a metadata page. For instance, the NADA catalog provides catalog administrators with the ability to create visualizations using various tools. By including visualizations in a data catalog, users are able to quickly and easily understand the data and gain insights from it.
+The NADA catalog allows catalog adinistrators to generate such visualizations using different tools of their choice. The example below were generated using the open-source Apache eCharts library.
+
+Example: Line chart for a time series
+Example: Geo-location of an image
To ensure efficient management and organization of datasets within a data catalog, it is essential to assign a unique identifier to each dataset. This identifier should not only meet technical requirements but also serve other purposes such as facilitating dataset citation. To achieve maximum effectiveness, it is recommended that datasets have a globally unique identifier, which can be accomplished through the assignment of a Digital Object Identifier (DOI). DOIs can be generated in addition to a catalog-specific unique identifier and provide a permanent and persistent identifier for the dataset. For more information about the process of generating DOIs and the reasons to use them, visit the DataCite website.
+Include a citation requirement in metadata.
+When a dataset is removed or replaced, the reproducibility of some analysis may become impossible. This may be a problem for some users. Unless there is a reason for not making them accessible, old versions of datasets should be kept accessible. But they should not be the ones indexed and dislayed in the catalog, to avoid cnfusion or the risk that a user would exploit a version other than the latest. Moving datasts that are replaced to an archive section of the catalog (not indexed) is an option. Note that DOIs require a permanent web page.
+A data catalog should not be limited to data. Ideally, the scripts produced by researchers to analyze the data, and the output of their analysis, should also be available. An ideal data catalog will allow a user to:
+Maintain a catalog of citations of datasets.
+Document, catalog, and publish reproducible/replicable scripts.
+Users may want to be automatically notified (by email) when new entries of interest are added, or when change are made to a specific resource. A system allowing users to set criteria for automatic notification can be developed.
+Example of Google SCholar alerts:
+Feedback on catalog certainly. In the form of a “Contact” email and possibly a “feedback form”. Also, if the platform itself is open source, GitHub for issues and suggestions on the application itself.
+BUT: Users forum, “reviews” as in e-commerce platforms, is not always recommended. Not all users are ’constructive” and qualified. Requires moderation, which can be costly and controversial. May create dis-incentives for data producers to publish their data. Could be a good option for data platforms that are internal to an organization (where comments are attributed, and an authentication system controls who can provide feedback), but not for public data platforms.
+Web Content Accessibility Guidelines (WCAG) international standard. WCAG documents explain how to make web content more accessible to people with disabilities. +ADA provides people with disabilities the same opportunities, free of discrimination. +WCAG is a compilation of accessibility guidelines for websites, whereas ADA is a civil rights law in the same ambit.
+When the data catalog is not administered by the producer of the data but by an entrusted repository, data providers want:
+“do not disturb”: low burden of deposit and no burden of serving users (minimum interaction with users; providing detailed metadata helps)
+In addition to meeting the needs of its users, a modern data catalog should also offer features that a catalog administrator can appreciate or expect. The features listed below can serve as checklist for choice of an application or development of features. These features may include:
+User friendly interface for data deposit. Compliant with metadata stadards. With embedded quality gateways and clearance procedures.
+Tools for privacy protection control (e.g., tools to identify direct identifiers)
+Availability of the application as an open-source software, accompanied by detailed technical documentation
+Robust security measures, such as compatibility with advanced authentication systems, flexible role/profile definitions, regular upgrades and security patches, and accreditation by information security experts
+Reasonable IT requirements, such as shared server operability and sufficient memory capacity
+Interoperability with other catalogs and applications, as well as compliance with metadata standards. By publishing metadata across multiple catalogs and hubs, data visibility can be increased, and the service provided to users can be maximized. This requires automation to ensure proper synchronization between catalogs (with only one catalog serving as the “owner” of a dataset), which necessitates interoperability between the catalogs, enabled by compliance with common formats and metadata standards and schemas.
+Flexibility in implementing data access policies that conform to the specific procedures and protocols of the organization managing the catalog
+Availability of APIs for catalog administration +Easy automation of procedures (harvesting, migration of formats, editing, etc.) This means API-based system.
+Easy activation of usage analytics (using Google Analytics, Omniture, or other)
+Multilingual capability, including internationalization of the code and the option for catalog administrators to translate or adapt software translations
+In Chapter 1, we emphasized the importance of generating comprehensive metadata and how machine learning can be leveraged to enrich it. Natural language processing (NLP) tools and models, in particular, have been employed to enhance the performance of search engines. By utilizing machine learning models, semantic search engines and recommender systems can be developed to aid users in locating relevant data. Moreover, machine learning can improve the ranking of search results to ensure that the most pertinent results are brought to users’ attention. Google, Bing, and other leading search engines have employed machine learning for years. While specialized data catalogs may not have the resources to implement such advanced systems, catalog administrators should explore opportunities to utilize machine learning to enhance their users’ experience. Catalogs can make use of external APIs to exploit machine learning solutions without requiring administrators to develop machine learning expertise or train their own models. For instance, APIs can be used to automatically and instantly translate queries or convert queries into embeddings. Ideally, a global community of practice will develop such APIs, including training NLP models, and provide them as a global public good.
+In 2019, Google introduced their NLP model, BERT (Biderectional Encoder Representations from Transformers), as a component of their search engine. Other major companies, such as Amazon, Apple, and Microsoft, are also developing similar models to enhance their search engines. One of the objectives of these companies is to create search engines that can support digital assistants like Siri, Alexa, Cortana, and Hey Google, which operate on a conversational mode and provide answers to users rather than just links to resources. Improving NLP models is a continuous and strategic priority for these companies, as not all answers can be found in textual resources. Google is also conducting research to develop solutions for extracting answers from tabular data.
+Specialized data catalogs maintained by data centers, statistical agencies, and other data producers still rely almost exclusively on full-text search engines. The search engine within these catalogs looks for matches between keywords submitted by the user and keywords found in an index, without attempting to understand or improve the user’s query. This can result in issues such as misinterpretation of the query, as discussed in Chapter 1, where a search for “dutch disease” may be mistakenly interpreted as a health-related query rather than an economic concept.
+The administrators of these specialized data catalogs often lack the resources to develop and implement the most advanced NLP solutions, and should not be required to do so. To assist them in transitioning from keyword-based search systems to semantic search and recommender systems, open solutions should be developed and published, such as pre-trained NLP models, open source tools, and open APIs. This would necessitate the creation and publishing of global public goods, including specialized corpora and the training of embedding models on these corpora, open NLP models and APIs that data catalogs can utilize to generate embeddings for their metadata, query parsers that can automatically improve/optimize queries and convert them into numeric vectors, and guidelines for implementing semantic search and recommender systems using tools like Solr, ElasticSearch, and Milvus.
+Simple models created from open source tools and publicly-available documents can provide straightforward solutions. In the example below, we demonstrate how these models can “understand” the concept of “dutch disease” and correctly associate it with relevant economic concepts.
+Effective search engines not only identify relevant resources, but also rank and present them to users in an optimal order of relevance. As highlighted in Chapter 1, research shows that 75% of search engine users do not click past the first page, emphasizing the importance of ranking and presenting results effectively.
+Data catalog administrators face two challenges in improving their search engine performance. Firstly, they need to improve their ranking in search engines such as Google by enriching metadata and embedding metadata compliant with DCAT or schema.org standards on catalog pages. Secondly, they need to improve the ranking of results returned by their own search engines in response to user queries.
+Google’s success in 1996 was largely attributed to their revolutionary approach to ranking search results called PageRank. Since then, they and other leading search engines have invested heavily in improving ranking methodologies with advanced techniques like RankBrain (introduced in 2015). These approaches include primary, contextual, and user-specific ranking, which utilize machine learning models referred to as Learn to Rank models. Lucidworks provides a clear description of this approach, noting that “Learning to rank (LTR) is a class of algorithmic techniques that apply supervised machine learning to solve ranking problems in search relevancy. In other words, it’s what orders query results. Done well, you have happy employees and customers; done poorly, at best you have frustrations, and worse, they will never return. To perform learning to rank you need access to training data, user behaviors, user profiles, and a powerful search engine such as SOLR. The training data for a learning to rank model consists of a list of results for a query and a relevance rating for each of those results with respect to the query. Data scientists create this training data by examining results and deciding to include or exclude each result from the data set.”
+Implementing Learn to Rank models can be challenging for data catalog administrators due to the resource-intensive nature of building the training dataset, fitting models, and implementing them. An alternative solution is to optimize the implementation of Solr or ElasticSearch, which can often contribute significantly to improving the ranking of search results. For more information on the challenge and available tools and methods for relevancy engineering, refer to D. Turnbull and J. Berryman’s 2016 publication.
+The examples we provided in this chapter are taken from our NADA cataloguing application. Other open-source cataloguing applications are available, including CKAN, GeoNetworks, and Dataverse.
+CKAN
+CKAN is a data management system that provides a platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. CKAN is an open source software held in trust by Open Knowledge Foundation. It is open and licensed under the GNU Affero General Public License (AGPL) v3.0. CKAN is used by some of the lead open data platforms, such as the US data.gov or the OCHA Humanitarian Data Exchange. CKAN does not require that the metadata comply with any metadata standard (which brings flexibility, but at a cost in terms of discoverability and quality control), but organizes the metadata in the following elements (information extracted from CKAN on-line documentation):
+The extra fields section allows ingestion of structured metadata, which makes it relatively easy to exporting data and metadata from NADA to CKAN. Importing data and metadata from CKAN to NADA is also possible (using the catalog’s respective APIs), but with a reduced metadata structure.
+GeoNetworks
+GeoNetworks is a cataloguing tool for geographic data and services (not for other types of data), which includes a specialized metadata editor. According to its website, “It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world. (…) The metadata editor support ISO19115/119/110 standards used for spatial resources and also Dublin Core format usually used for opendata portals.”
+DataVerse
+The Dataverse Project is led by the Institute for Quantitative Social Science (IQSS). Dataverse makes use of the DDI Codebook and Dublin Core metadata standards. According to its website, Dataverse “is an open source web application to share, preserve, cite, explore, and analyze research data. (…) The central insight behind the Dataverse Project is to automate much of the job of the professional archivist, and to provide services for and to distribute credit to the data creator.”
+“The Institute for Quantitative Social Science (IQSS) collaborates with the Harvard University Library and Harvard University Information Technology organization to make the installation of the Harvard Dataverse Repository openly available to researchers and data collectors worldwide from all disciplines, to deposit data. IQSS leads the development of the open source Dataverse Project software and, with the Open Data Assistance Program at Harvard (a collaboration with Harvard Library, the Office for Scholarly Communication and IQSS), provides user support.”
+ +The previous chapter defined the features of an advanced data discoverability and dissemination solution. What enables such a solution is not only the algorithms and technology, but also the quality of the metadata available to enable them. Metadata is defined as “… structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage that resource” (Data thesaurus, NIH, https://nnlm/gov/data/thesaurus) Metadata must be findable by machines and usable by humans. This chapter describes what metadata are needed, and how they can be organized and improved to fully enable the search and recommender tools. The metadata must be rich and structured. To make them rich, machine learning can be used. To ensure consistent structure, the use of metadata standards and schemas is highly recommended. In this chapter, we build the case for rich, augmented, structured metadata and for the adoption of metadata standard and schemas. The second part of this Guide will provide a detailed description of each recommended standard or schema, for different data types.
+Rich metadata means detailed and comprehensive metadata. Rich metadata are beneficial to both the users and the providers (producers and curators) of data.
+Being provided with rich metadata helps data users:
+For the data producers, rich metadata will contribute to:
+What makes metadata “rich and comprehensive” is not always easy to define, and is specific to each data type. Microdata and geospatial datasets for example will require much more – and different– metadata than a document or an image. Metadata standards and schemas provide data curators with detailed lists of elements (or fields), specific to each data type, that must or may be provided to document a dataset. The metadata elements included in a standard or schema will typically cover cataloguing material, contextual information, and explanatory materials.
+Cataloguing material includes elements such as a title, a unique identifier for the dataset, a version number and description, as well as information related to the data curation (including who generated the metadata and when, or where and when metadata may have been harvested from an external catalog). This information allows the dataset to be uniquely identified within a collection/catalog, and serves as a bibliographic record of the dataset, allowing it to be properly acknowledged and cited in publications.
+Contextual information describes the context in which the data were collected and how they were put to use. It enables secondary users to understand the background and processes behind the data production. Contextual information should cover topics such as:
+Explanatory materials are the information that should be created and preserved to ensure the long-term functionality of a dataset and its contents. This applies mostly to microdata, geospatial data, and to some extent to tabulations and to time series and indicators databases. It is less relevant for images, videos, and documents. Explanatory materials include:
+Metadata standards and schemas provide lists of elements with a description of the expected content to be captured in each element. For some elements, it may be appropriate to restrict the valid content to pre-selected options or “controlled vocabularies”. A controlled vocabulary is a pre-defined list of values that can be accepted as valid content for some elements. For example. a metadata element “data type” should not be populated with free text, but should make use of a pre-defined taxonomy of data types. The use of controlled vocabularies (for selected metadata elements) will be particularly useful to implement search and filter features in data catalogs (see section 3.1.1 of this Guide), and to foster inter-operability of data catalogs.
+In library and information science, controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search.Wikipedia
+Controlled vocabularies can be specific to an agency, or be developed by a community of practice. For example, the list of countries and codes provided by the ISO 3166 can be used as a controlled vocabulary for a metadata element country
or nation
; the ISO 639 list of languages can be used as a controlled vocabulary for a metadata element language
. Or the CESSDA topics classification can be used as a controlled vocabulary for the element topics
found in most metadata schemas. When a controlled vocabulary is used in a metadata standard or schema, it is good practice to include an identification of its origin and version.
Some recommended controlled vocabularies are included in the description of the ISO 19139 standard for geographic data and services (see chapter 6). Most standards and schemas we recommend also include a topics
element. Annex 1 provides a description of the CESSDA topics classification.
Ideally, controlled vocabulary will be developed in compliance with the FAIR principles for scientific data management and stewardship: Findability, Accessibility, Interoperability, and Reuse.
+Metadata should not only be comprehensive and detailed, they should also be organized in a structured manner, preferably using a standardized structure. Structured metadata means that the metadata are stored in specific fields (or elements) organized in a metadata schema. Standardized means that the list and description of elements are commonly agreed by a community of practice.
+“A metadata schema is a system that defines the data elements needed to describe a particular object, such as a certain type of research data.” (Ten rules data discovert - add ref)
+Some metadata standards have originated from academic data centers, like the Data Documentation Initiative (DDI), maintained by the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan. Other found their origins in specialized communities of practice (like the ISO 19139 for geospatial resources). The private sector also contributes to the development of standards, like the International Press Telecommunications Council (IPTC) standard developed by and for news media.
+Metadata compliant with standards and schemas will typically be stored as JSON or XML files (described in Chapter 2), which are plain text files. The example below show how a simple free-text content would be structured and stored in JSON and XML formats, using metadata elements from the DDI Codebook metadata standard:
+Free text version:
+The Child Mortality Survey (CMS) was conducted by the National Statistics Office of Popstan from July 2010 to June 2011, with financial support from the Child Health Trust Fund (TF123_456).
+Structured, machine-readable (JSON) version:
+"title" : "Child Mortality Survey 2010-2011",
+ "alternate_title" : "CMS 2010-2011",
+ "authoring_entity": "National Statistics Office (NSO)",
+ "funding_agencies": [{"name":"Child Health Trust Fund (CHTF)", "grant":"TF123_456"}],
+ "coll_dates" : [{"start":"2010-07", "end":"2011-06"}],
+ "nation" : [{"name":"Popstan", "abbreviation":"POP"}]
+ }
In XML format:
+titl>Child Mortality Survey 2010-2011</titl>
+ <altTitl>CMS 2010-2011</altTitl>
+ <rspStmt><AuthEnty>National Statistics Office</AuthEnty></rspStmt>
+ <fundAg abbr=CHTF>Child Health Trust Fund</fundAg>
+ <<collDate date="2010-07" event="start"/>
+collDate date="2011-06" event="end"/>
+ <nation abbr="POP">Popstan</nation> <
All three versions contain (almost) the same information. In the structured version, we have added acronyms and the ISO country code. This does not create new information but will help make the existing information more discoverable and inter-operable. The structured version is clearly more suitable for publishing in a meta-database (or catalog). Organizing and storing metadata in such a structured manner will enable all kinds of applications. For example, when metadata for a collection of surveys are stored in a database, it becomes straightforward to apply filters (for example, a filter by country using the nation/name element) and targeted searches to answer questions like “What data are available that cover the month of December 2010?” or “What surveys did the CHTF sponsor?”.
+Metadata standards and schemas consist of structured lists of metadata fields. They serve multiple purposes. First, they help data curators generate complete and usable documentation of their datasets. Metadata standards that are intuitive and human-readable better serve this purpose. Second, they help generate machine-readable metadata that are the input to software applications like on-line data catalogs. Metadata available in open file formats like JSON (JavaScript Object Notation) and XML (eXtended Markup Language) are most suitable for this purposes.
+Some international metadata standards like the Data Documentation Initative (DDI Codebook, for microdata), the ISO 19139 (for geospatial data), or the Dublin Core (a more generic metadata specification) are described and published as XML specifications. Any XML standard or schema can be “translated” into JSON, which is our preferred format (a choice we justify in the next section).
+JSON and XML formats have similarities:
+JSON files are however easier to parse than XML, easier to generate programmatically, and easier to read by humans. This makes them our preferred choice for describing and using metadata standards and schemas.
+Metadata in JSON are stored as key/value pairs, where the keys correspond to the names of the metadata elements in the standard. Values can be string, numeric, boolean, arrays, null, or JSON objects (for a more detailed description of the JSON format, see www.w3schools.com). Metadata in XML are stored within named tags. The example below shows how the JSON and XML formats are used to document the list of authors of a document, using elements from the Dublin Core metadata standard.
+In the documents schema, authors are documented in the metadata element authors
which contains the following sub-elements: first_name
, initial
, last_name
, and affiliation
.
In JSON, this information will be stored in key/value pairs as follows.
+"authors" : [
+{"first_name" : "Dieter",
+ "last_name" : "Wang",
+ "affiliation": "World Bank Group; Fragility, Conflict and Violence"},
+ {"first_name" : "Bo",
+ "initial" : "P.J.",
+ "last_name" : "Andrée",
+ "affiliation": "World Bank Group; Fragility, Conflict and Violence"},
+ {"first_name" : "Andres",
+ "initial" : "F.",
+ "last_name" : "Chamorro",
+ "affiliation": "World Bank Group; Development Data Analytics and Tools"},
+ {"first_name" : "Phoebe",
+ "initial" : "G.",
+ "last_name" : "Spencer",
+ "affiliation":"World Bank Group; Fragility, Conflict and Violence"}
+ ]
In XML, the same information will be stored within named tags as follows.
+authors>
+ <author>
+ <first_name>Dieter</first_name>
+ <last_name>Wang</last_name>
+ <affiliation>World Bank Group; Fragility, Conflict and Violence</affiliation>
+ <author>
+ </author>
+ <first_name>Bo</first_name>
+ <initial>P.J.</initial>
+ <last_name>Andrée</last_name>
+ <affiliation>World Bank Group; Fragility, Conflict and Violence</affiliation>
+ <author>
+ </author>
+ <first_name>Andres</first_name>
+ <initial>E.</initial>
+ <last_name>Chamorro</last_name>
+ <affiliation>World Bank Group; Development Data Analytics and Tools</affiliation>
+ <author>
+ </author>
+ <first_name>Phoebe</first_name>
+ <initial>G.</initial>
+ <last_name>Spencer</last_name>
+ <affiliation>World Bank Group; Fragility, Conflict and Violence</affiliation>
+ <author>
+ </authors> </
Metadata standards and schemas must be comprehensive and intuitive. They aim to provide comprehensive and granular lists of elements. Some standards may contain a very long list of elements. Most often, only a subset of the available elements will be used to document a specific dataset. For example, the elements of the DDI metadata standard related to sample design will be used to document sample survey datasets but will be ignored when documenting a population census or an administrative dataset. In all standards and schemas, most elements are optional, not required. Data curators should however try and provide content for all elements for which information is or can be made available.
+Complying with metadata standards and schemas contributes to the completeness, usability, discoverability, and inter-operability of the metadata, and to the visibility of the data and metadata.
+When they document datasets, data curators who do not make use of metadata standards and schemas tend to focus on the readily-available documentation and may omit some information that secondary data users –and search engines– may need. Metadata standards and schemas provide checklists of what information could or should be provided. These checklists are developed by experts, and are regularly updated or upgraded based on feedback received from users or to accommodate new technologies.
+Generating complete metadata will often be a collaborative exercise, as the production of data involves multiple stakeholders. The implementation of a survey, for example, may involve sampling specialists, field managers, data processing experts, subject matter specialists, and programmers. Documenting a dataset should not be seen as a last and independent step in the implementation of a data collection or production project. Ideally, metadata will be captured continuously and in quasi-real time during the entire life cycle of the data collection/production, and contributed by those who have most knowledge of each phase of the data production process.
+Generating complete and detailed metadata may be seen as a burden by some organizations or researchers. But it will typically represent only a small fraction of the time and budget invested in the production of the data, and is an investment that will add much value to the data by increasing their usability and discoverability.
+Fully understanding a dataset before conducting analysis should be a pre-requisite for all researchers and data users. But this will only be possible when the documentation is easy to obtain and exploit. Convenience to users is key. When using a geographic dataset for example, the user should be able to immediately find the coordinate reference system that was used. When using survey microdata, which may contain hundreds or thousands of variables, the user need to be able to immediately access information on a variable label, underlying question, universe, categories, etc. Structured metadata enables such “convenience”, as they can easily be transformed into bookmarked PDF documents, searchable websites, machine-readable codebooks, etc. The way metadata are displayed can be tailored to the specific needs of different categories of users.
+Data discoverability is one of the main tasks, next to availability and interoperability, that public policy makers and implementers should take into due consideration in order to foster access, use and re-use of public sector information, particularly in case of open data. Users shall be enabled to easily search and find data they need for the most different purposes. That is clearly highlighted in the introduction statements of the INSPIRE Directive, where we can read that “The loss of time and resources in searching for existing (spatial) data or establishing whether they may be used for a particular purpose is a key obstacle to the full exploitation of the data available”. +From Metadata and data portals/catalogues are essential assets to enable that data discoverability.
+What matters is not only what metadata are provided as input to the search engines that matters, it is also how the metadata are provided. To understand the value of structured metadata, we need to take into consideration how search engines ingest, index, and exploit the metadata. In brief, the metadata will need to be acquired, augmented, analyzed and transformed, and indexed before they can be made searchable. We provide here an overview of the process, which is described in detail by D. Turnbull and J. Berryman in “Relevant Search: With applications for Solr and Elasticsearch” (2016).
+Acquisition: Search engines like Google and Bing acquire metadata by crawling billions of web pages using web crawlers (or bots), with an objective to cover the entire web. Guidance is available to webmasters on how to optimize websites for visibility (see for example Google’s Search Engine Optimization (SEO) Starter Guide. The search tools embedded in specialized data catalogs have a much simpler task, as the catalog administrators and curators generate or control, and provide, the well-contained content to be indexed. In a cataloguing application like NADA, this content is provided in the form of structured metadata files saved in JSON or XML format. For textual data (documents), the content of the document (not only the metadata on the) can also be indexed. The process of acquisition/extraction of metadata by the search engine tool must preserve the structure of the metadata, in its original or in a modified form. This will be critical for optimizing the performance of the search tool and the ranking of query results (e.g., a keyword found in a document title may have more weight than the same keyword found in the document abstract), for implementing facets, or for providing advanced search options (e.g., search only in the “authors” metadata field).
Augmentation or enrichment: the content of the metadata can be augmented or enriched in multiple ways, often automatically (by extracting information from an external source, or using machine learning algorithms). Part of this augmentation process should happen before the metadata are submitted to the search engine. Other procedures of enrichment of the metadata may be implemented after acquisition of the metadata by the search engine tool. Metadata augmentation can have a significant impact on the discoverability of data. See the section “Augmented (enriched) metadata” below.
Analysis or transformation: The metadata generated by the data curator and by the augmentation process will mostly (not exclusively) consist of text. For the purpose of discoverability, some of the text has no value; words like “the”, “a”, it”, “with”, etc., referred to as stop words, will be removed from the metadata (multiple tools are available for this purpose). The remaining words will be converted to lowercase, may be submitted to spell checkers (to exclude or fix errors), and words will be stemmed or lemmatized. The stemming or lemmatization consist of converting words to their stem or root); this will among other transformations change plurals to singular and the conjugated forms of the verbs to their base form. Last, the transformed metadata will be tokenized, i.e. split into a list of terms (tokens). To enable semantic searchability, the metadata can also be converted into numeric vectors using natural language processing embedding models. These vectors will be saved in a database (such as ElasticSearch or Milvus) that will provide functionalities to measure similarity/distance between vectors. Section 1.8 below provide more information on text embedding and semantic searchability.
Indexing: The last phase of metadata processing is the indexing of the tokens. The index of a search engine is an inverted index, which will contain a list of all terms found in the metadata, with the following information (among other) attached to each term:
+Once the metadata has been acquired, transformed, and indexed, it is available for use via a user interface (UI). A data catalog UI will typically include a search box and facets (filters). The search engine underlying the search box can be simple (out-of-the-box full text search, looking for exact matches of keywords), or advanced (with semantic search capability and optimized ranking of query results). Basic full-text search do not provide satisfactory user experience, as we illustrated in the introduction to this Guide. Rich, structured metadata, combined with advanced search optimization tools and machine learning solutions, allow catalog administrators to tune the search engine, and implement advanced solutions including semantic searchability.
+Data catalogs that adopt common metadata standards and schemas can exchange information including through automated harvesting and synchronization of catalogs. This allows them to increase their visibility, and to publish their metadata in hubs. Recommendations and guidelines for improved inter-operability of data catalogs are provided by the Open Archives Initiative.
+Interoperability between data catalogs can be further improved by the adoption of common controlled vocabularies. For example, the adoption of the ISO country codes in country lists will guarantee that all catalogs will be able to filter dataset by country in a consistent manner. This will solve the issue of possible differences in the spelling of country names (e.g., one catalog referring to the Democratic Republic of Congo as Congo, DR, and another one as Congo, Dem. Rep.). It also solves issues of changing country names, e.g. Swaziland renamed as Eswatini in 2018). Controlled vocabularies are often used for “categorical” metadata elements like topics, keywords, data type, etc. Some metadata standards like the ISO 19139 for geospatial data include their own recommended controlled vocabularies. Ideally, controlled vocabularies are developed in accordance with FAIR principles (Findability, Accessibility, Interoperability, and Reuse of digital assets). “The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.” (https://www.go-fair.org/fair-principles/)
+The adoption of standards and schemas by software developers also contributes to the easy transfer of metadata across applications. For example, data capture tools like Survey Solutions by the World Bank and CsPro by the US Census Bureau offer options to export metadata compliant with the DDI Codebook standard; ESRI’s ArcGIS software export geospatial metadata in the ISO 19139 standard.
+Data cataloguing applications provide search and filtering tools to help users of the catalog identify data of interest. But not all users will start their search for data directly in specialized data catalogs; many will start their search in Google, Google Dataset Search, Bing, Yahoo! or another search engine.
+Some search engines may provide users with a direct answer to their query, without transiting via the source catalog. This will be the case when the query can be associated with a specific indicator, time and location for which data are openly available or accessible via a public API. For example, a search for “population india 2020” on Google, will provide an answer first, followed by links to the underlying sources. +In some cases, the user may not be brought to the data catalog at all, if the catalog ranked low in the relevance order of the Google query results. User behavior data (2020) showed that “only 9% of Google searchers make it to the bottom of the first page of the search results”, and that “only .44% of searchers go to the second page of Google’s search results”. (source: https://www.smartinsights.com/search-engine-marketing/search-engine-statistics/)
+It is thus critical to optimize the visibility of the content of specialized data catalogs in the lead search engines, Google in particular. This optimization process is referred to as search engine optimization or SEO. Wikipedia describes SEO as “the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as”natural” or “organic” results) rather than direct traffic or paid traffic. (…) As an Internet marketing strategy, SEO considers how search engines work, the computer-programmed algorithms that dictate search engine behavior, what people search for, the actual search terms or keywords typed into search engines, and which search engines are preferred by their targeted audience. SEO is performed because a website will receive more visitors from a search engine when websites rank higher on the search engine results page.”
+Because search engines crawl the web pages that are generated from databases (rather than crawling the databases themselves), your carefully applied metadata inside the database will not even be seen by search engines unless you write scripts to display the metadata tags and their values in HTML meta tags. It is crucial to understand that any metadata offered to search engines must be recognizable as part of a schema and must be machine-readable, which is to say that the search engine must be able to parse the metadata accurately. (For example, if you enter a bibliographic citation into a single metadata field, the search engine probably won’t know how to distinguish the article title from the journal title, or the volume from the issue number. In order for the search engine to read those citations effectively each part of the citation must have its own field. (…) Making sure metadata is machine-readable requires patterns and consistency, which will also prepare it for transformation to other schema. (This is far more important than picking any single metadata schema. (…) From the blog post “Metadata, Schema.org, and Getting Your Digital Collection Noticed” by Patrick Hogan (https://www.ala.org/tools/article/ala-techsource/metadata-schemaorg-and-getting-your-digital-collection-noticed-3)
+Guidelines for implementing SEO are provided by Google Search, Google Dataset Search, and other lead search engines. These guidelines are to be implemented not only by webmasters, but also by the developers of data cataloguing tools who should embed SEO into their software applications.
+An important element of SEO is the provision of structured metadata that can be exploited directly by the crawlers and indexers of search engines. This is the purpose of a set of schemas known as schema.org. In 2011 Google, Microsoft, Yandex, and Yahoo! created a common set of schemas for structured data markup on web pages with the aim of helping search engines to better understand websites. An alternative to schema.org is the DCAT (Data Catalog Vocabulary) metadata schema recommended by the W3C, also recognized by Google. “DCAT is a vocabulary for publishing data catalogs on the Web, which was originally developed in the context of government data catalogs such as data.gov and data.gov.uk (…)” (https://www.w3.org/TR/vocab-dcat-2/) Mapping augmented and structured metadata to the schema.org and/or DCAT standard is a critical element of such optimization. It will contribute significantly to the visibility of on-line data and metadata. Implementing such structured data markup in digital repositories is the responsibility of data librarians and of developers of data cataloguing applications.
+Detailed and complete metadata foster usability and discoverability of data. Augmentation of “enrichment” or “enhancement” of the metadata will therefore be beneficial. There are multiple ways metadata can be made richer, or augmented, programmatically and in a largely automated manner. Metadata can be extracted from external sources or from the data themselves.
+Extraction from external sources
+Metadata can be augmented by tapping into external sources related to the data being documented. For example, in a catalog of documents published in peer-reviewed journals, the Scimago Journal Rank (SJR) indicator could be extracted and added as an additional metadata element for each document. This information can then be used by the catalog’s search engine to rank query results, by “boosting” the rank of documents published in prestigious journals.
+Extraction from the data
+Metadata can be extracted from the data themselves. What metadata can be extracted will be specific to each data type. Examples of metadata augmentation will be provided in the subsequent chapters. We mention a few below.
+Embeddings and semantic discovery
+Previous sections of the chapter showed the value of rich and structured metadata to improve data usability and discoverability. Comprehensive and structured metadata are required to build and develop advanced and optimized lexical search engines (i.e. search engines that return results based on a matching of terms found in a query and in an inverted index). The richness of the metadata guarantees that the search engine will have all necessary “raw material” to identify datasets of interest. The metadata structure allows catalog administrators to tune their search engine (provided they use advanced solutions like Solr or ElasticSearch) to return and rank results in the most relevant manner. But this leaves one issue unsolved: the dependency on keyword matching. A user interested in datasets related to malnutrition for example will not find the indicators on Prevalence of stunting and Prevalence of wasting that the catalog may contain, unless the keyword “malnutrition” was included in these indicators’ metadata. Smarter search engines will be able to “understand” users intent, and identify relevant data based not only on a keyword matching process, but also on the semantic closeness between a query submitted by thea user and the metadata available in the database. The combination of rich metadata and natural language processing (NLP) models can solve this issue, by enabling semantic searchability in data catalogs.
+To enable a semantic search engine (or a recommender system), we need a way to “quantify” the semantic content of a query submitted by the user and the semantic content the metadata associated with a dataset, and to measure the closeness between them. This “quantitative” representation of semantic content can be generated in the form of numeric vectors called embeddings. “Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.” (Jurafsky, Daniel; H. James, Martin (2000)). These vectors will typically have a large dimension, with a length of 100 or more. They can be generated for a word, a phrase, or a longer text such as a paragraph or a full document. They are calculated using models like word2vec (Mikolov et al., 2013) or other. Training such models require a large corpus of hundreds of thousands or millions of documents. Pre-trained models and APIs are available that allow data catalog curators to generate embeddings for their metadata and, in real time, for queries submitted by users.
+Practically, embeddings are used as follows: metadata (or part of the metadata) associated with a dataset are converted into a numeric vector using a pre-trained embedding model. These embeddings are stored in a database. When a user submits a search query (which can be a term, a phrase, or even a document), the query is analyzed and enhanced (stop words are removed, spelling errors may be fixed, language detection and automatic translation may be applied, and more), then transformed into a vector using the same pre-trained model that was used to generate the metadata vectors. The metadata vectors that have the shortest distance (typically the cosine distance) with the query vector will be identified. The search engine will then return a sorted list of datasets having the highest semantic similarity with the query, or the distance between vectors will be used in combination with other criteria to rank and return results to the user. The fast identification of the closest vectors requires a specialized and optimized tool like the open source Milvus application.
+The standards and schemas we recommend and describe in this guide are the following:
+Data type | +Standard or schema | +
---|---|
Documents | +Dublin Core Metadata Initiative (DCMI), MARC | +
Microdata | +Data Documentation Initiative 2.5 (Codebook) | +
Geographic datasets and services | +ISO 19110, ISO19115, ISO19119, ISO 19139 | +
Time series, Indicators | +Custom-designed schema | +
Statistical tables | +Custom-designed schema | +
Photos / Images | +IPTC (for advanced use) or Dublin Core augmented | +
Audio files | +Dublin Core augmented with AudioObject from schema.org | +
Videos | +Dublin Core augmented with VideoObject from schema.org | +
Programs and scripts | +Custom-designed schema | +
External resources | +Dublin Core | +
All data types | +schema.org and DCAT (used for search engine optimization purpose, not as the primary schema to document resources) | +
Note on SDMX: The metadata standards and schemas described in the Guide do not include the Statistical Data and Metadata eXchange (SDMX) standard sponsored by a group of international organisations. Although SDMX includes a metadata component, it is intended to support machine-to-machine data exchange, not data documentation and discoverability. SDMX and the metadata standards and schemas we describe in the Guide could –and should– be made inter-operable.
+Documents are bibliographic resource of any type, such as books, working papers and papers published in scientific journals, reports, manuals, and other resources consisting mainly of text. Document libraries have along tradition of using structured metadata to manage their collections, which dates back from before the days this was computerized. Multiple standards are available. The Dublin Core Metadata Initiative specification (DCMI) provides simple and flexible option. The MARC standard (MAchine-Readable Cataloging) standard used by the United States Library of Congress is another, more advanced one. The schema we describe in this Guide make is the DCMI complemented by a few elements inspired by the MARC standard.
+Microdata are unit-level data on a population of individuals, households, dwellings, facilities, establishments or other. Microdata are typically obtained from surveys, censuses, or administrative recording systems. To document microdata, the Data Documentation Initiative (DDI) Alliance has developed the DDI metadata standard. “The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. DDI is a free standard that can document and manage different stages in the research data lifecycle, such as conceptualization, collection, processing, distribution, discovery, and archiving. Documenting data with DDI facilitates understanding, interpretation, and use – by people, software systems, and computer network.” (Source: https://ddialliance.org/, accessed on 7 June 2021)
+The DDI standard comes in two versions: DDI Codebook and DDI Lifecycle.
+In this Guide, which focuses on the use of matadata standards for documentation, cataloguing and dissemination purposes, we recommend the use of the DDI Codebook which is much easier to implement than the DDI LifeCycle. DDI Codebook provides all necessary elements needed for our purpose of improving data discoverability and usability.
+Geographic data identify and depict geographic locations, boundaries and characteristics of features on the surface of the earth. Geographic datasets include raster and vector data files. More or more data is disseminated not in the form of datasets, but in the form of geographic data services mainly via web applications. The ISO Technical Committee on Geographic Information/Geomatics (ISO/TC211), created a set of metadata standards to describe geographic datasets (ISO 19115), the geographic data structures of vector data (ISO 19110), and geographic data services (ISO 19119). These ISO standards are also available as an XML specification, the ISO 19139. In this Guide, we describe a JSON and simplified –but ISO-compatible– version of this complex schema.
+Indicators are summary (or “aggregated”) measures related to a key issue or phenomenon and derived from a series of observed facts. For example, the school enrollment rate indicator can be obtained from survey or census microdata, and the GDP per capita indicator is the output of a complex national accounting process that exploits many sources. When an indicator is repeated over time at a regular frequency (annual, quarterly, monthly or other), and when the time dimension is attached to its values, we obtain a time series. National statistical agencies and many other organizations publish indicators and time series. Some well-known public databases of time series indicators include the World Bank’s World Development Indicators (WDI), the Asian Development Bank’s Key Indicators (KI), and the United Nations Statistics Division Sustainable Development Goals (SDG) database. Some databases provide indicators that are not time series, like the Demographic and Health Survey (DHS) StatCompiler. Time series and indicators must be published with metadata that provide information on their spatial and temporal coverage, definition, methodology, sources, and more. No international standard is available to document indicators and time series. The JSON metadata schema we describe in this guide was developed by compiling a list of metadata elements found in international indicators databases, complemented with elements from other metadata schemas.
+Statistical tables (or cross tabulations or contingency tables) are summary presentations of data, presented as arrays of rows and columns that display numeric aggregates in a clearly labeled fashion. They are typically found in publications such as statistical yearbooks, census and survey reports, research papers, or published on-line. We developed the metadata schema presented in this Guide based on a review of a large collection of tables and of the 2015 W3C Model for Tabular Data and Metadata on the Web. This schema is intended to facilitate the cataloguing and discovery of tabular data, not to provide an electronic solution to automatically reproduce tables.
+The images we are interested in are photos and images available in electronic format. Some images are generated using digital cameras and are “born digital”. Others may have been created by scanning photos, or using other techniques. Note that satellite and remote sensing imagery are not considered in this Guide as images, but as geospatial (raster) data which should be documented using the ISO 19139 schema. To document images, we suggest two options: the Dublin Core Metadata Initiative standard augmented by some ImageObject (from schema.org) elements as a simple option, and the IPTC standard for more advanced uses and users.
+To document and catalog audio recordings, we propose a simple metadata schema that combines elements of the Dublin Core Metadata Initiative and of the AudioObject (from schema.org) schemas.
+To document and catalog videos, we propose a simple metadata schema that combines elements of the Dublin Core Metadata Initiative and of the VideoObject (from schema.org) schemas.
+We are interested in documenting and disseminating data processing and analysis programs and scripts. By “programs and scripts” we mean the code written to conduct data processing and data analysis, that results in the production of research and knowledge products including dublications, derived datasets, visualizations, or other. These scripts are produced using statistical analysis software or programming languages like R, Python, SAS, SPSS, Stata or equivalent. There are multiple reasons to invest in the documentation and dissemination of reproducible and replicable data processing and analysis (see chapter 12). Increasingly, the dissemination of reproducible scripts is a condition imposed by peer-reviewed journals to authors of papers they publish. Data catalogs should be the go-to place for those who look for reproducible research and examples of good practice in data analysis. As no international metadata schema is available to document and catalog scripts, we developed a schema for this purpose.
+External resources are files and links that we may want to attach to a dataset’s published metadata in a data catalog. When we publish metadata in a catalog, what is published is only the textual documentation contained in the JSO or XML metadata file. Other resources attached to a dataset (such as the questionnaire for a survey, technical or training manuals, tabulations, reports, possibly micro-data files, etc.) are not included in these metadata, but also constitute important materials for data users. All these resources are what we consider as external resources (“external” to the schema-compliant metadata), which need to be catalogued and (for most of them) published with the metadata. A simple metadata schema, based on the Dublin Core, is used to provide some essential information on these resources.
+The standards and schemas we recommend are lists of elements that have been tailored for each data type. The importance of structured and rich metadata has been described. Specialized metadata standards will foster comprehensiveness and discoverability in specialized catalog, and help build optimized data discovery suystems. But it is also critical to ensure the visibility and discoverability of the metadata in generic search engines, which are not built around the same schemas. The web makes use of its own schemas: schema.org. To ensure SEO, the specialized schemas should be mapped to it.
+Data catalogs must be optimized to improve the visibility and ranking of their content in search engines, including specialized search engines like Google’s Dataset Search. The ranking of web pages by Google and other lead search engines is determined by complex, proprietary, and non-disclosed algorithms. The only option for a web developer to ensure that a web page appears on top of the Google list of results is to pay for it, publishing it as a commercial ad. Otherwise, the ranking of a web page will be determined by a combination of known and unknown criteria. “Google’s automated ranking systems are designed to present helpful, reliable information that’s primarily created to benefit people, not to gain search engine rankings, in the top Search results.” (Google Search Central) But Google, Bing and other search engines provide web developers with some guidance and recommendations on search engine optimization (SEO). See for example the Google Search Central website where Google publish “Specific things you can do to improve the SEO of your website”.
+Improving the ranking of catalog pages is a shared responsibility of data curators and catalog developers and administrators. Data curators must pay particular attention to providing rich, useful content in the catalog web pages (the HTML pages that describe each catalog entry). To identify relevant results, search engines index the content of web pages. Datasets that are well documented, i.e. those published with rich and structured metadata, will thus have a better chance to be discovered. Much attention should be paid to some core elements including the dataset title, producer, description (abstract), keywords, topics, access license, and geographic coverage. In Google Search Central’s terms, curators must “create helpful, reliable, people-first content” (not search engine-first content) and “use words that people would use to look for your content, and place those words in prominent locations on the page, such as the title and main heading of a page, and other descriptive locations such as alt text and link text.*
+Developers and administrators of cataloguing applications must pay attention to other aspects of a catalog that will make it rank higher in Google and other search engine results:
+Last, Google will “reward” popular websites, i.e. websites that are frequently visited and to which many other influent and popular websites provide links. Google’s recommendation is thus to “tell people about your site. Be active in communities where you can tell like-minded people about your services and products that you mention on your site.”
+A helpful and detailed self-assessment list of items that data curators, catalog developers, and catalog administrators should pay attention to is provided by Google. Various tools are also available to catalog developers and administrators to assess the technical performance of their websites.
+Structured data is information that is embedded in HTML pages that helps Google classify, understand, and display the content of the page when the page is related to a specific type of content. The information stored in the structured data does not impact how the page itself is displayed in a web browser; it only impacts the display of information on the page when returned by Google search results. The types of content to which structured metadata applies is diverse and includes items like job positings, cooking receipes, books, events, movies, math solvers, and others (see the list provided in Google’s Search Gallery). It also applies to resources of type dataset and image. In this context, a dataset can be any type of structured dataset including microdata, indicators, tables, and geographic datasets.
+The structured data to be embedded in an HTML page consists of a set of metadata elements compliant with either the dataset schema from schema.org or W3C’s Data Catalog Vocabulary (DCAT) for datasets, and with the image schema from schema.org for images. For datasets, the schema.org schema is the most frequently used option.[^1]
+schema.org is a collection of schemas designed to document many types of resources. The most generic type is a “thing” which can be a person, an organization, an event, a creative work, etc. A creative work can be a book, a movie, a photograph, a data catalog, a dataset, etc. Among the many types of creative work for which schemas are available, we are particularly interested in the ones that correspond to the types of data and resources we recommend in this guide. This includes:
+The schemas proposed by schema.org have been developed primarily “to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results.” (from schema.org, Q&A) These schemas have not been developed by specialized communities of practice (statisticians, survey specialists, data librarians) to document datasets for preservation of institutional memory, to increase transparency in the data production process, or to provide data users with the “cook book” they may need to safely and responsibly use data. These schemas are not the ones that statistical organizations need to comply with international recommendations like the Generic Standard Business Process Model (GSBPM). But they play a critical role in improving data discoverability, as they provide webmasters and search engines with a means to better capture and index the content of web-based data platforms. Schemas from schema.org should thus be embedded in data catalogs. Data cataloguing applications should automatically map (some of) the elements of the specialized metadata standards and schemas they use to the appropriate fields of schema.org. Recommended mapping between the specialized standards and schemas and schema.org are not yet available. The production of such mappings, and the development of utilities to facilitate the production of content compliant with schema.org, would contribute to the objective of visibility and discoverability of data.
+DCAT describes datasets and data services using a standard model and vocabulary. It is organized in 13 “classes” (Catalog, Cataloged Resource, Catalog Record, Dataset, Distribution, Data Service, Concept Scheme, Concept, Organization/Person, Relationship, Role, Period of Time, and Location). Within classes, properties are used as metadata elements. For example, the class Cataloged Resource includes properties like title, description, resource creator; the class Dataset includes properties like spatial resolution, temporal coverage; many of these properties can easily be mapped to equivalent elements of the specialized metadata schemas we recommend in this Guide.
+The embedding of structured data into HTML pages must be automated in a data cataloguing tool. Data catalogs applications dynamically generate the HTML pages that display the description of each catalog entry. They do so by extracting the necessary metadata from the catalog database, and applying “transformations and styles” to this content to produce a user-friendly output that catalog visitors will view in their web browser. To embed structured data in these pages, the catalog application will (i) extract the relevant subset of metadata elements from the original metadata (e.g., from the DDI-compliant metadata for a micro-dataset), (ii) map these extracted elements to the schema.org or DCAT schema, and (iii) save it in the HTML page as a JSON-LD “hidden” component. Mapping the core elements of specialized metadata standards to the schema.org schema is thus essential to enable this feature. A mapping between the schema presented in this Guide and schema.org is provided in annex 2 of the Guide.
+The screenshots below show an example of an HTML page for a dataset published in a NADA catalog, with the underlying code. The structured metadata is used by Google to display this information as a formatted, “rich result” in Google Dataset Search.
+
+The HTML page as viewed by the catalog user - The web browser will ignore the embedded structured metadata when the HTML page is displayed. What users will see is entirely controlled by the catalog application.
+
+The HTML page code (abstract) - The automatically-generated structured data can be seen in the HTML page code (or page source). This information is visible and processed by Google, Bing, and other search engine’s web crawlers. Note that the structured data, although not “visible” to users, can be made accessible to them via API. Other data cataloguing applications may be able to ingest this information; the CKAN cataloguing tool for example makes use of metadata compliant with DCAT or schema.org. Making the structured data accessible is one way to improve the inter-operability of data catalogs.
+
+The result - Higher visibility/ranking in Google Dataset Search - The websites catalog.ihsn.org and microdata.worldbank.org are NADA catalogs, which embed schema.org metadata.
+
The most recent documentation of the schemas described in the Guide is available on-line at https://ihsn.github.io/nada-api-redoc/catalog-admin/#.
+The documentation of each standard or schema starts with four common elements that are not actually part of the standard or schema, but that contain information that will be used when the metadata are published in a data catalog that uses the NADA application. If NADA is not used, these “administrative elements” can be ignored.
+repositoryid
identifies the collection in which the metadata will be published.access_policy
determines if and how the data files will be accessible from the catalog in which the metadata are published. This element only applies to the microdata and geographic metadata standards. It makes use of a controlled vocabulary with the following access policy options:
+direct
: data can be downloaded without requiring them to be registered;open
: same as “direct”, with an open data license attached to the dataset;public
: public use files, which only require users to be registered in the catalog;licensed
: access to data is restricted to registered users who receive authorization to use the data, after submitting a request;remote
: data are made available by an external data repository;data_na
: data are not accessible to the public (only metadata are published).published
determines the status of the metadata in the on-line catalog (with options 0 = draft and 1 = published). Published entries are visible to all visitors of the on-line catalog; unpublished (draft) entries will only be visible by the catalog administrators and reviewers.overwrite
determines whether the metadata already in the catalog for this entry can be overwritten (iwith options yes or no, ‘no’ being the default).This set of administrative elements is followed by one or multiple sections that contain the elements specific to each standard/schema. For example, the DDI Codebook metadata standard, used to document microdata, contains the following main sections:
+document description
: a description of the metadata (who documented the dataset, when, etc.) Most schemas will contain such a section describing the metadata, useful mainly to data curators and catalog administrators. In other schemas, this section may be named metadata_description
.study description
: the description of the survey/census/study, not including the data files and data dictionary.file description
: a list and description of data files associated to the study.variable description
: the data dictionary (description of variables).The schema-specific sections are followed by a few other metadata elements common to most schemas. These elements are used to provide additional information useful for cataloguing and discoverability purposes. They include tags (which allow catalog administrators to attach tags to datasets independently of their type, which can be used as filters in the catalog), and external resources.
+Some schemas provide the possibility for data curators to add their own metadata elements in an additional section. The use of additional elements should be the exception, as metadata standards and schemas are designed to provide all elements needed to fully document a data resource.
+In each standard and schema, metadata elements can have the following properties:
+required
but have all its components (for elements that have sub-elements) declared as optional. This will be the case when at least one (but any) of the sub-element must contain information. It is also possible for an element to be declared optional but have one or more of its sub-elements declared mandatory
(this means that the field is optional, but if it is used, some of its features MUST be provided.)nation
in the DDI standard is Repeatable because a dataset can cover more than one country, while the element title
is Not repeatable because a study should be identified by a unique title.Some schemas may recommend controlled vocabularies for some elements. For example, the ISO 19139 used to document geographic datasets recommends …
+In most cases however, controlled vocabularies are not part of the metadata standard or schema. They will be selected and activated in templates and applications. +…example…
+Metadata compliant with the standards and schemas described in this Guide can be generated in two different ways: programmatically using a programming language like R or Python, or by using a specialized metadata editor application. The first option provides a high degree of flexibility and efficiency. It offers multiple opportunities to automate part of the metadata generation process, and to exploit advanced machine learning solutions to enhance metadata. Also, metadata generated using R or Python can also be published in a NADA catalog using the NADA API and the R package NADAR or the Python library PyNADA. The programmatic option may thus be the preferred option for organizations that have strong expertise in R or Python. For other organizations, and for some types of data, the use of a specialized metadata editor may be a better option. Metadata editors are specialized software applications designed to offer a user-friendly alternative to the programmatic generation of metadata. We provide in this section a brief description of how structured metadata can be generated and published using respectively R, Python, and a metadata editor application.
+The easiest way to generate metadata compliant with the standards and schemas we describe in this Guide is to use a specialized Metadata Editor. A Metadata Editor provides a user-friendly and flexible interface to document data. Most metadata editors are specific to a certain standard. The IHSN / World Bank developed an open source multi-standard Metadata Editor.
+This Metadata Editor contains all suggested standards. The full version of each standard is embedded in the application. But few users will ever make use of all elements contained in the standard. And some will want to customize the instructions, labels of the metadata elements, controlled vocabularies, and instructions to curators who will enter the metadata.
+The Metadata Editor allows users to develop their own templates based on the full version of the standards. A template is a subset of the elements available in the standard/schema, where the elements can be renamed and other customization can be made (within limits, as the metadata generated must remain compliant with the standard independently of the template).
+Template manager:
+(describe / provide bettere example)
+All schemas described in the on-line documentation can be used to generate compliant metadata using R scripts. Generating metadata using R will consist of producing a list object (itself containing lists). In the documentation of the standards and schemas, curly brackets indicate to R users that a list must be created to store the metadata elements. Square brackets indicate that a block of elements is repeatable, which corresponds in R to a list of lists. For example (using the DOCUMENT metadata schema):
+The sequence in which the metadata elements are created when documenting a dataset using R or Python does not have to match the sequence in the schema documentation.
+Metadata compliant with a standard/schema can be generated using R, and directly uploaded in a NADA catalog without having to be saved as a JSON file. An object (a list) must be created in the R script that contains metadata compliant with the JSON schema. The example below shows how such an object is created and published in a NADA catalog. We assume here that we have a document with the following information:
+We will use the DOCUMENT schema to document the publication, and the EXTERNAL RESOURCE schema to publish a link to the document in NADA.
+Publishing data and metadata in a NADA catalog (using R and the NADAR package or Python and the PyNADA library) requires to first identify the on-line catalog where the metadata will be published (by providing its URL in the set_api_url
command line) and to provide a key to authenticate as a catalog administrator (in the set_api_key
command line; note that this key should never be entered in clear in a script to avoid accidental disclosure).
We then create an object (a list in R, or a dictionary in Python) that we will for example name my_doc. Within this list (or dictionary), we will enter all metadata elements. Some will be simple elements, others will be lists (or dictionaries). The first element to be included is the required document_description
. Within it, we include the title_statement
which is also required and contains the mandatory elements idno
and title
(all documents must have a unique ID number for cataloguing purpose, and a title). The list of countries that the document covers is a repeatable element, i.e. a list of lists (although we only have one country in this case). Information on the authors is a repeatable element, allowing us to capture the information on the three co-authors individually.
This my_doc object is then published in the NADA catalog using the add_document
function. Last, we publish (as an external resource) a link to the file, with only basic information. We do not need to document this resource in detail, as it corresponds to the metadata provided in my_doc. If we had a different external resource (for example an MS-Excel table that contains all tables shown in the publication), we would make use of more of the external resources metadata elements to document it. Note that instead of a URL, we could have provided a path to an electronic file (e.g., to the PDF document), in which case the file would be uploaded to the web server and made available directly from the on-line catalog. We had previously captured a screenshot of the cover page of the document to be used as thumbnail in the catalog (optional).
library(nadar)
+# Define the NADA catalog URL and provide an API key
+set_api_url("http://nada-demo.ihsn.org/index.php/api/")
+set_api_key("a1b2c3d4e5")
+# Note: an administrator API key must always be kept strictly confidential;
+ # It is good practice to read it from an external file, not to enter it in clear
+ <- "C:/DOCS/teaching_lao.JPG" # Cover page image to be used as thumbnail
+ thumb # Generate and publish the metadata on the publication
+<- "WB_10986/7710"
+ doc_id <- list(
+ my_doc document_description = list(
+
+ title_statement = list(
+ idno = doc_id,
+ title = "Teaching in Lao PDR"
+
+ ),
+ date_published = "2007",
+
+ ref_country = list(
+ list(name = "Lao PDR", code = "LAO")
+
+ ),
+ # Authors: we only have one author, but this is a list of lists
+ # as the 'authors' element is a repeatable element in the schema
+ authors = list(
+ list(first_name = "Luis", last_name = "Benveniste", affiliation = "World Bank"),
+ list(first_name = "Jeffery", last_name = "Marshall", affiliation = "World Bank"),
+ list(first_name = "Lucrecia", last_name = "Santibañez", affiliation = "World Bank")
+
+ )
+ )
+ )# Publish the metadata in the central catalog
+add_document(idno = doc_id,
+metadata = my_doc,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes")
+ # Add a link as an external resource of type document/analytical (doc/anl).
+external_resources_add(
+title = "Teaching in Lao PDR",
+ idno = doc_id,
+ dctype = "doc/anl",
+ file_path = "http://hdl.handle.net/10986/7710",
+ overwrite = "yes"
+ )
The document is now available in the NADA catalog.
+Generating metadata using Python will consist of producing a dictionary object, which will itself contain lists and dictionaries. Non-repeatable metadata elements will be stored as dictionaries, and repeatable elements as lists of dictionaries. In the metadata documentation, curly brackets indicate that a dictionary must be created to store the metadata elements. Square brackets indicate that a dictionary containing dictionaries must be created.
+Dictionaries in Python are very similar to JSON schemas. When documenting a dataset, data curators who use Python can copy a schema from the ReDoc website, paste it in their script editor, then fill out the relevant metadata elements and delete the ones that are not used.
+The Python equivalent of the R example we provided above is as follows:
+import pynada as nada
+# Define the NADA catalog URL and provide an API key
+"http://nada-demo.ihsn.org/index.php/api/")
+ set_api_url("a1b2c3d4e5")
+ set_api_key(# Note: an administrator API key must always be kept strictly confidential;
+ # It is good practice to read it from an external file, not to enter it in clear
+ <- "C:/DOCS/teaching_lao.JPG" # Cover page image to be used as thumbnail
+ thumb # Generate and publish the metadata on the publication
+= "WB_10986/7710"
+ doc_id = {
+ document_description 'title_statement': {
+ 'idno': "WB_10986/7710",
+ 'title': "Teaching in Lao PDR"
+
+ },
+ 'date_published': "2007",
+ 'ref_country': [
+ 'name': "Lao PDR", 'code': "Lao"}
+ {
+ ],
+ # Authors: we only have one author, but this is a list of lists
+ # as the 'authors' element is a repeatable element in the schema
+ 'authors': [
+ 'first_name': "Luis", 'last_name': "Benveniste", 'affiliation' = "World Bank"},
+ {'first_name': "Jeffery", 'last_name': "Marshall", 'affiliation' = "World Bank"},
+ {'first_name': "Lucrecia", 'last_name': "Santibañez", 'affiliation' = "World Bank"},
+ {
+ ]
+ }# Publish the metadata in the central catalog
+
+ nada.create_document_dataset(= doc_id,
+ dataset_id = "central",
+ repository_id = 1,
+ published = "yes",
+ overwrite @@@@@@
+ my_doc_metadata, = thumb)
+ thumbnail_path # Add a link as an external resource of type document/analytical (doc/anl).
+
+ nada.add_resource(= doc_id,
+ dataset_id = "doc/anl",
+ dctype = "Teaching in Lao PDR",
+ title = "http://hdl.handle.net/10986/7710",
+ file_path = "yes") overwrite
[^1] See Omar Benjelloun, Shiyu Chen, Natasha Noy, 2020, Google Dataset Search by the Numbers, https://doi.org/10.48550/arXiv.2006.06894
+ +This chapter describes the use of a metadata schema for documenting documents. By document, we mean a bibliographic resource of any type such as a book, a working paper or a paper published in a scientific journal, a report, a presentation, a manual, or any another resource consisting mainly of text and available in physical and/or electronic format.
+Suggestions and recommendations to data curators
Librarians have developed specific standards to describe and catalog documents. The MARC 21 (MAchine-Readable Cataloging) standard used by the United States Library of Congress is one of them. It provides a detailed structure for documenting bibliographic resources, and is the recommended standard for well-resourced document libraries.
+For the purpose of cataloguing documents in a less-specialized repository intended to accommodate data of multiple types, we built our schema on a simpler but also highly popular standard, the Dublin Core Metadata Element Set. We will refer to this metadata specification, developed by the Dublin Core Metadata Initiative, as the Dublin Core. The Dublin Core became an ISO standard (ISO 15836) in 2009. It consists of a list of fifteen core metadata elements, to which more specialized elements can be added. These fifteen elements, with a definition extracted from the Dublin Core website, are the following:
+No | +Element name | +Description | +
---|---|---|
1 | +contributor | +An entity responsible for making contributions to the resource. | +
2 | +coverage | +The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. | +
3 | +creator | +An entity primarily responsible for making the resource. | +
4 | +date | +A point or period of time associated with an event in the life cycle of the resource. | +
5 | +description | +An account of the resource. | +
6 | +format | +The file format, physical medium, or dimensions of the resource. | +
7 | +identifier | +An unambiguous reference to the resource within a given context. | +
8 | +language | +A language of the resource. | +
9 | +publisher | +An entity responsible for making the resource available. | +
10 | +relation | +A related resource. | +
11 | +rights | +Information about rights held in and over the resource. | +
12 | +source | +A related resource from which the described resource is derived. | +
13 | +subject | +The topic of the resource. | +
14 | +title | +A name given to the resource. | +
15 | +type | +The nature or genre of the resource. | +
Due to its simplicity and versatility, this standard is widely used for multiple purposes. It can be used to document not only documents but also resources of other types like images or others. Documents that can be described using the MARC 21 standard can be described using the Dublin Core, although not with the same granularity of information. The US Library of Congress provides a mapping between the MARC and the Dublin Core metadata elements.
+MARC 21 and the Dublin Core are used to document a resource (typically, the electronic file containing the document) and its content. Another schema, BibTex, has been developed for the specific purpose of recording bibliographic citations. BibTex is a list of fields that may be used to generate bibliographic citations compliant with different bibliography styles. It applies to documents of multiple types: books, articles, reports, etc.
+The metadata schema we propose to document publications and reports is a combination of Dublin Core, MARC 21, and BibTex elements. The technical documentation of the schema and its API is available at https://ihsn.github.io/nada-api-redoc/catalog-admin/#tag/Documents.
+The proposed schema comprises two main blocks of elements, metadata_information
and document_description
. It also contains the tags
element common to all our schemas. The repository_id
, published
and overwrite
items in the schema are not metadata elements per se, but parameters used when publishing the metadata in a NADA catalog.
+
{
+"repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "document_description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
The metadata_information
contains information not related to the document itself but to its metadata. In other words, it contains “metadata on the metadata”. This information is optional but we recommend to enter content at least in the name
and date
sub-elements, which indicate who generated the metadata and when. This information is not useful to end-users of document catalogs, but is useful to catalog administrators for two reasons:
metadata compliant with standards are intended to be shared and used by inter-operable applications. Data catalogs offer opportunities to harvest (pull) information from other catalogs, or to publish (push) metadata in other catalogs. Metadata information helps to keep track of the provenance of metadata.
metadata for a same document may have been generated by more than one person or organization, or one version of the metadata can be updated and replaced with a new version. The metadata information
helps catalog administrators distinguish and manage different versions of the metadata.
+
"metadata_information": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "production_date": "string",
+ "version": "string"
+ }
The elements in the block are:
+title
[Required ; Not repeatable ; String]
+The title of the metadata document (which will usually be the same as the “Title” in the “Document description / Title statement” section). The metadata document is the metadata file (XML or JSON file) that is being generated.
idno
[Optional ; Not repeatable ; String]
+A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the “Primary ID” in section “Document description / Title statement”, although it is good practice to generate identifiers that establish a clear connection between these two identifiers. The Document ID could also include the metadata document version identifier. For example, if the “Primary ID” of the publication is “978-1-4648-1342-9”, the Document ID could be “IHSN_978-1-4648-1342-9_v1.0” if the metadata are produced by the IHSN and if this is version 1.0 of the metadata. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier may not contain blank spaces.
producers
[Optional ; Repeatable]
+This refers to the producer(s) of the metadata, not to the producer(s) of the document itself. The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a “Recommended” element. For catalog administration purposes, information on the producer and on the date of metadata production is useful.
name
[Optional ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.role
[Optional ; Not repeatable ; String] name
in the production of the metadata.production_date
[Optional ; Not repeatable ; String]
+The date the metadata on this document was produced (not distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a “Recommended” element, as information on the producer and on the date of metadata production is useful for catalog administration purposes.
version
[Optional ; Not repeatable ; String]
+The version of the metadata document (not the version of the publication, report, or other resource being documented).
++Example +
+
= list(
+ my_doc
+ metadata_information = list(
+
+ idno = "WBDG_978-1-4648-1342-9",
+
+ producers = list(
+ list(name = "Development Data Group, Curation Team",
+ abbr = "WBDG",
+ affiliation = "World Bank")
+
+ ),
+ production_date = "2020-12-27"
+
+ ),
+ # ...
+
+ )
The document_description
block contains the metadata elements used to describe the document. It includes the Dublin Core elements and a few more. The schema also includes elements intended to store information generated by machine learning (natural language processing - NLP) models to augment metadata on documents.
+
"document_description": {
+"title_statement": {},
+ "authors": [],
+ "editors": [],
+ "date_created": "string",
+ "date_available": "string",
+ "date_modified": "string",
+ "date_published": "string",
+ "identifiers": [],
+ "type": "string",
+ "status": "string",
+ "description": "string",
+ "toc": "string",
+ "toc_structured": [],
+ "abstract": "string",
+ "notes": [],
+ "scope": "string",
+ "ref_country": [],
+ "geographic_units": [],
+ "bbox": [],
+ "spatial_coverage": "string",
+ "temporal_coverage": "string",
+ "publication_frequency": "string",
+ "languages": [],
+ "license": [],
+ "bibliographic_citation": [],
+ "chapter": "string",
+ "edition": "string",
+ "institution": "string",
+ "journal": "string",
+ "volume": "string",
+ "number": "string",
+ "pages": "string",
+ "series": "string",
+ "publisher": "string",
+ "publisher_address": "string",
+ "annote": "string",
+ "booktitle": "string",
+ "crossref": "string",
+ "howpublished": "string",
+ "key": "string",
+ "organization": "string",
+ "url": null,
+ "translators": [],
+ "contributors": [],
+ "contacts": [],
+ "rights": "string",
+ "copyright": "string",
+ "usage_terms": "string",
+ "disclaimer": "string",
+ "security_classification": "string",
+ "access_restrictions": "string",
+ "sources": [],
+ "data_sources": [],
+ "keywords": [],
+ "themes": [],
+ "topics": [],
+ "disciplines": [],
+ "audience": "string",
+ "mandate": "string",
+ "pricing": "string",
+ "relations": [],
+ "reproducibility": {}
+ }
title_statement
[Required ; Not repeatable] The title_statement
is a required group of five elements, two of which are required:
+
"title_statement": {
+"idno": "string",
+ "title": "string",
+ "sub_title": "string",
+ "alternate_title": "string",
+ "translated_title": "string"
+ }
idno
[Required ; Not repeatable ; String] idno
is a unique identification number used to identify the database. A unique identifier is required for cataloguing purpose, so this element is declared as “Required”. The identifier will allow users to cite the indicator/series properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the idno
can be constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a same study (in element identifiers
); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The idno
should not contain blank spaces.title
[Required ; Not repeatable ; String] sub_title
[Optional ; Not repeatable ; String] alternate_title
[Optional ; Not repeatable ; String] translated_title
[Optional ; Not repeatable ; String] <- list(
+ my_doc
+# ... ,
+
+ document_description = list(
+ title_statement = list(
+ idno = "978-1-4648-1342-9",
+ title = "The Changing Nature of Work",
+ -title = "World Development Report 2019",
+ subalternate_title = "WDR 2019",
+ translated_title = "Rapport sur le Développement dans le Monde 2019"
+
+ ),
+ # ...
+
+ ) )
authors
[Optional ; Repeatable]
+The authors should be listed in the same order as they appear in the source itself, which is not necessarily alphabetical.
+
"authors": [
+ {
+ "first_name": "string",
+ "initial": "string",
+ "last_name": "string",
+ "affiliation": "string",
+ "author_id": [
+ {
+ "type": null,
+ "id": null
+ }
+ ],
+ "full_name": "string"
+ }
+ ]
first_name
[Optional ; Not repeatable ; String] initial
[Optional ; Not repeatable ; String] last_name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] author_id
[Optional ; Repeatable] type
[Optional ; Not repeatable ; String] id
[Optional ; Not repeatable ; String] type
.full_name
[Optional ; Not repeatable ; String] first_name
and last_name
cannot be filled out. This element can also be used when the author of a document is an organization or other type of entity.<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+authors = list(
+ list(first_name = "John", last_name = "Smith",
+ author_id = list(type = "ORCID", id = "0000-0002-1234-XXXX")),
+ list(first_name = "Jane", last_name = "Doe"),
+ author_id = list(type = "ORCID", id = "0000-0002-5678-YYYY"))
+
+ ),
+# ...
+ )
editors
[Optional ; Repeatable]
+If the source is a text within an edited volume, it should be listed under the name of the author of the text used, not under the name of the editor. The name of the editor should however be provided in the bibliographic citation, in accordance with a reference style.
+
"editors": [
+{
+ "first_name": "string",
+ "initial": "string",
+ "last_name": "string",
+ "affiliation": "string"
+ }
+ ]
+- first_name
[Optional ; Not repeatable ; String]
+The first name of the editor.
+- initial
[Optional ; Not repeatable ; String]
+The initials of the editor.
+- last_name
[Optional ; Not repeatable ; String]
+The last name of the editor.
+- affiliation
[Optional ; Not repeatable ; String]
+The affiliation of the editor.
date_created
[Optional ; Not repeatable ; String]
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was produced. This can be different from the date the document was published, made available, and from the temporal coverage. The document “Nigeria - Displacement Report” by the International Organization for Migration (IOM) shown below provides an example of this. The document was produced in November 2020 (date_created
), refers to events that occurred between 21 September and 10 October 2021 (temporal_coverage
), and was published (date_published
) on 28 January 2021.
date_available
[Optional ; Not repeatable ; String]
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was made available. This is different from the date it was published (see element date_published
below). This element will not be used frequently.
date_modified
[Optional ; Not repeatable ; String]
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was last modified.
date_published
[Optional ; Not repeatable ; String]
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was published.
The example below is a report from the International Organization for Migrations (IOM). It shows the difference between the date the document was created (date_created
), published (date_published
), and the period it covers (temporal_coverage
).
In R, this will be captured as follows:
+
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+temporal_coverage = "21 September 2020 to 10 October 2020",
+ date_created = "2020-11",
+ date_published = "2021-01-28",
+
+# ...
+
+ ),# ...
+ )
identifiers
[Optional ; Repeatable]
+This element is used to enter document identifiers (IDs) other than the catalog ID entered in the title_statement
(idno
). It can for example be a Digital Object Identifier (DOI), an International Standard Book Number (ISBN), or an International Standard Serial Number (ISSN). The ID entered in the title_statement
can be repeated here (the title_statement
does not provide a type
parameter; if a DOI, ISBN, ISSN, or other standard reference ID is used as idno
, it is recommended to repeat it here with the identification of its type
).
"identifiers": [
+{
+ "type": "string",
+ "identifier": "string"
+ }
+ ]
type
[Optional ; Not repeatable ; String] identifier
[Required ; Not repeatable ; String] The example shows the list of identifiers of the World Bank World Development Report 2020 The Changing Nature of Work (see full metadata for this document in the Complete Example 2 of this chapter).
+<- list(
+ my_doc
+ # ... ,
+
+ document_description = list(
+
+# ... ,
+
+ identifiers = list(
+ list(type = "ISSN", identifier = "0163-5085"),
+ list(type = "ISBN softcover", identifier = "978-1-4648-1328-3"),
+ list(type = "ISBN hardcover", identifier = "978-1-4648-1342-9"),
+ list(type = "e-ISBN", identifier = "978-1-4648-1356-6"),
+ list(type = "DOI softcover", identifier = "10.1596/978-1-4648-1328-3"),
+ list(type = "DOI hardcover", identifier = "10.1596/978-1-4648-1342-9")
+
+ ),
+ # ...
+
+ ),# ...
+ )
type
[Optional ; Not repeatable ; String]
This describes the nature of the resource. It is recommended practice to select a value from a controlled vocabulary, which could for example include the following options: “article”, “book”, “booklet”, “collection”, “conference proceedings”, “manual”, “master thesis”, “patent”, “PhD thesis”, “proceedings”, “technical report”, “working paper”, “website”, “other”. Specialized agencies may want to create their own controlled vocabularies; for example, a national statistical agency may need options like “press release”, “methodology document”, “protocol”, or “yearbook”. The type
element can be used to create a “Document type” facet (filter) in a data catalog. If the controlled vocabulary is such that it contains values that are not mutually exclusive (i.e. if a document could possibly have more than one type), the element type
cannot be used as it is not repeatable. In such case, the solution is to provide the type of document as tags
, in a tag_group
that could for example be named type or document_type. Note also that the Dublin Core provides a controlled vocabulary (the DCMI Type Vocabulary) for the type
element, but this vocabulary is related to the types of resources (dataset, event, image, software, sound, etc.), not the type of document which is what we are interested in here.
status
[Optional ; Not repeatable ; String]
The status of the document. The status of the document should (but does not have to) be provided using a controlled vocabulary, for example with the following options: “first draft”, “draft”, “reviewed draft”, “final draft”, “final”. Most documents published in a catalog will likely be “final”.
description
[Optional ; Not repeatable ; String]
This element is used to provide a brief description of the document (not an abstract, which would be provided in the field abstract
). It should not be used to provide content that is contained in other, more specific elements. As stated in the Dublin Core Usage Guide, “Since the description
field is a potentially rich source of indexable terms, care should be taken to provide this element when possible. Best practice recommendation for this element is to use full sentences, as description is often used to present information to users to assist in their selection of appropriate resources from a set of search results.”
toc
[Optional ; Not repeatable ; String]
The table of content of the document, provided as a single string element, i.e. with no structure (an structured alternative is provided with the field toc_structured
described below). This element is also a rich source of indexable terms which can contribute to document discoverability; care should thus be taken to use it (or the toc_structured
alternative) whenever possible.
+
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+toc = "Introduction
+ 1. The importance of rich and structured metadata
+ 1.1 Rich metadata
+ 1.2 Structured metadata
+ 2. Technology: JSON schemas and tools
+ 2.1 JSON schemas
+ 2.1.1 Advantages of JSON over XML
+ 2.2 Defining a metadata schema in JSON format",
+# ...
+
+ ),
+# ...
+ )
toc_structured
[Optional ; Not repeatable] "toc_structured": [
+{
+ "id": "string",
+ "parent_id": "string",
+ "name": "string"
+ }
+ ]
This element is used as an alternative to toc
to provide a structured table of content. The element contains a repeatable block of sub-elements which provides the possibility to define a hierarchical structure:
id
[Optional ; Not repeatable ; String] id
for Chapter 1 could be “1” while the id
for section 1 of chapter 1 would be “11”.parent_id
[Optional ; Not repeatable ; String] id
of the parent section (e.g., if the table of content is divided into chapters, themselves divided into sections, the parent_id
of a section would be the id of the chapter it belongs to.)name
[Required ; Not repeatable ; String] The example below shows how the content provided in the previous example is presented in a structured format.
+<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ...,
+
+ toc_structured = list(
+ list(id = "0", parent_id = "" , name = "Introduction"),
+ list(id = "1", parent_id = "" , name = "1. The importance of rich and structured metadata"),
+ list(id = "11", parent_id = "1", name = "1.1 Rich metadata"),
+ list(id = "12", parent_id = "1", name = "1.2 Structured metadata"),
+ list(id = "2", parent_id = "" , name = "2. Technology: JSON schemas and tools"),
+ list(id = "21", parent_id = "2", name = "2.1 JSON schemas"),
+ list(id = "211", parent_id = "21", name = "2.1.1 Advantages of JSON over XML"),
+ list(id = "22", parent_id = "2", name = "2.2 Defining a metadata schema in JSON format")
+ # etc.
+
+ ),# ...
+
+ ),# ...
+ )
abstract
[Optional ; Not repeatable ; String]
The abstract is a summary of the document, usually about one or two paragraph(s) long (around 150 to 300 words).
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ abstract = "The 2019 World Development Report studies how the nature of work is changing as a result of advances in technology today.
+ While technology improves overall living standards, the process can be disruptive.
+ A new social contract is needed to smooth the transition and guard against inequality.",
+
+ # ...
+
+ ),# ...
+ )
notes
[Optional ; Repeatable ; String] notes": [
+{
+ "note": "string"
+ }
+ ]
This field can be used to provide information on the document that does not belong to the other, more specific metadata elements provided in the schema.
+- note
+A note, entered as free text.
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ notes = list(
+ list(note = "This is note 1"),
+ list(note = "This is note 2")
+
+ ),
+ # ...
+
+ ),# ...
+ )
scope
[Optional ; Not repeatable ; String]
A textual description of the topics covered in the document, which complements (but does not duplicate) the elements description
and topics
available in the schema.
ref_country
[Optional ; Repeatable]
+The list of countries (or regions) covered by the document, if applicable.
+This is a repeatable block of two elements:
name
[Required ; Not repeatable ; String] code
[Optional ; Not repeatable ; String] "ref_country": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
The field ref_country
will often be used as a filter (facet) in data catalogs. When a document is related to only part of a country, we still want to capture this information in the metadata. For example, the ref_country
element for the document “Sewerage and sanitation : Jakarta and Manila” will list “Indonesia” (code IDN) and “Philippines” (code PHL).
Considering the importance of the geographic coverage of a document as a filter, the ref_country
element deserves particular attention. The document title will often but not always provide the necessary information. Using R, Python or other programming languages, a list of all countries mentioned in a document can be automatically extracted, with their frequencies. This approach (which requires a lookup file containing a list of all countries in the world with their different denominations and spelling) can be used to extract the information needed to populate the ref_country
element (not all countries in the list will have to be included; some threshold can be set to only include countries that are “significantly” mentioned in a document). Tools like the R package countrycode are available to facilitate this process.
When a document is related to a region (not to specific countries), or when it is related to a topic but not a specific geographic area, the ref_country
might still be applicable. Try and extract (possibly using a script that parses the document) information on the countries mentioned in the document. For example, ref_country
for the World Bank document “The investment climate in South Asia” should include Afghanistan (mentioned 81 times in the document), Bangladesh (113), Bhutan (94), India (148), Maldives (62), Nepal (64), Pakistan (103), and Sri Lanka (98), but also China (not a South-Asian country, but mentioned 63 times in the document).
If a document is not specific to any country, the element ref_country
would be ignored (not included in the metadata) if the content of the document is not related to any geographic area (for example, the user’s guide of a software application), or would contain “World” (code WLD) if the document is related but not specific to countries (for example, a document on “Climate change mitigation”).
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ ref_country = list(
+ list(name = "Bangladesh", code = "BGD"),
+ list(name = "India", code = "IND"),
+ list(name = "Nepal", code = "NPL")
+
+ ),
+ # ...
+ )
geographic_units
[Optional ; Repeatable] ref_country
."geographic_units": [
+{
+ "name": "string",
+ "code": "string",
+ "type": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String] code
[Optional ; Not repeatable ; String] type
[Optional ; Not repeatable ; String] bbox
[Optional ; Repeatable] "bbox": [
+{
+ "west": "string",
+ "east": "string",
+ "south": "string",
+ "north": "string"
+ }
+ ]
west
[Required ; Not repeatable ; String] east
[Optional ; Not repeatable ; String] south
[Optional ; Not repeatable ; String] north
[Optional ; Not repeatable ; String] <- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ bbox = list(
+ list(west = "92.12973",
+ east = "92.26863",
+ south = "20.91856",
+ north = "21.22292")
+
+ ),
+ # ...
+
+ ),# ...
+ )
spatial_coverage
[Optional ; Not repeatable ; String]
This element provides another space for capturing information on the spatial coverage of a document, which complements the ref_country
, geographic_units
, and bbox
elements. It can be used to qualify the geographic coverage of the document, in the form of a free text. For example, a report on refugee camps in the Cox’s Bazar district of Bangladesh would have Bangladesh as reference country, “Cox’s Bazar” as a geographic unit, and “Rohingya’s refugee camps” as spatial coverage.
+
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ref_country = list(
+ list(name = "Bangladesh", code = "BGD")
+
+ ),
+geographic_units = list(
+ list(name = "Cox's Bazar", type = "District")
+
+ ),
+spatial_coverage = "Rohingya's refugee camps",
+
+# ...
+
+ ),# ...
+
+ )
temporal_coverage
[Optional ; Not repeatable ; String]
Not all documents have a specific time coverage. When they do, it can be specified in this element.
publication_frequency
[Optional ; Not repeatable ; String]
+Some documents are published regularly. The frequency of publications can be documented using this element.
It is recommended to use a controlled vocabulary, for example the PRISM Publishing Frequency Vocabulary which identifies standard publishing frequencies for a serial or periodical publication.
+Frequency | +Description | +
---|---|
annually | +Published once a year | +
semiannually | +Published twice a year | +
quarterly | +Published every 3 months, or once a quarter | +
bimonthly | +Published twice a month | +
monthly | +Published once a month | +
biweekly | +Published twice a week | +
weekly | +Published once a week | +
daily | +Published every day | +
continually | +Published continually as new content is added; typical of websites and blogs, typically several times a day | +
irregularly | +Published on an irregular schedule, such as every month except July and August | +
other | +Published on another schedule not enumerated in this controlled vocabulary | +
languages
[Optional ; Repeatable] "languages": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
This is a block of two elements (at least one must be provided for each language):
+name
[Optional ; Not repeatable ; String] code
[Optional ; Not repeatable ; String] <- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ languages = list(
+ list(name = "English", code = "EN")
+
+ )
+ # ...
+
+ ),# ...
+ )
license
[Optional ; Repeatable] "license": [
+{
+ "name": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ license = list(
+ list(name = "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)",
+ uri = "http://creativecommons.org/licenses/by/3.0/igo")
+
+ ),
+ # ...
+
+ ),# ...
+ )
bibliographic_citation
[Optional ; Repeatable] bibliographic_citation
, the citation is provided as a single item. It should be provided in a standard style: Modern Language Association (MLA), American Psychological Association (APA), or Chicago. Note that the schema provides an itemized list of all elements (BibTex fields) required to build a citation in a format of their choice."bibliographic_citation": [
+{
+ "style": "string",
+ "citation": "string"
+ }
+ ]
style
[Optional ; Not repeatable ; String] citation
[Optional ; Not repeatable ; String] style
. The example below shows how the bibliographic citation for an article published in Econometrica can be provided in three different formats.
+<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ bibliographic_citation = list(
+
+ list(style = "MLA",
+ citation = 'Davidson, Russell, and Jean-Yves Duclos. “Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality.” Econometrica, vol. 68, no. 6, [Wiley, Econometric Society], 2000, pp. 1435–64, http://www.jstor.org/stable/3003995.'),
+
+ list(style = "APA",
+ citation = 'Davidson, R., & Duclos, J.-Y. (2000). Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality. Econometrica, 68(6), 1435–1464. http://www.jstor.org/stable/3003995'),
+
+ list(style = "Chicago",
+ citation = 'Davidson, Russell, and Jean-Yves Duclos. “Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality.” Econometrica 68, no. 6 (2000): 1435–64. http://www.jstor.org/stable/3003995.')
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
Bibliographic elements | +
The elements that follow are bibliographic elements that correspond to BibTex fields. Note that some of the BibTex elements are found elsewhere in the schema (namely type
, authors
, editors
, year
and month
, isbn
, issn
and doi
); when constructing a bibliographic citation, these external elements will have to be included as relevant. The description of the bibliographic fields listed below was adapted from Wikipedia’s description of BibTex.
{
+"chapter": "string",
+ "edition": "string",
+ "institution": "string",
+ "journal": "string",
+ "volume": "string",
+ "number": "string",
+ "pages": "string",
+ "series": "string",
+ "publisher": "string",
+ "publisher_address": "string",
+ "annote": "string",
+ "booktitle": "string",
+ "crossref": "string",
+ "howpublished": "string",
+ "key": "string",
+ "organization": "string",
+ "url": null
+ }
The elements that are required to form a complete bibliographic citation depend on the type of document. The table below, adapted from the BibTex templates, provides a list of required and optional fields by type of document:
+Document type | +Required fields | +Optional fields | +
---|---|---|
Article from a journal or magazine | +author, title, journal, year | +volume, number, pages, month, note, key | +
Book with an explicit publisher | +author or editor, title, publisher, year | +volume, series, address, edition, month, note, key | +
Printed and bound document without a named publisher or sponsoring institution | +title | +author, howpublished, address, month, year, note, key | +
Part of a book (chapter and/or range of pages) | +author or editor, title, chapter and/or pages, publisher, year | +volume, series, address, edition, month, note, key | +
Part of a book with its own title | +author, title, book title, publisher, year | +editor, pages, organization, publisher, address, month, note, key | +
Article in a conference proceedings | +author, title, book title, year | +editor, pages, organization, publisher, address, month, note, key | +
Technical documentation | +title | +author, organization, address, edition, month, year, key | +
Master’s thesis | +author, title, school, year | +address, month, note, key | +
Ph.D. thesis | +author, title, school, year | +address, month, note, key | +
Proceedings of a conference | +title, year | +editor, publisher, organization, address, month, note, key | +
Report published by a school or other institution, usually numbered within a series | +author, title, institution, year | +type, number, address, month, note, key | +
Document with an author and title, but not formally published |
+author, title, note | +month, year, key | +
chapter
[Optional ; Not repeatable ; String]
+A chapter (or section) number. This element is only used to document a resource which has been extracted from a book.
edition
[Optional ; Not repeatable ; String]
+The edition of a book - for example “Second”. When a book has no edition number/name present, it can be assumed to be a first edition. If the edition is other than the first, information on the edition of the book being documented must be mentioned in the citation. The edition can be identified by a number, a label (such as “Revised edition” or “Abridged edition”), and/or a year. The first letter of the label should be capitalized.
institution
[Optional ; Not repeatable ; String]
+The sponsoring institution of a technical report. For citations of Master’s and Ph.D. thesis, this will be the name of the school.
journal
[Optional ; Not repeatable ; String]
+A journal name. Abbreviations are provided for many journals.
volume
[Optional ; Not repeatable ; String]
+The volume of a journal or multi-volume book. Periodical publications, such as scholarly journals, are published on a regular basis in installments that are called issues. A volume usually consists of the issues published during one year.
number
[Optional ; Not repeatable ; String]
+The number of a journal, magazine, technical report, or of a work in a series. An issue of a journal or magazine is usually identified by its volume
(see previous element) and number
; the organization that issues a technical report usually gives it a number; and sometimes books are given numbers in a named series.
pages
[Optional ; Not repeatable ; String]
+One or more page numbers or range of numbers, such as 42-111 or 7,41,73-97 or 43+ (the `+’ indicates pages following that don’t form a simple range).
series
[Optional ; Not repeatable ; String]
+The name of a series or set of books. When citing an entire book, the title field gives its title and an optional series field gives the name of a series or multi-volume set in which the book is published.
publisher
[Optional ; Not repeatable ; String]
+The entity responsible for making the resource available. For major publishing houses, the information can be omitted. For small publishers, providing the complete address is recommended. If the company is a university press, the abbreviation UP (for University Press) can be used. The publisher is not stated for journal articles, working papers, and similar types of documents.
publisher_address
[Optional ; Not repeatable ; String]
+The address of the publisher. For major publishing houses, just the city is given. For small publishers, the complete address can be provided.
annote
[Optional ; Not repeatable ; String]
+An annotation. This element will not be used by standard bibliography styles like the MLA, APA or Chicago, but may be used by others that produce an annotated bibliography.
booktitle
[Optional ; Not repeatable ; String]
+Title of a book, part of which is being cited. If you are documenting the book itself, this element will not be used; it is only used when part of a book is being documented.
crossref
[Optional ; Not repeatable ; String]
+The catalog identifier (“database key”) of another catalog entry being cross referenced. This element may be used when multiple entries refer to a same publication, to avoid duplication.
howpublished
[Optional ; Not repeatable ; String]
+The howpublished
element is used to store the notice for unusual publications. The first word should be capitalized. For example, “WebPage”, or “Distributed at the local tourist office”.
key
[Optional ; Not repeatable ; String]
+A key is a field used for alphabetizing, cross referencing, and creating a label when the `author’ information is missing.
organization
[Optional ; Not repeatable ; String]
+The organization that sponsors a conference or that publishes a manual.
url
[Optional ; Not repeatable ; String]
+The URL of the document, preferably a permanent URL.
+
This example makes use of the same Econometrica paper used in the previous example.
+<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ bibliographic_fields = list(
+ doi = "https://doi.org/10.1111/1468-0262.00167",
+ journal = "Econometrica",
+ volume = "68",
+ issue = "6",
+ pages = "1435-1464",
+ url = "https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00167"
+
+ ),
+ # ...
+
+ ),# ...
+ )
translators
[Optional ; Repeatable] "translators": [
+{
+ "first_name": "string",
+ "initial": "string",
+ "last_name": "string",
+ "affiliation": "string"
+ }
+ ]
first_name
[Optional ; Not repeatable ; String] initial
[Optional ; Not repeatable ; String] last_name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] contributors
[Optional ; Repeatable] autors
or translators
).
+"contributors": [
+{
+ "first_name": "string",
+ "initial": "string",
+ "last_name": "string",
+ "affiliation": "string",
+ "contribution": "string"
+ }
+ ]
first_name
[Optional ; Not repeatable ; String] initial
[Optional ; Not repeatable ; String] last_name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] contribution
[Optional ; Not repeatable ; String] contacts
[Optional ; Repeatable] "contacts": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "email": "string",
+ "telephone": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] contact
.affiliation
[Optional ; Not repeatable ; String] email
[Optional ; Not repeatable ; String] telephone
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] rights
[Optional ; Not repeatable ; String]
A statement on the rights associated with the document (others than the copyright, which should be described in the element copyright
described below).
The example is extracted from the World Bank World Development Report 2019.
+<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+rights = "Some rights reserved. Nothing herein shall constitute or be considered to be a limitation upon or waiver of the privileges and immunities of The World Bank, all of which are specifically reserved.",
+
+# ...
+
+ ),# ...
+ )
copyright
[Optional ; Not repeatable ; String]
A statement and identifier indicating the legal ownership and rights regarding use and re-use of all or part of the resource. If the document is protected by a copyright, enter the information on the person or organization who owns the rights.
usage_terms
[Optional ; Not repeatable ; String]
This element is used to provide a description of the legal terms or other conditions that a person or organization who wants to use or reproduce the document has to comply with.
disclaimer
[Optional ; Not repeatable ; String]
A disclaimer limits the liability of the author(s) and/or publisher(s) of the document. A standard legal statement should be used for all documents from a same agency.
+<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+ disclaimer = "This work is a product of the staff of The World Bank with external contributions. The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of The World Bank, its Board of Executive Directors, or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries."
+ # ...
+
+ ),# ...
+ )
security_classification
[Optional ; Not repeatable ; String]
Information on the security classification attached to the document. The different levels of classification indicate the degree of sensitivity of the content of the document. This field should make use of a controlled vocabulary, specific or adopted by the organization that curates or disseminates the document. Such a vocabulary could contain the following levels: public, internal only, confidential, restricted, strictly confidential
access_restrictions
[Optional ; Not repeatable ; String]
A textual description of access restrictions that apply to the document.
+
sources
[Optional ; Repeatable]
"sources": [
+{
+ "source_origin": "string",
+ "source_char": "string",
+ "source_doc": "string"
+ }
+ ]
This element is used to describe the sources of different types (except data sources, which must be listed in the next element data_source
) that were used in the production of the document.
+- source_origin
[Optional ; Not repeatable ; String]
+For historical materials, information about the origin(s) of the sources and the rules followed in establishing the sources should be specified.
+- source_char
[Optional ; Not repeatable ; String]
+Characteristics of the source. Assessment of characteristics and quality of source material.
+- source_doc
[Optional ; Not repeatable ; String]
+Documentation and access to the source.
data_sources
[Optional ; Repeatable] "data_sources": [
+{
+ "name": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
Used to list the machine-readable data file(s) -if any- that served as the source(s) of the data collection.
+- name
[Required ; Not repeatable ; String]
+Name (title) of the dataset used as source.
+- uri
[Optional ; Not repeatable ; String]
+Link (URL) to the dataset or to a web page describing the dataset.
+- note
[Optional ; Not repeatable ; String]
+Additional information on the data source.
The data source for the publication Bangladesh Demographic and Health Survey (DHS), 2017-18 - Final Report would be entered as follows:
+<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ data_sources = list(
+ list(name = "Bangladesh Demographic and Health Survey 2017-18",
+ uri = "https://www.dhsprogram.com/methodology/survey/survey-display-536.cfm",
+ note = "Household survey conducted by the National Institute of Population Research and Training, Medical Education and Family Welfare Division and Ministry of Health and Family Welfare. Data and documentation available at https://dhsprogram.com/)"
+
+ ),
+ # ...
+
+ ),# ...
+ )
keywords
[Optional ; Repeatable] "keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of keywords that provide information on the core content of the document. Keywords provide a convenient solution to improve the discoverability of the document, as it allows terms and phrases not found in the document itself to be indexed and to make a document discoverable by text-based search engines. A controlled vocabulary can be used (although not required), such as the UNESCO Thesaurus. The list provided here can combine keywords from multiple controlled vocabularies and user-defined keywords.
+name
[Required ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ keywords = list(
+ list(name = "Migration", vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+ list(name = "Migrants", vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+ list(name = "Refugee", vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+ list(name = "Conflict"),
+ list(name = "Asylum seeker"),
+ list(name = "Forced displacement"),
+ list(name = "Forcibly displaced"),
+ list(name = "Internally displaced population (IDP)"),
+ list(name = "Population of concern (PoC)")
+ list(name = "Returnee")
+ list(name = "UNHCR")
+
+ ),
+ # ...
+
+ ),# ...
+ )
themes
[Optional ; Repeatable] "themes": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of themes covered by the document. A controlled vocabulary will preferably be used. The list provided here can combine themes from multiple controlled vocabularies and user-defined themes. Note that themes
will rarely be used as the elements topics
and disciplines
are more appropriate for most uses. This is a block of five fields:
id
[Optional ; Not repeatable ; String] name
[Required ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topics
[Optional ; Repeatable] "topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
Information on the topics covered in the document. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. The list provided here can combine topics from multiple controlled vocabularies and user-defined topics. The element is a block of five fields:
+id
[Optional ; Not repeatable ; String] name
[Required ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] We use the working paper “Push and Pull - A Study of International Migration from Nepal” by Maheshwor Shrestha, World Bank Policy Research Working Paper 7965, February 2017, as an example.
<- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ topics = list(
+
+ list(name = "Demography.Migration",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(name = "Demography.Censuses",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(id = "F22",
+ name = "International Migration",
+ parent_id = "F2 - International Factor Movements and International Business",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "O15",
+ name = "Human Resources - Human Development - Income Distribution - Migration",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "O12",
+ name = "Microeconomic Analyses of Economic Development",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "J61",
+ name = "Geographic Labor Mobility - Immigrant Workers",
+ parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+
+
+ ),
+ # ...
+
+ ), )
disciplines
[Optional ; Repeatable] "disciplines": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
Information on the academic disciplines related to the content of the document. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia. The list provided here can combine disciplines from multiple controlled vocabularies and user-defined disciplines. This is a block of five elements:
+id
[Optional ; Not repeatable ; String] name
[Optional ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ disciplines = list(
+
+ list(name = "Economics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+
+ list(name = "Agricultural economics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+
+ list(name = "Econometrics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields")
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
audience
[Optional ; Not repeatable ; String]
Information on the intended audience for the document, i.e. the category or categories of users for whom the resource is intended in terms of their interest, skills, status, or other.
mandate
[Optional ; Not repeatable ; String]
The legislative or other mandate under which the resource was produced.
pricing
[Optional ; Not repeatable ; String]
The current price of the document in any defined currency. As this information is subject to regular change, it will often not be included in the document metadata.
relations
[Optional ; Repeatable] "relations": [
+{
+ "name": "string",
+ "type": "isPartOf"
+ }
+ ]
name
[Optional ; Not repeatable ; String] type
[Optional ; Not repeatable ; String] isPartOf, hasPart, isVersionOf, isFormatOf, hasFormat, references, isReferencedBy, isBasedOn, isBasisFor, replaces, isReplacedBy, requires, isRequiredBy
}.Type | +Description | +
---|---|
isPartOf | +The described resource is a physical or logical part of the referenced resource. | +
hasPart | ++ |
isVersionOf | +The described resource is a version edition or adaptation of the referenced resource. A change in version implies substantive changes in content rather than differences in format. | +
isFormatOf | ++ |
hasFormat | +The described resource pre-existed the referenced resource, which is essentially the same intellectual content presented in another format. | +
references | ++ |
isReferencedBy | ++ |
isBasedOn | ++ |
isBasisFor | ++ |
replaces | +The described resource supplants, displaces or supersedes the referenced resource. | +
isReplacedBy | +The described resource is supplanted, displaced or superseded by the referenced resource. | +
requires | ++ |
reproducibility
[Optional ; Not repeatable] "reproducibility": {
+"statement": "string",
+ "links": [
+ {
+ "uri": "string",
+ "description": "string"
+ }
+ ]
+ }
We present in chapter 12 a metadata schema intended to document reproducible research and scripts. That chapter lists multiple reasons to make research reproducible, replicable, and auditable. Ideally, when a research output (paper) is published, the data and code used in the underlying analysis should be made as openly available as possible. Increasingly, academic journals make it a requirement. The reproducibility
element is used to provide interested users with information on reproducibility and replicability of the research output.
statement
[Optional ; Not repeatable ; String] links
[Optional ; Repeatable] uri
[Optional ; Not repeatable ; String] description
[Optional ; Not repeatable ; String] <- list(
+ my_doc # ... ,
+ document_description = list(
+ # ... ,
+
+ reproducibility = list(
+ statement = "The scripts used to acquire data, assess and edit data files, train the econometric models, and to generate the tables and charts included in the publication, are openly accessible (Stata 15 scripts).",
+ links = list(
+ list(uri = "www.[...]",
+ description = "Description and access to reproducible Stata scripts"),
+ list(uri = "www.[...]",
+ description = "Derived data files")
+
+ )
+ ),# ...
+
+ ),# ...
+ )
Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable]
+The origin_description
elements are used to describe when and from where metadata have been extracted or harvested.
harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Document Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] lda_topics
[Optional ; Not repeatable]
"lda_topics": [
+{
+ "model_info": [
+ {
+ "source": "string",
+ "author": "string",
+ "version": "string",
+ "model_id": "string",
+ "nb_topics": 0,
+ "description": "string",
+ "corpus": "string",
+ "uri": "string"
+ }
+ ],
+ "topic_description": [
+ {
+ "topic_id": null,
+ "topic_score": null,
+ "topic_label": "string",
+ "topic_words": [
+ {
+ "word": "string",
+ "word_weight": 0
+ }
+ ]
+ }
+ ]
+ }
+ ]
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics
is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.
Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics
elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.
The image below provides an example of topics extracted from a document from the United Nations High Commission for Refugees, using a LDA topic model trained by the World Bank (this model was trained to identify 75 topics; no document will cover all topics).
+ +The lda_topics
element includes the following metadata fields:
model_info
[Optional ; Not repeatable]
+Information on the LDA model.
source
[Optional ; Not repeatable ; String] author
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] model_id
[Optional ; Not repeatable ; String] nb_topics
[Optional ; Not repeatable ; Numeric] description
[Optional ; Not repeatable ; String] corpus
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topic_description
[Optional ; Repeatable]
+The topic composition of the document.
topic_id
[Optional ; Not repeatable ; String] topic_score
[Optional ; Not repeatable ; Numeric] topic_label
[Optional ; Not repeatable ; String] topic_words
[Optional ; Not repeatable] word
[Optional ; Not repeatable ; String] word_weight
[Optional ; Not repeatable ; Numeric] = list(
+ lda_topics
+ list(
+
+ model_info = list(
+ list(source = "World Bank, Development Data Group",
+ author = "A.S.",
+ version = "2021-06-22",
+ model_id = "Mallet_WB_75",
+ nb_topics = 75,
+ description = "LDA model, 75 topics, trained on Mallet",
+ corpus = "World Bank Documents and Reports (1950-2021)",
+ uri = ""))
+
+ ),
+ topic_description = list(
+
+ list(topic_id = "topic_27",
+ topic_score = 32,
+ topic_label = "Education",
+ topic_words = list(list(word = "school", word_weight = "")
+ list(word = "teacher", word_weight = ""),
+ list(word = "student", word_weight = ""),
+ list(word = "education", word_weight = ""),
+ list(word = "grade", word_weight = "")),
+
+ list(topic_id = "topic_8",
+ topic_score = 24,
+ topic_label = "Gender",
+ topic_words = list(list(word = "women", word_weight = "")
+ list(word = "gender", word_weight = ""),
+ list(word = "man", word_weight = ""),
+ list(word = "female", word_weight = ""),
+ list(word = "male", word_weight = "")),
+
+ list(topic_id = "topic_39",
+ topic_score = 22,
+ topic_label = "Forced displacement",
+ topic_words = list(list(word = "refugee", word_weight = "")
+ list(word = "programme", word_weight = ""),
+ list(word = "country", word_weight = ""),
+ list(word = "migration", word_weight = ""),
+ list(word = "migrant", word_weight = "")),
+
+ list(topic_id = "topic_40",
+ topic_score = 11,
+ topic_label = "Development policies",
+ topic_words = list(list(word = "development", word_weight = "")
+ list(word = "policy", word_weight = ""),
+ list(word = "national", word_weight = ""),
+ list(word = "strategy", word_weight = ""),
+ list(word = "activity", word_weight = ""))
+
+
+ )
+
+ )
+ )
embeddings
[Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below.
The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.
+"embeddings": [
+{
+ "id": "string",
+ "description": "string",
+ "date": "string",
+ "vector": null
+ }
+ ]
The embeddings
element contains four metadata fields:
id
[Optional ; Not repeatable ; String] description
[Optional ; Not repeatable ; String] date
[Optional ; Not repeatable ; String] vector
[Required ; Not repeatable ; Object] @@@@@@@@ do not offer options
+The numeric vector representing the document, provided as an object (array or string). additional
[Optional ; Not repeatable]
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
Generating metadata compliant with the document schema is easy. The three examples below illustrate how metadata can be generated and published in a NADA catalog, programmatically. In the first two examples, we assume that an electronic copy of a document is available, and that the metadata must be generated from scratch (not by re-purposing/mapping existing metadata). In the third example, we assume that a list of publications with some metadata is available as a CSV file; metadata compliant with the schema are created and published in a catalog using a single script.
+This document is the World Bank Policy Working Paper No 9412, titled “Predicting Food Crises” published in September 2020 under a CC-By 4.0 license. The list of authors is provided on the cover page; an abstract, a list of acknowledgments, and a list of keywords are also provided.
+library(nadar)
+
+# ----------------------------------------------------------------------------------
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_folder")
+<- "WB_PRWP_9412_Food_Crises.pdf"
+ doc_file
+<- "WB_WPS9412"
+ id
+<- gsub(".pdf", ".jpg", doc_file)
+ thumb_file capture_pdf_cover(doc_file) # Capture cover page for use as thumbnail
+
+<- list(
+ example_1
+ document_description = list(
+
+ title_statement = list(idno = id, title = "Predicting Food Crises"),
+
+ date_published = "2020-09",
+
+ authors = list(
+ list(last_name = "Andrée", first_name = "Bo Pieter Johannes",
+ affiliation = "World Bank",
+ author_id = list(list(type = "ORCID", id = "0000-0002-8007-5007"))),
+ list(last_name = "Chamorro", first_name = "Andres",
+ affiliation = "World Bank"),
+ list(last_name = "Kraay", first_name = "Aart",
+ affiliation = "World Bank"),
+ list(last_name = "Spencer", first_name = "Phoebe",
+ affiliation = "World Bank"),
+ list(last_name = "Wang", first_name = "Dieter",
+ affiliation = "World Bank",
+ author_id = list(list(type = "ORCID", id = "0000-0003-1287-332X")))
+
+ ),
+ journal = "World Bank Policy Research Working Paper",
+ number = "9412",
+ publisher = "World Bank",
+
+ ref_country = list(
+ list(name="Afghanistan", code="AFG"),
+ list(name="Burkina Faso", code="BFA"),
+ list(name="Chad", code="TCD"),
+ list(name="Congo, Dem. Rep.", code="COD"),
+ list(name="Ethiopia", code="ETH"),
+ list(name="Guatemala", code="GTM"),
+ list(name="Haiti", code="HTI"),
+ list(name="Kenya", code="KEN"),
+ list(name="Malawi", code="MWI"),
+ list(name="Mali", code="MLI"),
+ list(name="Mauritania", code="MRT"),
+ list(name="Mozambique", code="MOZ"),
+ list(name="Niger", code="NER"),
+ list(name="Nigeria", code="NGA"),
+ list(name="Somalia", code="SOM"),
+ list(name="South Sudan", code="SSD"),
+ list(name="Sudan", code="SDN"),
+ list(name="Uganda", code="UGA"),
+ list(name="Yemen, Rep.", code="YEM"),
+ list(name="Zambia", code="ZMB"),
+ list(name="Zimbabwe", code="ZWE")
+
+ ),
+ abstract = "Globally, more than 130 million people are estimated to be in food crisis. These humanitarian disasters are associated with severe impacts on livelihoods that can reverse years of development gains. The existing outlooks of crisis-affected populations rely on expert assessment of evidence and are limited in their temporal frequency and ability to look beyond several months. This paper presents a statistical forecasting approach to predict the outbreak of food crises with sufficient lead time for preventive action. Different use cases are explored related to possible alternative targeting policies and the levels at which finance is typically unlocked. The results indicate that, particularly at longer forecasting horizons, the statistical predictions compare favorably to expert-based outlooks. The paper concludes that statistical models demonstrate good ability to detect future outbreaks of food crises and that using statistical forecasting approaches may help increase lead time for action.",
+
+ languages = list(list(name="English", code="EN")),
+
+ reproducibility = list(
+ statement = "The code and data needed to reproduce the analysis are openly available.",
+ links = list(
+ list(uri="http://fcv.ihsn.org/catalog/study/RR_WLD_2020_PFC_v01",
+ description= "Source code"),
+ list(uri="http://fcv.ihsn.org/catalog/study/WLD_2020_PFC_v01_M",
+ description= "Dataset")
+
+ )
+ )
+
+ )
+
+ )
+ # Publish the metadata in NADA
+document_add(idno = id,
+metadata = example_1,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb_file,
+ overwrite = "yes")
+
+# Provide a link to the document (as an external resource)
+external_resources_add(
+title = "Predicting Food Crises",
+ idno = id,
+ dctype = "doc/anl",
+ file_path = "http://hdl.handle.net/10986/34510",
+ overwrite = "yes"
+ )
The document will now be available in the NADA catalog.
+
+
+
The Python equivalent of the R script presented above is as follows.
+# @@@ Script not tested yet
+
+import pynada as nada
+import inspect
+
+= "WB_WPS9412"
+ dataset_id
+= "central"
+ repository_id = 0
+ published = "yes"
+ overwrite
+= {
+ document_description
+'title_statement': {
+ 'idno': dataset_id,
+ 'title': "Predicting Food Crises"
+
+ },
+ 'date_published': "2020-09",
+
+ 'authors': [
+
+ {'last_name': "Andrée",
+ 'first_name': "Bo Pieter Johannes",
+ 'affiliation': "World Bank"
+
+ },
+ {'last_name': "Chamorro",
+ 'first_name': "Andres",
+ 'affiliation': "World Bank"
+
+ },
+ {'last_name': "Kraay",
+ 'first_name': "Aart",
+ 'affiliation': "World Bank"
+
+ },
+ {'last_name': "Spencer",
+ 'first_name': "Phoebe",
+ 'affiliation': "World Bank"
+
+ },
+ {'last_name': "Wang",
+ 'first_name': "Dieter",
+ 'affiliation': "World Bank"
+
+ }
+ ],
+ 'journal': "World Bank Policy Research Working Paper No. 9412",
+
+ 'publisher': "World Bank",
+
+ 'ref_country': [
+ 'name'="Afghanistan", 'code'="AFG"},
+ {'name'="Burkina Faso", 'code'="BFA"},
+ {'name'="Chad", 'code'="TCD"},
+ {'name'="Congo, Dem. Rep.", 'code'="COD"},
+ {'name'="Ethiopia", 'code'="ETH"},
+ {'name'="Guatemala", 'code'="GTM"},
+ {'name'="Haiti", 'code'="HTI"},
+ {'name'="Kenya", 'code'="KEN"},
+ {'name'="Malawi", 'code'="MWI"},
+ {'name'="Mali", 'code'="MLI"},
+ {'name'="Mauritania", 'code'="MRT"},
+ {'name'="Mozambique", 'code'="MOZ"},
+ {'name'="Niger", 'code'="NER"},
+ {'name'="Nigeria", 'code'="NGA"},
+ {'name'="Somalia", 'code'="SOM"},
+ {'name'="South Sudan", 'code'="SSD"},
+ {'name'="Sudan", 'code'="SDN"},
+ {'name'="Uganda", 'code'="UGA"},
+ {'name'="Yemen, Rep.", 'code'="YEM"},
+ {'name'="Zambia", 'code'="ZMB"},
+ {'name'="Zimbabwe", 'code'="ZWE"}
+ {
+ ],
+ 'abstract': inspect.cleandoc("""\
+
+Globally, more than 130 million people are estimated to be in food crisis. These humanitarian disasters are associated with severe impacts on livelihoods that can reverse years of development gains.
+The existing outlooks of crisis-affected populations rely on expert assessment of evidence and are limited in their temporal frequency and ability to look beyond several months.
+This paper presents a statistical forecasting approach to predict the outbreak of food crises with sufficient lead time for preventive action.
+Different use cases are explored related to possible alternative targeting policies and the levels at which finance is typically unlocked.
+The results indicate that, particularly at longer forecasting horizons, the statistical predictions compare favorably to expert-based outlooks.
+The paper concludes that statistical models demonstrate good ability to detect future outbreaks of food crises and that using statistical forecasting approaches may help increase lead time for action.
+
+ """),
+
+ 'languages': [
+ 'name': "English", 'code': "EN"}
+ {
+ ],
+ 'reproducibility': {
+ 'statement': "The code and data needed to reproduce the analysis are openly available.",
+ 'links': [
+
+ {'uri': "http://fcv.ihsn.org/catalog/study/RR_WLD_2020_PFC_v01",
+ 'description': "Source code"
+
+ },
+ {'uri': "http://fcv.ihsn.org/catalog/study/WLD_2020_PFC_v01_M",
+ 'description': "Dataset"
+
+ }
+ ]
+ },
+ = [
+ files 'file_uri': "http://hdl.handle.net/10986/34510"},
+ {
+ ]
+
+
+ nada.create_document_dataset(= dataset_id,
+ dataset_id = repository_id,
+ repository_id = published,
+ published = overwrite,
+ overwrite = document_description,
+ document_description = resources,
+ resources = files
+ files
+ )
+# If you have pdf file, generate thumbnail from it.
+= "WB_PRWP_9412_Food_Crises.pdf"
+ pdf_file = nada.pdf_to_thumbnail(pdf_file, page_no=1)
+ thumbnail_path nada.upload_thumbnail(dataset_id, thumbnail_path)
This example documents the World Bank World Development Report (WDR) 2019 titled “The Changing Nature of Work”. The book is available in multiple languages. It also has related resources like presentations and an Overview available in multiple languages, which we also document.
+ +library(nadar)
+
+# ----------------------------------------------------------------------------------
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_folder")
+<- "2019-WDR-Report.pdf"
+ doc_file
+<- "WB_WDR2019"
+ id <- "WBDG_WB_WDR2019"
+ meta_id
+<- gsub(".pdf", ".jpg", doc_file)
+ thumb_file capture_pdf_cover(doc_file) # Capture cover page for use as thumbnail
+
+# Generate the metadata
+
+= list(
+ example_2
+ metadata_information = list(
+ title = "The Changing Nature of Work",
+ idno = meta_id,
+ producers = list(
+ list(name = "Development Data Group, Curation Team",
+ abbr = "DECDG",
+ affiliation = "World Bank")
+
+ ),production_date = "2020-12-27"
+
+ ),
+document_description = list(
+
+ title_statement = list(
+ idno = id,
+ title = "The Changing Nature of Work",
+ sub_title = "World Development Report 2019",
+ abbreviated_title = "WDR 2019"
+
+ ),
+authors = list(
+ list(first_name = "Rong", last_name = "Chen", affiliation = "World Bank"),
+ list(first_name = "Davida", last_name = "Connon", affiliation = "World Bank"),
+ list(first_name = "Ana P.", last_name = "Cusolito", affiliation = "World Bank"),
+ list(first_name = "Ugo", last_name = "Gentilini", affiliation = "World Bank"),
+ list(first_name = "Asif", last_name = "Islam", affiliation = "World Bank"),
+ list(first_name = "Shwetlena", last_name = "Sabarwal", affiliation = "World Bank"),
+ list(first_name = "Indhira", last_name = "Santos", affiliation = "World Bank"),
+ list(first_name = "Yucheng", last_name = "Zheng", affiliation = "World Bank")
+
+ ),
+ date_created = "2019",
+ date_published = "2019",
+
+ identifers = list(
+ list(type = "ISSN", value = "0163-5085"),
+ list(type = "ISBN softcover", value = "978-1-4648-1328-3"),
+ list(type = "ISBN hardcover", value = "978-1-4648-1342-9"),
+ list(type = "e-ISBN", value = "978-1-4648-1356-6"),
+ list(type = "DOI softcover", value = "10.1596/978-1-4648-1328-3"),
+ list(type = "DOI hardcover", value = "10.1596/978-1-4648-1342-9")
+
+ ),
+ type = "book",
+
+ description = "The World Development Report (WDR) 2019: The Changing Nature of Work studies how the nature of work is changing as a result of advances in technology today. Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Work is constantly reshaped by technological progress. Firms adopt new ways of production, markets expand, and societies evolve. Overall, technology brings opportunity, paving the way to create new jobs, increase productivity, and deliver effective public services. Firms can grow rapidly thanks to digital transformation, expanding their boundaries and reshaping traditional production patterns. The rise of the digital platform firm means that technological effects reach more people faster than ever before. Technology is changing the skills that employers seek. Workers need to be better at complex problem-solving, teamwork and adaptability. Digital technology is also changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. The Report analyzes these changes and considers how governments can best respond. Investing in human capital must be a priority for governments in order for workers to build the skills in demand in the labor market. In addition, governments need to enhance social protection and extend it to all people in society, irrespective of the terms on which they work. To fund these investments in human capital and social protection, the Report offers some suggestions as to how governments can mobilize additional revenues by increasing the tax base.",
+
+ toc_structured = list(
+ list(id = "00", name = "Overview"),
+ list(id = "01", parent_id = "00", name = "Changes in the nature of work"),
+ list(id = "02", parent_id = "00", name = "What can governments do?"),
+ list(id = "03", parent_id = "00", name = "Organization of this study"),
+ list(id = "10", name = "1. The changing nature of work"),
+ list(id = "11", parent_id = "10", name = "Technology generates jobs"),
+ list(id = "12", parent_id = "10", name = "How work is changing"),
+ list(id = "13", parent_id = "10", name = "A simple model of changing work"),
+ list(id = "20", name = "2. The changing nature of firms"),
+ list(id = "21", parent_id = "20", name = "Superstar firms"),
+ list(id = "22", parent_id = "20", name = "Competitive markets"),
+ list(id = "23", parent_id = "20", name = "Tax avoidance"),
+ list(id = "30", name = "3. Building human capital"),
+ list(id = "31", parent_id = "30", name = "Why governments should get involved"),
+ list(id = "32", parent_id = "30", name = "Why measurement helps"),
+ list(id = "33", parent_id = "30", name = "The human capital project"),
+ list(id = "40", name = "4. Lifelong learning"),
+ list(id = "41", parent_id = "40", name = "Learning in early childhood"),
+ list(id = "42", parent_id = "40", name = "Tertiary education"),
+ list(id = "43", parent_id = "40", name = "Adult learning outside the workplace"),
+ list(id = "50", name = "5. Returns to work"),
+ list(id = "51", parent_id = "50", name = "Informality"),
+ list(id = "52", parent_id = "50", name = "Working women"),
+ list(id = "53", parent_id = "50", name = "Working in agriculture"),
+ list(id = "60", name = "6. Strengthening social protection"),
+ list(id = "61", parent_id = "60", name = "Social assistance"),
+ list(id = "62", parent_id = "60", name = "Social insurance"),
+ list(id = "63", parent_id = "60", name = "Labor regulation"),
+ list(id = "70", name = "7. Ideas for social inclusion"),
+ list(id = "71", parent_id = "70", name = "A global 'New Deal'"),
+ list(id = "72", parent_id = "70", name = "Creating a new social contract"),
+ list(id = "73", parent_id = "70", name = "Financing social inclusion")
+
+ ),
+abstract = "Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Instead, technology is bringing opportunity, paving the way to create new jobs, increase productivity, and improve public service delivery. The nature of work is changing.
+ Firms can grow rapidly thanks to digital transformation, which blurs their boundaries and challenges traditional production patterns.
+The rise of the digital platform firm means that technological effects reach more people faster than ever before.
+Technology is changing the skills that employers seek. Workers need to be good at complex problem-solving, teamwork and adaptability.
+Technology is changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers.
+What can governments do? The 2019 WDR suggests three solutions:
+1 - Invest in human capital especially in disadvantaged groups and early childhood education to develop the new skills that are increasingly in demand in the labor market, such as high-order cognitive and sociobehavioral skills
+2 - Enhance social protection to ensure universal coverage and protection that does not fully depend on having formal wage employment
+3 - Increase revenue mobilization by upgrading taxation systems, where needed, to provide fiscal space to finance human capital development and social protection.",
+
+ ref_country = list(
+ list(name = "World", code = "WLD")
+
+ ),
+ spatial_coverage = "Global",
+
+ publication_frequency = "Annual",
+
+ languages = list(
+ list(name = "English", code = "EN"),
+ list(name = "Chinese", code = "ZH"),
+ list(name = "Arabic", code = "AR"),
+ list(name = "French", code = "FR"),
+ list(name = "Spanish", code = "ES"),
+ list(name = "Italian", code = "IT"),
+ list(name = "Bulgarian", code = "BG"),
+ list(name = "Russian", code = "RU"),
+ list(name = "Serbian", code = "SR")
+
+ ),
+ license = list(
+ list(name = "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)",
+ uri = "http://creativecommons.org/licenses/by/3.0/igo")
+
+ ),
+ bibliographic_citation = list(
+ list(citation = " World Bank. 2019. World Development Report 2019: The Changing Nature of Work. Washington, DC: World Bank. doi:10.1596/978-1-4648-1328-3. License: Creative Commons Attribution CC BY 3.0 IGO")
+
+ ),
+ series = "World Development Report",
+
+ contributors = list(
+ list(first_name = "Simeon", last_name = "Djankov",
+ affiliation = "World Bank", role = "WDR Director"),
+ list(first_name = "Federica", last_name = "Saliola",
+ affiliation = "World Bank", role = "WDR Director"),
+ list(first_name = "David", last_name = "Sharrock",
+ affiliation = "World Bank", role = "Communications"),
+ list(first_name = "Consuelo Jurado", last_name = "Tan",
+ affiliation = "World Bank", role = "Program Assistant")
+
+ ),
+ publisher = "World Bank Publications",
+ publisher_address = "The World Bank Group, 1818 H Street NW, Washington, DC 20433, USA",
+
+ contacts = list(
+ list(name = "World Bank Publications", email = "pubrights@worldbank.org")
+
+ ),
+ topics = list(
+ list(name = "Labour And Employment - Employee Training",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(name = "Labour And Employment - Labour And Employment Policy",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(name = "Labour And Employment - Working Conditions",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(name = "Social Stratification And Groupings - Social And Occupational Mobility",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+
+ ),
+ disciplines = list(
+ list(name = "Economics")
+
+ )
+
+ )
+
+ )
+# Publish the metadata in NADA
+
+document_add(idno = id,
+metadata = example_2,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb_file,
+ overwrite = "yes")
+
+# Provide links to the document and related resources
+
+external_resources_add(
+title = "The Changing Nature of Work",
+ description = "Links to the PDF report in all available languages",
+ idno = id,
+ dctype = "doc/anl",
+ language = "English, Chinese, Arabic, French, Spanish, Italian, Bulgarian, Russian, Serbian",
+ file_path = "https://www.worldbank.org/en/publication/wdr2019",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "WORLD DEVELOPMENT REPORT 2019 - THE CHANGING NATURE OF WORK - Presentation (slide deck), English",
+ idno = id,
+ dctype = "doc/oth",
+ language = "English",
+ file_path = "http://pubdocs.worldbank.org/en/808261547222082195/WDR19-English-Presentation.pdf",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "INFORME SOBRE EL DESARROLLO MUNDIAL 2019 - LA NATURALEZA CAMBIANTE DEL TRABAJO - Presentation (slide deck), Spanish",
+ idno = id,
+ dctype = "doc/oth",
+ language = "Spanish",
+ file_path = "http://pubdocs.worldbank.org/en/942911547222108647/WDR19-Spanish-Presentation.pdf",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "RAPPORT SUR LE DÉVELOPPEMENT DANS LE MONDE 2019 - LE TRAVAIL EN MUTATION - Presentation (slide deck), French",
+ idno = id,
+ dctype = "doc/oth",
+ language = "French",
+ file_path = "http://pubdocs.worldbank.org/en/132831547222088914/WDR19-French-Presentation.pdf",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "RAPPORTO SULLO SVILUPPO MONDIALE 2019 - CAMBIAMENTI NEL MONDO DEL LAVORO - Presentation (slide deck), Italian",
+ idno = id,
+ dctype = "doc/oth",
+ language = "Italian",
+ file_path = "http://pubdocs.worldbank.org/en/842271547222095493/WDR19-Italian-Presentation.pdf",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "ДОКЛАД О МИРОВОМ РАЗВИТИИ 2019 - ИЗМЕНЕНИЕ ХАРАКТЕРА ТРУДА - Presentation (slide deck), Russian",
+ idno = id,
+ dctype = "doc/oth",
+ language = "Russian",
+ file_path = "http://pubdocs.worldbank.org/en/679061547222101914/WDR19-Russian-Presentation.pdf",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "Jobs of the future require more investment in people - Press Release (October 11, 2018)",
+ idno = id,
+ dctype = "doc/oth",
+ dcdate = "2018-10-11",
+ language = "Russian",
+ file_path = "https://www.worldbank.org/en/news/press-release/2018/10/11/jobs-of-the-future-require-more-investment-in-people",
+ overwrite = "yes"
+ )
The document is now available in the NADA catalog.
+
+
+
The Python equivalent of the R script presented above is as follows.
+# @@@ Script not tested yet - must be edited to match the R script
+
+import pynada as nada
+import inspect
+
+= "DOC_001"
+ dataset_id
+= "central"
+ repository_id
+= 0
+ published
+= "yes"
+ overwrite
+= {
+ metadata_information 'title': "The Changing Nature of Work",
+ 'idno': "META_DOC_001",
+ 'producers': [
+
+ {'name': "Development Data Group, Curation Team",
+ 'abbr': "DECDG",
+ 'affiliation': "World Bank"
+
+ }
+ ],'production_date': "2020-12-27"
+
+ }
+= {
+ document_description 'title_statement': {
+ 'idno': dataset_id,
+ 'title': "The Changing Nature of Work",
+ 'sub-title': "World Development Report 2019",
+ 'abbreviated_title': "WDR2019"
+
+ },
+ 'type': "book",
+
+ 'description': inspect.cleandoc("""\
+
+The World Development Report (WDR) 2019: The Changing Nature of Work studies how the nature of work is changing as a result of advances in technology today. Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Work is constantly reshaped by technological progress. Firms adopt new ways of production, markets expand, and societies evolve. Overall, technology brings opportunity, paving the way to create new jobs, increase productivity, and deliver effective public services. Firms can grow rapidly thanks to digital transformation, expanding their boundaries and reshaping traditional production patterns. The rise of the digital platform firm means that technological effects reach more people faster than ever before. Technology is changing the skills that employers seek. Workers need to be better at complex problem-solving, teamwork and adaptability. Digital technology is also changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. The Report analyzes these changes and considers how governments can best respond. Investing in human capital must be a priority for governments in order for workers to build the skills in demand in the labor market. In addition, governments need to enhance social protection and extend it to all people in society, irrespective of the terms on which they work. To fund these investments in human capital and social protection, the Report offers some suggestions as to how governments can mobilize additional revenues by increasing the tax base.
+
+ """),
+
+ 'toc_structured': [
+ 'id': "00", 'name': "Overview"},
+ {'id': "01", 'parent_id': "00", 'name': "Changes in the nature of work"},
+ {'id': "02", 'parent_id': "00", 'name': "What can governments do?"},
+ {'id': "03", 'parent_id': "00", 'name': "Organization of this study"},
+ {'id': "10", 'name': "1. The changing nature of work"},
+ {'id': "11", 'parent_id': "10", 'name': "Technology generates jobs"},
+ {'id': "12", 'parent_id': "10", 'name': "How work is changing"},
+ {'id': "13", 'parent_id': "10", 'name': "A simple model of changing work"},
+ {'id': "20", 'name': "2. The changing nature of firms"},
+ {'id': "21", 'parent_id': "20", 'name': "Superstar firms"},
+ {'id': "22", 'parent_id': "20", 'name': "Competitive markets"},
+ {'id': "23", 'parent_id': "20", 'name': "Tax avoidance"},
+ {'id': "30", 'name': "3. Building human capital"},
+ {'id': "31", 'parent_id': "30", 'name': "Why governments should get involved"},
+ {'id': "32", 'parent_id': "30", 'name': "Why measurement helps"},
+ {'id': "33", 'parent_id': "30", 'name': "The human capital project"},
+ {'id': "40", 'name': "4. Lifelong learning"},
+ {'id': "41", 'parent_id': "40", 'name': "Learning in early childhood"},
+ {'id': "42", 'parent_id': "40", 'name': "Tertiary education"},
+ {'id': "43", 'parent_id': "40", 'name': "Adult learning outside the workplace"},
+ {'id': "50", 'name': "5. Returns to work"},
+ {'id': "51", 'parent_id': "50", 'name': "Informality"},
+ {'id': "52", 'parent_id': "50", 'name': "Working women"},
+ {'id': "53", 'parent_id': "50", 'name': "Working in agriculture"},
+ {'id': "60", 'name': "6. Strengthening social protection"},
+ {'id': "61", 'parent_id': "60", 'name': "Social assistance"},
+ {'id': "62", 'parent_id': "60", 'name': "Social insurance"},
+ {'id': "63", 'parent_id': "60", 'name': "Labor regulation"},
+ {'id': "70", 'name': "7. Ideas for social inclusion"},
+ {'id': "71", 'parent_id': "70", 'name': "A global 'New Deal'"},
+ {'id': "72", 'parent_id': "70", 'name': "Creating a new social contract"},
+ {'id': "73", 'parent_id': "70", 'name': "Financing social inclusion"}
+ {
+ ],
+ 'abstract': inspect.cleandoc("""\
+
+Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Instead, technology is bringing opportunity, paving the way to create new jobs, increase productivity, and improve public service delivery.
+The nature of work is changing.
+Firms can grow rapidly thanks to digital transformation, which blurs their boundaries and challenges traditional production patterns.
+The rise of the digital platform firm means that technological effects reach more people faster than ever before.
+Technology is changing the skills that employers seek. Workers need to be good at complex problem-solving, teamwork and adaptability.
+Technology is changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers.
+What can governments do?
+The 2019 WDR suggests three solutions:
+1 - Invest in human capital especially in disadvantaged groups and early childhood education to develop the new skills that are increasingly in demand in the labor market, such as high-order cognitive and sociobehavioral skills
+2 - Enhance social protection to ensure universal coverage and protection that does not fully depend on having formal wage employment
+3 - Increase revenue mobilization by upgrading taxation systems, where needed, to provide fiscal space to finance human capital development and social protection.
+
+ """),
+
+ 'ref_country': [
+ 'name': "World", 'code': "WLD"}
+ {
+ ],
+ 'spatial_coverage': "Global",
+
+ 'date_created': "2019",
+
+'date_published': "2019",
+
+ 'identifiers': [
+ 'type': "ISSN", 'value': "0163-5085"},
+ {'type': "ISBN softcover", 'value': "978-1-4648-1328-3"},
+ {'type': "ISBN hardcover", 'value': "978-1-4648-1342-9"},
+ {'type': "e-ISBN", 'value': "978-1-4648-1356-6"},
+ {'type': "DOI softcover", 'value': "10.1596/978-1-4648-1328-3"},
+ {'type': "DOI hardcover", 'value': "10.1596/978-1-4648-1342-9"}
+ {
+ ],
+ 'publication_frequency': "Annual",
+
+ 'languages': [
+ 'name': "English", 'code': "EN"},
+ {'name': "Chinese", 'code': "ZH"},
+ {'name': "Arabic", 'code': "AR"},
+ {'name': "French", 'code': "FR"},
+ {'name': "Spanish", 'code': "ES"},
+ {'name': "Italian", 'code': "IT"},
+ {'name': "Bulgarian", 'code': "BG"},
+ {'name': "Russian", 'code': "RU"},
+ {'name': "Serbian", 'code': "SR"}
+ {
+ ],
+ 'license': [
+
+ {'name': "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)",
+ 'uri': "http://creativecommons.org/licenses/by/3.0/igo"
+
+ }
+ ],
+ 'authors': [
+ 'first_name': "Rong", 'last_name': "Chen", 'affiliation': "World Bank"},
+ {'first_name': "Davida", 'last_name': "Connon", 'affiliation': "World Bank"},
+ {'first_name': "Ana P.", 'last_name': "Cusolito", 'affiliation': "World Bank"},
+ {'first_name': "Ugo", 'last_name': "Gentilini", 'affiliation': "World Bank"},
+ {'first_name': "Asif", 'last_name': "Islam", 'affiliation': "World Bank"},
+ {'first_name': "Shwetlena", 'last_name': "Sabarwal", 'affiliation': "World Bank"},
+ {'first_name': "Indhira", 'last_name': "Santos", 'affiliation': "World Bank"},
+ {'first_name': "Yucheng", 'last_name': "Zheng", 'affiliation': "World Bank"}
+ {
+ ],
+ 'contributors': [
+ 'first_name': "Simeon", 'last_name': "Djankov", 'affiliation': "World Bank", 'role': "WDR Director"},
+ {'first_name': "Federica", 'last_name': "Saliola", 'affiliation': "World Bank", 'role': "WDR Director"},
+ {'first_name': "David", 'last_name': "Sharrock", 'affiliation': "World Bank", 'role': "Communications"},
+ {'first_name': "Consuelo Jurado", 'last_name': "Tan", 'affiliation': "World Bank", 'role': "Program Assistant"}
+ {
+ ],
+ 'topics': [
+
+ {'name': "LabourAndEmployment.EmployeeTraining",
+ 'vocabulary': "CESSDA Topic Classification",
+ 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+
+ },
+ {'name': "LabourAndEmployment.LabourAndEmploymentPolicy",
+ 'vocabulary': "CESSDA Topic Classification",
+ 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+
+ },
+ {'name': "LabourAndEmployment.WorkingConditions",
+ 'vocabulary': "CESSDA Topic Classification",
+ 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+
+ },
+ {'name': "SocialStratificationAndGroupings.SocialAndOccupationalMobility",
+ 'vocabulary': "CESSDA Topic Classification",
+ 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+
+ }
+ ],
+ 'disciplines': [
+ 'name': "Economics"}
+ {
+ ] }
In this example we take a different use case. We assume that a list of publications is available as a CSV file. Each row in this file describes one publication, with the following columns containing the metadata (with no missing information for the required elements):
+The R (or Python) script reads the CSV file. The listed documents are downloaded (if not previously done), and the cover page of each document is captured and saved as a JPG file to be used as a thumbnail in the catalog. Metadata are formatted to comply with the document schema, then published. The documents are not uploaded in the catalog, but links to the originating catalog are provided. There is no limit to the number of documents that could be included in such a batch process. If a repository of documents is available with metadata available in a structured format (in a CSV file as in the example, from an API, or from another source), the migration of the documents to a NADA catalog can be fully automated using a script similar to the one shown in the example. Note that such a script could also include some processes of metadata augmentation (e.g., submitting each document to a topic model to extract and store the topic composition of the document).
+
+
+
library(nadar)
+library(stringr)
+library(rlist)
+library(countrycode) # Will be used to automatically add ISO country codes
+
+# ----------------------------------------------------------------------------------
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+# Read the CSV file containing the information (metadata) on the 5 documents
+
+setwd("C:/my_folder")
+# Read the file containing information on the 5 documents
+<- read.csv("my_list_of_documents.csv", stringsAsFactors = FALSE)
+ doc_list
+# Generate the metadata for each document in the list, and publish in NADA
+
+for(i in 1:nrow(doc_list)) {
+
+ # Download the file if not already done
+ <- doc_list$URL_pdf[i]
+ url <- basename(doc_list$URL_pdf[i])
+ pdf_file if(!file.exists(pdf_file)) download.file(url, pdf_file, mode = "wb")
+
+ # Map the available metadata elements to the schema
+ <- doc_list$ID[i]
+ id <- doc_list$title[i]
+ title <- as.character(doc_list$date_published[i])
+ date <- doc_list$abstract[i]
+ abstract <- doc_list$type[i]
+ type
+ # Split the authors' list an generate a list compliant with the schema
+ <- doc_list$authors[i]
+ list_authors <- str_split(list_authors, ";")
+ list_authors = list()
+ authors for(n in 1:length(list_authors[[1]])) {
+ = trimws(list_authors[[1]][n])
+ author if("," %in% author) { # If we have last name and first name
+ = str_split(author, ",")
+ last_first = list(last_name = trimws(last_first[[1]][1]),
+ a_l first_name = trimws(last_first[[1]][2]))
+ else { # E.g., when author is an organization
+ } = list(last_name = author, first_name = "")
+ a_l
+ } = list.append(authors, a_l)
+ authors
+ }
+ # Split the country list an generate a list compliant with the schema
+ <- doc_list$country[i]
+ list_countries <- str_split(list_countries, ";")
+ list_countries = list()
+ countries for(n in 1:length(list_countries[[1]])) {
+ = trimws(list_countries[[1]][n])
+ country if(country == "World"){
+ = "WLD"
+ c_code else {
+ } = countrycode(country, origin = 'country.name', destination = 'iso3c')
+ c_code
+ }if(is.na(c_code)) c_code = ""
+ = list(name = country, code = c_code)
+ c_l = list.append(countries, c_l)
+ countries
+ }
+ # Capture the cover page as JPG, and generate the full document metadata
+
+ <- gsub(".pdf", ".jpg", pdf_file)
+ thumb capture_pdf_cover(pdf_file) # To be used as thumbnail
+
+ <- list(
+ this_document document_description = list(
+ title_statement = list(idno = id, title = title),
+ date_published = date,
+ authors = authors,
+ abstract = abstract,
+ ref_country = countries
+
+ )
+ )
+ # Publish the metadata in NADA
+
+ document_add(idno = id,
+ published = 1,
+ overwrite = "yes",
+ metadata = this_document,
+ thumbnail = thumb)
+
+ # Add a link to the document
+
+ external_resources_add(
+ title = as.character(this_document$document_description$title_statement[1]),
+ idno = id,
+ dctype = "doc/anl",
+ file_path = url,
+ overwrite = "yes"
+
+ )
+ }
# @@@ Script not tested yet
+
+import pynada as nada
+import pandas as pd
+import urllib.request
+import os.path
+
+# Set API key and catalog URL
+"my_api_key")
+ nada.set_api_key("http://my_catalog.ihsn.org/index.php/api/")
+ nada.set_api_url(
+# Read the file containing information on the 5 documents
+<- pd.read_csv("my_list_of_documents.csv")
+ doc_list
+# Generate the metadata and publish in NADA catalog
+for index, doc in doc_list.iterrows():
+
+# Download the file if not already done
+ = doc['URL']
+ url = os.path.basename(url)
+ pdf_file if(!os.path.exists(pdf_file)) {
+
+ urllib.request.urlretrieve(url, pdf_file)
+ }
+# Map/generate metadata fields
+ id = doc['id']
+ = f"{doc['title']} - Census {doc['censusyear']}"
+ title = doc['authors']
+ author = doc['contributor']
+ contrib = doc['date_published']
+ date = doc['date_available']
+ avail = doc['description']
+ abstract = doc['publisher']
+ publisher = doc['state']
+ spatial = [{'name': "English", 'code': "ENG"}]
+ language
+ # Document the file, and publish in NADA
+ = id
+ idno = "central"
+ repository_id = 1
+ published = "yes"
+ overwrite = {
+ document_description 'title_statement': {
+ 'idno': id,
+ 'title': title
+
+ },'date_published': date,
+ 'date_available': date,
+ 'authors': [
+ 'last_name': author}
+ {
+ ],'contributors': [
+ 'last_name': contrib}
+ {
+ ],'publisher': publisher,
+ 'abstract': abstract,
+ 'description': desc,
+ 'ref_country': [
+ 'name': "India", 'code': "IND"}
+ {
+ ],'languages': language,
+ 'pages': pages,
+ 'rights': "Office of the Registrar General, India (ORGI)"
+
+ }= tags
+ tags = [
+ files 'file_uri': pdf_file, 'format': "Adobe Acrobat PDF"},
+ {
+ ]
+
+ nada.create_document_dataset(= idno,
+ dataset_id = repository_id,
+ repository_id = published,
+ published = overwrite,
+ overwrite = document_description,
+ document_description = tags,
+ tags = files
+ files
+ )
+# generate thumbnail from the pdf file.
+ = nada.pdf_to_thumbnail(pdf_file, page_no=1)
+ thumbnail_path nada.upload_thumbnail(idno, thumbnail_path)
When surveys or censuses are conducted, or when administrative data are recorded, information is collected on each unit of observation. The unit of observation can be a person, a household, a firm, an agricultural holding, a facility, or other. Microdata are the data files resulting from these data collection activities, which contain the unit-level information (as opposed to aggregated data in the form of counts, means, or other). Information on each unit is stored in variables, which can be of different types (e.g. numeric or alphanumeric, discrete or continuous). These variables may contain data reported by the respondent (e.g., the marital status of a person), obtained by observation or measurement (e.g., the GPS location of a dwelling), or generated by calculation, recoding or derivation (e.g., the sample weight in a survey).
+For efficiency reasons, variables are often stored in numeric format (i.e. coded values), even when they contain qualitative information (coded values). For example, the sex of a respondent may be stored in a variable named ‘Q_01’, and include values 1, 2 and 9 where 1 represents “male”, 2 represents “female”, and 9 represents “unreported”. Microdata must therefore be provided at a minimum with a data dictionary containing the variables and value labels and, for derived variables, information of the derivation process. But many other features of a micro-dataset should also be described such as the objectives and the methodology of data collection (including a description of the sampling design for sample surveys), the period of data collection, the identification of the primary investigator and other contributors, the scope and geographic coverage of the data, and much more. This information will make the data usable and discoverable.
+The DDI metadata standard provides a structured and comprehensive list of hundreds of elements and attributes which may be used to document microdata. It is unlikely that any one study would ever require using them all, but this list provides a convenient solution to foster completeness of the information, and to generate documentation that will meet the needs of users.
+The Data Documentation Initiative (DDI) metadata standard originated in the Inter-university Consortium for Political and Social Research (ICPSR), a membership-based organization with more than 500 member colleges and universities worldwide. The DDI is now the project of an alliance of North American and European institutions. Member institutions comprise many of the largest data producers and data archives in the world. The DDI standard is used by a large community of data archivists, including data librarians from academia, data managers in national statistical agencies and other official data producing agencies, and international organizations. The standard has two branches: the DDI-Codebook (version 2.x) and the DDI LifeCycle (version 3.x). These two branches serve different purposes and audiences. For the purpose of data archiving and cataloguing, the schema we recommend in this Guide is the DDI-Codebook. We use a slightly simplified version of version 2.5 of the standard, to which we add a few elements (including the tags
element common to all schemas described in the Guide. A mapping between the elements included in our schema and the DDI Codebook metadata tags is provided in annex 2.
The DDI standard is published under the terms of the [GNU General Public License]((http://www.gnu.org/licenses) (version 3 or later).
+The DDI Alliance developed the DDI-Codebook for organizing the content, presentation, transfer, and preservation of metadata in the social and behavioral sciences. It enables documenting microdata files in a simultaneously flexible and rigorous way. The DDI-Codebook aims to provide a straightforward means of recording and communicating all the salient characteristics of a micro-dataset.
+The DDI-Codebook is designed to encompass the kinds of data resulting from surveys, censuses, administrative records, experiments, direct observation and other systematic methodology for generating empirical measurements. The unit of observation can be individual persons, households, families, business establishments, transactions, countries or other subjects of scientific interest.
+The DDI Alliance publishes the DDI-Codebook as an XML schema. We present in this Guide a JSON implementation of the schema, which is used in our R package NADAR and Python Library PyNADA. The NADA cataloguing application works with both the XML and the JSON version. A DDI-compliant metadata file can be converted from the JSON schema to the XML or from XML to JSON.
+As indicated by the DDI Alliance website, DDI-Lifecycle is “designed to document and manage data across the entire life cycle, from conceptualization to data publication, analysis and beyond. It encompasses all of the DDI-Codebook specification and extends it. Based on XML Schemas, DDI-Lifecycle is modular and extensible.” DDI-lifecycle can be used to “populate variable and question banks to explore available data and question structures for reuse in new surveys”. As this is not our objective, and because using the DDI-Lifecycle adds significant complexity, we do not make use of it and this chapter only covers the DDI-Codebook.
+The DDI is a comprehensive schema that provides metadata elements to document a study (e.g., a survey, or an administrative datasets), the related data files, and the variables they contain. A separate schema is used to document the related resources (questionnaires, reports, and others); see Chapter 13.
+Some datasets may contain hundreds or even thousands of variables. For each variable, the DDI can include not only the variable name, label and description, but also summary statistics like the count of valid and missing observations, weighted and unweighted frequencies, means, and others. Generating a DDI file manually, in particular the variable-level metadata, can be a tedious and time consuming task. But variable names, summary statistics, and (when avaiulable) variable and value labels can be extracted directly from the data files. User-friendly solutions (specialized metadata editors) are available to automate a large part of this work. DDI can also be generated programmatically using R or Python. Section 5.5 provides examples of the use of specialized DDI metadata editors and programming languages to generate DDI-compliant metadata.
+Documenting microdata is more complex than documenting publications or other types of data like tables or indicators. The production of microdata often involves experts in survey design, sampling, data processing, and analysis. Generating the metadata should thus be a collective responsibility and will ideally be done in real time (“document as you survey”). Data documentation should be implemented during the whole lifecycle of data production, not as an ex post task. This is in line with what the Generic Statistical Business process Model (GSBPM) recommends: “Good metadata management is essential for the efficient operation of statistical business processes. Metadata are present in every phase, either created, updated or carried forward from a previous phase or reused from another business process. In the context of this model, the emphasis of the overarching process of metadata management is on the creation/revision, updating, use and archiving of statistical metadata, though metadata on the different sub-processes themselves are also of interest, including as an input for quality management. The key challenge is to ensure that these metadata are captured as early as possible, and stored and transferred from phase to phase alongside the data they refer to.” Too often, microdata are documented after completion of the data collection, sometimes by a team who was not directly involved in the production of the data. In such cases, some information may not have been captured and will be difficult to find or reconstruct.
+Suggestions and recommendations to data curators
+keywords
metadata element provides a flexible solution to improve the discoverability of data. For example, a survey that collects data on children age, weight and height, will be relevant for measuring malnutrition and generating indicators like prevalence of stunting or wasting, overweight and underweight. The variable description alone would not make the data discoverable in keyword-based search engines, hence the importance of adding relevant terms and phrases in the keyword
section.variable groups
– to organize variables differently, for example thematically. These variable groupings are virtual, in the sense that they do not impact the way variables are stored. Not all variables have to be mapped to such groups, and a same variable can belong to more than one group. This option provides the possibility to organize the variables based on a thematic or topical classification. Machine learning (AI) tools make it possible to automate the process of mapping variables to a pre-defined list of groups (each one of them described by a label and a short description). By doing this, and by generating embeddings at the group level, it becomes possible to add semantic search and to implement a recommender system that applies to microdata.The DDI-Codebook is a comprehensive, structured list of elements to be used to document microdata of any source. The standard contains five main sections:
+doc_desc
), with elements used to describe the metadata (not the data); the term “document” refers here to the XML (or JSON) file that contains the metadata.study_desc
), which contains the elements used to describe the study itself (the survey, the administrative process, or the other activity that resulted in the production of the microdata). This section will contain information on the primary investigator, scope and coverage of the data, sampling, etc.data_files
), which provides elements to document each data file that compose the dataset (this is thus a repeatable block of elements).variables
), with elements used to describe each variable contained in the data files, including the variable names, the variable and value labels, summary statistics for each variable, interviewers’ instructions, description of recoding or derivation procedure, and more.variable_groups
), an optional section that allows organizing variables by thematic or other groups, independently from the data file they belong to. Variable groups are “virtual”; the grouping of variables does not affect the data files.The other sections in the schema are not part of the DDI Codebook itself. Some are used for catalog administration purposes when the NADA cataloguing application is used (repositoryid
, access_policy
, published
, overwrite
, and provenance
).
repositoryid
identifies the data catalog/collection in which the metadata will be published.access_policy
indicates the access policy to be applied to the microdata (open access, public use files, licensed access, no access, etc.)published
: Indicates whether the metadata will be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.overwrite
: Indicates whether metadata that may have been previously uploaded for the same dataset can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. Note that a dataset will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element study_desc > title_statement > idno
is the same.provenance
is used to store information on the source and time of harvesting, for metadata that were extracted automatically from external data catalogs.Other sections are provided to allow additional metadata to be collected and stored, including metadata generated by machine learning models (tags
, lda_topics
, embeddings
, and additional
). The tags
is a section common to all schemas (with the exception of the external resources schema), which provides a flexible solution to generate customized facets in data catalogs. The additional
section allows data curators to supplement the DDI standard with their own metadata elements, without breaking compliance with the DDI.
{
+"repositoryid": "string",
+ "access_policy": "data_na",
+ "published": 0,
+ "overwrite": "no",
+ "doc_desc": {},
+ "study_desc": {},
+ "data_files": [],
+ "variables": [],
+ "variable_groups": [],
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
The DDI-Codebook also provides a solution to describe OLAP cubes, which we do not make use of as our purpose is to use the standard to document and catalog datasets, not to manage data.
+Each metadata element in the DDI standard has a name. In our JSON version of the standard, we do not make use of the exact same names. We adapted some of them for clarity. For example, we renamed the DDI element titlStmt
as title_statement
. The mapping between the DDI Codebook 2.5 standard and the elements in our schema is provided in appendix. JSON files created using our adapted version of the DDI can be exported as a DDI 2.5 compliant and validated XML file using R or Python scripts provided in the NADAR package and PyNADA library.
doc_desc
[Optional ; Not repeatable]
+Documenting a study using the DDI-Codebook standard consists of generating a metadata file in XML or JSON format. This file is what is referred to as the metadata document. The doc_desc
or document description is thus a description of the metadata file, and consists of bibliographic information describing the DDI-compliant document as a whole. As a same dataset can possibly be documented by more than one organization, and because metadata can be automatically harvested by on-line catalogs, traceability of the metadata is important. This section, which only contains five main elements, should be as complete as possible, and at least contain information on the producer
and prod_date
; information.
"doc_desc": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "prod_date": "string",
+ "version_statement": {
+ "version": "string",
+ "version_date": "string",
+ "version_resp": "string",
+ "version_notes": "string"
+ }
+ }
title
[Optional ; Not repeatable ; String]
+The title of the metadata document (which may be the title of the study itself). The metadata document is the DDI metadata file (XML or JSON file) that is being generated. The “Document title” should mention the geographic scope of the data collection as well as the time period covered. For example: “DDI 2.5: Albania Living Standards Study 2012”.
idno
[Optional ; Not repeatable ; String]
+A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the unique identifier idno
found in section study_description / title_statement
, although it is good practice to generate identifiers that establish a clear connection between the two identifiers. The Document ID could also include the metadata document version identifier. For example, if the “Primary identifier” of the study is “ALB_LSMS_2012”, the “Document ID” in the Metadata information could be “IHSN_DDI_v01_ALB_LSMS_2012” if the DDI metadata are produced by the IHSN. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier should not contain blank spaces.
producers
[Optional ; Repeatable]
+The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a “Recommended” element. For catalog administration purposes, information on the producer and on the date of metadata production is useful.
name
[Optional ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.role
[Optional ; Not repeatable ; String] name
in the production of the DDI metadata. prod_date
[Optional ; Not repeatable ; String]
+The date the DDI metadata document was produced (not the date it was distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a “Recommended” element, as information on the producer and on the date of metadata production is useful for catalog administration purposes.
version_statement
[Optional ; Not repeatable]
+A version statement for the metadata (DDI) document. Documenting a dataset is not a trivial exercise. It may happen that, having identified errors or gaps in a DDI document, or after receiving suggestions for improvement or additional input, the DDI metadata are modified. The version_statement
describes the version of the metadata document. It is good practice to provide a version number and date, and information on what distinguishes the current version from the previous one(s).
version
[Optional ; Not repeatable ; String] version_date
[Optional ; Not repeatable ; String] prod_date
element. It is recommended to enter the date in the ISO 8601 date format (YYYY-MM-DD or YYYY-MM or YYYY).version_resp
[Optional ; Not repeatable ; String] version_notes
[Optional ; Not repeatable ; String] <- list(
+ my_ddi
+ doc_desc = list(
+ title = "Albania Living Standards Study 2012",
+ idno = "DDI_WB_ALB_2012_LSMS_v02",
+ producers = list(
+ list(name = "Development Data Group",
+ abbr = "DECDG",
+ affiliation = "World Bank",
+ role = "Production of the DDI-compliant metadata"
+
+ )
+ ),prod_date = "2021-02-16",
+ version_statement = list(
+ version = "Version 2.0",
+ version_date = "2021-02-16",
+ version_resp = "OD",
+ version_notes = "Version identical to Version 1.0 except for the Data Appraisal section which was added."
+
+ )
+ ),
+ # ... (other sections of the DDI)
+
+ )
study_desc
[Required ; Not repeatable]
+The study_desc
or study description consists of information about the data collection or study that the DDI-compliant documentation file describes. This section includes study-level information such as scope and coverage, objectives, producers, sampling, data collection dates and methods, etc.
"study_desc": {
+"title_statement": {},
+ "authoring_entity": [],
+ "oth_id": [],
+ "production_statement": {},
+ "distribution_statement": {},
+ "series_statement": {},
+ "version_statement": {},
+ "bib_citation": "string",
+ "bib_citation_format": "string",
+ "holdings": [],
+ "study_notes": "string",
+ "study_authorization": {},
+ "study_info": {},
+ "study_development": {},
+ "method": {},
+ "data_access": {}
+ }
title_statement
[Required ; Not repeatable]
+The title statement for the study.
"title_statement": {
+"idno": "string",
+ "identifiers": [
+ {
+ "type": "string",
+ "identifier": "string"
+ }
+ ],
+ "title": "string",
+ "sub_title": "string",
+ "alternate_title": "string",
+ "translated_title": "string"
+ }
idno
[Required ; Not repeatable ; String]
+idno
is the primary identifier of the dataset. It is a unique identification number used to identify the study (survey, census or other). A unique identifier is required for cataloguing purpose, so this element is declared as “Required”. The identifier will allow users to cite the dataset properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the idno
can be constructed by an organization using a consistent scheme. The scheme could for example be “catalog-country-study-year-version”, where country is the 3-letter ISO country code, producer is the abbreviation of the producing agency, study is the study acronym, year is the reference year (or the year the study started), version is a version number. Using that scheme, the Uganda 2005 Demographic and Health Survey for example would have the following idno
(where “MDA” stand for “My Data Archive”): MDA_UGA_DHS_2005_v01. Note that the schema allows you to provide more than one identifier for a same study (in element identifiers
); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The identifier should not contain blank spaces.
identifiers
[Optional ; Repeatable]
+This repeatable element is used to enter identifiers (IDs) other than the idno
entered in the Title statement. It can for example be a Digital Object Identifier (DOI). The idno
can be repeated here (the idno
element does not provide a type
parameter; if a DOI or other standard reference ID is used as idno
, it is recommended to repeat it here with the identification of its type
).
type
[Optional ; Not repeatable ; String] identifier
[Required ; Not repeatable ; String] title
[Required ; Not repeatable ; String]
+This element is “Required”. Provide here the full authoritative title for the study. Make sure to use a unique name for each distinct study. The title should indicate the time period covered. For example, in a country conducting monthly labor force surveys, the title of a study would be like “Labor Force Survey, December 2020”. When a survey spans two years (for example, a household income and expenditure survey conducted over a period of 12 months from June 2020 to June 2021), the range of years can be provided in the title, for example “Household Income and Expenditure Survey 2020-2021”. The title of a survey should be its official name as stated on the survey questionnaire or in other study documents (report, etc.). Including the country name in the title is optional (another metadata element is used to identify the reference countries). Pay attention to the consistent use of capitalization in the title.
sub_title
[Optional ; Not repeatable ; String]
+The sub-title
is a secondary title used to amplify or state certain limitations on the main title, for example to add information usually associated with a sequential qualifier for a survey. For example, we may have “[country] Universal Primary Education Project, Impact Evaluation Survey 2007” as title
, and “Baseline dataset” as sub-title
. Note that this information could also be entered as a Title with no Subtitle: “[country] Universal Primary Education Project, Impact Evaluation Survey 2007 - Baseline dataset”.
alternate_title
[Optional ; Not repeatable ; String]
+The alternate_title
will typically be used to capture the abbreviation of the survey title. Many surveys are known and referred to by their acronym. The survey reference year(s) may be included. For example, the “Demographic and Health Survey 2012” would be abbreviated as “DHS 2012”, or the “Living Standards Measurement Study 2020-2012” as “LSMS 2020-2021”.
translated_title
[Optional ; Not repeatable ; String]
+In countries with more than one official language, a translation of the title may be provided here. Likewise, the translated title may simply be a translation into English from a country’s own language. Special characters should be properly displayed, such as accents and other stress marks or different alphabets.
<- list(
+ my_ddi
+ # ... ,
+
+ study_desc = list(
+ title_statement = list(
+ idno = "ML_ALB_2012_LSMS_v02",
+ identifiers = list(
+ list(type = "DOI", identifier = "XXX-XXXX-XXX")
+
+ ),title = "Living Standards Study 2012",
+ alternate_title = "LSMS 2012",
+ translated_title = "Anketa e Matjes së Nivelit të Jetesës (AMNJ) 2012"
+
+ )
+ ),
+ # ...
+ )
oth_id
[Optional ; Repeatable]
+This element is used to acknowledge any other people and organizations that have in some form contributed to the study. This does not include other producers which should be listed in producers
, and financial sponsors which should be listed in the element funding_agencies
.
"oth_id": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.<- list(
+ my_ddi
+ # ... ,
+
+ study_desc = list(
+ # ... ,
+
+ oth_id = list(
+ list(name = "John Doe",
+ role = "Technical advisor in sample design",
+ affiliation = "World Bank Group"
+
+ )
+ ),# ...
+
+
+ )
+ )
production_statement
[Optional ; Not repeatable]
+A production statement for the work at the appropriate level.
"production_statement": {
+"producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "copyright": "string",
+ "prod_date": "string",
+ "prod_place": "string",
+ "funding_agencies": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "grant": "string",
+ "role": "string"
+ }
+ ]
+ }
producers
[Optional ; Repeatable]
+This field is provided to list other interested parties and persons that have played a significant but not the leading technical role in implementing and producing the data (which will be listed in authoring_entity
), and not the financial sponsors (which will be listed in funding_agencies
).
name
[Required ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String]name
.role
[Optional ; Not repeatable ; String] copyright
[Optional ; Not repeatable ; String]
+A copyright statement for the study at the appropriate level.
prod_date
[Optional ; Not repeatable ; String]
+This is the date (preferably entered in ISO 8601 format: YYYY-MM-DD or YYYY-MM or YYYY) of the actual and final production of the version of the dataset being documented. At least the month and year should be provided. A regular expression can be entered in user templates to validate the information captured in this field.
prod_place
[Optional ; Not repeatable ; String]
+The address of the organization that produced the study.
funding_agencies
[Optional ; repeatable]
+The source(s) of funds for the production of the study. If different funding agencies sponsored different stages of the production process, use the role
attribute to distinguish them.
name
[Required ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.grant
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] name
. This element is used when multiple funding agencies are listed to distinguish their specific contributions.This example shows the Bangladesh 2018-2019 Demographic and Health Survey (DHS)
+<- list(
+ my_ddi
+ # ... ,
+
+ study_desc = list(
+
+ # ... ,
+
+ production_statement = list(
+
+ producers = list(
+
+ list(name = "National Institute of Population Research and Training",
+ abbr = "NIPORT",
+ role = "Primary investigator"),
+
+ list(name = "Medical Education and Family Welfare Division",
+ role = "Advisory"),
+
+ list(name = "Ministry of Health and Family Welfare",
+ abbr = "MOHFW",
+ role = "Advisory"),
+
+ list(name = "Mitra and Associates",
+ role = "Data collection - fieldwork"),
+
+ list(name = "ICF (consulting firm)",
+ role = "Technical assistance / DHS Program")
+
+
+ ),
+ prod_date = "2019",
+
+ prod_place = "Dhaka, Bangladesh",
+
+ funding_agencies = list(
+ list(name = "United States Agency for International Development",
+ abbr = "USAID")
+
+ )
+
+ ), # ...,
+
+
+ )# ...
+
+ )
distribution_statement
[Optional ; Not repeatable]
+A distribution statement for the study.
"distribution_statement": {
+"distributors": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "uri": "string"
+ }
+ ],
+ "contact": [
+ {
+ "name": "string",
+ "affiliation": "string",
+ "email": "string",
+ "uri": "string"
+ }
+ ],
+ "depositor": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "uri": "string"
+ }
+ ],
+ "deposit_date": "string",
+ "distribution_date": "string"
+ }
distributors
[Optional ; Repeatable]
+The organization(s) designated by the author or producer to generate copies of the study output including any necessary editions or revisions.
name
[Required ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.uri
[Optional ; Not repeatable ; String] contact
[Optional ; Repeatable]
+Names and addresses of individuals responsible for the study. Individuals listed as contact persons will be used as resource persons regarding problems or questions raised by users.
name
[Required ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] name
.email
[Optional ; Not repeatable ; String] name
. uri
[Optional ; Not repeatable ; String] name
.depositor
[Optional ; Repeatable]
+The name of the person (or institution) who provided this study to the archive storing it.
name
[Required ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.uri
[Optional ; Not repeatable ; String] deposit_date
[Optional ; Not repeatable ; String]
+The date that the study was deposited with the archive that originally received it. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible.
distribution_date
[Optional ; Not repeatable ; String]
+The date that the study was made available for distribution/presentation. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible.
This example is @@@@@@@@@@@@
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+
+ distribution_statement = list(
+
+ distributors = list(
+ list(name = "World Bank Microdata Library",
+ abbr = "WBML",
+ affiliation = "World Bank Group",
+ uri = "http:/microdata.worldbank.org")
+
+ ),
+ contact = list(
+ list(name = "",
+ affiliation = "",
+ email = "",
+ uri = "")
+
+ ),
+ depositor = list(
+ list(name = "",
+ abbr = "",
+ affiliation = "",
+ uri = "")
+
+ ),
+ deposit_date = "",
+
+ distribution_date = ""
+
+
+ ),# ...
+
+ )# ...
+ )
series_statement
[Optional; Not repeatable]
+A study may be repeated at regular intervals (such as an annual labor force survey), or be part of an international survey program (such as the MICS, DHS, LSMS and others). The series statement provides information on the series.
"series_statement": {
+"series_name": "string",
+ "series_info": "string"
+ }
series_name
[Optional ; Not repeatable ; String] series_info
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),
+ study_desc = list(
+ # ... ,
+ series_statement = list(
+ list(series_name = "Multiple Indicator Cluster Survey (MICS) by UNICEF",
+ series_info = "The Multiple Indicator Cluster Survey, Round 3 (MICS3) is the third round of MICS surveys, previously conducted around 1995 (MICS1) and 2000 (MICS2). MICS surveys are designed by UNICEF, and implemented by national agencies in participating countries. MICS was designed to monitor various indicators identified at the World Summit for Children and the Millennium Development Goals. Many questions and indicators in MICS3 are consistent and compatible with the prior round of MICS (MICS2) but less so with MICS1, although there have been a number of changes in definition of indicators between rounds. Round 1 covered X countries, round 2 covered Y countries, and Round 3 covered Z countries.")
+
+ ),# ...
+
+ ),# ...
+ )
version_statement
[Optional; Not repeatable]
+Version statement for the study.
"version_statement": {
+"version": "string",
+ "version_date": "string",
+ "version_resp": "string",
+ "version_notes": "string"
+ }
The version statement should contain a version number followed by a version label. The version number should follow a standard convention to be adopted by the data repository. We recommend that larger series be defined by a number to the left of a decimal and iterations of the same series by a sequential number that identifies the release. The left number could for example be (0) for the raw, unedited dataset; (1) for the edited dataset, non anonymized, available for internal use at the data producing agency; and (2) the edited dataset, prepared for dissemination to secondary users (possibly anonymized). Example:
+v0: Basic raw data, resulting from the data capture process, before any data editing is implemented.
+v1.0: Edited data, first iteration, for internal use only.
+v1.1: Edited data, second iteration, for internal use only.
+v2.1: Edited data, anonymized and packaged for public distribution.
version
[Optional ; Not repeatable ; String] version_date
[Optional ; Not repeatable ; String] version_resp
[Optional ; Not repeatable ; String] version_notes
[Optional ; Not repeatable ; String] <- list(
+ my_ddi
+ # ...
+
+study_desc = list(
+
+ # ... ,
+
+ version_statement = list(
+ version = "Version 1.1",
+ version_date = "2021-02-09",
+ version_resp = "National Statistics Office, Data Processing unit",
+ version_notes = "This dataset contains the edited version of the data that were used to produce the Final Survey Report. It is equivalent to version 1.0 of the dataset, except for the addition of an additional variable (variable weight2) containing a calibrated version of the original sample weights (variable weight)"
+
+ ),
+ # ...
+
+
+ ),
+ # ...
+
+ )
bib_citation
[Optional ; Not repeatable ; String]
+Complete bibliographic reference containing all of the standard elements of a citation that can be used to cite the study. The bib_citation_format
(see below) is provided to enable specification of the particular citation style used, e.g., APA, MLA, or Chicago.
bib_citation_format
[Optional ; Not repeatable ; String]
+This element is used to specify the particular citation style used in the field bib_citation
described above, e.g., APA, MLA, or Chicago.
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ bib_citation = "",
+ bib_citation_format = ""
+ # ...
+
+ ),# ...
+ )
holdings
[Optional ; Repeatable]
+Information concerning either the physical or electronic holdings of the study being described.
"holdings": [
+{
+ "name": "string",
+ "location": "string",
+ "callno": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] location
[Optional ; Not repeatable ; String] callno
[Optional ; Not repeatable ; String] location
.uri
[Optional ; Not repeatable ; String] name
.<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ holdings = list(
+ name = "World Bank Microdata Library",
+ location = "World Bank, Development Data Group",
+ uri = "http://microdata.worldbank.org"
+
+ ),# ...
+
+ ),# ...
+ )
study_notes
[Optional ; Not repeatable]
This element can be used to provide additional information on the study which cannot be accommodated in the specific metadata elements of the schema, in the form of a free text field.
+study_authorization
[Optional ; Not repeatable]
"study_authorization": {
+"date": "string",
+ "agency": [
+ {
+ "name": "string",
+ "affiliation": "string",
+ "abbr": "string"
+ }
+ ],
+ "authorization_statement": "string"
+ }
Provides structured information on the agency that authorized the study, the date of authorization, and an authorization statement. This element will be used when a special legislation is required to conduct the data collection (for example a Census Act) or when the approval of an Ethics Board or other body is required to collect the data.
+date
[Optional ; Not repeatable ; String]
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD), when the authorization to conduct the study was granted.agency
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] name
.abbr
[Optional ; Not repeatable ; String] authorization_statement
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_authorization = list(
+ date = "2018-02-23",
+ agency = list(
+ name = "Institutional Review Board of the University of Popstan",
+ abbr = "IRB-UP")
+
+ ),authorization_statement = "The required documentation covering the study purpose, disclosure information, questionnaire content, and consent statements was delivered to the IRB-UP on 2017-12-27 and was reviewed by the compliance officer. Statement of authorization for the described study was issued on 2018-02-23."
+ # ...
+
+ ),# ...
+ )
study_info
[Required ; Not repeatable]
+This section contains the metadata elements needed to describe the core elements of a study including the dates of data collection and reference period, the country and other geographic coverage information, and more. These elements are not required in the DDI standard, but documenting a study without provinding at least some of this information would make the metadata mostly irrelevant.
"study_info": {
+"study_budget": "string",
+ "keywords": [],
+ "topics": [],
+ "abstract": "string",
+ "time_periods": [],
+ "coll_dates": [],
+ "nation": [],
+ "bbox": [],
+ "bound_poly": [],
+ "geog_coverage": "string",
+ "geog_coverage_notes": "string",
+ "geog_unit": "string",
+ "analysis_unit": "string",
+ "universe": "string",
+ "data_kind": "string",
+ "notes": "string",
+ "quality_statement": {},
+ "ex_post_evaluation": {}
+ }
study_budget
[Optional ; Not repeatable ; String]
This is a free-text field, not a structured element. The budget of a study will ideally be described by budget line. The currency used to describe the budget should be specified. This element can also be used to document issues related to the budget (e.g., documenting possible under-run and over-run).
<- list(
+ my_ddi # ... ,
+ study_desc = list(
+ # ... ,
+ study_info = list(
+ study_budget = "The study had a total budget of 500,000 USD allocated as follows:
+ By type of expense:
+ - Staff: 150,000 USD
+ - Consultants (incl. interviewers): 180,000 USD
+ - Travel: 50,000 USD
+ - Equipment: 90,000 USD
+ - Other: 30,000 USD
+ By activity
+ - Study design (questionnaire design and testing, sampling, piloting): 100,000 USD
+ - Data collection: 250,000 USD
+ - Data processing and tabulation: 80,000 USD
+ - Analysis and dissemination: 50,000 USD
+ - Evaluation: 20,000 USD
+ By source of funding:
+ - Government budget: 300,000 USD
+ - External sponsors
+ - Grant ABC001 - 150,000 USD
+ - Grant XYZ987 - 50,000 USD",
+
+# ...
+
+
+ ),# ...
+ )
keywords
[Optional ; Repeatable]
"keywords": [
+{
+ "keyword": "string",
+ "vocab": "string",
+ "uri": "string"
+ }
+ ]
Keywords are words or phrases that describe salient aspects of a data collection’s content. The addition of keywords can significantly improve the discoverability of data. Keywords can summarize and improve the description of the content or subject matter of a study. For example, keywords “poverty”, “inequality”, “welfare”, and “prosperity” could be attached to a household income survey used to generate poverty and inequality indicators (for which these keywords may not appear anywhere else in the metadata). A controlled vocabulary can be employed. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+- keyword
[ Required ; String ; Non repeatable]
+A keyword (or phrase).
+- vocab
[Optional ; Not repeatable ; String]
+The controlled vocabulary from which the keyword is extracted, if any.
+- uri
[Optional ; Not repeatable ; String]
+The URI of the controlled vocabulary used, if any.
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+ keywords = list(
+ list(keyword = "poverty",
+ vocab = "UNESCO Thesaurus",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"),
+ list(keyword = "income distribution",
+ vocab = "UNESCO Thesaurus",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"),
+ list(keyword = "inequality",
+ vocab = "UNESCO Thesaurus",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/")
+
+ ),# ...
+
+ ),# ...
+ )
topics
[Optional ; Repeatable] topics
field indicates the broad substantive topic(s) that the study covers. A topic classification facilitates referencing and searches in on-line data catalogs."topics": [
+{
+ "topic": "string",
+ "vocab": "string",
+ "uri": "string"
+ }
+ ]
topic
[Required ; Not repeatable] vocab
[Required ; Not repeatable] uri
[Required ; Not repeatable] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ topics = list(
+
+ list(topic = "Equality, inequality and social exclusion",
+ vocab = "CESSDA topics classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(topic = "Social and occupational mobility",
+ vocab = "CESSDA topics classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+
+
+ ),# ...
+
+ ),# ...
+ )
abstract
[Optional ; Not repeatable ; String]
+An un-formatted summary describing the purpose, nature, and scope of the data collection, special characteristics of its contents, major subject areas covered, and what questions the primary investigator(s) attempted to answer when they conducted the study. The summary should ideally be between 50 and 5000 characters long. The abstract should provide a clear summary of the purposes, objectives and content of the survey. It should be written by a researcher or survey statistician aware of the study. Inclusion of this element is strongly recommended.
This example is for the Afrobarometer Survey 1999-2000, Merged Round 1 dataset.
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+abstract = "The Afrobarometer is a comparative series of public attitude surveys that assess African citizen's attitudes to democracy and governance, markets, and civil society, among other topics.
+
+The 12 country dataset is a combined dataset for the 12 African countries surveyed during round 1 of the survey, conducted between 1999-2000 (Botswana, Ghana, Lesotho, Mali, Malawi, Namibia, Nigeria South Africa, Tanzania, Uganda, Zambia and Zimbabwe), plus data from the old Southern African Democracy Barometer, and similar surveys done in West and East Africa.",
+
+# ...
+
+ ),# ...
+ )
time_periods
[Optional ; Repeatable]
+This refers to the time period (also known as span) covered by the data, not the dates of data collection.
"time_periods": [
+{
+ "start": "string",
+ "end": "string",
+ "cycle": "string"
+ }
+ ]
start
[Required ; Not repeatable ; String]
+The start date for the cycle being described. Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
end
[Required ; Not repeatable ; String]
+The end date for the cycle being described. Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). Indicate open-ended dates with two decimal points (..)
cycle
[Optional ; Not repeatable ; String]
+The cycle
attribute permits specification of the relevant cycle, wave, or round of data.
coll_dates
[Optional ; Repeatable]
+Contains the date(s) when the data were collected, which may be different from the date the data refer to (see time_periods
above). For example, data may be collected over a period of 2 weeks (coll_dates
) about household expenditures during a reference week (time_periods
) preceding the beginning of data collection. Use the event attribute to specify the “start” and “end” for each period entered.
"coll_dates": [
+{
+ "start": "string",
+ "end": "string",
+ "cycle": "string"
+ }
+ ]
start
[Required ; Not repeatable ; String] end
[Required ; Not repeatable ; String] cycle
[Optional ; Not repeatable ; String] cycle
attribute permits specification of the relevant cycle, wave, or round of data. For example, a household consumption survey could visit households in four phases (one per quarter). Each quarter would be a cycle, and the specific dates of data collection for each quarter would be entered.This example is for an impact evaluation survey with a baseline and two follow-up surveys)
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ time_periods = list(
+
+ list(start = "2020-01-10",
+ end = "2020-01-16",
+ cycle = "Baseline survey"),
+
+ list(start = "2020-07-10",
+ end = "2020-07-16",
+ cycle = "First follow-up survey"),
+
+ list(start = "2021-01-10",
+ end = "2021-01-16",
+ cycle = "Second and last follow-up survey"),
+
+ ),
+ coll_dates = list(
+
+ list(start = "2020-01-17",
+ end = "2020-01-25",
+ cycle = "Baseline survey"),
+
+ list(start = "2020-07-17",
+ end = "2020-07-24",
+ cycle = "First follow-up survey"),
+
+ list(start = "2021-01-17",
+ end = "2021-01-22",
+ cycle = "Second and last follow-up survey")
+
+ ),
+ # ...
+
+ ),# ...
+ )
nation
[Optional ; Repeatable] "nation": [
+{
+ "name": "string",
+ "abbreviation": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The country name, even in cases where the study does not cover the entire country.
abbreviation
[Optional ; Not repeatable ; String]
+The abbreviation
will contain a country code, preferably the 3-letter ISO 3166-1 country code.
bbox
[Optional ; Repeatable]
+This element is used to define one or multiple bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. This element is optional, but if the bound_poly
element (see below) is used, then the bbox
element must be included.
"bbox": [
+{
+ "west": "string",
+ "east": "string",
+ "south": "string",
+ "north": "string"
+ }
+ ]
west
[Required ; Not repeatable ; String] east
[Required ; Not repeatable ; String] south
[Required ; Not repeatable ; String] north
[Required ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ nation = list(
+ list(name = "Madagascar", abbreviation = "MDG"),
+ list(name = "Mauritius", abbreviation = "MUS")
+
+ ),
+ bbox = list(
+
+ list(name = "Madagascar",
+ west = "43.2541870461",
+ east = "50.4765368996",
+ south = "-25.6014344215",
+ north = "-12.0405567359"),
+
+ list(name = "Mauritius",
+ west = "56.6",
+ east = "72.466667",
+ south = "-20.516667",
+ north = "-5.25")
+
+
+ ),# ...
+
+ ),# ...
+ )
bound_poly
[Optional ; Repeatable] bbox
metadata element (see above) describes a rectangular area representing the entire geographic coverage of a dataset. The element bound_poly
allows for a more detailed description of the geographic coverage, by allowing multiple and non-rectangular polygons (areas) to be described. This is done by providing list(s) of latitude and longitude coordinates that define the area(s). It should only be used to define the outer boundaries of the covered areas. This field is intended to enable a refined coordinate-based search, not to actually map an area. Note that if the bound_poly
element is used, then the element bbox
MUST be present as well, and all points enclosed by the bound_poly
MUST be contained within the bounding box defined in bbox
."bound_poly": [
+{
+ "lat": "string",
+ "lon": "string"
+ }
+ ]
lat
[Required ; Not repeatable ; String] lon
[Required ; Not repeatable ; String] This example shows a polygon for the State of Nevada, USA
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ bound_poly = list(
+ list(lat = "42.002207", lon = "-120.005729004"),
+ list(lat = "42.002207", lon = "-114.039663"),
+ list(lat = "35.9", lon = "-114.039663"),
+ list(lat = "36.080", lon = "-114.544"),
+ list(lat = "35.133", lon = "-114.542"),
+ list(lat = "35.00208499998", lon = "-114.63288"),
+ list(lat = "35.00208499998", lon = "-114.63323"),
+ list(lat = "38.999", lon = "-120.005729004"),
+ list(lat = "42.002207", lon = "-120.005729004")
+
+ ),
+ # ...
+
+ ),# ...
+ )
geog_coverage
[Optional ; Not repeatable ; String]
Information on the geographic coverage of the study. This includes the total geographic scope of the data, and any additional levels of geographic coding provided in the variables. Typical entries will be “National coverage”, “Urban areas”, “Rural areas”, “State of …”, “Capital city”, etc. This does not describe where the data were collected; it describes which area the data are representative of. This means for example that a sample survey could be declared as having a national coverage even if some districts of the country where not included in the sample, as long as the sample is nationally representative.
geog_coverage_notes
[Optional ; Not repeatable ; String]
Additional information on the geographic coverage of the study entered as a free text field.
geog_unit
[Optional ; Not repeatable ; String]
Describes the levels of geographic aggregation covered by the data. Particular attention must be paid to include information on the lowest geographic area for which data are representative.
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+geog_coverage = "National coverage",
+
+geog_coverage_notes = "The sample covered the urban and rural areas of all provinces of the country. Some areas of province X were however not accessible due to civil unrest.",
+
+geog_unit = "The survey provides data representative at the national, provincial and district levels. For the capital city, the data are representative at the ward level.",
+
+# ...
+
+ ),# ...
+ )
analysis_unit
[Optional ; Not repeatable ; String]
A study can have multiple units of analysis. This field will list the various units that can be analyzed. For example, a Living Standard Measurement Study (LSMS) may have collected data on households and their members (individuals), on dwelling characteristics, on prices in local markets, on household enterprises, on agricultural plots, and on characteristics of health and education facilities in the sample areas.
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+analysis_unit = "Data were collected on households, individuals (household members), dwellings, commodity prices at local markets, household enterprises, agricultural plots, and characteristics of health and education facilities."
+
+# ...
+
+ ),# ...
+ )
universe
[Optional ; Not repeatable ; String]
The universe is the group of persons (or other units of observations, like dwellings, facilities, or other) that are the object of the study and to which any analytic results refer. The universe will rarely cover the entire population of the country. Sample household surveys, for example, may not cover homeless, nomads, diplomats, community households. Population censuses do not cover diplomats. Facility surveys may be limited to facilities of a certain type (e.g., public schools). Try to provide the most detailed information possible on the population covered by the survey/census, focusing on excluded categories of the population. For household surveys, age, nationality, and residence commonly help to delineate a given universe, but any of a number of factors may be involved, such as sex, race, income, veteran status, criminal convictions, etc. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is a member of the population under study.
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+universe = "The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.",
+
+# ...
+
+ ),# ...
+ )
data_kind
[Optional ; Not repeatable ; String]
This field describes the main type of microdata generated by the study: survey data, census/enumeration data, aggregate data, clinical data, event/transaction data, program source code, machine-readable text, administrative records data, experimental data, psychological test, textual data, coded textual, coded documents, time budget diaries, observation data/ratings, process-produced data, etc. A controlled vocabulary should be used as this information may be used to build facets (filters) in a catalog user interface.
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+data_kind = "Sample survey data",
+
+# ...
+
+ ),# ...
+ )
notes
[Optional ; Not repeatable ; String]
This element is provided to document any specific situations, observations, or events that occurred during data collection. Consider stating such items like:
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+notes = "The pre-test for the survey took place from August 15, 2006 - August 25, 2006 and included 14 interviewers who would later become supervisors for the main survey.
+ Each interviewing team comprised of 3-4 female interviewers (no male interviewers were used due to the sensitivity of the subject matter), together with a field editor and a supervisor and a driver. A total of 52 interviewers, 14 supervisors and 14 field editors were used. Training of interviewers took place at the headquarters of the Statistics Office from July 1 to July 12, 2006.
+Data collection took place over a period of about 6 weeks from September 2, 2006 until October 17, 2006. Interviewing took place everyday throughout the fieldwork period, although interviewing teams were permitted to take one day off per week.
+Interviews averaged 35 minutes for the household questionnaire (excluding water testing), 23 minutes for the women's questionnaire, and 27 for the under five children's questionnaire (excluding the anthropometry). Interviews were conducted primarily in English, but occasionally used local translation.
+Six staff members of the Statistics Office provided overall fieldwork coordination and supervision."
+
+# ...
+
+ ),# ...
+ )
quality_statement
[Optional ; Not Repeatable]
+This section lists the specific standards complied with during the execution of this study, and provides the option to formulate a general statement on the quality of the data. Any known quality issue should be reported here. Such issues are better reported by the data producer or curator, not left to the secondary analysts to discover. Transparency in reporting quality issues will increase credibility and reputation of the data provider.
"quality_statement": {
+"compliance_description": "string",
+ "standards": [
+ {
+ "name": "string",
+ "producer": "string"
+ }
+ ],
+ "other_quality_statement": "string"
+ }
compliance_description
[Optional ; Not repeatable ; String] standards
.standards
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] producer
[Optional ; Not repeatable ; String] name
. other_quality_statement
[Optional ; Not repeatable ; String] @@@ complete the example
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ quality_statement = list(
+
+ compliance_description = "",
+
+ standards = list(
+ list(name = "",
+ producer = "")
+
+ ),
+ other_quality_statement = ""
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
ex_post_evaluation
[Optional ; Not Repeatable] "ex_post_evaluation": {
+"completion_date": "string",
+ "type": "string",
+ "evaluator": [
+ {
+ "name": "string",
+ "affiliation": "string",
+ "abbr": "string",
+ "role": "string"
+ }
+ ],
+ "evaluation_process": "string",
+ "outcomes": "string"
+ }
completion_date
[Optional ; Not repeatable ; String] type
[Optional ; Not Repeatable] type
attribute identifies the type of evaluation with or without the use of a controlled vocabulary.evaluator
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] name
. abbr
[Optional ; Not repeatable ; String] name
.role
[Optional ; Not repeatable ; String] name
in the evaluation process.evaluation_process
[Optional ; Not repeatable ; String] outcomes
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ ex_post_evaluation = list(
+
+ completion_date = "2020-04-30",
+
+ type = "Independent evaluation requested by the survey sponsor",
+
+ evaluator = list(
+ list(name = "John Doe",
+ affiliation = "Alpha Consulting, Ltd.",
+ abbr = "AC",
+ role = "Evaluation of the sampling methodology"),
+ list(name = "Jane Smith",
+ affiliation = "Beta Statistical Services, Ltd.",
+ abbr = "BSS",
+ role = "Evaluation of the data processing and analysis")
+
+ ),
+ evaluation_process = "In-depth review of pre-collection and collection procedures",
+
+ outcomes = "The following steps were highly effective in increasing response rates."
+
+
+ )
+ ),# ...
+ )
study_development
[Optional ; Not repeatable]
"study_development": {
+"development_activity": [
+ {
+ "activity_type": "string",
+ "activity_description": "string",
+ "participants": [
+ {
+ "name": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "resources": [
+ {
+ "name": "string",
+ "origin": "string",
+ "characteristics": "string"
+ }
+ ],
+ "outcome": "string"
+ }
+ ]
+ }
This section is used to describe the process that led to the production of the final output of the study, from its inception/design to the dissemination of the final output.
+development_activity
[Optional ; Repeatable]
@@@@ missing in schema; must be added then screenshot taken
+Each activity will be documented separately. The Generic Statistical Business Process Model (GSBPM) provides a useful decomposition of such a process, which can be used to list the activities to be described. This is a repeatable set of metadata elements; each activity should be documented separately.
activity_type
[Optional ; Not repeatable ; String] {Needs specification, Design, Build, Collect, Process, Analyze, Disseminate, Evaluate}
).activity_description
[Optional ; Not repeatable ; String] participants
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] name
.role
[Optional ; Not repeatable ; String] name
.resources
[Optional ; Not Repeatable] name
[Optional ; Not repeatable ; String] origin
[Optional ; Not repeatable ; String] name
.characteristics
[Optional ; Not repeatable ; String] name
.outcome
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ),
+
+ study_development = list(
+
+ development_activity = list(
+
+ list(
+ activity_type = "Questionnaire design and piloting",
+ activity_description = "",
+ participants = list(
+ list(name = "",
+ affiliation = "",
+ role = ""),
+ list(name = "",
+ affiliation = "",
+ role = ""),
+ list(name = "",
+ affiliation = "",
+ role = "")
+
+ ),resources = list(
+ list(name = "",
+ origin = "",
+ characteristics = "")
+
+ ),outcome = ""
+
+ ),
+ list(
+ activity_type = "Interviewers training",
+ activity_description = "",
+ participants = list(
+ list(name = "",
+ affiliation = "",
+ role = ""),
+ list(name = "",
+ affiliation = "",
+ role = ""),
+ list(name = "",
+ affiliation = "",
+ role = "")
+
+ ),resources = list(
+ list(name = "",
+ origin = "",
+ characteristics = "")
+
+ ),outcome = ""
+
+ )
+
+ )
+
+ ),
+ # ...
+
+ )
method
[Optional ; Not Repeatable]
+This section describes the methodology and processing involved in a study.
"method": {
+"data_collection": {},
+ "method_notes": "string",
+ "analysis_info": {},
+ "study_class": null,
+ "data_processing": [],
+ "coding_instructions": []
+ }
data_collection
[Optional ; Not Repeatable] "data_collection": {
+"time_method": "string",
+ "data_collectors": [],
+ "collector_training": [],
+ "frequency": "string",
+ "sampling_procedure": "string",
+ "sample_frame": {},
+ "sampling_deviation": "string",
+ "coll_mode": null,
+ "research_instrument": "string",
+ "instru_development": "string",
+ "instru_development_type": "string",
+ "sources": [],
+ "coll_situation": "string",
+ "act_min": "string",
+ "control_operations": "string",
+ "weight": "string",
+ "cleaning_operations": "string"
+ }
time_method
[Optional ; Not repeatable ; String]
+The time method or time dimension of the data collection. A controlled vocabulary can be used. The entries for this element may include “panel survey”, “cross-section”, “trend study”, or “time-series”.
data_collectors
[Optional ; Not Repeatable]
+The entity (individual, agency, or institution) responsible for administering the questionnaire or interview or compiling the data.
"data_collectors": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbr": "string",
+ "role": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+In most cases, we will record here the name of the agency, not the name of interviewers. Only in the case of very small-scale surveys, with a very limited number of interviewers, the name of persons will be included as well.
affiliation
[Optional ; Not repeatable ; String]
+The affiliation of the data collector mentioned in name
.
abbr
[Optional ; Not repeatable ; String]
+The abbreviation given to the agency mentioned in name
.
role
[Optional ; Not repeatable ; String]
+The specific role of the person or agency mentioned in name
.
collector_training
[Optional ; Repeatable]
+Describes the training provided to data collectors including interviewer training, process testing, compliance with standards etc. This set of elements is repeatable, to capture different aspects of the training process.
"collector_training": [
+{
+ "type": "string",
+ "training": "string"
+ }
+ ]
type
[Optional ; Not repeatable ; String]
+The type of training being described. For example, “Training of interviewers”, “Training of controllers”, “Training of cartographers”, “Training on the use of tablets for data collection”, etc.
training
[Optional ; Not repeatable ; String]
+A brief description of the training. This may include information on the dates and duration, audience, location, content, trainers, issues, etc.
frequency
[Optional ; Not repeatable ; String]
+For data collected at more than one point in time, the frequency with which the data were collected.
sampling_procedure
[Optional ; Not repeatable ; String]
+This field only applies to sample surveys. It describes the type of sample and sample design used to select the survey respondents to represent the population. This section should include summary information that includes (but is not limited to): sample size (expected and actual) and how the sample size was decided; level of representation of the sample; sample frame used, and listing exercise conducted to update it; sample selection process (e.g., probability proportional to size or over sampling); stratification (implicit and explicit); design omissions in the sample; strategy for absent respondents/not found/refusals (replacement or not). Detailed information on the sample design is critical to allow users to adequately calculate sampling errors and confidence intervals for their estimates. To do that, they will need to be able to clearly identify the variables in the dataset that represent the different levels of stratification and the primary sampling unit (PSU).
+In publications and reports, the description of sampling design often contains complex formulas and symbols. As the XML and JSON formats used to store the metadata are plain text files, they cannot contain these complex representations. You may however provide references (title/author/date) to documents where such detailed descriptions are provided, and make sure that the documents (or links to the documents) are provided in the catalog where the survey metadata are published.
sample_frame
[Optional ; Not Repeatable]
+A description of the sample frame used for identifying the population from which the sample was taken. For example, a telephone book may be a sample frame for a phone survey. Or the listing of enumeration areas (EAs) of a population census can provide a sample frame for a household survey. In addition to the name, label and text describing the sample frame, this structure lists who maintains the sample frame, the period for which it is valid, a use statement, the universe covered, the type of unit contained in the frame as well as the number of units available, the reference period of the frame and procedures used to update the frame.
"sample_frame": {
+"name": "string",
+ "valid_period": [
+ {
+ "event": "string",
+ "date": "string"
+ }
+ ],
+ "custodian": "string",
+ "universe": "string",
+ "frame_unit": {
+ "is_primary": null,
+ "unit_type": "string",
+ "num_of_units": "string"
+ },
+ "reference_period": [
+ {
+ "event": "string",
+ "date": "string"
+ }
+ ],
+ "update_procedure": "string"
+ }
name
[Optional ; Not Repeatable]
+The name (title) of the sample frame.
valid_period
[Optional ; Repeatable]
+Defines a time period for the validity of the sampling frame, using a list of events and dates.
event
[Optional ; Not repeatable ; String] start
or end
.date
[Optional ; Not repeatable ; String] custodian
[ Optional ; Not Repeatable]
+Custodian identifies the agency or individual responsible for creating and/or maintaining the sample frame.
universe
[Optional ; Not Repeatable]
+A description of the universe of population covered by the sample frame. Age,nationality, and residence commonly help to delineate a given universe, but any of a number of factors may be involved, such as sex, race, income, etc. The universe may consist of elements other than persons, such as housing units, court cases, deaths, countries, etc. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is included in the sample frame.
frame_unit
[Optional ; Not Repeatable]
+Provides information about the sampling frame unit.
is_primary
[Optional ; Boolean ; Not Repeatable] unit_type
[Optional ; Not repeatable ; String] num_of_units
[Optional ; Not Repeatable ; String]reference_period
[Optional ; Not Repeatable]
+Indicates the period of time in which the sampling frame was actually used for the study in question. Use ISO 8601 date format to enter the relevant date(s).
event
[Optional ; Not repeatable ; String] date
[Optional ; Not repeatable ; String] update_procedure
[Optional ; Not repeatable ; String]
+This element is used to describe how and with what frequency the sample frame is updated. For example: “The lists and boundaries of enumeration areas are updated every ten years at the occasion of the population census cartography work. Listing of households in enumeration areas are updated as and when needed, based on their selection in survey samples.”
sampling_deviation
[Optional ; Not repeatable ; String]
Sometimes the reality of the field requires a deviation from the sampling design (for example due to difficulty to access to zones due to weather problems, political instability, etc). If for any reason, the sample design has deviated, this can be reported here. This element will provide information indicating the correspondence as well as the possible discrepancies between the sampled units (obtained) and available statistics for the population (age, sex-ratio, marital status, etc.) as a whole.
coll_mode
[Optional ; Repeatable ; String] The mode of data collection is the manner in which the interview was conducted or information was gathered. Ideally, a controlled vocabulary will be used to constrain the entries in this field (which could include items like “telephone interview”, “face-to-face paper and pen interview”, “face-to-face computer-assisted interviews (CAPI)”, “mail questionnaire”, “computer-aided telephone interviews (CATI)”, “self-administered web forms”, “measurement by sensor”, and others.
+This is a repeatable field, as some data collection activities implement multi-mode data collection (for example, a population census can offer respondents the options to submit information via web forms, telephone interviews, mailed forms, or face-to-face interviews. Note that in the API description (see screenshot above), the element is described as having type “null”, not {}. This is due to the fact that the element can be entered either as a list (repeatable element) or as a string.
research_instrument
[Optional ; Not repeatable ; String] The research instrument refers to the questionnaire or form used for collecting data. The following should be mentioned:
+- List of questionnaires and short description of each (all questionnaires must be provided as External Resources)
+- In what language(s) was/were the questionnaire(s) available?
+- Information on the questionnaire design process (based on a previous questionnaire, based on a standard model questionnaire, review by stakeholders). If a document was compiled that contains the comments provided by the stakeholders on the draft questionnaire, or a report prepared on the questionnaire testing, a reference to these documents can be provided here.
instru_development
[Optional ; Not repeatable ; String] Describe any development work on the data collection instrument. This may include a description of the review process, standards followed, and a list of agencies/people consulted.
instru_development_type
[Optional ; Repeatable ; String] The instrument development type. This element will be used when a pre-defined list of options (controlled vocabulary) is available.
+sources
[Optional ; Repeatable] "sources": [
+{
+ "name": "string",
+ "origin": "string",
+ "characteristics": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name and other information on the source. For example, “United States Internal Revenue Service Quarterly Payroll File”
origin
[Optional ; Not repeatable ; String]
+For historical materials, information about the origin(s) of the sources and the rules followed in establishing the sources should be specified. This may not be relevant to survey data.
characteristics
[Optional ; Not repeatable ; String]
+Assessment of characteristics and quality of source material. This may not be relevant to survey data.
coll_situation
[Optional ; Not repeatable ; String]
A description of noteworthy aspects of the data collection situation. Includes information on factors such as cooperativeness of respondents, duration of interviews, number of call-backs, etc.
+act_min
[Optional ; Not repeatable ; String] A summary of actions taken to minimize data loss. This includes information on actions such as follow-up visits, supervisory checks, historical matching, estimation, etc. Note that this element does not have to include detailed information on response rates, as a specific metadata element is provided for that purpose in section analysis_info / response_rate
(see below).
control_operations
[Optional ; Not repeatable ; String] This element will provide information on the oversight of the data collection, i.e. on methods implemented to facilitate data control performed by the primary investigator or by the data archive.
weight
[Optional ; Not repeatable ; String] This field only applies to sample surveys. The use of sampling procedures may make it necessary to apply weights to produce accurate statistical results. Describe here the criteria for using weights in analysis of a collection, and provide a list of variables used as weighting coefficient. If more than one variable is a weighting variable, describe how these variables differ from each other and what the purpose of each one of them is.
cleaning_operations
[Optional ; Not repeatable ; String] A description of the methods used to clean or edit the data, e.g., consistency checking, wild code checking, etc. The data editing should contain information on how the data was treated or controlled for in terms of consistency and coherence. This item does not concern the data entry phase but only the editing of data whether manual or automatic. It should provide answers to questions like: Was a hot deck or a cold deck technique used to edit the data? Were corrections made automatically (by program), or by visual control of the questionnaire? What software was used? If materials are available (specifications for data editing, report on data editing, programs used for data editing), they should be listed here and provided as external resources in data catalogs (the best documentation of data editing consists of well-documented reproducible scripts).
Example for the data_collection
section:
```r
+my_ddi <- list(
+
+ doc_desc = list(
+ # ...
+ ),
+
+ study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ),
+ study_development = list(
+ # ... ),
+
+ method = list(
+
+ data_collection = list(
+
+ time_method = "cross-section",
+
+ data_collectors = list(
+ list(name = "Staff from the Central Statistics Office",
+ abbr = "NSO",
+ affiliation = "Ministry of Planning")
+ ),
+
+ collector_training = list(
+ list(
+ type = "Training of interviewers",
+ training = "72 staff (interviewers) were trained from [date] to [date] at the NSO headquarters. The training included 2 days of field work."
+ ),
+ list(
+ type = "Training of controllers and supervisors",
+ training = "A 3-day training of 10 controlers and 2 supervisors was organized from [date] to [date]. The controllers and supervisors had previously participated in the interviewer training."
+ )
+ ),
+
+ sampling_procedure = "A list of 500 Enumeration Areas (EAs) were randomly selected from the sample frame, 300 in urban areas and 200 in rural areas. In each selected EA, 10 households were then randomly selected. 5000 households were thus selected for the sample (3000 urban and 2000 rural). The distribution of the sample (households) by province is as follows:
+- Province A: Total: 1800 Urban: 1000 Rural: 800
+- Province B: Total: 1200 Urban: 500 Rural: 700
+- Province C: Total: 2000 Urban: 1500 Rural: 500",
+
+ sample_frame = list(
+ name = "Listing of Enumeration Areas (EAs) from the Population and Housing Census 2011",
+ custodian = "National Statistics Office",
+ universe = "The sample frame contains 25365 EAs covering the entire territory of the country. EAs contain an average of 400 households in rural areas, and 580 in urban areas. ",
+ frame_unit = list(
+ is_primary = true,
+ unit_type = "Enumeration areas (EAs)",
+ num_of_units = "25365, including 15100 in urban areas, and 10265 in rural areas."
+ ),
+ update_procedure = "The sample frame only provides EAs; a full household listing was conducted in each selected EA to provide an updated list of households."
+ ),
+
+ sampling_deviation = "Due to floods in two sampled rural in province A, two EAs could not be reached. The sample was thus reduced to 4980 households. The response rate was 90%, so the actual final sample size was 4482 households.",
+
+ coll_mode = "Face-to-face interviews, conducted using tablets (CAPI)",
+
+ research_instrument = "The questionnaires for the Generic MICS were structured questionnaires based on the MICS3 Model Questionnaire with some modifications and additions. A household questionnaire was administered in each household, which collected various information on household members including sex, age, relationship, and orphanhood status. The household questionnaire includes household characteristics, support to orphaned and vulnerable children, education, child labour, water and sanitation, household use of insecticide treated mosquito nets, and salt iodization, with optional modules for child discipline, child disability, maternal mortality and security of tenure and durability of housing.
+In addition to a household questionnaire, questionnaires were administered in each household for women age 15-49 and children under age five. For children, the questionnaire was administered to the mother or caretaker of the child.
+The women's questionnaire include women's characteristics, child mortality, tetanus toxoid, maternal and newborn health, marriage, polygyny, female genital cutting, contraception, and HIV/AIDS knowledge, with optional modules for unmet need, domestic violence, and sexual behavior.
+The children's questionnaire includes children's characteristics, birth registration and early learning, vitamin A, breastfeeding, care of illness, malaria, immunization, and anthropometry, with an optional module for child development.
+The questionnaires were developed in English from the MICS3 Model Questionnaires and translated into local languages. After an initial review the questionnaires were translated back into English by an independent translator with no prior knowledge of the survey. The back translation from the local language version was independently reviewed and compared to the English original. Differences in translation were reviewed and resolved in collaboration with the original translators. The English and local language questionnaires were both piloted as part of the survey pretest.",
+
+ instru_development = "The questionnaire was pre-tested with split-panel tests, as well as an analysis of non-response rates for individual items, and response distributions.",
+
+ coll_situation = "Floods in province A made access to two selected enumeration areas impossible.",
+
+ act_min = "Local authorities and local staff from the Ministry of Health contributed to an awareness campaign, which contributed to achieving a response rate of 90%.",
+
+ control_operations = "Interviewing was conducted by teams of interviewers. Each interviewing team comprised of 3-4 female interviewers, a field editor and a supervisor, and a driver. Each team used a 4 wheel drive vehicle to travel from cluster to cluster (and where necessary within cluster).
+The role of the supervisor was to coordinate field data collection activities, including management of the field teams, supplies and equipment, finances, maps and listings, coordinate with local authorities concerning the survey plan and make arrangements for accommodation and travel. Additionally, the field supervisor assigned the work to the interviewers, spot checked work, maintained field control documents, and sent completed questionnaires and progress reports to the central office.
+The field editor was responsible for validating questionnaires at the end of the day when the data form interviews were transferred to their laptops. This included checking for missed questions, skip errors, fields incorrectly completed, and checking for inconsistencies in the data. The field editor also observed interviews and conducted review sessions with interviewers.
+Responsibilities of the supervisors and field editors are described in the Instructions for Supervisors and Field Editors, together with the different field controls that were in place to control the quality of the fieldwork.
+Field visits were also made by a team of central staff on a periodic basis during fieldwork. The senior staff of NSO also made 3 visits to field teams to provide support and to review progress.",
+
+ weight = "Sample weights were calculated for each of the data files. Sample weights for the household data were computed as the inverse of the probability of selection of the household, computed at the sampling domain level (urban/rural within each region). The household weights were adjusted for non-response at the domain level, and were then normalized by a constant factor so that the total weighted number of households equals the total unweighted number of households. The household weight variable is called HHWEIGHT and is used with the HH data and the HL data.
+Sample weights for the women's data used the un-normalized household weights, adjusted for non-response for the women's questionnaire, and were then normalized by a constant factor so that the total weighted number of women's cases equals the total unweighted number of women's cases.
+Sample weights for the children's data followed the same approach as the women's and used the un-normalized household weights, adjusted for non-response for the children's questionnaire, and were then normalized by a constant factor so that the total weighted number of children's cases equals the total unweighted number of children's cases.",
+
+ cleaning_operations = "Data editing took place at a number of stages throughout the processing, including:
+ a) Office editing and coding
+ b) During data entry
+ c) Structure checking and completenes
+ d) Secondary editing
+ e) Structural checking of SPSS data files
+ Detailed documentation of the editing of data can be found in the 'Data processing guidelines' document provided as an external resource."
+ )
+
+ )
+
+ ),
+ # ...
+ )
+
+)
+```
+method_notes
[Optional ; Not repeatable ; String] This element is provided to capture any additional relevant information on the data collection methodology, which could not fit in the previous metadata elements.
+analysis_info
[Optional ; Not Repeatable] "analysis_info": {
+"response_rate": "string",
+ "sampling_error_estimates": "string",
+ "data_appraisal": "string"
+ }
response_rate
[Optional ; Not repeatable ; String] sampling_error_estimates
[Optional ; Not repeatable ; String] data_appraisal
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ),
+ study_development = list(
+ # ... ),
+ method = list(
+ # ... ,
+
+ analysis_info = list(
+
+ response_rate = "Of these, 4996 were occupied households and 4811 were successfully interviewed for a response rate of 96.3%. Within these households, 7815 eligible women aged 15-49 were identified for interview, of which 7505 were successfully interviewed (response rate 96.0%), and 3242 children aged 0-4 were identified for whom the mother or caretaker was successfully interviewed for 3167 children (response rate 97.7%). These give overall response rates (household response rate times individual response rate) for the women's interview of 92.5% and for the children's interview of 94.1%.",
+
+ sampling_error_estimates = "Estimates from a sample survey are affected by two types of errors: 1) non-sampling errors and 2) sampling errors. Non-sampling errors are the results of mistakes made in the implementation of data collection and data processing. Numerous efforts were made during implementation of the 2005-2006 MICS to minimize this type of error, however, non-sampling errors are impossible to avoid and difficult to evaluate statistically. If the sample of respondents had been a simple random sample, it would have been possible to use straightforward formulae for calculating sampling errors. However, the 2005-2006 MICS sample is the result of a multi-stage stratified design, and consequently needs to use more complex formulae. The SPSS complex samples module has been used to calculate sampling errors for the 2005-2006 MICS. This module uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. This method is documented in the SPSS file CSDescriptives.pdf found under the Help, Algorithms options in SPSS.
+ Sampling errors have been calculated for a select set of statistics (all of which are proportions due to the limitations of the Taylor linearization method) for the national sample, urban and rural areas, and for each of the five regions. For each statistic, the estimate, its standard error, the coefficient of variation (or relative error - the ratio between the standard error and the estimate), the design effect, and the square root design effect (DEFT - the ratio between the standard error using the given sample design and the standard error that would result if a simple random sample had been used), as well as the 95 percent confidence intervals (+/-2 standard errors). Details of the sampling errors are presented in the sampling errors appendix to the report and in the sampling errors table presented in the external resources.",
+
+ data_appraisal = "A series of data quality tables and graphs are available to review the quality of the data and include the following:
+ - Age distribution of the household population
+ - Age distribution of eligible women and interviewed women
+ - Age distribution of eligible children and children for whom the mother or caretaker was interviewed
+ - Age distribution of children under age 5 by 3 month groups
+ - Age and period ratios at boundaries of eligibility
+ - Percent of observations with missing information on selected variables
+ - Presence of mother in the household and person interviewed for the under 5 questionnaire
+ - School attendance by single year age
+ - Sex ratio at birth among children ever born, surviving and dead by age of respondent
+ - Distribution of women by time since last birth
+ - Scatter plot of weight by height, weight by age and height by age
+ - Graph of male and female population by single years of age
+ - Population pyramid
+ The results of each of these data quality tables are shown in the appendix of the final report.
+ The general rule for presentation of missing data in the final report tabulations is that a column is presented for missing data if the percentage of cases with missing data is 1% or more. Cases with missing data on the background characteristics (e.g. education) are included in the tables, but the missing data rows are suppressed and noted at the bottom of the tables in the report."
+
+
+ ),
+ # ...
+
+ )# ...
+ )
study_class
[Optional ; Repeatable ; String] This element can be used to give the data archive’s class or study status number, which indicates the processing status of the study. But it can also be used as an element to indicate the type of study, based on a controlled vocabulary. The element is repeatable, allowing one study to belong to more than one class. Note that in the API description (see screenshot above), the element is described as having type “null”, not {}. This is due to the fact that the element can be entered either as a list (repeatable element) or as a string.
data_processing
[Optional ; Repeatable] "data_processing": [
+{
+ "type": "string",
+ "description": "string"
+ }
+ ]
This element is used to describe how data were electronically captured (e.g., entered in the field, in a centralized manner by data entry clerks, captured electronically using tablets and a CAPI application, via web forms, etc.). Information on devices and software used for data capture can also be provided here. Other data processing procedures not captured elsewhere in the documentation can be described here (tabulation, etc.)
+- type
[Optional ; Not repeatable ; String]
+The type attribute supports better classification of this activity, including the optional use of a controlled vocabulary. The vocabulary could include options like “data capture”, “data validation”, “variable derivation”, “tabulation”, “data visualizations”, anonymization“, ”documentation”, etc.
+- description
[Optional ; Repeatable ; String]
+A description of a data processing task.
+
coding_instructions
[Optional ; Repeatable] coding_instructions
elements can be used to describe specific coding instructions used in data processing, cleaning, or tabulation. Providing this information may however be complex and very tedious for datasets with a significant number of variables, where hundreds of commands are used to process the data. An alternative option, preferable in many cases, will be to publish reproducible data editing, tabulation and analysis scripts together with the data, as related resources."coding_instructions": [
+{
+ "related_processes": "string",
+ "type": "string",
+ "txt": "string",
+ "command": "string",
+ "formal_language": "string"
+ }
+ ]
related_processes
[Optional ; Not repeatable ; String] related_processes
links a coding instruction to one or more processes such as “data editing”, “recoding”, “imputations and derivations”, “tabulation”, etc.type
[Optional ; Not repeatable ; String] txt
[Optional ; Not repeatable ; String] command
[Optional ; Not repeatable ; String] formal_language
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ),
+ study_development = list(
+ # ... ),
+
+ method = list(
+ # ... ,
+ study_class = "",
+
+ data_processing = list(
+ list(type = "Data capture",
+ description = "Data collection was conducted using tablets and Survey Solutions software. Multiple quality controls and validations are embedded in the questionnaire."),
+ list(type = "Batch data editing",
+ description = "Data editing was conducted in batch using a R script, including techniques of hot deck, imputations, and recoding."),
+ list(type = "Tabulation and visualizations",
+ description = "The 25 tables and the visualizations published in the survey report were produced using Stata (script 'tabulation.do')."),
+ list(type = "Anonymization",
+ description = "An anonymized version of the dataset, published as a public use file, was created using the R package sdcMicro.")
+
+ ),
+ coding_instructions = list(
+ list(related_processes = "",
+ type = "",
+ txt = "Suppression of observations with ...",
+ command = "",
+ formal_language = "Stata"),
+ list(related_processes = "",
+ type = "",
+ txt = "Top coding age",
+ command = "",
+ formal_language = "Stata"),
+ list(related_processes = "",
+ type = "",
+ txt = "",
+ command = "",
+ formal_language = "Stata")
+
+ )
+
+ )# ...
+ )
data_access
[Optional ; Not Repeatable]
+This section describes the access conditions and terms of use for the dataset. This set of elements should be used when the access conditions are well-defined and are unlikely to change. An alternative option is to document the terms of use in the catalog where the data will be published, instead of “freezing” them in a metadata file.
"data_access": {
+"dataset_availability": {
+ "access_place": "string",
+ "access_place_url": "string",
+ "original_archive": "string",
+ "status": "string",
+ "coll_size": "string",
+ "complete": "string",
+ "file_quantity": "string",
+ "notes": "string"
+ },
+ "dataset_use": {}
+ }
dataset_availability
[Optional ; Not Repeatable]
+Information on the availability and storage of the dataset.
access_place
[Optional ; Not repeatable ; String] access_place_url
[Optional ; Not repeatable ; String] original_archive
[Optional ; Not repeatable ; String] provenance
, which is not part of the DDI, that can be used to document the origin of a dataset.status
[Optional ; Not repeatable ; String] coll_size
[Optional ; Not repeatable ; String] complete
[Optional ; Not repeatable ; String] file_quantity
[Optional ; Not repeatable ; String] notes
[Optional ; Not repeatable ; String] <- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ),
+ study_development = list(
+ # ... ),
+ method = list(
+ # ...),
+
+data_access = list(
+
+dataset_availability = list(
+ access_place = "World Bank Microdata Library",
+ access_place_url = "http://microdata.worldbank.org",
+ status = "Available for public use",
+ coll_size = "4 data files + machine-readable questionnaire and report (2 PDF files) + data editing script (1 Stata do file).",
+ complete = "The variables 'latitude' and 'longitude' (GPS location of respondents) is not included, for confidentiality reasons.",
+ file_quantity = "7"
+
+ ),
+# ...
+
+ )
+ )# ...
+ )
dataset_use
[Optional ; Not Repeatable]
+Information on the terms of use for the study dataset.
"dataset_use": {
+"conf_dec": [
+ {
+ "txt": "string",
+ "required": "string",
+ "form_url": "string",
+ "form_id": "string"
+ }
+ ],
+ "spec_perm": [
+ {
+ "txt": "string",
+ "required": "string",
+ "form_url": "string",
+ "form_id": "string"
+ }
+ ],
+ "restrictions": "string",
+ "contact": [
+ {
+ "name": "string",
+ "affiliation": "string",
+ "uri": "string",
+ "email": "string"
+ }
+ ],
+ "cit_req": "string",
+ "deposit_req": "string",
+ "conditions": "string",
+ "disclaimer": "string"
+ }
conf_dec
[Optional ; Repeatable]
+This element is used to determine if signing of a confidentiality declaration is needed to access a resource. We may indicate here what Affidavit of Confidentiality must be signed before the data can be accessed. Another option is to include this information in the next element (Access conditions). If there is no confidentiality issue, this field can be left blank.
+
txt
[Optional ; Not repeatable ; String] Access condition
). An example of statement could be the following: “Confidentiality of respondents is guaranteed by Articles N to NN of the National Statistics Act of [date]. Before being granted access to the dataset, all users have to formally agree:required
[Optional ; Not repeatable ; String] form_url
[Optional ; Not repeatable ; String] "form_url
element is used to provide a link to an online confidentiality declaration form.form_id
[Optional ; Not repeatable ; String] spec_perm
[Optional ; Repeatable]
+This element is used to determine if any special permissions are required to access a resource.
txt
[Optional ; Not repeatable ; String] required
[Optional ; Not repeatable ; String] required
is used to aid machine processing of this element. The default specification is “yes”. form_url
[Optional ; Not repeatable ; String] form_url
is used to provide a link to a special on-line permissions form. form_id
[Optional ; Not repeatable ; String] restrictions
[Optional ; Not repeatable ; String]
+Any restrictions on access to or use of the collection such as privacy certification or distribution restrictions should be indicated here. These can be restrictions applied by the author, producer, or distributor of the data. This element can for example contain a statement (extracted from the DDI documentation) like: “In preparing the data file(s) for this collection, the National Center for Health Statistics (NCHS) has removed direct identifiers and characteristics that might lead to identification of data subjects. As an additional precaution NCHS requires, under Section 308(d) of the Public Health Service Act (42 U.S.C. 242m), that data collected by NCHS not be used for any purpose other than statistical analysis and reporting. NCHS further requires that analysts not use the data to learn the identity of any persons or establishments and that the director of NCHS be notified if any identities are inadvertently discovered. Users ordering data are expected to adhere to these restrictions.”
contact
[Optional ; Repeatable]
+Users of the data may need further clarification and information on the terms of use and conditions to access the data. This set of elements is used to identify the contact persons who can be used as resource persons regarding problems or questions raised by the user community.
name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] email
[Optional ; Not repeatable ; String] email
element is used to indicate an email address for the contact individual mentioned in name
. Ideally, a generic email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members.cit_req
[Optional ; Not repeatable ; String]
+A citation requirement that indicates the way that the dataset should be referenced when cited in any publication. Providing a citation requirement will guarantee that the data producer gets proper credit, and that results of analysis can be linked to the proper version of the dataset. The data access policy should explicitly mention the obligation to comply with the citation requirement. The citation should include at least the primary investigator, the name and abbreviation of the dataset, the reference year, and the version number. Include also a website where the data or information on the data is made available by the official data depositor. Ideally, the citation requirement will include a DOI (see the DataCite website for recommendations).
deposit_req
[Optional ; Not repeatable ; String]
+Information regarding data users’ responsibility for informing archives of their use of data through providing citations to the published work or providing copies of the manuscripts.
conditions
[Optional ; Not repeatable ; String]
+Indicates any additional information that will assist the user in understanding the access and use conditions of the data collection.
disclaimer
[Optional ; Not repeatable ; String]
+A disclaimer limits the liability that the data producer or data custodian has regarding the use of the data. A standard legal statement should be used for all datasets from a same agency. The following formulation could be used: The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.
Example
+```r
+my_ddi <- list(
+ doc_desc = list(
+ # ...
+ ),
+ study_desc = list(
+ # ... ,
+ study_info = list(
+ # ... ),
+ study_development = list(
+ # ... ),
+ method = list(
+ # ...),
+
+ data_access = list(
+ # ...,
+
+ dataset_use = list(
+
+ conf_dec = list(
+ list(txt = "Confidentiality of respondents is guaranteed by Articles N to NN of the National Statistics Act. All data users are required to sign an affidavit of confidentiality.",
+ required = "yes",
+ form_url = "http://datalibrary.org/affidavit",
+ form_id = "F01_AC_v01")
+ ),
+
+ spec_perm = list(
+ list(txt = "Permission will only be granted to residents of [country].",
+ required = "yes",
+ form_url = "http://datalibrary.org/residency",
+ form_id = "F02_RS_v01")
+ ),
+
+ restrictions = "Data will only be shared with users who are registered to the National Data Center and have successfuly completed the training on data privacy and responsible data use. Only users who legally reside in [country] will be authorized to access the data.",
+
+ contact = list(
+ list(name = "Head, Data Processing Division",
+ affiliation = "National Statistics Office",
+ uri = "www.cso.org/databank",
+ email = "dataproc@cso.org")
+ ),
+
+ cit_req = "National Statistics Office of Popstan. Multiple Indicators Cluster Survey 2000 (MICS 2000). Version 01 of the scientific use dataset (April 2001). DOI: XXX-XXXX-XXX",
+
+ deposit_req = "To provide funding agencies with essential information about use of archival resources and to facilitate the exchange of information among researchers and development practitioners, users of the Microdata Library data are requested to send to the Microdata Library bibliographic citations for, or copies of, each completed manuscript or thesis abstract. Please indicate in a cover letter which data were used.",
+
+ disclaimer = "The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses."
+
+ )
+
+ ),
+ # ...
+
+)
+```
+notes
[Optional ; Not repeatable ; String] data_access
.data_files
[Optional ; Repeatable]
+The data_files
section is the DDI section that contains the elements needed to describe each data file that form the study dataset. These are elements at the file level; it does not include the information at the variable level, which are contained in a separate section of the standard.
"data_files": [
+{
+ "file_id": "string",
+ "file_name": "string",
+ "file_type": "string",
+ "description": "string",
+ "case_count": 0,
+ "var_count": 0,
+ "producer": "string",
+ "data_checks": "string",
+ "missing_data": "string",
+ "version": "string",
+ "notes": "string"
+ }
+ ]
file_id
[Optional ; Not repeatable ; String]
+A unique file identifier (within the metadata document, not necessarily within a catalog). This will typically be the electronic file name.
file_name
[Optional ; Not repeatable ; String]
+This is not the name of the electronic file (which is provided in the previous element). It is a short title (label) that will help distinguish a particular file/part from other files/parts in the dataset.
file_type
[Optional ; Not repeatable ; String]
+The type of data files. For example, raw data (ASCII), or software-dependent files such as SAS / Stata / SPSS data file, etc. Provide specific information (e.g. Stata 10 or Stata 15, SPSS Windows or SPSS Export, etc.) Note that in an on-line catalog, data can be made available in multiple formats. In such case, the file_type
element is not useful.
description
[Optional ; Not repeatable ; String]
+The file_id
and file_name
elements provide limited information on the content of the file. The description
element is used to provide a more detailed description of the file content. This description should clearly distinguish collected variables and derived variables. It is also useful to indicate the availability in the data file of some particular variables such as the weighting coefficients. If the file contains derived variables, it is good practice to refer to the computer program that generated it. Information about the data file(s) that comprises a collection.
case_count
[Optional ; Numeric ; Not Repeatable]
+Number of cases or observations in the data file. The value is 0 by default.
var_count
[Optional ; Numeric ; Not Repeatable]
+Number of variables in the data file. The value is 0 by default.
producer
[Optional ; Not repeatable ; String]
+The name of the agency that produced the data file. Most data files will have been produced by the survey primary investigator. In some cases however, auxiliary or derived files from other producers may be released with a data set. This may for example be a file containing derived variables generated by a researcher.
data_checks
[Optional ; Not repeatable ; String]
+Use this element if needed to provide information about the types of checks and operations that have been performed on the data file to make sure that the data are as correct as possible, e.g. consistency checking, wildcode checking, etc. Note that the information included here should be specific to the data file. Information about data processing checks that have been carried out on the data collection (study) as a whole should be provided in the Data editing
element at the study level. You may also provide here a reference to an external resource that contains the specifications for the data processing checks (that same information may be provided also in the Data Editing
filed in the Study Description
section).
missing_data
[Optional ; Not repeatable ; String]
+A description of missing data (number of missing cases, cause of missing values, etc.)
version
[Optional ; Not repeatable ; String]
+The version of the data file. A data file may undergo various changes and modifications. File specific versions can be tracked in this element. This field will in most cases be left empty.
notes
[Optional ; Not repeatable ; String]
+This field aims to provide information on the specific data file not covered elsewhere.
Example for UNICEF MICS dataset
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ...
+
+ ),
+ data_files = list(
+
+ list(file_id = "HHS2020_S01",
+ file_name = "Household roster (demographics)",
+ description = "The file contains the demographic information on all individuals in the sample",
+ : 10000,
+ case_count: 12,
+ var_countproducer = "National Statistics Office",
+ missing_data = "Values of age outside valid range (0 to 100) have been replaced with 'missing'.",
+ version = "1.0 (edited, not anonymized)",
+ notes = ""
+
+ ),
+ list(file_id = "HHS2020_S03A",
+ file_name = "Section 3A - Education",
+ description = "The file contains data related to section 3A of the household survey questionnaire (Education of household members aged 6 to 24 years). It also contains the weighting coefficient, and various recoded variables on levels of education.",
+ : 2500,
+ case_count: 17,
+ var_countproducer = "National Statistics Office",
+ data_checks = "Education level (variable EDUCLEV) has been edited using hotdeck imputation when the reported value was out of acceptable range considering the AGE of the person.",
+ version = "1.0 (edited, not anonymized)"
+
+ ),
+ list(file_id = "HHS2020_CONSUMPTION",
+ file_name = "Annualized household consumption by products and services",
+ description = "The file contains derived data on household consumption, annualized and aggregated by category of products and services. The file also contains a regional price deflator variable and the household weighting coefficient. The file was generated using a Stata program named 'cons_aggregate.do'.",
+ : 42000,
+ case_count: 15,
+ var_countproducer = "National Statistics Office",
+ data_checks = "Outliers have been detected (> median + 5*IQR) for each product/service; fixed by imputation (regression model).",
+ missing_data = "Missing consumption values are treated as 0",
+ version = "1.0 (edited, not anonymized)"
+
+ )
+
+ ),
+ # ...
+ )
The DDI Codebook metadata standard provides multiple elements to document variables contained in a micro-dataset. There is much value in documenting variables: +- it makes the data usable by providing users with a detailed data dictionary; +- it makes the data more discoverable as all keywords included in the description of variables are indexed in data catalogs; +- it allows users to assess the comparability of data across sources; +- it enables the development of question banks; and +- it adds transparency and credibility to the data especially when derived or imputed variables are documented. +All possible effort should thus be made to generate and publish detailed variable-level documentation.
+A micro-dataset can contain many variables. Some survey datasets include hundreds or event thousands of variables. Documenting variables can thus be a tedious process. The use of a specialized DDI metadata editor can make this process considerably more efficient. Much of the variable-level metadata can indeed be automatically extracted from the electronic data files. Data files in Stata, SPSS or other common formats include variable names, variable and value labels, and in some cases notes that can be extracted. And the variable-level summary statistics that are part of the metadata can be generated from the data files. Further, software applications used for capturing data like Survey Solutions from the World Bank or CsPro from the US Census Bureau can export variable metadata, including the variable names, the variable and value labels, and possibly the formulation of questions and the interviewers instructions when the software is used for conducting computer assisted personal interviews (CAPI). Survey Solutions and CsPro can export metadata in multiple formats, including the DDI Codebook. Multiple options exist to make the documentation of variables efficient. As much as possible, tedious manual curation of variable-level information should be avoided.
+variables
[Optional ; Repeatable]
+The metadata elements we describe below apply independently to each variable in the dataset.
"variables": [
+{
+ "file_id": "string",
+ "vid": "string",
+ "name": "string",
+ "labl": "string",
+ "var_intrvl": "discrete",
+ "var_dcml": "string",
+ "var_wgt": 0,
+ "loc_start_pos": 0,
+ "loc_end_pos": 0,
+ "loc_width": 0,
+ "loc_rec_seg_no": 0,
+ "var_imputation": "string",
+ "var_derivation": "string",
+ "var_security": "string",
+ "var_respunit": "string",
+ "var_qstn_preqtxt": "string",
+ "var_qstn_qstnlit": "string",
+ "var_qstn_postqtxt": "string",
+ "var_forward": "string",
+ "var_backward": "string",
+ "var_qstn_ivuinstr": "string",
+ "var_universe": "string",
+ "var_sumstat": [],
+ "var_txt": "string",
+ "var_catgry": [],
+ "var_std_catgry": {},
+ "var_codinstr": "string",
+ "var_concept": [],
+ "var_format": {},
+ "var_notes": "string"
+ }
+ ]
file_id
[Required ; Not repeatable ; String]
+A dataset can be composed of multiple data files. The file_id
is the name of the data file that contains the variable being documented. This file name should correspond to a file_id
listed in the data_file
section of the DDI.
vid
[Required ; Not repeatable ; String]
+A unique identifier given to the variable. This can be a system-generated ID, such as a sequential number within each data file. The vid
is not the variable name.
name
[Required ; Not repeatable ; String]
+The name of the variable in the data file. The name
should be entered exactly as found in the data file (not abbreviated or converted to upper or lower cases, as some software applications are case-sensitive). This information can be programmatically extracted from the data file. The variable name is limited to eight characters in some statistical analysis software such as SAS or SPSS.
labl
[Optional ; Not repeatable ; String]
+All variables should have a label that provides a short but clear indication of what the variable contains. Ideally, all variables in a data file will have a different label. File formats like Stata or SPSS often contain variable labels. Variable labels can also be found in data dictionaries in software applications like Survey Solutions or CsPro. Avoid using the question itself as a label (specific elements are available to capture the literal question text; see below). Think of a label as what you would want to see in a tabulation of the variables. Keep in mind that software applications like Stata and others impose a limit to the number of characters in a label (often, 80).
var_intrvl
[Optional ; Not repeatable ; String]
+This element indicates whether the intervals between values for the variable are discrete
or continuous
.
var_dcml
[Optional ; Not repeatable ; String]
+This element refers to the number of decimal points in the values of the variable.
var_wgt
[Optional ; Not repeatable ; Numeric]
+This element, which applies to dataset from sample surveys, indicates whether the variable is a sample weight (value “1”) or not (value “0). Sample weights play an important role in the calculation of summary statistics and sampling errors, and should therefore be flagged.
loc_start_pos
[Optional ; Not repeatable ; Numeric]
+The starting position of the variable when the data are saved in an ASCII fixed-format data file.
loc_end_pos
[Optional ; Not repeatable ; Numeric]
+The end position of the variable when the data are saved in an ASCII fixed-format data file.
loc_width
[Optional ; Not repeatable ; Numeric]
+The length of the variable (the maximum number of characters used for its values) in an ASCII fixed-format data file.
loc_rec_seg_no
[Optional ; Not repeatable ; Numeric]
+Record segment number, deck or card number the variable is located on.
var_imputation
[Optional ; Not repeatable ; String]
+Imputation is the process of estimating values for variables when a value is missing. The element is used to describe the procedure used to impute values when missing.
var_derivation
[Optional ; Not repeatable ; String]
+Used only in the case of a derived variable, this element provides both a description of how the derivation was performed and the command used to generate the derived variable, as well as a specification of the other variables in the study used to generate the derivation. The var_derivation
element is used to provide a brief description of this process. As full transparency in derivation processes is critical to build trust and ensure replicability or reproducibility, the information captured in this element will often not be sufficient. A reference to a document and/or computer program can in such case be provided in this element, and the document/scripts provided as external resources. For example, a variable “TOT_EXP” containing the annualized total household expenditure obtained from a household budget survey may be the result of a complex process of aggregation, de-seasonalization, and more. In such case, the information provided in the var_derivation
element could be: “TOT_EXP was obtained by aggregating expenditure data on all goods and services, available in sections 4 to 6 of the household questionnaire. It contains imputed rental values for owner-occupied dwellings. The values have been deflated by a regional price deflator available in variable REG_DEF. All values are in local currency. Outliers have been fixed by imputation. Details on the calculations are available in Appendix 2 of the Report on Data Processing, and in the Stata program [generate_hh_exp_total.do].”
var_security
[Optional ; Not repeatable ; String]
+This element is used to provide information regarding levels of access, e.g., public, subscriber, need to know.
var_respunit
[Optional ; Not repeatable ; String]
+Provides information regarding who provided the information contained within the variable, e.g., head of household, respondent, proxy, interviewer.
var_qstn_preqtxt
[Optional ; Not repeatable ; String]
+The pre-question texts are the instructions provided to the interviewers and printed in the questionnaire before the literal question. This does not apply to all variables. Do not confuse this with instructions provided in the interviewer’s manual.
var_qstn_qstnlit
[Optional ; Not repeatable ; String]
+The literal question is the full text of the questionnaire as the enumerator is expected to ask it when conducting the interview. This does not apply to all variables (it does not apply to derived variables).
var_qstn_postqtxt
[Optional ; Not repeatable ; String]
+The post-question texts are instructions provided to the interviewers, printed in the questionnaire after the literal question. Post-question can be used to enter information on skips provided in the questionnaire. This does not apply to all variables. Do not confuse this with instructions provided in the interviewer’s manual.
+
+With the previous three elements, one should be able to understand how the question was formulated in a questionnaire. In the example below (extracted from the UNICEF Malawi 2006 MICS survey questionnaire), we find:
a pre-question: “Ask this question ONLY ONCE for each mother/caretaker (even if she has more children).”
a literal question: “Sometimes children have severe illnesses and should be taken immediately to a health facility. What types of symptoms would cause you to take your child to a health facility right away?”
a post-question: “Keep asking for more signs or symptoms until the mother/caretaker cannot recall any additional symptoms. Circle all symptoms mentioned. DO NOT PROMPT WITH ANY SUGGESTIONS”
var_forward
[Optional ; Not repeatable ; String]
+Contains a reference to the IDs of possible following questions. This can be used to document forward skip instructions.
var_backward
[Optional ; Not repeatable ; String]
+Contains a reference to IDs of possible preceding questions. This can be used to document backward skip instructions.
var_qstn_ivuinstr
[Optional ; Not repeatable ; String]
+Specific instructions to the individual conducting an interview. The content will typically be entered by copy/pasting instructions in the interviewer’s manual (or in the CAPI application). In cases where the same instructions relate to multiple variables, repeat the same information in the metadata for all these variables.
+NOTE: In earlier version of the documentation, due to a typo, the element was named var_qstn_ivulnstr
.
var_universe
[Optional ; Not repeatable ; String]
+The universe at the variable level defines the population the question applied to. It reflects skip patterns in a questionnaire. This information can typically be copy/pasted from the survey questionnaire. Try to be as specific as possible. This information is critical for the analyst, as it explains why missing values may be found in a variable. In the example below (from the Malawi MICS 2006 survey questionnaire), the universe for questions ED1 to ED2 will be “Household members age 5 and above”, and the universe for Question ED3 will be “Household members age 5 and above who ever attended school or pre-school”.
var_sumstat
[Optional ; Repeatable]
+The DDI metadata standard provides multiple elements to capture various summary statistics such as minimum, maximum, or mean values (weighted and un-weighted) for each variable (note that frequency statistics for categorical variables are reported in var_catgry
described below). The content of the var_sumstat
section will be easy to fill out programmatically (using R or Python) or using a specialized DDI metadata editor, which can read the data file and generate the summary statistics.
"var_sumstat": [
+{
+ "type": "string",
+ "value": null,
+ "wgtd": "string"
+ }
+ ]
type
[Required ; Not repeatable ; String]
+The type of statistics being shown: mean, median, mode, valid cases, invalid cases, minimum, maximum, or standard deviation.
value
[Required ; Not repeatable ; Numeric]
+The value of the summary statistics mentioned in type
.
wgtd
[Required ; Not repeatable ; String]
+Indicates whether the statistics reported in value
are weighted or not (for variables in sample surveys). Enter “weighted” if weighted, otherwise leave this element empty.
var_txt
[Optional ; Not repeatable ; String]
+This element provides a space to describe the variable in detail. Not all variables require a definition.
var_catgry
[Optional ; Repeatable]
+Variable categories are the lists of codes (and their meaning) that apply to a categorical variable. This block of elements is used to describe the categories (code and label) and optionally capture their weighted and/or un-weighted frequencies.
"var_catgry": [
+{
+ "value": "string",
+ "label": "string",
+ "stats": [
+ {
+ "type": "string",
+ "value": null,
+ "wgtd": "string"
+ }
+ ]
+ }
+ ]
value
[Required ; Not repeatable ; String] label
[Required ; Not repeatable ; String] value
.stats
[Optional ; Repeatable] type
[Required ; Not repeatable ; String] freq
for frequency.value
[Required ; Not repeatable ; Numeric] type
.wgtd
[Optional ; Not repeatable ; String] value
are weighted or not (for variables in sample surveys). Enter “weighted” if weighted, otherwise leave this element empty.var_std_catgry
[Optional ; Not repeatable] "var_std_catgry": {
+"name": "string",
+ "source": "string",
+ "date": "string",
+ "uri": "string"
+ }
name
[Required ; Not repeatable ; String]
+The name of the classification, e.g. “International Standard Industrial Classification of All Economic Activities (ISIC), Revision 4”
source
[Required ; Not repeatable ; String]
+The source of the classification, e.g. “United Nations”
date
[Required ; Not repeatable ; String]
+The version (typically a date) of the classification used for the study.
+
uri
[Required ; Not repeatable ; String]
+A URL to a website where an electronic copy and more information on the classification can be obtained.
var_codinstr
[Optional ; Not repeatable ; String]
+The coder instructions for the variable. These are any special instructions to those who converted information from one form to another (e.g., textual to numeric) for a particular variable.
var_concept
[Optional ; Repeatable]
+The general subject to which the parent element may be seen as pertaining. This element serves the same purpose as the keywords and topic classification elements, but at the variable description level.
"var_concept": [
+{
+ "title": "string",
+ "vocab": "string",
+ "uri": "string"
+ }
+ ]
title
[Optional ; Not repeatable ; String]
+The name (label) of the concept.
+
vocab
[Optional ; Not repeatable ; String]
+The controlled vocabulary, if any, from which the concept `title’ was taken.
uri
[Optional ; Not repeatable ; String]
+The location for the controlled vocabulary mentioned in `vocab’.
var_format
[Optional ; Not repeatable]
+The technical format of the variable in question.
"var_format": {
+"type": "string",
+ "name": "string",
+ "note": "string"
+ }
type
[Optional ; Not repeatable ; String]
+Indicates if the variable is numeric, fixed string, dynamic string, or date. Numeric variables are used to store any number, integer or floating point (decimals). A fixed string variable has a predefined length which enables the publisher to handle this data type more efficiently. Dynamic string variables can be used to store open-ended questions.
name
[Optional ; Not repeatable ; String]
+In some cases may provide the name of the particular, proprietary format used.
note
[Optional ; Not repeatable ; String]
+Additional information on the variable format.
var_notes
Optional ; Not repeatable ; String]
+This element is provided to record any additional or auxiliary information related to the specific variable.
Example for two variables only:
+<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ...
+
+ ), data_files = list(
+ # ...
+
+ ),
+ variables = list(
+
+ list(file_id = "",
+ vid = "",
+ name = "",
+ labl = "Main occupation",
+ var_intrvl = "discrete",
+ var_imputation = "",
+ var_respunit = "",
+ var_qstn_preqtxt = "",
+ var_qstn_qstnlit = "",
+ var_qstn_postqtxt = "",
+ var_qstn_ivulnstr = "",
+ var_universe = "",
+ var_sumstat = list(list(type = "", value = "", wgtd = "")),
+ var_txt = "",
+ var_forward = "",
+ var_catgry = list(list(value = "",
+ label = "",
+ stats = list(list(type = "", value = "", wgtd = ""),
+ list(type = "", value = "", wgtd = ""),
+ list(type = "", value = "", wgtd = "")),
+
+ list(value = "",
+ label = "",
+ stats = list(list(type = "", value = "", wgtd = ""),
+ list(type = "", value = "", wgtd = ""),
+ list(type = "", value = "", wgtd = "")),
+ var_std_catgry = list(),
+ var_codinstr = "",
+ var_concept = list(list(title = "", vocab = "", uri = "")),
+ var_format = list(type = "numeric", name = "")
+
+ ),
+ list(file_id = "",
+ vid = "",
+ name = "V75_HH_CONS",
+ labl = "Household total consumption",
+ var_intrvl = "continuous",
+ var_dcml = "",
+ var_wgt = 0,
+ var_imputation = "",
+ var_derivation = "",
+ var_security = "",
+ var_respunit = "",
+ var_qstn_preqtxt = "",
+ var_qstn_qstnlit = "",
+ var_qstn_postqtxt = "",
+ var_qstn_ivulnstr = "",
+ var_universe = "",
+ var_sumstat = list(list(type = "", value = "", wgtd = "")),
+ var_txt = "",
+ var_codinstr = "",
+ var_concept = list(list(title = "", vocab = "", uri = "")),
+ var_format = list(type = "", name = "", value = ""),
+ var_notes = ""
+
+ )
+
+ ),# ...
+ )
variable_groups
[Optional ; Repeatable]
In a dataset, variables are grouped by data file. For the convenience of users, the DDI allows data curators to organize the variables into different, “virtual” groups to organize variables by theme, type of respondent, or any other criteria. Grouping variables is optional, and will not impact the way variables are stored in the data files. One variable can belong to more than a group, and a group of variables can contain variables from more than one data file. The variable groups do not necessarily have to cover all variables in the data files. Variable groups can also contain other variable groups.
"variable_groups": [
+{
+ "vgid": "string",
+ "variables": "string",
+ "variable_groups": "string",
+ "group_type": "subject",
+ "label": "string",
+ "universe": "string",
+ "notes": "string",
+ "txt": "string",
+ "definition": "string"
+ }
+ ]
vgid
[Optional ; Not repeatable ; String]
+A unique identifier (within the DDI metadata file) for the variable group.
variables
[Optional ; Not repeatable ; String]
+The list of variables (variable identifiers - vid
) in the group. Enter a list with items separated by a space, e.g. “V21 V22, V30”.
variable_groups
[Optional ; Not repeatable ; String]
+The variable groups (vgid
) that are embedded in this variable group. Enter a list with items separated by a space, e.g. “VG2, VG5”.
group_type
[Optional ; Subject ; Not Repeatable]
+The type of grouping of the variables. A controlled vocabulary should be used. The DDI proposes the following vocabulary: {section, multipleResp, grid, display, repetition, subject, version, iteration, analysis, pragmatic, record, file, randomized, other
}. A description of the groups can be found in this document by W. Thomas, W. Block, R. Wozniak and J. Buysse.
label
[Optional ; Not repeatable ; String]
+A short description of the variable group.
universe
[Optional ; Not repeatable ; String]
+The universe can be a population of individuals, households, facilities, organizations, or others, which can be defined by any type of criteria (e.g., “adult males”, “private schools”, “small and medium-size enterprises”, etc.).
notes
[Optional ; Not repeatable ; String]
+Used to provide additional information about the variable group.
txt
[Optional ; Not repeatable ; String]
+A more detailed description of variable group than the one provided in label
.
definition
[Optional ; Not repeatable ; String]
+A brief rationale for the variable grouping.
<- list(
+ my_ddi doc_desc = list(
+ # ...
+
+ ),study_desc = list(
+ # ...
+
+ ), data_files = list(
+ # ...
+
+ ),variables = list(
+ # ...
+
+ ),
+ variable_groups = list(
+
+ list(vgid = "vg01",
+ variables = "",
+ variable_groups = "",
+ group_type = "subject",
+ label = "",
+ universe = "",
+ notes = "",
+ txt = "",
+ definition = ""
+
+ ),
+ list(vgid = "vg02",
+ variables = "",
+ variable_groups = "",
+ group_type = "subject",
+ label = "",
+ universe = "",
+ notes = "",
+ txt = "",
+ definition = ""
+
+ )
+
+ ),
+ # ...
+ )
provenance
[Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata. These elements are NOT part of the DDI metadata standard.
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable]
+The origin_description
elements are used to describe when and from where metadata have been extracted or harvested.
harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Study Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] lda_topics
[Optional ; Not repeatable]
"lda_topics": [
+{
+ "model_info": [
+ {
+ "source": "string",
+ "author": "string",
+ "version": "string",
+ "model_id": "string",
+ "nb_topics": 0,
+ "description": "string",
+ "corpus": "string",
+ "uri": "string"
+ }
+ ],
+ "topic_description": [
+ {
+ "topic_id": null,
+ "topic_score": null,
+ "topic_label": "string",
+ "topic_words": [
+ {
+ "word": "string",
+ "word_weight": 0
+ }
+ ]
+ }
+ ]
+ }
+ ]
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document (in this case, the “document” is a compilation of elements from the dataset metadata) can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics
is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. The lda_topics
element is NOT part of the DDI Codebook standard.
Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics
elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.
The lda_topics
element includes the following metadata fields:
model_info
[Optional ; Not repeatable] source
[Optional ; Not repeatable ; String] author
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] model_id
[Optional ; Not repeatable ; String] nb_topics
[Optional ; Not repeatable ; Numeric] description
[Optional ; Not repeatable ; String] corpus
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topic_description
[Optional ; Repeatable] topic_id
[Optional ; Not repeatable ; String] topic_score
[Optional ; Not repeatable ; Numeric] topic_label
[Optional ; Not repeatable ; String] topic_words
[Optional ; Not repeatable] word
[Optional ; Not repeatable ; String] word_weight
[Optional ; Not repeatable ; Numeric] = list(
+ lda_topics
+ list(
+
+ model_info = list(
+ list(source = "World Bank, Development Data Group",
+ author = "A.S.",
+ version = "2021-06-22",
+ model_id = "Mallet_WB_75",
+ nb_topics = 75,
+ description = "LDA model, 75 topics, trained on Mallet",
+ corpus = "World Bank Documents and Reports (1950-2021)",
+ uri = ""))
+
+ ),
+ topic_description = list(
+
+ list(topic_id = "topic_27",
+ topic_score = 32,
+ topic_label = "Education",
+ topic_words = list(list(word = "school", word_weight = "")
+ list(word = "teacher", word_weight = ""),
+ list(word = "student", word_weight = ""),
+ list(word = "education", word_weight = ""),
+ list(word = "grade", word_weight = "")),
+
+ list(topic_id = "topic_8",
+ topic_score = 24,
+ topic_label = "Gender",
+ topic_words = list(list(word = "women", word_weight = "")
+ list(word = "gender", word_weight = ""),
+ list(word = "man", word_weight = ""),
+ list(word = "female", word_weight = ""),
+ list(word = "male", word_weight = "")),
+
+ list(topic_id = "topic_39",
+ topic_score = 22,
+ topic_label = "Forced displacement",
+ topic_words = list(list(word = "refugee", word_weight = "")
+ list(word = "programme", word_weight = ""),
+ list(word = "country", word_weight = ""),
+ list(word = "migration", word_weight = ""),
+ list(word = "migrant", word_weight = "")),
+
+ list(topic_id = "topic_40",
+ topic_score = 11,
+ topic_label = "Development policies",
+ topic_words = list(list(word = "development", word_weight = "")
+ list(word = "policy", word_weight = ""),
+ list(word = "national", word_weight = ""),
+ list(word = "strategy", word_weight = ""),
+ list(word = "activity", word_weight = ""))
+
+
+ )
+
+ )
+ )
embeddings
[Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. In this case, the text would be a compilation of selected elements of the dataset metadata. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).
The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. The embeddings
element is NOT part of the DDI Codebook standard.
"embeddings": [
+{
+ "id": "string",
+ "description": "string",
+ "date": "string",
+ "vector": { }
+ }
+ ]
The embeddings
element contains four metadata fields:
+- id
[Optional ; Not repeatable ; String]
+A unique identifier of the word embedding model used to generate the vector.
+- description
[Optional ; Not repeatable ; String]
+A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
+- date
[Optional ; Not repeatable ; String]
+The date the model was trained (or a version date for the model).
+- vector
[Required ; Not repeatable ; Object] @@@@@@@@ do not offer options
+The numeric vector representing the document, provided as an object (array or string).
+[1,4,3,5,7,9]
additional
[Optional ; Not repeatable]
+The additional
element is provided to allow users of the API to create their own elements and add them to the schema. It is not part of the DDI Codebook standard. All custom elements must be added within the element
block; embedding them elsewhere in the schema would cause DDI schema validation to fail in NADA.
The DDI-Codebook metadata standard provides multiple elements to describe the variables in detail. This includes elements that are usually not found in data dictionaries, like summary statistics. Generating this information and manually capturing it in a DDI-compliant metadata file could be tedious. Indeed, some datasets contains hundreds or even thousands of variables. Some of the metadata (list of variables, possibly variable and value labels, and summary statistics) can be automatically extracted from the data files. Specialized metadata editors, who can read the data files, extract metadata, and generate DDI-compliant output are thus the preferred option to document microdata. Other software have the capability to generate variable-level metadata in DDI-compliant, such as CsPro and Survey Solutions (CAPI applications). Stata and R scripts also provide solutions to generate variable-level metadata out of data files. We present some of these tools below.
+@@@ Update this whole section with proper screenshots and description
+The World Bank Metadata Editor is compliant with the DDI-Codebook 2.5. It is an open source software. [@@@@@ not yet - wait for license] It is a flexible application that can also accommodate other standards and schemas such as the Dublin Core (for documents) and the ISO 19139 (for geospatial data).
+When importing data files, variable-level metadata is automatically generated including variable names, summary statistics, and variable and value labels if available in the source data files. Additional variable-level metadata can then be added manually.
+
+
+
The Metadata Editor provides forms to enter all other related metadata using the DDI-Codebook 2.5 standard, including the study description and a description of external resources.
+
+
+
The World Bank Metadata Editor exports the metadata (for microdataset) in DDI-Codebook 2.5 format (XML) and in JSON format. Metadata related to external resources can be exported to a Dublin Core file. A transformation of the metadata files into a PDF document is also implemented.
+
+
+
DDI-compliant metadata can also be generated and published in a NADA catalog programmatically. Programming languages like R and Python provides much flexibility to generate such metadata, including variable-level metadata.
+We provide here and example where a dataset is available in Stata format. We use two data files from the Core Welfare Indicator Questionnaire (CWIQ) survey conducted in Liberia in 2007 (the full dataset has 12 data files; the extension of the script to the full dataset would be straightforward). One data file, named “sec_abcde_individual.dta”, contains individual-level variables. The other data file, named “sec_fgh_ _household.dta”, contains household-level variables. The content of the Stata files is as follows:
+When generating the variable-level metadata, we want to extract the value labels from the data files, keeping the original [code - value label] pairs as they are in the original dataset. For example, if the Stata dataset has codes 1 = Male and 2 = Female for variable sex, we do not want them to be changed for example to 1 = Female and 2 = Male by the data import process. The import process in R packages do not always maintain the code/label pairs; some convert categorical data into factors and assign codes and value labels independently from the original coding.
+# In http://catalog.ihsn.org/catalog/1523
+
+library(nadar)
+library(haven)
+library(rlist)
+library(stringr)
+
+# ----------------------------------------------------------------------------------
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+= "LBR_CWIQ_2007"
+ id
+setwd("D:/LBR_CWIQ_2007")
+
+= "liberia_cwiq.JPG" # This image will be used as a thumbnail
+ thumb
+# The literal questions are only found in a PDF file; we extract them.
+# If list of questions had been available in MS-Excel format of equivalent, we
+# would import it from that file.
+= list(
+ literal_questions b1 = "Is [NAME] male or female?",
+ b2 = "How long has [NAME] been away in the last 12 months?",
+ b3 = "What is [NAME]'s relationship to the head of household?",
+ b4 = "How old was [NAME] at last birthday?",
+ b5 = "What is [NAME]'s marital status?",
+ b6 = "Is [NAME]'s father alive?",
+ b7 = "Is [NAME]'s father living in the household?",
+ b8 = "Is [NAME]'s mother alive?",
+ b9 = "Is [NAME]'s mother living in the household?",
+ c1 = "Can [NAME] read and write in any language?",
+ c2 = "Has [NAME] ever attended school?",
+ c3 = "What is the highest grade [NAME] completed?",
+ c4 = "Did [NAME] attend school last year?",
+ c5 = "Is [NAME] currently in school?",
+ c6 = "What is the current grade [NAME] is attending?",
+ c7 = "Who runs the school [NAME] is attending?",
+ c8 = "Did [NAME] have any problems with school?",
+ c9 = "Why is [NAME] not currently in school?",
+ c10= "Why has [NAME] not started school?"
+ # Etc. (we do not include all questions in the example)
+
+ )
+# Generate file-level and variable-level metadata for the two data files
+
+= c("sec_abcde_individual.dta", "sec_fgh_household.dta")
+ list_data_files
+= list()
+ list_var = list()
+ list_df = 1
+ vno = 1
+ fno
+for (datafile in list_data_files) {
+
+ <- read_dta(datafile)
+ data
+ # Generate file-level metadata
+
+ # Create a file identifier (sequential)
+ = paste0("F", str_pad(fno, 2, pad = "0"))
+ fid = fno + 1
+ fno
+ # Add core metadata
+ = nrow(data) # Nb of observations in the data file
+ case_n = length(data) # Nb of variables in the data file
+ var_n = list(file_id = fid,
+ df file_name = datafile,
+ case_count = case_n,
+ var_count = var_n)
+ = list.append(list_df, df)
+ list_df
+ # Generate variable-level metadata
+
+ for(v in 1:length(data)) {
+
+ # Create a variable identifier (sequential)
+ = paste0("V", str_pad(vno, 4, pad = "0"))
+ vid = vno + 1
+ vno
+ # Variable name and literal question
+ = names(data[v])
+ vname = as.character(literal_questions[vname])
+ question if(is.null(question)) question = ""
+
+ # Extract the variable label (trim leading and trailing white spaces)
+ <- trimws(attr(data[[v]], 'label'))
+ var_lab if(is.null(var_lab)) var_lab = ""
+
+ # Variable-level summary statistics
+ = sum(!is.na(data[[v]]))
+ vval = sum(is.na(data[[v]]))
+ vmis = as.character(min(data[[v]], na.rm = TRUE))
+ vmin = as.character(max(data[[v]], na.rm = TRUE))
+ vmax = list(
+ vstats list(type = "valid", value = vval),
+ list(type = "system missing", value = vmis),
+ list(type = "minimum", value = vmin),
+ list(type = "maximum", value = vmax)
+
+ )
+ # Extract the (original) codes and value labels and calculate frequencies
+ = list()
+ freqs <- attr(data[[v]], 'labels')
+ val_lab if(!is.null(val_lab) & typeof(data[[v]]) != "character") {
+ = table(data[[v]])
+ freq_tbl for (i in 1:length(val_lab)) {
+ = list(value = as.character(val_lab[i]),
+ f labl = as.character(names(val_lab[i])),
+ stats = list(
+ list(type = "count",
+ value = sum(data[[v]] == val_lab[i], na.rm = TRUE)
+
+ )
+ )
+ )= list.append(freqs, f)
+ freqs
+ }
+ }
+ # Compile the variable-level metadata
+ = list(
+ list_v file_id = fid,
+ vid = vid,
+ name = vname,
+ labl = var_lab,
+ var_qstn_qstnlit = question,
+ var_sumstat = vstats,
+ var_catgry = freqs)
+
+ # Add to the list of variables already documented
+ = list.append(list_var, list_v)
+ list_var
+
+ }
+
+ }
+# Generate the DDI-compliant metadata
+
+<- list(
+ cwiq_ddi_metadata
+ doc_desc = list(
+ producers = list(
+ list(name = "WB consultants")
+
+ ), prod_date = "2008-02-19"
+
+ ),
+ study_desc = list(
+
+ title_statement = list(
+ idno = id,
+ title = "Core Welfare Indicators Questionnaire 2007"
+
+ ),
+ authoring_entity = list(
+ list(name = "Liberia Institute of Statistics and Geo_Information Services")
+
+ ),
+ study_info = list(
+
+ coll_dates = list(
+ list(start = "2007-08-06", end = "2007-09-22")
+
+ ),
+ nation = list(
+ list(name = "Liberia", abbreviation = "LBR")
+
+ ),
+ abstract = "The Government of Liberia (GoL) is committed to producing a Poverty Reduction Strategy Paper (PSRP). To do this, the GoL will need to undertake an analysis of qualitative and quantitative sources to understand the nature of poverty ('Where are we?'); to develop a macro-economic framework, and conduct broad based and participatory consultations to choose objectives, define and prioritize strategies ('Where do we want to go? How far can we get?); and to develop a monitoring and evaluation system ('How will we know when we get there?). The analysis of the nature of poverty, the Poverty Profile, will establish the overall rate of poverty incidence, identifying the poor in relation to their location, habits, occupations, means of access to and use of government services, and their living standards in regard to health, education, nutrition. Given the capacity constraints it has been agreed that this information will be collected in a single visit survey using the Core Welfare Indicators Questionnaire (CWIQ) survey with an additional module to cover household income, expenditure and consumption. This will provide information to estimate welfare levels & poverty incidence, which can be combined and analyzed with the sectoral information from the main CWIQ questionnaire. While countries with more capacity usually do a household income, expenditure and consumption survey over 12 months, the single visit approach has been used in a number of countries (mainly in West Africa) fairly successfully.",
+
+ geog_coverage = "National"
+
+
+ ),
+ method = list(
+
+ data_collection = list(
+
+ coll_mode = "face to face interview",
+
+ sampling_procedure = "The CWIQ survey will be carried out on a sample of 3,600 randomly selected households located in 300 randomly selected clusters. This was the same basic sample used by the 2007 Liberian DHS. However, for Monrovia, a new listing was carried out and new EAs were chosen and the sampled households were chosen from that list. For rural areas, the same EAs were used but a new sample selection of housholds was drawn. Any household that may have participated in the LDHS was systematically eliminated. Twelve (12) households were selected in each of the 300 EA using systematic sampling. The total number of households and number of EAs sampled in each County are given in the table below. (More on the Sampling under the External Resources).",
+
+ coll_situation = "On average, the interview process lasted about about 2 hours 45 minutes. The Income and Expenditure questionnaire alone took about 2 hours to complete. In many occasions, the questionnaire was completed in 2 sitting sessions."
+
+
+ )
+
+ )
+
+ ),
+ # Information of data files
+ data_files = list_df,
+
+ # Information on variables
+ variables = list_var
+
+
+ )
+# Publish the metadata in the NADA catalog
+
+microdata_add(
+idno = id,
+ repositoryid = "central",
+ access_policy = "licensed",
+ published = 1,
+ overwrite = "yes",
+ metadata = cwiq_ddi_metadata,
+ thumbnail = thumb
+
+ )
+# Add links to data and documents
+
+external_resources_add(
+title = "Liberia, CWIQ 2007, Dataset in Stata 15 format",
+ idno = id,
+ dcdate = "2007",
+ language = "English",
+ country = "Liberia",
+ dctype = "dat/micro",
+ file_path = "LBR_CWIQ_2007_Stata15.zip",
+ description = "Liberia CWIQ dataset in Stata 15 format (2 data files)",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "Liberia, CWIQ 2007, Dataset in SPSS Windows format",
+ idno = id,
+ dcdate = "2007",
+ language = "English",
+ country = "Liberia",
+ dctype = "dat/micro",
+ file_path = "LBR_CWIQ_2007_Stata15.zip",
+ description = "Liberia CWIQ dataset in SPSS for Windows [.sav] format (2 data files)",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "CWIQ 2007 Questionnaire",
+ idno = id,
+ dcdate = "2007",
+ language = "English",
+ country = "Liberia",
+ dctype = "doc/ques",
+ file_path = "LCWIQ2007_.pdf",
+ overwrite = "yes"
+ )
After running the script, the metadata (and links) are available in the NADA catalog.
+
+
+
+
+
+
+
+
+
To make geographic information discoverable and to facilitate their dissemination and use, the ISO Technical Committee on Geographic Information/Geomatics (ISO/TC211) created a set of metadata standards to describe geographic datasets (ISO 19115), geographic data structures (ISO 19115-2 / ISO 19110), and geographic data services (ISO 19119). These standards have been “unified” into a common XML specification (ISO 19139). This set of standards, known as the ISO 19100 series, served as the cornerstone of multiple initiatives to improve the documentation and management of geographic information such as the Open Geospatial Consortium (OGC), the US Federal Geographic Data Committee (FDGC), the European INSPIRE directive, or more recently the Research Data Alliance (RDA), among others.
+The ISO 19100 standards have been designed to cover the large scope of geographic information. The level of detail they provide goes beyond the needs of most data curators. What we present in this Guide is a subset of the standards, which focuses on what we consider as the core requirements to describe and catalog geographic datasets and services. References and links to resources where more detailed information can be found are provided in appendix.
+Geographic information metadata standards cover three types of resources: i) datasets, ii) data structure definitions, and iii) data services. Each one of these three components is the object of a specific standard. To support their implementation, a common XML specification (ISO 19139) covering the three standards has been developed. The geographic metadata standard is however, by far, the most complex and “specialized” of all schemas described in this Guide. Its use requires expertise not only in data documentation, but also in the use of geospatial data. We provide in this chapter some information that readers who are not familiar with geographic data may find useful to better understand the purpose and use of the geographic metadata standards.
+Geographic datasets “identify and depict geographic locations, boundaries and characteristics of features on the surface of the earth. They include geographic coordinates (e.g., latitude and longitude) and data associated to geographic locations (…)”. (Source: https://www.fws.gov/gis/)
+The ISO 19115 standard defines the structure and content of the metadata to be used to document geographic datasets. The standard is split into two parts covering:
+Vector and raster spatial datasets are built with different structures and formats. The following summarizes how these two categories differ and how they can be processed using the R software. The descriptions of vector and raster data provided in this chapter are adapted from: +- https://gisgeography.com/spatial-data-types-vector-raster/ +- https://datacarpentry.org/organization-geospatial/02-intro-vector-data/index.html]
+Vector data
+Vector data are comprised of points, lines, and polygons (areas).
+A vector point is defined by a single x, y coordinate. Generally, vector points are a latitude and longitude with a spatial reference frame. A point can for example represent the location of a building or facility. When multiple dots are connected in a set order, they become a vector line with each dot representing a vertex. Lines usually represent features that are linear in nature, like roads and rivers. Each bend in the line represents a vertex that has a defined x, y location. When a set of 3 or more vertices is joined in a particular order and closed (i.e. the first and last coordinate pairs are the same), it becomes a polygon. Polygons are used to show boundaries. They will typically represent lakes, oceans, countries and their administrative subdivisions (provinces, states, districts), building footprints, or outline of survey plots. Polygons have an area (which will correspond to the square-footage for a building footprint, to the acreage for an agricultural plot, etc.)
+Vector data are often provided in one of the following file formats:
+Some examples | +
EXAMPLE 1
+The figure below provides an example of vector data extracted from Open Street Map for a part of the city of Thimphu, Bhutan (as of 17 May, 2021).
+The content of this map can be exported as an OSM file.
+Multiple applications will allow users to read and process OSM files, including open source software applications like QGIS or the R packages sf and osmdata
+# Example of a R script that reads and shows the content of the map.osm file
+
+library(sf)
+
+# List the layers contained in the OSM file
+<- st_layers("map.osm")
+ lyrs
+# Read the layers as sf objects
+<- st_read("map.osm", layer = "points")
+ points <- st_read("map.osm", layer = "lines")
+ lines <- st_read("map.osm", layer = "multipolygons") polygons
EXAMPLE 2
+In this second example, we use the R sf
(Simple Features) package to read a shape (vector) file of refugee camps in Bangladesh, downloaded from the Humanitarian Data Exchange (HDX) website:
# Load the sf package and utilities
+
+library(sf)
+library(utils)
+
+# Download and unzip the shape file (published by HDX as a compressed zip format)
+
+setwd("E:/my_data")
+<- "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip"
+ url download.file(url, destfile = "200908_RRC_Outline_Camp_AL1.zip")
+unzip("E:/my_data/200908_RRC_Outline_Camp_AL1.zip")
+
+# Read the file and display core information about its content
+
+<- st_read("./200908_RRC_Outline_Camp_AL1/200908_RRC_Outline_Camp_AL1.shp")
+ al1 print(al1)
+plot(al1)
+
+# ------------------------------
+# Output of the 'print' command:
+# ------------------------------
+
+# Simple feature collection with 35 features and 14 fields
+# geometry type: MULTIPOLYGON
+# dimension: XY
+# bbox: xmin: 92.12973 ymin: 20.91856 xmax: 92.26863 ymax: 21.22292
+# geographic CRS: WGS 84
+# First 10 features:
+# District Upazila Settlement Union Name_Alias SSID SMSD__Cnam NPM_Name Area_Acres PeriMe_Met
+# 1 Cox's Bazar Ukhia Collective site Palong Khali Bagghona-Putibonia CXB-224 Camp 16 Camp 16 (Potibonia) 130.57004 4136.730
+# 2 Cox's Bazar Ukhia Collective site Palong Khali <NA> CXB-203 Camp 02E Camp 02E 96.58179 4803.162
+# 3 ...
+#
+# Camp_Name Area_SqM Latitude Longitude geometry
+# 1 Camp 16 528946.95881724 21.1563813298438 92.1490685817901 MULTIPOLYGON (((92.15056 21...
+# 2 Camp 2E 391267.799744003 21.2078084302778 92.1643360947381 MULTIPOLYGON (((92.16715 21...
+# 3 ...
+
+# Output of 'str' command:
+
+# Classes 'sf' and 'data.frame': 35 obs. of 15 variables:
+# $ District : chr "Cox's Bazar" "Cox's Bazar" "Cox's Bazar" "Cox's Bazar" ...
+# $ Upazila : chr "Ukhia" "Ukhia" "Ukhia" "Ukhia" ...
+# $ Settlement: chr "Collective site" "Collective site" "Collective site" "Collective site" ...
+# $ Union : chr "Palong Khali" "Palong Khali" "Palong Khali" "Raja Palong" ...
+# $ Name_Alias: chr "Bagghona-Putibonia" NA "Jamtoli-Baggona" "Kutupalong RC" ...
+# $ SSID : chr "CXB-224" "CXB-203" "CXB-223" "CXB-221" ...
+# $ SMSD__Cnam: chr "Camp 16" "Camp 02E" "Camp 15" "Camp KRC" ...
+# $ NPM_Name : chr "Camp 16 (Potibonia)" "Camp 02E" "Camp 15 (Jamtoli)" "Kutupalong RC" ...
+# $ Area_Acres: num 130.6 96.6 243.3 95.7 160.4 ...
+# $ PeriMe_Met: num 4137 4803 4722 3095 4116 ...
+# $ Camp_Name : chr "Camp 16" "Camp 2E" "Camp 15" "Kutupalong RC" ...
+# $ Area_SqM : chr "528946.95881724" "391267.799744003" "985424.393160958" "387729.666427279" ...
+# $ Latitude : chr "21.1563813298438" "21.2078084302778" "21.1606399787906" "21.2120281895357" ...
+# $ Longitude : chr "92.1490685817901" "92.1643360947381" "92.1428956454661" "92.1638095873048" ...
+# $ geometry :sfc_MULTIPOLYGON of length 35; first list element: List of 1
+
+# This information can be extracted and used to document the data
The output of the script shows that the shape file contains 35 features (or “objects”; in this case each object represents a refugee camp) and 14 fields (attributes and variables; including information like the camp name, administrative region, surface area, and more) related to each object.
+The geometry type (multipolygon) and dimension (XY) provide information on the type of object. “All geometries are composed of points. Points are coordinates in a 2-, 3- or 4-dimensional space. All points in a geometry have the same dimensionality. In addition to X and Y coordinates, there are two optional additional dimensions:
+The four possible cases then are:
+The following seven simple feature types are the most common:
+Type | +Description | +
---|---|
POINT | +zero-dimensional geometry containing a single point | +
LINESTRING | +sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry | +
POLYGON | +geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring | +
MULTIPOINT | +set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal | +
MULTILINESTRING | +set of linestrings | +
MULTIPOLYGON | +set of polygons | +
GEOMETRYCOLLECTION | +set of geometries of any type except GEOMETRYCOLLECTION | +
The remaining ten geometries are rarer : CIRCULARSTRING, COMPOUNDCURVE, CURVEPOLYGON, MULTICURVE, MULTISURFACE, CURVE, SURFACE, POLYHEDRALSURFACE, TIN, TRIANGLE (see https://r-spatial.github.io/sf/articles/sf1.html).
+The geographic CRS informs us on the coordinate reference system (CRS). Coordinates can only be placed on the Earth’s surface when their CRS is known; this may be a spheroid CRS such as WGS 84, a projected, two-dimensional (Cartesian) CRS such as a UTM zone or Web Mercator, or a CRS in three-dimensions, or including time. In our example above, the CRS is the WGS 84 (World Geodetic System 84), a standard for use in cartography, geodesy, and satellite navigation including GPS.
+The bbox is the bounding box.
+Information on a subset (top 10 - only 2 shown above) of the features is displayed in the output of the script, with the list of the 14 available fields.
+The plot(al1)
command in R produces a visualization of the numeric fields in the data file:
All this information represents important components of the metadata, which we will want to capture, enrich, and catalog (together with additional information) using the ISO metadata standard. “Enriching” (or “augmenting”) the metadata will consist of providing more contextual information (who produced the data, when, why, etc.) and additional information on the features (e.g., what does the variable ’SMSD__Cnam’ represent?).
+Raster data
+Raster data are made up of pixels, also referred to as grid cells. Satellite imagery and other remote sensing data are raster datasets. Grid cells in raster data are usually (but not necessarily) regularly-spaced and square. Data stored in a raster format is arranged in a grid without storing the coordinates of each cell (pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in a grid.
+Any given pixel in a grid stores one or more values (in one or more bands). For example, each cell (pixel) value in a satellite image has a red, a green, and a blue value. Cells in raster data could represent anything from elevation, temperature, rainfall, land cover, population density, or others. (Source: https://worldbank.github.io/OpenNightLights/tutorials/mod2_1_data_overview.html)
+Raster data can be discrete or continuous. Discrete rasters have distinct themes or categories. For example, one grid cell can represent a land cover class, or a soil type. In a discrete raster, each thematic class can be discretely defined (usually represented by an integer) and distinguished from other classes. In other words, each cell is definable and its value applies to the entire area of the cell. For example, the value 1 for a class might indicate “urban area”, value 2 “forest”, and value 3 “others”. Continuous (or non-discrete) rasters are grid cells with gradual changing values, which could for example represent elevation, temperature, or an aerial photograph.
+The difference between vector and raster data, and between different types of vectors, is clearly illustrated in the figure below taken from the World Bank’s Light Every Night GitHub repository.
+In GIS applications, vector and raster data are often combined into multi-layer datasets, as shown in the figure below extracted from the County of San Bernardino (US) website.
+We may occasionally want to convert raster data into vector data. For example, a building footprint layer (vector data, composed of polygons) can be derived from a satellite image (raster data). Such conversions can be implemented in a largely automated manner using machine learning algorithms.
+Raster data are often provided in one of the following file formats:
+GeoTIFF is a popular file format for raster data. A Tagged Image File Format (TIFF or TIF) is a file format designed to store raster-type data. A GeoTIFF file is a TIFF file that contains specific tags to store structured geospatial metadata including:
+TIFF files can be read using (among other options) the R package raster or the Python library rasterio.
+GeoTIFF files can also be provided as Cloud Optimized GeoTIFFS (COGs). In COGs, the data are structured in a way that allows them to be shared via web services which allow users to query, visualize, or download a user-defined subset of the content of the file, without having to download the entire file. This option can be a major advantage, as geoTIFF files generated by remote sensing/satellite imagery can be very large. Extracting only the relevant part of a file can save significant time and storage space.
+Some examples | +
EXAMPLE 1
+The first example below shows the spatial distribution of the Ethiopian population in 2020. The data file was downloaded from the WorldPop website on 17 May 2021.
+# Load the raster R package
+
+library(raster)
+
+# Download a TIF file (spatial distribution of population, Ethiopia, 2020) - 62Mb
+
+setwd("E:/my_data")
+<- "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif"
+ url = basename(url)
+ file_name download.file(url, destfile = file_name, mode = 'wb')
+
+# Read the file and display core information about its content
+
+<- raster(file_name)
+ my_raster_file print(my_raster_file)
+
+# ------------------------------
+# Output of the 'print' command:
+# ------------------------------
+
+# dimensions : 13893, 17983, 249837819 (nrow, ncol, ncell)
+# resolution : 0.0008333333, 0.0008333333 (x, y)
+# extent : 32.99958, 47.98542, 3.322084, 14.89958 (xmin, xmax, ymin, ymax)
+# crs : +proj=longlat +datum=WGS84 +no_defs
+# source : E:/my_data/eth_ppp_2020_constrained.tif
+# names : eth_ppp_2020_constrained
+# values : 1.36248, 847.9389 (min, max)
This output shows that the TIF file contains one layer of cells, forming an image of 13,893 by 17,983 cells. It also provides information on the projection system (datum): WGS 84 (World Geodetic System 84). This information (and more) will be part of the ISO-compliant metadata we want to generate to document and catalog a raster dataset.
+EXAMPLE 2
+In the second example, we demonstrate the advantages of Cloud Optimized GeoTIFFS (COGs). We extract information from the World Bank Light Every Night repository.
+# Load 'aws.s3' package to access the Amazon Web Services (AWS) Simple Storage Service (s3)
+library("aws.s3")
+
+# Load 'raster' package to read the target GeoTiFF
+library("raster")
+
+# List files for World Bank bucket 'globalnightlight', setting a max number of items
+<- get_bucket(bucket = 'globalnightlight', max = 10000)
+ contents
+# Get_bucket_df is similar to 'get_bucket' but returns the list as a dataframe
+<- get_bucket_df(bucket = 'globalnightlight')
+ contents
+# Access DMSP-OLS data for satellite F12 in 1995
+<- get_bucket(bucket = 'globalnightlight',
+ F12_1995 prefix = "F121995")
+
+# As data.frame, with all objects listed
+<- get_bucket_df(bucket = 'globalnightlight',
+ F12_1995_df prefix = "F121995",
+ max = Inf)
+ # Number of objects
+nrow(F12_1995_df)
+
+# Save the object
+<- "F12199501140101.night.OIS.tir.co.tif"
+ filename save_object(bucket = 'globalnightlight',
+object = "F121995/F12199501140101.night.OIS.tir.co.tif",
+ file = filename)
+
+# Read it with raster package
+<- raster(filename) rs
The ISO 19115-2 provides the necessary metadata elements to describe the structure of raster data. The ISO 19115-1 standard does not provide all necessary metadata elements needed to describe the structure of vector datasets. The description of data structures for vector data (also referred to as feature types) is therefore often omitted. The ISO 19110 standard solves that issue, by providing the means to document the structure of vector datasets (column names and definitions, codes and value labels, measurement units, etc.), which will contribute to making the data more discoverable and usable.
+More and more data are disseminated not in the form of datasets, but as data services via web applications. “Geospatial services provide the technology to create, analyze, maintain, and distribute geospatial data and information.” (https://www.fws.gov/gis/) The ISO 19119 standard provides the elements to document such services.
+The three metadata standards previously described - ISO 19115 for vector and raster datasets, ISO 19110 for vector data structures, and ISO 19119 for data services, provide a set of concepts and definitions useful to describe the geographic information. To facilitate their practical implementation, a digital specification, which defines how this information is stored and organized in an electronic metadata file, is required. The ISO/TS 19139 standard, an XML specification of the ISO 19115/10110/19119/, was created for that purpose.
+The ISO/TS 19139 is a standard used worldwide to describe geographic information. It is the backbone for the implementation of INSPIRE dataset and service metadata in the European Union. It is supported by a wide range of tools, including desktop applications like Quantum GIS, ESRI ArcGIS), and OGC-compliant metadata catalogs (e.g., GeoNetwork) and geographic servers (e.g., GeoServer).
+ISO 19139-compliant metadata can be generated and edited using specialized metadata editors such as CatMDEdit or QSphere, or using programmatic tools like Java Apache SIS or the R packages geometa and geoflow, among others.
+The ISO 19139 specification is complex. To enable and simplify its use in our NADA cataloguing application, we produced a JSON version of (part of) the standard. We selected the elements we considered most relevant for our purpose, and organized them into the JSON schema described below. For data curators with limited expertise in XML and geographic data documentation, this JSON schema will make the production of metadata compliant with the ISO 19139 standard easier.
+Main structure (describe) @@@@
+{
+"repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
Geographic metadata (for both datasets and services) should include core metadata properties, and metadata sections aiming to describe specific aspect of the resource (e.g., resource identification or resource distribution).
+The content of some metadata elements is controlled by codelists (or controlled vocabularies). A codelist is a pre-defined set of values. The content of an element controlled by a codelist should be selected from that list. This may for example apply to the element “language”, whose content should be selected from the ISO 639 list and codes codes for language names, instead of being free-text. The ISO 19139 suggests but does not impose codelists. It is highly recommended to make use of the suggested codelists (or of specific codelists that may be promoted by agencies or partnerships).
+Some metadata elements (referred to as common elements) of the ISO 19139 can be repeated in different parts of a metadata file. For example, a standard set of fields is provided to describe a contact
, a citation
, or a file format
. Such common elements can be used in multiple locations of a metadata file (e.g., to provide information on who the contact person is for information on data quality, on data access, on data documentation, etc.)
In the following sections, we first present the common elements, then the elements that form the core metadata properties (information on the metadata themselves), followed by the elements from the main metadata sections used to describe the data, and finally the features catalog elements which are used to document attributes and variables related to vector data (ISO 19110).
+Common elements are blocks of metadata fields that can appear in multiple locations of a metadata file. For example, information on contact
person(s) or organization(s) may have to be provided in the section of the file where we document the production and maintenance of the data, where we document the production and maintenance of the metadata, where we document the distribution and terms of use of the data, etc. Other types of common elements include online and offline resources, file formats, citations, keywords, constraints, and extent. We describe these sets of elements below.
The ISO 19139 specification provides a structured set of metadata elements to describe a contact. A contact is the party (person or organization) responsible for a specific task. The following set of elements can be used to describe a contact:
+Element | +Description | +
---|---|
individualName |
+Name of the individual | +
organisationName |
+Name of the organization | +
positionName |
+Position of the individual in the organization | +
contactInfo |
+Contact information. The contact information is divided into 3 sections: phone (including either voice or facsimile numbers; address , handling the physical address elements (deliveryPoint , city , postalCode , country ), contact e-mail (electronicEmailAddress ), and onlineResource , e.g., the URL of the organization website (which includes linkage , name , description , protocol , and function ; see below) |
+
role |
+Role of the person/organization. A recommended controlled vocabulary is provided by ISO 19139, with the following options: {resourceProvider, custodian, owner, sponsor, user, distributor, originator, pointOfContact, principalInvestigator, processor, publisher, author, coAuthor, collaborator, editor, mediator, rightsHolder, contributor, funder, stakeholder} |
+
"contact": [
+{
+ "individualName": "string",
+ "organisationName": "string",
+ "positionName": "string",
+ "contactInfo": {
+ "phone": {
+ "voice": "string",
+ "facsimile": "string"
+ },
+ "address": {
+ "deliveryPoint": "string",
+ "city": "string",
+ "postalCode": "string",
+ "country": "string",
+ "electronicMailAddress": "string"
+ },
+ "onlineResource": {
+ "linkage": "string",
+ "name": "string",
+ "description": "string",
+ "protocol": "string",
+ "function": "string"
+ }
+ },
+ "role": "string"
+ }
+ ]
An online resource is a common set of elements frequently used in the geographic data/services schema. It can be used for example to provide a link to an organization website, to a data file or to a document, etc. An online resource is described with the following properties:
+Element | +Description | +
---|---|
linkage |
+URL of the online resource. In case of a geographic standard data services, only the base URL should be provided, without any service parameter. | +
name |
+Name of the online resource. In case of a geographic standard data services, this should be filled with the identifier of the resource as published in the service. Example, for an OGC Web Map Service (WMS), we will use the layer name. | +
description |
+Description of the online resource | +
protocol |
+Web protocol used to get the resource, e.g., FTP, HTTP. In case of a basic HTTP, the ISO 19139 suggests the value ‘WWW:LINK-1.0-http–link’. For geographic standard data services, it is recommended to fill this element with the appropriate protocol identifier. For an OGC Web Map Service (WMS) link for example, use ‘OGC:WMS-1.1.0-http-get-map’ | +
function |
+Function (purpose) of the online resource. | +
"onlineResource": {
+"linkage": "string",
+ "name": "string",
+ "description": "string",
+ "protocol": "string",
+ "function": "string"
+ }
An offline resource (medium) is a common set of elements that can be used to describe a physical resource used to distribute a dataset, e.g., a DVD or a CD-ROM. A medium
is described with the following properties:
Element | +Description | +
---|---|
name |
+Name of the medium, eg. ‘dvd’. Recommended code following the ISO/TS 19139 MediumName codelist. Suggested values: {cdRom, dvd, dvdRom, 3halfInchFloppy, 5quarterInchFloppy, 7trackTape, 9trackType, 3480Cartridge, 3490Cartridge, 3580Cartridge, 4mmCartridgeTape, 8mmCartridgeTape, 1quarterInchCartridgeTape, digitalLinearTape, onLine, satellite, telephoneLink, hardcopy} | +
density |
+Density (list of) at which the data is recorded | +
densityUnit |
+Unit(s) of measure for the recording density | +
volumes |
+Number of items in the media identified | +
mediumFormat |
+Method used to write to the medium, e.g. tar . Recommended code following the ISO/TS 19139 MediumFormat codelist. Suggested values: {cpio, tar, highSierra, iso9660, iso9660RockRidge, iso9660AppleHFS, udf} | +
mediumNote |
+Description of other limitations or requirements for using the medium | +
The table below lists the ISO 19139 elements used to document a file format. A format is defined at a minimum by its name
. It is also recommended to provide a version
, and possibly a format specification
. It is good practice to provide a standardized format name, using the file’s mime type, e.g., text/csv
, image/tiff
. A list of available mime types is available from the IANA website.
Element | +Description | +
---|---|
name |
+Format name - Recommended | +
version |
+Format version (if applicable) - Recommended | +
amendmentNumber |
+Amendment number (if applicable) | +
specification |
+Name of the specification - Recommended | +
fileDecompressionTechnique |
+Technique for file decompression (if applicable) | +
FormatDistributor |
+Contact(s) responsible of the distribution | +
"resourceFormat": [
+{
+ "name": "string",
+ "version": "string",
+ "amendmentNumber": "string",
+ "specification": "string",
+ "fileDecompressionTechnique": "string",
+ "FormatDistributor": {
+ "individualName": "string",
+ "organisationName": "string",
+ "positionName": "string",
+ "contactInfo": {},
+ "role": "string"
+ }
+ }
+ ]
The citation is another common element that can be used in various parts of a geographic metadata file. Citations are used to provide detailed information on external resources related to the dataset or service being documented. A citation can be defined using the following set of (mostly optional) elements:
+Element | +Description | +
---|---|
title |
+Title of the resource | +
alternateTitle |
+An alternate title (if applicable) | +
date |
+Date(s) associated to a resource, with sub-elements date and type . This may include different types of dates. The type of date should be provided, and selected from the controlled vocabulary proposed by the ISO 19139: date of {creation, publication, revision, expiry, lastUpdate, lastRevision, nextUpdate, unavailable, inForce, adopted, deprecated, superseded, validityBegins, validityExpires, released, distribution} |
+
edition |
+Edition of the resource | +
editionDate |
+Edition date | +
identifier |
+A unique persistent identifier for the metadata. If a DOI is available for the resource, the DOI should be entered here. The same fileIdentifier should be used if no other persistent identifier is available. |
+
citedResponsibleParty |
+Contact(s)/party(ies) responsible for the resource. | +
presentationForm |
+Form in which the resource is made available. The ISO 19139 recommends the following controlled vocabulary: {documentDigital, imageDigital, documentHardcopy, imageHardcopy, mapDigital, mapHardcopy, modelDigital, modelHardcopy, profileDigital, profileHardcopy, tableDigital, tableHardcopy, videoDigital, videoHardcopy, audioDigital, audioHardcopy, multimediaDigital, multimediaHardcopy, physicalSample, diagramDigital, diagramHardcopy} . For a geospatial dataset or web-layer, the value mapDigital will be preferred. |
+
series |
+A description of the series, in case the resource is part of a series. This include the series name , issueIdentification and page |
+
otherCitationDetails |
+Any other citation details to specify | +
collectiveTitle |
+A title in case the resource is part of a broader resource (e.g., data collection) | +
ISBN |
+International Standard Book Number (ISBN); an international standard identification number for uniquely identifying publications that are not intended to continue indefinitely. | +
ISSN |
+International Standard Serial Number (ISSN); an international standard for serial publications. | +
"citation": {
+"title": "string",
+ "alternateTitle": "string",
+ "date": [
+ {
+ "date": "string",
+ "type": "string"
+ }
+ ],
+ "edition": "string",
+ "editionDate": "string",
+ "identifier": {
+ "authority": "string",
+ "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+ "string"
+ ],
+ "series": {
+ "name": "string",
+ "issueIdentification": "string",
+ "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+ }
Keywords contribute significantly to making a resource more discoverable. Entering a list of relevant keywords is therefore highly recommended. Keywords can, but do not have to be selected from a controlled vocabulary (thesaurus). Keywords are documented using the following elements:
+Element | +Description | +
---|---|
type |
+Keywords type. The ISO 19139 provides a recommended controlled vocabulary with the following options: {dataCenter , discipline , place , dataResolution , stratum , temporal , theme , dataCentre , featureType , instrument , platform , process , project , service , product , subTopicCategory } |
+
keyword |
+The keyword itself. When possible, existing vocabularies should be preferred to writing free-text keywords. An example of global vocabulary is the Global Change Master Directory that could be a valuable source to reference data domains / disciplines, or the UNESCO Thesaurus. | +
thesaurusName |
+A reference to a thesaurus (if applicable) from which the keywords are extracted. The thesaurus itself should then be documented as a citation. | +
"keywords": [
+{
+ "type": "string",
+ "keyword": "string",
+ "thesaurusName": "string"
+ }
+ ]
The constraints common set of elements will be used to document legal and security constraints associated with the documented dataset or data service. Both types of constraints have one property in common, useLimitation
, used to describe the use limitation(s) as free text.
"resourceConstraints": [
+{
+ "legalConstraints": {
+ "useLimitation": [
+ "string"
+ ],
+ "accessConstraints": [
+ "string"
+ ],
+ "useConstraints": [
+ "string"
+ ],
+ "otherConstraints": [
+ "string"
+ ]
+ },
+ "securityConstraints": {
+ "useLimitation": [
+ "string"
+ ],
+ "classification": "string",
+ "userNote": "string",
+ "classificationSystem": "string",
+ "handlingDescription": "string"
+ }
+ }
+ ]
In addition to the useLimitation
element, legal constraints (legalConstraints
) can be described using the following three metadata elements:
Element | +Description | +
---|---|
accessConstraints |
+Access constraints. The ISO 19139 provides a controlled vocabulary with the following options: {copyright, patent, patentPending, trademark, license, intellectualPropertyRights, restricted, otherRestrictions, unrestricted, licenceUnrestricted, licenceEndUser, licenceDistributor, private, statutory, confidential, SBU, in-confidence} |
+
useConstraints |
+Use constraints. To be entered as free text. Filling this element will depend on the resource that is described. As best practice recommended to fill this element, this is where terms of use, disclaimers, preferred citation or* even data limitations can be captured | +
otherConstraints |
+Any other constraints related to the resource. | +
In addition to the useLimitation
element, security constraints (securityConstraints
) - which applies essentially to classified resources - can be described using the following four metadata elements:
Element | +Description | +
---|---|
classification |
+Classification code. The ISO 19139 provides a controlled vocabulary with the following options: {unclassified, restricted, confidential, secret, topSecret, SBU, forOfficialUseOnly, protected, limitedDistribution} |
+
userNote |
+Note to users (free text) | +
classificationSystem |
+Information on the system used to classify the information. Organizations may have their own system to classify the information. |
+
handlingDescription |
+Additional free-text description of the classification | +
The extent defines the boundaries of the dataset in space (horizontally and vertically) and in time. The ISO 19139 standard defines the extent as follows:
+Element | +Description | +
---|---|
geographicElement |
+Spatial (horizontal) extent element. This can be defined either with a geographicBoundingBox providing the coordinates bounding the limits of the dataset, by means of four properties: southBoundLongitude , westBoundLongitude , northBoundLongitude , eastBoundLongitude (recommended); or using geographicDescription - free text that defines the area covered. When the dataset covers one or more countries, it is recommended to enter the country names in this element, as it can then be used in data catalogs for filtering by geography. |
+
verticalElement |
+Spatial (vertical) extent element, providing two properties: minimumValue , maximumValue and verticalCRS (reference to the vertical coordinate reference system) |
+
temporalElement |
+Temporal extent element. Depending on the temporal characteristics of the dataset, this will consist in a TimePeriod (made of a beginPosition and endPosition ) or a TimeInstant (made of a single timePosition ) referencing date/time information according to ISO 8601 |
+
"extent": {
+"geographicElement": [
+ {
+ "geographicBoundingBox": {
+ "westBoundLongitude": -180,
+ "eastBoundLongitude": -180,
+ "southBoundLatitude": -180,
+ "northBoundLatitude": -180
+ },
+ "geographicDescription": "string"
+ }
+ ],
+ "temporalElement": [
+ {
+ "extent": null
+ }
+ ],
+ "verticalElement": [
+ {
+ "minimumValue": 0,
+ "maximumValue": 0,
+ "verticalCRS": null
+ }
+ ]
+ }
A set of elements is provided in the ISO 19139 to document the core properties of the metadata (not the data). With a few exceptions, these elements apply to the metadata related to datasets and data services. The table below summarizes these elements and their applicability. A description of the elements follows.
+Property | +Description | +Used in dataset metadata | +Used in service metadata | +
---|---|---|---|
fileIdentifier |
+Unique persistent identifier for the resource | +Yes | +- | +
language |
+Main language used in the metadata description | +Yes | +Yes | +
characterSet |
+Character set encoding used in the metadata description | +Yes | +Yes | +
parentIdentifier |
+Unique persistent identifier of the parent resource (if any) | +Yes | +Yes | +
hierarchyLevel |
+Scope(s) / hierarchy level(s) of the resource. List of pre-defined values suggested by the ISO 19139. See details below. | +Yes | +Yes | +
hierarchyLevelName |
+Alternative name definitions for hierarchy levels | +Yes | +Yes | +
contact |
+contact(s) associated to the metadata, i.e. persons/organizations in charge of the metadata create/edition/maintenance. For more details, see section on common elements | +Yes | +Yes | +
dateStamp |
+Date and time when the metadata record was created or updated | +Yes | +Yes | +
metadataStandardName |
+Reference or name of the metadata standard used. | +Yes | +Yes | +
metadataStandardVersion |
+Version of the metadata standard. For the ISO/TC211, the version corresponds to the creation/revision year. | +Yes | +Yes | +
dataSetURI |
+Unique persistent link to reference the database | +Yes | +- | +
"description": {
+"idno": "string",
+ "language": "string",
+ "characterSet": {
+ "codeListValue": "string",
+ "codeList": "string"
+ },
+ "parentIdentifier": "string",
+ "hierarchyLevel": [],
+ "hierarchyLevelName": [],
+ "contact": [],
+ "dateStamp": "string",
+ "metadataStandardName": "string",
+ "metadataStandardVersion": "string",
+ "dataSetURI": "string"
+ }
idno
)The idno
must provide a unique and persistent identifier for the resource (dataset or service). A common approach consists in building a semantic identifier, constructed by concatenating some owner and data characteristics. Although this approach offers the advantages of readability of the identifier, it may not guarantee its global uniqueness and its persistence in time. The use of time periods and/or geographic extents as components of a file identifier is not recommended, as these elements may evolve over time. The use of random identifiers such as the Universally Unique Identifiers (UUID) is sometimes suggested as an alternative, but this approach is also not recommended. The use of Digital Object Identifiers (DOI) as global and unique file identifiers is recommended.
language
)The metadata language refers to the main language used in the metadata. The recommended practice is to use the ISO 639-2 Language Code List (also known as the alpha-3 language code), e.g. ‘eng’ for English or ‘fra’ for French.
+characterSet
)The character set encoding of the metadata description. The best practice is to use the utf8
encoding codelist value (UTF-8 encoding). It is capable of encoding all valid character code points in Unicode, a standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML. UTF-8 is the most common encoding for the World Wide Web. Many text editors will provide you with an option to save your metadata (text) files in UTF-8, which will often be the default option (see below the example of Notepad++ and R Studio).
parentIdentifier
)A geographic data resource can be a subset of a larger dataset. For example, an aquatic species distribution map can be part of a data collection covering all species, or the 2010 population census dataset of a country can be part of a dataset that includes all population censuses for that country since 1900. In such case, the parent identifier metadata element can be used to identify this higher-level resource. As for the fileIdentifier
, the parentIdentifier
must be a unique identifier persistent in time. In a data catalog, a parentIdentifier
will allow the user to move from one dataset to another. The parentIdentifier
is generally applied to datasets, although it may in some cases be used in data services descriptions.
hierarchyLevel
)"hierarchyLevel": [
+"string"
+ ]
The hierarchylevel
defines the scope of the resource. It indicates whether the resource is a collection, a dataset, a series, a service, or another type of resource. The ISO 19139 provides a controlled vocabulary for this element. It is recommended but not mandatory to make use of it. The most relevant levels for the purpose of cataloguing geographic data and services are dataset (for both raster and vector data), service (a capability which a service provider entity makes available to a service user entity through a set of interfaces that define a behavior), and series. Series
will be used when the data represent an ordered succession, in time or in space; this will typically apply to time series, but it can also be used to describe other types of series (e.g., a series of ocean water temperatures collected at a succession of depths).
The recommended controlled vocabulary for hierarchylevel
includes: {dataset, series, service, attribute, attributeType, collectionHardware, collectionSession, nonGeographicDataset, dimensionGroup, feature, featureType, propertyType, fieldSession, software, model, tile, initiative, stereomate, sensor, platformSeries, sensorSeries, productionSeries, transferAggregate, otherAggregate}
hierarchyLevelname
)"hierarchyLevelName": [
+"string"
+ ]
The hierarchyLevelName
provides an alternative to describe hierarchy levels, using free text instead of a controlled vocabulary. The use of hierarchyLevel
is preferred to the use of hierarchylevelName
.
contact
)The contact
element is a common element described in the common elements section of this chapter. When associated to the metadata, it is used to identify the person(s) or organization(s) in charge of the creation, edition, and maintenance of the metadata. The contact(s) responsible for the metadata are not necessarily the ones who are responsible for the dataset/service creation/edition/maintenance. The latter will be documented in the dataset identification elements of the metadata file.
dateStamp
)The date stamp associated to the metadata. The metadata date stamp may be automatically filled by metadata editors, and will ideally use the standard ISO 8601 date format: YYYY-MM-DD (possibly with a time).
+metadataStandardName
)The name of the geographic metadata standard used to describe the resource. The recommended values are:
+metadataStandardVersion
)The version of the metadata standard being used. It is good practice to enter the standard’s inception/revision year. ISO standards are revised with an average periodicity of 10-year. Although the ISO TC211 geographic information metadata standards have been reviewed, it is still accepted to refer to the original version of the standard as many information systems/catalogs still make use of that version.
+The recommended values are:
+datasetURI
)A unique resource identifier for the dataset, such as a web link that uniquely identifies the dataset. The use of a Digital Object Identifier (DOI) is recommended.
+Geographic data can be diverse and complex. Users need detailed information to discover data and to use them in an informed and responsible manner. The core of the information on data will be provided in various sections of the metadata file. This will include information on the type of data, on the coordinate system being used, on the scope and coverage of the data, on the format and location of the data, on possible quality issues that users need to be aware of, and more. The table below summarizes the main metadata sections, by order of appearance in the ISO 19139 specification.
+"description": {
+"spatialRepresentationInfo": [],
+ "referenceSystemInfo": [],
+ "identificationInfo": [],
+ "contentInfo": [],
+ "distributionInfo": {},
+ "dataQualityInfo": [],
+ "metadataMaintenance": {}
+ }
Section | +Description | +Usability in dataset metadata | +Usability in service metadata | +
---|---|---|---|
spatialRepresentationInfo |
+The spatial representation of the dataset. Distinction is made between vector and grid (raster) spatial representations. | +Yes | +- | +
referenceSystemInfo |
+The reference systems used in the resource. In practice, this will often be limited to the geographic coordinate system. | +Yes | +Yes | +
identificationInfo |
+Identifies the resource, including descriptive elements (eg. title, purpose, abstract, keywords) and contact(s) having a role in the resource provision. See details below | +Yes | +Yes | +
contentInfo |
+The content of a dataset resource, i.e. how the dataset is structured (dimensions, attributes, variables, etc.). In the case of vector datasets, this relates to separate metadata files compliant with the ISO 19110 standard (Feature Catalogue). In the case of raster / gridded data, this is covered by the ISO 19115-2 extension for imagery and gridded data. | +Yes | +- | +
distributionInfo |
+The mode(s) of distribution of the resource (format, online resources), and by whom it is distributed. | +Yes | +Yes | +
dataQualityInfo |
+The quality reports on the resource (dataset or service), and in case of datasets, on the provenance / lineage information giving the process steps performed to obtain the dataset resource. | +Yes | +Yes | +
metadataMaintenanceInfo |
+The metadata maintenance cycle operated for the resource. | +Yes | +Yes | +
These sections are described in more detail below.
+spatialRepresentationInfo
)"spatialRepresentationInfo": [
+{
+ "vectorSpatialRepresentation": {
+ "topologyLevel": "string",
+ "geometricObjects": [
+ {
+ "geometricObjectType": "string",
+ "geometricObjectCount": 0
+ }
+ ]
+ },
+ "gridSpatialRepresentation": {
+ "numberOfDimensions": 0,
+ "axisDimensionProperties": [
+ {
+ "dimensionName": "string",
+ "dimensionSize": 0,
+ "resolution": 0
+ }
+ ],
+ "cellGeometry": "string",
+ "transformationParameterAvailability": true
+ }
+ }
+ ]
Information on the spatial representation is critical to properly describe a geospatial dataset. The ISO/TS 19139 distinguishes two types of spatial representations, characterized by different properties.
+The vector spatial representation describes the topology level and the geometric objects of vector datasets using the following two properties:
+topologyLevel
) is the type of topology used in the vector spatial dataset. The ISO 19139 provides a controlled vocabulary with the following options: {geometryOnly, topology1D, planarGraph, fullPlanarGraph, surfaceGraph, fullSurfaceGraph, topology3D, fullTopology3D, abstract}
. In most cases, vector datasets will be described as geometryOnly
which covers common geometry types (points, lines, polygons).geometricObjects
) will define:
+geometricObjectType
): The type of geometry handled. Possible values are: {complex, composite, curve, point, solid, surface}
.geometricObjectCount
): The number (count) of geometries in the dataset.In the case of an homogeneous geometry type, a single geometricObjects
element can be defined. For complex geometries (mixture of various geometry types), one geometricObjects
element will be defined for each geometry type.
The grid spatial representation describes gridded (raster) data using the following three properties:
+numberOfDimensions
) in the grid.axisDimensionProperties
): a list of each dimension including, for each dimension:
+The name of the dimension type (dimensionName
): the ISO 19139 provides a controlled vocabulary with the following options: {row, column, vertical, track, crossTrack, line, sample, time}
. These options represent the following:
In the Ethiopia population density file we used as an example of raster data, the types of dimensions will be row and column as the file is a spatial 2D raster. If we had a data with elevation or time dimensions, we would use respectively “vertical” and “time” dimension as name types.
The dimension size (dimensionSize
): the length of the dimension.
The dimension resolution: a resolution number associated to a unit of measurement. This is the resolution of the grid cell dimension. For example:
+cellGeometry
): The type of geometry used for grid cells. Possible values are: {point, area, voxel, stratum}
Most “grids” are commonly area-based, but in principle a grid goes beyond this and the grid cells can target a point, an area, or a volume.
+referenceSystemInfo
)The reference system(s) typically (but not necessarily) applies to the geographic reference system of the dataset. Multiple reference systems can be listed if a dataset is distributed with different spatial reference systems. This block of elements may also apply to service metadata. A spatial web-service may support several map projections / geographic coordinate reference systems.
+"referenceSystemInfo": [
+{
+ "code": "string",
+ "codeSpace": "string"
+ }
+ ]
A reference system is defined by two properties:
+Spatial Reference IDentifier
(SRID) number. For example, the SRID of the World Geodetic System (WGS 84) is 4326.EPSG
(as most of geographic reference systems are registered in it). Codes from other authorities can be used to define ad-hoc projections, for example:
+The main reference system registry is EPSG, which provides a “search by name” tool for users who need to find a SRID (global or local/country-specific). Other websites reference geographic systems, but are not authoritative sources including http://epsg.io/ and https://spatialreference.org/ The advantage of these sites is that they go beyond the EPSG registry, and handle other specific registries given by providers like ESRI.
+The following ESRI projections could be relevant, in particular those in support of world equal-area projected maps (maps conserving area proportions):
+ +identificationInfo
)The identification information (identificationInfo
) is where the citation elements of the resource will be provided. This may include descriptive information like title
, abstract
, purpose
, keywords
, etc., and identification of the parties/contact(s) associated with the resource, such as the owner, publisher, co-authors, etc. Providing and publishing detailed information in these elements will contribute significantly to improving the discoverability of the data.
"identificationInfo": [
+{
+ "citation": {},
+ "abstract": "string",
+ "purpose": "string",
+ "credit": "string",
+ "status": "string",
+ "pointOfContact": [],
+ "resourceMaintenance": [],
+ "graphicOverview": [],
+ "resourceFormat": [],
+ "descriptiveKeywords": [],
+ "resourceConstraints": [],
+ "resourceSpecificUsage": [],
+ "aggregationInfo": {},
+ "extent": {},
+ "spatialRepresentationType": "string",
+ "spatialResolution": {},
+ "language": [],
+ "characterSet": [],
+ "topicCategory": [],
+ "supplementalInformation": "string",
+ "serviceIdentification": {}
+ }
+ ]
The identification of a resource includes elements that are common to both datasets and data services, and others that are specific to the type of resource. The following table summarizes the identification elements that can be used for dataset, service, or both.
+Identification elements applicable to datasets and data services
+The following metadata elements apply to resources of type dataset and service.
+Element | +Description | +
---|---|
citation |
+A citation set of elements that will describe the dataset/service from a citation perspective, including title , associated contacts, etc. For more details, see section on common elements |
+
abstract |
+An abstract for the dataset/service resource | +
purpose |
+A statement describing the purpose of the dataset/service resource | +
credit |
+Credit information. | +
status |
+Status of the resource, with the following recommended controlled vocabulary: {completed, historicalArchive, obsolete, onGoing, planned, required, underDevelopment, final, pending, retired, superseded, tentative, valid, accepted, notAccepted, withdrawn, proposed, deprecated} |
+
pointOfContact |
+One ore more points of contacts to associate with the resource. People that can be contacted for information on the dataset/service. For more details, see section contact in the common elements section of the chapter. |
+
resourceMaintenance |
+Information on how the resource is maintained, essentially informing on the maintenance and update frequency (maintenanceAndUpdateFrequency ). This frequency should be chosen among possible values recommended by the ISO 19139 standard: {continual, daily, weekly, fortnightly, monthly, quarterly, biannually, annually, asNeeded, irregular, notPlanned, unknown} . |
+
graphicOverview |
+One or more graphic overview(s) that provide a visual identification of the dataset/service. e.g., a link to a map overview image. A graphicOverview will be defined with 3 properties fileName (or URL), fileDescription , and optionally a fileType . |
+
resourceFormat |
+Resource format(s) description. For more details on how to describe a format, see the common elements section of the chapter. | +
descriptiveKeywords |
+A set of keywords that describe the dataset. Keywords are grouped by keyword type, with the possibility to associate a thesaurus (if applicable). For more details how to describe keywords, see the common elements section of the chapter. | +
resourceConstraints |
+Legal and/or Security constraints associated to the resource. For more details how to describe constraints, see the common elements section of the chapter | +
resourceSpecificUsage |
+Information about specific usage(s) of the dataset/service, e.g., a research paper, a success story, etc. | +
aggregationInfo |
+Information on an aggregate or parent resource to which the resource belongs, i.e. a collection. | +
Resource maintenance
+
"resourceMaintenance": [
+{
+ "maintenanceAndUpdateFrequency": "string"
+ }
+ ]
Graphic overview
+
"graphicOverview": [
+{
+ "fileName": "string",
+ "fileDescription": "string",
+ "fileType": "string"
+ }
+ ]
Resource specific usage
+
"resourceSpecificUsage": [
+{
+ "specificUsage": "string",
+ "usageDateTime": "string",
+ "userDeterminedLimitations": "string",
+ "userContactInfo": []
+ }
+ ]
For userContactInfo
, seee common elements Contact
+
Aggregation information
+
"aggregationInfo": {
+"aggregateDataSetName": "string",
+ "aggregateDataSetIdentifier": "string",
+ "associationType": "string",
+ "initiativeType": "string"
+ }
Identification elements applicable to datasets
+The following metadata elements are specific to resources of type dataset.
+Element | +Description | +
---|---|
spatialRepresentationType |
+The spatial representation type of the dataset. Values should be selected from the following controlled vocabulary: {vector, grid, textTable, tin, stereoModel, video} |
+
spatialResolution |
+The spatial resolution of the data as numeric value associated to a unit of measure. | +
language |
+The language used in the dataset. | +
characterSet |
+The character set encoding used in the dataset. | +
topicCategory |
+The topic category(ies) characterizing the dataset resource. Values should be selected from the following controlled vocabulary: {farming, biota, boundaries, climatologyMeteorologyAtmosphere, economy, elevation, environment, geoscientificInformation, health, imageryBaseMapsEarthCover, intelligenceMilitary, inlandWaters, location, oceans, planningCadastre, society, structure, transportation, utilitiesCommunication, extraTerrestrial, disaster} |
+
extent |
+Defines the spatial (horizontal and vertical) and temporal region to which the content of the resource applies. For more details, see the common elements section of the chapter | +
supplementalInformation |
+Any additional information, provided as free text. | +
+Spatial resolution, language, characterset, and topic category
"spatialResolution": {
+"uom": "string",
+ "value": 0
+ },
+"language": [
+"string"
+ ],
+"characterSet": [
+{
+ "codeListValue": "string",
+ "codeList": "string"
+ }
+ ],
+"topicCategory": [
+"string"
+ ]
Identification elements applicable to data services
+The following metadata elements are specific to resources of type service.
+Element | +Description | +
---|---|
serviceType |
+The type of service (as free text),eg. OGC:WMS | +
serviceTypeVersion |
+The version of the service e.g. 1.3.0 | +
accessProperties |
+Access properties, including description of fees , plannedAvailableDateTime , orderingInstructions and turnaround |
+
restrictions |
+Legal and/or Security constraints associated to the service. For more details, see the common elements section of the chapter. | +
keywords |
+Set of service keywords. For more details, see the common elements section of the chapter. | +
extent |
+Defines the spatial (horizontal and vertical) and temporal region to which the service applies (if applicable). see the common elements section of the chapter. | +
coupledResource |
+Eventual resource(s) coupled to a service operation. | +
couplingType |
+The type of coupling between service and coupled resources. Values should be selected from the following controlled vocabulary: {loose, mixed, tight} |
+
containsOperations |
+Operation(s) available for the service. See below for details. | +
operatesOn |
+List of dataset identifiers on which the service operates. | +
"serviceIdentification": {
+"serviceType": "string",
+ "serviceTypeVersion": "string",
+ "accessProperties": {
+ "fees": "string",
+ "plannedAvailableDateTime": "string",
+ "orderingInstructions": "string",
+ "turnaround": "string"
+ },
+ "restrictions": [],
+ "keywords": [],
+ "coupledResource": [
+ {
+ "operationName": "string",
+ "identifier": "string"
+ }
+ ],
+ "couplingType": "string",
+ "containsOperations": [
+ {
+ "operationName": "string",
+ "DCP": [
+ "string"
+ ],
+ "operationDescription": "string",
+ "invocationName": "string",
+ "parameters": [
+ {
+ "name": "string",
+ "direction": "string",
+ "description": "string",
+ "optionality": "string",
+ "repeatability": true,
+ "valueType": "string"
+ }
+ ],
+ "connectPoint": {
+ "linkage": "string",
+ "name": "string",
+ "description": "string",
+ "protocol": "string",
+ "function": "string"
+ },
+ "dependsOn": [
+ { }
+ ]
+ }
+ ],
+ "operatesOn": [
+ {
+ "uuidref": "string"
+ }
+ ]
+ }
A data service operation is described with the following metadata elements:
+Element | +Description | +
---|---|
operationName |
+Name of the operation | +
DCP |
+Distributed Computing Platform. Recommended value: ‘WebServices’ | +
operationDescription |
+Description of the operation | +
invocationName |
+Name of the operation as invoked when using the service | +
parameters |
+Operation parameter(s). A parameter can be defined with several properties including name , description , direction (in, out, or ‘inout’), optionality (‘Mandatory’ or ‘Optional’), repeatability (true/false), and the valueType (type of value expected, e.g., string, numeric, etc.) |
+
connectPoint |
+URL points, defined as online resource(s) | +
dependsOn |
+Service operation(s) the service operation depends on. | +
The service operation(s) descriptions are recommended when the service does not support the self-description of its operations.
+contentInfo
)For vector datasets, the ISO 19115-1 does not provide all necessary elements; the structure of vector datasets is therefore documented using the featureCatalogueDescription
of the ISO 19110 (Feature Catalogue) standard. The ISO 19110 is included in the unified ISO 19139 XML specification.
Feature catalogue description (featureCatalogueDescription
)
The Feature Catalogue description aims to link the structural metadata (ISO 19110) to the dataset metadata (ISO 19115). This will be required when the structural metadata is not contained in the same metadata file as the dataset metadata.1 The following elements are used to document this relationship:
+Element | +Description | +
---|---|
complianceCode |
+Indicates whether the dataset complies with the feature catalogue description | +
language |
+Language used in the feature catalogue | +
includedWithDataset |
+Indicates if the feature catalogue description is included with the dataset (essentially, as downloadable resource) | +
featureCatalogueCitation |
+A citation that references the ISO 19110 feature catalogue. As best practice, this citation will essentially use two properties: uuidref giving the persistent identifier of the feature catalogue, href giving a web link to access the ISO 19110 feature catalogue. |
+
"contentInfo": [
+{
+ "featureCatalogueDescription": {
+ "complianceCode": true,
+ "language": "string",
+ "includedWithDataset": true,
+ "featureCatalogueCitation": {
+ "title": "string",
+ "alternateTitle": "string",
+ "date": [
+ {
+ "date": "string",
+ "type": "string"
+ }
+ ],
+ "edition": "string",
+ "editionDate": "string",
+ "identifier": {
+ "authority": "string",
+ "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+ "string"
+ ],
+ "series": {
+ "name": "string",
+ "issueIdentification": "string",
+ "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+ }
+ },
+ "coverageDescription": {
+ "contentType": "string",
+ "dimension": [
+ {
+ "name": "string",
+ "type": "string"
+ }
+ ]
+ }
+ }
+ ]
The feature catalog can be an external metadata file or document. We embedded it our JSON schema. See the section ISO 19110 Feature Catalogue below.
+Coverage description (coverageDescription
)
The structure of raster/gridded datasets can be described using the ISO 19115-2 standard, using the coverageDescription
element and the following two properties:
Element | +Description | +
---|---|
contentType |
+Type of coverage content, e.g., ‘image’. It is recommended to define the content type using the controlled vocabulary suggested by the ISO 19139 which contains the following values: {image , thematicClassification , physicalMeasurement , auxillaryInformation , qualityInformation , referenceInformation , modelResult , coordinate , auxilliaryData } |
+
dimension |
+List of coverage dimensions. Each dimension can be defined by a name and a type . For the type , a good practice is to rely on primitive data types defined in the XML Schema https://www.w3.org/2009/XMLSchema/XMLSchema.xsd |
+
rangeElementDescription |
+List of range element descriptions. Each range element description will have a name /definition (corresponding to the dimension considered), and list of accepted values as rangeElement . For example, for a timeseries with series defined at specific instants in time, the Time dimension of the spatio-temporal coverage could be defined here giving the list of time instants supported by the time series. |
+
distributionInfo
)The distribution information documents who is the actual distributor of the resources, and other aspects of the distribution in term of format and online resources. This information is provided using the following elements:
+Element | +Description | +
---|---|
distributionFormat |
+Format(s) definitions. See the common elements section for information on how to document a format. | +
distributor |
+Contact(s) in charge of the resource distribution. See the common elements section for information on how to document a contact. | +
transferOptions |
+Transfer option(s) to get the resource. To align with the ISO 19139, these resources should be set in an onLine element where all online resources available can be listed, or as offLine for media not available online. |
+
"distributionFormat": [
+{
+ "name": "string",
+ "version": "string",
+ "amendmentNumber": "string",
+ "specification": "string",
+ "fileDecompressionTechnique": "string",
+ "FormatDistributor": {}
+ }
+ ]
dataQualityInfo
)Information on the quality of the data will be useful to secondary analysts, to ensure proper use of the data. Data quality is documented in the section dataQualityInfo
using three main metadata elements:
Element | +Description | +
---|---|
scope |
+Scope / hierarchy level targeted by the data quality information section. The ISO 19139 recommends the use of a controlled vocabulary with the following options: {attribute , attributeType , collectionHardware , collectionSession , dataset , series , nonGeographicDataset , dimensionGroup , feature , featureType , propertyType , fieldSession , software , service , model , tile , initiative , stereomate , sensor , platformSeries , sensorSeries , productionSeries , transferAggregate , otherAggregate } |
+
report |
+Report(s) describing the quality information, for example a INSPIRE metadata compliance report. To see how to create a data quality conformance report , see details below. |
+
lineage |
+The lineage provides the elements needed to describe the process that led to the production of the data. In combination with report , the lineage will allow data users to assess quality conformance. This is an important metadata element. |
+
"dataQualityInfo": [
+{
+ "scope": "string",
+ "report": [],
+ "lineage": {
+ "statement": "string",
+ "processStep": []
+ }
+ }
+ ]
report
)"report": [
+{
+ "DQ_DomainConsistency": {
+ "result": {
+ "nameOfMeasure": [],
+ "measureIdentification": "string",
+ "measureDescription": "string",
+ "evaluationMethodType": [],
+ "evaluationMethodDescription": "string",
+ "evaluationProcedure": {},
+ "dateTime": "string",
+ "result": []
+ }
+ }
+ }
+ ]
A report
describes the result of an assessment of the conformance (or not) of a resource to consistency rules. The result
is the main component of a report, which can be described with the following elements:
nameOfMeasure
: One or more measure names used for the data quality reportmeasureIdentification
: Identification of the measure, using a unique identifier (if applicable)measureDescription
: A description of the measureevaluationMethodType
: Type of evaluation method. The ISO 19139 recommends the use of a controlled vocabulary with the following options: {directInternal, directExternal, indirect}
evaluationMethodDescription
: Description of the evaluation methodevaluationProcedure
: Citation of the evaluation procedure (as citation element)dateTime
: Date time when the report was establishedreport
: Result(s) associated to the report. Each result should be described with a specification
, an explanation
(of the result of conformance or not conformance), and a pass
property indicating if the result was positive (true) or not (false)."result": {
+"nameOfMeasure": [
+ "string"
+ ],
+ "measureIdentification": "string",
+ "measureDescription": "string",
+ "evaluationMethodType": [
+ "string"
+ ],
+ "evaluationMethodDescription": "string",
+ "evaluationProcedure": {
+ "title": "string",
+ "alternateTitle": "string",
+ "date": [
+ {
+ "date": "string",
+ "type": "string"
+ }
+ ],
+ "edition": "string",
+ "editionDate": "string",
+ "identifier": {
+ "authority": "string",
+ "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+ "string"
+ ],
+ "series": {
+ "name": "string",
+ "issueIdentification": "string",
+ "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+ },
+ "dateTime": "string",
+ "result": []
+ }
+ }
lineage
)The lineage
provides a structured solution to describe the work flow that led to the production of the data/service, defined by:
statement
of the work flow performedprocessStep
is defined by the following elements:
+description
: Description of the process step performedrationale
: Rationale of the process stepdateTime
: Date of the processingprocessor
: Contact(s) acting as processor(s) for the target stepsource
: Source(s) used for the process step. Each source
can have a description
and a sourceCitation
(as citation element)."lineage": {
+"statement": "string",
+ "processStep": [
+ {
+ "description": "string",
+ "rationale": "string",
+ "dateTime": "string",
+ "processor": [],
+ "source": [
+ {
+ "description": "string",
+ "sourceCitation": {
+ "title": "string",
+ "alternateTitle": "string",
+ "date": [
+ {
+ "date": "string",
+ "type": "string"
+ }
+ ],
+ "edition": "string",
+ "editionDate": "string",
+ "identifier": {
+ "authority": "string",
+ "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+ "string"
+ ],
+ "series": {
+ "name": "string",
+ "issueIdentification": "string",
+ "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+ }
+ }
+ ]
+ }
+ ]
+ }
metadataMaintenanceInfo
)The metadataMaintenanceInfo
and maintenanceAndUpdateFrequency
elements provide information on the maintenance of the metadata including the frequency of updates. The metadataMaintenanceInfo
element is a free text element. The information provided in maintenanceAndUpdateFrequency
should be chosen from values recommended by the ISO 19139 controlled vocabulary with the following options: {continual, daily, weekly, fortnightly, monthly, quarterly, biannually, annually, asNeeded, irregular, notPlanned, unknown}
.
"metadataMaintenance": {
+"maintenanceAndUpdateFrequency": "string"
+ }
feature_catalogue
)We describe below how the ISO 19110 feature catalogue is used to document the structure of a vector dataset (complementing the ISO 10119-1). This is equivalent to producing a “data dictionary” for the variables/features included in a vector dataset. An example of the implementation of such a feature catalogue using R is provided in the Examples section of this chapter (see Example 3 in section 5.5.3).
+Element | +Description | +
---|---|
name |
+Name of the feature catalogue | +
scope |
+Subject domain(s) of feature types defined in this feature catalogue | +
fieldOfApplication |
+One or more fields of applications for this feature catalogue. | +
versionNumber |
+Version number of this feature catalogue, which may include both a major version number or letter and a sequence of minor release numbers or letters, such as ‘3.2.4a.’ The format of this attribute may differ between cataloguing authorities. | +
versionDate |
+Version date |
+
producer |
+The responsibleParty in charge of the feature catalogue production |
+
functionalLanguage |
+Formal functional language in which the feature operation formal definition occurs in this feature catalogue | +
featureType |
+One or more feature type(s) defined in the Feature catalogue. The definition of several feature types can be considered when targeting various forms of a dataset (e.g., simplified vs. complete set of attributes, raw vs. aggregated, etc). In practice, a simple ISO 19110 feature catalogue will reference one feature type describing the unique dataset structure. See details below. | +
"feature_catalogue": {
+"name": "string",
+ "scope": [],
+ "fieldOfApplication": [],
+ "versionNumber": "string",
+ "versionDate": {},
+ "producer": {},
+ "functionalLanguage": "string",
+ "featureType": []
+ }
The featureType
is the actual data structure definition of a dataset (data dictionary), and has the following properties:
Element | +Description | +
---|---|
typeName |
+Text string that uniquely identifies this feature type within the feature catalogue that contains this feature type | +
definition |
+Definition of the feature type | +
code |
+Code that uniquely identifies this feature type within the feature catalogue that contains this feature type | +
isAbstract |
+Indicates if the feature type is abstract or not | +
aliases |
+One or more aliases as equivalent names of the feature type | +
carrierOfCharacteristics |
+Feature attribute(s) / column(s) definitions. See below details. | +
"featureType": [
+{
+ "typeName": "string",
+ "definition": "string",
+ "code": "string",
+ "isAbstract": true,
+ "aliases": [
+ "string"
+ ],
+ "carrierOfCharacteristics": [
+ {
+ "memberName": "string",
+ "definition": "string",
+ "cardinality": {
+ "lower": 0,
+ "upper": 0
+ },
+ "code": "string",
+ "valueMeasurementUnit": "string",
+ "valueType": "string",
+ "listedValue": [
+ {
+ "label": "string",
+ "code": "string",
+ "definition": "string"
+ }
+ ]
+ }
+ ]
+ }
+ ]
Each feature attribute, i.e. column that is a member of the vector data structure is defined as carrier of characteristics
. Each set of characteristics can be defined with the following properties:
Element | +Description | +
---|---|
memberName |
+Name of the property member of the feature type | +
definition |
+Definition of the property member | +
cardinality |
+Definition of the member type cardinality. The cardinality is set of two properties: lower cardinality (lower ) and upper cardinality (upper ). For simple tabular datasets, the cardinality will be 1-1. Multiple cardinalities (eg. 1-N, N-N) apply particularly to feature catalogues/types that describe relational databases. |
+
code |
+Code for the attribute member of the feature type. Corresponds to the actual column name in an attributes table. | +
valueMeasurementUnit |
+Measurement unit of the values (in case of the feature member corresponds to a measurable variable) | +
valueType |
+Type of value. A good practice is to rely on primitive data types defined in the XML Schema https://www.w3.org/2009/XMLSchema/XMLSchema.xsd | +
listedValue |
+List of controlled value(s) used in the attribute member. Each value corresponds to an object compound by 1) a label , 2) a code (as contained in the dataset), 3) a definition . This element will be used when the feature member relates to reference datasets, such as code lists or registers. e.g., list of countries, land cover types, etc. |
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
provenance
[Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the ISO 19139 metadata standard.
origin_description
[Required ; Not repeatable] origin_description
elements are used to describe when and from where metadata have been extracted or harvested. harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Study Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] lda_topics
[Optional ; Not repeatable]
"lda_topics": [
+{
+"model_info": [
+{
+ "source": "string",
+ "author": "string",
+ "version": "string",
+ "model_id": "string",
+ "nb_topics": 0,
+ "description": "string",
+ "corpus": "string",
+ "uri": "string"
+ }
+ ],
+ "topic_description": [
+ {
+ "topic_id": null,
+ "topic_score": null,
+ "topic_label": "string",
+ "topic_words": [
+ {
+ "word": "string",
+ "word_weight": 0
+ }
+ ]
+ }
+ ]
+ }
+ ]
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document (in this case, the “document” is a compilation of elements from the dataset metadata) can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics
is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.
Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics
elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.
The lda_topics
element includes the following metadata fields:
model_info
[Optional ; Not repeatable] source
[Optional ; Not repeatable ; String] author
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] model_id
[Optional ; Not repeatable ; String] nb_topics
[Optional ; Not repeatable ; Numeric] description
[Optional ; Not repeatable ; String] corpus
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topic_description
[Optional ; Repeatable] topic_id
[Optional ; Not repeatable ; String] topic_score
[Optional ; Not repeatable ; Numeric] topic_label
[Optional ; Not repeatable ; String] topic_words
[Optional ; Not repeatable] word
[Optional ; Not repeatable ; String] word_weight
[Optional ; Not repeatable ; Numeric] = list(
+ lda_topics
+ list(
+
+ model_info = list(
+ list(source = "World Bank, Development Data Group",
+ author = "A.S.",
+ version = "2021-06-22",
+ model_id = "Mallet_WB_75",
+ nb_topics = 75,
+ description = "LDA model, 75 topics, trained on Mallet",
+ corpus = "World Bank Documents and Reports (1950-2021)",
+ uri = ""))
+
+ ),
+ topic_description = list(
+
+ list(topic_id = "topic_27",
+ topic_score = 32,
+ topic_label = "Education",
+ topic_words = list(list(word = "school", word_weight = "")
+ list(word = "teacher", word_weight = ""),
+ list(word = "student", word_weight = ""),
+ list(word = "education", word_weight = ""),
+ list(word = "grade", word_weight = "")),
+
+ list(topic_id = "topic_8",
+ topic_score = 24,
+ topic_label = "Gender",
+ topic_words = list(list(word = "women", word_weight = "")
+ list(word = "gender", word_weight = ""),
+ list(word = "man", word_weight = ""),
+ list(word = "female", word_weight = ""),
+ list(word = "male", word_weight = "")),
+
+ list(topic_id = "topic_39",
+ topic_score = 22,
+ topic_label = "Forced displacement",
+ topic_words = list(list(word = "refugee", word_weight = "")
+ list(word = "programme", word_weight = ""),
+ list(word = "country", word_weight = ""),
+ list(word = "migration", word_weight = ""),
+ list(word = "migrant", word_weight = "")),
+
+ list(topic_id = "topic_40",
+ topic_score = 11,
+ topic_label = "Development policies",
+ topic_words = list(list(word = "development", word_weight = "")
+ list(word = "policy", word_weight = ""),
+ list(word = "national", word_weight = ""),
+ list(word = "strategy", word_weight = ""),
+ list(word = "activity", word_weight = ""))
+
+
+ )
+
+ )
+ )
embeddings
[Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. In this case, the text would be a compilation of selected elements of the dataset metadata. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).
The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.
+"embeddings": [
+{
+ "id": "string",
+ "description": "string",
+ "date": "string",
+ "vector": { }
+ }
+ ]
The embeddings
element contains four metadata fields:
id
[Optional ; Not repeatable ; String]
+A unique identifier of the word embedding model used to generate the vector.
description
[Optional ; Not repeatable ; String]
+A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
date
[Optional ; Not repeatable ; String]
+The date the model was trained (or a version date for the model).
vector
[Required ; Not repeatable ; Object] @@@@@@@@ do not offer options
+The numeric vector representing the document, provided as an object (array or string).
+[1,4,3,5,7,9]
additional
[Optional ; Not repeatable]
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
In this first example, we use a geographic dataset that contains the outline of Rohingya refugee camps, settlements, and sites in Cox’s Bazar, Bangladesh. The dataset was imported from the Humanitarian Data Exchange website on March 3, 2021.
+We include in the metadata a simple description of the features (variables) contained in the shape files. This information will significantly increase data discoverability, as it provide information of the content of the data files (which is not described elsewhere in the metadata).
+Generating the metadata using R | +
library(nadar)
+library(readr) @@@@ used?
+library(readxl) @@@@ used?
+library(writexl) @@@@ used?
+library(sf)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/")
+
+= "shape_camps.JPG"
+ thumb
+# Download the data files (if not already downloaded)
+# Note: the data are frequently updated; the links below may have become invalid.
+# Visit: https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd for an update.
+
+= "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/"
+ base_url <- list(
+ urls paste0(base_url, "7cec91fb-d0a8-4781-9f8d-9b69772ef2fd/download/210118_rrc_geodata_al1al2al3.gdb.zip"),
+ paste0(base_url, "ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip"),
+ paste0(base_url, "bd5351e7-3ffc-4eaa-acbc-c6d917b5549c/download/200908_rrc_outline_camp_al1.kmz"),
+ paste0(base_url, "9d5693ec-eeb8-42ed-9b65-4c279f523276/download/200908_rrc_outline_block_al2.zip"),
+ paste0(base_url, "ed119ae4-b13d-4473-9afe-a8c36e07870b/download/200908_rrc_outline_block_al2.kmz"),
+ paste0(base_url, "0d2d87ae-52a5-4dca-b435-dcd9c617b417/download/210118_rrc_outline_subblock_al3.zip"),
+ paste0(base_url, "6286c4a5-d2ab-499a-b019-a7f0c327bd5f/download/210118_rrc_outline_subblock_al3.kmz")
+
+ )
+for(url in urls) {
+<- basename(url)
+ f if (!file.exists(f)) download.file(url, destfile=f, mode="wb")
+
+ }
+# Unzip and read the shape files to extract information
+# The object contain the number of features, layers, geodetic CRS, etc.
+
+unzip("200908_rrc_outline_camp_al1.zip", exdir = "AL1")
+<- st_read("./AL1/200908_RRC_Outline_Camp_AL1.shp")
+ al1
+unzip("200908_rrc_outline_block_al2.zip", exdir = "AL2")
+<- st_read("./AL2/200908_RRC_Outline_Block_AL2.shp")
+ al2
+unzip("210118_rrc_outline_subblock_al3.zip", exdir = "AL3")
+<- st_read("./AL3/210118_RRC_Outline_SubBlock_AL3.shp")
+ al3
+# ---------------
+
+= "BGD_2021_COX_CAMPS_GEO_OUTLINE"
+ id
+<- list(
+ my_geo_metadata
+ metadata_information = list(
+ title = "(Demo) Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+ producers = list(list(name = "NADA team")),
+ production_date = "2022-02-18"
+
+ ),
+ description = list(
+
+ idno = id,
+
+ language = "eng",
+
+ characterSet = list(codeListValue = "utf8"),
+
+ hierarchyLevel = list("dataset"),
+
+ contact = list(
+ list(
+ organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+ contactInfo = list(
+ address = list(country = "Bangladesh"),
+ onlineResource = list(
+ linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ ),
+dateStamp = "2021-01-20",
+
+ metadataStandardName = "ISO 19115:2003/19139",
+
+ spatialRepresentationInfo = list(
+
+ # File 200908_rrc_outline_camp_al1.zip
+ list(
+ vectorSpatialRepresentationInfo = list(
+ topologyLevel = "geometryOnly",
+ geometricObjects = list(
+ geometricObjectType = "surface",
+ geometricObjectCount = "35"
+
+ )
+ )
+ ),
+ # File 200908_rrc_outline_block_al2.zip
+ list(
+ vectorSpatialRepresentationInfo = list(
+ topologyLevel = "geometryOnly",
+ geometricObjects = list(
+ geometricObjectType = "surface",
+ geometricObjectCount = "173"
+
+ )
+ )
+ ),
+ # File 210118_rrc_outline_subblock_al3.zip
+ list(
+ vectorSpatialRepresentationInfo = list(
+ topologyLevel = "geometryOnly",
+ geometricObjects = list(
+ geometricObjectType = "surface",
+ geometricObjectCount = "967"
+
+ )
+ )
+ )
+
+ ),
+ referenceSystemInfo = list(
+ list(code = "4326", codeSpace = "EPSG"),
+ list(code = "84", codespace = "WGS")
+
+ ),
+ identificationInfo = list(
+
+ list(
+
+ citation = list(
+ title = "Bangladesh, Outline of camps of Rohingya refugees in Cox's Bazar, January 2021",
+ date = list(
+ list(date = "2021-01-20", type = "creation")
+
+ ),citedResponsibleParty = list(
+ list(
+ organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+ contactInfo = list(
+ address = list(country = "Bangladesh"),
+ onlineResource = list(
+ linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ )
+ ),
+abstract = "These polygons were digitized through a combination of methodologies, originally using VHR satellite imagery and GPS points collected in the field, verified and amended according to Site Management Sector, RRRC, Camp in Charge (CiC) officers inputs, with technical support from other partners.",
+
+ purpose = "Inform the UNHCR operations (and other support agencies') in refugee camps in Cox's Bazar.",
+
+ credit = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+
+ status = "completed",
+
+ pointOfContact = list(
+ list(
+ organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+ contactInfo = list(
+ address = list(country = "Bangladesh"),
+ onlineResource = list(
+ linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/",
+ name = "Website"
+
+ )
+ ),role = "pointOfContact"
+
+ )
+ ),
+ resourceMaintenance = list(
+ list(maintenanceOrUpdateFrequency = "asNeeded")
+
+ ),
+ graphicOverview = list( # @@@@@@@@@@@@
+ list(fileName = "",
+ fileDescription = "",
+ fileType = "")
+
+ ),
+ resourceFormats = list(
+ list(name = "application/zip",
+ specification = "ESRI Shapefile (zipped)",
+ FormatDistributor = list(organisationName = "ESRI")
+
+ ),list(name = "application/vnd.google-earth.kmz",
+ specification = "KMZ file",
+ FormatDistributor = list(organisationName = "Google")
+
+ ),list(name = "ESRI Geodatabase",
+ FormatDistributor = list(organisationName = "ESRI")
+
+ )
+ ),
+ descriptiveKeywords = list(
+ list(keyword = "refugee camp"),
+ list(keyword = "forced displacement"),
+ list(keyword = "rohingya")
+
+ ),
+ resourceConstraints = list(
+ list(
+ legalConstraints = list(
+ uselimitation = list("License: http://creativecommons.org/publicdomain/zero/1.0/legalcode"),
+ accessConstraints = list("unrestricted"),
+ useConstraints = list("licenceUnrestricted")
+
+ )
+ )
+ ),
+ extent = list(
+ geographicElement = list(
+ list(
+ geographicBoundingBox = list(
+ southBoundLatitude = 20.91856,
+ westBoundLongitude = 92.12973,
+ northBoundLatitude = 21.22292,
+ eastBoundLongitude = 92.26863
+
+ )
+ )
+ )
+ ),
+ spatialRepresentationType = "vector",
+
+ language = list("eng")
+
+
+ )
+
+ ),
+ distributionInfo = list(
+
+ distributionFormat = list(
+ list(name = "application/zip",
+ specification = "ESRI Shapefile (zipped)",
+ FormatDistributor = list(organisationName = "ESRI")
+
+ ),list(name = "application/vnd.google-earth.kmz",
+ specification = "KMZ file",
+ FormatDistributor = list(organisationName = "Google")
+
+ ),list(name = "ESRI Geodatabase",
+ FormatDistributor = list(organisationName = "ESRI")
+
+ )
+ ),
+ distributor = list(
+ list(
+ organisationName = "United Nations Office for the Coordination of Humanitarian Affairs (OCHA)",
+ contactInfo = list(
+ onlineResource = list(
+ linkage = "https://data.humdata.org/dataset/outline-of-camps-sites-of-rohingya-refugees-in-cox-s-bazar-bangladesh",
+ name = "Website"
+
+ )
+ )
+ )#,
+ )
+ # transferOptions = list(
+ # list(
+ # onLine = list( # @@@@@@@@ / use external resources schema?
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/7cec91fb-d0a8-4781-9f8d-9b69772ef2fd/download/210118_rrc_geodata_al1al2al3.gdb.zip",
+ # name = "210118_RRC_GeoData_AL1,AL2,AL3.gdb.zip",
+ # description = "This zipped geodatabase file (GIS) contains the Camp boundary (Admin level-1) and and camp-block boundary (admin level-2 or camp sub-division) and sub-block boundary of Rohingya refugee camps and administrative level-3 or sub block division of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip",
+ # name = "200908_RRC_Outline_Camp_AL1.zip",
+ # description = "This zipped shape file (GIS) contains the Camp boundary (Admin level-1) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/bd5351e7-3ffc-4eaa-acbc-c6d917b5549c/download/200908_rrc_outline_camp_al1.kmz",
+ # name = "200908_RRC_Outline_Camp_AL1.kmzKMZ",
+ # description = "This kmz file (Google Earth) contains the Camp boundary (Admin level-1) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/9d5693ec-eeb8-42ed-9b65-4c279f523276/download/200908_rrc_outline_block_al2.zip",
+ # name = "200908_RRC_Outline_Block_AL2.zip",
+ # description = "This zipped shape file (GIS) contains the camp-block boundary (admin level-2 or camp sub-division) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ed119ae4-b13d-4473-9afe-a8c36e07870b/download/200908_rrc_outline_block_al2.kmz",
+ # name = "200908_RRC_Outline_Block_AL2.kmzKMZ",
+ # description = "This kmz file (Google Earth) contains the camp-block boundary (admin level-2 or camp sub-division) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/0d2d87ae-52a5-4dca-b435-dcd9c617b417/download/210118_rrc_outline_subblock_al3.zip",
+ # name = "210118_RRC_Outline_SubBlock_AL3.zip",
+ # description = "This zipped shape file (GIS) contains the camp-sub-block (Admin level-3) of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/6286c4a5-d2ab-499a-b019-a7f0c327bd5f/download/210118_rrc_outline_subblock_al3.kmz",
+ # name = "210118_RRC_Outline_SubBlock_AL3.kmzKMZ",
+ # description = "This kmz file (Google Earth) contains the camp-sub-block (Admin level-3) of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # )
+ # )
+ # )
+ # )
+
+
+ ),
+dataQualityInfo = list(
+ list(
+ scope = "dataset",
+ lineage = list(
+ statement = "The camps are continuously expanding, and Camp Boundaries are structured around the GoB, RRRC official governance structure of the camps, taking into account the potential new land allocation. The database is kept as accurate as possible, given these challenges."
+
+ )
+ )
+ ),
+ metadataMaintenance = list(maintenanceAndUpdateFrequency = "asNeeded"),
+
+ feature_catalogue = list(
+
+ name = "Feature Catalogue dataset xxxxx",
+ scope = list("3 shape files: al1, al2, al3"),
+
+ featureType = list(
+ list(
+ typeName = "",
+ definition = "",
+ carrierOfCharacteristics = list(
+ list(
+ memberName = 'District',
+ definition = 'Cox s Bazar'
+
+ ),list(
+ memberName = 'Upazila',
+ definition = 'Teknaf, Ukhia',
+
+ ),list(
+ memberName = 'Settlement',
+ definition = 'Collective site; Collective site with host community',
+
+ ),list(
+ memberName = 'Union',
+ definition = 'Baharchhara; Nhilla; Palong Khali; Raja Palong; Whykong',
+
+ ),list(
+ memberName = 'Name_Alias',
+ definition = 'Alikhali; Bagghona-Putibonia; Camp 20 Extension;
+ Camp 4; Camp 4 Extension; Chakmarkul; Choukhali;
+ Hakimpara; Jadimura; Jamtoli-Baggona; Jomer Chora;
+ Kutupalong RC; Modur Chora; Nayapara; Nayapara RC;
+ Shamlapur; Tasnimarkhola; Tasnimarkhola-Burmapara;
+ Unchiprang'
+
+ ),list(
+ memberName = 'SSID',
+ definition = 'CXB-017 to CXB-235',
+
+ ),list(
+ memberName = 'SMSD__Cnam',
+ definition = 'Camp 01E; Camp 01W; Camp 02E; Camp 02W; Camp 03; Camp 04;
+ Camp 04X; Camp 05; Camp 06; Camp 07; Camp 08E; Camp 08W;
+ Camp 09; Camp 10; Camp 11; Camp 12; Camp 13; Camp 14;
+ Camp 15; Camp 16; Camp 17; Camp 18; Camp 19; Camp 20;
+ Camp 20X; Camp 21; Camp 22; Camp 23; Camp 24; Camp 25;
+ Camp 26; Camp 27; Camp KRC; Camp NRC; Choukhali',
+
+ ),list(
+ memberName = 'NPM_Name',
+ definition = 'Camp 01E; Camp 01W; Camp 02E; Camp 02W; Camp 03;
+ Camp 04; Camp 04 Extension; Camp 05; Camp 06; ; Camp 07;
+ Camp 08E; Camp 08W; Camp 09; Camp 10; Camp 11; Camp 12;
+ Camp 13 Camp 14 (Hakimpara); Camp 15 (Jamtoli);
+ Camp 16 (Potibonia); Camp 17; Camp 18; Camp 19; Camp 20;
+ Camp 20 Extension; Camp 21 (Chakmarkul); Camp 22 (Unchiprang);
+ Camp 23 (Shamlapur); Camp 24 (Leda); Camp 25 (Ali Khali);
+ Camp 26 (Nayapara); Camp 27 (Jadimura); Choukhali;
+ Kutupalong RC; Nayapara RC',
+
+ ),list(
+ memberName = 'Area_Acres',
+ definition = 'Area in acres',
+
+ ),list(
+ memberName = 'PeriMe_Met',
+ definition = 'Perimeter in meters',
+
+ ),list(
+ memberName = 'Camp_Name',
+ definition = 'Camp 10; Camp 11; Camp 12; Camp 13; Camp 14; Camp 15;
+ Camp 16; Camp 17; Camp 18; Camp 19; Camp 1E; Camp 1W;
+ Camp 20 Camp 20 Extension; Camp 21; Camp 22; Camp 23;
+ Camp 24; Camp 25; Camp 26; Camp 27; Camp 2E; Camp 2W;
+ Camp 3; Camp 4; Camp 4 Extension; Camp 5; Camp 6;
+ Camp 7; Camp 8E; Camp 8W; Camp 9; Choukhali;
+ Kutupalong RC; Nayapara RC',
+
+ ),list(
+ memberName = 'Area_SqM',
+ definition = 'Area in square km',
+
+ ),list(
+ memberName = 'Latitude'
+
+ ),list(
+ memberName = 'Longitude'
+
+ ),list(
+ memberName = 'geometry'
+
+ )#,
+ #... al2, al3 @@@@@@@@@ complete
+
+ )
+ )
+ )
+ )
+
+ )
+
+ )
+
+# Publish in NADA catalog
+
+geospatial_add(
+idno = id,
+ metadata = my_geo_metadata,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes"
+
+ )
+# Add a link to HDX as an external resource
+
+external_resources_add(
+title = "Humanitarian Data Exchange website",
+ idno = id,
+ dctype = "web",
+ file_path = "https://data.humdata.org/",
+ overwrite = "yes"
+ )
The result in NADA
+After running the script, the data and metadata will be available in NADA.
+
+
+
Generating the metadata using Python
+The Syria Refugee Sites dataset used as a second example contains verified data about the geographic location (point geometry), name, and operational status of refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. Only refugee sites operated by the United Nations High Commissioner for Refugees (UNHCR) or the Government of Turkey are included. Data are provided as CSV, TSV and XLSX files. This example demonstrates the use of the ISO 19115 standard.
+Generating the metadata using R | +
library(nadar)
+library(sf)
+library(sp)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/")
+
+options(stringsAsFactors = FALSE)
+
+# Download and read the data file
+
+= "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/cc3e9e48-e363-404e-948b-e42d13c316d9/download/syria_refugeesites_2016jan21_hiu_dos.csv"
+ url = basename(url)
+ data_file if(!file.exists(data_file)) download.file(url, destfile = data_file, mode = "wb")
+
+<- st_read(data_file)
+ sf <- as.data.frame(sf)
+ sp $Long <- as(sp$Long, "numeric")
+ sp$Lat <- as(sp$Lat, "numeric")
+ spcoordinates(sp) <- c("Long", "Lat")
+proj4string(sp) <- CRS("+init=epsg:4326")
+
+# Generate the metadata
+
+<- "EX2_SYR_REFUGEE_SITES"
+ id
+<- list(
+ my_geo_data
+ metadata_information = list(
+ title = "(Demo) Syria, Refugee Sites",
+ producers = list(
+ list(name = "NADA team")
+
+ ),production_date = "2022-02-18"
+
+ ),
+description = list(
+
+ idno = id,
+
+ language = "eng",
+
+ characterSet = list(codeListValue = "utf8"),
+
+ hierarchyLevel = list("dataset"),
+
+ contact = list(
+ list(
+ organisationName = "U.S. Department of State - Humanitarian Information Unit",
+ contactInfo = list(
+ address = list(electronicEmailAddress = "HIU_DATA@state.gov"),
+ onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website")
+
+ ),role = "pointOfContact"
+
+ )
+ ),
+ dateStamp = "2018-06-18",
+
+ metadataStandardName = "ISO 19115:2003/19139",
+
+ spatialRepresentationInfo = list(
+ list(
+ vectorSpatialRepresentation = list(
+ topologyLevel = "geometryOnly",
+ geometricObjects = list(
+ list(
+ geometricObjectType = "point",
+ geometricObjectCounty = nrow(sp)
+
+ )
+ )
+ )
+ )
+ ),
+ referenceSystemInfo = list(
+ list(code = "4326", codeSpace = "EPSG")
+
+ ),
+ identificationInfo = list(
+
+ list(
+
+citation = list(
+ title = "Syria Refugee Sites",
+ date = list(
+ list(date = "2016-01-14", type = "creation"),
+ list(date = "2016-02-04", type = "publication")
+
+ ),identifier = list(authority = "IHSN", code = id),
+ citedResponsibleParty = list(
+ list(
+ individualName = "Humanitarian Information Unit",
+ organisationName = "U.S. Department of State - Humanitarian Information Unit",
+ contactInfo = list(
+ address = list(
+ electronicEmailAddress = "HIU_DATA@state.gov"
+
+ ),onlineResource = list(
+ linkage = "http://hiu.state.gov/",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ )
+ ),
+ abstract = "The 'Syria Refugee Sites' dataset is compiled by the U.S. Department of State, Humanitarian Information Unit (INR/GGI/HIU). This dataset contains open source derived data about the geographic location (point geometry), name, and operational status of refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. Only refugee sites operated by the United Nations High Commissioner for Refugees (UNHCR) or the Government of Turkey are included. Compiled by the U.S Department of State, Humanitarian Information Unit (HIU), each attribute in the dataset (including name, location, and status) is verified against multiple sources. The name and status are obtained from UN and AFAD reporting and the UNHCR data portal (accessible at http://data.unhcr.org/syrianrefugees/regional.php). The locations are obtained from both the U.S. Department of State, PRM and the National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) (accessible at http://geonames.nga.mil/ggmagaz/). The name and status for each refugee site is verified with PRM. Locations are verified using high-resolution commercial satellite imagery and/or known areas of population. Additionally, all data is checked against various news sources. The data contained herein is entirely unclassified and is current as of 14 January 2016. The data is updated as needed.",
+
+purpose = "The 'Syria Refugee Sites' dataset contains verified data about the refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. This file is compiled by the U.S Department of State, Humanitarian Information Unit (HIU) and is used in the production of the unclassified 'Syria: Numbers and Locations of Syrian Refugees' map product (accessible at https://hiu.state.gov/Pages/MiddleEast.aspx). The data contained herein is entirely unclassified and is current as of 14 January 2016.",
+
+credit = "U.S. Department of State - Humanitarian Information Unit",
+
+status = "onGoing",
+
+ pointOfContact = list(
+ list(
+ individualName = "Humanitarian Information Unit",
+ organisationName = "U.S. Department of State - Humanitarian Information Unit",
+ contactInfo = list(
+ address = list(electronicEmailAddress = "HIU_DATA@state.gov"),
+ onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website")
+
+ ),role = "pointOfContact"
+
+ )
+ ),
+ resourceMaintenance = list(
+ list(maintenanceOrUpdateFrequency = "fortnightly")
+
+ ),
+ # graphicOverview = list(),
+
+ resourceFormat = list(
+ list(
+ name = "text/csv",
+ specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files"
+
+ ),list(
+ name = "text/tab-separated-values",
+ specification = "Tab-Separated Values (CSV)"
+
+ ),list(
+ name = "xlsx",
+ specification = "Microsoft Excel (XLSX)"
+
+ )
+ ),
+ descriptiveKeywords = list(
+ list(type = "theme", keyword = "Middle East"),
+ list(type = "theme", keyword = "Refugees"),
+ list(type = "theme", keyword = "Displacement"),
+ list(type = "theme", keyword = "Refugee Camps"),
+ list(type = "theme", keyword = "UNHCR"),
+ list(type = "place", keyword = "Syria"),
+ list(type = "place", keyword = "Turkey"),
+ list(type = "place", keyword = "Lebanon"),
+ list(type = "place", keyword = "Jordan"),
+ list(type = "place", keyword = "Iraq"),
+ list(type = "place", keyword = "Egypt")
+
+ ),
+resourceConstraints = list(
+ list(
+ legalConstraints = list(
+ uselimitation = list("License: Creative Commons Attribution 4.0 International License"),
+ accessConstraints = list("unrestricted"),
+ useConstraints = list("licenceUnrestricted")
+
+ )
+ ),list(
+ securityConstraints = list(
+ classification = "unclassified",
+ handlingDescription = "All data contained herein are strictly unclassified with no restrictions on distribution. Accuracy of geographic data is not assured by the U.S. Department of State."
+
+ )
+ )
+ ),
+ extent = list(
+ geographicElement = list(
+ list(
+ geographicBoundingBox = list(
+ southBoundLatitude = bbox(sp)[2,1],
+ westBoundLongitude = bbox(sp)[1,1],
+ northBoundLatitude = bbox(sp)[2,2],
+ eastBoundLongitude = bbox(sp)[1,2]
+
+ )
+ )
+ )
+ ),
+spatialRepresentationType = "vector",
+
+language = list("eng"),
+
+characterSet = list(
+ list(codeListValue = "utf8")
+
+ ),
+topicCategory = list("society")
+
+
+ )
+
+ ),
+ distributionInfo = list(
+
+ distributionFormat = list(
+ list(
+ name = "text/csv",
+ specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files"
+
+ ),list(
+ name = "text/tab-separated-values",
+ specification = "Tab-Separated Values (CSV)"
+
+ ),list(
+ name = "xlsx",
+ specification = "Microsoft Excel (XLSX)"
+
+ )
+ ),
+ distributor = list(
+ list(
+ individualName = "Humanitarian Information Unit",
+ organisationName = "U.S. Department of State - Humanitarian Information Unit",
+ contactInfo = list(
+ address = list(electronicEmailAddress = "HIU_DATA@state.gov"),
+ onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website")
+
+ ),role = "distributor"
+
+ )#,
+ )
+ # transferOptions = list(
+ # list(
+ # onLine = list(
+ # list(
+ # linkage = "https://data.humdata.org/dataset/syria-refugee-sites",
+ # name = "Source metadata (HTML View)",
+ # protocol = "WWW:LINK-1.0-http--link",
+ # "function" = "Information"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/cc3e9e48-e363-404e-948b-e42d13c316d9/download/syria_refugeesites_2016jan21_hiu_dos.csv",
+ # name = "syria_refugeesites_2016jan21_hiu_dos.csv",
+ # description = "Data download (CSV)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/42f7884c-f54d-478c-a970-623945740e5d/download/syria_refugeesites_2016jan21_hiu_dos.tsv",
+ # name = "syria_refugeesites_2016jan21_hiu_dos.tsv",
+ # description = "Data download (TSV)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/59660c9a-e41a-4d54-bfc2-dd8fd1032c97/download/syria_refugeesites_2016jan21_hiu_dos.xlsx",
+ # name = "syria_refugeesites_2016jan21_hiu_dos.xlsx",
+ # description = "Data download (TSV)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # )
+ # )
+ # )
+ # )
+
+
+ ),
+dataQualityInfo = list(
+ list(
+ scope = "dataset",
+ lineage = list(
+ statement = "Methodology: Compiled by the U.S Department of State, Humanitarian Information Unit (INR/GGI/HIU), each attribute in the dataset (including name, location, and status) is verified against multiple sources. The name and status are obtained from the UNHCR data portal (accessible at http://data.unhcr.org/syrianrefugees/regional.php). The locations are obtained from the U.S. Department of State, Bureau of Population, Refugees, and Migration (PRM) and the National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) (accessible at http://geonames.nga.mil/ggmagaz/). The name and status for each refugee site is verified with PRM. Locations are verified using high-resolution commercial satellite imagery and/or known areas of population. Additionally, all data is checked against various news sources."
+
+ )
+ )
+ ),
+metadataMaintenance = list(maintenanceAndUpdateFrequency = "fortnightly")
+
+
+ )
+
+ )
+# Publish in NADA catalog
+
+geospatial_add(
+idno = id,
+ metadata = my_geo_data,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = NULL,
+ overwrite = "yes"
+ )
Generating the metadata using Python | +
The result in NADA | +
This example demonstrates the use of the ISO 19115 (geographic dataset) and ISO 19110 (feature catalogue). Documenting features contained in datasets makes the metadata richer and more discoverable. It is recommended to provide such information, which can easily be extracted from shape files and others. The dataset used for the example is the Geocoded Disasters (GDIS) Dataset, v1 (1960-2018)
+library(nadar)
+library(sf)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/")
+
+= "disaster.JPG"
+ thumb
+# Load the dataset (2 Gb) to extract some information
+
+load("pend-gdis-1960-2018-disasterlocations.rdata")
+= GDIS_disasterlocations
+ data = as.data.frame(GDIS_disasterlocations)
+ df = colnames(df)[!colnames(df) %in% c("geometry","centroid")]
+ column_names = c("longitude", "latitude") #exclude ISO 19110 listed values for these columns
+ exclude_listed_values_for
+# Generate the metadata
+
+<- "GDIS_TEST_01"
+ id
+= "Geocoded Disasters (GDIS) Dataset, v1 (1960–2018)"
+ ttl
+<- list(
+ my_geo_data
+ metadata_information = list(
+ title = ttl,
+ idno = id,
+ producers = list(
+ list(name = "NADA team")
+
+ ),production_date = "2022-02-18",
+ version = "v1.0 2022-02"
+
+ ),
+description = list(
+
+ idno = id,
+ language = "English",
+ characterSet = list(
+ codeListValue = "utf8",
+ codeList = "http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CharacterSetCode"
+
+ ),hierarchyLevel = list("dataset"),
+ contact = list(
+ list(
+ organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+ contactInfo = list(
+ phone = list(
+ voice = "+1 845-365-8920",
+ facsimile = "+1 845-365-8922"
+
+ ),address = list(
+ deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000",
+ city = "Palisades, NY",
+ postalCode = "10964",
+ electronicEmailAddress = "ciesin.info@ciesin.columbia.edu"
+
+ )
+ ),role = "pointOfContact"
+
+ )
+ ),dateStamp = "2021-03-10",
+ metadataStandardName = "ISO 19115:2003/19139",
+ dataSetURI = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018",
+
+ spatialRepresentationInfo = list(
+ list(
+ vectorSpatialRepresentation = list(
+ topologyLevel = "geometryOnly",
+ geometricObjects = list(
+ list(
+ geometricObjectType = tolower(as.character(st_geometry_type(data)[1])),
+ geometricObjectCounty = nrow(data)
+
+ )
+ )
+ )
+ )
+ ),
+ referenceSystemInfo = list(
+ list(code = "4326", codeSpace = "EPSG")
+
+ ),
+ identificationInfo = list(
+ list(
+ citation = list(
+ title = ttl,
+ date = list(
+ list(date = "2021-03-10", type = "publication")
+
+ ),identifier = list(authority= "DOI", code = "10.7927/zz3b-8y61"),
+ citedResponsibleParty = list(
+ list(
+ individualName = "Rosvold, E., and H. Buhaug",
+ role = "owner"
+
+ )
+ ),edition = "1.00",
+ presentationForm = list("raster", "map", "map service"),
+ series = list(
+ name = "Scientific Data",
+ issueIdentification = "8:61"
+
+ )
+ ),abstract = "The Geocoded Disasters (GDIS) Dataset is a geocoded extension of a selection of natural disasters from the Centre for Research on the Epidemiology of Disasters' (CRED) Emergency Events Database (EM-DAT). The data set encompasses 39,953 locations for 9,924 disasters that occurred worldwide in the years 1960 to 2018. All floods, storms (typhoons, monsoons etc.), earthquakes, landslides, droughts, volcanic activity and extreme temperatures that were recorded in EM-DAT during these 58 years and could be geocoded are included in the data set. The highest spatial resolution in the data set corresponds to administrative level 3 (usually district/commune/village) in the Global Administrative Areas database (GADM, 2018). The vast majority of the locations are administrative level 1 (typically state/province/region).",
+ purpose = "To provide the subnational location for different types of natural disasters recorded in EM-DAT between 1960-2018.",
+ credit = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+ status = "completed",
+ pointOfContact = list(
+ list(
+ organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+ contactInfo = list(
+ phone = list(
+ voice = "+1 845-365-8920",
+ facsimile = "+1 845-365-8922"
+
+ ),address = list(
+ deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000",
+ city = "Palisades, NY",
+ postalCode = "10964",
+ electronicEmailAddress = "ciesin.info@ciesin.columbia.edu"
+
+ )
+ ),role = "pointOfContact"
+
+ )
+ ),resourceMaintenance = list(
+ list(maintenanceOrUpdateFrequency = "asNeeded")
+
+ ),graphicOverview = list(
+ list(
+ fileName = "https://sedac.ciesin.columbia.edu/downloads/maps/pend/pend-gdis-1960-2018/sedac-logo.jpg",
+ fileDescription = "Geocoded Disasters (GDIS) Dataset",
+ fileType = "image/jpeg"
+
+ )
+ ),resourceFormat = list(
+ list(
+ name = "OpenFileGDB",
+ specification = "ESRI - GeoDatabase"
+
+ ),list(
+ name = "text/csv",
+ specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files"
+
+ ),list(
+ name = "application/geopackage+sqlite3",
+ specification = "http://www.geopackage.org/spec/"
+
+ )
+ ),descriptiveKeywords = list(
+ list(type = "theme", keyword = "climatology"),
+ list(type = "theme", keyword = "meteorology"),
+ list(type = "theme", keyword = "atmosphere"),
+ list(type = "theme", keyword = "earth science",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "human dimension",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "natural hazard",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "drought",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "earthquake",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "flood",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "landslides",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "tropical cyclones",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "cyclones",
+ thesaurusName = "GCMD Science Keywords, Version 8.6"),
+ list(type = "theme", keyword = "volcanic eruption",
+ thesaurusName = "GCMD Science Keywords, Version 8.6")
+
+ ),resourceConstraints = list(
+ list(
+ legalConstraints = list(
+ uselimitation = list(
+ "This work is licensed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0). Users are free to use, copy, distribute, transmit, and adapt the work for commercial and non-commercial purposes, without restriction, as long as clear attribution of the source is provided.",
+ "Recommended citation: Rosvold, E.L., Buhaug, H. GDIS, a global dataset of geocoded disaster locations. Scientific Data 8, 61 (2021). https://doi.org/10.1038/s41597-021-00846-6."
+
+ ),accessConstraints = list("unrestricted"),
+ useConstraints = list("licenceUnrestricted")
+
+ )
+ )
+ ),extent = list(
+ geographicElement = list(
+ list(
+ geographicBoundingBox = list(
+ westBoundLongitude = -180,
+ eastBoundLongitude = 180,
+ southBoundLatitude = -58,
+ northBoundLatitude = 90
+
+ )
+ )#,
+ )# temporalElement = list(
+ # list(
+ # extent = list(
+ # TimePeriod = list(
+ # beginPosition = "1960-01-01",
+ # endPosition = "2018-12-31"
+ # )
+ # )
+ # )
+ # )
+
+ ),spatialRepresentationType = "vector",
+ language = list("eng"),
+ characterSet = list(
+ list(
+ codeListValue = "utf8",
+ codeList = "http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CharacterSetCode"
+
+ )
+ )
+ )
+ ),
+ distributionInfo = list(
+
+ distributionFormat = list(
+ list(name = "OpenFileGDB",
+ specification = "ESRI - GeoDatabase",
+ fileDecompressionTechnique = "Unzip"),
+ list(name = "text/csv",
+ specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files",
+ fileDecompressionTechnique = "Unzip"),
+ list(name = "application/geopackage+sqlite3",
+ specification = "http://www.geopackage.org/spec/",
+ fileDecompressionTechnique = "Unzip")
+
+ ),
+ distributor = list(
+ list(
+ organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+ contactInfo = list(
+ phone = list(
+ voice = "+1 845-365-8920",
+ facsimile = "+1 845-365-8922"
+
+ ),address = list(
+ deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000",
+ city = "Palisades, NY",
+ postalCode = "10964",
+ electronicEmailAddress = "ciesin.info@ciesin.columbia.edu"
+
+ )
+ ),role = "pointOfContact"
+
+ )#,
+ )
+ # transferOptions = list(
+ # list(
+ # onLine = list(
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018",
+ # name = "Source metadata (HTML View)",
+ # protocol = "WWW:LINK-1.0-http--link",
+ # "function" = "Information"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-gdb.zip",
+ # name = "pend-gdis-1960-2018-disasterlocations-gdb.zip",
+ # description = "Data download (Geodatabase)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-gpkg.zip",
+ # name = "pend-gdis-1960-2018-disasterlocations-gpkg.zip",
+ # description = "Data download (GeoPackage)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-csv.zip",
+ # name = "pend-gdis-1960-2018-disasterlocations-csv.zip",
+ # description="Data download (CSV)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-priogrid-key-csv.zip",
+ # name = "pend-gdis-1960-2018-priogrid-key-csv.zip",
+ # description = "Data download (CSV)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-rdata.zip",
+ # name = "pend-gdis-1960-2018-disasterlocations-rdata.zip",
+ # description = "Data download (RData)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-replicationcode-r.zip",
+ # name = "pend-gdis-1960-2018-replicationcode-r.zip",
+ # description = "Source code (R)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-codebook.pdf",
+ # name = "pend-gdis-1960-2018-codebook.pdf",
+ # description = "Codebook (PDF)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # )
+ # )
+ # )
+ # )
+
+ ),
+ dataQualityInfo = list(
+ list(
+ scope = "dataset",
+ lineage = list(
+ statement = "CIESIN follows procedures designed to ensure that data disseminated by CIESIN are of reasonable quality. If, despite these procedures, users encounter apparent errors or misstatements in the data, they should contact SEDAC User Services at +1 845-365-8920 or via email at ciesin.info@ciesin.columbia.edu. Neither CIESIN nor NASA verifies or guarantees the accuracy, reliability, or completeness of any data provided. CIESIN provides this data without warranty of any kind whatsoever, either expressed or implied. CIESIN shall not be liable for incidental, consequential, or special damages arising out of the use of any data provided by CIESIN."
+
+ )
+ )
+ ),
+ metadataMaintenance = list(
+ maintenanceAndUpdateFrequency = "asNeeded"
+
+ )
+
+ ),
+# Feature catalog (ISO 19110/19139)
+
+ feature_catalogue = list(
+ name = sprintf("%s - Feature Catalogue", ttl),
+ featureType = list(
+ list(
+ typeName = ttl,
+ definition = "Disaster locations",
+ code = "pend-gdis-1960-2018-disasterlocations",
+ isAbstract = FALSE,
+ # carrierOfCharacteristics = lapply(column_names, function(column_name){
+ # print(column_name)
+ # values = unique(df[,column_name])
+ # values = values[order(values)]
+ # member = list(
+ # memberName = sprintf("Label for '%s'", column_name),
+ # definition = sprintf("Definition for '%s'", column_name),
+ # cardinality = list(lower = 1, upper = 1),
+ # code = column_name,
+ # valueType = switch(class(df[,column_name]), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ # valueMeasurementUnit = NA,
+ # listedValue = if(column_name %in% exclude_listed_values_for) {list()} else {lapply(values, function(x){ list(label = sprintf("Label for '%s'", x), code = x, definition = sprintf("Definition for '%s'", x)) })}
+ # )
+ # return(member)
+ # })
+ carrierOfCharacteristics = list(
+ list(
+ memberName = 'id',
+ definition = 'ID-variable identifying each disaster in the geocoded dataset. Contrary to disasterno each disaster in each country has a unique id number',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA01', # short for Disaster Feature Attribute 01
+ valueType = switch(class(df[,'id']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'country',
+ definition = 'Name of the country within which the location is',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA02',
+ valueType = switch(class(df[,'country']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'iso3',
+ definition = 'Three-letter country code, ISO 3166-1',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA03',
+ valueType = switch(class(df[,'iso3']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'gwno',
+ definition = 'Gledistsch and Ward country code (Gleditsch & Ward, 1999)',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA04',
+ valueType = switch(class(df[,'gwno']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'geo_id',
+ definition = 'Unique ID-variable for each location',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA05',
+ valueType = switch(class(df[,'geo_id']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'geolocation',
+ definition = 'Name of the location of the observation, which corresponds to the highest (most disaggregated) level available. For instance, observations at the third administrative level will have geolocation values identical to the adm3 variable',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA06',
+ valueType = switch(class(df[,'geolocation']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'level',
+ definition = 'The administrative level of the observation, ranges from 1-3 where 3 is the most disaggregated',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA07',
+ valueType = switch(class(df[,'level']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'adm1',
+ definition = 'Name of administrative level 1 for the given location',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA08',
+ valueType = switch(class(df[,'adm1']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'adm2',
+ definition = 'Name of administrative level 2 for the given location',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA09',
+ valueType = switch(class(df[,'adm2']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'location',
+ definition = 'Name of administrative level 3 for the given location',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA10',
+ valueType = switch(class(df[,'location']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'historical',
+ definition = 'Marks whether the disaster happened in a country that has since changed, takes the value 1 if the disaster happened in a country that has since changed, and 0 if not',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA11',
+ valueType = switch(class(df[,'historical']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'hist_country',
+ definition = 'Name of country at the time of the disaster, if the observation takes the value 1 on the historical variable, this is different from the country variable',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA12',
+ valueType = switch(class(df[,'hist_country']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ ),list(
+ memberName = 'disastertype',
+ definition = 'Type of disaster as defined by EM-DAT (Guha-Sapir et al., 2014): flood, storm, earthquake, extreme temperature, landslide, volcanic activity, drought or mass movement (dry)',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA13',
+ valueType = switch(class(df[,'disastertype']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA',
+ listedValue = list(
+ list(
+ label = 'flood',
+ code = 'flood',
+ definition = 'A general term for the overflow of water from a stream channel onto normally dry land in the floodplain (riverine flooding), higher-than-normal levels along the coast and in lakes or reservoirs (coastal flooding) as well as ponding of water at or near the point where the rain fell (flash floods).'
+
+ ),list(
+ label = 'storm',
+ code = 'storm',
+ definition = 'A type of meteorological hazard generated by the heating of air and the availability of moist and unstable air masses. Convective storms range from localized thunderstorms (with heavy rain and/or hail, lightning, high winds, tornadoes) to meso-scale, multi-day events.'
+
+ ),list(
+ label = 'earthquake',
+ code = 'earthquake',
+ definition = 'Sudden movement of a block of the Earth’s crust along a geological fault and associated ground shaking.'
+
+ ),list(
+ label = 'extreme temperature',
+ code = 'extreme temperature',
+ definition = 'A general term for temperature variations above (extreme heat) or below (extreme cold) normal conditions.'
+
+ ),list(
+ label = 'landslide',
+ code = 'landslide',
+ definition = 'Independent of the presence of water, mass movement may also be triggered by earthquakes.'
+
+ ),list(
+ label = 'volcanic activity',
+ code = 'volcanic activity',
+ definition = 'A type of volcanic event near an opening/vent in the Earth’s surface including volcanic eruptions of lava, ash, hot vapor, gas, and pyroclastic material.'
+
+ ),list(
+ label = 'drought',
+ code = 'drought',
+ definition = 'An extended period of unusually low precipitation that produces a shortage of water for people, animals, and plants. Drought is different from most other hazards in that it develops slowly, sometimes even over years, and its onset is generally difficult to detect. Drought is not solely a physical phenomenon because its impacts can be exacerbated by human activities and water supply demands. Drought is therefore often defined both conceptually and operationally. Operational definitions of drought, meaning the degree of precipitation reduction that constitutes a drought, vary by locality, climate and environmental sector.'
+
+ ),list(
+ label = 'mass movement (dry)',
+ code = 'mass movement (dry)',
+ definition = 'Any type of downslope movement of earth materials.'
+
+ )
+ )
+ ),list(
+ memberName = 'disasterno',
+ definition = 'ID-variable from EM-DAT (Guha-Sapir et al., 2014), use this to join the geocoded data with EM-DAT records in order to obtain information on the specific disasters',
+ cardinality = list(lower = 1, upper = 1),
+ code = 'DFA14',
+ valueType = switch(class(df[,'disasterno']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+ valueMeasurementUnit = 'NA'
+
+ )
+
+ )
+ )
+ )
+ )
+ )
+# Publish in NADA catalog
+
+geospatial_add(
+idno = id,
+ metadata = my_geo_data,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes")
+
+# Add links as external resources
+
+external_resources_add(
+idno = id,
+ dctype = "web",
+ title = "Website: Geocoded Disasters (GDIS) Dataset, v1 (1960–2018)",
+ file_path = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018",
+ overwrite = "yes"
+ )
This fourth example makes use of elements from the ISO 19115 to document a dataset generated by the WorldPop program using data from multiple sources and machine learning models. “WorldPop develops peer-reviewed research and methods for the construction of open and high-resolution geospatial data on population distributions, demographic and dynamics, with a focus on low and middle income countries.” As of March 1st, 2021 WorldPop was publishing over 44,600 datasets on its website. See https://www.worldpop.org/project/categories?id=3.
+The selected example represents the spatial distribution of the Ethiopian population in 2020.
+Generating the metadata using R | +
library(nadar)
+library(raster)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/")
+
+# Download and read the dataset
+
+= "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif"
+ url = basename(url)
+ filename if(!file.exists(filename)) download.file(url, destfile = filename, mode = "wb")
+<- raster("eth_ppp_2020_constrained.tif")
+ ras
+<- "WP_ETH_POP"
+ id <- "ethiopia_pop.JPG"
+ thumb
+# Generate the metadata
+
+<- list(
+ my_geo_data
+ metadata_information = list(
+ title = "(Demo) Ethiopia Gridded Population 2020 (WorldPop)",
+ producers = list(list(name = "NADA team")),
+ production_date = "2022-02-18"
+
+ ),
+ description = list(
+
+ idno = id,
+ language = "eng",
+ characterSet = list(codeListValue = "utf8"),
+ hierarchyLevel = list("dataset"),
+ contact = list(
+ list(organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+ contactInfo = list(
+ onlineResource = list(
+ linkage = "https://www.worldpop.org/", name = "Website"
+
+ )
+ ),role = "pointOfContact"
+
+ )
+ ),
+ dateStamp = "2020-09-20",
+ metadataStandardName = "ISO 19115:2003/19139",
+
+ spatialRepresentationInfo = list(
+
+ list(
+ gridSpatialRepresentationInfo = list(
+ numberOfDimensions = 2L,
+ axisDimensionproperties = list(
+ list(
+ dimensionName = "row", dimensionSize = dim(ras)[1]
+
+ ),list(
+ dimensionName = "column", dimensionSize = dim(ras)[2]
+
+ )
+ ),cellGeometry = "area"
+
+ )
+ )
+
+ ),
+ referenceSystemInfo = list(
+ list(code = "4326", codeSpace = "EPSG")
+
+ ),
+ identificationInfo = list(
+
+ list(
+
+ citation = list(
+ title = "Ethiopia population 2020",
+ alternateTitle = "Estimated total number of people per grid-cell at a resolution of 3 arc-seconds (approximately 100m at the equator)",
+ date=list(
+ list(date = "2020-09-12", type = "creation")
+
+ ),identifier = list(authority = "DOI", code = id),
+ citedResponsibleParty = list(
+ list(
+ organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+ contactInfo = list(
+ onlineResource = list(
+ linkage = "https://www.worldpop.org/",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ )
+ ),
+abstract = "The spatial distribution of population in 2020, Ethiopia",
+
+ credit = "World Pop - School of Geography and Environmental Science, University of Southampton",
+
+ status = "completed",
+
+ pointOfContact = list(
+ list(
+ organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+ contactInfo = list(
+ onlineResource = list(
+ linkage = "https://www.worldpop.org/",
+ name = "Website"
+
+ )
+ ),role = "pointOfContact"
+
+ )
+ ),
+ resourceMaintenance = list(
+ list(maintenanceOrUpdateFrequency = "notPlanned")
+
+ ),
+ graphicOverview = list(
+ list(fileName = thumb, fileDescription = "Ethiopia population 2020")
+
+ ),
+ resourceFormat = list(
+ list(name = "image/tiff", specification = "GeoTIFF")
+
+ ),
+ descriptiveKeywords = list(
+ list(type = "theme", keyword = "population density"),
+ list(type = "theme", keyword = "gridded population"),
+ list(type = "place", keyword = "Ethiopia")
+
+ ),
+ resourceConstraints = list(
+ list(
+ legalConstraints = list(
+ accessConstraints = list("unrestricted"),
+ useConstraints = list("licenceUnrestricted"),
+ uselimitation = list(
+ "License: Creative Commons Attribution 4.0 International License",
+ "Recommended citation: Bondarenko M., Kerr D., Sorichetta A., and Tatem, A.J. 2020. Census/projection-disaggregated gridded population datasets for 51 countries across sub-Saharan Africa in 2020 using building footprints. WorldPop, University of Southampton, UK. doi:10.5258/SOTON/WP00682"
+
+ )
+ )
+ )
+ ),
+ extent = list(
+ geographicElement = list(
+ list(
+ geographicBoundingBox = list(
+ southBoundLatitude = bbox(ras)[2,1],
+ westBoundLongitude = bbox(ras)[1,1],
+ northBoundLatitude = bbox(ras)[2,2],
+ eastBoundLongitude = bbox(ras)[1,2]
+
+ ),geographicDescription = "Ethiopia"
+
+ )
+ )
+ ),
+ spatialRepresentationType = "grid",
+
+ #spatialResolution = list(value = 3, uom = "arc_second"),
+
+ language = list("eng"),
+
+ characterSet = list(
+ list(codeListValue = "utf8")
+
+ ),
+ topicCategory = list("society"),
+
+supplementalInformation = "References:
+ - Stevens FR, Gaughan AE, Linard C, Tatem AJ (2015) Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLoS ONE 10(2): e0107042. https://doi.org/10.1371/journal.pone.0107042
+ - WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076).
+ - Dooley, C. A., Boo, G., Leasure, D.R. and Tatem, A.J. 2020. Gridded maps of building patterns throughout sub-Saharan Africa, version 1.1. University of Southampton: Southampton, UK. Source of building footprints \"Ecopia Vector Maps Powered by Maxar Satellite Imagery\"© 2020. doi:10.5258/SOTON/WP00677
+ - Bondarenko M., Nieves J. J., Stevens F. R., Gaughan A. E., Tatem A. and Sorichetta A. 2020. wpgpRFPMS: Random Forests population modelling R scripts, version 0.1.0. University of Southampton: Southampton, UK. https://dx.doi.org/10.5258/SOTON/WP00665
+ - Ecopia.AI and Maxar Technologies. 2020. Digitize Africa data. http://digitizeafrica.ai"
+
+
+ )
+ ),
+ distributionInfo = list(
+
+ distributionFormat = list(
+ list(name = "image/tiff", specification = "GeoTIFF")
+
+ ),distributor = list(
+ list(
+ organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+ contactInfo = list(
+ onlineResource = list(
+ linkage = "https://www.worldpop.org/",
+ name = "Website"
+
+ )
+ ),role = "distributor"
+
+ )#,
+ )
+ # transferOptions = list( @@@ Use DC external resources?
+ # list(
+ # onLine = list(
+ # list(
+ # linkage = "https://www.worldpop.org/geodata/summary?id=49635",
+ # name = "Source metadata (HTML View)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://www.worldpop.org/ajax/pdf/summary?id=49635",
+ # name = "Source metadata (PDF)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif",
+ # name = "eth_ppp_2020_constrained.tif",
+ # description = "Data download (GeoTIFF)",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # )
+ # )
+ # )
+ # )
+
+
+ ),
+ dataQualityInfo = list(
+
+ list(
+ scope = "dataset",
+ lineage = list(
+ statement = "Data management workflow",
+ processStep = list(
+ list(
+ description = "This dataset was produced based on the 2020 population census/projection-based estimates for 2020 (information and sources of the input population data can be found here). Building footprints were provided by the Digitize Africa project of Ecopia.AI and Maxar Technologies (2020) and gridded building patterns derived from the datasets produced by Dooley et al. 2020. Geospatial covariates representing factors related to population distribution, were obtained from the \"Global High Resolution Population Denominators Project\" (OPP1134076)",
+ rationale = "Source data acquisition"
+
+ ),list(
+ description = "The mapping approach is the Random Forests-based dasymetric redistribution developed by Stevens et al. (2015). The disaggregation was done by Maksym Bondarenko (WorldPop) and David Kerr (WorldPop), using the Random Forests population modelling R scripts (Bondarenko et al., 2020), with oversight from Alessandro Sorichetta (WorldPop).",
+ rationale = "Mapping"
+
+ )
+ )
+ )
+ )
+
+ ),
+ metadataMaintenance = list(maintenanceAndUpdateFrequency = "notPlanned")
+
+
+ )
+
+ )
+# Publish the metadata in a NADA catalog
+
+geospatial_add(
+idno = id,
+ metadata = my_geo_data,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes"
+
+ )
+# Add a link to WorldPop website as an external resource
+
+external_resources_add(
+idno = id,
+ dctype = "web",
+ title = "WorldPop website",
+ file_path = "https://www.worldpop.org/",
+ overwrite = "yes"
+ )
Generating the metadata using Python | +
The result in NADA | +
The previous four examples documented geographic datasets (ISO 19115). In this fourth example, we document a geographic service using elements from the ISO 19119 standard. The service described in this example is the United Nations Clear Map application from United Nations Geospatial.
+Generating the metadata using R | +
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/")
+
+= "un_clear_map.JPG"
+ thumb
+= "UN_GEO_CLEAR-MAP"
+ id
+<- list(
+ my_geo_service
+ metadata_information = list(
+ idno = id,
+ title = "United Nations Geospatial, Clear Map",
+ producers = list(
+ list(name = "NADA team")
+
+ ),production_date = "2022-02-18",
+ version = "v1.0 2022-02"
+
+ ),
+ description = list(
+
+ idno = id,
+ language = "eng",
+ characterSet = list(codeListValue = "utf8"),
+ hierarchyLevel = list("service"),
+ contact = list(
+ list(
+ organisationName = "United Nations Geospatial",
+ contactInfo = list(
+ address = list(
+ electronicEmailAddress = "gis@un.org"
+
+ ),onlineResource = list(
+ linkage = "https://www.un.org/geospatial",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ ),dateStamp = "2022-02-22",
+ metadataStandardName = "ISO 19119:2005/19139",
+
+ referenceSystemInfo = list(
+ list(code = "3857", codeSpace = "EPSG")
+
+ ),
+ identificationInfo = list(
+
+ list(
+ citation = list(
+ title = "United Nations Clear Map - OGC Web Map Service",
+ date = list(
+ list(date = "2019-08-19", type = "creation"),
+ list(date = "2020-03-19", type ="lastUpdate")
+
+ ),citedResponsibleParty = list(
+ list(
+ organisationName = "United Nations Geospatial",
+ contactInfo = list(
+ address = list(electronicEmailAddress = "gis@un.org"),
+ onlineResource = list(
+ linkage = "https://www.un.org/geospatial",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ )
+ ),
+ abstract = "The United Nations Clear Map (hereinafter 'Clear Map') is a background reference web mapping service produced to facilitate 'the issuance of any map at any duty station, including dissemination via public electronic networks such as Internet' and 'to ensure that maps meet publication standards and that they are not in contravention of existing United Nations policies' in accordance with the in the Administrative Instruction on 'Regulations for the Control and Limitation of Documentation - Guidelines for the Publication of Maps' of 20 January 1997 (http://undocs.org/ST/AI/189/Add.25/Rev.1).",
+ purpose = "Clear Map is created for the use of the United Nations Secretariat and community. All departments, offices and regional commissions of the United Nations Secretariat including offices away from Headquarters using Clear Map remain bound to the instructions as contained in the Administrative Instruction and should therefore seek clearance from the UN Geospatial Information Section (formerly Cartographic Section) prior to the issuance of their thematic maps using Clear Map as background reference.",
+ credit = "Produced by: United Nations Geospatial Contributor: UNGIS, UNGSC, Field Missions CONTACT US: Feedback is appreciated and should be sent directly to: Email:Clearmap@un.org / gis@un.org (UNCLASSIFIED) (c) UNITED NATIONS 2018",
+ status = "onGoing",
+
+ pointOfContact = list(
+ list(
+ organisationName = "United Nations Geospatial",
+ contactInfo = list(
+ address = list(electronicEmailAddress = "gis@un.org"),
+ onlineResource = list(linkage = "https://www.un.org/geospatial", name = "Website")
+
+ ),role = "pointOfContact"
+
+ )
+ ),
+ resourceMaintenance = list(
+ list(maintenanceOrUpdateFrequency = "asNeeded")
+
+ ),
+ graphicOverview = list(
+ list(
+ fileName = "https://geoportal.dfs.un.org/arcgis/sharing/rest/content/items/6f4eb9e136ee43758a62f587ceb0da01/info/thumbnail/thumbnail1567157577600.png",
+ fileDescription = "Service overview",
+ fileType = "image/png"
+
+ )
+ ),
+ resourceFormat = list(
+ list(name = "PNG32"),
+ list(name = "PNG24"),
+ list(name = "PNG"),
+ list(name = "JPG"),
+ list(name = "DIB"),
+ list(name = "TIFF"),
+ list(name = "EMF"),
+ list(name = "PS"),
+ list(name = "PDF"),
+ list(name = "GIF"),
+ list(name = "SVG"),
+ list(name = "SVGZ"),
+ list(name = "BMP")
+
+ ),
+ descriptiveKeywords = list(
+ list(type = "theme", keyword = "wms"),
+ list(type = "theme", keyword = "united nations"),
+ list(type = "theme", keyword = "global boundaries"),
+ list(type = "theme", keyword = "ocean coastline"),
+ list(type = "theme", keyword = "authoritative")
+
+ ),
+ resourceConstraints = list(
+ list(
+ legalConstraints = list(
+ uselimitation = list("The designations employed and the presentation of material on this map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries.
+ Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined.
+ Final status of the Abyei area is not yet determined.
+ * Dotted line represents approximately the Line of Control in Jammu and Kashmir agreed upon by India and Pakistan. The final status of Jammu and Kashmir has not yet been agreed upon by the parties.
+ ** Chagos Archipelago appears without prejudice to the question of sovereignty.
+ *** A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas)."),
+accessConstraints = list("unrestricted"),
+ useConstraints = list("licenceUnrestricted")
+
+ )
+ )
+ ),
+ extent = list(
+ geographicElement = list(
+ list(
+ geographicBoundingBox = list(
+ southBoundLongitude = -1.4000299034940418,
+ westBoundLongitude = -1.40477223188626,
+ northBoundLongitude = 2.149247026187029,
+ eastBoundLongitude = 1.367128649366541
+
+ )
+ )
+ )
+ ),
+ topicCategory = list("boundaries", "oceans"),
+
+ serviceIdentification = list(
+ serviceType = "OGC:WMS",
+ serviceTypeVersion = "1.1.0"
+
+ )
+ )
+ ),
+ distributionInfo = list(
+
+ distributionFormat = list(
+ list(name = "PNG32"),
+ list(name = "PNG24"),
+ list(name = "PNG"),
+ list(name = "JPG"),
+ list(name = "DIB"),
+ list(name = "TIFF"),
+ list(name = "EMF"),
+ list(name = "PS"),
+ list(name = "PDF"),
+ list(name = "GIF"),
+ list(name = "SVG"),
+ list(name = "SVGZ"),
+ list(name = "BMP")
+
+ ),
+ distributor = list(
+ list(
+ organisationName = "United Nations Geospatial",
+ contactInfo = list(
+ address = list(electronicEmailAddress = "gis@un.org"),
+ onlineResource = list(
+ linkage = "https://www.un.org/geospatial",
+ name = "Website"
+
+ )
+ ),role = "owner"
+
+ )
+ )#,
+
+ # transferOptions = list(
+ # list(
+ # onLine = list(
+ # list(
+ # linkage = "https://geoportal.dfs.un.org/arcgis/home/item.html?id=541557fd0d4d42efb24449be614e6887",
+ # name = "Original metadata",
+ # description = "Original metadata from UN ClearMap portal",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://geoportal.dfs.un.org/arcgis/sharing/rest/content/items/541557fd0d4d42efb24449be614e6887/data",
+ # name = "UN ClearMap WMS map service user guide",
+ # description = "How to import and use WMS services of the UN Clear map",
+ # protocol = "WWW:LINK-1.0-http--link"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Dark/MapServer?service=WMS",
+ # name = "ClearMap_Dark",
+ # description = "ClearMap Dark WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Gray/MapServer?service=WMS",
+ # name = "ClearMap_Gray",
+ # description = "ClearMap Gray WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Imagery/MapServer?service=WMS",
+ # name = "ClearMap_Imagery",
+ # description = "ClearMap Imagery WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Plain/MapServer?service=WMS",
+ # name = "ClearMap_Plain",
+ # description = "ClearMap Plain WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Topo/MapServer?service=WMS",
+ # name = "ClearMap_Topo",
+ # description = "ClearMap Topo WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebDark/MapServer?service=WMS",
+ # name = "ClearMap_WebDark",
+ # description = "ClearMap WebDark WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebGray/MapServer?service=WMS",
+ # name = "ClearMap_WebGray",
+ # description = "ClearMap WebGray WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebPlain/MapServer?service=WMS",
+ # name = "ClearMap_WebPlain",
+ # description = "ClearMap WebPlain WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # ),
+ # list(
+ # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebTopo/MapServer?service=WMS",
+ # name = "ClearMap_WebTopo",
+ # description = "ClearMap WebTopo WMS",
+ # protocol = "OGC:WMS-1.1.0-http-get-map"
+ # )
+ # )
+ # )
+ # )
+
+
+ ),
+ metadataMaintenance = list(maintenanceAndUpdateFrequency = "asNeeded")
+
+
+ )
+
+ )
+# Publish in a NADA catalog
+
+geospatial_add(
+idno = id,
+ metadata = my_geo_service,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes"
+
+ )
+# Add links as external resources
+
+external_resources_add(
+title = "United Nations Clear Map application",
+ idno = id,
+ dctype = "web",
+ file_path = "https://www.un.org/geospatial/",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "United Nations Geospatial website",
+ idno = id,
+ dctype = "web",
+ file_path = "https://geoservices.un.org/Html5Viewer/index.html?viewer=clearmap",
+ overwrite = "yes"
+ )
Generating the metadata using Python | +
[to do]
+The result in NADA | +
The ISO standard is complex and contains many nested elements. Using R or Python to generate the metadata is a convenient and powerful option, although it requires much attention to avoid errors. The geometa R package can be used to facilitate the process of documenting datasets using R.
+Using a specialized metadata editor to generate the ISO-compliant metadata is a good alternative for those who have limited expertise in R or Python. The GeoNetwork editor provides such a solution.
+ +In our JSON schema, the structural metadata and the dataset metadata are stored in one same container.↩︎
The schema we describe in this chapter is intended to document databases of indicators or time series, not the indicators or time series themselves (a schema for the description of indicators and time series is presented in chapter 8). Indicators are summary measures related to key issues or phenomena, derived from observed facts. Indicators form time series when they are provided with a temporal ordering, i.e. when their values are provided with an ordered annual, quarterly, monthly, daily, or other time reference. Indicators and time series are often contained in multi-indicators databases, like the World Bank’s World Development Indicators - WDI, whose on-line version contains series for 1,430 indicators (as of 2021).
+The metadata related to a database can be published in a catalog as specific entries, or as information attached to an indicator. +[provide example / screenshot in NADA]
+The database schema is used to document the database that contains the time series, not to document the indicators or /series.
+{
+"published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "database_description": {},
+ "provenance": [],
+ "tags",
+ "lda_topics": {},
+ "embeddings": {},
+ "additional": {}
+ }
The schema includes two elements that are not metadata, but parameters used when publishing the metadata in a NADA catalog:
+published
: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished), in which case it is only visible to catalog administrators. This value must be set to 1 (published) to make the metadata visible. Note that the database metadata will only be shown in NADA in association with the metadata of an indicator.overwrite
: Indicates whether metadata that may have been previously uploaded for the same database can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. A database will be considered as being the same as a previously uploaded one if they have the same identifier (provided in the metadata element database_description > title_statement > idno
).metadata_information
[Optional, Not Repeatable]
+The set of elements in metadata_information
is used to provide information on the production of the database metadata. This information is used mostly for administrative purposes by data curators and catalog administrators.
"metadata_information": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "prod_date": "string",
+ "version": "string"
+ }
title
[Optional ; Not repeatable ; String] idno
[Required ; Not repeatable ; String] producers
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.role
[Optional ; Not repeatable ; String] name
in the production of the metadata.prod_date
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] database_description
[Required, Not Repeatable]
"database_description": {
+"title_statement": {},
+ "authoring_entity": [],
+ "abstract": "string",
+ "url": "string",
+ "type": "string",
+ "date_created": "string",
+ "date_published": "string",
+ "version": [],
+ "update_frequency": "string",
+ "update_schedule": [],
+ "time_coverage": [],
+ "time_coverage_note": "string",
+ "periodicity": [],
+ "themes": [],
+ "topics": [],
+ "keywords": [],
+ "dimensions": [],
+ "ref_country": [],
+ "geographic_units": [],
+ "geographic_coverage_note": "string",
+ "bbox": [],
+ "geographic_granularity": "string",
+ "geographic_area_count": "string",
+ "sponsors": [],
+ "acknowledgments": [],
+ "acknowledgment_statement": "string",
+ "contacts": [],
+ "links": [],
+ "languages": [],
+ "access_options": [],
+ "errata": [],
+ "license": [],
+ "citation": "string",
+ "notes": [],
+ "disclaimer": "string",
+ "copyright": "string"
+ }
title_statement
[Required, Not Repeatable] "title_statement": {
+"idno": "string",
+ "identifiers": [
+ {
+ "type": "string",
+ "identifier": "string"
+ }
+ ],
+ "title": "string",
+ "sub_title": "string",
+ "alternate_title": "string",
+ "translated_title": "string"
+ }
idno
[Required ; Not repeatable ; String]
+A unique identifier of the database. For example, the World Bank’s World Development Indicators database published in April 2020 could have idno
= “WB_WDI_APR_2020”.
identifiers
[Optional ; Repeatable]
+This element is used to store database identifiers (IDs) other than the catalog ID entered in idno
. It can for example be a Digital Object Identifier (DOI). The idno
can be repeated here (idno
does not provide a type
parameter; if a DOI or other standard reference ID is used as idno
, it is recommended to repeat it here with the identification of its type
).
type
[Optional ; Not repeatable ; String] identifier
[Required ; Not repeatable ; String] title
[Required ; Not repeatable ; String]
+The title is the name by which the database is formally known. It is good practice to include the year of production in the title (and possibly the month, or quarter, if a new version of the database is released more than once a year). For example, “World Development Indicators, April 2020”.
+
sub_title
[Optional ; Not repeatable ; String]
+The database subtitle can be used when there is a need to distinguish characteristics of a database. This element will rarely be used.
alternate_title
[Optional ; Not repeatable ; String]
+This can be an acronym, or an alternative name of the database. For example, “WDI April 2020”.
translated_title
[Optional ; Not repeatable ; String]
+The title of the database in a secondary language (if more than one other language, they may be entered as one string, as this element is not repeatable).
authoring_entity
[Optional ; Repeatable]
+This set of five elements is used to identify the organization(s) or person(s) who are the main producers/curators of the database. Note that a similar element is available at the indicator/series level.
"authoring_entity": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "email": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the person or organization who maintains the contents of the database (back-end). Write the name in full (use the element abbreviation
to capture the acronym of the organization, if relevant).
affiliation
[Optional ; Not repeatable ; String]
+The affiliation of the person or organization mentioned in name
.
+
abbreviation
[Optional ; Not repeatable ; String]
+The abbreviated name (acronym) of the organization mentioned in name
.
email
[Optional ; Not repeatable ; String]
+The public email contact of the person or organizations mentioned in name
. It is good practice to provide a service account email address, not a personal one.
uri
[Optional ; Not repeatable ; String]
+A link (URL) to the website of the entity mentioned in name
.
abstract
[Optional ; Not repeatable ; String]
The abstract
is a brief description of the database. It can for example include a short statement on the database scope and coverage (not in detail, as other fields are available for that purpose), objectives, history, and expected audience.
url
[Optional ; Not repeatable ; String]
The link to the public interface of the database (home page).
type
[Optional ; Not repeatable ; String]
The type of database.
date_created
[Optional ; Not repeatable ; String]
+This is the date the database was created. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).
date_published
+This is the date the database was made public. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).
version
[Optional ; Repeatable]
+A database rarely remains static; it will be regularly updated and upgraded. The version
element is a compound element and contains important information regarding the updating of the database. This includes any extension of the database (adding new series data), appending existing data, correcting existing data, etc.
"version": [
+{
+ "version": "string",
+ "date": "string",
+ "responsibility": "string",
+ "notes": "string"
+ }
+ ]
version
[Optional ; Not repeatable ; String]
+A label for the version. The version specification will be determined by a curator or a data manager under conventions determined by the authoring entity.
date
[Optional ; Not repeatable ; String]
+The date the version was released. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).
responsibility
[Optional ; Not repeatable ; String]
+The organization or person in charge of this version of the database.
notes
[Optional ; Not repeatable ; String]
+Additional information on this version of the database. Notes can for example be used to document how this version differs from previous ones.
update_frequency
[Optional ; Not repeatable ; String]
+Indicates at which frequency the database is updated (for example, “annual” or “quarterly”). The use of a controlled vocabulary is recommended. If a database contains many indicators, the update frequency may vary by indicator (e.g., some may be updated on a monthly or quarterly basis while others are only updated annually). The information provided in the update_frequency
will correspond to the frequency of update for the indicators that are most frequently updated.
+
update_schedule
[Optional ; Repeatable]
+The update schedule is intended to provide users with information on scheduled updates. This is a repeatable field that allows for capturing specific dates, but this information would then have to be regularly updated. Often a single description will be used, which would avoid having to regularly update the metadata. For example, “The database is updated in January, April, July, October of each year.”
"update_schedule": [
+{
+ "update": "string"
+ }
+ ]
update
[Optional ; Not repeatable ; String]
+A description of the schedule of updates or a date entered in ISO 8601 format.
time_coverage
[Optional ; Repeatable]
+The time coverage is the time span of all the data contained in the database across all series.
+
"time_coverage": [
+{
+ "start": "string",
+ "end": "string"
+ }
+ ]
+- start
[Optional ; Not repeatable ; String]
+Indicates the start date of the period covered by the data (across all series) in the database. The date should be provided in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).
+- end
[Optional ; Not repeatable ; String]
+Indicates the end date of the period covered by the data (across all series) in the database. The date should be provided in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).
time_coverage_note
[Optional ; Not repeatable ; String]
+The element is used to annotate and/or describe auxiliary information related to the time coverage described in time_coverage
.
periodicity
[Optional ; Repeatable]
+The periodicity of the data describes the periodicity of the indicators contained in the database. A database can contain series covering different periods, in which case the information will be repeated for each type of periodicity. A controlled vocabulary should be used.
+
"periodicity": [
+{
+ "period": "string"
+ }
+ ]
period
[Optional ; Not repeatable ; String]
+Periodicity of the time series included in the database, for example, “annual”, “quarterly”, or “monthly”.
themes
[Optional ; Repeatable]
+Themes provide a general idea of the research that might guide the creation and/or demand for the series. A theme is broad and is likely also subject to a community based definition or list. A controlled vocabulary should be used. This element will rarely be used (the element topics
described below will be used more often).
"themes": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
id
[Optional ; Not repeatable ; String]
+The unique identifier of the theme. It can be a sequential number, or the identifier of the theme in a controlled vocabulary.
name
[Required ; Not repeatable ; String]
+The label of the theme associated with the data.
parent_id
[Optional ; Not repeatable ; String]
+When a hierarchical (nested) controlled vocabulary is used, the parent_id
field can be used to indicate a higher-level theme to which this theme belongs.
vocabulary
[Optional ; Not repeatable ; String]
+The name of the controlled vocabulary used, if any.
uri
[Optional ; Not repeatable ; String]
+A link to the controlled vocabulary mentioned in field ‘vocabulary’.
topics
[Optional ; Repeatable]
+The topics
field indicates the broad substantive topic(s) that the indicator/series covers. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) topic classification.
"topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
id
[Optional ; Not repeatable ; String]
+The unique identifier of the topic. It can be a sequential number, or the identifier of the topic in a controlled vocabulary.
name
[Required ; Not repeatable ; String]
+The label of the topic associated with the data.
+
parent_id
[Optional ; Not repeatable ; String]
+When a hierarchical (nested) controlled vocabulary is used, the parent_id
field can be used to indicate a higher-level topic to which this topic belongs.
vocabulary
[Optional ; Not repeatable ; String]
+The name of the controlled vocabulary used, if any.
uri
+A link to the controlled vocabulary mentioned in field `vocabulary’.
keywords
[Optional ; Repeatable]
+Words or phrases that describe salient aspects of a data collection’s content. This can be used for building keyword indexes and for classification and retrieval purposes. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus. The list of keywords can include keywords extracted from one or more controlled vocabularies and user-defined keywords.
"keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; String ; Non repeatable]
+A keyword (or phrase).
+
vocabulary
[Optional ; Not repeatable ; String]
+The name of the controlled vocabulary from which the keyword was extracted, if any.
+
uri
[Optional ; Not repeatable ; String]
+The URI of the controlled vocabulary used, if any.
dimensions
[Optional ; Repeatable]
+The dimensions available for the series included in the database. For example, “country, year”.
"dimensions": [
+{
+ "name": "string",
+ "label": "string"
+ }
+ ]
name
[Required ; String ; Non repeatable]
+The name of the dimension.
+
label
[Optional ; Not repeatable ; String]
+A label for the dimension.
ref_country
[Optional ; Repeatable]
+A list of countries for which data are available in the database. This element is somewhat redundant with the next element (geographic_units
) which may also contain a list of countries. Identifying geographic areas of type “country” is important to enable filters and facets in data catalogs (country names are among the most frequent queries submitted to catalogs).
"ref_country": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The name of the country.
code
[Optional ; Not repeatable ; String]
+The code of the country. The use of the ISO 3166-1 alpha-3 codes is recommended.
geographic_units
[Optional ; Repeatable]
+A list of geographic units (regions, countries, states, provinces, etc.) for which data are available in the database. This list is not limited to countries; it can contain sub-national areas, supra-national regions, or non-administrative area names. The type
element is used to indicate the type of geographic area. Countries may, but do not have to be repeated here if provided in the eleement ref_country
.
+
"geographic_units": [
+{
+ "name": "string",
+ "code": "string",
+ "type": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The name of the geographic unit e.g. ‘World’, ‘Sub-Saharan Africa’, ‘Afghanistan’, ‘Low-income countries’.
code
[Optional ; Not repeatable ; String]
+The code of the geographic unit as found in the database. If no code is available in the database, a code still can be added to the metadata. In such case, using the ISO 3166-1 alpha-3 codes is recommended for countries.
type
[Optional ; Not repeatable ; String]
+Type of geographic unit e.g. country, state, region, province, or other grouping.
geographic_coverage_note
[Optional ; Not repeatable ; String]
+The note can be used to capture additional information on the geographic coverage of the database.
bbox
[Optional ; Repeatable]
+Bounding boxes are typically used for geographic datasets to indicate the geographic coverage of the data, but can be provided for databases as well, although this will rarely be done. A geographic bounding box defines a rectangular geographic area.
+
"bbox": [
+{
+ "west": "string",
+ "east": "string",
+ "south": "string",
+ "north": "string"
+ }
+ ]
west
[Required ; Not repeatable ; String]
+Western geographic parameter of the bounding box.
east
[Required ; Not repeatable ; String]
+Eastern geographic parameter of the bounding box.
south
[Required ; Not repeatable ; String]
+Southern geographic parameter of the bounding box.
north
[Required ; Not repeatable ; String]
+Northern geographic parameter of the bounding box.
geographic_granularity
[Optional ; Not repeatable ; String]
Whereas the geographic_units
element lists the various geographic levels for which there is data in the database, the geographic_granularity
element will provide information on the geographic levels for which information is available in the database. For example: “The database contains data at the national, provincial (admin 1) and district (admin 2) levels.”
geographic_area_count
[Optional ; Not repeatable ; String]
The number of geographic areas for which data are provided in the database. The World Bank World Development Indicators for example provides data for 262 different areas (which includes countries and territories, geographic regions, and other country groupings).
sponsors
[Optional ; Repeatable]
+The source(s) of funds for the production and maintenance of the database. If different funding agencies sponsored different stages of the database development, use the role
attribute to distinguish their respective contributions.
"sponsors": [
+{
+ "name": "string",
+ "abbreviation": "string",
+ "role": "string",
+ "grant": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+Name of the funding agency/sponsor
abbreviation
[Optional ; Not repeatable ; String]
+Abbreviation of the funding/sponsoring agency mentioned in name
.
role
[Optional ; Not repeatable ; String]
+Role of the funding/sponsoring agency mentioned in name
.
grant
[Optional ; Not repeatable ; String]
+Grant or award number. If an agency provided more than one grant, list all grants separated with a “;”.
uri
[Optional ; Not repeatable ; String]
+URI of the sponsor agency mentioned in name
.
acknowledgments
[Optional ; Repeatable]
+An itemized list of person(s) and/or organization(s) other than sponsors and contributors already mentioned in metadata elements contributors
and sponsors
whose contribution to the database must be acknowledged.
"acknowledgments": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "role": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the person or agency being recognized for supporting the database.
affiliation
[Optional ; Not repeatable ; String]
+Affiliation of the person or agency recognized or acknowledged for supporting the database.
role
[Optional ; Not repeatable ; String]
+Role of the person or agency that is being recognized or acknowledged for supporting the database.
uri
[Optional ; Not repeatable ; String]
+Website URL or email of the person or organization being recognized or acknowledged for supporting the database.
acknowledgment_statement
[Optional ; Not repeatable ; String]
An overall statement of acknowledgment, which can be used as an alternative (or supplement) to the itemized list provided in acknowledgments
.
contacts
[Optional ; Repeatable]
+The contacts
element provides the public interface for questions associated with the development and maintenance of the database. There could be various contacts provided depending upon the organization.
"contacts": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "email": "string",
+ "telephone": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the contact person that should be contacted. Instead of the name of an individual (which would be subject to change and require frequent update of the metadata), a title can be provided here (e.g. “data helpdesk”).
role
[Optional ; Not repeatable ; String]
+The specific role of the contact person mentioned in name
. This will be used when multiple contacts are listed, and is intended to help users direct their questions and requests to the right contact person.
+
affiliation
[Optional ; Not repeatable ; String]
+The organization or affiliation of the contact person mentioned in name
.
email
[Optional ; Not repeatable ; String]
+The email address of the person or organization mentioned in name
. Avoid using personal email accounts; the use of an anonymous email is recommended (e.g, “helpdesk@….org”)
telephone
[Optional ; Not repeatable ; String]
+The phone number of the person or organization mentioned in name
.
uri
[Optional ; Not repeatable ; String]
+The URI of the agency (typically, a URL to a “contact us” web page).
links
[Optional ; Repeatable]
+This field allows for the association of auxiliary links referring to the database.
"links": [
+{
+ "uri": "string",
+ "description": "string"
+ }
+ ]
uri
[Optional ; Not repeatable ; String]
+The URI for the associated link.
description
[Optional ; Not repeatable ; String]
+A brief description of the link, in relation to the database.
+
languages
[Optional ; Repeatable]
+This set of elements is provided to list the languages that are supported in the database.
+
"languages": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] code
[Optional ; Not repeatable ; String] name
, preferably the three letter ISO 639-1 code.access_options
[Optional ; Repeatable]
+This repeatable set of elements describes the different modes and formats in which the database is made accessible. When more than one mode of access is provided, describe them separately.
"access_options": [
+{
+ "type": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
type
[Optional ; Not repeatable ; String]
+The access type, e.g. “Application Programming Interface (API)”, “Bulk download in CSV format”, “On-line query interface”, etc.
uri
[Optional ; Not repeatable ; String]
+The URI corresponding to the access mode mentioned in type
.
note
[Optional ; Not repeatable ; String]
+This element allows for annotating any specific information associated with the access mode mentioned in type
.
errata
[Optional ; Repeatable]
+A list of errata at the database level. Note that an errata
element is also available in the schema used for the description of indicators/series.
"errata": [
+{
+ "date": "string",
+ "description": "string"
+ }
+ ]
date
[Optional ; Not repeatable ; String]
+The date the erratum was published, preferably entered in ISO format.
description
[Optional ; Not repeatable ; String]
+A description of the error and of the measures taken to remedy.
license
[Optional ; Repeatable]
+This set of elements is used to describe the access license(s) attached to the database.
"license": [
+{
+ "name": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the license, for example “Creative Commons Attribution 4.0 International license (CC-BY 4.0)”.
uri
[Optional ; Not repeatable ; String]
+A URI to a description of the license, for example “https://creativecommons.org/licenses/by/4.0/”;
note
[Optional ; Not repeatable ; String]
+Any additional information to qualify the license requirements.
citation
[Optional ; Not repeatable ; String]
The citation requirement for the database (i.e. how users should cite the database in publications and reports).
notes
[Optional ; Repeatable]
+This element is provided to add notes that are relevant for describing the database, that cannot be provided in other metadata elements.
"notes": [
+{
+ "note": "string"
+ }
+ ]
note
[Optional ; Not repeatable ; String]
+A free-text note.
disclaimer
[Optional ; Not repeatable ; String]
If the agency responsible for managing the database has determined that there may be some liability as a result of the data, the element may be used to provide a disclaimer statement.
copyright
[Optional ; Not repeatable ; String] provenance
[Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable]
+The origin_description
elements are used to describe when and from where metadata have been extracted or harvested.
harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Document Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] lda_topics
[Optional ; Not repeatable]
+
"lda_topics": [
+{
+ "model_info": [
+ {
+ "source": "string",
+ "author": "string",
+ "version": "string",
+ "model_id": "string",
+ "nb_topics": 0,
+ "description": "string",
+ "corpus": "string",
+ "uri": "string"
+ }
+ ],
+ "topic_description": [
+ {
+ "topic_id": null,
+ "topic_score": null,
+ "topic_label": "string",
+ "topic_words": [
+ {
+ "word": "string",
+ "word_weight": 0
+ }
+ ]
+ }
+ ]
+ }
+ ]
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics
is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.
Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics
elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.
The image below provides an example of topics extracted from a document from the United Nations High Commission for Refugees, using a LDA topic model trained by the World Bank (this model was trained to identify 75 topics; no document will cover all topics).
+ +The lda_topics
element includes the following metadata fields:
model_info
[Optional ; Not repeatable]
+Information on the LDA model.
source
[Optional ; Not repeatable ; String] author
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] model_id
[Optional ; Not repeatable ; String] nb_topics
[Optional ; Not repeatable ; Numeric] description
[Optional ; Not repeatable ; String] corpus
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topic_description
[Optional ; Repeatable]
+The topic composition of the document.
topic_id
[Optional ; Not repeatable ; String] topic_score
[Optional ; Not repeatable ; Numeric] topic_label
[Optional ; Not repeatable ; String] topic_words
[Optional ; Not repeatable] word
[Optional ; Not repeatable ; String] word_weight
[Optional ; Not repeatable ; Numeric] = list(
+ lda_topics
+ list(
+
+ model_info = list(
+ list(source = "World Bank, Development Data Group",
+ author = "A.S.",
+ version = "2021-06-22",
+ model_id = "Mallet_WB_75",
+ nb_topics = 75,
+ description = "LDA model, 75 topics, trained on Mallet",
+ corpus = "World Bank Documents and Reports (1950-2021)",
+ uri = ""))
+
+ ),
+ topic_description = list(
+
+ list(topic_id = "topic_27",
+ topic_score = 32,
+ topic_label = "Education",
+ topic_words = list(list(word = "school", word_weight = "")
+ list(word = "teacher", word_weight = ""),
+ list(word = "student", word_weight = ""),
+ list(word = "education", word_weight = ""),
+ list(word = "grade", word_weight = "")),
+
+ list(topic_id = "topic_8",
+ topic_score = 24,
+ topic_label = "Gender",
+ topic_words = list(list(word = "women", word_weight = "")
+ list(word = "gender", word_weight = ""),
+ list(word = "man", word_weight = ""),
+ list(word = "female", word_weight = ""),
+ list(word = "male", word_weight = "")),
+
+ list(topic_id = "topic_39",
+ topic_score = 22,
+ topic_label = "Forced displacement",
+ topic_words = list(list(word = "refugee", word_weight = "")
+ list(word = "programme", word_weight = ""),
+ list(word = "country", word_weight = ""),
+ list(word = "migration", word_weight = ""),
+ list(word = "migrant", word_weight = "")),
+
+ list(topic_id = "topic_40",
+ topic_score = 11,
+ topic_label = "Development policies",
+ topic_words = list(list(word = "development", word_weight = "")
+ list(word = "policy", word_weight = ""),
+ list(word = "national", word_weight = ""),
+ list(word = "strategy", word_weight = ""),
+ list(word = "activity", word_weight = ""))
+
+
+ )
+
+ )
+ )
The information provided by LDA models can be used to build a “filter by topic composition” tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic.
+embeddings
[Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below.
The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.
+ +The embeddings
element contains four metadata fields:
id
[Optional ; Not repeatable ; String] description
[Optional ; Not repeatable ; String] date
[Optional ; Not repeatable ; String] vector
[Required ; Not repeatable ; Object] @@@@@@@@ do not offer options
+The numeric vector representing the document, provided as an object (array or string). additional
[Optional ; Not repeatable]
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
We use the World Bank’s World Development Indicators 2021 (WDI) database as an example. In this example, we assume that all information is entered manually in the script. In a real application, it is likely that some elements like the list and number of geographic areas covered in the database, or the start and end year of the period covered by the data, will be extracted programmatically by reading the data file (the WDI data and related metadata can be downloaded as CSV or MS-Excel files), or by extracting information from the database API (WDI metadata is available via API).
+** Using R | +
# The code below creates an object `wdi_database` ready to be published in a NADA catalog (using the NADAR package).
+
+<- list(
+ wdi_database
+ database_description = list(
+
+ title_statement = list(
+ idno = "WB_WDI_2021_09_15",
+ title = "World Development Indicators 2021",
+ alternate_title = "WDI 2021"
+
+ ),
+ authoring_entity = list(name = "Development Data Group",
+ affiliation = "The World Bank Group"),
+
+ abstract = "The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.",
+
+ url = "https://datatopics.worldbank.org/world-development-indicators/",
+
+ type = "Time series database",
+
+ date_created = "2021-09-15",
+ date_published = "2021-09-15",
+
+ version = list(
+ list(version = "On-line public version (open data), 15 September 2021",
+ date = "2021-09-15",
+ responsibility = "World Bank, Development Data Group")),
+
+update_frequency = "Quarterly",
+
+ update_schedule = list(list(update = "April, July, September, December")),
+
+ time_coverage = list(list(start = "1960", end = "2021")),
+
+ periodicity = list(list(period = "Annual")),
+
+ topics = list_topics,
+
+ geographic_units = list(
+ list(code = "ABW", name = "Aruba"),
+ list(code = "AFE", name = "Africa Eastern and Southern"),
+ list(code = "AFG", name = "Afghanistan"),
+ list(code = "AFW", name = "Africa Western and Central"),
+ list(code = "AGO", name = "Angola"),
+ list(code = "ALB", name = "Albania"),
+ list(code = "AND", name = "Andorra"),
+ list(code = "ARB", name = "Arab World"),
+ list(code = "ARE", name = "United Arab Emirates"),
+ list(code = "ARG", name = "Argentina")
+ # ... and 255 more - not shown here
+
+ ),
+ geographic_granularity = "global, national, regional",
+
+ geographic_area_count = "265",
+
+ languages = list(
+ list(code = "en", name = "English"),
+ list(code = "sp", name = "Spanish"),
+ list(code = "fr", name = "French"),
+ list(code = "ar", name = "Arabic"),
+ list(code = "cn", name = "Chinese")
+
+ ),
+ contacts = list(list(name = "Data Help Desk",
+ affiliation = "World Bank",
+ uri = "https://datahelpdesk.worldbank.org/",
+ email = "data@worldbank.org")),
+
+ access_options = list(
+ list(type = "API",
+ uri = "https://datahelpdesk.worldbank.org/knowledgebase/articles/889386"),
+ list(type = "Bulk (CSV)",
+ uri = "https://data.worldbank.org/data-catalog/world-development-indicators"),
+ list(type = "Query",
+ uri = "http://databank.worldbank.org/data/source/world-development-indicators"),
+ list(type = "PDF",
+ uri = "https://openknowledge.worldbank.org/bitstream/handle/10986/26447/WDI-2017-web.pdf")),
+
+ license = list(list(type = "CC BY-4.0",
+ uri = "https://creativecommons.org/licenses/by/4.0/")),
+
+ citation = "World Development Indicators 2021 (September), The World Bank"
+
+
+ )
+ )
** Using Python | +
# The code below creates a dictionary `wdi_database` ready to be published in a NADA catalog (using the PyNADA library).
+
+
+ wdi_database: {
+ "database_description" : {
+
+ "title_statement" : {
+ "idno" : "WB_WDI_2021_09_15",
+ "title" : "World Development Indicators 2021",
+ "alternate_title" : "WDI 2021"
+
+ },
+ "authoring_entity" : {"name" : "Development Data Group",
+ "affiliation" : "The World Bank Group"},
+
+ = "The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.",
+ abstract
+ = "https://datatopics.worldbank.org/world-development-indicators/",
+ url
+ type = "Time series database",
+
+ = "2021-09-15",
+ date_created = "2021-09-15",
+ date_published
+ = [{"version" : "On-line public version (open data), 15 September 2021",
+ version "date" : "2021-09-15",
+ "responsibility" : "World Bank, Development Data Group"}],
+
+= "Quarterly",
+ update_frequency
+ = [{"update" : "April, July, September, December"}],
+ update_schedule
+ = [{"start" : "1960", "end" : "2021"}],
+ time_coverage
+ = [{"period" : "Annual"}],
+ periodicity
+ = list_topics,
+ topics
+ = [
+ geographic_units "code" : "ABW", "name" : "Aruba"},
+ {"code" : "AFE", "name" : "Africa Eastern and Southern"},
+ {"code" : "AFG", "name" : "Afghanistan"},
+ {"code" : "AFW", "name" : "Africa Western and Central"},
+ {"code" : "AGO", "name" : "Angola"},
+ {"code" : "ALB", "name" : "Albania"},
+ {"code" : "AND", "name" : "Andorra"},
+ {"code" : "ARB", "name" : "Arab World"},
+ {"code" : "ARE", "name" : "United Arab Emirates"},
+ {"code" : "ARG", "name" : "Argentina"}
+ {# ... and 255 more, not shown here
+
+ ],
+ = "global, national, regional",
+ geographic_granularity
+ = "265",
+ geographic_area_count
+ = [
+ languages "code" : "en", "name" : "English"},
+ {"code" : "sp", "name" : "Spanish"},
+ {"code" : "fr", "name" : "French"},
+ {"code" : "ar", "name" : "Arabic"},
+ {"code" : "cn", "name" : "Chinese"}
+ {
+ ],
+ = [{"name" : "Data Help Desk",
+ contacts "affiliation" : "World Bank",
+ "uri" : "https://datahelpdesk.worldbank.org/",
+ "email" : "data@worldbank.org"}],
+
+ = [
+ access_options "type" : "API",
+ {"uri" : "https://datahelpdesk.worldbank.org/knowledgebase/articles/889386"},
+ "type" : "Bulk (CSV)",
+ {"uri" : "https://data.worldbank.org/data-catalog/world-development-indicators"},
+ "type" : "Query",
+ {"uri" : "http://databank.worldbank.org/data/source/world-development-indicators"},
+ "type" : "PDF",
+ {"uri" : "https://openknowledge.worldbank.org/bitstream/handle/10986/26447/WDI-2017-web.pdf"}
+
+ ],
+ = [{"type" : "CC BY-4.0",
+ license "uri" : "https://creativecommons.org/licenses/by/4.0/"}],
+
+ = "World Development Indicators 2021 (September), The World Bank"
+ citation
+
+ }
+ }
Indicators are summary measures related to key issues or phenomena, derived from observed facts. Indicators form time series when they are provided with a temporal ordering, i.e. when their values are provided with an ordered annual, quarterly, monthly, daily, or other time reference. Time series are usually published with equal intervals between values. In the context of this Guide, we however consider as time series all indicators provided for a given geographic area with an associated time reference, whether this time represents a regular, continuous succession of time stamps or not. For example, the indicators provided by the Demographic and Health Surveys (DHS) StatCompiler, which are only available for the years when DHS are conducted in countries (which for some countries can be a single year), would be considered here as “time series”.
+Time series are often contained in multi-indicators databases, like the World Bank’s World Development Indicators - WDI, whose on-line version contains series for 1,430 indicators (as of 2021). To document not only the series but also the databases they belong to, we propose two metadata schemas: one to document the series/indicators, the other one to document the databases they belong to.
+In the NADA application, a series can be documented and published without an associated database, but information on a database will only be published in association with a series. The information on a database is thus treated as an “attachment” to the information on a series. A SERIES DESCRIPTION tab will display all metadata related to the series, i.e. all content entered in the series schema.
++ |
+ |
+ |
The (optional) SOURCE DATABASE tab will display the metadata related to the database, i.e. all content entered in the series database schema. This information is displayed for information, but not indexed in the NADA catalog (i.e. not searchable).
++ |
+ |
+ |
Suggestions and recommendations to data curators
+Indicators and time series often come with metadata limited to the indicators/series name and a brief definition. This significantly reduces the discoverability of the indicators, and the possibility to implement semantic searchability and recommender systems. It is therefore highly recommended to generate more detailed metadata for each time series, including information on the purpose and typical use of the indicators, of its relevancy to different audiences, of its limitations, and more.
When documenting an indicator or time series, attention should be paid to include keywords and phrases in the metadata that reflect how data users are likely to formulate their queries when searching data catalogs. Subject-matter expertise, combined with an analysis of queries submitted to data catalogs, can help to identify such keywords. For example, the metadata related to an indicator “Prevalence of stunting” should contain the keyword “malnutrition”, and the metadata related to “GDP per capita” should include keywords like “economic growth” or “national income”. By doing so, data curators will provide richer input to search engines and recommender systems, and will have a significant and direct impact on the discoverability of the data. The use of AI tools can considerabli facilitate the process of identifying related keywords. We provide in the chapter an example of use of chatGPT for such purpose.
An indicator or time series is documented using the time series /indicators schema. The database schema is optional, and used to document the database, if any, that the indicator belongs to. When multiple series of a same database are documented, the metadata related to the database only needs to be generated once, then applied to all series. One metadata element in the time series /indicators schema is used to link an indicator to the corresponding database.
+The time series schema is used to document an indicator or a time series. In NADA, the data and metadata of an indicator can (but does not have to) be published with information on the database it belongs to (if any). A metadata element is provided to indicate the identifier of that database (if any), and to establish the link between the indicator metadata and the database metadata generated using the schema described above.
+
{
+"repositoryid": "string",
+ "access_policy": "na",
+ "data_remote_url": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "series_description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
The first elements of the schema (repositoryid
, access_policy
, data_remote_url
, published
, and overwrite
) are not part of the series metadata. They are parameters used to indicate how the series will be published in a NADA catalog.
repositoryid
identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA.
access_policy
indicates the access policy to be applied to the data: direct access, open access, public use files, licensed access, data accessible from an external repository, and data not accessible. A controlled vocabulary is provided and must be used, with the following respective options: {direct; open; public; licensed; remote; data_na
}.
data_remote_url
provides the link to an external website where the data can be obtained, if the access_policy
has been set to remote
.
published
: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.
overwrite
: Indicates whether metadata that may have been previously uploaded for the same series can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. Note that a series will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element series_description > idno
is the same.
metadata_information
[Optional, Not Repeatable]
+The set of elements in metadata_information
is used to provide information on the production of the indicator metadata. This information is used mostly for administrative purposes by data curators and catalog administrators.
+
"metadata_information": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "prod_date": "string",
+ "version": "string"
+ }
title
[Optional ; Not repeatable ; String]
+The title of the metadata document containing the indicator metadata.
idno
[Required ; Not repeatable ; String]
+A unique identifier of the indicator metadata document. It can be for example the identifier of the indicator preceded by a prefix identifying the metadata producer.
producers
[Optional ; Repeatable]
+This is a list of producers involved in the documentation (production of the metadata) of the series.
name
[Optional ; Not repeatable, String] abbr
[Optional ; Not repeatable, String] name
.affiliation
[Optional ; Not repeatable, String] name
.role
[Optional ; Not repeatable, String] name
in the production of the metadata. This element will be used when more than one person or organization is listed in the producers
element to distinguish the specific contribution of each metadata producer.prod_date
[Optional ; Not repeatable, String]
+The date the metadata was generated. The date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
version
[Optional ; Not repeatable, String]
+The version of the metadata on this series. This element will rarely be used.
= list(
+ metadata_creation
+producers = list(list(name = "Development Data Group",
+ abbr = "DECDG",
+ affiliation = "World Bank")),
+
+prod_date = "2021-10-15"
+
+ )
series_description
[Required ; Repeatable]
+This section contains all elements used to describe a specific series or indicator.
+
"series_description": {
+"idno": "string",
+ "doi": "string",
+ "name": "string",
+ "database_id": "string",
+ "aliases": [],
+ "alternate_identifiers": [],
+ "languages": [],
+ "measurement_unit": "string",
+ "dimensions": [],
+ "periodicity": "string",
+ "base_period": "string",
+ "definition_short": "string",
+ "definition_long": "string",
+ "definition_references": [],
+ "statistical_concept": "string",
+ "concepts": [],
+ "methodology": "string",
+ "derivation": "string",
+ "imputation": "string",
+ "missing": "string",
+ "quality_checks": "string",
+ "quality_note": "string",
+ "sources_discrepancies": "string",
+ "series_break": "string",
+ "limitation": "string",
+ "themes": [],
+ "topics": [],
+ "disciplines": [],
+ "relevance": "string",
+ "time_periods": [],
+ "ref_country": [],
+ "geographic_units": [],
+ "bbox": [],
+ "aggregation_method": "string",
+ "disaggregation": "string",
+ "license": [],
+ "confidentiality": "string",
+ "confidentiality_status": "string",
+ "confidentiality_note": "string",
+ "links": [],
+ "api_documentation": [],
+ "authoring_entity": [],
+ "sources": [],
+ "sources_note": "string",
+ "keywords": [],
+ "acronyms": [],
+ "errata": [],
+ "notes": [],
+ "related_indicators": [],
+ "compliance": [],
+ "framework": [],
+ "series_groups": []
+ }
idno
[Required ; Not repeatable ; String]
A unique identifier (ID) for the series. Most agencies and databases will have a coherent coding convention to generate their series IDs. For example, the name of the series in the World Bank’s World Development Indicators series are composed of the following elements, separated by a dot:
+For example, the series with identifier “DT.DIS.PRVT.CD” is the series containing data on “External debt disbursements by private creditors in current US dollars” (for more information, see How does the World Bank code its indicators?.
doi
[Optional ; Not repeatable ; String]
A Digital Object Identifier (DOI) for the the series.
name
[Required ; Not repeatable ; String]
The name (label) of the series. Note that a field alias
is provided (see below) to capture alternative names for the series.
database_id
[Optional ; Not repeatable ; String]
The unique identifier of the database the series belongs to. This field must correspond to the element database_description > title_statement > idno
of the database schema described above. This is the only field that is needed to establish the link between the database metadata and the indicator metadata.
aliases
[Optional ; Repeatable]
+A series or an indicator can be referred to using different names. The aliases
element is provided to capture the multiple names and labels that may be associated with (i.e synomyms of) the documented series or indicator.
+
"aliases": [
+{
+ "alias": "string"
+ }
+ ]
+- alias
[Optional ; Not repeatable ; String]
+An alternative name for the indicator or series being documented.
alternate_identifiers
[Optional ; Not repeatable ; String]idno
described above is the reference unique identifier for the catalog in which the metadata is intended to be published. But the same indicator/metadata may be published in other catalogs. For example, a data catalog may publish metadata for series extracted from the World Bank World Development Indicators (WDI) database. And the WDI itself contains series generated and published by other organizations, such as the World Health Organization or UNICEF. Catalog administrators may want to assign a unique identifier specific to their catalog (the idno
element), but keep track of the identifier of the series or indicator in other catalogs or databases. The alternate_identifiers
element serves that purpose.
+"alternate_identifiers": [
+{
+ "identifier": "string",
+ "name": "string",
+ "database": "string",
+ "uri": "string",
+ "notes": "string"
+ }
+ ]
identifier
[Required ; Not repeatable ; String]
+An identifier for the series other than the identifier entered in idno
(note that the identifier entered in idno
can be included in this list, if it is useful to provide it with a type identifier (see name
element below) which is not provided in idno
. This can be the identifier of the indicator in another database/catalog, or a global unique identifier.
name
+This element will be used to define the type of identifier. This will typically be used to flag DOIs by entering “Digital Object Identifier (DOI)”.
database
+The name of the database (or catalog) where this alternative identifier is used, e.g. “IMF, International Financial Statistics (IFS)”.
uri
[Optional ; Not repeatable ; String]
+A link (URL) to the database mentioned in database
.
notes
[Optional ; Not repeatable ; String]
+Any additional information on the alternate identifier.
languages
[Optional ; Repeatable]
+An indicator or time series can be made available at different levels of disaggregation. For example, an indicator containing estimates of the “Population” of a country by year can be available by sex. The data curators in such case will have two options: (i) create and document three separate indicators, namely “Population, Total”, “Population, Female”, and “Population, Male”; or create a single indicator “Population” and attach a dimension “sex” to it, with values “Total”, “Female”, and “Male”. The dimensions
are features (or “variables”) that define the different levels of disaggregation within an indicator/series. The element dimensions
is used to provide an itemized list of disaggregations that correspond exactly to the published data. Note that when an indicator is available at two “non-overlapping” levels of disaggregation, it should be split into two indicators. For example, if the Population indicator is available by male/female and by urban/rural, but not by male/urban/male/rural/female urban/female rural, it should be treated as two separate indicators (“Population by sex” with dimension sex = “male / female” and “Population by area of residence” with dimension area = “urban / rural”.) Note also that another element in the schema, disaggregation
, is also provided, in which a narrative description of the actual or recommended disaggregations can be documented.
+
"alternate_identifiers": [
+{
+ "identifier": "string",
+ "name": "string",
+ "database": "string",
+ "uri": "string",
+ "notes": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The name of the language.
code
[Optional ; Not repeatable ; String]
+The code of the language, preferably the ISO code.
measurement_unit
[Optional ; Not repeatable ; String]
The unit of measurement. Note that in many databases the measurement unit will be included in the series name/label. In the World Bank’s World Development Indicators for example, series are named as follows:
+In such case, the name of the series should not be changed, but the measurement unit may be extracted from it and stored in element measurement_unit
.
dimensions
[Optional ; Repeatable]
+An indicator or time series can be made available at different levels of disaggregation. For example, a time series containing annual estimates of the indicator “Resident population (mid-year)” can be provided by country, by urban/rural area of residence, by sex, by age group. The data curator has to make a decision on how to organize such data. One option is to create an indicator “Resident population (mid-year)” and to define a set of “dimensions” for the breakdowns. The dimensions would in such case be the year, the country, the area of residence, the sex, and the age group. Some of the dimensions would have to be provided with a code list (or ’controlled vocabulary”, for example stating that F means “Female”, M” means male, and T means “Total” for the dimension sex). Another option would be to create multiple indicators (e.g., creating three distinct indicators “Resident population, male (mid-year)”, “Resident population, female (mid-year)”, “Resident population, total (mid-year)” and using year, country, area of residence, and age group as dimensions). The element dimensions
is used to provide an itemized list of disaggregations that correspond to the published data. Note that another element in the schema, disaggregation
, is also provided, in which a narrative description of the actual or recommended disaggregations can be documented. Note also that in the SDMX standard, dimensions are listed in the Data Structure Definition” and are complemented by code lists* that provide the related controlled vocabularies.
+
"dimensions": [
+{
+ "name": "string",
+ "label": "string",
+ "description": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The name of the dimension.
label
[Required ; Not repeatable ; String]
+The label of the dimension, for example “sex”, or “urban/rural”.
description
[Optional ; Not repeatable ; String]
+A description of the dimension (for example, if the label was “age group”, the description can provide detailed information on the age groups, e.g. “The age groups in the database are 0-14, 15-49, 50-64, and 65+ years old”.)
release_calendar
[Optional ; Not repeatable ; String]
Information on when updates for the indicators can be expected. This will usually not consist of exact dates (which would have to be updated regularly), but of more general information like “Every first Monday of the Month”, or “Every year on June 30”, or “The last week of each quarter”.
periodicity
[Optional ; Not repeatable ; String]
The periodicity of the series. It is recommended to use a controlled vocabulary with values like annual, quarterly, monthly, daily, etc.
base_period
[Optional ; Not repeatable ; String]
The base period for the series. This field will only apply to series that require a base year (or other reference time) used as a benchmark, like a Consumer Price Index (CPI) which will have a value of 100 for a reference base year.
definition_short
[Optional ; Not repeatable ; String]
+A short definition of the series. The short definition captures the essence of the series.
definition_long
[Optional ; Not repeatable ; String]
+A long(er) version of the definition of the series. If only one definition is available (not a short/long version), it is recommended to capture it in the definition_short
element. ALternatively, the same definition can be stored in both definition_short
and definition_long
.
definition_references
[Optional ; Repeatable]
+This element is provided to link to an external resources from which the definition was extracted.
+
"definition_references": [
+{
+ "source": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
source
[Optional ; Not repeatable ; String]
+The source of the definition (title, or label).
uri
[Optional ; Not repeatable ; String]
+A link (URL) to the source of the definition.
note
[Optional ; Not repeatable ; String]
+This element provides for annotating or explaining the reason the reference has been included as part of the metadata.
statistical_concept
[Optional ; Not repeatable ; String]
This element allows to insert a reference of the series with content of a statistical character. This can include coding concepts or standards that are applied to render the data statistically relevant.
concepts
[Optional ; Repeatable]
+This repeatable element can be used to document concepts related to the indicators or time series (other than the main statistical concept that may have been entered in statisticsl_concept
). For example, the concept of malnutrition could be documented in relation to the indicators “Prevalence of stunting” and “Prevalence of wasting”.
+
"concepts": [
+{
+ "name": "string",
+ "definition": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+A concise and standardized name (label) for the concept.
definition
[Required ; Not repeatable ; String]
+The definition of the concept.
uri
[Optional ; Not repeatable ; String]
+A link (URL) to a resource providing more detailed information on the concept.
data_collection
[Optional ; Not repeatable]
+This group of elements can be used to document data collection activities that led to or allowed the production of the indicator. This element will typically be used for the description of surveys or censuses.
+Note: the schema also contains an element “sources”. That element will be used to document the organization and/or main data production program from which the indicator is derived.
+
"data_collection": [
+{
+ "data_source": "string",
+ "method": "string",
+ "period": "string",
+ "note": "string"
+ "uri": "string"
+ }
+ ]
data_source
[Required ; Not repeatable ; String]
+A concise and standardized name (label) for the data source, e.g. “National Labor Force Survey, 1st quarter 2022”. If multiple data sources were used, they can all be listed here. Note that is a time series has values obtained from many different sources, the source for each value (or group of values) will not be part of the indicator/series metadata, but will be stored as an attribute in the data file where the information can be associated with a specific observation (“cell note” or group of observation (e.g. attached to an indicator for avv values for a same year or for a same area).
method
[Required ; Not repeatable ; String]
+Brief information on the data collection method, e.g. :Sample household survey”.
period
[Optional ; Not repeatable ; String]
+Information on the period of the data collection, e.g. “January to March 2022”.
note
[Optional ; Not repeatable ; String]
+Additional information on the data collection.
uri
[Optional ; Not repeatable ; String]
+A link to a resource (website, document) where more information on the data collection can be found.
imputation
[Optional ; Not repeatable ; String]
Data may have been imputed to account for data gaps or for other reasons (harmonization/standardization, and others). If imputations have been made, this element provides the space for their description.
adjustments
[Optional ; Repeatable ; String]
Description of any adjustments with respect to use of standard classifications and harmonization of breakdowns for age group and other dimensions, or adjustments made for compliance with specific international or national definitions.
missing
[Optional ; Not repeatable ; String]
Information on missing values in the series or indicator. This information can be related to treatment of missing values, to the cause(s) of missing values, and others.
validation_rules
[Optional ; Repeatable ; String]
Description of the set of rules (itemized) used to validate values for the indicator, e.g. “Is within range 0-100”, or “Is the sum of indicatorX + indicator Y”.
quality_checks
[Optional ; Not repeatable ; String]
Data may have gone through data quality checks to assure that the values are reasonable and coherent, which can be described in this element. These quality checks may include checking for outlying values or other. A brief description of such quality control procedures will contribute to reinforcing the credibility of the data being disseminated.
quality_note
[Optional ; Not repeatable ; String]
Additional notes or an overall statement on data quality. These could for example cover non-standard quality notes and/or information on independent reviews on the data quality.
sources_discrepancies
[Optional ; Not repeatable ; String]
This element is used to describe and explain why the data in the series may be different from the data for the same series published in other sources. International organizations, for example, may apply different techniques to make data obtained from national sources comparable across countries, in which cases the data published in international databases may differ from the data published in national, official databases.
series_break
[Optional ; Not repeatable ; String]
Breaks in statistical series occur when there is a change in the standards, sources of data, or reference year used in the compilation of a series. Breaks in series must be well documented. The documentation should include the reason(s) for the break, the time it occured, and information on the impact on comparability of data over time.
limitation
[Optional ; Not repeatable ; String]
This element is used to communicate to the user any limitations or exceptions in using the data. The limitations may result from the methodology, from issues of quality or consistency in the data source, or other.
themes
[Optional ; Repeatable]
+Themes provide a general idea of the research that might guide the creation and/or demand for the series. A theme is broad and is likely also subject to a community based definition or list. A controlled vocabulary should be used. This element will rarely be used (the element topics
described below will be used more often).
+
"themes": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
id
[Optional ; Not repeatable ; String]
+The unique identifier of the theme. It can be a sequential number, or the ID of the theme in a controlled vocabulary.
name
[Required ; Not repeatable ; String]
+The label of the theme associated with the data.
parent_id
[Optional ; Not repeatable ; String]
+When a hierarchical (nested) controlled vocabulary is used, the parent_id
field can be used to indicate a higher-level theme to which this theme belongs.
vocabulary
[Optional ; Not repeatable ; String]
+The name of the controlled vocabulary used, if any.
uri
[Optional ; Not repeatable ; String]
+A link to the controlled vocabulary mentioned in field `vocabulary’.
topics
[Optional ; Repeatable]
+The topics
field indicates the broad substantive topic(s) that the indicator/series covers. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) topics classification.
+
"topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
id
[Optional ; Not repeatable ; String]
+The unique identifier of the topic. It can be a sequential number, or the ID of the topic in a controlled vocabulary.
name
[Required ; Not repeatable ; String]
+The label of the topic associated with the data.
+
parent_id
[Optional ; Not repeatable ; String]
+When a hierarchical (nested) controlled vocabulary is used, the parent_id
field can be used to indicate a higher-level topic to which this topic belongs.
vocabulary
[Optional ; Not repeatable ; String]
+The name of the controlled vocabulary used, if any.
uri
+A link to the controlled vocabulary mentioned in field vocabulary
.
disciplines
[Optional ; Repeatable]
+Information on the academic disciplines related to the content of the document. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia.
+
"disciplines": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
This is a block of five elements:
+id
[Optional ; Not repeatable ; String]
+The ID of the discipline, preferably taken from a controlled vocabulary.
name
[Optional ; Not repeatable ; String]
+The name (label) of the discipline, preferably taken from a controlled vocabulary.
parent_id
[Optional ; Not repeatable ; String]
+The parent ID of the discipline (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
vocabulary
[Optional ; Not repeatable ; String]
+The name (including version number) of the controlled vocabulary used, if any.
uri
[Optional ; Not repeatable ; String]
+The URL to the controlled vocabulary used, if any.
relevance
[Optional ; Not repeatable ; String]
This field documents the relevance of an indicator or series in relation to a social imperative or policy objective.
mandate
[Optional ; Not repeatable ; String]
mandate
[Optional ; Not repeatable ; String] URI
[Optional ; Not repeatable ; String] time_periods
[Optional ; Repeatable]
+The time period covers the entire span of data available for the series. The time period has a start and an end and is reported according to the periodicity provided in a previous element.
+
"time_periods": [
+{
+ "start": "string",
+ "end": "string",
+ "notes": "string"
+ }
+ ]
start
[Required ; Not repeatable ; String]
+The initial date of the series in the dataset. The start date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
end
[Required ; Not repeatable ; String]
+The end date is the latest date for which an estimate for the indicator is available. The end date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
notes
[Optional ; Not repeatable ; String]
+Additional information on the time period.
ref_country
[Optional ; Repeatable]
+A list of countries for which data are available in the series. This element is somewhat redundant with the next element (geographic_units
) which may also contain a list of countries. Identifying geographic areas of type “country” is important to enable filters and facets in data catalogs (country names are among the most frequent queries submitted to catalogs).
+
"ref_country": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The name of the country.
code
[Optional ; Not repeatable ; String]
+The code of the country. The use of the ISO 3166-1 alpha-3 codes is recommended.
geographic_units
[Optional ; Repeatable]
+List of geographic units (regions, countries, states, provinces, etc.) for which data are available for the series.
+
"geographic_units": [
+{
+ "name": "string",
+ "code": "string",
+ "type": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+Name of the geographic unit e.g. “World,”Africa”, “Afghanistan”, “OECD countries”, “Bangkok”.
code
[Optional ; Not repeatable ; String]
+Code of the geographic unit. The ISO 3166-1 alpha-3 code is preferred when the unit is a country.
type
[Optional ; Not repeatable ; String]
+Type of geographic unit e.g. “country”, “state”, “region”, “province”, “city”, etc.
bbox
[Optional ; Repeatable]
+This element is used to define one or multiple bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. This element is optional, but if the bound_poly
element (see below) is used, then the bbox
element must be included.
+
"bbox": [
+{
+ "west": "string",
+ "east": "string",
+ "south": "string",
+ "north": "string"
+ }
+ ]
west
[Required ; Not repeatable ; String] east
[Required ; Not repeatable ; String] south
[Required ; Not repeatable ; String] north
[Required ; Not repeatable ; String] <- list(
+ my_indicator metadata_information = list(
+ # ...
+
+ ),series_description = list(
+ # ... ,
+ study_info = list(
+ # ... ,
+
+ ref_country = list(
+ list(name = "Madagascar", code = "MDG"),
+ list(name = "Mauritius", code = "MUS")
+
+ ),
+ bbox = list(
+
+ list(name = "Madagascar",
+ west = "43.2541870461",
+ east = "50.4765368996",
+ south = "-25.6014344215",
+ north = "-12.0405567359"),
+
+ list(name = "Mauritius",
+ west = "56.6",
+ east = "72.466667",
+ south = "-20.516667",
+ north = "-5.25")
+
+
+ ),# ...
+
+ ),# ...
+ )
aggregation_method
[Optional ; Not repeatable ; String]
The aggregation_method
element describes how values can be aggregated from one geographic level (for example, a country) to a higher-level geographic area (for example, a group of country defined based on a geographic criteria (region, world) or another criteria (low/medium/high-income countries, island countries, OECD countries, etc.). The aggregation method can be simple (like “sum” or “population-weighted average”) or more complex, involving weighting of values.
disaggregation
[Optional ; Not repeatable ; String]
This element is intended to inform users that an indicator or series is available at various levels of disaggregation. The related series should be listed (by andme and/or identifier). For indicator “Population, total” for example, one may inform the user that the indicator is also available (in other series) by sex, urban/rural, and age group (in series “Population, male” and “Population, female”, etc.).
license
[Optional ; Repeatable]
+The license refers to the accessibility and terms of use associated with the data. Providing a license and a link to the terms of the license allos data users to determine, with full clarity, what they can and cannot do with the data.
+
"license": [
+{
+ "name": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String]
+The name of the license, e.g. “Creative Commons Attribution 4.0 International license (CC-BY 4.0)”.
uri
[Optional ; Not repeatable ; String]
+The URL of a website where the licensed is described in detail, for example “https://creativecommons.org/licenses/by/4.0/”.
note
[Optional ; Not repeatable ; String]
+Any additional information on the license.
confidentiality
[Optional ; Not repeatable ; String]
A statement of confidentiality for the series.
confidentiality_status
[Optional ; Not repeatable ; String]
This indicates a confidentiality status for the series. A controlled vocabulary should be used with possible options “public”, “official use only”, “confidential”, “strictly confidential”. When all series are made publicly available, and belong to a database that has an open or public access policy, this element can be ignored.
confidentiality_note
[Optional ; Not repeatable ; String]
This element is reserved for additional notes regarding confidentiality of the data. This could involve references to specific laws and circumstances regarding the use of data.
links
[Optional ; Repeatable]
+This element provides links to online resources of any type that could be useful to the data users. This can be links to description of methods and reference documents, analytics tools, visualizations, data sources, or other.
+
"links": [
+{
+ "type": "string",
+ "description": "string",
+ "uri": "string"
+ }
+ ]
type
[Optional ; Not repeatable ; String]
+This element allows to classify the link that is provided.
description
[Optional ; Not repeatable ; String]
+A description of the link that is provided.
uri
[Optional ; Not repeatable ; String]
+The uri (URL) to the described resource.
api_documentation
[Optional ; Repeatable]
+Increasingly, data are made accessible via Application Programming Interfaces (APIs). The API associated with a series must be documented. The documentation will usually not be specific to a series, but apply to all series in a same database.
+
"api_documentation": [
+{
+ "description": "string",
+ "uri": "string"
+ }
+ ]
description
[Optional ; Not repeatable ; String]
+This element will not contain the API documentation itself, but information on what documentation is available.
uri
[Optional ; Not repeatable ; String]
+The URL of the API documentation.
authoring_entity
[Optional ; Repeatable]
+This set of five elements is used to identify the organization(s) or person(s) who are the main producers/curators of the indicator. Note that a similar element is provided at the database level. The authoring_entity for the indicator can be different from the authoring_entity of the database. For example, the World Bank is the authoring entity for the World Development Indicators database, which contains indicators obtained from the International Monetary Fund, World Health Organization, and other organizations that are thus the authoring entitis for specific indicators.
+
"authoring_entity": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbreviation": null,
+ "email": null,
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the person or organization who is responsible for the production of the indicator or series. Write the name in full (use the element abbreviation
to capture the acronym of the organization, if relevant).
affiliation
[Optional ; Not repeatable ; String]
+The affiliation of the person or organization mentioned in name
.
+
abbreviation
[Optional ; Not repeatable ; String]
+Abbreviated name (acronym) of the organization mentioned in name
.
email
[Optional ; Not repeatable ; String]
+The public email contact of the person or organizations mentioned in name
. It is good practice to provide a service account email address, not a personal one.
uri
[Optional ; Not repeatable ; String]
+A link (URL) to the website of the entity mentioned in name
.
sources
[Optional ; Not repeatable ; String]
+This element provides information on the source(s) of data that were used to generate the indicator. A source can refer to an organization (e.g., “Source: World Health Organization”), or to a dataset (e.g., for a national poverty headcount indicator, the sources will likely be a list of sample household surveys). In sources
, we are mainly interested in the latter. When a series in a database is a series extracted from another database (e.g., when the World Bank World Development Indicators include a series from the World Health Organization in its database), the source organization should be mentioned in the authoring_entity
element of the schema. The sources
element is a repeatable element.
+Note 1: In some cases, the source of a specific value in a database will be stored as an attribute of the data file (e.g., as a “footnote” attached to a specific cell. If the sources are listed in the data file, they may but do not need to be stored in the metadata.
+Note 2: the schema also contains an element “data_collection” that would be used to describe a specific data collection activity from which an indicator is derived.
+
"sources": [
+{
+ "id": "string",
+ "name": "string",
+ "organization": "string",
+ "type": "string",
+ "note": "string"
+ }
+ ]
id
[Required ; String]
+This element records the unique identifier of a source. It is a required element. If the source does not have a specific unique identifier, a sequential number can be used. If the source is a dataset or database that has its own unique identifier (possibly a DOI), this identifier should be used.
name
[Optional ; String]
+The name (title, or label) of the source.
organization
[Optional ; String]
+The organization responsible for the source data.
+
type
[Optional ; String]
+The type of source, e.g. “household survey”, “administrative data”, or “external database”.
+
note
[Optional ; String]
+This element can be used to provide additional information regarding the source data.
sources_note
[Optional ; Not repeatable ; String]
Additional information on the source(s) of data used to generate the series or indicator.
keywords
[Optional ; Repeatable]
+Words or phrases that describe salient aspects of a data collection’s content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Keywords should be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+
"keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; String ; Non repeatable]
+Keyword (or phrase). Keywords summarize the content or subject matter of the study.
vocabulary
[Optional ; Not repeatable ; String]
+Controlled vocabulary from which the keyword is extracted, if any.
+
uri
[Optional ; Not repeatable ; String]
+The URI of the controlled vocabulary used, if any.
acronyms
[Optional ; Repeatable]
+The acronyms
element is used to document the meaning of all acronyms used in the metadata of a series. If some acronyms are well known (like “GDP”, or “IMF” for example), others may be less obvious or could be uncertain (does “PPP” mean “public-private partnership”, or “purchasing power parity”?). In any case, providing a list of acronyms with their meaning will help users and make your metadata more discoverable. Note that acronyms should not include country codes used in the documentation of the geographic coverage of the data.
+
"acronyms": [
+{
+ "acronym": "string",
+ "expansion": "string",
+ "occurrence": 0
+ }
+ ]
acronym
[Required ; Not repeatable ; String]
+An acronym referenced in the series metadata (e.g. “GDP”).
expansion
[Required ; Not repeatable ; String]
+The expansion of the acronym, i.e. the full name or title that it represents (e.g., “Gross Domestic Product”).
occurrence
[Optional ; Not repeatable ; Numeric]
+This numeric element can be used to indicate the number of times the acronym is mentioned in the metadata. The element will rarely be used.
errata
[Optional ; Repeatable]
+This element is used to provide information on detected errors in the data or metadata for the series, and on the measures taken to remedy them.
+
"errata": [
+{
+ "date": "string",
+ "description": "string"
+ }
+ ]
date
[Required ; Repeatable ; String]
+The date the erratum was published.
description
[Required ; Repeatable ; String]
+A description of the error and remedy measures.
notes
[Optional ; Repeatable]
+This element is open and reserved for explanatory notes deemed useful to the users of the data. Notes should account for additional information that might help: replicate the series; access the data and research area, or discoverability in general.
+
"notes": [
+{
+ "note": "string"
+ }
+ ]
note
[Required ; Repeatable ; String]
+The note itself.
related_indicators
[Optional ; Repeatable]
+This element allows to reference indicators that are often associated with the indicator being documented.
+
"related_indicators": [
+{
+ "code": "string",
+ "label": "string",
+ "uri": "string"
+ }
+ ]
code
[Optional ; Not repeatable ; String]
+The code for the indicator that is referenced in the document. It will likely be an ID that is used by that indicator.
label
[Optional ; Not repeatable ; String]
+The name or label of the indicator that is associated with the indicator being documented.
uri
[Optional ; Not repeatable ; String]
+A link to the related indicator.
compliance
[Optional ; Repeatable]
+For some indicators, international standards have been established. This is for example the case of indicators like the unemployment or unemployment rate, for which the International Conference of Labour Statisticians defines the standards concepts and methods. The compliance
element is used to document the compliance of a series with one or multiple national or international standards.
+
"compliance": [
+{
+ "standard": "string",
+ "abbreviation": "string",
+ "custodian": "string",
+ "uri": "string"
+ }
+ ]
standard
[Optional ; Not repeatable ; String]
+The name of the standard that the series complies with. This name will ideally include a label and a version or a date. For example: “International Standard Industrial Classification of All Economic Activities (ISIC) Revision 4, published in 2007”
abbreviation
[Optional ; Not repeatable ; String]
+The acronym of the standard that the series complies with.
custodian
[Optional ; Not repeatable ; String]
+The organization that maintains the standard that is being used for compliance. For example: “United Nations Statistics Division”.
uri
[Optional ; Not repeatable ; String]
+A link to a public website site where information on the compliance standard can be obtained. For example: “https://unstats.un.org/unsd/classifications/Family/Detail/27
framework
[Optional ; Repeatable]
+Some national, regional, and international agencies develop monitoring frameworks, with goals, targets, and indicators. Some well-known examples are the Millennium Development Goals and the Sustainable Development Goals which establish international goals for human development, or the World Summit for Children (1990) which set international goals in the areas of child survival, development and protection, supporting sector goals such as women’s health and education, nutrition, child health, water and sanitation, basic education, and children in difficult circumstances. The framework
element is used to link an indicator or series to the framework, goal, and target associated with it.
+
"framework": [
+{
+ "name": "string",
+ "abbreviation": "string",
+ "custodian": "string",
+ "description": "string",
+ "goal_id": "string",
+ "goal_name": "string",
+ "goal_description": "string",
+ "target_id": "string",
+ "target_name": "string",
+ "target_description": "string",
+ "indicator_id": "string",
+ "indicator_name": "string",
+ "indicator_description": "string",
+ "uri": "string",
+ "notes": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the framework.
abbreviation
[Optional ; Not repeatable ; String]
+The abreviation of the name of the framework.
custodian
[Optional ; Not repeatable ; String]
+The name of the organization that is the official custodian of the framework.
description
[Optional ; Not repeatable ; String]
+A brief description of the framework.
goal_id
[Optional ; Not repeatable ; String]
+The identifier of the Goal that the indicator or series is associated with.
goal_name
[Optional ; Not repeatable ; String]
+The name (label) of the Goal that the indicator or series is associated with.
+
goal_description
[Optional ; Not repeatable ; String]
+A brief description of the Goal that the indicator or series is associated with.
target_id
[Optional ; Not repeatable ; String]
+The identifier of the Target that the indicator or series is associated with.
target_name
[Optional ; Not repeatable ; String]
+The name (label) of the Target that the indicator or series is associated with.
target_description
[Optional ; Not repeatable ; String]
+A brief description of the Target that the indicator or series is associated with.
indicator_id
[Optional ; Not repeatable ; String]
+The identifier of the indicator, as provided in the framework (this is not the idno
identifier).
indicator_name
[Optional ; Not repeatable ; String]
+The name of the indicator, as provided in the framework (which may be different from the name provided in name
)
indicator_description
[Optional ; Not repeatable ; String]
+A brief description of the indicator, as provided in the framework.
uri
[Optional ; Not repeatable ; String]
+A link to a website providing detailed information on the framework, its goals, targets, and indicators.
notes
[Optional ; Not repeatable ; String]
+Any additional information on the relationship between the indicator/series and the framework.
series_group
[Optional ; Repeatable]
+The group(s) the indicator belongs to. Groups can be create to organize indicators/series by theme, producer, or other.
+
"series_groups": [
+ {
+ "name": "string",
+ "description": "string",
+ "version": "string",
+ "uri": "string"
+ }
+]
+name
[Optional ; Not repeatable ; String]
+The name of the group.
description
[Optional ; Not repeatable ; String]
+A brief description of the group.
version
[Optional ; Not repeatable ; String]
+The version of the grouping.
uri
[Optional ; Not repeatable ; String]
+A link to a public website site where information on the grouping can be obtained.
contacts
[Optional ; Repeatable]
+The contacts
element provides the public interface for questions associated with the production of the indicator or time series.
"contacts": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "email": "string",
+ "telephone": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] name
. This will be used when multiple contacts are listed, and is intended to help users direct their questions and requests to the right contact person.affiliation
[Optional ; Not repeatable ; String] name
.email
[Optional ; Not repeatable ; String] name
. Avoid using personal email accounts; the use of an anonymous email is recommended (e.g, “helpdesk@….org”)telephone
[Optional ; Not repeatable ; String] name
.uri
[Optional ; Not repeatable ; String] provenance
[Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable]
+The origin_description
elements are used to describe when and from where metadata have been extracted or harvested.
harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Document Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] We use a series from the World Bank’s World Development Indicators (WDI 2021) as an example: the series “Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)”
+Note that we only show how metadata are generated and published in a NADA catalog. We do not demonstrate the (optional) publishing of the underlying data in a MongoDB database, which makes the data accessible via API and allows activation of data visualizations in the NADA catalog. This is covered in the NADA documentation.
+The discoverability of indicators by keyword-based search engines can be significantly improved by the inclusion of a list of relevant keywords in the metadata. These keywords can be synonyms or terms and concepts that are closely associated with the indicator. Identifying the most relevant related keywords requires subject matter expertise. But this can be considerably facilitated by the use of AI tools. We provide below an example of a query submitted to chatGPT. The proposed terms returned by the application MUST be reviewed by a subject matter expert. But having the proposed list (which can be copy-pasted then edited in a Metadata Editor or in a script) will make the process very efficient.
+The returned list is as follows: +Poverty +Headcount ratio +Income +Consumption +Living standards +Basic needs +Poverty line +Purchasing power parity (PPP) +International poverty line +Economic development +Social inequality +Human development +Poverty reduction +Extreme poverty +Global poverty +Developing countries +Wealth distribution +Rural poverty +Urban poverty +Household income +Inclusive growth +Multidimensional poverty +Income inequality +Poverty gap +Human capital +Poverty trap +Food security +Employment +Vulnerability +Social protection +Poverty measurement +Poverty alleviation +Social exclusion +Targeted interventions +Poverty incidence +Poverty dynamics +Poverty cycle +Equity +Income distribution +Sustainable development
+# The code below generates metadata at the database level (object "wdi_database")
+# and for a time series (object "this_series").
+# It then publishes the metadata in a NADA catalog using the R package NADAR.
+# It also publishes related materials as "external resources".
+library(nadar)
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+setwd("C:/my_indicators/")
+= "poverty.JPG" # Image to be used as thumbnail in the data catalog
+ thumb = "WB_WDI_2021_09_15" # The WDI database identifier
+ db_id
+# Document the indicator (Poverty headcount ratio at $1.90 a day)
+= list(
+ this_series
+ metadata_creation = list(
+ producers = list(
+ list(name = "Development Data Group",
+ abbr = "DECDG",
+ affiliation = "World Bank",
+ role = "Metadata curation")
+
+ ), prod_date = "2021-10-15",
+ version = "Example v 1.0"
+
+ ),
+ series_description = list(
+
+ idno = "SI.POV.DDAY",
+
+ name = "Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)",
+
+ database_id = db_id, # To attach the database metadata to the series metadata
+
+ measurement_unit = "% of population",
+
+ periodicity = "Annual",
+
+ definition_short = "Poverty headcount ratio at $1.90 a day is the percentage of the population living on less than $1.90 a day at 2011 international prices. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions.",
+
+ definition_references = list(
+ list(source = "World Bank, Development Data Group",
+ uri = "https://databank.worldbank.org/metadataglossary/millennium-development-goals/series/SI.POV.DDAY"
+
+ )
+ ),
+ methodology = "International comparisons of poverty estimates entail both conceptual and practical problems. Countries have different definitions of poverty, and consistent comparisons across countries can be difficult. Local poverty lines tend to have higher purchasing power in rich countries, where more generous standards are used, than in poor countries. Since World Development Report 1990, the World Bank has aimed to apply a common standard in measuring extreme poverty, anchored to what poverty means in the world's poorest countries. The welfare of people living in different countries can be measured on a common scale by adjusting for differences in the purchasing power of currencies. The commonly used $1 a day standard, measured in 1985 international prices and adjusted to local currency using purchasing power parities (PPPs), was chosen for World Development Report 1990 because it was typical of the poverty lines in low-income countries at the time. As differences in the cost of living across the world evolve, the international poverty line has to be periodically updated using new PPP price data to reflect these changes. The last change was in October 2015, when we adopted $1.90 as the international poverty line using the 2011 PPP. Prior to that, the 2008 update set the international poverty line at $1.25 using the 2005 PPP. Poverty measures based on international poverty lines attempt to hold the real value of the poverty line constant across countries, as is done when making comparisons over time. The $3.20 poverty line is derived from typical national poverty lines in countries classified as Lower Middle Income. The $5.50 poverty line is derived from typical national poverty lines in countries classified as Upper Middle Income. Early editions of World Development Indicators used PPPs from the Penn World Tables to convert values in local currency to equivalent purchasing power measured in U.S dollars. Later editions used 1993, 2005, and 2011 consumption PPP estimates produced by the World Bank. The current extreme poverty line is set at $1.90 a day in 2011 PPP terms, which represents the mean of the poverty lines found in 15 of the poorest countries ranked by per capita consumption. The new poverty line maintains the same standard for extreme poverty - the poverty line typical of the poorest countries in the world - but updates it using the latest information on the cost of living in developing countries. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions. The statistics reported here are based on consumption data or, when unavailable, on income surveys. Analysis of some 20 countries for which income and consumption expenditure data were both available from the same surveys found income to yield a higher mean than consumption but also higher inequality. When poverty measures based on consumption and income were compared, the two effects roughly cancelled each other out: there was no significant statistical difference.",
+
+ limitation = "Despite progress in the last decade, the challenges of measuring poverty remain. The timeliness, frequency, quality, and comparability of household surveys need to increase substantially, particularly in the poorest countries. The availability and quality of poverty monitoring data remains low in small states, countries with fragile situations, and low-income countries and even some middle-income countries. The low frequency and lack of comparability of the data available in some countries create uncertainty over the magnitude of poverty reduction. Besides the frequency and timeliness of survey data, other data quality issues arise in measuring household living standards. The surveys ask detailed questions on sources of income and how it was spent, which must be carefully recorded by trained personnel. Income is generally more difficult to measure accurately, and consumption comes closer to the notion of living standards. And income can vary over time even if living standards do not. But consumption data are not always available: the latest estimates reported here use consumption data for about two-thirds of countries. However, even similar surveys may not be strictly comparable because of differences in timing or in the quality and training of enumerators. Comparisons of countries at different levels of development also pose a potential problem because of differences in the relative importance of the consumption of nonmarket goods. The local market value of all consumption in kind (including own production, particularly important in underdeveloped rural economies) should be included in total consumption expenditure but may not be. Most survey data now include valuations for consumption or income from own production, but valuation methods vary.",
+
+ topics = list(
+ list(id = "1",
+ name = "Economics, Consumption and consumer behaviour",
+ vocabulary = "",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "2",
+ name = "Economics, Economic conditions and indicators",
+ vocabulary = "CESSDA Version 4.1",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "3",
+ name = "Economics, Economic systems and development",
+ vocabulary = "CESSDA Version 4.1",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "4",
+ name = "Social stratification and groupings, Equality, inequality and social exclusion",
+ vocabulary = "CESSDA Version 4.1",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+
+ ),
+ relevance = "The World Bank Group is committed to reducing extreme poverty to 3 percent or less, globally, by 2030. Monitoring poverty is important on the global development agenda as well as on the national development agenda of many countries. The World Bank produced its first global poverty estimates for developing countries for World Development Report 1990: Poverty (World Bank 1990) using household survey data for 22 countries (Ravallion, Datt, and van de Walle 1991). Since then there has been considerable expansion in the number of countries that field household income and expenditure surveys. The World Bank's Development Research Group maintains a database that is updated annually as new survey data become available (and thus may contain more recent data or revisions) and conducts a major reassessment of progress against poverty every year. PovcalNet is an interactive computational tool that allows users to replicate these internationally comparable $1.90, $3.20 and $5.50 a day global, regional and country-level poverty estimates and to compute poverty measures for custom country groupings and for different poverty lines. The Poverty and Equity Data portal provides access to the database and user-friendly dashboards with graphs and interactive maps that visualize trends in key poverty and inequality indicators for different regions and countries. The country dashboards display trends in poverty measures based on the national poverty lines alongside the internationally comparable estimates, produced from and consistent with PovcalNet.",
+
+ time_periods = list(list(start = "1960", end = "2020")),
+
+ geographic_units = list(
+ list(name = "Afghanistan", code = "AFG", type = "country/economy"),
+ list(name = "Africa Eastern and Southern", code = "AFE", type = "geographic region"),
+ list(name = "Africa Western and Central", code = "AFW", type = "geographic region"),
+ list(name = "Albania", code = "ALB", type = "country/economy"),
+ list(name = "Algeria", code = "DZA", type = "country/economy"),
+ list(name = "Angola", code = "AGO", type = "country/economy"),
+ list(name = "Aruba", code = "ABW", type = "country/economy")
+ # ... and many more - In a real situation, this would be programmatically extracted from the data
+
+ ),
+ license = list(name = "CC BY-4.0", uri = "https://creativecommons.org/licenses/by/4.0/"),
+
+ api_documentation = list(
+ description = "See the Developer Information webpage for detailed documentation of the API",
+ uri = "https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information"
+
+ ),
+ source = "World Bank, Development Data Group (DECDG) and Poverty and Inequality Global Practice. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. Data for high-income economies are from the Luxembourg Income Study database. For more information and methodology, see PovcalNet website: http://iresearch.worldbank.org/PovcalNet/home.aspx",
+
+ keywords = list(
+ list(name = "poverty rate"),
+ list(name = "poverty incidence"),
+ list(name = "global poverty line"),
+ list(name = "international poverty line"),
+ list(name = "welfare"),
+ list(name = "prosperity"),
+ list(name = "inequality"),
+ list(name = "income")
+
+ ),
+ acronyms = list(
+ list(acronym = "PPP", expansion = "Purchasing Power Parity")
+
+ ),
+ related_indicators = list(
+ list(code = "SI.POV.GAPS",
+ label = "Poverty gap at $1.90 a day (2011 PPP) (%)",
+ uri = "https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.GAPS"),
+ list(code = "SI.POV.NAHC",
+ label = "Poverty headcount ratio at national poverty lines (% of population)",
+ uri = "https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.NAHC")
+
+ ),
+ framework = list(
+ list(name = "Sustainable Development Goals (SDGs)",
+ description = "The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed and developing - in a global partnership.",
+ goal_id = "SDG Goal 1",
+ goal_name = "End poverty in all its forms everywhere",
+ target_id = "SDG Target 1.1",
+ target_name = "By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day",
+ indicator_id = "SDG Indicator 1.1.1",
+ indicator_name = "Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)",
+ uri = "https://sdgs.un.org/goals")
+
+ )
+
+ )
+
+ )# Publish the metadata in NADA, with a link to the WDI website
+# Database-level metadata
+ timeseries_database_add(idno = db_id,
+ published = 1,
+ overwrite = "yes",
+ metadata = wdi_database)
+
+ # Indicator-level metadata
+ timeseries_add(
+ idno = this_series$series_description$idno,
+ repositoryid = "central",
+ published = 1,
+ overwrite = "yes",
+ metadata = this_series,
+ thumbnail = thumb
+
+ )# Add a link to the WDI website as an external resource
+
+ external_resources_add(
+title = "World Development Indicators website",
+ idno = this_series$series_description$idno,
+ dctype = "web",
+ file_path = "https://datatopics.worldbank.org/world-development-indicators/",
+ overwrite = "yes"
+ )
After uploading the above metadata, and activating some visualization widgets, the result in NADA will be as follows (not all metadata displayed here; see https://nada-demo.ihsn.org/index.php/catalog/study/SI.POV.DDAY for the full view):
+
+
+
+
+
+
+
A statistical table (cross tabulation or contingency table) is a summary presentation of data. The OECD Glossary of Statistical Terms defines it as “observation data gained by a purposeful aggregation of statistical microdata conforming to statistical methodology [organized in] groups or aggregates, such as counts, means, or frequencies.”
+Tables are produced as an array of rows and columns that display numeric aggregates in a clearly labeled fashion. They may have a complex structure and become quite elaborate. They are typically found in publications such as statistical yearbooks, census and survey reports, research papers, or published on-line.
+Statistical tables can be understood by a broad audience. In some cases, they may be the only publicly-available output of a data collection activity. Even when other output is available –such as microdata, dashboards, or databases accessible via user interfaces or APIs– statistical tables are an important component of data dissemination. It is thus important to make tables as discoverable as possible. The schema described in this chapter was designed to structure and foster the comprehensiveness of information on tables by rendering the pertinent metadata into a structured, machine-readable format. It is intended for the purpose of improving data discoverability. The schema is not intended to store information to programmatically re-create tables.
+The schema description is available at http://dev.ihsn.org/nada/api-documentation/catalog-admin/index.html#tag/Tables
+The figure below, adapted from LabWrite Resources, provides an illustration of what statistical tables typically look like. The main parts of a table are highlighted. They provide a content structure for the metadata schema we describe in this chapter.
+Table number and title: Every table must have a title, and should have a number. Tables in yearbooks, report and papers are usually numbered in the order that they are referred to in the document. They can be numbered sequentially (Table 1, Table 2, and so on), by chapter (Table 1.1, Table 1.2, Table 2.1, …), or based on other reference system. The Table number typically precedes the table title. The title provides a description of the contents of the table. It should be concise and include the key elements shown in the table.
+Column spanner, column heads, and stub head: The column headings (and sub-headings) identify what data are listed in the table in a vertical arrangement. A column heading placed above the leftmost column is often referred to as the stubhead, and the column is the stub column. A heading that sits above two or more columns to indicate a certain grouping is referred to as a column spanner.
+Stubs: The horizontal headings and sub-headings of the rows are called row captions. Together, they form the stub.
+Table body: The actual data (values) in a table (containing for example percentages, means, or counts of certain variables) form the table body.
+Table spanner: A table spanner is located in the body of the table in order to divide the data in a table without changing the columns. Spanners go the entire length of the table.
+Table notes: Table notes are used to provide information that is not self-explanatory (e.g., to provide the expanded form of acronyms used in row or column captions).
+Table source: The source identifies the dataset(s) or database(s) that contain the data used to generate the table. This can for example be a survey or a census dataset.
+The table schema contains six blocks of elements. The first block of three elements (repository_id
, published
, and overwrite
) do not describe the table, but are used by the NADA cataloguing application to determine where and how the table metadata is published in the catalog. The second block, metadata_information
, contains “metadata on the metadata” and is used mainly for archiving purpose. The third block, table_description
, contains the elements used to describe the table and its production process. A fourth block provenance
, is used to document the origin of metadata that may be harvested from other catalogs. The block tags
is used to add information (in the form of words or short phrases) that will be useful to create facets in the a catalog user interface. Last, an empty block additional
is provided as a container for additional metadata elements that users may want to create.
{
+"repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "table_description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
The following elements are used by the NADA application API (see the NADA documentation for more information):
+repositoryid
: A NADA catalog can be composed of multiple collections. The repositoryid element identifies in which collection the table will be published. This collection must have been previously created in the catalog. By default, the table will be published in the central
catalog (i.e. in no particular collection).
+
published
: The NADA catalog allows tables to be published (in which case they will be visible to users of the catalog) or unpublished (in which case they will only be visible by administrators). The default value is 0 (unpublished). Code 1 is used to set the status to “published”.
overwrite
: This element defines what action will be taken when a command is issued to add the table to a catalog and a table with the same identifier (element idno) is already in the catalog. By default, the command will not overwrite the existing table (the default value of overwrite is “no”). Set this parameter to “yes” to allow the existing table to be overwritten in the catalog.
metadata_information
[Optional, Not Repeatable]
+The metadata_information
block is used to document the table metadata (not the table itself). It provides information on the process of generating the table metadata. This block is optional. The information it contains is useful to catalog administrators, not to the public. It is however recommended to enter at least the identification of the metadata producer, her/his affiliation, and the date the metadata were created. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so the metadata produced by one organization can be found in other data centers (complying with standards and schema is precisely intended to facilitate inter-operability of catalogs and automated information sharing).
+
"metadata_information": {
+"idno": "string",
+ "title": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "production_date": "string",
+ "version": "string"
+ }
idno
[Optional, Not Repeatable, String]
+A unique identifier for the metadata document (the metadata document is the JSON file containing the table metadata). This is different from the table unique identifier (see section title_statement
below), although the same identifier can be used, and it is good practice to generate identfiers that would maintain an easy connection between the metadata idno and the table idno. For example, if the unique identifier of the table is “TBL_0001”, the idno
in the metadata_information could be “META_TBL_001”.
title
[Optional, Not Repeatable, String]
+The title of the metadata document (not necessarily the title of the table).
producers
[Optional, Repeatable]
+This refers to the producer(s) of the table metadata, not to the producer(s) of the table. This could for example be the data curator in a data center. Four elements can be used to provide information on the metadata producer(s):
name
[Optional, Not Repeatable, String] abbr
[Optional, Not Repeatable, String] name
.affiliation
[Optional, Not Repeatable, String] name
.role
[Optional, Not Repeatable, String] name
(applicable when more than one person was involved in the production of the metadata). production_date
[Optional, Not Repeatable, String]
+The date the metadata (not the table) was produced. The date will preferably be entered in ISO 8601 format (YYYY-MM-DD).
version
[Optional, Not Repeatable, String]
+The version of the metadata (not the version of the table).
= list(
+ my_table # ... ,
+ metadata_information = list(
+ idno = "META_TBL_POP_PC2001_02-01",
+ producers = list(
+ list(name = "John Doe",
+ affiliation = "National Data Center of Popstan")
+
+ ),production_date = "2020-12-27",
+ version = "version 1.0"
+
+ ),# ...
+ )
table_description
[Required, Not Repeatable]
+This section contains the metadata elements that describe the table itself. Not all elements will be required to fully document a table, but efforts should be made to provide as much and as detailed information as possible, as richer metadata will make the table more discoverable.
+
"table_description": {
+"title_statement": {},
+ "identifiers": [],
+ "authoring_entity": [],
+ "contributors": [],
+ "publisher": [],
+ "date_created": "string",
+ "date_published": "string",
+ "date_modified": "string",
+ "version": "string",
+ "description": "string",
+ "table_columns": [],
+ "table_rows": [],
+ "table_footnotes": [],
+ "table_series": [],
+ "statistics": [],
+ "unit_observation": [],
+ "data_sources": [],
+ "time_periods": [],
+ "universe": [],
+ "ref_country": [],
+ "geographic_units": [],
+ "geographic_granularity": "string",
+ "bbox": [],
+ "languages": [],
+ "links": [],
+ "api_documentation": [],
+ "publications": [],
+ "keywords": [],
+ "themes": [],
+ "topics": [],
+ "disciplines": [],
+ "definitions": [],
+ "classifications": [],
+ "rights": "string",
+ "license": [],
+ "citation": "string",
+ "confidentiality": "string",
+ "sdc": "string",
+ "contacts": [],
+ "notes": [],
+ "relations": []
+ }
title_statement
[Required, Not Repeatable] "title_statement": {
+"idno": "string",
+ "table_number": "string",
+ "title": "string",
+ "sub_title": "string",
+ "alternate_title": "string",
+ "translated_title": "string"
+ }
idno
[Required, Not Repeatable, String]
+A unique identifier to the table. Do not include spaces in the idno
. This identifier must be unique to the catalog in which the table will be published. Some organizations have their own system to assign unique identifiers to tables. Ideally, an identifier that guarantees uniqueness globally will be used, such as a Digital Object Identifier (DOI) or an ISBN number. Note that a table may have more than one identifier. In such case, the element idno
(as a non-repeatable element) will contain the main identifier (as selected as the “reference” one by the catalog administrator). The other identifiers will be provided in the element identifiers
(see below).
table_number
[Optional, Not Repeatable, String]
+The table number. The table number will usually begin with the word “Table” followed by a numeric identifier such as: Table 1 or Table 2.1 etc. Different publications may use different ways to reference a table. This is particularly the case for publications that are part of a standard survey program and have well-defined table templates. The following are different ways to number a table:
Type | +Description | +
---|---|
Sequential | +This is a sequential number given to each table produced and appearing within the publication (e.g., Table 1, Table 2 to Table n). | +
Thematic | +Provides a numbering scheme based on the theme and a sequential number | +
Chapter | +The tables can be numbered according to the chapter and then a sequential reference within that reference such as: Table 1.1 or Table 3.5 etc. | +
Annex | +Tables in an annex will usually be given a letter number referring to the annex and a sequential number such as Table A.1 or Table B.3. | +
Note | +A table number is usually set apart from the title with a colon. The word “Table” should never abbreviated. | +
title
[Required, Not Repeatable, String]
+The title of the table. The title provides a brief description of the content of the table. It should be concise and include the key elements shown in the table. There are varying styles for writing a table title. A consistent style should be applied to all tables published in a catalog.
sub_title
[Optional, Not Repeatable, String]
+A subtitle can provide further descriptive or explanatory content to the table.
alternate_title
[Optional, Not Repeatable, String]
+An alternate title for the table.
translated_title
[Optional, Not Repeatable, String]
+A translation of the title.
= list(
+ my_table # ...
+ table_description = list(
+ title_statement = list(
+ idno = "EXAMPLE_TBL_001",
+ table_number = "Table 1.0",
+ title = "Resident population by age group, sex, and area of residence, 2020",
+ sub_title = "District of X, as of June 30",
+ translated_title = "Population résidente par groupe d'âge, sexe et zone de résidence, 2020 (district X, au 30 juin)"
+
+ ),# ...
+
+ ) )
identifiers
[Optional ; Repeatable] title_statement
(idno
). It can for example be a Digital Object Identifier (DOI). The identifier entered in the title_statement
can be repeated here (the title_statement
does not provide a type
parameter; if a DOI or other standard reference ID is used as idno
, it is recommended to repeat it here with the identification of its type
).
+"identifiers": [
+{
+ "type": "string",
+ "identifier": "string"
+ }
+ ]
type
[Optional, Not Repeatable, String] identifier
[Required, Not Repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ identifiers = list(
+ type = "DOI",
+ identifier = "XXX.XXX.XXXX"
+
+ ),# ...
+
+ ) )
authoring_entity
[Optional, Not Repeatable] "authoring_entity": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "uri": "string",
+ "author_id": [
+ {
+ "type": null,
+ "id": null
+ }
+ ]
+ }
+ ]
name
[Optional, Not Repeatable, String] affiliation
[Optional, Not Repeatable, String] name
.abbreviation
[Optional, Not Repeatable, String] name
.uri
[Optional, Not Repeatable, String] name
. author_id
[Optional ; Repeatable] type
[Optional ; Not repeatable ; String] id
[Optional ; Not repeatable ; String] type
. = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ authoring_entity = list(
+ name = "John Doe",
+ affiliation = "National Research Center, Popstan",
+ abbreviation = "NRC",
+ uri = "www. ...",
+ author_id = list(
+ list(type = "ORCID", id = "XYZ123")
+
+ )
+ ), # ...
+
+ ) )
contributors
[Optional, Repeatable] "contributors": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "role": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional, Not Repeatable, String] affiliation
[Optional, Not Repeatable, String] name
. This could be a government agency, a university or a department in a university, etc.abbreviation
[Optional, Not Repeatable, String] role
[Optional, Not Repeatable, String] name
. This could for example be ““Research assistant”, “Technical specialist”, “Programmer”, or “Reviewer”.uri
[Optional, Not Repeatable, String] name
. = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ contributors = list(
+ name = "John Doe",
+ affiliation = "National Research Center",
+ abbreviation = "NRC",
+ role = "Research assistant; Stata programming",
+ uri = "www. ..."
+
+ ), # ...
+
+ ) )
publisher
[Optional, Not repeatable] "publisher": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "role": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional, Not Repeatable, String] affiliation
[Optional, Not Repeatable, String] abbreviation
[Optional, Not Repeatable, String] role
[Optional, Not Repeatable, String] uri
[Optional, Not Repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ publisher = list(
+ name = "National Statistics Office, Publishing Department",
+ affiliation = "Ministry of Planning, National Statistics Office",
+ abbreviation = "NSO",
+ uri = "www. ..."
+
+ ), # ...
+
+ ) )
date_created
[Optional, Not Repeatable, String]
+The date the table was created. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the table is created refers to the date that the output was produced and considered ready for publication.
date_published
[Optional, Not Repeatable, String]
+The date the table was published. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). If the table is contained in a document (report, paper, book, etc.), the date the table is published is associated with the publication date of that document. If the table is found in a statistics yearbook for example, then the publication date will be the date the yearbook was published.
date_modified
[Optional, Not Repeatable, String]
+The date the table was last modified. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). Modifications, revisions, or re-publications of the table are recorded in this element.
version
[Optional, Not Repeatable, String]
+The version of the table refers to the published version of the table. If for some reason, data in a published table are revised, then the version of the table is captured in this element.
description
[Optional, Not Repeatable, String]
+A brief “narrative” description of the table. The description can contain information on the content, purpose, production process, or other relevant information.
= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ date_created = "2020-06-15",
+ date_published = "2020-10-30",
+ version = "Version 1.0",
+ description = "The table is part of a series of tables extracted from the Population Census 2020 dataset. It presents counts of resident population by type of disability, sex, and age group, by province and at the national level. The data were collected in compliance with questions from the Washington Group.",
+ # ...
+
+ ) )
table_columns
[Optional, Repeatable]
+The columns description is composed of the column spanner and the column heads. Columns spanners group the column heads together in a logical fashion to present the data to the user. Not all columns presented in a table will have a column spanner. The column spanners can become quite complicated; when a table is documented, the information found in the column spanner and heads can be merged and edited. What matters is not to document the exact structure of the table, but to ensure that the text of the spanners and heads is included in the metadata as this text will be used by search engines to find tables in data catalogs.
+
"table_columns": [
+{
+ "label": "string",
+ "var_name": "string",
+ "dataset": "string"
+ }
+ ]
label
[Required, Not Repeatable, String] var_name
*[Optional, Not Repeatable, String*] dataset
[Optional, Not Repeatable, String] The column captions of the following table can be documented in the following manner:
+
+
= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ table_columns = list(
+
+ list(label = "Area of residence: National (total)",
+ var_name = "urbrur", dataset = "pop_census_2020_v01"),
+
+ list(label = "Area of residence: Urban",
+ var_name = "urbrur", dataset = "pop_census_2020_v01"),
+
+ list(label = "Area of residence: Rural",
+ var_name = "urbrur", dataset = "pop_census_2020_v01"),
+
+ list(label = "Sex: total",
+ var_name = "sex", dataset = "pop_census_2020_v01")
+
+ list(label = "Sex: male",
+ var_name = "sex", dataset = "pop_census_2020_v01")
+
+ list(label = "Sex: female",
+ var_name = "sex", dataset = "pop_census_2020_v01")
+
+
+ ), # ...
+
+ ) )
Or, in a more concise but also valid version:
+= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ table_columns = list(
+
+ list(label = "Area of residence: national (total) / urban / rural",
+ var_name = "urbrur", dataset = "pop_census_2020_v01"),
+
+ list(label = "Sex: total / male / female",
+ var_name = "sex", dataset = "pop_census_2020_v01")
+
+
+ ), # ...
+
+ ) )
table_rows
[Required, Not Repeatable, String] table_rows
section is composed of the stub head and stubs (row captions). The stubs are the captions of the rows of data and the stub head is the label that groups the rows together in a logical fashion. As for table_columns
, the information found in the stubs can be merged and edited to be optimized for clarity and discoverability.
+"table_rows": [
+{
+ "label": "string",
+ "var_name": "string",
+ "dataset": "string"
+ }
+ ]
label
[Required, Not Repeatable, String] row_label
is designed to include the stub head, stubs and any captions included.var_name
[Optional, Not Repeatable, String] dataset
[Optional, Not Repeatable, String] data_sources
element (see below) to describe in more detail the sources of data. The content of the dataset
element must be compatible with the information provided in that other element.Example using the same table as for table_columns
:
= list(
+ my_table # ... ,
+
+ table_description = list(
+ # ... ,
+
+ table_rows = list(
+
+ list(label = "Age group; 0-4 years",
+ var_name = "age", dataset = "pop_census_2020_v01"),
+
+ list(label = "Age group; 5-9 years",
+ var_name = "age",
+ dataset = "pop_census_2020_v01"),
+
+ list(label = "Age group; 10-14 years",
+ var_name = "age",
+ dataset = "pop_census_2020_v01"),
+
+ list(label = "Age group; 15-19 years",
+ var_name = "age",
+ dataset = "pop_census_2020_v01")
+
+
+ ), # ...
+
+ ) )
The same information can be provided in a more concise version as follows:
+= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ table_rows = list(
+ list(label = "Age group; 0-4 years, 5-9 years, 10-14 years, 15-19 years",
+ var_name = "age",
+ dataset = "pop_census_2020_v01")
+
+ ), # ...
+
+ ) )
table_footnotes
[Optional, Repeatable] "table_footnotes": [
+{
+ "number": "string",
+ "text": "string"
+ }
+ ]
number
[Optional, Not Repeatable, String] text
[Required, Not Repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ table_footnotes = list(
+
+ list(number = "1",
+ text = "Data refer to the resident population only."),
+
+ list(number = "2",
+ text = "Figures for the district of X have been imputed.")
+
+
+ ),
+ # ...
+
+ ) )
table_series
[Optional, Repeatable] "table_series": [
+{
+ "name": "string",
+ "maintainer": "string",
+ "uri": "string",
+ "description": "string"
+ }
+ ]
name
[Optional, Not Repeatable, String] maintainer
[Optional, Not Repeatable, String] uri
[Optional, Not Repeatable, String] description
[Optional, Not Repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ table_series = list(
+
+ list(name = "Population Census - Age distribution",
+ description = "Series 1 - Tables on demographic composition of the population")
+
+
+ ),
+ # ...
+
+ ) )
statistics
[Optional, Repeatable] statistics
element refers to is the type of statistics included in the table. Some tables may only contain counts, such as a table of population by age group and sex (which shows counts of persons; other tables could be counts of households, facilities, or any other observation unit). But statistical tables can contain many other types of summary statistics. This element is used to list these types of statistics.
+"statistics": [
+{
+ "value": "string"
+ }
+ ]
value
[Required, Not Repeatable, String] The use of a controlled vocabulary is recommended. This list could contain (but does not have to be limited to):
+- Count (frequencies)
+- Number of missing values
+- Mean (average)
+- Median
+- Mode
+- Minimum value
+- Maximum value
+- Range
+- Standard deviation
+- Variance
+- Confidence interval (95%) - Lower limit
+- Confidence interval (95%) - Upper limit
+- Standard error
+- Sum
+- Inter-quartile Range (IQR)
+- Percentile (possibly with specification, e,g, "10th percentile")
+- Mean Absolute Deviation
+- Mean Absolute Deviation from the Median (MADM)
+- Coefficient of Variation (COV)
+- Coefficient of Dispersion (COD)
+- Skewness
+- Kurtosis
+- Entropy
+- Regression coefficient
+- R-squared
+- Adjusted R-squared
+- Z-score
+- Accuracy
+- Precision
+- Mean squared logarithmic error (MSLE)<br><br>
+Example in R for a table showing the distribution of the population by age goup and sex, and the mean age by sex
+= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ statistics = list(
+ list(value = "count"),
+ list(value = "mean")
+
+ ), # ...
+
+ ) )
unit_observation
[Optional, Repeatable] "unit_observation": [
+{
+ "value": "string"
+ }
+ ]
value
[Required, Not repeatable, String]
+The value
is not a numeric value; it is the label (description) of the observation unit, e.g, “individual” or “person”, “household”, “dwelling”, enterprise, “country”, etc. = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ unit_observation = list(
+ list(value = "individual")
+
+ ), # ...
+
+ ) )
data_sources
[Optional, Repeatable] name
, source_id
, and link
elements are optional, but at least one of them must be provided.
+"data_sources": [
+{
+ "name": "string",
+ "abbreviation": "string",
+ "source_id": "string",
+ "note": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional, Not repeatable, String] abbreviation
[Optional, Not repeatable, String] source_id
[Optional, Not repeatable, String]note
[Optional, Not repeatable, String] uri
[Optional, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ data_sources = list(
+ list(source = "Population and Housing Census 2020",
+ abbreviation = "PHC 2020",
+ source_id = "ABC_PHC_2020_PUF"
+
+ )
+ ), # ...
+
+
+ ) )
time_periods
[Optional, Repeatable] from
and to
fields.
+"time_periods": [
+{
+ "from": "string",
+ "to": "string"
+ }
+ ]
from
[Required, Not repeatable, String] to
[Required, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ time_periods = list(
+ list(from = "1990", to = "1990"),
+ list(from = "2000", to = "2004"),
+ list(from = "2014", to = "2019-06")
+
+ ),
+ # ...
+
+ ) )
universe
[Optional, Repeatable] "universe": [
+{
+ "value": "string"
+ }
+ ]
value
[Required, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ universe = list(
+ list(value = "Resident male population aged 0 to 6 years; this excludes visitors and people present in the country under a diplomatic status.
+ Nomadic and homeless populations are included.")
+
+ ), # ...
+
+ ) )
ref_country
[Optional, Repeatable] ref_country
field should still be filled. Another element called geographic_units
is provided (see below) to capture more detailed information on the table’s geographic coverage.
+"ref_country": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Required, Not repeatable, String]
+The name of a country for which data are in the table.
code
[Required, Not repeatable, String]
+The code of the country mentioned in name
, preferably an ISO 3166 country code.
geographic_units
[Optional, Repeatable]
+An itemized list of geographic areas covered by the data in the table, other than the country/countries that must be entered in ref_country
.
+
"geographic_units": [
+{
+ "name": "string",
+ "code": "string",
+ "type": "string"
+ }
+ ]
name
[Required, Not repeatable, String] code
[Optional, Not repeatable, String] name
.type
[Optional, Not repeatable, String] name
(e.g., “State”, “Province”, “Town”, “Region”, etc.)= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ ref_country = list(
+ list(name = "Malawi", code = "MWI")
+
+ ),
+ geographic_units = list(
+ list(name = "Northern", type = "region"),
+ list(name = "Central", type = "region"),
+ list(name = "Southern", type = "region"),
+ list(name = "Lilongwe", type = "town"),
+ list(name = "Mzuzu", type = "town"),
+ list(name = "Blantyre", type = "town")
+
+ ),
+ # ...
+
+ ) )
geographic_granularity
[Optional, Not repeatable, String]
+A description of the geographic levels for which data are presented in the table. This is not a list of specific geographic areas, but a list of the administrative level(s) that correspond to these geographic areas.
Example for a table showing the population of a country by State, district, and sub-district (+ total)
+= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ ref_country = list(
+ list(name = "India", code = "IND")
+
+ ),
+geographic_granularity = "national, state (admin 1), district (admin 2), sub-district (admin 3)",
+
+# ...
+
+ ) )
bbox
[Optional ; Repeatable]
+Bounding boxes are typically used for geographic datasets to indicate the geographic coverage of the data, but can be provided for tables as well, although this will rarely be done. A geographic bounding box defines a rectangular geographic area.
+
"bbox": [
+{
+ "west": "string",
+ "east": "string",
+ "south": "string",
+ "north": "string"
+ }
+ ]
west
[Required ; Not repeatable ; String]
+Western geographic parameter of the bounding box.
east
[Required ; Not repeatable ; String]
+Eastern geographic parameter of the bounding box.
south
[Required ; Not repeatable ; String]
+Southern geographic parameter of the bounding box.
north
[Required ; Not repeatable ; String]
+Northern geographic parameter of the bounding box.
languages
[Optional, Repeatable]
+Most tables will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed.
+
"languages": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Required, Not repeatable, String] code
[Optional, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ languages = list(
+ list(name = "English", code = "EN"),
+ list(name = "French", code = "FR")
+
+ ),
+ # ...
+
+ ) )
links
[Optional, Repeatable] "links": [
+{
+ "uri": "string",
+ "description": "string"
+ }
+ ]
uri
[Required, Not repeatable, String] description
[Optional, Not repeatable, String] Example for a table extracted from the Gambia Demographic and Health Survey 2019/2020 Report, the links could be the following:
+= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ links = list(
+
+ list(uri = "https://dhsprogram.com/pubs/pdf/FR369/FR369.pdf",
+ description = "The Gambia, Demographic and Health Survey 2019/2020 Report"),
+
+ list(uri = "https://dhsprogram.com/data/available-datasets.cfm",
+ description = "DHS microdata for The Gambia")
+
+
+ ),
+ # ...
+
+ ) )
api_documentation
[Optional ; Repeatable] "api_documentation": [
+{
+ "description": "string",
+ "uri": "string"
+ }
+ ]
description
[Optional ; Not repeatable ; String]
+This element will not contain the API documentation itself, but information on what documentation is available.
uri
[Optional ; Not repeatable ; String]
+The URL of the API documentation.
publications
[Optional, Repeatable]
+This element identifies the publication(s) where the table is published. This could for example be a Statistics Yearbook, a report, a paper, etc.
+
"publications": [
+{
+ "title": "string",
+ "uri": "string"
+ }
+ ]
title
[Required, Not repeatable, String] uri
[Optional, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ publications = list(
+ list(title = "United Nations Statistical Yearbook, Fifty-second issue, May 2023",
+ uri = "https://www.un-ilibrary.org/content/books/9789210557566")
+
+ ),
+ # ...
+
+ ) )
keywords
[Optional ; Repeatable] "keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of keywords that provide information on the core content of the table. Keywords provide a convenient solution to improve the discoverability of the table, as it allows terms and phrases not found in the table itself to be indexed and to make a table discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the UNESCO Thesaurus. The list provided here can combine keywords from multiple controlled vocabularies, and user-defined keywords.
+name
[Required ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ keywords = list(
+ list(name = "Migration", vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+ list(name = "Migrants", vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+ list(name = "Refugee", vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+ list(name = "Forced displacement"),
+ list(name = "Forcibly displaced")
+
+ ),
+ # ...
+
+ ),# ...
+ )
themes
[Optional ; Repeatable] "themes": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of themes covered by the table. A controlled vocabulary will preferably be used. Note that themes
will rarely be used as the elements topics
and disciplines
are more appropriate for most uses. This is a block of five fields:
+- id
[Optional ; Not repeatable ; String]
+The ID of the theme, taken from a controlled vocabulary.
+- name
[Required ; Not repeatable ; String]
+The name (label) of the theme, preferably taken from a controlled vocabulary.
+- parent_id
[Optional ; Not repeatable ; String]
+The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
+- vocabulary
[Optional ; Not repeatable ; String]
+The name (including version number) of the controlled vocabulary used, if any.
+- uri
[Optional ; Not repeatable ; String]
+The URL to the controlled vocabulary used, if any.
topics
[Optional ; Repeatable] "topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
Information on the topics covered in the table. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. Note that you may use more than one controlled vocabulary.
+This element is a block of five fields:
+id
[Optional ; Not repeatable ; String] name
[Required ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ topics = list(
+ list(name = "Demography.Migration",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(name = "Demography.Censuses",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "F22",
+ name = "International Migration",
+ parent_id = "F2 - International Factor Movements and International Business",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+ list(id = "O15",
+ name = "Human Resources - Human Development - Income Distribution - Migration",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+
+ ),
+ # ...
+
+
+ ),
+ )
disciplines
[Optional ; Repeatable] "disciplines": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
Information on the academic disciplines related to the content of the table. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia. +This is a block of five elements:
+id
[Optional ; Not repeatable ; String] name
[Optional ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ disciplines = list(
+
+ list(name = "Economics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+
+ list(name = "Agricultural economics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+
+ list(name = "Econometrics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields")
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
definitions
[Optional, Repeatable] "definitions": [
+{
+ "name": "string",
+ "definition": "string",
+ "uri": "string"
+ }
+ ]
name
[Required, Not repeatable, String] definition
[Required, Not repeatable, String] uri
[Optional, Not repeatable, String] Example for a table on malnutrition that would include estimates of stunting and wasting prevalence:
+= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+ definitions = list(
+
+ list(name = "stunting",
+ definition = "Prevalence of stunting is the percentage of children under age 5 whose height for age is more than two standard deviations below the median for the international reference population ages 0-59 months. For children up to two years old height is measured by recumbent length. For older children height is measured by stature while standing. The data are based on the WHO's new child growth standards released in 2006.",
+ uri = "https://data.worldbank.org/indicator/SH.STA.STNT.ZS?locations=1W"),
+
+ list(name = "wasting",
+ definition = "Prevalence of wasting, male,is the proportion of boys under age 5 whose weight for height is more than two standard deviations below the median for the international reference population ages 0-59.",
+ uri = "https://data.worldbank.org/indicator/SH.STA.WAST.MA.ZS?locations=1W")
+
+
+ ), # ...
+
+ ) )
classifications
[Optional, Repeatable] "classifications": [
+{
+ "name": "string",
+ "version": "string",
+ "organization": "string",
+ "uri": "string"
+ }
+ ]
name
[Required, Not repeatable, String] version
[Optional, Not repeatable, String] organization
[Optional, Not repeatable, String] uri
[Optional, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ classifications = list(
+
+ list(name = "International Standard Classification of Occupations (ISCO)",
+ version = "ISCO-08",
+ organization = "International Labour Organization (ILO)",
+ uri = "https://www.ilo.org/public/english/bureau/stat/isco/")
+
+
+ ), # ...
+
+
+ )
+ )
rights
[Optional, Not repeatable, String]
+Information on the rights or copyright that applies to the table.
+
license
[Optional, Repeatable]
+A table may require a license to use or reproduce. This is done to protect the intellectual content of the research product. The licensing entity may be different from the researcher or the publisher. It is the entity which has the intellectual rights to the table (s) and would grant rights or restrictions on the reuse of the table.
+
"license": [
+{
+ "name": "string",
+ "uri": "string"
+ }
+ ]
name
[Required, Not repeatable, String] uri
[Optional, Not repeatable, String] = list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ license = list(
+ list(name = "Attribution 4.0 International (CC BY 4.0)",
+ uri = "https://creativecommons.org/licenses/by/4.0/")
+
+ ),
+ # ...
+
+ )
+ )
citation
[Optional, Not repeatable, String]
+A citation requirement for the table (i.e. an indication of how the table should be cited in publications).
+
confidentiality
[Optional, Not repeatable, String]
+A published table may be protected through a confidentiality agreement between the publisher and the researcher. It may also determine certain rights regarding the use of the research and the data presented to the table. The data may also present confidential information that is produced for selective audiences. This element is used to provide a statement on any limitations ore restrictions on use of the table based on confidential data or agreements.
+
sdc
[Optional, Not repeatable, String]
+Information on statistical disclosure control measures applied to the table. This can include cell suppression, or other techniques. Specialized packages have been developed for this purpose, like sdcTable: Methods for Statistical Disclosure Control in Tabular Data and https://cran.r-project.org/web/packages/sdcTable/sdcTable.pdf
+The information provided here should be such that it does not provide intruders with useful information for reverse-engineering the protection measures applied to the table.
+
contacts
[Optional, Repeatable]
+Users of the data may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a “generic” email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members.
+
"contacts": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "email": "string",
+ "telephone": "string",
+ "uri": "string"
+ }
+ ]
name
[Required, Not repeatable, String] role
[Optional, Not repeatable, String] name
, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact.affiliation
[Optional, Not repeatable, String] email
[Optional, Not repeatable, String] telephone
[Optional, Not repeatable, String] uri
[Optional, Not repeatable, String] name
can be found.= list(
+ my_table # ... ,
+ table_description = list(
+ # ... ,
+
+ contacts = list(
+
+ list(name = "Data helpdesk",
+ role = "Support to data users",
+ affiliation = "National Statistics Office",
+ email = "data_helpdesk@ ...")
+
+
+ )
+ ) )
notes
[Optional, Repeatable] "notes": [
+{
+ "note": "string"
+ }
+ ]
note
[Required, Not repeatable, String]
+The note itself.
relations
[Optional ; Repeatable]
+If the table has a relation to other resources (e.g., it is a subset of another resource, or a translation of another resource), the relation(s) and associated resources can be listed in this element.
+
"relations": [
+{
+ "name": "string",
+ "type": "isPartOf"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The related resource. Recommended practice is to identify the related resource by means of a URI. If this is not possible or feasible, a string conforming to a formal identification system may be provided.
type
[Optional ; Not repeatable ; String]
+The type of relationship. The use of a controlled vocabulary is recommended. The Dublin Core proposes the following vocabulary: isPartOf, hasPart, isVersionOf, isFormatOf, hasFormat, references, isReferencedBy, isBasedOn, isBasisFor, replaces, isReplacedBy, requires, isRequiredBy
.
Type | +Description | +
---|---|
isPartOf | +The described resource is a physical or logical part of the referenced resource. | +
hasPart | ++ |
isVersionOf | +The described resource is a version edition or adaptation of the referenced resource. A change in version implies substantive changes in content rather than differences in format. | +
isFormatOf | ++ |
hasFormat | +The described resource pre-existed the referenced resource, which is essentially the same intellectual content presented in another format. | +
references | ++ |
isReferencedBy | ++ |
isBasedOn | ++ |
isBasisFor | ++ |
replaces | +The described resource supplants, displaces or supersedes the referenced resource. | +
isReplacedBy | +The described resource is supplanted, displaced or superseded by the referenced resource. | +
requires | ++ |
provenance
[Optional ; Repeatable]
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
origin_description
[Required ; Not repeatable] origin_description
elements are used to describe when and from where metadata have been extracted or harvested.harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Table Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] additional
[Optional ; Not repeatable]
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
We provide here examples of documentation of actual tables, and their publishing in a NADA catalog. We use the R package NADAR and the Python library PyNada to publish metadata in the catalog. The example only demonstrate the production and publishing of table metadata. We do not show in the example how the data can also be published in a NADA database (MongoDB), to be made available via API. The use of the data API is covered in the NADA documentation.
+This first example is a table presenting the evolution since 1960 of the number of households by size and of the average household size in the United States, published by the US Census Bureau. This table, published in MS-Excel format, was downloaded on 20 February 2021 from https://www.census.gov/data/tables/time-series/demo/families/households.html.
+
+
+
Using R
+library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_tables/")
+
+= "TBL_EXAMPLE_01"
+ id = "household_pic.JPG" # To be used as thumbnail in the data catalog
+ thumb
+# Document the table
+
+<- list(
+ my_table_hh4
+ metadata_information = list(
+ idno = "META_TBL_EXAMPLE-01",
+ producers = list(
+ list(name = "Olivier Dupriez",affiliation = "World Bank")
+
+ ),production_date = "2021-02-20"
+
+ ),
+ table_description = list(
+
+ title_statement = list(
+ idno = id,
+ table_number = "Table HH-4",
+ title = "Households by Size: 1960 to Present",
+ sub_title = "(Numbers in thousands, except for averages)"
+
+ ),
+ authoring_entity = list(
+ list(name = "United States Census Bureau",
+ affiliation = " U.S. Department of Commerce",
+ abbreviation = "US BUCEN",
+ uri = "https://www.census.gov/en.html"
+
+ )
+ ),
+ date_created = "2020",
+
+ date_published = "2020-12",
+
+ table_columns = list(
+ list(label = "Year"),
+ list(label = "All households (number)"),
+ list(label = "Number of people: One"),
+ list(label = "Number of people: Two"),
+ list(label = "Number of people: Three"),
+ list(label = "Number of people: Four"),
+ list(label = "Number of people: Five"),
+ list(label = "Number of people: Six"),
+ list(label = "Number of people: Seven or more"),
+ list(label = "Average number of people per household")
+
+ ),
+ table_rows = list(
+ list(label = "Year (values from 1960 to 2020)")
+
+ ),
+ table_footnotes = list(
+
+ list(number = "1",
+ text = "This table uses the householder's person weight to describe characteristics of people living in households. As a result, estimates of the number of households do not match estimates of housing units from the Housing Vacancy Survey (HVS). The HVS is weighted to housing units, rather than the population, in order to more accurately estimate the number of occupied and vacant housing units. If you are primarily interested in housing inventory estimates, then see the published tables and reports here: http://www.census.gov/housing/hvs/. If you are primarily interested in characteristics about the population and people who live in households, then see the H table series and reports here: https://www.census.gov/topics/families/families-and-households.html."),
+
+ list(number = "2",
+ text = "Details may not sum to total due to rounding."),
+
+ list(number = "3",
+ text = "1993 figures revised based on population from the most recent decennial census."),
+
+ list(number = "4",
+ text = "The 2014 CPS ASEC included redesigned questions for income and health insurance coverage. All of the approximately 98,000 addresses were selected to receive the improved set of health insurance coverage items. The improved income questions were implemented using a split panel design. Approximately 68,000 addresses were selected to receive a set of income questions similar to those used in the 2013 CPS ASEC. The remaining 30,000 addresses were selected to receive the redesigned income questions. The source of data for this table is the CPS ASEC sample of 98,000 addresses.")
+
+
+ ),
+ table_series = list(
+ list(name = "Historical Households Tables",
+ maintainer = "United States Census Bureau",
+ uri = "https://www.census.gov/data/tables/time-series/demo/families/households.html",
+ description = "Tables on households generated from the Current Population Survey")
+
+ ),
+ statistics = list(
+ list(value = "Count"),
+ list(value = "Average")
+
+ ),
+ unit_observation = list(
+ list(value = "Household")
+
+ ),
+ data_sources = list(
+ list(source = "U.S. Census Bureau, Current Population Survey, March and Annual Social and Economic Supplements")
+
+ ),
+ time_periods = list(
+ list(from = "1960", to = "2020")
+
+ ),
+ universe = list(
+ list(value = "US resident population")
+
+ ),
+ ref_country = list(
+ list(name = "United States", code = "USA")
+
+ ),
+ geographic_granularity = "Country",
+
+ languages = list(
+ list(name = "English", code = "EN")
+
+ ),
+ links = list(
+ list(uri = "https://www2.census.gov/programs-surveys/demo/tables/families/time-series/households/hh4.xls",
+ description = "Table in MS-Excel formal"),
+ list(uri = "https://www.census.gov/programs-surveys/cps/technical-documentation/complete.html",
+ description = "Technical documentation with information about ASEC, including the source and accuracy statement")
+
+ ),
+ topics = list(
+ list(
+ id = "1",
+ name = "Demography - Censuses",
+ parent_id = "Demography",
+ vocabulary = "CESSDA Controlled Vocabulary for CESSDA Topic Classification v. 3.0 (2019-05-20)",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification?v=3.0"
+
+ )
+ ),
+ contacts = list(
+ list(name = "Fertility and Family Statistics Branch",
+ affiliation = "US Census Bureau",
+ telephone = "+1 - 301-763-2416",
+ uri = "ask.census.gov")
+
+ )
+
+ )
+
+ )
+ # Publish the table in a NADA catalog
+
+table_add(idno = id,
+metadata = my_table_hh4,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes")
+
+# Provide a link to the table series page (US Bucen website)
+
+external_resources_add(
+title = "Historical Households Tables (US Bucen web page)",
+ idno = id,
+ dctype = "web",
+ file_path = "https://www.census.gov/data/tables/time-series/demo/families/households.html",
+ overwrite = "yes"
+ )
+The result in NADA will be as follows (only part of metadata displayed):
+
+
Using Python
+The same result can be achieved in Python; the script will be as follows:
+# Python script
For this second example, we use a regional table from the World Bank: “World Development Indicators - Country profiles”. The table is available on-line in Excel and in PDF formats, for many geographic areas: world, geographic regions, country groups (income level, etc), and country. A separate table is available for each of these areas. Metadata common to all table files is available in a separate Excel file.
+
+
+
+
+
As the same metadata applies to all tables, we generate the metadata once, and use a function to publish the geography-specific tables in one loop. In our example, we only generate the tables for the following geographies: world, World Bank regions, and countries of South Asia. This will result in the documentation and publishing of 15 tables. By providing the list of all countries to the loop, we would publish 200+ tables using this script.
+We include definitions in the metadata. These definitions are extracted from the World Development Indicators API.
+In the script, we assume that we only want to publish the metadata in the catalog, and provide a link to the originating World Bank website. In other words, we do not make the XLSX or PDF directly accessible from the NADA catalog (which would be easy to implement).
+Using R
+# --------------------------------------------------------------------------
+# Load libraries and establish the catalog administrator credentials
+# --------------------------------------------------------------------------
+
+library(nadar)
+library(jsonlite)
+library(httr)
+library(rlist)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_tables/")
+
+<- "WB_country_profiles_WLD.jpg"
+ thumb_file
+<- "World Bank, World Development Indicators database - WDI Central, 2021"
+ src_data
+# The tables contain data extracted from WDI time series. We identified these
+# series ID and we list them here in their order of appearance in the table.
+
+= list(
+ tbl_wdi_indicators "SP.POP.TOTL", "SP.POP.GROW", "AG.SRF.TOTL.K2", "EN.POP.DNST",
+ "SI.POV.NAHC", "SI.POV.DDAY", "NY.GNP.ATLS.CD", "NY.GNP.PCAP.CD",
+ "NY.GNP.MKTP.PP.CD", "NY.GNP.PCAP.PP.CD", "SI.DST.FRST.20",
+ "SP.DYN.LE00.IN", "SP.DYN.TFRT.IN", "SP.ADO.TFRT", "SP.DYN.CONU.ZS",
+ "SH.STA.BRTC.ZS", "SH.DYN.MORT", "SH.STA.MALN.ZS", "SH.IMM.MEAS",
+ "SE.PRM.CMPT.ZS", "SE.PRM.ENRR", "SE.SEC.ENRR", "SE.ENR.PRSC.FM.ZS",
+ "SH.DYN.AIDS.ZS", "AG.LND.FRST.K2", "ER.PTD.TOTL.ZS",
+ "ER.H2O.FWTL.ZS", "SP.URB.GROW", "EG.USE.PCAP.KG.OE",
+ "EN.ATM.CO2E.PC", "EG.USE.ELEC.KH.PC", "NY.GDP.MKTP.CD",
+ "NY.GDP.MKTP.KD.ZG", "NY.GDP.DEFL.KD.ZG", "NV.AGR.TOTL.ZS",
+ "NV.IND.TOTL.ZS", "NE.EXP.GNFS.ZS", "NE.IMP.GNFS.ZS",
+ "NE.GDI.TOTL.ZS", "GC.REV.XGRT.GD.ZS", "GC.NLD.TOTL.GD.ZS",
+ "FS.AST.DOMS.GD.ZS", "GC.TAX.TOTL.GD.ZS", "MS.MIL.XPND.GD.ZS",
+ "IT.CEL.SETS.P2", "IT.NET.USER.ZS", "TX.VAL.TECH.MF.ZS",
+ "IQ.SCI.OVRL", "TG.VAL.TOTL.GD.ZS", "TT.PRI.MRCH.XD.WD",
+ "DT.DOD.DECT.CD", "DT.TDS.DECT.EX.ZS", "SM.POP.NETM",
+ "BX.TRF.PWKR.CD.DT", "BX.KLT.DINV.CD.WD", "DT.ODA.ODAT.CD"
+
+ )
+= list()
+ rows = list()
+ defs
+# We then use the WDI API to retrieve information on the series (name, label,
+# definition) to be included in the published metadata.
+
+for(s in tbl_wdi_indicators) {
+
+ = paste0("https://api.worldbank.org/v2/sources/2/series/", s,
+ url "/metadata?format=JSON")
+ <- GET(url)
+ s_meta if(http_error(s_meta)){
+ stop("The request failed")
+ else {
+ } <- fromJSON(content(s_meta, as = "text"))
+ s_metadata <- s_metadata$source$concept[[1]][[2]][[1]][[2]][[1]]
+ s_metadata
+ }
+ = s_metadata$value[s_metadata$id=="IndicatorName"]
+ indic_lbl = s_metadata$value[s_metadata$id=="Longdefinition"]
+ indic_def
+ = list(var_name = s, dataset = src_data, label = indic_lbl)
+ this_row = list.append(rows, this_row)
+ rows
+ = list(name = indic_lbl, definition = indic_def)
+ this_def = list.append(defs, this_def)
+ defs
+
+ }
+# --------------------------------------------------------------------------
+# We create a function that takes two parameters: the country (or region)
+# name, and the country (or region) code. This function will generate the
+# table metadata and publish the selected table in the NADA catalog.
+# --------------------------------------------------------------------------
+
+<- function(country_name, country_code) {
+ publish_country_profile
+ # Generate the country/region-specific unique table ID and table title
+
+ <- paste0("UC013_", country_code)
+ idno_meta <- paste0("UC013_", country_code)
+ idno_tbl <- paste0("World Development Indicators, Country Profile, ",
+ tbl_title " - 2021")
+ country_name, <- paste("World Bank,", tbl_title,
+ citation ", https://datacatalog.worldbank.org/dataset/country-profiles, accessed on [date]")
+
+ # Generate the schema-compliant metadata
+
+ <- list(
+ my_tbl
+ metadata_information = list(
+ producers = list(list(name = "NADA team")),
+ production_date = "2021-09-14",
+ version = "v01"
+
+ ),
+ table_description = list(
+
+ title_statement = list(
+ idno = idno_tbl,
+ title = tbl_title
+
+ ),
+ authoring_entity = list(
+ list(name = "World Bank, Development Data Group",
+ abbreviation = "WB",
+ uri = "https://data.worldbank.org/")
+
+ ),
+ date_created = "2021-07-03",
+ date_published = "2021-07",
+
+ description = "Country profiles present the latest key development data drawn from the World Development Indicators (WDI) database. They follow the format of The Little Data Book, the WDI's quick reference publication.",
+
+ table_columns = list(
+ list(label = "Year 1990"),
+ list(label = "Year 2000"),
+ list(label = "Year 2010"),
+ list(label = "Year 2018")
+
+ ),
+ table_rows = rows,
+
+ table_series = list(
+ list(name = "World Development Indicators, Country Profiles",
+ maintainer = "World Bank, Development Data Group (DECDG)")
+
+ ),
+ data_sources = list(
+ list(source = src_data)
+
+ ),
+ time_periods = list(
+ list(from = "1990", to = "1990"),
+ list(from = "2000", to = "2000"),
+ list(from = "2010", to = "2010"),
+ list(from = "2018", to = "2018")
+
+ ),
+ ref_country = list(
+ list(name = country_name, code = country_code)
+
+ ),
+ geographic_granularity = area,
+
+ languages = list(
+ list(name = "English", code = "EN")
+
+ ),
+ links = list(
+ list(uri = "https://datacatalog.worldbank.org/dataset/country-profiles",
+ description = "Country Profiles in World Bank Data Catalog website"),
+ list(uri = "http://wdi.worldbank.org/tables",
+ description = "Country Profiles in World Bank Word Development Indicators website"),
+ list(uri = "https://datatopics.worldbank.org/world-development-indicators/",
+ description = "Word Development Indicators website")
+
+ ),
+ keywords = list(
+ list(name = "World View"),
+ list(name = "People"),
+ list(name = "Environment"),
+ list(name = "Economy"),
+ list(name = "States and markets"),
+ list(name = "Global links")
+
+ ),
+ topics = list(
+ list(id = "1", name = "Demography",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "2", name = "Economics",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "3", name = "Education",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "4", name = "Health",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "5", name = "Labour And Employment",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "6", name = "Natural Environment",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "7", name = "Social Welfare Policy and Systems",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "8", name = "Trade Industry and Markets",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "9", name = "Economic development")
+
+ ),
+ definitions = defs,
+
+ license = list(
+ list(name = "Creative Commons - Attribution 4.0 International - CC BY 4.0",
+ uri = "https://creativecommons.org/licenses/by/4.0/")
+
+ ),
+ citation = citation,
+
+ contacts = list(
+ list(name = "World Bank, Development Data Group, Help Desk",
+ telephone = "+1 (202) 473-7824 or +1 (800) 590-1906",
+ email = "data@worldbank.org",
+ uri = "https://datahelpdesk.worldbank.org/")
+
+ )
+
+ )
+
+ )
+ # Publish the table in the NADA catalog
+
+ table_add(idno = my_tbl$table_description$title_statement$idno,
+ metadata = my_tbl,
+ repositoryid = "central",
+ published = 1,
+ overwrite = "yes",
+ thumbnail = thumb_file)
+
+ # Add a link to the WDI website as an external resource
+
+ external_resources_add(
+ title = "World Development Indicators - Regional tables",
+ idno = idno_tbl,
+ dctype = "web",
+ file_path = "http://wdi.worldbank.org/table",
+ overwrite = "yes"
+
+ )
+
+ }
+# --------------------------------------------------------------------------
+# We run the function in a loop to publish the selected tables
+# --------------------------------------------------------------------------
+
+# List of countries/regions
+
+<- list(
+ geo_list list(name = "World", code = "WLD", area = "World"),
+ list(name = "East Asia and Pacific", code = "EAP", area = "Region"),
+ list(name = "Europe and Central Asia", code = "ECA", area = "Region"),
+ list(name = "Latin America and Caribbean", code = "LAC", area = "Region"),
+ list(name = "Middle East and North Africa", code = "MNA", area = "Region"),
+ list(name = "South Asia", code = "SAR", area = "Region"),
+ list(name = "Sub-Saharan Africa", code = "AFR", area = "Region"),
+ list(name = "Afghanistan", code = "AFG", area = "Country"),
+ list(name = "Bangladesh", code = "BGD", area = "Country"),
+ list(name = "Bhutan", code = "BHU", area = "Country"),
+ list(name = "India", code = "IND", area = "Country"),
+ list(name = "Maldives", code = "MDV", area = "Country"),
+ list(name = "Nepal", code = "NPL", area = "Country"),
+ list(name = "Pakistan", code = "PAK", area = "Country"),
+ list(name = "Sri Lanka", code = "LKA", area = "Country"))
+
+# Loop through the list of countries/region to publish the tables
+
+for(i in 1:length(geo_list)) {
+<- as.character(geo_list[[i]][3])
+ area publish_country_profile(
+ country_name = as.character(geo_list[[i]][1]),
+ country_code = as.character(geo_list[[i]][2]))
+ }
** Using Python**
+# Python script
The result in NADA
+This example is selected to show how the documentation can take advantage of R or Python to extract information from the table. Here we have the table in MS-Excel format. The table contains a long list of countries, which would be tedious to manually enter. A script reads the Excel file and extracts some of the information which is then added to the table metadata. The table also contains the definitions of the indicators shown in the table.
+Here we assume we want to provide the XLS and PDF tables in addition to a link to the source website. We will identify and upload the resources (XLS and PDF) on our web server.
+The table:
+ +++Using R
+
library(nadar)
+library(readxl)
+library(rlist)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_tables/")
+
+= "SDGs.jpg"
+ thumb
+= "TBL_EXAMPLE-03"
+ id
+# ---------------------------------------------------------------------------
+# We read the MS-Excel file and extract the list of countries and definitions
+# ---------------------------------------------------------------------------
+
+# We generate the list of countries
+<- read_xlsx("WV2_Global_goals_ending_poverty_and_improving_lives.xlsx",
+ df range = "A5:A230")
+ <- list()
+ ctry_list for(i in 1:nrow(df)) {
+<- list(name = as.character(df[[1]][i]))
+ c <- list.append(ctry_list, c)
+ ctry_list
+ }
+# We extract the definitions found in the table.
+# Note that we could have instead copy/pasted the definitions.
+# For example, the command line:
+# list(name = as.character(df[1,1]), definition = as.character(df[3,1]))
+# is equivalent to:
+# list(name = "Income share held by lowest 20%",
+# definition = "Percentage share of income or consumption is the share that accrues to subgroups of population indicated by deciles or quintiles. Percentage shares by quintile may not sum to 100 because of rounding.")
+
+<- read_xlsx("WV2_Global_goals_ending_poverty_and_improving_lives.xlsx",
+ df range = "A241:A340", col_names = FALSE)
+
+= list(
+ def_list list(name = as.character(df[1,1]), definition = as.character(df[3,1])),
+ list(name = as.character(df[11,1]), definition = as.character(df[13,1])),
+ list(name = as.character(df[21,1]), definition = as.character(df[23,1])),
+ list(name = as.character(df[31,1]), definition = as.character(df[33,1])),
+ list(name = as.character(df[41,1]), definition = as.character(df[43,1])),
+ list(name = as.character(df[51,1]), definition = as.character(df[53,1])),
+ list(name = as.character(df[61,1]), definition = as.character(df[63,1])),
+ list(name = as.character(df[71,1]), definition = as.character(df[73,1])),
+ list(name = as.character(df[78,1]), definition = as.character(df[80,1])),
+ list(name = as.character(df[85,1]), definition = as.character(df[87,1])),
+ list(name = as.character(df[92,1]), definition = as.character(df[94,1]))
+
+ )
+# We generate the table metadata
+
+<- list(
+ my_tbl
+ metadata_information = list(
+ idno = "META_TBL_EXAMPLE-03",
+ producers = list(
+ list(name = "Olivier Dupriez", affiliation = "World Bank")
+
+ ),production_date = "2021-02-20"
+
+ ),
+ table_description = list(
+
+ title_statement = list(
+ idno = id,
+ table_number = "WV.2",
+ title = "Global Goals: Ending Poverty and Improving Lives"
+
+ ),
+ authoring_entity = list(
+ list(name = "World Bank, Development Data Group",
+ abbreviation = "WB",
+ uri = "https://data.worldbank.org/")
+
+ ),
+ date_created = "2020-12-16",
+ date_published = "2020-12",
+
+ description = "",
+
+ table_columns = list(
+ list(label = "Percentage share of income or consumption - Lowest 20% - 2007-18"),
+ list(label = "Prevalence of child malnutrition - Stunting, height for age - % of children under 5 - 2011-19"),
+ list(label = "Maternal mortality ratio - Modeled estimates - per 100,000 live births - 2017"),
+ list(label = "Under-five mortality rate - Total - per 1,000 live births - 2019"),
+ list(label = "Incidence of HIV, ages 15-49 (per 1,000 uninfected population ages 15-49) - 2019"),
+ list(label = "Incidence of tuberculosis - per 100,000 people - 2019"),
+ list(label = "Mortality caused by road traffic injury - per 100,000 people - 2016"),
+ list(label = "Primary completion rate - Total - % of relevant age group - 2018"),
+ list(label = "Contributing family workers - Male - % of male employment - 2018"),
+ list(label = "Contributing family workers - Female - % of female employment - 2018"),
+ list(label = "Labor productivity - GDP per person employed - % growth - 2015-18")
+
+ ),
+ table_rows = list(
+ list(label = "Country or region")
+
+ ),
+ table_series = list(
+ list(name = "World Development Indicators - World View",
+ description = "World Development Indicators includes data spanning up to 56 years-from 1960 to 2016. World view frames global trends with indicators on population, population density, urbanization, GNI, and GDP. As in previous years, the World view online tables present indicators measuring the world's economy and progress toward improving lives, achieving sustainable development, providing support for vulnerable populations, and reducing gender disparities. Data on poverty and shared prosperity are now in a separate section, while highlights of progress toward the Sustainable Development Goals are now presented in the companion publication, Atlas of Sustainable Development Goals 2017.
+
+ The global highlights in this section draw on the six themes of World Development Indicators:
+ - Poverty and shared prosperity, which presents indicators that measure progress toward the World Bank Group's twin goals of ending extreme poverty by 2030 and promoting shared prosperity in every country.
+ - People, which showcases indicators covering education, health, jobs, social protection, and gender and provides a portrait of societal progress across the world.
+ - Environment, which presents indicators on the use of natural resources, such as water and energy, and various measures of environmental degradation, including pollution, deforestation, and loss of habitat, all of which must be considered in shaping development strategies.
+ - Economy, which provides a window on the global economy through indicators that describe the economic activity of the more than 200 countries and territories that produce, trade, and consume the world's output.
+ - States and markets, which encompasses indicators on private investment and performance, financial system development, quality and availability of infrastructure, and the role of the public sector in nurturing investment and growth.
+ - Global links, which presents indicators on the size and direction of the flows and links that enable economies to grow, including measures of trade, remittances, equity, and debt, as well as tourism and migration.",
+uri = "http://wdi.worldbank.org/tables",
+ maintainer = "World Bank, Development Data Group (DECDG)")
+
+ ),
+ data_sources = list(
+ list(source = "World Bank, World Development Indicators database, 2020")
+
+ ),
+ time_periods = list(
+ list(from = "2007", to = "2019") # The table cover all years from 2007 to 2019
+
+ ),
+ ref_country = ctry_list,
+ geographic_granularity = "Country, WB geographic region, other country groupings",
+
+ languages = list(
+ list(name = "English", code = "EN")
+
+ ),
+ links = list(
+ list(uri = "http://wdi.worldbank.org/tables",
+ description = "World Development Indicators - Global Goals tables"),
+ list(uri = "https://datatopics.worldbank.org/world-development-indicators/",
+ description = "Word Development Indicators website"),
+ list(uri = "https://sdgs.un.org/goals",
+ description = "United Nations, Sustainable Development Goals (SDG) website")
+
+ ),
+ keywords = list(
+ list(name = "Sustainable Development Goals (SDGs)"),
+ list(name = "Shared prosperity"),
+ list(name = "HIV - AIDS")
+
+ ),
+ topics = list(
+ list(id = "1",
+ name = "Demography",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "2",
+ name = "Economics",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "3",
+ name = "Education",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+ list(id = "4",
+ name = "Health",
+ vocabulary = "CESSDA",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+
+ ),
+ disciplines = list(
+ list(name = "Economics")
+
+ ),
+ definitions = def_list,
+
+ license = list(
+ list(name = "Creative Commons - Attribution 4.0 International - CC BY 4.0",
+ uri = "https://creativecommons.org/licenses/by/4.0/")
+
+ ),
+ citation = "",
+
+ contacts = list(
+ list(name = "World Bank, Development Data Group, Help Desk",
+ telephone = "+1 (202) 473-7824 or +1 (800) 590-1906",
+ email = "data@worldbank.org",
+ uri = "https://datahelpdesk.worldbank.org/")
+
+ )
+
+ )
+ )
+# We publish the table in the catalog
+
+table_add(idno = id,
+metadata = my_tbl,
+ repositoryid = "central",
+ published = 1,
+ overwrite = "yes",
+ thumbnail = thumb)
+
+# We add the MS-Excel and PDF versions of the table as external resources
+
+external_resources_add(
+title = "Global Goals: Ending Poverty and Improving Lives (in MS-Excel format)",
+ idno = id,
+ dctype = "tbl",
+ file_path = "WV2_Global_goals_ending_poverty_and_improving_lives.xlsx",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "Global Goals: Ending Poverty and Improving Lives (in PDF format)",
+ idno = id,
+ dctype = "tbl",
+ file_path = "WV2_Global_goals_ending_poverty_and_improving_lives.pdf",
+ overwrite = "yes"
+ )
The table will now be available in the NADA catalog.
+
+*** Using Python
#Python script
This chapter describes the use of two metadata standards for the documentation of images. Images may include both electronic and physical representations, but we are here interested in images available as electronic files, intended to be catalogued and published in on-line catalogs/albums. These files will typically be available in one of the following formats: JPG, PNG, or TIFF. Images can be photos taken by digital cameras, images generated by computer, or scanned images. The metadata standards we describe are intended to make these images discoverable, accessible, and usable. For that purpose, metadata must be provided on the content of the image (in the form of caption, description, keywords, etc.), on the location and date the image was generated, on the author, and more. Information on use license and copyrights, on possible privacy protection issues (persons, possibly minors, etc.) is needed to provide users with information they need to ensure their use of the published images is legal, ethical, and responsible.
+The device used to generate images in the form of electronic files (such as digital cameras) contain embedded metadata. Digital cameras generate EXIF metadata. This information may be useful to some users, but (with a few exceptions like the date the photo was taken and the GPS location if generated), they lack information on the content of the image (what is represented in it), required for discoverability. This information must added by curators. Part of it will be entered manually, other can be extracted in a largely automated manner using machine learning models and APIs. This information must be structured and stored in compliance with a metadata standard. We present in this chapter two standards that can serve that purpose: the comprehensive (and somewhat complex) IPTC standard, and the simpler Dublin Core (DCMI) standard. The metadata schema we propose embeds both options; when using the schema, users will select either one or the other to document their images. We also make references to the ImageObject metadata schema from schema.org, and include some of their elements in our schema.
+Although photographs may be more explicit than a long discourse for humans, they don’t describe themselves in term of content as texts do. For texts, authors use many clues to indicate what they are talking about: titles, abstract, keywords, etc. which may be used for automatic cataloguing. Searching for photos must rely on manual cataloguing, or relate texts and documents that come with the photos. (Source: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.5077&rep=rep1&type=pdf)
+We start with a brief presentation of the EXIF metadata, then describe the schema we propose for the documentation and cataloguing of images.
+Modern digital cameras automatically generate metadata and embed it into the image file. This metadata is known as the Exchangeable Image File Format or EXIF. EXIF will record information on the date and time the image was taken, on the GPS location coordinates (latitude & longitude, possibly altitude) if the camera was equipped with a GPS and geolocation was enabled, information on the device including manufacturer and model, technical information (lens type, focal range, aperture, shutter speed, flash settings), the system-generated unique image identifier, and more.
+There are several ways to extract or view an image’s EXIF Data. For example, the R packages ExifTool and ExifR allow extraction and use of EXIF metadata, and applications like Flickr will display EXIF content.
+
+
+
But with the exception of the date, location (if captured), and unique image identifier, the content of the EXIF does not provide information that users interested in identifying images based on their source and/or content will find useful. Metadata describing the content and source of an image will have to be obtained from another source or using other tools.
+The metadata schema we propose for documenting images contains two mutually-exclusive options: the Dublin Core, as a simple option, and the IPTC as a more complex and advanced solutions. The schema also contains a few metadata elements that will be used no matter which option is selected. The schema is structured as follows:
+A few elements common to both options are provided to document the metadata (not the image itself), to provide some cataloguing parameters, and to set a unique identifier for the image being documented.
Then come the two options for documenting the image itself: the IPTC block of metadata elements, and the Dublin Core block of elements. Users will make use of one of them, not both.
+The IPTC is the most detailed and complex schema. The version embedded in our schema is 2019.1 According to the IPTC website, “The IPTC Photo Metadata Standard is the most widely used standard to describe photos, because of its universal acceptance among news agencies, photographers, photo agencies, libraries, museums, and other related industries. It structures and defines metadata properties that allow users to add precise and reliable data about images.” The IPTC standard consists of two schemas: IPTC Core and IPTC Extension. They provide a comprehensive set of fields to document an image including information on time and geographic coverage, people and objects shown in the image, information on rights, and more. The schema is complex and in most cases only a small subset of fields will be used to document an image. Controlled vocabularies are recommended for some elements.
The Dublin Core (DCMI) is a simpler and highly flexible standard, composed of 15 core elements which we supplement with a few elements mostly taken from the ImageObject schema from schema.org.
Last, a small number of additional metadata elements are provided, which are common to both options described above.
Whether the IPTC or the simpler DCMI option is used, the metadata should be made as rich as possible.
+To make images discoverable, metadata that describe the content depicted in an image, the source of the image and the rights and licensing associated with it, are essential but not provided in the EXIF. Additional metadata must be provided.
+Some of these metadata will have to be generated by image authors and/or curators, other can be generated in a much automated manner using machine learning models and tools. Image processing algorithms that make it possible to augmented metadata include algorithms of face detection, person identification, automated labeling, text extraction, and others. Before describing the proposed metadata schema in the following sections, we present here some example of tools that make such metadata enhancement easy and affordable.
+The example we provide below makes use of the Google Vision API to generate image metadata. Google Vision is one out of multiple tools that can be used for that purpose such as Amazon Rekognition, or Microsoft Azure Computer Vision. This example makes use of a photo selected from the World Bank Flickr album.
+The image comes with a brief description that identifies the photographer, the location (name of the country and town, not GPS location), and the content of the image. The description of the image includes important keywords that, when indexed in a catalog, will support discoverability of the image. This information, to be manually entered, is valuable and must be part of the curated image metadata.
++ |
+ |
+ |
But we can add useful additional information in an automated manner and at low cost using machine learning models. In the example below, we use the (free) on-line “Try it” tool of the Google Vision application.
+The Google Vision API returns and displays the results of the image processing in multiple tabs. The same content is available programmatically in JSON format. The content of this JSON file can be mapped to elements of the metadata schema, for automatic addition to the image metadata.
+The first tab is the result of faces detection. Each detected face has a bounding box and metadata such as the derived emotion of the person. The bounding box can be used to automatically flag images that have one or multiple “significant size” face(s) and may have to be excluded from the published images for privacy protection reasons.
+The second tab reports on detected objects.
+The third tab suggests labels that could be attached to the image, provided with a degree of confidence. A threshold can be set to automatically add (or not) each proposed label as a keyword in the image metadata.
++ +
+The fourth tab shows the text detected in the image. The quality of text detection and recognition depends on the resolution of the image and on the size and orientation of the text in the image. In our example, the algorithm fails to read (most of) the small, rotated and truncated text.
+The tool managed to recognize some, but not all characters. In this case, this would be considered as not useful information to be added to the image metadata.
+We are not interested in the properties tab which does not provide information that can be used for discoverability of images based on their content or source.
+The last tab, Safe search, could be used as warnings if you plan to make the image publicly accessible.
+This “Try it” tool demonstrates the capabilities of the application which, for automating the processing of a collection of images, would be accessed programmatically using R, Python or another programming language. Accessing the application’s API requires a key. The cost of image labeling, face detection, and other image processing is low. For information on pricing, consult the website of the API providers.
+The schema contains two options to document images: the IPTC and the Dublin Core metadata standards. The schema contains four main groups of metadata elements: +1. A small set of “common elements” (used no matter what option – IPTC or Dublin Core – is used), used mostly for cataloguing purpose. +2. The IPTC metadata elements +3. The Dublin Core (DCMI) elements +4. Another small set of common elements.
+The description of IPTC metadata elements is largely taken from the Photo Metadata section of the IPTC website.
+{
+"repositoryid": "central",
+ "published": "0",
+ "overwrite": "no",
+ "metadata_information": {},
+ "image_description": {
+ "idno": "string",
+ "identifiers": [],
+ "iptc": {},
+ "dcmi": {},
+ "license": [],
+ "album": []
+ },
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
metadata_information
[Optional ; Not repeatable] IPTC
or DCMI
section.
+"metadata_information": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "production_date": "string",
+ "version": "string"
+ }
title
[Optional ; Not Repeatable ; String]
+The title of the image metadata. This can be the same as the image title.
idno
[Optional ; Not Repeatable ; String]
+The unique identifier of the image metadata document (which can be different from the image identifier).
producers
[Optional ; Repeatable]
+A list of persons or organizations involved in the documentation (production of the metadata) of the image.
name
[Optional ; Not repeatable, String] abbr
[Optional ; Not repeatable, String] name
.affiliation
[Optional ; Not repeatable, String] name
.role
[Optional ; Not repeatable, String] name
in the production of the metadata. This element will be used when more than one person or organization is listed in the producers
element to distinguish the specific contribution of each metadata producer.production_date
[Optional ; Not repeatable, String]
+The date the image metadata was generated (not the date the image was created), preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
version
[Optional ; Not repeatable, String]
+The version of the metadata on this image. This element will rarely be used.
image_description
[Required ; Not Repeatable]
+The image_description
will contain the metadata related to one image.
+
"image_description": {
+"idno": "string",
+ "identifiers": [
+ {
+ "type": "string",
+ "identifier": "string"
+ }
+ ],
+ "iptc": {},
+ "dcmi": {},
+ "license": [],
+ "album": []
+ }
idno
[Required ; Not Repeatable, String]
+The (main) unique identifier of the image, to be used for cataloguing purpose.
identifiers
[Optional, Repeatable]
+The repeatable element identifiers
is used to list image identifiers other than the one used in idno
. Some images may have unique identifiers assigned by different organizations or cataloguing systems; this element is used to document them.
This element is used to enter image identifiers (IDs) other than the catalog ID entered in the image_description / idno
element. It can for example be a Digital Object Identifier (DOI), or the EXIF identifier. Note that the ID entered in the idno
element can be repeated here (idno
does not provide a type
parameter, that curators may want to document).
type
[Optional, Not Repeatable, String]
+The type of identifier. This could be for example “DOI”.identifier
[Required, Not Repeatable, String]
+The identifier itself.iptc
[Optional ; Not Repeatable]
+The schema provides two options (standards) to document an image: the IPTC, and the Dublin Core. Only one of these standards, not both, will be used to document an image. The block iptc
will be used when IPTC is the preferred option. In such case, the dcmi
block describe later in this chapter will be left empty. IPTC is the most complex of these two options.
+
"iptc": {
+"photoVideoMetadataIPTC": {
+ "title": "string",
+ "imageSupplierImageId": "string",
+ "registryEntries": [],
+ "digitalImageGuid": "string",
+ "dateCreated": "2023-04-11T15:06:09Z",
+ "headline": "string",
+ "eventName": "string",
+ "description": "string",
+ "captionWriter": "string",
+ "keywords": [],
+ "sceneCodes": [],
+ "sceneCodesLabelled": [],
+ "subjectCodes": [],
+ "subjectCodesLabelled": [],
+ "creatorNames": [],
+ "creatorContactInfo": {},
+ "creditLine": "string",
+ "digitalSourceType": "http://example.com",
+ "jobid": "string",
+ "jobtitle": "string",
+ "source": "string",
+ "locationsShown": [],
+ "imageRating": 0,
+ "supplier": [],
+ "copyrightNotice": "string",
+ "copyrightOwners": [],
+ "usageTerms": "string",
+ "embdEncRightsExpr": [],
+ "linkedEncRightsExpr": [],
+ "webstatementRights": "http://example.com",
+ "instructions": "string",
+ "genres": [],
+ "intellectualGenre": "string",
+ "artworkOrObjects": [],
+ "personInImageNames": [],
+ "personsShown": [],
+ "modelAges": [],
+ "additionalModelInfo": "string",
+ "minorModelAgeDisclosure": "http://example.com",
+ "modelReleaseDocuments": [],
+ "modelReleaseStatus": {},
+ "organisationInImageCodes": [],
+ "organisationInImageNames": [],
+ "productsShown": [],
+ "maxAvailHeight": 0,
+ "maxAvailWidth": 0,
+ "propertyReleaseStatus": {},
+ "propertyReleaseDocuments": [],
+ "aboutCvTerms": []
+ }
+ }
photoVideoMetadataIPTC
[Required ; Not Repeatable ; String]
+Contains all elements used to describe the image using the IPTC standard.
title
[Optional ; Not Repeatable ; String]
+The title is a shorthand reference for the digital image. It provides a short verbal and human readable name which can be a text and/or a numeric reference. It is not the same as the Headline (see below). Some may use the title
field to store the file name of the image, though the field may be used in many ways. This element should not be used to provide the unique identifier of the image.
imageSupplierImageId
[Optional ; Not Repeatable ; String]
+A unique identifier assigned by the image supplier to the image.
registryEntries
[Optional ; Repeatable]
+A structured element used to provide cataloguing information (i.e. an entry in a registry). It includes the unique identifier for the image issued by the registry and the registry’s organization identifier.
+
"registryEntries": [
+{
+ "role": "http://example.com",
+ "assetIdentifier": "string",
+ "registryIdentifier": "http://example.com"
+ }
+ ]
role
: [Optional ; Not Repeatable ; String]
+An identifier of the reason and/or purpose for this Registry Entry.
assetIdentifier
[Optional ; Not Repeatable ; String]
+A unique identifier created by the registry and applied by the creator of the digital image. This value shall not be changed after being applied. This identifier is linked to a corresponding Registry Organization Identifier. Enter the unique identifier created by a registry and applied by the creator of the digital image. This value shall not be changed after being applied. This identifier may be globally unique by itself, but it must be unique for the issuing registry. An input to this field should be made mandatory.
registryIdentifier
[Optional ; Not Repeatable ; String]
+An identifier for the registry/organization which issued the corresponding Registry Image Id.
digitalImageGuid
[Optional ; Not Repeatable ; String]
+A globally unique identifier for the image. This identifier is created and applied by the creator of the digital image at the time of its creation. This value shall not be changed after that time. The identifier can be generated using an algorithm that would guarantee that the created identifier is globally unique. Device that create digital images like digital or video cameras or scanners usually create such an identifier at the time of the creation of the digital data, and add it to the metadata embedded in the image file (e.g., the EXIF metadata).IPTC’s requirements for unique ids are as follows:
dateCreated
[Optional ; Not Repeatable ; String]
+Designates the date and optionally the time the content of the image was created. For a photo, this will be the date and time the photo was taken. When no information is available on the time, the time is set to 00:00:00. The preferred format for the dateCreated
element is the truncated DateTime format, for example: 2021-02-22T21:24:06Z
headline
[Optional ; Not Repeatable ; String]
+A brief publishable summary of the contents of the image. Note that a headline is not the same as a title.
eventName
[Optional ; Not Repeatable ; String]
+The name or a brief description of the event where the image was taken. If this is a sub-event of a larger event, mention both in the description. For example: “Opening statement, 1st International Conference on Metadata Standards, New York, November 2021”.
description
[Optional ; Not Repeatable ; String]
+A textual description, including captions, of the image. This describes the who, what, and why of what is happening in this image. This might include names of people, and/or their role in the action that is taking place within the image. Example: “The president of the Metadata Association delivers the keynote address”.
captionWriter
[Optional ; Not Repeatable ; String]
+An identifier, or the name, of the person involved in writing, editing or correcting the description of the image.
keywords
: [Optional ; Repeatable ; String]
+
"keywords": [
+"string"
+ ]
Keywords (terms or phrases) to express the subject of the image. Keywords do not have to be taken from a controlled vocabulary.
+sceneCodes
[Optional ; Repeatable ; String] "sceneCodes": [
+"string"
+ ]
+The sceneCodes
describe the scene of a photo content. The IPTC Scene-NewsCodes controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license) should be used, where a scene is represented as a string of 6 digits.
code | +Label | +Description | +
---|---|---|
010100 | +headshot | +A head only view of a person (or animal/s) or persons as in a montage. | +
010200 | +half-length | +A torso and head view of a person or persons. | +
010300 | +full-length | +A view from head to toe of a person or persons | +
010400 | +profile | +A view of a person from the side | +
010500 | +rear view | +A view of a person or persons from the rear. | +
010600 | +single | +A view of only one person, object or animal. | +
010700 | +couple | +A view of two people who are in a personal relationship, for example engaged, married or in a romantic partnership. | +
010800 | +two | +A view of two people | +
010900 | +group | +A view of more than two people | +
011000 | +general view | +An overall view of the subject and its surrounds | +
011100 | +panoramic view | +A panoramic or wide angle view of a subject and its surrounds | +
011200 | +aerial view | +A view taken from above | +
011300 | +under-water | +A photo taken under water | +
011400 | +night scene | +A photo taken during darkness | +
011500 | +satellite | +A photo taken from a satellite in orbit | +
011600 | +exterior view | +A photo that shows the exterior of a building or other object | +
011700 | +interior view | +A scene or view of the interior of a building or other object | +
011800 | +close-up | +A view of, or part of a person/object taken at close range in order to emphasize detail or accentuate mood. Macro photography. | +
011900 | +action | +Subject in motion such as children jumping, horse running | +
012000 | +performing | +Subject or subjects on a stage performing to an audience | +
012100 | +posing | +Subject or subjects posing such as a “victory” pose or other stance that symbolizes leadership. | +
012200 | +symbolic | +A posed picture symbolizing an event - two rings for marriage | +
012300 | +off-beat | +An attractive, perhaps fun picture of everyday events - dog with sunglasses, people cooling off in the fountain | +
012400 | +movie scene | +Photos taken during the shooting of a movie or TV production. | +
sceneCodesLabelled
[Optional ; Repeatable] "sceneCodesLabelled": [
+{
+ "code": "string",
+ "label": "string",
+ "description": "string"
+ }
+ ]
The sceneCodes
element described above only allows for the capture of codes. To improve discoverability (by indexing important keywords), not only the scene codes but also the scene description should be provided. The IPTC standard does not provide an element that allows the scene label and description to be entered. The sceneCodesLabelled
is an element that we added to our schema. Ideally, curators will enter the scene codes in the element sceneCodes
to maintain full compatibility with the IPTC, and complement that information by also entering the codes and their description in the sceneCodesLabelled
element.
code
[Optional ; Not Repeatable ; String]
+The code for the scene of a photo content. The IPTC Scene-NewsCodes controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license) should be used, where a scene is represented as a string of 6 digits. See table above.
label
[Optional ; Not Repeatable ; String]
+The label of the scene. See table above for examples.
description
[Optional ; Not Repeatable ; String]
+A more detailed description of the scene. See table above for examples.
subjectCodes
[Optional ; Repeatable ; String]
+
"subjectCodes": [
+"string"
+ ]
Specifies one or more subjects from the IPTC Subject-NewsCodes controlled vocabulary to categorize the image. Each Subject is represented as a string of 8 digits. The vocabulary consists of about 1400 terms organized into 3 levels (users can decide to use only the first, or the first two levels; the more detail is provided, the better the discoverability of the image). The first level of the controlled vocabulary is as follows:
+
code | +Label | +Description | +
---|---|---|
01000000 | +arts, culture and entertainment | +Matters pertaining to the advancement and refinement of the human mind, of interests, skills, tastes and emotions | +
02000000 | +crime, law and justice | +Establishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules and the punishment of offenders. Organizations and bodies involved in these activities. | +
03000000 | +disaster and accident | +Man made and natural events resulting in loss of life or injury to living creatures and/or damage to inanimate objects or property. | +
04000000 | +economy, business and finance | +All matters concerning the planning, production and exchange of wealth. | +
05000000 | +education | +All aspects of furthering knowledge of human individuals from birth to death. | +
06000000 | +environmental issue | +All aspects of protection, damage, and condition of the ecosystem of the planet earth and its surroundings. | +
07000000 | +health | +All aspects pertaining to the physical and mental welfare of human beings. | +
08000000 | +human interest | +Lighter items about individuals, groups, animals or objects. | +
09000000 | +labor | +Social aspects, organizations, rules and conditions affecting the employment of human effort for the generation of wealth or provision of services and the economic support of the unemployed. | +
10000000 | +lifestyle and leisure | +Activities undertaken for pleasure, relaxation or recreation outside paid employment, including eating and travel. | +
11000000 | +politics | +Local, regional, national and international exercise of power, or struggle for power, and the relationships between governing bodies and states. | +
12000000 | +religion and belief | +All aspects of human existence involving theology, philosophy, ethics and spirituality. | +
13000000 | +science and technology | +All aspects pertaining to human understanding of nature and the physical world and the development and application of this knowledge | +
14000000 | +social issue | +Aspects of the behavior of humans affecting the quality of life. | +
15000000 | +sport | +Competitive exercise involving physical effort. Organizations and bodies involved in these activities. | +
16000000 | +unrest | +conflicts and war Acts of socially or politically motivated protest and/or violence. | +
17000000 | +weather | +The study, reporting and prediction of meteorological phenomena. | +
As an example of subjects at the three levels, the list below zooms on the subject “education”.
code | +Subject | +Description | +
---|---|---|
05000000 | +education | +All aspects of furthering knowledge of human individuals from birth to death | +
05001000 | +Adult education | +Education provided for older students outside the usual age groups of 5-25 | +
05002000 | +Further education | +Any form of education beyond basic education of several levels | +
05003000 | +parent organization | +Groups of parents set up to support schools | +
05004000 | +preschool | +Education for children under the national compulsory education age | +
05005000 | +school | +A building or institution in which education of various sorts is provided | +
05005001 | +elementary schools | +Schools usually of a level from kindergarten through 11 or 12 years of age | +
05005002 | +middle schools | +Transitional school between elementary and high school, 12 through 13 years of age | +
05005003 | +high schools | +Pre-college/ university level education 14 to 17 or 18 years of age, called freshman, sophomore, junior and senior | +
05006000 | +teachers union | +Organization of teachers for collective bargaining and other purposes | +
05007000 | +university | +Institutions of higher learning capable of providing doctorate degrees | +
05008000 | +upbringing | +Lessons learned from parents and others as one grows up | +
05009000 | +entrance examination | +Exams for entering colleges, universities, junior and senior high schools, and all other higher and lower education institutes, including cram schools, which help students prepare for exams for entry to prestigious schools. | +
05010000 | +teaching and learning | +Either end of the education equation | +
05010001 | +students | +People of any age in a structured environment, not necessarily a classroom, in order to learn something | +
05010002 | +teachers | +People with knowledge who can impart that knowledge to others | +
05010003 | +curriculum | +The courses offered by a learning institution and the regulation of those courses | +
05010004 | +test/examination | +A measurement of student accomplishment | +
05011000 | +religious education | +Instruction by any faith, in that faith or about other faiths, usually, but not always, conducted in schools run by religious bodies | +
05011001 | +parochial school | +A school run by the Roman Catholic faith | +
05011002 | +seminary | +A school of any faith specifically designed to train ministers | +
05011003 | +yeshiva | +A school for training rabbis | +
05011004 | +madrasa | +A school for teaching Islam | +
subjectCodesLabelled
[Optional ; Repeatable] "subjectCodesLabelled": [
+{
+ "code": "string",
+ "label": "string",
+ "description": "string"
+ }
+ ]
The subjectCodes
element described above only allows for the capture of codes. To improve discoverability (by indexing important keywords), not only the subject codes but also the subject description should be provided. The IPTC standard does not provide an element that allows the subject label and description to be entered. The subjectCodesLabelled
is an element that we added to our schema. Ideally, curators will enter the subject codes in the element subjectCodes
to maintain full compatibility with the IPTC, and complement that information by also entering the codes and their description in the subjectCodesLabelled
element.
code
[Optional ; Not Repeatable ; String]
+Specifies one or more subjects from the IPTC Subject-NewsCodes controlled vocabulary to categorize the image. Each Subject is represented as a string of 8 digits. The vocabulary consists of about 1400 terms organized into 3 levels (users can decide to use only the first, or the first two levels; the more detail is provided, the better the discoverability of the image). See examples in the table above.
label
[Optional ; Not Repeatable ; String]
+The label of the subject. See table above for examples.
description
[Optional ; Not Repeatable ; String]
+A more detailed description of the subject. See table above for examples.
creatorNames
[Optional ; Repeatable ; String]
+
"creatorNames": [
+"string"
+ ]
Enter details about the creator or creators of this image. The Image Creator must often be attributed in association with any use of the image. The Image Creator, Copyright Owner, Image Supplier and Licensor may be the same or different entities.
creatorContactInfo
[Optional ;Not repeatable ; String] "creatorContactInfo": {
+"country": "string",
+ "emailwork": "string",
+ "region": "string",
+ "phonework": "string",
+ "weburlwork": "string",
+ "address": "string",
+ "city": "string",
+ "postalCode": "string"
+ }
The creator’s contact information provides all necessary information to get in contact with the creator of this image and comprises a set of elements for proper addressing. Note that if the creator is also the licensor, his or her contact information should be provided in the licensor
fields.
country
[Optional ; Not Repeatable ; String]
+The country name for the address of the person that created this image.
emailwork
[Optional ; Not Repeatable ; String]
+The work email address(es) for the creator of the image. Multiple email addresses can be given, in which case they should be separated by a comma.
region
[Optional ; Not Repeatable ; String]
+The state or province for the address of the creator of the image.
phonework
[Optional ; Not Repeatable ; String]
+The work phone number(s) for the creator of the image. Use the international format including the country code, such as +1 (123) 456789. Multiple numbers can be given, in which case they should be separated by a comma.
weburlwork
[Optional ; Not Repeatable ; String]
+The work web address for the creator of the image. Multiple addresses can be given, in which case they should be separated by a comma.
address
[Optional ; Not Repeatable ; String]
+The address of the creator of the image. This may comprise a company name.
city
[Optional ; Not Repeatable ; String]
+The city for the address of the person that created the image.
postalCode
[Optional ; Not Repeatable ; String]
+Enter the local postal code for the address of the person who created the image.
creditLine
[Optional ; Not Repeatable ; String]
+The credit to person(s) and/or organization(s) required by the supplier of the image to be used when published. This is a free-text field.
digitalSourceType
[Optional ; Not Repeatable ; String]
+The type of the source of this digital image. One value should be selected from the IPTC controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license license) that contains the following values:
+
Type | +Source | +Description | +
---|---|---|
digitalCapture | +Original digital capture of a real life scene | +The digital image is the original and only instance and was taken by a digital camera | +
negativeFilm | +Digitized from a negative on film | +The digital image was digitized from a negative on film on any other transparent medium | +
positiveFilm | +Digitized from a positive on film | +The digital image was digitized from a positive on a transparency or any other transparent medium | +
Digitized from a print on non-transparent medium | +The digital image was digitized from an image printed on a non-transparent medium | +|
softwareImage | +Created by software | +The digital image was created by computer software | +
jobid
[Optional ; Not Repeatable ; String]
+Number or identifier for the purpose of improved workflow handling (control or tracking). This is a user created identifier related to the job for which the image is supplied.
+Note: As this identifier references a job of the receiver’s workflow it must first be issued by the receiver, then transmitted to the creator or provider of the news object and finally added by the creator to this field.
jobtitle
[Optional ; Not Repeatable ; String]
+The job title of the photographer (the person listed in creatorNames
). The use of this element implies that the photographer information (creatorNames
is not empty).
source
[Optional ; Not Repeatable ; String]
+The name of a person or party who has a role in the content supply chain. The source
can be different from the creator
and from the entities listed in the Copyright Notice.
locationsShown
[Optional ; Repeatable]
+
"locationsShown": [
+{
+ "name": "string",
+ "identifiers": [
+ "http://example.com"
+ ],
+ "worldRegion": "string",
+ "countryName": "string",
+ "countryCode": "string",
+ "provinceState": "string",
+ "city": "string",
+ "sublocation": "string",
+ "gpsAltitude": 0,
+ "gpsLatitude": 0,
+ "gpsLongitude": 0
+ }
+ ]
This block of elements is used to document the location shown in the image. This information should be provided with as much detail as possible. It contains elements that can be used to provide a “nested” description of the location, from a high geographic level (world region) down to a very specific location (city and sub-location within a city).
name
[Optional ; Not Repeatable ; String]
+The full name of the location.
identifiers
[Optional ; Repeatable ; String]
+A globally unique identifier of the location shown.
worldRegion
[Optional ; Not Repeatable ; String]
+The name of a world region. This element is at the first (top) level of the top-down geographical hierarchy.
countryName
[Optional ; Not Repeatable ; String]
+The name of a country of a location. This element is at the second level of a top-down geographical hierarchy.
countryCode
[Optional ; Not Repeatable ; String]
+The ISO code of the country mentioned in countryName
.
+
provinceState
[Optional ; Not Repeatable ; String]
+The name of a sub-region of the country - for example a province or a state name. This element is at the third level of a top-down geographical hierarchy.
+
city
[Optional ; Not Repeatable ; String]
+The name of the city. This element is at the fourth level of a top-down geographical hierarchy.
sublocation
[Optional ; Not Repeatable ; String]
+The sublocation name could either be the name of a sublocation to a city or the name of a well known location or (natural) monument outside a city. This element is at the fifth (lowest) level of a top-down geographical hierarchy.
gpsAltitude
[Optional ; Not Repeatable ; Numeric]
+The altitude in meters of a WGS84 based position of this location.
gpsLatitude
[Optional ; Not Repeatable ; Numeric]
+Latitude of a WGS84 based position of this location (in some cases, this information may be contained in the EXIF metadata).
gpsLongitude
[Optional ; Not Repeatable ; Numeric]
+Longitude of a WGS84 based position of this location (in some cases, this information may be contained in the EXIF metadata).
imageRating
[Optional ; Not Repeatable ; Numeric]
+Rating of the image by its user or supplier. The value shall be -1 or in the range 0 to 5. -1 indicates “rejected” and 0 “unrated”. If an explicit value is not provided, the default value is 0 will be assumed.
supplier
[Optional ; Repeatable]
+
"supplier": [
+{
+ "name": "string",
+ "identifiers": [
+ "http://example.com"
+ ]
+ }
+ ]
name
[Optional ; Not Repeatable ; String]
+The name of the supplier of the image (person or organization).
identifiers
[Optional ; Repeatable ; String]
+The identifier for the most recent supplier of this image. This will not necessarily be the creator or the owner of the image.
copyrightNotice
[Optional ; Not Repeatable ; String]
+
+
+
+Contains any necessary copyright notice for claiming the intellectual property for this photograph and should identify the current owner of the copyright for the photograph. Other entities like the creator of the photograph may be added in the corresponding field. Notes on usage rights should be provided in “Rights usage terms”. Example: ©2008 Jane Doe. If the copyright ownership must be expressed in a more controlled manner, use the fields “Copyright Owner”, “Copyright Owner ID”, “Copyright Owner Name” described below instead of the copyrightNotice
element.
copyrightOwners
[Optional ; Repeatable]
+Owner or owners of the copyright in the licensed image, described in a structured format (as an alternative to the element copyrightNotice
described above. This block serves the same purpose of identifying the rights holder/s for the image. The Copyright Owner, Image Creator and Licensor may be the same or different entities.
+
"copyrightOwners": [
+{
+ "name": "string",
+ "role": [
+ "http://example.com"
+ ],
+ "identifiers": [
+ "http://example.com"
+ ]
+ }
+ ]
+<br>
+
+- **`name`** *[Optional ; Not Repeatable ; String]* <br>
+ The name of the owner of the copyright in the licensed image.<br>
+ - **`role`** *[Optional ; Repeatable ; String]*<br>
+ The role the entity.<br>
+ - **`identifiers`** *[Optional ; Repeatable ; String]*<br>
+ The identifier of the owner of the copyright in the licensed image.<br><br>
+
+
+- **`usageTerms`** *[Optional ; Not Repeatable ; String]* <br>
+The licensing parameters of the image expressed in free-text. Enter instructions on how this image can legally be used. The PLUS fields of the IPTC Extension can be used in parallel to express the licensed usage in more controlled terms.<br>
+
+
+- **`embdEncRightsExpr`** *[Optional ; Repeatable]* <br>
+An embedded rights expression using a rights expression language which is encoded as a string.
+Embedded Encoded Rights Expression (EERE) structure
+A structure providing details of an embedded encoded rights expression
+<br>
+```json
+"embdEncRightsExpr": [
+{
+ "encRightsExpr": "string",
+ "rightsExprEncType": "string",
+ "rightsExprLangId": "http://example.com"
+ }
+ ]
encRightsExpr
[Optional ; Not Repeatable ; String]
+Rights Expression Language ID. An identifier of the rights expression language used by the rights expression.
rightsExprEncType
[Optional ; Not Repeatable ; String]
+The encoding type of the rights expression, identified by an IANA Media Type.
rightsExprLangId
[Optional ; Not Repeatable ; String]
+An embedded rights expression using any rights expression language.
@@@@
+https://www.iptc.org/std/photometadata/specification/IPTC-PhotoMetadata#embedded-encoded-rights-expression-eere-structure
+
linkedEncRightsExpr
[Optional ; Repeatable]
+Link to Encoded Rights Expression.
+
"linkedEncRightsExpr": [
+{
+ "linkedRightsExpr": "http://example.com",
+ "rightsExprEncType": "string",
+ "rightsExprLangId": "http://example.com"
+ }
+ ]
linkedRightsExpr
[Optional ; Not Repeatable ; String]
+The link to a web resource representing an encoded rights expression.
rightsExprEncType
[Optional ; Not Repeatable ; String]
+The encoding type of the rights expression, identified by an IANA Media Type.
rightsExprLangId
[Optional ; Not Repeatable ; String]
+The identifer of the rights expression language used by the rights expression.
webstatementRights
[Optional ; Not Repeatable ; String]
+URL referencing a web resource providing a statement of the copyright ownership and usage rights of the image.
instructions
[Optional ; Not Repeatable ; String]
+Any of a number of instructions from the provider or creator to the receiver of the image which might include any of the following: embargoes and other restrictions not covered by the “Rights Usage Terms” field; information regarding the original means of capture (scanning notes, colourspace info) or other specific text information that the user may need for accurate reproduction; additional permissions required when publishing; credits for publishing if they exceed the IIM length of the credit field.
genres
[Optional ; Repeatable]
+
"genres": [
+{
+ "cvId": "http://example.com",
+ "cvTermName": "string",
+ "cvTermId": "http://example.com",
+ "cvTermRefinedAbout": "http://example.com"
+ }
+ ]
cvId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the Controlled Vocabulary the term is from.
cvTermName
[Optional ; Not Repeatable ; String]
+The natural language name of the term from a Controlled Vocabulary.
cvTermId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the term from a Controlled Vocabulary.
cvTermRefinedAbout
[Optional ; Not Repeatable ; String]
+Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary. May be used to refine the generic about relationship.
+Artistic, style, journalistic, product or other genre(s) of the image (expressed by a term from any Controlled Vocabulary)
intellectualGenre
[Optional ; Not Repeatable ; String]
+A term to describe the nature of the image in terms of its intellectual or journalistic characteristics (for example “actuality”, “interview”, “background”, “feature”, “summary”, “wrapup” for journalistic genres, or “daybook”, “obituary”, “press release”, “transcript” for news category related genres. It is advised to use terms from a controlled vocabulary such as the NewsCodes Scheme published by the IPTC under a Creative Commons Attribution (CC BY) 4.0 license.
+
Genre | +Description | +
---|---|
Actuality | +Recording of an event | +
Advertiser Supplied | +Content is supplied by an organization or individual that has paid the news provider for its placement | +
Advice | +Letters and answers about readers’ personal problems | +
Advisory | +Recommendation on editorial or technical matters by a provider to its customers | +
On This Day | +List of data, including birthdays of famous people and items of historical significance, for a given day | +
Analysis | +Data and conclusions drawn by a journalist who has conducted in depth research for a story | +
Archival material | +Material selected from the originator’s archive that has been previously distributed | +
Background | +Scene setting and explanation for an event being reported | +
Behind the Story | +The content describes how a story was reported and offers context on the reporting | +
Biography | +Facts and background about a person | +
Birth Announcement | +News of newly born children | +
Current Events | +Content about events taking place at the time of the report | +
Curtain Raiser | +Information about the staging and outcome of an immediately upcoming event | +
Daybook | +Items filed on a regular basis that are lists of upcoming events with time and place, designed to inform others of events for planning purposes. | +
Exclusive | +Information content, in any form, that is unique to a specific information provider. | +
Fact Check | +The news item looks into the truth or falsehood of another reported news item or assertion (for example a statement on social media by a public figure) | +
Feature | +The object content is about a particular event or individual that may not be significant to the current breaking news. | +
Fixture | +The object contains data that occurs often and predictably. | +
Forecast | +The object contains opinion as to the outcome of a future event. | +
From the Scene | +The object contains a report from the scene of an event. | +
Help us to Report | +The news item is a call for readers to provide information that may help journalists to investigate a potential news story | +
History | +The object content is based on previous rather than current events. | +
Horoscope | +Astrological forecasts | +
Interview | +The object contains a report of a dialogue with a news source that gives it significant voice (includes Q and A). | +
Listing of facts | +Detailed listing of facts related to a topic or a story | +
Music | +The object contains music alone. | +
Obituary | +The object contains a narrative about an individual’s life and achievements for publication after his or her death. | +
Opinion | +The object contains an editorial comment that reflects the views of the author. | +
Polls and Surveys | +The object contains numeric or other information produced as a result of questionnaires or interviews. | +
Press Release | +The object contains promotional material or information provided to a news organisation. | +
Press-Digest | +The object contains an editorial comment by another medium completely or in parts without significant journalistic changes. | +
Profile | +The object contains a description of the life or activity of a news subject (often a living individual). | +
Program | +A news item giving lists of intended events and time to be covered by the news provider. Each program covers a day, a week, a month or a year. The covered period is referenced as a keyword. | +
Question and Answer Session | +The object contains the interviewer and subject questions and answers. | +
Quote | +The object contains a one or two sentence verbatim in direct quote. | +
Raw Sound | +The object contains unedited sounds. | +
Response to a Question | +The object contains a reply to a question. | +
Results Listings and Statistics | +The object contains alphanumeric data suitable for presentation in tabular form. | +
Retrospective | +The object contains material that looks back on a specific (generally long) period of time such as a season, quarter, year or decade. | +
Review | +The object contains a critique of a creative activity or service (for example a book, a film or a restaurant). | +
Satire | +Uses exaggeration, irony, or humor to make a point; not intended to be understood as factual | +
Scener | +The object contains a description of the event circumstances. | +
Side bar and supporting information | +Related story that provides additional context or insight into a news event | +
Special Report | +In-depth examination of a single subject requiring extensive research and usually presented at great length, either as a single item or as a series of items | +
Sponsored | +Content is produced on behalf of an organization or individual that has paid the news provider for production and may approve content publication | +
Summary | +Single item synopsis of a number of generally unrelated news stories | +
Supported | +Content is produced with financial support from an organization or individual, yet not approved by the underwriter before or after publication | +
Synopsis | +The object contains a condensed version of a single news item. | +
Text only | +The object contains a transcription of text. | +
Transcript and Verbatim | +A word for word report of a discussion or briefing | +
Update | +The object contains an intraday snapshot (as for electronic services) of a single news subject. | +
Voicer | +Content is only voice | +
Wrap | +Complete summary of an event | +
Wrapup | +Recap of a running story | +
artworkOrObjects
[Optional ; Repeatable] "artworkOrObjects": [
+{
+"title": "string",
+ "contentDescription": "string",
+ "physicalDescription": "string",
+ "creatorNames": [
+ "string"
+ ],
+ "creatorIdentifiers": [
+ "string"
+ ],
+ "contributionDescription": "string",
+ "stylePeriod": [
+ "string"
+ ],
+ "dateCreated": "2023-04-11T15:06:09Z",
+ "circaDateCreated": "string",
+ "source": "string",
+ "sourceInventoryNr": "string",
+ "sourceInventoryUrl": "http://example.com",
+ "currentCopyrightOwnerName": "string",
+ "currentCopyrightOwnerIdentifier": "http://example.com",
+ "copyrightNotice": "string",
+ "currentLicensorName": "string",
+ "currentLicensorIdentifier": "http://example.com"
+ }
+ ]
title
[Optional ; Not Repeatable ; String]
+A human readable name of the object or artwork shown in the image.
contentDescription
[Optional ; Not Repeatable ; String]
+A textual description of the content depicted in the object or artwork.
physicalDescription
[Optional ; Not Repeatable ; String]
+A textual description of the physical characteristics of the artwork or object, without reference to the content depicted. This would be used to describe the object type, materials, techniques, and measurements.
creatorNames
[Optional Repeatable ; String]
+The name of the person(s) (possibly an organization) who created the object or artwork shown in the image.
creatorIdentifiers
[Optional ; Repeatable ; String]
+One or multiple globally unique identifier(s) for the artist who created the artwork or object shown in the image. This could be an identifier issued by an online registry of persons or companies. Make sure to enter these identifiers in the exact same sequence as the names entered in the field creatorNames
.
contributionDescription
[Optional ; Not Repeatable ; String]
+A description of any contributions made to the artwork or object. It should include the type, date and location of contribution, and details about the contributor.
stylePeriod
[Optional ; Repeatable ; String]
+The style, historical or artistic period, movement, group, or school whose characteristics are represented in the artwork or object. It is advised to take the terms from a Controlled Vocabulary.
dateCreated
[Optional ; Not Repeatable ; String]
+The date and optionally the time the artwork or object shown in the image was created.
circaDateCreated
[Optional ; Not Repeatable ; String]
+The approximate date or range of dates associated with the creation and production of an artwork or object or its components.
source
[Optional ; Not Repeatable ; String]
+The name of the organization or body holding and registering the artwork or object in this image for inventory purposes.
sourceInventoryNr
[Optional ; Not Repeatable ; String]
+The inventory number issued by the organization or body holding and registering the artwork or object in the image.
sourceInventoryUrl
[Optional ; Not Repeatable ; String]
+A reference URL for the metadata record of the inventory maintained by the Source.
currentCopyrightOwnerName
[Optional ; Not Repeatable ; String]
+The name of the current owner of the copyright of the artwork or object.
currentCopyrightOwnerIdentifier
[Optional ; Not Repeatable ; String]
+A globally unique identifier for the current copyright owner e.g. issued by an online registry of persons or companies.
copyrightNotice
[Optional ; Not Repeatable ; String]
+Any necessary copyright notice for claiming the intellectual property for artwork or an object in the image and should identify the current owner of the copyright of this work with associated intellectual property rights.
currentLicensorName
[Optional ; Not Repeatable ; String]
+Name of the current licensor of the artwork or object.
currentLicensorIdentifier
[Optional ; Not Repeatable ; String]
+A globally unique identifier for the current licensor e.g. issued by an online registry of persons or companies.
personInImageNames
[Optional ; Repeatable ; String]
+
"personInImageNames": [
+"string"
+ ]
This repeatable block of elements is used to provide information on the person(s) shown in the image.
personsShown
[Optional ; Repeatable] "personsShown": [
+{
+ "name": "string",
+ "description": "string",
+ "identifiers": [
+ "http://example.com"
+ ],
+ "characteristics": [
+ {
+ "cvId": "http://example.com",
+ "cvTermName": "string",
+ "cvTermId": "http://example.com",
+ "cvTermRefinedAbout": "http://example.com"
+ }
+ ]
+ }
+ ]
name
[Optional ; Not Repeatable ; String] description
[Optional ; Not Repeatable ; String] identifiers
[Optional ; Not Repeatable ; String] characteristics
[Optional ; Not Repeatable ; String] cvId
[Optional ; Not Repeatable ; String] cvTermName
[Optional ; Not Repeatable ; String] cvTermId
[Optional ; Not Repeatable ; String] cvTermRefinedAbout
[Optional ; Not Repeatable ; String] modelAges
[Optional ; Repeatable ; Numeric] "modelAges": [
+0
+ ]
Age of the human model(s) at the time the image was taken. Be aware of any legal implications of providing ages for young models. Ages below 18 years should not be included.
additionalModelInfo
[Optional ; Not Repeatable ; String]
+Information about other facets of the model(s).
minorModelAgeDisclosure
[Optional ; Not Repeatable ; String]
+The age of the youngest model pictured in the image, at the time the image was created. This information is not intended to be displayed publicly; it is intended to be used as a filter for inclusion/exclusion of images in catalogs and dissemination processes.
modelReleaseDocuments
[Optional ; Repeatable ; String]
+
"modelReleaseDocuments": [
+"string"
+ ]
Identifier associated with each Model Release.
modelReleaseStatus
[Optional ; Not Repeatable] "modelReleaseStatus": {
+"cvId": "http://example.com",
+ "cvTermName": "string",
+ "cvTermId": "http://example.com",
+ "cvTermRefinedAbout": "http://example.com"
+ }
cvId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the Controlled Vocabulary the term is from.
cvTermName
[Optional ; Not Repeatable ; String]
+The natural language name of the term from a Controlled Vocabulary.
cvTermId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the term from a Controlled Vocabulary.
cvTermRefinedAbout
[Optional ; Not Repeatable ; String]
+The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary. May be used to refine the generic about relationship.
organisationInImageCodes
[Optional ; Repeatable ; String]
+
"organisationInImageCodes": [
+"string"
+ ]
The code, extracted from a controlled vocabulary, used to identify the organization or company featured in the image. For example a stock ticker symbol may be used. Enter an identifier for the controlled vocabulary, then a colon, and finally the code from the vocabulary assigned to the organization (e.g. nasdaq:companyA)
organisationInImageNames
[Optional ; Repeatable ; String] "organisationInImageNames": [
+"string"
+ ]
The name of the organization or company which is featured in the image.
productsShown
[Optional ; Repeatable] "productsShown": [
+{
+ "description": "string",
+ "gtin": "string",
+ "name": "string"
+ }
+ ]
description
[Optional ; Not Repeatable ; String]
+A textual description of the product.
gtin
[Optional ; Not Repeatable ; String]
+The Global Trade Item Number (GTIN) of the product (GTIN-8 to GTIN-14 codes can be used).
name
[Optional ; Not Repeatable ; String]
+The name of the product.
maxAvailHeight
[Optional ; Not Repeatable ; Numeric]
+The maximum available height in pixels of the original photo from which this photo has been derived by downsizing.
maxAvailWidth
[Optional ; Not Repeatable ; Numeric]
+The maximum available width in pixels of the original photo from which this photo has been derived by downsizing.
propertyReleaseStatus
[Optional ; Not Repeatable]
+
"propertyReleaseStatus": {
+"cvId": "http://example.com",
+ "cvTermName": "string",
+ "cvTermId": "http://example.com",
+ "cvTermRefinedAbout": "http://example.com"
+ }
This summarizes the availability and scope of property releases authorizing usage of the properties appearing in the photograph. One value should be selected from a controlled vocabulary. It is recommended to apply the value PR-UPR very carefully and to check the wording of the property release thoroughly before applying it.
+- cvId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the Controlled Vocabulary the term is from.
+- cvTermName
[Optional ; Not Repeatable ; String]
+The natural language name of the term from a Controlled Vocabulary.
+- cvTermId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the term from a Controlled Vocabulary.
+- cvTermRefinedAbout
[Optional ; Not Repeatable ; String]
+Refined ‘about’ relationship of the CV-Term. The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.
propertyReleaseDocuments
[Optional ; Repeatable ; String] "propertyReleaseDocuments": [
+"string"
+ ]
+Optional identifier associated with each Property Release.
aboutCvTerms
[Optional ; Repeatable] "aboutCvTerms": [
+{
+ "cvId": "http://example.com",
+ "cvTermName": "string",
+ "cvTermId": "http://example.com",
+ "cvTermRefinedAbout": "http://example.com"
+ }
+ ]
One or more topics, themes or entities the content is about, each one expressed by a term from a controlled vocabulary.
+- cvId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the Controlled Vocabulary the term is from.
+- cvTermName
[Optional ; Not Repeatable ; String]
+The natural language name of the term from a Controlled Vocabulary.
+- cvTermId
[Optional ; Not Repeatable ; String]
+The globally unique identifier of the term from a Controlled Vocabulary.
+- cvTermRefinedAbout
[Optional ; Not Repeatable ; String]
+Refined ‘about’ relationship of the CV-Term. The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.
The IPTC elements are followed by a small set of common elements: see license
, tags
, and album
in section Additional elements
.
We introduced the Dublin Core Metadata Initiative (DCMI) specification in chapter 3 - Documents. It contains 15 core elements, which are generic and versatile enough to be used for documenting different types of resources. Other elements can be added to the specification to increase its relevancy for specific uses. In the schema we recommend for the documentation of publications, we added elements inspired by the MARC 21 standard. We take a similar approach for the use of the Dublin Core for documenting images, by adding elements inspired by the ImageObject schema from schema.org to the 15 elements.
+The fifteen elements, with their definition extracted from the Dublin Core website, are the following:
+Element name | +Description | +
---|---|
identifier | +An unambiguous reference to the resource within a given context. | +
type | +The nature or genre of the resource. | +
title | +A name given to the resource. | +
description | +An account of the resource. | +
subject | +The topic of the resource. | +
creator | +An entity primarily responsible for making the resource. | +
contributor | +An entity responsible for making contributions to the resource. | +
publisher | +An entity responsible for making the resource available. | +
date | +A point or period of time associated with an event in the life cycle of the resource. | +
coverage | +The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. | +
format | +The file format, physical medium, or dimensions of the resource. | +
language | +A language of the resource. | +
relation | +A related resource. | +
rights | +Information about rights held in and over the resource. | +
source | +A related resource from which the described resource is derived. | +
We do not use the identifier
element, as we already have a unique identifier in the common element idno
.
We added the following elements to the schema, which are not part of the core list of the DCMI:
+The common additional elements license
, album
and tags
also complement the DCMI metadata (see section Additional elements
).
We describe below how DCMI elements are used to document images.
+dcmi
[Optional, Not repeatable]
+Users of the schema will chose either IPTC or Dublin Core (DCMI), not both, to document their images. If the choice is DCMI, the elements under dcmi
will be used.
+
"dcmi": {
+"type": "image",
+ "title": "string",
+ "caption": "string",
+ "description": "string",
+ "topics": [],
+ "keywords": [],
+ "creator": "string",
+ "contributor": "string",
+ "publisher": "string",
+ "date": "string",
+ "country": [],
+ "coverage": "string",
+ "gps": {},
+ "format": "string",
+ "languages": [],
+ "relations": [],
+ "rights": "string",
+ "source": "string",
+ "note": "string"
+ }
type
[Required, Not Repeatable, String]
+The Dublin Core schema is flexible and versatile, and can be used to document different types of resources. This element is used to document the type of resource being documented. The DCMI provides a list of suggested categories, including “image” which is the relevant type to be entered here. Some users may want to be more specific in the description of the type of resource, for example distinguishing color from black & white images. This distinction should not be made in this element; another element can be used for such purpose (like tags and tag groups).
title
[Optional, Not Repeatable, String]
+The title of the photo.
caption
[Optional, Not Repeatable, String]
+A caption for the photo.
description
[Optional, Not Repeatable, String]
+A brief description of the content depicted in the image. This element will typically provide more detailed information than the title or caption. Note that other elements can be used to provide a more specific and “itemized” description of an image; the element keywords
for example can be used to list labels associated with an image (possibly generated in an automated manner using machine learning tools).
topics
[Optional ; Repeatable]
+The topics
field indicates the broad substantive topic(s) that the image represents. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) thesaurus.
+
"topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
id
[Optional ; Not repeatable ; String]
+The unique identifier of the topic. It can be a sequential number, or the ID of the topic in a controlled vocabulary.
name
[Required ; Not repeatable ; String]
+The label of the topic associated with the data.
+
parent_id
[Optional ; Not repeatable ; String]
+When a hierarchical (nested) controlled vocabulary is used, the parent_id
field can be used to indicate a higher-level topic to which this topic belongs.
vocabulary
[Optional ; Not repeatable ; String]
+The name of the controlled vocabulary used, if any.
uri
+A link to the controlled vocabulary mentioned in field `vocabulary’.
keywords
[Optional ; Repeatable]
+Words or phrases that describe salient aspects of an image content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Keywords should be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+
"keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; String ; Non repeatable]
+Keyword (or phrase). Keywords summarize the content or subject matter of the image.
vocabulary
[Optional ; Not repeatable ; String]
+Controlled vocabulary from which the keyword is extracted, if any.
+
uri
[Optional ; Not repeatable ; String]
+The URI of the controlled vocabulary used, if any.
creator
[Optional, Not Repeatable, String]
+The name of the person (or organization) who has taken the photo or created the image.
contributor
[Optional, Not Repeatable, String]
+The contributor could be a person or organization, possibly a sponsoring organizations.
publisher
[Optional, Not Repeatable, String]
+The person or organization who publish the image.
date
[Optional, Not Repeatable, String]
+The date when the photo was taken / the image was created, preferably entered in ISO 8601 format.
country
[Optional, Repeatable]
+
"country": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Optional, Not Repeatable, String]
+The name of the country/economy where the photo was taken.
code
[Optional, Not Repeatable, String]
+The code of the country/economy mentioned in name
. This will preferably be the ISO country code.
coverage
[Optional, Not Repeatable, String]
+In the Dublin Core, the coverage can be either temporal or geographic. In the use of the schema, coverage
is used to document the geographic coverage of the image. This element complements the country
element, and allows more specific information to be provided.
gps
[Optional, Not Repeatable]
+The geographic location where the photo was taken. Some digital cameras equipped with GPS can, when the option is activated, capture and store in the EXIF metadata the exact geographic location where the photo was taken.
+
"gps": {
+"latitude": -90,
+ "longitude": -180,
+ "altitude": 0
+ }
latitude
[Optional, Not Repeatable, String]
+The latitude of the geographic location where the photo was taken.
longitude
[Optional, Not Repeatable, String]
+The longitude of the geographic location where the photo was taken.
altitude
[Optional, Not Repeatable, String]
+The altitude of the geographic location where the photo was taken.
format
[Optional, Not Repeatable, String]
+This refers to the image file format. It is typically expressed using a MIME format.
languages
[Optional, Not repeatable, String]
+The language(s) in which the image metadata (caption, title) is provided. This is a block of two elements (at least one must be provided for each language).
+
"languages": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name of the language.
code
[Optional ; Not repeatable ; String]
+The code of the language. The use of ISO 639-2 (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings.
relations
[Optional, Repeatable, String]
+A list of related resources (images or of other type)
+
"relations": [
+{
+ "name": "string",
+ "type": "isPartOf",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String]
+The name (title) of the related resource.
type
[Optional ; Not repeatable ; String]
+A brief description of the type of relation. A controlled vocabulary could be used.
uri
[Optional ; Not repeatable ; String]
+A link to the related resource being described.
rights
[Optional, Not Repeatable, String]
+The copyrights for the photograph. License is in another (common) element.
source
[Optional, Not Repeatable, String]
+A related resource from which the described image is derived.
note
[Optional, Not Repeatable, String]
+Any additional information on the image, not captured in one of the other metadata elements.
Two elements are added to the list of image_description
section of the schema. They apply both to the IPTC and to the DCMI options.
license
[Optional ; Repeatable] "license": [
+{
+ "name": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not Repeatable ; String]
+The name of the license.
uri
[Optional ; Not Repeatable ; String]
+A URL where detailed information on the license / terms of use can be found.
album
[Optional ; Repeatable]
+If your catalog contains many images, you will likely want to group them by album. Albums are collections of images organized by theme, period, location, photographer, or other criteria. One image can belong to more than one album. Albums are thus “virtual collections”.
+
"album": [
+{
+ "name": "string",
+ "description": "string",
+ "owner": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not Repeatable ; String]
+A short name (label) given to the album.
description
[Optional ; Not Repeatable ; String]
+A brief description of the album.
owner
[Optional ; Not Repeatable ; String]
+Identification of the owner/custodian of the album. This can be the name of a person or an organization.
uri
[Optional ; Not Repeatable ; String]
+A URL for the album.
provenance
[Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable] origin_description
elements are used to describe when and from where metadata have been extracted or harvested. harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Study Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] tags
[Optional ; Repeatable] tag_groups
, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R.
+"tags": [
+{
+ "tag": "string",
+ "tag_group": "string"
+ }
+ ]
tag
[Required ; Not repeatable ; String] tag_group
[Optional ; Not repeatable ; String] lda_topics
[Optional ; Not repeatable]
"lda_topics": [
+{
+ "model_info": [
+ {
+ "source": "string",
+ "author": "string",
+ "version": "string",
+ "model_id": "string",
+ "nb_topics": 0,
+ "description": "string",
+ "corpus": "string",
+ "uri": "string"
+ }
+ ],
+ "topic_description": [
+ {
+ "topic_id": null,
+ "topic_score": null,
+ "topic_label": "string",
+ "topic_words": [
+ {
+ "word": "string",
+ "word_weight": 0
+ }
+ ]
+ }
+ ]
+ }
+ ]
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics
is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.
The lda_topics
element includes the following metadata fields:
model_info
[Optional ; Not repeatable]
+Information on the LDA model.
source
[Optional ; Not repeatable ; String] author
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] model_id
[Optional ; Not repeatable ; String] nb_topics
[Optional ; Not repeatable ; Numeric] description
[Optional ; Not repeatable ; String] corpus
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topic_description
[Optional ; Repeatable]
+The topic composition of the document.
topic_id
[Optional ; Not repeatable ; String] topic_score
[Optional ; Not repeatable ; Numeric] topic_label
[Optional ; Not repeatable ; String] topic_words
[Optional ; Not repeatable] word
[Optional ; Not repeatable ; String] word_weight
[Optional ; Not repeatable ; Numeric] = list(
+ lda_topics
+ list(
+
+ model_info = list(
+ list(source = "World Bank, Development Data Group",
+ author = "A.S.",
+ version = "2021-06-22",
+ model_id = "Mallet_WB_75",
+ nb_topics = 75,
+ description = "LDA model, 75 topics, trained on Mallet",
+ corpus = "World Bank Documents and Reports (1950-2021)",
+ uri = ""))
+
+ ),
+ topic_description = list(
+
+ list(topic_id = "topic_27",
+ topic_score = 32,
+ topic_label = "Education",
+ topic_words = list(list(word = "school", word_weight = "")
+ list(word = "teacher", word_weight = ""),
+ list(word = "student", word_weight = ""),
+ list(word = "education", word_weight = ""),
+ list(word = "grade", word_weight = "")),
+
+ list(topic_id = "topic_8",
+ topic_score = 24,
+ topic_label = "Gender",
+ topic_words = list(list(word = "women", word_weight = "")
+ list(word = "gender", word_weight = ""),
+ list(word = "man", word_weight = ""),
+ list(word = "female", word_weight = ""),
+ list(word = "male", word_weight = "")),
+
+ list(topic_id = "topic_39",
+ topic_score = 22,
+ topic_label = "Forced displacement",
+ topic_words = list(list(word = "refugee", word_weight = "")
+ list(word = "programme", word_weight = ""),
+ list(word = "country", word_weight = ""),
+ list(word = "migration", word_weight = ""),
+ list(word = "migrant", word_weight = "")),
+
+ list(topic_id = "topic_40",
+ topic_score = 11,
+ topic_label = "Development policies",
+ topic_words = list(list(word = "development", word_weight = "")
+ list(word = "policy", word_weight = ""),
+ list(word = "national", word_weight = ""),
+ list(word = "strategy", word_weight = ""),
+ list(word = "activity", word_weight = ""))
+
+
+ )
+
+ )
+ )
The information provided by LDA models can be used to build a “filter by topic composition” tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic.
+embeddings
[Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below.
The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.
+"embeddings": [
+{
+ "id": "string",
+ "description": "string",
+ "date": "string",
+ "vector": null
+ }
+ ]
The embeddings
element contains four metadata fields:
id
[Optional ; Not repeatable ; String]
+A unique identifier of the word embedding model used to generate the vector.
description
[Optional ; Not repeatable ; String]
+A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
date
[Optional ; Not repeatable ; String]
+The date the model was trained (or a version date for the model).
vector
[Required ; Not repeatable ; Object] @@@@@@@@ do not offer options
+The numeric vector representing the document, provided as an object (array or string).
+[1,4,3,5,7,9]
additional
[Optional ; Not repeatable]
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
Use schema and resource schema for publishing links.
+We selected an image from the World Bank Flickr collection. The image is available at https://www.flickr.com/photos/worldbank/8120361619/in/album-72157648790716931/ +Some metadata is provided with the photo.
+Metadata:
+The image is made available in multiple formats. We assume that we want to only provide access to the small, medium and original version of the image available in our NADA catalog. We also assume that instead of uploading the images to our catalog server to make them available directly from our catalog, we want to provide link to the images in the source repository (Flickr in this case).
+ +Using R
+library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_images/")
+# Download image files from Flickr (different resolutions)
+
+download.file("https://live.staticflickr.com/4858/31953178928_77e4d7abae_o_d.jpg",
+destfile = "img_001_original.jpg", mode = "wb")
+
+download.file("https://live.staticflickr.com/4858/31953178928_44abb01418_w_d.jpg",
+destfile = "img_001_small.jpg", mode = "wb")
+
+# Generate image metadata (using the IPTC metadata elements)
+
+<- list(
+ my_image
+ metadata_information = list(
+
+ producers = list(name = "OD"),
+
+ production_date = "2022-01-10"
+
+
+ ),
+ idno = "image_001",
+
+ image_description = list(
+
+ iptc = list(
+
+ photoVideoMetadataIPTC = list(
+
+ title = "Man fetching water, Afghanistan",
+
+ imageSupplierImageId = "Image_001",
+
+ headline = "Residents get water",
+
+ dateCreated = "2008-09-20T00:00:00Z",
+
+ creatorNames = list("Sofie Tesson, Taimani Films"),
+
+ description = "View of villagers, getting some water.
+ World Bank Emergency Horticulture and Livestock Project",
+
+ digitalImageGuid = "72157648790716931",
+
+ locationsShown = list(
+ list(countryCode = "AFG", countryName = "Afghanistan")
+
+ ),
+ keywords = list("Water and sanitation"),
+
+ @@@ as list? sceneCodes = list("010600, 011000, 011100, 011900"),
+
+ sceneCodesLabelled = list(
+
+ list(code = "010600",
+ label = "single",
+ description = "A view of only one person, object or animal."),
+
+ list(code = "011000",
+ label = "general view",
+ description = "An overall view of the subject and its surrounds"),
+
+list(code = "011100",
+ label = "panoramic view",
+ description = "A panoramic or wide angle view of a subject and its surrounds"),
+
+ list(code = "011900",
+ label = "action",
+ description = "Subject in motion")
+
+
+ ),
+ @@@ as list? subjectCodes = list("06000000, 09000000, 14000000"),
+
+ subjectCodesLabelled = list(
+
+ list(code = "06000000",
+ label = "environmental issue",
+ description = "All aspects of protection, damage, and condition of the ecosystem of the planet earth and its surroundings."),
+
+ list(code = "09000000",
+ label = "labor",
+ description = "Social aspects, organizations, rules and conditions affecting the employment of human effort for the generation of wealth or provision of services and the economic support of the unemployed."),
+
+ list(code = "14000000",
+ label = "social issue",
+ description = "Aspects of the behavior of humans affecting the quality of life.")
+
+
+ ),
+ source = "World Bank",
+
+ supplier = list(
+ list(name = "World Bank")
+
+ )
+
+ )
+
+ ),
+license = list(
+ list(name = "Attribution 2.0 Generic (CC BY 2.0)",
+ uri = "https://creativecommons.org/licenses/by/2.0/")
+
+ ),
+ album = list(
+ list(name = "World Bank Projects in Afghanistan")
+
+ )
+
+ )
+
+ )
+# Publish the image metadata in the NADA catalog
+
+image_add(idno = "image_001",
+metadata = my_image,
+ repositoryid = "central",
+ overwrite = "yes",
+ published = 1,
+ thumbnail = thumb)
+
+# Provide a link to the images in the originating repository, and upload files
+# (uploading files will make them available directly from the NADA catalog)
+
+external_resources_add(
+idno = "image_001",
+ dctype = "pic",
+ title = "Man fetching water, Afghanistan (Flickr link)",
+ file_path = "https://www.flickr.com/photos/water_alternatives/31953178928/in/photolist-QFAoS5",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+idno = "image_001",
+ dctype = "pic",
+ title = "Man fetching water, Afghanistan (original size)",
+ file_path = "img_001_original.jpg",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+idno = "image_001",
+ dctype = "pic",
+ title = "Man fetching water, Afghanistan (small size)",
+ file_path = "img_001_small.jpg",
+ overwrite = "yes"
+ )
Result in NADA
+The metadata, links, and images will be displayed in NADA.
+
+Different views (mosaic, list, page views) are available. If the metadata contained a GPS location, a map showing the exact location where the photo was taken will also be displayed in the image page.
Using Python
+# Python script
We document the same image as in Example 1.
+Using R
+library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_images/")
+# Download image files from Flickr (different resolutions)
+
+download.file("https://live.staticflickr.com/4858/31953178928_77e4d7abae_o_d.jpg",
+destfile = "img_001_original.jpg", mode = "wb")
+
+download.file("https://live.staticflickr.com/4858/31953178928_44abb01418_w_d.jpg",
+destfile = "img_001_small.jpg", mode = "wb")
+
+# Generate image metadata (using the DCMI metadata elements)
+
+<- list(
+ pic_desc
+ metadata_information = list(
+
+ producers = list(name = "OD"),
+
+ production_date = "2022-01-10"
+
+
+ ),
+ idno = "image_001",
+
+ image_description = list(
+
+ dcmi = list(
+
+ identifier = "72157648790716931",
+
+ type = "image",
+
+ title = "Man fetching water, Afghanistan",
+
+ caption = "Residents get water",
+
+ description = "View of villagers, getting some water.
+ World Bank Emergency Horticulture and Livestock Project",
+
+ subject = "",
+
+ topics = list(),
+
+ keywords = list(
+ list(name = "water and sanitation")
+
+ ),
+ creator = "Sofie Tesson, Taimani Films",
+
+ publisher = "World Bank",
+
+ date = "2008-09-20T00:00:00Z",
+
+ country = list(name = "Afghanistan", code = "AFG"),
+
+ language = "English"
+
+
+ ),
+ license = list(
+ list(name = "Attribution 2.0 Generic (CC BY 2.0)",
+ uri = "https://creativecommons.org/licenses/by/2.0/")),
+
+ album = list(
+ list(name = "World Bank Projects in Afghanistan")
+
+ )
+
+ )
+
+ )
+# Publish the image metadata in the NADA catalog
+
+image_add(idno = "image_001",
+metadata = pic_desc,
+ repositoryid = "central",
+ overwrite = "yes",
+ published = 1,
+ thumbnail = thumb)
+
+# Provide a link to the images in the originating repository, and upload files
+# (uploading files will make them available directly from the NADA catalog)
+
+external_resources_add(
+idno = "image_001",
+ dctype = "pic",
+ title = "Man fetching water, Afghanistan (Flickr link)",
+ file_path = "https://www.flickr.com/photos/water_alternatives/31953178928/in/photolist-QFAoS5",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+idno = "image_001",
+ dctype = "pic",
+ title = "Man fetching water, Afghanistan (original size)",
+ file_path = "img_001_original.jpg",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+idno = "image_001",
+ dctype = "pic",
+ title = "Man fetching water, Afghanistan (small size)",
+ file_path = "img_001_small.jpg",
+ overwrite = "yes"
+ )
Using Python
+# Python script
The schema we propose to document video files is a combination of elements extracted from the Dublin Core Metadata Initiative (DCMI) and from the VideoObject (from schema.org) schemas. This schema is very similar to the schema we proposed for audio files (see chapter 10).
+The Dublin Core is a generic and versatile standard, which we also use (in an augmented form) for the documentation of Documents (Chapter 4), Images (Chapter 9), and Audio files (chapter 10). It contains 15 core elements, to which we added a selection of elements from VideoObject. We also included the elements keywords
, topics
, tags
, provenance
and additional
that are found in other schemas documented in the Guide.
The resulting metadata schema is simple, but it contains the elements needed to document the resources and their content in a way that will foster their discoverability in data catalogs. Compliance with the VideoObject elements contributes to search engine optimization, as search engines like Google, Bing and others “reward” metadata published in formats compatible with the schema.org recommendations.
+{
+"repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "video_description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
When published in a NADA catalog, the metadata related to video files will appear in a specific tab.
+
+
+
Videos typically come with limited metadata. To make them more discoverable, a transcription of the video content can be generated, stored, and indexed in the catalog. The metadata schema we propose includes an element transcription
that can store transcriptions (and possibly their automatically-generated translations) in the video metadata. Word embedding models and topic models can be applied to the transcriptions to further augment the metadata. This will significantly increase the discoverability of the resource, and offer the possibility to apply semantic searchability on video metadata.
Machine learning speech-to-text solutions are available (although not for all languages) to automatically generate transcriptions at a low cost. This includes commercial applications like Whisper by openAI, Microsoft Azure, or Amazon Transcribe. Open source solutions in Python also exist.
+Transcriptions of videos published on Youtube are available on-line (the example below was extracted from https://www.youtube.com/watch?v=Axs8NPVYmms).
+
+
+
Note that some care must be taken when adding automatic speech transcriptions into your metadata, as the transcriptions are not always perfect and may return unexpected results. This will be the case when the sound quality is low, or when the video includes sections in an unknown language (see the example below, of a video in English that includes a brief segmnent in Somali; the speech-to-text algorithm may in such case attempt to transcribe text it does not recognize, returning invalid information).
+
+
+
The first three elements of the schema (repositoryid
, published
, and overwrite
) are not part of the video metadata. They are parameters used to indicate how the video metadata will be published in a NADA catalog.
repositoryid
identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA.
published
: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.
overwrite
: Indicates whether metadata that may have been previously uploaded for the same video can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. Note that a video will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element video_description > idno
is the same.
metadata_information
[Optional ; Not Repeatable]
+The metadata information set is used to document the video metadata (not the video itself). This provides information useful for archiving purposes. This set is optional. It is recommended however to enter at least the identification and affiliation of the metadata producer, and the date of creation of the metadata. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so metadata produced by one organization can be found in other data centers.
+
"metadata_information": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "production_date": "string",
+ "version": "string"
+ }
title
[Optional ; Not Repeatable ; String]
+The title of the video.
idno
[Optional ; Not Repeatable ; String]
+A unique identifier for the metadata document (unique in the catalog; ideally also unique globally). This is different from the video unique ID (see idno
element in section video_description below), although it is good practice to generate identifiers that would maintain an easy connection between the metadata idno
element and the video idno
found under video_description
(see below).
producers
[Optional ; Repeatable]
+This refers to the producer(s) of the metadata, NOT to the producer(s) of the video. This could for example be the data curator in a data center.
name
[Optional ; Not repeatable ; String] abbr
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] production_date
[Optional ; Not repeatable ; String]
+Date the metadata (not the table) was produced.
version
[Optional ; Not repeatable ; String]
+Version of the metadata (not version of the table).
video_description
[Required ; Not Repeatable]
+The video_description
section contains all elements that will be used to describe the video and its content. These are the elements that will be indexed and made searchable when published in a data catalog.
idno
[Mandatory, Not Repeatable ; String]
+idno
is an identification number that is used to uniquely identify a video in a catalog. It will also help users of the data cite the video properly. The best option is to obtain a Digital Object Identifier (DOI) for the video, as it will ensure that the ID is unique globally. Alternatively, it can be an identifier constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a video (see identifiers
below). This element maps to the “identifier” element in the Dublin Core.
identifiers
[Optional ; Repeatable]
+
"identifiers": [
+{
+ "type": "string",
+ "identifier": "string"
+ }
+ ]
This element is used to enter video identifiers other than the idno
element described above). It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in idno
can be repeated here, allowing to attach a “type” attribute to it.
+- type
[Optional ; Not repeatable ; String]
+The type of unique identifier, e.g., “DOI”.
+- value
[Required ; Not repeatable ; String]
+The identifier.
title
[Required ; Not repeatable ; String]
The title of the video. This element maps to the element caption in VideoObject.
alt_title
[Optional ; Not repeatable ; String]
An alias for the video title. This element maps to the element alternateName in VideoObject.
description
[Optional ; Not repeatable ; String]
A brief description of the video, typically about a paragraph long (around 150 to 250 words). This element maps to the element abstract in VideoObject.
genre
[Optional ; Repeatable ; String]
The genre of the video, broadcast channel or group. This is a VideoObject element. A controlled vocabulary can be used.
keywords
[Optional ; Repeatable]
+
"keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of keywords that provide information on the core content of the video. Keywords provide a convenient solution to improve the discoverability of the video, as it allows terms and phrases not found elsewhere in the video metadata to be indexed and to make the video discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the UNESCO Thesaurus. The list can combine keywords from multiple controlled vocabularies, and user-defined keywords.
+- name
[Required ; Not repeatable ; String]
+The keyword itself.
+- vocabulary
[Optional ; Not repeatable ; String]
+The controlled vocabulary (including version number or date) from which the keyword is extracted, if any.
+- uri
[Optional ; Not repeatable ; String]
+The URL of the controlled vocabulary from which the keyword is extracted, if any.
<- list(
+ my_video # ... ,
+ video_description = list(
+ # ... ,
+
+ keywords = list(
+
+ list(name = "Migration",
+ vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+
+ list(name = "Migrants",
+ vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+
+ list(name = "Refugee",
+ vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+
+ list(name = "Forced displacement"),
+
+ list(name = "Internally displaced population (IDP)")
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
topics
[Optional ; Repeatable] "topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
Information on the topics covered in the video. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. Note that you may use more than one controlled vocabulary. This element is a block of five fields:
id
[Optional ; Not repeatable ; String] name
[Required ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_video # ... ,
+ video_description = list(
+ # ... ,
+
+ topics = list(
+
+ list(name = "Demography.Migration",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(name = "Demography.Censuses",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(id = "F22",
+ name = "International Migration",
+ parent_id = "F2 - International Factor Movements and International Business",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "O15",
+ name = "Human Resources - Human Development - Income Distribution - Migration",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "O12",
+ name = "Microeconomic Analyses of Economic Development",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "J61",
+ name = "Geographic Labor Mobility - Immigrant Workers",
+ parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+
+
+ ),
+ # ...
+
+ ), )
persons
[Optional ; Repeatable] "persons": [
+{
+ "name": "string",
+ "role": "string"
+ }
+ ]
A list of persons who appear in the video.
+- name
[Required ; Not repeatable ; String]
+The name of the person.
+- role
[Optional ; Not repeatable, String]
+The role of the person mentioned in name
.
<- list(
+ my_video metadata_information = list(
+# ...
+
+ ),video_description = list(
+# ... ,
+
+ persons = list(
+
+ list(name = "John Smith",
+ role = "Keynote speaker"),
+
+ list(name = "Jane Doe",
+ role = "Debate moderator")
+
+
+ ),# ...
+ )
main_entity
[Optional ; Not repeatable ; String]
Indicates the primary entity described in the video. This element maps to the element mainEntity
in VideoObject.
date_created
[Optional, Not Repeatable ; String]
The date the video was created. It is recommended to enter the date in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the video is created refers to the date that the video was produced and considered ready for dissemination.
date_published
[Optional, Not Repeatable ; String]
The date the video was published. It is recommended to use the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
version
[Optional, Not Repeatable ; String]
The version of the video refers to the published version of the video.
status
[Optional ; Not repeatable, String]
The status of the video in terms of its stage in a lifecycle. A controlled vocabulary should be used. Example terms include {Incomplete, Draft, Published, Obsolete
}. Some organizations define a set of terms for the stages of their publication lifecycle. This element maps to the element creativeWorkStatus in VideoObject.
country
[Optional ; Repeatable]
+
"country": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
The list of countries (or regions) covered by the video, if applicable. This refers to the content of the video, not to the country where the video was released. This is a repeatable block of two elements:
+- name
[Required ; Not repeatable ; String]
+The country/region name. Note that many organizations have their own policies on the naming of countries/regions/economies/territories, which data curators will have to comply with.
+- code
[Optional ; Not repeatable ; String]
+The country/region code (entered as a string, even for numeric codes). It is recommended to use a standard list of countries and regions, such as the ISO country list (ISO 3166).
+
spatial_coverage
[Optional ; Not repeatable ; String]
Indicates the place(s) which are depicted or described in the video. This element maps to the element contentLocation
in VideoObject. This element complements the ref_country
element. It can be used to qualify the geographic coverage of the video, in the form of a free text.
content_reference_time
[Optional ; Not repeatable ; String]
The specific time described by the video, for works that emphasize a particular moment within an event. This element maps to the element contentReferenceTime
in VideoObject.
temporal_coverage
[Optional ; Not repeatable ; String]
Indicates the period that the video applies to, i.e. that it describes, either as a DateTime or as a textual string indicating a time period in ISO 8601 time interval format. This element maps to the element temporalCoverage
in VideoObject.
recorded_at
[Optional ; Not repeatable ; String]
This element maps to the element recordedAt
in VideoObject schema. It identifies the event where the video was recorded (e.g., a conference, or a demonstration).
audience
[Optional ; Not repeatable ; String]
A brief description of the intended audience of the video, i.e. the group for whom it was created.
+bbox
[Optional ; Repeatable] "bbox": [
+{
+ "west": "string",
+ "east": "string",
+ "south": "string",
+ "north": "string"
+ }
+ ]
This element is used to define one or multiple bounding box(es), which are the (rectangular) fundamental geometric description of the geographic coverage of the video. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the video’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search.
+- west
[Required ; Not repeatable ; String]
+West longitude of the box
+- east
[Required ; Not repeatable ; String]
+East longitude of the box
+- south
[Required ; Not repeatable ; String]
+South latitude of the box
+- north
[Required ; Not repeatable ; String]
+North latitude of the box
+
language
[Optional, Repeatable] "language": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
Most videos will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed. For the language code, ISO codes will preferably be used. The language refers to the language in which the video is published. This is a block of two elements (at least one must be provided for each language):
+- name
[Optional ; Not repeatable ; String]
+The name of the language.
+- code
[Optional ; Not repeatable ; String]
+The code of the language. The use of ISO 639-2 (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings.
creator
[Optional, Not repeatable ; String] Organization or person who created/authored the video.
+production_company
[Optional, Not repeatable ; String]
The production company or studio responsible for the item. This element maps to the element productionCompany in VideoObject.
publisher
[Optional, Not repeatable ; String]
= list(
+ my_video # ... ,
+ video_description = list(
+ # ... ,
+ publisher = "@@@@@",
+ # ...
+
+ ) )
repository
[Optional ; Not repeatable ; String]
The name of the repository (organization).
contacts
[Optional, Repeatable]
+Users of the video may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a “generic” email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members.
+
"contacts": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "email": "string",
+ "telephone": "string",
+ "uri": "string"
+ }
+ ]
name
[Required, Not repeatable, String]
+Name of a person or unit (such as a data help desk). It will usually be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position.
role
[Optional, Not repeatable, String]
+The specific role of name
, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact.
affiliation
[Optional, Not repeatable, String]
+Affiliation of the person/unit.
email
[Optional, Not repeatable, String]
+E-mail address of the person.
telephone
[Optional, Not repeatable, String]
+A phone number that can be called to obtain information or provide feedback on the table. This should never be a personal phone number; a corporate number (typically of a data help desk) should be provided.
uri
[Optional, Not repeatable, String]
+A link to a website where contact information for name
can be found.
contributors
[Optional, Repeatable]
+
"contributors": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "abbr": "string",
+ "role": "string",
+ "uri": "string"
+ }
+ ]
Identifies the person(s) and/or organization(s) who contributed to the production of the video. The role
attribute allows defining what the specific contribution of the identified person or organization was.
+- name
[Optional, Not Repeatable ; String]
+The name of the contributor (person or organization).
+- affiliation
[Optional, Not Repeatable ; String]
+The affiliation of the contributor.
+- abbr
[Optional, Not Repeatable ; String]
+The abbreviation for the institution which has been listed as the affiliation of the contributor.
+- role
[Optional, Not Repeatable ; String]
+The specific role of the contributor. This could for example be “Cameraman”, “Sound engineer”, etc.
+- uri
[Optional, Not Repeatable ; String]
+A URI (link to a website, or email address) for the contributor.
= list(
+ my_video # ... ,
+ video_description = list(
+ # ... ,
+ contributors = list(
+ list(
+ name = "",
+ affiliation = "",
+ abbr = "",
+ role = "",
+ uri = "")
+
+ ), # ...
+
+ ) )
sponsors
[Optional ; Repeatable] "sponsors": [
+{
+ "name": "string",
+ "abbr": "string",
+ "grant": "string",
+ "role": "string"
+ }
+ ]
This element is used to list the funders/sponsors of the video. If different funding agencies financed different stages of the production process, use the “role” attribute to distinguish them.
+- name
[Required ; Not repeatable ; String]
+The name of the sponsor (person or organization)
+- abbr
[Optional ; Not repeatable ; String]
+The abbreviation (acronym) of the sponsor.
+- grant
[Optional ; Not repeatable ; String]
+The grant (or contract) number.
+- role
[Optional ; Not repeatable ; String]
+The specific role of the sponsor.
translators
[Optional ; Repeatable] "translators": [
+{
+ "first_name": "string",
+ "initial": "string",
+ "last_name": "string",
+ "affiliation": "string"
+ }
+ ]
Organization or person who adapted the video to different languages. This element maps to the element translator in VideoObject.
+- first_name
[Optional ; Not repeatable ; String]
+The first name of the translator.
+- initial
[Optional ; Not repeatable ; String]
+The initials of the translator.
+- last_name
[Optional ; Not repeatable ; String]
+The last name of the translator.
+- affiliation
[Optional ; Not repeatable ; String]
+The affiliation of the translator.
is_based_on
[Optional ; Not repeatable, String]
A resource from which this video is derived or from which it is a modification or adaption. This element maps to the element isBasedOn in VideoObject.
is_part_of
[Optional ; Not repeatable, String]
Indicates another video that this video is part of. This element maps to the element isPartOf in VideoObject.
relations
[Optional ; Repeatable, String]
+
"relations": [
+"string"
+ ]
Defines, as a free text field, the relation between the video being documented and other resources. This is a Dublin Core element.
+video_provider
[Optional ; Not repeatable, String]
+
+
+
The person or organization who provides the video. This element maps to the element provider in VideoObject.
video_url
[Optional ; Not repeatable, String]
URL of the video. This element maps to the element url in VideoObject.
embed_url
[Optional ; Not repeatable, String]
A URL pointing to a player for a specific video. This element maps to the element embedUrl in VideoObject. For example, “https://www.youtube.com/embed/7Aif1xjstws”
+To be embedded, a video must be hosted on a video sharing platform like Youtube (www.youtube.com). To obtain the “embed link” from youtube, click on the “Share” button, then “Embed”. In the result box, select the content of the element src =
.
+
+
encoding_format
[Optional ; Not repeatable, String]
The video file format, typically expressed using a MIME format. This element corresponds to the “encodingFormat” element of VideoObject and maps to the element format of the Dublin Core.
duration
[Optional ; Not repeatable, String]
The duration of the item (movie, audio recording, event, etc.) in ISO 8601 format. This element is a VideoObject element.
+ISO 8601 durations are expressed using the following format, where (n) is replaced by the value for each of the date and time elements that follows the (n). For example: (3)H means 3 hours.
+P(n)Y(n)M(n)DT(n)H(n)M(n)S
For example, P1Y2M20DT3H30M8S represents a duration of one year, two months, twenty days, three hours, thirty minutes, and eight seconds.
+Date and time elements including their designator may be omitted if their value is zero, and lower-order elements may also be omitted for reduced precision. For example, “P23DT23H” and “P4Y” are both acceptable duration representations.
+As M can represent both Month and Minutes, the time designator T is used. For example, “P1M” is a one-month duration and “PT1M” is a one-minute duration.
+This information on the ISO 8601 was adapted from wikipedia where more detailed information can be found.
+rights
[Optional ; Not repeatable, String]
A textual description of the rights associated to the video. If a copyright is available, the three following elements will be used instead of this element.
copyright_holder
[Optional ; Not repeatable, String]
The party holding the legal copyright to the video. This element corresponds to the “copyrightHolder” element of VideoObject.
copyright_notice
[Optional ; Not repeatable, String]
Text of a notice appropriate for describing the copyright aspects of the video, ideally indicating the owner of the copyright. This element corresponds to the “copyrightNotice” element of VideoObject.
copyright_year
[Optional ; Not repeatable, String]
The year during which the claimed copyright for the video was first asserted. This element corresponds to the “copyrightYear” element of VideoObject.
credit_text
[Optional ; Not repeatable, String]
This element can be used to credit the person(s) and/or organization(s) associated with a published video. This element corresponds to the “creditText” element of VideoObject.
citation
[Optional ; Not repeatable, String]
This element provides a required or recommended citation of the audio file.
transcript
[Optional ; Repeatable, String]
+
"transcript": [
+{
+ "language_name": "string",
+ "language_code": "string",
+ "text": "string"
+ }
+ ]
The transcript of the video content, provided as a text. Note that if the text is very long, an alternative is to save it in a separate text file and to make it available in a data catalog as an external resource.
+- language_name
[Optional ; Not repeatable ; String]
+The name of the language of the transcript.
+- language_code
[Optional ; Not repeatable ; String]
+The code of the language of the transcript, preferably the ISO code.
+- text
[Optional ; Not repeatable ; String]
The transcript itself. Adding the transcript in the metadata will make the video much more discoverable, as the content of the transcription can be indexed in catalogs.
+media
[Optional ; Repeatable ; String] "media": [
+"string"
+ ]
A description of the media on which the recording is stored (other than the online file format); e,g., “CD-ROM”.
+album
[Optional ; Repeatable] "album": [
+{
+ "name": "string",
+ "description": "string",
+ "owner": "string",
+ "uri": "string"
+ }
+ ]
When a video is published in a catalog containing many other videos, it may be desirable to organize them by album. Albums are collections of videos organized by theme, period, location, or other criteria. One video can belong to more than one album. Albums are “virtual collections”.
+- name
[Optional ; Not Repeatable ; String]
+The name (label) of the album.
+- description
[Optional ; Not Repeatable ; String]
+A brief description of the album.
+- owner
[Optional ; Not Repeatable ; String]
+The owner of the album.
+- uri
[Optional ; Not Repeatable ; String]
+A link (URL) to the album.
provenance
[Optional ; Repeatable]
Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the IPTC or DCMI metadata standard.
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable]
+The origin_description
elements are used to describe when and from where metadata have been extracted or harvested.
harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Study Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] tags
[Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups
, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R.
+
"tags": [
+{
+ "tag": "string",
+ "tag_group": "string"
+ }
+ ]
tag
[Required ; Not repeatable ; String]
+A user-defined tag.
tag_group
[Optional ; Not repeatable ; String]
+A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.
lda_topics
[Optional ; Not repeatable]
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series’ name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%).
"lda_topics": [
+{
+ "model_info": [
+ {
+ "source": "string",
+ "author": "string",
+ "version": "string",
+ "model_id": "string",
+ "nb_topics": 0,
+ "description": "string",
+ "corpus": "string",
+ "uri": "string"
+ }
+ ],
+ "topic_description": [
+ {
+ "topic_id": null,
+ "topic_score": null,
+ "topic_label": "string",
+ "topic_words": [
+ {
+ "word": "string",
+ "word_weight": 0
+ }
+ ]
+ }
+ ]
+ }
+ ]
The lda_topics
element includes the following metadata fields.
model_info
[Optional ; Not repeatable]
+Information on the LDA model.
source
[Optional ; Not repeatable ; String] author
[Optional ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] model_id
[Optional ; Not repeatable ; String] nb_topics
[Optional ; Not repeatable ; Numeric] description
[Optional ; Not repeatable ; String] corpus
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] topic_description
[Optional ; Repeatable]
+The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).
topic_id
[Optional ; Not repeatable ; String] topic_score
[Optional ; Not repeatable ; Numeric] topic_label
[Optional ; Not repeatable ; String] topic_words
[Optional ; Not repeatable] word
[Optional ; Not repeatable ; String] word_weight
[Optional ; Not repeatable ; Numeric] embeddings
[Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).
The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.
"embeddings": [
+{
+ "id": "string",
+ "description": "string",
+ "date": "string",
+ "vector": null
+ }
+ ]
The embeddings
element contains four metadata fields:
id
[Optional ; Not repeatable ; String]
+A unique identifier of the word embedding model used to generate the vector.
description
[Optional ; Not repeatable ; String]
+A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
date
[Optional ; Not repeatable ; String]
+The date the model was trained (or a version date for the model).
vector
[Required ; Not repeatable ; @@@@]
+The numeric vector representing the video metadata.
additional
[Optional ; Not repeatable]
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
+
+
+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_videos")
+
+= "MDA_VDO_001"
+ id
+= "vdo_001.jpg"
+ thumb
+# Generate the metadata
+
+= list(
+ my_video
+ metadata_information = list(
+ title = "Mogadishu, Somalia: A Call for Help",
+ idno = id,
+ producers = list(
+ list(name = "John Doe", affiliation = "National Library")
+
+ ),production_date = "2021-09-03"
+
+ ),
+ video_description = list(
+
+ idno = id,
+
+ title = "Mogadishu, Somalia: A Call for Help",
+
+ alt_title = "Somalia: Guterres in Mogadishu",
+
+ date_published = "2011-09-01",
+
+ description = "During a landmark visit, the United Nations High Commissioner for Refugees calls on the international community to rapidly increase aid to Somalia.",
+
+ genre = "Documentary",
+
+ persons = list(
+ list(name = "António Guterres", role = "High Commissioner for Refugees"),
+ list(name = "Fadhumo", role = "Somali internally displaced person (IDP)")
+
+ ),
+ main_entity = "United Nations High Commission for Refugees (UNHCR), the UN Refugee Agency",
+
+ country = list(
+ list(name = "Somalia", code = "SOM")
+
+ ),
+ spatial_coverage = "Mogadishu, Somalia",
+
+ content_reference_time = "2011-09",
+
+ languages = list(
+ list(name = "English", code = "EN")
+
+ ),
+ creator = "United Nations High Commission for Refugees (UNHCR)",
+
+ video_url = "https://www.youtube.com/watch?v=7Aif1xjstws",
+
+ embed_url = "https://www.youtube.com/embed/7Aif1xjstws",
+
+ transcript = list(
+ list(
+ language = "English",
+ transcript = "Mogadishu is a dangerous place securityhas improved since al-shabaab militias
+ withdrew last month but not a lot despite the insecurity hundreds of thousands of Somalis
+ have been streaming into the capital from surrounding areas they're fleeing the worst famine
+ to strike the region in 60 years in a landmark visit the UN High Commissioner for Refugees
+ Antonio Gutierrez traveled to Mogadishu this week to visit with Somalis he urged the international
+ community to rapidly increase aid to people who have been through so much already makes us very emotional is to
+ feel that for 2020 as these people has been suffering the suffering enormously of course there is a large
+ responsibility of Somalis in the way things have happened but let's also recognize that international community
+ there sometimes also be part of the problem and not part of the solution some aid is getting through fatuma has
+ just been registered to receive assistance from UNHCR she left her home and is now seeking help in the capital
+ she is camped with thousands of others in a settlement not far from the shoreline UNHCR is providing plastic
+ sheeting and other supplies there are also food distributions there are a total of four hundred thousand displaced
+ people in Mogadishu 100,000 arrived in the past two months alone getting assistance to them despite the
+ dangers is an urgent priority otherwise settlements like these are certain to you"
+
+ )
+ ),
+ duration = "PT2M14S" # 2 minutes and 14 seconds
+
+
+ )
+
+ )
+# Publish in the NADA catalog
+
+video_add(idno = id,
+published = 1,
+ overwrite = "yes",
+ metadata = my_video,
+ thumbnail = thumb)
In NADA, the video will now appear in the “All” tab and in the “Videos” tab.
+
+
+
If the embed_url
element was provided, the video can be played within the NADA page.
+
+
Documenting, cataloguing and disseminating data has the potential to increase the volume and diversity of data analysis. There is also much value in documenting, cataloguing and disseminating data processing and analysis scripts. Technological solutions such as GitHub, Jupyter Notebooks or Jupiter Lab facilitate the preservation and sharing of code, and enable collaborative work around data analysis. Coding style guides like the Google style guides and the Guide to Reproducible Code in Ecology and Evolution by the British Ecological Society, contribute to foster the usability, adaptability, and reproducibility of code. But these tools and guidelines do not fully address the issue of cataloguing and discoverability of the data processing and analysis programs and scripts. We propose –as a complement to collaboration tools and style guides– a metadata schema to document data analysis projects and scripts. The production of structured metadata will contribute not only to discoverability, but also to the reproducibility, replicability, and auditability of data analytics.
+There are multiple reasons to make reproducibility, replicability, and auditability of data analytics a component of a data dissemination system. This will:
+Stodden et al (2013) make a useful distinction between five levels of research openness:
+Search and filter by title, author, software, method, country, etc. Get links to analytical output and data. Example: search for a “project that implemented multiple imputation in R for a project related to poverty in Kenya”: search for poverty AND “multiple imputation” and filter the results by software / country.
+Note: the code will also be “attached” to the output page (paper) and to the dataset page of the catalog if they are available in the catalog.
+Provide access to scripts with detailed information, including software and libraries used, distribution license, IT requirements, datasets used, list of outputs, and more.
+To make data processing and analysis scripts more discoverable and usable, we propose a metadata schema inspired by the schemas available to document datasets. The proposed schema contains two main blocks of metadata elements: the document description intended to document the metadata themselves (the term document refers to the file that will contain the metadata), and the project description used to document the research or analytical work and the related scripts. We also include in the schema the tags
, provenance
, and additional
elements common to all schemas.
{
+"repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "doc_desc": {},
+ "project_desc": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+ }
doc_desc
[Optional ; Not repeatable]
+The document description is a description of the metadata file being generated. It provides metadata about the metadata. This block is optional. It is used to document the research project metadata (not the project itself). This information is not needed to document the project; it only provides information, useful for archiving purposes, on the process of generating the project metadata. The information it contains are typically useful to a catalog administrator; they are not useful to the public and do not need to be displayed in the publicly-available catalog interface. This block is optional. It is recommended to enter at least the identification of the metadata producer, her/his affiliation, and the date the metadata were created. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so the metadata produced by one organization can be found in other data centers (complying with standards and schema is precisely intended to facilitate inter-operability of catalogs and automated information sharing). Keeping track of who documented a resource is thus useful.
"doc_desc": {
+"title": "string",
+ "idno": "string",
+ "producers": [
+ {
+ "name": "string",
+ "abbr": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "prod_date": "string",
+ "version": "string"
+ }
title
[Optional ; Not Repeatable ; String]
+The title of the project. This will usually be the same as the element title
in the project description section.
idno
[Optional ; Not Repeatable ; String]
+A unique identifier for the metadata document.
producers
[Optional ; Not Repeatable]
+A list of producers of the metadata (who may be but do not have to be the authors of the research project and scripts being documented). These can be persons or organizations. The following four elements are used to identify them and specify their specific role as and if relevant (this block of four elements is repeated for each contributor to the metadata):
name
[Optional ; Not Repeatable ; String] abbr
: [Optional ; Not Repeatable ; String] affiliation
[Optional ; Not Repeatable ; String] role
[Optional ; Not Repeatable ; String] prod_date
[Optional ; Not Repeatable ; String]
+The date the metadata on this project was produced (not distributed or archived), preferably in ISO 8601 format (YYYY-MM-DD or YYY-MM).
version
[Optional ; Not Repeatable ; String]
+Documenting a research project is not a trivial exercise. It may happen that, having identified errors or omissions in the metadata or having received suggestions for improvement, a new version of the metadata is produced. This element is used to identify and describe the current version of the metadata. It is good practice to provide a version number, and information on what distinguishes this version from the previous one(s) if relevant.
+
= list(
+ my_project doc_desc = list(
+ idno = "META_RP_001",
+ producers = list(
+ list(name = "John Doe",
+ affiliation = "National Data Center of Popstan")
+
+ ),prod_date = "2020-12-27",
+ version = "Version 1.0 - Original version of the documentation provided by the author of the project"
+
+ ),# ...
+ )
project_desc
[Required ; Not repeatable]
+The project description contains the metadata related to the project itself. All efforts should be made to provide as much and as detailed information as possible.
"project_desc": {
+"title_statement": {},
+ "abstract": "string",
+ "review_board": "string",
+ "output": [],
+ "approval_process": [],
+ "project_website": [],
+ "language": [],
+ "production_date": "string",
+ "version_statement": {},
+ "errata": [],
+ "process": [],
+ "authoring_entity": [],
+ "contributors": [],
+ "sponsors": [],
+ "curators": [],
+ "reviews_comments": [],
+ "acknowledgments": [],
+ "acknowledgment_statement": "string",
+ "disclaimer": "string",
+ "confidentiality": "string",
+ "citation_requirement": "string",
+ "related_projects": [],
+ "geographic_units": [],
+ "keywords": [],
+ "themes": [],
+ "topics": [],
+ "disciplines": [],
+ "repository_uri": [],
+ "license": [],
+ "copyright": "string",
+ "technology_environment": "string",
+ "technology_requirements": "string",
+ "reproduction_instructions": "string",
+ "methods": [],
+ "software": [],
+ "scripts": [],
+ "data_statement": "string",
+ "datasets": [],
+ "contacts": []
+ }
title_statement
[Required ; Non repeatable] "title_statement": {
+"idno": "string",
+ "identifiers": [
+ {
+ "type": "string",
+ "identifier": "string"
+ }
+ ],
+ "title": "string",
+ "sub_title": "string",
+ "alternate_title": "string",
+ "translated_title": "string"
+ }
idno
[Required ; Not Repeatable ; String] identifiers
[Optional ; Repeatable] idno
entered in the title_statement
. It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in idno
can (and in some cases should) be repeated here. The element idno
does not provide a type
parameter; repeating it in this section makes it possible to add that information.
+type
[Optional ; Not repeatable ; String] identifier
[Required ; Not repeatable ; String] title
[Required ; Not Repeatable ; String] sub_title
[Optional ; Not Repeatable ; String] alternate_title
[Optional ; Not Repeatable ; String] translated_title
[Optional ; Not Repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+
+ title_statement = list(
+ idno = "RR_WB_2020_001",
+ identifiers = list(
+ list(type = "DOI", identifier = "XXX-XXX-XXXX")
+
+ ),date = "2020",
+ title = "Predicting Food Crises - Econometric Model"
+
+ ),
+# ...
+
+ ),# ...
+ )
abstract
[Optional ; Non repeatable ; String]
+The abstract should provide a clear summary of the purposes, objectives and content of the project. An abstract can make reference to the various outputs associated with the research project.
Example extracted from https://microdata.worldbank.org/index.php/catalog/4218:
+= list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+abstract = "Food price inflation is an important metric to inform economic policy but traditional sources of consumer prices are often produced with delay during crises and only at an aggregate level. This may poorly reflect the actual price trends in rural or poverty-stricken areas, where large populations reside in fragile situations.
+ This data set includes food price estimates and is intended to help gain insight in price developments beyond what can be formally measured by traditional methods. The estimates are generated using a machine-learning approach that imputes ongoing subnational price surveys, often with accuracy similar to direct measurement of prices. The data set provides new opportunities to investigate local price dynamics in areas where populations are sensitive to localized price shocks and where traditional data are not available.",
+
+# ...
+
+ ),# ...
+ )
review_board
[Optional ; Non repeatable ; String]
+Information on whether and when the project was submitted, reviewed, and approved by an institutional review board (or independent ethics committee, ethical review board (ERB), research ethics board, or equivalent).
+
output
[Optional ; Repeatable]
+This element will describe and reference all substantial/intended products of the research project, which may include publications, reports, websites, datasets, interactive applications, presentations, visualizations, and others. An output may also be referred to as a “deliverable”.
+
"output": [
+{
+ "type": "string",
+ "title": "string",
+ "authors": "string",
+ "description": "string",
+ "abstract": "string",
+ "uri": "string",
+ "doi": "string"
+ }
+ ]
The output
is a repeatable block of seven elements, used to document all output of the research project:
+- type
[Optional ; Non repeatable]
+Type of output. The type of output relates to the media which is used to convey or communicate the intended results, findings or conclusions of the research project. This field may be controlled by a controlled vocabulary. The kind on content could be “Working paper”, “Database”, etc.
+- title
[Required ; Non repeatable]
+Formal title of the output. Depending upon the kind of output, the title will vary in formality.
+- authors
[Optional ; Non repeatable]
+Authors of the output; if multiple, they will be listed in one same text field.
+- description
[Optional ; Non repeatable]
+Brief description of the output (NOT an abstract)
+- abstract
[Optional ; Non repeatable]
+If the output consists of a document, the abstract will be entered here.
+- uri
[Optional ; Non repeatable]
+A link where the output or information on the output can be found.
+- doi
[Optional ; Non repeatable]v
+Digital Object Identifier (DOI) of the output, if available.
= list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ output = list(
+
+ list(type = "working paper",
+ title = "Estimating Food Price Inflation from Partial Surveys",
+ authors = "Andrée, B. P. J.",
+ description = "World Bank Policy Research Working Paper",
+ abstract = "The traditional consumer price index is often produced at an aggregate level, using data from few, highly urbanized, areas. As such, it poorly describes price trends in rural or poverty-stricken areas, where large populations may reside in fragile situations. Traditional price data collection also follows a deliberate sampling and measurement process that is not well suited for monitoring during crisis situations, when price stability may deteriorate rapidly. To gain real-time insights beyond what can be formally measured by traditional methods, this paper develops a machine-learning approach for imputation of ongoing subnational price surveys. The aim is to monitor inflation at the market level, relying only on incomplete and intermittent survey data. The capabilities are highlighted using World Food Programme surveys in 25 fragile and conflict-affected countries where real-time monthly food price data are not publicly available from official sources. The results are made available as a data set that covers more than 1200 markets and 43 food types. The local statistics provide a new granular view on important inflation events, including the World Food Price Crisis of 2007–08 and the surge in global inflation following the 2020 pandemic. The paper finds that imputations often achieve accuracy similar to direct measurement of prices. The estimates may provide new opportunities to investigate local price dynamics in markets where prices are sensitive to localized shocks and traditional data are not available.",
+ uri = "http://hdl.handle.net/10986/36778"),
+
+ list(type = "dataset",
+ title = "Monthly food price estimates",
+ authors = "Andrée, B. P. J.",
+ description = "A dataset of derived data, published as open data",
+ abstract = "Food price inflation is an important metric to inform economic policy but traditional sources of consumer prices are often produced with delay during crises and only at an aggregate level. This may poorly reflect the actual price trends in rural or poverty-stricken areas, where large populations reside in fragile situations.
+ This data set includes food price estimates and is intended to help gain insight in price developments beyond what can be formally measured by traditional methods. The estimates are generated using a machine-learning approach that imputes ongoing subnational price surveys, often with accuracy similar to direct measurement of prices. The data set provides new opportunities to investigate local price dynamics in areas where populations are sensitive to localized price shocks and where traditional data are not available."
+uri = "https://microdata.worldbank.org/index.php/catalog/4218"),
+ doi = "https://doi.org/10.48529/2ZH0-JF55")
+
+
+ ),
+ # ...
+ )
approval_process
[Optional ; Repeatable] approval_process
is a group of six elements used to describe the formal approval process(es) (if any) that the project had to go through. This may for example include an approval by an Ethics Board to collect new data, followed by an internal review process to endorse the results.
+"approval_process": [
+{
+ "approval_phase": "string",
+ "approval_authority": "string",
+ "submission_date": "string",
+ "reviewer": "string",
+ "review_status": "string",
+ "approval_date": "string"
+ }
+ ]
approval_phase
[Optional ; Non repeatable] approval_authority
[Optional ; Non repeatable] submission_date
[Optional ; Non repeatable] reviewer
[Optional ; Non repeatable] review_status
[Optional ; Non repeatable] approval_date
[Optional ; Non repeatable] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ approval_process = list(
+
+ list(approval_phase = "Authorization to conduct the survey",
+ approval_authority = "Internal Ethics Board, [Organization]",
+ submission_date = "2019-01-15",
+ review_status = "Approved (permission No ABC123)",
+ approval_date = "2020-04-30"),
+
+ list(approval_phase = "Review of research output and authorization to publish",
+ approval_authority = "Internal Ethics Board, [Organization]",
+ submission_date = "2021-07-15",
+ review_status = "Approved",
+ approval_date = "2021-10-30")
+
+
+ ),# ...
+
+ )# ...
+ )
project_website
[Optional ; Repeatable ; String] "project_website": [
+"string"
+ ]
language
[Optional ; Repeatable] "language": [
+{
+ "name": "string",
+ "code": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] code
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ languages = list(
+ list(name = "English", code = "EN"),
+ list(name = "French", code = "FR")
+
+ ),
+ # ...
+
+ )# ...
+ )
production_date
+The date in ISO 8601 format (YYYY-MM-DD) the project was completed (this refers to the version that is being documented and released.)
+
version_statement
[Optional ; Repeatable]
+This repeatable block of four elements is used to list and describe the successive versions of the project.
+
"version_statement": {
+"version": "string",
+ "version_date": "string",
+ "version_resp": "string",
+ "version_notes": "string"
+ }
version
[Optional ; Not repeatable ; String] version_date
[Optional ; Not repeatable ; String] version_resp
[Optional ; Not repeatable ; String] version_notes
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ version_statement = list(
+
+ list(version = "v1.0",
+ version_date = "2021-12-27",
+ version_resp = "University of Popstan, Department of Economics",
+ version_notes = "First version approved for open dissemination")
+
+
+ ),
+ # ...
+ )
errata
[Optional ; Repeatable] "errata": [
+{
+ "date": "string",
+ "description": "string"
+ }
+ ]
date
[Optional ; Not repeatable ; String] description
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ errata = list(
+ list(date = "2021-10-30",
+ description = "Outliers in the data for Afghanistan resulted in unrealistic model estimates of the food prices for January 2020. In the latest version of the 'model.R' script, outliers are detected and dropped from the input data file. The published dataset has been updated."
+
+ )
+ ),
+ # ...
+
+ ) )
process
[Optional ; Repeatable] "process": [
+{
+ "name": "string",
+ "date_start": "string",
+ "date_end": "string",
+ "description": "string"
+ }
+ ]
name
: [Optional ; Not repeatable ; String] date_start
[Optional ; Not repeatable ; String] date_end
[Optional ; Not repeatable ; String] description
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ process = list(
+
+ list(name = "Presentation of the concept note at the Review Committee decision meeting",
+ date_start = "2018-02-23",
+ date_end = "2018-02-23",
+ description = "Presentation of the research objectives and method by the primary investigator to the Review Committee, which resulted in the approval of the concept note."
+
+ ),
+ list(name = "Fundraising",
+ date_start = "2018-02-24",
+ date_end = "2018-02-30",
+ description = "Discussion with project sponsors, and conclusion of the funding agreement."
+
+ ),
+ list(name = "Data acquisition and analytics",
+ date_start = "2018-03-15",
+ date_end = "2019-01-30",
+ description = "Implementation of web scraping, then data analysis"
+
+ ),
+ list(name = "Working paper",
+ date_start = "2019-01-30",
+ date_end = "2019-02-25",
+ description = "Production (and copy editing) of the working paper"
+
+ ),
+ list(name = "Presentation to conferences",
+ date_start = "2019-04-12",
+ date_end = "2019-04-12",
+ description = "Presentation of the paper by the primary investigator at the ... conference, London"
+
+ ),
+ list(name = "Curation and dissemination of data and code",
+ date_start = "2019-02-25",
+ date_end = "2019-03-18",
+ description = "Data and script documentation, and publishing in the National Microdata Library"
+
+ )
+
+ ),
+ # ...
+
+ ) )
authoring_entity
[Optional ; Repeatable] "authoring_entity": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "email": "string",
+ "author_id": []
+ }
+ ]
name
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] name
.affiliation
[Optional ; Not repeatable ; String] name
.abbreviation
[Optional ; Not repeatable ; String] affiliation
.email
[Optional ; Not repeatable ; String] author_id
[Optional ; Repeatable] type
[Optional ; Not repeatable ; String] id
[Required ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ authoring_entity = list(
+
+ list(name = "",
+ role = "",
+ affiliation = "",
+ email = "",
+ author_id = list(
+ list(type = "", id = "ORCID")
+
+ )
+ )
+
+ ),
+ # ...
+
+ ) )
contributors
[Optional ; Repeatable] This section is provided to record other contributors to the research project and provide recognition for the roles they provided.
+"contributors": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "email": "string",
+ "url": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] abbreviation
[Optional ; Not repeatable ; String] affiliation
.email
[Optional ; Not repeatable ; String] url
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ contributors = list(
+ list(name = "",
+ role = "",
+ affiliation = "",
+ email = ""
+
+ )
+ ),
+ # ...
+
+ ) )
sponsors
[Optional ; Repeatable] "sponsors": [
+{
+ "name": "string",
+ "abbreviation": "string",
+ "role": "string",
+ "grant_no": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] abbreviation
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] grant_no
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ sponsors = list(
+
+ list(name = "ABC Foundation",
+ abbr = "ABCF",
+ role = "Purchase of the data",
+ grant_no = "ABC_001_XYZ"
+
+ ),
+ list(name = "National Research Foundation",
+ abbr = "NRF",
+ role = "Funding of staff and research assistant costs, and variable costs for participation in conferences",
+ grant_no = "NRF_G01"
+
+ )
+
+ ),
+ # ...
+
+ ) )
curators
[Optional ; Repeatable] "curators": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "abbreviation": "string",
+ "email": "string",
+ "url": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] abbreviation
[Optional ; Not repeatable ; String] name
.email
[Optional ; Not repeatable ; String] url
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ curators = list(
+
+ list(name = "National Data Archive of Popstan",
+ role = "Documentation, preservation and dissemination of the data and reproducible code",
+ email = "helpdesk@nda. ...",
+ url = "popstan_nda,org"
+
+ )
+
+ ),
+ # ...
+
+ ) )
reviews_comments
[Optional ; Repeatable] "reviews_comments": [
+{
+ "comment_date": "string",
+ "comment_by": "string",
+ "comment_description": "string",
+ "comment_response": "string"
+ }
+ ]
comment_date
[Optional ; Not repeatable ; String] comment_by
[Optional ; Not repeatable ; String] comment_description
[Optional ; Not repeatable ; String] comment_response
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+ reviews_comments = list(
+ list(comment_date = "",
+ comment_by = "",
+ comment_description = "",
+ comment_response = ""
+
+ )
+ ),# ...
+
+ ) )
acknowledgments
[Optional ; Repeatable] acknowledgment_statement
field (see below) which can be used to provide the acknowledgment in the form of an unstructured text.
+"acknowledgments": [
+{
+ "name": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+ acknowledgements = list(
+ list(name = "",
+ affiliation = "",
+ role = ""
+
+ ),list(name = "",
+ affiliation = "",
+ role = ""
+
+ )
+ ),# ...
+
+ ) )
acknowledgement_statement
[Optional ; Not repeatable ; String]
+This field is used to provide acknowledgments in the form of an unstructured text. An alternative to this field is the acknowledgments field which provides a solution to itemize the acknowledgments.
disclaimer
[Optional ; Not repeatable ; String]
+Disclaimers limit the responsibility or liability of the publishing organization or researchers associated with the research project. Disclaimers assure that any research in the public domain produced by an organization has limited repercussions to the publishing organization. A disclaimer is intended to prevent liability from any effects occurring as a result of the acts or omissions in the research.
confidentiality
[Optional ; Not repeatable ; String]
+A confidentiality statement binds the publisher to ethical considerations regarding the subjects of the research. In most cases, the individual identity of an individual that is the subject of research can not be released and special effort is required to assure the preservation of privacy.
citation_requirement
[Optional ; Not repeatable ; String]
+The citation requirement is specific to the output and is a preferred shorthand or means to refer to the publication or published good.
related_projects
[Optional ; Repeatable]
+The objective of this block is to provide links (URLs) to other, related projects which can be documented and disseminated in the same catalog or any other location on the internet.
+
"related_projects": [
+{
+ "name": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] note
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+ related_projects = list(
+ list(name = "",
+ uri = "",
+ note = "")
+
+ ),# ...
+
+ ) )
geographic_units
[Optional ; Repeatable] "geographic_units": [
+{
+ "name": "string",
+ "code": "string",
+ "type": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] code
[Optional ; Not repeatable ; String] type
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ geographic_units = list(
+ list(name = "India", code = "IND", type = "Country"),
+ list(name = "New Delhi", type = "City"),
+ list(name = "Kerala", type = "State"),
+ list(name = "Nepal", code = "NPL", type = "Country"),
+ list(name = "Kathmandu", type = "City")
+
+ ),
+ # ...
+
+ ) )
keywords
[Optional ; Repeatable] "keywords": [
+{
+ "name": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of keywords that provide information on the core scope and objectives of the research project. Keywords provide a convenient solution to improve the discoverability of the research, as it allows terms and phrases not found elsewhere in the metadata to be indexed and to make a project discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the UNESCO Thesaurus. The list provided here can combine keywords from multiple controlled vocabularies, and user-defined keywords.
+name
[Required ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ keywords = list(
+
+ list(name = "Migration",
+ vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+
+ list(name = "Migrants",
+ vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+
+ list(name = "Refugee",
+ vocabulary = "Unesco Thesaurus (June 2021)",
+ uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+
+ list(name = "Conflict"),
+ list(name = "Asylum seeker"),
+ list(name = "Forced displacement"),
+ list(name = "Forcibly displaced"),
+ list(name = "Internally displaced population (IDP)"),
+ list(name = "Population of concern (PoC)")
+ list(name = "Returnee")
+ list(name = "UNHCR")
+
+ ),
+ # ...
+
+ ),# ...
+ )
themes
[Optional ; Repeatable] "themes": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
A list of themes covered by the research project. A controlled vocabulary will preferably be used. Note that themes
will rarely be used as the elements topics
and disciplines
are more appropriate for most uses. This is a block of five fields:
id
[Optional ; Not repeatable ; String]
+The ID of the theme, taken from a controlled vocabulary.
name
[Required ; Not repeatable ; String]
+The name (label) of the theme, preferably taken from a controlled vocabulary.
parent_id
[Optional ; Not repeatable ; String]
+The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
vocabulary
[Optional ; Not repeatable ; String]
+The name (including version number) of the controlled vocabulary used, if any.
uri
[Optional ; Not repeatable ; String]
+The URL to the controlled vocabulary used, if any.
topics
[Optional ; Repeatable]
+
"topics": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
Information on the topics covered in the research project. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. Note that you may use more than one controlled vocabulary.
+This element is a block of five fields:
+- id
[Optional ; Not repeatable ; String]
+The identifier of the topic, taken from a controlled vocabulary.
+- name
[Required ; Not repeatable ; String]
+The name (label) of the topic, preferably taken from a controlled vocabulary.
+- parent_id
[Optional ; Not repeatable ; String]
+The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
+- vocabulary
[Optional ; Not repeatable ; String]
+The name (including version number) of the controlled vocabulary used, if any.
+- uri
[Optional ; Not repeatable ; String]
+The URL to the controlled vocabulary used, if any.
= list(
+ my_project # ... ,
+
+ project_desc = list(
+ # ... ,
+
+topics = list(
+
+ list(name = "Demography.Migration",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(name = "Demography.Censuses",
+ vocabulary = "CESSDA Topic Classification",
+ uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+
+ list(id = "F22",
+ name = "International Migration",
+ parent_id = "F2 - International Factor Movements and International Business",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "O15",
+ name = "Human Resources - Human Development - Income Distribution - Migration",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "O12",
+ name = "Microeconomic Analyses of Economic Development",
+ parent_id = "O1 - Economic Development",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+
+ list(id = "J61",
+ name = "Geographic Labor Mobility - Immigrant Workers",
+ parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers",
+ vocabulary = "JEL Classification System",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+
+ ),
+ # ...
+
+ ) )
disciplines
[Optional ; Repeatable] "disciplines": [
+{
+ "id": "string",
+ "name": "string",
+ "parent_id": "string",
+ "vocabulary": "string",
+ "uri": "string"
+ }
+ ]
+Information on the academic disciplines related to the content of the research project. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia.
+This is a block of five elements:
id
[Optional ; Not repeatable ; String] name
[Optional ; Not repeatable ; String] parent_id
[Optional ; Not repeatable ; String] vocabulary
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] <- list(
+ my_project # ... ,
+
+ project_desc = list(
+ # ... ,
+
+ disciplines = list(
+
+ list(name = "Economics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+
+ list(name = "Agricultural economics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+
+ list(name = "Econometrics",
+ vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)",
+ uri = "https://en.wikipedia.org/wiki/List_of_academic_fields")
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
repository_uri
In the process of producing the outputs of the research project, a researcher may want to share their source code for transparency and replicability. This repository provides information for finding the repository where the source code is kept."repository_uri": [
+{
+ "name": "string",
+ "type": "string",
+ "uri": null
+ }
+ ]
name
[Optional ; Not repeatable ; String] type
[Optional ; Not repeatable ; String] uri
[Required ; Not repeatable ; String] = list(
+ my_project # ... ,
+
+ project_desc = list(
+ # ... ,
+
+ repository_uri = list(
+ list(name = "A comparative assessment of machine learning classification algorithms applied to poverty prediction",
+ type = "GitHub public repo",
+ uri = "https://github.com/worldbank/ML-classification-algorithms-poverty")
+
+ ),
+ # ...
+
+ ) )
license
[Optional ; Repeatable] "license": [
+{
+ "name": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] note
[Optional ; Not repeatable ; String] <- list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ license = list(
+
+ list(name = "Attribution 4.0 International (CC BY 4.0)",
+ uri = "https://creativecommons.org/licenses/by/4.0/")
+
+
+ ),
+ # ...
+
+ ),# ...
+ )
copyright
[Optional ; Not repeatable ; String]
+Information on the copyright, if any, that applies to the research project metadata.
technology_environment
[Optional ; Not repeatable ; String]
+This field is used to provide a description (as detailed as possible) of the computational environment under which the scripts were implemented and are expected to be reproducible. A substantial challenge in reproducing analyses is installing and configuring the web of dependencies of specific versions of various analytical tools. Virtual machines (a computer inside a computer) enable you to efficiently share your entire computational environment with all the dependencies intact. (https://ropensci.github.io/reproducibility-guide/sections/introduction/)
technology_requirements
[Optional ; Not repeatable ; String]
+Software/hardware or other technology requirements needed to run the scripts and replicate the outputs
reproduction_instructions
[Optional ; Not repeatable ; String]
+Instructions to secondary analysts who may want to reproduce the scripts.
methods
[Optional ; Repeatable]
+A list of analytic, statistical, econometric, machine learning methods used in the project. The objective is to allow users to find projects based on a search on methods applied, e.g. answer a query like “poverty prediction using random forest”.
+
"methods": [
+{
+ "name": "string",
+ "note": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String] note
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ methods = list(
+
+ list(name = "linear regression",
+ note = "Implemented using R package 'stats'"),
+
+ list(name = "random forest",
+ note = "Used for both regression and classification"),
+
+ list(name = "lasso regression (least asolute shrinkage and selection operator)",
+ note = "Implemented using R package glmnet"),
+
+ list(name = "gradient boosting machine (GBM)"),
+
+ list(name = "cross validation"),
+
+ list(name = "mean square error, quadratic loss, L2 loss",
+ note = "Loss functions used to fit models")
+
+
+ ),
+ # ...
+
+ ) )
software
[Optional ; Repeatable] "software": [
+{
+ "name": "string",
+ "version": "string",
+ "library": [
+ "string"
+ ]
+ }
+ ]
name
[Required ; Not repeatable ; String] version
[Optional ; Not repeatable ; String] library
[Optional ; Repeatable] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ software = list(
+
+ list(name = "R",
+ version = "4.0.2",
+ library = list("caret", "dplyr", "ggplot2"),
+
+ list(name = "Stata",
+ version = "15"),
+
+ list(name = "Python",
+ version = "3.7 (Anaconda install)",
+ library = list("pandas", "scikit-learn")
+
+
+ ),
+ # ...
+
+ ) )
scripts
[Optional ; Repeatable] "scripts": [
+{
+ "file_name": "string",
+ "zip_package": "string",
+ "title": "string",
+ "authors": [
+ {
+ "name": "string",
+ "affiliation": "string",
+ "role": "string"
+ }
+ ],
+ "date": "string",
+ "format": "string",
+ "software": "string",
+ "description": "string",
+ "methods": "string",
+ "dependencies": "string",
+ "instructions": "string",
+ "source_code_repo": "string",
+ "notes": "string",
+ "license": [
+ {
+ "name": "string",
+ "uri": "string",
+ "note": "string"
+ }
+ ]
+ }
+ ]
file_name
[Optional ; Not repeatable ; String] zip_package
[Optional ; Not repeatable] title
[Optional ; Not repeatable ; String] authors
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] date
[Optional ; Not repeatable ; String]* format
[Optional ; Not repeatable ; String] software
[Optional ; Not repeatable ; String] description
[Optional ; Not repeatable ; String] methods
[Optional ; Not repeatable ; String] dependencies
[Optional ; Not repeatable ; String] library
element.instructions
[Optional ; Not repeatable ; String] source_code_repo
[Optional ; Not repeatable ; String] notes
[Optional ; Not repeatable ; String] license
[Optional ; Repeatable] name
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ scripts = list(
+
+ list(file_name = "00_script.R",
+ zip_package = "all_scripts.zip",
+ title = "Project X - Master script",
+ authors = list(name = "John Doe",
+ affiliation = "IHSN",
+ role = "Writing, testing and documenting the script"),
+ date = "2020-12-27",
+ format = "R script",
+ software = "R x64 4.0.2",
+ description = "Master script for automated reproduction of the analysis. Calls all other scripts in proper sequence to reproduce the full analysis.",
+ methods = "box-cox transformation of data",
+ dependencies = "",
+ instructions = "",
+ source_code_repo = "",
+ notes = "",
+ license = list(name = "CC BY 4.0",
+ uri = "https://creativecommons.org/licenses/by/4.0/deed.ast")),
+
+ list(file_name = "01_regression.R",
+ zip_package = "",
+ title = "Charts and maps",
+ authors = list(name = "",
+ affiliation = "",
+ role = ""),
+ date = "",
+ format = "R script",
+ software = "R",
+ description = "This script runs all linear regressions and PCA presented in the working paper.",
+ methods = "linear regression; principal component analysis",
+ dependencies = "",
+ instructions = "",
+ source_code_repo = "",
+ notes = "",
+ license = list(list(name = "CC BY 4.0",
+ uri = "https://creativecommons.org/licenses/by/4.0/deed.ast"))),
+
+ list(file_name = "02_visualization",
+ zip_package = "",
+ title = "",
+ authors = list(name = "",
+ abbr = "",
+ role = ""),
+ date = "",
+ format = "",
+ software = "",
+ description = "",
+ instructions = "",
+ source_code_repo = "",
+ notes = "",
+ license = list(list(name = "CC BY 4.0",
+ uri = "https://creativecommons.org/licenses/by/4.0/deed.ast"))),
+
+
+ ),# ...
+
+ ) )
data_statement
[Optional ; Not repeatable ; String]
+An overall statement on the data used in the project. A separate field is provided to list and document the origin and key characteristics of the datasets.
datasets
[Optional ; Repeatable]
+This field is used to provide an itemized list of datasets used in the project. The data are not documented here (specific metadata are available for documenting data of different types, like the DDI for microdata, the ISO 19139 for geographic datasets, etc.)
+
"datasets": [
+{
+ "name": "string",
+ "idno": "string",
+ "note": "string",
+ "access_type": "string",
+ "license": "string",
+ "license_uri": "string",
+ "uri": "string"
+ }
+ ]
name
[Optional ; Not repeatable ; String] idno
[Optional ; Not repeatable ; String] note
[Optional ; Not repeatable ; String] access_type
[Optional ; Not repeatable ; String] license
[Optional ; Not repeatable ; String] license_uri
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ datasets = list(
+
+ list(name = "Multiple Indicator Cluster Survey 2019, Round 6, Chad",
+ idno = "TCD_2019_MICS_v01_M",
+ uri = "https://microdata.worldbank.org/index.php/catalog/4150"),
+
+ list(name = "World Bank Group Country Survey 2018, Chad",
+ idno = "TCD_2018_WBCS_v01_M",
+ access_type = "Public access",
+ uri = "https://microdata.worldbank.org/index.php/catalog/3058")
+
+
+ ),# ...
+
+ ) )
contacts
[Optional ; Repeatable] "contacts": [
+{
+ "name": "string",
+ "role": "string",
+ "affiliation": "string",
+ "email": "string",
+ "telephone": "string",
+ "uri": "string"
+ }
+ ]
name
[Required ; Not repeatable ; String] role
[Optional ; Not repeatable ; String] affiliation
[Optional ; Not repeatable ; String] email
[Optional ; Not repeatable ; String] telephone
[Optional ; Not repeatable ; String] uri
[Optional ; Not repeatable ; String] = list(
+ my_project # ... ,
+ project_desc = list(
+ # ... ,
+
+ contacts = list(
+
+ list(name = "Data helpdesk",
+ affiliation = "National Data Center",
+ role = "Support to data users",
+ uri = "helpdesk@ndc. ...")
+
+ ),
+ # ...
+
+ ) )
provenance
[Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance
group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
"provenance": [
+{
+ "origin_description": {
+ "harvest_date": "string",
+ "altered": true,
+ "base_url": "string",
+ "identifier": "string",
+ "date_stamp": "string",
+ "metadata_namespace": "string"
+ }
+ }
+ ]
origin_description
[Required ; Not repeatable] origin_description
elements are used to describe when and from where metadata have been extracted or harvested. harvest_date
[Required ; Not repeatable ; String] altered
[Optional ; Not repeatable ; Boolean] idno
in the Document Description / Title Statement section) will be modified when published in a new catalog.base_url
[Required ; Not repeatable ; String] identifier
[Optional ; Not repeatable ; String] idno
element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier
element in provenance
is used to maintain traceability.date_stamp
[Optional ; Not repeatable ; String] metadata_namespace
[Optional ; Not repeatable ; String] additional
[Optional ; Not repeatable]
@@@@ add this to the schema and do screenshot
+The additional
element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional
block; embedding them elsewhere in the schema would cause schema validation to fail.
For this example of documentation and publishing of reproducible research, we use the Replication data for: Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia published in the OpenICPSR website. The primary investigators for the project were Vivi Alatas, Abhijit Banerjee, Rema Hanna, Benjamin A. Olken, Ririn Purnamasari, and Matthew Wai-Poi.
+A service of the Inter-university Consortium for Political and Social Research (ICPSR), openICPSR is a self-publishing repository for social, behavioral, and health sciences research data. openICPSR is particularly well-suited for the deposit of replication data sets for researchers who need to publish their raw data associated with a journal article so that other researchers can replicate their findings. (from OpenICPSR website)
+library(jsonlite)
+library(httr)
+library(dplyr)
+library(nadar)
+
+# ----credentials and catalog URL --------------------------------------------------
+<- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+ my_keys set_api_key("my_keys[1,1")
+set_api_url("https://.../index.php/api/")
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:\my_project")
+= "elite_capture.JPG" # Will be used as thumbnail in the data catalog
+ thumb
+= "IDN_2019_ECTWP_v01_RR"
+ id
+# Generate the metadata
+
+<- list(
+ my_project_metadata
+ # Information on metadata production
+
+ doc_desc = list(
+
+ producers = list(
+ list(name = "OD", affiliation = "National Data Center")
+
+ ),
+ prod_date = "2022-01-15"
+
+
+ ),
+ # Documentation of the research project, and scripts
+
+ project_desc = list(
+
+ title_statement = list(
+ idno = id,
+ title = "Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia",
+ sub_title = "Reproducible scripts"
+
+ ),
+ production_date = list("2019"),
+
+ geographic_units = list(
+ list(name="Indonesia", code="IDN", type="Country")
+
+ ),
+ authoring_entity = list(
+
+ list(name = "Vivi Alatas",
+ role = "Primary investigator",
+ affiliation = "World Bank",
+ email = "valatas@worldbank.org"),
+
+ list(name = "Abhijit Banerjee",
+ role = "Primary investigator",
+ affiliation = "Department of Economics, MIT",
+ email = "banerjee@mit.edu"),
+
+ list(name = "Rema Hanna",
+ role = "Primary investigator",
+ affiliation = "Harvard Kennedy School",
+ email = "rema_hanna@hks.harvard.edu"),
+
+ list(name = "Benjamin A. Olken",
+ role = "Primary investigator",
+ affiliation = "Department of Economics, MIT",
+ email = "bolken@mit.edu"),
+
+ list(name = "Ririn Purnamasari",
+ role = "Primary investigator",
+ affiliation = "World Bank",
+ email = "rpurnamasari@worldbank.org"),
+
+ list(name = "Matthew Wai-Poi",
+ role = "Primary investigator",
+ affiliation = "World Bank",
+ email = "mwaipoi@worldbank.org")
+
+
+ ),
+ abstract = "This paper investigates how elite capture affects the welfare gains from targeted government transfer programs in Indonesia, using both a high-stakes field experiment that varied the extent of elite influence and nonexperimental data on a variety of existing government programs. While the relatives of those holding formal leadership positions are more likely to receive benefits in some programs, we argue that the welfare consequences of elite capture appear small: eliminating elite capture entirely would improve the welfare gains from these programs by less than one percent.",
+
+ keywords = list(
+ list(name="proxy-means test (PMT)"),
+ list(name="experimental design")
+
+ ),
+ topics = list(
+
+ list(id="D72",
+ name = "Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior",
+ vocabulary = "JEL codes",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php"),
+
+ list(id = "H53",
+ name = "National Government Expenditures and Welfare Programs",
+ vocabulary = "JEL codes",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php"),
+
+ list(id = "I38",
+ name = "Welfare, Well-Being, and Poverty: Government Programs; Provision and Effects of Welfare Programs",
+ vocabulary = "JEL codes",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php"),
+
+ list(id = "O15",
+ name = "Economic Development: Human Resources; Human Development; Income Distribution; Migration",
+ vocabulary = "JEL codes",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php"),
+
+ list(id = "O17",
+ name = "Formal and Informal Sectors; Shadow Economy Institutional Arrangements",
+ vocabulary = "JEL codes",
+ uri = "https://www.aeaweb.org/econlit/jelCodes.php")
+
+
+ ),
+ output_types = list(
+
+ list(type = "Article",
+ title = "Does Elite Capture Matter Local Elites and Targeted Welfare Programs in Indonesia",
+ description = "AEA Papers and Proceedings 2019, 109: 334-339",
+ uri = "https://doi.org/10.1257/pandp.20191047",
+ doi = "10.1257/pandp.20191047"),
+
+ list(type = "Working Paper",
+ title = "Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia",
+ description = "NBER Working Paper No. 18798, February 2013",
+ uri = "https://www.nber.org/papers/w18798")
+
+
+ ),
+ version_statement = list(version = "1.0", version_date = "2019"),
+
+ language = list(
+ list(name = "English", code = "EN")
+
+ ),
+ methods = list(
+ list(name = "linear regression with large dummy-variable set (areg)"),
+ list(name = "probit regression"),
+ list(name = "Test linear hypotheses after estimation")
+
+ ),
+ software = list(
+ list(name= "Stata", version = "14")
+
+ ),
+ reproduction_instructions = "The master do file should run start to finish in less than five minutes from the master do file '0MASTER 20190918.do'. Original data is in data-PUBLISH/originaldata and is all that is needed to run the code; all data in data-PUBLISH/codeddata is created from the coding do files. All results are then created and saved in output-PUBLISH/tables.
+
+ Key Subfolders:
+ 1. code-PUBLISH: This folder contains all relevant code. The master do file is located here ('0Master20190918.do') as well as the two folders that are necessary for the creation of datasets/coding ('coding_matching' folder) and for the analysis/table creation ('analysis' folder). Users should update the directory on the master file to reflect the location of the directory on their computers once downloaded. Following that, all the data and output files needed to replicate the main findings of the paper (Tables 1A-1D, Table 2 and the 4 Appendix Tables) will be generated. The sub do files provide specific notes on the variables created where relevant.
+ 2. data-PUBLISH: This folder contains all relevant .dta files. The first folder, 'original data' contains the 'Baseline' folder that has the original baseline survey information. Under 'original data' you will also find the 'Others' folder with the randomization results, the 2008 PPLS data and the PODES 2008 village level administrative data. The 'Endline2' folder contains the endline survey information. These datasets have been modified only to mask sensitive information. Finally, the 'codeddata' folder that stores intermediate datasets that are created through the sub 'coding_matching' do files.
+ 3. log-PUBLISH: This folder contains the latest log file. When users run the master do file, a new log file will automatically be created and stored here.
+ 4. output-PUBLISH: This folder contains all the tables of the main paper and appendix. When users run the master do file, these tables will be automatically overwritten.",
+
+ confidentiality = "The published materials do not contain confidential information.",
+
+ datasets = list(
+
+ list(name = "Village survey (original data; baseline)",
+ idno = "",
+ note = "Stata 14 data files",
+ access_type = "Public",
+ uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+
+ list(name = "Village survey (original data; endline)",
+ idno = "",
+ note = "Stata 14 data files",
+ access_type = "Public",
+ uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+
+ list(name = "Randomization data",
+ idno = "",
+ note = "Stata 14 data files",
+ access_type = "Public",
+ uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+
+ list(name = "2008 PPLS",
+ idno = "",
+ note = "Stata 14 data files",
+ access_type = "Public",
+ uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+
+ list(name = "2008 PODES - Village level administrative data",
+ idno = "",
+ note = "Stata 14 data files",
+ access_type = "Public",
+ uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+
+ list(name = "Coded data (intermediary data files generated by the scripts)",
+ idno = "",
+ note = "Stata 14 data files",
+ access_type = "Public",
+ uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view")
+
+
+ ),
+ sponsors = list(
+
+ list(name="Australian Aid (World Bank Trust Fund)",
+ abbr="AusAID",
+ role="Financial support"),
+
+ list(name="3ie",
+ grant_no="OW3.1055",
+ role="Financial support"),
+
+ list(name="NIH",
+ grant_no="P01 HD061315",
+ role="Financial support")
+
+
+ ),
+ acknowledgements = list(
+
+ list(name = "Jurist Tan, Talitha Chairunissa, Amri Ilmma, Chaeruddin Kodir, He Yang, and Gabriel Zucker",
+ role = "Research assistance"),
+
+ list(name = "Scott Guggenheim",
+ role = "Provided comments"),
+
+ list(name = "Mitra Samya, BPS, TNP2K, and SurveyMeter",
+ role = "Field cooperation")
+
+
+ ),
+ disclaimer = "Users acknowledge that the original collector of the data, ICPSR, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.",
+
+ scripts = list(
+
+ list(file_name = "0MASTER-20190918.do",
+ zip_package = "119802-V1.zip",
+ title = "Master Stata do file",
+ authors = list(list(name="Rema Hanna, Ben Olken (PIs) and Sam Solomon (RA)")),
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Master do file; this script calls all do files required to replicate the output from start to finish (in no more than a few minutes)",
+ notes = "Original data is in data-PUBLISH/originaldata and is all that is needed to run the code; all data in data-PUBLISH/codeddata is created from the coding do files. All results are then created and saved in output-PUBLISH/tables."),
+
+ list(file_name = "coding baseline.do",
+ title = "coding baseline variables",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 1/7"),
+
+ list(file_name = "coding suseti pmt.do",
+ title = "coding pmt",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 2/7"),
+
+ list(file_name = "coding elite relation.do",
+ title = "coding additional variables for analysis",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 3/7"),
+
+ list(file_name = "matching hybrid.do",
+ title = "matching baseline survey data and matching results",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 4/7; Generates poverty density measure"),
+
+ list(file_name = "coding existing social programs.do",
+ title = "coding existing social programs",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 5/7"),
+
+ list(file_name = "coding kitchen-sink variables.do",
+ title = "coding miscellaneous variables",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 6/7"),
+
+ list(file_name = "coding_partV_hybrid.do",
+ title = "coding for part V of analysis plan",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Coding/matching script 7/7"),
+
+ list(file_name = "0 Table 1AB.do",
+ title = "Table 1: formal vs. informal elites - Panels A and B: historical benefits",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 1/7"),
+
+ list(file_name = "0 Table 1CD.do",
+ title = "Table 1: formal vs. informal elites - Panels C and D: PKH Experiment",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 2/7"),
+
+ list(file_name = "0 Table 2 Appendix Table 3.do",
+ title = "Table 7: Social welfare simulations",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 3/7"),
+
+ list(file_name = "0 Appendix Table 1A.do",
+ title = "Table 2A: Elite capture in historical programs",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 4/7"),
+
+ list(file_name = "0 Appendix Table 1B.do",
+ title = "Table 2B: Elite capture in PKH experiment",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 5/7"),
+
+ list(file_name = "0 Appendix Table 2.do",
+ title = "Appendix Table 12: Probit Model from Table 7",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 6/7"),
+
+ list(file_name = "0 Appendix Table 4.do",
+ title = "Appendix Table 13: Social welfare simulations -- PKH - Additional model from Table 7",
+ zip_package = "119802-V1.zip",
+ format = "Stata do file",
+ software = "Stata 14",
+ description = "Analysis script 7/7"),
+
+ list(file_name = "master_log_09182019.smcl",
+ title = "Log file - Run of master do file",
+ zip_package = "119802-V1.zip",
+ format = "Stata log file",
+ software = "Stata 14",
+ description = "Latest log file obtained by running the master do file")
+
+ )
+
+ )
+
+ )
+
+# Publish the project metadata in the NADA catalog
+
+script_add(idno = id,
+metadata = my_project_metadata,
+ repositoryid = "central",
+ published = 1,
+ thumbnail = thumb,
+ overwrite = "yes")
+
+
+# Add links to ICPSROpen website and AEA website as external resources:
+
+external_resources_add(
+title = "Elite Capture Paper (Alatas et Al., 2019) - Project page - OpenICPSR",
+ idno = id,
+ dctype = "web",
+ file_path = "https://www.openicpsr.org/openicpsr/project/116471/version/V1/view;jsessionid=31C3E76620D0DDD1CABADAA263A1E491",
+ overwrite = "yes"
+
+ )
+external_resources_add(
+title = "American Economic Association (AEA) paper: Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia",
+ idno = id,
+ dctype = "doc/anl",
+ file_path = "https://www.aeaweb.org/articles?id=10.1257/pandp.20191047",
+ overwrite = "yes"
+ )
The metadata and all resources (script files, etc.) are now available in the NADA catalog.
+@@@@@ redo screenshot when displays external resources
+
+
The metadata schemas presented in chapters 4 to 12 of the Guide are intended to document in detail resources of multiple types (data and scripts). When published in a NADA catalog, these metadata will be made visible and searchable. But publishing metadata in an HTML format is not enough. In most cases, you will also want to made files (data files, documents, or others) accessible in your catalog, and provide links to other, related resources. These files will have to be uploaded on your web server, and the links created, with some documentation. These related materials are what is referred to as “external resources”.
+External resources are not a specific type of data. They are resources of any type (data, document, web page, or any other type of resource that can be provided as an electronic file or a web link) that can be attached as a “related resource” to a catalog entry. A schema that is intentionally kept very simple, based on the Dublin Core standard, is used to describe these resources. This schema will never be used independently; it will always be used in combination with one of the other metadata standards and schemas documented in this Guide.
+The table below shows some examples of the kind of external resources that may be attached to the metadata of different data types.
+Data type | +Resources that may be documented and published as external resources | +
---|---|
Document | +MS-Excel version of tables included in a publication ; PDF/DOC version of the publication ; visualizations files (scripts and image) for visualizations included in the publication ; link to electronic annexes | +
Microdata | +survey questionnaire ; survey report ; technical documentation (sampling, etc.) ; data entry application ; survey budget in Excel ; microdata files in different formats ; link to an external website | +
Geographic dataset | +link to an interactive web application ; technical documentation in PDF ; data analysis scripts ; publicly accessible data files | +
Time series | +link to a database query interface ; technical documents ; link to external websites ; visualization scripts | +
Tables | +link to an organization website ; tabulation scripts | +
Images | +image files in different formats and resolutions ; link to a photo album application ; link to a photographer website | +
Audio recordings | +audio file in MP3 or other format ; transcript in PDF | +
Videos | +video file in WAV or other format ; transcript in PDF | +
Scripts | +publication ; link to a package/library web page ; link to datasets | +
Note that a catalog entry (e.g. a document, or a table) can itself be provided as a link (i.e. as an external resource) for another catalog entry.
+In a NADA catalog, the external resources will not appear as catalog entries. Their list and description will be displayed (and the resources made accessible) in a “DOWNLOAD” tab for the entry to which they are attached.
+
+
+
The schema used to document external resources only contains 16 elements.
+{
+"dctype": "doc/adm",
+ "dcformat": "application/zip",
+ "title": "string",
+ "author": "string",
+ "dcdate": "string",
+ "country": "string",
+ "language": "string",
+ "contributor": "string",
+ "publisher": "string",
+ "rights": "string",
+ "description": "string",
+ "abstract": "string",
+ "toc": "string",
+ "filename": "string",
+ "created": "2023-04-09T19:23:22Z",
+ "changed": "2023-04-09T19:23:22Z"
+ }
dctype
[Optional, Not Repeatable, String]
+This element defines the type of external resource being documented. This element plays an important role in the cataloguing system (NADA), as it is used to determine where and how the resource will be published. Particular attention must be paid to the type “Microdata File” (dat/micro
) and to other data types, when the datasets will be published in a data catalog but with access restrictions). The NADA catalog allows data to be published under different levels of accessibility: open data, direct access, public use files, licensed data, access in data enclave, or no access. Most standards include an element access_policy
which is used to determine the type of access to a resource, and will apply to data of type dat/micro
. The resource type dctype
must be selected from a controlled vocabulary:
dcformat
[Optional, Not Repeatable, String]
+The resource file format. This format can be entered using a controlled vocabulary. Options could include:
title
[Required, Not Repeatable, String]
+The title of the resource.
author
[Optional, Not Repeatable, String]
+The author(s) of the resource. If more than one, separate the names with a “;”.
dcdate
[Optional, Not Repeatable, String]
+The date the resource was produced or released, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
country
[Optional, Not Repeatable, String]
+The country name, if the resource is specific to a country. If more than one, enter the country names separated with a “;”.
language
[Optional, Not Repeatable, String]
+The language name. If more than one, enter the language names separated with a “;”.
contributor
[Optional, Not Repeatable, String]
+List of contributor (free text). If more than one, enter the names separated with a “;”.
publisher
[Optional, Not Repeatable, String]
+List of contributor (free text). If more than one, enter the names separated with a “;”.
rights
[Optional, Not Repeatable, String]
+The rights associated with the resource.
description
[Optional, Not Repeatable, String]
+A brief description of the resource (but not the abstract; see the next element).
abstract
[Optional, Not Repeatable, String]
+And abstract for the resource.
toc
[Optional, Not Repeatable, String]
+The table of content of the resource (if the resource is a publication), entered as free text.
filename
[Optional, Not Repeatable, String]
+A file name or a URL.
The “complete examples” provided in the previous chapters included some examples of the use of the “external_resources_add” command (from the Nadar R package) or “…” (from the PyNada Python library). We provide here one more example.
+# R example @@@@
+# Python example @@@@
+
+2023-11-23
++ +
+Numerous organizations –government agencies, international organizations, the private sector, the academia, and others– invest in data collection and creation. Their datasets often possess intrinsic value not only for their creators but also for a broader community of secondary users and researchers. By repurposing and reusing data, this community adds value to the data. However, many valuable datasets remain difficult to find, access, and use, and are therefore underexploited. A dedicated and concerted effort to improve the discoverability, accessibility, and usability of data is needed. Such effort would largely hinge on the quality of the metadata associated with the data. This Guide aims to promote and facilitate the production and use of rich and structured metadata, ultimately promoting the responsible use and repurposing of data.
+The primary audience for the Guide are data producers and curators, data librarians and catalogs administrators, and the developers of data management and dissemination platforms, who seek to maximize the value of existing data in a responsible and technically proficient manner. The Guide applies mainly to socio-economic data of different types (indicators, microdata, geographic datasets, publications, and others).
+The Guide is part of a broader toolset that also includes specialized software applications – a specialized metadata editor and a cataloging tool. This toolset covers the technical aspects of data documentation and dissemination. Legal and ethical considerations are equally important, but are adressed in other guidelines and are supported by different tools.
+The Guide was written by Olivier Dupriez (Deputy Chief Statistician, World Bank) and Mehmood Asghar (Senior Data Engineer, World Bank). Kamwoo Lee (Data Scientist, World Bank) produced some of the examples of the use of metadata schemas included in the Guide and contributed to the testing of the schemas. Emmanuel Blondel (consultant) contributed much of chapter 6. Geoffrey Greenwell (consultant) provided input to chapter 9. Tefera Bekele Degefu and Cathrine Machingauta (Data Scientists, World Bank) participated in the testing of the metadata schemas.
+The production of the Guide and related tools has been made possible by financial contributions from:
+The Guide was created using R Bookdown and is licensed under a Creative Commons Attribution- NonCommercial- NoDerivatives 4.0 International License.
+chatGPT was used as a copy editor, but not for substantive content suggestion or creation.
+Feedback and suggestions on the Guide are welcome. They can be sent to […] or submitted on GitHub where the Guide’s source code is stored (https://github.com/mah0001/schema-guide).
+ + + +Over the last decade, the supply of socio-economic data available to researchers and policy makers has increased considerably, along with advances in the tools and methods available to exploit these data. This provides the research community and development practitioners with unprecedented opportunities to increase the use and value of existing data.
+#Note: +Data that were initially collected with one intention can be reused for a completely different purpose. (…) Because the potential of data to serve a productive use is essentially limitless, enabling the reuse and repurposing of data is critical if data are to lead to better lives. (World Bank, World Development Report 2021)
+But data can be challenging to find, access, and use, resulting in many valuable datasets remaining underutilized. Data repositories and libraries, and the data catalogs they maintain, play a crucial role in making data more discoverable, visible, and usable. But many of these catalogs are built on sub-optimal standards and technological solutions, resulting in limited findability and visibility of their assets. To address such market failures, a better market place for data is needed.
+A better market place for data can be developed on the model of large e-commerce platforms, which are designed to effectively and efficiently serve both buyers and sellers. In a market place for data, the “buyers” are the data users, and the “sellers” are the organizations who own or curate datasets and seek to make them available to users – preferably free of charge to maximize the use of data. Data platforms must be optimized to provide data users with convenient ways of identifying, locating, and acquiring data (which requires the implementation of a user-friendly search and recommendation system), and to provide data owners with a trustable mechanism to make their datasets visible and discoverable and to share them in a cost-effective, convenient, and safe manner.
+Achieving such objectives requires detailed and structured metadata that properly describe the data products. Indeed, search algorithms and recommender systems exploit metadata, not data. Metadata are essential to the credibility, discoverability, visibility, and usability of the data. Adopting metadata standards and schemas is a practical and efficient solution to achieve completeness and quality of the metadata. This Guide presents a set of recommended standards and schemas covering multiple types of data along with guidance for their implementation. The data types covered include microdata, statistical tables, indicators and time series, geographic datasets, text, images, video recordings, and programs and scripts.
+Chapter 1 of the Guide outlines the challenges associated with finding and using data. Chapter 2 describes the essential features of a modern data catalog, and Chapter 3 explains how rich and structured metadata, compliant with the metadata standards and schemas we describe in the Guide, can enable advanced search algorithms and recommender systems. Finally, Chapters 4 to 13 present the recommended standards and schemas, along with examples of their use.
+This Guide was produced by the Office of the World Bank Chief Statistician as a reference guide for World Bank staff and for partners involved in the curation and dissemination of data related to social and economic development. The standards and schemas it describes are used by the World Bank in its data management and dissemination systems, and for the development of systems and tools for the acquisition, documentation, cataloguing, and dissemination of data. Among these tools is a specialized Metadata Editor designed to facilitate the documentation of datasets in compliance with the recommended standards and schemas, and a cataloguing application (“NADA”). Both applications are openly available.
+ +