diff --git a/00_introduction.md b/00_introduction.md new file mode 100644 index 0000000..29c2ee9 --- /dev/null +++ b/00_introduction.md @@ -0,0 +1,20 @@ +--- +output: html_document +--- + +# Introduction {-} + +Over the last decade, the supply of socio-economic data available to researchers and policy makers has increased considerably, along with advances in the tools and methods available to exploit these data. This provides the research community and development practitioners with unprecedented opportunities to increase the use and value of existing data. + +#Note: +Data that were initially collected with one intention can be reused for a completely different purpose. (…) Because the potential of data to serve a productive use is essentially limitless, enabling the reuse and repurposing of data is critical if data are to lead to better lives. ([World Bank, World Development Report 2021](https://www.worldbank.org/en/publication/wdr2021)) + +But data can be challenging to find, access, and use, resulting in many valuable datasets remaining underutilized. Data repositories and libraries, and the data catalogs they maintain, play a crucial role in making data more discoverable, visible, and usable. But many of these catalogs are built on sub-optimal standards and technological solutions, resulting in limited findability and visibility of their assets. To address such market failures, a better market place for data is needed. + +A better market place for data can be developed on the model of large e-commerce platforms, which are designed to effectively and efficiently serve both buyers and sellers. In a market place for data, the "buyers" are the data users, and the "sellers" are the organizations who own or curate datasets and seek to make them available to users -- preferably free of charge to maximize the use of data. Data platforms must be optimized to provide data users with convenient ways of identifying, locating, and acquiring data (which requires the implementation of a user-friendly search and recommendation system), and to provide data owners with a trustable mechanism to make their datasets visible and discoverable and to share them in a cost-effective, convenient, and safe manner. + +Achieving such objectives requires detailed and structured metadata that properly describe the data products. Indeed, search algorithms and recommender systems exploit metadata, not data. Metadata are essential to the credibility, discoverability, visibility, and usability of the data. Adopting metadata standards and schemas is a practical and efficient solution to achieve completeness and quality of the metadata. This Guide presents a set of recommended standards and schemas covering multiple types of data along with guidance for their implementation. The data types covered include microdata, statistical tables, indicators and time series, geographic datasets, text, images, video recordings, and programs and scripts. + +Chapter 1 of the Guide outlines the challenges associated with finding and using data. Chapter 2 describes the essential features of a modern data catalog, and Chapter 3 explains how rich and structured metadata, compliant with the metadata standards and schemas we describe in the Guide, can enable advanced search algorithms and recommender systems. Finally, Chapters 4 to 13 present the recommended standards and schemas, along with examples of their use. + +This Guide was produced by the Office of the World Bank Chief Statistician as a reference guide for World Bank staff and for partners involved in the curation and dissemination of data related to social and economic development. The standards and schemas it describes are used by the World Bank in its data management and dissemination systems, and for the development of systems and tools for the acquisition, documentation, cataloguing, and dissemination of data. Among these tools is a specialized **Metadata Editor** designed to facilitate the documentation of datasets in compliance with the recommended standards and schemas, and a cataloguing application ("NADA"). Both applications are openly available. diff --git a/01_chapter01_challenge_finding_using_data.md b/01_chapter01_challenge_finding_using_data.md new file mode 100644 index 0000000..b183984 --- /dev/null +++ b/01_chapter01_challenge_finding_using_data.md @@ -0,0 +1,55 @@ +--- +output: html_document +--- + +# (PART) RATIONALE AND OBJECTIVES {-} + +# The challenge of finding and assessing, accessing, and using data {#chapter01} + +In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges. + +## Finding and assessing data + +Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks, often referred to as *tribal knowledge*, to locate and obtain the data they require. This may lead to the use of *convenient* data that may not be the most relevant. Others may encounter datasets of interest in academic publications, which can be challenging due to the inconsistent or non-standardized citation of datasets. However, most data users use general search engines or turn to specialized data catalogs to discover relevant data resources. + +Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, such as a query for "population of India in 2023," yield instant informative responses (though not always from the most authoritative source). Even less direct queries, like "indicators of malnutrition in Yemen," return adequate responses, as the engine can "understand" concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Additionally, generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although it is not without the risk of errors and inaccuracies. However, these search engines may not be optimized to identify the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking "satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition." + +While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking pertinent data. Nonetheless, the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal search indexes and algorithms. With the rapid advancements in AI-based solutions, many of which are available as open-source software, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems. + +The solution to improve data discoverability involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest. + +Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data's suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset's fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time. + +## Accessing data + +Accessing data is a multifaceted challenge that encompasses legal, ethical, and practical considerations. To ensure that data access is lawful, ethical, efficient, and enables relevant and responsible use of the data, data providers and users must adhere to specific principles and practices: + +- Data providers must ensure that they possess the legal rights to share the data and define clear usage rights for data users. +- Data users must understand how they can use the data, whether for research, commercial purposes, or other applications, and they must strictly adhere to the terms of use. +- Data access must comply with data privacy laws and ethical standards. Sensitive or personally identifiable information must be handled with care to protect individuals' privacy. +- Data providers must furnish comprehensive metadata that provides context and a full understanding of the data. Metadata should include details about the data's provenance, encompassing its history, transformations, and processing steps. Understanding how the data was created and modified is essential for accurate and responsible analysis. +- Data should be available in user-friendly formats compatible with common data analysis tools, such as CSV, JSON, or Excel. +- Data should be accessible through various means, accommodating users' preferences and capacities. This may involve offering downloadable files, providing access through web-based tools, and supporting data streaming. - APIs are essential for enabling programmable data access, allowing researchers to retrieve and manipulate data programmatically for integration into their research workflows and applications. + +Data users in developing countries often encounter additional challenges in accessing data, including: + +- Lack of resources: Researchers in developing countries may lack the financial resources to purchase data or access data stored in expensive cloud-based repositories. +- Lack of infrastructure: Researchers in developing countries may lack access to the high-speed internet and computing resources required for working with large datasets. +- Lack of expertise: Researchers in developing countries may lack the expertise to work with complex data formats and utilize data analysis tools. +These specific challenges should be considered when developing data dissemination systems. + +## Using data + +The challenge for data users extends beyond discovering data to obtaining all the necessary information for a comprehensive understanding of the data and for responsible and appropriate use. A single indicator label, such as "unemployment rate (%)," can obscure significant variations by country, source, and time. The international recommendations for the definition and calculation of the "unemployment rate" have evolved over time, and not all countries employ the same data collection instrument (e.g., labor force surveys) to gather the underlying data. Detailed metadata should always accompany data on online data dissemination platforms. This association should be close; relevant metadata should ideally be no more than one click away from the data. This is particularly crucial when a platform publishes data from multiple sources that are not fully harmonized. + +:::quote +The scope and meaning of labor statistics, in general, are determined by their source and methodology, which holds true for the unemployment rate. To interpret the data accurately, it is crucial to understand what the data convey, how they were collected and constructed, and to have information on the relevant metadata. The design and characteristics of the data source, typically a labor force survey or a similar household survey for the unemployment rate, especially in terms of definitions and concepts used, geographical and age coverage, and reference periods, have significant implications for the resulting data. Taking these aspects into account is essential when analyzing the statistics. Additionally, it is crucial to seek information on any methodological changes and breaks in series to assess their impact on trend analysis and to keep in mind methodological differences across countries when conducting cross-country studies. (From Quick guide on interpreting the unemployment rate, International Labour Office – Geneva: ILO, 2019, ISBN: 978-92-2-133323-4 (web pdf)). +::: + +Whenever possible, reproducible or replicable scripts used with the data, along with the analytical output of these scripts, should be published alongside the data. These scripts can be highly valuable to researchers who wish to expand the scope of previous data analysis or reuse parts of the code, and to students who can learn from reading and replicating the work of experienced analysts. To enhance data usability, we have developed a specific metadata schema for documenting research projects and scripts. + +## A FAIR solution + +To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469). + +It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse. diff --git a/02_chapter02_search_engine_for_data.md b/02_chapter02_search_engine_for_data.md new file mode 100644 index 0000000..380ac6d --- /dev/null +++ b/02_chapter02_search_engine_for_data.md @@ -0,0 +1,592 @@ +--- +output: html_document +--- + +# The features of a modern data dissemination platform {#chapter02} + +In the introductory section of this Guide, we proposed that a data dissemination platform should be modeled after highly successful e-commerce platforms. These platforms are designed to optimally satisfy the requirements and expectations of both buyers (in our context, the data users) and sellers (in our context, the data providers who make their datasets accessible through a data catalog). In this chapter, we outline the crucial features that a modern online data catalog should incorporate to adhere to this model and effectively cater to the diverse needs and expectations of its users. + +Our objective is to provide recommendations for developing data catalogs that encompass lexical search and semantic search, filtering, advanced search functionality, interactive user interfaces, and the capability to operate as a data recommender system. To define these features, we approach the topic from three distinct perspectives: the viewpoint of data users, who represent a highly diverse community with varying needs, preferences, expectations, and capabilities; the standpoint of data suppliers, who either publish their data or delegate the task to a data library; and the perspective of catalog administrators, responsible for curating and disseminating data in a responsible, effective, and efficient manner while optimizing both user and supplier satisfaction. + +The creation of a contemporary data dissemination platform is a collaborative endeavor, engaging data curators, user experience (UX) experts, designers, search engineers, and subject matter specialists with a profound understanding of both the data and the users' requirements and preferences. Inclusive in this development process should be the active participation of the users themselves, allowing them to provide feedback that directly influences the system's design. + +## Features for data users + +In order to cultivate a favorable user experience, online data catalogs must offer an intuitive and efficient interface, allowing users to effortlessly access the most pertinent datasets. To meet user expectations effectively, one should emphasize simplicity, predictability, relevance, speed, and reliability. Integrating these principles into the design of data catalogs can deliver a seamless and user-friendly experience, akin to the convenience and ease provided by well-known internet search engines and e-commerce platforms. This, in turn, streamlines the process of discovering and obtaining the necessary data, making it quick and hassle-free for users. + +### Simple search interface + +The default option to search for data in a specialized catalog should be a single search box, following the model of general search engines. The objective of the search algorithm should then be to "understand" the user's query as accurately as possible, potentially by parsing and enhancing the query, and returning the most relevant results ranked in order of importance. + +
+
+![image](https://user-images.githubusercontent.com/35276300/229823626-311376be-f75f-4e0b-9e6b-767fa307246b.png) +
+
+ +However, not all users can be expected to provide ideal queries. The search engine must be able to tolerate spelling mistakes to provide a seamless user experience. Auto-completion and spell checkers of queries are independent of the metadata being searched and can be enabled using indexing tools such as Solr or ElasticSearch. Additionally, after processing a user query, the application can provide suggestions for related keywords. This can be implemented using a graph of related words generated by natural language processing (NLP) models. Access to an API is necessary to implement keyword suggestions based on such graphs. The example below shows a related words graph for the terms "climate change" as returned by an NLP model. + +
+
+![](./images/related_words_graph.JPG){width=100%} +
+
+ +A search interface could retrieve such information via API and display it as follows: + +
+
+![](./images/catalog_search_01.JPG){width=100%} +
+
+ + +### Browser + +Some users will just want to browse a catalog. This should be made easy. The use of cards is recommended. For images, a mosaic view can be provided. For microdata, a variable view. + + +### Latest additions and history + +The catalog must provide a list of the most recent additions, and a history of additions and updates. +For each entry, information must be available on the date the entry was first added to the catalog, and when it was last updated. +When a dataset is replaced with a new version, the versioning must be clear. + +
+![image](https://user-images.githubusercontent.com/35276300/231492091-a96d4c5c-c461-4b5f-88c1-f8db26daa98d.png) +
+ + +### Advanced search + +It is useful also to provide users with an option to build a more advanced search, targetted to specific metadata elements and with boolean operators. Advanced search are enabled by structured metadata, i.e., by the use of metadata standards and schemas. The advanced search should be available as a user interface and using a syntax option. The interface could be as follows: + +
+
+![image](https://user-images.githubusercontent.com/35276300/229806372-8c33d0ca-5d3e-48b1-af4f-5a0405c30c22.png){width=85%} +
+
+ +This would correspond to the following syntax that the user could enter directly in the search box (and save and/or share with others): +
+
+**title:"demographic transition" AND country:(*Kenya*) AND body:(poverty)** +
+
+ + +### Document as a query + +A search engine with semantic search capability should be able to process short or long queries, even accepting a document (a PDF or a TXT file) as a query. The search engine will then first analyze the semantic content of the document, convert it into an embedding vector, and identify the closest resources available in the catalog. + +
+
+![image](https://user-images.githubusercontent.com/35276300/229806674-941ac085-6f6f-45e8-bfa4-d0834cf73587.png) +
+
+ + +### Geographic search + +Data catalogs receive numerous queries that are related to a particular geography. Analysis of millions of queries from the World Bank (WB) and International Monetary Fund (IMF) data catalogs revealed that a significant percentage of queries consist of a single country name. For data catalogs that cover multiple countries, creating a "Country page" can provide a quick overview of the most recent and popular datasets of different types, which many users may find helpful. + +However, geography is not limited to countries alone. Many users may be interested in sub-national data or geographic areas that do not correspond to administrative areas, such as a watershed or an ocean. Especially when a data catalog contains geographic datasets, it is recommended to provide specialized search tools. Most metadata standards allow the use of bounding boxes to specify geographic coverage, which could be used to develop a "search" tool that enables a user to draw a box on a map. But this option is very imperfect (explain why). + +Example from data.gov (https://catalog.data.gov/dataset/?metadata_type=geospatial) +
+
+![image](https://user-images.githubusercontent.com/35276300/230094206-ff3bca7b-58ee-4061-ab0c-7777d9286813.png) +
+
+ +For geographic datasets, geographic indexing is recommended. The H3 index is a powerful option. (describe) + +Also, one must take into account that many users will rely on a keywords search to identify data. For example, a raster image of the Philippines (e.g., dataset from a satellite imagery) will contain the country name in the metadata, but the metadata cannot contain the name of all geographic areas coveregd by the data. A user looking for "Iloilo" for example would not find this relevant dataset based on a simple keyword search. The solution would be for the search engine to parse the query, detect if it contains the name of a geographic area, automatically identify the area (polygon of geographic coordinates) that corresponds to it (possibly using an API built around Nominatim), and retrieve resources in the catalog that cover the area (which requires that the datasets in the catalog be indexed geographically). + +(describe how this works - illustrate from our KCP project "Indexing the world"). + +
+
+![image](https://user-images.githubusercontent.com/35276300/230091095-d63c8b8f-7684-41db-b347-d75ded1dc95a.png) +
+
+ +Example of use of Nominatim: The Nominatim application shows the polygon boundary for the search query “Iloilo City” automatically provided by the API. + +
+
+![image](https://user-images.githubusercontent.com/35276300/230091354-b44c38fa-f628-4693-97bb-f49fb4f23b3e.png) +
+
+ +The search API endpoint of Nominatim returns this JSON data which can be processed to generate search cell(s). + +
+
+![image](https://user-images.githubusercontent.com/35276300/230091598-fee71949-29d2-4bac-b60f-dd8efb49278f.png) +
+
+ + +### Semantic search and recommendations + +There are two types of search engines: lexical and semantic. The former matches literal terms in the query to the search engine's index, while the latter aims to identify datasets that have semantically similar metadata to the query. While an ideal data catalog would offer both types of search engines, implementing semantic searchability can be complex. + +(explain how semantic search workd for different data types - with embeddings and vector indexing and cosine similarity - use of API) + +For microdata: embeddings based on thematic variable groupings - an option to implement semantic search and recommendations +Discovery of microdata poses specific challenges. Typically, a data dictionary will be available, with variables organized by data file. A "virtual" organization of variables by thematic group, with a pre-defined ontology, can significantly improve data discoverability. AI solutions can be used to generate such groupings and map variables to them. The DDI metadata standard provides the metadata elements needed to store information on variable groups. + + +### Customized views + +Build your own dashboards +- Allo users to set preferences: thematic, data type, geographies, search query +- Have a page where pre-designed dashboards (country/thematic pages) and custom dashboard are accessible +- Allow sharing of dashboards +- Core idea: all data and metadata accessible via API; platform operates as a service to feed dashboards (within the platform or external) + + +### Data and metadata as a service + +- Maintain a data service: let external users build dashboards/poaltorms dynamically connected via API; one organization cannot customize to all communities of users. + + +### Query user interface + +For time series only + + +### Ranking results + +A search engine not only needs to identify relevant datasets but also must return the results in a proper order of relevance, with the most relevant results at the top of the list. If users fail to find a relevant response among the top results, they may choose to search for data elsewhere. The ability of a search engine to return relevant results in the optimal rank depends on the metadata's content and structure. To optimize the ranking of results, a lot of relevance engineering is required, including tuning advanced search tools like Solr or ElasticSearch. Large data catalogs managed by well-resourced agencies can leverage data scientists to explore the possibility of using machine learning solutions such as "learn-to-rank" to improve result ranking. See section "Improving results ranking" below. For more detailed information, see D. Turnbull and J. Berryman's (2016) in-depth description of tools and methods. + +Keyword-based searches can be optimized using tools like Solr or ElasticSearch. Out-of-the-box solutions, such as those provided by SQL databases, rarely deliver satisfactory results. Structured metadata can help optimize search engines and the ranking of results by allowing for the boosting of specific metadata elements. For instance, a query term found in the *title* of a dataset would carry more weight than if it were found in the *notes* element, and the results would be ranked accordingly. Similarly, a country name found in the *nation* or *reference country* metadata elements should be given more weight than if it were found in a variable description. Advanced indexing tools like Solr and ElasticSearch provide boosting functionalities to fine-tune search engines and enhance result relevancy. + + +### Filtering results + +Facets or filters are useful for narrowing down datasets based on specific metadata categories. For instance, in a data catalog with datasets from different countries, a "country" facet can help users find relevant datasets quickly. To be effective, filters should be based on metadata elements that have a limited number of categories and a predictable set of options. Controlled vocabularies can be used to enable such filters. Furthermore, as some metadata elements are specific to particular data types, contextual facets should be integrated into the catalog's user interface to offer relevant filters based on the type of data being searched. + +
+![](./images/catalog_facets_01.JPG){width=100%} +
+
+ +Tags and tag groups (which are available in all schemas we recommend) provide much flexibility to implement facets, as we showed in section 1.7. + +(use pills / ...) + + +### Sorting results + +Sorting results + + +### Collections + +Organize entries by collections + + +### Linking results + +Not all data catalog users know exactly what they are looking for and may need to explore the catalog to find relevant resources. E-commerce platforms use recommender systems to suggest products to customers, and data catalogs should have a similar commitment to bringing relevant resources to users' attention. To achieve this, modern data catalogs display relationships between entries, which may involve data of different types, such as microdata files, analytical scripts, and working papers. + +These relationships can be documented in the metadata, such as identifying datasets as part of a series or new versions of a previous dataset. When relationships are not known or documented, machine learning tools such as topic models and word embedding models can be used to establish the topical or semantic closeness between resources of different types. This can be used to implement a recommender system in data catalogs, which automatically identifies and displays related documents and data for a given resource. The image below shows how "related documents" and "related data" can be automatically identified and displayed for a resource (in this case a document). + +
+![](./images/catalog_related_01.JPG){width=100%} +
+
+ + +### Organized results + +When a data catalog contains multiple types of data, it should offer an easy way for users to filter and display query results by data type. For example, when searching for "US population," one user may only be interested in knowing the total population of the USA, while another may need the public use census microdata sample, and a third may be searching for a publication. To cater to such needs, presenting query results in type-specific tabs (with an "All" option) and/or providing a filter (facet) by type will allow users to focus on the types of data relevant to them. This is similar to commercial platforms that offer search results organized by department, allowing users to search for "keyboard" in either the "music" or "electronics" department. + +
+![](./images/catalog_tabs_01.JPG){width=100%} +
+
+ + +### Saving and sharing results + +URL / API query ; export list ; social networks, etc. + + +### Personalized results + +Option for user to set a profile with preferences that may be used to display results. + + +### Metadata display and formats + +To make metadata easily accessible to users, it's important to display it in a convenient way. The display of metadata will vary depending on the data type being used, as each type uses a specific metadata schema. For online catalogs, style sheets can be utilized to control the appearance of the HTML pages. + +In addition to being displayed in HTML format, metadata should be available as electronic files in JSON, XML, and potentially PDF format. Structured metadata provides greater control and flexibility to automatically generate JSON and XML files, as well as format and create PDF outputs. It's important that the JSON and XML files generated by the data catalog comply with the underlying metadata schema and are properly validated. This ensures that the metadata files can be easily and reliably reused and repurposed. + +
+![](./images/catalog_display_01.JPG){width=100%} +
+ +
+ + +### Variable-level comparison + +E-commerce platforms commonly allow customers to compare products by displaying their pictures and descriptions (i.e., metadata) side-by-side. Similarly, for data users, the ability to compare datasets can be valuable to evaluate the consistency or comparability of a variable or an indicator over time or across sources and countries. However, to implement this functionality, detailed and structured metadata at the variable level are necessary. These metadata standards, such as DDI and ISO 19110/19139, enable the implementation of this feature. + +In the example below, we show how a query for *water* returns not only a list of seven datasets, but also a list of variables in each dataset that match the query. + +
+![](./images/catalog_variable_view_01.JPG){width=100%} +
+
+ +The *variable view* shows that a total of 90 variables match the searched keyword. + +
+![](./images/catalog_variable_view_02.JPG){width=100%} +
+
+ +After selecting the variables of interest, users should be able to display their metadata in a format that facilitates comparison. The availability of detailed metadata is crucial to ensure the quality and usefulness of these comparisons. For example, when working with a survey dataset, capturing information on the variable universe, categories, questions, interviewer instructions, and summary statistics would be ideal. This comprehensive metadata will enable users to make informed decisions about which variables to use and how to analyze them. + +
+![](./images/catalog_variable_view_03.JPG){width=100%} +
+
+ + +### Transparency in access policies + +The terms of use (ideally provided in the form of a standard license) and the conditions of access to data should be made transparent and visible in the data catalog. The access policy will preferably be provided using a controlled vocabulary, which can be used to enable a facet (filter) as shown in the screenshot below. + +
+![](./images/catalog_access_policy_01.JPG){width=100%} +
+
+ + +### Data and metadata API + +To keep up with modern data management needs, a comprehensive data catalog must provide users with convenient access to both data and metadata through an application programming interface (API). The structured metadata in a catalog allows users to extract specific components of the metadata they need, such as the identifier and title of all microdata and geographic datasets conducted after a certain year. With an API, users can easily and automatically access datasets or subsets of datasets they require. This enables internal features of the catalog such as dynamic visualizations and data previews, making data management more efficient. It is crucial that detailed documentation and guidelines on the use of the data and metadata API are provided to users to maximize the benefits of this feature. + +Metadata (and data) should be accessible via API +The API should be well documented with examples. +API query builder: UI for building an API query + + +### Online data access forms + +Make the process of registration, requests fully digital, easy, and fully traceable. + + +#### Bulk download option + +Even when UI or visualizations etc are shown, many users just want to downlaod the data and metadata. +(...) + + +### Data preview + +When the data (time series and tabular data, possibly also microdata) are made available via API, the data catalog can also provide a data preview option, and possibly a data extraction option, to the users. Multiple JavaScript tools, some of them open-source, are available to easily embed data grids in catalog pages. + +
+![](./images/catalog_data_preview_01.JPG){width=80%} +
+ +For a document, the "data preview" would consist of a document viewer that would allow the user to view the document within the application (even when the document is not stored in the catalog itself but in an external website). When implementing such a feature, check that the terms of use of the origination source allows that. + +
+
+![image](https://user-images.githubusercontent.com/35276300/230733447-55c75dbb-5e5c-4788-9e58-ae4fca646a85.png) +
+
+ + +### Data extraction + +For some data (microdata / time series), provide a simple way for users to extract specific variables / observations. + + +### Data visualizations + +Embedding visualizations in a data catalog can greatly enhance its usefulness. Different types of data require different types of visualizations. For instance, time series data can be effectively displayed using a line chart, while images with geographic information can be displayed on a map that shows the location of the image capture. For more complex data, other types of charts can be created as well. However, in order to embed dynamic charts in a catalog page, the data needs to be available via API. A good data catalog should offer flexibility in the types of charts and maps that can be embedded in a metadata page. For instance, the NADA catalog provides catalog administrators with the ability to create visualizations using various tools. By including visualizations in a data catalog, users are able to quickly and easily understand the data and gain insights from it. + +The NADA catalog allows catalog adinistrators to generate such visualizations using different tools of their choice. The example below were generated using the open-source [Apache eCharts](https://echarts.apache.org/en/index.html) library. + +
+*Example: Line chart for a time series* + +
+![](./images/catalog_visualization_03.JPG){width=100%} +
+ +
+*Example: Geo-location of an image* + +
+![](./images/catalog_visualization_05.JPG){width=100%} +
+
+ + +### Permanent URLs + +To ensure efficient management and organization of datasets within a data catalog, it is essential to assign a unique identifier to each dataset. This identifier should not only meet technical requirements but also serve other purposes such as facilitating dataset citation. To achieve maximum effectiveness, it is recommended that datasets have a globally unique identifier, which can be accomplished through the assignment of a Digital Object Identifier (DOI). DOIs can be generated in addition to a catalog-specific unique identifier and provide a permanent and persistent identifier for the dataset. For more information about the process of generating DOIs and the reasons to use them, visit the [DataCite website](https://datacite.org/). + +Include a citation requirement in metadata. + + +### Archive / tombstone + +When a dataset is removed or replaced, the reproducibility of some analysis may become impossible. This may be a problem for some users. Unless there is a reason for not making them accessible, old versions of datasets should be kept accessible. But they should not be the ones indexed and dislayed in the catalog, to avoid cnfusion or the risk that a user would exploit a version other than the latest. Moving datasts that are replaced to an archive section of the catalog (not indexed) is an option. Note that DOIs require a permanent web page. + + +### Catalog of citations + +A data catalog should not be limited to data. Ideally, the scripts produced by researchers to analyze the data, and the output of their analysis, should also be available. An ideal data catalog will allow a user to: + +- search for data, and find/access the related scripts and citations +- search for a document (analytical output), and find/access the related data and scripts +- search for a script, and find/access the data and analytical output + +Maintain a catalog of citations of datasets. + +
+
+![image](https://user-images.githubusercontent.com/35276300/229811421-fbda05da-2390-42c5-815c-5fcbc90d9ee1.png) +
+
+ + +### Reproducible and replicable scripts + +Document, catalog, and publish reproducible/replicable scripts. + +
+
+![image](https://user-images.githubusercontent.com/35276300/229810244-f68655ee-5173-444a-a4c6-5c2446a5361d.png) +
+
+ + +### Notifications or alerts + +Users may want to be automatically notified (by email) when new entries of interest are added, or when change are made to a specific resource. A system allowing users to set criteria for automatic notification can be developed. + +Example of Google SCholar alerts: + +
+
+![image](https://user-images.githubusercontent.com/35276300/230730245-ea3702f6-b877-436a-9833-492afafa0270.png) +
+
+ +### Providing feedback + +Feedback on catalog certainly. In the form of a "Contact" email and possibly a "feedback form". Also, if the platform itself is open source, GitHub for issues and suggestions on the application itself. + +BUT: Users forum, "reviews" as in e-commerce platforms, is not always recommended. Not all users are 'constructive" and qualified. Requires moderation, which can be costly and controversial. May create dis-incentives for data producers to publish their data. Could be a good option for data platforms that are internal to an organization (where comments are attributed, and an authentication system controls who can provide feedback), but not for public data platforms. + + +### Getting support + +Contact, responsive +FAQs + + +### Web content accessibility + +Web Content Accessibility Guidelines (WCAG) international standard. WCAG documents explain how to make web content more accessible to people with disabilities. +ADA provides people with disabilities the same opportunities, free of discrimination. +WCAG is a compilation of accessibility guidelines for websites, whereas ADA is a civil rights law in the same ambit. + + +## Features for data providers + +When the data catalog is not administered by the producer of the data but by an entrusted repository, data providers want: + + +### Safety + +- Safety, protection against reputation risk (responsible use of data) +- Guarantee that regulations and terms of use will be strictly complied with; reputation of the organization that manages the catalog (Seal of Approval or other accreditation; properly staffed) + +### Visibility + +- Visibility to maximize the use of data (including options to share/publicize on social media) - screenshot from data.gov + +
+
+![image](https://user-images.githubusercontent.com/35276300/230095637-85901bdc-857a-4d23-a55c-7f67ffbf7a4a.png) +
+
+ +### Low burden + +"do not disturb": low burden of deposit and no burden of serving users (minimum interaction with users; providing detailed metadata helps) + + +### Real time information on usage + +Monitoring of usage (downloads and citations) to assess demand; reports on this (automatically generated) + + +### Feedback from users + +Feedback on quality issues + + +## Features for catalog administrators + +In addition to meeting the needs of its users, a modern data catalog should also offer features that a catalog administrator can appreciate or expect. The features listed below can serve as checklist for choice of an application or development of features. These features may include: + + +### Data deposit + +User friendly interface for data deposit. Compliant with metadata stadards. With embedded quality gateways and clearance procedures. + +### Privacy protection + +Tools for privacy protection control (e.g., tools to identify direct identifiers) + + +### Free software + +Availability of the application as an open-source software, accompanied by detailed technical documentation + + +### Security + +Robust security measures, such as compatibility with advanced authentication systems, flexible role/profile definitions, regular upgrades and security patches, and accreditation by information security experts + + +### IT affordability + +Reasonable IT requirements, such as shared server operability and sufficient memory capacity + + +### Ease of maintenance + +Ease of upgrading to the latest version + + +### Interoperability + +Interoperability with other catalogs and applications, as well as compliance with metadata standards. By publishing metadata across multiple catalogs and hubs, data visibility can be increased, and the service provided to users can be maximized. This requires automation to ensure proper synchronization between catalogs (with only one catalog serving as the "owner" of a dataset), which necessitates interoperability between the catalogs, enabled by compliance with common formats and metadata standards and schemas. + + +### Flexibility on access policies + +Flexibility in implementing data access policies that conform to the specific procedures and protocols of the organization managing the catalog + + +### API based system for automation and efficiency + +Availability of APIs for catalog administration +Easy automation of procedures (harvesting, migration of formats, editing, etc.) This means API-based system. + + +### Featuring tools + +Ability to feature datasets + + +### Usage monitoring and analytics + +Easy activation of usage analytics (using Google Analytics, Omniture, or other) + + +### Multilingual capability + +Multilingual capability, including internationalization of the code and the option for catalog administrators to translate or adapt software translations + + +### Embedded SEO + +Embedded Search Engine Optimization (SEO) procedures + + +### Widgets and plugins + +Ability to use widgets to embed custom charts, maps, and data grids in the catalog + + +### Feedback to developers + +Ability to provide feedback and suggestions to the application developers. + + +## Machine learning for a better user experience + +In Chapter 1, we emphasized the importance of generating comprehensive metadata and how machine learning can be leveraged to enrich it. Natural language processing (NLP) tools and models, in particular, have been employed to enhance the performance of search engines. By utilizing machine learning models, semantic search engines and recommender systems can be developed to aid users in locating relevant data. Moreover, machine learning can improve the ranking of search results to ensure that the most pertinent results are brought to users' attention. Google, Bing, and other leading search engines have employed machine learning for years. While specialized data catalogs may not have the resources to implement such advanced systems, catalog administrators should explore opportunities to utilize machine learning to enhance their users' experience. Catalogs can make use of external APIs to exploit machine learning solutions without requiring administrators to develop machine learning expertise or train their own models. For instance, APIs can be used to automatically and instantly translate queries or convert queries into embeddings. Ideally, a global community of practice will develop such APIs, including training NLP models, and provide them as a global public good. + + +### Improved discoverability + +In 2019, Google introduced their NLP model, BERT (Biderectional Encoder Representations from Transformers), as a component of their search engine. Other major companies, such as Amazon, Apple, and Microsoft, are also developing similar models to enhance their search engines. One of the objectives of these companies is to create search engines that can support digital assistants like Siri, Alexa, Cortana, and Hey Google, which operate on a conversational mode and provide answers to users rather than just links to resources. Improving NLP models is a continuous and strategic priority for these companies, as not all answers can be found in textual resources. Google is also conducting research to develop solutions for extracting answers from tabular data. + +Specialized data catalogs maintained by data centers, statistical agencies, and other data producers still rely almost exclusively on full-text search engines. The search engine within these catalogs looks for matches between keywords submitted by the user and keywords found in an index, without attempting to understand or improve the user's query. This can result in issues such as misinterpretation of the query, as discussed in Chapter 1, where a search for "dutch disease" may be mistakenly interpreted as a health-related query rather than an economic concept. + +The administrators of these specialized data catalogs often lack the resources to develop and implement the most advanced NLP solutions, and should not be required to do so. To assist them in transitioning from keyword-based search systems to semantic search and recommender systems, open solutions should be developed and published, such as pre-trained NLP models, open source tools, and open APIs. This would necessitate the creation and publishing of global public goods, including specialized corpora and the training of embedding models on these corpora, open NLP models and APIs that data catalogs can utilize to generate embeddings for their metadata, query parsers that can automatically improve/optimize queries and convert them into numeric vectors, and guidelines for implementing semantic search and recommender systems using tools like Solr, ElasticSearch, and Milvus. + +Simple models created from open source tools and publicly-available documents can provide straightforward solutions. In the example below, we demonstrate how these models can "understand" the concept of "dutch disease" and correctly associate it with relevant economic concepts. + +
+
+![](./images/word_graph_dutch_disease.JPG){width=100%} +
+
+ + +### Improved results ranking + +Effective search engines not only identify relevant resources, but also rank and present them to users in an optimal order of relevance. As highlighted in Chapter 1, [research](https://www.webfx.com/internet-marketing/seo-statistics.html) shows that 75% of search engine users do not click past the first page, emphasizing the importance of ranking and presenting results effectively. + +Data catalog administrators face two challenges in improving their search engine performance. Firstly, they need to improve their ranking in search engines such as Google by enriching metadata and embedding metadata compliant with DCAT or schema.org standards on catalog pages. Secondly, they need to improve the ranking of results returned by their own search engines in response to user queries. + +Google's success in 1996 was largely attributed to their revolutionary approach to ranking search results called *PageRank*. Since then, they and other leading search engines have invested heavily in improving ranking methodologies with advanced techniques like *RankBrain* (introduced in 2015). These approaches include primary, contextual, and user-specific ranking, which utilize machine learning models referred to as Learn to Rank models. [Lucidworks](https://lucidworks.com/post/abcs-learning-to-rank/) provides a clear description of this approach, noting that "Learning to rank (LTR) is a class of algorithmic techniques that apply supervised machine learning to solve ranking problems in search relevancy. In other words, it’s what orders query results. Done well, you have happy employees and customers; done poorly, at best you have frustrations, and worse, they will never return. To perform learning to rank you need access to training data, user behaviors, user profiles, and a powerful search engine such as SOLR. The training data for a learning to rank model consists of a list of results for a query and a relevance rating for each of those results with respect to the query. Data scientists create this training data by examining results and deciding to include or exclude each result from the data set." + +Implementing Learn to Rank models can be challenging for data catalog administrators due to the resource-intensive nature of building the training dataset, fitting models, and implementing them. An alternative solution is to optimize the implementation of Solr or ElasticSearch, which can often contribute significantly to improving the ranking of search results. For more information on the challenge and available tools and methods for relevancy engineering, refer to D. Turnbull and J. Berryman's 2016 publication. + +
+
+![](./images/schema_search_ranking.JPG){width=100%} +
+
+ + +## Cataloguing tools + +The examples we provided in this chapter are taken from our NADA cataloguing application. Other open-source cataloguing applications are available, including CKAN, GeoNetworks, and Dataverse. + +**CKAN** + +[CKAN](https://ckan.org/) is a data management system that provides a platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. CKAN is an open source software held in trust by Open Knowledge Foundation. It is open and licensed under the GNU Affero General Public License (AGPL) v3.0. CKAN is used by some of the lead open data platforms, such as the [US data.gov](https://www.data.gov/) or the [OCHA Humanitarian Data Exchange](https://data.humdata.org/). CKAN does not require that the metadata comply with any metadata standard (which brings flexibility, but at a cost in terms of discoverability and quality control), but organizes the metadata in the following elements (information extracted from [CKAN on-line documentation](https://docs.ckan.org/en/2.9/)): + + - *Title*: allows intuitive labeling of the dataset for search, sharing and linking. + - *Unique identifier*: dataset has a unique URL which is customizable by the publisher. + - *Groups*: display of which groups the dataset belongs to if applicable. Groups (such as science data) allow easier data linking, finding and sharing among interested publishers and users. + - *Description*: additional information describing or analyzing the data. This can either be static or an editable wiki which anyone can contribute to instantly or via admin moderation. + - *Data preview*: preview [.csv] data quickly and easily in browser to see if this is the dataset you want. + - *Revision history*: CKAN allows you to display a revision history for datasets which are freely editable by users + - *Extra fields*: these hold any additional information, such as location data (see geospatial feature) or types relevant to the publisher or dataset. How and where extra fields display is customizable. + - *Licence*: instant view of whether the data is available under an open license or not. This makes it clear to users whether they have the rights to use, change and re-distribute the data. + - *Tags*: see what labels the dataset in question belongs to. Tags also allow for browsing between similarly tagged datasets in addition to enabling better discoverability through tag search and faceting by tags. + - *Multiple formats* (if provided): see the different formats the data has been made available in quickly in a table, with any further information relating to specific files provided inline. + - *API key*: allows access every metadata field of the dataset and ability to change the data if you have the relevant permissions via API. + +The *extra fields* section allows ingestion of structured metadata, which makes it relatively easy to exporting data and metadata from NADA to CKAN. Importing data and metadata from CKAN to NADA is also possible (using the catalog's respective APIs), but with a reduced metadata structure. + +**GeoNetworks** + +[GeoNetworks](https://geonetwork-opensource.org/) is a cataloguing tool for geographic data and services (not for other types of data), which includes a specialized metadata editor. According to its website, "It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world. (...) The metadata editor support ISO19115/119/110 standards used for spatial resources and also Dublin Core format usually used for opendata portals." + +**DataVerse** + +The [Dataverse Project](https://dataverse.org/about) is led by the Institute for Quantitative Social Science (IQSS). Dataverse makes use of the DDI Codebook and Dublin Core metadata standards. According to its website, Dataverse "is an open source web application to share, preserve, cite, explore, and analyze research data. (...) The central insight behind the Dataverse Project is to automate much of the job of the professional archivist, and to provide services for and to distribute credit to the data creator." + +"The Institute for Quantitative Social Science (IQSS) collaborates with the Harvard University Library and Harvard University Information Technology organization to make the installation of the Harvard Dataverse Repository openly available to researchers and data collectors worldwide from all disciplines, to deposit data. IQSS leads the development of the open source Dataverse Project software and, with the Open Data Assistance Program at Harvard (a collaboration with Harvard Library, the Office for Scholarly Communication and IQSS), provides user support." diff --git a/03_chapter03_rich_structured_metadata.md b/03_chapter03_rich_structured_metadata.md new file mode 100644 index 0000000..10c989f --- /dev/null +++ b/03_chapter03_rich_structured_metadata.md @@ -0,0 +1,751 @@ +--- +output: html_document +--- + +# The power of rich, structured metadata {#chapter03} + +The previous chapter defined the features of an advanced data discoverability and dissemination solution. What enables such a solution is not only the algorithms and technology, but also the quality of the metadata available to enable them. Metadata is defined as "... structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage that resource" (Data thesaurus, NIH, https://nnlm/gov/data/thesaurus) Metadata must be findable by machines and usable by humans. This chapter describes what metadata are needed, and how they can be organized and improved to fully enable the search and recommender tools. The metadata must be rich and structured. To make them rich, machine learning can be used. To ensure consistent structure, the use of metadata standards and schemas is highly recommended. In this chapter, we build the case for rich, augmented, structured metadata and for the adoption of metadata standard and schemas. The second part of this Guide will provide a detailed description of each recommended standard or schema, for different data types. + +## Rich metadata + +*Rich* metadata means detailed and comprehensive metadata. Rich metadata are beneficial to both the users and the providers (producers and curators) of data. + +### Benefits for data users + +Being provided with rich metadata helps data users: + + - **Find** data of interest. The metadata provide much of the content that the search engine will be able to index and discover. The richer the metadata, the better the search engine will be able to help users identify relevant data. + - **Understand** what the data are measuring and how they have been created. Without a proper description of the data, the risk is high that a user will misunderstand and possibly misuse them, or simply decide not to make use of them. + - **Assess** the quality of the data, including their reliability, fitness for purpose, and consistency with other datasets when the purpose requires integration of multiple datasets. + +### Benefits for data producers + +For the data producers, rich metadata will contribute to: + + - Ensure **transparency, auditability, and credibility** of the data and of the derived products. + - Increase the **visibility** of the data, and thus the demand for, and use of the data. + - **Reduce the cost** of operating a data dissemination service by lowering the burden of responding to users' requests for information. + - Support the preservation of **institutional memory**. + - Provide the meta-database needed to **harmonize data collection** methods and instruments, e.g., by providing convenient tools to compare variables across datasets. A compelling case for rich metadata for transparency and harmonization can be found in ["The Struggle for Integration and Harmonization of Social Statistics in +a Statistical Agency - A Case Study of Statistics Canada"](https://www.ihsn.org/sites/default/files/resources/IHSN-WP004.pdf) by Gordon Priest (2010).
+
+ ![](./images/compare_variables_IHSN.JPG){width=90%} +
+ +### Scope of the metadata + +What makes metadata "rich and comprehensive" is not always easy to define, and is specific to each data type. Microdata and geospatial datasets for example will require much more -- and different-- metadata than a document or an image. Metadata standards and schemas provide data curators with detailed lists of *elements* (or *fields*), specific to each data type, that must or may be provided to document a dataset. The metadata elements included in a standard or schema will typically cover *cataloguing material*, *contextual information*, and *explanatory materials*. + +#### Cataloguing material + +Cataloguing material includes elements such as a title, a unique identifier for the dataset, a version number and description, as well as information related to the data curation (including who generated the metadata and when, or where and when metadata may have been harvested from an external catalog). This information allows the dataset to be uniquely identified within a collection/catalog, and serves as a bibliographic record of the dataset, allowing it to be properly acknowledged and cited in publications. + +#### Contextual information + +Contextual information describes the context in which the data were collected and how they were put to use. It enables secondary users to understand the background and processes behind the data production. Contextual information should cover topics such as: + + - What justified or required the data collection (the objectives of the data production exercise); + - Who or what was being studied; + - The geographic and temporal coverage of the data; + - Changes and developments that occurred over time in the data collection methodology and in the dataset, if relevant. For repeated cross-section, panel, or time series datasets, this may include information describing changes in the question text, variable labeling, sampling procedures, or others; + - What are the key output of the data collection, such as a publication, the design or implementation of a policy or project, etc. + - Problems encountered in the process of data collection, entry, checking, and cleaning; + - Other useful information on the life cycle of the dataset. + +#### Explanatory material + +Explanatory materials are the information that should be created and preserved to ensure the long-term functionality of a dataset and its contents. This applies mostly to microdata, geospatial data, and to some extent to tabulations and to time series and indicators databases. It is less relevant for images, videos, and documents. Explanatory materials include: + + - **Information about the data collection methods**: This section should describe the instruments used and methods employed, and how they were developed. If applicable, details of the sampling design and sampling frames should be included. It is also useful to include information on any monitoring process undertaken during the data collection as well as details of quality controls. + - **Information about the structure of the dataset**: Key to this information is a detailed data dictionary describing the structure of the dataset, including information about relationships between individual files or records within the study. For example, it should include key variables required for unique identification of subjects across files (required to properly merge data files), the number of cases and variables in each file, and the number of files in the dataset. For relational models, the structure and relations between datasets records and elements should be described. + - **Technical information**: This information relates to the technical framework and should include the computer system used to generate the data and related files; the software packages with which the files were created. + - **Variables and values, coding and classification schemes** (for microdata and geospatial data): The documentation should contain an exhaustive list of variables in the dataset, including a complete explanation and full details about the coding and classifications used for the information allocated to those fields. It is especially important to have blank and missing fields explained and accounted for. It is helpful to identify variables to which standard coding classifications apply, and to record the version of the classification scheme used. + - **Information about derived variables** (for microdata and geospatial data, and tabulations): Many data producers derive new variables from original data. This may be as simple as grouping raw age (in years) data according to groups of years appropriate for the survey, or it may be much more complex and require the use of sophisticated algorithms. When grouped or derived variables are created, it is important that the logic for the grouping or derivation is clear. Simple grouping, such as for age, can be included within the data dictionary. More complex derivations require other means of recording the information. Sufficient supporting information should be provided to allow an easy link between the core variables used and the resultant variables. In addition, computer algorithms used to create the derivations should be saved together with information on the software. + - **Weighting and grossing** (for sample survey microdata): Weighting and grossing variables must be fully documented, with explanations of the construction of the variables and clear indications of the circumstances in which they should be used. The latter is particularly important when different weights are applied for different purposes. + - **Data source**: Details about the source from which the data is derived should be included. For example, when the data source consists of responses to survey questionnaires, each question should be carefully recorded in the documentation. Ideally, the text will include a reference to the generated variable(s). It is also useful to explain the conditions under which a question would be asked, including, if possible, the cases to which it applies and, ideally, a summary of response statistics. + - **Confidentiality and anonymization**: It is important to determine whether the data contains any confidential information on individuals, households, organizations, or institutions. If so, such information should be recorded together with any agreement on how to use the data, such as with survey respondents. Issues of confidentiality may restrict the analyses to be undertaken or results to be published, particularly if the data is to be made available for secondary use. If the data were anonymized to prevent identification, it is wise to record the anonymization procedure (taking care of not providing information that would enable a reverse-engineering of the procedure) and its impact on the data, as such modification may restrict subsequent analysis. + +### Controlled vocabularies + +Metadata standards and schemas provide lists of elements with a description of the expected content to be captured in each element. For some elements, it may be appropriate to restrict the valid content to pre-selected options or "controlled vocabularies". A controlled vocabulary is a pre-defined list of values that can be accepted as valid content for some elements. For example. a metadata element "data type" should not be populated with free text, but should make use of a pre-defined taxonomy of data types. The use of controlled vocabularies (for selected metadata elements) will be particularly useful to implement search and filter features in data catalogs (see section 3.1.1 of this Guide), and to foster inter-operability of data catalogs. + +:::quote +In library and information science, controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search.[Wikipedia](https://en.wikipedia.org/wiki/Controlled_vocabulary) +::: + +Controlled vocabularies can be specific to an agency, or be developed by a community of practice. For example, the list of countries and codes provided by the [ISO 3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) can be used as a controlled vocabulary for a metadata element `country` or `nation`; the [ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) list of languages can be used as a controlled vocabulary for a metadata element `language`. Or the [CESSDA topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification) can be used as a controlled vocabulary for the element `topics` found in most metadata schemas. When a controlled vocabulary is used in a metadata standard or schema, it is good practice to include an identification of its origin and version. + +Some recommended controlled vocabularies are included in the description of the ISO 19139 standard for geographic data and services (see chapter 6). Most standards and schemas we recommend also include a `topics` element. Annex 1 provides a description of the CESSDA topics classification. + +Ideally, controlled vocabulary will be developed in compliance with the [FAIR principles](https://www.go-fair.org/fair-principles/) for scientific data management and stewardship: **F**indability, **A**ccessibility, **I**nteroperability, and **R**euse. + +### Tags + +All metadata standards and schemas described in this guide include a `tags` element, even when this element is not part of a standard. This element enables the implementation of filters (facets) in data cataloguing applications, in a flexible manner. The `tags` metadata element is repeatable (meaning that more than one tag can be attached to a dataset) and contains two sub-elements to capture a `tag` (word or phrase), and the `tag_group` (if any) it belongs to. + +
+![](./images/reDoc_tags.JPG){width=100%} +
+ +To illustrate the use of tags, let's assume that a catalog contains datasets that are available freely, and others that are available for a fee. The catalog administrator may want to provide a filter (facet) in the user interface that would allow users to filter datasets based on their free or not free status. None of the metadata schemas we describe in the Guide contains an element specifically designed to indicate the "free" or "for a fee" nature of the data. But this information can be captured in a tag "*Free*" or "*For a fee*" that would be added to each dataset in the catalog, with a tag group that could be named "free_or_fee". In R, this would be done as follows (for a "Free" dataset): + + +```r +# ... , +tags = list( + list(tag = "Free", tag_group = "free_or_fee") +) +# ... +``` + +In the NADA catalog, a facet titled "Free or for a fee" can then be created based on the information found in the `tags` element where `tag_group` = "free_or_fee". + +
+![](./images/reDoc_tags_2.JPG){width=100%} +
+ + +## Structured metadata + +### What structure? + +Metadata should not only be comprehensive and detailed, they should also be organized in a **structured** manner, preferably using a standardized structure. **Structured metadata** means that the metadata are stored in specific *fields* (or *elements*) organized in a metadata schema. **Standardized** means that the list and description of elements are commonly agreed by a community of practice. + +"A metadata schema is a system that defines the data elements needed to describe a particular object, such as a certain type of research data." (Ten rules data discovert - add ref) + +Some metadata standards have originated from academic data centers, like the [Data Documentation Initiative (DDI)](https://ddialliance.org/), maintained by the [Inter-University Consortium for Political and Social Research](https://www.icpsr.umich.edu/web/pages/) (ICPSR) at the University of Michigan. Other found their origins in specialized communities of practice (like the [ISO 19139](https://www.iso.org/standard/67253.html?browse=tc) for geospatial resources). The private sector also contributes to the development of standards, like the [International Press Telecommunications Council (IPTC)](https://iptc.org/) standard developed by and for news media. + +Metadata compliant with standards and schemas will typically be stored as JSON or XML files (described in Chapter 2), which are plain text files. The example below show how a simple free-text content would be structured and stored in JSON and XML formats, using metadata elements from the [DDI Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) metadata standard: + +**Free text version**: + +*The Child Mortality Survey (CMS) was conducted by the National Statistics Office of Popstan from July 2010 to June 2011, with financial support from the Child Health Trust Fund (TF123_456).* + +**Structured, machine-readable (JSON) version**: + + +```json + "title" : "Child Mortality Survey 2010-2011", + "alternate_title" : "CMS 2010-2011", + "authoring_entity": "National Statistics Office (NSO)", + "funding_agencies": [{"name":"Child Health Trust Fund (CHTF)", "grant":"TF123_456"}], + "coll_dates" : [{"start":"2010-07", "end":"2011-06"}], + "nation" : [{"name":"Popstan", "abbreviation":"POP"}] +} +``` + +In XML format: + +```xml +Child Mortality Survey 2010-2011 +CMS 2010-2011 +National Statistics Office +Child Health Trust Fund + + +Popstan +``` + +All three versions contain (almost) the same information. In the structured version, we have added acronyms and the ISO country code. This does not create new information but will help make the existing information more discoverable and inter-operable. The structured version is clearly more suitable for publishing in a meta-database (or catalog). Organizing and storing metadata in such a structured manner will enable all kinds of applications. For example, when metadata for a collection of surveys are stored in a database, it becomes straightforward to apply filters (for example, a filter by country using the nation/name element) and targeted searches to answer questions like "What data are available that cover the month of December 2010?" or "What surveys did the CHTF sponsor?". + +### Formats for structured metadata: JSON and XML + +Metadata standards and schemas consist of structured lists of metadata fields. They serve multiple purposes. First, they help data curators generate complete and usable documentation of their datasets. Metadata standards that are intuitive and *human-readable* better serve this purpose. Second, they help generate *machine-readable* metadata that are the input to software applications like on-line data catalogs. Metadata available in open file formats like JSON (JavaScript Object Notation) and XML (eXtended Markup Language) are most suitable for this purposes. + +Some international metadata standards like the Data Documentation Initative (DDI Codebook, for microdata), the ISO 19139 (for geospatial data), or the Dublin Core (a more generic metadata specification) are described and published as XML specifications. Any XML standard or schema can be "translated" into JSON, which is our preferred format (a choice we justify in the next section). + +JSON and XML formats have similarities: + + - Both are non-proprietary text files + - Both are hierarchical (they may contain values within values) + - Both can be parsed and used by many programming languages including R and Python + +JSON files are however easier to parse than XML, easier to generate programmatically, and easier to read by humans. This makes them our preferred choice for describing and using metadata standards and schemas. + +Metadata in JSON are stored as *key/value* pairs, where the keys correspond to the names of the metadata elements in the standard. Values can be string, numeric, boolean, arrays, null, or JSON objects (for a more detailed description of the JSON format, see [www.w3schools.com](https://www.w3schools.com/js/js_json_intro.asp)). Metadata in XML are stored within named tags. The example below shows how the JSON and XML formats are used to document the list of authors of a [document](http://hdl.handle.net/10986/34511), using elements from the Dublin Core metadata standard. + +
+![](./images/document_example_01b_authors_keywords.JPG){width=100%} +
+
+ +In the *documents* schema, authors are documented in the metadata element `authors` which contains the following sub-elements: `first_name`, `initial`, `last_name`, and `affiliation`. + +
+![](./images/JSON_array_list_authors.JPG){width=100%} +
+
+ +In JSON, this information will be stored in key/value pairs as follows. + + +```json +"authors" : [ + {"first_name" : "Dieter", + "last_name" : "Wang", + "affiliation": "World Bank Group; Fragility, Conflict and Violence"}, + {"first_name" : "Bo", + "initial" : "P.J.", + "last_name" : "Andrée", + "affiliation": "World Bank Group; Fragility, Conflict and Violence"}, + {"first_name" : "Andres", + "initial" : "F.", + "last_name" : "Chamorro", + "affiliation": "World Bank Group; Development Data Analytics and Tools"}, + {"first_name" : "Phoebe", + "initial" : "G.", + "last_name" : "Spencer", + "affiliation":"World Bank Group; Fragility, Conflict and Violence"} +] +``` + +In XML, the same information will be stored within named tags as follows. + + +```xml + + + Dieter + Wang + World Bank Group; Fragility, Conflict and Violence + + + Bo + P.J. + Andrée + World Bank Group; Fragility, Conflict and Violence + + + Andres + E. + Chamorro + World Bank Group; Development Data Analytics and Tools + + + Phoebe + G. + Spencer + World Bank Group; Fragility, Conflict and Violence + + +``` + +### Benefits of structured metadata + +Metadata standards and schemas must be comprehensive and intuitive. They aim to provide comprehensive and granular lists of elements. Some standards may contain a very long list of elements. Most often, only a subset of the available elements will be used to document a specific dataset. For example, the elements of the DDI metadata standard related to sample design will be used to document sample survey datasets but will be ignored when documenting a population census or an administrative dataset. In all standards and schemas, most elements are *optional*, not *required*. Data curators should however try and provide content for all elements for which information is or can be made available. + +Complying with metadata standards and schemas contributes to the completeness, usability, discoverability, and inter-operability of the metadata, and to the visibility of the data and metadata. + +#### Completeness + +When they document datasets, data curators who do not make use of metadata standards and schemas tend to focus on the readily-available documentation and may omit some information that secondary data users --and search engines-- may need. Metadata standards and schemas provide checklists of what information could or should be provided. These checklists are developed by experts, and are regularly updated or upgraded based on feedback received from users or to accommodate new technologies. + +Generating complete metadata will often be a collaborative exercise, as the production of data involves multiple stakeholders. The implementation of a survey, for example, may involve sampling specialists, field managers, data processing experts, subject matter specialists, and programmers. Documenting a dataset should not be seen as a last and independent step in the implementation of a data collection or production project. Ideally, metadata will be captured continuously and in quasi-real time during the entire life cycle of the data collection/production, and contributed by those who have most knowledge of each phase of the data production process. + +Generating complete and detailed metadata may be seen as a burden by some organizations or researchers. But it will typically represent only a small fraction of the time and budget invested in the production of the data, and is an investment that will add much value to the data by increasing their usability and discoverability. + +#### Usability + +Fully understanding a dataset before conducting analysis should be a pre-requisite for all researchers and data users. But this will only be possible when the documentation is easy to obtain and exploit. Convenience to users is key. When using a geographic dataset for example, the user should be able to immediately find the coordinate reference system that was used. When using survey microdata, which may contain hundreds or thousands of variables, the user need to be able to immediately access information on a variable label, underlying question, universe, categories, etc. Structured metadata enables such "convenience", as they can easily be transformed into bookmarked PDF documents, searchable websites, machine-readable codebooks, etc. The way metadata are displayed can be tailored to the specific needs of different categories of users. + +#### Discoverability + +:::quote +Data discoverability is one of the main tasks, next to availability and interoperability, that public policy makers and implementers should take into due consideration in order to foster access, use and re-use of public sector information, particularly in case of open data. Users shall be enabled to easily search and find data they need for the most different purposes. That is clearly highlighted in the introduction statements of the INSPIRE Directive, where we can read that “The loss of time and resources in searching for existing (spatial) data or establishing whether they may be used for a particular purpose is a key obstacle to the full exploitation of the data available”. +From [*Metadata and data portals/catalogues are essential assets to enable that data discoverability.*](https://www.europeandataportal.eu/en/impact-studies/country-insights/italy/italy-discoverability-practice) +::: + +What matters is not only what metadata are provided as input to the search engines that matters, it is also how the metadata are provided. To understand the value of structured metadata, we need to take into consideration how search engines ingest, index, and exploit the metadata. In brief, the metadata will need to be acquired, augmented, analyzed and transformed, and indexed before they can be made searchable. We provide here an overview of the process, which is described in detail by D. Turnbull and J. Berryman in ["Relevant Search: With applications for Solr and Elasticsearch"](https://www.manning.com/books/relevant-search) (2016). + +- **Acquisition**: Search engines like Google and Bing acquire metadata by crawling billions of web pages using *web crawlers* (or *bots*), with an objective to cover the entire web. Guidance is available to webmasters on how to optimize websites for visibility (see for example [Google's Search Engine Optimization (SEO) Starter Guide](https://developers.google.com/search/docs/beginner/seo-starter-guide). The search tools embedded in specialized data catalogs have a much simpler task, as the catalog administrators and curators generate or control, and provide, the well-contained content to be indexed. In a cataloguing application like NADA, this content is provided in the form of *structured metadata files* saved in JSON or XML format. For textual data (documents), the content of the document (not only the metadata on the) can also be indexed. The process of acquisition/extraction of metadata by the search engine tool must preserve the structure of the metadata, in its original or in a modified form. This will be critical for optimizing the performance of the search tool and the ranking of query results (e.g., a keyword found in a document title may have more weight than the same keyword found in the document abstract), for implementing facets, or for providing advanced search options (e.g., search only in the "authors" metadata field). + +- **Augmentation** or **enrichment**: the content of the metadata can be *augmented* or *enriched* in multiple ways, often automatically (by extracting information from an external source, or using machine learning algorithms). Part of this augmentation process should happen before the metadata are submitted to the search engine. Other procedures of enrichment of the metadata may be implemented after acquisition of the metadata by the search engine tool. Metadata augmentation can have a significant impact on the discoverability of data. See the section "Augmented (enriched) metadata" below. + +- **Analysis** or **transformation**: The metadata generated by the data curator and by the augmentation process will mostly (not exclusively) consist of text. For the purpose of discoverability, some of the text has no value; words like "the", "a", it", "with", etc., referred to as *stop words*, will be removed from the metadata (multiple tools are available for this purpose). The remaining words will be converted to lowercase, may be submitted to spell checkers (to exclude or fix errors), and words will be *stemmed* or *lemmatized*. The stemming or lemmatization consist of converting words to their *stem* or *root*); this will among other transformations change plurals to singular and the conjugated forms of the verbs to their base form. Last, the transformed metadata will be *tokenized*, i.e. split into a list of terms (*tokens*). To enable semantic searchability, the metadata can also be converted into numeric vectors using natural language processing *embedding* models. These vectors will be saved in a database (such as [ElasticSearch](https://github.com/elastic/elasticsearch) or [Milvus](https://milvus.io/)) that will provide functionalities to measure similarity/distance between vectors. Section 1.8 below provide more information on text embedding and semantic searchability. + +- **Indexing**: The last phase of metadata processing is the indexing of the tokens. The index of a search engine is an *inverted index*, which will contain a list of all terms found in the metadata, with the following information (among other) attached to each term: + - The *document frequency*, i.e. the number of metadata documents where the word is found (a *metadata document* is the metadata related to one dataset). + - The identification of the metadata documents in which the term was found. + - The *term frequency* in each metadata document. + - The *term positions* in the metadata document, i.e. where the term is found in the document. This is important to identify colocations. When a user submits a query for "demographic transition" for example, documents where the two terms are found next to each other will be more relevant than documents where both terms appear but in different parts of the document. + +Once the metadata has been acquired, transformed, and indexed, it is available for use via a user interface (UI). A data catalog UI will typically include a search box and facets (filters). The search engine underlying the search box can be simple (out-of-the-box full text search, looking for exact matches of keywords), or advanced (with semantic search capability and optimized ranking of query results). Basic full-text search do not provide satisfactory user experience, as we illustrated in the introduction to this Guide. Rich, structured metadata, combined with advanced search optimization tools and machine learning solutions, allow catalog administrators to tune the search engine, and implement advanced solutions including semantic searchability. + +
+![](./images/schema_documentation_indexing.JPG){width=100%} +
+ +#### Interoperability + +Data catalogs that adopt common metadata standards and schemas can exchange information including through automated harvesting and synchronization of catalogs. This allows them to increase their visibility, and to publish their metadata in hubs. Recommendations and guidelines for improved inter-operability of data catalogs are provided by the [Open Archives Initiative](https://www.openarchives.org/). + +Interoperability between data catalogs can be further improved by the adoption of common controlled vocabularies. For example, the adoption of the ISO country codes in country lists will guarantee that all catalogs will be able to filter dataset by country in a consistent manner. This will solve the issue of possible differences in the spelling of country names (e.g., one catalog referring to the *Democratic Republic of Congo* as *Congo, DR*, and another one as *Congo, Dem. Rep.*). It also solves issues of changing country names, e.g. *Swaziland* renamed as *Eswatini* in 2018). Controlled vocabularies are often used for "categorical" metadata elements like topics, keywords, data type, etc. Some metadata standards like the ISO 19139 for geospatial data include their own recommended controlled vocabularies. Ideally, controlled vocabularies are developed in accordance with [FAIR principles](https://www.go-fair.org/fair-principles/) (Findability, Accessibility, Interoperability, and Reuse of digital assets). "The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data." (https://www.go-fair.org/fair-principles/) + +The adoption of standards and schemas by software developers also contributes to the easy transfer of metadata across applications. For example, data capture tools like [Survey Solutions](https://mysurvey.solutions/en/) by the World Bank and [CsPro](https://www.census.gov/data/software/cspro.html) by the US Census Bureau offer options to export metadata compliant with the DDI Codebook standard; ESRI's ArcGIS software export geospatial metadata in the ISO 19139 standard. + +#### Visibility + +Data cataloguing applications provide search and filtering tools to help users of the catalog identify data of interest. But not all users will start their search for data directly in specialized data catalogs; many will start their search in Google, Google Dataset Search, Bing, Yahoo! or another search engine. + +Some search engines may provide users with a direct answer to their query, without transiting via the source catalog. This will be the case when the query can be associated with a specific indicator, time and location for which data are openly available or accessible via a public API. For example, a search for "population india 2020" on Google, will provide an *answer* first, followed by links to the underlying sources. +
+
+ ![](./images/Google_Population_India_2020.JPG){width=90%} +
+
+ +In other cases, the search engine will provide users with a link to a specific catalog page, not to the catalog's home page. In such cases, the user will not be directly connected to the catalog's own search engine. For example, a search for "albania lsms 2012" (a Living Standard Measurement Study, i.e. household survey) in Google will send the user directly to the survey page of the catalog, not to the home or search page of the catalog. +
+
+ ![](./images/Google_LSMS_Albania_2012.JPG){width=90%} +
+
+ +In some cases, the user may not be brought to the data catalog at all, if the catalog ranked low in the relevance order of the Google query results. User behavior data (2020) showed that "only 9% of Google searchers make it to the bottom of the first page of the search results", and that "only .44% of searchers go to the second page of Google’s search results". (source: https://www.smartinsights.com/search-engine-marketing/search-engine-statistics/) + +It is thus critical to optimize the visibility of the content of specialized data catalogs in the lead search engines, Google in particular. This optimization process is referred to as **search engine optimization** or SEO. [Wikipedia](https://en.wikipedia.org/wiki/Search_engine_optimization) describes SEO as “the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or "organic" results) rather than direct traffic or paid traffic. (…) As an Internet marketing strategy, SEO considers how search engines work, the computer-programmed algorithms that dictate search engine behavior, what people search for, the actual search terms or keywords typed into search engines, and which search engines are preferred by their targeted audience. SEO is performed because a website will receive more visitors from a search engine when websites rank higher on the search engine results page.” + +:::quote +Because search engines crawl the web pages that are generated from databases (rather than crawling the databases themselves), your carefully applied metadata inside the database will not even be seen by search engines unless you write scripts to display the metadata tags and their values in HTML meta tags. It is crucial to understand that any metadata offered to search engines must be recognizable as part of a schema and must be machine-readable, which is to say that the search engine must be able to parse the metadata accurately. (For example, if you enter a bibliographic citation into a single metadata field, the search engine probably won’t know how to distinguish the article title from the journal title, or the volume from the issue number. In order for the search engine to read those citations effectively each part of the citation must have its own field. (...) Making sure metadata is machine-readable requires patterns and consistency, which will also prepare it for transformation to other schema. (This is far more important than picking any single metadata schema. (...) *From the blog post "Metadata, Schema.org, and Getting Your Digital Collection Noticed" by Patrick Hogan (https://www.ala.org/tools/article/ala-techsource/metadata-schemaorg-and-getting-your-digital-collection-noticed-3)* +::: + +Guidelines for implementing SEO are provided by Google Search, Google Dataset Search, and other lead search engines. These guidelines are to be implemented not only by webmasters, but also by the developers of data cataloguing tools who should embed SEO into their software applications. + + - [Google Search Engine Optimization Starter Guide](https://developers.google.com/search/docs/beginner/seo-starter-guide) + - [Google Dataset Search, Advanced SEO](https://developers.google.com/search/docs/data-types/dataset) + - [Bing webmaster Tools](https://www.bing.com/webmasters/about) + +An important element of SEO is the provision of structured metadata that can be exploited directly by the crawlers and indexers of search engines. This is the purpose of a set of schemas known as [**schema.org**](https://schema.org/). In 2011 Google, Microsoft, Yandex, and Yahoo! created a common set of schemas for structured data markup on web pages with the aim of helping search engines to better understand websites. An alternative to schema.org is the [DCAT (Data Catalog Vocabulary)](https://www.w3.org/TR/vocab-dcat-2/) metadata schema recommended by the W3C, also recognized by Google. "DCAT is a vocabulary for publishing data catalogs on the Web, which was originally developed in the context of government data catalogs such as data.gov and data.gov.uk (...)" (https://www.w3.org/TR/vocab-dcat-2/) Mapping augmented and structured metadata to the schema.org and/or DCAT standard is a critical element of such optimization. It will contribute significantly to the visibility of on-line data and metadata. Implementing such structured data markup in digital repositories is the responsibility of data librarians and of developers of data cataloguing applications. + + +## Augmenting metadata + +Detailed and complete metadata foster usability and discoverability of data. Augmentation of "enrichment" or "enhancement" of the metadata will therefore be beneficial. There are multiple ways metadata can be made richer, or *augmented*, programmatically and in a largely automated manner. Metadata can be extracted from external sources or from the data themselves. + +**Extraction from external sources** + +Metadata can be augmented by tapping into external sources related to the data being documented. For example, in a catalog of documents published in peer-reviewed journals, the [Scimago Journal Rank (SJR)](https://www.scimagojr.com/) indicator could be extracted and added as an additional metadata element for each document. This information can then be used by the catalog's search engine to rank query results, by "boosting" the rank of documents published in prestigious journals. + +**Extraction from the data** + +Metadata can be extracted from the data themselves. What metadata can be extracted will be specific to each data type. Examples of metadata augmentation will be provided in the subsequent chapters. We mention a few below. + + - For microdata: variable-level statistics (range of values, number of valid/missing cases, frequencies for categorical variables, summary statistics like means or standard deviations for continuous variables) can be extracted and stored as metadata. The DDI Codebook metadata standard provides elements for that purpose. + - For documents: information such as the country counts (how many times each country is mentioned) can be extracted automatically to fill out the metadata element related to geographic coverage. Natural language processing (NLP) models can be applied to automatically extract keywords or topics (e.g., using a Latent Dirichlet Allocation - LDA - topic model). Classification models can be applied to categorize documents by type. + +**Embeddings and semantic discovery** + +Previous sections of the chapter showed the value of rich and structured metadata to improve data usability and discoverability. Comprehensive and structured metadata are required to build and develop advanced and optimized lexical search engines (i.e. search engines that return results based on a matching of terms found in a query and in an inverted index). The richness of the metadata guarantees that the search engine will have all necessary "raw material" to identify datasets of interest. The metadata structure allows catalog administrators to tune their search engine (provided they use advanced solutions like Solr or ElasticSearch) to return and rank results in the most relevant manner. But this leaves one issue unsolved: the dependency on keyword matching. A user interested in datasets related to *malnutrition* for example will not find the indicators on *Prevalence of stunting* and *Prevalence of wasting* that the catalog may contain, unless the keyword "malnutrition" was included in these indicators' metadata. Smarter search engines will be able to "understand" users intent, and identify relevant data based not only on a keyword matching process, but also on the **semantic closeness** between a query submitted by thea user and the metadata available in the database. The combination of rich metadata and natural language processing (NLP) models can solve this issue, by enabling semantic searchability in data catalogs. + +To enable a semantic search engine (or a recommender system), we need a way to "quantify" the semantic content of a query submitted by the user and the semantic content the metadata associated with a dataset, and to measure the closeness between them. This "quantitative" representation of semantic content can be generated in the form of numeric vectors called **embeddings**. "Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning." (Jurafsky, Daniel; H. James, Martin (2000)). These vectors will typically have a large dimension, with a length of 100 or more. They can be generated for a word, a phrase, or a longer text such as a paragraph or a full document. They are calculated using models like word2vec (Mikolov et al., 2013) or other. Training such models require a large corpus of hundreds of thousands or millions of documents. Pre-trained models and APIs are available that allow data catalog curators to generate embeddings for their metadata and, in real time, for queries submitted by users. + +Practically, embeddings are used as follows: metadata (or part of the metadata) associated with a dataset are converted into a numeric vector using a pre-trained embedding model. These embeddings are stored in a database. When a user submits a search query (which can be a term, a phrase, or even a document), the query is analyzed and enhanced (stop words are removed, spelling errors may be fixed, language detection and automatic translation may be applied, and more), then transformed into a vector using the same pre-trained model that was used to generate the metadata vectors. The metadata vectors that have the shortest distance (typically the cosine distance) with the query vector will be identified. The search engine will then return a sorted list of datasets having the highest semantic similarity with the query, or the distance between vectors will be used in combination with other criteria to rank and return results to the user. The fast identification of the closest vectors requires a specialized and optimized tool like the open source [Milvus](https://milvus.io/) application. + + + - For geospatial data: bounding boxes (i.e. the *extent* of the data) can be derived from the data files. + - For photos taken by digital cameras: metadata such as the date and time the photo was taken and possibly the geographic location can be extracted from the EXIF metadata generated by digital cameras and stored in the image file. Also, machine learning models allow image labeling, face detection, text detection and recognition to be applied at low cost (using commercial solutions like [Google Vision](https://cloud.google.com/vision) or [Amazon Rekognition](https://aws.amazon.com/rekognition/) among others). + - For videos and audio files, machine learning models of speech-to-text API solutions can be used to automatically generate transcripts (see for example [Amazon Transcribe](https://aws.amazon.com/transcribe/), [Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text), [Microsoft Azure Speech to Text](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/), or [rev.ai](https://www.rev.ai/)). The content of the transcripts can then be indexed in search engines, making the content of video and audio files more discoverable. + - For programs and scripts: a parsing of the commands used in the script may be used to derive information on the methods applied. + - For all types: user-defined tags can be added, possibly generated by machine learning classification algorithms. + + +## Recommended standards and schemas + +The standards and schemas we recommend and describe in this guide are the following: + +| Data type | Standard or schema | +| -------------------------- | ----------------------------------------------- | +| Documents | Dublin Core Metadata Initiative (DCMI), MARC | +| Microdata | Data Documentation Initiative 2.5 (Codebook) | +| Geographic datasets and services | ISO 19110, ISO19115, ISO19119, ISO 19139 | +| Time series, Indicators | Custom-designed schema | +| Statistical tables | Custom-designed schema | +| Photos / Images | IPTC (for advanced use) or Dublin Core augmented| +| Audio files | Dublin Core augmented with AudioObject from schema.org | +| Videos | Dublin Core augmented with VideoObject from schema.org | +| Programs and scripts | Custom-designed schema | +| External resources | Dublin Core | +| All data types | schema.org and DCAT (used for search engine optimization purpose, not as the primary schema to document resources)| + +
+ +:::note +Note on SDMX: The metadata standards and schemas described in the Guide do not include the [Statistical Data and Metadata eXchange (SDMX)](https://sdmx.org/?sdmx_news=launching-the-new-sdmx-3-0-standard) standard sponsored by a group of international organisations. Although SDMX includes a metadata component, it is intended to support machine-to-machine data exchange, not data documentation and discoverability. SDMX and the metadata standards and schemas we describe in the Guide could --and should-- be made inter-operable. +::: + +### Documents + +**Documents** are bibliographic resource of any type, such as books, working papers and papers published in scientific journals, reports, manuals, and other resources consisting mainly of text. Document libraries have along tradition of using structured metadata to manage their collections, which dates back from before the days this was computerized. Multiple standards are available. The Dublin Core Metadata Initiative specification (DCMI) provides simple and flexible option. The MARC standard (**MA**chine-**R**eadable **C**ataloging) standard used by the United States Library of Congress is another, more advanced one. The schema we describe in this Guide make is the DCMI complemented by a few elements inspired by the MARC standard. + +### Microdata + +**Microdata** are unit-level data on a population of individuals, households, dwellings, facilities, establishments or other. Microdata are typically obtained from surveys, censuses, or administrative recording systems. To document microdata, the Data Documentation Initiative (DDI) Alliance has developed the DDI metadata standard. "The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. DDI is a free standard that can document and manage different stages in the research data lifecycle, such as conceptualization, collection, processing, distribution, discovery, and archiving. Documenting data with DDI facilitates understanding, interpretation, and use -- by people, software systems, and computer network." (Source: https://ddialliance.org/, accessed on 7 June 2021) + +The DDI standard comes in two versions: DDI Codebook and DDI Lifecycle. + + - [DDI-Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) is a light-weight version of the standard. Its elements include descriptive content for variables, files, source material, and study level information. The standard is designed to support the discovery, preservation, and the informed use of data. + - [DDI Lifecycle](https://ddialliance.org/Specification/DDI-Lifecycle/3.3/) is designed to document and manage data across the entire life cycle, from conceptualization to data publication, analysis and beyond. It encompasses all of the DDI-Codebook specification and extends it. + +In this Guide, which focuses on the use of matadata standards for documentation, cataloguing and dissemination purposes, we recommend the use of the DDI Codebook which is much easier to implement than the DDI LifeCycle. DDI Codebook provides all necessary elements needed for our purpose of improving data discoverability and usability. + +### Geographic datasets, data structures, and data services + +Geographic data identify and depict geographic locations, boundaries and characteristics of features on the surface of the earth. **Geographic datasets** include raster and vector data files. More or more data is disseminated not in the form of datasets, but in the form of **geographic data services** mainly via web applications. The ISO Technical Committee on Geographic Information/Geomatics (ISO/TC211), created a set of metadata standards to describe geographic datasets (ISO 19115), the geographic data structures of vector data (ISO 19110), and geographic data services (ISO 19119). These ISO standards are also available as an XML specification, the ISO 19139. In this Guide, we describe a JSON and simplified --but ISO-compatible-- version of this complex schema. + +### Time series, indicators + +**Indicators** are summary (or "aggregated") measures related to a key issue or phenomenon and derived from a series of observed facts. For example, the *school enrollment rate* indicator can be obtained from survey or census microdata, and the *GDP per capita* indicator is the output of a complex national accounting process that exploits many sources. When an indicator is repeated over time at a regular frequency (annual, quarterly, monthly or other), and when the time dimension is attached to its values, we obtain a **time series**. National statistical agencies and many other organizations publish indicators and time series. Some well-known public databases of time series indicators include the World Bank's [World Development Indicators (WDI)](https://datatopics.worldbank.org/world-development-indicators/), the Asian Development Bank's [Key Indicators (KI)](https://www.adb.org/publications/series/key-indicators-for-asia-and-the-pacific), and the United Nations Statistics Division [Sustainable Development Goals (SDG) database](https://unstats.un.org/sdgs/indicators/database/). Some databases provide indicators that are not time series, like the Demographic and Health Survey (DHS) [StatCompiler](https://www.statcompiler.com/en/). Time series and indicators must be published with metadata that provide information on their spatial and temporal coverage, definition, methodology, sources, and more. No international standard is available to document indicators and time series. The JSON metadata schema we describe in this guide was developed by compiling a list of metadata elements found in international indicators databases, complemented with elements from other metadata schemas. + +### Statistical tables + +**Statistical tables** (or *cross tabulations* or *contingency tables*) are summary presentations of data, presented as arrays of rows and columns that display numeric aggregates in a clearly labeled fashion. They are typically found in publications such as statistical yearbooks, census and survey reports, research papers, or published on-line. We developed the metadata schema presented in this Guide based on a review of a large collection of tables and of the 2015 [W3C Model for Tabular Data and Metadata on the Web](https://www.w3.org/TR/tabular-data-model/#bib-tabular-metadata). This schema is intended to facilitate the cataloguing and discovery of tabular data, not to provide an electronic solution to automatically reproduce tables. + +### Images + +The **images** we are interested in are photos and images available in electronic format. Some images are generated using digital cameras and are "born digital". Others may have been created by scanning photos, or using other techniques. Note that satellite and remote sensing imagery are not considered in this Guide as images, but as geospatial (raster) data which should be documented using the ISO 19139 schema. To document images, we suggest two options: the [Dublin Core Metadata Initiative](https://dublincore.org/) standard augmented by some [ImageObject (from schema.org)](https://schema.org/ImageObject) elements as a simple option, and the IPTC standard for more advanced uses and users. + +### Audio + +To document and catalog audio recordings, we propose a simple metadata schema that combines elements of the [Dublin Core Metadata Initiative](https://dublincore.org/) and of the [AudioObject (from schema.org)](https://schema.org/AudioObject) schemas. + +### Videos + +To document and catalog videos, we propose a simple metadata schema that combines elements of the [Dublin Core Metadata Initiative](https://dublincore.org/) and of the [VideoObject (from schema.org)](https://schema.org/VideoObject) schemas. + +### Programs and scripts + +We are interested in documenting and disseminating **data processing and analysis programs and scripts**. By “programs and scripts” we mean the code written to conduct data processing and data analysis, that results in the production of research and knowledge products including dublications, derived datasets, visualizations, or other. These scripts are produced using statistical analysis software or programming languages like [R](https://www.r-project.org/), Python, [SAS](https://www.sas.com/en_us/software/stat.html), [SPSS](https://www.ibm.com/products/spss-statistics), [Stata](https://www.stata.com/) or equivalent. There are multiple reasons to invest in the documentation and dissemination of reproducible and replicable data processing and analysis (see chapter 12). Increasingly, the dissemination of reproducible scripts is a condition imposed by peer-reviewed journals to authors of papers they publish. Data catalogs should be the go-to place for those who look for reproducible research and examples of good practice in data analysis. As no international metadata schema is available to document and catalog scripts, we developed a schema for this purpose. + +### External resources + +**External resources** are files and links that we may want to attach to a dataset's published metadata in a data catalog. When we publish metadata in a catalog, what is published is only the textual documentation contained in the JSO or XML metadata file. Other resources attached to a dataset (such as the questionnaire for a survey, technical or training manuals, tabulations, reports, possibly micro-data files, etc.) are not included in these metadata, but also constitute important materials for data users. All these resources are what we consider as *external resources* ("external" to the schema-compliant metadata), which need to be catalogued and (for most of them) published with the metadata. A simple metadata schema, based on the Dublin Core, is used to provide some essential information on these resources. + + +## Search engine optimization: schema.org + +The standards and schemas we recommend are lists of elements that have been tailored for each data type. The importance of structured and rich metadata has been described. Specialized metadata standards will foster comprehensiveness and discoverability in specialized catalog, and help build optimized data discovery suystems. But it is also critical to ensure the visibility and discoverability of the metadata in generic search engines, which are not built around the same schemas. The web makes use of its own schemas: schema.org. To ensure SEO, the specialized schemas should be mapped to it. + +### The basics of search engine optimization + +Data catalogs must be optimized to improve the visibility and ranking of their content in search engines, including specialized search engines like Google's [Dataset Search](https://datasetsearch.research.google.com/). The ranking of web pages by Google and other lead search engines is determined by complex, proprietary, and non-disclosed algorithms. The only option for a web developer to ensure that a web page appears on top of the Google list of results is to pay for it, publishing it as a commercial ad. Otherwise, the ranking of a web page will be determined by a combination of known and unknown criteria. "Google's automated ranking systems are designed to present helpful, reliable information that's primarily created to benefit people, not to gain search engine rankings, in the top Search results." ([Google Search Central](https://developers.google.com/search/docs/fundamentals/creating-helpful-content)) But Google, Bing and other search engines provide web developers with some guidance and recommendations on search engine optimization (SEO). See for example the [Google Search Central](https://developers.google.com/search) website where Google publish "Specific things you can do to improve the SEO of your website". + +Improving the ranking of catalog pages is a shared responsibility of data curators and catalog developers and administrators. **Data curators** must pay particular attention to providing rich, useful content in the catalog web pages (the HTML pages that describe each catalog entry). To identify relevant results, search engines index the content of web pages. Datasets that are well documented, i.e. those published with rich and structured metadata, will thus have a better chance to be discovered. Much attention should be paid to some core elements including the dataset title, producer, description (abstract), keywords, topics, access license, and geographic coverage. In Google Search Central's terms, curators must "create helpful, reliable, people-first content" (not search engine-first content) and "use words that people would use to look for your content, and place those words in prominent locations on the page, such as the title and main heading of a page, and other descriptive locations such as alt text and link text.* + +**Developers and administrators of cataloguing applications** must pay attention to other aspects of a catalog that will make it rank higher in Google and other search engine results: + +- Ensuring that a data catalog delivers a good experience to users (see [Understanding page experience in Google Search results](https://developers.google.com/search/docs/appearance/page-experience)), which among other things involves: + - Catalog pages that load fast + - Catalog pages that are mobile-friendly. A data catalog should thus be built with a responsive design. + - Provide secure connection by serving the catalog over HTTPS (see more information, see for example https://web.dev/enable-https/) +- Embedding *structured data* in the catalog's HTML pages. The HTML pages in a data catalog are mostly the pages that will make the metadata specific to an entry visible to the user. These pages are automatically generated by the cataloguing application, by extracting and formatting the metadata stored in the catalog's database. Structured data is information that will be included in these HTML pages (but not shown to the user) to help Google understand the content of the page. The use of *structured data* only applies to certain types of content, including datasets. The use of structured data influences not only the ranking of a page, but also the way information on the page will be displayed by Google. The next section is dedicated to this. + +Last, Google will "reward" popular websites, i.e. websites that are frequently visited and to which many other influent and popular websites provide links. Google's recommendation is thus to "tell people about your site. Be active in communities where you can tell like-minded people about your services and products that you mention on your site." + +A helpful and detailed [self-assessment list](https://developers.google.com/search/docs/fundamentals/creating-helpful-content) of items that data curators, catalog developers, and catalog administrators should pay attention to is provided by Google. Various tools are also available to catalog developers and administrators to assess the technical performance of their websites. + +#### Structured data for rich results in Google + +*Structured data* is information that is embedded in HTML pages that helps Google classify, understand, and display the content of the page when the page is related to a specific type of content. The information stored in the structured data does not impact how the page itself is displayed in a web browser; it only impacts the display of information on the page when returned by Google search results. The types of content to which structured metadata applies is diverse and includes items like job positings, cooking receipes, books, events, movies, math solvers, and others (see the list provided in [Google's Search Gallery](https://developers.google.com/search/docs/appearance/structured-data/search-gallery)). It also applies to resources of type *dataset* and *image*. In this context, a *dataset* can be any type of structured dataset including microdata, indicators, tables, and geographic datasets. + +The *structured data* to be embedded in an HTML page consists of a set of metadata elements compliant with either the [*dataset* schema from schema.org](https://schema.org/Dataset) or W3C's Data Catalog Vocabulary ([DCAT](https://www.w3.org/TR/vocab-dcat/#dcat-scope)) for datasets, and with the *image* schema from schema.org for images. For datasets, the schema.org schema is the most frequently used option.[^1] + +#### schema.org + +[**schema.org**](www.schema.org) is a collection of schemas designed to document many types of resources. The most generic type is a "thing" which can be a person, an organization, an event, a creative work, etc. A *creative work* can be a book, a movie, a photograph, a data catalog, a dataset, etc. Among the many types of *creative work* for which schemas are available, we are particularly interested in the ones that correspond to the types of data and resources we recommend in this guide. This includes: + + - [**DataCatalog**](https://schema.org/DataCatalog): A data catalog is a collection of datasets. + - [**Dataset**](https://schema.org/Dataset): A body of structured information describing some topic(s) of interest. + - [**MediaObject**](https://schema.org/MediaObject): A media object, such as an image, video, or audio object embedded in a web page or a downloadable dataset. This includes: + - [**ImageObject**](https://schema.org/ImageObject): An image file. + - [**AudioObject**](https://schema.org/AudioObject): An audio file. + - [**VideoObject**](https://schema.org/VideoObject): A video file. + - [**Book**](https://schema.org/Book): A book. + - [**DigitalDocument**](https://schema.org/DigitalDocument): An electronic file or document. + +The schemas proposed by schema.org have been developed primarily "to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results." (from [schema.org, Q&A](https://schema.org/docs/faq.html#0)) These schemas have not been developed by specialized communities of practice (statisticians, survey specialists, data librarians) to document datasets for preservation of institutional memory, to increase transparency in the data production process, or to provide data users with the "cook book" they may need to safely and responsibly use data. These schemas are not the ones that statistical organizations need to comply with international recommendations like the Generic Standard Business Process Model (GSBPM). But they play a critical role in improving data discoverability, as they provide webmasters and search engines with a means to better capture and index the content of web-based data platforms. Schemas from schema.org should thus be embedded in data catalogs. Data cataloguing applications should automatically map (some of) the elements of the specialized metadata standards and schemas they use to the appropriate fields of schema.org. Recommended mapping between the specialized standards and schemas and schema.org are not yet available. The production of such mappings, and the development of utilities to facilitate the production of content compliant with schema.org, would contribute to the objective of visibility and discoverability of data. + +#### DCAT + +[**DCAT**](https://www.w3.org/TR/vocab-dcat-2/) describes datasets and data services using a standard model and vocabulary. It is organized in 13 "classes" (Catalog, Cataloged Resource, Catalog Record, Dataset, Distribution, Data Service, Concept Scheme, Concept, Organization/Person, Relationship, Role, Period of Time, and Location). Within classes, *properties* are used as metadata elements. For example, the class *Cataloged Resource* includes properties like *title*, *description*, *resource creator*; the class *Dataset* includes properties like *spatial resolution*, *temporal coverage*; many of these properties can easily be mapped to equivalent elements of the specialized metadata schemas we recommend in this Guide. + +#### Practical implementation of structured data + +The embedding of structured data into HTML pages must be automated in a data cataloguing tool. Data catalogs applications dynamically generate the HTML pages that display the description of each catalog entry. They do so by extracting the necessary metadata from the catalog database, and applying "transformations and styles" to this content to produce a user-friendly output that catalog visitors will view in their web browser. To embed structured data in these pages, the catalog application will (i) extract the relevant subset of metadata elements from the original metadata (e.g., from the DDI-compliant metadata for a micro-dataset), (ii) map these extracted elements to the schema.org or DCAT schema, and (iii) save it in the HTML page as a JSON-LD "hidden" component. Mapping the core elements of specialized metadata standards to the schema.org schema is thus essential to enable this feature. A mapping between the schema presented in this Guide and schema.org is provided in annex 2 of the Guide. + +The screenshots below show an example of an HTML page for a dataset published in a NADA catalog, with the underlying code. The structured metadata is used by Google to display this information as a formatted, "rich result" in Google Dataset Search. + +
+**The HTML page as viewed by the catalog user** - The web browser will ignore the embedded structured metadata when the HTML page is displayed. What users will see is entirely controlled by the catalog application. +
+ +
+
+![](./images/reDoc_html_view.JPG){width=80%} +
+
+ +
+**The HTML page code (abstract)** - The automatically-generated structured data can be seen in the HTML page code (or *page source*). This information is visible and processed by Google, Bing, and other search engine's web crawlers. Note that the structured data, although not "visible" to users, can be made accessible to them via API. Other data cataloguing applications may be able to ingest this information; the CKAN cataloguing tool for example makes use of metadata compliant with DCAT or schema.org. Making the structured data accessible is one way to improve the inter-operability of data catalogs. +
+ +
+
+![](./images/reDoc_html_code.JPG){width=80%} +
+
+ +
+**The result - Higher visibility/ranking in Google Dataset Search** - The websites catalog.ihsn.org and microdata.worldbank.org are NADA catalogs, which embed schema.org metadata. +
+ +
+
+![](./images/reDoc_html_rank.JPG){width=80%} +
+
+ + +## Where to find the schemas' documentation + +The most recent documentation of the schemas described in the Guide is available on-line at https://ihsn.github.io/nada-api-redoc/catalog-admin/#. + +
+![](./images/reDoc.JPG){width=100%} +
+
+ +The documentation of each standard or schema starts with four common elements that are not actually part of the standard or schema, but that contain information that will be used when the metadata are published in a data catalog that uses the NADA application. If NADA is not used, these "administrative elements" can be ignored. + +
+![](./images/reDoc_0.JPG){width=100%} +
+
+ + - **`repositoryid`** identifies the collection in which the metadata will be published. + - **`access_policy`** determines if and how the data files will be accessible from the catalog in which the metadata are published. This element only applies to the microdata and geographic metadata standards. It makes use of a controlled vocabulary with the following access policy options: + - **`direct`**: data can be downloaded without requiring them to be registered; + - **`open`**: same as "direct", with an open data license attached to the dataset; + - **`public`**: public use files, which only require users to be registered in the catalog; + - **`licensed`**: access to data is restricted to registered users who receive authorization to use the data, after submitting a request; + - **`remote`**: data are made available by an external data repository; + - **`data_na`**: data are not accessible to the public (only metadata are published). + - **`published`** determines the status of the metadata in the on-line catalog (with options 0 = draft and 1 = published). Published entries are visible to all visitors of the on-line catalog; unpublished (draft) entries will only be visible by the catalog administrators and reviewers. + - **`overwrite`** determines whether the metadata already in the catalog for this entry can be overwritten (iwith options yes or no, 'no' being the default). + +This set of administrative elements is followed by one or multiple sections that contain the elements specific to each standard/schema. For example, the DDI Codebook metadata standard, used to document microdata, contains the following main sections: + + - **`document description`**: a description of the metadata (who documented the dataset, when, etc.) Most schemas will contain such a section describing the metadata, useful mainly to data curators and catalog administrators. In other schemas, this section may be named `metadata_description`. + - **`study description`**: the description of the survey/census/study, not including the data files and data dictionary. + - **`file description`**: a list and description of data files associated to the study. + - **`variable description`**: the data dictionary (description of variables). + +The schema-specific sections are followed by a few other metadata elements common to most schemas. These elements are used to provide additional information useful for cataloguing and discoverability purposes. They include **tags** (which allow catalog administrators to attach tags to datasets independently of their type, which can be used as filters in the catalog), and **external resources**. + +Some schemas provide the possibility for data curators to add their own metadata elements in an **additional** section. The use of additional elements should be the exception, as metadata standards and schemas are designed to provide all elements needed to fully document a data resource. + +In each standard and schema, metadata elements can have the following properties: + + - **Optional** or **required**. When an element is declared as *required* (or *mandatory*), the metadata will be considered invalid if it contains no information in that element. To keep the schemas flexible, very few elements are set as required. Note that it is possible for a metadata element to be `required` but have all its components (for elements that have sub-elements) declared as optional. This will be the case when at least one (but any) of the sub-element must contain information. It is also possible for an element to be declared *optional* but have one or more of its sub-elements declared `mandatory` (this means that the field is optional, but if it is used, some of its features MUST be provided.) + - **Repeatable** or **Not repeatable**. For example, the element `nation` in the DDI standard is *Repeatable* because a dataset can cover more than one country, while the element `title` is *Not repeatable* because a study should be identified by a unique title. + - **Type**. This indicates the format of the information contained in an element. It can be a *string* (text), a *numeric* value, a *boolean* variable (TRUE/FALSE), or an *array*. + +Some schemas may recommend controlled vocabularies for some elements. For example, the ISO 19139 used to document geographic datasets recommends ... + +In most cases however, controlled vocabularies are not part of the metadata standard or schema. They will be selected and activated in templates and applications. +...example... + + +## Generating structured metadata + +Metadata compliant with the standards and schemas described in this Guide can be generated in two different ways: **programmatically** using a programming language like R or Python, or by **using a specialized metadata editor** application. The first option provides a high degree of flexibility and efficiency. It offers multiple opportunities to automate part of the metadata generation process, and to exploit advanced machine learning solutions to enhance metadata. Also, metadata generated using R or Python can also be published in a NADA catalog using the NADA API and the R package NADAR or the Python library PyNADA. The programmatic option may thus be the preferred option for organizations that have strong expertise in R or Python. For other organizations, and for some types of data, the use of a specialized metadata editor may be a better option. Metadata editors are specialized software applications designed to offer a user-friendly alternative to the programmatic generation of metadata. We provide in this section a brief description of how structured metadata can be generated and published using respectively R, Python, and a metadata editor application. + +### Generating compliant metadata using a metadata editor + +The easiest way to generate metadata compliant with the standards and schemas we describe in this Guide is to use a specialized Metadata Editor. A Metadata Editor provides a user-friendly and flexible interface to document data. Most metadata editors are specific to a certain standard. The IHSN / World Bank developed an open source multi-standard Metadata Editor. + +This Metadata Editor contains all suggested standards. The full version of each standard is embedded in the application. But few users will ever make use of all elements contained in the standard. And some will want to customize the instructions, labels of the metadata elements, controlled vocabularies, and instructions to curators who will enter the metadata. + +The Metadata Editor allows users to develop their own templates based on the full version of the standards. A template is a subset of the elements available in the standard/schema, where the elements can be renamed and other customization can be made (within limits, as the metadata generated must remain compliant with the standard independently of the template). + +Template manager: + +
+
+![image](https://user-images.githubusercontent.com/35276300/230179149-87eb17ca-2a60-4ae6-a993-423a51880da8.png) +
+
+ +Then UI with (for some types) import of data and automated generation of some metadata. +
+
+![image](https://user-images.githubusercontent.com/35276300/230179493-6e945fed-3bcf-4ab6-9545-8a7e982d5c46.png) +
+
+ +(describe / provide bettere example) + + +### Generating compliant metadata using R + +All schemas described in the [on-line documentation](https://ihsn.github.io/nada-api-redoc/catalog-admin/#) can be used to generate compliant metadata using R scripts. Generating metadata using R will consist of producing a *list* object (itself containing lists). In the documentation of the standards and schemas, curly brackets indicate to R users that a *list* must be created to store the metadata elements. Square brackets indicate that a block of elements is repeatable, which corresponds in R to a *list of lists*. For example (using the [DOCUMENT]((https://ihsn.github.io/nada-api-redoc/catalog-admin/#operation/createDocument)) metadata schema): + +
+![](./images/JSON_to_R_interpret.JPG){width=100%} +
+ +:::note +The sequence in which the metadata elements are created when documenting a dataset using R or Python does not have to match the sequence in the schema documentation. +::: + +Metadata compliant with a standard/schema can be generated using R, and directly uploaded in a NADA catalog without having to be saved as a JSON file. An object (a list) must be created in the R script that contains metadata compliant with the JSON schema. The example below shows how such an object is created and published in a NADA catalog. We assume here that we have a document with the following information: + + - document unique id: *WB_10986/7710* + - title: *Teaching in Lao PDR* + - authors: *Luis Benveniste, Jeffery Marshall, Lucrecia Santibañez (World Bank)* + - date published: *2007* + - countries: *Lao PDR*. + - The document is available from the World Bank Open knowledge Repository at http://hdl.handle.net/10986/7710. + +We will use the [DOCUMENT schema](https://ihsn.github.io/nada-api-redoc/catalog-admin/#tag/Documents) to document the publication, and the [EXTERNAL RESOURCE schema](https://ihsn.github.io/nada-api-redoc/catalog-admin/#tag/External-resources) to publish a link to the document in NADA. + +
+![](./images/ReDoc_documents_21.JPG){width=100%} +
+
+ +Publishing data and metadata in a NADA catalog (using R and the NADAR package or Python and the PyNADA library) requires to first identify the on-line catalog where the metadata will be published (by providing its URL in the `set_api_url` command line) and to provide a key to authenticate as a catalog administrator (in the `set_api_key` command line; note that this key should never be entered in clear in a script to avoid accidental disclosure). + +We then create an object (a list in R, or a dictionary in Python) that we will for example name *my_doc*. Within this list (or dictionary), we will enter all metadata elements. Some will be simple elements, others will be lists (or dictionaries). The first element to be included is the required `document_description`. Within it, we include the `title_statement` which is also required and contains the mandatory elements `idno` and `title` (all documents must have a unique ID number for cataloguing purpose, and a title). The list of countries that the document covers is a repeatable element, i.e. a list of lists (although we only have one country in this case). Information on the authors is a repeatable element, allowing us to capture the information on the three co-authors individually. + +This *my_doc* object is then published in the NADA catalog using the `add_document` function. Last, we publish (as an external resource) a link to the file, with only basic information. We do not need to document this resource in detail, as it corresponds to the metadata provided in *my_doc*. If we had a different external resource (for example an MS-Excel table that contains all tables shown in the publication), we would make use of more of the external resources metadata elements to document it. Note that instead of a URL, we could have provided a path to an electronic file (e.g., to the PDF document), in which case the file would be uploaded to the web server and made available directly from the on-line catalog. We had previously captured a screenshot of the cover page of the document to be used as thumbnail in the catalog (optional). + + +```r +library(nadar) +# Define the NADA catalog URL and provide an API key +set_api_url("http://nada-demo.ihsn.org/index.php/api/") +set_api_key("a1b2c3d4e5") + # Note: an administrator API key must always be kept strictly confidential; + # It is good practice to read it from an external file, not to enter it in clear +thumb <- "C:/DOCS/teaching_lao.JPG" # Cover page image to be used as thumbnail +# Generate and publish the metadata on the publication +doc_id <- "WB_10986/7710" +my_doc <- list( + document_description = list( + + title_statement = list( + idno = doc_id, + title = "Teaching in Lao PDR" + ), + + date_published = "2007", + + ref_country = list( + list(name = "Lao PDR", code = "LAO") + ), + + # Authors: we only have one author, but this is a list of lists + # as the 'authors' element is a repeatable element in the schema + authors = list( + list(first_name = "Luis", last_name = "Benveniste", affiliation = "World Bank"), + list(first_name = "Jeffery", last_name = "Marshall", affiliation = "World Bank"), + list(first_name = "Lucrecia", last_name = "Santibañez", affiliation = "World Bank") + ) + ) +) +# Publish the metadata in the central catalog +add_document(idno = doc_id, + metadata = my_doc, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes") +# Add a link as an external resource of type document/analytical (doc/anl). +external_resources_add( + title = "Teaching in Lao PDR", + idno = doc_id, + dctype = "doc/anl", + file_path = "http://hdl.handle.net/10986/7710", + overwrite = "yes" +) +``` + +The document is now available in the NADA catalog. + +
+![](./images/ReDoc_documents_21b.JPG){width=100%} +
+ + +### Generating compliant metadata using Python + +Generating metadata using Python will consist of producing a *dictionary* object, which will itself contain lists and dictionaries. Non-repeatable metadata elements will be stored as dictionaries, and repeatable elements as lists of dictionaries. In the [metadata documentation](https://ihsn.github.io/nada-api-redoc/catalog-admin/#), curly brackets indicate that a *dictionary* must be created to store the metadata elements. Square brackets indicate that a dictionary containing dictionaries must be created. + +
+![](./images/JSON_to_Python_interpret.JPG){width=100%} +
+ +
+ +:::idea +Dictionaries in Python are very similar to JSON schemas. When documenting a dataset, data curators who use Python can copy a schema from the ReDoc website, paste it in their script editor, then fill out the relevant metadata elements and delete the ones that are not used. +::: + +
+![](./images/copy_ReDoc.JPG){width=75%} +
+ +
+ +The Python equivalent of the R example we provided above is as follows: + + +```python +import pynada as nada +# Define the NADA catalog URL and provide an API key +set_api_url("http://nada-demo.ihsn.org/index.php/api/") +set_api_key("a1b2c3d4e5") + # Note: an administrator API key must always be kept strictly confidential; + # It is good practice to read it from an external file, not to enter it in clear +thumb <- "C:/DOCS/teaching_lao.JPG" # Cover page image to be used as thumbnail +# Generate and publish the metadata on the publication +doc_id = "WB_10986/7710" +document_description = { + 'title_statement': { + 'idno': "WB_10986/7710", + 'title': "Teaching in Lao PDR" + }, + + 'date_published': "2007", + 'ref_country': [ + {'name': "Lao PDR", 'code': "Lao"} + ], + + # Authors: we only have one author, but this is a list of lists + # as the 'authors' element is a repeatable element in the schema + 'authors': [ + {'first_name': "Luis", 'last_name': "Benveniste", 'affiliation' = "World Bank"}, + {'first_name': "Jeffery", 'last_name': "Marshall", 'affiliation' = "World Bank"}, + {'first_name': "Lucrecia", 'last_name': "Santibañez", 'affiliation' = "World Bank"}, + ] +} +# Publish the metadata in the central catalog +nada.create_document_dataset( + dataset_id = doc_id, + repository_id = "central", + published = 1, + overwrite = "yes", + my_doc_metadata, @@@@@@ + thumbnail_path = thumb) +# Add a link as an external resource of type document/analytical (doc/anl). +nada.add_resource( + dataset_id = doc_id, + dctype = "doc/anl", + title = "Teaching in Lao PDR", + file_path = "http://hdl.handle.net/10986/7710", + overwrite = "yes") +``` + + + +[^1] See Omar Benjelloun, Shiyu Chen, Natasha Noy, 2020, *Google Dataset Search by the Numbers*, https://doi.org/10.48550/arXiv.2006.06894 diff --git a/04_chapter04_document.md b/04_chapter04_document.md new file mode 100644 index 0000000..87729e6 --- /dev/null +++ b/04_chapter04_document.md @@ -0,0 +1,2818 @@ +--- +output: html_document +--- + +# (PART) STANDARDS AND SCHEMAS {-} + +# Documents {#chapter04} + +
+![](./images/DCMI_MARC21_BIBTEX.JPG){width=100%} +
+
+ +This chapter describes the use of a metadata schema for documenting *documents*. By *document*, we mean a bibliographic resource of any type such as a book, a working paper or a paper published in a scientific journal, a report, a presentation, a manual, or any another resource consisting mainly of text and available in physical and/or electronic format. + +:::idea +Suggestions and recommendations to data curators
+ + - Documents in a data catalog can appear (i) as "data" in the catalog, or as "related resources" attached to other datasets. The schema we describe here is to be used for documents that will be listed as catalog entries and made searchable, not those that will be attached as resources (for which the "external resource" metadata schema must be used. + - For all types of data we describe in this Guide (microdata, geographic, indicators, tables, images, audio, video, and scripts), what is indexed and made searchable in the catalog are the **metadata** associated with the data (some of these metadata may have been extracted directly from the data). For *documents*, not only the metadata but the content of the document (the "data") can and should be indexed and made searchable. Some documents may have been scanned and submitted to optical character recognition (OCR). The OCR process will not always manage to properly convert images to text, resulting in errors and non-existing words that should not be included in an index. It is thus recommended to submit the text version of these documents to a pipeline of quality control and enhancement (spell checker, and other). + - Including a screenshot of a document cover page in a data catalog adds value. + - Documents should be categorized by type, and the *type* metadata element should have a controlled vocabulary. If a document can have more than one type, use the *tags* element (with a *tag_group* = *type*) instead of the non-repeatable *type* element to store this information. Use this information to activate a facet in the catalog user interface. Many users will find it useful to be able to filter documents by type. + - The document metadata can be augmented in different manners, including by applying automated topic extraction (e.g. using a LDA topic model) and by generating document embeddings. When topic models and embedding models are used, it is important to ensure that the same topic model and the same embedding model is consistently used for all resources in the catalog. + - Machine learning tools also provide automatic language detection and translation solutions that mey be useful to enhance the metadata. + - Documenting documents using R or Python is not very complex. For large collections of documents, managing and publishing metadata can be made significantly more efficient when programmatic solutions are used. + - It is highly recommended to obtain a globally unique identifier for each document, such as a DOI, an ISBN, or other. +::: + + +## MARC 21, Dublin Core, and BibTex + +Librarians have developed specific standards to describe and catalog documents. The [MARC 21](https://www.loc.gov/marc/bibliographic/) (**MA**chine-**R**eadable **C**ataloging) standard used by the United States Library of Congress is one of them. It provides a detailed structure for documenting bibliographic resources, and is the recommended standard for well-resourced document libraries. + +For the purpose of cataloguing documents in a less-specialized repository intended to accommodate data of multiple types, we built our schema on a simpler but also highly popular standard, the **Dublin Core Metadata Element Set**. We will refer to this metadata specification, developed by the [Dublin Core Metadata Initiative](https://dublincore.org/), as the *Dublin Core*. The Dublin Core became an ISO standard (ISO 15836) in 2009. It consists of a list of fifteen core metadata elements, to which more specialized elements can be added. These fifteen elements, with a definition extracted from the Dublin Core [website](https://dublincore.org/), are the following: + +|No | Element name | Description | +|-- | -------------------- | --------------------------------------------------------------- | +|1 | contributor | An entity responsible for making contributions to the resource. | +|2 | coverage | The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. | +|3 | creator | An entity primarily responsible for making the resource. | +|4 | date | A point or period of time associated with an event in the life cycle of the resource. | +|5 | description | An account of the resource. | +|6 | format | The file format, physical medium, or dimensions of the resource. | +|7 | identifier | An unambiguous reference to the resource within a given context. | +|8 | language | A language of the resource. | +|9 | publisher | An entity responsible for making the resource available. | +|10 | relation | A related resource. | +|11 | rights | Information about rights held in and over the resource. | +|12 | source | A related resource from which the described resource is derived. | +|13 | subject | The topic of the resource. | +|14 | title | A name given to the resource. | +|15 | type | The nature or genre of the resource. | + +Due to its simplicity and versatility, this standard is widely used for multiple purposes. It can be used to document not only documents but also resources of other types like images or others. Documents that can be described using the MARC 21 standard can be described using the Dublin Core, although not with the same granularity of information. The US Library of Congress provides a [mapping between the MARC and the Dublin Core](https://www.loc.gov/marc/marc2dc.html) metadata elements. + +MARC 21 and the Dublin Core are used to document a resource (typically, the electronic file containing the document) and its content. Another schema, [BibTex](https://en.wikipedia.org/wiki/BibTeX), has been developed for the specific purpose of recording bibliographic citations. BibTex is a list of fields that may be used to generate bibliographic citations compliant with different bibliography styles. It applies to documents of multiple types: books, articles, reports, etc. + +The metadata schema we propose to document publications and reports is a combination of Dublin Core, MARC 21, and BibTex elements. The technical documentation of the schema and its API is available at https://ihsn.github.io/nada-api-redoc/catalog-admin/#tag/Documents. + + +## Schema description + +The proposed schema comprises two main blocks of elements, **`metadata_information`** and **`document_description`**. It also contains the `tags` element common to all our schemas. The `repository_id`, `published` and `overwrite` items in the schema are not metadata elements *per se*, but parameters used when publishing the metadata in a NADA catalog. +
+```json +{ + "repositoryid": "string", + "published": 0, + "overwrite": "no", + "metadata_information": {}, + "document_description": {}, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ +### Metadata information + +The **`metadata_information`** contains information not related to the document itself but to its metadata. In other words, it contains "metadata on the metadata". This information is optional but we recommend to enter content at least in the `name` and `date` sub-elements, which indicate who generated the metadata and when. This information is not useful to end-users of document catalogs, but is useful to catalog administrators for two reasons: + + - metadata compliant with standards are intended to be shared and used by inter-operable applications. Data catalogs offer opportunities to harvest (pull) information from other catalogs, or to publish (push) metadata in other catalogs. Metadata information helps to keep track of the provenance of metadata. + + - metadata for a same document may have been generated by more than one person or organization, or one version of the metadata can be updated and replaced with a new version. The `metadata information` helps catalog administrators distinguish and manage different versions of the metadata. +
+```json +"metadata_information": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "production_date": "string", + "version": "string" +} +``` +
+ +The elements in the block are: + +- **`title`** *[Required ; Not repeatable ; String]*
+The title of the metadata document (which will usually be the same as the "Title" in the "Document description / Title statement" section). The metadata document is the metadata file (XML or JSON file) that is being generated. + +- **`idno`** *[Optional ; Not repeatable ; String]*
+A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the "Primary ID" in section "Document description / Title statement", although it is good practice to generate identifiers that establish a clear connection between these two identifiers. The Document ID could also include the metadata document version identifier. For example, if the "Primary ID" of the publication is “978-1-4648-1342-9”, the Document ID could be “IHSN_978-1-4648-1342-9_v1.0” if the metadata are produced by the IHSN and if this is version 1.0 of the metadata. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier may not contain blank spaces. + +- **`producers`** *[Optional ; Repeatable]*
+This refers to the producer(s) of the metadata, not to the producer(s) of the document itself. The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a "Recommended" element. For catalog administration purposes, information on the producer and on the date of metadata production is useful. + + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization who produced the metadata or contributed to its production. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The abbreviation (or acronym) of the organization that is referenced in `name`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the person or organization mentioned in `name` in the production of the metadata.

+ +- **`production_date`** *[Optional ; Not repeatable ; String]*
+The date the metadata on this document was produced (not distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a "Recommended" element, as information on the producer and on the date of metadata production is useful for catalog administration purposes. + +- **`version`** *[Optional ; Not repeatable ; String]*
+The version of the metadata document (not the version of the publication, report, or other resource being documented). + +> Example +
+ +```r +my_doc = list( + + metadata_information = list( + + idno = "WBDG_978-1-4648-1342-9", + + producers = list( + list(name = "Development Data Group, Curation Team", + abbr = "WBDG", + affiliation = "World Bank") + ), + + production_date = "2020-12-27" + ), + + # ... + +) +``` +
+ +### Document description + +The **`document_description`** block contains the metadata elements used to describe the document. It includes the Dublin Core elements and a few more. The schema also includes elements intended to store information generated by machine learning (natural language processing - NLP) models to augment metadata on documents. +
+```json +"document_description": { + "title_statement": {}, + "authors": [], + "editors": [], + "date_created": "string", + "date_available": "string", + "date_modified": "string", + "date_published": "string", + "identifiers": [], + "type": "string", + "status": "string", + "description": "string", + "toc": "string", + "toc_structured": [], + "abstract": "string", + "notes": [], + "scope": "string", + "ref_country": [], + "geographic_units": [], + "bbox": [], + "spatial_coverage": "string", + "temporal_coverage": "string", + "publication_frequency": "string", + "languages": [], + "license": [], + "bibliographic_citation": [], + "chapter": "string", + "edition": "string", + "institution": "string", + "journal": "string", + "volume": "string", + "number": "string", + "pages": "string", + "series": "string", + "publisher": "string", + "publisher_address": "string", + "annote": "string", + "booktitle": "string", + "crossref": "string", + "howpublished": "string", + "key": "string", + "organization": "string", + "url": null, + "translators": [], + "contributors": [], + "contacts": [], + "rights": "string", + "copyright": "string", + "usage_terms": "string", + "disclaimer": "string", + "security_classification": "string", + "access_restrictions": "string", + "sources": [], + "data_sources": [], + "keywords": [], + "themes": [], + "topics": [], + "disciplines": [], + "audience": "string", + "mandate": "string", + "pricing": "string", + "relations": [], + "reproducibility": {} +} +``` +
+ +- **`title_statement`** *[Required ; Not repeatable]*
+ +The `title_statement` is a required group of five elements, two of which are required: +
+ + ```json + "title_statement": { + "idno": "string", + "title": "string", + "sub_title": "string", + "alternate_title": "string", + "translated_title": "string" + } + ``` +
+ + - **`idno`** *[Required ; Not repeatable ; String]*
+ A unique identifier of the document, which serves as the "primary ID". `idno` is a unique identification number used to identify the database. A unique identifier is required for cataloguing purpose, so this element is declared as "Required". The identifier will allow users to cite the indicator/series properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the `idno` can be constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a same study (in element `identifiers`); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The `idno` should not contain blank spaces. + - **`title`** *[Required ; Not repeatable ; String]*
+ The title of the book, report, paper, or other document. Pay attention to the use of capitalization in the title, to ensure consistency across documents listed in your catalog. Pay attention to the consistent use of capitalization in the title. It is recommended to use sentence capitalization. + - **`sub_title`** *[Optional ; Not repeatable ; String]*
+ The document subtitle can be used when there is a need to distinguish characteristics of a document. Pay attention to the consistent use of capitalization in the subtitle. + - **`alternate_title`** *[Optional ; Not repeatable ; String]*
+ An alternate version of the title, possibly an abbreviated version. For example, the World Bank’s World Development Report is often referred to as the WDR; the alternate title for the “World Development Report 2021” could then be “WDR 2021”.
+ - **`translated_title`** *[Optional ; Not repeatable ; String]*
+ A translation of the title of the document. Special characters should be properly displayed, such as accents and other stress marks or different alphabets.
+ +
+ + ```r + my_doc <- list( + + # ... , + + document_description = list( + title_statement = list( + idno = "978-1-4648-1342-9", + title = "The Changing Nature of Work", + sub-title = "World Development Report 2019", + alternate_title = "WDR 2019", + translated_title = "Rapport sur le Développement dans le Monde 2019" + ), + + # ... + ) + ) + ``` +
+ +- **`authors`** *[Optional ; Repeatable]*
+The authors should be listed in the same order as they appear in the source itself, which is not necessarily alphabetical. +
+ + ```json + "authors": [ + { + "first_name": "string", + "initial": "string", + "last_name": "string", + "affiliation": "string", + "author_id": [ + { + "type": null, + "id": null + } + ], + "full_name": "string" + } + ] + ``` +
+ + - **`first_name`** *[Optional ; Not repeatable ; String]*
+ The first name of the author.
+ - **`initial`** *[Optional ; Not repeatable ; String]*
+ The initials of the author.
+ - **`last_name`** *[Optional ; Not repeatable ; String]*
+ The last name of the author.
+ - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the author.
+ - **`author_id`** *[Optional ; Repeatable]*
+ The author ID in a registry of academic researchers such as the [Open Researcher and Contributor ID (ORCID)](https://orcid.org/).
+ - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of ID, i.e. the identification of the registry that assigned the author's identifier, for example "ORCID".
+ - **`id`** *[Optional ; Not repeatable ; String]*
+ The ID of the author in the registry mentioned in `type`.

+ - **`full_name`** *[Optional ; Not repeatable ; String]*
+ The full name of the author. This element should only be used when the first and last name of an author cannot be distinguished, i.e. when elements `first_name` and `last_name` cannot be filled out. This element can also be used when the author of a document is an organization or other type of entity.
+
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + authors = list( + list(first_name = "John", last_name = "Smith", + author_id = list(type = "ORCID", id = "0000-0002-1234-XXXX")), + list(first_name = "Jane", last_name = "Doe"), + author_id = list(type = "ORCID", id = "0000-0002-5678-YYYY")) + ), + + # ... + ) + ``` +
+ +- **`editors`** *[Optional ; Repeatable]*
+If the source is a text within an edited volume, it should be listed under the name of the author of the text used, not under the name of the editor. The name of the editor should however be provided in the bibliographic citation, in accordance with a [reference style](https://awelu.srv.lu.se/sources-and-referencing/using-a-reference-style/elements-of-the-reference-list/). +
+```json +"editors": [ + { + "first_name": "string", + "initial": "string", + "last_name": "string", + "affiliation": "string" + } +] +``` +
+ - **`first_name`** *[Optional ; Not repeatable ; String]*
+ The first name of the editor. + - **`initial`** *[Optional ; Not repeatable ; String]*
+ The initials of the editor. + - **`last_name`** *[Optional ; Not repeatable ; String]*
+ The last name of the editor. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the editor.

+ +- **`date_created`** *[Optional ; Not repeatable ; String]*
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was produced. This can be different from the date the document was published, made available, and from the temporal coverage. The document "Nigeria - Displacement Report" by the International Organization for Migration (IOM) shown below provides an example of this. The document was produced in November 2020 (`date_created`), refers to events that occurred between 21 September and 10 October 2021 (`temporal_coverage`), and was published (`date_published`) on 28 January 2021. + +- **`date_available`** *[Optional ; Not repeatable ; String]*
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was made available. This is different from the date it was published (see element `date_published` below). This element will not be used frequently. + +- **`date_modified`** *[Optional ; Not repeatable ; String]*
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was last modified. + +- **`date_published`** *[Optional ; Not repeatable ; String]*
+The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was published. + + The example below is a [report from the International Organization for Migrations](https://displacement.iom.int/node/10647) (IOM). It shows the difference between the date the document was created (`date_created`), published (`date_published`), and the period it covers (`temporal_coverage`). + +
+ ![](./images/document_example_00b.JPG){width=85%} +
+ + In R, this will be captured as follows: +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + temporal_coverage = "21 September 2020 to 10 October 2020", + date_created = "2020-11", + date_published = "2021-01-28", + + # ... + ), + # ... + ) + ``` +
+ +- **`identifiers`** *[Optional ; Repeatable]* +This element is used to enter document identifiers (IDs) other than the catalog ID entered in the `title_statement` (`idno`). It can for example be a Digital Object Identifier (DOI), an International Standard Book Number (ISBN), or an International Standard Serial Number (ISSN). The ID entered in the `title_statement` can be repeated here (the `title_statement` does not provide a `type` parameter; if a DOI, ISBN, ISSN, or other standard reference ID is used as `idno`, it is recommended to repeat it here with the identification of its `type`). + +
+```json +"identifiers": [ + { + "type": "string", + "identifier": "string" + } +] +``` +
+ + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of identifier, for example "DOI", "ISBN", or "ISSN". + - **`identifier`** *[Required ; Not repeatable ; String]*
+ The identifier itself.

+ + The example shows the list of identifiers of the World Bank World Development Report 2020 *The Changing Nature of Work* (see full metadata for this document in the *Complete Example 2* of this chapter). + +
+ + ```r + my_doc <- list( + + # ... , + + document_description = list( + + # ... , + + identifiers = list( + list(type = "ISSN", identifier = "0163-5085"), + list(type = "ISBN softcover", identifier = "978-1-4648-1328-3"), + list(type = "ISBN hardcover", identifier = "978-1-4648-1342-9"), + list(type = "e-ISBN", identifier = "978-1-4648-1356-6"), + list(type = "DOI softcover", identifier = "10.1596/978-1-4648-1328-3"), + list(type = "DOI hardcover", identifier = "10.1596/978-1-4648-1342-9") + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`type`** *[Optional ; Not repeatable ; String]*
+ + This describes the nature of the resource. It is recommended practice to select a value from a controlled vocabulary, which could for example include the following options: "article", "book", "booklet", "collection", "conference proceedings", "manual", "master thesis", "patent", "PhD thesis", "proceedings", "technical report", "working paper", "website", "other". Specialized agencies may want to create their own controlled vocabularies; for example, a national statistical agency may need options like "press release", "methodology document", "protocol", or "yearbook". The `type` element can be used to create a "Document type" facet (filter) in a data catalog. If the controlled vocabulary is such that it contains values that are not mutually exclusive (i.e. if a document could possibly have more than one type), the element `type` cannot be used as it is not repeatable. In such case, the solution is to provide the type of document as `tags`, in a `tag_group` that could for example be named *type* or *document_type*. Note also that the Dublin Core provides a controlled vocabulary (the [DCMI Type Vocabulary](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#section-7)) for the `type` element, but this vocabulary is related to the types of resources (dataset, event, image, software, sound, etc.), not the type of document which is what we are interested in here. + +
+ +- **`status`** *[Optional ; Not repeatable ; String]*
+ + The status of the document. The status of the document should (but does not have to) be provided using a controlled vocabulary, for example with the following options: "first draft", "draft", "reviewed draft", "final draft", "final". Most documents published in a catalog will likely be "final". + +
+ +- **`description`** *[Optional ; Not repeatable ; String]*
+ + This element is used to provide a brief description of the document (not an abstract, which would be provided in the field `abstract`). It should not be used to provide content that is contained in other, more specific elements. As stated in the [Dublin Core Usage Guide](https://www.dublincore.org/specifications/dublin-core/usageguide/elements/), "Since the `description` field is a potentially rich source of indexable terms, care should be taken to provide this element when possible. Best practice recommendation for this element is to use full sentences, as description is often used to present information to users to assist in their selection of appropriate resources from a set of search results." + +
+ +- **`toc`** *[Optional ; Not repeatable ; String]*
+ + The table of content of the document, provided as a single string element, i.e. with no structure (an structured alternative is provided with the field `toc_structured` described below). This element is also a rich source of indexable terms which can contribute to document discoverability; care should thus be taken to use it (or the `toc_structured` alternative) whenever possible. +
+ + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + toc = "Introduction + 1. The importance of rich and structured metadata + 1.1 Rich metadata + 1.2 Structured metadata + 2. Technology: JSON schemas and tools + 2.1 JSON schemas + 2.1.1 Advantages of JSON over XML + 2.2 Defining a metadata schema in JSON format", + # ... + ), + + # ... + ) + ``` + +
+ +- **`toc_structured`** *[Optional ; Not repeatable]*
+ +
+```json +"toc_structured": [ + { + "id": "string", + "parent_id": "string", + "name": "string" + } +] +``` +
+ + This element is used as an alternative to `toc` to provide a structured table of content. The element contains a repeatable block of sub-elements which provides the possibility to define a hierarchical structure: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier for the element of the table of content. For example, the `id` for Chapter 1 could be "1" while the `id` for section 1 of chapter 1 would be "11". + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The `id` of the parent section (e.g., if the table of content is divided into chapters, themselves divided into sections, the `parent_id` of a section would be the id of the chapter it belongs to.) + - **`name`** *[Required ; Not repeatable ; String]*
+ The label of this section of the table of content (e.g., the chapter or section title)

+ + The example below shows how the content provided in the previous example is presented in a structured format. + +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ..., + + toc_structured = list( + list(id = "0", parent_id = "" , name = "Introduction"), + list(id = "1", parent_id = "" , name = "1. The importance of rich and structured metadata"), + list(id = "11", parent_id = "1", name = "1.1 Rich metadata"), + list(id = "12", parent_id = "1", name = "1.2 Structured metadata"), + list(id = "2", parent_id = "" , name = "2. Technology: JSON schemas and tools"), + list(id = "21", parent_id = "2", name = "2.1 JSON schemas"), + list(id = "211", parent_id = "21", name = "2.1.1 Advantages of JSON over XML"), + list(id = "22", parent_id = "2", name = "2.2 Defining a metadata schema in JSON format") + # etc. + ), + # ... + ), + # ... + ) + ``` +
+ +- **`abstract`** *[Optional ; Not repeatable ; String]*
+ + The abstract is a summary of the document, usually about one or two paragraph(s) long (around 150 to 300 words). + +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + abstract = "The 2019 World Development Report studies how the nature of work is changing as a result of advances in technology today. + While technology improves overall living standards, the process can be disruptive. + A new social contract is needed to smooth the transition and guard against inequality.", + + # ... + ), + # ... + ) + ``` +
+ +- **`notes`** *[Optional ; Repeatable ; String]*
+ +
+```json +notes": [ + { + "note": "string" + } +] +``` +
+ + This field can be used to provide information on the document that does not belong to the other, more specific metadata elements provided in the schema. + - **`note`**
+ A note, entered as free text. + +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + notes = list( + list(note = "This is note 1"), + list(note = "This is note 2") + ), + + # ... + ), + # ... + ) + ``` + +
+ +- **`scope`** *[Optional ; Not repeatable ; String]*
+ + A textual description of the topics covered in the document, which complements (but does not duplicate) the elements `description` and `topics` available in the schema. + +- **`ref_country`** *[Optional ; Repeatable]*
+The list of countries (or regions) covered by the document, if applicable. +This is a repeatable block of two elements: + + - **`name`** *[Required ; Not repeatable ; String]*
+ The country/region name. Note that many organizations have their own policies on the naming of countries/regions/economies/territories, which data curators will have to comply with. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The country/region code. It is recommended to use a standard list of countries codes, such as the [ISO 3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes). +
+```json +"ref_country": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + The field `ref_country` will often be used as a filter (facet) in data catalogs. When a document is related to only part of a country, we still want to capture this information in the metadata. For example, the `ref_country` element for the document ["Sewerage and sanitation : Jakarta and Manila"](https://documents.worldbank.org/en/publication/documents-reports/documentdetail/880091468095971513/sewerage-and-sanitation-jakarta-and-manila) will list "Indonesia" (code IDN) and "Philippines" (code PHL). + + Considering the importance of the geographic coverage of a document as a filter, the `ref_country` element deserves particular attention. The document title will often but not always provide the necessary information. Using R, Python or other programming languages, a list of all countries mentioned in a document can be automatically extracted, with their frequencies. This approach (which requires a lookup file containing a list of all countries in the world with their different denominations and spelling) can be used to extract the information needed to populate the `ref_country` element (not all countries in the list will have to be included; some threshold can be set to only include countries that are "significantly" mentioned in a document). Tools like the R package [countrycode](https://cran.r-project.org/web/packages/countrycode/index.html) are available to facilitate this process. + + When a document is related to a region (not to specific countries), or when it is related to a topic but not a specific geographic area, the `ref_country` might still be applicable. Try and extract (possibly using a script that parses the document) information on the countries mentioned in the document. For example, `ref_country` for the World Bank document ["The investment climate in South Asia"](http://documents1.worldbank.org/curated/en/242391468114239381/pdf/715140v10ESW0P0Climate0I0OCR0Needed.pdf) should include Afghanistan (mentioned 81 times in the document), Bangladesh (113), Bhutan (94), India (148), Maldives (62), Nepal (64), Pakistan (103), and Sri Lanka (98), but also China (not a South-Asian country, but mentioned 63 times in the document). + + If a document is not specific to any country, the element `ref_country` would be ignored (not included in the metadata) if the content of the document is not related to any geographic area (for example, the user's guide of a software application), or would contain "World" (code WLD) if the document is related but not specific to countries (for example, a document on "Climate change mitigation"). + +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + ref_country = list( + list(name = "Bangladesh", code = "BGD"), + list(name = "India", code = "IND"), + list(name = "Nepal", code = "NPL") + ), + + # ... + ) + ``` +
+ +- **`geographic_units`** *[Optional ; Repeatable]*
+A list of geographic units covered in the document, other than the countries listed in `ref_country`. + +
+```json +"geographic_units": [ + { + "name": "string", + "code": "string", + "type": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the geographic unit. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the geographic unit. + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of the geographic unit (e.g., "province", "state", "district", or "town").
+ +
+ +- **`bbox`** *[Optional ; Repeatable]*
+This element is used to define one or multiple geographic bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. The valid range of latitude in degrees is -90 and +90 for the southern and northern hemisphere, respectively. Longitude is in the range -180 and +180 specifying coordinates west and east of the Prime Meridian, respectively. This element will rarely be used for documenting publications. Bounding boxes are an optional element, but when a bounding box is defined, all four coordinates are required. + +
+```json +"bbox": [ + { + "west": "string", + "east": "string", + "south": "string", + "north": "string" + } +] +``` +
+ + - **`west`** *[Required ; Not repeatable ; String]*
+ The West longitude of the bounding box. + - **`east`** *[Optional ; Not repeatable ; String]*
+ The East longitude of the bounding box. + - **`south`** *[Optional ; Not repeatable ; String]*
+ The South latitude of the bounding box. + - **`north`** *[Optional ; Not repeatable ; String]*
+ The North latitude of the bounding box.
+ + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + bbox = list( + list(west = "92.12973", + east = "92.26863", + south = "20.91856", + north = "21.22292") + ), + + # ... + ), + # ... + ) + ``` + +
+ +- **`spatial_coverage`** *[Optional ; Not repeatable ; String]*
+ + This element provides another space for capturing information on the spatial coverage of a document, which complements the `ref_country`, `geographic_units`, and `bbox` elements. It can be used to qualify the geographic coverage of the document, in the form of a free text. For example, a report on refugee camps in the Cox's Bazar district of Bangladesh would have Bangladesh as reference country, "Cox's Bazar" as a geographic unit, and "Rohingya's refugee camps" as spatial coverage. +
+ + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + ref_country = list( + list(name = "Bangladesh", code = "BGD") + ), + + geographic_units = list( + list(name = "Cox's Bazar", type = "District") + ), + + spatial_coverage = "Rohingya's refugee camps", + + # ... + ), + # ... + + ) + ``` + +
+ +- **`temporal_coverage`** *[Optional ; Not repeatable ; String]*
+ + Not all documents have a specific time coverage. When they do, it can be specified in this element. + +
+ +- **`publication_frequency`** *[Optional ; Not repeatable ; String]*
+Some documents are published regularly. The frequency of publications can be documented using this element. + + It is recommended to use a controlled vocabulary, for example the [PRISM Publishing Frequency Vocabulary](http://prismstandard.org/vocabularies/3.0/pubfrequency.xml) which identifies standard publishing frequencies for a serial or periodical publication. + + | Frequency | Description | + |--------------|-----------------------------------| + | annually | Published once a year | + | semiannually | Published twice a year | + | quarterly | Published every 3 months, or once a quarter| + | bimonthly | Published twice a month | + | monthly | Published once a month | + | biweekly | Published twice a week | + | weekly | Published once a week | + | daily | Published every day | + | continually | Published continually as new content is added; typical of websites and blogs, typically several times a day| + | irregularly | Published on an irregular schedule, such as every month except July and August| + | other | Published on another schedule not enumerated in this controlled vocabulary | + +
+ +- **`languages`** *[Optional ; Repeatable]*
+The language(s) in which the document is written. For the language codes and names, the use of the ISO 639-2 standard is recommended. + +
+```json +"languages": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + This is a block of two elements (at least one must be provided for each language): + + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the language. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the language. The use of [ISO 639-2](https://www.loc.gov/standards/iso639-2/php/code_list.php) (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings. +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + languages = list( + list(name = "English", code = "EN") + ) + + # ... + ), + # ... + ) + ``` + +
+ +- **`license`** *[Optional ; Repeatable]*
+Information on the license(s) attached to the document, which defines the terms of use. +
+```json +"license": [ + { + "name": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the license (e.g., CC-BY 4.0). + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the license, where detailed information on the license can be obtained. + +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + license = list( + list(name = "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)", + uri = "http://creativecommons.org/licenses/by/3.0/igo") + ), + + # ... + ), + # ... + ) + ``` + +
+ +- **`bibliographic_citation`** *[Optional ; Repeatable]*
+The bibliographic citation provides relevant information about the author and the publication. When using the element `bibliographic_citation`, the citation is provided as a single item. It should be provided in a standard style: Modern Language Association ([MLA](https://www.mla.org/)), American Psychological Association ([APA](https://apastyle.apa.org/)), or [Chicago](https://owl.purdue.edu/owl/research_and_citation/chicago_manual_17th_edition/cmos_formatting_and_style_guide/chicago_manual_of_style_17th_edition.html). Note that the schema provides an itemized list of all elements (BibTex fields) required to build a citation in a format of their choice. + +
+```json +"bibliographic_citation": [ + { + "style": "string", + "citation": "string" + } +] +``` +
+ + - **`style`** *[Optional ; Not repeatable ; String]*
+ The citation style, e.g. "MLA", "APA", or "Chicago". + - **`citation`** *[Optional ; Not repeatable ; String]*
+ The citation in the style mentioned in `style`.

+ + The example below shows how the bibliographic citation for an article published in [Econometrica](https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00167) can be provided in three different formats. + + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + bibliographic_citation = list( + + list(style = "MLA", + citation = 'Davidson, Russell, and Jean-Yves Duclos. “Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality.” Econometrica, vol. 68, no. 6, [Wiley, Econometric Society], 2000, pp. 1435–64, http://www.jstor.org/stable/3003995.'), + + list(style = "APA", + citation = 'Davidson, R., & Duclos, J.-Y. (2000). Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality. Econometrica, 68(6), 1435–1464. http://www.jstor.org/stable/3003995'), + + list(style = "Chicago", + citation = 'Davidson, Russell, and Jean-Yves Duclos. “Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality.” Econometrica 68, no. 6 (2000): 1435–64. http://www.jstor.org/stable/3003995.') + + ), + + # ... + ), + # ... + ) + ``` + +
+ +------- +**Bibliographic elements** +------- + +The elements that follow are bibliographic elements that correspond to BibTex fields. Note that some of the BibTex elements are found elsewhere in the schema (namely `type`, `authors`, `editors`, `year` and `month`, `isbn`, `issn` and `doi`); when constructing a bibliographic citation, these external elements will have to be included as relevant. The description of the bibliographic fields listed below was adapted from [Wikipedia's description of BibTex](https://en.wikipedia.org/wiki/BibTeX). + +```json +{ + "chapter": "string", + "edition": "string", + "institution": "string", + "journal": "string", + "volume": "string", + "number": "string", + "pages": "string", + "series": "string", + "publisher": "string", + "publisher_address": "string", + "annote": "string", + "booktitle": "string", + "crossref": "string", + "howpublished": "string", + "key": "string", + "organization": "string", + "url": null +} +``` + +The elements that are required to form a complete bibliographic citation depend on the type of document. The table below, adapted from the [BibTex templates](https://www.bibtex.com/format/), provides a list of required and optional fields by type of document: + + | Document type | Required fields | Optional fields | + |------------------------------------|-----------------------------------|--------------------------------------| + | Article from a journal or magazine | author, title, journal, year | volume, number, pages, month, note, key | + | Book with an explicit publisher | author or editor, title, publisher, year | volume, series, address, edition, month, note, key | + | Printed and bound document without a named publisher or sponsoring institution | title | author, howpublished, address, month, year, note, key | + | Part of a book (chapter and/or range of pages) | author or editor, title, chapter and/or pages, publisher, year | volume, series, address, edition, month, note, key | + | Part of a book with its own title | author, title, book title, publisher, year | editor, pages, organization, publisher, address, month, note, key | + | Article in a conference proceedings | author, title, book title, year | editor, pages, organization, publisher, address, month, note, key | + | Technical documentation | title | author, organization, address, edition, month, year, key | + | Master's thesis | author, title, school, year | address, month, note, key | + | Ph.D. thesis | author, title, school, year | address, month, note, key | + | Proceedings of a conference | title, year | editor, publisher, organization, address, month, note, key | + | Report published by a school or other institution, usually numbered within a series | author, title, institution, year | type, number, address, month, note, key | + | Document with an author and title, but not
formally published | author, title, note | month, year, key | + + + - **`chapter`** *[Optional ; Not repeatable ; String]*
+ A chapter (or section) number. This element is only used to document a resource which has been extracted from a book. + + - **`edition`** *[Optional ; Not repeatable ; String]*
+ The edition of a book - for example "Second". When a book has no edition number/name present, it can be assumed to be a first edition. If the edition is other than the first, information on the edition of the book being documented must be mentioned in the citation. The edition can be identified by a number, a label (such as “Revised edition” or “Abridged edition”), and/or a year. The first letter of the label should be capitalized. + + - **`institution`** *[Optional ; Not repeatable ; String]*
+ The sponsoring institution of a technical report. For citations of Master's and Ph.D. thesis, this will be the name of the school. + + - **`journal`** *[Optional ; Not repeatable ; String]*
+ A journal name. Abbreviations are provided for many journals. + + - **`volume`** *[Optional ; Not repeatable ; String]*
+ The volume of a journal or multi-volume book. Periodical publications, such as scholarly journals, are published on a regular basis in installments that are called issues. A volume usually consists of the issues published during one year. + + - **`number`** *[Optional ; Not repeatable ; String]*
+ The number of a journal, magazine, technical report, or of a work in a series. An issue of a journal or magazine is usually identified by its `volume` (see previous element) and `number`; the organization that issues a technical report usually gives it a number; and sometimes books are given numbers in a named series. + + - **`pages`** *[Optional ; Not repeatable ; String]*
+ One or more page numbers or range of numbers, such as 42-111 or 7,41,73-97 or 43+ (the `+' indicates pages following that don't form a simple range). + + - **`series`** *[Optional ; Not repeatable ; String]*
+ The name of a series or set of books. When citing an entire book, the title field gives its title and an optional series field gives the name of a series or multi-volume set in which the book is published. + + - **`publisher`** *[Optional ; Not repeatable ; String]*
+ The entity responsible for making the resource available. For major publishing houses, the information can be omitted. For small publishers, providing the complete address is recommended. If the company is a university press, the abbreviation UP (for University Press) can be used. The publisher is not stated for journal articles, working papers, and similar types of documents. + + - **`publisher_address`** *[Optional ; Not repeatable ; String]*
+ The address of the publisher. For major publishing houses, just the city is given. For small publishers, the complete address can be provided. + + - **`annote`** *[Optional ; Not repeatable ; String]*
+ An annotation. This element will not be used by standard bibliography styles like the MLA, APA or Chicago, but may be used by others that produce an annotated bibliography. + + - **`booktitle`** *[Optional ; Not repeatable ; String]*
+ Title of a book, part of which is being cited. If you are documenting the book itself, this element will not be used; it is only used when part of a book is being documented. + + - **`crossref`** *[Optional ; Not repeatable ; String]*
+ The catalog identifier ("database key") of another catalog entry being cross referenced. This element may be used when multiple entries refer to a same publication, to avoid duplication. + + - **`howpublished`** *[Optional ; Not repeatable ; String]*
+ The `howpublished` element is used to store the notice for unusual publications. The first word should be capitalized. For example, "WebPage", or "Distributed at the local tourist office". + + - **`key`** *[Optional ; Not repeatable ; String]*
+ A key is a field used for alphabetizing, cross referencing, and creating a label when the `author' information is missing. + + - **`organization`** *[Optional ; Not repeatable ; String]*
+ The organization that sponsors a conference or that publishes a manual. + + - **`url`** *[Optional ; Not repeatable ; String]*
+ The URL of the document, preferably a permanent URL. +
+ + This example makes use of the same *Econometrica* paper used in the previous example. + +
+ + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + bibliographic_fields = list( + doi = "https://doi.org/10.1111/1468-0262.00167", + journal = "Econometrica", + volume = "68", + issue = "6", + pages = "1435-1464", + url = "https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00167" + ), + + # ... + ), + # ... + ) + ``` + +------- + +
+ +- **`translators`** *[Optional ; Repeatable]*
+Information on translators, for publications that are translations of publication originally created in another language. + +
+```json +"translators": [ + { + "first_name": "string", + "initial": "string", + "last_name": "string", + "affiliation": "string" + } +] +``` +
+ + - **`first_name`** *[Optional ; Not repeatable ; String]*
+ The first name of the translator. + - **`initial`** *[Optional ; Not repeatable ; String]*
+ The initials of the translator. + - **`last_name`** *[Optional ; Not repeatable ; String]*
+ The last name of the translator. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the translator.
+ +
+ +- **`contributors`** *[Optional ; Repeatable]*
+These elements are used to acknowledge contributions to the production of the document, other than the ones for which specific metadata elements are provided (like `autors` or `translators`). +
+```json +"contributors": [ + { + "first_name": "string", + "initial": "string", + "last_name": "string", + "affiliation": "string", + "contribution": "string" + } +] +``` +
+ + - **`first_name`** *[Optional ; Not repeatable ; String]*
+ The first name of the contributor. + - **`initial`** *[Optional ; Not repeatable ; String]*
+ The initials of the contributor. + - **`last_name`** *[Optional ; Not repeatable ; String]*
+ The last name of the contributor. If the contributor is an organization, enter the name of the organization here. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the contributor. + - **`contribution`** *[Optional ; Not repeatable ; String]*
+ A brief description of the specific contribution of the person to the document, e.g. "Design of the cover page", or "Proofreading".
+ +
+ +- **`contacts`** *[Optional ; Repeatable]*
+Contact information for a person or organization that can be contacted for inquiries related to the document. +
+```json +"contacts": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "email": "string", + "telephone": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the contact. This can be a person or an organization.. + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the person or organization mentioned in `contact`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the contact person. + - **`email`** *[Optional ; Not repeatable ; String]*
+ The email address of the contact person or organization. Personal emails should be avoided.
+ - **`telephone`** *[Optional ; Not repeatable ; String]*
+ The telephone number for the contact person or organization. Personal phone numbers should be avoided.
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to an on-line resource related to the contact person or organization.
+ +
+ +- **`rights`** *[Optional ; Not repeatable ; String]*
+ + A statement on the rights associated with the document (others than the copyright, which should be described in the element `copyright` described below). + + The example is extracted from the World Bank World Development Report 2019. + + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + rights = "Some rights reserved. Nothing herein shall constitute or be considered to be a limitation upon or waiver of the privileges and immunities of The World Bank, all of which are specifically reserved.", + + # ... + ), + # ... + ) + ``` + +
+ +- **`copyright`** *[Optional ; Not repeatable ; String]*
+ + A statement and identifier indicating the legal ownership and rights regarding use and re-use of all or part of the resource. If the document is protected by a copyright, enter the information on the person or organization who owns the rights. + +
+ +- **`usage_terms`** *[Optional ; Not repeatable ; String]*
+ + This element is used to provide a description of the legal terms or other conditions that a person or organization who wants to use or reproduce the document has to comply with. + +
+ +- **`disclaimer`** *[Optional ; Not repeatable ; String]*
+ + A disclaimer limits the liability of the author(s) and/or publisher(s) of the document. A standard legal statement should be used for all documents from a same agency. + + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + disclaimer = "This work is a product of the staff of The World Bank with external contributions. The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of The World Bank, its Board of Executive Directors, or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries." + # ... + ), + # ... + ) + ``` + + +- **`security_classification`** *[Optional ; Not repeatable ; String]*
+ + Information on the security classification attached to the document. The different levels of classification indicate the degree of sensitivity of the content of the document. This field should make use of a controlled vocabulary, specific or adopted by the organization that curates or disseminates the document. Such a vocabulary could contain the following levels: `public, internal only, confidential, restricted, strictly confidential` + +
+ + +- **`access_restrictions`** *[Optional ; Not repeatable ; String]*
+ + A textual description of access restrictions that apply to the document. +
+ + +- **`sources`** *[Optional ; Repeatable]*
+ +
+```json +"sources": [ + { + "source_origin": "string", + "source_char": "string", + "source_doc": "string" + } +] +``` +
+ + This element is used to describe the sources of different types (except data sources, which must be listed in the next element `data_source`) that were used in the production of the document. + - **`source_origin`** *[Optional ; Not repeatable ; String]*
+ For historical materials, information about the origin(s) of the sources and the rules followed in establishing the sources should be specified. + - **`source_char`** *[Optional ; Not repeatable ; String]*
+ Characteristics of the source. Assessment of characteristics and quality of source material. + - **`source_doc`** *[Optional ; Not repeatable ; String]*
+ Documentation and access to the source.

+ +
+ +- **`data_sources`** *[Optional ; Repeatable]*
+ +
+```json +"data_sources": [ + { + "name": "string", + "uri": "string", + "note": "string" + } +] +``` +
+ + Used to list the machine-readable data file(s) -if any- that served as the source(s) of the data collection. + - **`name`** *[Required ; Not repeatable ; String]*
+ Name (title) of the dataset used as source. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ Link (URL) to the dataset or to a web page describing the dataset.
+ - **`note`** *[Optional ; Not repeatable ; String]*
+ Additional information on the data source.

+ + The data source for the publication [Bangladesh Demographic and Health Survey (DHS), 2017-18 - Final Report](https://dhsprogram.com/publications/publication-FR208-DHS-Final-Reports.cfm) would be entered as follows: + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + data_sources = list( + list(name = "Bangladesh Demographic and Health Survey 2017-18", + uri = "https://www.dhsprogram.com/methodology/survey/survey-display-536.cfm", + note = "Household survey conducted by the National Institute of Population Research and Training, Medical Education and Family Welfare Division and Ministry of Health and Family Welfare. Data and documentation available at https://dhsprogram.com/)" + ), + + # ... + ), + # ... + ) + ``` + +
+ +- **`keywords`** *[Optional ; Repeatable]*
+ +
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of keywords that provide information on the core content of the document. Keywords provide a convenient solution to improve the discoverability of the document, as it allows terms and phrases not found in the document itself to be indexed and to make a document discoverable by text-based search engines. A controlled vocabulary can be used (although not required), such as the [UNESCO Thesaurus](http://vocabularies.unesco.org/browser/thesaurus/en/). The list provided here can combine keywords from multiple controlled vocabularies and user-defined keywords. + + - **`name`** *[Required ; Not repeatable ; String]*
+ The keyword itself. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The controlled vocabulary (including version number or date) from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the controlled vocabulary from which the keyword is extracted, if any.
+ + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + keywords = list( + list(name = "Migration", vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + list(name = "Migrants", vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + list(name = "Refugee", vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + list(name = "Conflict"), + list(name = "Asylum seeker"), + list(name = "Forced displacement"), + list(name = "Forcibly displaced"), + list(name = "Internally displaced population (IDP)"), + list(name = "Population of concern (PoC)") + list(name = "Returnee") + list(name = "UNHCR") + ), + + # ... + ), + # ... + ) + ``` + +
+ +- **`themes`** *[Optional ; Repeatable]*
+ +
+```json +"themes": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of themes covered by the document. A controlled vocabulary will preferably be used. The list provided here can combine themes from multiple controlled vocabularies and user-defined themes. Note that `themes` will rarely be used as the elements `topics` and `disciplines` are more appropriate for most uses. This is a block of five fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The ID of the theme, taken from a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the theme, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.
+ +
+ +- **`topics`** *[Optional ; Repeatable]*
+
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + Information on the topics covered in the document. A controlled vocabulary will preferably be used, for example the [CESSDA Topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification), a typology of topics available in 11 languages; or the [Journal of Economic Literature (JEL) Classification System](https://en.wikipedia.org/wiki/JEL_classification_codes), or the [World Bank topics classification](https://documents.worldbank.org/en/publication/documents-reports/docadvancesearch). The list provided here can combine topics from multiple controlled vocabularies and user-defined topics. The element is a block of five fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the topic, taken from a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the topic, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ + We use the working paper "[Push and Pull - A Study of International Migration from Nepal](http://documents1.worldbank.org/curated/en/318581486560991532/pdf/WPS7965.pdf)" by Maheshwor Shrestha, World Bank Policy Research Working Paper 7965, February 2017, as an example.
+ + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + topics = list( + + list(name = "Demography.Migration", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(name = "Demography.Censuses", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(id = "F22", + name = "International Migration", + parent_id = "F2 - International Factor Movements and International Business", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "O15", + name = "Human Resources - Human Development - Income Distribution - Migration", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "O12", + name = "Microeconomic Analyses of Economic Development", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "J61", + name = "Geographic Labor Mobility - Immigrant Workers", + parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J") + + ), + + # ... + ), + ) + ``` +
+ + +- **`disciplines`** *[Optional ; Repeatable]*
+ +
+```json +"disciplines": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + Information on the academic disciplines related to the content of the document. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in [Wikipedia](https://en.wikipedia.org/wiki/List_of_academic_fields). The list provided here can combine disciplines from multiple controlled vocabularies and user-defined disciplines. This is a block of five elements: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the discipline, taken from a controlled vocabulary. + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the discipline, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the discipline (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + disciplines = list( + + list(name = "Economics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"), + + list(name = "Agricultural economics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"), + + list(name = "Econometrics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields") + + ), + + # ... + ), + # ... + ) + ``` + +
+ +- **`audience`** *[Optional ; Not repeatable ; String]*
+ + Information on the intended audience for the document, i.e. the category or categories of users for whom the resource is intended in terms of their interest, skills, status, or other. + +
+ +- **`mandate`** *[Optional ; Not repeatable ; String]*
+ + The legislative or other mandate under which the resource was produced. + +
+ +- **`pricing`** *[Optional ; Not repeatable ; String]*
+ + The current price of the document in any defined currency. As this information is subject to regular change, it will often not be included in the document metadata. + +
+ +- **`relations`** *[Optional ; Repeatable]*
+References to related resources with a specification of the type of relationship. + +
+```json +"relations": [ + { + "name": "string", + "type": "isPartOf" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The related resource. Recommended practice is to identify the related resource by means of a URL. If this is not possible or feasible, a string conforming to a formal identification system may be provided. + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of relationship. The use of a controlled vocabulary is recommended. The Dublin Core proposes the following vocabulary: {`isPartOf, hasPart, isVersionOf, isFormatOf, hasFormat, references, isReferencedBy, isBasedOn, isBasisFor, replaces, isReplacedBy, requires, isRequiredBy`}.
+
+ + | Type | Description | + | ------------------------| ------------------------------------------------------------ | + | isPartOf | The described resource is a physical or logical part of the referenced resource. | + | hasPart | | + | isVersionOf | The described resource is a version edition or adaptation of the referenced resource. A change in version implies substantive changes in content rather than differences in format.| + | isFormatOf | | + | hasFormat | The described resource pre-existed the referenced resource, which is essentially the same intellectual content presented in another format.| + | references | | + | isReferencedBy | | + | isBasedOn | | + | isBasisFor | | + | replaces | The described resource supplants, displaces or supersedes the referenced resource.| + | isReplacedBy | The described resource is supplanted, displaced or superseded by the referenced resource.| + | requires | | + +
+ +- **`reproducibility`** *[Optional ; Not repeatable]*
+ +
+```json +"reproducibility": { + "statement": "string", + "links": [ + { + "uri": "string", + "description": "string" + } + ] +} +``` +
+ + We present in chapter 12 a metadata schema intended to document reproducible research and scripts. That chapter lists multiple reasons to make research reproducible, replicable, and auditable. Ideally, when a research output (paper) is published, the data and code used in the underlying analysis should be made as openly available as possible. Increasingly, academic journals make it a requirement. The `reproducibility` element is used to provide interested users with information on reproducibility and replicability of the research output. + + - **`statement`** *[Optional ; Not repeatable ; String]*
+ A general statement on reproducibility and replicability of the analysis (including data processing, tabulation, production of visualizations, modeling, etc.) being presented in the document. + - **`links`** *[Optional ; Repeatable]*
+ Links to web pages where reproducible materials and the related information can be found. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The link to a web page. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the content of the web page. + + + + ```r + my_doc <- list( + # ... , + document_description = list( + # ... , + + reproducibility = list( + statement = "The scripts used to acquire data, assess and edit data files, train the econometric models, and to generate the tables and charts included in the publication, are openly accessible (Stata 15 scripts).", + links = list( + list(uri = "www.[...]", + description = "Description and access to reproducible Stata scripts"), + list(uri = "www.[...]", + description = "Derived data files") + ) + ), + # ... + ), + # ... + ) + ``` + + +### Provenance + +Metadata can be programmatically harvested from external catalogs. The **`provenance`** group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+ +
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ +- **`origin_description`** *[Required ; Not repeatable]*
+The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ + - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Document Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ + +### Tags + +**`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. + +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ + - **`tag`** *[Required ; Not repeatable ; String]*
+ A user-defined tag. + - **`tag_group`** *[Optional ; Not repeatable ; String]*
+ A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs. + + +### LDA topics + +**`lda_topics`** *[Optional ; Not repeatable]*
+ +
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ +We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). +
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element `lda_topics` is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. + +:::note +Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the `lda_topics` elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated. +::: + +The image below provides an example of topics extracted from a document from the United Nations High Commission for Refugees, using a LDA topic model trained by the World Bank (this model was trained to identify 75 topics; no document will cover all topics). + +![](./images/LDA_refugee_education.JPG){width=100%} + +The `lda_topics` element includes the following metadata fields:
+ +- **`model_info`** *[Optional ; Not repeatable]*
+Information on the LDA model. + + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + +- **`topic_description`** *[Optional ; Repeatable]*
+The topic composition of the document. + + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the document (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic. This is specific to the model, not to a document.
+ + + +```r +lda_topics = list( + + list( + + model_info = list( + list(source = "World Bank, Development Data Group", + author = "A.S.", + version = "2021-06-22", + model_id = "Mallet_WB_75", + nb_topics = 75, + description = "LDA model, 75 topics, trained on Mallet", + corpus = "World Bank Documents and Reports (1950-2021)", + uri = "")) + ), + + topic_description = list( + + list(topic_id = "topic_27", + topic_score = 32, + topic_label = "Education", + topic_words = list(list(word = "school", word_weight = "") + list(word = "teacher", word_weight = ""), + list(word = "student", word_weight = ""), + list(word = "education", word_weight = ""), + list(word = "grade", word_weight = "")), + + list(topic_id = "topic_8", + topic_score = 24, + topic_label = "Gender", + topic_words = list(list(word = "women", word_weight = "") + list(word = "gender", word_weight = ""), + list(word = "man", word_weight = ""), + list(word = "female", word_weight = ""), + list(word = "male", word_weight = "")), + + list(topic_id = "topic_39", + topic_score = 22, + topic_label = "Forced displacement", + topic_words = list(list(word = "refugee", word_weight = "") + list(word = "programme", word_weight = ""), + list(word = "country", word_weight = ""), + list(word = "migration", word_weight = ""), + list(word = "migrant", word_weight = "")), + + list(topic_id = "topic_40", + topic_score = 11, + topic_label = "Development policies", + topic_words = list(list(word = "development", word_weight = "") + list(word = "policy", word_weight = ""), + list(word = "national", word_weight = ""), + list(word = "strategy", word_weight = ""), + list(word = "activity", word_weight = "")) + + ) + + ) + +) +``` +
+ +The information provided by LDA models can be used to build a "filter by topic composition" tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic. +
+
+![](./images/filter_by_topic_share_1.JPG){width=85%} +
+ + +### Embeddings + +**`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below. + +![](./images/embedding_related_docs.JPG){width=100%} + +The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": null + } +] +``` +
+ +The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; Object]* @@@@@@@@ do not offer options + The numeric vector representing the document, provided as an object (array or string).

+ [1,4,3,5,7,9] + + +### Additional fields + +**`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + + +## Complete examples + +Generating metadata compliant with the **document schema** is easy. The three examples below illustrate how metadata can be generated and published in a NADA catalog, programmatically. In the first two examples, we assume that an electronic copy of a document is available, and that the metadata must be generated from scratch (not by re-purposing/mapping existing metadata). In the third example, we assume that a list of publications with some metadata is available as a CSV file; metadata compliant with the schema are created and published in a catalog using a single script. + +### Example 1: Working Paper + +#### Description + +This document is the World Bank Policy Working Paper No 9412, titled "[Predicting Food Crises](http://hdl.handle.net/10986/34510)" published in September 2020 under a CC-By 4.0 license. The list of authors is provided on the cover page; an abstract, a list of acknowledgments, and a list of keywords are also provided. + +
+![](./images/document_example_01_cover.JPG){width=75%} + +![](./images/document_example_01_authors_keywords.JPG){width=65%} + +![](./images/document_example_01_abstract.JPG){width=75%} +
+ + +#### Using a metadata editor + +(use the open source WB metadata editor) + +
+
+![image](https://user-images.githubusercontent.com/35276300/229924530-40bd8e92-961b-405e-85a9-321d9a045921.png) +
+
+ + +#### Using R + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_folder") +doc_file <- "WB_PRWP_9412_Food_Crises.pdf" + +id <- "WB_WPS9412" + +thumb_file <- gsub(".pdf", ".jpg", doc_file) +capture_pdf_cover(doc_file) # Capture cover page for use as thumbnail + +example_1 <- list( + + document_description = list( + + title_statement = list(idno = id, title = "Predicting Food Crises"), + + date_published = "2020-09", + + authors = list( + list(last_name = "Andrée", first_name = "Bo Pieter Johannes", + affiliation = "World Bank", + author_id = list(list(type = "ORCID", id = "0000-0002-8007-5007"))), + list(last_name = "Chamorro", first_name = "Andres", + affiliation = "World Bank"), + list(last_name = "Kraay", first_name = "Aart", + affiliation = "World Bank"), + list(last_name = "Spencer", first_name = "Phoebe", + affiliation = "World Bank"), + list(last_name = "Wang", first_name = "Dieter", + affiliation = "World Bank", + author_id = list(list(type = "ORCID", id = "0000-0003-1287-332X"))) + ), + + journal = "World Bank Policy Research Working Paper", + number = "9412", + publisher = "World Bank", + + ref_country = list( + list(name="Afghanistan", code="AFG"), + list(name="Burkina Faso", code="BFA"), + list(name="Chad", code="TCD"), + list(name="Congo, Dem. Rep.", code="COD"), + list(name="Ethiopia", code="ETH"), + list(name="Guatemala", code="GTM"), + list(name="Haiti", code="HTI"), + list(name="Kenya", code="KEN"), + list(name="Malawi", code="MWI"), + list(name="Mali", code="MLI"), + list(name="Mauritania", code="MRT"), + list(name="Mozambique", code="MOZ"), + list(name="Niger", code="NER"), + list(name="Nigeria", code="NGA"), + list(name="Somalia", code="SOM"), + list(name="South Sudan", code="SSD"), + list(name="Sudan", code="SDN"), + list(name="Uganda", code="UGA"), + list(name="Yemen, Rep.", code="YEM"), + list(name="Zambia", code="ZMB"), + list(name="Zimbabwe", code="ZWE") + ), + + abstract = "Globally, more than 130 million people are estimated to be in food crisis. These humanitarian disasters are associated with severe impacts on livelihoods that can reverse years of development gains. The existing outlooks of crisis-affected populations rely on expert assessment of evidence and are limited in their temporal frequency and ability to look beyond several months. This paper presents a statistical forecasting approach to predict the outbreak of food crises with sufficient lead time for preventive action. Different use cases are explored related to possible alternative targeting policies and the levels at which finance is typically unlocked. The results indicate that, particularly at longer forecasting horizons, the statistical predictions compare favorably to expert-based outlooks. The paper concludes that statistical models demonstrate good ability to detect future outbreaks of food crises and that using statistical forecasting approaches may help increase lead time for action.", + + languages = list(list(name="English", code="EN")), + + reproducibility = list( + statement = "The code and data needed to reproduce the analysis are openly available.", + links = list( + list(uri="http://fcv.ihsn.org/catalog/study/RR_WLD_2020_PFC_v01", + description= "Source code"), + list(uri="http://fcv.ihsn.org/catalog/study/WLD_2020_PFC_v01_M", + description= "Dataset") + ) + ) + + ) + +) + +# Publish the metadata in NADA +document_add(idno = id, + metadata = example_1, + repositoryid = "central", + published = 1, + thumbnail = thumb_file, + overwrite = "yes") + +# Provide a link to the document (as an external resource) +external_resources_add( + title = "Predicting Food Crises", + idno = id, + dctype = "doc/anl", + file_path = "http://hdl.handle.net/10986/34510", + overwrite = "yes" +) +``` + +The document will now be available in the NADA catalog. + +
+![](./images/document_example_01_nada.JPG) +
+ +#### Using Python + +The Python equivalent of the R script presented above is as follows. + + +```python +# @@@ Script not tested yet + +import pynada as nada +import inspect + +dataset_id = "WB_WPS9412" + +repository_id = "central" +published = 0 +overwrite = "yes" + +document_description = { + + 'title_statement': { + 'idno': dataset_id, + 'title': "Predicting Food Crises" + }, + + 'date_published': "2020-09", + + 'authors': [ + { + 'last_name': "Andrée", + 'first_name': "Bo Pieter Johannes", + 'affiliation': "World Bank" + }, + { + 'last_name': "Chamorro", + 'first_name': "Andres", + 'affiliation': "World Bank" + }, + { + 'last_name': "Kraay", + 'first_name': "Aart", + 'affiliation': "World Bank" + }, + { + 'last_name': "Spencer", + 'first_name': "Phoebe", + 'affiliation': "World Bank" + }, + { + 'last_name': "Wang", + 'first_name': "Dieter", + 'affiliation': "World Bank" + } + ], + + 'journal': "World Bank Policy Research Working Paper No. 9412", + + 'publisher': "World Bank", + + 'ref_country': [ + {'name'="Afghanistan", 'code'="AFG"}, + {'name'="Burkina Faso", 'code'="BFA"}, + {'name'="Chad", 'code'="TCD"}, + {'name'="Congo, Dem. Rep.", 'code'="COD"}, + {'name'="Ethiopia", 'code'="ETH"}, + {'name'="Guatemala", 'code'="GTM"}, + {'name'="Haiti", 'code'="HTI"}, + {'name'="Kenya", 'code'="KEN"}, + {'name'="Malawi", 'code'="MWI"}, + {'name'="Mali", 'code'="MLI"}, + {'name'="Mauritania", 'code'="MRT"}, + {'name'="Mozambique", 'code'="MOZ"}, + {'name'="Niger", 'code'="NER"}, + {'name'="Nigeria", 'code'="NGA"}, + {'name'="Somalia", 'code'="SOM"}, + {'name'="South Sudan", 'code'="SSD"}, + {'name'="Sudan", 'code'="SDN"}, + {'name'="Uganda", 'code'="UGA"}, + {'name'="Yemen, Rep.", 'code'="YEM"}, + {'name'="Zambia", 'code'="ZMB"}, + {'name'="Zimbabwe", 'code'="ZWE"} + ], + + 'abstract': inspect.cleandoc("""\ + +Globally, more than 130 million people are estimated to be in food crisis. These humanitarian disasters are associated with severe impacts on livelihoods that can reverse years of development gains. +The existing outlooks of crisis-affected populations rely on expert assessment of evidence and are limited in their temporal frequency and ability to look beyond several months. +This paper presents a statistical forecasting approach to predict the outbreak of food crises with sufficient lead time for preventive action. +Different use cases are explored related to possible alternative targeting policies and the levels at which finance is typically unlocked. +The results indicate that, particularly at longer forecasting horizons, the statistical predictions compare favorably to expert-based outlooks. +The paper concludes that statistical models demonstrate good ability to detect future outbreaks of food crises and that using statistical forecasting approaches may help increase lead time for action. + + """), + + 'languages': [ + {'name': "English", 'code': "EN"} + ], + + 'reproducibility': { + 'statement': "The code and data needed to reproduce the analysis are openly available.", + 'links': [ + { + 'uri': "http://fcv.ihsn.org/catalog/study/RR_WLD_2020_PFC_v01", + 'description': "Source code" + }, + { + 'uri': "http://fcv.ihsn.org/catalog/study/WLD_2020_PFC_v01_M", + 'description': "Dataset" + } + ] + }, + +files = [ + {'file_uri': "http://hdl.handle.net/10986/34510"}, +] + + +nada.create_document_dataset( + dataset_id = dataset_id, + repository_id = repository_id, + published = published, + overwrite = overwrite, + document_description = document_description, + resources = resources, + files = files +) + +# If you have pdf file, generate thumbnail from it. +pdf_file = "WB_PRWP_9412_Food_Crises.pdf" +thumbnail_path = nada.pdf_to_thumbnail(pdf_file, page_no=1) +nada.upload_thumbnail(dataset_id, thumbnail_path) +``` + + +### Example 2: Book + +This example documents the World Bank World Development Report (WDR) 2019 titled "The Changing Nature of Work". The book is available in multiple languages. It also has related resources like presentations and an *Overview* available in multiple languages, which we also document. + +#### Description + +
+![](./images/document_example_02_cover.JPG){width=60%} +![](./images/document_example_02_rights.JPG){width=60%} +![](./images/document_example_02_toc.JPG){width=60%} +
+ +#### Using R + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_folder") +doc_file <- "2019-WDR-Report.pdf" + +id <- "WB_WDR2019" +meta_id <- "WBDG_WB_WDR2019" + +thumb_file <- gsub(".pdf", ".jpg", doc_file) +capture_pdf_cover(doc_file) # Capture cover page for use as thumbnail + +# Generate the metadata + +example_2 = list( + + metadata_information = list( + title = "The Changing Nature of Work", + idno = meta_id, + producers = list( + list(name = "Development Data Group, Curation Team", + abbr = "DECDG", + affiliation = "World Bank") + ), + production_date = "2020-12-27" + ), + + document_description = list( + + title_statement = list( + idno = id, + title = "The Changing Nature of Work", + sub_title = "World Development Report 2019", + abbreviated_title = "WDR 2019" + ), + + authors = list( + list(first_name = "Rong", last_name = "Chen", affiliation = "World Bank"), + list(first_name = "Davida", last_name = "Connon", affiliation = "World Bank"), + list(first_name = "Ana P.", last_name = "Cusolito", affiliation = "World Bank"), + list(first_name = "Ugo", last_name = "Gentilini", affiliation = "World Bank"), + list(first_name = "Asif", last_name = "Islam", affiliation = "World Bank"), + list(first_name = "Shwetlena", last_name = "Sabarwal", affiliation = "World Bank"), + list(first_name = "Indhira", last_name = "Santos", affiliation = "World Bank"), + list(first_name = "Yucheng", last_name = "Zheng", affiliation = "World Bank") + ), + + date_created = "2019", + date_published = "2019", + + identifers = list( + list(type = "ISSN", value = "0163-5085"), + list(type = "ISBN softcover", value = "978-1-4648-1328-3"), + list(type = "ISBN hardcover", value = "978-1-4648-1342-9"), + list(type = "e-ISBN", value = "978-1-4648-1356-6"), + list(type = "DOI softcover", value = "10.1596/978-1-4648-1328-3"), + list(type = "DOI hardcover", value = "10.1596/978-1-4648-1342-9") + ), + + type = "book", + + description = "The World Development Report (WDR) 2019: The Changing Nature of Work studies how the nature of work is changing as a result of advances in technology today. Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Work is constantly reshaped by technological progress. Firms adopt new ways of production, markets expand, and societies evolve. Overall, technology brings opportunity, paving the way to create new jobs, increase productivity, and deliver effective public services. Firms can grow rapidly thanks to digital transformation, expanding their boundaries and reshaping traditional production patterns. The rise of the digital platform firm means that technological effects reach more people faster than ever before. Technology is changing the skills that employers seek. Workers need to be better at complex problem-solving, teamwork and adaptability. Digital technology is also changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. The Report analyzes these changes and considers how governments can best respond. Investing in human capital must be a priority for governments in order for workers to build the skills in demand in the labor market. In addition, governments need to enhance social protection and extend it to all people in society, irrespective of the terms on which they work. To fund these investments in human capital and social protection, the Report offers some suggestions as to how governments can mobilize additional revenues by increasing the tax base.", + + toc_structured = list( + list(id = "00", name = "Overview"), + list(id = "01", parent_id = "00", name = "Changes in the nature of work"), + list(id = "02", parent_id = "00", name = "What can governments do?"), + list(id = "03", parent_id = "00", name = "Organization of this study"), + list(id = "10", name = "1. The changing nature of work"), + list(id = "11", parent_id = "10", name = "Technology generates jobs"), + list(id = "12", parent_id = "10", name = "How work is changing"), + list(id = "13", parent_id = "10", name = "A simple model of changing work"), + list(id = "20", name = "2. The changing nature of firms"), + list(id = "21", parent_id = "20", name = "Superstar firms"), + list(id = "22", parent_id = "20", name = "Competitive markets"), + list(id = "23", parent_id = "20", name = "Tax avoidance"), + list(id = "30", name = "3. Building human capital"), + list(id = "31", parent_id = "30", name = "Why governments should get involved"), + list(id = "32", parent_id = "30", name = "Why measurement helps"), + list(id = "33", parent_id = "30", name = "The human capital project"), + list(id = "40", name = "4. Lifelong learning"), + list(id = "41", parent_id = "40", name = "Learning in early childhood"), + list(id = "42", parent_id = "40", name = "Tertiary education"), + list(id = "43", parent_id = "40", name = "Adult learning outside the workplace"), + list(id = "50", name = "5. Returns to work"), + list(id = "51", parent_id = "50", name = "Informality"), + list(id = "52", parent_id = "50", name = "Working women"), + list(id = "53", parent_id = "50", name = "Working in agriculture"), + list(id = "60", name = "6. Strengthening social protection"), + list(id = "61", parent_id = "60", name = "Social assistance"), + list(id = "62", parent_id = "60", name = "Social insurance"), + list(id = "63", parent_id = "60", name = "Labor regulation"), + list(id = "70", name = "7. Ideas for social inclusion"), + list(id = "71", parent_id = "70", name = "A global 'New Deal'"), + list(id = "72", parent_id = "70", name = "Creating a new social contract"), + list(id = "73", parent_id = "70", name = "Financing social inclusion") + ), + + abstract = "Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Instead, technology is bringing opportunity, paving the way to create new jobs, increase productivity, and improve public service delivery. The nature of work is changing. +Firms can grow rapidly thanks to digital transformation, which blurs their boundaries and challenges traditional production patterns. +The rise of the digital platform firm means that technological effects reach more people faster than ever before. +Technology is changing the skills that employers seek. Workers need to be good at complex problem-solving, teamwork and adaptability. +Technology is changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. +What can governments do? The 2019 WDR suggests three solutions: +1 - Invest in human capital especially in disadvantaged groups and early childhood education to develop the new skills that are increasingly in demand in the labor market, such as high-order cognitive and sociobehavioral skills +2 - Enhance social protection to ensure universal coverage and protection that does not fully depend on having formal wage employment +3 - Increase revenue mobilization by upgrading taxation systems, where needed, to provide fiscal space to finance human capital development and social protection.", + + ref_country = list( + list(name = "World", code = "WLD") + ), + + spatial_coverage = "Global", + + publication_frequency = "Annual", + + languages = list( + list(name = "English", code = "EN"), + list(name = "Chinese", code = "ZH"), + list(name = "Arabic", code = "AR"), + list(name = "French", code = "FR"), + list(name = "Spanish", code = "ES"), + list(name = "Italian", code = "IT"), + list(name = "Bulgarian", code = "BG"), + list(name = "Russian", code = "RU"), + list(name = "Serbian", code = "SR") + ), + + license = list( + list(name = "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)", + uri = "http://creativecommons.org/licenses/by/3.0/igo") + ), + + bibliographic_citation = list( + list(citation = " World Bank. 2019. World Development Report 2019: The Changing Nature of Work. Washington, DC: World Bank. doi:10.1596/978-1-4648-1328-3. License: Creative Commons Attribution CC BY 3.0 IGO") + ), + + series = "World Development Report", + + contributors = list( + list(first_name = "Simeon", last_name = "Djankov", + affiliation = "World Bank", role = "WDR Director"), + list(first_name = "Federica", last_name = "Saliola", + affiliation = "World Bank", role = "WDR Director"), + list(first_name = "David", last_name = "Sharrock", + affiliation = "World Bank", role = "Communications"), + list(first_name = "Consuelo Jurado", last_name = "Tan", + affiliation = "World Bank", role = "Program Assistant") + ), + + publisher = "World Bank Publications", + publisher_address = "The World Bank Group, 1818 H Street NW, Washington, DC 20433, USA", + + contacts = list( + list(name = "World Bank Publications", email = "pubrights@worldbank.org") + ), + + topics = list( + list(name = "Labour And Employment - Employee Training", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(name = "Labour And Employment - Labour And Employment Policy", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(name = "Labour And Employment - Working Conditions", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(name = "Social Stratification And Groupings - Social And Occupational Mobility", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification") + ), + + disciplines = list( + list(name = "Economics") + ) + + ) + +) + +# Publish the metadata in NADA + +document_add(idno = id, + metadata = example_2, + repositoryid = "central", + published = 1, + thumbnail = thumb_file, + overwrite = "yes") + +# Provide links to the document and related resources + +external_resources_add( + title = "The Changing Nature of Work", + description = "Links to the PDF report in all available languages", + idno = id, + dctype = "doc/anl", + language = "English, Chinese, Arabic, French, Spanish, Italian, Bulgarian, Russian, Serbian", + file_path = "https://www.worldbank.org/en/publication/wdr2019", + overwrite = "yes" +) + +external_resources_add( + title = "WORLD DEVELOPMENT REPORT 2019 - THE CHANGING NATURE OF WORK - Presentation (slide deck), English", + idno = id, + dctype = "doc/oth", + language = "English", + file_path = "http://pubdocs.worldbank.org/en/808261547222082195/WDR19-English-Presentation.pdf", + overwrite = "yes" +) + +external_resources_add( + title = "INFORME SOBRE EL DESARROLLO MUNDIAL 2019 - LA NATURALEZA CAMBIANTE DEL TRABAJO - Presentation (slide deck), Spanish", + idno = id, + dctype = "doc/oth", + language = "Spanish", + file_path = "http://pubdocs.worldbank.org/en/942911547222108647/WDR19-Spanish-Presentation.pdf", + overwrite = "yes" +) + +external_resources_add( + title = "RAPPORT SUR LE DÉVELOPPEMENT DANS LE MONDE 2019 - LE TRAVAIL EN MUTATION - Presentation (slide deck), French", + idno = id, + dctype = "doc/oth", + language = "French", + file_path = "http://pubdocs.worldbank.org/en/132831547222088914/WDR19-French-Presentation.pdf", + overwrite = "yes" +) + +external_resources_add( + title = "RAPPORTO SULLO SVILUPPO MONDIALE 2019 - CAMBIAMENTI NEL MONDO DEL LAVORO - Presentation (slide deck), Italian", + idno = id, + dctype = "doc/oth", + language = "Italian", + file_path = "http://pubdocs.worldbank.org/en/842271547222095493/WDR19-Italian-Presentation.pdf", + overwrite = "yes" +) + +external_resources_add( + title = "ДОКЛАД О МИРОВОМ РАЗВИТИИ 2019 - ИЗМЕНЕНИЕ ХАРАКТЕРА ТРУДА - Presentation (slide deck), Russian", + idno = id, + dctype = "doc/oth", + language = "Russian", + file_path = "http://pubdocs.worldbank.org/en/679061547222101914/WDR19-Russian-Presentation.pdf", + overwrite = "yes" +) + +external_resources_add( + title = "Jobs of the future require more investment in people - Press Release (October 11, 2018)", + idno = id, + dctype = "doc/oth", + dcdate = "2018-10-11", + language = "Russian", + file_path = "https://www.worldbank.org/en/news/press-release/2018/10/11/jobs-of-the-future-require-more-investment-in-people", + overwrite = "yes" +) +``` + +The document is now available in the NADA catalog. +
+![](./images/document_example_02_nada.JPG) +
+ + +#### Using Python + +The Python equivalent of the R script presented above is as follows. + + +```python +# @@@ Script not tested yet - must be edited to match the R script + +import pynada as nada +import inspect + +dataset_id = "DOC_001" + +repository_id = "central" + +published = 0 + +overwrite = "yes" + +metadata_information = { + 'title': "The Changing Nature of Work", + 'idno': "META_DOC_001", + 'producers': [ + { + 'name': "Development Data Group, Curation Team", + 'abbr': "DECDG", + 'affiliation': "World Bank" + } + ], + 'production_date': "2020-12-27" +} + +document_description = { + 'title_statement': { + 'idno': dataset_id, + 'title': "The Changing Nature of Work", + 'sub-title': "World Development Report 2019", + 'abbreviated_title': "WDR2019" + }, + + 'type': "book", + + 'description': inspect.cleandoc("""\ + +The World Development Report (WDR) 2019: The Changing Nature of Work studies how the nature of work is changing as a result of advances in technology today. Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Work is constantly reshaped by technological progress. Firms adopt new ways of production, markets expand, and societies evolve. Overall, technology brings opportunity, paving the way to create new jobs, increase productivity, and deliver effective public services. Firms can grow rapidly thanks to digital transformation, expanding their boundaries and reshaping traditional production patterns. The rise of the digital platform firm means that technological effects reach more people faster than ever before. Technology is changing the skills that employers seek. Workers need to be better at complex problem-solving, teamwork and adaptability. Digital technology is also changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. The Report analyzes these changes and considers how governments can best respond. Investing in human capital must be a priority for governments in order for workers to build the skills in demand in the labor market. In addition, governments need to enhance social protection and extend it to all people in society, irrespective of the terms on which they work. To fund these investments in human capital and social protection, the Report offers some suggestions as to how governments can mobilize additional revenues by increasing the tax base. + + """), + + 'toc_structured': [ + {'id': "00", 'name': "Overview"}, + {'id': "01", 'parent_id': "00", 'name': "Changes in the nature of work"}, + {'id': "02", 'parent_id': "00", 'name': "What can governments do?"}, + {'id': "03", 'parent_id': "00", 'name': "Organization of this study"}, + {'id': "10", 'name': "1. The changing nature of work"}, + {'id': "11", 'parent_id': "10", 'name': "Technology generates jobs"}, + {'id': "12", 'parent_id': "10", 'name': "How work is changing"}, + {'id': "13", 'parent_id': "10", 'name': "A simple model of changing work"}, + {'id': "20", 'name': "2. The changing nature of firms"}, + {'id': "21", 'parent_id': "20", 'name': "Superstar firms"}, + {'id': "22", 'parent_id': "20", 'name': "Competitive markets"}, + {'id': "23", 'parent_id': "20", 'name': "Tax avoidance"}, + {'id': "30", 'name': "3. Building human capital"}, + {'id': "31", 'parent_id': "30", 'name': "Why governments should get involved"}, + {'id': "32", 'parent_id': "30", 'name': "Why measurement helps"}, + {'id': "33", 'parent_id': "30", 'name': "The human capital project"}, + {'id': "40", 'name': "4. Lifelong learning"}, + {'id': "41", 'parent_id': "40", 'name': "Learning in early childhood"}, + {'id': "42", 'parent_id': "40", 'name': "Tertiary education"}, + {'id': "43", 'parent_id': "40", 'name': "Adult learning outside the workplace"}, + {'id': "50", 'name': "5. Returns to work"}, + {'id': "51", 'parent_id': "50", 'name': "Informality"}, + {'id': "52", 'parent_id': "50", 'name': "Working women"}, + {'id': "53", 'parent_id': "50", 'name': "Working in agriculture"}, + {'id': "60", 'name': "6. Strengthening social protection"}, + {'id': "61", 'parent_id': "60", 'name': "Social assistance"}, + {'id': "62", 'parent_id': "60", 'name': "Social insurance"}, + {'id': "63", 'parent_id': "60", 'name': "Labor regulation"}, + {'id': "70", 'name': "7. Ideas for social inclusion"}, + {'id': "71", 'parent_id': "70", 'name': "A global 'New Deal'"}, + {'id': "72", 'parent_id': "70", 'name': "Creating a new social contract"}, + {'id': "73", 'parent_id': "70", 'name': "Financing social inclusion"} + ], + + 'abstract': inspect.cleandoc("""\ + +Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Instead, technology is bringing opportunity, paving the way to create new jobs, increase productivity, and improve public service delivery. +The nature of work is changing. +Firms can grow rapidly thanks to digital transformation, which blurs their boundaries and challenges traditional production patterns. +The rise of the digital platform firm means that technological effects reach more people faster than ever before. +Technology is changing the skills that employers seek. Workers need to be good at complex problem-solving, teamwork and adaptability. +Technology is changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. +What can governments do? +The 2019 WDR suggests three solutions: +1 - Invest in human capital especially in disadvantaged groups and early childhood education to develop the new skills that are increasingly in demand in the labor market, such as high-order cognitive and sociobehavioral skills +2 - Enhance social protection to ensure universal coverage and protection that does not fully depend on having formal wage employment +3 - Increase revenue mobilization by upgrading taxation systems, where needed, to provide fiscal space to finance human capital development and social protection. + + """), + + 'ref_country': [ + {'name': "World", 'code': "WLD"} + ], + + 'spatial_coverage': "Global", + + 'date_created': "2019", + + 'date_published': "2019", + + 'identifiers': [ + {'type': "ISSN", 'value': "0163-5085"}, + {'type': "ISBN softcover", 'value': "978-1-4648-1328-3"}, + {'type': "ISBN hardcover", 'value': "978-1-4648-1342-9"}, + {'type': "e-ISBN", 'value': "978-1-4648-1356-6"}, + {'type': "DOI softcover", 'value': "10.1596/978-1-4648-1328-3"}, + {'type': "DOI hardcover", 'value': "10.1596/978-1-4648-1342-9"} + ], + + 'publication_frequency': "Annual", + + 'languages': [ + {'name': "English", 'code': "EN"}, + {'name': "Chinese", 'code': "ZH"}, + {'name': "Arabic", 'code': "AR"}, + {'name': "French", 'code': "FR"}, + {'name': "Spanish", 'code': "ES"}, + {'name': "Italian", 'code': "IT"}, + {'name': "Bulgarian", 'code': "BG"}, + {'name': "Russian", 'code': "RU"}, + {'name': "Serbian", 'code': "SR"} + ], + + 'license': [ + { + 'name': "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)", + 'uri': "http://creativecommons.org/licenses/by/3.0/igo" + } + ], + + 'authors': [ + {'first_name': "Rong", 'last_name': "Chen", 'affiliation': "World Bank"}, + {'first_name': "Davida", 'last_name': "Connon", 'affiliation': "World Bank"}, + {'first_name': "Ana P.", 'last_name': "Cusolito", 'affiliation': "World Bank"}, + {'first_name': "Ugo", 'last_name': "Gentilini", 'affiliation': "World Bank"}, + {'first_name': "Asif", 'last_name': "Islam", 'affiliation': "World Bank"}, + {'first_name': "Shwetlena", 'last_name': "Sabarwal", 'affiliation': "World Bank"}, + {'first_name': "Indhira", 'last_name': "Santos", 'affiliation': "World Bank"}, + {'first_name': "Yucheng", 'last_name': "Zheng", 'affiliation': "World Bank"} + ], + + 'contributors': [ + {'first_name': "Simeon", 'last_name': "Djankov", 'affiliation': "World Bank", 'role': "WDR Director"}, + {'first_name': "Federica", 'last_name': "Saliola", 'affiliation': "World Bank", 'role': "WDR Director"}, + {'first_name': "David", 'last_name': "Sharrock", 'affiliation': "World Bank", 'role': "Communications"}, + {'first_name': "Consuelo Jurado", 'last_name': "Tan", 'affiliation': "World Bank", 'role': "Program Assistant"} + ], + + 'topics': [ + { + 'name': "LabourAndEmployment.EmployeeTraining", + 'vocabulary': "CESSDA Topic Classification", + 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification" + }, + { + 'name': "LabourAndEmployment.LabourAndEmploymentPolicy", + 'vocabulary': "CESSDA Topic Classification", + 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification" + }, + { + 'name': "LabourAndEmployment.WorkingConditions", + 'vocabulary': "CESSDA Topic Classification", + 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification" + }, + { + 'name': "SocialStratificationAndGroupings.SocialAndOccupationalMobility", + 'vocabulary': "CESSDA Topic Classification", + 'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification" + } + ], + + 'disciplines': [ + {'name': "Economics"} + ] +} +``` + + +### Example 3: Importing from a list of documents + +In this example we take a different use case. We assume that a list of publications is available as a CSV file. Each row in this file describes one publication, with the following columns containing the metadata (with no missing information for the required elements): + + - **URL_pdf** (required): a link to the publication (direct link to a PDF file) + - **ID** (required): a unique identifier for each document, with no missing value) + - **title** (required): the title of the document + - **country** (optional): the country (or countries) that the document is about, separated by a ";" + - **authors** (optional): separated by a ";" and with the last name and first name separated by a "," (last name always provided before first name) + - **abstract** (optional): abstract + - **type** (optional): type of document + - **date_published** (optional): date the document was published; optional by highly recommended + +The R (or Python) script reads the CSV file. The listed documents are downloaded (if not previously done), and the cover page of each document is captured and saved as a JPG file to be used as a thumbnail in the catalog. Metadata are formatted to comply with the document schema, then published. The documents are not uploaded in the catalog, but links to the originating catalog are provided. There is no limit to the number of documents that could be included in such a batch process. If a repository of documents is available with metadata available in a structured format (in a CSV file as in the example, from an API, or from another source), the migration of the documents to a NADA catalog can be fully automated using a script similar to the one shown in the example. Note that such a script could also include some processes of metadata augmentation (e.g., submitting each document to a topic model to extract and store the topic composition of the document). + +
+ ![](./images/ReDoc_documents_22.JPG){width=100%} +
+ +#### Using R + + +```r +library(nadar) +library(stringr) +library(rlist) +library(countrycode) # Will be used to automatically add ISO country codes + +# ---------------------------------------------------------------------------------- +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +# Read the CSV file containing the information (metadata) on the 5 documents + +setwd("C:/my_folder") +# Read the file containing information on the 5 documents +doc_list <- read.csv("my_list_of_documents.csv", stringsAsFactors = FALSE) + +# Generate the metadata for each document in the list, and publish in NADA + +for(i in 1:nrow(doc_list)) { + + # Download the file if not already done + url <- doc_list$URL_pdf[i] + pdf_file <- basename(doc_list$URL_pdf[i]) + if(!file.exists(pdf_file)) download.file(url, pdf_file, mode = "wb") + + # Map the available metadata elements to the schema + id <- doc_list$ID[i] + title <- doc_list$title[i] + date <- as.character(doc_list$date_published[i]) + abstract <- doc_list$abstract[i] + type <- doc_list$type[i] + + # Split the authors' list an generate a list compliant with the schema + list_authors <- doc_list$authors[i] + list_authors <- str_split(list_authors, ";") + authors = list() + for(n in 1:length(list_authors[[1]])) { + author = trimws(list_authors[[1]][n]) + if("," %in% author) { # If we have last name and first name + last_first = str_split(author, ",") + a_l = list(last_name = trimws(last_first[[1]][1]), + first_name = trimws(last_first[[1]][2])) + } else { # E.g., when author is an organization + a_l = list(last_name = author, first_name = "") + } + authors = list.append(authors, a_l) + } + + # Split the country list an generate a list compliant with the schema + list_countries <- doc_list$country[i] + list_countries <- str_split(list_countries, ";") + countries = list() + for(n in 1:length(list_countries[[1]])) { + country = trimws(list_countries[[1]][n]) + if(country == "World"){ + c_code = "WLD" + } else { + c_code = countrycode(country, origin = 'country.name', destination = 'iso3c') + } + if(is.na(c_code)) c_code = "" + c_l = list(name = country, code = c_code) + countries = list.append(countries, c_l) + } + + # Capture the cover page as JPG, and generate the full document metadata + + thumb <- gsub(".pdf", ".jpg", pdf_file) + capture_pdf_cover(pdf_file) # To be used as thumbnail + + this_document <- list( + document_description = list( + title_statement = list(idno = id, title = title), + date_published = date, + authors = authors, + abstract = abstract, + ref_country = countries + ) + ) + + # Publish the metadata in NADA + + document_add(idno = id, + published = 1, + overwrite = "yes", + metadata = this_document, + thumbnail = thumb) + + # Add a link to the document + + external_resources_add( + title = as.character(this_document$document_description$title_statement[1]), + idno = id, + dctype = "doc/anl", + file_path = url, + overwrite = "yes" + ) + +} +``` + + +#### Using Python + + +```python +# @@@ Script not tested yet + +import pynada as nada +import pandas as pd +import urllib.request +import os.path + +# Set API key and catalog URL +nada.set_api_key("my_api_key") +nada.set_api_url("http://my_catalog.ihsn.org/index.php/api/") + +# Read the file containing information on the 5 documents +doc_list <- pd.read_csv("my_list_of_documents.csv") + +# Generate the metadata and publish in NADA catalog +for index, doc in doc_list.iterrows(): + + # Download the file if not already done + url = doc['URL'] + pdf_file = os.path.basename(url) + if(!os.path.exists(pdf_file)) { + urllib.request.urlretrieve(url, pdf_file) + } + + # Map/generate metadata fields + id = doc['id'] + title = f"{doc['title']} - Census {doc['censusyear']}" + author = doc['authors'] + contrib = doc['contributor'] + date = doc['date_published'] + avail = doc['date_available'] + abstract = doc['description'] + publisher = doc['publisher'] + spatial = doc['state'] + language = [{'name': "English", 'code': "ENG"}] + + # Document the file, and publish in NADA + idno = id + repository_id = "central" + published = 1 + overwrite = "yes" + document_description = { + 'title_statement': { + 'idno': id, + 'title': title + }, + 'date_published': date, + 'date_available': date, + 'authors': [ + {'last_name': author} + ], + 'contributors': [ + {'last_name': contrib} + ], + 'publisher': publisher, + 'abstract': abstract, + 'description': desc, + 'ref_country': [ + {'name': "India", 'code': "IND"} + ], + 'languages': language, + 'pages': pages, + 'rights': "Office of the Registrar General, India (ORGI)" + } + tags = tags + files = [ + {'file_uri': pdf_file, 'format': "Adobe Acrobat PDF"}, + ] + + nada.create_document_dataset( + dataset_id = idno, + repository_id = repository_id, + published = published, + overwrite = overwrite, + document_description = document_description, + tags = tags, + files = files + ) + + # generate thumbnail from the pdf file. + thumbnail_path = nada.pdf_to_thumbnail(pdf_file, page_no=1) + nada.upload_thumbnail(idno, thumbnail_path) +``` + diff --git a/05_chapter05_microdata.md b/05_chapter05_microdata.md new file mode 100644 index 0000000..b26f4f8 --- /dev/null +++ b/05_chapter05_microdata.md @@ -0,0 +1,3538 @@ +--- +output: html_document +--- + +# Microdata {#chapter05} + +
+![](./images/DDI.JPG){width=60%} +
+
+ + +## Definition of microdata + +When surveys or censuses are conducted, or when administrative data are recorded, information is collected on each unit of observation. The unit of observation can be a person, a household, a firm, an agricultural holding, a facility, or other. Microdata are the data files resulting from these data collection activities, which contain the unit-level information (as opposed to aggregated data in the form of counts, means, or other). Information on each unit is stored in *variables*, which can be of different types (e.g. numeric or alphanumeric, discrete or continuous). These variables may contain data reported by the respondent (e.g., the marital status of a person), obtained by observation or measurement (e.g., the GPS location of a dwelling), or generated by calculation, recoding or derivation (e.g., the sample weight in a survey). + +For efficiency reasons, variables are often stored in numeric format (i.e. coded values), even when they contain qualitative information (coded values). For example, the sex of a respondent may be stored in a variable named ‘Q_01’, and include values 1, 2 and 9 where 1 represents "male", 2 represents "female", and 9 represents "unreported". Microdata must therefore be provided at a minimum with a data dictionary containing the variables and value labels and, for derived variables, information of the derivation process. But many other features of a micro-dataset should also be described such as the objectives and the methodology of data collection (including a description of the sampling design for sample surveys), the period of data collection, the identification of the primary investigator and other contributors, the scope and geographic coverage of the data, and much more. This information will make the data usable and discoverable. + + +## The Data Documentation Initiative (DDI) metadata standard + +The DDI metadata standard provides a structured and comprehensive list of hundreds of elements and attributes which may be used to document microdata. It is unlikely that any one study would ever require using them all, but this list provides a convenient solution to foster completeness of the information, and to generate documentation that will meet the needs of users. + +The Data Documentation Initiative (DDI) metadata standard originated in the [Inter-university Consortium for Political and Social Research (ICPSR)](https://www.icpsr.umich.edu/web/pages/), a membership-based organization with more than 500 member colleges and universities worldwide. The DDI is now the project of an alliance of North American and European institutions. Member institutions comprise many of the largest data producers and data archives in the world. The DDI standard is used by a large community of data archivists, including data librarians from academia, data managers in national statistical agencies and other official data producing agencies, and international organizations. The standard has two branches: the [DDI-Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) (version 2.x) and the [DDI LifeCycle](https://ddialliance.org/Specification/DDI-Lifecycle/) (version 3.x). These two branches serve different purposes and audiences. For the purpose of data archiving and cataloguing, the schema we recommend in this Guide is the DDI-Codebook. We use a slightly simplified version of version 2.5 of the standard, to which we add a few elements (including the `tags` element common to all schemas described in the Guide. A mapping between the elements included in our schema and the DDI Codebook metadata tags is provided in annex 2. + +The DDI standard is published under the terms of the [GNU General Public License]((http://www.gnu.org/licenses) (version 3 or later). + + +### DDI-Codebook + +The DDI Alliance developed the [DDI-Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) for organizing the content, presentation, transfer, and preservation of metadata in the social and behavioral sciences. It enables documenting microdata files in a simultaneously flexible and rigorous way. The DDI-Codebook aims to provide a straightforward means of recording and communicating all the salient characteristics of a micro-dataset. + +The DDI-Codebook is designed to encompass the kinds of data resulting from surveys, censuses, administrative records, experiments, direct observation and other systematic methodology for generating empirical measurements. The unit of observation can be individual persons, households, families, business establishments, transactions, countries or other subjects of scientific interest. + +The DDI Alliance publishes the DDI-Codebook as an XML schema. We present in this Guide a JSON implementation of the schema, which is used in our R package *NADAR* and Python Library *PyNADA*. The [NADA cataloguing](https://nada.ihsn.org/) application works with both the XML and the JSON version. A DDI-compliant metadata file can be converted from the JSON schema to the XML or from XML to JSON. + + +### DDI-Lifecycle + +As indicated by the [DDI Alliance website](https://ddialliance.org/Specification/DDI-Lifecycle/3.3/), **DDI-Lifecycle** is "designed to document and manage data across the entire life cycle, from conceptualization to data publication, analysis and beyond. It encompasses all of the DDI-Codebook specification and extends it. Based on XML Schemas, DDI-Lifecycle is modular and extensible." DDI-lifecycle can be used to "populate variable and question banks to explore available data and question structures for reuse in new surveys". As this is not our objective, and because using the DDI-Lifecycle adds significant complexity, we do not make use of it and this chapter only covers the DDI-Codebook. + + +## Some practical considerations + +The DDI is a comprehensive schema that provides metadata elements to document a **study** (e.g., a survey, or an administrative datasets), the related **data files**, and the **variables** they contain. A separate schema is used to document the **related resources** (questionnaires, reports, and others); see Chapter 13. + +Some datasets may contain hundreds or even thousands of variables. For each variable, the DDI can include not only the variable name, label and description, but also summary statistics like the count of valid and missing observations, weighted and unweighted frequencies, means, and others. Generating a DDI file manually, in particular the variable-level metadata, can be a tedious and time consuming task. But variable names, summary statistics, and (when avaiulable) variable and value labels can be extracted directly from the data files. User-friendly solutions (specialized metadata editors) are available to automate a large part of this work. DDI can also be generated programmatically using R or Python. Section 5.5 provides examples of the use of specialized DDI metadata editors and programming languages to generate DDI-compliant metadata. + +Documenting microdata is more complex than documenting publications or other types of data like tables or indicators. The production of microdata often involves experts in survey design, sampling, data processing, and analysis. Generating the metadata should thus be a collective responsibility and will ideally be done in real time ("document as you survey"). Data documentation should be implemented during the whole lifecycle of data production, not as an *ex post* task. This is in line with what the [Generic Statistical Business process Model (GSBPM)](https://statswiki.unece.org/display/GSBPM/VI.+Overarching+Processes) recommends: "Good metadata management is essential for the efficient operation of statistical business processes. Metadata are present in every phase, either created, updated or carried forward from a previous phase or reused from another business process. In the context of this model, the emphasis of the overarching process of metadata management is on the creation/revision, updating, use and archiving of statistical metadata, though metadata on the different sub-processes themselves are also of interest, including as an input for quality management. The key challenge is to ensure that these metadata are captured as early as possible, and stored and transferred from phase to phase alongside the data they refer to." Too often, microdata are documented after completion of the data collection, sometimes by a team who was not directly involved in the production of the data. In such cases, some information may not have been captured and will be difficult to find or reconstruct. + +:::idea +**Suggestions and recommendations to data curators** + +- Generating detailed metadata at the variable level (including elements like the formulation of the questions, variable and value labels, interviewer instructions, universe, derivation procedures, etc.) may seem to be a tedious exercise, but it adds considerable value to the metadata. Indeed, it will (i) provide a detailed data dictionary, required to make the data usable, (ii) provide the necessary information for making the data more discoverable and to enable variable comparison tools, and (iii) guarantee the preservation of institutional memory. The cost of generating such metadata will be very small relative to the cost of generating the data.
+- To make the data more discoverable, attention should be paid to provide a detailed description of the scope and objectives of the data collection. When a survey (or other microdataset) is used to generate statistical indicators, a list of these indicators should be provided in the metadata.
+- The `keywords` metadata element provides a flexible solution to improve the discoverability of data. For example, a survey that collects data on children age, weight and height, will be relevant for measuring malnutrition and generating indicators like prevalence of stunting or wasting, overweight and underweight. The variable description alone would not make the data discoverable in keyword-based search engines, hence the importance of adding relevant terms and phrases in the `keyword` section.
+- The DDI metadata will be saved as an XML or JSON file, i.e. as plain text. This means that the DDI metadata cannot include complex formulas. The description of some variables, as well as the description of a survey sample design, may require the use of formulas. In such case, the recommendation is to provide as much of the information as possible in the DDI, and to provide links to documents where the formulas can be found (these documents would be published with the metadata as *external resources*). +- Typically, the variables in the DDI are organized by data file. The DDI provides an option --the `variable groups`-- to organize variables differently, for example thematically. These variable groupings are virtual, in the sense that they do not impact the way variables are stored. Not all variables have to be mapped to such groups, and a same variable can belong to more than one group. This option provides the possibility to organize the variables based on a thematic or topical classification. Machine learning (AI) tools make it possible to automate the process of mapping variables to a pre-defined list of groups (each one of them described by a label and a short description). By doing this, and by generating embeddings at the group level, it becomes possible to add semantic search and to implement a recommender system that applies to microdata. +::: + + +## Schema description: DDI-Codebook 2.5 + +The DDI-Codebook is a comprehensive, structured list of elements to be used to document microdata of any source. The standard contains five main sections: + +- **Document description** (`doc_desc`), with elements used to describe the metadata (not the data); the term "document" refers here to the XML (or JSON) file that contains the metadata. +- **Study description** (`study_desc`), which contains the elements used to describe the study itself (the survey, the administrative process, or the other activity that resulted in the production of the microdata). This section will contain information on the primary investigator, scope and coverage of the data, sampling, etc. +- **File description** (`data_files`), which provides elements to document each data file that compose the dataset (this is thus a repeatable block of elements). +- **Variable description** (`variables`), with elements used to describe each variable contained in the data files, including the variable names, the variable and value labels, summary statistics for each variable, interviewers' instructions, description of recoding or derivation procedure, and more. +- **Variable groups** (`variable_groups`), an optional section that allows organizing variables by thematic or other groups, independently from the data file they belong to. Variable groups are "virtual"; the grouping of variables does not affect the data files. + +The other sections in the schema are not part of the DDI Codebook itself. Some are used for catalog administration purposes when the NADA cataloguing application is used (`repositoryid`, `access_policy`, `published`, `overwrite`, and `provenance`). + +- **`repositoryid`** identifies the data catalog/collection in which the metadata will be published. +- **`access_policy`** indicates the access policy to be applied to the microdata (open access, public use files, licensed access, no access, etc.) +- **`published`**: Indicates whether the metadata will be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible. +- **`overwrite`**: Indicates whether metadata that may have been previously uploaded for the same dataset can be overwritten. By default, the value is "no". It must be set to "yes" to overwrite existing information. Note that a dataset will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element `study_desc > title_statement > idno` is the same. +- **`provenance`** is used to store information on the source and time of harvesting, for metadata that were extracted automatically from external data catalogs. + +Other sections are provided to allow additional metadata to be collected and stored, including metadata generated by machine learning models (`tags`, `lda_topics`, `embeddings`, and `additional`). The `tags` is a section common to all schemas (with the exception of the *external resources* schema), which provides a flexible solution to generate customized facets in data catalogs. The `additional` section allows data curators to supplement the DDI standard with their own metadata elements, without breaking compliance with the DDI. + +```json +{ + "repositoryid": "string", + "access_policy": "data_na", + "published": 0, + "overwrite": "no", + "doc_desc": {}, + "study_desc": {}, + "data_files": [], + "variables": [], + "variable_groups": [], + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ +The DDI-Codebook also provides a solution to describe OLAP cubes, which we do not make use of as our purpose is to use the standard to document and catalog datasets, not to manage data. + +:::note +Each metadata element in the DDI standard has a name. In our JSON version of the standard, we do not make use of the exact same names. We adapted some of them for clarity. For example, we renamed the DDI element `titlStmt` as `title_statement`. The mapping between the DDI Codebook 2.5 standard and the elements in our schema is provided in appendix. JSON files created using our adapted version of the DDI can be exported as a DDI 2.5 compliant and validated XML file using R or Python scripts provided in the NADAR package and PyNADA library. +::: + +### Document description + +**`doc_desc`** *[Optional ; Not repeatable]*
+Documenting a study using the DDI-Codebook standard consists of generating a metadata file in XML or JSON format. This file is what is referred to as the metadata *document*. The `doc_desc` or **document description** is thus a description of the metadata file, and consists of bibliographic information describing the DDI-compliant document as a whole. As a same dataset can possibly be documented by more than one organization, and because metadata can be automatically harvested by on-line catalogs, traceability of the metadata is important. This section, which only contains five main elements, should be as complete as possible, and at least contain information on the `producer` and `prod_date`; information. + +```json +"doc_desc": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "prod_date": "string", + "version_statement": { + "version": "string", + "version_date": "string", + "version_resp": "string", + "version_notes": "string" + } +} +``` +
+ +- **`title`** *[Optional ; Not repeatable ; String]*
+The title of the metadata document (which may be the title of the study itself). The metadata document is the DDI metadata file (XML or JSON file) that is being generated. The "Document title" should mention the geographic scope of the data collection as well as the time period covered. For example: “DDI 2.5: Albania Living Standards Study 2012”. + +- **`idno`** *[Optional ; Not repeatable ; String]*
+A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the unique identifier `idno` found in section `study_description / title_statement`, although it is good practice to generate identifiers that establish a clear connection between the two identifiers. The Document ID could also include the metadata document version identifier. For example, if the "Primary identifier" of the study is “ALB_LSMS_2012”, the "Document ID" in the Metadata information could be “IHSN_DDI_v01_ALB_LSMS_2012” if the DDI metadata are produced by the IHSN. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier should not contain blank spaces. + +- **`producers`** *[Optional ; Repeatable]*
+The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a "Recommended" element. For catalog administration purposes, information on the producer and on the date of metadata production is useful. + + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization in charge of the production of the DDI metadata. If the name of individuals cannot be provided due to an organization's data protection rules, the title of the person, or an anonymized identifier, can be provided (or this field can be left blank if no other option is available). + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The initials of the person, or the abbreviation of the organization's name mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the person or organization mentioned in `name` in the production of the DDI metadata.

+ +- **`prod_date`** *[Optional ; Not repeatable ; String]*
+The date the DDI metadata document was produced (not the date it was distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a "Recommended" element, as information on the producer and on the date of metadata production is useful for catalog administration purposes. + +- **`version_statement`** *[Optional ; Not repeatable]*
+A version statement for the metadata (DDI) document. Documenting a dataset is not a trivial exercise. It may happen that, having identified errors or gaps in a DDI document, or after receiving suggestions for improvement or additional input, the DDI metadata are modified. The `version_statement` describes the version of the metadata document. It is good practice to provide a version number and date, and information on what distinguishes the current version from the previous one(s). + + - **`version`** *[Optional ; Not repeatable ; String]*
+ The label of the version, also known as release or edition. For example, *Version 1.2* + - **`version_date`** *[Optional ; Not repeatable ; String]*
+ The date when this version of the metadata document (DDI file) was produced, preferably identifying an exact date. This will usually correspond to the `prod_date` element. It is recommended to enter the date in the ISO 8601 date format (YYYY-MM-DD or YYYY-MM or YYYY). + - **`version_resp`** *[Optional ; Not repeatable ; String]*
+ The organization or person responsible for this version of the metadata document. + - **`version_notes`** *[Optional ; Not repeatable ; String]*
+ This element can be used to clarify information/annotation regarding this version of the metadata document, for example to indicate what is new or specific in this version comparing with a previous version. + + + +```r +my_ddi <- list( + + doc_desc = list( + title = "Albania Living Standards Study 2012", + idno = "DDI_WB_ALB_2012_LSMS_v02", + producers = list( + list(name = "Development Data Group", + abbr = "DECDG", + affiliation = "World Bank", + role = "Production of the DDI-compliant metadata" + ) + ), + prod_date = "2021-02-16", + version_statement = list( + version = "Version 2.0", + version_date = "2021-02-16", + version_resp = "OD", + version_notes = "Version identical to Version 1.0 except for the Data Appraisal section which was added." + ) + ), + + # ... (other sections of the DDI) + +) +``` +
+ +### Study description + +**`study_desc`** *[Required ; Not repeatable]*
+The `study_desc` or **study description** consists of information about the data collection or study that the DDI-compliant documentation file describes. This section includes study-level information such as scope and coverage, objectives, producers, sampling, data collection dates and methods, etc. + +```json +"study_desc": { + "title_statement": {}, + "authoring_entity": [], + "oth_id": [], + "production_statement": {}, + "distribution_statement": {}, + "series_statement": {}, + "version_statement": {}, + "bib_citation": "string", + "bib_citation_format": "string", + "holdings": [], + "study_notes": "string", + "study_authorization": {}, + "study_info": {}, + "study_development": {}, + "method": {}, + "data_access": {} +} +``` +
+ + +#### Title statement + +**`title_statement`** *[Required ; Not repeatable]*
+The title statement for the study. + +```json +"title_statement": { + "idno": "string", + "identifiers": [ + { + "type": "string", + "identifier": "string" + } + ], + "title": "string", + "sub_title": "string", + "alternate_title": "string", + "translated_title": "string" +} +``` +
+ +- **`idno`** *[Required ; Not repeatable ; String]*
+`idno` is the primary identifier of the dataset. It is a unique identification number used to identify the study (survey, census or other). A unique identifier is required for cataloguing purpose, so this element is declared as "Required". The identifier will allow users to cite the dataset properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the `idno` can be constructed by an organization using a consistent scheme. The scheme could for example be “catalog-country-study-year-version”, where country is the 3-letter ISO country code, producer is the abbreviation of the producing agency, study is the study acronym, year is the reference year (or the year the study started), version is a version number. Using that scheme, the Uganda 2005 Demographic and Health Survey for example would have the following `idno` (where “MDA” stand for “My Data Archive”): MDA_UGA_DHS_2005_v01. Note that the schema allows you to provide more than one identifier for a same study (in element `identifiers`); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The identifier should not contain blank spaces. + +- **`identifiers`** *[Optional ; Repeatable]*
+This repeatable element is used to enter identifiers (IDs) other than the `idno` entered in the Title statement. It can for example be a Digital Object Identifier (DOI). The `idno` can be repeated here (the `idno` element does not provide a `type` parameter; if a DOI or other standard reference ID is used as `idno`, it is recommended to repeat it here with the identification of its `type`). + + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of unique ID, e.g. "DOI". + - **`identifier`** *[Required ; Not repeatable ; String]*
+ The identifier itself.

+ +- **`title`** *[Required ; Not repeatable ; String]*
+This element is "Required". Provide here the full authoritative title for the study. Make sure to use a unique name for each distinct study. The title should indicate the time period covered. For example, in a country conducting monthly labor force surveys, the title of a study would be like “Labor Force Survey, December 2020”. When a survey spans two years (for example, a household income and expenditure survey conducted over a period of 12 months from June 2020 to June 2021), the range of years can be provided in the title, for example “Household Income and Expenditure Survey 2020-2021”. The title of a survey should be its official name as stated on the survey questionnaire or in other study documents (report, etc.). Including the country name in the title is optional (another metadata element is used to identify the reference countries). Pay attention to the consistent use of capitalization in the title. + +- **`sub_title`** *[Optional ; Not repeatable ; String]*
+The `sub-title` is a secondary title used to amplify or state certain limitations on the main title, for example to add information usually associated with a sequential qualifier for a survey. For example, we may have “[country] Universal Primary Education Project, Impact Evaluation Survey 2007” as `title`, and “Baseline dataset” as `sub-title`. Note that this information could also be entered as a Title with no Subtitle: “[country] Universal Primary Education Project, Impact Evaluation Survey 2007 - Baseline dataset”. + +- **`alternate_title`** *[Optional ; Not repeatable ; String]*
+The `alternate_title` will typically be used to capture the abbreviation of the survey title. Many surveys are known and referred to by their acronym. The survey reference year(s) may be included. For example, the "Demographic and Health Survey 2012" would be abbreviated as "DHS 2012", or the "Living Standards Measurement Study 2020-2012" as "LSMS 2020-2021". + +- **`translated_title`** *[Optional ; Not repeatable ; String]*
+In countries with more than one official language, a translation of the title may be provided here. Likewise, the translated title may simply be a translation into English from a country’s own language. Special characters should be properly displayed, such as accents and other stress marks or different alphabets. + + +```r +my_ddi <- list( + + # ... , + + study_desc = list( + title_statement = list( + idno = "ML_ALB_2012_LSMS_v02", + identifiers = list( + list(type = "DOI", identifier = "XXX-XXXX-XXX") + ), + title = "Living Standards Study 2012", + alternate_title = "LSMS 2012", + translated_title = "Anketa e Matjes së Nivelit të Jetesës (AMNJ) 2012" + ) + ), + + # ... +) +``` +
+ +#### Authoring entity + +**`authoring_entity`** *[Optional ; Repeatable]*
+The name and affiliation of the person, corporate body, or agency responsible for the study’s substantive and intellectual content (the "authoring entity" or “primary investigator”). Generally, in a survey, the authoring entity will be the institution implementing the survey. Repeat the element for each authoring entity, and enter the `affiliation` when relevant. If various institutions have been equally involved as main investigators, then should all be listed. This only includes the agencies responsible for the implementation of the study, not sponsoring agencies or entities providing technical assistance (for which other metadata elements are available). The order in which authoring entities are listed is discretionary. It can be alphabetic or by significance of contribution. Individual persons can also be mentioned, if not prohibited by privacy protection rules. + +```json +"authoring_entity": [ + { + "name": "string", + "affiliation": "string" + } +] +``` +
+ +- **`name`** *[Optional ; Not repeatable ; String]*
+The name of the person, corporate body, or agency responsible for the work's substantive and intellectual content. The primary investigator will in most cases be an institution, but could also be an individual in the case of small-scale academic surveys. If persons are mentioned, use the appropriate format of *Surname, First name*. +- **`affiliation`** *[Optional ; Not repeatable ; String]*
+The affiliation of the person, corporate body, or agency mentioned in `name`. + + +```r +my_ddi <- list( + + # ... , + + study_desc = list( + + # ... , + + authoring_entity = list( + + list(name = "National Statistics Office of Popstan (NSOP)", + affiliation = "Ministry of Planning"), + + list(name = "Department of Public Health of Popstan (DPH)", + affiliation = "Ministry of Health") + + ), + + # ... + ) + +) +``` +
+ + +#### Other entity + +**`oth_id`** *[Optional ; Repeatable]*
+This element is used to acknowledge any other people and organizations that have in some form contributed to the study. This does not include other producers which should be listed in `producers`, and financial sponsors which should be listed in the element `funding_agencies`. + +```json +"oth_id": [ + { + "name": "string", + "role": "string", + "affiliation": "string" + } +] +``` +
+ +- **`name`** *[Required ; Not repeatable ; String]*
+The name of the person or organization. +- **`role`** *[Optional ; Not repeatable ; String]*
+A brief description of the specific role of the person or organization mentioned in `name`. +- **`affiliation`** *[Optional ; Not repeatable ; String]*
+The affiliation of the person or organization mentioned in `name`. + + +```r +my_ddi <- list( + + # ... , + + study_desc = list( + # ... , + + oth_id = list( + list(name = "John Doe", + role = "Technical advisor in sample design", + affiliation = "World Bank Group" + ) + ), + # ... + + ) + +) +``` +
+ + +#### Production statement + +**`production_statement`** *[Optional ; Not repeatable]*
+A production statement for the work at the appropriate level. + +```json +"production_statement": { + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "copyright": "string", + "prod_date": "string", + "prod_place": "string", + "funding_agencies": [ + { + "name": "string", + "abbr": "string", + "grant": "string", + "role": "string" + } + ] +} +``` +
+ +- **`producers`** *[Optional ; Repeatable]*
+This field is provided to list other interested parties and persons that have played a significant but not the leading technical role in implementing and producing the data (which will be listed in `authoring_entity`), and not the financial sponsors (which will be listed in `funding_agencies`). + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the person or organization. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The official abbreviation of the organization mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`role`** *[Optional ; Not repeatable ; String]*
+ A succinct description of the specific contribution by the person or organization in the production of the data.
+ +- **`copyright`** *[Optional ; Not repeatable ; String]*
+A copyright statement for the study at the appropriate level. + +- **`prod_date`** *[Optional ; Not repeatable ; String]*
+This is the date (preferably entered in ISO 8601 format: YYYY-MM-DD or YYYY-MM or YYYY) of the actual and final production of the version of the dataset being documented. At least the month and year should be provided. A regular expression can be entered in user templates to validate the information captured in this field. + +- **`prod_place`** *[Optional ; Not repeatable ; String]*
+The address of the organization that produced the study. + +- **`funding_agencies`** *[Optional ; repeatable]*
+The source(s) of funds for the production of the study. If different funding agencies sponsored different stages of the production process, use the `role` attribute to distinguish them. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the funding agency. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The abbreviation (acronym) of the funding agency mentioned in `name`. + - **`grant`** *[Optional ; Not repeatable ; String]*
+ The grant number. If an agency has provided more than one grant, list them all separated with a ";". + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific contribution of the funding agency mentioned in `name`. This element is used when multiple funding agencies are listed to distinguish their specific contributions.

+ +This example shows the Bangladesh 2018-2019 Demographic and Health Survey (DHS) + + +```r +my_ddi <- list( + + # ... , + + study_desc = list( + + # ... , + + production_statement = list( + + producers = list( + + list(name = "National Institute of Population Research and Training", + abbr = "NIPORT", + role = "Primary investigator"), + + list(name = "Medical Education and Family Welfare Division", + role = "Advisory"), + + list(name = "Ministry of Health and Family Welfare", + abbr = "MOHFW", + role = "Advisory"), + + list(name = "Mitra and Associates", + role = "Data collection - fieldwork"), + + list(name = "ICF (consulting firm)", + role = "Technical assistance / DHS Program") + + ), + + prod_date = "2019", + + prod_place = "Dhaka, Bangladesh", + + funding_agencies = list( + list(name = "United States Agency for International Development", + abbr = "USAID") + ) + + ), + # ..., + + ) + # ... + +) +``` +
+ + +#### Distribution statement + +**`distribution_statement`** *[Optional ; Not repeatable]*
+A distribution statement for the study. + +```json +"distribution_statement": { + "distributors": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "uri": "string" + } + ], + "contact": [ + { + "name": "string", + "affiliation": "string", + "email": "string", + "uri": "string" + } + ], + "depositor": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "uri": "string" + } + ], + "deposit_date": "string", + "distribution_date": "string" +} +``` +
+ + +- **`distributors`** *[Optional ; Repeatable]*
+The organization(s) designated by the author or producer to generate copies of the study output including any necessary editions or revisions. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the distributor. It can be an individual or an organization. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The official abbreviation of the organization mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`.
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ A URL to the ordering service or download facility on a Web site.

+ +- **`contact`** *[Optional ; Repeatable]*
+Names and addresses of individuals responsible for the study. Individuals listed as contact persons will be used as resource persons regarding problems or questions raised by users.
+ - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the person or organization that can be contacted. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ An email address for the contact mentioned in `name`.
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ A URL to the contact mentioned in `name`.

+ +- **`depositor`** *[Optional ; Repeatable]*
+The name of the person (or institution) who provided this study to the archive storing it.
+ - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the depositor. It can be an individual or an organization. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The official abbreviation of the organization mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A URL to the depositor

+ +- **`deposit_date`** *[Optional ; Not repeatable ; String]*
+The date that the study was deposited with the archive that originally received it. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible.
+ +- **`distribution_date`** *[Optional ; Not repeatable ; String]*
+The date that the study was made available for distribution/presentation. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible.

+ +This example is @@@@@@@@@@@@ + + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + + distribution_statement = list( + + distributors = list( + list(name = "World Bank Microdata Library", + abbr = "WBML", + affiliation = "World Bank Group", + uri = "http:/microdata.worldbank.org") + ), + + contact = list( + list(name = "", + affiliation = "", + email = "", + uri = "") + ), + + depositor = list( + list(name = "", + abbr = "", + affiliation = "", + uri = "") + ), + + deposit_date = "", + + distribution_date = "" + + ), + # ... + ) + # ... +) +``` +
+ + +#### Series statement + +**`series_statement`** *[Optional; Not repeatable]*
+A study may be repeated at regular intervals (such as an annual labor force survey), or be part of an international survey program (such as the MICS, DHS, LSMS and others). The series statement provides information on the series. + +```json +"series_statement": { + "series_name": "string", + "series_info": "string" +} +``` +
+ +- **`series_name`** *[Optional ; Not repeatable ; String]*
+The name of the series to which the study belongs. For example, "Living Standards Measurement Study (LSMS)" or "Demographic and Health Survey (DHS)" or "Multiple Indicator Cluster Survey VII (MICS7)". A description of the series can be provided in the element "series_info".
+- **`series_info`** *[Optional ; Not repeatable ; String]*
+A brief description of the characteristics of the series, including when it started, how many rounds were already implemented, and who is in charge would be provided here.
+ + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + + study_desc = list( + # ... , + series_statement = list( + list(series_name = "Multiple Indicator Cluster Survey (MICS) by UNICEF", + series_info = "The Multiple Indicator Cluster Survey, Round 3 (MICS3) is the third round of MICS surveys, previously conducted around 1995 (MICS1) and 2000 (MICS2). MICS surveys are designed by UNICEF, and implemented by national agencies in participating countries. MICS was designed to monitor various indicators identified at the World Summit for Children and the Millennium Development Goals. Many questions and indicators in MICS3 are consistent and compatible with the prior round of MICS (MICS2) but less so with MICS1, although there have been a number of changes in definition of indicators between rounds. Round 1 covered X countries, round 2 covered Y countries, and Round 3 covered Z countries.") + ), + # ... + ), + # ... +) +``` +
+ + +#### Version statement + +**`version_statement`** *[Optional; Not repeatable]*
+Version statement for the study. + +```json +"version_statement": { + "version": "string", + "version_date": "string", + "version_resp": "string", + "version_notes": "string" +} +``` +
+ +The version statement should contain a version number followed by a version label. The version number should follow a standard convention to be adopted by the data repository. We recommend that larger series be defined by a number to the left of a decimal and iterations of the same series by a sequential number that identifies the release. The left number could for example be (0) for the raw, unedited dataset; (1) for the edited dataset, non anonymized, available for internal use at the data producing agency; and (2) the edited dataset, prepared for dissemination to secondary users (possibly anonymized). Example: + +v0: Basic raw data, resulting from the data capture process, before any data editing is implemented.
+v1.0: Edited data, first iteration, for internal use only.
+v1.1: Edited data, second iteration, for internal use only.
+v2.1: Edited data, anonymized and packaged for public distribution.
+ +- **`version`** *[Optional ; Not repeatable ; String]*
+The version number, also known as release or edition. +- **`version_date`** *[Optional ; Not repeatable ; String]*
+The ISO 8601 standard for dates (YYYY-MM-DD) is recommended for use with the "date" attribute. +- **`version_resp`** *[Optional ; Not repeatable ; String]*
+The person(s) or organization(s) responsible for this version of the study. +- **`version_notes`** *[Optional ; Not repeatable ; String]*
+Version notes should provide a brief report on the changes made through the versioning process. The note should indicate how this version differs from other versions of the same dataset.
+
+ + +```r +my_ddi <- list( + + # ... + + study_desc = list( + + # ... , + + version_statement = list( + version = "Version 1.1", + version_date = "2021-02-09", + version_resp = "National Statistics Office, Data Processing unit", + version_notes = "This dataset contains the edited version of the data that were used to produce the Final Survey Report. It is equivalent to version 1.0 of the dataset, except for the addition of an additional variable (variable weight2) containing a calibrated version of the original sample weights (variable weight)" + ), + + # ... + + ), + + # ... + +) +``` +
+ + +#### Bibliographic citation + + +**`bib_citation`** *[Optional ; Not repeatable ; String]*
+Complete bibliographic reference containing all of the standard elements of a citation that can be used to cite the study. The `bib_citation_format` (see below) is provided to enable specification of the particular citation style used, e.g., APA, MLA, or Chicago. + + +#### Bibliographic citation format + +**`bib_citation_format`** *[Optional ; Not repeatable ; String]*
+This element is used to specify the particular citation style used in the field `bib_citation` described above, e.g., APA, MLA, or Chicago.
+ + +```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + bib_citation = "", + bib_citation_format = "" + # ... + ), + # ... + ) +``` +
+ + +#### Holdings + +**`holdings`** *[Optional ; Repeatable]*
+Information concerning either the physical or electronic holdings of the study being described. + +```json +"holdings": [ + { + "name": "string", + "location": "string", + "callno": "string", + "uri": "string" + } +] +``` +
+ +- **`name`** *[Optional ; Not repeatable ; String]*
+Name of the physical or electronic holdings of the cited study.
+- **`location`** *[Optional ; Not repeatable ; String]*
+The physical location where a copy of the study is held.
+- **`callno`** *[Optional ; Not repeatable ; String]*
+The call number at the location specified in `location`.
+- **`uri`** *[Optional ; Not repeatable ; String]*
+A URL for accessing the electronic copy of the cited study from the location mentioned in `name`.
+ + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + holdings = list( + name = "World Bank Microdata Library", + location = "World Bank, Development Data Group", + uri = "http://microdata.worldbank.org" + ), + # ... + ), + # ... +) +``` +
+ + +#### Study notes + +**`study_notes`** *[Optional ; Not repeatable]*
+ +This element can be used to provide additional information on the study which cannot be accommodated in the specific metadata elements of the schema, in the form of a free text field. + + +#### Study autorization + +**`study_authorization`** *[Optional ; Not repeatable]*
+ +```json +"study_authorization": { + "date": "string", + "agency": [ + { + "name": "string", + "affiliation": "string", + "abbr": "string" + } + ], + "authorization_statement": "string" +} +``` +
+ +Provides structured information on the agency that authorized the study, the date of authorization, and an authorization statement. This element will be used when a special legislation is required to conduct the data collection (for example a Census Act) or when the approval of an Ethics Board or other body is required to collect the data. + +- **`date`** *[Optional ; Not repeatable ; String]* +The date, preferably entered in ISO 8601 format (YYYY-MM-DD), when the authorization to conduct the study was granted.
+- **`agency`** *[Optional ; Repeatable]*
+Identification of the agency that authorized the study. + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the agent or agency that authorized the study. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The institutional affiliation of the authorizing agent or agency mentioned in `name`. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The abbreviation of the authorizing agent's or agency's name.

+ +- **`authorization_statement`** *[Optional ; Not repeatable ; String]*
+The text of the authorization (or a description and link to a document or other resource containing the authorization statement).
+ + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_authorization = list( + date = "2018-02-23", + agency = list( + name = "Institutional Review Board of the University of Popstan", + abbr = "IRB-UP") + ), + authorization_statement = "The required documentation covering the study purpose, disclosure information, questionnaire content, and consent statements was delivered to the IRB-UP on 2017-12-27 and was reviewed by the compliance officer. Statement of authorization for the described study was issued on 2018-02-23." + # ... + ), + # ... +) +``` +
+ + +#### Study information + +**`study_info`** *[Required ; Not repeatable]*
+This section contains the metadata elements needed to describe the core elements of a study including the dates of data collection and reference period, the country and other geographic coverage information, and more. These elements are not required in the DDI standard, but documenting a study without provinding at least some of this information would make the metadata mostly irrelevant. + +```json +"study_info": { + "study_budget": "string", + "keywords": [], + "topics": [], + "abstract": "string", + "time_periods": [], + "coll_dates": [], + "nation": [], + "bbox": [], + "bound_poly": [], + "geog_coverage": "string", + "geog_coverage_notes": "string", + "geog_unit": "string", + "analysis_unit": "string", + "universe": "string", + "data_kind": "string", + "notes": "string", + "quality_statement": {}, + "ex_post_evaluation": {} +} +``` +
+ +- **`study_budget`** *[Optional ; Not repeatable ; String]*
+ + This is a free-text field, not a structured element. The budget of a study will ideally be described by budget line. The currency used to describe the budget should be specified. This element can also be used to document issues related to the budget (e.g., documenting possible under-run and over-run).
+ + + ```r + my_ddi <- list( + # ... , + study_desc = list( + # ... , + study_info = list( + study_budget = "The study had a total budget of 500,000 USD allocated as follows: + By type of expense: + - Staff: 150,000 USD + - Consultants (incl. interviewers): 180,000 USD + - Travel: 50,000 USD + - Equipment: 90,000 USD + - Other: 30,000 USD + By activity + - Study design (questionnaire design and testing, sampling, piloting): 100,000 USD + - Data collection: 250,000 USD + - Data processing and tabulation: 80,000 USD + - Analysis and dissemination: 50,000 USD + - Evaluation: 20,000 USD + By source of funding: + - Government budget: 300,000 USD + - External sponsors + - Grant ABC001 - 150,000 USD + - Grant XYZ987 - 50,000 USD", + + # ... + + ), + # ... + ) + ``` +
+ +- **`keywords`** *[Optional ; Repeatable]*
+ +```json +"keywords": [ + { + "keyword": "string", + "vocab": "string", + "uri": "string" + } +] +``` +
+ + Keywords are words or phrases that describe salient aspects of a data collection's content. The addition of keywords can significantly improve the discoverability of data. Keywords can summarize and improve the description of the content or subject matter of a study. For example, keywords "poverty", "inequality", "welfare", and "prosperity" could be attached to a household income survey used to generate poverty and inequality indicators (for which these keywords may not appear anywhere else in the metadata). A controlled vocabulary can be employed. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+ - **`keyword`** *[ Required ; String ; Non repeatable]*
+ A keyword (or phrase). + - **`vocab`** *[Optional ; Not repeatable ; String]*
+ The controlled vocabulary from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the controlled vocabulary used, if any.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + keywords = list( + list(keyword = "poverty", + vocab = "UNESCO Thesaurus", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"), + list(keyword = "income distribution", + vocab = "UNESCO Thesaurus", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"), + list(keyword = "inequality", + vocab = "UNESCO Thesaurus", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/") + ), + # ... + ), + # ... + ) + ``` +
+ +- **`topics`** *[Optional ; Repeatable]*
+The `topics` field indicates the broad substantive topic(s) that the study covers. A topic classification facilitates referencing and searches in on-line data catalogs. + +```json +"topics": [ + { + "topic": "string", + "vocab": "string", + "uri": "string" + } +] +``` +
+ + - **`topic`** *[Required ; Not repeatable]*
+ The label of the topic. Topics should be selected from a standard controlled vocabulary such as the [Council of European Social Science Data Archives (CESSDA) Topic Classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification).
+ - **`vocab`** *[Required ; Not repeatable]*
+ The specification (name including the version) of the controlled vocabulary in use.
+ - **`uri`** *[Required ; Not repeatable]*
+ A link (URL) to the controlled vocabulary website.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + topics = list( + + list(topic = "Equality, inequality and social exclusion", + vocab = "CESSDA topics classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(topic = "Social and occupational mobility", + vocab = "CESSDA topics classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification") + + ), + # ... + ), + # ... + ) + ``` +
+ +- **`abstract`** *[Optional ; Not repeatable ; String]*
+An un-formatted summary describing the purpose, nature, and scope of the data collection, special characteristics of its contents, major subject areas covered, and what questions the primary investigator(s) attempted to answer when they conducted the study. The summary should ideally be between 50 and 5000 characters long. The abstract should provide a clear summary of the purposes, objectives and content of the survey. It should be written by a researcher or survey statistician aware of the study. Inclusion of this element is strongly recommended.
+ + This example is for the Afrobarometer Survey 1999-2000, Merged Round 1 dataset. + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + abstract = "The Afrobarometer is a comparative series of public attitude surveys that assess African citizen's attitudes to democracy and governance, markets, and civil society, among other topics. + + The 12 country dataset is a combined dataset for the 12 African countries surveyed during round 1 of the survey, conducted between 1999-2000 (Botswana, Ghana, Lesotho, Mali, Malawi, Namibia, Nigeria South Africa, Tanzania, Uganda, Zambia and Zimbabwe), plus data from the old Southern African Democracy Barometer, and similar surveys done in West and East Africa.", + + # ... + ), + # ... + ) + ``` +
+ +- **`time_periods`** *[Optional ; Repeatable]*
+This refers to the time period (also known as span) covered by the data, not the dates of data collection.
+ +```json +"time_periods": [ + { + "start": "string", + "end": "string", + "cycle": "string" + } +] +``` +
+ + - **`start`** *[Required ; Not repeatable ; String]*
+ The start date for the cycle being described. Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
+ - **`end`** *[Required ; Not repeatable ; String]*
+ The end date for the cycle being described. Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). Indicate open-ended dates with two decimal points (..)
+ - **`cycle`** *[Optional ; Not repeatable ; String]*
+ The `cycle` attribute permits specification of the relevant cycle, wave, or round of data.

+ + +- **`coll_dates`** *[Optional ; Repeatable]*
+Contains the date(s) when the data were collected, which may be different from the date the data refer to (see `time_periods` above). For example, data may be collected over a period of 2 weeks (`coll_dates`) about household expenditures during a reference week (`time_periods`) preceding the beginning of data collection. Use the event attribute to specify the "start" and "end" for each period entered.
+ +```json +"coll_dates": [ + { + "start": "string", + "end": "string", + "cycle": "string" + } +] +``` +
+ + - **`start`** *[Required ; Not repeatable ; String]*
+ Date the data collection started (for the specified cycle, if any). Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
+ - **`end`** *[Required ; Not repeatable ; String]*
+ Date the data collection ended (for the specified cycle, if any). Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
+ - **`cycle`** *[Optional ; Not repeatable ; String]*
+ Identification of the cycle of data collection. The `cycle` attribute permits specification of the relevant cycle, wave, or round of data. For example, a household consumption survey could visit households in four phases (one per quarter). Each quarter would be a cycle, and the specific dates of data collection for each quarter would be entered.
+ + This example is for an impact evaluation survey with a baseline and two follow-up surveys) + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + time_periods = list( + + list(start = "2020-01-10", + end = "2020-01-16", + cycle = "Baseline survey"), + + list(start = "2020-07-10", + end = "2020-07-16", + cycle = "First follow-up survey"), + + list(start = "2021-01-10", + end = "2021-01-16", + cycle = "Second and last follow-up survey"), + ), + + coll_dates = list( + + list(start = "2020-01-17", + end = "2020-01-25", + cycle = "Baseline survey"), + + list(start = "2020-07-17", + end = "2020-07-24", + cycle = "First follow-up survey"), + + list(start = "2021-01-17", + end = "2021-01-22", + cycle = "Second and last follow-up survey") + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`nation`** *[Optional ; Repeatable]*
+Indicates the country or countries (or "economies", or "territories") covered in the study (but not the sub-national geographic areas). If the study covers more than one country, they will be entered separately. + +```json +"nation": [ + { + "name": "string", + "abbreviation": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The country name, even in cases where the study does not cover the entire country.
+ - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ The `abbreviation` will contain a country code, preferably the 3-letter [ISO 3166-1 country code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3).
+ + +- **`bbox`** *[Optional ; Repeatable]*
+This element is used to define one or multiple bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset's geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. This element is optional, but if the `bound_poly` element (see below) is used, then the `bbox` element must be included.
+ +```json +"bbox": [ + { + "west": "string", + "east": "string", + "south": "string", + "north": "string" + } +] +``` +
+ + - **`west`** *[Required ; Not repeatable ; String]*
+ West longitude of the bounding box.
+ - **`east`** *[Required ; Not repeatable ; String]*
+ East longitude of the bounding box.
+ - **`south`** *[Required ; Not repeatable ; String]*
+ South latitude of the bounding box.
+ - **`north`** *[Required ; Not repeatable ; String]*
+ North latitude of the bounding box.
+ + This example is for a study covering the islands of Madagascar and Mauritius +
+ ![](./images/Microdata_bbox.JPG){width=45%} +
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + nation = list( + list(name = "Madagascar", abbreviation = "MDG"), + list(name = "Mauritius", abbreviation = "MUS") + ), + + bbox = list( + + list(name = "Madagascar", + west = "43.2541870461", + east = "50.4765368996", + south = "-25.6014344215", + north = "-12.0405567359"), + + list(name = "Mauritius", + west = "56.6", + east = "72.466667", + south = "-20.516667", + north = "-5.25") + + ), + # ... + ), + # ... + ) + ``` +
+ +- **`bound_poly`** *[Optional ; Repeatable]*
+The `bbox` metadata element (see above) describes a rectangular area representing the entire geographic coverage of a dataset. The element `bound_poly` allows for a more detailed description of the geographic coverage, by allowing multiple and non-rectangular polygons (areas) to be described. This is done by providing list(s) of latitude and longitude coordinates that define the area(s). It should only be used to define the outer boundaries of the covered areas. This field is intended to enable a refined coordinate-based search, not to actually map an area. Note that if the `bound_poly` element is used, then the element `bbox` MUST be present as well, and all points enclosed by the `bound_poly` MUST be contained within the bounding box defined in `bbox`.
+ +```json +"bound_poly": [ + { + "lat": "string", + "lon": "string" + } +] +``` +
+ + - **`lat`** *[Required ; Not repeatable ; String]*
+ The latitude of the coordinate.
+ - **`lon`** *[Required ; Not repeatable ; String]*
+ The longitude of the coordinate.
+ + This example shows a polygon for the State of Nevada, USA + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + bound_poly = list( + list(lat = "42.002207", lon = "-120.005729004"), + list(lat = "42.002207", lon = "-114.039663"), + list(lat = "35.9", lon = "-114.039663"), + list(lat = "36.080", lon = "-114.544"), + list(lat = "35.133", lon = "-114.542"), + list(lat = "35.00208499998", lon = "-114.63288"), + list(lat = "35.00208499998", lon = "-114.63323"), + list(lat = "38.999", lon = "-120.005729004"), + list(lat = "42.002207", lon = "-120.005729004") + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`geog_coverage`** *[Optional ; Not repeatable ; String]*
+ + Information on the geographic coverage of the study. This includes the total geographic scope of the data, and any additional levels of geographic coding provided in the variables. Typical entries will be "National coverage", "Urban areas", "Rural areas", "State of ...", "Capital city", etc. This does not describe where the data were collected; it describes which area the data are representative of. This means for example that a sample survey could be declared as having a national coverage even if some districts of the country where not included in the sample, as long as the sample is nationally representative.
+ +- **`geog_coverage_notes`** *[Optional ; Not repeatable ; String]*
+ + Additional information on the geographic coverage of the study entered as a free text field.
+ +- **`geog_unit`** *[Optional ; Not repeatable ; String]*
+ + Describes the levels of geographic aggregation covered by the data. Particular attention must be paid to include information on the lowest geographic area for which data are representative.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + geog_coverage = "National coverage", + + geog_coverage_notes = "The sample covered the urban and rural areas of all provinces of the country. Some areas of province X were however not accessible due to civil unrest.", + + geog_unit = "The survey provides data representative at the national, provincial and district levels. For the capital city, the data are representative at the ward level.", + + # ... + ), + # ... + ) + ``` +
+ +- **`analysis_unit`** *[Optional ; Not repeatable ; String]*
+ + A study can have multiple units of analysis. This field will list the various units that can be analyzed. For example, a Living Standard Measurement Study (LSMS) may have collected data on households and their members (individuals), on dwelling characteristics, on prices in local markets, on household enterprises, on agricultural plots, and on characteristics of health and education facilities in the sample areas. + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + analysis_unit = "Data were collected on households, individuals (household members), dwellings, commodity prices at local markets, household enterprises, agricultural plots, and characteristics of health and education facilities." + + # ... + ), + # ... + ) + ``` +
+ +- **`universe`** *[Optional ; Not repeatable ; String]*
+ + The universe is the group of persons (or other units of observations, like dwellings, facilities, or other) that are the object of the study and to which any analytic results refer. The universe will rarely cover the entire population of the country. Sample household surveys, for example, may not cover homeless, nomads, diplomats, community households. Population censuses do not cover diplomats. Facility surveys may be limited to facilities of a certain type (e.g., public schools). Try to provide the most detailed information possible on the population covered by the survey/census, focusing on excluded categories of the population. For household surveys, age, nationality, and residence commonly help to delineate a given universe, but any of a number of factors may be involved, such as sex, race, income, veteran status, criminal convictions, etc. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is a member of the population under study.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + universe = "The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.", + + # ... + ), + # ... + ) + ``` +
+ +- **`data_kind`** *[Optional ; Not repeatable ; String]*
+ + This field describes the main type of microdata generated by the study: survey data, census/enumeration data, aggregate data, clinical data, event/transaction data, program source code, machine-readable text, administrative records data, experimental data, psychological test, textual data, coded textual, coded documents, time budget diaries, observation data/ratings, process-produced data, etc. A controlled vocabulary should be used as this information may be used to build facets (filters) in a catalog user interface.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + data_kind = "Sample survey data", + + # ... + ), + # ... + ) + ``` +
+ +- **`notes`** *[Optional ; Not repeatable ; String]*
+ + This element is provided to document any specific situations, observations, or events that occurred during data collection. Consider stating such items like:
+ - Was a training of enumerators held? (elaborate)
+ - Was a pilot survey conducted?
+ - Did any events have a bearing on the data quality? (elaborate)
+ - How long did an interview take on average?
+ - In what language(s) were the interviews conducted?
+ - Were there any corrective actions taken by management when problems occurred in the field?
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + notes = "The pre-test for the survey took place from August 15, 2006 - August 25, 2006 and included 14 interviewers who would later become supervisors for the main survey. + Each interviewing team comprised of 3-4 female interviewers (no male interviewers were used due to the sensitivity of the subject matter), together with a field editor and a supervisor and a driver. A total of 52 interviewers, 14 supervisors and 14 field editors were used. Training of interviewers took place at the headquarters of the Statistics Office from July 1 to July 12, 2006. + Data collection took place over a period of about 6 weeks from September 2, 2006 until October 17, 2006. Interviewing took place everyday throughout the fieldwork period, although interviewing teams were permitted to take one day off per week. + Interviews averaged 35 minutes for the household questionnaire (excluding water testing), 23 minutes for the women's questionnaire, and 27 for the under five children's questionnaire (excluding the anthropometry). Interviews were conducted primarily in English, but occasionally used local translation. + Six staff members of the Statistics Office provided overall fieldwork coordination and supervision." + + # ... + ), + # ... + ) + ``` +
+ +- **`quality_statement`** *[Optional ; Not Repeatable]*
+This section lists the specific standards complied with during the execution of this study, and provides the option to formulate a general statement on the quality of the data. Any known quality issue should be reported here. Such issues are better reported by the data producer or curator, not left to the secondary analysts to discover. Transparency in reporting quality issues will increase credibility and reputation of the data provider. + +```json +"quality_statement": { + "compliance_description": "string", + "standards": [ + { + "name": "string", + "producer": "string" + } + ], + "other_quality_statement": "string" +} +``` +
+ + - **`compliance_description`** *[Optional ; Not repeatable ; String]*
+ A statement on compliance with standard quality assessment procedures. The list of these standards can be documented in the next element, `standards`. + - **`standards`** *[Optional ; Repeatable]*
+ An itemized list of quality standards complied with during the execution of the study.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the quality standard, if such a standard was used. Include the date when the standard was published, and the version of the standard with which the study is compliant, and the "URI" attribute includes .
+ - **`producer`** *[Optional ; Not repeatable ; String]*
+ The producer of the quality standard mentined in `name`.

+ - **`other_quality_statement`** *[Optional ; Not repeatable ; String]*
+ Any additional statement on the quality of the data, entered as free text. This can be independent of any particular quality standard.
+ + @@@ complete the example + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + quality_statement = list( + + compliance_description = "", + + standards = list( + list(name = "", + producer = "") + ), + + other_quality_statement = "" + + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`ex_post_evaluation`** *[Optional ; Not Repeatable]*
+Ex-post evaluations are frequently done within large statistical or research organizations, in particular when a study is intended to be repeated. Such evaluations are recommended by the [Generic Statistical Business Process Model](https://statswiki.unece.org/display/GSBPM/Generic+Statistical+Business+Process+Model) (GSBPM). This section of the schema is used to describe the evaluation procedures and their outcomes.
+ +```json +"ex_post_evaluation": { + "completion_date": "string", + "type": "string", + "evaluator": [ + { + "name": "string", + "affiliation": "string", + "abbr": "string", + "role": "string" + } + ], + "evaluation_process": "string", + "outcomes": "string" +} +``` +
+ + - **`completion_date`** *[Optional ; Not repeatable ; String]*
+ The date the ex-post evaluation was completed.
+ - **`type`** *[Optional ; Not Repeatable]*
+ The `type` attribute identifies the type of evaluation with or without the use of a controlled vocabulary.
+ - **`evaluator`** *[Optional ; Repeatable]*
+ The evaluator element identifies the person(s) and/or organization(s) involved in the evaluation.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization involved in the evaluation.
+ - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the individual or organization mentioned in `name`.
+ - **`abbr`** *[Optional ; Not repeatable ; String]*
+ An abbreviation for the organization mentioned in `name`.
+ - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role played by the individual or organization mentioned in `name` in the evaluation process.
+ - **`evaluation_process`** *[Optional ; Not repeatable ; String]*
+ A description of the evaluation process. This may include information on the dates the evaluation was conducted, cost/budget, relevance, institutional or legal arrangements, et.
+ - **`outcomes`** *[Optional ; Not repeatable ; String]*
+ A description of the outcomes of the evaluation. It may include a reference to an evaluation report.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... , + + ex_post_evaluation = list( + + completion_date = "2020-04-30", + + type = "Independent evaluation requested by the survey sponsor", + + evaluator = list( + list(name = "John Doe", + affiliation = "Alpha Consulting, Ltd.", + abbr = "AC", + role = "Evaluation of the sampling methodology"), + list(name = "Jane Smith", + affiliation = "Beta Statistical Services, Ltd.", + abbr = "BSS", + role = "Evaluation of the data processing and analysis") + ), + + evaluation_process = "In-depth review of pre-collection and collection procedures", + + outcomes = "The following steps were highly effective in increasing response rates." + + ) + ), + # ... + ) + ``` +
+ + +#### Study development + +**`study_development`** *[Optional ; Not repeatable]*
+ +```json +"study_development": { + "development_activity": [ + { + "activity_type": "string", + "activity_description": "string", + "participants": [ + { + "name": "string", + "affiliation": "string", + "role": "string" + } + ], + "resources": [ + { + "name": "string", + "origin": "string", + "characteristics": "string" + } + ], + "outcome": "string" + } + ] +} +``` +
+ +This section is used to describe the process that led to the production of the final output of the study, from its inception/design to the dissemination of the final output. + +- **`development_activity`** *[Optional ; Repeatable]*
@@@@ missing in schema; must be added then screenshot taken +Each activity will be documented separately. The [Generic Statistical Business Process Model (GSBPM)](https://statswiki.unece.org/display/GSBPM/Generic+Statistical+Business+Process+Model) provides a useful decomposition of such a process, which can be used to list the activities to be described. This is a repeatable set of metadata elements; each activity should be documented separately. + + - **`activity_type`** *[Optional ; Not repeatable ; String]*
+ The type of activity. A controlled vocabulary can be used, possibly comprising the main components of the GSBPM: `{Needs specification, Design, Build, Collect, Process, Analyze, Disseminate, Evaluate}`).
+ - **`activity_description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the activity.
+ - **`participants`** *[Optional ; Repeatable]*
+ A list of participants (persons or organizations) in the activity. This is a repeatable set of elements; each participant can be documented separately.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the participating person or organization.
+ - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ Affiliation of the person or organization mentioned in `name`.
+ - **`role`** *[Optional ; Not repeatable ; String]*
+ Specific role (participation) of the person or organization mentioned in `name`.

+ - **`resources`** *[Optional ; Not Repeatable]*
+ A description of the data sources and other resources used to implement the activity.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the resource.
+ - **`origin`** *[Optional ; Not repeatable ; String]*
+ The origin of the resource mentioned in `name`.
+ - **`characteristics`** *[Optional ; Not repeatable ; String]*
+ The characteristics of the resource mentioned in `name`.

+ - **`outcome`** *[Optional ; Not repeatable ; String]*
+ Description of the main outcome of the activity.
+ + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... ), + + study_development = list( + + development_activity = list( + + list( + activity_type = "Questionnaire design and piloting", + activity_description = "", + participants = list( + list(name = "", + affiliation = "", + role = ""), + list(name = "", + affiliation = "", + role = ""), + list(name = "", + affiliation = "", + role = "") + ), + resources = list( + list(name = "", + origin = "", + characteristics = "") + ), + outcome = "" + ), + + list( + activity_type = "Interviewers training", + activity_description = "", + participants = list( + list(name = "", + affiliation = "", + role = ""), + list(name = "", + affiliation = "", + role = ""), + list(name = "", + affiliation = "", + role = "") + ), + resources = list( + list(name = "", + origin = "", + characteristics = "") + ), + outcome = "" + ) + + ) + + ), + + # ... + +) +``` +
+ + +#### Method + +**`method`** *[Optional ; Not Repeatable]*
+This section describes the methodology and processing involved in a study.
+ +```json +"method": { + "data_collection": {}, + "method_notes": "string", + "analysis_info": {}, + "study_class": null, + "data_processing": [], + "coding_instructions": [] +} +``` +
+ +- **`data_collection`** *[Optional ; Not Repeatable]*
+A block of metadata elements used to describe the methodology employed in a data collection. This includes the design of the questionnaire, sampling, supervision of field work, and other characteristics of the data collection phase. + +```json +"data_collection": { + "time_method": "string", + "data_collectors": [], + "collector_training": [], + "frequency": "string", + "sampling_procedure": "string", + "sample_frame": {}, + "sampling_deviation": "string", + "coll_mode": null, + "research_instrument": "string", + "instru_development": "string", + "instru_development_type": "string", + "sources": [], + "coll_situation": "string", + "act_min": "string", + "control_operations": "string", + "weight": "string", + "cleaning_operations": "string" +} +``` +
+ + - **`time_method`** *[Optional ; Not repeatable ; String]*
+ The time method or time dimension of the data collection. A controlled vocabulary can be used. The entries for this element may include "panel survey", "cross-section", "trend study", or "time-series". + + - **`data_collectors`** *[Optional ; Not Repeatable]*
+ The entity (individual, agency, or institution) responsible for administering the questionnaire or interview or compiling the data. + +```json +"data_collectors": [ + { + "name": "string", + "affiliation": "string", + "abbr": "string", + "role": "string" + } +] +``` + + - **`name`** *[Optional ; Not repeatable ; String]*
+ In most cases, we will record here the name of the agency, not the name of interviewers. Only in the case of very small-scale surveys, with a very limited number of interviewers, the name of persons will be included as well. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the data collector mentioned in `name`. + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The abbreviation given to the agency mentioned in `name`. + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the person or agency mentioned in `name`.

+ + - **`collector_training`** *[Optional ; Repeatable]*
+ Describes the training provided to data collectors including interviewer training, process testing, compliance with standards etc. This set of elements is repeatable, to capture different aspects of the training process. + +```json +"collector_training": [ + { + "type": "string", + "training": "string" + } +] +``` +
+ + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of training being described. For example, "Training of interviewers", "Training of controllers", "Training of cartographers", "Training on the use of tablets for data collection", etc.
+ - **`training`** *[Optional ; Not repeatable ; String]*
+ A brief description of the training. This may include information on the dates and duration, audience, location, content, trainers, issues, etc.
+ + - **`frequency`** *[Optional ; Not repeatable ; String]*
+ For data collected at more than one point in time, the frequency with which the data were collected.
+ + - **`sampling_procedure`** *[Optional ; Not repeatable ; String]*
+ This field only applies to sample surveys. It describes the type of sample and sample design used to select the survey respondents to represent the population. This section should include summary information that includes (but is not limited to): sample size (expected and actual) and how the sample size was decided; level of representation of the sample; sample frame used, and listing exercise conducted to update it; sample selection process (e.g., probability proportional to size or over sampling); stratification (implicit and explicit); design omissions in the sample; strategy for absent respondents/not found/refusals (replacement or not). Detailed information on the sample design is critical to allow users to adequately calculate sampling errors and confidence intervals for their estimates. To do that, they will need to be able to clearly identify the variables in the dataset that represent the different levels of stratification and the primary sampling unit (PSU).
+ In publications and reports, the description of sampling design often contains complex formulas and symbols. As the XML and JSON formats used to store the metadata are plain text files, they cannot contain these complex representations. You may however provide references (title/author/date) to documents where such detailed descriptions are provided, and make sure that the documents (or links to the documents) are provided in the catalog where the survey metadata are published.
+ + - **`sample_frame`** *[Optional ; Not Repeatable]*
+ A description of the sample frame used for identifying the population from which the sample was taken. For example, a telephone book may be a sample frame for a phone survey. Or the listing of enumeration areas (EAs) of a population census can provide a sample frame for a household survey. In addition to the name, label and text describing the sample frame, this structure lists who maintains the sample frame, the period for which it is valid, a use statement, the universe covered, the type of unit contained in the frame as well as the number of units available, the reference period of the frame and procedures used to update the frame. + +```json +"sample_frame": { + "name": "string", + "valid_period": [ + { + "event": "string", + "date": "string" + } + ], + "custodian": "string", + "universe": "string", + "frame_unit": { + "is_primary": null, + "unit_type": "string", + "num_of_units": "string" + }, + "reference_period": [ + { + "event": "string", + "date": "string" + } + ], + "update_procedure": "string" +} +``` +
+ + - **`name`** *[Optional ; Not Repeatable]*
+ The name (title) of the sample frame.
+ - **`valid_period`** *[Optional ; Repeatable]*
+ Defines a time period for the validity of the sampling frame, using a list of events and dates.
+ - **`event`** *[Optional ; Not repeatable ; String]*
+ The event can for example be `start` or `end`. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date corresponding to the event, entered in ISO 8601 format: YYYY-MM-DD.

+ + - **`custodian`** *[ Optional ; Not Repeatable]*
+ Custodian identifies the agency or individual responsible for creating and/or maintaining the sample frame. + - **`universe`** *[Optional ; Not Repeatable]*
+ A description of the universe of population covered by the sample frame. Age,nationality, and residence commonly help to delineate a given universe, but any of a number of factors may be involved, such as sex, race, income, etc. The universe may consist of elements other than persons, such as housing units, court cases, deaths, countries, etc. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is included in the sample frame. + - **`frame_unit`** *[Optional ; Not Repeatable]*
+ Provides information about the sampling frame unit. + - **`is_primary`** *[Optional ; Boolean ; Not Repeatable]*
+ This boolean attribute (true/false) indicates whether the unit is primary or not. + - **`unit_type`** *[Optional ; Not repeatable ; String]*
+ The type of the sampling frame unit (for example "household", or "dwelling"). + - **`num_of_units`** *[Optional ; Not Repeatable ; String]*
+ The number of units in the sample frame, possibly with information on its distribution (e.g. by urban/rural, province, or other).

+ + - **`reference_period`** *[Optional ; Not Repeatable]*
+ Indicates the period of time in which the sampling frame was actually used for the study in question. Use ISO 8601 date format to enter the relevant date(s). + - **`event`** *[Optional ; Not repeatable ; String]*
+ Indicates the type of event that the date corresponds to, e.g., "start", "end", "single". + - **`date`** *[Optional ; Not repeatable ; String]*
+ The relevant date in ISO 8601 date/time format.

+ + - **`update_procedure`** *[Optional ; Not repeatable ; String]*
+ This element is used to describe how and with what frequency the sample frame is updated. For example: "The lists and boundaries of enumeration areas are updated every ten years at the occasion of the population census cartography work. Listing of households in enumeration areas are updated as and when needed, based on their selection in survey samples."

+ + - **`sampling_deviation`** *[Optional ; Not repeatable ; String]*
+ + Sometimes the reality of the field requires a deviation from the sampling design (for example due to difficulty to access to zones due to weather problems, political instability, etc). If for any reason, the sample design has deviated, this can be reported here. This element will provide information indicating the correspondence as well as the possible discrepancies between the sampled units (obtained) and available statistics for the population (age, sex-ratio, marital status, etc.) as a whole.
+ + - **`coll_mode`** *[Optional ; Repeatable ; String]*
+ + The mode of data collection is the manner in which the interview was conducted or information was gathered. Ideally, a controlled vocabulary will be used to constrain the entries in this field (which could include items like "telephone interview", "face-to-face paper and pen interview", "face-to-face computer-assisted interviews (CAPI)", "mail questionnaire", "computer-aided telephone interviews (CATI)", "self-administered web forms", "measurement by sensor", and others.
+ This is a repeatable field, as some data collection activities implement multi-mode data collection (for example, a population census can offer respondents the options to submit information via web forms, telephone interviews, mailed forms, or face-to-face interviews. Note that in the API description (see screenshot above), the element is described as having type "null", not {}. This is due to the fact that the element can be entered either as a list (repeatable element) or as a string.
+ + - **`research_instrument`** *[Optional ; Not repeatable ; String]*
+ + The research instrument refers to the questionnaire or form used for collecting data. The following should be mentioned:
+ - List of questionnaires and short description of each (all questionnaires must be provided as External Resources)
+ - In what language(s) was/were the questionnaire(s) available?
+ - Information on the questionnaire design process (based on a previous questionnaire, based on a standard model questionnaire, review by stakeholders). If a document was compiled that contains the comments provided by the stakeholders on the draft questionnaire, or a report prepared on the questionnaire testing, a reference to these documents can be provided here. + + - **`instru_development`** *[Optional ; Not repeatable ; String]*
+ + Describe any development work on the data collection instrument. This may include a description of the review process, standards followed, and a list of agencies/people consulted.
+ + - **`instru_development_type`** *[Optional ; Repeatable ; String]*
+ + The instrument development type. This element will be used when a pre-defined list of options (controlled vocabulary) is available. + + - **`sources`** *[Optional ; Repeatable]*
+ A description of sources used for developing the methodology of the data collection. + +```json +"sources": [ + { + "name": "string", + "origin": "string", + "characteristics": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name and other information on the source. For example, "United States Internal Revenue Service Quarterly Payroll File"
+ - **`origin`** *[Optional ; Not repeatable ; String]*
+ For historical materials, information about the origin(s) of the sources and the rules followed in establishing the sources should be specified. This may not be relevant to survey data. + - **`characteristics`** *[Optional ; Not repeatable ; String]*
+ Assessment of characteristics and quality of source material. This may not be relevant to survey data.

+ + - **`coll_situation`** *[Optional ; Not repeatable ; String]*
+ + A description of noteworthy aspects of the data collection situation. Includes information on factors such as cooperativeness of respondents, duration of interviews, number of call-backs, etc. + + - **`act_min`** *[Optional ; Not repeatable ; String]*
+ + A summary of actions taken to minimize data loss. This includes information on actions such as follow-up visits, supervisory checks, historical matching, estimation, etc. Note that this element does not have to include detailed information on response rates, as a specific metadata element is provided for that purpose in section `analysis_info / response_rate` (see below).
+ + - **`control_operations`** *[Optional ; Not repeatable ; String]*
+ + This element will provide information on the oversight of the data collection, i.e. on methods implemented to facilitate data control performed by the primary investigator or by the data archive.
+ + - **`weight`** *[Optional ; Not repeatable ; String]*
+ + This field only applies to sample surveys. The use of sampling procedures may make it necessary to apply weights to produce accurate statistical results. Describe here the criteria for using weights in analysis of a collection, and provide a list of variables used as weighting coefficient. If more than one variable is a weighting variable, describe how these variables differ from each other and what the purpose of each one of them is.
+ + - **`cleaning_operations`** *[Optional ; Not repeatable ; String]*
+ + A description of the methods used to clean or edit the data, e.g., consistency checking, wild code checking, etc. The data editing should contain information on how the data was treated or controlled for in terms of consistency and coherence. This item does not concern the data entry phase but only the editing of data whether manual or automatic. It should provide answers to questions like: Was a hot deck or a cold deck technique used to edit the data? Were corrections made automatically (by program), or by visual control of the questionnaire? What software was used? If materials are available (specifications for data editing, report on data editing, programs used for data editing), they should be listed here and provided as external resources in data catalogs (the best documentation of data editing consists of well-documented reproducible scripts).
+ + Example for the `data_collection` section: + + + ```r + my_ddi <- list( + + doc_desc = list( + # ... + ), + + study_desc = list( + # ... , + study_info = list( + # ... ), + study_development = list( + # ... ), + + method = list( + + data_collection = list( + + time_method = "cross-section", + + data_collectors = list( + list(name = "Staff from the Central Statistics Office", + abbr = "NSO", + affiliation = "Ministry of Planning") + ), + + collector_training = list( + list( + type = "Training of interviewers", + training = "72 staff (interviewers) were trained from [date] to [date] at the NSO headquarters. The training included 2 days of field work." + ), + list( + type = "Training of controllers and supervisors", + training = "A 3-day training of 10 controlers and 2 supervisors was organized from [date] to [date]. The controllers and supervisors had previously participated in the interviewer training." + ) + ), + + sampling_procedure = "A list of 500 Enumeration Areas (EAs) were randomly selected from the sample frame, 300 in urban areas and 200 in rural areas. In each selected EA, 10 households were then randomly selected. 5000 households were thus selected for the sample (3000 urban and 2000 rural). The distribution of the sample (households) by province is as follows: + - Province A: Total: 1800 Urban: 1000 Rural: 800 + - Province B: Total: 1200 Urban: 500 Rural: 700 + - Province C: Total: 2000 Urban: 1500 Rural: 500", + + sample_frame = list( + name = "Listing of Enumeration Areas (EAs) from the Population and Housing Census 2011", + custodian = "National Statistics Office", + universe = "The sample frame contains 25365 EAs covering the entire territory of the country. EAs contain an average of 400 households in rural areas, and 580 in urban areas. ", + frame_unit = list( + is_primary = true, + unit_type = "Enumeration areas (EAs)", + num_of_units = "25365, including 15100 in urban areas, and 10265 in rural areas." + ), + update_procedure = "The sample frame only provides EAs; a full household listing was conducted in each selected EA to provide an updated list of households." + ), + + sampling_deviation = "Due to floods in two sampled rural in province A, two EAs could not be reached. The sample was thus reduced to 4980 households. The response rate was 90%, so the actual final sample size was 4482 households.", + + coll_mode = "Face-to-face interviews, conducted using tablets (CAPI)", + + research_instrument = "The questionnaires for the Generic MICS were structured questionnaires based on the MICS3 Model Questionnaire with some modifications and additions. A household questionnaire was administered in each household, which collected various information on household members including sex, age, relationship, and orphanhood status. The household questionnaire includes household characteristics, support to orphaned and vulnerable children, education, child labour, water and sanitation, household use of insecticide treated mosquito nets, and salt iodization, with optional modules for child discipline, child disability, maternal mortality and security of tenure and durability of housing. + In addition to a household questionnaire, questionnaires were administered in each household for women age 15-49 and children under age five. For children, the questionnaire was administered to the mother or caretaker of the child. + The women's questionnaire include women's characteristics, child mortality, tetanus toxoid, maternal and newborn health, marriage, polygyny, female genital cutting, contraception, and HIV/AIDS knowledge, with optional modules for unmet need, domestic violence, and sexual behavior. + The children's questionnaire includes children's characteristics, birth registration and early learning, vitamin A, breastfeeding, care of illness, malaria, immunization, and anthropometry, with an optional module for child development. + The questionnaires were developed in English from the MICS3 Model Questionnaires and translated into local languages. After an initial review the questionnaires were translated back into English by an independent translator with no prior knowledge of the survey. The back translation from the local language version was independently reviewed and compared to the English original. Differences in translation were reviewed and resolved in collaboration with the original translators. The English and local language questionnaires were both piloted as part of the survey pretest.", + + instru_development = "The questionnaire was pre-tested with split-panel tests, as well as an analysis of non-response rates for individual items, and response distributions.", + + coll_situation = "Floods in province A made access to two selected enumeration areas impossible.", + + act_min = "Local authorities and local staff from the Ministry of Health contributed to an awareness campaign, which contributed to achieving a response rate of 90%.", + + control_operations = "Interviewing was conducted by teams of interviewers. Each interviewing team comprised of 3-4 female interviewers, a field editor and a supervisor, and a driver. Each team used a 4 wheel drive vehicle to travel from cluster to cluster (and where necessary within cluster). + The role of the supervisor was to coordinate field data collection activities, including management of the field teams, supplies and equipment, finances, maps and listings, coordinate with local authorities concerning the survey plan and make arrangements for accommodation and travel. Additionally, the field supervisor assigned the work to the interviewers, spot checked work, maintained field control documents, and sent completed questionnaires and progress reports to the central office. + The field editor was responsible for validating questionnaires at the end of the day when the data form interviews were transferred to their laptops. This included checking for missed questions, skip errors, fields incorrectly completed, and checking for inconsistencies in the data. The field editor also observed interviews and conducted review sessions with interviewers. + Responsibilities of the supervisors and field editors are described in the Instructions for Supervisors and Field Editors, together with the different field controls that were in place to control the quality of the fieldwork. + Field visits were also made by a team of central staff on a periodic basis during fieldwork. The senior staff of NSO also made 3 visits to field teams to provide support and to review progress.", + + weight = "Sample weights were calculated for each of the data files. Sample weights for the household data were computed as the inverse of the probability of selection of the household, computed at the sampling domain level (urban/rural within each region). The household weights were adjusted for non-response at the domain level, and were then normalized by a constant factor so that the total weighted number of households equals the total unweighted number of households. The household weight variable is called HHWEIGHT and is used with the HH data and the HL data. + Sample weights for the women's data used the un-normalized household weights, adjusted for non-response for the women's questionnaire, and were then normalized by a constant factor so that the total weighted number of women's cases equals the total unweighted number of women's cases. + Sample weights for the children's data followed the same approach as the women's and used the un-normalized household weights, adjusted for non-response for the children's questionnaire, and were then normalized by a constant factor so that the total weighted number of children's cases equals the total unweighted number of children's cases.", + + cleaning_operations = "Data editing took place at a number of stages throughout the processing, including: + a) Office editing and coding + b) During data entry + c) Structure checking and completenes + d) Secondary editing + e) Structural checking of SPSS data files + Detailed documentation of the editing of data can be found in the 'Data processing guidelines' document provided as an external resource." + ) + + ) + + ), + # ... + ) + + ) + ``` +
+ +- **`method_notes`** *[Optional ; Not repeatable ; String]*
+ +This element is provided to capture any additional relevant information on the data collection methodology, which could not fit in the previous metadata elements. + +- **`analysis_info`** *[Optional ; Not Repeatable]*
+This block of elements is used to organize information related to data quality and appraisal.
+ +```json +"analysis_info": { + "response_rate": "string", + "sampling_error_estimates": "string", + "data_appraisal": "string" +} +``` +
+ + - **`response_rate`** *[Optional ; Not repeatable ; String]*
+ The response rate is the percentage of sample units that participated in the survey based on the original sample size. Omissions may occur due to refusal to participate, impossibility to locate the respondent, or other reason. This element is used to provide a narrative description of the response rate, possibly by stratum or other criteria, and if possible with an identification of possible causes. If information is available on the causes of non-response (refusal/not found/other), it can be reported here. This field can also be used to describe non-responses in population censuses.
+ - **`sampling_error_estimates`** *[Optional ; Not repeatable ; String]*
+ Sampling errors are intended to measure how precisely one can estimate a population value from a given sample. For sampling surveys, it is good practice to calculate and publish sampling error. This field is used to provide information on these calculations (not to provide the sampling errors themselves, which should be made available in publications or reports). Information can be provided on which ratios/indicators have been subjected to the calculation of sampling errors, and on the software used for computing the sampling error. Reference to a report or other document where the results can be found can also be provided.
+ - **`data_appraisal`** *[Optional ; Not repeatable ; String]*
+ This section is used to report any other action taken to assess the reliability of the data, or any observations regarding data quality. Describe here issues such as response variance, interviewer and response bias, question bias, etc. For a population census, this can include information on the main results of a post enumeration survey (a report should be provided in external resources and mentioned here); it can also include relevant comparisons with data from other sources that can be used as benchmarks.
+ + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... ), + study_development = list( + # ... ), + method = list( + # ... , + + analysis_info = list( + + response_rate = "Of these, 4996 were occupied households and 4811 were successfully interviewed for a response rate of 96.3%. Within these households, 7815 eligible women aged 15-49 were identified for interview, of which 7505 were successfully interviewed (response rate 96.0%), and 3242 children aged 0-4 were identified for whom the mother or caretaker was successfully interviewed for 3167 children (response rate 97.7%). These give overall response rates (household response rate times individual response rate) for the women's interview of 92.5% and for the children's interview of 94.1%.", + + sampling_error_estimates = "Estimates from a sample survey are affected by two types of errors: 1) non-sampling errors and 2) sampling errors. Non-sampling errors are the results of mistakes made in the implementation of data collection and data processing. Numerous efforts were made during implementation of the 2005-2006 MICS to minimize this type of error, however, non-sampling errors are impossible to avoid and difficult to evaluate statistically. If the sample of respondents had been a simple random sample, it would have been possible to use straightforward formulae for calculating sampling errors. However, the 2005-2006 MICS sample is the result of a multi-stage stratified design, and consequently needs to use more complex formulae. The SPSS complex samples module has been used to calculate sampling errors for the 2005-2006 MICS. This module uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. This method is documented in the SPSS file CSDescriptives.pdf found under the Help, Algorithms options in SPSS. + Sampling errors have been calculated for a select set of statistics (all of which are proportions due to the limitations of the Taylor linearization method) for the national sample, urban and rural areas, and for each of the five regions. For each statistic, the estimate, its standard error, the coefficient of variation (or relative error - the ratio between the standard error and the estimate), the design effect, and the square root design effect (DEFT - the ratio between the standard error using the given sample design and the standard error that would result if a simple random sample had been used), as well as the 95 percent confidence intervals (+/-2 standard errors). Details of the sampling errors are presented in the sampling errors appendix to the report and in the sampling errors table presented in the external resources.", + + data_appraisal = "A series of data quality tables and graphs are available to review the quality of the data and include the following: + - Age distribution of the household population + - Age distribution of eligible women and interviewed women + - Age distribution of eligible children and children for whom the mother or caretaker was interviewed + - Age distribution of children under age 5 by 3 month groups + - Age and period ratios at boundaries of eligibility + - Percent of observations with missing information on selected variables + - Presence of mother in the household and person interviewed for the under 5 questionnaire + - School attendance by single year age + - Sex ratio at birth among children ever born, surviving and dead by age of respondent + - Distribution of women by time since last birth + - Scatter plot of weight by height, weight by age and height by age + - Graph of male and female population by single years of age + - Population pyramid + The results of each of these data quality tables are shown in the appendix of the final report. + The general rule for presentation of missing data in the final report tabulations is that a column is presented for missing data if the percentage of cases with missing data is 1% or more. Cases with missing data on the background characteristics (e.g. education) are included in the tables, but the missing data rows are suppressed and noted at the bottom of the tables in the report." + + ), + + # ... + ) + # ... + ) + ``` +
+ +- **`study_class`** *[Optional ; Repeatable ; String]*
+ +This element can be used to give the data archive's class or study status number, which indicates the processing status of the study. But it can also be used as an element to indicate the type of study, based on a controlled vocabulary. The element is repeatable, allowing one study to belong to more than one class. Note that in the API description (see screenshot above), the element is described as having type "null", not {}. This is due to the fact that the element can be entered either as a list (repeatable element) or as a string.
+ +- **`data_processing`** *[Optional ; Repeatable]*
@@@@ Improve definition of elements +
+```json +"data_processing": [ + { + "type": "string", + "description": "string" + } +] +``` +
+ +This element is used to describe how data were electronically captured (e.g., entered in the field, in a centralized manner by data entry clerks, captured electronically using tablets and a CAPI application, via web forms, etc.). Information on devices and software used for data capture can also be provided here. Other data processing procedures not captured elsewhere in the documentation can be described here (tabulation, etc.)
+ - **`type`** *[Optional ; Not repeatable ; String]*
+ The type attribute supports better classification of this activity, including the optional use of a controlled vocabulary. The vocabulary could include options like “data capture”, “data validation”, “variable derivation”, “tabulation”, “data visualizations”, anonymization“, ”documentation", etc.
+ - **`description`** *[Optional ; Repeatable ; String]* + A description of a data processing task. +
+ +- **`coding_instructions`** *[Optional ; Repeatable]*
+The `coding_instructions` elements can be used to describe specific coding instructions used in data processing, cleaning, or tabulation. Providing this information may however be complex and very tedious for datasets with a significant number of variables, where hundreds of commands are used to process the data. An alternative option, preferable in many cases, will be to publish reproducible data editing, tabulation and analysis scripts together with the data, as related resources.
+ +```json +"coding_instructions": [ + { + "related_processes": "string", + "type": "string", + "txt": "string", + "command": "string", + "formal_language": "string" + } +] +``` +
+ + - **`related_processes`** *[Optional ; Not repeatable ; String]*
+ The `related_processes` links a coding instruction to one or more processes such as "data editing", "recoding", "imputations and derivations", "tabulation", etc.
+ - **`type`** *[Optional ; Not repeatable ; String]*
+ The "type" attribute supports the classification of this activity (e.g. "topcoding"). A controlled vocabulary can be used.
+ - **`txt`** *[Optional ; Not repeatable ; String]*
+ A description of the code/command, in a human readable form.
+ - **`command`** *[Optional ; Not repeatable ; String]*
+ The command code for the coding instruction.
+ - **`formal_language`** *[Optional ; Not repeatable ; String]*
+ The language of the command code, e.g. "Stata", "R", "SPSS", "SAS", "Python", etc.
+ + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... ), + study_development = list( + # ... ), + + method = list( + # ... , + study_class = "", + + data_processing = list( + list(type = "Data capture", + description = "Data collection was conducted using tablets and Survey Solutions software. Multiple quality controls and validations are embedded in the questionnaire."), + list(type = "Batch data editing", + description = "Data editing was conducted in batch using a R script, including techniques of hot deck, imputations, and recoding."), + list(type = "Tabulation and visualizations", + description = "The 25 tables and the visualizations published in the survey report were produced using Stata (script 'tabulation.do')."), + list(type = "Anonymization", + description = "An anonymized version of the dataset, published as a public use file, was created using the R package sdcMicro.") + ), + + coding_instructions = list( + list(related_processes = "", + type = "", + txt = "Suppression of observations with ...", + command = "", + formal_language = "Stata"), + list(related_processes = "", + type = "", + txt = "Top coding age", + command = "", + formal_language = "Stata"), + list(related_processes = "", + type = "", + txt = "", + command = "", + formal_language = "Stata") + ) + + ) + # ... + ) + ``` +
+ + +#### Data access + +**`data_access`** *[Optional ; Not Repeatable]*
+This section describes the access conditions and terms of use for the dataset. This set of elements should be used when the access conditions are well-defined and are unlikely to change. An alternative option is to document the terms of use in the catalog where the data will be published, instead of "freezing" them in a metadata file. + +```json +"data_access": { + "dataset_availability": { + "access_place": "string", + "access_place_url": "string", + "original_archive": "string", + "status": "string", + "coll_size": "string", + "complete": "string", + "file_quantity": "string", + "notes": "string" + }, + "dataset_use": {} +} +``` +
+ +- **`dataset_availability`** *[Optional ; Not Repeatable]*
+Information on the availability and storage of the dataset. + + - **`access_place`** *[Optional ; Not repeatable ; String]*
+ Name of the location where the data collection is currently stored.
+ - **`access_place_url`** *[Optional ; Not repeatable ; String]*
+ The URL of the website of the location where the data collection is currently stored.
+ - **`original_archive`** *[Optional ; Not repeatable ; String]*
+ Archive from which the data collection was obtained, if any (the originating archive). Note that the schema we propose provides an element `provenance`, which is not part of the DDI, that can be used to document the origin of a dataset.
+ - **`status`** *[Optional ; Not repeatable ; String]*
+ A statement of the data availability. An archive may need to indicate that a collection is unavailable because it is embargoed for a period of time, because it has been superseded, because a new edition is imminent, etc. This element will rarely be used.
+ - **`coll_size`** *[Optional ; Not repeatable ; String]*
+ Extent of the collection. This is a summary of the number of physical files that exist in a collection. We will record here the number of files that contain data and note whether the collection contains other machine-readable documentation and/or other supplementary files and information such as data dictionaries, data definition statements, or data collection instruments. This element will rarely be used.
+ - **`complete`** *[Optional ; Not repeatable ; String]*
+ This item indicates the relationship of the data collected to the amount of data coded and stored in the data collection. Information as to why certain items of collected information were not included in the data file stored by the archive should be provided here. Example: "Because of embargo provisions, data values for some variables have been masked. Users should consult the data definition statements to see which variables are under embargo." This element will rarely be used.
+ - **`file_quantity`** *[Optional ; Not repeatable ; String]*
+ The total number of physical files associated with a collection. This element will rarely be used.
+ - **`notes`** *[Optional ; Not repeatable ; String]*
+ Additional information on the dataset availability, not included in one of the elements above.
+ + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... ), + study_development = list( + # ... ), + method = list( + # ...), + + data_access = list( + + dataset_availability = list( + access_place = "World Bank Microdata Library", + access_place_url = "http://microdata.worldbank.org", + status = "Available for public use", + coll_size = "4 data files + machine-readable questionnaire and report (2 PDF files) + data editing script (1 Stata do file).", + complete = "The variables 'latitude' and 'longitude' (GPS location of respondents) is not included, for confidentiality reasons.", + file_quantity = "7" + ), + + # ... + ) + ) + # ... + ) + ``` +
+ +- **`dataset_use`** *[Optional ; Not Repeatable]*
+Information on the terms of use for the study dataset. + +```json +"dataset_use": { + "conf_dec": [ + { + "txt": "string", + "required": "string", + "form_url": "string", + "form_id": "string" + } + ], + "spec_perm": [ + { + "txt": "string", + "required": "string", + "form_url": "string", + "form_id": "string" + } + ], + "restrictions": "string", + "contact": [ + { + "name": "string", + "affiliation": "string", + "uri": "string", + "email": "string" + } + ], + "cit_req": "string", + "deposit_req": "string", + "conditions": "string", + "disclaimer": "string" +} +``` +
+ + - **`conf_dec`** *[Optional ; Repeatable]*
+ This element is used to determine if signing of a confidentiality declaration is needed to access a resource. We may indicate here what *Affidavit of Confidentiality* must be signed before the data can be accessed. Another option is to include this information in the next element (Access conditions). If there is no confidentiality issue, this field can be left blank. +
+ - **`txt`** *[Optional ; Not repeatable ; String]*
+ A statement on confidentiality and limitations to data use. This statement does not replace a more comprehensive data agreement (see `Access condition`). An example of statement could be the following: "Confidentiality of respondents is guaranteed by Articles N to NN of the National Statistics Act of [date]. Before being granted access to the dataset, all users have to formally agree:
+ - To make no copies of any files or portions of files to which s/he is granted access except those authorized by the data depositor.
+ - Not to use any technique in an attempt to learn the identity of any person, establishment, or sampling unit not identified on public use data files.
+ - To hold in strictest confidence the identification of any establishment or individual that may be inadvertently revealed in any documents or discussion, or analysis.
+ - That such inadvertent identification revealed in her/his analysis will be immediately and in confidentiality brought to the attention of the data depositor."
+ - **`required`** *[Optional ; Not repeatable ; String]*
+ The "required" attribute is used to aid machine processing of this element. The default specification is "yes".
+ - **`form_url`** *[Optional ; Not repeatable ; String]*
+ The `"form_url` element is used to provide a link to an online confidentiality declaration form.
+ - **`form_id`** *[Optional ; Not repeatable ; String]*
+ Indicates the number or ID of the confidentiality declaration form that the user must fill out.

+ + - **`spec_perm`** *[Optional ; Repeatable]*
+ This element is used to determine if any special permissions are required to access a resource.
+ - **`txt`** *[Optional ; Not repeatable ; String]*
+ A statement on the special permissions required to access the dataset.
+ - **`required`** *[Optional ; Not repeatable ; String]*
+ The `required` is used to aid machine processing of this element. The default specification is "yes".
+ - **`form_url`** *[Optional ; Not repeatable ; String]*
+ The `form_url` is used to provide a link to a special on-line permissions form.
+ - **`form_id`** *[Optional ; Not repeatable ; String]*
+ The "form_id" indicates the number or ID of the special permissions form that the user must fill out.

+ + - **`restrictions`** *[Optional ; Not repeatable ; String]*
+ Any restrictions on access to or use of the collection such as privacy certification or distribution restrictions should be indicated here. These can be restrictions applied by the author, producer, or distributor of the data. This element can for example contain a statement (extracted from the DDI documentation) like: "In preparing the data file(s) for this collection, the National Center for Health Statistics (NCHS) has removed direct identifiers and characteristics that might lead to identification of data subjects. As an additional precaution NCHS requires, under Section 308(d) of the Public Health Service Act (42 U.S.C. 242m), that data collected by NCHS not be used for any purpose other than statistical analysis and reporting. NCHS further requires that analysts not use the data to learn the identity of any persons or establishments and that the director of NCHS be notified if any identities are inadvertently discovered. Users ordering data are expected to adhere to these restrictions."
+ + - **`contact`** *[Optional ; Repeatable]*
+ Users of the data may need further clarification and information on the terms of use and conditions to access the data. This set of elements is used to identify the contact persons who can be used as resource persons regarding problems or questions raised by the user community.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the person. Note that in some cases, it might be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position.
+ - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ Affiliation of the person.
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ URI for the person; it can be the URL of the organization the person belongs to.
+ - **`email`** *[Optional ; Not repeatable ; String]*
+ The `email` element is used to indicate an email address for the contact individual mentioned in `name`. Ideally, a generic email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members.

+ + - **`cit_req`** *[Optional ; Not repeatable ; String]*
+ A citation requirement that indicates the way that the dataset should be referenced when cited in any publication. Providing a citation requirement will guarantee that the data producer gets proper credit, and that results of analysis can be linked to the proper version of the dataset. The data access policy should explicitly mention the obligation to comply with the citation requirement. The citation should include at least the primary investigator, the name and abbreviation of the dataset, the reference year, and the version number. Include also a website where the data or information on the data is made available by the official data depositor. Ideally, the citation requirement will include a DOI (see the [DataCite](https://datacite.org/) website for recommendations).
+ + - **`deposit_req`** *[Optional ; Not repeatable ; String]*
+ Information regarding data users' responsibility for informing archives of their use of data through providing citations to the published work or providing copies of the manuscripts.
+ + - **`conditions`** *[Optional ; Not repeatable ; String]*
+ Indicates any additional information that will assist the user in understanding the access and use conditions of the data collection.
+ + - **`disclaimer`** *[Optional ; Not repeatable ; String]*
+ A disclaimer limits the liability that the data producer or data custodian has regarding the use of the data. A standard legal statement should be used for all datasets from a same agency. The following formulation could be used: *The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.* + + Example + + + ```r + my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... , + study_info = list( + # ... ), + study_development = list( + # ... ), + method = list( + # ...), + + data_access = list( + # ..., + + dataset_use = list( + + conf_dec = list( + list(txt = "Confidentiality of respondents is guaranteed by Articles N to NN of the National Statistics Act. All data users are required to sign an affidavit of confidentiality.", + required = "yes", + form_url = "http://datalibrary.org/affidavit", + form_id = "F01_AC_v01") + ), + + spec_perm = list( + list(txt = "Permission will only be granted to residents of [country].", + required = "yes", + form_url = "http://datalibrary.org/residency", + form_id = "F02_RS_v01") + ), + + restrictions = "Data will only be shared with users who are registered to the National Data Center and have successfuly completed the training on data privacy and responsible data use. Only users who legally reside in [country] will be authorized to access the data.", + + contact = list( + list(name = "Head, Data Processing Division", + affiliation = "National Statistics Office", + uri = "www.cso.org/databank", + email = "dataproc@cso.org") + ), + + cit_req = "National Statistics Office of Popstan. Multiple Indicators Cluster Survey 2000 (MICS 2000). Version 01 of the scientific use dataset (April 2001). DOI: XXX-XXXX-XXX", + + deposit_req = "To provide funding agencies with essential information about use of archival resources and to facilitate the exchange of information among researchers and development practitioners, users of the Microdata Library data are requested to send to the Microdata Library bibliographic citations for, or copies of, each completed manuscript or thesis abstract. Please indicate in a cover letter which data were used.", + + disclaimer = "The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses." + + ) + + ), + # ... + + ) + ``` +
+ +- **`notes`** *[Optional ; Not repeatable ; String]*

+ Any additional information related to data access that is not contained in the specific metadata elements provided in the section `data_access`.
+ + +### Description of data files + +**`data_files`** *[Optional ; Repeatable]*
+The `data_files` section is the DDI section that contains the elements needed to describe each data file that form the study dataset. These are elements at the file level; it does not include the information at the variable level, which are contained in a separate section of the standard. + +```json +"data_files": [ + { + "file_id": "string", + "file_name": "string", + "file_type": "string", + "description": "string", + "case_count": 0, + "var_count": 0, + "producer": "string", + "data_checks": "string", + "missing_data": "string", + "version": "string", + "notes": "string" + } +] +``` +
+ +- **`file_id`** *[Optional ; Not repeatable ; String]*
+A unique file identifier (within the metadata document, not necessarily within a catalog). This will typically be the electronic file name. + +- **`file_name`** *[Optional ; Not repeatable ; String]*
+This is not the name of the electronic file (which is provided in the previous element). It is a short title (label) that will help distinguish a particular file/part from other files/parts in the dataset.
+ +- **`file_type`** *[Optional ; Not repeatable ; String]*
+The type of data files. For example, raw data (ASCII), or software-dependent files such as SAS / Stata / SPSS data file, etc. Provide specific information (e.g. Stata 10 or Stata 15, SPSS Windows or SPSS Export, etc.) Note that in an on-line catalog, data can be made available in multiple formats. In such case, the `file_type` element is not useful.
+ +- **`description`** *[Optional ; Not repeatable ; String]*
+The `file_id` and `file_name` elements provide limited information on the content of the file. The `description` element is used to provide a more detailed description of the file content. This description should clearly distinguish collected variables and derived variables. It is also useful to indicate the availability in the data file of some particular variables such as the weighting coefficients. If the file contains derived variables, it is good practice to refer to the computer program that generated it. Information about the data file(s) that comprises a collection.
+ +- **`case_count`** *[Optional ; Numeric ; Not Repeatable]*
+Number of cases or observations in the data file. The value is 0 by default. + +- **`var_count`** *[Optional ; Numeric ; Not Repeatable]*
+Number of variables in the data file. The value is 0 by default. + +- **`producer`** *[Optional ; Not repeatable ; String]*
+The name of the agency that produced the data file. Most data files will have been produced by the survey primary investigator. In some cases however, auxiliary or derived files from other producers may be released with a data set. This may for example be a file containing derived variables generated by a researcher. + +- **`data_checks`** *[Optional ; Not repeatable ; String]*
+Use this element if needed to provide information about the types of checks and operations that have been performed on the data file to make sure that the data are as correct as possible, e.g. consistency checking, wildcode checking, etc. Note that the information included here should be specific to the data file. Information about data processing checks that have been carried out on the data collection (study) as a whole should be provided in the `Data editing` element at the study level. You may also provide here a reference to an external resource that contains the specifications for the data processing checks (that same information may be provided also in the `Data Editing` filed in the `Study Description` section). + +- **`missing_data`** *[Optional ; Not repeatable ; String]*
+A description of missing data (number of missing cases, cause of missing values, etc.) + +- **`version`** *[Optional ; Not repeatable ; String]*
+The version of the data file. A data file may undergo various changes and modifications. File specific versions can be tracked in this element. This field will in most cases be left empty. + +- **`notes`** *[Optional ; Not repeatable ; String]*
+This field aims to provide information on the specific data file not covered elsewhere. + + Example for UNICEF MICS dataset + + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... + ), + + data_files = list( + + list(file_id = "HHS2020_S01", + file_name = "Household roster (demographics)", + description = "The file contains the demographic information on all individuals in the sample", + case_count: 10000, + var_count: 12, + producer = "National Statistics Office", + missing_data = "Values of age outside valid range (0 to 100) have been replaced with 'missing'.", + version = "1.0 (edited, not anonymized)", + notes = "" + ), + + list(file_id = "HHS2020_S03A", + file_name = "Section 3A - Education", + description = "The file contains data related to section 3A of the household survey questionnaire (Education of household members aged 6 to 24 years). It also contains the weighting coefficient, and various recoded variables on levels of education.", + case_count: 2500, + var_count: 17, + producer = "National Statistics Office", + data_checks = "Education level (variable EDUCLEV) has been edited using hotdeck imputation when the reported value was out of acceptable range considering the AGE of the person.", + version = "1.0 (edited, not anonymized)" + ), + + list(file_id = "HHS2020_CONSUMPTION", + file_name = "Annualized household consumption by products and services", + description = "The file contains derived data on household consumption, annualized and aggregated by category of products and services. The file also contains a regional price deflator variable and the household weighting coefficient. The file was generated using a Stata program named 'cons_aggregate.do'.", + case_count: 42000, + var_count: 15, + producer = "National Statistics Office", + data_checks = "Outliers have been detected (> median + 5*IQR) for each product/service; fixed by imputation (regression model).", + missing_data = "Missing consumption values are treated as 0", + version = "1.0 (edited, not anonymized)" + ) + + ), + + # ... +) +``` +
+ + +### Variable description + +The DDI Codebook metadata standard provides multiple elements to document variables contained in a micro-dataset. There is much value in documenting variables: + - it makes the data **usable** by providing users with a detailed data dictionary; + - it makes the data more **discoverable** as all keywords included in the description of variables are indexed in data catalogs; + - it allows users to assess the comparability of data across sources; + - it enables the development of question banks; and + - it adds transparency and credibility to the data especially when derived or imputed variables are documented. +All possible effort should thus be made to generate and publish detailed variable-level documentation. + +A micro-dataset can contain many variables. Some survey datasets include hundreds or event thousands of variables. Documenting variables can thus be a tedious process. The use of a specialized DDI metadata editor can make this process considerably more efficient. Much of the variable-level metadata can indeed be automatically extracted from the electronic data files. Data files in Stata, SPSS or other common formats include variable names, variable and value labels, and in some cases notes that can be extracted. And the variable-level summary statistics that are part of the metadata can be generated from the data files. Further, software applications used for capturing data like [Survey Solutions](https://mysurvey.solutions/en/) from the World Bank or [CsPro](https://www.census.gov/data/software/cspro.html) from the US Census Bureau can export variable metadata, including the variable names, the variable and value labels, and possibly the formulation of questions and the interviewers instructions when the software is used for conducting computer assisted personal interviews (CAPI). Survey Solutions and CsPro can export metadata in multiple formats, including the DDI Codebook. Multiple options exist to make the documentation of variables efficient. As much as possible, tedious manual curation of variable-level information should be avoided. + +**`variables`** *[Optional ; Repeatable]*
+The metadata elements we describe below apply independently to each variable in the dataset. + +```json +"variables": [ + { + "file_id": "string", + "vid": "string", + "name": "string", + "labl": "string", + "var_intrvl": "discrete", + "var_dcml": "string", + "var_wgt": 0, + "loc_start_pos": 0, + "loc_end_pos": 0, + "loc_width": 0, + "loc_rec_seg_no": 0, + "var_imputation": "string", + "var_derivation": "string", + "var_security": "string", + "var_respunit": "string", + "var_qstn_preqtxt": "string", + "var_qstn_qstnlit": "string", + "var_qstn_postqtxt": "string", + "var_forward": "string", + "var_backward": "string", + "var_qstn_ivuinstr": "string", + "var_universe": "string", + "var_sumstat": [], + "var_txt": "string", + "var_catgry": [], + "var_std_catgry": {}, + "var_codinstr": "string", + "var_concept": [], + "var_format": {}, + "var_notes": "string" + } +] +``` +
+ +- **`file_id`** *[Required ; Not repeatable ; String]*
+A dataset can be composed of multiple data files. The `file_id` is the name of the data file that contains the variable being documented. This file name should correspond to a `file_id` listed in the `data_file` section of the DDI.
+ +- **`vid`** *[Required ; Not repeatable ; String]*
+A unique identifier given to the variable. This can be a system-generated ID, such as a sequential number within each data file. The `vid` is not the variable name.
+ +- **`name`** *[Required ; Not repeatable ; String]*
+The name of the variable in the data file. The `name` should be entered exactly as found in the data file (not abbreviated or converted to upper or lower cases, as some software applications are case-sensitive). This information can be programmatically extracted from the data file. The variable name is limited to eight characters in some statistical analysis software such as SAS or SPSS.
+ +- **`labl`** *[Optional ; Not repeatable ; String]*
+All variables should have a label that provides a short but clear indication of what the variable contains. Ideally, all variables in a data file will have a different label. File formats like Stata or SPSS often contain variable labels. Variable labels can also be found in data dictionaries in software applications like Survey Solutions or CsPro. Avoid using the question itself as a label (specific elements are available to capture the literal question text; see below). Think of a label as what you would want to see in a tabulation of the variables. Keep in mind that software applications like Stata and others impose a limit to the number of characters in a label (often, 80).
+ +- **`var_intrvl`** *[Optional ; Not repeatable ; String]*
+This element indicates whether the intervals between values for the variable are `discrete` or `continuous`.
+ +- **`var_dcml`** *[Optional ; Not repeatable ; String]*
+This element refers to the number of decimal points in the values of the variable.
+ +- **`var_wgt`** *[Optional ; Not repeatable ; Numeric]*
+This element, which applies to dataset from sample surveys, indicates whether the variable is a sample weight (value "1") or not (value "0). Sample weights play an important role in the calculation of summary statistics and sampling errors, and should therefore be flagged.
+ +- **`loc_start_pos`** *[Optional ; Not repeatable ; Numeric]*
+The starting position of the variable when the data are saved in an ASCII fixed-format data file.
+ +- **`loc_end_pos`** *[Optional ; Not repeatable ; Numeric]*
+The end position of the variable when the data are saved in an ASCII fixed-format data file.
+ +- **`loc_width`** *[Optional ; Not repeatable ; Numeric]*
+The length of the variable (the maximum number of characters used for its values) in an ASCII fixed-format data file.
+ +- **`loc_rec_seg_no`** *[Optional ; Not repeatable ; Numeric]*
+Record segment number, deck or card number the variable is located on.
+ +- **`var_imputation`** *[Optional ; Not repeatable ; String]*
+Imputation is the process of estimating values for variables when a value is missing. The element is used to describe the procedure used to impute values when missing.
+ +- **`var_derivation`** *[Optional ; Not repeatable ; String]*
+Used only in the case of a derived variable, this element provides both a description of how the derivation was performed and the command used to generate the derived variable, as well as a specification of the other variables in the study used to generate the derivation. The `var_derivation` element is used to provide a brief description of this process. As full transparency in derivation processes is critical to build trust and ensure replicability or reproducibility, the information captured in this element will often not be sufficient. A reference to a document and/or computer program can in such case be provided in this element, and the document/scripts provided as external resources. For example, a variable "TOT_EXP" containing the annualized total household expenditure obtained from a household budget survey may be the result of a complex process of aggregation, de-seasonalization, and more. In such case, the information provided in the `var_derivation` element could be: "TOT_EXP was obtained by aggregating expenditure data on all goods and services, available in sections 4 to 6 of the household questionnaire. It contains imputed rental values for owner-occupied dwellings. The values have been deflated by a regional price deflator available in variable REG_DEF. All values are in local currency. Outliers have been fixed by imputation. Details on the calculations are available in Appendix 2 of the Report on Data Processing, and in the Stata program [generate_hh_exp_total.do]."
+ +- **`var_security`** *[Optional ; Not repeatable ; String]*
+This element is used to provide information regarding levels of access, e.g., public, subscriber, need to know.
+ +- **`var_respunit`** *[Optional ; Not repeatable ; String]*
+Provides information regarding who provided the information contained within the variable, e.g., head of household, respondent, proxy, interviewer.
+ +- **`var_qstn_preqtxt`** *[Optional ; Not repeatable ; String]*
+The pre-question texts are the instructions provided to the interviewers and printed in the questionnaire before the literal question. This does not apply to all variables. Do not confuse this with instructions provided in the interviewer's manual.
+ +- **`var_qstn_qstnlit`** *[Optional ; Not repeatable ; String]*
+The literal question is the full text of the questionnaire as the enumerator is expected to ask it when conducting the interview. This does not apply to all variables (it does not apply to derived variables).
+ +- **`var_qstn_postqtxt`** *[Optional ; Not repeatable ; String]*
+The post-question texts are instructions provided to the interviewers, printed in the questionnaire after the literal question. Post-question can be used to enter information on skips provided in the questionnaire. This does not apply to all variables. Do not confuse this with instructions provided in the interviewer's manual. +
+With the previous three elements, one should be able to understand how the question was formulated in a questionnaire. In the example below (extracted from the UNICEF [Malawi 2006 MICS](https://microdata.worldbank.org/index.php/catalog/1798) survey questionnaire), we find: + + - a pre-question: *"Ask this question ONLY ONCE for each mother/caretaker (even if she has more children)."*
+ - a literal question: *"Sometimes children have severe illnesses and should be taken immediately to a health facility. What types of symptoms would cause you to take your child to a health facility right away?"*
+ - a post-question: *"Keep asking for more signs or symptoms until the mother/caretaker cannot recall any additional symptoms. Circle all symptoms mentioned. DO NOT PROMPT WITH ANY SUGGESTIONS"*
+ + ![](./images/ReDoc_Microdata_37.JPG){width=100%} + +- **`var_forward`** *[Optional ; Not repeatable ; String]*
+Contains a reference to the IDs of possible following questions. This can be used to document forward skip instructions.
+ +- **`var_backward`** *[Optional ; Not repeatable ; String]*
+Contains a reference to IDs of possible preceding questions. This can be used to document backward skip instructions.
+ +- **`var_qstn_ivuinstr`** *[Optional ; Not repeatable ; String]*
+Specific instructions to the individual conducting an interview. The content will typically be entered by copy/pasting instructions in the interviewer's manual (or in the CAPI application). In cases where the same instructions relate to multiple variables, repeat the same information in the metadata for all these variables. +NOTE: In earlier version of the documentation, due to a typo, the element was named `var_qstn_ivulnstr`.
+ +- **`var_universe`** *[Optional ; Not repeatable ; String]*
+The universe at the variable level defines the population the question applied to. It reflects skip patterns in a questionnaire. This information can typically be copy/pasted from the survey questionnaire. Try to be as specific as possible. This information is critical for the analyst, as it explains why missing values may be found in a variable. In the example below (from the Malawi MICS 2006 survey questionnaire), the universe for questions ED1 to ED2 will be *"Household members age 5 and above"*, and the universe for Question ED3 will be *"Household members age 5 and above who ever attended school or pre-school"*.
+ + ![](./images/ReDoc_Microdata_37.JPG){width=100%} + +- **`var_sumstat`** *[Optional ; Repeatable]*
+The DDI metadata standard provides multiple elements to capture various summary statistics such as minimum, maximum, or mean values (weighted and un-weighted) for each variable (note that frequency statistics for categorical variables are reported in `var_catgry` described below). The content of the `var_sumstat` section will be easy to fill out programmatically (using R or Python) or using a specialized DDI metadata editor, which can read the data file and generate the summary statistics. + +```json +"var_sumstat": [ + { + "type": "string", + "value": null, + "wgtd": "string" + } +] +``` +
+ + - **`type`** *[Required ; Not repeatable ; String]*
+ The type of statistics being shown: mean, median, mode, valid cases, invalid cases, minimum, maximum, or standard deviation.
+ - **`value`** *[Required ; Not repeatable ; Numeric]*
+ The value of the summary statistics mentioned in `type`.
+ - **`wgtd`** *[Required ; Not repeatable ; String]*
+ Indicates whether the statistics reported in `value` are weighted or not (for variables in sample surveys). Enter "weighted" if weighted, otherwise leave this element empty.

+ +- **`var_txt`** *[Optional ; Not repeatable ; String]*
+This element provides a space to describe the variable in detail. Not all variables require a definition.
+ +- **`var_catgry`** *[Optional ; Repeatable]*
+Variable categories are the lists of codes (and their meaning) that apply to a categorical variable. This block of elements is used to describe the categories (code and label) and optionally capture their weighted and/or un-weighted frequencies.
+ +```json +"var_catgry": [ + { + "value": "string", + "label": "string", + "stats": [ + { + "type": "string", + "value": null, + "wgtd": "string" + } + ] + } +] +``` +
+ + - **`value`** *[Required ; Not repeatable ; String]*
+ The value here is the code assigned to a variable category. For example, a variable "Sex" could have value 1 for "Male" and value 2 for "Female".
+ - **`label`** *[Required ; Not repeatable ; String]*
+The label attached to the code mentioned in `value`.
+ - **`stats`** *[Optional ; Repeatable]*
+ This repeatable block of elements will contain the summary statistics for the category (not for the variable) being documented. This may include frequencies, percentages, or cross-tabulation results.
+ - **`type`** *[Required ; Not repeatable ; String]*
+ The type of the summary statistic. This will usually be `freq` for frequency.
+ - **`value`** *[Required ; Not repeatable ; Numeric]*
+ The value of the summary statistic, for the corresponding `type`.
+ - **`wgtd`** *[Optional ; Not repeatable ; String]*
+ Indicates whether the statistic reported in `value` are weighted or not (for variables in sample surveys). Enter "weighted" if weighted, otherwise leave this element empty.

+ +- **`var_std_catgry`** *[Optional ; Not repeatable]*
+This element is used to indicate that the codes used for a categorical variable are from a standard international or other classification, like COICOP, ISIC, ISO country codes, etc.
+ +```json +"var_std_catgry": { + "name": "string", + "source": "string", + "date": "string", + "uri": "string" +} +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the classification, e.g. "International Standard Industrial Classification of All Economic Activities (ISIC), Revision 4"
+ - **`source`** *[Required ; Not repeatable ; String]*
+ The source of the classification, e.g. "United Nations"
+ - **`date`** *[Required ; Not repeatable ; String]*
+ The version (typically a date) of the classification used for the study.
+ - **`uri`** *[Required ; Not repeatable ; String]*
+ A URL to a website where an electronic copy and more information on the classification can be obtained.
+ +- **`var_codinstr`** *[Optional ; Not repeatable ; String]*
+The coder instructions for the variable. These are any special instructions to those who converted information from one form to another (e.g., textual to numeric) for a particular variable.
+ +- **`var_concept`** *[Optional ; Repeatable]*
+The general subject to which the parent element may be seen as pertaining. This element serves the same purpose as the keywords and topic classification elements, but at the variable description level.
+ +```json +"var_concept": [ + { + "title": "string", + "vocab": "string", + "uri": "string" + } +] +``` +
+ + - **`title`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the concept. + - **`vocab`** *[Optional ; Not repeatable ; String]*
+ The controlled vocabulary, if any, from which the concept `title' was taken.
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ The location for the controlled vocabulary mentioned in `vocab'.

+ +- **`var_format`** *[Optional ; Not repeatable]*
+The technical format of the variable in question. + +```json +"var_format": { + "type": "string", + "name": "string", + "note": "string" +} +``` +
+ + - **`type`** *[Optional ; Not repeatable ; String]*
+ Indicates if the variable is numeric, fixed string, dynamic string, or date. Numeric variables are used to store any number, integer or floating point (decimals). A fixed string variable has a predefined length which enables the publisher to handle this data type more efficiently. Dynamic string variables can be used to store open-ended questions.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ In some cases may provide the name of the particular, proprietary format used.
+ - **`note`** *[Optional ; Not repeatable ; String]*
+ Additional information on the variable format.

+ +- **`var_notes`** *Optional ; Not repeatable ; String]*
+This element is provided to record any additional or auxiliary information related to the specific variable.
+ + +Example for two variables only: + + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... + ), + data_files = list( + # ... + ), + + variables = list( + + list(file_id = "", + vid = "", + name = "", + labl = "Main occupation", + var_intrvl = "discrete", + var_imputation = "", + var_respunit = "", + var_qstn_preqtxt = "", + var_qstn_qstnlit = "", + var_qstn_postqtxt = "", + var_qstn_ivulnstr = "", + var_universe = "", + var_sumstat = list(list(type = "", value = "", wgtd = "")), + var_txt = "", + var_forward = "", + var_catgry = list(list(value = "", + label = "", + stats = list(list(type = "", value = "", wgtd = ""), + list(type = "", value = "", wgtd = ""), + list(type = "", value = "", wgtd = "")), + + list(value = "", + label = "", + stats = list(list(type = "", value = "", wgtd = ""), + list(type = "", value = "", wgtd = ""), + list(type = "", value = "", wgtd = "")), + var_std_catgry = list(), + var_codinstr = "", + var_concept = list(list(title = "", vocab = "", uri = "")), + var_format = list(type = "numeric", name = "") + ), + + list(file_id = "", + vid = "", + name = "V75_HH_CONS", + labl = "Household total consumption", + var_intrvl = "continuous", + var_dcml = "", + var_wgt = 0, + var_imputation = "", + var_derivation = "", + var_security = "", + var_respunit = "", + var_qstn_preqtxt = "", + var_qstn_qstnlit = "", + var_qstn_postqtxt = "", + var_qstn_ivulnstr = "", + var_universe = "", + var_sumstat = list(list(type = "", value = "", wgtd = "")), + var_txt = "", + var_codinstr = "", + var_concept = list(list(title = "", vocab = "", uri = "")), + var_format = list(type = "", name = "", value = ""), + var_notes = "" + ) + + ), + # ... +) +``` +
+ + +### Variable groups + +**`variable_groups`** *[Optional ; Repeatable]*
+ +In a dataset, variables are grouped by data file. For the convenience of users, the DDI allows data curators to organize the variables into different, "virtual" groups to organize variables by theme, type of respondent, or any other criteria. Grouping variables is optional, and will not impact the way variables are stored in the data files. One variable can belong to more than a group, and a group of variables can contain variables from more than one data file. The variable groups do not necessarily have to cover all variables in the data files. Variable groups can also contain other variable groups.
+ +```json +"variable_groups": [ + { + "vgid": "string", + "variables": "string", + "variable_groups": "string", + "group_type": "subject", + "label": "string", + "universe": "string", + "notes": "string", + "txt": "string", + "definition": "string" + } +] +``` +
+ +- **`vgid`** *[Optional ; Not repeatable ; String]*
+A unique identifier (within the DDI metadata file) for the variable group.
+ +- **`variables`** *[Optional ; Not repeatable ; String]*
+The list of variables (variable identifiers - `vid`) in the group. Enter a list with items separated by a space, e.g. "V21 V22, V30".
+ +- **`variable_groups`** *[Optional ; Not repeatable ; String]*
+The variable groups (`vgid`) that are embedded in this variable group. Enter a list with items separated by a space, e.g. "VG2, VG5".
+ +- **`group_type`** *[Optional ; Subject ; Not Repeatable]*
+The type of grouping of the variables. A controlled vocabulary should be used. The DDI proposes the following vocabulary: {`section, multipleResp, grid, display, repetition, subject, version, iteration, analysis, pragmatic, record, file, randomized, other`}. A description of the groups can be found in [this document](https://zenodo.org/record/3823051/files/maddiewshop.pdf) by W. Thomas, W. Block, R. Wozniak and J. Buysse.
+ +- **`label`** *[Optional ; Not repeatable ; String]*
+A short description of the variable group.
+ +- **`universe`** *[Optional ; Not repeatable ; String]*
+The universe can be a population of individuals, households, facilities, organizations, or others, which can be defined by any type of criteria (e.g., "adult males", "private schools", "small and medium-size enterprises", etc.).
+ +- **`notes`** *[Optional ; Not repeatable ; String]*
+Used to provide additional information about the variable group.
+ +- **`txt`** *[Optional ; Not repeatable ; String]*
+A more detailed description of variable group than the one provided in `label`.
+ +- **`definition`** *[Optional ; Not repeatable ; String]*
+A brief rationale for the variable grouping.
+ + +```r +my_ddi <- list( + doc_desc = list( + # ... + ), + study_desc = list( + # ... + ), + data_files = list( + # ... + ), + variables = list( + # ... + ), + + variable_groups = list( + + list(vgid = "vg01", + variables = "", + variable_groups = "", + group_type = "subject", + label = "", + universe = "", + notes = "", + txt = "", + definition = "" + ), + + list(vgid = "vg02", + variables = "", + variable_groups = "", + group_type = "subject", + label = "", + universe = "", + notes = "", + txt = "", + definition = "" + ) + + ), + + # ... +) +``` +
+ + +### Provenance + +**`provenance`** *[Optional ; Repeatable]*
+Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata. These elements are NOT part of the DDI metadata standard.
+ +```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ +- **`origin_description`** *[Required ; Not repeatable]*
+The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ + - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Study Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ + +### Tags + +**`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. Tags are NOT part of the DDI codebook standard. + +```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ +- **`tag`** *[Required ; Not repeatable ; String]*
+A user-defined tag. +- **`tag_group`** *[Optional ; Not repeatable ; String]*
+A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.
+ + +### LDA topics + +**`lda_topics`** *[Optional ; Not repeatable]*
+ +```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ +We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document (in this case, the "document" is a compilation of elements from the dataset metadata) can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). +
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element `lda_topics` is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. The `lda_topics` element is NOT part of the DDI Codebook standard. + +:::note +Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the `lda_topics` elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated. +::: + +The `lda_topics` element includes the following metadata fields:
+ +- **`model_info`** *[Optional ; Not repeatable]*
+Information on the LDA model.
+ - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ +- **`topic_description`** *[Optional ; Repeatable]*
+The topic composition of the document.
+ - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the document (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic. This is specific to the model, not to a document.
+ + + +```r +lda_topics = list( + + list( + + model_info = list( + list(source = "World Bank, Development Data Group", + author = "A.S.", + version = "2021-06-22", + model_id = "Mallet_WB_75", + nb_topics = 75, + description = "LDA model, 75 topics, trained on Mallet", + corpus = "World Bank Documents and Reports (1950-2021)", + uri = "")) + ), + + topic_description = list( + + list(topic_id = "topic_27", + topic_score = 32, + topic_label = "Education", + topic_words = list(list(word = "school", word_weight = "") + list(word = "teacher", word_weight = ""), + list(word = "student", word_weight = ""), + list(word = "education", word_weight = ""), + list(word = "grade", word_weight = "")), + + list(topic_id = "topic_8", + topic_score = 24, + topic_label = "Gender", + topic_words = list(list(word = "women", word_weight = "") + list(word = "gender", word_weight = ""), + list(word = "man", word_weight = ""), + list(word = "female", word_weight = ""), + list(word = "male", word_weight = "")), + + list(topic_id = "topic_39", + topic_score = 22, + topic_label = "Forced displacement", + topic_words = list(list(word = "refugee", word_weight = "") + list(word = "programme", word_weight = ""), + list(word = "country", word_weight = ""), + list(word = "migration", word_weight = ""), + list(word = "migrant", word_weight = "")), + + list(topic_id = "topic_40", + topic_score = 11, + topic_label = "Development policies", + topic_words = list(list(word = "development", word_weight = "") + list(word = "policy", word_weight = ""), + list(word = "national", word_weight = ""), + list(word = "strategy", word_weight = ""), + list(word = "activity", word_weight = "")) + + ) + + ) + +) +``` +
+ + +### Embeddings + +**`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. In this case, the text would be a compilation of selected elements of the dataset metadata. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). + +The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. The `embeddings` element is NOT part of the DDI Codebook standard. + +```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": { } + } +] +``` +
+ +The `embeddings` element contains four metadata fields: + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; Object]* @@@@@@@@ do not offer options + The numeric vector representing the document, provided as an object (array or string).
+ [1,4,3,5,7,9] + + +### Additional + +**`additional`** *[Optional ; Not repeatable]*
+The `additional` element is provided to allow users of the API to create their own elements and add them to the schema. It is not part of the DDI Codebook standard. All custom elements must be added within the `element` block; embedding them elsewhere in the schema would cause DDI schema validation to fail in NADA.
+ + +## Generating and publishing DDI metadata + +The DDI-Codebook metadata standard provides multiple elements to describe the variables in detail. This includes elements that are usually not found in data dictionaries, like summary statistics. Generating this information and manually capturing it in a DDI-compliant metadata file could be tedious. Indeed, some datasets contains hundreds or even thousands of variables. Some of the metadata (list of variables, possibly variable and value labels, and summary statistics) can be automatically extracted from the data files. Specialized metadata editors, who can read the data files, extract metadata, and generate DDI-compliant output are thus the preferred option to document microdata. Other software have the capability to generate variable-level metadata in DDI-compliant, such as CsPro and Survey Solutions (CAPI applications). Stata and R scripts also provide solutions to generate variable-level metadata out of data files. We present some of these tools below. + +### Using the World Bank Metadata Editor + +@@@ Update this whole section with proper screenshots and description + +The World Bank Metadata Editor is compliant with the DDI-Codebook 2.5. It is an open source software. [@@@@@ not yet - wait for license] It is a flexible application that can also accommodate other standards and schemas such as the Dublin Core (for documents) and the ISO 19139 (for geospatial data). + +When importing data files, variable-level metadata is automatically generated including variable names, summary statistics, and variable and value labels if available in the source data files. Additional variable-level metadata can then be added manually. + +
+![](./images/ReDoc_Microdata_WBME_01.JPG) +
+ +The Metadata Editor provides forms to enter all other related metadata using the DDI-Codebook 2.5 standard, including the study description and a description of external resources. +
+![image](https://user-images.githubusercontent.com/35276300/229926157-4ca798d2-ea70-44d4-83e7-6d6eeb7f25cc.png){width=100%} +
+ +The World Bank Metadata Editor exports the metadata (for microdataset) in DDI-Codebook 2.5 format (XML) and in JSON format. Metadata related to external resources can be exported to a Dublin Core file. A transformation of the metadata files into a PDF document is also implemented. + +
+![](./images/ReDoc_Microdata_WBME_03.JPG){width=100%} +
+ + +### Using R or Python + +DDI-compliant metadata can also be generated and published in a NADA catalog programmatically. Programming languages like R and Python provides much flexibility to generate such metadata, including variable-level metadata. + +We provide here and example where a dataset is available in Stata format. We use two data files from the Core Welfare Indicator Questionnaire (CWIQ) survey conducted in Liberia in 2007 (the full dataset has 12 data files; the extension of the script to the full dataset would be straightforward). One data file, named "sec_abcde_individual.dta", contains individual-level variables. The other data file, named "sec_fgh_ _household.dta", contains household-level variables. The content of the Stata files is as follows: + +
+
+![](./images/CWIQ_Stata.JPG){width=80%} +
+
+ +:::note +When generating the variable-level metadata, we want to extract the value labels from the data files, keeping the original [code - value label] pairs as they are in the original dataset. For example, if the Stata dataset has codes 1 = Male and 2 = Female for variable *sex*, we do not want them to be changed for example to 1 = Female and 2 = Male by the data import process. The import process in R packages do not always maintain the code/label pairs; some convert categorical data into factors and assign codes and value labels independently from the original coding. +::: + + + +```r +# In http://catalog.ihsn.org/catalog/1523 + +library(nadar) +library(haven) +library(rlist) +library(stringr) + +# ---------------------------------------------------------------------------------- +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +id = "LBR_CWIQ_2007" + +setwd("D:/LBR_CWIQ_2007") + +thumb = "liberia_cwiq.JPG" # This image will be used as a thumbnail + +# The literal questions are only found in a PDF file; we extract them. +# If list of questions had been available in MS-Excel format of equivalent, we +# would import it from that file. +literal_questions = list( + b1 = "Is [NAME] male or female?", + b2 = "How long has [NAME] been away in the last 12 months?", + b3 = "What is [NAME]'s relationship to the head of household?", + b4 = "How old was [NAME] at last birthday?", + b5 = "What is [NAME]'s marital status?", + b6 = "Is [NAME]'s father alive?", + b7 = "Is [NAME]'s father living in the household?", + b8 = "Is [NAME]'s mother alive?", + b9 = "Is [NAME]'s mother living in the household?", + c1 = "Can [NAME] read and write in any language?", + c2 = "Has [NAME] ever attended school?", + c3 = "What is the highest grade [NAME] completed?", + c4 = "Did [NAME] attend school last year?", + c5 = "Is [NAME] currently in school?", + c6 = "What is the current grade [NAME] is attending?", + c7 = "Who runs the school [NAME] is attending?", + c8 = "Did [NAME] have any problems with school?", + c9 = "Why is [NAME] not currently in school?", + c10= "Why has [NAME] not started school?" + # Etc. (we do not include all questions in the example) +) + +# Generate file-level and variable-level metadata for the two data files + +list_data_files = c("sec_abcde_individual.dta", "sec_fgh_household.dta") + +list_var = list() +list_df = list() +vno = 1 +fno = 1 + +for (datafile in list_data_files) { + + data <- read_dta(datafile) + + # Generate file-level metadata + + # Create a file identifier (sequential) + fid = paste0("F", str_pad(fno, 2, pad = "0")) + fno = fno + 1 + + # Add core metadata + case_n = nrow(data) # Nb of observations in the data file + var_n = length(data) # Nb of variables in the data file + df = list(file_id = fid, + file_name = datafile, + case_count = case_n, + var_count = var_n) + list_df = list.append(list_df, df) + + # Generate variable-level metadata + + for(v in 1:length(data)) { + + # Create a variable identifier (sequential) + vid = paste0("V", str_pad(vno, 4, pad = "0")) + vno = vno + 1 + + # Variable name and literal question + vname = names(data[v]) + question = as.character(literal_questions[vname]) + if(is.null(question)) question = "" + + # Extract the variable label (trim leading and trailing white spaces) + var_lab <- trimws(attr(data[[v]], 'label')) + if(is.null(var_lab)) var_lab = "" + + # Variable-level summary statistics + vval = sum(!is.na(data[[v]])) + vmis = sum(is.na(data[[v]])) + vmin = as.character(min(data[[v]], na.rm = TRUE)) + vmax = as.character(max(data[[v]], na.rm = TRUE)) + vstats = list( + list(type = "valid", value = vval), + list(type = "system missing", value = vmis), + list(type = "minimum", value = vmin), + list(type = "maximum", value = vmax) + ) + + # Extract the (original) codes and value labels and calculate frequencies + freqs = list() + val_lab <- attr(data[[v]], 'labels') + if(!is.null(val_lab) & typeof(data[[v]]) != "character") { + freq_tbl = table(data[[v]]) + for (i in 1:length(val_lab)) { + f = list(value = as.character(val_lab[i]), + labl = as.character(names(val_lab[i])), + stats = list( + list(type = "count", + value = sum(data[[v]] == val_lab[i], na.rm = TRUE) + ) + ) + ) + freqs = list.append(freqs, f) + } + } + + # Compile the variable-level metadata + list_v = list( + file_id = fid, + vid = vid, + name = vname, + labl = var_lab, + var_qstn_qstnlit = question, + var_sumstat = vstats, + var_catgry = freqs) + + # Add to the list of variables already documented + list_var = list.append(list_var, list_v) + + } + +} + +# Generate the DDI-compliant metadata + +cwiq_ddi_metadata <- list( + + doc_desc = list( + producers = list( + list(name = "WB consultants") + ), + prod_date = "2008-02-19" + ), + + study_desc = list( + + title_statement = list( + idno = id, + title = "Core Welfare Indicators Questionnaire 2007" + ), + + authoring_entity = list( + list(name = "Liberia Institute of Statistics and Geo_Information Services") + ), + + study_info = list( + + coll_dates = list( + list(start = "2007-08-06", end = "2007-09-22") + ), + + nation = list( + list(name = "Liberia", abbreviation = "LBR") + ), + + abstract = "The Government of Liberia (GoL) is committed to producing a Poverty Reduction Strategy Paper (PSRP). To do this, the GoL will need to undertake an analysis of qualitative and quantitative sources to understand the nature of poverty ('Where are we?'); to develop a macro-economic framework, and conduct broad based and participatory consultations to choose objectives, define and prioritize strategies ('Where do we want to go? How far can we get?); and to develop a monitoring and evaluation system ('How will we know when we get there?). The analysis of the nature of poverty, the Poverty Profile, will establish the overall rate of poverty incidence, identifying the poor in relation to their location, habits, occupations, means of access to and use of government services, and their living standards in regard to health, education, nutrition. Given the capacity constraints it has been agreed that this information will be collected in a single visit survey using the Core Welfare Indicators Questionnaire (CWIQ) survey with an additional module to cover household income, expenditure and consumption. This will provide information to estimate welfare levels & poverty incidence, which can be combined and analyzed with the sectoral information from the main CWIQ questionnaire. While countries with more capacity usually do a household income, expenditure and consumption survey over 12 months, the single visit approach has been used in a number of countries (mainly in West Africa) fairly successfully.", + + geog_coverage = "National" + + ), + + method = list( + + data_collection = list( + + coll_mode = "face to face interview", + + sampling_procedure = "The CWIQ survey will be carried out on a sample of 3,600 randomly selected households located in 300 randomly selected clusters. This was the same basic sample used by the 2007 Liberian DHS. However, for Monrovia, a new listing was carried out and new EAs were chosen and the sampled households were chosen from that list. For rural areas, the same EAs were used but a new sample selection of housholds was drawn. Any household that may have participated in the LDHS was systematically eliminated. Twelve (12) households were selected in each of the 300 EA using systematic sampling. The total number of households and number of EAs sampled in each County are given in the table below. (More on the Sampling under the External Resources).", + + coll_situation = "On average, the interview process lasted about about 2 hours 45 minutes. The Income and Expenditure questionnaire alone took about 2 hours to complete. In many occasions, the questionnaire was completed in 2 sitting sessions." + + ) + + ) + + ), + + # Information of data files + data_files = list_df, + + # Information on variables + variables = list_var + +) + +# Publish the metadata in the NADA catalog + +microdata_add( + idno = id, + repositoryid = "central", + access_policy = "licensed", + published = 1, + overwrite = "yes", + metadata = cwiq_ddi_metadata, + thumbnail = thumb +) + +# Add links to data and documents + +external_resources_add( + title = "Liberia, CWIQ 2007, Dataset in Stata 15 format", + idno = id, + dcdate = "2007", + language = "English", + country = "Liberia", + dctype = "dat/micro", + file_path = "LBR_CWIQ_2007_Stata15.zip", + description = "Liberia CWIQ dataset in Stata 15 format (2 data files)", + overwrite = "yes" +) + +external_resources_add( + title = "Liberia, CWIQ 2007, Dataset in SPSS Windows format", + idno = id, + dcdate = "2007", + language = "English", + country = "Liberia", + dctype = "dat/micro", + file_path = "LBR_CWIQ_2007_Stata15.zip", + description = "Liberia CWIQ dataset in SPSS for Windows [.sav] format (2 data files)", + overwrite = "yes" +) + +external_resources_add( + title = "CWIQ 2007 Questionnaire", + idno = id, + dcdate = "2007", + language = "English", + country = "Liberia", + dctype = "doc/ques", + file_path = "LCWIQ2007_.pdf", + overwrite = "yes" +) +``` + +After running the script, the metadata (and links) are available in the NADA catalog. + +
+![](./images/CWIQ_in_NADA_1.JPG){width=100%} +
+ +
+![](./images/CWIQ_in_NADA_2.JPG){width=100%} +
+ +
+![](./images/CWIQ_in_NADA_3.JPG){width=100%} +
+ +
+![](./images/CWIQ_in_NADA_4.JPG){width=100%} +
+ diff --git a/06_chapter06_geospatial.md b/06_chapter06_geospatial.md new file mode 100644 index 0000000..0e1afd9 --- /dev/null +++ b/06_chapter06_geospatial.md @@ -0,0 +1,3803 @@ +--- +output: html_document +--- + +# Geographic data and services {#chapter06} + +
+
+![](./images/geo_logo.JPG){width=25%} +
+
+ + +## Background + +To make geographic information discoverable and to facilitate their dissemination and use, the ISO Technical Committee on Geographic Information/Geomatics (ISO/TC211) created a set of metadata standards to describe geographic **datasets** (ISO 19115), geographic **data structures** (ISO 19115-2 / ISO 19110), and geographic **data services** (ISO 19119). These standards have been "unified" into a common XML specification (ISO 19139). This set of standards, known as the ISO 19100 series, served as the cornerstone of multiple initiatives to improve the documentation and management of geographic information such as the [Open Geospatial Consortium (OGC)](https://www.ogc.org/), the [US Federal Geographic Data Committee (FDGC)](https://www.fgdc.gov/), the [European INSPIRE directive](https://inspire.ec.europa.eu), or more recently the [Research Data Alliance (RDA)](https://rd-alliance.org/), among others. + +The ISO 19100 standards have been designed to cover the large scope of geographic information. The level of detail they provide goes beyond the needs of most data curators. What we present in this Guide is a subset of the standards, which focuses on what we consider as the core requirements to describe and catalog geographic datasets and services. References and links to resources where more detailed information can be found are provided in appendix. + + +## Geographic information metadata standards + +Geographic information metadata standards cover three types of resources: i) datasets, ii) data structure definitions, and iii) data services. Each one of these three components is the object of a specific standard. To support their implementation, a common XML specification (ISO 19139) covering the three standards has been developed. The geographic metadata standard is however, by far, the most complex and "specialized" of all schemas described in this Guide. Its use requires expertise not only in data documentation, but also in the use of geospatial data. We provide in this chapter some information that readers who are not familiar with geographic data may find useful to better understand the purpose and use of the geographic metadata standards. + +### Documenting geographic datasets - The ISO 19115 standard + +**Geographic datasets** "identify and depict geographic locations, boundaries and characteristics of features on the surface of the earth. They include geographic coordinates (e.g., latitude and longitude) and data associated to geographic locations (...)". (Source: https://www.fws.gov/gis/) + +The ISO 19115 standard defines the structure and content of the metadata to be used to document geographic datasets. The standard is split into two parts covering: + +1. **vector data** (ISO 19115-1), and +2. **raster data** including imagery and gridded data (ISO 19115-2). + +*Vector* and *raster* spatial datasets are built with different structures and formats. The following summarizes how these two categories differ and how they can be processed using the R software. The descriptions of vector and raster data provided in this chapter are adapted from: + - https://gisgeography.com/spatial-data-types-vector-raster/ + - https://datacarpentry.org/organization-geospatial/02-intro-vector-data/index.html] + +**Vector data** + +Vector data are comprised of **points**, **lines**, and **polygons** (areas). + +A vector **point** is defined by a single x, y coordinate. Generally, vector points are a latitude and longitude with a spatial reference frame. A point can for example represent the location of a building or facility. When multiple dots are connected in a set order, they become a vector **line** with each dot representing a **vertex**. Lines usually represent features that are linear in nature, like roads and rivers. Each bend in the line represents a vertex that has a defined x, y location. When a set of 3 or more vertices is joined in a particular order and closed (i.e. the first and last coordinate pairs are the same), it becomes a **polygon**. Polygons are used to show boundaries. They will typically represent lakes, oceans, countries and their administrative subdivisions (provinces, states, districts), building footprints, or outline of survey plots. Polygons have an area (which will correspond to the square-footage for a building footprint, to the acreage for an agricultural plot, etc.) + +Vector data are often provided in one of the following file formats: + + - ESRI Shapefile (actually a zip set of files; not standard and limited as it is based on an outdated DBF format, but still widely used); + - ESRI GeoDatabase file (not a standard format, but widely used); + - GML: the Official OGC geospatial standard format, used by standard spatial data services; + - GeoPackage: the OGC recommended standard for handling vector data; + - GeoJSON: another OGC standard, often used when a *service* is associated to the data; + - KML/KMZ: [Keyhole Markup Language](https://en.wikipedia.org/wiki/Keyhole_Markup_Language), an XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers; + - CSV file: Comma-separated values files, with geometries provided in OGC Well-Known-Text (WKT); + - OSM: An XML-formatted file containing "nodes" (points), "ways" (connections), and "relations" from [OpenStreetMap](https://www.openstreetmap.org) format. + +------------- +Some examples +-------------- + +**EXAMPLE 1** + +The figure below provides an example of vector data extracted from [Open Street Map](https://www.openstreetmap.org/node/1376501203#map=18/27.47008/89.63725) for a part of the city of Thimphu, Bhutan (as of 17 May, 2021). + +
+![](./images/geospatial_example_03_vector_OSM.JPG){width=100%} +
+ +The content of this map can be exported as an OSM file. + +
+![](./images/geospatial_example_03_vector_OSM_export.JPG){width=40%} +
+ +Multiple applications will allow users to read and process OSM files, including open source software applications like [QGIS](https://www.qgis.org/en/site/) or the R packages [sf](https://cran.r-project.org/package=sf) and [osmdata](https://cran.r-project.org/web/packages/osmdata/vignettes/osmdata.html) + + +```r +# Example of a R script that reads and shows the content of the map.osm file + +library(sf) + +# List the layers contained in the OSM file +lyrs <- st_layers("map.osm") + +# Read the layers as sf objects +points <- st_read("map.osm", layer = "points") +lines <- st_read("map.osm", layer = "lines") +polygons <- st_read("map.osm", layer = "multipolygons") +``` + +**EXAMPLE 2** + +In this second example, we use the R `sf` (Simple Features) package to read a shape (vector) file of refugee camps in Bangladesh, downloaded from the Humanitarian [Data Exchange (HDX)](https://data.humdata.org) website: + + +```r +# Load the sf package and utilities + +library(sf) +library(utils) + +# Download and unzip the shape file (published by HDX as a compressed zip format) + +setwd("E:/my_data") +url <- "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip" +download.file(url, destfile = "200908_RRC_Outline_Camp_AL1.zip") +unzip("E:/my_data/200908_RRC_Outline_Camp_AL1.zip") + +# Read the file and display core information about its content + +al1 <- st_read("./200908_RRC_Outline_Camp_AL1/200908_RRC_Outline_Camp_AL1.shp") +print(al1) +plot(al1) + +# ------------------------------ +# Output of the 'print' command: +# ------------------------------ + +# Simple feature collection with 35 features and 14 fields +# geometry type: MULTIPOLYGON +# dimension: XY +# bbox: xmin: 92.12973 ymin: 20.91856 xmax: 92.26863 ymax: 21.22292 +# geographic CRS: WGS 84 +# First 10 features: +# District Upazila Settlement Union Name_Alias SSID SMSD__Cnam NPM_Name Area_Acres PeriMe_Met +# 1 Cox's Bazar Ukhia Collective site Palong Khali Bagghona-Putibonia CXB-224 Camp 16 Camp 16 (Potibonia) 130.57004 4136.730 +# 2 Cox's Bazar Ukhia Collective site Palong Khali CXB-203 Camp 02E Camp 02E 96.58179 4803.162 +# 3 ... +# +# Camp_Name Area_SqM Latitude Longitude geometry +# 1 Camp 16 528946.95881724 21.1563813298438 92.1490685817901 MULTIPOLYGON (((92.15056 21... +# 2 Camp 2E 391267.799744003 21.2078084302778 92.1643360947381 MULTIPOLYGON (((92.16715 21... +# 3 ... + +# Output of 'str' command: + +# Classes 'sf' and 'data.frame': 35 obs. of 15 variables: +# $ District : chr "Cox's Bazar" "Cox's Bazar" "Cox's Bazar" "Cox's Bazar" ... +# $ Upazila : chr "Ukhia" "Ukhia" "Ukhia" "Ukhia" ... +# $ Settlement: chr "Collective site" "Collective site" "Collective site" "Collective site" ... +# $ Union : chr "Palong Khali" "Palong Khali" "Palong Khali" "Raja Palong" ... +# $ Name_Alias: chr "Bagghona-Putibonia" NA "Jamtoli-Baggona" "Kutupalong RC" ... +# $ SSID : chr "CXB-224" "CXB-203" "CXB-223" "CXB-221" ... +# $ SMSD__Cnam: chr "Camp 16" "Camp 02E" "Camp 15" "Camp KRC" ... +# $ NPM_Name : chr "Camp 16 (Potibonia)" "Camp 02E" "Camp 15 (Jamtoli)" "Kutupalong RC" ... +# $ Area_Acres: num 130.6 96.6 243.3 95.7 160.4 ... +# $ PeriMe_Met: num 4137 4803 4722 3095 4116 ... +# $ Camp_Name : chr "Camp 16" "Camp 2E" "Camp 15" "Kutupalong RC" ... +# $ Area_SqM : chr "528946.95881724" "391267.799744003" "985424.393160958" "387729.666427279" ... +# $ Latitude : chr "21.1563813298438" "21.2078084302778" "21.1606399787906" "21.2120281895357" ... +# $ Longitude : chr "92.1490685817901" "92.1643360947381" "92.1428956454661" "92.1638095873048" ... +# $ geometry :sfc_MULTIPOLYGON of length 35; first list element: List of 1 + +# This information can be extracted and used to document the data +``` + +The output of the script shows that the shape file contains 35 features (or "objects"; in this case each object represents a refugee camp) and 14 fields (attributes and variables; including information like the camp name, administrative region, surface area, and more) related to each object. + +The *geometry type* (multipolygon) and *dimension* (XY) provide information on the type of object. "All geometries are composed of points. Points are coordinates in a 2-, 3- or 4-dimensional space. All points in a geometry have the same dimensionality. In addition to X and Y coordinates, there are two optional additional dimensions: + + - a Z coordinate, denoting the altitude; + - an M coordinate (rarely used), denoting some measure that is associated with the point, rather than with the feature as a whole (in which case it would be a feature attribute); examples could be time of measurement, or measurement error of the coordinates. + +The four possible cases then are: + + - two-dimensional points refer to x and y, easting and northing, or longitude and latitude, referred to as XY + - three-dimensional points as XYZ + - three-dimensional points as XYM + - four-dimensional points as XYZM (the third axis is Z, the fourth is M) + +The following seven simple feature types are the most common: + +| Type | Description | +| ------------------ | ------------------------------------------------------------ | +| POINT | zero-dimensional geometry containing a single point | +| LINESTRING | sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry | +| POLYGON | geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring | +| MULTIPOINT | set of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal | +| MULTILINESTRING | set of linestrings | +| MULTIPOLYGON | set of polygons | +| GEOMETRYCOLLECTION | set of geometries of any type except GEOMETRYCOLLECTION | + +The remaining ten geometries are rarer : CIRCULARSTRING, COMPOUNDCURVE, CURVEPOLYGON, MULTICURVE, MULTISURFACE, CURVE, SURFACE, POLYHEDRALSURFACE, TIN, TRIANGLE (see https://r-spatial.github.io/sf/articles/sf1.html). + +The *geographic CRS* informs us on the coordinate reference system (CRS). Coordinates can only be placed on the Earth's surface when their CRS is known; this may be a spheroid CRS such as WGS 84, a projected, two-dimensional (Cartesian) CRS such as a UTM zone or Web Mercator, or a CRS in three-dimensions, or including time. In our example above, the CRS is the WGS 84 (World Geodetic System 84), a standard for use in cartography, geodesy, and satellite navigation including GPS. + +The *bbox* is the bounding box. + +Information on a subset (top 10 - only 2 shown above) of the features is displayed in the output of the script, with the list of the 14 available fields. +The `plot(al1)` command in R produces a visualization of the numeric fields in the data file: + +
+![](./images/geospatial_plot_vector.JPG){width=100%} +
+ +All this information represents important components of the metadata, which we will want to capture, enrich, and catalog (together with additional information) using the ISO metadata standard. "Enriching" (or "augmenting") the metadata will consist of providing more contextual information (who produced the data, when, why, etc.) and additional information on the features (e.g., what does the variable 'SMSD__Cnam' represent?). + +**Raster data** + +**Raster data** are made up of pixels, also referred to as *grid cells*. Satellite imagery and other remote sensing data are raster datasets. Grid cells in raster data are usually (but not necessarily) regularly-spaced and square. Data stored in a raster format is arranged in a grid without storing the coordinates of each cell (pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in a grid. + +Any given pixel in a grid stores one or more values (in one or more bands). For example, each cell (pixel) value in a satellite image has a red, a green, and a blue value. Cells in raster data could represent anything from elevation, temperature, rainfall, land cover, population density, or others. (Source: https://worldbank.github.io/OpenNightLights/tutorials/mod2_1_data_overview.html) + +Raster data can be **discrete** or **continuous**. Discrete rasters have distinct themes or categories. For example, one grid cell can represent a land cover class, or a soil type. In a discrete raster, each thematic class can be discretely defined (usually represented by an integer) and distinguished from other classes. In other words, each cell is definable and its value applies to the entire area of the cell. For example, the value 1 for a class might indicate "urban area", value 2 "forest", and value 3 "others". Continuous (or non-discrete) rasters are grid cells with gradual changing values, which could for example represent elevation, temperature, or an aerial photograph. + +The difference between vector and raster data, and between different types of vectors, is clearly illustrated in the figure below taken from the World Bank's [Light Every Night GitHub repository](https://worldbank.github.io/OpenNightLights/tutorials/mod2_1_data_overview.html). + +
+![](./images/geospatial_example_00c_vector_raster_2.JPG){width=100%} +
+ +In GIS applications, vector and raster data are often combined into multi-layer datasets, as shown in the figure below extracted from the [County of San Bernardino (US) website](http://sbcounty.gov/). + +
+![](./images/geospatial_example_00b_layers.JPG){width=75%} +
+ +We may occasionally want to convert raster data into vector data. For example, a building footprint layer (vector data, composed of polygons) can be derived from a satellite image (raster data). Such conversions can be implemented in a largely automated manner using machine learning algorithms. + +
+![](./images/geospatial_example_01_building_footprint.JPG){width=100%} +
+Source: https://blogs.bing.com/maps/2019-09/microsoft-releases-18M-building-footprints-in-uganda-and-tanzania-to-enable-ai-assisted-mapping + +Raster data are often provided in one of the following file formats: + + - GeoTiFF (standard): Most of the remote sensing data are stored as GeoTIFF files. https://www.ogc.org/standards/geotiff + - NetCDF (standard) https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html + - ECW: https://en.wikipedia.org/wiki/ECW_(file_format) + - JPEG 2000: https://fr.wikipedia.org/wiki/JPEG_2000 + - MrSid: https://en.wikipedia.org/wiki/MrSID + - ArcGrid (ESRI Grid format) + +**GeoTIFF** is a popular file format for raster data. A *Tagged Image File Format* (TIFF or TIF) is a file format designed to store raster-type data. A GeoTIFF file is a TIFF file that contains specific tags to store structured geospatial metadata including: + + - Spatial extent: the area coverage of the file + - Coordinate reference system: the projection / coordinate reference system used + - Resolution: the spatial extent of each pixel (spatial resolution) + - Number of layers: number of layers or bands available in the file + +TIFF files can be read using (among other options) the R package [*raster*](https://cran.r-project.org/package=raster) or the Python library [*rasterio*](https://pypi.org/project/rasterio/). + +GeoTIFF files can also be provided as **Cloud Optimized GeoTIFFS (COGs)**. In COGs, the data are structured in a way that allows them to be shared via web services which allow users to query, visualize, or download a user-defined subset of the content of the file, without having to download the entire file. This option can be a major advantage, as geoTIFF files generated by remote sensing/satellite imagery can be very large. Extracting only the relevant part of a file can save significant time and storage space. + +------------- +Some examples +-------------- + +**EXAMPLE 1** + +The first example below shows the spatial distribution of the Ethiopian population in 2020. The data file was downloaded from the [WorldPop](https://www.worldpop.org/) website on 17 May 2021. + +
+![](./images/geospatial_example_script_worldpop_ETH.JPG){width=100%} +
+ + +```r +# Load the raster R package + +library(raster) + +# Download a TIF file (spatial distribution of population, Ethiopia, 2020) - 62Mb + +setwd("E:/my_data") +url <- "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif" +file_name = basename(url) +download.file(url, destfile = file_name, mode = 'wb') + +# Read the file and display core information about its content + +my_raster_file <- raster(file_name) +print(my_raster_file) + +# ------------------------------ +# Output of the 'print' command: +# ------------------------------ + +# dimensions : 13893, 17983, 249837819 (nrow, ncol, ncell) +# resolution : 0.0008333333, 0.0008333333 (x, y) +# extent : 32.99958, 47.98542, 3.322084, 14.89958 (xmin, xmax, ymin, ymax) +# crs : +proj=longlat +datum=WGS84 +no_defs +# source : E:/my_data/eth_ppp_2020_constrained.tif +# names : eth_ppp_2020_constrained +# values : 1.36248, 847.9389 (min, max) +``` + +This output shows that the TIF file contains one layer of cells, forming an image of 13,893 by 17,983 cells. It also provides information on the projection system (datum): WGS 84 (World Geodetic System 84). This information (and more) will be part of the ISO-compliant metadata we want to generate to document and catalog a raster dataset. + +**EXAMPLE 2** + +In the second example, we demonstrate the advantages of Cloud Optimized GeoTIFFS (COGs). We extract information from the World Bank [Light Every Night](https://worldbank.github.io/OpenNightLights/wb-light-every-night-readme.html#) repository. + + +```r +# Load 'aws.s3' package to access the Amazon Web Services (AWS) Simple Storage Service (s3) +library("aws.s3") + +# Load 'raster' package to read the target GeoTiFF +library("raster") + +# List files for World Bank bucket 'globalnightlight', setting a max number of items +contents <- get_bucket(bucket = 'globalnightlight', max = 10000) + +# Get_bucket_df is similar to 'get_bucket' but returns the list as a dataframe +contents <- get_bucket_df(bucket = 'globalnightlight') + +# Access DMSP-OLS data for satellite F12 in 1995 +F12_1995 <- get_bucket(bucket = 'globalnightlight', + prefix = "F121995") + +# As data.frame, with all objects listed +F12_1995_df <- get_bucket_df(bucket = 'globalnightlight', + prefix = "F121995", + max = Inf) +# Number of objects +nrow(F12_1995_df) + +# Save the object +filename <- "F12199501140101.night.OIS.tir.co.tif" +save_object(bucket = 'globalnightlight', + object = "F121995/F12199501140101.night.OIS.tir.co.tif", + file = filename) + +# Read it with raster package +rs <- raster(filename) +``` +
+ +### Describing data structures - The ISO 19115-2 and ISO 19110 standards + +The ISO 19115-2 provides the necessary metadata elements to describe the structure of raster data. The ISO 19115-1 standard does not provide all necessary metadata elements needed to describe the structure of vector datasets. The description of data structures for vector data (also referred to as *feature types*) is therefore often omitted. The ISO 19110 standard solves that issue, by providing the means to document the structure of vector datasets (column names and definitions, codes and value labels, measurement units, etc.), which will contribute to making the data more discoverable and usable. + + +### Describing data services - The ISO 19119 standard + +More and more data are disseminated not in the form of datasets, but as data services via web applications. "Geospatial services provide the technology to create, analyze, maintain, and distribute geospatial data and information." (https://www.fws.gov/gis/) The ISO 19119 standard provides the elements to document such services. + + +### Unified metadata specification - The ISO/TS 19139 standard + +The three metadata standards previously described - ISO 19115 for vector and raster datasets, ISO 19110 for vector data structures, and ISO 19119 for data services, provide a set of concepts and definitions useful to describe the geographic information. To facilitate their practical implementation, a digital specification, which defines how this information is stored and organized in an electronic metadata file, is required. The ISO/TS 19139 standard, an XML specification of the ISO 19115/10110/19119/, was created for that purpose. + +The ISO/TS 19139 is a standard used worldwide to describe geographic information. It is the backbone for the implementation of [INSPIRE](https://inspire.ec.europa.eu/) dataset and service metadata in the European Union. It is supported by a wide range of tools, including desktop applications like [Quantum GIS](https://qgis.org/en/site/), [ESRI ArcGIS](https://www.arcgis.com/index.html)), and OGC-compliant metadata catalogs (e.g., [GeoNetwork](https://geonetwork-opensource.org/)) and geographic servers (e.g., [GeoServer](http://geoserver.org/)). + +ISO 19139-compliant metadata can be generated and edited using specialized metadata editors such as [CatMDEdit](http://catmdedit.sourceforge.net/) or [QSphere](https://www.fgdc.gov/organization/working-groups-subcommittees/mwg/iso-metadata-editors-registry/qsphere), or using programmatic tools like Java Apache SIS or the R packages [geometa](https://cran.r-project.org/web/packages/geometa/index.html) and [geoflow](https://github.com/eblondel/geoflow), among others. + +The ISO 19139 specification is complex. To enable and simplify its use in our NADA cataloguing application, we produced a JSON version of (part of) the standard. We selected the elements we considered most relevant for our purpose, and organized them into the JSON schema described below. For data curators with limited expertise in XML and geographic data documentation, this JSON schema will make the production of metadata compliant with the ISO 19139 standard easier. + + +## Schema description + +Main structure (describe) @@@@ + +
+```json +{ + "repositoryid": "string", + "published": 0, + "overwrite": "no", + "metadata_information": {}, + "description": {}, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ +### Introduction to ISO19139 + +Geographic metadata (for both *datasets* and *services*) should include **core metadata properties**, and metadata **sections** aiming to describe specific aspect of the resource (e.g., resource identification or resource distribution). + +The content of some metadata elements is controlled by **codelists** (or **controlled vocabularies**). A codelist is a pre-defined set of values. The content of an element controlled by a codelist should be selected from that list. This may for example apply to the element "language", whose content should be selected from the ISO 639 list and codes codes for language names, instead of being free-text. The ISO 19139 suggests but does not impose codelists. It is highly recommended to make use of the suggested codelists (or of specific codelists that may be promoted by agencies or partnerships). + +Some metadata elements (referred to as *common elements*) of the ISO 19139 can be repeated in different parts of a metadata file. For example, a standard set of fields is provided to describe a `contact`, a `citation`, or a `file format`. Such common elements can be used in multiple locations of a metadata file (e.g., to provide information on who the contact person is for information on data quality, on data access, on data documentation, etc.) + +In the following sections, we first present the **common elements**, then the elements that form the **core metadata properties** (information on the metadata themselves), followed by the elements from the main **metadata sections** used to describe the data, and finally the **features catalog** elements which are used to document attributes and variables related to vector data (ISO 19110). + + +### Common sets of elements + +Common elements are blocks of metadata fields that can appear in multiple locations of a metadata file. For example, information on `contact` person(s) or organization(s) may have to be provided in the section of the file where we document the production and maintenance of the data, where we document the production and maintenance of the metadata, where we document the distribution and terms of use of the data, etc. Other types of common elements include online and offline resources, file formats, citations, keywords, constraints, and extent. We describe these sets of elements below. + + +#### Contact / Responsible party + +The ISO 19139 specification provides a structured set of metadata elements to describe a **contact**. A contact is the party (person or organization) responsible for a specific task. The following set of elements can be used to describe a contact: + +| Element | Description | +| ------------------ | ------------------------------------------------------------ | +| `individualName` | Name of the individual | +| `organisationName` | Name of the organization | +| `positionName` | Position of the individual in the organization | +| `contactInfo` | Contact information. The contact information is divided into 3 sections: `phone`(including either `voice` or `facsimile` numbers; `address`, handling the physical address elements (`deliveryPoint`, `city`, `postalCode`, `country`), contact e-mail (`electronicEmailAddress`), and `onlineResource`, e.g., the URL of the organization website (which includes `linkage`, `name`, `description`, `protocol`, and `function` ; see below) | +| `role` | Role of the person/organization. A recommended controlled vocabulary is provided by ISO 19139, with the following options: `{resourceProvider, custodian, owner, sponsor, user, distributor, originator, pointOfContact, principalInvestigator, processor, publisher, author, coAuthor, collaborator, editor, mediator, rightsHolder, contributor, funder, stakeholder}` | + +
+```json +"contact": [ + { + "individualName": "string", + "organisationName": "string", + "positionName": "string", + "contactInfo": { + "phone": { + "voice": "string", + "facsimile": "string" + }, + "address": { + "deliveryPoint": "string", + "city": "string", + "postalCode": "string", + "country": "string", + "electronicMailAddress": "string" + }, + "onlineResource": { + "linkage": "string", + "name": "string", + "description": "string", + "protocol": "string", + "function": "string" + } + }, + "role": "string" + } +] +``` +
+ + +#### Online resource + +An **online resource** is a common set of elements frequently used in the geographic data/services schema. It can be used for example to provide a link to an organization website, to a data file or to a document, etc. An online resource is described with the following properties: + +| Element | Description | +| ------------- | ------------------------------------------------------------ | +| `linkage` | URL of the online resource. In case of a geographic standard data services, only the base URL should be provided, without any service parameter. | +| `name` | Name of the online resource. In case of a geographic standard data services, this should be filled with the identifier of the resource as published in the service. Example, for an OGC Web Map Service (WMS), we will use the layer name. | +| `description` | Description of the online resource | +| `protocol` | Web protocol used to get the resource, e.g., FTP, HTTP. In case of a basic HTTP, the ISO 19139 suggests the value 'WWW:LINK-1.0-http--link'. For geographic standard data services, it is recommended to fill this element with the appropriate protocol identifier. For an OGC Web Map Service (WMS) link for example, use 'OGC:WMS-1.1.0-http-get-map' | +| `function` | Function (purpose) of the online resource. | + +
+```json +"onlineResource": { + "linkage": "string", + "name": "string", + "description": "string", + "protocol": "string", + "function": "string" +} +``` +
+ +#### Offline resource (Medium) + +An **offline resource** (medium) is a common set of elements that can be used to describe a physical resource used to distribute a dataset, e.g., a DVD or a CD-ROM. A `medium` is described with the following properties: + +| Element | Description | +| ------------- | ------------------------------------------------------------ | +| `name` | Name of the medium, eg. 'dvd'. Recommended code following the [ISO/TS 19139](http://standards.iso.org/iso/19139/resources/gmxCodelists.xmlMD_MediumNameCode) MediumName codelist. Suggested values: {cdRom, dvd, dvdRom, 3halfInchFloppy, 5quarterInchFloppy, 7trackTape, 9trackType, 3480Cartridge, 3490Cartridge, 3580Cartridge, 4mmCartridgeTape, 8mmCartridgeTape, 1quarterInchCartridgeTape, digitalLinearTape, onLine, satellite, telephoneLink, hardcopy} | +| `density` | Density (list of) at which the data is recorded | +| `densityUnit`| Unit(s) of measure for the recording density | +| `volumes` | Number of items in the media identified | +| `mediumFormat` | Method used to write to the medium, e.g. tar . Recommended code following the [ISO/TS 19139](http://standards.iso.org/iso/19139/resources/gmxCodelists.xmlMD_MediumFormatCode) MediumFormat codelist. Suggested values: {cpio, tar, highSierra, iso9660, iso9660RockRidge, iso9660AppleHFS, udf} | +| `mediumNote` | Description of other limitations or requirements for using the medium | + + +#### File format + +The table below lists the ISO 19139 elements used to document a **file format**. A format is defined at a minimum by its `name`. It is also recommended to provide a `version`, and possibly a format `specification`. It is good practice to provide a standardized format name, using the file's *mime type*, e.g., `text/csv`, `image/tiff`. A list of available mime types is available from the [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml) website. + +| Element | Description | +| ---------------------------- | ------------------------------------------------- | +| `name` | Format name - *Recommended* | +| `version` | Format version (if applicable) - *Recommended* | +| `amendmentNumber` | Amendment number (if applicable) | +| `specification` | Name of the specification - *Recommended* | +| `fileDecompressionTechnique` | Technique for file decompression (if applicable) | +| `FormatDistributor` | Contact(s) responsible of the distribution | + +
+```json +"resourceFormat": [ + { + "name": "string", + "version": "string", + "amendmentNumber": "string", + "specification": "string", + "fileDecompressionTechnique": "string", + "FormatDistributor": { + "individualName": "string", + "organisationName": "string", + "positionName": "string", + "contactInfo": {}, + "role": "string" + } + } +] +``` +
+ +#### Citation + +The **citation** is another common element that can be used in various parts of a geographic metadata file. Citations are used to provide detailed information on external resources related to the dataset or service being documented. A citation can be defined using the following set of (mostly optional) elements: + +| Element | Description | +| ----------------------- | ------------------------------------------------------------ | +| `title` | Title of the resource | +| `alternateTitle` | An alternate title (if applicable) | +| `date` | Date(s) associated to a resource, with sub-elements `date` and `type`. This may include different types of dates. The type of date should be provided, and selected from the controlled vocabulary proposed by the ISO 19139: date of `{creation, publication, revision, expiry, lastUpdate, lastRevision, nextUpdate, unavailable, inForce, adopted, deprecated, superseded, validityBegins, validityExpires, released, distribution}` | +| `edition` | Edition of the resource | +| `editionDate` | Edition date | +| `identifier` | A unique persistent identifier for the metadata. If a DOI is available for the resource, the DOI should be entered here. The same `fileIdentifier` should be used if no other persistent identifier is available. | +| `citedResponsibleParty` | Contact(s)/party(ies) responsible for the resource. | +| `presentationForm` | Form in which the resource is made available. The ISO 19139 recommends the following controlled vocabulary: `{documentDigital, imageDigital, documentHardcopy, imageHardcopy, mapDigital, mapHardcopy, modelDigital, modelHardcopy, profileDigital, profileHardcopy, tableDigital, tableHardcopy, videoDigital, videoHardcopy, audioDigital, audioHardcopy, multimediaDigital, multimediaHardcopy, physicalSample, diagramDigital, diagramHardcopy}`. For a geospatial dataset or web-layer, the value `mapDigital` will be preferred. | +| `series` | A description of the series, in case the resource is part of a series. This include the series `name`, `issueIdentification` and `page` | +| `otherCitationDetails` | Any other citation details to specify | +| `collectiveTitle` | A title in case the resource is part of a broader resource (e.g., data collection) | +| `ISBN` | International Standard Book Number (ISBN); an international standard identification number for uniquely identifying publications that are not intended to continue indefinitely. | +| `ISSN` | International Standard Serial Number (ISSN); an international standard for serial publications. | + +
+```json +"citation": { + "title": "string", + "alternateTitle": "string", + "date": [ + { + "date": "string", + "type": "string" + } + ], + "edition": "string", + "editionDate": "string", + "identifier": { + "authority": "string", + "code": null + }, + "citedResponsibleParty": [], + "presentationForm": [ + "string" + ], + "series": { + "name": "string", + "issueIdentification": "string", + "page": "string" + }, + "otherCitationDetails": "string", + "collectiveTitle": "string", + "ISBN": "string", + "ISSN": "string" +} +``` +
+ +#### Keywords + +**Keywords** contribute significantly to making a resource more discoverable. Entering a list of relevant keywords is therefore highly recommended. Keywords can, but do not have to be selected from a controlled vocabulary (thesaurus). Keywords are documented using the following elements: + +| Element | Description | +| --------------- | ------------------------------------------------------------ | +| `type` | Keywords type. The ISO 19139 provides a recommended controlled vocabulary with the following options: {`dataCenter`, `discipline`, `place`, `dataResolution`, `stratum`, `temporal`, `theme`, `dataCentre`, `featureType`, `instrument`, `platform`, `process`, `project`, `service`, `product`, `subTopicCategory`} | +| `keyword` | The keyword itself. When possible, existing vocabularies should be preferred to writing free-text keywords. An example of global vocabulary is the [*Global Change Master Directory*](http://vocab.nerc.ac.uk/collection/P04/current/) that could be a valuable source to reference data domains / disciplines, or the [UNESCO Thesaurus](http://vocabularies.unesco.org/browser/thesaurus/en/). +| `thesaurusName` | A reference to a thesaurus (if applicable) from which the keywords are extracted. The thesaurus itself should then be documented as a citation. | + +
+```json +"keywords": [ + { + "type": "string", + "keyword": "string", + "thesaurusName": "string" + } +] +``` +
+ +#### Constraints @@@@ not clear. where is the element useLimitations? ... what are the elements used in the schema? + +The **constraints** common set of elements will be used to document *legal* and *security* constraints associated with the documented dataset or data service. Both types of constraints have one property in common, `useLimitation`, used to describe the use limitation(s) as free text. + +
+```json +"resourceConstraints": [ + { + "legalConstraints": { + "useLimitation": [ + "string" + ], + "accessConstraints": [ + "string" + ], + "useConstraints": [ + "string" + ], + "otherConstraints": [ + "string" + ] + }, + "securityConstraints": { + "useLimitation": [ + "string" + ], + "classification": "string", + "userNote": "string", + "classificationSystem": "string", + "handlingDescription": "string" + } + } +] +``` +
+ +In addition to the `useLimitation` element, **legal constraints** (`legalConstraints`) can be described using the following three metadata elements: + +| Element | Description | +| ------------------- | ------------------------------------------------------------ | +| `accessConstraints` | Access constraints. The ISO 19139 provides a controlled vocabulary with the following options: `{copyright, patent, patentPending, trademark, license, intellectualPropertyRights, restricted, otherRestrictions, unrestricted, licenceUnrestricted, licenceEndUser, licenceDistributor, private, statutory, confidential, SBU, in-confidence}` | +| `useConstraints` | Use constraints. To be entered as free text. Filling this element will depend on the resource that is described. As best practice recommended to fill this element, this is where *terms of use*, *disclaimers*, *preferred citation* or* even *data limitations* can be captured | +| `otherConstraints` | Any other constraints related to the resource. | + +In addition to the `useLimitation` element, **security constraints** (`securityConstraints`) - which applies essentially to *classified* resources - can be described using the following four metadata elements: + +| Element | Description | +| ---------------------- | ------------------------------------------------------------ | +| `classification` | Classification code. The ISO 19139 provides a controlled vocabulary with the following options: `{unclassified, restricted, confidential, secret, topSecret, SBU, forOfficialUseOnly, protected, limitedDistribution}` | +| `userNote` | Note to users (free text) | +| `classificationSystem` | Information on the system used to classify the information. Organizations may have their own system to `classify` the information. | +| `handlingDescription` | Additional free-text description of the classification | + + +#### Extent + +The **extent** defines the boundaries of the dataset in space (horizontally and vertically) and in time. The ISO 19139 standard defines the extent as follows: + + Element | Description | +| ------------- | ------------------------------------------------------------ | +| `geographicElement` | Spatial (horizontal) extent element. This can be defined either with a `geographicBoundingBox` providing the coordinates bounding the limits of the dataset, by means of four properties: `southBoundLongitude`, `westBoundLongitude`, `northBoundLongitude`, `eastBoundLongitude` (recommended); or using `geographicDescription` - free text that defines the area covered. When the dataset covers one or more countries, it is recommended to enter the country names in this element, as it can then be used in data catalogs for filtering by geography.| +| `verticalElement` | Spatial (vertical) extent element, providing two properties: `minimumValue`, `maximumValue` and `verticalCRS` (reference to the vertical coordinate reference system) | +| `temporalElement` | Temporal extent element. Depending on the temporal characteristics of the dataset, this will consist in a `TimePeriod` (made of a `beginPosition` and `endPosition`) or a `TimeInstant` (made of a single `timePosition`) referencing date/time information according to [ISO 8601](https://www.iso.org/iso-8601-date-and-time-format.html) | + +
+```json +"extent": { + "geographicElement": [ + { + "geographicBoundingBox": { + "westBoundLongitude": -180, + "eastBoundLongitude": -180, + "southBoundLatitude": -180, + "northBoundLatitude": -180 + }, + "geographicDescription": "string" + } + ], + "temporalElement": [ + { + "extent": null + } + ], + "verticalElement": [ + { + "minimumValue": 0, + "maximumValue": 0, + "verticalCRS": null + } + ] +} +``` +
+ + +### Core metadata properties + +A set of elements is provided in the ISO 19139 to document the core properties of the metadata (not the data). With a few exceptions, these elements apply to the metadata related to datasets and data services. The table below summarizes these elements and their applicability. A description of the elements follows. + +| Property | Description | Used in *dataset metadata* | Used in *service metadata* | +| ------------------------- | ------------------------------------------------------------ | ------------------------------- | ------------------------------- | +| `fileIdentifier` | Unique persistent identifier for the resource | Yes | - | +| `language` | Main language used in the metadata description | Yes | Yes | +| `characterSet` | Character set encoding used in the metadata description | Yes | Yes | +| `parentIdentifier` | Unique persistent identifier of the parent resource (if any) | Yes | Yes | +| `hierarchyLevel` | Scope(s) / hierarchy level(s) of the resource. List of pre-defined values suggested by the ISO 19139. See details below. | Yes | Yes | +| `hierarchyLevelName` | Alternative name definitions for hierarchy levels | Yes | Yes | +| `contact` | contact(s) associated to the metadata, i.e. persons/organizations in charge of the metadata create/edition/maintenance. For more details, see section on ***common elements*** | Yes | Yes | +| `dateStamp` | Date and time when the metadata record was created or updated | Yes | Yes | +| `metadataStandardName` | Reference or name of the metadata standard used. | Yes | Yes | +| `metadataStandardVersion` | Version of the metadata standard. For the ISO/TC211, the version corresponds to the creation/revision year. | Yes | Yes | +| `dataSetURI` | Unique persistent link to reference the database | Yes | - | + +
+```json +"description": { + "idno": "string", + "language": "string", + "characterSet": { + "codeListValue": "string", + "codeList": "string" + }, + "parentIdentifier": "string", + "hierarchyLevel": [], + "hierarchyLevelName": [], + "contact": [], + "dateStamp": "string", + "metadataStandardName": "string", + "metadataStandardVersion": "string", + "dataSetURI": "string" +} +``` +
+ + +#### Resource identifier (`idno`) + +The `idno` must provide a unique and persistent identifier for the resource (dataset or service). A common approach consists in building a _semantic identifier_, constructed by concatenating some owner and data characteristics. Although this approach offers the advantages of readability of the identifier, it may not guarantee its global uniqueness and its persistence in time. The use of time periods and/or geographic extents as components of a file identifier is not recommended, as these elements may evolve over time. The use of random identifiers such as the Universally Unique Identifiers (UUID) is sometimes suggested as an alternative, but this approach is also not recommended. The use of [Digital Object Identifiers](https://www.doi.org/) (DOI) as global and unique file identifiers is recommended. + +#### Language (`language`) + +The metadata language refers to the main language used in the metadata. The recommended practice is to use the [ISO 639-2 Language Code List](http://www.loc.gov/standards/iso639-2/) (also known as the alpha-3 language code), e.g. 'eng' for English or 'fra' for French. + +#### Character set (`characterSet`) + +The character set encoding of the metadata description. The best practice is to use the `utf8` encoding codelist value (UTF-8 encoding). It is capable of encoding all valid character code points in Unicode, a standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML. UTF-8 is the most common encoding for the World Wide Web. Many text editors will provide you with an option to save your metadata (text) files in UTF-8, which will often be the default option (see below the example of Notepad++ and R Studio). + +
+![](./images/geospatial_encoding_utf8.JPG){width=100%} +
+ +#### Parent Identifier (`parentIdentifier`) + +A geographic data resource can be a subset of a larger dataset. For example, an aquatic species distribution map can be part of a data collection covering all species, or the 2010 population census dataset of a country can be part of a dataset that includes all population censuses for that country since 1900. In such case, the _parent identifier_ metadata element can be used to identify this higher-level resource. As for the `fileIdentifier`, the `parentIdentifier` must be a unique identifier persistent in time. In a data catalog, a `parentIdentifier` will allow the user to move from one dataset to another. The `parentIdentifier` is generally applied to *datasets*, although it may in some cases be used in *data services* descriptions. + +#### Hierarchy level(s) (`hierarchyLevel`) + +
+```json +"hierarchyLevel": [ + "string" + ] +``` +
+ +The `hierarchylevel` defines the scope of the resource. It indicates whether the resource is a collection, a dataset, a series, a service, or another type of resource. The ISO 19139 provides a controlled vocabulary for this element. It is recommended but not mandatory to make use of it. The most relevant levels for the purpose of cataloguing geographic data and services are **dataset** (for both raster and vector data), **service** (a capability which a service provider entity makes available to a service user entity through a set of interfaces that define a behavior), and **series**. `Series` will be used when the data represent an ordered succession, in time or in space; this will typically apply to time series, but it can also be used to describe other types of series (e.g., a series of ocean water temperatures collected at a succession of depths). + +The recommended controlled vocabulary for `hierarchylevel` includes: `{dataset, series, service, attribute, attributeType, collectionHardware, collectionSession, nonGeographicDataset, dimensionGroup, feature, featureType, propertyType, fieldSession, software, model, tile, initiative, stereomate, sensor, platformSeries, sensorSeries, productionSeries, transferAggregate, otherAggregate}` + +#### Hierarchy level name(s) (`hierarchyLevelname`) + +
+```json +"hierarchyLevelName": [ + "string" +] +``` +
+ +The `hierarchyLevelName` provides an alternative to describe hierarchy levels, using free text instead of a controlled vocabulary. The use of `hierarchyLevel` is preferred to the use of `hierarchylevelName`. + +#### Contact(s) (`contact`) + +The `contact` element is a common element described in the ***common elements*** section of this chapter. When associated to the **metadata**, it is used to identify the person(s) or organization(s) in charge of the creation, edition, and maintenance of the metadata. The contact(s) responsible for the metadata are not necessarily the ones who are responsible for the dataset/service creation/edition/maintenance. The latter will be documented in the dataset identification elements of the metadata file. + +#### Date stamp (`dateStamp`) + +The date stamp associated to the metadata. The metadata date stamp may be automatically filled by metadata editors, and will ideally use the standard ISO 8601 date format: YYYY-MM-DD (possibly with a time). + +#### Metadata standard name (`metadataStandardName`) + +The name of the geographic metadata standard used to describe the resource. The recommended values are: + +* in the case of vector dataset metadata: *ISO 19115 Geographic information - Metadata* +* in the case of grid/imagery dataset metadata: *ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for imagery and gridded data* +* in the case of service metadata: *ISO 19119 Geographic information - Services* + +#### Metadata standard version (`metadataStandardVersion`) + +The version of the metadata standard being used. It is good practice to enter the standard's inception/revision year. ISO standards are revised with an average periodicity of 10-year. Although the ISO TC211 geographic information metadata standards have been reviewed, it is still accepted to refer to the original version of the standard as many information systems/catalogs still make use of that version. + +The recommended values are: + +* in the case of vector dataset metadata: *ISO 19115:2003* +* in the case of grid/imagery dataset metadata: *ISO 19115-2:2009* +* in the case of service metadata: *ISO 19119:2005* + +#### Dataset URI (`datasetURI`) + +A unique resource identifier for the dataset, such as a web link that uniquely identifies the dataset. The use of a [Digital Object Identifier (DOI)](https://www.doi.org/) is recommended. + + +### Main metadata sections + +Geographic data can be diverse and complex. Users need detailed information to discover data and to use them in an informed and responsible manner. The core of the information on data will be provided in various *sections* of the metadata file. This will include information on the type of data, on the coordinate system being used, on the scope and coverage of the data, on the format and location of the data, on possible quality issues that users need to be aware of, and more. The table below summarizes the main metadata *sections*, by order of appearance in the ISO 19139 specification. + +
+```json +"description": { + "spatialRepresentationInfo": [], + "referenceSystemInfo": [], + "identificationInfo": [], + "contentInfo": [], + "distributionInfo": {}, + "dataQualityInfo": [], + "metadataMaintenance": {} +} +``` +
+ +| Section | Description | Usability in *dataset metadata* | Usability in *service metadata* | +| --------------------------- | ------------------------------------------------------------ | ------------------------------- | ------------------------------- | +| `spatialRepresentationInfo` | The spatial representation of the dataset. Distinction is made between *vector* and *grid* (raster) spatial representations. | Yes | - | +| `referenceSystemInfo` | The reference systems used in the resource. In practice, this will often be limited to the geographic coordinate system. | Yes | Yes | +| `identificationInfo` | Identifies the resource, including descriptive elements (eg. title, purpose, abstract, keywords) and contact(s) having a role in the resource provision. See details below | Yes | Yes | +| `contentInfo` | The content of a dataset resource, i.e. how the dataset is structured (dimensions, attributes, variables, etc.). In the case of *vector* datasets, this relates to separate metadata files compliant with the ISO 19110 standard (Feature Catalogue). In the case of *raster* / *gridded* data, this is covered by the ISO 19115-2 extension for imagery and gridded data. | Yes | - | +| `distributionInfo` | The mode(s) of distribution of the resource (format, online resources), and by whom it is distributed. | Yes | Yes | +| `dataQualityInfo` | The quality reports on the resource (dataset or service), and in case of *datasets*, on the provenance / lineage information giving the process steps performed to obtain the dataset resource. | Yes | Yes | +| `metadataMaintenanceInfo` | The metadata maintenance cycle operated for the resource. | Yes | Yes | + +These sections are described in more detail below. + +#### Spatial representation (`spatialRepresentationInfo`) + +
+```json +"spatialRepresentationInfo": [ + { + "vectorSpatialRepresentation": { + "topologyLevel": "string", + "geometricObjects": [ + { + "geometricObjectType": "string", + "geometricObjectCount": 0 + } + ] + }, + "gridSpatialRepresentation": { + "numberOfDimensions": 0, + "axisDimensionProperties": [ + { + "dimensionName": "string", + "dimensionSize": 0, + "resolution": 0 + } + ], + "cellGeometry": "string", + "transformationParameterAvailability": true + } + } +] +``` +
+ +Information on the spatial representation is critical to properly describe a geospatial dataset. The ISO/TS 19139 distinguishes two types of spatial representations, characterized by different properties. + +The **vector spatial representation** describes the topology level and the geometric objects of **vector datasets** using the following two properties: + +- **Topology level** (`topologyLevel`) is the type of topology used in the vector spatial dataset. The ISO 19139 provides a controlled vocabulary with the following options: `{geometryOnly, topology1D, planarGraph, fullPlanarGraph, surfaceGraph, fullSurfaceGraph, topology3D, fullTopology3D, abstract}`. In most cases, vector datasets will be described as ``geometryOnly`` which covers common geometry types (points, lines, polygons). +- **Geometric objects** (`geometricObjects`) will define: + - Geometry type (`geometricObjectType`): The type of geometry handled. Possible values are: `{complex, composite, curve, point, solid, surface}`. + - Geometry count (`geometricObjectCount`): The number (count) of geometries in the dataset. + +In the case of an homogeneous geometry type, a single `geometricObjects`element can be defined. For complex geometries (mixture of various geometry types), one `geometricObjects` element will be defined for each geometry type. + +The **grid spatial representation** describes gridded (raster) data using the following three properties: + +- **Number of dimensions** (`numberOfDimensions`) in the grid. +- **Axis dimension properties** (`axisDimensionProperties`): a list of each dimension including, for each dimension: + - The name of the dimension type (`dimensionName`): the ISO 19139 provides a controlled vocabulary with the following options: `{row, column, vertical, track, crossTrack, line, sample, time}`. These options represent the following: + - row: ordinate (y) axis + - column: abscissa (x) axis + - vertical: vertical (z) axis + - track: along the direction of motion of the scan point + - crossTrack: perpendicular to the direction of motion of the scan point + - line: scan line of a sensor + - sample: element along a scan line + - time: duration + + In the Ethiopia population density file we used as an example of raster data, the types of dimensions will be row and column as the file is a spatial 2D raster. If we had a data with elevation or time dimensions, we would use respectively "vertical" and "time" dimension as name types. + + - The dimension size (`dimensionSize`): the length of the dimension. + - The dimension resolution: a resolution number associated to a unit of measurement. This is the resolution of the grid cell dimension. For example: + - for longitude/latitude dimensions, and a grid at 1deg x 5deg, the 'row' dimension will have a resolution of 1 deg, and the 'column' dimension will have a resolution of 5 deg + - for a "vertical" dimension, this will represent the elevation step. For example, the vertical resolution of the mean Ozone concentration between 40m and 50m altitude at a location of longitude x/ latitude y would be 10 m. + - similar: in case of a spatial-temporal grid, the "time" resolution will represent the time lag (e.g., 1 year, 1 month, 1 week, etc.) between two measures.

+ +- **Cell geometry type** (`cellGeometry`): The type of geometry used for grid cells. Possible values are: `{point, area, voxel, stratum}` Most "grids" are commonly area-based, but in principle a grid goes beyond this and the grid cells can target a point, an area, or a volume. + - point: each cell represents a point + - area: each cell represents an area + - voxel: each cell represents a volumetric measurement on a regular grid in a three dimensional space + - stratum: height range for a single point vertical profile + +#### Reference system(s) (`referenceSystemInfo`) + +The reference system(s) typically (but not necessarily) applies to the **geographic reference system** of the dataset. Multiple reference systems can be listed if a dataset is distributed with different spatial reference systems. This block of elements may also apply to *service metadata*. A spatial web-service may support several map projections / geographic coordinate reference systems. + +
+```json +"referenceSystemInfo": [ + { + "code": "string", + "codeSpace": "string" + } +] +``` +
+ +A reference system is defined by two properties: + +- the **identifier** of the reference system. The recommended practice is to use to the `Spatial Reference IDentifier` (SRID) number. For example, the SRID of the World Geodetic System (WGS 84) is *4326*. +- the **code space** of the source authority providing the SRID. The best practice is to use the [EPSG](https://epsg.org/home.html) authority code `EPSG` (as most of geographic reference systems are registered in it). Codes from other authorities can be used to define ad-hoc projections, for example: + - ESRI:54012 (Eckert IV equal area projection) + - EPSG:4326 ([World Geodetic System 84 - aka WGS84](https://epsg.org/crs_4326/WGS-84.html)), the system used for GPS + - EPSG:3857 ([Web Mercator / Pseudo-Mercator](https://epsg.org/crs_3857/WGS-84-Pseudo-Mercator.html)) - widely used for map visualization from web map tile providers. + +The main reference system registry is [EPSG](https://epsg.org/search/by-name), which provides a "search by name" tool for users who need to find a SRID (global or local/country-specific). Other websites reference geographic systems, but are not authoritative sources including http://epsg.io/ and https://spatialreference.org/ The advantage of these sites is that they go beyond the EPSG registry, and handle other specific registries given by providers like ESRI. + +The following ESRI projections could be relevant, in particular those in support of world equal-area projected maps (maps conserving area proportions): + + - [ESRI:54012 (Eckert IV)](https://epsg.io/54012) + - [ESRI:54009 (Mollweide)](https://epsg.io/54009) + - [ESRI:54030 (Robinson)](https://epsg.io/54030) + + +#### Identification (`identificationInfo`) + +The **identification information** (`identificationInfo`) is where the citation elements of the resource will be provided. This may include descriptive information like `title`, `abstract`, `purpose`, `keywords`, etc., and identification of the parties/contact(s) associated with the resource, such as the *owner*, *publisher*, *co-authors*, etc. Providing and publishing detailed information in these elements will contribute significantly to improving the discoverability of the data. + +
+```json +"identificationInfo": [ + { + "citation": {}, + "abstract": "string", + "purpose": "string", + "credit": "string", + "status": "string", + "pointOfContact": [], + "resourceMaintenance": [], + "graphicOverview": [], + "resourceFormat": [], + "descriptiveKeywords": [], + "resourceConstraints": [], + "resourceSpecificUsage": [], + "aggregationInfo": {}, + "extent": {}, + "spatialRepresentationType": "string", + "spatialResolution": {}, + "language": [], + "characterSet": [], + "topicCategory": [], + "supplementalInformation": "string", + "serviceIdentification": {} + } +] +``` +
+ +The identification of a resource includes elements that are common to both *datasets* and *data services*, and others that are specific to the type of resource. The following table summarizes the **identification elements** that can be used for *dataset*, *service*, or both. + + +**Identification elements applicable to datasets and data services** + +The following metadata elements apply to resources of type **dataset** and **service**. + +| Element | Description | +| ----------------------- | ------------------------------------------------------------ | +| `citation` | A citation set of elements that will describe the dataset/service from a citation perspective, including `title`, associated contacts, etc. For more details, see section on ***common elements*** | +| `abstract` | An abstract for the dataset/service resource | +| `purpose` | A statement describing the purpose of the dataset/service resource | +| `credit` | Credit information. | +| `status` | Status of the resource, with the following recommended controlled vocabulary: `{completed, historicalArchive, obsolete, onGoing, planned, required, underDevelopment, final, pending, retired, superseded, tentative, valid, accepted, notAccepted, withdrawn, proposed, deprecated}` | +| `pointOfContact` | One ore more points of contacts to associate with the resource. People that can be contacted for information on the dataset/service. For more details, see section `contact` in the ***common elements*** section of the chapter. | +| `resourceMaintenance` | Information on how the resource is maintained, essentially informing on the maintenance and update frequency (`maintenanceAndUpdateFrequency`). This frequency should be chosen among possible values recommended by the ISO 19139 standard: `{continual, daily, weekly, fortnightly, monthly, quarterly, biannually, annually, asNeeded, irregular, notPlanned, unknown}`. | +| `graphicOverview` | One or more graphic overview(s) that provide a visual identification of the dataset/service. e.g., a link to a map overview image. A `graphicOverview` will be defined with 3 properties `fileName` (or URL), `fileDescription`, and optionally a `fileType`. | +| `resourceFormat` | Resource format(s) description. For more details on how to describe a format, see the ***common elements*** section of the chapter. | +| `descriptiveKeywords` | A set of keywords that describe the dataset. Keywords are grouped by keyword type, with the possibility to associate a thesaurus (if applicable). For more details how to describe keywords, see the ***common elements*** section of the chapter.| +| `resourceConstraints` | Legal and/or Security *constraints* associated to the resource. For more details how to describe constraints, see the ***common elements*** section of the chapter| +| `resourceSpecificUsage` | Information about specific usage(s) of the dataset/service, e.g., a research paper, a success story, etc. | +| `aggregationInfo` | Information on an aggregate or parent resource to which the resource belongs, i.e. a collection. | + +
+ +Resource maintenance +
+```json +"resourceMaintenance": [ + { + "maintenanceAndUpdateFrequency": "string" + } +] +``` +
+ +Graphic overview +
+```json +"graphicOverview": [ + { + "fileName": "string", + "fileDescription": "string", + "fileType": "string" + } +] +``` +
+ +Resource specific usage +
+```json +"resourceSpecificUsage": [ + { + "specificUsage": "string", + "usageDateTime": "string", + "userDeterminedLimitations": "string", + "userContactInfo": [] + } +] +``` +For `userContactInfo`, seee common elements `Contact` +
+ +Aggregation information +
+```json +"aggregationInfo": { + "aggregateDataSetName": "string", + "aggregateDataSetIdentifier": "string", + "associationType": "string", + "initiativeType": "string" +} +``` +
+ +**Identification elements applicable to datasets** + +The following metadata elements are specific to resources of type **dataset**. + +| Element | Description | +| --------------------------- | ------------------------------------------------------------ | +| `spatialRepresentationType` | The spatial representation type of the dataset. Values should be selected from the following controlled vocabulary: `{vector, grid, textTable, tin, stereoModel, video}` | +| `spatialResolution` | The spatial resolution of the data as numeric value associated to a unit of measure. | +| `language` | The language used in the dataset. | +| `characterSet` | The character set encoding used in the dataset. | +| `topicCategory` | The topic category(ies) characterizing the dataset resource. Values should be selected from the following controlled vocabulary: `{farming, biota, boundaries, climatologyMeteorologyAtmosphere, economy, elevation, environment, geoscientificInformation, health, imageryBaseMapsEarthCover, intelligenceMilitary, inlandWaters, location, oceans, planningCadastre, society, structure, transportation, utilitiesCommunication, extraTerrestrial, disaster}` | +| `extent` | Defines the spatial (horizontal and vertical) and temporal region to which the content of the resource applies. For more details, see the **common elements** section of the chapter| +| `supplementalInformation` | Any additional information, provided as free text. | + +
+Spatial resolution, language, characterset, and topic category +```json +"spatialResolution": { + "uom": "string", + "value": 0 +}, +"language": [ + "string" +], +"characterSet": [ + { + "codeListValue": "string", + "codeList": "string" + } +], +"topicCategory": [ + "string" +] +``` +
+ +**Identification elements applicable to data services** + +The following metadata elements are specific to resources of type **service**. + +| Element | Description | +| -------------------- | ------------------------------------------------------------ | +| `serviceType` | The type of service (as free text),eg. OGC:WMS | +| `serviceTypeVersion` | The version of the service e.g. 1.3.0 | +| `accessProperties` | Access properties, including description of `fees`, `plannedAvailableDateTime`, `orderingInstructions` and `turnaround` | +| `restrictions` | Legal and/or Security constraints associated to the service. For more details, see the ***common elements*** section of the chapter.| +| `keywords` | Set of service keywords. For more details, see the ***common elements*** section of the chapter. | +| `extent` | Defines the spatial (horizontal and vertical) and temporal region to which the service applies (if applicable). see the ***common elements*** section of the chapter. | +| `coupledResource` | Eventual resource(s) coupled to a service operation. | +| `couplingType` | The type of coupling between service and coupled resources. Values should be selected from the following controlled vocabulary: `{loose, mixed, tight}` | +| `containsOperations` | Operation(s) available for the service. See below for details. | +| `operatesOn` | List of dataset identifiers on which the service operates. | + +
+```json +"serviceIdentification": { + "serviceType": "string", + "serviceTypeVersion": "string", + "accessProperties": { + "fees": "string", + "plannedAvailableDateTime": "string", + "orderingInstructions": "string", + "turnaround": "string" + }, + "restrictions": [], + "keywords": [], + "coupledResource": [ + { + "operationName": "string", + "identifier": "string" + } + ], + "couplingType": "string", + "containsOperations": [ + { + "operationName": "string", + "DCP": [ + "string" + ], + "operationDescription": "string", + "invocationName": "string", + "parameters": [ + { + "name": "string", + "direction": "string", + "description": "string", + "optionality": "string", + "repeatability": true, + "valueType": "string" + } + ], + "connectPoint": { + "linkage": "string", + "name": "string", + "description": "string", + "protocol": "string", + "function": "string" + }, + "dependsOn": [ + { } + ] + } + ], + "operatesOn": [ + { + "uuidref": "string" + } + ] +} +``` +
+ + +##### Service operation + +A data service operation is described with the following metadata elements: + +| Element | Description | +| ---------------------- | ------------------------------------------------------------ | +| `operationName` | Name of the operation | +| `DCP` | Distributed Computing Platform. Recommended value: 'WebServices' | +| `operationDescription` | Description of the operation | +| `invocationName` | Name of the operation as invoked when using the service | +| `parameters` | Operation parameter(s). A parameter can be defined with several properties including `name`, `description`, `direction` (in, out, or 'inout'), `optionality` ('Mandatory' or 'Optional'), `repeatability`(true/false), and the `valueType` (type of value expected, e.g., string, numeric, etc.) | +| `connectPoint` | URL points, defined as online resource(s) | +| `dependsOn` | Service operation(s) the service operation depends on. | + +The *service operation*(s) descriptions are recommended when the *service* does not support the self-description of its operations. + + +#### Content (`contentInfo`) + +For vector datasets, the ISO 19115-1 does not provide all necessary elements; the structure of vector datasets is therefore documented using the `featureCatalogueDescription` of the ISO 19110 (*Feature Catalogue*) standard. The ISO 19110 is included in the unified ISO 19139 XML specification. + +**Feature catalogue description** (`featureCatalogueDescription`) + +The Feature Catalogue description aims to link the structural metadata (ISO 19110) to the dataset metadata (ISO 19115). This will be required when the structural metadata is not contained in the same metadata file as the dataset metadata.^[In our JSON schema, the structural metadata and the dataset metadata are stored in one same container.] The following elements are used to document this relationship: + +| Element | Description | +| -------------------------- | ------------------------------------------------------------ | +| `complianceCode` | Indicates whether the dataset complies with the feature catalogue description | +| `language` | Language used in the feature catalogue | +| `includedWithDataset` | Indicates if the feature catalogue description is included with the dataset (essentially, as downloadable resource) | +| `featureCatalogueCitation` | A `citation` that references the ISO 19110 feature catalogue. As best practice, this citation will essentially use two properties: `uuidref` giving the persistent identifier of the feature catalogue, `href` giving a web link to access the ISO 19110 feature catalogue. | + +
+```json +"contentInfo": [ + { + "featureCatalogueDescription": { + "complianceCode": true, + "language": "string", + "includedWithDataset": true, + "featureCatalogueCitation": { + "title": "string", + "alternateTitle": "string", + "date": [ + { + "date": "string", + "type": "string" + } + ], + "edition": "string", + "editionDate": "string", + "identifier": { + "authority": "string", + "code": null + }, + "citedResponsibleParty": [], + "presentationForm": [ + "string" + ], + "series": { + "name": "string", + "issueIdentification": "string", + "page": "string" + }, + "otherCitationDetails": "string", + "collectiveTitle": "string", + "ISBN": "string", + "ISSN": "string" + } + }, + "coverageDescription": { + "contentType": "string", + "dimension": [ + { + "name": "string", + "type": "string" + } + ] + } + } +] +``` +
+ +The feature catalog can be an external metadata file or document. We embedded it our JSON schema. See the section **ISO 19110 Feature Catalogue** below. + +**Coverage description** (`coverageDescription`) + +The structure of raster/gridded datasets can be described using the ISO 19115-2 standard, using the `coverageDescription` element and the following two properties: + +| Element | Description | +| ------------- | ------------------------------------------------------------ | +| `contentType` | Type of coverage content, e.g., 'image'. It is recommended to define the content type using the [controlled vocabulary](http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CoverageContentTypeCode) suggested by the ISO 19139 which contains the following values: {`image`, `thematicClassification`, `physicalMeasurement`, `auxillaryInformation`, `qualityInformation`, `referenceInformation`, `modelResult`, `coordinate`, `auxilliaryData`} | +| `dimension` | List of coverage dimensions. Each dimension can be defined by a `name` and a `type`. For the `type`, a good practice is to rely on primitive data types defined in the XML Schema https://www.w3.org/2009/XMLSchema/XMLSchema.xsd | +| `rangeElementDescription` | List of range element descriptions. Each range element description will have a `name`/`definition` (corresponding to the dimension considered), and list of accepted values as `rangeElement`. For example, for a timeseries with series defined at specific instants in time, the `Time` dimension of the spatio-temporal coverage could be defined here giving the list of time instants supported by the time series. | + +#### Distribution (`distributionInfo`) + +The distribution information documents who is the actual *distributor* of the resources, and other aspects of the distribution in term of _format_ and _online resources_. This information is provided using the following elements: + +| Element | Description | +| -------------------- | ------------------------------------------------------------ | +| `distributionFormat` | Format(s) definitions. See the ***common elements*** section for information on how to document a format. | +| `distributor` | Contact(s) in charge of the resource distribution. See the ***common elements*** section for information on how to document a contact. | +| `transferOptions` | Transfer option(s) to get the resource. To align with the ISO 19139, these resources should be set in an `onLine` element where all online resources available can be listed, or as `offLine` for media not available online. | + +
+```json +"distributionFormat": [ + { + "name": "string", + "version": "string", + "amendmentNumber": "string", + "specification": "string", + "fileDecompressionTechnique": "string", + "FormatDistributor": {} + } +] +``` +
+ +#### Data quality (`dataQualityInfo`) + +Information on the quality of the data will be useful to secondary analysts, to ensure proper use of the data. Data quality is documented in the section `dataQualityInfo` using three main metadata elements: + +| Element | Description | +| --------- | ------------------------------------------------------------ | +| `scope` | Scope / hierarchy level targeted by the data quality information section. The ISO 19139 recommends the use of a controlled vocabulary with the following options: {`attribute`, `attributeType`, `collectionHardware`, `collectionSession`, `dataset`, `series`, `nonGeographicDataset`, `dimensionGroup`, `feature`, `featureType`, `propertyType`, `fieldSession`, `software`, `service`, `model`, `tile`, `initiative`, `stereomate`, `sensor`, `platformSeries`, `sensorSeries`, `productionSeries`, `transferAggregate`, `otherAggregate`} | +| `report` | Report(s) describing the quality information, for example a [INSPIRE](https://inspire.ec.europa.eu/metadata/6541) metadata compliance report. To see how to create a data quality conformance `report`, see details below. | +| `lineage` | The `lineage` provides the elements needed to describe the process that led to the production of the data. In combination with `report`, the lineage will allow data users to assess quality conformance. This is an important metadata element.| + +
+```json +"dataQualityInfo": [ + { + "scope": "string", + "report": [], + "lineage": { + "statement": "string", + "processStep": [] + } + } +] +``` +
+ +##### Report (`report`) + +
+```json +"report": [ + { + "DQ_DomainConsistency": { + "result": { + "nameOfMeasure": [], + "measureIdentification": "string", + "measureDescription": "string", + "evaluationMethodType": [], + "evaluationMethodDescription": "string", + "evaluationProcedure": {}, + "dateTime": "string", + "result": [] + } + } + } +] +``` +
+ +A `report` describes the *result* of an assessment of the conformance (or not) of a resource to consistency rules. The `result` is the main component of a report, which can be described with the following elements: + + - `nameOfMeasure`: One or more measure names used for the data quality report + - `measureIdentification`: Identification of the measure, using a unique identifier (if applicable) + - `measureDescription`: A description of the measure + - `evaluationMethodType`: Type of evaluation method. The ISO 19139 recommends the use of a controlled vocabulary with the following options: `{directInternal, directExternal, indirect}` + - `evaluationMethodDescription`: Description of the evaluation method + - `evaluationProcedure`: Citation of the evaluation procedure (as citation element) + - `dateTime`: Date time when the report was established + - `report`: Result(s) associated to the report. Each result should be described with a `specification`, an `explanation` (of the result of conformance or not conformance), and a `pass` property indicating if the result was positive (true) or not (false). + +
+```json +"result": { + "nameOfMeasure": [ + "string" + ], + "measureIdentification": "string", + "measureDescription": "string", + "evaluationMethodType": [ + "string" + ], + "evaluationMethodDescription": "string", + "evaluationProcedure": { + "title": "string", + "alternateTitle": "string", + "date": [ + { + "date": "string", + "type": "string" + } + ], + "edition": "string", + "editionDate": "string", + "identifier": { + "authority": "string", + "code": null + }, + "citedResponsibleParty": [], + "presentationForm": [ + "string" + ], + "series": { + "name": "string", + "issueIdentification": "string", + "page": "string" + }, + "otherCitationDetails": "string", + "collectiveTitle": "string", + "ISBN": "string", + "ISSN": "string" + }, + "dateTime": "string", + "result": [] + } +} +``` +
+ +##### Lineage (`lineage`) + +The `lineage` provides a structured solution to describe the work flow that led to the production of the data/service, defined by: + +- a general `statement` of the work flow performed +- sequence of process steps performed. Each `processStep` is defined by the following elements: + - `description`: Description of the process step performed + - `rationale`: Rationale of the process step + - `dateTime`: Date of the processing + - `processor`: Contact(s) acting as processor(s) for the target step + - `source`: Source(s) used for the process step. Each `source` can have a `description` and a `sourceCitation` (as citation element). + +
+```json +"lineage": { + "statement": "string", + "processStep": [ + { + "description": "string", + "rationale": "string", + "dateTime": "string", + "processor": [], + "source": [ + { + "description": "string", + "sourceCitation": { + "title": "string", + "alternateTitle": "string", + "date": [ + { + "date": "string", + "type": "string" + } + ], + "edition": "string", + "editionDate": "string", + "identifier": { + "authority": "string", + "code": null + }, + "citedResponsibleParty": [], + "presentationForm": [ + "string" + ], + "series": { + "name": "string", + "issueIdentification": "string", + "page": "string" + }, + "otherCitationDetails": "string", + "collectiveTitle": "string", + "ISBN": "string", + "ISSN": "string" + } + } + ] + } + ] +} +``` +
+ +#### Metadata maintenance (`metadataMaintenanceInfo`) + +The `metadataMaintenanceInfo` and `maintenanceAndUpdateFrequency` elements provide information on the maintenance of the metadata including the frequency of updates. The `metadataMaintenanceInfo` element is a free text element. The information provided in `maintenanceAndUpdateFrequency` should be chosen from values recommended by the ISO 19139 controlled vocabulary with the following options: `{continual, daily, weekly, fortnightly, monthly, quarterly, biannually, annually, asNeeded, irregular, notPlanned, unknown}`. + +
+```json +"metadataMaintenance": { + "maintenanceAndUpdateFrequency": "string" +} +``` +
+ + +## ISO 19110 Feature Catalogue (`feature_catalogue`) + +We describe below how the ISO 19110 feature catalogue is used to document the structure of a vector dataset (complementing the ISO 10119-1). This is equivalent to producing a "data dictionary" for the variables/features included in a vector dataset. An example of the implementation of such a feature catalogue using R is provided in the **Examples** section of this chapter (see Example 3 in section 5.5.3). + +| Element | Description | +| -------------------- | ------------------------------------------------------------ | +| `name` | Name of the feature catalogue | +| `scope` | Subject domain(s) of feature types defined in this feature catalogue | +| `fieldOfApplication` | One or more fields of applications for this feature catalogue. | +| `versionNumber` | Version number of this feature catalogue, which may include both a major version number or letter and a sequence of minor release numbers or letters, such as '3.2.4a.' The format of this attribute may differ between cataloguing authorities. | +| `versionDate` | Version `date` | +| `producer` | The `responsibleParty` in charge of the feature catalogue production | +| `functionalLanguage` | Formal functional language in which the feature operation formal definition occurs in this feature catalogue | +| `featureType` | One or more feature type(s) defined in the Feature catalogue. The definition of several feature types can be considered when targeting various forms of a dataset (e.g., simplified vs. complete set of attributes, raw vs. aggregated, etc). In practice, a simple ISO 19110 feature catalogue will reference one feature type describing the unique dataset structure. See details below. | + +
+```json +"feature_catalogue": { + "name": "string", + "scope": [], + "fieldOfApplication": [], + "versionNumber": "string", + "versionDate": {}, + "producer": {}, + "functionalLanguage": "string", + "featureType": [] +} +``` +
+ + The `featureType` is the actual data structure definition of a dataset (data dictionary), and has the following properties: + +| Element | Description | +| -------------------------- | ------------------------------------------------------------ | +| `typeName` | Text string that uniquely identifies this feature type within the feature catalogue that contains this feature type | +| `definition` | Definition of the feature type | +| `code` | Code that uniquely identifies this feature type within the feature catalogue that contains this feature type | +| `isAbstract` | Indicates if the feature type is abstract or not | +| `aliases` | One or more aliases as equivalent names of the feature type | +| `carrierOfCharacteristics` | Feature attribute(s) / column(s) definitions. See below details. | + +
+```json +"featureType": [ + { + "typeName": "string", + "definition": "string", + "code": "string", + "isAbstract": true, + "aliases": [ + "string" + ], + "carrierOfCharacteristics": [ + { + "memberName": "string", + "definition": "string", + "cardinality": { + "lower": 0, + "upper": 0 + }, + "code": "string", + "valueMeasurementUnit": "string", + "valueType": "string", + "listedValue": [ + { + "label": "string", + "code": "string", + "definition": "string" + } + ] + } + ] + } +] +``` +
+ +Each feature attribute, i.e. column that is a member of the vector data structure is defined as `carrier of characteristics`. Each set of characteristics can be defined with the following properties: + +| Element | Description | +| ---------------------- | ------------------------------------------------------------ | +| `memberName` | Name of the property member of the feature type | +| `definition` | Definition of the property member | +| `cardinality` | Definition of the member type cardinality. The cardinality is set of two properties: lower cardinality (`lower`) and upper cardinality (`upper`). For simple tabular datasets, the cardinality will be 1-1. Multiple cardinalities (eg. 1-N, N-N) apply particularly to feature catalogues/types that describe relational databases. | +| `code` | Code for the attribute member of the feature type. Corresponds to the actual column name in an attributes table. | +| `valueMeasurementUnit` | Measurement unit of the values (in case of the feature member corresponds to a measurable variable) | +| `valueType` | Type of value. A good practice is to rely on primitive data types defined in the XML Schema https://www.w3.org/2009/XMLSchema/XMLSchema.xsd | +| `listedValue` | List of controlled value(s) used in the attribute member. Each value corresponds to an object compound by 1) a `label`, 2) a `code` (as contained in the dataset), 3) a `definition`. This element will be used when the feature member relates to reference datasets, such as code lists or registers. e.g., list of countries, land cover types, etc. | + + +## Provenance + +
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ +**`provenance`** *[Optional ; Repeatable]*
+Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the ISO 19139 metadata standard.
+ +- **`origin_description`** *[Required ; Not repeatable]*
+The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Study Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The datestamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ + +## Tags + +**`tags`** *[Optional ; Repeatable]*
+Tags provides easy way to include custom-facets in NADA. Should consider using one or multiple controlled vocabulary(ies). See section 1.7 for more on the importance and use of tags and tag_groups in data catalogs. + +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ +- **`tag`** *[Required ; Not repeatable ; String]*
+A user-defined tag. +- **`tag_group`** *[Optional ; Not repeatable ; String]*

+A user-defined group to which the tag belongs. Grouping tags allows implementation of controlled facets (filters) in data catalogs. + + +## LDA topics + +**`lda_topics`** *[Optional ; Not repeatable]*
+ +
+```json +"lda_topics": [ +{ +"model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ +We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document (in this case, the "document" is a compilation of elements from the dataset metadata) can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). +
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element `lda_topics` is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. + +:::note +Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the `lda_topics` elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated. +::: + +The `lda_topics` element includes the following metadata fields:
+ +- **`model_info`** *[Optional ; Not repeatable]*
+Information on the LDA model.
+ - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.
+
+ +- **`topic_description`** *[Optional ; Repeatable]*
+The topic composition of the document.
+ - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the document (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic. This is specific to the model, not to a document.
+ + +```r +lda_topics = list( + + list( + + model_info = list( + list(source = "World Bank, Development Data Group", + author = "A.S.", + version = "2021-06-22", + model_id = "Mallet_WB_75", + nb_topics = 75, + description = "LDA model, 75 topics, trained on Mallet", + corpus = "World Bank Documents and Reports (1950-2021)", + uri = "")) + ), + + topic_description = list( + + list(topic_id = "topic_27", + topic_score = 32, + topic_label = "Education", + topic_words = list(list(word = "school", word_weight = "") + list(word = "teacher", word_weight = ""), + list(word = "student", word_weight = ""), + list(word = "education", word_weight = ""), + list(word = "grade", word_weight = "")), + + list(topic_id = "topic_8", + topic_score = 24, + topic_label = "Gender", + topic_words = list(list(word = "women", word_weight = "") + list(word = "gender", word_weight = ""), + list(word = "man", word_weight = ""), + list(word = "female", word_weight = ""), + list(word = "male", word_weight = "")), + + list(topic_id = "topic_39", + topic_score = 22, + topic_label = "Forced displacement", + topic_words = list(list(word = "refugee", word_weight = "") + list(word = "programme", word_weight = ""), + list(word = "country", word_weight = ""), + list(word = "migration", word_weight = ""), + list(word = "migrant", word_weight = "")), + + list(topic_id = "topic_40", + topic_score = 11, + topic_label = "Development policies", + topic_words = list(list(word = "development", word_weight = "") + list(word = "policy", word_weight = ""), + list(word = "national", word_weight = ""), + list(word = "strategy", word_weight = ""), + list(word = "activity", word_weight = "")) + + ) + + ) + +) +``` + + +## Embeddings + +**`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. In this case, the text would be a compilation of selected elements of the dataset metadata. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). + +The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": { } + } +] +``` +
+ +The `embeddings` element contains four metadata fields: + +- **`id`** *[Optional ; Not repeatable ; String]*
+A unique identifier of the word embedding model used to generate the vector. + +- **`description`** *[Optional ; Not repeatable ; String]*
+A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + +- **`date`** *[Optional ; Not repeatable ; String]*
+The date the model was trained (or a version date for the model). + +- **`vector`** *[Required ; Not repeatable ; Object]* @@@@@@@@ do not offer options +The numeric vector representing the document, provided as an object (array or string).
+[1,4,3,5,7,9] + + +## Additional + +**`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + + +## Complete examples + +### Example 1 (vector - shape files): Bangladesh, Outline of camps of Rohingya refugees in Cox's Bazar, January 2021 + +In this first example, we use a geographic dataset that contains the outline of Rohingya refugee camps, settlements, and sites in Cox's Bazar, Bangladesh. The dataset was imported from the [Humanitarian Data Exchange website](https://data.humdata.org/dataset/outline-of-camps-sites-of-rohingya-refugees-in-cox-s-bazar-bangladesh) on March 3, 2021. + +We include in the metadata a simple description of the features (variables) contained in the shape files. This information will significantly increase data discoverability, as it provide information of the content of the data files (which is not described elsewhere in the metadata). + +
+![](./images/geospatial_example_script_OCHA_BGD.JPG){width=100%} +
+ +---------- +**Generating the metadata using R** +---------- + + +```r +library(nadar) +library(readr) @@@@ used? +library(readxl) @@@@ used? +library(writexl) @@@@ used? +library(sf) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_geo_data/") + +thumb = "shape_camps.JPG" + +# Download the data files (if not already downloaded) +# Note: the data are frequently updated; the links below may have become invalid. +# Visit: https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd for an update. + +base_url = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/" +urls <- list( + paste0(base_url, "7cec91fb-d0a8-4781-9f8d-9b69772ef2fd/download/210118_rrc_geodata_al1al2al3.gdb.zip"), + paste0(base_url, "ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip"), + paste0(base_url, "bd5351e7-3ffc-4eaa-acbc-c6d917b5549c/download/200908_rrc_outline_camp_al1.kmz"), + paste0(base_url, "9d5693ec-eeb8-42ed-9b65-4c279f523276/download/200908_rrc_outline_block_al2.zip"), + paste0(base_url, "ed119ae4-b13d-4473-9afe-a8c36e07870b/download/200908_rrc_outline_block_al2.kmz"), + paste0(base_url, "0d2d87ae-52a5-4dca-b435-dcd9c617b417/download/210118_rrc_outline_subblock_al3.zip"), + paste0(base_url, "6286c4a5-d2ab-499a-b019-a7f0c327bd5f/download/210118_rrc_outline_subblock_al3.kmz") +) + +for(url in urls) { + f <- basename(url) + if (!file.exists(f)) download.file(url, destfile=f, mode="wb") +} + +# Unzip and read the shape files to extract information +# The object contain the number of features, layers, geodetic CRS, etc. + +unzip("200908_rrc_outline_camp_al1.zip", exdir = "AL1") +al1 <- st_read("./AL1/200908_RRC_Outline_Camp_AL1.shp") + +unzip("200908_rrc_outline_block_al2.zip", exdir = "AL2") +al2 <- st_read("./AL2/200908_RRC_Outline_Block_AL2.shp") + +unzip("210118_rrc_outline_subblock_al3.zip", exdir = "AL3") +al3 <- st_read("./AL3/210118_RRC_Outline_SubBlock_AL3.shp") + +# --------------- + +id = "BGD_2021_COX_CAMPS_GEO_OUTLINE" + +my_geo_metadata <- list( + + metadata_information = list( + title = "(Demo) Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)", + producers = list(list(name = "NADA team")), + production_date = "2022-02-18" + ), + + description = list( + + idno = id, + + language = "eng", + + characterSet = list(codeListValue = "utf8"), + + hierarchyLevel = list("dataset"), + + contact = list( + list( + organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)", + contactInfo = list( + address = list(country = "Bangladesh"), + onlineResource = list( + linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/", + name = "Website" + ) + ), + role = "owner" + ) + ), + + dateStamp = "2021-01-20", + + metadataStandardName = "ISO 19115:2003/19139", + + spatialRepresentationInfo = list( + + # File 200908_rrc_outline_camp_al1.zip + list( + vectorSpatialRepresentationInfo = list( + topologyLevel = "geometryOnly", + geometricObjects = list( + geometricObjectType = "surface", + geometricObjectCount = "35" + ) + ) + ), + + # File 200908_rrc_outline_block_al2.zip + list( + vectorSpatialRepresentationInfo = list( + topologyLevel = "geometryOnly", + geometricObjects = list( + geometricObjectType = "surface", + geometricObjectCount = "173" + ) + ) + ), + + # File 210118_rrc_outline_subblock_al3.zip + list( + vectorSpatialRepresentationInfo = list( + topologyLevel = "geometryOnly", + geometricObjects = list( + geometricObjectType = "surface", + geometricObjectCount = "967" + ) + ) + ) + + ), + + referenceSystemInfo = list( + list(code = "4326", codeSpace = "EPSG"), + list(code = "84", codespace = "WGS") + ), + + identificationInfo = list( + + list( + + citation = list( + title = "Bangladesh, Outline of camps of Rohingya refugees in Cox's Bazar, January 2021", + date = list( + list(date = "2021-01-20", type = "creation") + ), + citedResponsibleParty = list( + list( + organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)", + contactInfo = list( + address = list(country = "Bangladesh"), + onlineResource = list( + linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/", + name = "Website" + ) + ), + role = "owner" + ) + ) + ), + + abstract = "These polygons were digitized through a combination of methodologies, originally using VHR satellite imagery and GPS points collected in the field, verified and amended according to Site Management Sector, RRRC, Camp in Charge (CiC) officers inputs, with technical support from other partners.", + + purpose = "Inform the UNHCR operations (and other support agencies') in refugee camps in Cox's Bazar.", + + credit = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)", + + status = "completed", + + pointOfContact = list( + list( + organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)", + contactInfo = list( + address = list(country = "Bangladesh"), + onlineResource = list( + linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/", + name = "Website" + ) + ), + role = "pointOfContact" + ) + ), + + resourceMaintenance = list( + list(maintenanceOrUpdateFrequency = "asNeeded") + ), + + graphicOverview = list( # @@@@@@@@@@@@ + list(fileName = "", + fileDescription = "", + fileType = "") + ), + + resourceFormats = list( + list(name = "application/zip", + specification = "ESRI Shapefile (zipped)", + FormatDistributor = list(organisationName = "ESRI") + ), + list(name = "application/vnd.google-earth.kmz", + specification = "KMZ file", + FormatDistributor = list(organisationName = "Google") + ), + list(name = "ESRI Geodatabase", + FormatDistributor = list(organisationName = "ESRI") + ) + ), + + descriptiveKeywords = list( + list(keyword = "refugee camp"), + list(keyword = "forced displacement"), + list(keyword = "rohingya") + ), + + resourceConstraints = list( + list( + legalConstraints = list( + uselimitation = list("License: http://creativecommons.org/publicdomain/zero/1.0/legalcode"), + accessConstraints = list("unrestricted"), + useConstraints = list("licenceUnrestricted") + ) + ) + ), + + extent = list( + geographicElement = list( + list( + geographicBoundingBox = list( + southBoundLatitude = 20.91856, + westBoundLongitude = 92.12973, + northBoundLatitude = 21.22292, + eastBoundLongitude = 92.26863 + ) + ) + ) + ), + + spatialRepresentationType = "vector", + + language = list("eng") + + ) + + ), + + distributionInfo = list( + + distributionFormat = list( + list(name = "application/zip", + specification = "ESRI Shapefile (zipped)", + FormatDistributor = list(organisationName = "ESRI") + ), + list(name = "application/vnd.google-earth.kmz", + specification = "KMZ file", + FormatDistributor = list(organisationName = "Google") + ), + list(name = "ESRI Geodatabase", + FormatDistributor = list(organisationName = "ESRI") + ) + ), + + distributor = list( + list( + organisationName = "United Nations Office for the Coordination of Humanitarian Affairs (OCHA)", + contactInfo = list( + onlineResource = list( + linkage = "https://data.humdata.org/dataset/outline-of-camps-sites-of-rohingya-refugees-in-cox-s-bazar-bangladesh", + name = "Website" + ) + ) + ) + )#, + + # transferOptions = list( + # list( + # onLine = list( # @@@@@@@@ / use external resources schema? + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/7cec91fb-d0a8-4781-9f8d-9b69772ef2fd/download/210118_rrc_geodata_al1al2al3.gdb.zip", + # name = "210118_RRC_GeoData_AL1,AL2,AL3.gdb.zip", + # description = "This zipped geodatabase file (GIS) contains the Camp boundary (Admin level-1) and and camp-block boundary (admin level-2 or camp sub-division) and sub-block boundary of Rohingya refugee camps and administrative level-3 or sub block division of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip", + # name = "200908_RRC_Outline_Camp_AL1.zip", + # description = "This zipped shape file (GIS) contains the Camp boundary (Admin level-1) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/bd5351e7-3ffc-4eaa-acbc-c6d917b5549c/download/200908_rrc_outline_camp_al1.kmz", + # name = "200908_RRC_Outline_Camp_AL1.kmzKMZ", + # description = "This kmz file (Google Earth) contains the Camp boundary (Admin level-1) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/9d5693ec-eeb8-42ed-9b65-4c279f523276/download/200908_rrc_outline_block_al2.zip", + # name = "200908_RRC_Outline_Block_AL2.zip", + # description = "This zipped shape file (GIS) contains the camp-block boundary (admin level-2 or camp sub-division) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ed119ae4-b13d-4473-9afe-a8c36e07870b/download/200908_rrc_outline_block_al2.kmz", + # name = "200908_RRC_Outline_Block_AL2.kmzKMZ", + # description = "This kmz file (Google Earth) contains the camp-block boundary (admin level-2 or camp sub-division) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/0d2d87ae-52a5-4dca-b435-dcd9c617b417/download/210118_rrc_outline_subblock_al3.zip", + # name = "210118_RRC_Outline_SubBlock_AL3.zip", + # description = "This zipped shape file (GIS) contains the camp-sub-block (Admin level-3) of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/6286c4a5-d2ab-499a-b019-a7f0c327bd5f/download/210118_rrc_outline_subblock_al3.kmz", + # name = "210118_RRC_Outline_SubBlock_AL3.kmzKMZ", + # description = "This kmz file (Google Earth) contains the camp-sub-block (Admin level-3) of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021", + # protocol = "WWW:LINK-1.0-http--link" + # ) + # ) + # ) + # ) + + ), + + dataQualityInfo = list( + list( + scope = "dataset", + lineage = list( + statement = "The camps are continuously expanding, and Camp Boundaries are structured around the GoB, RRRC official governance structure of the camps, taking into account the potential new land allocation. The database is kept as accurate as possible, given these challenges." + ) + ) + ), + + metadataMaintenance = list(maintenanceAndUpdateFrequency = "asNeeded"), + + feature_catalogue = list( + + name = "Feature Catalogue dataset xxxxx", + scope = list("3 shape files: al1, al2, al3"), + + featureType = list( + list( + typeName = "", + definition = "", + carrierOfCharacteristics = list( + list( + memberName = 'District', + definition = 'Cox s Bazar' + ), + list( + memberName = 'Upazila', + definition = 'Teknaf, Ukhia', + ), + list( + memberName = 'Settlement', + definition = 'Collective site; Collective site with host community', + ), + list( + memberName = 'Union', + definition = 'Baharchhara; Nhilla; Palong Khali; Raja Palong; Whykong', + ), + list( + memberName = 'Name_Alias', + definition = 'Alikhali; Bagghona-Putibonia; Camp 20 Extension; + Camp 4; Camp 4 Extension; Chakmarkul; Choukhali; + Hakimpara; Jadimura; Jamtoli-Baggona; Jomer Chora; + Kutupalong RC; Modur Chora; Nayapara; Nayapara RC; + Shamlapur; Tasnimarkhola; Tasnimarkhola-Burmapara; + Unchiprang' + ), + list( + memberName = 'SSID', + definition = 'CXB-017 to CXB-235', + ), + list( + memberName = 'SMSD__Cnam', + definition = 'Camp 01E; Camp 01W; Camp 02E; Camp 02W; Camp 03; Camp 04; + Camp 04X; Camp 05; Camp 06; Camp 07; Camp 08E; Camp 08W; + Camp 09; Camp 10; Camp 11; Camp 12; Camp 13; Camp 14; + Camp 15; Camp 16; Camp 17; Camp 18; Camp 19; Camp 20; + Camp 20X; Camp 21; Camp 22; Camp 23; Camp 24; Camp 25; + Camp 26; Camp 27; Camp KRC; Camp NRC; Choukhali', + ), + list( + memberName = 'NPM_Name', + definition = 'Camp 01E; Camp 01W; Camp 02E; Camp 02W; Camp 03; + Camp 04; Camp 04 Extension; Camp 05; Camp 06; ; Camp 07; + Camp 08E; Camp 08W; Camp 09; Camp 10; Camp 11; Camp 12; + Camp 13 Camp 14 (Hakimpara); Camp 15 (Jamtoli); + Camp 16 (Potibonia); Camp 17; Camp 18; Camp 19; Camp 20; + Camp 20 Extension; Camp 21 (Chakmarkul); Camp 22 (Unchiprang); + Camp 23 (Shamlapur); Camp 24 (Leda); Camp 25 (Ali Khali); + Camp 26 (Nayapara); Camp 27 (Jadimura); Choukhali; + Kutupalong RC; Nayapara RC', + ), + list( + memberName = 'Area_Acres', + definition = 'Area in acres', + ), + list( + memberName = 'PeriMe_Met', + definition = 'Perimeter in meters', + ), + list( + memberName = 'Camp_Name', + definition = 'Camp 10; Camp 11; Camp 12; Camp 13; Camp 14; Camp 15; + Camp 16; Camp 17; Camp 18; Camp 19; Camp 1E; Camp 1W; + Camp 20 Camp 20 Extension; Camp 21; Camp 22; Camp 23; + Camp 24; Camp 25; Camp 26; Camp 27; Camp 2E; Camp 2W; + Camp 3; Camp 4; Camp 4 Extension; Camp 5; Camp 6; + Camp 7; Camp 8E; Camp 8W; Camp 9; Choukhali; + Kutupalong RC; Nayapara RC', + ), + list( + memberName = 'Area_SqM', + definition = 'Area in square km', + ), + list( + memberName = 'Latitude' + ), + list( + memberName = 'Longitude' + ), + list( + memberName = 'geometry' + ) + #, + #... al2, al3 @@@@@@@@@ complete + ) + ) + ) + ) + + ) + +) + + +# Publish in NADA catalog + +geospatial_add( + idno = id, + metadata = my_geo_metadata, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes" +) + +# Add a link to HDX as an external resource + +external_resources_add( + title = "Humanitarian Data Exchange website", + idno = id, + dctype = "web", + file_path = "https://data.humdata.org/", + overwrite = "yes" +) +``` + +**The result in NADA** + +After running the script, the data and metadata will be available in NADA. + +
+![](./images/geo_example1_in_nada.JPG){width=100%} +
+ + +**Generating the metadata using Python** + + + + + +### Example 2 (vector, CSV data): Syria Refugee Sites (OCHA) + +The [Syria Refugee Sites](https://data.humdata.org/dataset/syria-refugee-sites) dataset used as a second example contains verified data about the geographic location (point geometry), name, and operational status of refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. Only refugee sites operated by the United Nations High Commissioner for Refugees (UNHCR) or the Government of Turkey are included. Data are provided as CSV, TSV and XLSX files. This example demonstrates the use of the ISO 19115 standard. + +---------- +**Generating the metadata using R** +---------- + + +```r +library(nadar) +library(sf) +library(sp) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_geo_data/") + +options(stringsAsFactors = FALSE) + +# Download and read the data file + +url = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/cc3e9e48-e363-404e-948b-e42d13c316d9/download/syria_refugeesites_2016jan21_hiu_dos.csv" +data_file = basename(url) +if(!file.exists(data_file)) download.file(url, destfile = data_file, mode = "wb") + +sf <- st_read(data_file) +sp <- as.data.frame(sf) +sp$Long <- as(sp$Long, "numeric") +sp$Lat <- as(sp$Lat, "numeric") +coordinates(sp) <- c("Long", "Lat") +proj4string(sp) <- CRS("+init=epsg:4326") + +# Generate the metadata + +id <- "EX2_SYR_REFUGEE_SITES" + +my_geo_data <- list( + + metadata_information = list( + title = "(Demo) Syria, Refugee Sites", + producers = list( + list(name = "NADA team") + ), + production_date = "2022-02-18" + ), + + description = list( + + idno = id, + + language = "eng", + + characterSet = list(codeListValue = "utf8"), + + hierarchyLevel = list("dataset"), + + contact = list( + list( + organisationName = "U.S. Department of State - Humanitarian Information Unit", + contactInfo = list( + address = list(electronicEmailAddress = "HIU_DATA@state.gov"), + onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website") + ), + role = "pointOfContact" + ) + ), + + dateStamp = "2018-06-18", + + metadataStandardName = "ISO 19115:2003/19139", + + spatialRepresentationInfo = list( + list( + vectorSpatialRepresentation = list( + topologyLevel = "geometryOnly", + geometricObjects = list( + list( + geometricObjectType = "point", + geometricObjectCounty = nrow(sp) + ) + ) + ) + ) + ), + + referenceSystemInfo = list( + list(code = "4326", codeSpace = "EPSG") + ), + + identificationInfo = list( + + list( + + citation = list( + title = "Syria Refugee Sites", + date = list( + list(date = "2016-01-14", type = "creation"), + list(date = "2016-02-04", type = "publication") + ), + identifier = list(authority = "IHSN", code = id), + citedResponsibleParty = list( + list( + individualName = "Humanitarian Information Unit", + organisationName = "U.S. Department of State - Humanitarian Information Unit", + contactInfo = list( + address = list( + electronicEmailAddress = "HIU_DATA@state.gov" + ), + onlineResource = list( + linkage = "http://hiu.state.gov/", + name = "Website" + ) + ), + role = "owner" + ) + ) + ), + + abstract = "The 'Syria Refugee Sites' dataset is compiled by the U.S. Department of State, Humanitarian Information Unit (INR/GGI/HIU). This dataset contains open source derived data about the geographic location (point geometry), name, and operational status of refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. Only refugee sites operated by the United Nations High Commissioner for Refugees (UNHCR) or the Government of Turkey are included. Compiled by the U.S Department of State, Humanitarian Information Unit (HIU), each attribute in the dataset (including name, location, and status) is verified against multiple sources. The name and status are obtained from UN and AFAD reporting and the UNHCR data portal (accessible at http://data.unhcr.org/syrianrefugees/regional.php). The locations are obtained from both the U.S. Department of State, PRM and the National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) (accessible at http://geonames.nga.mil/ggmagaz/). The name and status for each refugee site is verified with PRM. Locations are verified using high-resolution commercial satellite imagery and/or known areas of population. Additionally, all data is checked against various news sources. The data contained herein is entirely unclassified and is current as of 14 January 2016. The data is updated as needed.", + + purpose = "The 'Syria Refugee Sites' dataset contains verified data about the refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. This file is compiled by the U.S Department of State, Humanitarian Information Unit (HIU) and is used in the production of the unclassified 'Syria: Numbers and Locations of Syrian Refugees' map product (accessible at https://hiu.state.gov/Pages/MiddleEast.aspx). The data contained herein is entirely unclassified and is current as of 14 January 2016.", + + credit = "U.S. Department of State - Humanitarian Information Unit", + + status = "onGoing", + + pointOfContact = list( + list( + individualName = "Humanitarian Information Unit", + organisationName = "U.S. Department of State - Humanitarian Information Unit", + contactInfo = list( + address = list(electronicEmailAddress = "HIU_DATA@state.gov"), + onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website") + ), + role = "pointOfContact" + ) + ), + + resourceMaintenance = list( + list(maintenanceOrUpdateFrequency = "fortnightly") + ), + + # graphicOverview = list(), + + resourceFormat = list( + list( + name = "text/csv", + specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files" + ), + list( + name = "text/tab-separated-values", + specification = "Tab-Separated Values (CSV)" + ), + list( + name = "xlsx", + specification = "Microsoft Excel (XLSX)" + ) + ), + + descriptiveKeywords = list( + list(type = "theme", keyword = "Middle East"), + list(type = "theme", keyword = "Refugees"), + list(type = "theme", keyword = "Displacement"), + list(type = "theme", keyword = "Refugee Camps"), + list(type = "theme", keyword = "UNHCR"), + list(type = "place", keyword = "Syria"), + list(type = "place", keyword = "Turkey"), + list(type = "place", keyword = "Lebanon"), + list(type = "place", keyword = "Jordan"), + list(type = "place", keyword = "Iraq"), + list(type = "place", keyword = "Egypt") + ), + + resourceConstraints = list( + list( + legalConstraints = list( + uselimitation = list("License: Creative Commons Attribution 4.0 International License"), + accessConstraints = list("unrestricted"), + useConstraints = list("licenceUnrestricted") + ) + ), + list( + securityConstraints = list( + classification = "unclassified", + handlingDescription = "All data contained herein are strictly unclassified with no restrictions on distribution. Accuracy of geographic data is not assured by the U.S. Department of State." + ) + ) + ), + + extent = list( + geographicElement = list( + list( + geographicBoundingBox = list( + southBoundLatitude = bbox(sp)[2,1], + westBoundLongitude = bbox(sp)[1,1], + northBoundLatitude = bbox(sp)[2,2], + eastBoundLongitude = bbox(sp)[1,2] + ) + ) + ) + ), + + spatialRepresentationType = "vector", + + language = list("eng"), + + characterSet = list( + list(codeListValue = "utf8") + ), + + topicCategory = list("society") + + ) + + ), + + distributionInfo = list( + + distributionFormat = list( + list( + name = "text/csv", + specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files" + ), + list( + name = "text/tab-separated-values", + specification = "Tab-Separated Values (CSV)" + ), + list( + name = "xlsx", + specification = "Microsoft Excel (XLSX)" + ) + ), + + distributor = list( + list( + individualName = "Humanitarian Information Unit", + organisationName = "U.S. Department of State - Humanitarian Information Unit", + contactInfo = list( + address = list(electronicEmailAddress = "HIU_DATA@state.gov"), + onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website") + ), + role = "distributor" + ) + ) #, + + # transferOptions = list( + # list( + # onLine = list( + # list( + # linkage = "https://data.humdata.org/dataset/syria-refugee-sites", + # name = "Source metadata (HTML View)", + # protocol = "WWW:LINK-1.0-http--link", + # "function" = "Information" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/cc3e9e48-e363-404e-948b-e42d13c316d9/download/syria_refugeesites_2016jan21_hiu_dos.csv", + # name = "syria_refugeesites_2016jan21_hiu_dos.csv", + # description = "Data download (CSV)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/42f7884c-f54d-478c-a970-623945740e5d/download/syria_refugeesites_2016jan21_hiu_dos.tsv", + # name = "syria_refugeesites_2016jan21_hiu_dos.tsv", + # description = "Data download (TSV)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/59660c9a-e41a-4d54-bfc2-dd8fd1032c97/download/syria_refugeesites_2016jan21_hiu_dos.xlsx", + # name = "syria_refugeesites_2016jan21_hiu_dos.xlsx", + # description = "Data download (TSV)", + # protocol = "WWW:LINK-1.0-http--link" + # ) + # ) + # ) + # ) + + ), + + dataQualityInfo = list( + list( + scope = "dataset", + lineage = list( + statement = "Methodology: Compiled by the U.S Department of State, Humanitarian Information Unit (INR/GGI/HIU), each attribute in the dataset (including name, location, and status) is verified against multiple sources. The name and status are obtained from the UNHCR data portal (accessible at http://data.unhcr.org/syrianrefugees/regional.php). The locations are obtained from the U.S. Department of State, Bureau of Population, Refugees, and Migration (PRM) and the National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) (accessible at http://geonames.nga.mil/ggmagaz/). The name and status for each refugee site is verified with PRM. Locations are verified using high-resolution commercial satellite imagery and/or known areas of population. Additionally, all data is checked against various news sources." + ) + ) + ), + + metadataMaintenance = list(maintenanceAndUpdateFrequency = "fortnightly") + + ) + +) + +# Publish in NADA catalog + +geospatial_add( + idno = id, + metadata = my_geo_data, + repositoryid = "central", + published = 1, + thumbnail = NULL, + overwrite = "yes" +) +``` + + +---------- +**Generating the metadata using Python** +---------- + + +---------- +**The result in NADA** +---------- + + +### Example 3 (vector, with Feature Catalogue) - The GDIS (beta) dataset + +This example demonstrates the use of the ISO 19115 (geographic dataset) and ISO 19110 (feature catalogue). Documenting features contained in datasets makes the metadata richer and more discoverable. It is recommended to provide such information, which can easily be extracted from shape files and others. The dataset used for the example is the [Geocoded Disasters (GDIS) Dataset, v1 (1960-2018)](https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018) + + + +```r +library(nadar) +library(sf) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_geo_data/") + +thumb = "disaster.JPG" + +# Load the dataset (2 Gb) to extract some information + +load("pend-gdis-1960-2018-disasterlocations.rdata") +data = GDIS_disasterlocations +df = as.data.frame(GDIS_disasterlocations) +column_names = colnames(df)[!colnames(df) %in% c("geometry","centroid")] +exclude_listed_values_for = c("longitude", "latitude") #exclude ISO 19110 listed values for these columns + +# Generate the metadata + +id <- "GDIS_TEST_01" + +ttl = "Geocoded Disasters (GDIS) Dataset, v1 (1960–2018)" + +my_geo_data <- list( + + metadata_information = list( + title = ttl, + idno = id, + producers = list( + list(name = "NADA team") + ), + production_date = "2022-02-18", + version = "v1.0 2022-02" + ), + + description = list( + + idno = id, + language = "English", + characterSet = list( + codeListValue = "utf8", + codeList = "http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CharacterSetCode" + ), + hierarchyLevel = list("dataset"), + contact = list( + list( + organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)", + contactInfo = list( + phone = list( + voice = "+1 845-365-8920", + facsimile = "+1 845-365-8922" + ), + address = list( + deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000", + city = "Palisades, NY", + postalCode = "10964", + electronicEmailAddress = "ciesin.info@ciesin.columbia.edu" + ) + ), + role = "pointOfContact" + ) + ), + dateStamp = "2021-03-10", + metadataStandardName = "ISO 19115:2003/19139", + dataSetURI = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018", + + spatialRepresentationInfo = list( + list( + vectorSpatialRepresentation = list( + topologyLevel = "geometryOnly", + geometricObjects = list( + list( + geometricObjectType = tolower(as.character(st_geometry_type(data)[1])), + geometricObjectCounty = nrow(data) + ) + ) + ) + ) + ), + + referenceSystemInfo = list( + list(code = "4326", codeSpace = "EPSG") + ), + + identificationInfo = list( + list( + citation = list( + title = ttl, + date = list( + list(date = "2021-03-10", type = "publication") + ), + identifier = list(authority= "DOI", code = "10.7927/zz3b-8y61"), + citedResponsibleParty = list( + list( + individualName = "Rosvold, E., and H. Buhaug", + role = "owner" + ) + ), + edition = "1.00", + presentationForm = list("raster", "map", "map service"), + series = list( + name = "Scientific Data", + issueIdentification = "8:61" + ) + ), + abstract = "The Geocoded Disasters (GDIS) Dataset is a geocoded extension of a selection of natural disasters from the Centre for Research on the Epidemiology of Disasters' (CRED) Emergency Events Database (EM-DAT). The data set encompasses 39,953 locations for 9,924 disasters that occurred worldwide in the years 1960 to 2018. All floods, storms (typhoons, monsoons etc.), earthquakes, landslides, droughts, volcanic activity and extreme temperatures that were recorded in EM-DAT during these 58 years and could be geocoded are included in the data set. The highest spatial resolution in the data set corresponds to administrative level 3 (usually district/commune/village) in the Global Administrative Areas database (GADM, 2018). The vast majority of the locations are administrative level 1 (typically state/province/region).", + purpose = "To provide the subnational location for different types of natural disasters recorded in EM-DAT between 1960-2018.", + credit = "NASA Socioeconomic Data and Applications Center (SEDAC)", + status = "completed", + pointOfContact = list( + list( + organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)", + contactInfo = list( + phone = list( + voice = "+1 845-365-8920", + facsimile = "+1 845-365-8922" + ), + address = list( + deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000", + city = "Palisades, NY", + postalCode = "10964", + electronicEmailAddress = "ciesin.info@ciesin.columbia.edu" + ) + ), + role = "pointOfContact" + ) + ), + resourceMaintenance = list( + list(maintenanceOrUpdateFrequency = "asNeeded") + ), + graphicOverview = list( + list( + fileName = "https://sedac.ciesin.columbia.edu/downloads/maps/pend/pend-gdis-1960-2018/sedac-logo.jpg", + fileDescription = "Geocoded Disasters (GDIS) Dataset", + fileType = "image/jpeg" + ) + ), + resourceFormat = list( + list( + name = "OpenFileGDB", + specification = "ESRI - GeoDatabase" + ), + list( + name = "text/csv", + specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files" + ), + list( + name = "application/geopackage+sqlite3", + specification = "http://www.geopackage.org/spec/" + ) + ), + descriptiveKeywords = list( + list(type = "theme", keyword = "climatology"), + list(type = "theme", keyword = "meteorology"), + list(type = "theme", keyword = "atmosphere"), + list(type = "theme", keyword = "earth science", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "human dimension", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "natural hazard", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "drought", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "earthquake", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "flood", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "landslides", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "tropical cyclones", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "cyclones", + thesaurusName = "GCMD Science Keywords, Version 8.6"), + list(type = "theme", keyword = "volcanic eruption", + thesaurusName = "GCMD Science Keywords, Version 8.6") + ), + resourceConstraints = list( + list( + legalConstraints = list( + uselimitation = list( + "This work is licensed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0). Users are free to use, copy, distribute, transmit, and adapt the work for commercial and non-commercial purposes, without restriction, as long as clear attribution of the source is provided.", + "Recommended citation: Rosvold, E.L., Buhaug, H. GDIS, a global dataset of geocoded disaster locations. Scientific Data 8, 61 (2021). https://doi.org/10.1038/s41597-021-00846-6." + ), + accessConstraints = list("unrestricted"), + useConstraints = list("licenceUnrestricted") + ) + ) + ), + extent = list( + geographicElement = list( + list( + geographicBoundingBox = list( + westBoundLongitude = -180, + eastBoundLongitude = 180, + southBoundLatitude = -58, + northBoundLatitude = 90 + ) + ) + )#, + # temporalElement = list( + # list( + # extent = list( + # TimePeriod = list( + # beginPosition = "1960-01-01", + # endPosition = "2018-12-31" + # ) + # ) + # ) + # ) + ), + spatialRepresentationType = "vector", + language = list("eng"), + characterSet = list( + list( + codeListValue = "utf8", + codeList = "http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CharacterSetCode" + ) + ) + ) + ), + + distributionInfo = list( + + distributionFormat = list( + list(name = "OpenFileGDB", + specification = "ESRI - GeoDatabase", + fileDecompressionTechnique = "Unzip"), + list(name = "text/csv", + specification = "RFC4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files", + fileDecompressionTechnique = "Unzip"), + list(name = "application/geopackage+sqlite3", + specification = "http://www.geopackage.org/spec/", + fileDecompressionTechnique = "Unzip") + ), + + distributor = list( + list( + organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)", + contactInfo = list( + phone = list( + voice = "+1 845-365-8920", + facsimile = "+1 845-365-8922" + ), + address = list( + deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000", + city = "Palisades, NY", + postalCode = "10964", + electronicEmailAddress = "ciesin.info@ciesin.columbia.edu" + ) + ), + role = "pointOfContact" + ) + )#, + + # transferOptions = list( + # list( + # onLine = list( + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018", + # name = "Source metadata (HTML View)", + # protocol = "WWW:LINK-1.0-http--link", + # "function" = "Information" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-gdb.zip", + # name = "pend-gdis-1960-2018-disasterlocations-gdb.zip", + # description = "Data download (Geodatabase)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-gpkg.zip", + # name = "pend-gdis-1960-2018-disasterlocations-gpkg.zip", + # description = "Data download (GeoPackage)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-csv.zip", + # name = "pend-gdis-1960-2018-disasterlocations-csv.zip", + # description="Data download (CSV)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-priogrid-key-csv.zip", + # name = "pend-gdis-1960-2018-priogrid-key-csv.zip", + # description = "Data download (CSV)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-rdata.zip", + # name = "pend-gdis-1960-2018-disasterlocations-rdata.zip", + # description = "Data download (RData)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-replicationcode-r.zip", + # name = "pend-gdis-1960-2018-replicationcode-r.zip", + # description = "Source code (R)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-codebook.pdf", + # name = "pend-gdis-1960-2018-codebook.pdf", + # description = "Codebook (PDF)", + # protocol = "WWW:LINK-1.0-http--link" + # ) + # ) + # ) + # ) + ), + + dataQualityInfo = list( + list( + scope = "dataset", + lineage = list( + statement = "CIESIN follows procedures designed to ensure that data disseminated by CIESIN are of reasonable quality. If, despite these procedures, users encounter apparent errors or misstatements in the data, they should contact SEDAC User Services at +1 845-365-8920 or via email at ciesin.info@ciesin.columbia.edu. Neither CIESIN nor NASA verifies or guarantees the accuracy, reliability, or completeness of any data provided. CIESIN provides this data without warranty of any kind whatsoever, either expressed or implied. CIESIN shall not be liable for incidental, consequential, or special damages arising out of the use of any data provided by CIESIN." + ) + ) + ), + + metadataMaintenance = list( + maintenanceAndUpdateFrequency = "asNeeded" + ) + + ), + + # Feature catalog (ISO 19110/19139) + + feature_catalogue = list( + name = sprintf("%s - Feature Catalogue", ttl), + featureType = list( + list( + typeName = ttl, + definition = "Disaster locations", + code = "pend-gdis-1960-2018-disasterlocations", + isAbstract = FALSE, + # carrierOfCharacteristics = lapply(column_names, function(column_name){ + # print(column_name) + # values = unique(df[,column_name]) + # values = values[order(values)] + # member = list( + # memberName = sprintf("Label for '%s'", column_name), + # definition = sprintf("Definition for '%s'", column_name), + # cardinality = list(lower = 1, upper = 1), + # code = column_name, + # valueType = switch(class(df[,column_name]), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + # valueMeasurementUnit = NA, + # listedValue = if(column_name %in% exclude_listed_values_for) {list()} else {lapply(values, function(x){ list(label = sprintf("Label for '%s'", x), code = x, definition = sprintf("Definition for '%s'", x)) })} + # ) + # return(member) + # }) + carrierOfCharacteristics = list( + list( + memberName = 'id', + definition = 'ID-variable identifying each disaster in the geocoded dataset. Contrary to disasterno each disaster in each country has a unique id number', + cardinality = list(lower = 1, upper = 1), + code = 'DFA01', # short for Disaster Feature Attribute 01 + valueType = switch(class(df[,'id']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'country', + definition = 'Name of the country within which the location is', + cardinality = list(lower = 1, upper = 1), + code = 'DFA02', + valueType = switch(class(df[,'country']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'iso3', + definition = 'Three-letter country code, ISO 3166-1', + cardinality = list(lower = 1, upper = 1), + code = 'DFA03', + valueType = switch(class(df[,'iso3']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'gwno', + definition = 'Gledistsch and Ward country code (Gleditsch & Ward, 1999)', + cardinality = list(lower = 1, upper = 1), + code = 'DFA04', + valueType = switch(class(df[,'gwno']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'geo_id', + definition = 'Unique ID-variable for each location', + cardinality = list(lower = 1, upper = 1), + code = 'DFA05', + valueType = switch(class(df[,'geo_id']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'geolocation', + definition = 'Name of the location of the observation, which corresponds to the highest (most disaggregated) level available. For instance, observations at the third administrative level will have geolocation values identical to the adm3 variable', + cardinality = list(lower = 1, upper = 1), + code = 'DFA06', + valueType = switch(class(df[,'geolocation']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'level', + definition = 'The administrative level of the observation, ranges from 1-3 where 3 is the most disaggregated', + cardinality = list(lower = 1, upper = 1), + code = 'DFA07', + valueType = switch(class(df[,'level']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'adm1', + definition = 'Name of administrative level 1 for the given location', + cardinality = list(lower = 1, upper = 1), + code = 'DFA08', + valueType = switch(class(df[,'adm1']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'adm2', + definition = 'Name of administrative level 2 for the given location', + cardinality = list(lower = 1, upper = 1), + code = 'DFA09', + valueType = switch(class(df[,'adm2']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'location', + definition = 'Name of administrative level 3 for the given location', + cardinality = list(lower = 1, upper = 1), + code = 'DFA10', + valueType = switch(class(df[,'location']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'historical', + definition = 'Marks whether the disaster happened in a country that has since changed, takes the value 1 if the disaster happened in a country that has since changed, and 0 if not', + cardinality = list(lower = 1, upper = 1), + code = 'DFA11', + valueType = switch(class(df[,'historical']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'hist_country', + definition = 'Name of country at the time of the disaster, if the observation takes the value 1 on the historical variable, this is different from the country variable', + cardinality = list(lower = 1, upper = 1), + code = 'DFA12', + valueType = switch(class(df[,'hist_country']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ), + list( + memberName = 'disastertype', + definition = 'Type of disaster as defined by EM-DAT (Guha-Sapir et al., 2014): flood, storm, earthquake, extreme temperature, landslide, volcanic activity, drought or mass movement (dry)', + cardinality = list(lower = 1, upper = 1), + code = 'DFA13', + valueType = switch(class(df[,'disastertype']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA', + listedValue = list( + list( + label = 'flood', + code = 'flood', + definition = 'A general term for the overflow of water from a stream channel onto normally dry land in the floodplain (riverine flooding), higher-than-normal levels along the coast and in lakes or reservoirs (coastal flooding) as well as ponding of water at or near the point where the rain fell (flash floods).' + ), + list( + label = 'storm', + code = 'storm', + definition = 'A type of meteorological hazard generated by the heating of air and the availability of moist and unstable air masses. Convective storms range from localized thunderstorms (with heavy rain and/or hail, lightning, high winds, tornadoes) to meso-scale, multi-day events.' + ), + list( + label = 'earthquake', + code = 'earthquake', + definition = 'Sudden movement of a block of the Earth’s crust along a geological fault and associated ground shaking.' + ), + list( + label = 'extreme temperature', + code = 'extreme temperature', + definition = 'A general term for temperature variations above (extreme heat) or below (extreme cold) normal conditions.' + ), + list( + label = 'landslide', + code = 'landslide', + definition = 'Independent of the presence of water, mass movement may also be triggered by earthquakes.' + ), + list( + label = 'volcanic activity', + code = 'volcanic activity', + definition = 'A type of volcanic event near an opening/vent in the Earth’s surface including volcanic eruptions of lava, ash, hot vapor, gas, and pyroclastic material.' + ), + list( + label = 'drought', + code = 'drought', + definition = 'An extended period of unusually low precipitation that produces a shortage of water for people, animals, and plants. Drought is different from most other hazards in that it develops slowly, sometimes even over years, and its onset is generally difficult to detect. Drought is not solely a physical phenomenon because its impacts can be exacerbated by human activities and water supply demands. Drought is therefore often defined both conceptually and operationally. Operational definitions of drought, meaning the degree of precipitation reduction that constitutes a drought, vary by locality, climate and environmental sector.' + ), + list( + label = 'mass movement (dry)', + code = 'mass movement (dry)', + definition = 'Any type of downslope movement of earth materials.' + ) + ) + ), + list( + memberName = 'disasterno', + definition = 'ID-variable from EM-DAT (Guha-Sapir et al., 2014), use this to join the geocoded data with EM-DAT records in order to obtain information on the specific disasters', + cardinality = list(lower = 1, upper = 1), + code = 'DFA14', + valueType = switch(class(df[,'disasterno']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"), + valueMeasurementUnit = 'NA' + ) + + ) + ) + ) + ) +) + +# Publish in NADA catalog + +geospatial_add( + idno = id, + metadata = my_geo_data, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes") + +# Add links as external resources + +external_resources_add( + idno = id, + dctype = "web", + title = "Website: Geocoded Disasters (GDIS) Dataset, v1 (1960–2018)", + file_path = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018", + overwrite = "yes" +) +``` + + +### Example 4 (raster): Spatial distribution of the Ethiopian population in 2020 + +This fourth example makes use of elements from the ISO 19115 to document a dataset generated by the WorldPop program using data from multiple sources and machine learning models. "WorldPop develops peer-reviewed research and methods for the construction of open and high-resolution geospatial data on population distributions, demographic and dynamics, with a focus on low and middle income countries." As of March 1st, 2021 WorldPop was publishing over 44,600 datasets on its website. See https://www.worldpop.org/project/categories?id=3. + +
+![](./images/geospatial_example_script_worldpop_00.JPG){width=100%} +
+ +The selected example represents the spatial distribution of the Ethiopian population in 2020. + +
+![](./images/geospatial_example_script_worldpop_ETH.JPG){width=60%} +
+ + +---------- +**Generating the metadata using R** +---------- + + +```r +library(nadar) +library(raster) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_geo_data/") + +# Download and read the dataset + +url = "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif" +filename = basename(url) +if(!file.exists(filename)) download.file(url, destfile = filename, mode = "wb") +ras <- raster("eth_ppp_2020_constrained.tif") + +id <- "WP_ETH_POP" +thumb <- "ethiopia_pop.JPG" + +# Generate the metadata + +my_geo_data <- list( + + metadata_information = list( + title = "(Demo) Ethiopia Gridded Population 2020 (WorldPop)", + producers = list(list(name = "NADA team")), + production_date = "2022-02-18" + ), + + description = list( + + idno = id, + language = "eng", + characterSet = list(codeListValue = "utf8"), + hierarchyLevel = list("dataset"), + contact = list( + list(organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton", + contactInfo = list( + onlineResource = list( + linkage = "https://www.worldpop.org/", name = "Website" + ) + ), + role = "pointOfContact" + ) + ), + + dateStamp = "2020-09-20", + metadataStandardName = "ISO 19115:2003/19139", + + spatialRepresentationInfo = list( + + list( + gridSpatialRepresentationInfo = list( + numberOfDimensions = 2L, + axisDimensionproperties = list( + list( + dimensionName = "row", dimensionSize = dim(ras)[1] + ), + list( + dimensionName = "column", dimensionSize = dim(ras)[2] + ) + ), + cellGeometry = "area" + ) + ) + + ), + + referenceSystemInfo = list( + list(code = "4326", codeSpace = "EPSG") + ), + + identificationInfo = list( + + list( + + citation = list( + title = "Ethiopia population 2020", + alternateTitle = "Estimated total number of people per grid-cell at a resolution of 3 arc-seconds (approximately 100m at the equator)", + date=list( + list(date = "2020-09-12", type = "creation") + ), + identifier = list(authority = "DOI", code = id), + citedResponsibleParty = list( + list( + organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton", + contactInfo = list( + onlineResource = list( + linkage = "https://www.worldpop.org/", + name = "Website" + ) + ), + role = "owner" + ) + ) + ), + + abstract = "The spatial distribution of population in 2020, Ethiopia", + + credit = "World Pop - School of Geography and Environmental Science, University of Southampton", + + status = "completed", + + pointOfContact = list( + list( + organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton", + contactInfo = list( + onlineResource = list( + linkage = "https://www.worldpop.org/", + name = "Website" + ) + ), + role = "pointOfContact" + ) + ), + + resourceMaintenance = list( + list(maintenanceOrUpdateFrequency = "notPlanned") + ), + + graphicOverview = list( + list(fileName = thumb, fileDescription = "Ethiopia population 2020") + ), + + resourceFormat = list( + list(name = "image/tiff", specification = "GeoTIFF") + ), + + descriptiveKeywords = list( + list(type = "theme", keyword = "population density"), + list(type = "theme", keyword = "gridded population"), + list(type = "place", keyword = "Ethiopia") + ), + + resourceConstraints = list( + list( + legalConstraints = list( + accessConstraints = list("unrestricted"), + useConstraints = list("licenceUnrestricted"), + uselimitation = list( + "License: Creative Commons Attribution 4.0 International License", + "Recommended citation: Bondarenko M., Kerr D., Sorichetta A., and Tatem, A.J. 2020. Census/projection-disaggregated gridded population datasets for 51 countries across sub-Saharan Africa in 2020 using building footprints. WorldPop, University of Southampton, UK. doi:10.5258/SOTON/WP00682" + ) + ) + ) + ), + + extent = list( + geographicElement = list( + list( + geographicBoundingBox = list( + southBoundLatitude = bbox(ras)[2,1], + westBoundLongitude = bbox(ras)[1,1], + northBoundLatitude = bbox(ras)[2,2], + eastBoundLongitude = bbox(ras)[1,2] + ), + geographicDescription = "Ethiopia" + ) + ) + ), + + spatialRepresentationType = "grid", + + #spatialResolution = list(value = 3, uom = "arc_second"), + + language = list("eng"), + + characterSet = list( + list(codeListValue = "utf8") + ), + + topicCategory = list("society"), + + supplementalInformation = "References: + - Stevens FR, Gaughan AE, Linard C, Tatem AJ (2015) Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLoS ONE 10(2): e0107042. https://doi.org/10.1371/journal.pone.0107042 + - WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). + - Dooley, C. A., Boo, G., Leasure, D.R. and Tatem, A.J. 2020. Gridded maps of building patterns throughout sub-Saharan Africa, version 1.1. University of Southampton: Southampton, UK. Source of building footprints \"Ecopia Vector Maps Powered by Maxar Satellite Imagery\"© 2020. doi:10.5258/SOTON/WP00677 + - Bondarenko M., Nieves J. J., Stevens F. R., Gaughan A. E., Tatem A. and Sorichetta A. 2020. wpgpRFPMS: Random Forests population modelling R scripts, version 0.1.0. University of Southampton: Southampton, UK. https://dx.doi.org/10.5258/SOTON/WP00665 + - Ecopia.AI and Maxar Technologies. 2020. Digitize Africa data. http://digitizeafrica.ai" + + ) + ), + + distributionInfo = list( + + distributionFormat = list( + list(name = "image/tiff", specification = "GeoTIFF") + ), + distributor = list( + list( + organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton", + contactInfo = list( + onlineResource = list( + linkage = "https://www.worldpop.org/", + name = "Website" + ) + ), + role = "distributor" + ) + )#, + + # transferOptions = list( @@@ Use DC external resources? + # list( + # onLine = list( + # list( + # linkage = "https://www.worldpop.org/geodata/summary?id=49635", + # name = "Source metadata (HTML View)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://www.worldpop.org/ajax/pdf/summary?id=49635", + # name = "Source metadata (PDF)", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif", + # name = "eth_ppp_2020_constrained.tif", + # description = "Data download (GeoTIFF)", + # protocol = "WWW:LINK-1.0-http--link" + # ) + # ) + # ) + # ) + + ), + + dataQualityInfo = list( + + list( + scope = "dataset", + lineage = list( + statement = "Data management workflow", + processStep = list( + list( + description = "This dataset was produced based on the 2020 population census/projection-based estimates for 2020 (information and sources of the input population data can be found here). Building footprints were provided by the Digitize Africa project of Ecopia.AI and Maxar Technologies (2020) and gridded building patterns derived from the datasets produced by Dooley et al. 2020. Geospatial covariates representing factors related to population distribution, were obtained from the \"Global High Resolution Population Denominators Project\" (OPP1134076)", + rationale = "Source data acquisition" + ), + list( + description = "The mapping approach is the Random Forests-based dasymetric redistribution developed by Stevens et al. (2015). The disaggregation was done by Maksym Bondarenko (WorldPop) and David Kerr (WorldPop), using the Random Forests population modelling R scripts (Bondarenko et al., 2020), with oversight from Alessandro Sorichetta (WorldPop).", + rationale = "Mapping" + ) + ) + ) + ) + + ), + + metadataMaintenance = list(maintenanceAndUpdateFrequency = "notPlanned") + + ) + +) + +# Publish the metadata in a NADA catalog + +geospatial_add( + idno = id, + metadata = my_geo_data, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes" +) + +# Add a link to WorldPop website as an external resource + +external_resources_add( + idno = id, + dctype = "web", + title = "WorldPop website", + file_path = "https://www.worldpop.org/", + overwrite = "yes" +) +``` + +---------- +**Generating the metadata using Python** +---------- + + +---------- +**The result in NADA** +---------- + + +### Example 5 (service): The United Nations Geospatial website + +The previous four examples documented geographic datasets (ISO 19115). In this fourth example, we document a **geographic service** using elements from the ISO 19119 standard. The service described in this example is the [United Nations Clear Map application](https://geoservices.un.org/Html5Viewer/index.html?viewer=clearmap) from [United Nations Geospatial](https://www.un.org/geospatial/). + +
+![](./images/geospatial_example_script_UN.JPG){width=100%} +
+ +---------- +**Generating the metadata using R** +---------- + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_geo_data/") + +thumb = "un_clear_map.JPG" + +id = "UN_GEO_CLEAR-MAP" + +my_geo_service <- list( + + metadata_information = list( + idno = id, + title = "United Nations Geospatial, Clear Map", + producers = list( + list(name = "NADA team") + ), + production_date = "2022-02-18", + version = "v1.0 2022-02" + ), + + description = list( + + idno = id, + language = "eng", + characterSet = list(codeListValue = "utf8"), + hierarchyLevel = list("service"), + contact = list( + list( + organisationName = "United Nations Geospatial", + contactInfo = list( + address = list( + electronicEmailAddress = "gis@un.org" + ), + onlineResource = list( + linkage = "https://www.un.org/geospatial", + name = "Website" + ) + ), + role = "owner" + ) + ), + dateStamp = "2022-02-22", + metadataStandardName = "ISO 19119:2005/19139", + + referenceSystemInfo = list( + list(code = "3857", codeSpace = "EPSG") + ), + + identificationInfo = list( + + list( + citation = list( + title = "United Nations Clear Map - OGC Web Map Service", + date = list( + list(date = "2019-08-19", type = "creation"), + list(date = "2020-03-19", type ="lastUpdate") + ), + citedResponsibleParty = list( + list( + organisationName = "United Nations Geospatial", + contactInfo = list( + address = list(electronicEmailAddress = "gis@un.org"), + onlineResource = list( + linkage = "https://www.un.org/geospatial", + name = "Website" + ) + ), + role = "owner" + ) + ) + ), + + abstract = "The United Nations Clear Map (hereinafter 'Clear Map') is a background reference web mapping service produced to facilitate 'the issuance of any map at any duty station, including dissemination via public electronic networks such as Internet' and 'to ensure that maps meet publication standards and that they are not in contravention of existing United Nations policies' in accordance with the in the Administrative Instruction on 'Regulations for the Control and Limitation of Documentation - Guidelines for the Publication of Maps' of 20 January 1997 (http://undocs.org/ST/AI/189/Add.25/Rev.1).", + purpose = "Clear Map is created for the use of the United Nations Secretariat and community. All departments, offices and regional commissions of the United Nations Secretariat including offices away from Headquarters using Clear Map remain bound to the instructions as contained in the Administrative Instruction and should therefore seek clearance from the UN Geospatial Information Section (formerly Cartographic Section) prior to the issuance of their thematic maps using Clear Map as background reference.", + credit = "Produced by: United Nations Geospatial Contributor: UNGIS, UNGSC, Field Missions CONTACT US: Feedback is appreciated and should be sent directly to: Email:Clearmap@un.org / gis@un.org (UNCLASSIFIED) (c) UNITED NATIONS 2018", + status = "onGoing", + + pointOfContact = list( + list( + organisationName = "United Nations Geospatial", + contactInfo = list( + address = list(electronicEmailAddress = "gis@un.org"), + onlineResource = list(linkage = "https://www.un.org/geospatial", name = "Website") + ), + role = "pointOfContact" + ) + ), + + resourceMaintenance = list( + list(maintenanceOrUpdateFrequency = "asNeeded") + ), + + graphicOverview = list( + list( + fileName = "https://geoportal.dfs.un.org/arcgis/sharing/rest/content/items/6f4eb9e136ee43758a62f587ceb0da01/info/thumbnail/thumbnail1567157577600.png", + fileDescription = "Service overview", + fileType = "image/png" + ) + ), + + resourceFormat = list( + list(name = "PNG32"), + list(name = "PNG24"), + list(name = "PNG"), + list(name = "JPG"), + list(name = "DIB"), + list(name = "TIFF"), + list(name = "EMF"), + list(name = "PS"), + list(name = "PDF"), + list(name = "GIF"), + list(name = "SVG"), + list(name = "SVGZ"), + list(name = "BMP") + ), + + descriptiveKeywords = list( + list(type = "theme", keyword = "wms"), + list(type = "theme", keyword = "united nations"), + list(type = "theme", keyword = "global boundaries"), + list(type = "theme", keyword = "ocean coastline"), + list(type = "theme", keyword = "authoritative") + ), + + resourceConstraints = list( + list( + legalConstraints = list( + uselimitation = list("The designations employed and the presentation of material on this map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. + Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined. + Final status of the Abyei area is not yet determined. + * Dotted line represents approximately the Line of Control in Jammu and Kashmir agreed upon by India and Pakistan. The final status of Jammu and Kashmir has not yet been agreed upon by the parties. + ** Chagos Archipelago appears without prejudice to the question of sovereignty. + *** A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas)."), + accessConstraints = list("unrestricted"), + useConstraints = list("licenceUnrestricted") + ) + ) + ), + + extent = list( + geographicElement = list( + list( + geographicBoundingBox = list( + southBoundLongitude = -1.4000299034940418, + westBoundLongitude = -1.40477223188626, + northBoundLongitude = 2.149247026187029, + eastBoundLongitude = 1.367128649366541 + ) + ) + ) + ), + + topicCategory = list("boundaries", "oceans"), + + serviceIdentification = list( + serviceType = "OGC:WMS", + serviceTypeVersion = "1.1.0" + ) + ) + ), + + distributionInfo = list( + + distributionFormat = list( + list(name = "PNG32"), + list(name = "PNG24"), + list(name = "PNG"), + list(name = "JPG"), + list(name = "DIB"), + list(name = "TIFF"), + list(name = "EMF"), + list(name = "PS"), + list(name = "PDF"), + list(name = "GIF"), + list(name = "SVG"), + list(name = "SVGZ"), + list(name = "BMP") + ), + + distributor = list( + list( + organisationName = "United Nations Geospatial", + contactInfo = list( + address = list(electronicEmailAddress = "gis@un.org"), + onlineResource = list( + linkage = "https://www.un.org/geospatial", + name = "Website" + ) + ), + role = "owner" + ) + ) + #, + + # transferOptions = list( + # list( + # onLine = list( + # list( + # linkage = "https://geoportal.dfs.un.org/arcgis/home/item.html?id=541557fd0d4d42efb24449be614e6887", + # name = "Original metadata", + # description = "Original metadata from UN ClearMap portal", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://geoportal.dfs.un.org/arcgis/sharing/rest/content/items/541557fd0d4d42efb24449be614e6887/data", + # name = "UN ClearMap WMS map service user guide", + # description = "How to import and use WMS services of the UN Clear map", + # protocol = "WWW:LINK-1.0-http--link" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Dark/MapServer?service=WMS", + # name = "ClearMap_Dark", + # description = "ClearMap Dark WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Gray/MapServer?service=WMS", + # name = "ClearMap_Gray", + # description = "ClearMap Gray WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Imagery/MapServer?service=WMS", + # name = "ClearMap_Imagery", + # description = "ClearMap Imagery WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Plain/MapServer?service=WMS", + # name = "ClearMap_Plain", + # description = "ClearMap Plain WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Topo/MapServer?service=WMS", + # name = "ClearMap_Topo", + # description = "ClearMap Topo WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebDark/MapServer?service=WMS", + # name = "ClearMap_WebDark", + # description = "ClearMap WebDark WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebGray/MapServer?service=WMS", + # name = "ClearMap_WebGray", + # description = "ClearMap WebGray WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebPlain/MapServer?service=WMS", + # name = "ClearMap_WebPlain", + # description = "ClearMap WebPlain WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ), + # list( + # linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebTopo/MapServer?service=WMS", + # name = "ClearMap_WebTopo", + # description = "ClearMap WebTopo WMS", + # protocol = "OGC:WMS-1.1.0-http-get-map" + # ) + # ) + # ) + # ) + + ), + + metadataMaintenance = list(maintenanceAndUpdateFrequency = "asNeeded") + + ) + +) + +# Publish in a NADA catalog + +geospatial_add( + idno = id, + metadata = my_geo_service, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes" +) + +# Add links as external resources + +external_resources_add( + title = "United Nations Clear Map application", + idno = id, + dctype = "web", + file_path = "https://www.un.org/geospatial/", + overwrite = "yes" +) + +external_resources_add( + title = "United Nations Geospatial website", + idno = id, + dctype = "web", + file_path = "https://geoservices.un.org/Html5Viewer/index.html?viewer=clearmap", + overwrite = "yes" +) +``` + + +---------- +**Generating the metadata using Python** +---------- + +[to do] + +---------- +**The result in NADA** +---------- + + +## Useful tools + +The ISO standard is complex and contains many nested elements. Using R or Python to generate the metadata is a convenient and powerful option, although it requires much attention to avoid errors. The [geometa R package](https://cran.r-project.org/web/packages/geometa/index.html) can be used to facilitate the process of documenting datasets using R. + +Using a specialized metadata editor to generate the ISO-compliant metadata is a good alternative for those who have limited expertise in R or Python. The [GeoNetwork editor](https://geonetwork-opensource.org/) provides such a solution. + diff --git a/07_chapter07_database.md b/07_chapter07_database.md new file mode 100644 index 0000000..4219203 --- /dev/null +++ b/07_chapter07_database.md @@ -0,0 +1,1161 @@ +--- +output: html_document +--- + +# Databases of indicators {#chapter07} + +
+![](./images/time_series_logo.JPG){width=25%} +
+ +## Database vs indicators + +The schema we describe in this chapter is intended to document *databases* of indicators or time series, not the indicators or time series themselves (a schema for the description of indicators and time series is presented in chapter 8). **Indicators** are summary measures related to key issues or phenomena, derived from observed facts. Indicators form **time series** when they are provided with a temporal ordering, i.e. when their values are provided with an ordered annual, quarterly, monthly, daily, or other time reference. Indicators and time series are often contained in multi-indicators databases, like the World Bank's [World Development Indicators - WDI](https://datatopics.worldbank.org/world-development-indicators/), whose on-line version contains series for 1,430 indicators (as of 2021). + +The metadata related to a database can be published in a catalog as specific entries, or as information attached to an indicator. +[provide example / screenshot in NADA] + + +## Schema description + +The **database** schema is used to document the database that contains the time series, not to document the indicators or /series. + +
+```json +{ + "published": 0, + "overwrite": "no", + "metadata_information": {}, + "database_description": {}, + "provenance": [], + "tags", + "lda_topics": {}, + "embeddings": {}, + "additional": {} +} +``` +
+ +The schema includes two elements that are not metadata, but parameters used when publishing the metadata in a NADA catalog: + +- **`published`**: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished), in which case it is only visible to catalog administrators. This value must be set to 1 (published) to make the metadata visible. Note that the database metadata will only be shown in NADA in association with the metadata of an indicator. +- **`overwrite`**: Indicates whether metadata that may have been previously uploaded for the same database can be overwritten. By default, the value is "no". It must be set to "yes" to overwrite existing information. A database will be considered as being the same as a previously uploaded one if they have the same identifier (provided in the metadata element `database_description > title_statement > idno`). + +#### Metadata information + +**`metadata_information`** *[Optional, Not Repeatable]*
+The set of elements in `metadata_information` is used to provide information on the production of the database metadata. This information is used mostly for administrative purposes by data curators and catalog administrators. + +
+```json +"metadata_information": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "prod_date": "string", + "version": "string" +} +``` +
+ +- **`title`** *[Optional ; Not repeatable ; String]*
+The title of the metadata document containing the database metadata.
+- **`idno`** *[Required ; Not repeatable ; String]*
+A unique identifier of the database metadata document. It can be for example the identifier of the database preceded by a prefix identifying the metadata producer.
+- **`producers`** *[Optional ; Repeatable]*
+A list and description of the producers of the database metadata (not the producers of the database).
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization who produced the metadata (or contributed to its production).
+ - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The abbreviation (aconym) of the organization mentioned in `name`.
+ - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`.
+ - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the person or organization mentioned in `name` in the production of the metadata.
+- **`prod_date`** *[Optional ; Not repeatable ; String]*
+The date when the metadata was produced, preferably entered in ISO 8601 format (YYYY-MM-DD).
+- **`version`** *[Optional ; Not repeatable ; String]*
+The version of the metadata (not the version of the database).
+ + +#### Database description + +**`database_description`** *[Required, Not Repeatable]*
+ +
+```json +"database_description": { + "title_statement": {}, + "authoring_entity": [], + "abstract": "string", + "url": "string", + "type": "string", + "date_created": "string", + "date_published": "string", + "version": [], + "update_frequency": "string", + "update_schedule": [], + "time_coverage": [], + "time_coverage_note": "string", + "periodicity": [], + "themes": [], + "topics": [], + "keywords": [], + "dimensions": [], + "ref_country": [], + "geographic_units": [], + "geographic_coverage_note": "string", + "bbox": [], + "geographic_granularity": "string", + "geographic_area_count": "string", + "sponsors": [], + "acknowledgments": [], + "acknowledgment_statement": "string", + "contacts": [], + "links": [], + "languages": [], + "access_options": [], + "errata": [], + "license": [], + "citation": "string", + "notes": [], + "disclaimer": "string", + "copyright": "string" +} +``` +
+ +- **`title_statement`** *[Required, Not Repeatable]*
+ +
+```json +"title_statement": { + "idno": "string", + "identifiers": [ + { + "type": "string", + "identifier": "string" + } + ], + "title": "string", + "sub_title": "string", + "alternate_title": "string", + "translated_title": "string" +} +``` +
+ + - **`idno`** *[Required ; Not repeatable ; String]*
+ A unique identifier of the database. For example, the World Bank's World Development Indicators database published in April 2020 could have `idno` = "WB_WDI_APR_2020". + - **`identifiers`** *[Optional ; Repeatable]*
+ This element is used to store database identifiers (IDs) other than the catalog ID entered in `idno`. It can for example be a Digital Object Identifier (DOI). The `idno` can be repeated here (`idno` does not provide a `type` parameter; if a DOI or other standard reference ID is used as `idno`, it is recommended to repeat it here with the identification of its `type`). + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of unique ID, e.g. "DOI".
+ - **`identifier`** *[Required ; Not repeatable ; String]*
+ The identifier itself.
+ - **`title`** *[Required ; Not repeatable ; String]*
+ The title is the name by which the database is formally known. It is good practice to include the year of production in the title (and possibly the month, or quarter, if a new version of the database is released more than once a year). For example, "World Development Indicators, April 2020".
+ - **`sub_title`** *[Optional ; Not repeatable ; String]*
+ The database subtitle can be used when there is a need to distinguish characteristics of a database. This element will rarely be used.
+ - **`alternate_title`** *[Optional ; Not repeatable ; String]*
+ This can be an acronym, or an alternative name of the database. For example, "WDI April 2020".
+ - **`translated_title`** *[Optional ; Not repeatable ; String]*
+ The title of the database in a secondary language (if more than one other language, they may be entered as one string, as this element is not repeatable).

+ + +- **`authoring_entity`** *[Optional ; Repeatable]*
+This set of five elements is used to identify the organization(s) or person(s) who are the main producers/curators of the database. Note that a similar element is available at the indicator/series level. + +
+```json +"authoring_entity": [ + { + "name": "string", + "affiliation": "string", + "abbreviation": "string", + "email": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization who maintains the contents of the database (back-end). Write the name in full (use the element `abbreviation` to capture the acronym of the organization, if relevant). + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ The abbreviated name (acronym) of the organization mentioned in `name`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ The public email contact of the person or organizations mentioned in `name`. It is good practice to provide a service account email address, not a personal one. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link (URL) to the website of the entity mentioned in `name`.

+ + +- **`abstract`** *[Optional ; Not repeatable ; String]*
+ + The `abstract` is a brief description of the database. It can for example include a short statement on the database scope and coverage (not in detail, as other fields are available for that purpose), objectives, history, and expected audience.
+ + +- **`url`** *[Optional ; Not repeatable ; String]*
+ + The link to the public interface of the database (home page).
+ + +- **`type`** *[Optional ; Not repeatable ; String]*
+ + The type of database.
+ + +- **`date_created`** *[Optional ; Not repeatable ; String]*
+This is the date the database was created. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY). + + +- **`date_published`**
+This is the date the database was made public. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).
+ + +- **`version`** *[Optional ; Repeatable]*
+A database rarely remains static; it will be regularly updated and upgraded. The `version` element is a compound element and contains important information regarding the updating of the database. This includes any extension of the database (adding new series data), appending existing data, correcting existing data, etc. + +
+```json +"version": [ + { + "version": "string", + "date": "string", + "responsibility": "string", + "notes": "string" + } +] +``` +
+ + - **`version`** *[Optional ; Not repeatable ; String]*
+ A label for the version. The version specification will be determined by a curator or a data manager under conventions determined by the authoring entity. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the version was released. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY). + - **`responsibility`** *[Optional ; Not repeatable ; String]*
+ The organization or person in charge of this version of the database. + - **`notes`** *[Optional ; Not repeatable ; String]*
+ Additional information on this version of the database. Notes can for example be used to document how this version differs from previous ones.

+ + +- **`update_frequency`** *[Optional ; Not repeatable ; String]*
+Indicates at which frequency the database is updated (for example, "annual" or "quarterly"). The use of a controlled vocabulary is recommended. If a database contains many indicators, the update frequency may vary by indicator (e.g., some may be updated on a monthly or quarterly basis while others are only updated annually). The information provided in the `update_frequency` will correspond to the frequency of update for the indicators that are most frequently updated. +
+ + +- **`update_schedule`** *[Optional ; Repeatable]*
+The update schedule is intended to provide users with information on scheduled updates. This is a repeatable field that allows for capturing specific dates, but this information would then have to be regularly updated. Often a single description will be used, which would avoid having to regularly update the metadata. For example, "The database is updated in January, April, July, October of each year." + +
+```json +"update_schedule": [ + { + "update": "string" + } +] +``` +
+ + - **`update`** *[Optional ; Not repeatable ; String]*
+ A description of the schedule of updates or a date entered in ISO 8601 format.

+ + +- **`time_coverage`** *[Optional ; Repeatable]*
+The time coverage is the time span of all the data contained in the database across all series. +
+```json +"time_coverage": [ + { + "start": "string", + "end": "string" + } +] +``` +
+ - **`start`** *[Optional ; Not repeatable ; String]*
+ Indicates the start date of the period covered by the data (across all series) in the database. The date should be provided in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY). + - **`end`** *[Optional ; Not repeatable ; String]*
+ Indicates the end date of the period covered by the data (across all series) in the database. The date should be provided in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).

+ +- **`time_coverage_note`** *[Optional ; Not repeatable ; String]*
+The element is used to annotate and/or describe auxiliary information related to the time coverage described in `time_coverage`.
+ + +- **`periodicity`** *[Optional ; Repeatable]*
+The periodicity of the data describes the periodicity of the indicators contained in the database. A database can contain series covering different periods, in which case the information will be repeated for each type of periodicity. A controlled vocabulary should be used. +
+```json +"periodicity": [ + { + "period": "string" + } +] +``` +
+ + - **`period`** *[Optional ; Not repeatable ; String]*
+Periodicity of the time series included in the database, for example, "annual", "quarterly", or "monthly".

+ + +- **`themes`** *[Optional ; Repeatable]*
+Themes provide a general idea of the research that might guide the creation and/or demand for the series. A theme is broad and is likely also subject to a community based definition or list. A controlled vocabulary should be used. This element will rarely be used (the element `topics` described below will be used more often). + +
+```json +"themes": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`id`** *[Optional ; Not repeatable ; String]*
+ The unique identifier of the theme. It can be a sequential number, or the identifier of the theme in a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The label of the theme associated with the data. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ When a hierarchical (nested) controlled vocabulary is used, the `parent_id` field can be used to indicate a higher-level theme to which this theme belongs. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to the controlled vocabulary mentioned in field 'vocabulary'.

+ + +- **`topics`** *[Optional ; Repeatable]*
+The `topics` field indicates the broad substantive topic(s) that the indicator/series covers. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the [Council of European Social Science Data Archives (CESSDA) topic classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification).
+ +
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`id`** *[Optional ; Not repeatable ; String]*
+ The unique identifier of the topic. It can be a sequential number, or the identifier of the topic in a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The label of the topic associated with the data. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ When a hierarchical (nested) controlled vocabulary is used, the `parent_id` field can be used to indicate a higher-level topic to which this topic belongs. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name of the controlled vocabulary used, if any. + - **`uri`**
+ A link to the controlled vocabulary mentioned in field `vocabulary'.

+ + +- **`keywords`** *[Optional ; Repeatable]*
+Words or phrases that describe salient aspects of a data collection's content. This can be used for building keyword indexes and for classification and retrieval purposes. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus. The list of keywords can include keywords extracted from one or more controlled vocabularies and user-defined keywords. + +
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; String ; Non repeatable]*
+ A keyword (or phrase). + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name of the controlled vocabulary from which the keyword was extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the controlled vocabulary used, if any.

+ +- **`dimensions`** *[Optional ; Repeatable]*
+The dimensions available for the series included in the database. For example, "country, year". + +
+```json +"dimensions": [ + { + "name": "string", + "label": "string" + } +] +``` +
+ + - **`name`** *[Required ; String ; Non repeatable]*
+ The name of the dimension. + - **`label`** *[Optional ; Not repeatable ; String]*
+ A label for the dimension.

+ + +- **`ref_country`** *[Optional ; Repeatable]*
+A list of countries for which data are available in the database. This element is somewhat redundant with the next element (`geographic_units`) which may also contain a list of countries. Identifying geographic areas of type "country" is important to enable filters and facets in data catalogs (country names are among the most frequent queries submitted to catalogs). + +
+```json +"ref_country": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the country. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the country. The use of the [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) codes is recommended.

+ + +- **`geographic_units`** *[Optional ; Repeatable]*
+A list of geographic units (regions, countries, states, provinces, etc.) for which data are available in the database. This list is not limited to countries; it can contain sub-national areas, supra-national regions, or non-administrative area names. The `type` element is used to indicate the type of geographic area. Countries may, but do not have to be repeated here if provided in the eleement `ref_country`. +
+```json +"geographic_units": [ + { + "name": "string", + "code": "string", + "type": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the geographic unit e.g. 'World', 'Sub-Saharan Africa', 'Afghanistan', 'Low-income countries'. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the geographic unit as found in the database. If no code is available in the database, a code still can be added to the metadata. In such case, using the [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) codes is recommended for countries. + - **`type`** *[Optional ; Not repeatable ; String]*
+ Type of geographic unit e.g. country, state, region, province, or other grouping.
+ +- **`geographic_coverage_note`** *[Optional ; Not repeatable ; String]*
+The note can be used to capture additional information on the geographic coverage of the database.
+ +- **`bbox`** *[Optional ; Repeatable]*
+Bounding boxes are typically used for geographic datasets to indicate the geographic coverage of the data, but can be provided for databases as well, although this will rarely be done. A geographic bounding box defines a rectangular geographic area. +
+```json +"bbox": [ + { + "west": "string", + "east": "string", + "south": "string", + "north": "string" + } +] +``` +
+ + - **`west`** *[Required ; Not repeatable ; String]*
+ Western geographic parameter of the bounding box. + - **`east`** *[Required ; Not repeatable ; String]*
+ Eastern geographic parameter of the bounding box. + - **`south`** *[Required ; Not repeatable ; String]*
+ Southern geographic parameter of the bounding box. + - **`north`** *[Required ; Not repeatable ; String]*
+ Northern geographic parameter of the bounding box. + +- **`geographic_granularity`** *[Optional ; Not repeatable ; String]*
+ + Whereas the `geographic_units` element lists the various geographic levels for which there is data in the database, the `geographic_granularity` element will provide information on the geographic levels for which information is available in the database. For example: "The database contains data at the national, provincial (admin 1) and district (admin 2) levels."
+ +- **`geographic_area_count`** *[Optional ; Not repeatable ; String]*
+ + The number of geographic areas for which data are provided in the database. The World Bank World Development Indicators for example provides data for 262 different areas (which includes countries and territories, geographic regions, and other country groupings).
+ + +- **`sponsors`** *[Optional ; Repeatable]*
+The source(s) of funds for the production and maintenance of the database. If different funding agencies sponsored different stages of the database development, use the `role` attribute to distinguish their respective contributions. + +
+```json +"sponsors": [ + { + "name": "string", + "abbreviation": "string", + "role": "string", + "grant": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ Name of the funding agency/sponsor + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ Abbreviation of the funding/sponsoring agency mentioned in `name`. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Role of the funding/sponsoring agency mentioned in `name`. + - **`grant`** *[Optional ; Not repeatable ; String]*
+ Grant or award number. If an agency provided more than one grant, list all grants separated with a ";". + - **`uri`** *[Optional ; Not repeatable ; String]*
+ URI of the sponsor agency mentioned in `name`.

+ + +- **`acknowledgments`** *[Optional ; Repeatable]*
+An itemized list of person(s) and/or organization(s) other than sponsors and contributors already mentioned in metadata elements `contributors` and `sponsors` whose contribution to the database must be acknowledged. + +
+```json +"acknowledgments": [ + { + "name": "string", + "affiliation": "string", + "role": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or agency being recognized for supporting the database. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ Affiliation of the person or agency recognized or acknowledged for supporting the database. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Role of the person or agency that is being recognized or acknowledged for supporting the database. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ Website URL or email of the person or organization being recognized or acknowledged for supporting the database. + + +- **`acknowledgment_statement`** *[Optional ; Not repeatable ; String]*
+ + An overall statement of acknowledgment, which can be used as an alternative (or supplement) to the itemized list provided in `acknowledgments`.

+ + +- **`contacts`** *[Optional ; Repeatable]*
+The `contacts` element provides the public interface for questions associated with the development and maintenance of the database. There could be various contacts provided depending upon the organization. + +
+```json +"contacts": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "email": "string", + "telephone": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the contact person that should be contacted. Instead of the name of an individual (which would be subject to change and require frequent update of the metadata), a title can be provided here (e.g. "data helpdesk"). + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the contact person mentioned in `name`. This will be used when multiple contacts are listed, and is intended to help users direct their questions and requests to the right contact person. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The organization or affiliation of the contact person mentioned in `name`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ The email address of the person or organization mentioned in `name`. Avoid using personal email accounts; the use of an anonymous email is recommended (e.g, "helpdesk@....org") + - **`telephone`** *[Optional ; Not repeatable ; String]*
+ The phone number of the person or organization mentioned in `name`. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the agency (typically, a URL to a "contact us" web page).

+ + +- **`links`** *[Optional ; Repeatable]*
+This field allows for the association of auxiliary links referring to the database. + +
+```json +"links": [ + { + "uri": "string", + "description": "string" + } +] +``` +
+ + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI for the associated link. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the link, in relation to the database. +

+ + +- **`languages`** *[Optional ; Repeatable]*
+This set of elements is provided to list the languages that are supported in the database. +
+ ```json + "languages": [ + { + "name": "string", + "code": "string" + } + ] + ``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The official name of the language being supported; it is recommended to use a name from the [ISO 639-1 language name list](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the language mentioned in `name`, preferably the three letter [ISO 639-1 code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

+ + +- **`access_options`** *[Optional ; Repeatable]*
+This repeatable set of elements describes the different modes and formats in which the database is made accessible. When more than one mode of access is provided, describe them separately. + +
+```json +"access_options": [ + { + "type": "string", + "uri": "string", + "note": "string" + } +] +``` +
+ + - **`type`** *[Optional ; Not repeatable ; String]*
+ The access type, e.g. "Application Programming Interface (API)", "Bulk download in CSV format", "On-line query interface", etc. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI corresponding to the access mode mentioned in `type`. + - **`note`** *[Optional ; Not repeatable ; String]*
+ This element allows for annotating any specific information associated with the access mode mentioned in `type`.

+ + +- **`errata`** *[Optional ; Repeatable]*
+A list of errata at the database level. Note that an `errata` element is also available in the schema used for the description of indicators/series. + +
+```json +"errata": [ + { + "date": "string", + "description": "string" + } +] +``` +
+ + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the erratum was published, preferably entered in ISO format. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A description of the error and of the measures taken to remedy.

+ + +- **`license`** *[Optional ; Repeatable]*
+This set of elements is used to describe the access license(s) attached to the database. + +
+```json +"license": [ + { + "name": "string", + "uri": "string", + "note": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the license, for example "Creative Commons Attribution 4.0 International license (CC-BY 4.0)". + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A URI to a description of the license, for example "https://creativecommons.org/licenses/by/4.0/"; + - **`note`** *[Optional ; Not repeatable ; String]*
+ Any additional information to qualify the license requirements.

+ + +- **`citation`** *[Optional ; Not repeatable ; String]*
+ + The citation requirement for the database (i.e. how users should cite the database in publications and reports). + + +- **`notes`** *[Optional ; Repeatable]*
+This element is provided to add notes that are relevant for describing the database, that cannot be provided in other metadata elements. + +
+```json +"notes": [ + { + "note": "string" + } +] +``` +
+ +- **`note`** *[Optional ; Not repeatable ; String]*
+ A free-text note. + + +- **`disclaimer`** *[Optional ; Not repeatable ; String]*
+ +If the agency responsible for managing the database has determined that there may be some liability as a result of the data, the element may be used to provide a disclaimer statement.

+ + +- **`copyright`** *[Optional ; Not repeatable ; String]*
+The copyright attached to the database, if any. + + +### Provenance + +**`provenance`** *[Optional ; Repeatable]*
+Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ + - **`origin_description`** *[Required ; Not repeatable]*
+ The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ + - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Document Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ + +### Tags + +**`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ +- **`tag`** *[Required ; Not repeatable ; String]*
+A user-defined tag. +- **`tag_group`** *[Optional ; Not repeatable ; String]*

+A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs. + + +### LDA topics + +**`lda_topics`** *[Optional ; Not repeatable]*
+
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ +We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). +
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element `lda_topics` is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. + +:::note +Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the `lda_topics` elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated. +::: + +The image below provides an example of topics extracted from a document from the United Nations High Commission for Refugees, using a LDA topic model trained by the World Bank (this model was trained to identify 75 topics; no document will cover all topics). + +![](./images/LDA_refugee_education.JPG){width=100%} + +The `lda_topics` element includes the following metadata fields:
+ +- **`model_info`** *[Optional ; Not repeatable]*
+Information on the LDA model. + + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + +- **`topic_description`** *[Optional ; Repeatable]*
+The topic composition of the document. + + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the document (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic. This is specific to the model, not to a document.
+ + + +```r +lda_topics = list( + + list( + + model_info = list( + list(source = "World Bank, Development Data Group", + author = "A.S.", + version = "2021-06-22", + model_id = "Mallet_WB_75", + nb_topics = 75, + description = "LDA model, 75 topics, trained on Mallet", + corpus = "World Bank Documents and Reports (1950-2021)", + uri = "")) + ), + + topic_description = list( + + list(topic_id = "topic_27", + topic_score = 32, + topic_label = "Education", + topic_words = list(list(word = "school", word_weight = "") + list(word = "teacher", word_weight = ""), + list(word = "student", word_weight = ""), + list(word = "education", word_weight = ""), + list(word = "grade", word_weight = "")), + + list(topic_id = "topic_8", + topic_score = 24, + topic_label = "Gender", + topic_words = list(list(word = "women", word_weight = "") + list(word = "gender", word_weight = ""), + list(word = "man", word_weight = ""), + list(word = "female", word_weight = ""), + list(word = "male", word_weight = "")), + + list(topic_id = "topic_39", + topic_score = 22, + topic_label = "Forced displacement", + topic_words = list(list(word = "refugee", word_weight = "") + list(word = "programme", word_weight = ""), + list(word = "country", word_weight = ""), + list(word = "migration", word_weight = ""), + list(word = "migrant", word_weight = "")), + + list(topic_id = "topic_40", + topic_score = 11, + topic_label = "Development policies", + topic_words = list(list(word = "development", word_weight = "") + list(word = "policy", word_weight = ""), + list(word = "national", word_weight = ""), + list(word = "strategy", word_weight = ""), + list(word = "activity", word_weight = "")) + + ) + + ) + +) +``` + +The information provided by LDA models can be used to build a "filter by topic composition" tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic. + +
+![](./images/filter_by_topic_share_1.JPG){width=85%} +
+ + +### Embeddings + +**`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below. + +![](./images/embedding_related_docs.JPG){width=100%} + +The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +![](./images/ReDoc_documents_18.JPG){width=100%} + +The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; Object]* @@@@@@@@ do not offer options + The numeric vector representing the document, provided as an object (array or string).

+ [1,4,3,5,7,9] + + +### Additional + +**`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + +![](./images/ReDoc_ts_series_49.JPG){width=100%} + + +#### Complete example + +We use the World Bank's [World Development Indicators 2021 (WDI)](https://datatopics.worldbank.org/world-development-indicators/) database as an example. In this example, we assume that all information is entered manually in the script. In a real application, it is likely that some elements like the list and number of geographic areas covered in the database, or the start and end year of the period covered by the data, will be extracted programmatically by reading the data file (the WDI data and related metadata can be downloaded as CSV or MS-Excel files), or by extracting information from the database API (WDI metadata is available via [API](https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information)). + +-------- +** Using R +-------- + + +```r +# The code below creates an object `wdi_database` ready to be published in a NADA catalog (using the NADAR package). + +wdi_database <- list( + + database_description = list( + + title_statement = list( + idno = "WB_WDI_2021_09_15", + title = "World Development Indicators 2021", + alternate_title = "WDI 2021" + ), + + authoring_entity = list(name = "Development Data Group", + affiliation = "The World Bank Group"), + + abstract = "The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.", + + url = "https://datatopics.worldbank.org/world-development-indicators/", + + type = "Time series database", + + date_created = "2021-09-15", + date_published = "2021-09-15", + + version = list( + list(version = "On-line public version (open data), 15 September 2021", + date = "2021-09-15", + responsibility = "World Bank, Development Data Group")), + + update_frequency = "Quarterly", + + update_schedule = list(list(update = "April, July, September, December")), + + time_coverage = list(list(start = "1960", end = "2021")), + + periodicity = list(list(period = "Annual")), + + topics = list_topics, + + geographic_units = list( + list(code = "ABW", name = "Aruba"), + list(code = "AFE", name = "Africa Eastern and Southern"), + list(code = "AFG", name = "Afghanistan"), + list(code = "AFW", name = "Africa Western and Central"), + list(code = "AGO", name = "Angola"), + list(code = "ALB", name = "Albania"), + list(code = "AND", name = "Andorra"), + list(code = "ARB", name = "Arab World"), + list(code = "ARE", name = "United Arab Emirates"), + list(code = "ARG", name = "Argentina") + # ... and 255 more - not shown here + ), + + geographic_granularity = "global, national, regional", + + geographic_area_count = "265", + + languages = list( + list(code = "en", name = "English"), + list(code = "sp", name = "Spanish"), + list(code = "fr", name = "French"), + list(code = "ar", name = "Arabic"), + list(code = "cn", name = "Chinese") + ), + + contacts = list(list(name = "Data Help Desk", + affiliation = "World Bank", + uri = "https://datahelpdesk.worldbank.org/", + email = "data@worldbank.org")), + + access_options = list( + list(type = "API", + uri = "https://datahelpdesk.worldbank.org/knowledgebase/articles/889386"), + list(type = "Bulk (CSV)", + uri = "https://data.worldbank.org/data-catalog/world-development-indicators"), + list(type = "Query", + uri = "http://databank.worldbank.org/data/source/world-development-indicators"), + list(type = "PDF", + uri = "https://openknowledge.worldbank.org/bitstream/handle/10986/26447/WDI-2017-web.pdf")), + + license = list(list(type = "CC BY-4.0", + uri = "https://creativecommons.org/licenses/by/4.0/")), + + citation = "World Development Indicators 2021 (September), The World Bank" + + ) + +) +``` + +-------- +** Using Python +-------- + + +```python +# The code below creates a dictionary `wdi_database` ready to be published in a NADA catalog (using the PyNADA library). + +wdi_database: { + + "database_description" : { + + "title_statement" : { + "idno" : "WB_WDI_2021_09_15", + "title" : "World Development Indicators 2021", + "alternate_title" : "WDI 2021" + }, + + "authoring_entity" : {"name" : "Development Data Group", + "affiliation" : "The World Bank Group"}, + + abstract = "The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.", + + url = "https://datatopics.worldbank.org/world-development-indicators/", + + type = "Time series database", + + date_created = "2021-09-15", + date_published = "2021-09-15", + + version = [{"version" : "On-line public version (open data), 15 September 2021", + "date" : "2021-09-15", + "responsibility" : "World Bank, Development Data Group"}], + + update_frequency = "Quarterly", + + update_schedule = [{"update" : "April, July, September, December"}], + + time_coverage = [{"start" : "1960", "end" : "2021"}], + + periodicity = [{"period" : "Annual"}], + + topics = list_topics, + + geographic_units = [ + {"code" : "ABW", "name" : "Aruba"}, + {"code" : "AFE", "name" : "Africa Eastern and Southern"}, + {"code" : "AFG", "name" : "Afghanistan"}, + {"code" : "AFW", "name" : "Africa Western and Central"}, + {"code" : "AGO", "name" : "Angola"}, + {"code" : "ALB", "name" : "Albania"}, + {"code" : "AND", "name" : "Andorra"}, + {"code" : "ARB", "name" : "Arab World"}, + {"code" : "ARE", "name" : "United Arab Emirates"}, + {"code" : "ARG", "name" : "Argentina"} + # ... and 255 more, not shown here + ], + + geographic_granularity = "global, national, regional", + + geographic_area_count = "265", + + languages = [ + {"code" : "en", "name" : "English"}, + {"code" : "sp", "name" : "Spanish"}, + {"code" : "fr", "name" : "French"}, + {"code" : "ar", "name" : "Arabic"}, + {"code" : "cn", "name" : "Chinese"} + ], + + contacts = [{"name" : "Data Help Desk", + "affiliation" : "World Bank", + "uri" : "https://datahelpdesk.worldbank.org/", + "email" : "data@worldbank.org"}], + + access_options = [ + {"type" : "API", + "uri" : "https://datahelpdesk.worldbank.org/knowledgebase/articles/889386"}, + {"type" : "Bulk (CSV)", + "uri" : "https://data.worldbank.org/data-catalog/world-development-indicators"}, + {"type" : "Query", + "uri" : "http://databank.worldbank.org/data/source/world-development-indicators"}, + {"type" : "PDF", + "uri" : "https://openknowledge.worldbank.org/bitstream/handle/10986/26447/WDI-2017-web.pdf"} + ], + + license = [{"type" : "CC BY-4.0", + "uri" : "https://creativecommons.org/licenses/by/4.0/"}], + + citation = "World Development Indicators 2021 (September), The World Bank" + + } + +} +``` diff --git a/08_chapter08_indicators.md b/08_chapter08_indicators.md new file mode 100644 index 0000000..2bc5f0e --- /dev/null +++ b/08_chapter08_indicators.md @@ -0,0 +1,1528 @@ +--- +output: html_document +--- + +# Indicators and time series {#chapter08} + +
+![](./images/time_series_logo.JPG){width=25%} +
+ +## Indicators, time series, database, and scope of the schema + +**Indicators** are summary measures related to key issues or phenomena, derived from observed facts. Indicators form **time series** when they are provided with a temporal ordering, i.e. when their values are provided with an ordered annual, quarterly, monthly, daily, or other time reference. Time series are usually published with equal intervals between values. In the context of this Guide, we however consider as time series all indicators provided for a given geographic area with an associated time reference, whether this time represents a regular, continuous succession of time stamps or not. For example, the indicators provided by the Demographic and Health Surveys (DHS) [StatCompiler](https://www.statcompiler.com/en/), which are only available for the years when DHS are conducted in countries (which for some countries can be a single year), would be considered here as "time series". + +Time series are often contained in multi-indicators databases, like the World Bank's [World Development Indicators - WDI](https://datatopics.worldbank.org/world-development-indicators/), whose on-line version contains series for 1,430 indicators (as of 2021). To document not only the series but also the databases they belong to, we propose two metadata schemas: one to document the series/indicators, the other one to document the databases they belong to. + +In the NADA application, a series can be documented and published without an associated database, but information on a database will only be published in association with a series. The information on a database is thus treated as an "attachment" to the information on a series. A **SERIES DESCRIPTION** tab will display all metadata related to the series, i.e. all content entered in the *series schema*. + +---------- +
+![](./images/NADA_Timeseries_Series_view.JPG){width=100%} +
+---------- + +The (optional) **SOURCE DATABASE** tab will display the metadata related to the database, i.e. all content entered in the *series database schema*. This information is displayed for information, but not indexed in the NADA catalog (i.e. not searchable). + +---------- +
+![](./images/NADA_Timeseries_Database_view.JPG){width=100%} +
+---------- + + +:::idea +**Suggestions and recommendations to data curators** + +- Indicators and time series often come with metadata limited to the indicators/series name and a brief definition. This significantly reduces the discoverability of the indicators, and the possibility to implement semantic searchability and recommender systems. It is therefore highly recommended to generate more detailed metadata for each time series, including information on the purpose and typical use of the indicators, of its relevancy to different audiences, of its limitations, and more. + +- When documenting an indicator or time series, attention should be paid to include keywords and phrases in the metadata that reflect how data users are likely to formulate their queries when searching data catalogs. Subject-matter expertise, combined with an analysis of queries submitted to data catalogs, can help to identify such keywords. For example, the metadata related to an indicator "Prevalence of stunting" should contain the keyword "malnutrition", and the metadata related to "GDP per capita" should include keywords like "economic growth" or "national income". By doing so, data curators will provide richer input to search engines and recommender systems, and will have a significant and direct impact on the discoverability of the data. The use of AI tools can considerabli facilitate the process of identifying related keywords. We provide in the chapter an example of use of chatGPT for such purpose. +::: + + +## Schema description + +An indicator or time series is documented using the **time series /indicators** schema. The **database** schema is optional, and used to document the database, if any, that the indicator belongs to. When multiple series of a same database are documented, the metadata related to the database only needs to be generated once, then applied to all series. One metadata element in the **time series /indicators** schema is used to link an indicator to the corresponding database. + + +### The time series (indicators) schema + +The time series schema is used to document an indicator or a time series. In NADA, the data and metadata of an indicator can (but does not have to) be published with information on the database it belongs to (if any). A metadata element is provided to indicate the identifier of that database (if any), and to establish the link between the indicator metadata and the database metadata generated using the schema described above. +
+```json +{ + "repositoryid": "string", + "access_policy": "na", + "data_remote_url": "string", + "published": 0, + "overwrite": "no", + "metadata_information": {}, + "series_description": {}, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ +#### Cataloguing parameters + +The first elements of the schema (`repositoryid`, `access_policy`, `data_remote_url`, `published`, and `overwrite`) are not part of the series metadata. They are parameters used to indicate how the series will be published in a NADA catalog. + +**`repositoryid`** identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA. + +**`access_policy`** indicates the access policy to be applied to the data: direct access, open access, public use files, licensed access, data accessible from an external repository, and data not accessible. A controlled vocabulary is provided and must be used, with the following respective options: {`direct; open; public; licensed; remote; data_na`}. + +**`data_remote_url`** provides the link to an external website where the data can be obtained, if the `access_policy` has been set to `remote`. + +**`published`**: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible. + +**`overwrite`**: Indicates whether metadata that may have been previously uploaded for the same series can be overwritten. By default, the value is "no". It must be set to "yes" to overwrite existing information. Note that a series will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element `series_description > idno` is the same. + + +#### Metadata information + +**`metadata_information`** *[Optional, Not Repeatable]*
+The set of elements in `metadata_information` is used to provide information on the production of the indicator metadata. This information is used mostly for administrative purposes by data curators and catalog administrators. +
+```json +"metadata_information": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "prod_date": "string", + "version": "string" +} +``` +
+ +- **`title`** *[Optional ; Not repeatable ; String]*
+The title of the metadata document containing the indicator metadata.
+- **`idno`** *[Required ; Not repeatable ; String]*
+A unique identifier of the indicator metadata document. It can be for example the identifier of the indicator preceded by a prefix identifying the metadata producer.
+- **`producers`** *[Optional ; Repeatable]*
+This is a list of producers involved in the documentation (production of the metadata) of the series. + + - **`name`** *[Optional ; Not repeatable, String]*
+ The name of the agency that is responsible for the documentation of the series. + - **`abbr`** *[Optional ; Not repeatable, String]*
+ Abbreviation (acronym) of the agency mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable, String]*
+ Affiliation of the agency mentioned in `name`. + - **`role`** *[Optional ; Not repeatable, String]*
+ The specific role of the agency mentioned in `name` in the production of the metadata. This element will be used when more than one person or organization is listed in the `producers` element to distinguish the specific contribution of each metadata producer.

+ +- **`prod_date`** *[Optional ; Not repeatable, String]*
+The date the metadata was generated. The date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). + +- **`version`** *[Optional ; Not repeatable, String]*
+The version of the metadata on this series. This element will rarely be used. + + + + ```r + metadata_creation = list( + + producers = list(list(name = "Development Data Group", + abbr = "DECDG", + affiliation = "World Bank")), + + prod_date = "2021-10-15" + + ) + ``` + + +#### Series description + +**`series_description`** *[Required ; Repeatable]*
+This section contains all elements used to describe a specific series or indicator. +
+```json +"series_description": { + "idno": "string", + "doi": "string", + "name": "string", + "database_id": "string", + "aliases": [], + "alternate_identifiers": [], + "languages": [], + "measurement_unit": "string", + "dimensions": [], + "periodicity": "string", + "base_period": "string", + "definition_short": "string", + "definition_long": "string", + "definition_references": [], + "statistical_concept": "string", + "concepts": [], + "methodology": "string", + "derivation": "string", + "imputation": "string", + "missing": "string", + "quality_checks": "string", + "quality_note": "string", + "sources_discrepancies": "string", + "series_break": "string", + "limitation": "string", + "themes": [], + "topics": [], + "disciplines": [], + "relevance": "string", + "time_periods": [], + "ref_country": [], + "geographic_units": [], + "bbox": [], + "aggregation_method": "string", + "disaggregation": "string", + "license": [], + "confidentiality": "string", + "confidentiality_status": "string", + "confidentiality_note": "string", + "links": [], + "api_documentation": [], + "authoring_entity": [], + "sources": [], + "sources_note": "string", + "keywords": [], + "acronyms": [], + "errata": [], + "notes": [], + "related_indicators": [], + "compliance": [], + "framework": [], + "series_groups": [] +} +``` +
+ + +- **`idno`** *[Required ; Not repeatable ; String]*
+ + A unique identifier (ID) for the series. Most agencies and databases will have a coherent coding convention to generate their series IDs. For example, the name of the series in the World Bank's World Development Indicators series are composed of the following elements, separated by a dot: + + - Topic code (2 digits). + - General subject code (3 digits) + - Specific subject code (4 digits) + - Extensions (2 digits each)
+ + For example, the series with identifier "DT.DIS.PRVT.CD" is the series containing data on "External debt disbursements by private creditors in current US dollars" (for more information, see [*How does the World Bank code its indicators?*]( https://datahelpdesk.worldbank.org/knowledgebase/articles/201175-how-does-the-world-bank-code-its-indicators).
+ +- **`doi`** *[Optional ; Not repeatable ; String]*
+ + A Digital Object Identifier (DOI) for the the series. + +- **`name`** *[Required ; Not repeatable ; String]*
+ + The name (label) of the series. Note that a field `alias` is provided (see below) to capture alternative names for the series. + +- **`database_id`** *[Optional ; Not repeatable ; String]*
+ + The unique identifier of the database the series belongs to. This field must correspond to the element `database_description > title_statement > idno` of the database schema described above. This is the only field that is needed to establish the link between the database metadata and the indicator metadata. + +- **`aliases`** *[Optional ; Repeatable]*
+A series or an indicator can be referred to using different names. The `aliases` element is provided to capture the multiple names and labels that may be associated with (i.e synomyms of) the documented series or indicator. +
+```json +"aliases": [ + { + "alias": "string" + } +] +``` +
+ - **`alias`** *[Optional ; Not repeatable ; String]*
+ An alternative name for the indicator or series being documented.

+ +- **`alternate_identifiers`** *[Optional ; Not repeatable ; String]*
+The element `idno` described above is the reference unique identifier for the catalog in which the metadata is intended to be published. But the same indicator/metadata may be published in other catalogs. For example, a data catalog may publish metadata for series extracted from the World Bank World Development Indicators (WDI) database. And the WDI itself contains series generated and published by other organizations, such as the World Health Organization or UNICEF. Catalog administrators may want to assign a unique identifier specific to their catalog (the `idno` element), but keep track of the identifier of the series or indicator in other catalogs or databases. The `alternate_identifiers` element serves that purpose. +
+```json +"alternate_identifiers": [ + { + "identifier": "string", + "name": "string", + "database": "string", + "uri": "string", + "notes": "string" + } +] +``` +
+ + - **`identifier`** *[Required ; Not repeatable ; String]*
+ An identifier for the series other than the identifier entered in `idno` (note that the identifier entered in `idno` can be included in this list, if it is useful to provide it with a type identifier (see `name` element below) which is not provided in `idno`. This can be the identifier of the indicator in another database/catalog, or a global unique identifier. + - **`name`**
+ This element will be used to define the type of identifier. This will typically be used to flag DOIs by entering "Digital Object Identifier (DOI)". + - **`database`**
+ The name of the database (or catalog) where this alternative identifier is used, e.g. "IMF, International Financial Statistics (IFS)".
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link (URL) to the database mentioned in `database`.
+ - **`notes`** *[Optional ; Not repeatable ; String]*
+ Any additional information on the alternate identifier.

+ +- **`languages`** *[Optional ; Repeatable]*
+An indicator or time series can be made available at different levels of disaggregation. For example, an indicator containing estimates of the "Population" of a country by year can be available by sex. The data curators in such case will have two options: (i) create and document three separate indicators, namely "Population, Total", "Population, Female", and "Population, Male"; or create a single indicator "Population" and attach a *dimension* "sex" to it, with values "Total", "Female", and "Male". The `dimensions` are features (or "variables") that define the different levels of disaggregation within an indicator/series. The element `dimensions` is used to provide an itemized list of disaggregations that correspond exactly to the published data. Note that when an indicator is available at two "non-overlapping" levels of disaggregation, it should be split into two indicators. For example, if the Population indicator is available by male/female and by urban/rural, but not by male/urban/male/rural/female urban/female rural, it should be treated as two separate indicators ("Population by sex" with dimension sex = "male / female" and "Population by area of residence" with dimension area = "urban / rural".) Note also that another element in the schema, `disaggregation`, is also provided, in which a narrative description of the actual or recommended disaggregations can be documented.
+
+```json +"alternate_identifiers": [ + { + "identifier": "string", + "name": "string", + "database": "string", + "uri": "string", + "notes": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the language. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the language, preferably the ISO code.

+ +- **`measurement_unit`** *[Optional ; Not repeatable ; String]*
+ + The unit of measurement. Note that in many databases the measurement unit will be included in the series name/label. In the World Bank's World Development Indicators for example, series are named as follows: + + - CO2 emissions (kg per 2010 US$ of GDP) + - GDP per capita (current US$) + - GDP per capita (current LCU) + - Population density (people per sq. km of land area) + + In such case, the name of the series should not be changed, but the measurement unit may be extracted from it and stored in element `measurement_unit`.
+ +- **`dimensions`** *[Optional ; Repeatable]*
+An indicator or time series can be made available at different levels of disaggregation. For example, a time series containing annual estimates of the indicator "Resident population (mid-year)" can be provided by country, by urban/rural area of residence, by sex, by age group. The data curator has to make a decision on how to organize such data. One option is to create an indicator "Resident population (mid-year)" and to define a set of "dimensions" for the breakdowns. The dimensions would in such case be the year, the country, the area of residence, the sex, and the age group. Some of the dimensions would have to be provided with a code list (or 'controlled vocabulary", for example stating that F means "Female", M" means male, and T means "Total" for the dimension *sex*). Another option would be to create multiple indicators (e.g., creating three distinct indicators "Resident population, male (mid-year)", "Resident population, female (mid-year)", "Resident population, total (mid-year)" and using year, country, area of residence, and age group as dimensions). The element `dimensions` is used to provide an itemized list of disaggregations that correspond to the published data. Note that another element in the schema, `disaggregation`, is also provided, in which a narrative description of the actual or recommended disaggregations can be documented. Note also that in the SDMX standard, dimensions are listed in the *Data Structure Definition" and are complemented by *code lists* that provide the related controlled vocabularies.
+
+```json +"dimensions": [ + { + "name": "string", + "label": "string", + "description": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the dimension. + - **`label`** *[Required ; Not repeatable ; String]*
+ The label of the dimension, for example "sex", or "urban/rural". + - **`description`** *[Optional ; Not repeatable ; String]*
+ A description of the dimension (for example, if the label was "age group", the description can provide detailed information on the age groups, e.g. "The age groups in the database are 0-14, 15-49, 50-64, and 65+ years old".)

+ +- **`release_calendar`** *[Optional ; Not repeatable ; String]*
+ + Information on when updates for the indicators can be expected. This will usually not consist of exact dates (which would have to be updated regularly), but of more general information like "Every first Monday of the Month", or "Every year on June 30", or "The last week of each quarter". + +- **`periodicity`** *[Optional ; Not repeatable ; String]*
+ + The periodicity of the series. It is recommended to use a controlled vocabulary with values like *annual*, *quarterly*, *monthly*, *daily*, etc. + +- **`base_period`** *[Optional ; Not repeatable ; String]*
+ + The base period for the series. This field will only apply to series that require a base year (or other reference time) used as a benchmark, like a Consumer Price Index (CPI) which will have a value of 100 for a reference base year. + +- **`definition_short`** *[Optional ; Not repeatable ; String]*
+A short definition of the series. The short definition captures the essence of the series.
+ +- **`definition_long`** *[Optional ; Not repeatable ; String]*
+A long(er) version of the definition of the series. If only one definition is available (not a short/long version), it is recommended to capture it in the `definition_short` element. ALternatively, the same definition can be stored in both `definition_short` and `definition_long`. + +- **`definition_references`** *[Optional ; Repeatable]*
+This element is provided to link to an external resources from which the definition was extracted. +
+```json +"definition_references": [ + { + "source": "string", + "uri": "string", + "note": "string" + } +] +``` +
+ + - **`source`** *[Optional ; Not repeatable ; String]*
+ The source of the definition (title, or label). + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link (URL) to the source of the definition. + - **`note`** *[Optional ; Not repeatable ; String]*
+ This element provides for annotating or explaining the reason the reference has been included as part of the metadata.

+ +- **`statistical_concept`** *[Optional ; Not repeatable ; String]*
+ + This element allows to insert a reference of the series with content of a statistical character. This can include coding concepts or standards that are applied to render the data statistically relevant. + +- **`concepts`** *[Optional ; Repeatable]*
+This repeatable element can be used to document concepts related to the indicators or time series (other than the main statistical concept that may have been entered in `statisticsl_concept`). For example, the concept of *malnutrition* could be documented in relation to the indicators "Prevalence of stunting" and "Prevalence of wasting". +
+```json +"concepts": [ + { + "name": "string", + "definition": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ A concise and standardized name (label) for the concept. + - **`definition`** *[Required ; Not repeatable ; String]*
+ The definition of the concept. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link (URL) to a resource providing more detailed information on the concept.

+ +- **`data_collection`** *[Optional ; Not repeatable]*
+This group of elements can be used to document data collection activities that led to or allowed the production of the indicator. This element will typically be used for the description of surveys or censuses. +Note: the schema also contains an element "sources". That element will be used to document the organization and/or main data production program from which the indicator is derived. +
+```json +"data_collection": [ + { + "data_source": "string", + "method": "string", + "period": "string", + "note": "string" + "uri": "string" + } +] +``` +
+ + - **`data_source`** *[Required ; Not repeatable ; String]*
+ A concise and standardized name (label) for the data source, e.g. "National Labor Force Survey, 1st quarter 2022". If multiple data sources were used, they can all be listed here. Note that is a time series has values obtained from many different sources, the source for each value (or group of values) will not be part of the indicator/series metadata, but will be stored as an attribute in the data file where the information can be associated with a specific observation ("cell note" or group of observation (e.g. attached to an indicator for avv values for a same year or for a same area). + - **`method`** *[Required ; Not repeatable ; String]*
+ Brief information on the data collection method, e.g. :Sample household survey". +- **`period`** *[Optional ; Not repeatable ; String]*
+Information on the period of the data collection, e.g. "January to March 2022".
+- **`note`** *[Optional ; Not repeatable ; String]*
+Additional information on the data collection. +- **`uri`** *[Optional ; Not repeatable ; String]*
+A link to a resource (website, document) where more information on the data collection can be found. + +- **`imputation`** *[Optional ; Not repeatable ; String]*
+ + Data may have been imputed to account for data gaps or for other reasons (harmonization/standardization, and others). If imputations have been made, this element provides the space for their description. + +- **`adjustments`** *[Optional ; Repeatable ; String]*
+ + Description of any adjustments with respect to use of standard classifications and harmonization of breakdowns for age group and other dimensions, or adjustments made for compliance with specific international or national definitions. + +- **`missing`** *[Optional ; Not repeatable ; String]*
+ + Information on missing values in the series or indicator. This information can be related to treatment of missing values, to the cause(s) of missing values, and others. + +- **`validation_rules`** *[Optional ; Repeatable ; String]*
+ + Description of the set of rules (itemized) used to validate values for the indicator, e.g. "Is within range 0-100", or "Is the sum of indicatorX + indicator Y". + +- **`quality_checks`** *[Optional ; Not repeatable ; String]*
+ + Data may have gone through data quality checks to assure that the values are reasonable and coherent, which can be described in this element. These quality checks may include checking for outlying values or other. A brief description of such quality control procedures will contribute to reinforcing the credibility of the data being disseminated. + +- **`quality_note`** *[Optional ; Not repeatable ; String]*
+ + Additional notes or an overall statement on data quality. These could for example cover non-standard quality notes and/or information on independent reviews on the data quality. + +- **`sources_discrepancies`** *[Optional ; Not repeatable ; String]*
+ + This element is used to describe and explain why the data in the series may be different from the data for the same series published in other sources. International organizations, for example, may apply different techniques to make data obtained from national sources comparable across countries, in which cases the data published in international databases may differ from the data published in national, official databases. + +- **`series_break`** *[Optional ; Not repeatable ; String]*
+ + Breaks in statistical series occur when there is a change in the standards, sources of data, or reference year used in the compilation of a series. Breaks in series must be well documented. The documentation should include the reason(s) for the break, the time it occured, and information on the impact on comparability of data over time. + +- **`limitation`** *[Optional ; Not repeatable ; String]*
+ + This element is used to communicate to the user any limitations or exceptions in using the data. The limitations may result from the methodology, from issues of quality or consistency in the data source, or other. + +- **`themes`** *[Optional ; Repeatable]*
+Themes provide a general idea of the research that might guide the creation and/or demand for the series. A theme is broad and is likely also subject to a community based definition or list. A controlled vocabulary should be used. This element will rarely be used (the element `topics` described below will be used more often). +
+```json +"themes": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`id`** *[Optional ; Not repeatable ; String]*
+ The unique identifier of the theme. It can be a sequential number, or the ID of the theme in a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The label of the theme associated with the data. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ When a hierarchical (nested) controlled vocabulary is used, the `parent_id` field can be used to indicate a higher-level theme to which this theme belongs. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to the controlled vocabulary mentioned in field `vocabulary'. + +- **`topics`** *[Optional ; Repeatable]*
+The `topics` field indicates the broad substantive topic(s) that the indicator/series covers. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the [Council of European Social Science Data Archives (CESSDA) topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification).
+
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`id`** *[Optional ; Not repeatable ; String]*
+ The unique identifier of the topic. It can be a sequential number, or the ID of the topic in a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The label of the topic associated with the data. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ When a hierarchical (nested) controlled vocabulary is used, the `parent_id` field can be used to indicate a higher-level topic to which this topic belongs. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name of the controlled vocabulary used, if any. + - **`uri`**
+ A link to the controlled vocabulary mentioned in field `vocabulary`.

+ +- **`disciplines`** *[Optional ; Repeatable]*
+Information on the academic disciplines related to the content of the document. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in [Wikipedia](https://en.wikipedia.org/wiki/List_of_academic_fields). +
+```json +"disciplines": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + This is a block of five elements: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The ID of the discipline, preferably taken from a controlled vocabulary. + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the discipline, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent ID of the discipline (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ +- **`relevance`** *[Optional ; Not repeatable ; String]*
+ + This field documents the relevance of an indicator or series in relation to a social imperative or policy objective.
+ +- **`mandate`** *[Optional ; Not repeatable ; String]*
+ + - **`mandate`** *[Optional ; Not repeatable ; String]*
+ Description of the institutional mandate or of a set of rules or other formal set of instructions assigning responsibility as well as the authority to an organization for the collection, processing, and dissemination of statistics for this indicator.
+ - **`URI`** *[Optional ; Not repeatable ; String]*
+ A link to a resource (document, website) describing the mandate.
+ +- **`time_periods`** *[Optional ; Repeatable]*
+The time period covers the entire span of data available for the series. The time period has a start and an end and is reported according to the periodicity provided in a previous element. +
+```json +"time_periods": [ + { + "start": "string", + "end": "string", + "notes": "string" + } +] +``` +
+ + - **`start`** *[Required ; Not repeatable ; String]*
+ The initial date of the series in the dataset. The start date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). + - **`end`** *[Required ; Not repeatable ; String]*
+ The end date is the latest date for which an estimate for the indicator is available. The end date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
+ - **`notes`** *[Optional ; Not repeatable ; String]*
+ Additional information on the time period.

+ +- **`ref_country`** *[Optional ; Repeatable]*
+A list of countries for which data are available in the series. This element is somewhat redundant with the next element (`geographic_units`) which may also contain a list of countries. Identifying geographic areas of type "country" is important to enable filters and facets in data catalogs (country names are among the most frequent queries submitted to catalogs). +
+```json +"ref_country": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the country. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the country. The use of the [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) codes is recommended. + +- **`geographic_units`** *[Optional ; Repeatable]*
+List of geographic units (regions, countries, states, provinces, etc.) for which data are available for the series. +
+```json +"geographic_units": [ + { + "name": "string", + "code": "string", + "type": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ Name of the geographic unit e.g. "World, "Africa", "Afghanistan", "OECD countries", "Bangkok". + - **`code`** *[Optional ; Not repeatable ; String]*
+ Code of the geographic unit. The [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) code is preferred when the unit is a country. + - **`type`** *[Optional ; Not repeatable ; String]*
+ Type of geographic unit e.g. "country", "state", "region", "province", "city", etc.

+ +- **`bbox`** *[Optional ; Repeatable]*
+This element is used to define one or multiple bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset's geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. This element is optional, but if the `bound_poly` element (see below) is used, then the `bbox` element must be included.
+
+```json +"bbox": [ + { + "west": "string", + "east": "string", + "south": "string", + "north": "string" + } +] +``` +
+ + - **`west`** *[Required ; Not repeatable ; String]*
+ West longitude of the bounding box.
+ - **`east`** *[Required ; Not repeatable ; String]*
+ East longitude of the bounding box.
+ - **`south`** *[Required ; Not repeatable ; String]*
+ South latitude of the bounding box.
+ - **`north`** *[Required ; Not repeatable ; String]*
+ North latitude of the bounding box.
+ + This example is for a study covering the islands of Madagascar and Mauritius +
+ ![](./images/Microdata_bbox.JPG){width=45%} +
+ + + ```r + my_indicator <- list( + metadata_information = list( + # ... + ), + series_description = list( + # ... , + study_info = list( + # ... , + + ref_country = list( + list(name = "Madagascar", code = "MDG"), + list(name = "Mauritius", code = "MUS") + ), + + bbox = list( + + list(name = "Madagascar", + west = "43.2541870461", + east = "50.4765368996", + south = "-25.6014344215", + north = "-12.0405567359"), + + list(name = "Mauritius", + west = "56.6", + east = "72.466667", + south = "-20.516667", + north = "-5.25") + + ), + # ... + ), + # ... + ) + ``` +
+ +- **`aggregation_method`** *[Optional ; Not repeatable ; String]*
+ + The `aggregation_method` element describes how values can be aggregated from one geographic level (for example, a country) to a higher-level geographic area (for example, a group of country defined based on a geographic criteria (region, world) or another criteria (low/medium/high-income countries, island countries, OECD countries, etc.). The aggregation method can be simple (like "sum" or "population-weighted average") or more complex, involving weighting of values.
+ +- **`disaggregation`** *[Optional ; Not repeatable ; String]*
+ + This element is intended to inform users that an indicator or series is available at various levels of disaggregation. The related series should be listed (by andme and/or identifier). For indicator "Population, total" for example, one may inform the user that the indicator is also available (in other series) by sex, urban/rural, and age group (in series "Population, male" and "Population, female", etc.). + +- **`license`** *[Optional ; Repeatable]*
+The license refers to the accessibility and terms of use associated with the data. Providing a license and a link to the terms of the license allos data users to determine, with full clarity, what they can and cannot do with the data. +
+```json +"license": [ + { + "name": "string", + "uri": "string", + "note": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the license, e.g. "Creative Commons Attribution 4.0 International license (CC-BY 4.0)". + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of a website where the licensed is described in detail, for example "https://creativecommons.org/licenses/by/4.0/".
+ - **`note`** *[Optional ; Not repeatable ; String]*
+ Any additional information on the license.

+ +- **`confidentiality`** *[Optional ; Not repeatable ; String]*
+ + A statement of confidentiality for the series. + +- **`confidentiality_status`** *[Optional ; Not repeatable ; String]*
+ + This indicates a confidentiality status for the series. A controlled vocabulary should be used with possible options "public", "official use only", "confidential", "strictly confidential". When all series are made publicly available, and belong to a database that has an open or public access policy, this element can be ignored. + +- **`confidentiality_note`** *[Optional ; Not repeatable ; String]*
+ + This element is reserved for additional notes regarding confidentiality of the data. This could involve references to specific laws and circumstances regarding the use of data.
+ +- **`links`** *[Optional ; Repeatable]*
+This element provides links to online resources of any type that could be useful to the data users. This can be links to description of methods and reference documents, analytics tools, visualizations, data sources, or other. +
+```json +"links": [ + { + "type": "string", + "description": "string", + "uri": "string" + } +] +``` +
+ + - **`type`** *[Optional ; Not repeatable ; String]*
+ This element allows to classify the link that is provided. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A description of the link that is provided. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The uri (URL) to the described resource.

+ +- **`api_documentation`** *[Optional ; Repeatable]*
+Increasingly, data are made accessible via Application Programming Interfaces (APIs). The API associated with a series must be documented. The documentation will usually not be specific to a series, but apply to all series in a same database. +
+```json +"api_documentation": [ + { + "description": "string", + "uri": "string" + } +] +``` +
+ + - **`description`** *[Optional ; Not repeatable ; String]*
+ This element will not contain the API documentation itself, but information on what documentation is available. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the API documentation.

+ +- **`authoring_entity`** *[Optional ; Repeatable]*
+This set of five elements is used to identify the organization(s) or person(s) who are the main producers/curators of the indicator. Note that a similar element is provided at the database level. The authoring_entity for the indicator can be different from the authoring_entity of the database. For example, the World Bank is the authoring entity for the World Development Indicators database, which contains indicators obtained from the International Monetary Fund, World Health Organization, and other organizations that are thus the authoring entitis for specific indicators. +
+```json +"authoring_entity": [ + { + "name": "string", + "affiliation": "string", + "abbreviation": null, + "email": null, + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization who is responsible for the production of the indicator or series. Write the name in full (use the element `abbreviation` to capture the acronym of the organization, if relevant). + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization mentioned in `name`. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ Abbreviated name (acronym) of the organization mentioned in `name`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ The public email contact of the person or organizations mentioned in `name`. It is good practice to provide a service account email address, not a personal one. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link (URL) to the website of the entity mentioned in `name`.

+ +- **`sources`** *[Optional ; Not repeatable ; String]*
+This element provides information on the source(s) of data that were used to generate the indicator. A source can refer to an organization (e.g., "Source: World Health Organization"), or to a dataset (e.g., for a national poverty headcount indicator, the sources will likely be a list of sample household surveys). In `sources`, we are mainly interested in the latter. When a series in a database is a series extracted from another database (e.g., when the World Bank World Development Indicators include a series from the World Health Organization in its database), the source organization should be mentioned in the `authoring_entity` element of the schema. The `sources` element is a repeatable element. +Note 1: In some cases, the source of a specific value in a database will be stored as an attribute of the data file (e.g., as a "footnote" attached to a specific cell. If the sources are listed in the data file, they may but do not need to be stored in the metadata. +Note 2: the schema also contains an element "data_collection" that would be used to describe a specific data collection activity from which an indicator is derived. +
+```json +"sources": [ + { + "id": "string", + "name": "string", + "organization": "string", + "type": "string", + "note": "string" + } +] +``` +
+ + - **`id`** *[Required ; String]*
+ This element records the unique identifier of a source. It is a required element. If the source does not have a specific unique identifier, a sequential number can be used. If the source is a dataset or database that has its own unique identifier (possibly a DOI), this identifier should be used. + - **`name`** *[Optional ; String]*
+ The name (title, or label) of the source. + - **`organization`** *[Optional ; String]*
+ The organization responsible for the source data. + - **`type`** *[Optional ; String]*
+ The type of source, e.g. "household survey", "administrative data", or "external database". + - **`note`** *[Optional ; String]*
+ This element can be used to provide additional information regarding the source data.

+ +- **`sources_note`** *[Optional ; Not repeatable ; String]*
+ + Additional information on the source(s) of data used to generate the series or indicator. + +- **`keywords`** *[Optional ; Repeatable]*
+Words or phrases that describe salient aspects of a data collection's content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Keywords should be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; String ; Non repeatable]*
+ Keyword (or phrase). Keywords summarize the content or subject matter of the study. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ Controlled vocabulary from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the controlled vocabulary used, if any. + +- **`acronyms`** *[Optional ; Repeatable]*
+The `acronyms` element is used to document the meaning of all acronyms used in the metadata of a series. If some acronyms are well known (like "GDP", or "IMF" for example), others may be less obvious or could be uncertain (does "PPP" mean "public-private partnership", or "purchasing power parity"?). In any case, providing a list of acronyms with their meaning will help users and make your metadata more discoverable. Note that acronyms should not include country codes used in the documentation of the geographic coverage of the data. +
+```json +"acronyms": [ + { + "acronym": "string", + "expansion": "string", + "occurrence": 0 + } +] +``` +
+ + - **`acronym`** *[Required ; Not repeatable ; String]*
+ An acronym referenced in the series metadata (e.g. "GDP"). + - **`expansion`** *[Required ; Not repeatable ; String]*
+ The expansion of the acronym, i.e. the full name or title that it represents (e.g., "Gross Domestic Product"). + - **`occurrence`** *[Optional ; Not repeatable ; Numeric]*
+ This numeric element can be used to indicate the number of times the acronym is mentioned in the metadata. The element will rarely be used.

+ +- **`errata`** *[Optional ; Repeatable]*
+This element is used to provide information on detected errors in the data or metadata for the series, and on the measures taken to remedy them. +
+```json +"errata": [ + { + "date": "string", + "description": "string" + } +] +``` +
+ + - **`date`** *[Required ; Repeatable ; String]*
+ The date the erratum was published.
+ - **`description`** *[Required ; Repeatable ; String]*
+ A description of the error and remedy measures.

+ +- **`notes`** *[Optional ; Repeatable]*
+This element is open and reserved for explanatory notes deemed useful to the users of the data. Notes should account for additional information that might help: replicate the series; access the data and research area, or discoverability in general. +
+```json +"notes": [ + { + "note": "string" + } +] +``` +
+ + - **`note`** *[Required ; Repeatable ; String]*
+ The note itself.

+ +- **`related_indicators`** *[Optional ; Repeatable]*
+This element allows to reference indicators that are often associated with the indicator being documented. +
+```json +"related_indicators": [ + { + "code": "string", + "label": "string", + "uri": "string" + } +] +``` +
+ + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code for the indicator that is referenced in the document. It will likely be an ID that is used by that indicator. + - **`label`** *[Optional ; Not repeatable ; String]*
+ The name or label of the indicator that is associated with the indicator being documented. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to the related indicator.

+ +- **`compliance`** *[Optional ; Repeatable]*
+For some indicators, international standards have been established. This is for example the case of indicators like the unemployment or unemployment rate, for which the International Conference of Labour Statisticians defines the standards concepts and methods. The `compliance` element is used to document the compliance of a series with one or multiple national or international standards. +
+```json +"compliance": [ + { + "standard": "string", + "abbreviation": "string", + "custodian": "string", + "uri": "string" + } +] +``` +
+ + - **`standard`** *[Optional ; Not repeatable ; String]*
+ The name of the standard that the series complies with. This name will ideally include a label and a version or a date. For example: "International Standard Industrial Classification of All Economic Activities (ISIC) Revision 4, published in 2007" + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ The acronym of the standard that the series complies with. + - **`custodian`** *[Optional ; Not repeatable ; String]*
+ The organization that maintains the standard that is being used for compliance. For example: "United Nations Statistics Division". + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to a public website site where information on the compliance standard can be obtained. For example: "https://unstats.un.org/unsd/classifications/Family/Detail/27

+ +- **`framework`** *[Optional ; Repeatable]*
+Some national, regional, and international agencies develop monitoring frameworks, with goals, targets, and indicators. Some well-known examples are the [Millennium Development Goals](https://www.un.org/millenniumgoals/) and the [Sustainable Development Goals](https://sdgs.un.org/goals) which establish international goals for human development, or the World Summit for Children (1990) which set international goals in the areas of child survival, development and protection, supporting sector goals such as women’s health and education, nutrition, child health, water and sanitation, basic education, and children in difficult circumstances. The `framework` element is used to link an indicator or series to the framework, goal, and target associated with it. +
+```json +"framework": [ + { + "name": "string", + "abbreviation": "string", + "custodian": "string", + "description": "string", + "goal_id": "string", + "goal_name": "string", + "goal_description": "string", + "target_id": "string", + "target_name": "string", + "target_description": "string", + "indicator_id": "string", + "indicator_name": "string", + "indicator_description": "string", + "uri": "string", + "notes": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the framework. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ The abreviation of the name of the framework. + - **`custodian`** *[Optional ; Not repeatable ; String]*
+ The name of the organization that is the official custodian of the framework. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the framework. + - **`goal_id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the Goal that the indicator or series is associated with. + - **`goal_name`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the Goal that the indicator or series is associated with. + - **`goal_description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the Goal that the indicator or series is associated with. + - **`target_id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the Target that the indicator or series is associated with. + - **`target_name`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the Target that the indicator or series is associated with. + - **`target_description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the Target that the indicator or series is associated with. + - **`indicator_id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the indicator, as provided in the framework (this is not the `idno` identifier). + - **`indicator_name`** *[Optional ; Not repeatable ; String]*
+ The name of the indicator, as provided in the framework (which may be different from the name provided in `name`) + - **`indicator_description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the indicator, as provided in the framework. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to a website providing detailed information on the framework, its goals, targets, and indicators. + - **`notes`** *[Optional ; Not repeatable ; String]*
+ Any additional information on the relationship between the indicator/series and the framework.

+ + +- **`series_group`** *[Optional ; Repeatable]*
+The group(s) the indicator belongs to. Groups can be create to organize indicators/series by theme, producer, or other. +
+```json' +"series_groups": [ + { + "name": "string", + "description": "string", + "version": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the group. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the group. + - **`version`** *[Optional ; Not repeatable ; String]*
+ The version of the grouping. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to a public website site where information on the grouping can be obtained.

+ +- **`contacts`** *[Optional ; Repeatable]*
+The `contacts` element provides the public interface for questions associated with the production of the indicator or time series. + +
+```json +"contacts": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "email": "string", + "telephone": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the contact person that should be contacted. Instead of the name of an individual (which would be subject to change and require frequent update of the metadata), a title can be provided here (e.g. "data helpdesk"). + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the contact person mentioned in `name`. This will be used when multiple contacts are listed, and is intended to help users direct their questions and requests to the right contact person. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The organization or affiliation of the contact person mentioned in `name`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ The email address of the person or organization mentioned in `name`. Avoid using personal email accounts; the use of an anonymous email is recommended (e.g, "helpdesk@....org") + - **`telephone`** *[Optional ; Not repeatable ; String]*
+ The phone number of the person or organization mentioned in `name`. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the agency (typically, a URL to a "contact us" web page).

+ + +### Provenance + +**`provenance`** *[Optional ; Repeatable]*
+Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ + - **`origin_description`** *[Required ; Not repeatable]*
+ The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ + - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Document Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ +### Tags + +**`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ +- **`tag`** *[Required ; Not repeatable ; String]*
+A user-defined tag. +- **`tag_group`** *[Optional ; Not repeatable ; String]*

+A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs. + +- **`lda_topics`** *[Optional ; Not repeatable]*
+
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ + We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). + + Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series' name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%). + + The `lda_topics` element includes the following metadata fields. An example in R was provided in Chapter 4 - Documents. + + - **`model_info`** *[Optional ; Not repeatable]*
+ Information on the LDA model.
+ + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + - **`topic_description`** *[Optional ; Repeatable]*
+ The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).
+ + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the metadata (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic.

+ + +```r +lda_topics = list( + + list( + + model_info = list( + list(source = "World Bank, Development Data Group", + author = "A.S.", + version = "2021-06-22", + model_id = "Mallet_WB_75", + nb_topics = 75, + description = "LDA model, 75 topics, trained on Mallet", + corpus = "World Bank Documents and Reports (1950-2021)", + uri = "")) + ), + + topic_description = list( + + list(topic_id = "topic_27", + topic_score = 32, + topic_label = "Education", + topic_words = list(list(word = "school", word_weight = "") + list(word = "teacher", word_weight = ""), + list(word = "student", word_weight = ""), + list(word = "education", word_weight = ""), + list(word = "grade", word_weight = "")), + + list(topic_id = "topic_8", + topic_score = 24, + topic_label = "Gender", + topic_words = list(list(word = "women", word_weight = "") + list(word = "gender", word_weight = ""), + list(word = "man", word_weight = ""), + list(word = "female", word_weight = ""), + list(word = "male", word_weight = "")), + + list(topic_id = "topic_39", + topic_score = 22, + topic_label = "Forced displacement", + topic_words = list(list(word = "refugee", word_weight = "") + list(word = "programme", word_weight = ""), + list(word = "country", word_weight = ""), + list(word = "migration", word_weight = ""), + list(word = "migrant", word_weight = "")), + + list(topic_id = "topic_40", + topic_score = 11, + topic_label = "Development policies", + topic_words = list(list(word = "development", word_weight = "") + list(word = "policy", word_weight = ""), + list(word = "national", word_weight = ""), + list(word = "strategy", word_weight = ""), + list(word = "activity", word_weight = "")) + + ) + + ) + +) +``` + +- **`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). + + The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": null + } +] +``` +
+ + The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; @@@@]* + The numeric vector representing the series metadata.

+ +### Additional + +**`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + + +## Generating and publishing compliant metadata - Complete example + +We use a series from the World Bank's World Development Indicators (WDI 2021) as an example: the series ["Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)"](https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.DDAY) + +:::note +Note that we only show how metadata are generated and published in a NADA catalog. We do not demonstrate the (optional) publishing of the underlying data in a MongoDB database, which makes the data accessible via API and allows activation of data visualizations in the NADA catalog. This is covered in the NADA documentation. +::: + + +### Use of AI for metadata augmentation + +The discoverability of indicators by keyword-based search engines can be significantly improved by the inclusion of a list of relevant keywords in the metadata. These keywords can be synonyms or terms and concepts that are closely associated with the indicator. Identifying the most relevant related keywords requires subject matter expertise. But this can be considerably facilitated by the use of AI tools. We provide below an example of a query submitted to chatGPT. The proposed terms returned by the application MUST be reviewed by a subject matter expert. But having the proposed list (which can be copy-pasted then edited in a Metadata Editor or in a script) will make the process very efficient. + +
+
+![image](https://user-images.githubusercontent.com/35276300/230427783-28276a6c-a284-423c-9205-3ab4ce2d465b.png) +
+
+ +The returned list is as follows: +Poverty +Headcount ratio +Income +Consumption +Living standards +Basic needs +Poverty line +Purchasing power parity (PPP) +International poverty line +Economic development +Social inequality +Human development +Poverty reduction +Extreme poverty +Global poverty +Developing countries +Wealth distribution +Rural poverty +Urban poverty +Household income +Inclusive growth +Multidimensional poverty +Income inequality +Poverty gap +Human capital +Poverty trap +Food security +Employment +Vulnerability +Social protection +Poverty measurement +Poverty alleviation +Social exclusion +Targeted interventions +Poverty incidence +Poverty dynamics +Poverty cycle +Equity +Income distribution +Sustainable development + + +### Using R + + +```r +# The code below generates metadata at the database level (object "wdi_database") +# and for a time series (object "this_series"). +# It then publishes the metadata in a NADA catalog using the R package NADAR. +# It also publishes related materials as "external resources". +library(nadar) +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- +setwd("C:/my_indicators/") +thumb = "poverty.JPG" # Image to be used as thumbnail in the data catalog +db_id = "WB_WDI_2021_09_15" # The WDI database identifier + +# Document the indicator (Poverty headcount ratio at $1.90 a day) +this_series = list( + + metadata_creation = list( + producers = list( + list(name = "Development Data Group", + abbr = "DECDG", + affiliation = "World Bank", + role = "Metadata curation") + ), + prod_date = "2021-10-15", + version = "Example v 1.0" + ), + + series_description = list( + + idno = "SI.POV.DDAY", + + name = "Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)", + + database_id = db_id, # To attach the database metadata to the series metadata + + measurement_unit = "% of population", + + periodicity = "Annual", + + definition_short = "Poverty headcount ratio at $1.90 a day is the percentage of the population living on less than $1.90 a day at 2011 international prices. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions.", + + definition_references = list( + list(source = "World Bank, Development Data Group", + uri = "https://databank.worldbank.org/metadataglossary/millennium-development-goals/series/SI.POV.DDAY" + ) + ), + + methodology = "International comparisons of poverty estimates entail both conceptual and practical problems. Countries have different definitions of poverty, and consistent comparisons across countries can be difficult. Local poverty lines tend to have higher purchasing power in rich countries, where more generous standards are used, than in poor countries. Since World Development Report 1990, the World Bank has aimed to apply a common standard in measuring extreme poverty, anchored to what poverty means in the world's poorest countries. The welfare of people living in different countries can be measured on a common scale by adjusting for differences in the purchasing power of currencies. The commonly used $1 a day standard, measured in 1985 international prices and adjusted to local currency using purchasing power parities (PPPs), was chosen for World Development Report 1990 because it was typical of the poverty lines in low-income countries at the time. As differences in the cost of living across the world evolve, the international poverty line has to be periodically updated using new PPP price data to reflect these changes. The last change was in October 2015, when we adopted $1.90 as the international poverty line using the 2011 PPP. Prior to that, the 2008 update set the international poverty line at $1.25 using the 2005 PPP. Poverty measures based on international poverty lines attempt to hold the real value of the poverty line constant across countries, as is done when making comparisons over time. The $3.20 poverty line is derived from typical national poverty lines in countries classified as Lower Middle Income. The $5.50 poverty line is derived from typical national poverty lines in countries classified as Upper Middle Income. Early editions of World Development Indicators used PPPs from the Penn World Tables to convert values in local currency to equivalent purchasing power measured in U.S dollars. Later editions used 1993, 2005, and 2011 consumption PPP estimates produced by the World Bank. The current extreme poverty line is set at $1.90 a day in 2011 PPP terms, which represents the mean of the poverty lines found in 15 of the poorest countries ranked by per capita consumption. The new poverty line maintains the same standard for extreme poverty - the poverty line typical of the poorest countries in the world - but updates it using the latest information on the cost of living in developing countries. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions. The statistics reported here are based on consumption data or, when unavailable, on income surveys. Analysis of some 20 countries for which income and consumption expenditure data were both available from the same surveys found income to yield a higher mean than consumption but also higher inequality. When poverty measures based on consumption and income were compared, the two effects roughly cancelled each other out: there was no significant statistical difference.", + + limitation = "Despite progress in the last decade, the challenges of measuring poverty remain. The timeliness, frequency, quality, and comparability of household surveys need to increase substantially, particularly in the poorest countries. The availability and quality of poverty monitoring data remains low in small states, countries with fragile situations, and low-income countries and even some middle-income countries. The low frequency and lack of comparability of the data available in some countries create uncertainty over the magnitude of poverty reduction. Besides the frequency and timeliness of survey data, other data quality issues arise in measuring household living standards. The surveys ask detailed questions on sources of income and how it was spent, which must be carefully recorded by trained personnel. Income is generally more difficult to measure accurately, and consumption comes closer to the notion of living standards. And income can vary over time even if living standards do not. But consumption data are not always available: the latest estimates reported here use consumption data for about two-thirds of countries. However, even similar surveys may not be strictly comparable because of differences in timing or in the quality and training of enumerators. Comparisons of countries at different levels of development also pose a potential problem because of differences in the relative importance of the consumption of nonmarket goods. The local market value of all consumption in kind (including own production, particularly important in underdeveloped rural economies) should be included in total consumption expenditure but may not be. Most survey data now include valuations for consumption or income from own production, but valuation methods vary.", + + topics = list( + list(id = "1", + name = "Economics, Consumption and consumer behaviour", + vocabulary = "", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "2", + name = "Economics, Economic conditions and indicators", + vocabulary = "CESSDA Version 4.1", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "3", + name = "Economics, Economic systems and development", + vocabulary = "CESSDA Version 4.1", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "4", + name = "Social stratification and groupings, Equality, inequality and social exclusion", + vocabulary = "CESSDA Version 4.1", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification") + ), + + relevance = "The World Bank Group is committed to reducing extreme poverty to 3 percent or less, globally, by 2030. Monitoring poverty is important on the global development agenda as well as on the national development agenda of many countries. The World Bank produced its first global poverty estimates for developing countries for World Development Report 1990: Poverty (World Bank 1990) using household survey data for 22 countries (Ravallion, Datt, and van de Walle 1991). Since then there has been considerable expansion in the number of countries that field household income and expenditure surveys. The World Bank's Development Research Group maintains a database that is updated annually as new survey data become available (and thus may contain more recent data or revisions) and conducts a major reassessment of progress against poverty every year. PovcalNet is an interactive computational tool that allows users to replicate these internationally comparable $1.90, $3.20 and $5.50 a day global, regional and country-level poverty estimates and to compute poverty measures for custom country groupings and for different poverty lines. The Poverty and Equity Data portal provides access to the database and user-friendly dashboards with graphs and interactive maps that visualize trends in key poverty and inequality indicators for different regions and countries. The country dashboards display trends in poverty measures based on the national poverty lines alongside the internationally comparable estimates, produced from and consistent with PovcalNet.", + + time_periods = list(list(start = "1960", end = "2020")), + + geographic_units = list( + list(name = "Afghanistan", code = "AFG", type = "country/economy"), + list(name = "Africa Eastern and Southern", code = "AFE", type = "geographic region"), + list(name = "Africa Western and Central", code = "AFW", type = "geographic region"), + list(name = "Albania", code = "ALB", type = "country/economy"), + list(name = "Algeria", code = "DZA", type = "country/economy"), + list(name = "Angola", code = "AGO", type = "country/economy"), + list(name = "Aruba", code = "ABW", type = "country/economy") + # ... and many more - In a real situation, this would be programmatically extracted from the data + ), + + license = list(name = "CC BY-4.0", uri = "https://creativecommons.org/licenses/by/4.0/"), + + api_documentation = list( + description = "See the Developer Information webpage for detailed documentation of the API", + uri = "https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information" + ), + + source = "World Bank, Development Data Group (DECDG) and Poverty and Inequality Global Practice. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. Data for high-income economies are from the Luxembourg Income Study database. For more information and methodology, see PovcalNet website: http://iresearch.worldbank.org/PovcalNet/home.aspx", + + keywords = list( + list(name = "poverty rate"), + list(name = "poverty incidence"), + list(name = "global poverty line"), + list(name = "international poverty line"), + list(name = "welfare"), + list(name = "prosperity"), + list(name = "inequality"), + list(name = "income") + ), + + acronyms = list( + list(acronym = "PPP", expansion = "Purchasing Power Parity") + ), + + related_indicators = list( + list(code = "SI.POV.GAPS", + label = "Poverty gap at $1.90 a day (2011 PPP) (%)", + uri = "https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.GAPS"), + list(code = "SI.POV.NAHC", + label = "Poverty headcount ratio at national poverty lines (% of population)", + uri = "https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.NAHC") + ), + + framework = list( + list(name = "Sustainable Development Goals (SDGs)", + description = "The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed and developing - in a global partnership.", + goal_id = "SDG Goal 1", + goal_name = "End poverty in all its forms everywhere", + target_id = "SDG Target 1.1", + target_name = "By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day", + indicator_id = "SDG Indicator 1.1.1", + indicator_name = "Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)", + uri = "https://sdgs.un.org/goals") + ) + + ) + +) +# Publish the metadata in NADA, with a link to the WDI website + # Database-level metadata + timeseries_database_add(idno = db_id, + published = 1, + overwrite = "yes", + metadata = wdi_database) + + # Indicator-level metadata + timeseries_add( + idno = this_series$series_description$idno, + repositoryid = "central", + published = 1, + overwrite = "yes", + metadata = this_series, + thumbnail = thumb + ) +# Add a link to the WDI website as an external resource + +external_resources_add( + title = "World Development Indicators website", + idno = this_series$series_description$idno, + dctype = "web", + file_path = "https://datatopics.worldbank.org/world-development-indicators/", + overwrite = "yes" +) +``` + +After uploading the above metadata, and activating some visualization widgets, the result in NADA will be as follows (not all metadata displayed here; see https://nada-demo.ihsn.org/index.php/catalog/study/SI.POV.DDAY for the full view): + +
+![](./images/ReDoc_ts_series_50.JPG){width=100%} +![](./images/ReDoc_ts_series_51.JPG){width=100%} +![](./images/ReDoc_ts_series_52.JPG){width=100%} +![](./images/ReDoc_ts_series_53.JPG){width=100%} +![](./images/ReDoc_ts_series_54.JPG){width=100%} +
+ + +### Using Python + +The equivalent in Python of the R script provided above is as follows. + + +```python +# Same example in Python @@@@@@@@ +``` + diff --git a/09_chapter09_table.md b/09_chapter09_table.md new file mode 100644 index 0000000..6779141 --- /dev/null +++ b/09_chapter09_table.md @@ -0,0 +1,2600 @@ +--- +output: html_document +--- + +# Statistical tables {#chapter09} + +
+![](./images/table_logo.JPG){width=25%} +
+ + +## Introduction + +A statistical table (*cross tabulation* or *contingency table*) is a summary presentation of data. The [OECD Glossary of Statistical Terms](https://stats.oecd.org/glossary/) defines it as “observation data gained by a purposeful aggregation of statistical microdata conforming to statistical methodology [organized in] groups or aggregates, such as counts, means, or frequencies.” + +Tables are produced as an array of rows and columns that display numeric aggregates in a clearly labeled fashion. They may have a complex structure and become quite elaborate. They are typically found in publications such as statistical yearbooks, census and survey reports, research papers, or published on-line. + +Statistical tables can be understood by a broad audience. In some cases, they may be the only publicly-available output of a data collection activity. Even when other output is available --such as microdata, dashboards, or databases accessible via user interfaces or APIs-- statistical tables are an important component of data dissemination. It is thus important to make tables as discoverable as possible. The schema described in this chapter was designed to structure and foster the comprehensiveness of information on tables by rendering the pertinent metadata into a structured, machine-readable format. It is intended for the purpose of improving data discoverability. The schema is not intended to store information to programmatically re-create tables. + +The schema description is available at http://dev.ihsn.org/nada/api-documentation/catalog-admin/index.html#tag/Tables + + +## Anatomy of a table + +The figure below, adapted from [LabWrite Resources](https://labwrite.ncsu.edu/res/gh/gh-tables.html), provides an illustration of what statistical tables typically look like. The main parts of a table are highlighted. They provide a content structure for the metadata schema we describe in this chapter. + +
+![](./images/Anatomy_Table.JPG){width=100%} +
+ +**Table number and title**: Every table must have a title, and should have a number. Tables in yearbooks, report and papers are usually numbered in the order that they are referred to in the document. They can be numbered sequentially (Table 1, Table 2, and so on), by chapter (Table 1.1, Table 1.2, Table 2.1, ...), or based on other reference system. The Table number typically precedes the table title. The title provides a description of the contents of the table. It should be concise and include the key elements shown in the table. + +**Column spanner, column heads, and stub head**: The column headings (and sub-headings) identify what data are listed in the table in a vertical arrangement. A column heading placed above the leftmost column is often referred to as the *stubhead*, and the column is the *stub column*. A heading that sits above two or more columns to indicate a certain grouping is referred to as a *column spanner*. + +**Stubs**: The horizontal headings and sub-headings of the rows are called *row captions*. Together, they form the *stub*. + +**Table body**: The actual data (values) in a table (containing for example percentages, means, or counts of certain variables) form the *table body*. + +**Table spanner**: A table spanner is located in the body of the table in order to divide the data in a table without changing the columns. Spanners go the entire length of the table. + +**Table notes**: Table notes are used to provide information that is not self-explanatory (e.g., to provide the expanded form of acronyms used in row or column captions). + +**Table source**: The source identifies the dataset(s) or database(s) that contain the data used to generate the table. This can for example be a survey or a census dataset. + + +## Schema description + +The table schema contains six blocks of elements. The first block of three elements (`repository_id`, `published`, and `overwrite`) do not describe the table, but are used by the NADA cataloguing application to determine where and how the table metadata is published in the catalog. The second block, `metadata_information`, contains "metadata on the metadata" and is used mainly for archiving purpose. The third block, `table_description`, contains the elements used to describe the table and its production process. A fourth block `provenance`, is used to document the origin of metadata that may be harvested from other catalogs. The block `tags` is used to add information (in the form of words or short phrases) that will be useful to create *facets* in the a catalog user interface. Last, an empty block `additional` is provided as a container for additional metadata elements that users may want to create. + +
+```json +{ + "repositoryid": "string", + "published": 0, + "overwrite": "no", + "metadata_information": {}, + "table_description": {}, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ +### Cataloguing parameters + +The following elements are used by the NADA application API (see the NADA documentation for more information): + +- **`repositoryid`**: A NADA catalog can be composed of multiple *collections*. The *repositoryid* element identifies in which collection the table will be published. This collection must have been previously created in the catalog. By default, the table will be published in the `central` catalog (i.e. in no particular collection). +- **`published`**: The NADA catalog allows tables to be published (in which case they will be visible to users of the catalog) or unpublished (in which case they will only be visible by administrators). The default value is 0 (unpublished). Code 1 is used to set the status to "published". + +- **`overwrite`**: This element defines what action will be taken when a command is issued to add the table to a catalog and a table with the same identifier (element *idno*) is already in the catalog. By default, the command will not overwrite the existing table (the default value of overwrite is "no"). Set this parameter to "yes" to allow the existing table to be overwritten in the catalog. + + +### Metadata information + +**`metadata_information`** *[Optional, Not Repeatable]*
+The `metadata_information` block is used to document the table metadata (not the table itself). It provides information on the process of generating the table metadata. This block is optional. The information it contains is useful to catalog administrators, not to the public. It is however recommended to enter at least the identification of the metadata producer, her/his affiliation, and the date the metadata were created. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so the metadata produced by one organization can be found in other data centers (complying with standards and schema is precisely intended to facilitate inter-operability of catalogs and automated information sharing). +
+```json +"metadata_information": { + "idno": "string", + "title": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "production_date": "string", + "version": "string" +} +``` +
+ +- **`idno`** *[Optional, Not Repeatable, String]*
+A unique identifier for the metadata document (the *metadata document* is the JSON file containing the table metadata). This is different from the table unique identifier (see section `title_statement` below), although the same identifier can be used, and it is good practice to generate identfiers that would maintain an easy connection between the metadata idno and the table idno. For example, if the unique identifier of the table is "TBL_0001", the `idno` in the metadata_information could be "META_TBL_001". + +- **`title`** *[Optional, Not Repeatable, String]*
+The title of the metadata document (not necessarily the title of the table). + +- **`producers`** *[Optional, Repeatable]*
+This refers to the producer(s) of the table metadata, not to the producer(s) of the table. This could for example be the data curator in a data center. Four elements can be used to provide information on the metadata producer(s): + + - **`name`** *[Optional, Not Repeatable, String]*
+ The name of the metadata producer/curator. An alternative to entering the name of the curator (e.g. for privacy protection purpose) is to enter the curator identifier (see the element *abbr* below) + - **`abbr`** *[Optional, Not Repeatable, String]*
+ This element can be used to provide an identifier of the metadata producer/curator mentioned in `name`. + - **`affiliation`** *[Optional, Not Repeatable, String]*
+ The affiliation of the metadata producer/curator mentioned in `name`. + - **`role`** *[Optional, Not Repeatable, String]*
+ The specific role of the metadata producer/curator mentioned in `name` (applicable when more than one person was involved in the production of the metadata).

+ +- **`production_date`** *[Optional, Not Repeatable, String]*
+The date the metadata (not the table) was produced. The date will preferably be entered in ISO 8601 format (YYYY-MM-DD). + +- **`version`** *[Optional, Not Repeatable, String]*
+The version of the metadata (not the version of the table). + + + +```r +my_table = list( + # ... , + metadata_information = list( + idno = "META_TBL_POP_PC2001_02-01", + producers = list( + list(name = "John Doe", + affiliation = "National Data Center of Popstan") + ), + production_date = "2020-12-27", + version = "version 1.0" + ), + # ... +) +``` + + +### Table description + +**`table_description`** *[Required, Not Repeatable]*
+This section contains the metadata elements that describe the table itself. Not all elements will be required to fully document a table, but efforts should be made to provide as much and as detailed information as possible, as richer metadata will make the table more discoverable. +
+```json +"table_description": { + "title_statement": {}, + "identifiers": [], + "authoring_entity": [], + "contributors": [], + "publisher": [], + "date_created": "string", + "date_published": "string", + "date_modified": "string", + "version": "string", + "description": "string", + "table_columns": [], + "table_rows": [], + "table_footnotes": [], + "table_series": [], + "statistics": [], + "unit_observation": [], + "data_sources": [], + "time_periods": [], + "universe": [], + "ref_country": [], + "geographic_units": [], + "geographic_granularity": "string", + "bbox": [], + "languages": [], + "links": [], + "api_documentation": [], + "publications": [], + "keywords": [], + "themes": [], + "topics": [], + "disciplines": [], + "definitions": [], + "classifications": [], + "rights": "string", + "license": [], + "citation": "string", + "confidentiality": "string", + "sdc": "string", + "contacts": [], + "notes": [], + "relations": [] + } +``` +
+ +- **`title_statement`** *[Required, Not Repeatable]*
+
+```json +"title_statement": { + "idno": "string", + "table_number": "string", + "title": "string", + "sub_title": "string", + "alternate_title": "string", + "translated_title": "string" +} +``` +
+ + - **`idno`** *[Required, Not Repeatable, String]*
+ A unique identifier to the table. Do not include spaces in the `idno`. This identifier must be unique to the catalog in which the table will be published. Some organizations have their own system to assign unique identifiers to tables. Ideally, an identifier that guarantees uniqueness globally will be used, such as a Digital Object Identifier (DOI) or an ISBN number. Note that a table may have more than one identifier. In such case, the element `idno` (as a non-repeatable element) will contain the main identifier (as selected as the "reference" one by the catalog administrator). The other identifiers will be provided in the element `identifiers` (see below).
+ - **`table_number`** *[Optional, Not Repeatable, String]*
+ The table number. The table number will usually begin with the word “Table” followed by a numeric identifier such as: Table 1 or Table 2.1 etc. Different publications may use different ways to reference a table. This is particularly the case for publications that are part of a standard survey program and have well-defined table templates. The following are different ways to number a table: + + | Type | Description | + | --------------- | ------------------------------------- | + | Sequential | This is a sequential number given to each table produced and appearing within the publication (e.g., Table 1, Table 2 to Table n). | + | Thematic | Provides a numbering scheme based on the theme and a sequential number | + | Chapter | The tables can be numbered according to the chapter and then a sequential reference within that reference such as: Table 1.1 or Table 3.5 etc. | + | Annex | Tables in an annex will usually be given a letter number referring to the annex and a sequential number such as Table A.1 or Table B.3. | + | Note | A table number is usually set apart from the title with a colon. The word “Table” should never abbreviated. | + + - **`title`** *[Required, Not Repeatable, String]*
+ The title of the table. The title provides a brief description of the content of the table. It should be concise and include the key elements shown in the table. There are varying styles for writing a table title. A consistent style should be applied to all tables published in a catalog.
+ + - **`sub_title`** *[Optional, Not Repeatable, String]*
+ A subtitle can provide further descriptive or explanatory content to the table. + + - **`alternate_title`** *[Optional, Not Repeatable, String]*
+ An alternate title for the table. + + - **`translated_title`** *[Optional, Not Repeatable, String]*
+ A translation of the title.
+ + + + ```r + my_table = list( + # ... + table_description = list( + title_statement = list( + idno = "EXAMPLE_TBL_001", + table_number = "Table 1.0", + title = "Resident population by age group, sex, and area of residence, 2020", + sub_title = "District of X, as of June 30", + translated_title = "Population résidente par groupe d'âge, sexe et zone de résidence, 2020 (district X, au 30 juin)" + ), + # ... + ) + ) + ``` +
+ + +- **`identifiers`** *[Optional ; Repeatable]*
+This element is used to enter document identifiers other than the catalog identifier entered in the `title_statement` (`idno`). It can for example be a Digital Object Identifier (DOI). The identifier entered in the `title_statement` can be repeated here (the `title_statement` does not provide a `type` parameter; if a DOI or other standard reference ID is used as `idno`, it is recommended to repeat it here with the identification of its `type`). +
+```json +"identifiers": [ + { + "type": "string", + "identifier": "string" + } +] +``` +
+ + - **`type`** *[Optional, Not Repeatable, String]*
+ The type of unique ID, e.g. "DOI". + - **`identifier`** *[Required, Not Repeatable, String]*
+ The identifier itself.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + identifiers = list( + type = "DOI", + identifier = "XXX.XXX.XXXX" + ), + # ... + ) + ) + ``` +
+ + +- **`authoring_entity`** *[Optional, Not Repeatable]*
+The authoring entity identifies the person(s) or organization(s) responsible for the production of the table. An authoring entity is identified by its name, affiliation, abbreviation, URI, and author's identifiers (if any). +
+```json +"authoring_entity": [ + { + "name": "string", + "affiliation": "string", + "abbreviation": "string", + "uri": "string", + "author_id": [ + { + "type": null, + "id": null + } + ] + } +] +``` +
+ + - **`name`** *[Optional, Not Repeatable, String]*
+ The name of person(s) or organization responsible for the production and content of the table. + - **`affiliation`** *[Optional, Not Repeatable, String]*
+ The affiliation of the person(s) or organization(s) mentioned in `name`. + - **`abbreviation`** *[Optional, Not Repeatable, String]*
+ The abbreviation (acronym) of the organization mentioned in `name`. + - **`uri`** *[Optional, Not Repeatable, String]*
+ The URI can be a link to the website, or the email address, of the authoring entity mentioned in `name`.
+ - **`author_id`** *[Optional ; Repeatable]*
+ The author identifier in a registry of academic researchers such as the [Open Researcher and Contributor ID (ORCID)](https://orcid.org/).
+ - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of identifier, i.e. the identification of the registry that assigned the author's identifier, e.g. "ORCID".
+ - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the author in the registry mentioned in `type`.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + authoring_entity = list( + name = "John Doe", + affiliation = "National Research Center, Popstan", + abbreviation = "NRC", + uri = "www. ...", + author_id = list( + list(type = "ORCID", id = "XYZ123") + ) + ), + # ... + ) + ) + ``` +
+ + +- **`contributors`** *[Optional, Repeatable]*
+This set of elements identifies the person(s) and/or organization(s), other than the authoring entity, who contributed to the production of the table.
+
+```json +"contributors": [ + { + "name": "string", + "affiliation": "string", + "abbreviation": "string", + "role": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional, Not Repeatable, String]*
+ The name of the contributor (person or organization). + - **`affiliation`** *[Optional, Not Repeatable, String]*
+ The affiliation of the contributor mentioned in `name`. This could be a government agency, a university or a department in a university, etc. + - **`abbreviation`** *[Optional, Not Repeatable, String]*
+ The abbreviation for the institution which has been listed as the affiliation of the contributor. + - **`role`** *[Optional, Not Repeatable, String]*
+ The specific role of the contributor mentioned in `name`. This could for example be ""Research assistant", "Technical specialist", "Programmer", or "Reviewer". + - **`uri`** *[Optional, Not Repeatable, String]*
+ A URI (link to a website, or email address) for the contributor mentioned in `name`.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + contributors = list( + name = "John Doe", + affiliation = "National Research Center", + abbreviation = "NRC", + role = "Research assistant; Stata programming", + uri = "www. ..." + ), + # ... + ) + ) + ``` +
+ + +- **`publisher`** *[Optional, Not repeatable]*
+The entity responsible for publishing the table. +
+```json +"publisher": [ + { + "name": "string", + "affiliation": "string", + "abbreviation": "string", + "role": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional, Not Repeatable, String]*
+ The name of the publisher (person or organization). + - **`affiliation`** *[Optional, Not Repeatable, String]*
+ The affiliation of the publisher. This could be a government agency, a university or a department in a university, etc. + - **`abbreviation`** *[Optional, Not Repeatable, String]*
+ The abbreviation for the institution which has been listed as the affiliation of the publisher. + - **`role`** *[Optional, Not Repeatable, String]*
+ The specific role of the publisher (this element is unlikely to be used as the role is obvious). + - **`uri`** *[Optional, Not Repeatable, String]*
+ A URI (link to a website, or email address) of the publisher.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + publisher = list( + name = "National Statistics Office, Publishing Department", + affiliation = "Ministry of Planning, National Statistics Office", + abbreviation = "NSO", + uri = "www. ..." + ), + # ... + ) + ) + ``` +
+ + +- **`date_created`** *[Optional, Not Repeatable, String]*
+The date the table was created. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the table is created refers to the date that the output was produced and considered ready for publication. + + +- **`date_published`** *[Optional, Not Repeatable, String]*
+The date the table was published. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). If the table is contained in a document (report, paper, book, etc.), the date the table is published is associated with the publication date of that document. If the table is found in a statistics yearbook for example, then the publication date will be the date the yearbook was published. + + +- **`date_modified`** *[Optional, Not Repeatable, String]*
+The date the table was last modified. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). Modifications, revisions, or re-publications of the table are recorded in this element. + + +- **`version`** *[Optional, Not Repeatable, String]*
+The version of the table refers to the published version of the table. If for some reason, data in a published table are revised, then the version of the table is captured in this element. + + +- **`description`** *[Optional, Not Repeatable, String]*
+A brief "narrative" description of the table. The description can contain information on the content, purpose, production process, or other relevant information. + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + date_created = "2020-06-15", + date_published = "2020-10-30", + version = "Version 1.0", + description = "The table is part of a series of tables extracted from the Population Census 2020 dataset. It presents counts of resident population by type of disability, sex, and age group, by province and at the national level. The data were collected in compliance with questions from the Washington Group.", + # ... + ) + ) + ``` +
+ + +- **`table_columns`** *[Optional, Repeatable]*
+The columns description is composed of the column spanner and the column heads. Columns spanners group the column heads together in a logical fashion to present the data to the user. Not all columns presented in a table will have a column spanner. The column spanners can become quite complicated; when a table is documented, the information found in the column spanner and heads can be merged and edited. What matters is not to document the exact structure of the table, but to ensure that the text of the spanners and heads is included in the metadata as this text will be used by search engines to find tables in data catalogs. +
+```json +"table_columns": [ + { + "label": "string", + "var_name": "string", + "dataset": "string" + } +] +``` +
+ + - **`label`** *[Required, Not Repeatable, String]*
+ The labels of the table columns (or *column captions*) are vital for discoverability. The column labels will include both column spanners and column headers. Columns spanners are captions that join various column headers together. + - **`var_name`** *[Optional, Not Repeatable, String*]
+ This refers to the name of the variables found in the dataset (typically microdata) used to produce the table. The objective of this optional field is to help establish a link between the source dataset and the table, to foster reproducibility. + - **`dataset`** *[Optional, Not Repeatable, String]*
+ This refers to the dataset (typically microdata) used to produce the table. If the dataset is available in a catalog and has a unique identifier (DOI or other), this identifier can be entered here. Alternatively, the title of the dataset or a permanent URI can be provided.

+ + The column captions of the following table can be documented in the following manner: + + ![](./images/table_example_05.JPG){width=100%} +
+ + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + table_columns = list( + + list(label = "Area of residence: National (total)", + var_name = "urbrur", dataset = "pop_census_2020_v01"), + + list(label = "Area of residence: Urban", + var_name = "urbrur", dataset = "pop_census_2020_v01"), + + list(label = "Area of residence: Rural", + var_name = "urbrur", dataset = "pop_census_2020_v01"), + + list(label = "Sex: total", + var_name = "sex", dataset = "pop_census_2020_v01") + + list(label = "Sex: male", + var_name = "sex", dataset = "pop_census_2020_v01") + + list(label = "Sex: female", + var_name = "sex", dataset = "pop_census_2020_v01") + + ), + # ... + ) + ) + ``` +
+ + Or, in a more concise but also valid version: + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + table_columns = list( + + list(label = "Area of residence: national (total) / urban / rural", + var_name = "urbrur", dataset = "pop_census_2020_v01"), + + list(label = "Sex: total / male / female", + var_name = "sex", dataset = "pop_census_2020_v01") + + ), + # ... + ) + ) + ``` +
+ + +- **`table_rows`** *[Required, Not Repeatable, String]*
+Like the column spanner and column heads, the `table_rows` section is composed of the stub head and stubs (*row captions*). The stubs are the captions of the rows of data and the stub head is the label that groups the rows together in a logical fashion. As for `table_columns`, the information found in the stubs can be merged and edited to be optimized for clarity and discoverability. +
+```json +"table_rows": [ + { + "label": "string", + "var_name": "string", + "dataset": "string" + } +] +``` +
+ + - **`label`** *[Required, Not Repeatable, String]*
+ As with the column labels, the content in this `row_label` is designed to include the stub head, stubs and any captions included. + - **`var_name`** *[Optional, Not Repeatable, String]*
+ As with the column variables, this optional element is reserved to identify those variables found in the source dataset that are associated with the row of data. + - **`dataset`** *[Optional, Not Repeatable, String]*
+ This refers to the dataset (typically microdata) used to produce the table. If the dataset is available in a catalog and has a unique identifier (DOI or other), this identifier can be entered here. Alternatively, the title of the dataset or a permanent URI can be provided. Note also that the schema provides a `data_sources` element (see below) to describe in more detail the sources of data. The content of the `dataset` element must be compatible with the information provided in that other element.

+ + Example using the same table as for `table_columns`: + + + ```r + my_table = list( + # ... , + + table_description = list( + # ... , + + table_rows = list( + + list(label = "Age group; 0-4 years", + var_name = "age", dataset = "pop_census_2020_v01"), + + list(label = "Age group; 5-9 years", + var_name = "age", + dataset = "pop_census_2020_v01"), + + list(label = "Age group; 10-14 years", + var_name = "age", + dataset = "pop_census_2020_v01"), + + list(label = "Age group; 15-19 years", + var_name = "age", + dataset = "pop_census_2020_v01") + + ), + # ... + ) + ) + ``` + + The same information can be provided in a more concise version as follows: + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + table_rows = list( + list(label = "Age group; 0-4 years, 5-9 years, 10-14 years, 15-19 years", + var_name = "age", + dataset = "pop_census_2020_v01") + ), + # ... + ) + ) + ``` +
+ +- **`table_footnotes`** *[Optional, Repeatable]*
+Footnotes provide additional clarity. They may for example be used to assure that the user is aware of conditions and exceptions that may apply to a table. Footnotes may include statements of missing data, imputation of data, or other content that is not included in the body of the publication. +
+```json +"table_footnotes": [ + { + "number": "string", + "text": "string" + } +] +``` +
+ + - **`number`** *[Optional, Not Repeatable, String]*
+ A sequential number is usually given to the footnotes, which starts with 1 for each table. + - **`text`** *[Required, Not Repeatable, String]*
+ The text of the footnote.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + table_footnotes = list( + + list(number = "1", + text = "Data refer to the resident population only."), + + list(number = "2", + text = "Figures for the district of X have been imputed.") + + ), + + # ... + ) + ) + ``` +
+ + +- **`table_series`** *[Optional, Repeatable]*
+Table series may be organized into series, typically by theme. +
+```json +"table_series": [ + { + "name": "string", + "maintainer": "string", + "uri": "string", + "description": "string" + } +] +``` +
+ + - **`name`** *[Optional, Not Repeatable, String]*
+ The name (label) of the series. + - **`maintainer`** *[Optional, Not Repeatable, String]*
+ The person or organization in charge of maintaining the series. This will often be the same person/organization that produce and publish the table. This optional element will often be ignored. + - **`uri`** *[Optional, Not Repeatable, String]*
+ A URI to the series information. This optional element will often be ignored. + - **`description`** *[Optional, Not Repeatable, String]*
+ A description of the series.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + table_series = list( + + list(name = "Population Census - Age distribution", + description = "Series 1 - Tables on demographic composition of the population") + + ), + + # ... + ) + ) + ``` +
+ + +- **`statistics`** *[Optional, Repeatable]*
+The table metadata will not contain data. What the `statistics` element refers to is the type of statistics included in the table. Some tables may only contain counts, such as a table of population by age group and sex (which shows counts of persons; other tables could be counts of households, facilities, or any other observation unit). But statistical tables can contain many other types of summary statistics. This element is used to list these types of statistics. +
+```json +"statistics": [ + { + "value": "string" + } +] +``` +
+ + - **`value`** *[Required, Not Repeatable, String]*
+ + The use of a controlled vocabulary is recommended. This list could contain (but does not have to be limited to): + + - Count (frequencies) + - Number of missing values + - Mean (average) + - Median + - Mode + - Minimum value + - Maximum value + - Range + - Standard deviation + - Variance + - Confidence interval (95%) - Lower limit + - Confidence interval (95%) - Upper limit + - Standard error + - Sum + - Inter-quartile Range (IQR) + - Percentile (possibly with specification, e,g, "10th percentile") + - Mean Absolute Deviation + - Mean Absolute Deviation from the Median (MADM) + - Coefficient of Variation (COV) + - Coefficient of Dispersion (COD) + - Skewness + - Kurtosis + - Entropy + - Regression coefficient + - R-squared + - Adjusted R-squared + - Z-score + - Accuracy + - Precision + - Mean squared logarithmic error (MSLE)

+ + Example in R for a table showing the distribution of the population by age goup and sex, and the mean age by sex + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + statistics = list( + list(value = "count"), + list(value = "mean") + ), + # ... + ) + ) + ``` +
+ + +- **`unit_observation`** *[Optional, Repeatable]*
+The element provides information on the unit(s) of observations that correspond to the values shown in the table. +
+```json +"unit_observation": [ + { + "value": "string" + } +] +``` +
+ + - **`value`** *[Required, Not repeatable, String]* + The `value` is not a numeric value; it is the label (description) of the observation unit, e.g, "individual" or "person", "household", "dwelling", enterprise, "country", etc.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + unit_observation = list( + list(value = "individual") + ), + # ... + ) + ) + ``` +
+ + +- **`data_sources`** *[Optional, Repeatable]*
+The data sources are often cited in the footnote section of a table. The `name`, `source_id`, and `link` elements are optional, but at least one of them must be provided. +
+```json +"data_sources": [ + { + "name": "string", + "abbreviation": "string", + "source_id": "string", + "note": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional, Not repeatable, String]*
+ The name (title) of the data source. For example, a table data may be extracted from the "Population Census 2020". + - **`abbreviation`** *[Optional, Not repeatable, String]*
+ The abbreviation (acronym) of the data source. + - **`source_id`** *[Optional, Not repeatable, String]* + A unique identifier for the source, such as a Digital Object Identifier (DOI). + - **`note`** *[Optional, Not repeatable, String]*
+ A note that describes how the source was used, possibly mentioning issues in the use of the source. + - **`uri`** *[Optional, Not repeatable, String]*
+ A link (URL) to the source dataset.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + data_sources = list( + list(source = "Population and Housing Census 2020", + abbreviation = "PHC 2020", + source_id = "ABC_PHC_2020_PUF" + ) + ), + # ... + + ) + ) + ``` +
+ + +- **`time_periods`** *[Optional, Repeatable]*
+The time periods consists of a list or periods (range of years / quarters / months / days) that the data relate to, preferably entered in ISO 8601 format (YYYY, or YYYY-MM, or YYYY-MM-DD). If the data are by quarter, convert them into ISO 8601 format (e.g., first quarter of 2020 would be "from 2020-01 to 2020-03). This is a repeatable field. If the time periods are for example 1990, 2000 to 2004, and 2014 to June 2019, do not enter the time period as a single range 1990-2019 as this would include irrelevant periods. It should be entered as three separate ranges as in the example below. For data that are related to a specific date (for example, the population of a country as of the census day), enter the date in both the `from` and `to` fields. +
+```json +"time_periods": [ + { + "from": "string", + "to": "string" + } +] +``` +
+ + - **`from`** *[Required, Not repeatable, String]*
+ The start date of the time period covered by the table, preferably entered in ISO 8601 format (YYYY, or YYYY-MM, or YYYY-MM-DD). + - **`to`** *[Required, Not repeatable, String]*
+ The end date of the time period covered by the table, preferably entered in ISO 8601 format (YYYY, or YYYY-MM, or YYYY-MM-DD). + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + time_periods = list( + list(from = "1990", to = "1990"), + list(from = "2000", to = "2004"), + list(from = "2014", to = "2019-06") + ), + + # ... + ) + ) + ``` +
+ + +- **`universe`** *[Optional, Repeatable]*
+The universe of a table refers to the population (or *respondents*) covered in the data. It does not have to be a population of individuals; it can for example be a population of households, facilities, firms, groups of persons, or even objects. The description of the universe should clearly inform the data users of inclusions and exclusions that they may not expect. +
+```json +"universe": [ + { + "value": "string" + } +] +``` +
+ + - **`value`** *[Required, Not repeatable, String]*
+ A textual description of the universe covered by the data. + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + universe = list( + list(value = "Resident male population aged 0 to 6 years; this excludes visitors and people present in the country under a diplomatic status. + Nomadic and homeless populations are included.") + ), + # ... + ) + ) + ``` +
+ + +- **`ref_country`** *[Optional, Repeatable]*
+This element is used to document the list of countries for which data are in the table. This element serves to assure that the country name and code are easily discoverable and contribute to a virtual national catalog. If the table only refers to part of a country (for example a city), the `ref_country` field should still be filled. Another element called `geographic_units`is provided (see below) to capture more detailed information on the table's geographic coverage. +
+```json +"ref_country": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ The name of a country for which data are in the table. + - **`code`** *[Required, Not repeatable, String]*
+ The code of the country mentioned in `name`, preferably an [ISO 3166 country code](https://en.wikipedia.org/wiki/ISO_3166-1).

+ + +- **`geographic_units`** *[Optional, Repeatable]*
+An itemized list of geographic areas covered by the data in the table, other than the country/countries that must be entered in `ref_country`. +
+```json +"geographic_units": [ + { + "name": "string", + "code": "string", + "type": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ The name of the geographic unit. + - **`code`** *[Optional, Not repeatable, String]*
+ The code of the geographic unit mentioned in `name`. + - **`type`** *[Optional, Not repeatable, String]*
+ The type of geographic unit mentioned in `name` (e.g., "State", "Province", "Town", "Region", etc.)
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + ref_country = list( + list(name = "Malawi", code = "MWI") + ), + + geographic_units = list( + list(name = "Northern", type = "region"), + list(name = "Central", type = "region"), + list(name = "Southern", type = "region"), + list(name = "Lilongwe", type = "town"), + list(name = "Mzuzu", type = "town"), + list(name = "Blantyre", type = "town") + ), + + # ... + ) + ) + ``` +
+ +- **`geographic_granularity`** *[Optional, Not repeatable, String]*
+A description of the geographic levels for which data are presented in the table. This is not a list of specific geographic areas, but a list of the administrative level(s) that correspond to these geographic areas.
+ + Example for a table showing the population of a country by State, district, and sub-district (+ total) + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + ref_country = list( + list(name = "India", code = "IND") + ), + + geographic_granularity = "national, state (admin 1), district (admin 2), sub-district (admin 3)", + + # ... + ) + ) + ``` +
+ + +- **`bbox`** *[Optional ; Repeatable]*
+Bounding boxes are typically used for geographic datasets to indicate the geographic coverage of the data, but can be provided for tables as well, although this will rarely be done. A geographic bounding box defines a rectangular geographic area. +
+```json +"bbox": [ + { + "west": "string", + "east": "string", + "south": "string", + "north": "string" + } +] +``` +
+ + - **`west`** *[Required ; Not repeatable ; String]*
+ Western geographic parameter of the bounding box. + - **`east`** *[Required ; Not repeatable ; String]*
+ Eastern geographic parameter of the bounding box. + - **`south`** *[Required ; Not repeatable ; String]*
+ Southern geographic parameter of the bounding box. + - **`north`** *[Required ; Not repeatable ; String]*
+ Northern geographic parameter of the bounding box.

+ + +- **`languages`** *[Optional, Repeatable]*
+Most tables will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed. +
+```json +"languages": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ The name of the language(s) in which the table is published, preferably extracted from the [ISO 639 standardized nomenclature of languages](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). + - **`code`** *[Optional, Not repeatable, String]*
+ The code of the language, preferably from the [ISO 639 code list](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + languages = list( + list(name = "English", code = "EN"), + list(name = "French", code = "FR") + ), + + # ... + ) + ) + ``` +
+ + +- **`links`** *[Optional, Repeatable]*
+A list of associated links related to the table. +
+```json +"links": [ + { + "uri": "string", + "description": "string" + } +] +``` +
+ + - **`uri`** *[Required, Not repeatable, String]*
+ The URI to an external resource. + - **`description`** *[Optional, Not repeatable, String]*
+ A brief description of the resource.

+ + Example for a table extracted from the Gambia Demographic and Health Survey 2019/2020 Report, the links could be the following: + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + links = list( + + list(uri = "https://dhsprogram.com/pubs/pdf/FR369/FR369.pdf", + description = "The Gambia, Demographic and Health Survey 2019/2020 Report"), + + list(uri = "https://dhsprogram.com/data/available-datasets.cfm", + description = "DHS microdata for The Gambia") + + ), + + # ... + ) + ) + ``` +
+ + +- **`api_documentation`** *[Optional ; Repeatable]*
+Increasingly, data are made accessible via Application Programming Interfaces (APIs). The API associated with a series must be documented. +
+```json +"api_documentation": [ + { + "description": "string", + "uri": "string" + } +] +``` +
+ + - **`description`** *[Optional ; Not repeatable ; String]*
+ This element will not contain the API documentation itself, but information on what documentation is available. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the API documentation.

+ + +- **`publications`** *[Optional, Repeatable]*
+This element identifies the publication(s) where the table is published. This could for example be a Statistics Yearbook, a report, a paper, etc.
+
+```json +"publications": [ + { + "title": "string", + "uri": "string" + } +] +``` +
+ + - **`title`** *[Required, Not repeatable, String]*
+ The title of the publication (including the producer and the year). + - **`uri`** *[Optional, Not repeatable, String]*
+ A link to the publication. + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + publications = list( + list(title = "United Nations Statistical Yearbook, Fifty-second issue, May 2023", + uri = "https://www.un-ilibrary.org/content/books/9789210557566") + ), + + # ... + ) + ) + ``` +
+ + +- **`keywords`** *[Optional ; Repeatable]*
+ +
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of keywords that provide information on the core content of the table. Keywords provide a convenient solution to improve the discoverability of the table, as it allows terms and phrases not found in the table itself to be indexed and to make a table discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the [UNESCO Thesaurus](http://vocabularies.unesco.org/browser/thesaurus/en/). The list provided here can combine keywords from multiple controlled vocabularies, and user-defined keywords. + + - **`name`** *[Required ; Not repeatable ; String]*
+ The keyword itself. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The controlled vocabulary (including version number or date) from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the controlled vocabulary from which the keyword is extracted, if any.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + keywords = list( + list(name = "Migration", vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + list(name = "Migrants", vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + list(name = "Refugee", vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + list(name = "Forced displacement"), + list(name = "Forcibly displaced") + ), + + # ... + ), + # ... + ) + ``` +
+ + +- **`themes`** *[Optional ; Repeatable]*
+ +
+```json +"themes": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of themes covered by the table. A controlled vocabulary will preferably be used. Note that `themes` will rarely be used as the elements `topics` and `disciplines` are more appropriate for most uses. This is a block of five fields: + - **`id`** *[Optional ; Not repeatable ; String]*
+ The ID of the theme, taken from a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the theme, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ + +- **`topics`** *[Optional ; Repeatable]*
+ +
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + Information on the topics covered in the table. A controlled vocabulary will preferably be used, for example the [CESSDA Topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification), a typology of topics available in 11 languages; or the [Journal of Economic Literature (JEL) Classification System](https://en.wikipedia.org/wiki/JEL_classification_codes), or the [World Bank topics classification](https://documents.worldbank.org/en/publication/documents-reports/docadvancesearch). Note that you may use more than one controlled vocabulary. + + This element is a block of five fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the topic, taken from a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the topic, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.
+ + + + ```r + my_table <- list( + # ... , + table_description = list( + # ... , + + topics = list( + list(name = "Demography.Migration", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(name = "Demography.Censuses", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "F22", + name = "International Migration", + parent_id = "F2 - International Factor Movements and International Business", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + list(id = "O15", + name = "Human Resources - Human Development - Income Distribution - Migration", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J") + ), + + # ... + + ), + + ) + ``` +
+ + +- **`disciplines`** *[Optional ; Repeatable]*
+ +
+```json +"disciplines": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + Information on the academic disciplines related to the content of the table. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in [Wikipedia](https://en.wikipedia.org/wiki/List_of_academic_fields). +This is a block of five elements: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the discipline, taken from a controlled vocabulary. + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the discipline, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the discipline (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.
+ + + + ```r + my_table <- list( + # ... , + table_description = list( + # ... , + + disciplines = list( + + list(name = "Economics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"), + + list(name = "Agricultural economics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"), + + list(name = "Econometrics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields") + + ), + + # ... + ), + # ... + ) + ``` +
+ + +- **`definitions`** *[Optional, Repeatable]*
+Definitions or concepts covered by the table. +
+```json +"definitions": [ + { + "name": "string", + "definition": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ The name (or label) of the term, indicator, or concept being defined. + - **`definition`** *[Required, Not repeatable, String]*
+ The definition of the term, indicator, or concept. + - **`uri`** *[Optional, Not repeatable, String]*
+ A link to the source of the definition, or to a site providing a more detailed definition.

+ + Example for a table on malnutrition that would include estimates of stunting and wasting prevalence: + + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + definitions = list( + + list(name = "stunting", + definition = "Prevalence of stunting is the percentage of children under age 5 whose height for age is more than two standard deviations below the median for the international reference population ages 0-59 months. For children up to two years old height is measured by recumbent length. For older children height is measured by stature while standing. The data are based on the WHO's new child growth standards released in 2006.", + uri = "https://data.worldbank.org/indicator/SH.STA.STNT.ZS?locations=1W"), + + list(name = "wasting", + definition = "Prevalence of wasting, male,is the proportion of boys under age 5 whose weight for height is more than two standard deviations below the median for the international reference population ages 0-59.", + uri = "https://data.worldbank.org/indicator/SH.STA.WAST.MA.ZS?locations=1W") + + ), + # ... + ) + ) + ``` +
+ + +- **`classifications`** *[Optional, Repeatable]*
+The element is used to document the use of standard classifications (or "ontologies", or "taxonomies") in the table. +
+```json +"classifications": [ + { + "name": "string", + "version": "string", + "organization": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ Name (label) of the classification, ontology, or taxonomy. + - **`version`** *[Optional, Not repeatable, String]*
+ Version of the classification, ontology, or taxonomy used in the table. + - **`organization`** *[Optional, Not repeatable, String]*
+ Organization that is the custodian of the classification, ontology, or taxonomy. + - **`uri`** *[Optional, Not repeatable, String]*
+ Link to an external resource where detailed information on the classification, ontology, or taxonomy can be obtained.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + classifications = list( + + list(name = "International Standard Classification of Occupations (ISCO)", + version = "ISCO-08", + organization = "International Labour Organization (ILO)", + uri = "https://www.ilo.org/public/english/bureau/stat/isco/") + + ), + # ... + + ) + + ) + ``` +
+ + +- **`rights`** *[Optional, Not repeatable, String]*
+Information on the rights or copyright that applies to the table. +
+ + +- **`license`** *[Optional, Repeatable]*
+A table may require a license to use or reproduce. This is done to protect the intellectual content of the research product. The licensing entity may be different from the researcher or the publisher. It is the entity which has the intellectual rights to the table (s) and would grant rights or restrictions on the reuse of the table. +
+```json +"license": [ + { + "name": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ The name of the license". + - **`uri`** *[Optional, Not repeatable, String]*
+ A link to a publicly-accessible description of the terms of the license.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + license = list( + list(name = "Attribution 4.0 International (CC BY 4.0)", + uri = "https://creativecommons.org/licenses/by/4.0/") + ), + + # ... + ) + + ) + ``` +
+ + +- **`citation`** *[Optional, Not repeatable, String]*
+A citation requirement for the table (i.e. an indication of how the table should be cited in publications). +
+ + +- **`confidentiality`** *[Optional, Not repeatable, String]*
+A published table may be protected through a confidentiality agreement between the publisher and the researcher. It may also determine certain rights regarding the use of the research and the data presented to the table. The data may also present confidential information that is produced for selective audiences. This element is used to provide a statement on any limitations ore restrictions on use of the table based on confidential data or agreements. +
+ + +- **`sdc`** *[Optional, Not repeatable, String]*
+Information on statistical disclosure control measures applied to the table. This can include cell suppression, or other techniques. Specialized packages have been developed for this purpose, like [*sdcTable: Methods for Statistical Disclosure Control in Tabular Data*](https://cran.r-project.org/web/packages/sdcTable/index.html) and https://cran.r-project.org/web/packages/sdcTable/sdcTable.pdf
+The information provided here should be such that it does not provide intruders with useful information for reverse-engineering the protection measures applied to the table. +
+ + +- **`contacts`** *[Optional, Repeatable]*
+Users of the data may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a "generic" email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members. +
+```json +"contacts": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "email": "string", + "telephone": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ Name of a person or unit (such as a data help desk). It will usually be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position. + - **`role`** *[Optional, Not repeatable, String]*
+ The specific role of `name`, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact. + - **`affiliation`** *[Optional, Not repeatable, String]*
+ Affiliation of the person/unit. + - **`email`** *[Optional, Not repeatable, String]*
+ E-mail address of the person. + - **`telephone`** *[Optional, Not repeatable, String]*
+ A phone number that can be called to obtain information or provide feedback on the table. This should never be a personal phone number; a corporate number (typically of a data help desk) should be provided. + - **`uri`** *[Optional, Not repeatable, String]*
+ A link to a website where contact information for `name` can be found.
+ + + + ```r + my_table = list( + # ... , + table_description = list( + # ... , + + contacts = list( + + list(name = "Data helpdesk", + role = "Support to data users", + affiliation = "National Statistics Office", + email = "data_helpdesk@ ...") + + ) + ) + ) + ``` +
+ + +- **`notes`** *[Optional, Repeatable]*
+The notes provide a space to include observations or open-ended content that may be material in understanding the table, which have not been captured in other elements of the schema. +
+```json +"notes": [ + { + "note": "string" + } +] +``` +
+ + - **`note`** *[Required, Not repeatable, String]*
+ The note itself.

+ + +- **`relations`** *[Optional ; Repeatable]*
+If the table has a relation to other resources (e.g., it is a subset of another resource, or a translation of another resource), the relation(s) and associated resources can be listed in this element. +
+```json +"relations": [ + { + "name": "string", + "type": "isPartOf" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The related resource. Recommended practice is to identify the related resource by means of a URI. If this is not possible or feasible, a string conforming to a formal identification system may be provided. + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of relationship. The use of a controlled vocabulary is recommended. The Dublin Core proposes the following vocabulary: `isPartOf, hasPart, isVersionOf, isFormatOf, hasFormat, references, isReferencedBy, isBasedOn, isBasisFor, replaces, isReplacedBy, requires, isRequiredBy`.

+ + | Type | Description | + | ------------------------| ------------------------------------------------------------ | + | isPartOf | The described resource is a physical or logical part of the referenced resource. | + | hasPart | | + | isVersionOf | The described resource is a version edition or adaptation of the referenced resource. A change in version implies substantive changes in content rather than differences in format.| + | isFormatOf | | + | hasFormat | The described resource pre-existed the referenced resource, which is essentially the same intellectual content presented in another format.| + | references | | + | isReferencedBy | | + | isBasedOn | | + | isBasisFor | | + | replaces | The described resource supplants, displaces or supersedes the referenced resource.| + | isReplacedBy | The described resource is supplanted, displaced or superseded by the referenced resource.| + | requires | | +
+ + +### Provenance + +**`provenance`** *[Optional ; Repeatable]*
+
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ + Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+ +- **`origin_description`** *[Required ; Not repeatable]*
+The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Table Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ + +### Tags + +**`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ + - **`tag`** *[Required ; Not repeatable ; String]*
+ A user-defined tag. + - **`tag_group`** *[Optional ; Not repeatable ; String]*

+ A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs. + + +- **`lda_topics`** *[Optional ; Not repeatable]*
+
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ + We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). + + Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series' name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%). + + The `lda_topics` element includes the following metadata fields. An example in R was provided in Chapter 4 - Documents. + + - **`model_info`** *[Optional ; Not repeatable]*
+ Information on the LDA model.
+ + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + - **`topic_description`** *[Optional ; Repeatable]*
+ The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).
+ + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the metadata (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic.

+ + +- **`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). + + The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": null + } +] +``` +
+ + The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; @@@@]* + The numeric vector representing the series metadata.

+ + +### Additional (custom) elements + +**`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + + +## Complete examples + +We provide here examples of documentation of actual tables, and their publishing in a NADA catalog. We use the R package NADAR and the Python library PyNada to publish metadata in the catalog. The example only demonstrate the production and publishing of table metadata. We do not show in the example how the **data** can also be published in a NADA database (MongoDB), to be made available via API. The use of the data API is covered in the NADA documentation. + +### Example 1 + +This first example is a table presenting the evolution since 1960 of the number of households by size and of the average household size in the United States, published by the US Census Bureau. This table, published in MS-Excel format, was downloaded on 20 February 2021 from https://www.census.gov/data/tables/time-series/demo/families/households.html. +
+![](./images/table_example_01_US_BUCEN.JPG){width=100%} +
+ +**Using R** + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_tables/") + +id = "TBL_EXAMPLE_01" +thumb = "household_pic.JPG" # To be used as thumbnail in the data catalog + +# Document the table + +my_table_hh4 <- list( + + metadata_information = list( + idno = "META_TBL_EXAMPLE-01", + producers = list( + list(name = "Olivier Dupriez",affiliation = "World Bank") + ), + production_date = "2021-02-20" + ), + + table_description = list( + + title_statement = list( + idno = id, + table_number = "Table HH-4", + title = "Households by Size: 1960 to Present", + sub_title = "(Numbers in thousands, except for averages)" + ), + + authoring_entity = list( + list(name = "United States Census Bureau", + affiliation = " U.S. Department of Commerce", + abbreviation = "US BUCEN", + uri = "https://www.census.gov/en.html" + ) + ), + + date_created = "2020", + + date_published = "2020-12", + + table_columns = list( + list(label = "Year"), + list(label = "All households (number)"), + list(label = "Number of people: One"), + list(label = "Number of people: Two"), + list(label = "Number of people: Three"), + list(label = "Number of people: Four"), + list(label = "Number of people: Five"), + list(label = "Number of people: Six"), + list(label = "Number of people: Seven or more"), + list(label = "Average number of people per household") + ), + + table_rows = list( + list(label = "Year (values from 1960 to 2020)") + ), + + table_footnotes = list( + + list(number = "1", + text = "This table uses the householder's person weight to describe characteristics of people living in households. As a result, estimates of the number of households do not match estimates of housing units from the Housing Vacancy Survey (HVS). The HVS is weighted to housing units, rather than the population, in order to more accurately estimate the number of occupied and vacant housing units. If you are primarily interested in housing inventory estimates, then see the published tables and reports here: http://www.census.gov/housing/hvs/. If you are primarily interested in characteristics about the population and people who live in households, then see the H table series and reports here: https://www.census.gov/topics/families/families-and-households.html."), + + list(number = "2", + text = "Details may not sum to total due to rounding."), + + list(number = "3", + text = "1993 figures revised based on population from the most recent decennial census."), + + list(number = "4", + text = "The 2014 CPS ASEC included redesigned questions for income and health insurance coverage. All of the approximately 98,000 addresses were selected to receive the improved set of health insurance coverage items. The improved income questions were implemented using a split panel design. Approximately 68,000 addresses were selected to receive a set of income questions similar to those used in the 2013 CPS ASEC. The remaining 30,000 addresses were selected to receive the redesigned income questions. The source of data for this table is the CPS ASEC sample of 98,000 addresses.") + + ), + + table_series = list( + list(name = "Historical Households Tables", + maintainer = "United States Census Bureau", + uri = "https://www.census.gov/data/tables/time-series/demo/families/households.html", + description = "Tables on households generated from the Current Population Survey") + ), + + statistics = list( + list(value = "Count"), + list(value = "Average") + ), + + unit_observation = list( + list(value = "Household") + ), + + data_sources = list( + list(source = "U.S. Census Bureau, Current Population Survey, March and Annual Social and Economic Supplements") + ), + + time_periods = list( + list(from = "1960", to = "2020") + ), + + universe = list( + list(value = "US resident population") + ), + + ref_country = list( + list(name = "United States", code = "USA") + ), + + geographic_granularity = "Country", + + languages = list( + list(name = "English", code = "EN") + ), + + links = list( + list(uri = "https://www2.census.gov/programs-surveys/demo/tables/families/time-series/households/hh4.xls", + description = "Table in MS-Excel formal"), + list(uri = "https://www.census.gov/programs-surveys/cps/technical-documentation/complete.html", + description = "Technical documentation with information about ASEC, including the source and accuracy statement") + ), + + topics = list( + list( + id = "1", + name = "Demography - Censuses", + parent_id = "Demography", + vocabulary = "CESSDA Controlled Vocabulary for CESSDA Topic Classification v. 3.0 (2019-05-20)", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification?v=3.0" + ) + ), + + contacts = list( + list(name = "Fertility and Family Statistics Branch", + affiliation = "US Census Bureau", + telephone = "+1 - 301-763-2416", + uri = "ask.census.gov") + ) + + ) + +) + +# Publish the table in a NADA catalog + +table_add(idno = id, + metadata = my_table_hh4, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes") + +# Provide a link to the table series page (US Bucen website) + +external_resources_add( + title = "Historical Households Tables (US Bucen web page)", + idno = id, + dctype = "web", + file_path = "https://www.census.gov/data/tables/time-series/demo/families/households.html", + overwrite = "yes" +) +``` +

+The result in NADA will be as follows (only part of metadata displayed): + +
+![](./images/table_example_01_US_BUCEN_nada1.JPG){width=100%} +
+ +
+ +**Using Python** + +The same result can be achieved in Python; the script will be as follows: + + +```python +# Python script +``` + + +### Example 2 + +For this second example, we use a regional table from the World Bank: ["World Development Indicators - Country profiles"](https://databank.worldbank.org/views/reports/reportwidget.aspx?Report_Name=CountryProfile&Id=b450fd57&tbar=y&dd=y&inf=n&zm=n). The table is available on-line in Excel and in PDF formats, for many geographic areas: world, geographic regions, country groups (income level, etc), and country. A separate table is available for each of these areas. Metadata common to all table files is available in a separate Excel file. + +
+![](./images/table_example_02_WB_CTRY_PROFILE_SEL.JPG){width=100%} +
+![](./images/table_example_02_WB_CTRY_PROFILE.JPG){width=100%} +
+ +As the same metadata applies to all tables, we generate the metadata once, and use a function to publish the geography-specific tables in one loop. In our example, we only generate the tables for the following geographies: world, World Bank regions, and countries of South Asia. This will result in the documentation and publishing of 15 tables. By providing the list of all countries to the loop, we would publish 200+ tables using this script. + +We include definitions in the metadata. These definitions are extracted from the [World Development Indicators API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-about-the-indicators-api-documentation). + +In the script, we assume that we only want to publish the metadata in the catalog, and provide a link to the originating World Bank website. In other words, we do not make the XLSX or PDF directly accessible from the NADA catalog (which would be easy to implement). + +**Using R** + + +```r +# -------------------------------------------------------------------------- +# Load libraries and establish the catalog administrator credentials +# -------------------------------------------------------------------------- + +library(nadar) +library(jsonlite) +library(httr) +library(rlist) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_tables/") + +thumb_file <- "WB_country_profiles_WLD.jpg" + +src_data <- "World Bank, World Development Indicators database - WDI Central, 2021" + +# The tables contain data extracted from WDI time series. We identified these +# series ID and we list them here in their order of appearance in the table. + +tbl_wdi_indicators = list( + "SP.POP.TOTL", "SP.POP.GROW", "AG.SRF.TOTL.K2", "EN.POP.DNST", + "SI.POV.NAHC", "SI.POV.DDAY", "NY.GNP.ATLS.CD", "NY.GNP.PCAP.CD", + "NY.GNP.MKTP.PP.CD", "NY.GNP.PCAP.PP.CD", "SI.DST.FRST.20", + "SP.DYN.LE00.IN", "SP.DYN.TFRT.IN", "SP.ADO.TFRT", "SP.DYN.CONU.ZS", + "SH.STA.BRTC.ZS", "SH.DYN.MORT", "SH.STA.MALN.ZS", "SH.IMM.MEAS", + "SE.PRM.CMPT.ZS", "SE.PRM.ENRR", "SE.SEC.ENRR", "SE.ENR.PRSC.FM.ZS", + "SH.DYN.AIDS.ZS", "AG.LND.FRST.K2", "ER.PTD.TOTL.ZS", + "ER.H2O.FWTL.ZS", "SP.URB.GROW", "EG.USE.PCAP.KG.OE", + "EN.ATM.CO2E.PC", "EG.USE.ELEC.KH.PC", "NY.GDP.MKTP.CD", + "NY.GDP.MKTP.KD.ZG", "NY.GDP.DEFL.KD.ZG", "NV.AGR.TOTL.ZS", + "NV.IND.TOTL.ZS", "NE.EXP.GNFS.ZS", "NE.IMP.GNFS.ZS", + "NE.GDI.TOTL.ZS", "GC.REV.XGRT.GD.ZS", "GC.NLD.TOTL.GD.ZS", + "FS.AST.DOMS.GD.ZS", "GC.TAX.TOTL.GD.ZS", "MS.MIL.XPND.GD.ZS", + "IT.CEL.SETS.P2", "IT.NET.USER.ZS", "TX.VAL.TECH.MF.ZS", + "IQ.SCI.OVRL", "TG.VAL.TOTL.GD.ZS", "TT.PRI.MRCH.XD.WD", + "DT.DOD.DECT.CD", "DT.TDS.DECT.EX.ZS", "SM.POP.NETM", + "BX.TRF.PWKR.CD.DT", "BX.KLT.DINV.CD.WD", "DT.ODA.ODAT.CD" +) + +rows = list() +defs = list() + +# We then use the WDI API to retrieve information on the series (name, label, +# definition) to be included in the published metadata. + +for(s in tbl_wdi_indicators) { + + url = paste0("https://api.worldbank.org/v2/sources/2/series/", s, + "/metadata?format=JSON") + s_meta <- GET(url) + if(http_error(s_meta)){ + stop("The request failed") + } else { + s_metadata <- fromJSON(content(s_meta, as = "text")) + s_metadata <- s_metadata$source$concept[[1]][[2]][[1]][[2]][[1]] + } + + indic_lbl = s_metadata$value[s_metadata$id=="IndicatorName"] + indic_def = s_metadata$value[s_metadata$id=="Longdefinition"] + + this_row = list(var_name = s, dataset = src_data, label = indic_lbl) + rows = list.append(rows, this_row) + + this_def = list(name = indic_lbl, definition = indic_def) + defs = list.append(defs, this_def) + +} + +# -------------------------------------------------------------------------- +# We create a function that takes two parameters: the country (or region) +# name, and the country (or region) code. This function will generate the +# table metadata and publish the selected table in the NADA catalog. +# -------------------------------------------------------------------------- + +publish_country_profile <- function(country_name, country_code) { + + # Generate the country/region-specific unique table ID and table title + + idno_meta <- paste0("UC013_", country_code) + idno_tbl <- paste0("UC013_", country_code) + tbl_title <- paste0("World Development Indicators, Country Profile, ", + country_name, " - 2021") + citation <- paste("World Bank,", tbl_title, + ", https://datacatalog.worldbank.org/dataset/country-profiles, accessed on [date]") + + # Generate the schema-compliant metadata + + my_tbl <- list( + + metadata_information = list( + producers = list(list(name = "NADA team")), + production_date = "2021-09-14", + version = "v01" + ), + + table_description = list( + + title_statement = list( + idno = idno_tbl, + title = tbl_title + ), + + authoring_entity = list( + list(name = "World Bank, Development Data Group", + abbreviation = "WB", + uri = "https://data.worldbank.org/") + ), + + date_created = "2021-07-03", + date_published = "2021-07", + + description = "Country profiles present the latest key development data drawn from the World Development Indicators (WDI) database. They follow the format of The Little Data Book, the WDI's quick reference publication.", + + table_columns = list( + list(label = "Year 1990"), + list(label = "Year 2000"), + list(label = "Year 2010"), + list(label = "Year 2018") + ), + + table_rows = rows, + + table_series = list( + list(name = "World Development Indicators, Country Profiles", + maintainer = "World Bank, Development Data Group (DECDG)") + ), + + data_sources = list( + list(source = src_data) + ), + + time_periods = list( + list(from = "1990", to = "1990"), + list(from = "2000", to = "2000"), + list(from = "2010", to = "2010"), + list(from = "2018", to = "2018") + ), + + ref_country = list( + list(name = country_name, code = country_code) + ), + + geographic_granularity = area, + + languages = list( + list(name = "English", code = "EN") + ), + + links = list( + list(uri = "https://datacatalog.worldbank.org/dataset/country-profiles", + description = "Country Profiles in World Bank Data Catalog website"), + list(uri = "http://wdi.worldbank.org/tables", + description = "Country Profiles in World Bank Word Development Indicators website"), + list(uri = "https://datatopics.worldbank.org/world-development-indicators/", + description = "Word Development Indicators website") + ), + + keywords = list( + list(name = "World View"), + list(name = "People"), + list(name = "Environment"), + list(name = "Economy"), + list(name = "States and markets"), + list(name = "Global links") + ), + + topics = list( + list(id = "1", name = "Demography", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "2", name = "Economics", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "3", name = "Education", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "4", name = "Health", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "5", name = "Labour And Employment", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "6", name = "Natural Environment", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "7", name = "Social Welfare Policy and Systems", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "8", name = "Trade Industry and Markets", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "9", name = "Economic development") + ), + + definitions = defs, + + license = list( + list(name = "Creative Commons - Attribution 4.0 International - CC BY 4.0", + uri = "https://creativecommons.org/licenses/by/4.0/") + ), + + citation = citation, + + contacts = list( + list(name = "World Bank, Development Data Group, Help Desk", + telephone = "+1 (202) 473-7824 or +1 (800) 590-1906", + email = "data@worldbank.org", + uri = "https://datahelpdesk.worldbank.org/") + ) + + ) + + ) + + # Publish the table in the NADA catalog + + table_add(idno = my_tbl$table_description$title_statement$idno, + metadata = my_tbl, + repositoryid = "central", + published = 1, + overwrite = "yes", + thumbnail = thumb_file) + + # Add a link to the WDI website as an external resource + + external_resources_add( + title = "World Development Indicators - Regional tables", + idno = idno_tbl, + dctype = "web", + file_path = "http://wdi.worldbank.org/table", + overwrite = "yes" + ) + +} + +# -------------------------------------------------------------------------- +# We run the function in a loop to publish the selected tables +# -------------------------------------------------------------------------- + +# List of countries/regions + +geo_list <- list( + list(name = "World", code = "WLD", area = "World"), + list(name = "East Asia and Pacific", code = "EAP", area = "Region"), + list(name = "Europe and Central Asia", code = "ECA", area = "Region"), + list(name = "Latin America and Caribbean", code = "LAC", area = "Region"), + list(name = "Middle East and North Africa", code = "MNA", area = "Region"), + list(name = "South Asia", code = "SAR", area = "Region"), + list(name = "Sub-Saharan Africa", code = "AFR", area = "Region"), + list(name = "Afghanistan", code = "AFG", area = "Country"), + list(name = "Bangladesh", code = "BGD", area = "Country"), + list(name = "Bhutan", code = "BHU", area = "Country"), + list(name = "India", code = "IND", area = "Country"), + list(name = "Maldives", code = "MDV", area = "Country"), + list(name = "Nepal", code = "NPL", area = "Country"), + list(name = "Pakistan", code = "PAK", area = "Country"), + list(name = "Sri Lanka", code = "LKA", area = "Country")) + +# Loop through the list of countries/region to publish the tables + +for(i in 1:length(geo_list)) { + area <- as.character(geo_list[[i]][3]) + publish_country_profile( + country_name = as.character(geo_list[[i]][1]), + country_code = as.character(geo_list[[i]][2])) +} +``` + +

+ +** Using Python** + + +```python +# Python script +``` + +**The result in NADA** + +
+![](./images/Table_Example02_in_NADA.JPG){width=100%} +
+ + +### Example 3 + +This example is selected to show how the documentation can take advantage of R or Python to extract information from the table. Here we have the table in MS-Excel format. The table contains a long list of countries, which would be tedious to manually enter. A script reads the Excel file and extracts some of the information which is then added to the table metadata. The table also contains the definitions of the indicators shown in the table. + +Here we assume we want to provide the XLS and PDF tables in addition to a link to the source website. We will identify and upload the resources (XLS and PDF) on our web server. + +The table: + +![](./images/table_example_03_WB_GLOBAL_GOAL.JPG){width=100%} + +

+ +> Using R + + +```r +library(nadar) +library(readxl) +library(rlist) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_tables/") + +thumb = "SDGs.jpg" + +id = "TBL_EXAMPLE-03" + +# --------------------------------------------------------------------------- +# We read the MS-Excel file and extract the list of countries and definitions +# --------------------------------------------------------------------------- + +# We generate the list of countries +df <- read_xlsx("WV2_Global_goals_ending_poverty_and_improving_lives.xlsx", + range = "A5:A230") +ctry_list <- list() +for(i in 1:nrow(df)) { + c <- list(name = as.character(df[[1]][i])) + ctry_list <- list.append(ctry_list, c) +} + +# We extract the definitions found in the table. +# Note that we could have instead copy/pasted the definitions. +# For example, the command line: +# list(name = as.character(df[1,1]), definition = as.character(df[3,1])) +# is equivalent to: +# list(name = "Income share held by lowest 20%", +# definition = "Percentage share of income or consumption is the share that accrues to subgroups of population indicated by deciles or quintiles. Percentage shares by quintile may not sum to 100 because of rounding.") + +df <- read_xlsx("WV2_Global_goals_ending_poverty_and_improving_lives.xlsx", + range = "A241:A340", col_names = FALSE) + +def_list = list( + list(name = as.character(df[1,1]), definition = as.character(df[3,1])), + list(name = as.character(df[11,1]), definition = as.character(df[13,1])), + list(name = as.character(df[21,1]), definition = as.character(df[23,1])), + list(name = as.character(df[31,1]), definition = as.character(df[33,1])), + list(name = as.character(df[41,1]), definition = as.character(df[43,1])), + list(name = as.character(df[51,1]), definition = as.character(df[53,1])), + list(name = as.character(df[61,1]), definition = as.character(df[63,1])), + list(name = as.character(df[71,1]), definition = as.character(df[73,1])), + list(name = as.character(df[78,1]), definition = as.character(df[80,1])), + list(name = as.character(df[85,1]), definition = as.character(df[87,1])), + list(name = as.character(df[92,1]), definition = as.character(df[94,1])) +) + +# We generate the table metadata + +my_tbl <- list( + + metadata_information = list( + idno = "META_TBL_EXAMPLE-03", + producers = list( + list(name = "Olivier Dupriez", affiliation = "World Bank") + ), + production_date = "2021-02-20" + ), + + table_description = list( + + title_statement = list( + idno = id, + table_number = "WV.2", + title = "Global Goals: Ending Poverty and Improving Lives" + ), + + authoring_entity = list( + list(name = "World Bank, Development Data Group", + abbreviation = "WB", + uri = "https://data.worldbank.org/") + ), + + date_created = "2020-12-16", + date_published = "2020-12", + + description = "", + + table_columns = list( + list(label = "Percentage share of income or consumption - Lowest 20% - 2007-18"), + list(label = "Prevalence of child malnutrition - Stunting, height for age - % of children under 5 - 2011-19"), + list(label = "Maternal mortality ratio - Modeled estimates - per 100,000 live births - 2017"), + list(label = "Under-five mortality rate - Total - per 1,000 live births - 2019"), + list(label = "Incidence of HIV, ages 15-49 (per 1,000 uninfected population ages 15-49) - 2019"), + list(label = "Incidence of tuberculosis - per 100,000 people - 2019"), + list(label = "Mortality caused by road traffic injury - per 100,000 people - 2016"), + list(label = "Primary completion rate - Total - % of relevant age group - 2018"), + list(label = "Contributing family workers - Male - % of male employment - 2018"), + list(label = "Contributing family workers - Female - % of female employment - 2018"), + list(label = "Labor productivity - GDP per person employed - % growth - 2015-18") + ), + + table_rows = list( + list(label = "Country or region") + ), + + table_series = list( + list(name = "World Development Indicators - World View", + description = "World Development Indicators includes data spanning up to 56 years-from 1960 to 2016. World view frames global trends with indicators on population, population density, urbanization, GNI, and GDP. As in previous years, the World view online tables present indicators measuring the world's economy and progress toward improving lives, achieving sustainable development, providing support for vulnerable populations, and reducing gender disparities. Data on poverty and shared prosperity are now in a separate section, while highlights of progress toward the Sustainable Development Goals are now presented in the companion publication, Atlas of Sustainable Development Goals 2017. + + The global highlights in this section draw on the six themes of World Development Indicators: + - Poverty and shared prosperity, which presents indicators that measure progress toward the World Bank Group's twin goals of ending extreme poverty by 2030 and promoting shared prosperity in every country. + - People, which showcases indicators covering education, health, jobs, social protection, and gender and provides a portrait of societal progress across the world. + - Environment, which presents indicators on the use of natural resources, such as water and energy, and various measures of environmental degradation, including pollution, deforestation, and loss of habitat, all of which must be considered in shaping development strategies. + - Economy, which provides a window on the global economy through indicators that describe the economic activity of the more than 200 countries and territories that produce, trade, and consume the world's output. + - States and markets, which encompasses indicators on private investment and performance, financial system development, quality and availability of infrastructure, and the role of the public sector in nurturing investment and growth. + - Global links, which presents indicators on the size and direction of the flows and links that enable economies to grow, including measures of trade, remittances, equity, and debt, as well as tourism and migration.", + uri = "http://wdi.worldbank.org/tables", + maintainer = "World Bank, Development Data Group (DECDG)") + ), + + data_sources = list( + list(source = "World Bank, World Development Indicators database, 2020") + ), + + time_periods = list( + list(from = "2007", to = "2019") # The table cover all years from 2007 to 2019 + ), + + ref_country = ctry_list, + geographic_granularity = "Country, WB geographic region, other country groupings", + + languages = list( + list(name = "English", code = "EN") + ), + + links = list( + list(uri = "http://wdi.worldbank.org/tables", + description = "World Development Indicators - Global Goals tables"), + list(uri = "https://datatopics.worldbank.org/world-development-indicators/", + description = "Word Development Indicators website"), + list(uri = "https://sdgs.un.org/goals", + description = "United Nations, Sustainable Development Goals (SDG) website") + ), + + keywords = list( + list(name = "Sustainable Development Goals (SDGs)"), + list(name = "Shared prosperity"), + list(name = "HIV - AIDS") + ), + + topics = list( + list(id = "1", + name = "Demography", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "2", + name = "Economics", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "3", + name = "Education", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + list(id = "4", + name = "Health", + vocabulary = "CESSDA", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification") + ), + + disciplines = list( + list(name = "Economics") + ), + + definitions = def_list, + + license = list( + list(name = "Creative Commons - Attribution 4.0 International - CC BY 4.0", + uri = "https://creativecommons.org/licenses/by/4.0/") + ), + + citation = "", + + contacts = list( + list(name = "World Bank, Development Data Group, Help Desk", + telephone = "+1 (202) 473-7824 or +1 (800) 590-1906", + email = "data@worldbank.org", + uri = "https://datahelpdesk.worldbank.org/") + ) + + ) +) + +# We publish the table in the catalog + +table_add(idno = id, + metadata = my_tbl, + repositoryid = "central", + published = 1, + overwrite = "yes", + thumbnail = thumb) + +# We add the MS-Excel and PDF versions of the table as external resources + +external_resources_add( + title = "Global Goals: Ending Poverty and Improving Lives (in MS-Excel format)", + idno = id, + dctype = "tbl", + file_path = "WV2_Global_goals_ending_poverty_and_improving_lives.xlsx", + overwrite = "yes" +) + +external_resources_add( + title = "Global Goals: Ending Poverty and Improving Lives (in PDF format)", + idno = id, + dctype = "tbl", + file_path = "WV2_Global_goals_ending_poverty_and_improving_lives.pdf", + overwrite = "yes" +) +``` + +The table will now be available in the NADA catalog. + +
+![](./images/Table_Example03_in_NADA.JPG){width=100%} +
+ +

+*** Using Python + + +```python +#Python script +``` diff --git a/10_chapter10_image.md b/10_chapter10_image.md new file mode 100644 index 0000000..9c856de --- /dev/null +++ b/10_chapter10_image.md @@ -0,0 +1,2053 @@ +--- +output: html_document +--- + +# Images {#chapter10} + +
+
+![](./images/IPTC_DCMI.JPG){width=70%} +
+
+ + +## Image metadata + +This chapter describes the use of two metadata standards for the documentation of images. Images may include both electronic and physical representations, but we are here interested in images available as electronic files, intended to be catalogued and published in on-line catalogs/albums. These files will typically be available in one of the following formats: JPG, PNG, or TIFF. Images can be photos taken by digital cameras, images generated by computer, or scanned images. The metadata standards we describe are intended to make these images discoverable, accessible, and usable. For that purpose, metadata must be provided on the content of the image (in the form of caption, description, keywords, etc.), on the location and date the image was generated, on the author, and more. Information on use license and copyrights, on possible privacy protection issues (persons, possibly minors, etc.) is needed to provide users with information they need to ensure their use of the published images is legal, ethical, and responsible. + +The device used to generate images in the form of electronic files (such as digital cameras) contain embedded metadata. Digital cameras generate EXIF metadata. This information may be useful to some users, but (with a few exceptions like the date the photo was taken and the GPS location if generated), they lack information on the content of the image (what is represented in it), required for discoverability. This information must added by curators. Part of it will be entered manually, other can be extracted in a largely automated manner using machine learning models and APIs. This information must be structured and stored in compliance with a metadata standard. We present in this chapter two standards that can serve that purpose: the comprehensive (and somewhat complex) [IPTC standard](https://iptc.org/), and the simpler [Dublin Core (DCMI)](https://dublincore.org/) standard. The metadata schema we propose embeds both options; when using the schema, users will select either one or the other to document their images. We also make references to the [ImageObject metadata schema](https://schema.org/ImageObject) from schema.org, and include some of their elements in our schema. + + +:::quote +Although photographs may be more explicit than a long discourse for humans, they don't describe themselves in term of content as texts do. For texts, authors use many clues to indicate what they are talking about: titles, abstract, keywords, etc. which may be used for automatic cataloguing. Searching for photos must rely on manual cataloguing, or relate texts and documents that come with the photos. *(Source: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.5077&rep=rep1&type=pdf)* +::: + + +We start with a brief presentation of the EXIF metadata, then describe the schema we propose for the documentation and cataloguing of images. + + +### Embedded metadata: EXIF + +Modern digital cameras automatically generate metadata and embed it into the image file. This metadata is known as the Exchangeable Image File Format or EXIF. EXIF will record information on the date and time the image was taken, on the GPS location coordinates (latitude & longitude, possibly altitude) if the camera was equipped with a GPS and geolocation was enabled, information on the device including manufacturer and model, technical information (lens type, focal range, aperture, shutter speed, flash settings), the system-generated unique image identifier, and more. + +There are several ways to extract or view an image's EXIF Data. For example, the R packages ExifTool and ExifR allow extraction and use of EXIF metadata, and applications like Flickr will display EXIF content. + +
+![](./images/schema_guide_exif_01.JPG){width=100%} +
+ +But with the exception of the date, location (if captured), and unique image identifier, the content of the EXIF does not provide information that users interested in identifying images based on their source and/or content will find useful. Metadata describing the content and source of an image will have to be obtained from another source or using other tools. + + +### IPTC and Dublin Core standards + +The metadata schema we propose for documenting images contains two mutually-exclusive options: the Dublin Core, as a simple option, and the IPTC as a more complex and advanced solutions. The schema also contains a few metadata elements that will be used no matter which option is selected. The schema is structured as follows: + +- A few elements common to both options are provided to document the metadata (not the image itself), to provide some cataloguing parameters, and to set a unique identifier for the image being documented. + +- Then come the two options for documenting the image itself: the IPTC block of metadata elements, and the Dublin Core block of elements. Users will make use of one of them, not both. + + - The IPTC is the most detailed and complex schema. The version embedded in our schema is 2019.1 According to the [IPTC website](https://iptc.org/standards/photo-metadata/iptc-standard/), "The IPTC Photo Metadata Standard is the most widely used standard to describe photos, because of its universal acceptance among news agencies, photographers, photo agencies, libraries, museums, and other related industries. It structures and defines metadata properties that allow users to add precise and reliable data about images." The IPTC standard consists of two schemas: IPTC Core and IPTC Extension. They provide a comprehensive set of fields to document an image including information on time and geographic coverage, people and objects shown in the image, information on rights, and more. The schema is complex and in most cases only a small subset of fields will be used to document an image. Controlled vocabularies are recommended for some elements. + + - The Dublin Core (DCMI) is a simpler and highly flexible standard, composed of 15 core elements which we supplement with a few elements mostly taken from the ImageObject schema from schema.org. + +- Last, a small number of additional metadata elements are provided, which are common to both options described above. + +Whether the IPTC or the simpler DCMI option is used, the metadata should be made as rich as possible. + + +### Augmenting image metadata + +To make images discoverable, metadata that describe the content depicted in an image, the source of the image and the rights and licensing associated with it, are essential but not provided in the EXIF. Additional metadata must be provided. + +Some of these metadata will have to be generated by image authors and/or curators, other can be generated in a much automated manner using machine learning models and tools. Image processing algorithms that make it possible to augmented metadata include algorithms of face detection, person identification, automated labeling, text extraction, and others. Before describing the proposed metadata schema in the following sections, we present here some example of tools that make such metadata enhancement easy and affordable. + +The example we provide below makes use of the [Google Vision API](https://cloud.google.com/vision/docs/drag-and-drop) to generate image metadata. Google Vision is one out of multiple tools that can be used for that purpose such as [Amazon Rekognition](https://aws.amazon.com/rekognition/), or [Microsoft Azure Computer Vision](https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/). This example makes use of a photo selected from the [World Bank Flickr album](https://www.flickr.com/photos/worldbank/1543136297/in/album-72157634086023459/). + +
+![](./images/Google_Vision_00.JPG){width=100%} +
+ +The image comes with a brief description that identifies the photographer, the location (name of the country and town, not GPS location), and the content of the image. The description of the image includes important keywords that, when indexed in a catalog, will support discoverability of the image. This information, to be manually entered, is valuable and must be part of the curated image metadata. + +----------- +
+![](./images/Google_Vision_00a.JPG){width=70%} +
+----------- + +But we can add useful additional information in an automated manner and at low cost using machine learning models. In the example below, we use the (free) on-line ["Try it" tool](https://cloud.google.com/vision) of the Google Vision application. + +
+![](./images/Google_Vision_01.JPG){width=85%} +
+ +The Google Vision API returns and displays the results of the image processing in multiple tabs. The same content is available programmatically in JSON format. The content of this JSON file can be mapped to elements of the metadata schema, for automatic addition to the image metadata. + +The first tab is the result of **faces** detection. Each detected face has a bounding box and metadata such as the derived emotion of the person. The bounding box can be used to automatically flag images that have one or multiple "significant size" face(s) and may have to be excluded from the published images for privacy protection reasons. + +
+![](./images/Google_Vision_02.JPG){width=85%} +
+ +The second tab reports on detected **objects**. + +
+![](./images/Google_Vision_03.JPG){width=85%} +
+ +The third tab suggests **labels** that could be attached to the image, provided with a degree of confidence. A threshold can be set to automatically add (or not) each proposed label as a keyword in the image metadata. + +
+![](./images/Google_Vision_04.JPG){width=85%} +
+ +![](./images/Google_Vision_04a.JPG){width=30%} +![](./images/Google_Vision_04b.JPG){width=30%} +![](./images/Google_Vision_04c.JPG){width=30%} + +The fourth tab shows the **text** detected in the image. The quality of text detection and recognition depends on the resolution of the image and on the size and orientation of the text in the image. In our example, the algorithm fails to read (most of) the small, rotated and truncated text. + +
+![](./images/Google_Vision_06.JPG){width=65%} +
+ +The tool managed to recognize some, but not all characters. In this case, this would be considered as not useful information to be added to the image metadata. + +
+![](./images/Google_Vision_05.JPG){width=85%} +
+ +We are not interested in the **properties** tab which does not provide information that can be used for discoverability of images based on their content or source. + +The last tab, **Safe search**, could be used as warnings if you plan to make the image publicly accessible. + +
+![](./images/Google_Vision_07.JPG){width=85%} +
+ +This "Try it" tool demonstrates the capabilities of the application which, for automating the processing of a collection of images, would be accessed programmatically using R, Python or another programming language. Accessing the application's API requires a key. The cost of image labeling, face detection, and other image processing is low. For information on pricing, consult the website of the API providers. + + +## Schema description + +The schema contains two options to document images: the IPTC and the Dublin Core metadata standards. The schema contains four main groups of metadata elements: + 1. A small set of "common elements" (used no matter what option -- IPTC or Dublin Core -- is used), used mostly for cataloguing purpose. + 2. The IPTC metadata elements + 3. The Dublin Core (DCMI) elements + 4. Another small set of common elements. + +The description of IPTC metadata elements is largely taken from the Photo Metadata section of the [IPTC website](https://iptc.org/standards/photo-metadata/). + +
+```json +{ + "repositoryid": "central", + "published": "0", + "overwrite": "no", + "metadata_information": {}, + "image_description": { + "idno": "string", + "identifiers": [], + "iptc": {}, + "dcmi": {}, + "license": [], + "album": [] + }, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ + +### Common elements + +- **`metadata_information`** *[Optional ; Not repeatable]*
+This block is used to describe who produced the metadata and when. This is an optional section of the schema. It is useful for archivist more than data users. The description of the image itself is found in the `IPTC` or `DCMI` section. +
+```json +"metadata_information": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "production_date": "string", + "version": "string" +} +``` +
+ + - **`title`** *[Optional ; Not Repeatable ; String]*
+ The title of the image metadata. This can be the same as the image title. + + - **`idno`** *[Optional ; Not Repeatable ; String]*
+ The unique identifier of the image metadata document (which can be different from the image identifier). + + - **`producers`** *[Optional ; Repeatable]*
+ A list of persons or organizations involved in the documentation (production of the metadata) of the image. + - **`name`** *[Optional ; Not repeatable, String]*
+ The name of the person or agency that is responsible for the documentation of the image. + - **`abbr`** *[Optional ; Not repeatable, String]*
+ Abbreviation (acronym) of the agency mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable, String]*
+ Affiliation of the person or agency mentioned in `name`. + - **`role`** *[Optional ; Not repeatable, String]*
+ The specific role of the person or agency mentioned in `name` in the production of the metadata. This element will be used when more than one person or organization is listed in the `producers` element to distinguish the specific contribution of each metadata producer.

+ + - **`production_date`** *[Optional ; Not repeatable, String]*
+ The date the image metadata was generated (not the date the image was created), preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). + + - **`version`** *[Optional ; Not repeatable, String]*
+ The version of the metadata on this image. This element will rarely be used. + + +- **`image_description`** *[Required ; Not Repeatable]*
+The `image_description` will contain the metadata related to one image. +
+```json +"image_description": { + "idno": "string", + "identifiers": [ + { + "type": "string", + "identifier": "string" + } + ], + "iptc": {}, + "dcmi": {}, + "license": [], + "album": [] +} +``` +
+ + - **`idno`** *[Required ; Not Repeatable, String]*
+ The (main) unique identifier of the image, to be used for cataloguing purpose. + + - **`identifiers`** *[Optional, Repeatable]*
+ The repeatable element `identifiers` is used to list image identifiers other than the one used in `idno`. Some images may have unique identifiers assigned by different organizations or cataloguing systems; this element is used to document them. + + This element is used to enter image identifiers (IDs) other than the catalog ID entered in the `image_description / idno` element. It can for example be a Digital Object Identifier (DOI), or the EXIF identifier. Note that the ID entered in the `idno` element can be repeated here (`idno` does not provide a `type` parameter, that curators may want to document). + + - **`type`** *[Optional, Not Repeatable, String]* + The type of identifier. This could be for example "DOI". + - **`identifier`** *[Required, Not Repeatable, String]* + The identifier itself. + + +### IPTC option + +**`iptc`** *[Optional ; Not Repeatable]*
+The schema provides two options (standards) to document an image: the IPTC, and the Dublin Core. Only one of these standards, not both, will be used to document an image. The block `iptc` will be used when IPTC is the preferred option. In such case, the `dcmi` block describe later in this chapter will be left empty. IPTC is the most complex of these two options. +
+```json +"iptc": { + "photoVideoMetadataIPTC": { + "title": "string", + "imageSupplierImageId": "string", + "registryEntries": [], + "digitalImageGuid": "string", + "dateCreated": "2023-04-11T15:06:09Z", + "headline": "string", + "eventName": "string", + "description": "string", + "captionWriter": "string", + "keywords": [], + "sceneCodes": [], + "sceneCodesLabelled": [], + "subjectCodes": [], + "subjectCodesLabelled": [], + "creatorNames": [], + "creatorContactInfo": {}, + "creditLine": "string", + "digitalSourceType": "http://example.com", + "jobid": "string", + "jobtitle": "string", + "source": "string", + "locationsShown": [], + "imageRating": 0, + "supplier": [], + "copyrightNotice": "string", + "copyrightOwners": [], + "usageTerms": "string", + "embdEncRightsExpr": [], + "linkedEncRightsExpr": [], + "webstatementRights": "http://example.com", + "instructions": "string", + "genres": [], + "intellectualGenre": "string", + "artworkOrObjects": [], + "personInImageNames": [], + "personsShown": [], + "modelAges": [], + "additionalModelInfo": "string", + "minorModelAgeDisclosure": "http://example.com", + "modelReleaseDocuments": [], + "modelReleaseStatus": {}, + "organisationInImageCodes": [], + "organisationInImageNames": [], + "productsShown": [], + "maxAvailHeight": 0, + "maxAvailWidth": 0, + "propertyReleaseStatus": {}, + "propertyReleaseDocuments": [], + "aboutCvTerms": [] + } +} +``` +
+ + +**`photoVideoMetadataIPTC`** *[Required ; Not Repeatable ; String]*
+Contains all elements used to describe the image using the IPTC standard. + +- **`title`** *[Optional ; Not Repeatable ; String]*
+The title is a shorthand reference for the digital image. It provides a short verbal and human readable name which can be a text and/or a numeric reference. It is not the same as the Headline (see below). Some may use the `title` field to store the file name of the image, though the field may be used in many ways. This element should not be used to provide the unique identifier of the image. + +- **`imageSupplierImageId`** *[Optional ; Not Repeatable ; String]*
+A unique identifier assigned by the image supplier to the image. + +- **`registryEntries`** *[Optional ; Repeatable]*
+A structured element used to provide cataloguing information (i.e. an entry in a registry). It includes the unique identifier for the image issued by the registry and the registry’s organization identifier. +
+```json +"registryEntries": [ + { + "role": "http://example.com", + "assetIdentifier": "string", + "registryIdentifier": "http://example.com" + } +] +``` +
+ + - **`role`**: *[Optional ; Not Repeatable ; String]*
+ An identifier of the reason and/or purpose for this Registry Entry. + - **`assetIdentifier`** *[Optional ; Not Repeatable ; String]*
+ A unique identifier created by the registry and applied by the creator of the digital image. This value shall not be changed after being applied. This identifier is linked to a corresponding Registry Organization Identifier. Enter the unique identifier created by a registry and applied by the creator of the digital image. This value shall not be changed after being applied. This identifier may be globally unique by itself, but it must be unique for the issuing registry. An input to this field should be made mandatory. + - **`registryIdentifier`** *[Optional ; Not Repeatable ; String]*
+ An identifier for the registry/organization which issued the corresponding Registry Image Id.

+ +- **`digitalImageGuid`** *[Optional ; Not Repeatable ; String]*
+A globally unique identifier for the image. This identifier is created and applied by the creator of the digital image at the time of its creation. This value shall not be changed after that time. The identifier can be generated using an algorithm that would guarantee that the created identifier is globally unique. Device that create digital images like digital or video cameras or scanners usually create such an identifier at the time of the creation of the digital data, and add it to the metadata embedded in the image file (e.g., the EXIF metadata).IPTC’s requirements for unique ids are as follows: + - It must be globally unique. Algorithms for this purpose exist. + - It should identify the camera body. + - It should identify each individual photo from this camera body. + - It should identify the date and time of the creation of the picture. + - It should be secured against tampering.

+ +- **`dateCreated`** *[Optional ; Not Repeatable ; String]*
+Designates the date and optionally the time the content of the image was created. For a photo, this will be the date and time the photo was taken. When no information is available on the time, the time is set to 00:00:00. The preferred format for the `dateCreated` element is the truncated DateTime format, for example: 2021-02-22T21:24:06Z + +- **`headline`** *[Optional ; Not Repeatable ; String]*
+A brief publishable summary of the contents of the image. Note that a headline is not the same as a title. + +- **`eventName`** *[Optional ; Not Repeatable ; String]*
+The name or a brief description of the event where the image was taken. If this is a sub-event of a larger event, mention both in the description. For example: "Opening statement, 1st International Conference on Metadata Standards, New York, November 2021". + +- **`description`** *[Optional ; Not Repeatable ; String]*
+A textual description, including captions, of the image. This describes the who, what, and why of what is happening in this image. This might include names of people, and/or their role in the action that is taking place within the image. Example: "The president of the Metadata Association delivers the keynote address". + +- **`captionWriter`** *[Optional ; Not Repeatable ; String]*
+An identifier, or the name, of the person involved in writing, editing or correcting the description of the image. + +- **`keywords`**: *[Optional ; Repeatable ; String]*
+
+```json +"keywords": [ + "string" +] +``` +
+ +Keywords (terms or phrases) to express the subject of the image. Keywords do not have to be taken from a controlled vocabulary. + +- **`sceneCodes`** *[Optional ; Repeatable ; String]*
+
+```json +"sceneCodes": [ + "string" +] +``` +
+The `sceneCodes` describe the scene of a photo content. The [IPTC Scene-NewsCodes](http://cv.iptc.org/newscodes/scene) controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license) should be used, where a scene is represented as a string of 6 digits.
+ + | code | Label | Description | + |:-----:|:----------------:|-----------------------------------| + |010100 | headshot | A head only view of a person (or animal/s) or persons as in a montage.| + |010200 | half-length | A torso and head view of a person or persons.| + |010300 | full-length | A view from head to toe of a person or persons | + |010400 | profile | A view of a person from the side | + |010500 | rear view | A view of a person or persons from the rear. | + |010600 | single | A view of only one person, object or animal. | + |010700 | couple | A view of two people who are in a personal relationship, for example engaged, married or in a romantic partnership. | + |010800 | two | A view of two people | + |010900 | group | A view of more than two people | + |011000 | general view | An overall view of the subject and its surrounds | + |011100 | panoramic view | A panoramic or wide angle view of a subject and its surrounds | + |011200 | aerial view | A view taken from above | + |011300 | under-water | A photo taken under water | + |011400 | night scene | A photo taken during darkness | + |011500 | satellite | A photo taken from a satellite in orbit | + |011600 | exterior view | A photo that shows the exterior of a building or other object | + |011700 | interior view | A scene or view of the interior of a building or other object | + |011800 | close-up | A view of, or part of a person/object taken at close range in order to emphasize detail or accentuate mood. Macro photography. | + |011900 | action | Subject in motion such as children jumping, horse running | + |012000 | performing | Subject or subjects on a stage performing to an audience | + |012100 | posing | Subject or subjects posing such as a "victory" pose or other stance that symbolizes leadership. | + |012200 | symbolic | A posed picture symbolizing an event - two rings for marriage | + |012300 | off-beat | An attractive, perhaps fun picture of everyday events - dog with sunglasses, people cooling off in the fountain | + |012400 | movie scene | Photos taken during the shooting of a movie or TV production. | + +
+ +- **`sceneCodesLabelled`** *[Optional ; Repeatable]*
+
+```json +"sceneCodesLabelled": [ + { + "code": "string", + "label": "string", + "description": "string" + } +] +``` +
+ +The `sceneCodes` element described above only allows for the capture of codes. To improve discoverability (by indexing important keywords), not only the scene codes but also the scene description should be provided. The IPTC standard does not provide an element that allows the scene label and description to be entered. The `sceneCodesLabelled` is an element that we added to our schema. Ideally, curators will enter the scene codes in the element `sceneCodes` to maintain full compatibility with the IPTC, and complement that information by also entering the codes and their description in the `sceneCodesLabelled` element.
+ + - **`code`** *[Optional ; Not Repeatable ; String]*
+ The code for the scene of a photo content. The [IPTC Scene-NewsCodes](http://cv.iptc.org/newscodes/scene) controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license) should be used, where a scene is represented as a string of 6 digits. See table above.
+ - **`label`** *[Optional ; Not Repeatable ; String]*
+ The label of the scene. See table above for examples.
+ - **`description`** *[Optional ; Not Repeatable ; String]*
+ A more detailed description of the scene. See table above for examples.

+ +- **`subjectCodes`** *[Optional ; Repeatable ; String]*
+
+```json +"subjectCodes": [ + "string" +] +``` +
+ +Specifies one or more subjects from the [IPTC Subject-NewsCodes](http://cv.iptc.org/newscodes/subjectcode) controlled vocabulary to categorize the image. Each Subject is represented as a string of 8 digits. The vocabulary consists of about 1400 terms organized into 3 levels (users can decide to use only the first, or the first two levels; the more detail is provided, the better the discoverability of the image). The first level of the controlled vocabulary is as follows: +
+ + | code | Label | Description | + |:------:|:-----------------:|-----------------------------------| + |01000000| arts, culture and entertainment | Matters pertaining to the advancement and refinement of the human mind, of interests, skills, tastes and emotions | + |02000000| crime, law and justice | Establishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules and the punishment of offenders. Organizations and bodies involved in these activities. | + |03000000| disaster and accident | Man made and natural events resulting in loss of life or injury to living creatures and/or damage to inanimate objects or property.| + |04000000| economy, business and finance | All matters concerning the planning, production and exchange of wealth.| + |05000000| education | All aspects of furthering knowledge of human individuals from birth to death.| + |06000000| environmental issue | All aspects of protection, damage, and condition of the ecosystem of the planet earth and its surroundings.| + |07000000| health | All aspects pertaining to the physical and mental welfare of human beings.| + |08000000| human interest | Lighter items about individuals, groups, animals or objects. + |09000000| labor | Social aspects, organizations, rules and conditions affecting the employment of human effort for the generation of wealth or provision of services and the economic support of the unemployed.| + |10000000| lifestyle and leisure | Activities undertaken for pleasure, relaxation or recreation outside paid employment, including eating and travel.| + |11000000| politics | Local, regional, national and international exercise of power, or struggle for power, and the relationships between governing bodies and states.| + |12000000| religion and belief | All aspects of human existence involving theology, philosophy, ethics and spirituality.| + |13000000| science and technology | All aspects pertaining to human understanding of nature and the physical world and the development and application of this knowledge| + |14000000| social issue | Aspects of the behavior of humans affecting the quality of life.| + |15000000| sport | Competitive exercise involving physical effort. Organizations and bodies involved in these activities.| + |16000000| unrest | conflicts and war Acts of socially or politically motivated protest and/or violence.| + |17000000| weather | The study, reporting and prediction of meteorological phenomena. +

+ + As an example of subjects at the three levels, the list below zooms on the subject "education".

+ + | code | Subject | Description | + |:------:|:----------------------:|-----------------------------------| + |05000000| education | All aspects of furthering knowledge of human individuals from birth to death| + |05001000| Adult education | Education provided for older students outside the usual age groups of 5-25| + |05002000| Further education | Any form of education beyond basic education of several levels| + |05003000| parent organization | Groups of parents set up to support schools| + |05004000| preschool | Education for children under the national compulsory education age| + |05005000| school | A building or institution in which education of various sorts is provided| + |05005001| elementary schools | Schools usually of a level from kindergarten through 11 or 12 years of age| + |05005002| middle schools | Transitional school between elementary and high school, 12 through 13 years of age| + |05005003| high schools | Pre-college/ university level education 14 to 17 or 18 years of age, called freshman, sophomore, junior and senior| + |05006000| teachers union | Organization of teachers for collective bargaining and other purposes| + |05007000| university | Institutions of higher learning capable of providing doctorate degrees| + |05008000| upbringing | Lessons learned from parents and others as one grows up| + |05009000| entrance examination | Exams for entering colleges, universities, junior and senior high schools, and all other higher and lower education institutes, including cram schools, which help students prepare for exams for entry to prestigious schools.| + |05010000| teaching and learning| Either end of the education equation| + |05010001| students | People of any age in a structured environment, not necessarily a classroom, in order to learn something| + |05010002| teachers | People with knowledge who can impart that knowledge to others| + |05010003| curriculum | The courses offered by a learning institution and the regulation of those courses| + |05010004| test/examination | A measurement of student accomplishment| + |05011000| religious education | Instruction by any faith, in that faith or about other faiths, usually, but not always, conducted in schools run by religious bodies| + |05011001| parochial school | A school run by the Roman Catholic faith| + |05011002| seminary | A school of any faith specifically designed to train ministers| + |05011003| yeshiva | A school for training rabbis| + |05011004| madrasa | A school for teaching Islam| +
+ +- **`subjectCodesLabelled`** *[Optional ; Repeatable]*
+
+```json +"subjectCodesLabelled": [ + { + "code": "string", + "label": "string", + "description": "string" + } +] +``` +
+ +The `subjectCodes` element described above only allows for the capture of codes. To improve discoverability (by indexing important keywords), not only the subject codes but also the subject description should be provided. The IPTC standard does not provide an element that allows the subject label and description to be entered. The `subjectCodesLabelled` is an element that we added to our schema. Ideally, curators will enter the subject codes in the element `subjectCodes` to maintain full compatibility with the IPTC, and complement that information by also entering the codes and their description in the `subjectCodesLabelled` element.
+ + - **`code`** *[Optional ; Not Repeatable ; String]*
+ Specifies one or more subjects from the [IPTC Subject-NewsCodes](http://cv.iptc.org/newscodes/subjectcode) controlled vocabulary to categorize the image. Each Subject is represented as a string of 8 digits. The vocabulary consists of about 1400 terms organized into 3 levels (users can decide to use only the first, or the first two levels; the more detail is provided, the better the discoverability of the image). See examples in the table above.
+ - **`label`** *[Optional ; Not Repeatable ; String]*
+ The label of the subject. See table above for examples.
+ - **`description`** *[Optional ; Not Repeatable ; String]*
+ A more detailed description of the subject. See table above for examples.

+ + +- **`creatorNames`** *[Optional ; Repeatable ; String]*
+
+```json +"creatorNames": [ + "string" +] +``` +
+ +Enter details about the creator or creators of this image. The Image Creator must often be attributed in association with any use of the image. The Image Creator, Copyright Owner, Image Supplier and Licensor may be the same or different entities.
+ + +- **`creatorContactInfo`** *[Optional ;Not repeatable ; String]*
+
+```json +"creatorContactInfo": { + "country": "string", + "emailwork": "string", + "region": "string", + "phonework": "string", + "weburlwork": "string", + "address": "string", + "city": "string", + "postalCode": "string" +} +``` +
+ +The creator’s contact information provides all necessary information to get in contact with the creator of this image and comprises a set of elements for proper addressing. Note that if the creator is also the licensor, his or her contact information should be provided in the `licensor` fields.
+ + - **`country`** *[Optional ; Not Repeatable ; String]*
+ The country name for the address of the person that created this image.
+ - **`emailwork`** *[Optional ; Not Repeatable ; String]*
+ The work email address(es) for the creator of the image. Multiple email addresses can be given, in which case they should be separated by a comma.
+ - **`region`** *[Optional ; Not Repeatable ; String]*
+ The state or province for the address of the creator of the image.
+ - **`phonework`** *[Optional ; Not Repeatable ; String]*
+ The work phone number(s) for the creator of the image. Use the international format including the country code, such as +1 (123) 456789. Multiple numbers can be given, in which case they should be separated by a comma.
+ - **`weburlwork`** *[Optional ; Not Repeatable ; String]*
+ The work web address for the creator of the image. Multiple addresses can be given, in which case they should be separated by a comma.
+ - **`address`** *[Optional ; Not Repeatable ; String]*
+ The address of the creator of the image. This may comprise a company name.
+ - **`city`** *[Optional ; Not Repeatable ; String]*
+ The city for the address of the person that created the image.
+ - **`postalCode`** *[Optional ; Not Repeatable ; String]*
+ Enter the local postal code for the address of the person who created the image.

+ + +- **`creditLine`** *[Optional ; Not Repeatable ; String]*
+The credit to person(s) and/or organization(s) required by the supplier of the image to be used when published. This is a free-text field.

+ + +- **`digitalSourceType`** *[Optional ; Not Repeatable ; String]*
+The type of the source of this digital image. One value should be selected from the IPTC controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license license) that contains the following values: +
+ + | Type | Source | Description | + |:------------------:|:------------------:|-----------------------------------| + |digitalCapture | Original digital capture of a real life scene | The digital image is the original and only instance and was taken by a digital camera| + |negativeFilm | Digitized from a negative on film | The digital image was digitized from a negative on film on any other transparent medium| + |positiveFilm | Digitized from a positive on film | The digital image was digitized from a positive on a transparency or any other transparent medium | + |print | Digitized from a print on non-transparent medium | The digital image was digitized from an image printed on a non-transparent medium| + |softwareImage | Created by software | The digital image was created by computer software| + +
+ + +- **`jobid`** *[Optional ; Not Repeatable ; String]*
+Number or identifier for the purpose of improved workflow handling (control or tracking). This is a user created identifier related to the job for which the image is supplied.
+Note: As this identifier references a job of the receiver’s workflow it must first be issued by the receiver, then transmitted to the creator or provider of the news object and finally added by the creator to this field.
+ + +- **`jobtitle`** *[Optional ; Not Repeatable ; String]*
+The job title of the photographer (the person listed in `creatorNames`). The use of this element implies that the photographer information (`creatorNames` is not empty).
+ + +- **`source`** *[Optional ; Not Repeatable ; String]*
+The name of a person or party who has a role in the content supply chain. The `source` can be different from the `creator` and from the entities listed in the Copyright Notice.
+ + +- **`locationsShown`** *[Optional ; Repeatable]*
+
+```json +"locationsShown": [ + { + "name": "string", + "identifiers": [ + "http://example.com" + ], + "worldRegion": "string", + "countryName": "string", + "countryCode": "string", + "provinceState": "string", + "city": "string", + "sublocation": "string", + "gpsAltitude": 0, + "gpsLatitude": 0, + "gpsLongitude": 0 + } +] +``` +
+ +This block of elements is used to document the location shown in the image. This information should be provided with as much detail as possible. It contains elements that can be used to provide a "nested" description of the location, from a high geographic level (world region) down to a very specific location (city and sub-location within a city).
+ + - **`name`** *[Optional ; Not Repeatable ; String]*
+ The full name of the location.
+ - **`identifiers`** *[Optional ; Repeatable ; String]*
+ A globally unique identifier of the location shown.
+ - **`worldRegion`** *[Optional ; Not Repeatable ; String]*
+ The name of a world region. This element is at the first (top) level of the top-down geographical hierarchy.
+ - **`countryName`** *[Optional ; Not Repeatable ; String]*
+ The name of a country of a location. This element is at the second level of a top-down geographical hierarchy.
+ - **`countryCode`** *[Optional ; Not Repeatable ; String]*
+ The ISO code of the country mentioned in `countryName`.
+ - **`provinceState`** *[Optional ; Not Repeatable ; String]*
+ The name of a sub-region of the country - for example a province or a state name. This element is at the third level of a top-down geographical hierarchy.
+ - **`city`** *[Optional ; Not Repeatable ; String]*
+ The name of the city. This element is at the fourth level of a top-down geographical hierarchy.
+ - **`sublocation`** *[Optional ; Not Repeatable ; String]*
+ The sublocation name could either be the name of a sublocation to a city or the name of a well known location or (natural) monument outside a city. This element is at the fifth (lowest) level of a top-down geographical hierarchy.
+ - **`gpsAltitude`** *[Optional ; Not Repeatable ; Numeric]*
+ The altitude in meters of a WGS84 based position of this location.
+ - **`gpsLatitude`** *[Optional ; Not Repeatable ; Numeric]*
+ Latitude of a WGS84 based position of this location (in some cases, this information may be contained in the EXIF metadata).
+ - **`gpsLongitude`** *[Optional ; Not Repeatable ; Numeric]*
+ Longitude of a WGS84 based position of this location (in some cases, this information may be contained in the EXIF metadata).

+ + +- **`imageRating`** *[Optional ; Not Repeatable ; Numeric]*
+Rating of the image by its user or supplier. The value shall be -1 or in the range 0 to 5. -1 indicates "rejected" and 0 "unrated". If an explicit value is not provided, the default value is 0 will be assumed.
+ + +- **`supplier`** *[Optional ; Repeatable]*
+
+```json +"supplier": [ + { + "name": "string", + "identifiers": [ + "http://example.com" + ] + } +] +``` +
+ + - **`name`** *[Optional ; Not Repeatable ; String]*
+ The name of the supplier of the image (person or organization).
+ - **`identifiers`** *[Optional ; Repeatable ; String]*
+ The identifier for the most recent supplier of this image. This will not necessarily be the creator or the owner of the image.

+ + +- **`copyrightNotice`** *[Optional ; Not Repeatable ; String]*
+
+![](./images/ReDoc_image_27.JPG){width=100%} +
+Contains any necessary copyright notice for claiming the intellectual property for this photograph and should identify the current owner of the copyright for the photograph. Other entities like the creator of the photograph may be added in the corresponding field. Notes on usage rights should be provided in "Rights usage terms". Example: ©2008 Jane Doe. If the copyright ownership must be expressed in a more controlled manner, use the fields "Copyright Owner", "Copyright Owner ID", "Copyright Owner Name" described below instead of the `copyrightNotice` element.
+ + +- **`copyrightOwners`** *[Optional ; Repeatable]*
+Owner or owners of the copyright in the licensed image, described in a structured format (as an alternative to the element `copyrightNotice` described above. This block serves the same purpose of identifying the rights holder/s for the image. The Copyright Owner, Image Creator and Licensor may be the same or different entities. +
+```json +"copyrightOwners": [ + { + "name": "string", + "role": [ + "http://example.com" + ], + "identifiers": [ + "http://example.com" + ] + } +] +
+ + - **`name`** *[Optional ; Not Repeatable ; String]*
+ The name of the owner of the copyright in the licensed image.
+ - **`role`** *[Optional ; Repeatable ; String]*
+ The role the entity.
+ - **`identifiers`** *[Optional ; Repeatable ; String]*
+ The identifier of the owner of the copyright in the licensed image.

+ + +- **`usageTerms`** *[Optional ; Not Repeatable ; String]*
+The licensing parameters of the image expressed in free-text. Enter instructions on how this image can legally be used. The PLUS fields of the IPTC Extension can be used in parallel to express the licensed usage in more controlled terms.
+ + +- **`embdEncRightsExpr`** *[Optional ; Repeatable]*
+An embedded rights expression using a rights expression language which is encoded as a string. +Embedded Encoded Rights Expression (EERE) structure +A structure providing details of an embedded encoded rights expression +
+```json +"embdEncRightsExpr": [ + { + "encRightsExpr": "string", + "rightsExprEncType": "string", + "rightsExprLangId": "http://example.com" + } +] +``` +
+ + - **`encRightsExpr`** *[Optional ; Not Repeatable ; String]*
+ Rights Expression Language ID. An identifier of the rights expression language used by the rights expression.
+ - **`rightsExprEncType`** *[Optional ; Not Repeatable ; String]*
+ The encoding type of the rights expression, identified by an IANA Media Type.
+ - **`rightsExprLangId`** *[Optional ; Not Repeatable ; String]*
+ An embedded rights expression using any rights expression language.
@@@@ +https://www.iptc.org/std/photometadata/specification/IPTC-PhotoMetadata#embedded-encoded-rights-expression-eere-structure +
+ + +- **`linkedEncRightsExpr`** *[Optional ; Repeatable]*
+Link to Encoded Rights Expression.
+
+```json +"linkedEncRightsExpr": [ + { + "linkedRightsExpr": "http://example.com", + "rightsExprEncType": "string", + "rightsExprLangId": "http://example.com" + } +] +``` +
+ + - **`linkedRightsExpr`** *[Optional ; Not Repeatable ; String]*
+ The link to a web resource representing an encoded rights expression.
+ - **`rightsExprEncType`** *[Optional ; Not Repeatable ; String]*
+ The encoding type of the rights expression, identified by an IANA Media Type.
+ - **`rightsExprLangId`** *[Optional ; Not Repeatable ; String]*
+ The identifer of the rights expression language used by the rights expression.

+ + +- **`webstatementRights`** *[Optional ; Not Repeatable ; String]*
+URL referencing a web resource providing a statement of the copyright ownership and usage rights of the image.

+ + +- **`instructions`** *[Optional ; Not Repeatable ; String]*
+Any of a number of instructions from the provider or creator to the receiver of the image which might include any of the following: embargoes and other restrictions not covered by the "Rights Usage Terms" field; information regarding the original means of capture (scanning notes, colourspace info) or other specific text information that the user may need for accurate reproduction; additional permissions required when publishing; credits for publishing if they exceed the IIM length of the credit field.
+ + +- **`genres`** *[Optional ; Repeatable]*
+
+```json +"genres": [ + { + "cvId": "http://example.com", + "cvTermName": "string", + "cvTermId": "http://example.com", + "cvTermRefinedAbout": "http://example.com" + } +] +``` +
+ + - **`cvId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the Controlled Vocabulary the term is from.
+ - **`cvTermName`** *[Optional ; Not Repeatable ; String]*
+ The natural language name of the term from a Controlled Vocabulary.
+ - **`cvTermId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the term from a Controlled Vocabulary.
+ - **`cvTermRefinedAbout`** *[Optional ; Not Repeatable ; String]*
+ Optionally enter a refinement of the 'about' relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary. May be used to refine the generic about relationship.
+ Artistic, style, journalistic, product or other genre(s) of the image (expressed by a term from any Controlled Vocabulary)

+ + +- **`intellectualGenre`** *[Optional ; Not Repeatable ; String]*
+A term to describe the nature of the image in terms of its intellectual or journalistic characteristics (for example "actuality", "interview", "background", "feature", "summary", "wrapup" for journalistic genres, or "daybook", "obituary", "press release", "transcript" for news category related genres. It is advised to use terms from a controlled vocabulary such as the [NewsCodes Scheme](http://cv.iptc.org/newscodes/genre) published by the IPTC under a Creative Commons Attribution (CC BY) 4.0 license.
+
+ + | Genre | Description | + |:------------------:|-----------------------------------| + |Actuality | Recording of an event| + |Advertiser Supplied | Content is supplied by an organization or individual that has paid the news provider for its placement| + |Advice | Letters and answers about readers' personal problems| + |Advisory | Recommendation on editorial or technical matters by a provider to its customers| + |On This Day | List of data, including birthdays of famous people and items of historical significance, for a given day| + |Analysis | Data and conclusions drawn by a journalist who has conducted in depth research for a story| + |Archival material | Material selected from the originator's archive that has been previously distributed| + |Background | Scene setting and explanation for an event being reported| + |Behind the Story | The content describes how a story was reported and offers context on the reporting| + |Biography | Facts and background about a person| + |Birth Announcement | News of newly born children| + |Current Events | Content about events taking place at the time of the report| + |Curtain Raiser | Information about the staging and outcome of an immediately upcoming event| + |Daybook | Items filed on a regular basis that are lists of upcoming events with time and place, designed to inform others of events for planning purposes.| + |Exclusive | Information content, in any form, that is unique to a specific information provider.| + |Fact Check | The news item looks into the truth or falsehood of another reported news item or assertion (for example a statement on social media by a public figure)| + |Feature | The object content is about a particular event or individual that may not be significant to the current breaking news.| + |Fixture | The object contains data that occurs often and predictably.| + |Forecast | The object contains opinion as to the outcome of a future event.| + |From the Scene | The object contains a report from the scene of an event.| + |Help us to Report | The news item is a call for readers to provide information that may help journalists to investigate a potential news story| + |History | The object content is based on previous rather than current events.| + |Horoscope | Astrological forecasts| + |Interview | The object contains a report of a dialogue with a news source that gives it significant voice (includes Q and A).| + |Listing of facts | Detailed listing of facts related to a topic or a story| + |Music | The object contains music alone.| + |Obituary | The object contains a narrative about an individual's life and achievements for publication after his or her death.| + |Opinion | The object contains an editorial comment that reflects the views of the author.| + |Polls and Surveys | The object contains numeric or other information produced as a result of questionnaires or interviews.| + |Press Release | The object contains promotional material or information provided to a news organisation.| + |Press-Digest | The object contains an editorial comment by another medium completely or in parts without significant journalistic changes.| + |Profile | The object contains a description of the life or activity of a news subject (often a living individual).| + |Program | A news item giving lists of intended events and time to be covered by the news provider. Each program covers a day, a week, a month or a year. The covered period is referenced as a keyword.| + |Question and Answer Session | The object contains the interviewer and subject questions and answers.| + |Quote | The object contains a one or two sentence verbatim in direct quote.| + |Raw Sound | The object contains unedited sounds.| + |Response to a Question | The object contains a reply to a question.| + |Results Listings and Statistics | The object contains alphanumeric data suitable for presentation in tabular form.| + |Retrospective | The object contains material that looks back on a specific (generally long) period of time such as a season, quarter, year or decade.| + |Review | The object contains a critique of a creative activity or service (for example a book, a film or a restaurant).| + |Satire | Uses exaggeration, irony, or humor to make a point; not intended to be understood as factual| + |Scener | The object contains a description of the event circumstances.| + |Side bar and supporting information | Related story that provides additional context or insight into a news event| + |Special Report | In-depth examination of a single subject requiring extensive research and usually presented at great length, either as a single item or as a series of items| + |Sponsored | Content is produced on behalf of an organization or individual that has paid the news provider for production and may approve content publication| + |Summary | Single item synopsis of a number of generally unrelated news stories| + |Supported | Content is produced with financial support from an organization or individual, yet not approved by the underwriter before or after publication| + |Synopsis | The object contains a condensed version of a single news item.| + |Text only | The object contains a transcription of text.| + |Transcript and Verbatim | A word for word report of a discussion or briefing| + |Update | The object contains an intraday snapshot (as for electronic services) of a single news subject.| + |Voicer | Content is only voice| + |Wrap | Complete summary of an event| + |Wrapup | Recap of a running story| + +
+ + +- **`artworkOrObjects`** *[Optional ; Repeatable]*
+This block provides a set of metadata elements to be used to describe the object or artwork shown in the image. +
+```json +"artworkOrObjects": [ +{ + "title": "string", + "contentDescription": "string", + "physicalDescription": "string", + "creatorNames": [ + "string" + ], + "creatorIdentifiers": [ + "string" + ], + "contributionDescription": "string", + "stylePeriod": [ + "string" + ], + "dateCreated": "2023-04-11T15:06:09Z", + "circaDateCreated": "string", + "source": "string", + "sourceInventoryNr": "string", + "sourceInventoryUrl": "http://example.com", + "currentCopyrightOwnerName": "string", + "currentCopyrightOwnerIdentifier": "http://example.com", + "copyrightNotice": "string", + "currentLicensorName": "string", + "currentLicensorIdentifier": "http://example.com" + } +] +``` +
+ + - **`title`** *[Optional ; Not Repeatable ; String]*
+ A human readable name of the object or artwork shown in the image.
+ - **`contentDescription`** *[Optional ; Not Repeatable ; String]*
+ A textual description of the content depicted in the object or artwork.
+ - **`physicalDescription`** *[Optional ; Not Repeatable ; String]*
+ A textual description of the physical characteristics of the artwork or object, without reference to the content depicted. This would be used to describe the object type, materials, techniques, and measurements.
+ - **`creatorNames`** *[Optional Repeatable ; String]*
+ The name of the person(s) (possibly an organization) who created the object or artwork shown in the image.
+ - **`creatorIdentifiers`** *[Optional ; Repeatable ; String]*
+ One or multiple globally unique identifier(s) for the artist who created the artwork or object shown in the image. This could be an identifier issued by an online registry of persons or companies. Make sure to enter these identifiers in the exact same sequence as the names entered in the field `creatorNames`.
+ - **`contributionDescription`** *[Optional ; Not Repeatable ; String]*
+ A description of any contributions made to the artwork or object. It should include the type, date and location of contribution, and details about the contributor.
+ - **`stylePeriod`** *[Optional ; Repeatable ; String]*
+ The style, historical or artistic period, movement, group, or school whose characteristics are represented in the artwork or object. It is advised to take the terms from a Controlled Vocabulary.
+ - **`dateCreated`** *[Optional ; Not Repeatable ; String]*
+ The date and optionally the time the artwork or object shown in the image was created.
+ - **`circaDateCreated`** *[Optional ; Not Repeatable ; String]*
+ The approximate date or range of dates associated with the creation and production of an artwork or object or its components.
+ - **`source`** *[Optional ; Not Repeatable ; String]*
+ The name of the organization or body holding and registering the artwork or object in this image for inventory purposes.
+ - **`sourceInventoryNr`** *[Optional ; Not Repeatable ; String]*
+ The inventory number issued by the organization or body holding and registering the artwork or object in the image.
+ - **`sourceInventoryUrl`** *[Optional ; Not Repeatable ; String]*
+ A reference URL for the metadata record of the inventory maintained by the Source.
+ - **`currentCopyrightOwnerName`** *[Optional ; Not Repeatable ; String]*
+ The name of the current owner of the copyright of the artwork or object.
+ - **`currentCopyrightOwnerIdentifier`** *[Optional ; Not Repeatable ; String]*
+ A globally unique identifier for the current copyright owner e.g. issued by an online registry of persons or companies.
+ - **`copyrightNotice`** *[Optional ; Not Repeatable ; String]*
+ Any necessary copyright notice for claiming the intellectual property for artwork or an object in the image and should identify the current owner of the copyright of this work with associated intellectual property rights.
+ - **`currentLicensorName`** *[Optional ; Not Repeatable ; String]*
+ Name of the current licensor of the artwork or object.
+ - **`currentLicensorIdentifier`** *[Optional ; Not Repeatable ; String]*
+ A globally unique identifier for the current licensor e.g. issued by an online registry of persons or companies.

+ + +- **`personInImageNames`** *[Optional ; Repeatable ; String]*
+
+```json +"personInImageNames": [ + "string" +] +``` +
+ +This repeatable block of elements is used to provide information on the person(s) shown in the image.

+ + +- **`personsShown`** *[Optional ; Repeatable]*
+Details about person(s) shown in the image. It is not required to list all, just those details which can be recognized. +
+```json +"personsShown": [ + { + "name": "string", + "description": "string", + "identifiers": [ + "http://example.com" + ], + "characteristics": [ + { + "cvId": "http://example.com", + "cvTermName": "string", + "cvTermId": "http://example.com", + "cvTermRefinedAbout": "http://example.com" + } + ] + } +] +``` +
+ + - **`name`** *[Optional ; Not Repeatable ; String]*
+ The name of a person shown in the image.
+ - **`description`** *[Optional ; Not Repeatable ; String]*
+ A textual description of the person. For example, you may include actions taken, emotional expressions shown and more.
+ - **`identifiers`** *[Optional ; Not Repeatable ; String]*
+ Globally Unique identifiers of the person, such as those from [WikiData](https://www.wikidata.org/wiki/Wikidata:Main_Page). + - **`characteristics`** *[Optional ; Not Repeatable ; String]*
+ A property or trait of the person, provided as a term selected from a Controlled Vocabulary.
+ - **`cvId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the Controlled Vocabulary the term is from.
+ - **`cvTermName`** *[Optional ; Not Repeatable ; String]*
+ The natural language name of the term from a Controlled Vocabulary.
+ - **`cvTermId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the term from a Controlled Vocabulary.
+ - **`cvTermRefinedAbout`** *[Optional ; Not Repeatable ; String]*
+ The refined 'about' relationship of the term with the content. Optionally enter a refinement of the 'about' relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.

+ + +- **`modelAges`** *[Optional ; Repeatable ; Numeric]*
+
+```json +"modelAges": [ + 0 +] +``` +
+ +Age of the human model(s) at the time the image was taken. Be aware of any legal implications of providing ages for young models. Ages below 18 years should not be included.

+ + +- **`additionalModelInfo`** *[Optional ; Not Repeatable ; String]*
+Information about other facets of the model(s).

+ + +- **`minorModelAgeDisclosure`** *[Optional ; Not Repeatable ; String]*
+The age of the youngest model pictured in the image, at the time the image was created. This information is not intended to be displayed publicly; it is intended to be used as a filter for inclusion/exclusion of images in catalogs and dissemination processes.
+ + +- **`modelReleaseDocuments`** *[Optional ; Repeatable ; String]*
+
+```json +"modelReleaseDocuments": [ + "string" +] +``` +
+ +Identifier associated with each Model Release.

+ + +- **`modelReleaseStatus`** *[Optional ; Not Repeatable]*
+
+```json +"modelReleaseStatus": { + "cvId": "http://example.com", + "cvTermName": "string", + "cvTermId": "http://example.com", + "cvTermRefinedAbout": "http://example.com" +} +``` +
+ + - **`cvId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the Controlled Vocabulary the term is from.
+ - **`cvTermName`** *[Optional ; Not Repeatable ; String]*
+ The natural language name of the term from a Controlled Vocabulary.
+ - **`cvTermId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the term from a Controlled Vocabulary.
+ - **`cvTermRefinedAbout`** *[Optional ; Not Repeatable ; String]*
+ The refined 'about' relationship of the term with the content. Optionally enter a refinement of the 'about' relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary. May be used to refine the generic about relationship.

+ + +- **`organisationInImageCodes`** *[Optional ; Repeatable ; String]*
+
+```json +"organisationInImageCodes": [ + "string" +] +``` +
+ +The code, extracted from a controlled vocabulary, used to identify the organization or company featured in the image. For example a stock ticker symbol may be used. Enter an identifier for the controlled vocabulary, then a colon, and finally the code from the vocabulary assigned to the organization (e.g. nasdaq:companyA)
+ + +- **`organisationInImageNames`** *[Optional ; Repeatable ; String]*
+
+```json +"organisationInImageNames": [ + "string" +] +``` +
+ +The name of the organization or company which is featured in the image.

+ + +- **`productsShown`** *[Optional ; Repeatable]*
+Details about a product shown in the image. +
+```json +"productsShown": [ + { + "description": "string", + "gtin": "string", + "name": "string" + } +] +``` +
+ + - **`description`** *[Optional ; Not Repeatable ; String]*
+ A textual description of the product.
+ - **`gtin`** *[Optional ; Not Repeatable ; String]*
+ The [Global Trade Item Number (GTIN)](https://www.gs1.org/standards/id-keys/gtin) of the product (GTIN-8 to GTIN-14 codes can be used).
+ - **`name`** *[Optional ; Not Repeatable ; String]*
+ The name of the product.

+ + +- **`maxAvailHeight`** *[Optional ; Not Repeatable ; Numeric]*
+The maximum available height in pixels of the original photo from which this photo has been derived by downsizing.

+ + +- **`maxAvailWidth`** *[Optional ; Not Repeatable ; Numeric]*
+The maximum available width in pixels of the original photo from which this photo has been derived by downsizing.

+ + +- **`propertyReleaseStatus`** *[Optional ; Not Repeatable]*
+
+```json +"propertyReleaseStatus": { + "cvId": "http://example.com", + "cvTermName": "string", + "cvTermId": "http://example.com", + "cvTermRefinedAbout": "http://example.com" +} +``` +
+ +This summarizes the availability and scope of property releases authorizing usage of the properties appearing in the photograph. One value should be selected from a controlled vocabulary. It is recommended to apply the value PR-UPR very carefully and to check the wording of the property release thoroughly before applying it.
+ - **`cvId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the Controlled Vocabulary the term is from.
+ - **`cvTermName`** *[Optional ; Not Repeatable ; String]*
+ The natural language name of the term from a Controlled Vocabulary.
+ - **`cvTermId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the term from a Controlled Vocabulary.
+ - **`cvTermRefinedAbout`** *[Optional ; Not Repeatable ; String]*
+ Refined 'about' relationship of the CV-Term. The refined 'about' relationship of the term with the content. Optionally enter a refinement of the 'about' relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.

+ + +- **`propertyReleaseDocuments`** *[Optional ; Repeatable ; String]*
+
+```json +"propertyReleaseDocuments": [ + "string" +] +``` +
+Optional identifier associated with each Property Release.
+ + +- **`aboutCvTerms`** *[Optional ; Repeatable]*
+
+```json +"aboutCvTerms": [ + { + "cvId": "http://example.com", + "cvTermName": "string", + "cvTermId": "http://example.com", + "cvTermRefinedAbout": "http://example.com" + } +] +``` +
+ +One or more topics, themes or entities the content is about, each one expressed by a term from a controlled vocabulary.
+ - **`cvId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the Controlled Vocabulary the term is from.
+ - **`cvTermName`** *[Optional ; Not Repeatable ; String]*
+ The natural language name of the term from a Controlled Vocabulary.
+ - **`cvTermId`** *[Optional ; Not Repeatable ; String]*
+ The globally unique identifier of the term from a Controlled Vocabulary.
+ - **`cvTermRefinedAbout`** *[Optional ; Not Repeatable ; String]*
+ Refined 'about' relationship of the CV-Term. The refined 'about' relationship of the term with the content. Optionally enter a refinement of the 'about' relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.

+ + +:::note +The IPTC elements are followed by a small set of common elements: see `license`, `tags`, and `album` in section **`Additional elements`**. +::: + + +### Dublin Core option + +We introduced the Dublin Core Metadata Initiative (DCMI) specification in chapter 3 - Documents. It contains 15 core elements, which are generic and versatile enough to be used for documenting different types of resources. Other elements can be added to the specification to increase its relevancy for specific uses. In the schema we recommend for the documentation of publications, we added elements inspired by the MARC 21 standard. We take a similar approach for the use of the Dublin Core for documenting images, by adding elements inspired by the [ImageObject](https://schema.org/ImageObject) schema from schema.org to the 15 elements. + +The fifteen elements, with their definition extracted from the [Dublin Core website](https://dublincore.org/), are the following: + +| Element name | Description | +| -------------------- | ---------------------------------------------------------------- | +| identifier | An unambiguous reference to the resource within a given context. | +| type | The nature or genre of the resource. | +| title | A name given to the resource. | +| description | An account of the resource. | +| subject | The topic of the resource. | +| creator | An entity primarily responsible for making the resource. | +| contributor | An entity responsible for making contributions to the resource. | +| publisher | An entity responsible for making the resource available. | +| date | A point or period of time associated with an event in the life cycle of the resource. | +| coverage | The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.| +| format | The file format, physical medium, or dimensions of the resource. | +| language | A language of the resource. | +| relation | A related resource. | +| rights | Information about rights held in and over the resource. | +| source | A related resource from which the described resource is derived. | + +We do not use the `identifier` element, as we already have a unique identifier in the common element `idno`. + +We added the following elements to the schema, which are not part of the core list of the DCMI: + +- identifiers +- caption +- keywords +- topics +- country +- gps (latitude, longitude, altitude) +- note + +:::note +The common additional elements `license`, `album` and `tags` also complement the DCMI metadata (see section **`Additional elements`**). +::: + + +We describe below how DCMI elements are used to document images. + +**`dcmi`** *[Optional, Not repeatable]*
+Users of the schema will chose either IPTC or Dublin Core (DCMI), not both, to document their images. If the choice is DCMI, the elements under `dcmi` will be used. +
+```json +"dcmi": { + "type": "image", + "title": "string", + "caption": "string", + "description": "string", + "topics": [], + "keywords": [], + "creator": "string", + "contributor": "string", + "publisher": "string", + "date": "string", + "country": [], + "coverage": "string", + "gps": {}, + "format": "string", + "languages": [], + "relations": [], + "rights": "string", + "source": "string", + "note": "string" +} +``` +
+ + +- **`type`** *[Required, Not Repeatable, String]*
+The Dublin Core schema is flexible and versatile, and can be used to document different types of resources. This element is used to document the type of resource being documented. The DCMI provides a list of suggested categories, including "image" which is the relevant type to be entered here. Some users may want to be more specific in the description of the type of resource, for example distinguishing color from black & white images. This distinction should not be made in this element; another element can be used for such purpose (like tags and tag groups). + + +- **`title`** *[Optional, Not Repeatable, String]*
+The title of the photo. + + +- **`caption`** *[Optional, Not Repeatable, String]*
+A caption for the photo. + + +- **`description`** *[Optional, Not Repeatable, String]*
+A brief description of the content depicted in the image. This element will typically provide more detailed information than the title or caption. Note that other elements can be used to provide a more specific and "itemized" description of an image; the element `keywords` for example can be used to list labels associated with an image (possibly generated in an automated manner using machine learning tools). + + +- **`topics`** *[Optional ; Repeatable]*
+The `topics` field indicates the broad substantive topic(s) that the image represents. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the [Council of European Social Science Data Archives (CESSDA) thesaurus](https://vocabularies.cessda.eu/vocabulary/TopicClassification).
+
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`id`** *[Optional ; Not repeatable ; String]*
+ The unique identifier of the topic. It can be a sequential number, or the ID of the topic in a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The label of the topic associated with the data. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ When a hierarchical (nested) controlled vocabulary is used, the `parent_id` field can be used to indicate a higher-level topic to which this topic belongs. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name of the controlled vocabulary used, if any. + - **`uri`**
+ A link to the controlled vocabulary mentioned in field `vocabulary'.

+ + +- **`keywords`** *[Optional ; Repeatable]*
+Words or phrases that describe salient aspects of an image content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Keywords should be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; String ; Non repeatable]*
+ Keyword (or phrase). Keywords summarize the content or subject matter of the image. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ Controlled vocabulary from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the controlled vocabulary used, if any.

+ + +- **`creator`** *[Optional, Not Repeatable, String]*
+The name of the person (or organization) who has taken the photo or created the image. + + +- **`contributor`** *[Optional, Not Repeatable, String]*
+The contributor could be a person or organization, possibly a sponsoring organizations. + + +- **`publisher`** *[Optional, Not Repeatable, String]*
+The person or organization who publish the image. + + +- **`date`** *[Optional, Not Repeatable, String]*
+The date when the photo was taken / the image was created, preferably entered in ISO 8601 format. + + +- **`country`** *[Optional, Repeatable]*
+
+```json +"country": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Optional, Not Repeatable, String]*
+ The name of the country/economy where the photo was taken. + - **`code`** *[Optional, Not Repeatable, String]*
+ The code of the country/economy mentioned in `name`. This will preferably be the ISO country code.

+ + +- **`coverage`** *[Optional, Not Repeatable, String]*
+In the Dublin Core, the coverage can be either temporal or geographic. In the use of the schema, `coverage` is used to document the geographic coverage of the image. This element complements the `country` element, and allows more specific information to be provided. + + +- **`gps`** *[Optional, Not Repeatable]*
+The geographic location where the photo was taken. Some digital cameras equipped with GPS can, when the option is activated, capture and store in the EXIF metadata the exact geographic location where the photo was taken. +
+```json +"gps": { + "latitude": -90, + "longitude": -180, + "altitude": 0 +} +``` +
+ + - **`latitude`** *[Optional, Not Repeatable, String]*
+ The latitude of the geographic location where the photo was taken. + - **`longitude`** *[Optional, Not Repeatable, String]*
+ The longitude of the geographic location where the photo was taken. + - **`altitude`** *[Optional, Not Repeatable, String]*
+ The altitude of the geographic location where the photo was taken.

+ + +- **`format`** *[Optional, Not Repeatable, String]*
+This refers to the image file format. It is typically expressed using a MIME format. + + +- **`languages`** *[Optional, Not repeatable, String]*
+The language(s) in which the image metadata (caption, title) is provided. This is a block of two elements (at least one must be provided for each language). +
+```json +"languages": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the language. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the language. The use of [ISO 639-2](https://www.loc.gov/standards/iso639-2/php/code_list.php) (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings.

+ + +- **`relations`** *[Optional, Repeatable, String]*
+A list of related resources (images or of other type) +
+```json +"relations": [ + { + "name": "string", + "type": "isPartOf", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name (title) of the related resource. + - **`type`** *[Optional ; Not repeatable ; String]*
+ A brief description of the type of relation. A controlled vocabulary could be used. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link to the related resource being described.

+ + +- **`rights`** *[Optional, Not Repeatable, String]*
+The copyrights for the photograph. License is in another (common) element. + + +- **`source`** *[Optional, Not Repeatable, String]*
+A related resource from which the described image is derived. + + +- **`note`** *[Optional, Not Repeatable, String]*
+Any additional information on the image, not captured in one of the other metadata elements. + + +### Additional elements (IPTC and DCMI) + +Two elements are added to the list of `image_description` section of the schema. They apply both to the IPTC and to the DCMI options. + +- **`license`** *[Optional ; Repeatable]*
+The license under which the image is published. +
+```json +"license": [ + { + "name": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not Repeatable ; String]*
+ The name of the license. + - **`uri`** *[Optional ; Not Repeatable ; String]*
+ A URL where detailed information on the license / terms of use can be found.

+ + +- **`album`** *[Optional ; Repeatable]*
+If your catalog contains many images, you will likely want to group them by album. Albums are collections of images organized by theme, period, location, photographer, or other criteria. One image can belong to more than one album. Albums are thus "virtual collections". +
+```json +"album": [ + { + "name": "string", + "description": "string", + "owner": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not Repeatable ; String]*
+ A short name (label) given to the album. + - **`description`** *[Optional ; Not Repeatable ; String]*
+ A brief description of the album. + - **`owner`** *[Optional ; Not Repeatable ; String]*
+ Identification of the owner/custodian of the album. This can be the name of a person or an organization. + - **`uri`** *[Optional ; Not Repeatable ; String]*
+ A URL for the album.

+ + +- **`provenance`** *[Optional ; Repeatable]* +Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ + - **`origin_description`** *[Required ; Not repeatable]*
+ The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Study Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@ definition

+ + +- **`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ + - **`tag`** *[Required ; Not repeatable ; String]*
+ A user-defined tag. + - **`tag_group`** *[Optional ; Not repeatable ; String]*

+ A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.

+ + +### LDA topics + +**`lda_topics`** *[Optional ; Not repeatable]*
+ +
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ +We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). +
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element `lda_topics` is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. + +The `lda_topics` element includes the following metadata fields:
+ +- **`model_info`** *[Optional ; Not repeatable]*
+Information on the LDA model. + + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + +- **`topic_description`** *[Optional ; Repeatable]*
+The topic composition of the document. + + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the document (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic. This is specific to the model, not to a document.
+ + + +```r +lda_topics = list( + + list( + + model_info = list( + list(source = "World Bank, Development Data Group", + author = "A.S.", + version = "2021-06-22", + model_id = "Mallet_WB_75", + nb_topics = 75, + description = "LDA model, 75 topics, trained on Mallet", + corpus = "World Bank Documents and Reports (1950-2021)", + uri = "")) + ), + + topic_description = list( + + list(topic_id = "topic_27", + topic_score = 32, + topic_label = "Education", + topic_words = list(list(word = "school", word_weight = "") + list(word = "teacher", word_weight = ""), + list(word = "student", word_weight = ""), + list(word = "education", word_weight = ""), + list(word = "grade", word_weight = "")), + + list(topic_id = "topic_8", + topic_score = 24, + topic_label = "Gender", + topic_words = list(list(word = "women", word_weight = "") + list(word = "gender", word_weight = ""), + list(word = "man", word_weight = ""), + list(word = "female", word_weight = ""), + list(word = "male", word_weight = "")), + + list(topic_id = "topic_39", + topic_score = 22, + topic_label = "Forced displacement", + topic_words = list(list(word = "refugee", word_weight = "") + list(word = "programme", word_weight = ""), + list(word = "country", word_weight = ""), + list(word = "migration", word_weight = ""), + list(word = "migrant", word_weight = "")), + + list(topic_id = "topic_40", + topic_score = 11, + topic_label = "Development policies", + topic_words = list(list(word = "development", word_weight = "") + list(word = "policy", word_weight = ""), + list(word = "national", word_weight = ""), + list(word = "strategy", word_weight = ""), + list(word = "activity", word_weight = "")) + + ) + + ) + +) +``` + +The information provided by LDA models can be used to build a "filter by topic composition" tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic. + +
+![](./images/filter_by_topic_share_1.JPG){width=85%} +
+ + +### Embeddings + +**`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below. + +![](./images/embedding_related_docs.JPG){width=100%} + +The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": null + } +] +``` +
+ +The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; Object]* @@@@@@@@ do not offer options + The numeric vector representing the document, provided as an object (array or string).

+ [1,4,3,5,7,9] + + +- **`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + + +## Examples + +Use schema and resource schema for publishing links. + +### Example 1 - Using the IPTC option + +We selected an image from the World Bank Flickr collection. The image is available at https://www.flickr.com/photos/worldbank/8120361619/in/album-72157648790716931/ +Some metadata is provided with the photo. + +
+![](./images/Image_Example_01a.JPG){width=80%} +
+ +Metadata: + +
+![](./images/Image_Example_01b.JPG){width=80%} +
+ +The image is made available in multiple formats. We assume that we want to only provide access to the small, medium and original version of the image available in our NADA catalog. We also assume that instead of uploading the images to our catalog server to make them available directly from our catalog, we want to provide link to the images in the source repository (Flickr in this case). + +![](./images/Image_Example_01c.JPG){width=40%} + +**Using R** + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_images/") +# Download image files from Flickr (different resolutions) + +download.file("https://live.staticflickr.com/4858/31953178928_77e4d7abae_o_d.jpg", + destfile = "img_001_original.jpg", mode = "wb") + +download.file("https://live.staticflickr.com/4858/31953178928_44abb01418_w_d.jpg", + destfile = "img_001_small.jpg", mode = "wb") + +# Generate image metadata (using the IPTC metadata elements) + +my_image <- list( + + metadata_information = list( + + producers = list(name = "OD"), + + production_date = "2022-01-10" + + ), + + idno = "image_001", + + image_description = list( + + iptc = list( + + photoVideoMetadataIPTC = list( + + title = "Man fetching water, Afghanistan", + + imageSupplierImageId = "Image_001", + + headline = "Residents get water", + + dateCreated = "2008-09-20T00:00:00Z", + + creatorNames = list("Sofie Tesson, Taimani Films"), + + description = "View of villagers, getting some water. + World Bank Emergency Horticulture and Livestock Project", + + digitalImageGuid = "72157648790716931", + + locationsShown = list( + list(countryCode = "AFG", countryName = "Afghanistan") + ), + + keywords = list("Water and sanitation"), + + @@@ as list? sceneCodes = list("010600, 011000, 011100, 011900"), + + sceneCodesLabelled = list( + + list(code = "010600", + label = "single", + description = "A view of only one person, object or animal."), + + list(code = "011000", + label = "general view", + description = "An overall view of the subject and its surrounds"), + + list(code = "011100", + label = "panoramic view", + description = "A panoramic or wide angle view of a subject and its surrounds"), + + list(code = "011900", + label = "action", + description = "Subject in motion") + + ), + + @@@ as list? subjectCodes = list("06000000, 09000000, 14000000"), + + subjectCodesLabelled = list( + + list(code = "06000000", + label = "environmental issue", + description = "All aspects of protection, damage, and condition of the ecosystem of the planet earth and its surroundings."), + + list(code = "09000000", + label = "labor", + description = "Social aspects, organizations, rules and conditions affecting the employment of human effort for the generation of wealth or provision of services and the economic support of the unemployed."), + + list(code = "14000000", + label = "social issue", + description = "Aspects of the behavior of humans affecting the quality of life.") + + ), + + source = "World Bank", + + supplier = list( + list(name = "World Bank") + ) + + ) + + ), + + license = list( + list(name = "Attribution 2.0 Generic (CC BY 2.0)", + uri = "https://creativecommons.org/licenses/by/2.0/") + ), + + album = list( + list(name = "World Bank Projects in Afghanistan") + ) + + ) + +) + +# Publish the image metadata in the NADA catalog + +image_add(idno = "image_001", + metadata = my_image, + repositoryid = "central", + overwrite = "yes", + published = 1, + thumbnail = thumb) + +# Provide a link to the images in the originating repository, and upload files +# (uploading files will make them available directly from the NADA catalog) + +external_resources_add( + idno = "image_001", + dctype = "pic", + title = "Man fetching water, Afghanistan (Flickr link)", + file_path = "https://www.flickr.com/photos/water_alternatives/31953178928/in/photolist-QFAoS5", + overwrite = "yes" +) + +external_resources_add( + idno = "image_001", + dctype = "pic", + title = "Man fetching water, Afghanistan (original size)", + file_path = "img_001_original.jpg", + overwrite = "yes" +) + +external_resources_add( + idno = "image_001", + dctype = "pic", + title = "Man fetching water, Afghanistan (small size)", + file_path = "img_001_small.jpg", + overwrite = "yes" +) +``` +

+ + + +**Result in NADA** + +The metadata, links, and images will be displayed in NADA. + +
+![](./images/ReDoc_images_34.JPG){width=80%} +
+ +

+Different views (mosaic, list, page views) are available. If the metadata contained a GPS location, a map showing the exact location where the photo was taken will also be displayed in the image page. + +
+![](./images/ReDoc_images_35.JPG){width=80%} +
+ +

+ + + + +**Using Python ** + + +```python +# Python script +``` + + + + + + +### Example 2 - Using the DCMI option + +We document the same image as in Example 1. + +**Using R ** + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_images/") +# Download image files from Flickr (different resolutions) + +download.file("https://live.staticflickr.com/4858/31953178928_77e4d7abae_o_d.jpg", + destfile = "img_001_original.jpg", mode = "wb") + +download.file("https://live.staticflickr.com/4858/31953178928_44abb01418_w_d.jpg", + destfile = "img_001_small.jpg", mode = "wb") + +# Generate image metadata (using the DCMI metadata elements) + +pic_desc <- list( + + metadata_information = list( + + producers = list(name = "OD"), + + production_date = "2022-01-10" + + ), + + idno = "image_001", + + image_description = list( + + dcmi = list( + + identifier = "72157648790716931", + + type = "image", + + title = "Man fetching water, Afghanistan", + + caption = "Residents get water", + + description = "View of villagers, getting some water. + World Bank Emergency Horticulture and Livestock Project", + + subject = "", + + topics = list(), + + keywords = list( + list(name = "water and sanitation") + ), + + creator = "Sofie Tesson, Taimani Films", + + publisher = "World Bank", + + date = "2008-09-20T00:00:00Z", + + country = list(name = "Afghanistan", code = "AFG"), + + language = "English" + + ), + + license = list( + list(name = "Attribution 2.0 Generic (CC BY 2.0)", + uri = "https://creativecommons.org/licenses/by/2.0/")), + + album = list( + list(name = "World Bank Projects in Afghanistan") + ) + + ) + +) + +# Publish the image metadata in the NADA catalog + +image_add(idno = "image_001", + metadata = pic_desc, + repositoryid = "central", + overwrite = "yes", + published = 1, + thumbnail = thumb) + +# Provide a link to the images in the originating repository, and upload files +# (uploading files will make them available directly from the NADA catalog) + +external_resources_add( + idno = "image_001", + dctype = "pic", + title = "Man fetching water, Afghanistan (Flickr link)", + file_path = "https://www.flickr.com/photos/water_alternatives/31953178928/in/photolist-QFAoS5", + overwrite = "yes" +) + +external_resources_add( + idno = "image_001", + dctype = "pic", + title = "Man fetching water, Afghanistan (original size)", + file_path = "img_001_original.jpg", + overwrite = "yes" +) + +external_resources_add( + idno = "image_001", + dctype = "pic", + title = "Man fetching water, Afghanistan (small size)", + file_path = "img_001_small.jpg", + overwrite = "yes" +) +``` +

+ + + +**Using Python ** + + +```python +# Python script +``` diff --git a/11_chapter11_video.md b/11_chapter11_video.md new file mode 100644 index 0000000..54f9a57 --- /dev/null +++ b/11_chapter11_video.md @@ -0,0 +1,1087 @@ +--- +output: html_document +--- + +# Videos {#chapter11} + +
+![](./images/movie_logo.JPG){width=25%} +
+ +The schema we propose to document video files is a combination of elements extracted from the [Dublin Core Metadata Initiative](https://dublincore.org/) (DCMI) and from the [VideoObject (from schema.org)](https://schema.org/VideoObject) schemas. This schema is very similar to the schema we proposed for audio files (see chapter 10). + +The Dublin Core is a generic and versatile standard, which we also use (in an augmented form) for the documentation of *Documents* (Chapter 4), *Images* (Chapter 9), and *Audio* files (chapter 10). It contains 15 core elements, to which we added a selection of elements from VideoObject. We also included the elements `keywords`, `topics`, `tags`, `provenance` and `additional` that are found in other schemas documented in the Guide. + +The resulting metadata schema is simple, but it contains the elements needed to document the resources and their content in a way that will foster their discoverability in data catalogs. Compliance with the VideoObject elements contributes to search engine optimization, as search engines like Google, Bing and others "reward" metadata published in formats compatible with the schema.org recommendations. + +
+```json +{ + "repositoryid": "string", + "published": 0, + "overwrite": "no", + "metadata_information": {}, + "video_description": {}, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ +When published in a NADA catalog, the metadata related to video files will appear in a specific tab. + +
+![](./images/Video_NADA_tabs.JPG){width=100%} +
+ + +## Augmenting video metadata + +Videos typically come with limited metadata. To make them more discoverable, a transcription of the video content can be generated, stored, and indexed in the catalog. The metadata schema we propose includes an element `transcription` that can store transcriptions (and possibly their automatically-generated translations) in the video metadata. Word embedding models and topic models can be applied to the transcriptions to further augment the metadata. This will significantly increase the discoverability of the resource, and offer the possibility to apply semantic searchability on video metadata. + +Machine learning speech-to-text solutions are available (although not for all languages) to automatically generate transcriptions at a low cost. This includes commercial applications like [Whisper by openAI](https://openai.com/research/whisper), [Microsoft Azure](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/), or [Amazon Transcribe](https://aws.amazon.com/transcribe/pricing/). Open source solutions in Python also exist. + +Transcriptions of videos published on Youtube are available on-line (the example below was extracted from https://www.youtube.com/watch?v=Axs8NPVYmms). + +
+![](./images/ReDoc_videos_47.JPG){width=100%} +
+ +Note that some care must be taken when adding automatic speech transcriptions into your metadata, as the transcriptions are not always perfect and may return unexpected results. This will be the case when the sound quality is low, or when the video includes sections in an unknown language (see the example below, of a video in English that includes a brief segmnent in Somali; the speech-to-text algorithm may in such case attempt to transcribe text it does not recognize, returning invalid information). + +
+![](./images/ReDoc_videos_48.JPG){width=100%} +
+ + +## Schema description + +The first three elements of the schema (`repositoryid`, `published`, and `overwrite`) are not part of the video metadata. They are parameters used to indicate how the video metadata will be published in a NADA catalog. + +- **`repositoryid`** identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA. + +- **`published`**: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible. + +- **`overwrite`**: Indicates whether metadata that may have been previously uploaded for the same video can be overwritten. By default, the value is "no". It must be set to "yes" to overwrite existing information. Note that a video will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element `video_description > idno` is the same. + + +### Metadata information + +**`metadata_information`** *[Optional ; Not Repeatable]* +The metadata information set is used to document the video metadata (not the video itself). This provides information useful for archiving purposes. This set is optional. It is recommended however to enter at least the identification and affiliation of the metadata producer, and the date of creation of the metadata. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so metadata produced by one organization can be found in other data centers. +
+```json +"metadata_information": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "production_date": "string", + "version": "string" +} +``` +
+ +- **`title`** *[Optional ; Not Repeatable ; String]*
+The title of the video.
+ +- **`idno`** *[Optional ; Not Repeatable ; String]*
+A unique identifier for the metadata document (unique in the catalog; ideally also unique globally). This is different from the video unique ID (see `idno` element in section *video_description* below), although it is good practice to generate identifiers that would maintain an easy connection between the metadata `idno` element and the video `idno` found under `video_description` (see below).
+ +- **`producers`** *[Optional ; Repeatable]*
+This refers to the producer(s) of the metadata, NOT to the producer(s) of the video. This could for example be the data curator in a data center.
+ - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the metadata producer/curator. An alternative to entering the name of the curator (e.g. for privacy protection purpose) is to enter the curator ID (see the element *abbr* below)
+ - **`abbr`** *[Optional ; Not repeatable ; String]*
+ Can be used to provide an ID of the metadata producer/curator.
+ - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ Affiliation of the metadata producer/curator.
+ - **`role`** *[Optional ; Not repeatable ; String]*
+ Specific role of the metadata producer/curator.
+ +- **`production_date`** *[Optional ; Not repeatable ; String]*
+Date the metadata (not the table) was produced.
+ +- **`version`** *[Optional ; Not repeatable ; String]*
+Version of the metadata (not version of the table).
+ + +### Video description + +**`video_description`** *[Required ; Not Repeatable]*
+The `video_description` section contains all elements that will be used to describe the video and its content. These are the elements that will be indexed and made searchable when published in a data catalog. + + +- **`idno`** *[Mandatory, Not Repeatable ; String]*
+ `idno` is an identification number that is used to uniquely identify a video in a catalog. It will also help users of the data cite the video properly. The best option is to obtain a [Digital Object Identifier (DOI)](https://www.doi.org/) for the video, as it will ensure that the ID is unique globally. Alternatively, it can be an identifier constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a video (see `identifiers` below). This element maps to the “identifier” element in the Dublin Core. + + +- **`identifiers`** *[Optional ; Repeatable]*
+
+```json +"identifiers": [ + { + "type": "string", + "identifier": "string" + } +] +``` +
+ + This element is used to enter video identifiers other than the `idno` element described above). It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in `idno` can be repeated here, allowing to attach a "type" attribute to it. + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of unique identifier, e.g., "DOI". + - **`value`** *[Required ; Not repeatable ; String]*
+ The identifier.

+ + +- **`title`** *[Required ; Not repeatable ; String]*
+ + The title of the video. This element maps to the element *caption* in VideoObject. + + +- **`alt_title`** *[Optional ; Not repeatable ; String]*
+ + An alias for the video title. This element maps to the element *alternateName* in VideoObject. + + +- **`description`** *[Optional ; Not repeatable ; String]*
+ + A brief description of the video, typically about a paragraph long (around 150 to 250 words). This element maps to the element *abstract* in VideoObject. + + +- **`genre`** *[Optional ; Repeatable ; String]*
+ + The genre of the video, broadcast channel or group. This is a VideoObject element. A controlled vocabulary can be used. + + +- **`keywords`** *[Optional ; Repeatable]*
+
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of keywords that provide information on the core content of the video. Keywords provide a convenient solution to improve the discoverability of the video, as it allows terms and phrases not found elsewhere in the video metadata to be indexed and to make the video discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the [UNESCO Thesaurus](http://vocabularies.unesco.org/browser/thesaurus/en/). The list can combine keywords from multiple controlled vocabularies, and user-defined keywords. + - **`name`** *[Required ; Not repeatable ; String]*
+ The keyword itself. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The controlled vocabulary (including version number or date) from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the controlled vocabulary from which the keyword is extracted, if any.

+ + + + ```r + my_video <- list( + # ... , + video_description = list( + # ... , + + keywords = list( + + list(name = "Migration", + vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + + list(name = "Migrants", + vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + + list(name = "Refugee", + vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + + list(name = "Forced displacement"), + + list(name = "Internally displaced population (IDP)") + + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`topics`** *[Optional ; Repeatable]*
+
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + Information on the topics covered in the video. A controlled vocabulary will preferably be used, for example the [CESSDA Topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification), a typology of topics available in 11 languages; or the [Journal of Economic Literature (JEL) Classification System](https://en.wikipedia.org/wiki/JEL_classification_codes), or the [World Bank topics classification](https://documents.worldbank.org/en/publication/documents-reports/docadvancesearch). Note that you may use more than one controlled vocabulary. This element is a block of five fields:
+ + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the topic, taken from a controlled vocabulary.
+ - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the topic, preferably taken from a controlled vocabulary.
+ - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
+ - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any.
+ - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ + + + ```r + my_video <- list( + # ... , + video_description = list( + # ... , + + topics = list( + + list(name = "Demography.Migration", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(name = "Demography.Censuses", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(id = "F22", + name = "International Migration", + parent_id = "F2 - International Factor Movements and International Business", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "O15", + name = "Human Resources - Human Development - Income Distribution - Migration", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "O12", + name = "Microeconomic Analyses of Economic Development", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "J61", + name = "Geographic Labor Mobility - Immigrant Workers", + parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J") + + ), + + # ... + ), + ) + ``` +
+ +- **`persons`** *[Optional ; Repeatable]*
+
+```json +"persons": [ + { + "name": "string", + "role": "string" + } +] +``` +
+ + A list of persons who appear in the video.
+ - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the person.
+ - **`role`** *[Optional ; Not repeatable, String]*
+ The role of the person mentioned in `name`.

+ + + + ```r + my_video <- list( + metadata_information = list( + # ... + ), + video_description = list( + # ... , + + persons = list( + + list(name = "John Smith", + role = "Keynote speaker"), + + list(name = "Jane Doe", + role = "Debate moderator") + + ), + # ... + ) + ``` +
+ +- **`main_entity`** *[Optional ; Not repeatable ; String]*
+ + Indicates the primary entity described in the video. This element maps to the element `mainEntity` in VideoObject. + + +- **`date_created`** *[Optional, Not Repeatable ; String]*
+ + The date the video was created. It is recommended to enter the date in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the video is created refers to the date that the video was produced and considered ready for dissemination. + + +- **`date_published`** *[Optional, Not Repeatable ; String]*
+ + The date the video was published. It is recommended to use the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). + + +- **`version`** *[Optional, Not Repeatable ; String]*
+ + The version of the video refers to the published version of the video. + + +- **`status`** *[Optional ; Not repeatable, String]*
+ + The status of the video in terms of its stage in a lifecycle. A controlled vocabulary should be used. Example terms include {`Incomplete, Draft, Published, Obsolete`}. Some organizations define a set of terms for the stages of their publication lifecycle. This element maps to the element *creativeWorkStatus* in VideoObject. + + +- **`country`** *[Optional ; Repeatable]*
+
+```json +"country": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + The list of countries (or regions) covered by the video, if applicable. This refers to the content of the video, not to the country where the video was released. This is a repeatable block of two elements: + - **`name`** *[Required ; Not repeatable ; String]*
+ The country/region name. Note that many organizations have their own policies on the naming of countries/regions/economies/territories, which data curators will have to comply with. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The country/region code (entered as a string, even for numeric codes). It is recommended to use a standard list of countries and regions, such as the ISO country list ([ISO 3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes)). +
+ +- **`spatial_coverage`** *[Optional ; Not repeatable ; String]*
+ + Indicates the place(s) which are depicted or described in the video. This element maps to the element `contentLocation` in VideoObject. This element complements the `ref_country` element. It can be used to qualify the geographic coverage of the video, in the form of a free text. + + +- **`content_reference_time`** *[Optional ; Not repeatable ; String]*
+ + The specific time described by the video, for works that emphasize a particular moment within an event. This element maps to the element `contentReferenceTime` in VideoObject. + + +- **`temporal_coverage`** *[Optional ; Not repeatable ; String]*
+ + Indicates the period that the video applies to, i.e. that it describes, either as a DateTime or as a textual string indicating a time period in ISO 8601 time interval format. This element maps to the element `temporalCoverage` in VideoObject. + + +- **`recorded_at`** *[Optional ; Not repeatable ; String]*
+ + This element maps to the element `recordedAt` in VideoObject schema. It identifies the event where the video was recorded (e.g., a conference, or a demonstration). + + +- **`audience`** *[Optional ; Not repeatable ; String]*
+ +A brief description of the intended audience of the video, i.e. the group for whom it was created. + + +- **`bbox`** *[Optional ; Repeatable]*
+
+```json +"bbox": [ + { + "west": "string", + "east": "string", + "south": "string", + "north": "string" + } +] +``` +
+ + This element is used to define one or multiple bounding box(es), which are the (rectangular) fundamental geometric description of the geographic coverage of the video. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the video's geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. + - **`west`** *[Required ; Not repeatable ; String]*
+ West longitude of the box + - **`east`** *[Required ; Not repeatable ; String]*
+ East longitude of the box + - **`south`** *[Required ; Not repeatable ; String]*
+ South latitude of the box + - **`north`** *[Required ; Not repeatable ; String]*
+ North latitude of the box +
+ +- **`language`** *[Optional, Repeatable]*
+
+```json +"language": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + Most videos will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed. For the language code, ISO codes will preferably be used. The language refers to the language in which the video is published. This is a block of two elements (at least one must be provided for each language): + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the language. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the language. The use of [ISO 639-2](https://www.loc.gov/standards/iso639-2/php/code_list.php) (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings.

+ + +- **`creator`** *[Optional, Not repeatable ; String]*
+ +Organization or person who created/authored the video. + + +- **`production_company`** *[Optional, Not repeatable ; String]*
+ + The production company or studio responsible for the item. This element maps to the element *productionCompany* in VideoObject. + + +- **`publisher`** *[Optional, Not repeatable ; String]*
+ + + ```r + my_video = list( + # ... , + video_description = list( + # ... , + publisher = "@@@@@", + # ... + ) + ) + ``` +
+ + +- **`repository`** *[Optional ; Not repeatable ; String]*
+ + The name of the repository (organization). + + +- **`contacts`** *[Optional, Repeatable]*
+Users of the video may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a "generic" email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members. +
+```json +"contacts": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "email": "string", + "telephone": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required, Not repeatable, String]*
+ Name of a person or unit (such as a data help desk). It will usually be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position. + - **`role`** *[Optional, Not repeatable, String]*
+ The specific role of `name`, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact. + - **`affiliation`** *[Optional, Not repeatable, String]*
+ Affiliation of the person/unit. + - **`email`** *[Optional, Not repeatable, String]*
+ E-mail address of the person. + - **`telephone`** *[Optional, Not repeatable, String]*
+ A phone number that can be called to obtain information or provide feedback on the table. This should never be a personal phone number; a corporate number (typically of a data help desk) should be provided. + - **`uri`** *[Optional, Not repeatable, String]*
+ A link to a website where contact information for `name` can be found.

+ + +- **`contributors`** *[Optional, Repeatable]*
+
+```json +"contributors": [ + { + "name": "string", + "affiliation": "string", + "abbr": "string", + "role": "string", + "uri": "string" + } +] +``` +
+ + Identifies the person(s) and/or organization(s) who contributed to the production of the video. The `role` attribute allows defining what the specific contribution of the identified person or organization was.
+ - **`name`** *[Optional, Not Repeatable ; String]*
+ The name of the contributor (person or organization). + - **`affiliation`** *[Optional, Not Repeatable ; String]*
+ The affiliation of the contributor. + - **`abbr`** *[Optional, Not Repeatable ; String]*
+ The abbreviation for the institution which has been listed as the affiliation of the contributor. + - **`role`** *[Optional, Not Repeatable ; String]*
+ The specific role of the contributor. This could for example be "Cameraman", "Sound engineer", etc. + - **`uri`** *[Optional, Not Repeatable ; String]*
+ A URI (link to a website, or email address) for the contributor.

+ + + ```r + my_video = list( + # ... , + video_description = list( + # ... , + contributors = list( + list( + name = "", + affiliation = "", + abbr = "", + role = "", + uri = "") + ), + # ... + ) + ) + ``` + + +- **`sponsors`** *[Optional ; Repeatable]*
+
+```json +"sponsors": [ + { + "name": "string", + "abbr": "string", + "grant": "string", + "role": "string" + } +] +``` +
+ + This element is used to list the funders/sponsors of the video. If different funding agencies financed different stages of the production process, use the "role" attribute to distinguish them. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the sponsor (person or organization) + - **`abbr`** *[Optional ; Not repeatable ; String]*
+ The abbreviation (acronym) of the sponsor. + - **`grant`** *[Optional ; Not repeatable ; String]*
+ The grant (or contract) number. + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the sponsor.

+ + +- **`translators`** *[Optional ; Repeatable]*
+
+```json +"translators": [ + { + "first_name": "string", + "initial": "string", + "last_name": "string", + "affiliation": "string" + } +] +``` +
+ + Organization or person who adapted the video to different languages. This element maps to the element *translator* in VideoObject. + - **`first_name`** *[Optional ; Not repeatable ; String]*
+ The first name of the translator. + - **`initial`** *[Optional ; Not repeatable ; String]*
+ The initials of the translator. + - **`last_name`** *[Optional ; Not repeatable ; String]*
+ The last name of the translator. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the translator.

+ + +- **`is_based_on`** *[Optional ; Not repeatable, String]*
+ + A resource from which this video is derived or from which it is a modification or adaption. This element maps to the element *isBasedOn* in VideoObject. + + +- **`is_part_of`** *[Optional ; Not repeatable, String]*
+ + Indicates another video that this video is part of. This element maps to the element *isPartOf* in VideoObject. + + +- **`relations`** *[Optional ; Repeatable, String]*
+
+```json +"relations": [ + "string" +] +``` +
+ + Defines, as a free text field, the relation between the video being documented and other resources. This is a Dublin Core element. + + +- **`video_provider`** *[Optional ; Not repeatable, String]*
+
+![](./images/ReDoc_videos_34.JPG){width=100%} +
+ + The person or organization who provides the video. This element maps to the element *provider* in VideoObject. + + +- **`video_url`** *[Optional ; Not repeatable, String]*
+ + URL of the video. This element maps to the element *url* in VideoObject. + + +- **`embed_url`** *[Optional ; Not repeatable, String]*
+ + A URL pointing to a player for a specific video. This element maps to the element *embedUrl* in VideoObject. For example, "https://www.youtube.com/embed/7Aif1xjstws" + + To be embedded, a video must be hosted on a video sharing platform like Youtube (www.youtube.com). To obtain the "embed link" from youtube, click on the "Share" button, then "Embed". In the result box, select the content of the element `src = `. + +
+ ![](./images/ReDoc_videos_46.JPG){width=100%} +
+ + +- **`encoding_format`** *[Optional ; Not repeatable, String]*
+ + The video file format, typically expressed using a MIME format. This element corresponds to the "encodingFormat" element of VideoObject and maps to the element *format* of the Dublin Core. + + +- **`duration`** *[Optional ; Not repeatable, String]*
+ + The duration of the item (movie, audio recording, event, etc.) in ISO 8601 format. This element is a VideoObject element. + + ISO 8601 durations are expressed using the following format, where (n) is replaced by the value for each of the date and time elements that follows the (n). For example: (3)H means 3 hours. + + + :::note + **`P(n)Y(n)M(n)DT(n)H(n)M(n)S`** + + Where:
+ - P is the **Period designator** and is always placed at the beginning of the duration
+ - (n)Y represents the number of years
+ - (n)M represents the number of months
+ - (n)W represents the number of weeks
+ - (n)D represents the number of days
+ - T is the **Time designator** and always precedes the time components
+ - (n)H represents the number of hours
+ - (n)M represents the number of minutes
+ - (n)S represents the number of seconds
+ + For example, **P1Y2M20DT3H30M8S** represents a duration of one year, two months, twenty days, three hours, thirty minutes, and eight seconds. + + Date and time elements including their designator may be omitted if their value is zero, and lower-order elements may also be omitted for reduced precision. For example, "P23DT23H" and "P4Y" are both acceptable duration representations. + + As *M* can represent both Month and Minutes, the time designator *T* is used. For example, "P1M" is a one-month duration and "PT1M" is a one-minute duration. + + This information on the ISO 8601 was adapted from [wikipedia](https://en.wikipedia.org/wiki/ISO_8601) where more detailed information can be found. + ::: + + +- **`rights`** *[Optional ; Not repeatable, String]*
+ + A textual description of the rights associated to the video. If a copyright is available, the three following elements will be used instead of this element. + + +- **`copyright_holder`** *[Optional ; Not repeatable, String]*
+ + The party holding the legal copyright to the video. This element corresponds to the "copyrightHolder" element of VideoObject. + + +- **`copyright_notice`** *[Optional ; Not repeatable, String]*
+ + Text of a notice appropriate for describing the copyright aspects of the video, ideally indicating the owner of the copyright. This element corresponds to the "copyrightNotice" element of VideoObject. + + +- **`copyright_year`** *[Optional ; Not repeatable, String]*
+ + The year during which the claimed copyright for the video was first asserted. This element corresponds to the "copyrightYear" element of VideoObject. + + +- **`credit_text`** *[Optional ; Not repeatable, String]*
+ + This element can be used to credit the person(s) and/or organization(s) associated with a published video. This element corresponds to the "creditText" element of VideoObject. + + +- **`citation`** *[Optional ; Not repeatable, String]*
+ + This element provides a required or recommended citation of the audio file. + + +- **`transcript`** *[Optional ; Repeatable, String]*
+
+```json +"transcript": [ + { + "language_name": "string", + "language_code": "string", + "text": "string" + } +] +``` +
+ + The transcript of the video content, provided as a text. Note that if the text is very long, an alternative is to save it in a separate text file and to make it available in a data catalog as an external resource. + - **`language_name`** *[Optional ; Not repeatable ; String]*
+ The name of the language of the transcript. + - **`language_code`** *[Optional ; Not repeatable ; String]*
+ The code of the language of the transcript, preferably the ISO code. + - **`text`** *[Optional ; Not repeatable ; String]*
+ + The transcript itself. Adding the transcript in the metadata will make the video much more discoverable, as the content of the transcription can be indexed in catalogs. + + +- **`media`** *[Optional ; Repeatable ; String]*
+
+```json +"media": [ + "string" +] +``` +
+ + A description of the media on which the recording is stored (other than the online file format); e,g., "CD-ROM". + + +- **`album`** *[Optional ; Repeatable]*
+
+```json +"album": [ + { + "name": "string", + "description": "string", + "owner": "string", + "uri": "string" + } +] +``` +
+ + When a video is published in a catalog containing many other videos, it may be desirable to organize them by album. Albums are collections of videos organized by theme, period, location, or other criteria. One video can belong to more than one album. Albums are "virtual collections". + - **`name`** *[Optional ; Not Repeatable ; String]*
+ The name (label) of the album. + - **`description`** *[Optional ; Not Repeatable ; String]*
+ A brief description of the album. + - **`owner`** *[Optional ; Not Repeatable ; String]*
+ The owner of the album. + - **`uri`** *[Optional ; Not Repeatable ; String]*
+ A link (URL) to the album. + + +- **`provenance`** *[Optional ; Repeatable]*
+ + Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the IPTC or DCMI metadata standard.
+
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ + - **`origin_description`** *[Required ; Not repeatable]*
+ The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Study Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The datestamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@

+ + +- **`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ + - **`tag`** *[Required ; Not repeatable ; String]*
+ A user-defined tag. + - **`tag_group`** *[Optional ; Not repeatable ; String]*

+ A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs. + + +- **`lda_topics`** *[Optional ; Not repeatable]*
+ + We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). + + Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series' name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%). + +
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ + The `lda_topics` element includes the following metadata fields. + + - **`model_info`** *[Optional ; Not repeatable]*
+ Information on the LDA model.
+ + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + - **`topic_description`** *[Optional ; Repeatable]*
+ The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).
+ + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the metadata (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic.

+ + +- **`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). + + The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": null + } +] +``` +
+ + The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; @@@@]* + The numeric vector representing the video metadata.

+ + +- **`additional`** *[Optional ; Not repeatable]*
+The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. +
+![](./images/ReDoc_videos_45.JPG){width=100%} +
+ + +## Complete example + + +### In R + + +```r +library(nadar) + +# ---------------------------------------------------------------------------------- +# Enter credentials (API confidential key) and catalog URL +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:/my_videos") + +id = "MDA_VDO_001" + +thumb = "vdo_001.jpg" + +# Generate the metadata + +my_video = list( + + metadata_information = list( + title = "Mogadishu, Somalia: A Call for Help", + idno = id, + producers = list( + list(name = "John Doe", affiliation = "National Library") + ), + production_date = "2021-09-03" + ), + + video_description = list( + + idno = id, + + title = "Mogadishu, Somalia: A Call for Help", + + alt_title = "Somalia: Guterres in Mogadishu", + + date_published = "2011-09-01", + + description = "During a landmark visit, the United Nations High Commissioner for Refugees calls on the international community to rapidly increase aid to Somalia.", + + genre = "Documentary", + + persons = list( + list(name = "António Guterres", role = "High Commissioner for Refugees"), + list(name = "Fadhumo", role = "Somali internally displaced person (IDP)") + ), + + main_entity = "United Nations High Commission for Refugees (UNHCR), the UN Refugee Agency", + + country = list( + list(name = "Somalia", code = "SOM") + ), + + spatial_coverage = "Mogadishu, Somalia", + + content_reference_time = "2011-09", + + languages = list( + list(name = "English", code = "EN") + ), + + creator = "United Nations High Commission for Refugees (UNHCR)", + + video_url = "https://www.youtube.com/watch?v=7Aif1xjstws", + + embed_url = "https://www.youtube.com/embed/7Aif1xjstws", + + transcript = list( + list( + language = "English", + transcript = "Mogadishu is a dangerous place securityhas improved since al-shabaab militias + withdrew last month but not a lot despite the insecurity hundreds of thousands of Somalis + have been streaming into the capital from surrounding areas they're fleeing the worst famine + to strike the region in 60 years in a landmark visit the UN High Commissioner for Refugees + Antonio Gutierrez traveled to Mogadishu this week to visit with Somalis he urged the international + community to rapidly increase aid to people who have been through so much already makes us very emotional is to + feel that for 2020 as these people has been suffering the suffering enormously of course there is a large + responsibility of Somalis in the way things have happened but let's also recognize that international community + there sometimes also be part of the problem and not part of the solution some aid is getting through fatuma has + just been registered to receive assistance from UNHCR she left her home and is now seeking help in the capital + she is camped with thousands of others in a settlement not far from the shoreline UNHCR is providing plastic + sheeting and other supplies there are also food distributions there are a total of four hundred thousand displaced + people in Mogadishu 100,000 arrived in the past two months alone getting assistance to them despite the + dangers is an urgent priority otherwise settlements like these are certain to you" + ) + ), + + duration = "PT2M14S" # 2 minutes and 14 seconds + + ) + +) + +# Publish in the NADA catalog + +video_add(idno = id, + published = 1, + overwrite = "yes", + metadata = my_video, + thumbnail = thumb) +``` + +In NADA, the video will now appear in the "All" tab and in the "Videos" tab. + +
+ ![](./images/video_in_NADA.JPG){width=100%} +
+ +If the `embed_url` element was provided, the video can be played within the NADA page. + +
+ ![](./images/video_in_NADA_2.JPG){width=100%} +
+ + +### In Python + + +```python +# Python script +``` + diff --git a/12_chapter12_reproducible_scripts.md b/12_chapter12_reproducible_scripts.md new file mode 100644 index 0000000..3ca8c6d --- /dev/null +++ b/12_chapter12_reproducible_scripts.md @@ -0,0 +1,2321 @@ +--- +output: html_document +--- + +# Research projects and scripts {#chapter12} + +
+
+![](./images/script_logo.JPG){width=25%} +
+
+ + +## Rationale + +Documenting, cataloguing and disseminating **data** has the potential to increase the volume and diversity of data analysis. There is also much value in documenting, cataloguing and disseminating **data processing and analysis scripts**. Technological solutions such as GitHub, [Jupyter Notebooks or Jupiter Lab](https://jupyter.org/) facilitate the preservation and sharing of code, and enable collaborative work around data analysis. Coding style guides like the [Google style guides](https://google.github.io/styleguide/) and the [Guide to Reproducible Code in Ecology and Evolution](https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf) by the British Ecological Society, contribute to foster the usability, adaptability, and reproducibility of code. But these tools and guidelines do not fully address the issue of cataloguing and discoverability of the data processing and analysis programs and scripts. We propose --as a complement to collaboration tools and style guides-- a metadata schema to document data analysis projects and scripts. The production of structured metadata will contribute not only to discoverability, but also to the reproducibility, replicability, and auditability of data analytics. + +There are multiple reasons to make reproducibility, replicability, and auditability of data analytics a component of a data dissemination system. This will: + +- Improve the **quality of research and analysis**. Public scrutiny enables contestability and independent quality control of the output of research and analysis; these are strong incentives for additional rigor in data analysis. +- Allow the **re-purposing or expansion of analysis** by the research community, thereby increasing the relevance, utility and value of both the data and of the analytical work. +- Strengthen the **reputation and credibility** of the analysis. +- Provide students and peers with rich **training materials**. +- In some cases, satisfy a **requirement** imposed by peer reviewed journals or financial sponsors of research activities. For example, the [Data and Policy Code of the American Economic Association](https://www.aeaweb.org/journals/policies/data-code) (accessed on June 29, 2020), states that *It is the policy of the American Economic Association to publish papers only if the data and code used in the analysis are clearly and precisely documented, and access to the data and code is clearly and precisely documented and is non-exclusive to the authors. Authors of accepted papers that contain empirical work, simulations, or experimental work must provide, prior to acceptance, information about the data, programs, and other details of the computations sufficient to permit replication, as well as information about access to data and programs.* +- Contribute to **assuring the fairness of policy advice and interventions** resulting from data analysis. Data analysis may be used to identify or target the beneficiaries of policies and programs, or may contribute otherwise to the design and implementation of development policies and projects. By doing so, they also contribute to identifying populations to be excluded from these interventions. Errors and biases may be introduced in analysis by accidental or intentional human errors, by the algorithms themselves, or they can result from flaws in the data. The analysis that informs such projects and policies should therefore be made auditable and contestable, i.e. documented and published. + + +## Motivation for open analytics + +[Stodden et al (2013)](http://stodden.net/icerm_report.pdf) make a useful distinction between five levels of research openness: + +1. **Reviewable research**. The descriptions of the research methods can be independently assessed, and the results judged credible. This includes both traditional peer review and community review and does not imply reproducibility. +2. **Replicable research**. Tools are made available that would allow one to duplicate the results of the research, for example by running the authors' code to produce the plots shown in the publication. (Here tools might be limited in scope, e.g., only essential data or executables, and might only be made available to referees or only upon request.) +3. **Confirmable research**. The main conclusions of the research can be attained independently without the use of software provided by the author. (But using the complete description of algorithms and methodology provided in the publication and any supplementary materials.) +4. **Auditable research**. Sufficient records (including data and software) have been archived so that the research can be defended later if necessary or differences between independent confirmations resolved. The archive might be private. +5. **Open or Reproducible research**. This is auditable research made openly available. This comprised well-documented and fully open code and data that are publicly available that would allow one to (a) fully audit the computational procedure, (b) replicate and also independently reproduce the results of the research, and (c) extend the results or apply the method to new problems. + + +## Goal: discoverable code + +Search and filter by title, author, software, method, country, etc. Get links to analytical output and data. Example: search for a "project that implemented multiple imputation in R for a project related to poverty in Kenya": search for *poverty AND "multiple imputation"* and filter the results by software / country. + +Note: the code will also be "attached" to the output page (paper) and to the dataset page of the catalog if they are available in the catalog. + +
+
+![image](https://user-images.githubusercontent.com/35276300/229812919-8a457692-310a-4095-80c2-e3bf202ecf21.png) +
+
+ +Provide access to scripts with detailed information, including software and libraries used, distribution license, IT requirements, datasets used, list of outputs, and more. + +
+
+![image](https://user-images.githubusercontent.com/35276300/229813050-6ab8d762-7e09-40ca-83b3-877f64caa9e4.png) +
+
+ + +## Schema description + +To make data processing and analysis scripts more discoverable and usable, we propose a metadata schema inspired by the schemas available to document datasets. The proposed schema contains two main blocks of metadata elements: the *document description* intended to document the metadata themselves (the term *document* refers to the file that will contain the metadata), and the *project description* used to document the research or analytical work and the related scripts. We also include in the schema the `tags`, `provenance`, and `additional` elements common to all schemas. + +
+```json +{ + "repositoryid": "string", + "published": 0, + "overwrite": "no", + "doc_desc": {}, + "project_desc": {}, + "provenance": [], + "tags": [], + "lda_topics": [], + "embeddings": [], + "additional": { } +} +``` +
+ + +### Document description + +**`doc_desc`** *[Optional ; Not repeatable]*
+The document description is a description of the metadata file being generated. It provides metadata about the metadata. This block is optional. It is used to document the research project metadata (not the project itself). This information is not needed to document the project; it only provides information, useful for archiving purposes, on the process of generating the project metadata. The information it contains are typically useful to a catalog administrator; they are not useful to the public and do not need to be displayed in the publicly-available catalog interface. This block is optional. It is recommended to enter at least the identification of the metadata producer, her/his affiliation, and the date the metadata were created. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so the metadata produced by one organization can be found in other data centers (complying with standards and schema is precisely intended to facilitate inter-operability of catalogs and automated information sharing). Keeping track of who documented a resource is thus useful. + +
+```json +"doc_desc": { + "title": "string", + "idno": "string", + "producers": [ + { + "name": "string", + "abbr": "string", + "affiliation": "string", + "role": "string" + } + ], + "prod_date": "string", + "version": "string" +} +``` +
+ +- **`title`** *[Optional ; Not Repeatable ; String]*
+The title of the project. This will usually be the same as the element `title` in the project description section. + +- **`idno`** *[Optional ; Not Repeatable ; String]*
+A unique identifier for the metadata document. + +- **`producers`** *[Optional ; Not Repeatable]*
+A list of producers of the metadata (who may be but do not have to be the authors of the research project and scripts being documented). These can be persons or organizations. The following four elements are used to identify them and specify their specific role as and if relevant (this block of four elements is repeated for each contributor to the metadata): + + - **`name`** *[Optional ; Not Repeatable ; String]*
+ Name of the person or organization who documented the project. + - **`abbr`**: *[Optional ; Not Repeatable ; String]*
+ The abbreviation of the organization that is referenced under 'name' above. + - **`affiliation`** *[Optional ; Not Repeatable ; String]*
+ Affiliation of the person(s) or organization(s) who documented the project. + - **`role`** *[Optional ; Not Repeatable ; String]*
+ This attribute is used to distinguish different stages of involvement in the metadata production process.

+ +- **`prod_date`** *[Optional ; Not Repeatable ; String]*
+The date the metadata on this project was produced (not distributed or archived), preferably in ISO 8601 format (YYYY-MM-DD or YYY-MM). + +- **`version`** *[Optional ; Not Repeatable ; String]*
+Documenting a research project is not a trivial exercise. It may happen that, having identified errors or omissions in the metadata or having received suggestions for improvement, a new version of the metadata is produced. This element is used to identify and describe the current version of the metadata. It is good practice to provide a version number, and information on what distinguishes this version from the previous one(s) if relevant. +
+ + + ```r + my_project = list( + doc_desc = list( + idno = "META_RP_001", + producers = list( + list(name = "John Doe", + affiliation = "National Data Center of Popstan") + ), + prod_date = "2020-12-27", + version = "Version 1.0 - Original version of the documentation provided by the author of the project" + ), + # ... + ) + ``` + +### Project description + +**`project_desc`** *[Required ; Not repeatable]*
+The project description contains the metadata related to the project itself. All efforts should be made to provide as much and as detailed information as possible. + +
+```json +"project_desc": { + "title_statement": {}, + "abstract": "string", + "review_board": "string", + "output": [], + "approval_process": [], + "project_website": [], + "language": [], + "production_date": "string", + "version_statement": {}, + "errata": [], + "process": [], + "authoring_entity": [], + "contributors": [], + "sponsors": [], + "curators": [], + "reviews_comments": [], + "acknowledgments": [], + "acknowledgment_statement": "string", + "disclaimer": "string", + "confidentiality": "string", + "citation_requirement": "string", + "related_projects": [], + "geographic_units": [], + "keywords": [], + "themes": [], + "topics": [], + "disciplines": [], + "repository_uri": [], + "license": [], + "copyright": "string", + "technology_environment": "string", + "technology_requirements": "string", + "reproduction_instructions": "string", + "methods": [], + "software": [], + "scripts": [], + "data_statement": "string", + "datasets": [], + "contacts": [] +} +``` +
+ +- **`title_statement`** *[Required ; Non repeatable]*
+The *title_statement* is a group of five elements, two of them mandatory. +
+```json +"title_statement": { + "idno": "string", + "identifiers": [ + { + "type": "string", + "identifier": "string" + } + ], + "title": "string", + "sub_title": "string", + "alternate_title": "string", + "translated_title": "string" +} +``` +
+ + - **`idno`** *[Required ; Not Repeatable ; String]*
+ A unique identifier to the project. Define and use a consistent scheme to use. Avoid including spaces in the ID. The ID number of a research project is a unique number that is used to identify a particular project. This ID number is a vital reference. A research project can be the formal cause of a survey, scripts, tables and knowledge products. Do not include spaces in the idno element. Use a system that guarantees uniqueness of the ID (DOI, own reference number). + - **`identifiers`** *[Optional ; Repeatable]*
+ This repeatable element is used to enter identifiers (IDs) other than the `idno` entered in the `title_statement`. It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in `idno` can (and in some cases should) be repeated here. The element `idno` does not provide a `type` parameter; repeating it in this section makes it possible to add that information. + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of unique ID, e.g. "DOI". + - **`identifier`** *[Required ; Not repeatable ; String]*
+ The identifier itself.

+ + - **`title`** *[Required ; Not Repeatable ; String]*
+ The title is the official name of the project as it may be stated in reports, papers or other documents. The title will in most cases be identical to the Document Title (see above). The title may correspond to the title of an academic paper, of a project impact evaluation, etc. Pay attention to capitalization in the title. + - **`sub_title`** *[Optional ; Not Repeatable ; String]*
+ Subtitle is optional and rarely used. A short subtitle for the project. Often the sub title is used to qualify the title or rephrase the title. + - **`alternate_title`** *[Optional ; Not Repeatable ; String]*
+ An alternate title of the project. This would be any alternate title that would help discover the research project. In countries with more than one official language, a translation of the title may be provided. Likewise, the translated title may simply be a translation into English from a country's own language. + - **`translated_title`** *[Optional ; Not Repeatable ; String]*
+ A translated version of the title (this will be used for example when a catalog documents all entries in English, but wants to preserve the title of a project in its original language when the original language is not English).
+ + + + ```r + my_project = list( + # ... , + project_desc = list( + + title_statement = list( + idno = "RR_WB_2020_001", + identifiers = list( + list(type = "DOI", identifier = "XXX-XXX-XXXX") + ), + date = "2020", + title = "Predicting Food Crises - Econometric Model" + ), + + # ... + ), + # ... + ) + ``` +
+ + +- **`abstract`** *[Optional ; Non repeatable ; String]*
+The abstract should provide a clear summary of the purposes, objectives and content of the project. An abstract can make reference to the various outputs associated with the research project. + + Example extracted from https://microdata.worldbank.org/index.php/catalog/4218: + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + abstract = "Food price inflation is an important metric to inform economic policy but traditional sources of consumer prices are often produced with delay during crises and only at an aggregate level. This may poorly reflect the actual price trends in rural or poverty-stricken areas, where large populations reside in fragile situations. + This data set includes food price estimates and is intended to help gain insight in price developments beyond what can be formally measured by traditional methods. The estimates are generated using a machine-learning approach that imputes ongoing subnational price surveys, often with accuracy similar to direct measurement of prices. The data set provides new opportunities to investigate local price dynamics in areas where populations are sensitive to localized price shocks and where traditional data are not available.", + + # ... + ), + # ... + ) + ``` +
+ +- **`review_board`** *[Optional ; Non repeatable ; String]*
+Information on whether and when the project was submitted, reviewed, and approved by an institutional review board (or independent ethics committee, ethical review board (ERB), research ethics board, or equivalent). +
+ +- **`output`** *[Optional ; Repeatable]*
+This element will describe and reference all substantial/intended products of the research project, which may include publications, reports, websites, datasets, interactive applications, presentations, visualizations, and others. An output may also be referred to as a "deliverable". +
+```json +"output": [ + { + "type": "string", + "title": "string", + "authors": "string", + "description": "string", + "abstract": "string", + "uri": "string", + "doi": "string" + } +] +``` +
+ +The `output` is a repeatable block of seven elements, used to document all output of the research project: + - **`type`** *[Optional ; Non repeatable]*
+ Type of output. The type of output relates to the media which is used to convey or communicate the intended results, findings or conclusions of the research project. This field may be controlled by a controlled vocabulary. The kind on content could be "Working paper", "Database", etc. + - **`title`** *[Required ; Non repeatable]*
+ Formal title of the output. Depending upon the kind of output, the title will vary in formality. + - **`authors`** *[Optional ; Non repeatable]*
+ Authors of the output; if multiple, they will be listed in one same text field. + - **`description`** *[Optional ; Non repeatable]*
+ Brief description of the output (NOT an abstract) + - **`abstract`** *[Optional ; Non repeatable]*
+ If the output consists of a document, the abstract will be entered here. + - **`uri`** *[Optional ; Non repeatable]*
+ A link where the output or information on the output can be found. + - **`doi`** *[Optional ; Non repeatable]*v + Digital Object Identifier (DOI) of the output, if available.

+ + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + output = list( + + list(type = "working paper", + title = "Estimating Food Price Inflation from Partial Surveys", + authors = "Andrée, B. P. J.", + description = "World Bank Policy Research Working Paper", + abstract = "The traditional consumer price index is often produced at an aggregate level, using data from few, highly urbanized, areas. As such, it poorly describes price trends in rural or poverty-stricken areas, where large populations may reside in fragile situations. Traditional price data collection also follows a deliberate sampling and measurement process that is not well suited for monitoring during crisis situations, when price stability may deteriorate rapidly. To gain real-time insights beyond what can be formally measured by traditional methods, this paper develops a machine-learning approach for imputation of ongoing subnational price surveys. The aim is to monitor inflation at the market level, relying only on incomplete and intermittent survey data. The capabilities are highlighted using World Food Programme surveys in 25 fragile and conflict-affected countries where real-time monthly food price data are not publicly available from official sources. The results are made available as a data set that covers more than 1200 markets and 43 food types. The local statistics provide a new granular view on important inflation events, including the World Food Price Crisis of 2007–08 and the surge in global inflation following the 2020 pandemic. The paper finds that imputations often achieve accuracy similar to direct measurement of prices. The estimates may provide new opportunities to investigate local price dynamics in markets where prices are sensitive to localized shocks and traditional data are not available.", + uri = "http://hdl.handle.net/10986/36778"), + + list(type = "dataset", + title = "Monthly food price estimates", + authors = "Andrée, B. P. J.", + description = "A dataset of derived data, published as open data", + abstract = "Food price inflation is an important metric to inform economic policy but traditional sources of consumer prices are often produced with delay during crises and only at an aggregate level. This may poorly reflect the actual price trends in rural or poverty-stricken areas, where large populations reside in fragile situations. + This data set includes food price estimates and is intended to help gain insight in price developments beyond what can be formally measured by traditional methods. The estimates are generated using a machine-learning approach that imputes ongoing subnational price surveys, often with accuracy similar to direct measurement of prices. The data set provides new opportunities to investigate local price dynamics in areas where populations are sensitive to localized price shocks and where traditional data are not available." + uri = "https://microdata.worldbank.org/index.php/catalog/4218"), + doi = "https://doi.org/10.48529/2ZH0-JF55") + + ), + + # ... + ) + ``` +
+ +- **`approval_process`** *[Optional ; Repeatable]*
+The *`approval_process`* is a group of six elements used to describe the formal approval process(es) (if any) that the project had to go through. This may for example include an approval by an Ethics Board to collect new data, followed by an internal review process to endorse the results. +
+```json +"approval_process": [ + { + "approval_phase": "string", + "approval_authority": "string", + "submission_date": "string", + "reviewer": "string", + "review_status": "string", + "approval_date": "string" + } +] +``` +
+ + - **`approval_phase`** *[Optional ; Non repeatable]*
+ A label that describes the approval phase. + - **`approval_authority`** *[Optional ; Non repeatable]*
+ Identification of the person(s) or organization(s) whose approval was required or sought. + - **`submission_date`** *[Optional ; Non repeatable]*
+ The date, entered in ISO 8601 format (YYYY-MM-DD), when the project (or a component of it) was submitted for approval. + - **`reviewer`** *[Optional ; Non repeatable]*
+ Identification of the reviewer(s). + - **`review_status`** *[Optional ; Non repeatable]*
+ Status of approval. + - **`approval_date`** *[Optional ; Non repeatable]*
+ Date the approval was formally received, preferably entered in ISO 8601 format (YYYY-MM-DD).

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + approval_process = list( + + list(approval_phase = "Authorization to conduct the survey", + approval_authority = "Internal Ethics Board, [Organization]", + submission_date = "2019-01-15", + review_status = "Approved (permission No ABC123)", + approval_date = "2020-04-30"), + + list(approval_phase = "Review of research output and authorization to publish", + approval_authority = "Internal Ethics Board, [Organization]", + submission_date = "2021-07-15", + review_status = "Approved", + approval_date = "2021-10-30") + + ), + # ... + ) + # ... + ) + ``` +
+ +- **`project_website`** *[Optional ; Repeatable ; String]*
+URL of the project website. +
+```json +"project_website": [ + "string" +] +``` +
+ + +- **`language`** *[Optional ; Repeatable]*
+A block of two elements describing the language(s) of the project. At least one of the two elements must be provided for each listed language. The use of [ISO 639-2](https://www.loc.gov/standards/iso639-2/php/code_list.php) (the alpha-3 code in Codes for the representation of names of languages) is recommended. +
+```json +"language": [ + { + "name": "string", + "code": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the language. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the language. Numeric codes must be entered as strings.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + languages = list( + list(name = "English", code = "EN"), + list(name = "French", code = "FR") + ), + + # ... + ) + # ... + ) + ``` +
+ +- **`production_date`**
+The date in ISO 8601 format (YYYY-MM-DD) the project was completed (this refers to the version that is being documented and released.) +
+ +- **`version_statement`** *[Optional ; Repeatable]*
+This repeatable block of four elements is used to list and describe the successive versions of the project. +
+```json +"version_statement": { + "version": "string", + "version_date": "string", + "version_resp": "string", + "version_notes": "string" +} +``` +
+ + - **`version`** *[Optional ; Not repeatable ; String]*
+ A label describing the version. For example, "Version 1.2" *[String]* + - **`version_date`** *[Optional ; Not repeatable ; String]*
+ Date (in ISO 8601 format, YYYY-MM-DD) the version was released *[String]* + - **`version_resp`** *[Optional ; Not repeatable ; String]*
+ Person(s) or organization(s) responsible for this version. *[String]* + - **`version_notes`** *[Optional ; Not repeatable ; String]*
+ Additional information on the version if any; it is good practice to describe what distinguishes this version from the previous one(s). The version must be entered as a string, even when composed only of numbers.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + version_statement = list( + + list(version = "v1.0", + version_date = "2021-12-27", + version_resp = "University of Popstan, Department of Economics", + version_notes = "First version approved for open dissemination") + + ), + + # ... + ) + ``` +
+ + +- **`errata`** *[Optional ; Repeatable]*
+This field is used to list and describe errata. +
+```json +"errata": [ + { + "date": "string", + "description": "string" + } +] +``` +
+ + - **`date`** *[Optional ; Not repeatable ; String]*
+ Date (in ISO 8601 format, YYYY-MM-DD) the erratum was released. + - **`description`** *[Optional ; Not repeatable ; String]*
+ Description of the error(s) and measures taken to address it/them.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + errata = list( + list(date = "2021-10-30", + description = "Outliers in the data for Afghanistan resulted in unrealistic model estimates of the food prices for January 2020. In the latest version of the 'model.R' script, outliers are detected and dropped from the input data file. The published dataset has been updated." + ) + ), + + # ... + ) + ) + ``` +
+ +- **`process`** *[Optional ; Repeatable]*
+This element is used to document the life cycle of the research project, from its design and inception to its conclusion. This can include phases of fundraising, IRB, concept note review, data acquisition, analysis, publishing of a working paper, peer review, publishing in journal, presentation to conferences, publishing, evaluation, reporting to sponsors, etc. It is recommended to provide these steps in a chronological order. +
+```json +"process": [ + { + "name": "string", + "date_start": "string", + "date_end": "string", + "description": "string" + } +] +``` +
+ + - **`name`**: *[Optional ; Not repeatable ; String]*
+ This is a header for the phase of the process. + - **`date_start`** *[Optional ; Not repeatable ; String]*
+ Date the phase started (preferably in ISO 8601 format, YYYY-MM-DD) + - **`date_end`** *[Optional ; Not repeatable ; String]*
+ Date the phase ended (preferably in ISO 8601 format, YYYY-MM-DD) + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the phase.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + process = list( + + list(name = "Presentation of the concept note at the Review Committee decision meeting", + date_start = "2018-02-23", + date_end = "2018-02-23", + description = "Presentation of the research objectives and method by the primary investigator to the Review Committee, which resulted in the approval of the concept note." + ), + + list(name = "Fundraising", + date_start = "2018-02-24", + date_end = "2018-02-30", + description = "Discussion with project sponsors, and conclusion of the funding agreement." + ), + + list(name = "Data acquisition and analytics", + date_start = "2018-03-15", + date_end = "2019-01-30", + description = "Implementation of web scraping, then data analysis" + ), + + list(name = "Working paper", + date_start = "2019-01-30", + date_end = "2019-02-25", + description = "Production (and copy editing) of the working paper" + ), + + list(name = "Presentation to conferences", + date_start = "2019-04-12", + date_end = "2019-04-12", + description = "Presentation of the paper by the primary investigator at the ... conference, London" + ), + + list(name = "Curation and dissemination of data and code", + date_start = "2019-02-25", + date_end = "2019-03-18", + description = "Data and script documentation, and publishing in the National Microdata Library" + ) + + ), + + # ... + ) + ) + ``` +
+ +- **`authoring_entity`** *[Optional ; Repeatable]*
+This section will identify the person(s) and/or organization(s) in charge of the intellectual content of the research project, and specify their respective role. +
+```json +"authoring_entity": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "abbreviation": "string", + "email": "string", + "author_id": [] + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the person or organization responsible for the research project. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Specific role of the person or organization mentioned in `name`. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ Agency or organization affiliation of the author/primary investigator mentioned in `name`. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ Abbreviation used to identify the agency stated under `affiliation`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ Depending on the agency policies, a researcher may provide a personal email or an agency email to field inquires related to the project. + - **`author_id`** *[Optional ; Repeatable]*
+ A block of two elements used to provide unique identifiers of the authors, as provided by different registers of researchers. For example, this can be an ORCID number (ORCID is a non-profit organization supported by a global community of member organizations, including research institutions, publishers, sponsors, professional associations, service providers, and other stakeholders in the research ecosystem.) + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of ID; for example, "ORCID". + - **`id`** *[Required ; Not repeatable ; String]*
+ A unique identification number/code for the authoring entity, entered as a string variable.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + authoring_entity = list( + + list(name = "", + role = "", + affiliation = "", + email = "", + author_id = list( + list(type = "", id = "ORCID") + ) + ) + + ), + + # ... + ) + ) + ``` +
+ +- **`contributors`** *[Optional ; Repeatable]* This section is provided to record other contributors to the research project and provide recognition for the roles they provided. +
+```json +"contributors": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "abbreviation": "string", + "email": "string", + "url": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the person, corporate body, or agency contributing to the intellectual content of the project (other than the PI). If a person, invert first and last name and use commas. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Title of the person (if any) responsible for the work's substantive and intellectual content. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ Agency or organization affiliation of the contributor. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ Abbreviation used to identify the agency stated under `affiliation`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ Depending on the agency policies, a researcher may provide a personal email or an agency email to field inquires related to the project. + - **`url`** *[Optional ; Not repeatable ; String]*
+ Thhe URL that provides information on the contributor or its affiliate

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + contributors = list( + list(name = "", + role = "", + affiliation = "", + email = "" + ) + ), + + # ... + ) + ) + ``` +
+ +- **`sponsors`** *[Optional ; Repeatable]*
The source(s) of funds for production of the work. If different funding agencies sponsored different stages of the production process, use the 'role' attribute to distinguish them. +
+```json +"sponsors": [ + { + "name": "string", + "abbreviation": "string", + "role": "string", + "grant_no": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the funding agency/sponsor. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ Abbreviation of the funding/sponsoring agency. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Specific role of the funding/sponsoring agency. + - **`grant_no`** *[Optional ; Not repeatable ; String]*
+ Grant or award number. + + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + sponsors = list( + + list(name = "ABC Foundation", + abbr = "ABCF", + role = "Purchase of the data", + grant_no = "ABC_001_XYZ" + ), + + list(name = "National Research Foundation", + abbr = "NRF", + role = "Funding of staff and research assistant costs, and variable costs for participation in conferences", + grant_no = "NRF_G01" + ) + + ), + + # ... + ) + ) + ``` +
+ +- **`curators`** *[Optional ; Repeatable]*
+A list of persons and/or organizations in charge of curating the resources associated with the project. +
+```json +"curators": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "abbreviation": "string", + "email": "string", + "url": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization. + - **`role`** *[Optional ; Not repeatable ; String]*
+ The specific role of the person or organization in the curation of the project resources. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or organization. + - **`abbreviation`** *[Optional ; Not repeatable ; String]*
+ An acronym of the organization, if an organization was entered in `name`. + - **`email`** *[Optional ; Not repeatable ; String]*
+ The email address of the person or organization. The use of personal email addresses must be avoided. + - **`url`** *[Optional ; Not repeatable ; String]*
+ A link to the website of the person or organization. +

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + curators = list( + + list(name = "National Data Archive of Popstan", + role = "Documentation, preservation and dissemination of the data and reproducible code", + email = "helpdesk@nda. ...", + url = "popstan_nda,org" + ) + + ), + + # ... + ) + ) + ``` +
+ +- **`reviews_comments`** *[Optional ; Repeatable]*
+Many research projects will be subject to a review process, which may happen at different stages of the project implementation (from design to review of the final output). This block is intended to document the comments received by reviewers during this process. It is a repeatable block of metadata elements, which can be used to document comments with a fine granularity. +
+```json +"reviews_comments": [ + { + "comment_date": "string", + "comment_by": "string", + "comment_description": "string", + "comment_response": "string" + } +] +``` +
+ + - **`comment_date`** *[Optional ; Not repeatable ; String]*
+ The date the comment was provided, in ISO 8601 format (YYYY-MM-DD or YYYY-MM). + - **`comment_by`** *[Optional ; Not repeatable ; String]*
+ The name of the person or organization that provided the comment. + - **`comment_description`** *[Optional ; Not repeatable ; String]*
+ The comment itself, in its original formulation or in a summary version. + - **`comment_response`** *[Optional ; Not repeatable ; String]*
+ The response provided by teh research team/person to the comment, in its original formulation or in a summary version. +

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + reviews_comments = list( + list(comment_date = "", + comment_by = "", + comment_description = "", + comment_response = "" + ) + ), + # ... + ) + ) + ``` +
+ +- **`acknowledgments`** *[Optional ; Repeatable]*
+This repeatable block of elements is used to provide an itemized list of persons and organizations whose contribution to the project must be acknowledged. Note that specific metadata elements are available for listing financial sponsors and main contributors to the study.
+An alternative to this field is the `acknowledgment_statement` field (see below) which can be used to provide the acknowledgment in the form of an unstructured text. +
+```json +"acknowledgments": [ + { + "name": "string", + "affiliation": "string", + "role": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the person or agency being recognized for supporting the project. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the person or agency being acknowledged. + - **`role`** *[Optional ; Not repeatable ; String]*
+ A brief description of the role of the person or agency that is being recognized or acknowledged for supporting the project.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + acknowledgements = list( + list(name = "", + affiliation = "", + role = "" + ), + list(name = "", + affiliation = "", + role = "" + ) + ), + # ... + ) + ) + ``` +
+ +- **`acknowledgement_statement`** *[Optional ; Not repeatable ; String]*
+This field is used to provide acknowledgments in the form of an unstructured text. An alternative to this field is the *acknowledgments* field which provides a solution to itemize the acknowledgments. + + +- **`disclaimer`** *[Optional ; Not repeatable ; String]*
+Disclaimers limit the responsibility or liability of the publishing organization or researchers associated with the research project. Disclaimers assure that any research in the public domain produced by an organization has limited repercussions to the publishing organization. A disclaimer is intended to prevent liability from any effects occurring as a result of the acts or omissions in the research. + + +- **`confidentiality`** *[Optional ; Not repeatable ; String]*
+A confidentiality statement binds the publisher to ethical considerations regarding the subjects of the research. In most cases, the individual identity of an individual that is the subject of research can not be released and special effort is required to assure the preservation of privacy. + + +- **`citation_requirement`** *[Optional ; Not repeatable ; String]*
+The citation requirement is specific to the output and is a preferred shorthand or means to refer to the publication or published good. + + +- **`related_projects`** *[Optional ; Repeatable]*
+The objective of this block is to provide links (URLs) to other, related projects which can be documented and disseminated in the same catalog or any other location on the internet. +
+```json +"related_projects": [ + { + "name": "string", + "uri": "string", + "note": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name (title) of the related project. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ A link (URL) to the related project web page. + - **`note`** *[Optional ; Not repeatable ; String]*
+ A brief description or other relevant information on the related project.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + related_projects = list( + list(name = "", + uri = "", + note = "") + ), + # ... + ) + ) + ``` +
+ +- **`geographic_units`** *[Optional ; Repeatable]*
+The geographic areas covered by the project. When the project relates to one or more countries, or part of one or more countries, it is important to provide the country name. This means that for a project related to a specific province or town of a country, the country name will be entered in addition to the province or town (as separate entries in this repeatable block of elements). Note that the area does not have to be an administrative area; it can for example be an ocean. +
+```json +"geographic_units": [ + { + "name": "string", + "code": "string", + "type": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name of the geographic area. + - **`code`** *[Optional ; Not repeatable ; String]*
+ The code of the geographic area. For countries, it is recommended to use the [ISO 3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) country codes and names. + - **`type`** *[Optional ; Not repeatable ; String]*
+ The type of geographic area.
+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + geographic_units = list( + list(name = "India", code = "IND", type = "Country"), + list(name = "New Delhi", type = "City"), + list(name = "Kerala", type = "State"), + list(name = "Nepal", code = "NPL", type = "Country"), + list(name = "Kathmandu", type = "City") + ), + + # ... + ) + ) + ``` +
+ +- **`keywords`** *[Optional ; Repeatable]*
+
+```json +"keywords": [ + { + "name": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of keywords that provide information on the core scope and objectives of the research project. Keywords provide a convenient solution to improve the discoverability of the research, as it allows terms and phrases not found elsewhere in the metadata to be indexed and to make a project discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the [UNESCO Thesaurus](http://vocabularies.unesco.org/browser/thesaurus/en/). The list provided here can combine keywords from multiple controlled vocabularies, and user-defined keywords. + + - **`name`** *[Required ; Not repeatable ; String]*
+ The keyword itself. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The controlled vocabulary (including version number or date) from which the keyword is extracted, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the controlled vocabulary from which the keyword is extracted, if any.

+ + + + ```r + my_project <- list( + # ... , + project_desc = list( + # ... , + + keywords = list( + + list(name = "Migration", + vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + + list(name = "Migrants", + vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + + list(name = "Refugee", + vocabulary = "Unesco Thesaurus (June 2021)", + uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"), + + list(name = "Conflict"), + list(name = "Asylum seeker"), + list(name = "Forced displacement"), + list(name = "Forcibly displaced"), + list(name = "Internally displaced population (IDP)"), + list(name = "Population of concern (PoC)") + list(name = "Returnee") + list(name = "UNHCR") + ), + + # ... + ), + # ... + ) + ``` +
+ + +- **`themes`** *[Optional ; Repeatable]*
+
+```json +"themes": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + A list of themes covered by the research project. A controlled vocabulary will preferably be used. Note that `themes` will rarely be used as the elements `topics` and `disciplines` are more appropriate for most uses. This is a block of five fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The ID of the theme, taken from a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the theme, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.
+ + +- **`topics`** *[Optional ; Repeatable]*
+
+```json +"topics": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ + Information on the topics covered in the research project. A controlled vocabulary will preferably be used, for example the [CESSDA Topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification), a typology of topics available in 11 languages; or the [Journal of Economic Literature (JEL) Classification System](https://en.wikipedia.org/wiki/JEL_classification_codes), or the [World Bank topics classification](https://documents.worldbank.org/en/publication/documents-reports/docadvancesearch). Note that you may use more than one controlled vocabulary. + This element is a block of five fields: + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the topic, taken from a controlled vocabulary. + - **`name`** *[Required ; Not repeatable ; String]*
+ The name (label) of the topic, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ + + + ```r + my_project = list( + # ... , + + project_desc = list( + # ... , + + topics = list( + + list(name = "Demography.Migration", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(name = "Demography.Censuses", + vocabulary = "CESSDA Topic Classification", + uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"), + + list(id = "F22", + name = "International Migration", + parent_id = "F2 - International Factor Movements and International Business", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "O15", + name = "Human Resources - Human Development - Income Distribution - Migration", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "O12", + name = "Microeconomic Analyses of Economic Development", + parent_id = "O1 - Economic Development", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"), + + list(id = "J61", + name = "Geographic Labor Mobility - Immigrant Workers", + parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers", + vocabulary = "JEL Classification System", + uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J") + ), + + # ... + ) + ) + ``` +
+ +- **`disciplines`** *[Optional ; Repeatable]*
+
+```json +"disciplines": [ + { + "id": "string", + "name": "string", + "parent_id": "string", + "vocabulary": "string", + "uri": "string" + } +] +``` +
+ Information on the academic disciplines related to the content of the research project. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in [Wikipedia](https://en.wikipedia.org/wiki/List_of_academic_fields). + This is a block of five elements: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ The identifier of the discipline, taken from a controlled vocabulary. + - **`name`** *[Optional ; Not repeatable ; String]*
+ The name (label) of the discipline, preferably taken from a controlled vocabulary. + - **`parent_id`** *[Optional ; Not repeatable ; String]*
+ The parent identifier of the discipline (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. + - **`vocabulary`** *[Optional ; Not repeatable ; String]*
+ The name (including version number) of the controlled vocabulary used, if any. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL to the controlled vocabulary used, if any.

+ + + + ```r + my_project <- list( + # ... , + + project_desc = list( + # ... , + + disciplines = list( + + list(name = "Economics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"), + + list(name = "Agricultural economics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"), + + list(name = "Econometrics", + vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", + uri = "https://en.wikipedia.org/wiki/List_of_academic_fields") + + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`repository_uri`** In the process of producing the outputs of the research project, a researcher may want to share their source code for transparency and replicability. This repository provides information for finding the repository where the source code is kept. + +
+```json +"repository_uri": [ + { + "name": "string", + "type": "string", + "uri": null + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the repository where code is hosted. + - **`type`** *[Optional ; Not repeatable ; String]*
+ Repository type e.g.GitHub, Bitbucket, etc. + - **`uri`** *[Required ; Not repeatable ; String]*
+ URI of the project source code/script repository

+ + + + ```r + my_project = list( + # ... , + + project_desc = list( + # ... , + + repository_uri = list( + list(name = "A comparative assessment of machine learning classification algorithms applied to poverty prediction", + type = "GitHub public repo", + uri = "https://github.com/worldbank/ML-classification-algorithms-poverty") + ), + + # ... + ) + ) + ``` +
+ +- **`license`** *[Optional ; Repeatable]*
+Information on the license(s) attached to the research project resources, which defines their terms of use. +
+```json +"license": [ + { + "name": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the license. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URL of the license, where detailed information on the license can be obtained. + - **`note`** *[Optional ; Not repeatable ; String]*
+ Additional information on the license. +
+ + + ```r + my_project <- list( + # ... , + project_desc = list( + # ... , + + license = list( + + list(name = "Attribution 4.0 International (CC BY 4.0)", + uri = "https://creativecommons.org/licenses/by/4.0/") + + ), + + # ... + ), + # ... + ) + ``` +
+ +- **`copyright`** *[Optional ; Not repeatable ; String]*
+Information on the copyright, if any, that applies to the research project metadata. + + +- **`technology_environment`** *[Optional ; Not repeatable ; String]*
+This field is used to provide a description (as detailed as possible) of the computational environment under which the scripts were implemented and are expected to be reproducible. A substantial challenge in reproducing analyses is installing and configuring the web of dependencies of specific versions of various analytical tools. Virtual machines (a computer inside a computer) enable you to efficiently share your entire computational environment with all the dependencies intact. (https://ropensci.github.io/reproducibility-guide/sections/introduction/) + + +- **`technology_requirements`** *[Optional ; Not repeatable ; String]*
+Software/hardware or other technology requirements needed to run the scripts and replicate the outputs + + +- **`reproduction_instructions`** *[Optional ; Not repeatable ; String]*
+Instructions to secondary analysts who may want to reproduce the scripts. + + +- **`methods`** *[Optional ; Repeatable]*
+A list of analytic, statistical, econometric, machine learning methods used in the project. The objective is to allow users to find projects based on a search on methods applied, e.g. answer a query like *"poverty prediction using random forest"*. +
+```json +"methods": [ + { + "name": "string", + "note": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ A short name for the method being described. + - **`note`** *[Optional ; Not repeatable ; String]*
+ Any additional information on the method. +

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + methods = list( + + list(name = "linear regression", + note = "Implemented using R package 'stats'"), + + list(name = "random forest", + note = "Used for both regression and classification"), + + list(name = "lasso regression (least asolute shrinkage and selection operator)", + note = "Implemented using R package glmnet"), + + list(name = "gradient boosting machine (GBM)"), + + list(name = "cross validation"), + + list(name = "mean square error, quadratic loss, L2 loss", + note = "Loss functions used to fit models") + + ), + + # ... + ) + ) + ``` +
+ +- **`software`** *[Optional ; Repeatable]*
+This field is used to list the software and the specialized packages and libraries/packages that were used to implement the project and that are required to reproduce the scripts. The libraries that are loaded by the scripts (e.g., by the R *require* or *library* command) are included (not all their own dependencies, which will be assumed to be installed automatically). +
+```json +"software": [ + { + "name": "string", + "version": "string", + "library": [ + "string" + ] + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the software. + - **`version`** *[Optional ; Not repeatable ; String]*
+ The version of the software. + - **`library`** *[Optional ; Repeatable]*
+ A list of libraries/packages required to run the scripts. Note that the specific version of each package is not documented here; it is expected to be found in the script or in the reproduction instructions. + + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + software = list( + + list(name = "R", + version = "4.0.2", + library = list("caret", "dplyr", "ggplot2"), + + list(name = "Stata", + version = "15"), + + list(name = "Python", + version = "3.7 (Anaconda install)", + library = list("pandas", "scikit-learn") + + ), + + # ... + ) + ) + ``` +
+ +- **`scripts`** *[Optional ; Repeatable]*
+This field is used to describe the scripts written by the project authors. All scripts are expected to have been written using software listed in the field *software*. +
+```json +"scripts": [ + { + "file_name": "string", + "zip_package": "string", + "title": "string", + "authors": [ + { + "name": "string", + "affiliation": "string", + "role": "string" + } + ], + "date": "string", + "format": "string", + "software": "string", + "description": "string", + "methods": "string", + "dependencies": "string", + "instructions": "string", + "source_code_repo": "string", + "notes": "string", + "license": [ + { + "name": "string", + "uri": "string", + "note": "string" + } + ] + } +] +``` +
+ + - **`file_name`** *[Optional ; Not repeatable ; String]*
+ Name of the script file (for R users, this will typically include files with extension [.R], for Stata users it will be files with extension [.do], for Python users ...). But this can also include other files related and required to run the scripts (for example lookup CSV files, etc.) This does not include the data files, which are described ina specific field. + - **``zip_package``** *[Optional ; Not repeatable]*
+ If the script files have been saved as or in a compressed file (zip, rar, of equivalent), we provide here the name of the zip file containing the script. + - **`title`** *[Optional ; Not repeatable ; String]*
+ A title (label) given to the script file + - **`authors`** *[Optional ; Repeatable]*
+ This is a repeatable block that allows entering a list of authors and co-authors of a script + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name of the author (person or organization) of the script + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The affiliation of the author. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Specific role of the person or organization in the production of the script.

+ + - **`date`** [Optional ; Not repeatable ; String]*
+ Date the script was produced, in ISO 8601 format (YYYY-MM-DD) + - **`format`** *[Optional ; Not repeatable ; String]*
+ File format + - **`software`** *[Optional ; Not repeatable ; String]*
+ Software used to run the script + - **`description`** *[Optional ; Not repeatable ; String]*
+ Brief description of the script + - **`methods`** *[Optional ; Not repeatable ; String]*
+ Statistical/analytic methods included in the script + - **`dependencies`** *[Optional ; Not repeatable ; String]*
+ Any dependencies (packages/libraries) that the script relies on. This field is not needed if dependencies were described in the `library` element. + - **`instructions`** *[Optional ; Not repeatable ; String]*
+ Instructions for running the script. Information on the sequence in which the scripts must be run is critical. + - **`source_code_repo`** *[Optional ; Not repeatable ; String]*
+ Repository (e.g. GitHub repo) where the script has been published. + - **`notes`** *[Optional ; Not repeatable ; String]*
+ Any additional information on the script. + - **`license`** *[Optional ; Repeatable]*
+ License, if any, under which the script is published. + - **`name`** *[Optional ; Not repeatable ; String]*
+ Name (label) of the license + - **`uri`** *[Optional ; Not repeatable ; String]*
+ License URI

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + scripts = list( + + list(file_name = "00_script.R", + zip_package = "all_scripts.zip", + title = "Project X - Master script", + authors = list(name = "John Doe", + affiliation = "IHSN", + role = "Writing, testing and documenting the script"), + date = "2020-12-27", + format = "R script", + software = "R x64 4.0.2", + description = "Master script for automated reproduction of the analysis. Calls all other scripts in proper sequence to reproduce the full analysis.", + methods = "box-cox transformation of data", + dependencies = "", + instructions = "", + source_code_repo = "", + notes = "", + license = list(name = "CC BY 4.0", + uri = "https://creativecommons.org/licenses/by/4.0/deed.ast")), + + list(file_name = "01_regression.R", + zip_package = "", + title = "Charts and maps", + authors = list(name = "", + affiliation = "", + role = ""), + date = "", + format = "R script", + software = "R", + description = "This script runs all linear regressions and PCA presented in the working paper.", + methods = "linear regression; principal component analysis", + dependencies = "", + instructions = "", + source_code_repo = "", + notes = "", + license = list(list(name = "CC BY 4.0", + uri = "https://creativecommons.org/licenses/by/4.0/deed.ast"))), + + list(file_name = "02_visualization", + zip_package = "", + title = "", + authors = list(name = "", + abbr = "", + role = ""), + date = "", + format = "", + software = "", + description = "", + instructions = "", + source_code_repo = "", + notes = "", + license = list(list(name = "CC BY 4.0", + uri = "https://creativecommons.org/licenses/by/4.0/deed.ast"))), + + ), + # ... + ) + ) + ``` +
+ +- **`data_statement`** *[Optional ; Not repeatable ; String]*
+An overall statement on the data used in the project. A separate field is provided to list and document the origin and key characteristics of the datasets. + + +- **`datasets`** *[Optional ; Repeatable]*
+This field is used to provide an itemized list of datasets used in the project. The data are not documented here (specific metadata are available for documenting data of different types, like the DDI for microdata, the ISO 19139 for geographic datasets, etc.) +
+```json +"datasets": [ + { + "name": "string", + "idno": "string", + "note": "string", + "access_type": "string", + "license": "string", + "license_uri": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Optional ; Not repeatable ; String]*
+ The dataset name (title) + - **`idno`** *[Optional ; Not repeatable ; String]*
+ The unique identifier of the dataset + - **`note`** *[Optional ; Not repeatable ; String]*
+ A brief description of the dataset. + - **`access_type`** *[Optional ; Not repeatable ; String]*
+ The access policy pplied to the dataset. + - **`license`** *[Optional ; Not repeatable ; String]*
+ The access license that applies to the dataset. + - **`license_uri`** *[Optional ; Not repeatable ; String]*
+ The URL of a web page where more information on the license can be obtained. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI where the dataset (or a detailed description of it) can be obtained. +

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + datasets = list( + + list(name = "Multiple Indicator Cluster Survey 2019, Round 6, Chad", + idno = "TCD_2019_MICS_v01_M", + uri = "https://microdata.worldbank.org/index.php/catalog/4150"), + + list(name = "World Bank Group Country Survey 2018, Chad", + idno = "TCD_2018_WBCS_v01_M", + access_type = "Public access", + uri = "https://microdata.worldbank.org/index.php/catalog/3058") + + ), + # ... + ) + ) + ``` +
+ +- **`contacts`** *[Optional ; Repeatable]*
+The contacts element provides the public interface for questions associated with the research project. There could be various contacts provided depending upon the organization. It is important to assure that the proper contacts are provided to channel public inquiry. +
+```json +"contacts": [ + { + "name": "string", + "role": "string", + "affiliation": "string", + "email": "string", + "telephone": "string", + "uri": "string" + } +] +``` +
+ + - **`name`** *[Required ; Not repeatable ; String]*
+ The name of the contact person that should be contacted depending on the role defined below. + - **`role`** *[Optional ; Not repeatable ; String]*
+ Role of the contact person. A research project may have contact persons depending on the output or some of the technical input. Some complex projects may have various data collection processes that have different processing channels and contacts. This section should provide for a key primary public interface that can refer the public inquiry or provide a collection of entry points. + - **`affiliation`** *[Optional ; Not repeatable ; String]*
+ The organization or affiliation of the contact person. This is usually the organization that the contact person represents. + - **`email`** *[Optional ; Not repeatable ; String]*
+ Email address of the responsible person, institution, or division in charge of the research project or output. + - **`telephone`** *[Optional ; Not repeatable ; String]*
+ Phone number of the responsible institution or division of the research project or output. + - **`uri`** *[Optional ; Not repeatable ; String]*
+ The URI of the agency or organization of the contact organization. This may be the same as the web page of the project or may be a permanent contact name at an institutional level and not project related. Eventually a project web site may be removed but there may still be need to have a contact. In this case, it is recommended to have a contact that is permanent.

+ + + + ```r + my_project = list( + # ... , + project_desc = list( + # ... , + + contacts = list( + + list(name = "Data helpdesk", + affiliation = "National Data Center", + role = "Support to data users", + uri = "helpdesk@ndc. ...") + ), + + # ... + ) + ) + ``` +
+ +### Provenance + +**`provenance`** *[Optional ; Repeatable]*
+Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata. +
+```json +"provenance": [ + { + "origin_description": { + "harvest_date": "string", + "altered": true, + "base_url": "string", + "identifier": "string", + "date_stamp": "string", + "metadata_namespace": "string" + } + } +] +``` +
+ + - **`origin_description`** *[Required ; Not repeatable]*
+ The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested.
+ - **`harvest_date`** *[Required ; Not repeatable ; String]*
+ The date and time the metadata were harvested, entered in ISO 8601 format.
+ - **`altered`** *[Optional ; Not repeatable ; Boolean]*
+ A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Document Description / Title Statement section) will be modified when published in a new catalog.
+ - **`base_url`** *[Required ; Not repeatable ; String]*
+ The URL from where the metadata were harvested.
+ - **`identifier`** *[Optional ; Not repeatable ; String]*
+ The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.
+ - **`date_stamp`** *[Optional ; Not repeatable ; String]*
+ The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
+ - **`metadata_namespace`** *[Optional ; Not repeatable ; String]*
+ @@@@@@@
+ + +### Tags + +**`tags`** *[Optional ; Repeatable]*
+As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +
+```json +"tags": [ + { + "tag": "string", + "tag_group": "string" + } +] +``` +
+ + - **`tag`** *[Required ; Not repeatable ; String]*
+ A user-defined tag. + - **`tag_group`** *[Optional ; Not repeatable ; String]*

+ A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs. + + +- **`lda_topics`** *[Optional ; Not repeatable]*
+
+```json +"lda_topics": [ + { + "model_info": [ + { + "source": "string", + "author": "string", + "version": "string", + "model_id": "string", + "nb_topics": 0, + "description": "string", + "corpus": "string", + "uri": "string" + } + ], + "topic_description": [ + { + "topic_id": null, + "topic_score": null, + "topic_label": "string", + "topic_words": [ + { + "word": "string", + "word_weight": 0 + } + ] + } + ] + } +] +``` +
+ + We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights). + + Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series' name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%). + + The `lda_topics` element includes the following metadata fields. An example in R was provided in Chapter 4 - Documents. + + - **`model_info`** *[Optional ; Not repeatable]*
+ Information on the LDA model.
+ + - `source` *[Optional ; Not repeatable ; String]*
+ The source of the model (typically, an organization).
+ - `author` *[Optional ; Not repeatable ; String]*
+ The author(s) of the model.
+ - `version` *[Optional ; Not repeatable ; String]*
+ The version of the model, which could be defined by a date or a number.
+ - `model_id` *[Optional ; Not repeatable ; String]*
+ The unique ID given to the model.
+ - `nb_topics` *[Optional ; Not repeatable ; Numeric]*
+ The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
+ - `description` *[Optional ; Not repeatable ; String]*
+ A brief description of the model.
+ - `corpus` *[Optional ; Not repeatable ; String]*
+ A brief description of the corpus on which the LDA model was trained.
+ - `uri` *[Optional ; Not repeatable ; String]*
+ A link to a web page where additional information on the model is available.

+ + - **`topic_description`** *[Optional ; Repeatable]*
+ The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).
+ + - `topic_id` *[Optional ; Not repeatable ; String]*
+ The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
+ - `topic_score` *[Optional ; Not repeatable ; Numeric]*
+ The share of the topic in the metadata (%).
+ - `topic_label` *[Optional ; Not repeatable ; String]*
+ The label of the topic, if any (not automatically generated by the LDA model).
+ - `topic_words` *[Optional ; Not repeatable]*
+ The list of N keywords describing the topic (e.g., the top 5 words).
+ - `word` *[Optional ; Not repeatable ; String]*
+ The word.
+ - `word_weight` *[Optional ; Not repeatable ; Numeric]*
+ The weight of the word in the definition of the topic.

+ + +- **`embeddings`** *[Optional ; Repeatable]*
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). + + The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. + +
+```json +"embeddings": [ + { + "id": "string", + "description": "string", + "date": "string", + "vector": null + } +] +``` +
+ + The `embeddings` element contains four metadata fields: + + - **`id`** *[Optional ; Not repeatable ; String]*
+ A unique identifier of the word embedding model used to generate the vector. + - **`description`** *[Optional ; Not repeatable ; String]*
+ A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. + - **`date`** *[Optional ; Not repeatable ; String]*
+ The date the model was trained (or a version date for the model). + - **`vector`** *[Required ; Not repeatable ; @@@@]* + The numeric vector representing the series metadata.

+ + +### Additional + +**`additional`** *[Optional ; Not repeatable]*
@@@@ add this to the schema and do screenshot +The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail. + + +## Generating compliant metadata + +For this example of documentation and publishing of reproducible research, we use the [Replication data for: Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia](https://www.openicpsr.org/openicpsr/project/116471/version/V1/view;jsessionid=31C3E76620D0DDD1CABADAA263A1E491) published in the OpenICPSR website. The primary investigators for the project were Vivi Alatas, Abhijit Banerjee, Rema Hanna, Benjamin A. Olken, Ririn Purnamasari, and Matthew Wai-Poi. + +:::quote +A service of the Inter-university Consortium for Political and Social Research (ICPSR), openICPSR is a self-publishing repository for social, behavioral, and health sciences research data. openICPSR is particularly well-suited for the deposit of replication data sets for researchers who need to publish their raw data associated with a journal article so that other researchers can replicate their findings. (from [OpenICPSR website](https://www.openicpsr.org/openicpsr/about)) +::: + + +### Full example, using a metadata editor + +
+
+![image](https://user-images.githubusercontent.com/35276300/229925181-c9ebbfb6-9934-4307-a8b5-b7903a6c8381.png) +
+
+ + +### Full example, using R + + +```r +library(jsonlite) +library(httr) +library(dplyr) +library(nadar) + +# ----credentials and catalog URL -------------------------------------------------- +my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F) +set_api_key("my_keys[1,1") +set_api_url("https://.../index.php/api/") +set_api_verbose(FALSE) +# ---------------------------------------------------------------------------------- + +setwd("C:\my_project") +thumb = "elite_capture.JPG" # Will be used as thumbnail in the data catalog + +id = "IDN_2019_ECTWP_v01_RR" + +# Generate the metadata + +my_project_metadata <- list( + + # Information on metadata production + + doc_desc = list( + + producers = list( + list(name = "OD", affiliation = "National Data Center") + ), + + prod_date = "2022-01-15" + + ), + + # Documentation of the research project, and scripts + + project_desc = list( + + title_statement = list( + idno = id, + title = "Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia", + sub_title = "Reproducible scripts" + ), + + production_date = list("2019"), + + geographic_units = list( + list(name="Indonesia", code="IDN", type="Country") + ), + + authoring_entity = list( + + list(name = "Vivi Alatas", + role = "Primary investigator", + affiliation = "World Bank", + email = "valatas@worldbank.org"), + + list(name = "Abhijit Banerjee", + role = "Primary investigator", + affiliation = "Department of Economics, MIT", + email = "banerjee@mit.edu"), + + list(name = "Rema Hanna", + role = "Primary investigator", + affiliation = "Harvard Kennedy School", + email = "rema_hanna@hks.harvard.edu"), + + list(name = "Benjamin A. Olken", + role = "Primary investigator", + affiliation = "Department of Economics, MIT", + email = "bolken@mit.edu"), + + list(name = "Ririn Purnamasari", + role = "Primary investigator", + affiliation = "World Bank", + email = "rpurnamasari@worldbank.org"), + + list(name = "Matthew Wai-Poi", + role = "Primary investigator", + affiliation = "World Bank", + email = "mwaipoi@worldbank.org") + + ), + + abstract = "This paper investigates how elite capture affects the welfare gains from targeted government transfer programs in Indonesia, using both a high-stakes field experiment that varied the extent of elite influence and nonexperimental data on a variety of existing government programs. While the relatives of those holding formal leadership positions are more likely to receive benefits in some programs, we argue that the welfare consequences of elite capture appear small: eliminating elite capture entirely would improve the welfare gains from these programs by less than one percent.", + + keywords = list( + list(name="proxy-means test (PMT)"), + list(name="experimental design") + ), + + topics = list( + + list(id="D72", + name = "Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior", + vocabulary = "JEL codes", + uri = "https://www.aeaweb.org/econlit/jelCodes.php"), + + list(id = "H53", + name = "National Government Expenditures and Welfare Programs", + vocabulary = "JEL codes", + uri = "https://www.aeaweb.org/econlit/jelCodes.php"), + + list(id = "I38", + name = "Welfare, Well-Being, and Poverty: Government Programs; Provision and Effects of Welfare Programs", + vocabulary = "JEL codes", + uri = "https://www.aeaweb.org/econlit/jelCodes.php"), + + list(id = "O15", + name = "Economic Development: Human Resources; Human Development; Income Distribution; Migration", + vocabulary = "JEL codes", + uri = "https://www.aeaweb.org/econlit/jelCodes.php"), + + list(id = "O17", + name = "Formal and Informal Sectors; Shadow Economy Institutional Arrangements", + vocabulary = "JEL codes", + uri = "https://www.aeaweb.org/econlit/jelCodes.php") + + ), + + output_types = list( + + list(type = "Article", + title = "Does Elite Capture Matter Local Elites and Targeted Welfare Programs in Indonesia", + description = "AEA Papers and Proceedings 2019, 109: 334-339", + uri = "https://doi.org/10.1257/pandp.20191047", + doi = "10.1257/pandp.20191047"), + + list(type = "Working Paper", + title = "Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia", + description = "NBER Working Paper No. 18798, February 2013", + uri = "https://www.nber.org/papers/w18798") + + ), + + version_statement = list(version = "1.0", version_date = "2019"), + + language = list( + list(name = "English", code = "EN") + ), + + methods = list( + list(name = "linear regression with large dummy-variable set (areg)"), + list(name = "probit regression"), + list(name = "Test linear hypotheses after estimation") + ), + + software = list( + list(name= "Stata", version = "14") + ), + + reproduction_instructions = "The master do file should run start to finish in less than five minutes from the master do file '0MASTER 20190918.do'. Original data is in data-PUBLISH/originaldata and is all that is needed to run the code; all data in data-PUBLISH/codeddata is created from the coding do files. All results are then created and saved in output-PUBLISH/tables. + + Key Subfolders: + 1. code-PUBLISH: This folder contains all relevant code. The master do file is located here ('0Master20190918.do') as well as the two folders that are necessary for the creation of datasets/coding ('coding_matching' folder) and for the analysis/table creation ('analysis' folder). Users should update the directory on the master file to reflect the location of the directory on their computers once downloaded. Following that, all the data and output files needed to replicate the main findings of the paper (Tables 1A-1D, Table 2 and the 4 Appendix Tables) will be generated. The sub do files provide specific notes on the variables created where relevant. + 2. data-PUBLISH: This folder contains all relevant .dta files. The first folder, 'original data' contains the 'Baseline' folder that has the original baseline survey information. Under 'original data' you will also find the 'Others' folder with the randomization results, the 2008 PPLS data and the PODES 2008 village level administrative data. The 'Endline2' folder contains the endline survey information. These datasets have been modified only to mask sensitive information. Finally, the 'codeddata' folder that stores intermediate datasets that are created through the sub 'coding_matching' do files. + 3. log-PUBLISH: This folder contains the latest log file. When users run the master do file, a new log file will automatically be created and stored here. + 4. output-PUBLISH: This folder contains all the tables of the main paper and appendix. When users run the master do file, these tables will be automatically overwritten.", + + confidentiality = "The published materials do not contain confidential information.", + + datasets = list( + + list(name = "Village survey (original data; baseline)", + idno = "", + note = "Stata 14 data files", + access_type = "Public", + uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"), + + list(name = "Village survey (original data; endline)", + idno = "", + note = "Stata 14 data files", + access_type = "Public", + uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"), + + list(name = "Randomization data", + idno = "", + note = "Stata 14 data files", + access_type = "Public", + uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"), + + list(name = "2008 PPLS", + idno = "", + note = "Stata 14 data files", + access_type = "Public", + uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"), + + list(name = "2008 PODES - Village level administrative data", + idno = "", + note = "Stata 14 data files", + access_type = "Public", + uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"), + + list(name = "Coded data (intermediary data files generated by the scripts)", + idno = "", + note = "Stata 14 data files", + access_type = "Public", + uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view") + + ), + + sponsors = list( + + list(name="Australian Aid (World Bank Trust Fund)", + abbr="AusAID", + role="Financial support"), + + list(name="3ie", + grant_no="OW3.1055", + role="Financial support"), + + list(name="NIH", + grant_no="P01 HD061315", + role="Financial support") + + ), + + acknowledgements = list( + + list(name = "Jurist Tan, Talitha Chairunissa, Amri Ilmma, Chaeruddin Kodir, He Yang, and Gabriel Zucker", + role = "Research assistance"), + + list(name = "Scott Guggenheim", + role = "Provided comments"), + + list(name = "Mitra Samya, BPS, TNP2K, and SurveyMeter", + role = "Field cooperation") + + ), + + disclaimer = "Users acknowledge that the original collector of the data, ICPSR, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.", + + scripts = list( + + list(file_name = "0MASTER-20190918.do", + zip_package = "119802-V1.zip", + title = "Master Stata do file", + authors = list(list(name="Rema Hanna, Ben Olken (PIs) and Sam Solomon (RA)")), + format = "Stata do file", + software = "Stata 14", + description = "Master do file; this script calls all do files required to replicate the output from start to finish (in no more than a few minutes)", + notes = "Original data is in data-PUBLISH/originaldata and is all that is needed to run the code; all data in data-PUBLISH/codeddata is created from the coding do files. All results are then created and saved in output-PUBLISH/tables."), + + list(file_name = "coding baseline.do", + title = "coding baseline variables", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 1/7"), + + list(file_name = "coding suseti pmt.do", + title = "coding pmt", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 2/7"), + + list(file_name = "coding elite relation.do", + title = "coding additional variables for analysis", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 3/7"), + + list(file_name = "matching hybrid.do", + title = "matching baseline survey data and matching results", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 4/7; Generates poverty density measure"), + + list(file_name = "coding existing social programs.do", + title = "coding existing social programs", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 5/7"), + + list(file_name = "coding kitchen-sink variables.do", + title = "coding miscellaneous variables", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 6/7"), + + list(file_name = "coding_partV_hybrid.do", + title = "coding for part V of analysis plan", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Coding/matching script 7/7"), + + list(file_name = "0 Table 1AB.do", + title = "Table 1: formal vs. informal elites - Panels A and B: historical benefits", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 1/7"), + + list(file_name = "0 Table 1CD.do", + title = "Table 1: formal vs. informal elites - Panels C and D: PKH Experiment", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 2/7"), + + list(file_name = "0 Table 2 Appendix Table 3.do", + title = "Table 7: Social welfare simulations", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 3/7"), + + list(file_name = "0 Appendix Table 1A.do", + title = "Table 2A: Elite capture in historical programs", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 4/7"), + + list(file_name = "0 Appendix Table 1B.do", + title = "Table 2B: Elite capture in PKH experiment", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 5/7"), + + list(file_name = "0 Appendix Table 2.do", + title = "Appendix Table 12: Probit Model from Table 7", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 6/7"), + + list(file_name = "0 Appendix Table 4.do", + title = "Appendix Table 13: Social welfare simulations -- PKH - Additional model from Table 7", + zip_package = "119802-V1.zip", + format = "Stata do file", + software = "Stata 14", + description = "Analysis script 7/7"), + + list(file_name = "master_log_09182019.smcl", + title = "Log file - Run of master do file", + zip_package = "119802-V1.zip", + format = "Stata log file", + software = "Stata 14", + description = "Latest log file obtained by running the master do file") + ) + + ) + +) + + +# Publish the project metadata in the NADA catalog + +script_add(idno = id, + metadata = my_project_metadata, + repositoryid = "central", + published = 1, + thumbnail = thumb, + overwrite = "yes") + + +# Add links to ICPSROpen website and AEA website as external resources: + +external_resources_add( + title = "Elite Capture Paper (Alatas et Al., 2019) - Project page - OpenICPSR", + idno = id, + dctype = "web", + file_path = "https://www.openicpsr.org/openicpsr/project/116471/version/V1/view;jsessionid=31C3E76620D0DDD1CABADAA263A1E491", + overwrite = "yes" +) + +external_resources_add( + title = "American Economic Association (AEA) paper: Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia", + idno = id, + dctype = "doc/anl", + file_path = "https://www.aeaweb.org/articles?id=10.1257/pandp.20191047", + overwrite = "yes" +) +``` + +The metadata and all resources (script files, etc.) are now available in the NADA catalog. +@@@@@ redo screenshot when displays external resources + +
+![](./images/script_example1_nada.JPG) +
+ + + + +### Full example, using Python + + +```python +# Python example +``` diff --git a/13_chapter13_related_resources.md b/13_chapter13_related_resources.md new file mode 100644 index 0000000..07753c0 --- /dev/null +++ b/13_chapter13_related_resources.md @@ -0,0 +1,152 @@ +--- +output: html_document +--- + +# External resources {#chapter13} + +The metadata schemas presented in chapters 4 to 12 of the Guide are intended to document in detail resources of multiple types (data and scripts). When published in a NADA catalog, these metadata will be made visible and searchable. But publishing metadata in an HTML format is not enough. In most cases, you will also want to made files (data files, documents, or others) accessible in your catalog, and provide links to other, related resources. These files will have to be uploaded on your web server, and the links created, with some documentation. These related materials are what is referred to as "external resources". + +**External resources** are not a specific type of data. They are resources of any type (data, document, web page, or any other type of resource that can be provided as an electronic file or a web link) that can be attached as a "related resource" to a catalog entry. A schema that is intentionally kept very simple, based on the Dublin Core standard, is used to describe these resources. This schema will never be used independently; it will always be used in combination with one of the other metadata standards and schemas documented in this Guide. + +The table below shows some examples of the kind of external resources that may be attached to the metadata of different data types. + +| Data type | Resources that may be documented and published as external resources | +| ------------------ | ------------------------------------------------------------ | +| Document | MS-Excel version of tables included in a publication ; PDF/DOC version of the publication ; visualizations files (scripts and image) for visualizations included in the publication ; link to electronic annexes | +| Microdata | survey questionnaire ; survey report ; technical documentation (sampling, etc.) ; data entry application ; survey budget in Excel ; microdata files in different formats ; link to an external website| +| Geographic dataset | link to an interactive web application ; technical documentation in PDF ; data analysis scripts ; publicly accessible data files | +| Time series | link to a database query interface ; technical documents ; link to external websites ; visualization scripts | +| Tables | link to an organization website ; tabulation scripts | electronic copy of the table | +| Images | image files in different formats and resolutions ; link to a photo album application ; link to a photographer website | +| Audio recordings | audio file in MP3 or other format ; transcript in PDF | +| Videos | video file in WAV or other format ; transcript in PDF | +| Scripts | publication ; link to a package/library web page ; link to datasets | + +Note that a catalog entry (e.g. a document, or a table) can itself be provided as a link (i.e. as an external resource) for another catalog entry. + +In a NADA catalog, the external resources will not appear as catalog entries. Their list and description will be displayed (and the resources made accessible) in a "DOWNLOAD" tab for the entry to which they are attached. + +
+![](./images/external_resources_tab_NADA.JPG){width=100%} +
+ +The schema used to document external resources only contains 16 elements. + +
+```json +{ + "dctype": "doc/adm", + "dcformat": "application/zip", + "title": "string", + "author": "string", + "dcdate": "string", + "country": "string", + "language": "string", + "contributor": "string", + "publisher": "string", + "rights": "string", + "description": "string", + "abstract": "string", + "toc": "string", + "filename": "string", + "created": "2023-04-09T19:23:22Z", + "changed": "2023-04-09T19:23:22Z" +} +``` +
+ +**`dctype`** *[Optional, Not Repeatable, String]*
+This element defines the type of external resource being documented. This element plays an important role in the cataloguing system (NADA), as it is used to determine where and how the resource will be published. Particular attention must be paid to the type "Microdata File" (`dat/micro`) and to other data types, when the datasets will be published in a data catalog but with access restrictions). The NADA catalog allows data to be published under different levels of accessibility: open data, direct access, public use files, licensed data, access in data enclave, or no access. Most standards include an element **`access_policy`** which is used to determine the type of access to a resource, and will apply to data of type `dat/micro`. The resource type `dctype` must be selected from a controlled vocabulary: + + - **doc/adm**: Document, Administrative [doc/adm]
+ - **doc/anl**: Document, Analytical [doc/anl]
+ - **doc/oth**: Document, Other [doc/oth]
+ - **doc/qst**: Document, Questionnaire [doc/qst]
+ - **doc/ref**: Document, Reference [doc/ref]
+ - **doc/rep**: Document, Report [doc/rep]
+ - **doc/tec**: Document, Technical [doc/tec]
+ - **aud**: Audio [aud]
+ - **dat**: Database [dat] (not including microdata)
+ - **map**: Map [map]
+ - **dat/micro**: Microdata File [dat/micro]
+ - **pic**: Photo / image [pic]
+ - **prg**: Program / script [prg]
+ - **tbl**: Table [tbl]
+ - **vid**: Video [vid]
+ - **web**: Web Site [web]
+ +**`dcformat`** *[Optional, Not Repeatable, String]*
+The resource file format. This format can be entered using a controlled vocabulary. Options could include: + + - **application/x-compressed**: Compressed, Generic
+ - **application/zip**: Compressed, ZIP
+ - **application/x-cspro**: Data, CSPro
+ - **application/dbase**: Data, dBase
+ - **application/msaccess**: Data, Microsoft Access
+ - **application/x-sas**: Data, SAS
+ - **application/x-spss**: Data, SPSS
+ - **application/x-stata**: Data, Stata
+ - **text**: Document, Generic
+ - **text/html**: Document, HTML
+ - **application/msexcel**: Document, Microsoft Excel
+ - **application/mspowerpoint**: Document, Microsoft PowerPoint
+ - **application/msword**: Document, Microsoft Word
+ - **application/pdf**: Document, PDF
+ - **application/postscript**: Document, Postscript
+ - **text/plain**: Document, Plain
+ - **text/wordperfect**: Document, WordPerfect
+ - **image/gif**: Image, GIF
+ - **image/jpeg**: Image, JPEG
+ - **image/png**: Image, PNG
+ - **image/tiff**: Image, TIFF
+ +**`title`** *[Required, Not Repeatable, String]*
+The title of the resource. + +**`author`** *[Optional, Not Repeatable, String]*
+The author(s) of the resource. If more than one, separate the names with a ";". + +**`dcdate`** *[Optional, Not Repeatable, String]*
+The date the resource was produced or released, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). + +**`country`** *[Optional, Not Repeatable, String]*
+The country name, if the resource is specific to a country. If more than one, enter the country names separated with a ";". + +**`language`** *[Optional, Not Repeatable, String]*
+The language name. If more than one, enter the language names separated with a ";". + +**`contributor`** *[Optional, Not Repeatable, String]*
+List of contributor (free text). If more than one, enter the names separated with a ";". + +**`publisher`** *[Optional, Not Repeatable, String]*
+List of contributor (free text). If more than one, enter the names separated with a ";". + +**`rights`** *[Optional, Not Repeatable, String]*
+The rights associated with the resource. + +**`description`** *[Optional, Not Repeatable, String]*
+A brief description of the resource (but not the abstract; see the next element). + +**`abstract`** *[Optional, Not Repeatable, String]*
+And abstract for the resource. + +**`toc`** *[Optional, Not Repeatable, String]*
+The table of content of the resource (if the resource is a publication), entered as free text. + +**`filename`** *[Optional, Not Repeatable, String]*
+A file name or a URL. + + +## Example of use of external resources + +The "complete examples" provided in the previous chapters included some examples of the use of the "external_resources_add" command (from the Nadar R package) or "..." (from the PyNada Python library). We provide here one more example. + +``` +# R example @@@@ +``` + + +``` +# Python example @@@@ +``` + diff --git a/404.html b/404.html new file mode 100644 index 0000000..cbdd2d6 --- /dev/null +++ b/404.html @@ -0,0 +1,568 @@ + + + + + + + Page not found | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Page not found

+

The page you requested cannot be found (perhaps it was moved or renamed).

+

You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/90_annexes.md b/90_annexes.md new file mode 100644 index 0000000..e2de399 --- /dev/null +++ b/90_annexes.md @@ -0,0 +1,602 @@ +--- +output: html_document +--- + +# (APPENDIX) ANNEXES {-} + +# Annex 1: References and links {-} + +**Documents** + +- Asian Development Bank (ADB). 2001. [*Mapping the Spatial Distribution of Poverty Using Satellite Imagery in Thailand*](http://dx.doi.org/10.22617/TCS210112-2) ISBN 978-92-9262-768-3 (print), 978-92-9262-769-0 (electronic), 978-92-9262-770-6 (ebook) +Publication Stock No. TCS210112-2. DOI: http://dx.doi.org/10.22617/TCS210112-2 + +- Balashankar, A., L.Subramanian, and S.P. Fraiberger. 2021. [*Fine-grained prediction of food insecurity using news streams*](https://arxiv.org/pdf/2111.15602.pdf) + +- British Ecological Society. 2017. [*Guide to Reproducible Code in Ecology and Evolution*](https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf) + +- Google. [Google's Search Engine Optimization (SEO) Starter Guide](https://developers.google.com/search/docs/beginner/seo-starter-guide) + +- Jurafsky, Daniel; H. James, Martin. 2000. *Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition*. Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0-13-095069-7 + +- Mikolov, T., K.Chen, G.Corrado, and J.Dean. 2013. [*Efficient Estimation of Word Representations in Vector Space*](https://arxiv.org/abs/1301.3781) + +- Min, B. and Z.O'Keeffe. 2021. http://www-personal.umich.edu/~brianmin/HREA/index.html + +- Priest, G.. 2010. [*The Struggle for Integration and Harmonization of Social Statistics in a Statistical Agency - A Case Study of Statistics Canada*](https://www.ihsn.org/sites/default/files/resources/IHSN-WP004.pdf) + +- Stodden et al. 2013. [*Setting the Default to Reproducible - Reproducibility in Computational and Experimental Mathematics*](http://stodden.net/icerm_report.pdf) + +- Turnbull, D. and J. Berryman. 2016. [*Relevant Search: With applications for Solr and Elasticsearch*](https://www.manning.com/books/relevant-search) + + +**Links (standards, schemas, controlled vocabularies)** + +- American Psychological Association (APA): [APA Style (example of specific publications styles for a table)](https://apastyle.apa.org/style-grammar-guidelines/tables-figures/tables) + +- [Consortium of European Social Science Data Archives (CESSDA)](https://www.cessda.eu/) + +- US Census Bureau, CsPro Users Guide: [Parts of a Table](https://www.csprousers.org/help/CSPro/parts_of_a_table.html) + +- [Data Documentation Initiative (DDI) Alliance](https://ddialliance.org/) + +- DDI Alliance, [Data Documentation Initiative (DDI) Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) + +- [Dublin Core Metadata Initiative (DCMI)](https://www.dublincore.org/) + +- eMathZone: [Construction of a Statistical Table](https://www.emathzone.com/tutorials/basic-statistics/construction-of-statistical-table.html) + +- GoFair [(Findable, Accessible, Interoperable and Reusable (FAIR))](https://www.go-fair.org/) + +- [International Household Survey Network (IHSN)](https://www.ihsn.org/) + +- [International Press Telecommunications Council (IPTC)](https://iptc.org/) + +- International Organization for Standardization (ISO) 19139: [Geographic information — Metadata — XML schema implementation](https://www.iso.org/standard/32557.html) + +- [LabWrite: Designing Tables](https://labwrite.ncsu.edu/res/gh/gh-tables.html) + +- [schema.org](https://schema.org/) + +- Microsoft Bing: [Bing Webmaster Tools Help & How-To Center, Bing Webmaster Guidelines](https://www.bing.com/webmasters/help/webmaster-guidelines-30fba23a) + +- [Vedantu: Tabulation](https://www.vedantu.com/commerce/tabulation) + + +**Links (tools)** + +- [CKAN open-source data management system](https://ckan.org/) +- [ElasticSearch](https://github.com/elastic/elasticsearch) +- [GeoNetwork](https://geonetwork-opensource.org/) +- [Milvus](https://milvus.io/)) +- [NADA cataloguing application, web page](https://nada.ihsn.org/) +- [NADA cataloguing application, demo page](https://nada-demo.ihsn.org/index.php/home) +- [NADA cataloguing application, GitHub repository](https://github.com/ihsn/nada/releases/tag/V5.0.3) +- [NADAR package]() +- [Nesstar Publisher (DDI 1.n Metadata Editor](http://www.nesstar.com/) +- [R: The R Project for Statistical Computing](https://www.r-project.org/) +- [R Bookdown](https://bookdown.org/): Write HTML, PDF, ePub, and Kindle books with R Markdown +- [R geometa](https://cran.r-project.org/web/packages/geometa/index.html): Tools for Reading and Writing ISO/OGC Geographic Metadata +- [Solr](https://solr.apache.org/) + +**Links (others)** + +- WorldPop: https://www.worldpop.org/ + + +# Annex 2: Mapping standards and schemas to schema.org {-} + +The use of *structured data* described in section 1.6.2 requires a mapping between the relevant elements of some of the metadata standards and schemas described in the Guide to the schema.org standard. We provide here a suggested selection and mapping for the core set of elements (we do not attempt to map all possible elements that are common to our schemas and schema.org). + +### Microdata + +|schema.org/dataset |DDI CodeBook | Recommendation | +|-----------------------|-----------------|-----------------| +|name | | +|description | | +|url | | +|sameAs | | +|identifier | | +|keywords | | +|license | | +|isAccessibleForFree | | +|hasPart / isPartOf | | +|creator type / url / name / contactPoint / funder | | +|includedInDataCatalog | | +|distribution | | +|temporalCoverage | | +|spatialCoverage | | + + +**Example:** + +```html + + + + + + + +``` + +### Geographic data + +|schema.org/dataset |ISO 19139 | Recommendation | +|-----------------------|-----------------|-----------------| +|name | | +|description | | +|url | | +|sameAs | | +|identifier | | +|keywords | | +|license | | +|isAccessibleForFree | | +|hasPart / isPartOf | | +|creator type / url / name / contactPoint / funder | | +|includedInDataCatalog | | +|distribution | | +|temporalCoverage | | +|spatialCoverage | | + + +**Example:** + + + +### Indicators (and database) + +|schema.org/dataset |INDICATOR schema | Recommendation | +|-----------------------|-----------------|-----------------| +|name | | +|description | | +|url | | +|sameAs | | +|identifier | | +|keywords | | +|license | | +|isAccessibleForFree | | +|hasPart / isPartOf | | +|creator type / url / name / contactPoint / funder | | +|includedInDataCatalog | | +|distribution | | +|temporalCoverage | | +|spatialCoverage | | + + +**Example:** + +### Tables + +|schema.org/dataset |TABLES schema | Recommendation | +|-----------------------|-----------------|-----------------| +|name | | +|description | | +|url | | +|sameAs | | +|identifier | | +|keywords | | +|license | | +|isAccessibleForFree | | +|hasPart / isPartOf | | +|creator type / url / name / contactPoint / funder | | +|includedInDataCatalog | | +|distribution | | +|temporalCoverage | | +|spatialCoverage | | + + +**Example:** + +### Images + +The complete list of elements available in schema.org to document an image object is available at https://schema.org/ImageObject. We only show in the table below a selection of the ones we consder the most relevant and frequently available. Images can be documented either using the IPTC-based schema, or the Dublin Core (DCMI)-based schema. + +|schema.org/dataset |IMAGE schema (IPTC) | Recommendation | +|-----------------------|--------------------|-----------------| +| name | | +| abstract | | +| creator | | +| provider | | +| sourceOrganization | | +| dateCreated | | +| keywords | | +| contentLocation | | +| contentReferenceTime | | +| copyrightHolder | | +| copyrightNotice | | +| copyrightYear | | +| creditText | | +| isAccessibleForFree | | +| license | | +| acquireLicensePage | | +| contentUrl | | + + +|schema.org/dataset |IMAGE schema (DCMI) | Recommendation | +|-----------------------|--------------------|-----------------| +| name | | +| abstract | | +| creator | | +| provider | | +| sourceOrganization | | +| dateCreated | | +| keywords | | +| contentLocation | | +| contentReferenceTime | | +| copyrightHolder | | +| copyrightNotice | | +| copyrightYear | | +| creditText | | +| isAccessibleForFree | | +| license | | +| acquireLicensePage | | +| contentUrl | | + + +**Example:** + + + + Residents get water from an artesian well, Sindh, Pakistan + + + + + + +# Annex 3: Mapping the microdata schema to the DDI Codebook 2.5 {-} + +|JSON Schema |DDI/XML CodeBook 2.5 |Title | +|----------------------------------------------------------------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------| +|doc_desc |docDscr | | +|doc_desc/title |docDscr/citation/titlStmt/titl |Document title | +|doc_desc/idno |docDscr/citation/titlStmt/IDNo |Unique ID number for the document | +|doc_desc/producers |docDscr/citation/prodStmt/producer |Producers | +|- name |. |Name | +|- abbr |- abbr |Abbreviation | +|- affiliation |- affiliation |Affiliation | +|- role |- role |Role | +|doc_desc/prod_date |docDscr/citation/prodStmt/prodDate |Date of Production | +|doc_desc/version_statement |docDscr/citation/verStmt |Version Statement | +|doc_desc/version_statement/version |docDscr/citation/verStmt/version |Version | +|doc_desc/version_statement/version_date |docDscr/citation/verStmt/version/@date |Version Date | +|doc_desc/version_statement/version_resp |docDscr/citation/verStmt/verResp |Version Responsibility Statement | +|doc_desc/version_statement/version_notes |docDscr/citation/verStmt/notes |Version Notes | +|study_desc |stdyDscr | | +|study_desc/title_statement |stdyDscr/citation/titlStmt | | +|study_desc/title_statement/idno |stdyDscr/citation/titlStmt/IDNo |Unique user defined ID | +|study_desc/title_statement/identifiers | |Other identifiers | +|- type | |Identifier type | +|- identifier | |Identifier | +|study_desc/title_statement/title |stdyDscr/citation/titlStmt/titl |Survey title | +|study_desc/title_statement/sub_title |stdyDscr/citation/titlStmt/subTitl |Survey subtitle | +|study_desc/title_statement/alternate_title |stdyDscr/citation/titlStmt/altTitl |Abbreviation or Acronym | +|study_desc/title_statement/translated_title |stdyDscr/citation/titlStmt/parTitl |Translated Title | +|study_desc/authoring_entity |stdyDscr/citation/rspStmt/AuthEnty |Authoring entity/Primary investigators | +|- name |. |Agency Name | +|- affiliation |- affiliation |Affiliation | +|study_desc/oth_id |stdyDscr/citation/rspStmt/othId |Other Identifications/Acknowledgments | +|- name |. |Name | +|- role |- role |Role | +|- affiliation |- affiliation |Affiliation | +|study_desc/production_statement |stdyDscr/citation/prodStmt |Production Statement | +|study_desc/production_statement/producers |stdyDscr/citation/prodStmt/producer |Producers | +|- name |. |Name | +|- abbr |- abbr |Abbreviation | +|- affiliation |- affiliation |Affiliation | +|- role |- role |Role | +|study_desc/production_statement/copyright |stdyDscr/citation/prodStmt/copyright |Copyright | +|study_desc/production_statement/prod_date |stdyDscr/citation/prodStmt/prodDate |Production Date | +|study_desc/production_statement/prod_place |stdyDscr/citation/prodStmt/prodPlac |Production Place | +|study_desc/production_statement/funding_agencies |stdyDscr/citation/prodStmt/fundAg |Funding Agency/Sponsor | +|- name |. |Funding Agency/Sponsor | +|- abbr |- abbr |Abbreviation | +|- grant |- stdyDscr/citation/prodStmt/fundAg |Grant Number | +|- role |- role |Role | +|study_desc/distribution_statement |stdyDscr/citation/distStmt |Distribution Statement | +|study_desc/distribution_statement/distributors |stdyDscr/citation/distStmt/distrbtr |Distributor | +|- name |. |Organization name | +|- abbr |- abbr |Abbreviation | +|- affiliation |- affiliation |Affiliation | +|- uri |- uri |URI | +|study_desc/distribution_statement/contact |stdyDscr/citation/distStmt/contact |Contact | +|- name |. |Name | +|- affiliation |- affiliation |Affiliation | +|- email |- email |Email | +|- uri |- uri |URI | +|study_desc/distribution_statement/depositor |stdyDscr/citation/distStmt/depositr |Depositor | +|- name |. |Name | +|- abbr |- abbr |Abbreviation | +|- affiliation |- affiliation |Affiliation | +|- uri | |URI | +|study_desc/distribution_statement/deposit_date |stdyDscr/citation/distStmt/depDate |Date of Deposit | +|study_desc/distribution_statement/distribution_date |stdyDscr/citation/distStmt/distDate |Date of Distribution | +|study_desc/series_statement |stdyDscr/citation/serStmt |Series Statement | +|study_desc/series_statement/series_name |stdyDscr/citation/serStmt/serName |Series Name | +|study_desc/series_statement/series_info |stdyDscr/citation/serStmt/serInfo |Series Information | +|study_desc/version_statement |stdyDscr/citation/verStmt |Version Statement | +|study_desc/version_statement/version |stdyDscr/citation/verStmt/version |Version | +|study_desc/version_statement/version_date |stdyDscr/citation/verStmt/version/@date |Version Date | +|study_desc/version_statement/version_resp |stdyDscr/citation/verStmt/verResp |Version Responsibility Statement | +|study_desc/version_statement/version_notes |stdyDscr/citation/verStmt/notes |Version Notes | +|study_desc/bib_citation |stdyDscr/citation/biblCit |Bibliographic Citation | +|study_desc/bib_citation_format |stdyDscr/citation/biblCit/@format |Bibliographic Citation Format | +|study_desc/holdings |stdyDscr/citation/holdings |Holdings Information | +|- name |. |Name | +|- location |- location |Location | +|- callno |- callno |Callno | +|- uri |- uri |URI | +|study_desc/study_notes |stdyDscr/citation/notes |Study notes | +|study_desc/study_authorization |stdyDscr/studyAuthorization |Study Authorization | +|study_desc/study_authorization/date |stdyDscr/studyAuthorization/@date |Authorization Date | +|study_desc/study_authorization/agency |stdyDscr/studyAuthorization/authorizingAgency |Authorizing Agency | +|- name |. |Funding Agency/Sponsor | +|- affiliation |- affiliation |Affiliation | +|- abbr |- abbr |Abbreviation | +|study_desc/study_authorization/authorization_statement |stdyDscr/studyAuthorization/authorizationStatement |Authorization Statement | +|study_desc/study_info |stdyDscr/stdyInfo |Study Scope | +|study_desc/study_info/study_budget |stdyDscr/stdyInfo/studyBudget |Study Budget | +|study_desc/study_info/keywords |stdyDscr/stdyInfo/subject/keyword | | +|- keyword |. |Keyword | +|- vocab |- vocab |Vocabulary | +|- uri |- vocabURI |uri | +|study_desc/study_info/topics |stdyDscr/stdyInfo/subject/topcClas |Topic Classification | +|- topic |. |Topic | +|- vocab |- vocab |Vocab | +|- uri |- vocabURI |URI | +|study_desc/study_info/abstract |stdyDscr/stdyInfo/abstract |Abstract | +|study_desc/study_info/time_periods |stdyDscr/stdyInfo/sumDscr/timePrd |Time periods (YYYY/MM/DD) | +|- start | |Start date | +|- end | |End date | +|- cycle | |Cycle | +|study_desc/study_info/coll_dates |stdyDscr/stdyInfo/sumDscr/collDate |Dates of Data Collection (YYYY/MM/DD) | +|- start | |Start date | +|- end | |End date | +|- cycle | |Cycle | +|study_desc/study_info/nation |stdyDscr/stdyInfo/sumDscr/nation |Country | +|- name |. |Name | +|- abbreviation |- abbr |Country code | +|study_desc/study_info/bbox |stdyDscr/sumDscr/geoBndBox |Geographic bounding box | +|- west |- westBL |West | +|- east |- eastBL |East | +|- south |- southBL |South | +|- north |- northBL |North | +|study_desc/study_info/bound_poly |stdyDscr/sumDscr/boundPoly/polygon/point |Geographic Bounding Polygon | +|- lat |gringLat |Latitude | +|- lon |gringLon |longitude | +|study_desc/study_info/geog_coverage |stdyDscr/stdyInfo/sumDscr/geogCover |Geographic Coverage | +|study_desc/study_info/geog_coverage_notes |stdyDscr/sumDscr/geogCover/txt |Geographic Coverage notes | +|study_desc/study_info/geog_unit |stdyDscr/stdyInfo/sumDscr/geogUnit |Geographic Unit | +|study_desc/study_info/analysis_unit |stdyDscr/stdyInfo/sumDscr/anlyUnit |Unit of Analysis | +|study_desc/study_info/universe |stdyDscr/stdyInfo/sumDscr/universe |Universe | +|study_desc/study_info/data_kind |stdyDscr/stdyInfo/sumDscr/dataKind |Kind of Data | +|study_desc/study_info/notes |stdyDscr/stdyInfo/notes |Study notes | +|study_desc/study_info/quality_statement |stdyDscr/stdyInfo/qualityStatement |Quality Statement | +|study_desc/study_info/quality_statement/compliance_description |stdyDscr/stdyInfo/qualityStatement/standardsCompliance/complianceDescription|Standard compliance description | +|study_desc/study_info/quality_statement/standards |stdyDscr/stdyInfo/qualityStatement/standardsCompliance/standard |Standards | +|- name |standardName |Name | +|- producer |producer * |Producer | +|study_desc/study_info/quality_statement/other_quality_statement |stdyDscr/stdyInfo/qualityStatement/otherQualityStatement |Other quality statement | +|study_desc/study_info/ex_post_evaluation |stdyDscr/stdyInfo/exPostEvaluation |Ex-Post Evaluation | +|study_desc/study_info/ex_post_evaluation/completion_date |stdyDscr/stdyInfo/exPostEvaluation/@completionDate |Evaluation completion date | +|study_desc/study_info/ex_post_evaluation/type |stdyDscr/stdyInfo/@type |Evaluation type | +|study_desc/study_info/ex_post_evaluation/evaluator |stdyDscr/stdyInfo/exPostEvaluation/evaluator |Evaluators | +|- name |. |Funding Agency/Sponsor | +|- affiliation |- affiliation |Affiliation | +|- abbr |- abbr |Abbreviation | +|- role |- role |Role | +|study_desc/study_info/ex_post_evaluation/evaluation_process |stdyDscr/stdyInfo/exPostEvaluation/evaluationProcess |Evaluation process | +|study_desc/study_info/ex_post_evaluation/outcomes |stdyDscr/stdyInfo/exPostEvaluation/outcomes |Outcomes | +|study_desc/study_development |stdyDscr/studyDevelopment |Study Development | +|study_desc/study_development/development_activity |stdyDscr/studyDevelopment/developmentActivity |Development activity | +|- activity_type |. |Development activity type | +|- activity_description |- description |Development activity description | +|- participants |- participants |Participants | +|- resources |- resources |Development activity resources | +|- outcome |- outcome |Development Activity Outcome | +|study_desc/method |stdyDscr/method |Methodology and Processing | +|study_desc/method/data_collection |stdyDscr/method/dataColl |Data Collection | +|study_desc/method/data_collection/time_method |stdyDscr/method/dataColl/timeMeth |Time Method | +|study_desc/method/data_collection/data_collectors |stdyDscr/method/dataColl/dataCollector |Data Collectors | +|- name |. |Name | +|- affiliation | |Affiliation | +|- abbr | |Abbreviation | +|- role | |Role | +|study_desc/method/data_collection/collector_training |stdyDscr/method/dataColl/collectorTraining |Collector training | +|- type |@type |Training type | +|- training |. |Training | +|study_desc/method/data_collection/frequency |stdyDscr/method/dataColl/frequenc |Frequency of Data Collection | +|study_desc/method/data_collection/sampling_procedure |stdyDscr/method/dataColl/sampProc |Sampling Procedure | +|study_desc/method/data_collection/sample_frame |stdyDscr/method/dataColl/sampleFrame |Sample Frame | +|study_desc/method/data_collection/sample_frame/name |stdyDscr/method/dataColl/sampleFrame/sampleFrameName |Sample frame name | +|study_desc/method/data_collection/sample_frame/valid_period |stdyDscr/method/dataColl/sampleFrame/validPeriod |Valid periods (YYYY/MM/DD) | +|- event | |Event | +|- date | |Date | +|study_desc/method/data_collection/sample_frame/custodian |stdyDscr/method/dataColl/sampleFrame/custodian |Custodian | +|study_desc/method/data_collection/sample_frame/universe |stdyDscr/method/dataColl/sampleFrame/universe |Universe | +|study_desc/method/data_collection/sample_frame/frame_unit |stdyDscr/method/dataColl/sampleFrame/frameUnit |Frame unit | +|study_desc/method/data_collection/sample_frame/frame_unit/is_primary |stdyDscr/method/dataColl/sampleFrame/frameUnit/@isPrimary |Is Primary | +|study_desc/method/data_collection/sample_frame/frame_unit/unit_type |stdyDscr/method/dataColl/sampleFrame/frameUnit/unitType |Unit Type | +|study_desc/method/data_collection/sample_frame/frame_unit/num_of_units|stdyDscr/method/dataColl/sampleFrame/frameUnit/@numberOfUnits |Number of units | +|study_desc/method/data_collection/sample_frame/reference_period |stdyDscr/method/dataColl/sampleFrame/referencePeriod |Reference periods (YYYY/MM/DD) | +|- event | |Event | +|- date | |Date | +|study_desc/method/data_collection/sample_frame/update_procedure |stdyDscr/method/dataColl/sampleFrame/updateProcedure |Update procedure | +|study_desc/method/data_collection/sampling_deviation |stdyDscr/method/dataColl/deviat |Deviations from the Sample Design | +|study_desc/method/data_collection/coll_mode |stdyDscr/method/dataColl/collMode |Mode of data collection | +|study_desc/method/data_collection/research_instrument |stdyDscr/method/dataColl/resInstru |Type of Research Instrument | +|study_desc/method/data_collection/instru_development |stdyDscr/method/dataColl/instrumentDevelopment |Instrument development | +|study_desc/method/data_collection/instru_development_type |stdyDscr/method/dataColl/instrumentDevelopment/@type |Instrument development type | +|study_desc/method/data_collection/sources |stdyDscr/method/dataColl/sources |Sources | +|- name | |Source name | +|- origin | |Origin of Source | +|- characteristics | |Characteristics of Source Noted | +|study_desc/method/data_collection/coll_situation |stdyDscr/method/dataColl/collSitu |Characteristics of Data Collection Situation - Notes on data collection| +|study_desc/method/data_collection/act_min |stdyDscr/method/dataColl/actMin |Supervision | +|study_desc/method/data_collection/control_operations |stdyDscr/method/dataColl/ConOps |Control Operations | +|study_desc/method/data_collection/weight |stdyDscr/method/dataColl/weight |Weighting | +|study_desc/method/data_collection/cleaning_operations |stdyDscr/method/dataColl/cleanOps |Cleaning Operations | +|study_desc/method/method_notes |stdyDscr/method/notes |Methodology notes | +|study_desc/method/analysis_info |stdyDscr/method/anlyInfo |Data Appraisal | +|study_desc/method/analysis_info/response_rate |stdyDscr/method/anlyInfo/respRate |Response Rate | +|study_desc/method/analysis_info/sampling_error_estimates |stdyDscr/method/anlyInfo/EstSmpErr |Estimates of Sampling Error | +|study_desc/method/analysis_info/data_appraisal |stdyDscr/method/anlyInfo/dataAppr |Data Appraisal | +|study_desc/method/study_class |stdyDscr/method/stdyClas |Class of the Study | +|study_desc/method/data_processing |stdyDscr/method/dataProcessing |Data Processing | +|- type | |Data processing type | +|- description | |Data processing description | +|study_desc/method/coding_instructions |stdyDscr/method/codingInstructions |Coding Instructions | +|- related_processes | |Related processes | +|- type | |Coding instructions type | +|- txt | |Coding instructions text | +|- command | |Command | +|- formal_language | |Identify the language of the command code | +|study_desc/data_access |stdyDscr/dataAccs/setAvail/dataAccs | | +|study_desc/data_access/dataset_availability |stdyDscr/dataAccs/setAvail |Data Set Availability | +|study_desc/data_access/dataset_availability/access_place |stdyDscr/dataAccs/setAvail/accsPlac |Location of Data Collection | +|study_desc/data_access/dataset_availability/access_place_url |stdyDscr/dataAccs/setAvail/accsPlac/@URI |URL for Location of Data Collection | +|study_desc/data_access/dataset_availability/original_archive |stdyDscr/dataAccs/setAvail/origArch |Archive where study is originally stored | +|study_desc/data_access/dataset_availability/status |stdyDscr/dataAccs/setAvail/avlStatus |Availability Status | +|study_desc/data_access/dataset_availability/coll_size |stdyDscr/dataAccs/setAvail/collSize |Extent of Collection | +|study_desc/data_access/dataset_availability/complete |stdyDscr/dataAccs/setAvail/complete |Completeness of Study Stored | +|study_desc/data_access/dataset_availability/file_quantity |stdyDscr/dataAccs/setAvail/fileQnty |Number of Files | +|study_desc/data_access/dataset_availability/notes |stdyDscr/dataAccs/setAvail/notes |Notes | +|study_desc/data_access/dataset_use |stdyDscr/dataAccs/useStmt |Data Set Availability | +|study_desc/data_access/dataset_use/conf_dec |stdyDscr/dataAccs/useStmt/confDec |Confidentiality Declaration | +|- txt |. |Confidentiality declaration text | +|- required |- required |Is signing of a confidentiality declaration required? | +|- form_url |- URI |Confidentiality declaration form URL | +|- form_id |- formNo |Form ID | +|study_desc/data_access/dataset_use/spec_perm |stdyDscr/dataAccs/useStmt/specPerm |Special Permissions | +|- txt | |Special permissions description | +|- required |- required |Indicate if special permissions are required to access a resource | +|- form_url |- URI |Form URL | +|- form_id |- formNo |Form ID | +|study_desc/data_access/dataset_use/restrictions |stdyDscr/dataAccs/useStmt/restrctn |Restrictions | +|study_desc/data_access/dataset_use/contact |stdyDscr/dataAccs/useStmt/contact |Contact | +|- name |. |Name | +|- affiliation |- affiliation |Affiliation | +|- uri |- URI |URI | +|- email |- email |Email | +|study_desc/data_access/dataset_use/cit_req |stdyDscr/dataAccs/useStmt/citReq |Citation requirement | +|study_desc/data_access/dataset_use/deposit_req |stdyDscr/dataAccs/useStmt/deposReq |Deposit requirement | +|study_desc/data_access/dataset_use/conditions |stdyDscr/dataAccs/useStmt/conditions |Conditions | +|study_desc/data_access/dataset_use/disclaimer |stdyDscr/dataAccs/useStmt/disclaimer |Disclaimer | +|study_desc/data_access/notes |stdyDscr/dataAccs/setAvail/notes |Notes | +|data_files | | | +|variables | | | +|variable_groups | |Variable groups | + +# Annex 4: Mapping the geographic schema to DCAT/schema.org {-} + +[to do] + +# Annex 5: Mapping the indicator/time series schema to schema.org {-} + +[to do] + +# Annex 6: Mapping the table schema to schema.org {-} + +[to do] + +# Annex 7: Mapping the image schema to Dublin Core, IPTC, and schema.org {-} + +[to do] + +# Annex 8: Mapping the audio schema to Dublin Core and schema.org {-} + +[to do] + +# Annex 9: Mapping the video schema to Dublin Core and schema.org {-} + +[to do] + +# Annex 10: Mapping the research/script schema to Dublin Core and schema.org {-} + +[to do] + + diff --git a/annex-1-references-and-links.html b/annex-1-references-and-links.html new file mode 100644 index 0000000..6579ab2 --- /dev/null +++ b/annex-1-references-and-links.html @@ -0,0 +1,617 @@ + + + + + + + Annex 1: References and links | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+ +
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-10-mapping-the-researchscript-schema-to-dublin-core-and-schema.html b/annex-10-mapping-the-researchscript-schema-to-dublin-core-and-schema.html new file mode 100644 index 0000000..d4dcd4f --- /dev/null +++ b/annex-10-mapping-the-researchscript-schema-to-dublin-core-and-schema.html @@ -0,0 +1,573 @@ + + + + + + + Annex 10: Mapping the research/script schema to Dublin Core and schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 10: Mapping the research/script schema to Dublin Core and schema.org

+

[to do]

+ +
+ + + + + + +
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-2-mapping-standards-and-schemas-to-schema.html b/annex-2-mapping-standards-and-schemas-to-schema.html new file mode 100644 index 0000000..516a040 --- /dev/null +++ b/annex-2-mapping-standards-and-schemas-to-schema.html @@ -0,0 +1,1201 @@ + + + + + + + Annex 2: Mapping standards and schemas to schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 2: Mapping standards and schemas to schema.org

+

The use of structured data described in section 1.6.2 requires a mapping between the relevant elements of some of the metadata standards and schemas described in the Guide to the schema.org standard. We provide here a suggested selection and mapping for the core set of elements (we do not attempt to map all possible elements that are common to our schemas and schema.org).

+
+

1.1 Microdata

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
schema.org/datasetDDI CodeBookRecommendation
name
description
url
sameAs
identifier
keywords
license
isAccessibleForFree
hasPart / isPartOf
creator type / url / name / contactPoint / funder
includedInDataCatalog
distribution
temporalCoverage
spatialCoverage
+

Example:

+
<html>
+  <head>
+    <script type="application/ld+json">
+    {
+      "@context":"https://schema.org/",
+      "@type":"Dataset",
+      "name":"Albania Living Standards Measurement Survey 2012 (LSMS 2010)",
+      "description":"The Living Standards Measurement Survey (LSMS) is a multi-purpose household survey conducted to measure living conditions and poverty situation,                        and to help policymakers in monitoring and developing social programs. LSMS has been carried out in Albania in the context of continuing                                monitoring of poverty and the creation of policy evaluation system in the framework of the National Strategy for Development and Integration                            (previously the National Strategy for Economic and Social Development). The first Albania LSMS was conducted in 2002, followed by 2003, 2004,                          2005, 2008 and 2012 surveys. In 2012, 6,671 households participated in the survey.",
+      "url":"https://microdata.worldbank.org/index.php/catalog/1970",
+      "identifier": ["ALB_2012_LSMS_v01_M_v01_A_PUF"],
+      "keywords":[
+         "demographic characteristics",
+         "education",
+         "communication",
+         "labor",
+         "employment",
+         "non-farm business",
+         "migration",
+         "remittances",
+         "subjective poverty",
+         "health",
+         "fertility",
+         "non-food expenditures",
+         "dwelling",
+         "utilities",
+         "durable goods",
+         "daily food consumption"
+      ],
+      "license" : "",
+      "isAccessibleForFree" : true,
+      "creator":[
+         {
+            "@type":"Organization",
+            "url": "http://www.instat.gov.al/en/",
+            "name":"Institute of Statistics of Albania",
+            "contactPoint":{
+               "@type":"ContactPoint",
+               "email":"info@instat.gov.al"
+         },
+         { 
+            "@type":"Organization",
+            "url": "https://www.worldbank.org/",
+            "name":"World Bank",
+            "contactPoint":{
+               "@type":"ContactPoint",
+               "contactType": "LSMS technical support",
+               "email":"lsms@worldbank.org"
+            }
+      ],
+      "funder":{
+         "@type": "Organization",
+         "name": "World Bank"
+      },
+      "includedInDataCatalog":{
+         "@type":"World Bank Microdata Library",
+         "name":"https://microdata.worldbank.org/index.php/home"
+      },
+      "distribution":[
+         {
+            "@type":"DataDownload",
+            "encodingFormat":"SPSS Windows (.sav)",
+            "contentUrl":"http://www.instat.gov.al/en/figures/micro-data/"
+         }
+      ],
+      "temporalCoverage":"2012",
+      "spatialCoverage":{
+         "@type":"Place",
+         "name": "Albania"
+         }
+      }
+    }
+    </script>
+  </head>
+  <body>
+  </body>
+</html>
+
+
+

1.2 Geographic data

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
schema.org/datasetISO 19139Recommendation
name
description
url
sameAs
identifier
keywords
license
isAccessibleForFree
hasPart / isPartOf
creator type / url / name / contactPoint / funder
includedInDataCatalog
distribution
temporalCoverage
spatialCoverage
+

Example:

+
+
+

1.3 Indicators (and database)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
schema.org/datasetINDICATOR schemaRecommendation
name
description
url
sameAs
identifier
keywords
license
isAccessibleForFree
hasPart / isPartOf
creator type / url / name / contactPoint / funder
includedInDataCatalog
distribution
temporalCoverage
spatialCoverage
+

Example:

+
+
+

1.4 Tables

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
schema.org/datasetTABLES schemaRecommendation
name
description
url
sameAs
identifier
keywords
license
isAccessibleForFree
hasPart / isPartOf
creator type / url / name / contactPoint / funder
includedInDataCatalog
distribution
temporalCoverage
spatialCoverage
+

Example:

+
+
+

1.5 Images

+

The complete list of elements available in schema.org to document an image object is available at https://schema.org/ImageObject. We only show in the table below a selection of the ones we consder the most relevant and frequently available. Images can be documented either using the IPTC-based schema, or the Dublin Core (DCMI)-based schema.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
schema.org/datasetIMAGE schema (IPTC)Recommendation
name
abstract
creator
provider
sourceOrganization
dateCreated
keywords
contentLocation
contentReferenceTime
copyrightHolder
copyrightNotice
copyrightYear
creditText
isAccessibleForFree
license
acquireLicensePage
contentUrl
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
schema.org/datasetIMAGE schema (DCMI)Recommendation
name
abstract
creator
provider
sourceOrganization
dateCreated
keywords
contentLocation
contentReferenceTime
copyrightHolder
copyrightNotice
copyrightYear
creditText
isAccessibleForFree
license
acquireLicensePage
contentUrl
+

Example:

+ + + +Residents get water from an artesian well, Sindh, Pakistan + + + + +
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-3-mapping-the-microdata-schema-to-the-ddi-codebook-2.html b/annex-3-mapping-the-microdata-schema-to-the-ddi-codebook-2.html new file mode 100644 index 0000000..24b90fd --- /dev/null +++ b/annex-3-mapping-the-microdata-schema-to-the-ddi-codebook-2.html @@ -0,0 +1,1771 @@ + + + + + + + Annex 3: Mapping the microdata schema to the DDI Codebook 2.5 | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 3: Mapping the microdata schema to the DDI Codebook 2.5

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
JSON SchemaDDI/XML CodeBook 2.5Title
doc_descdocDscr
doc_desc/titledocDscr/citation/titlStmt/titlDocument title
doc_desc/idnodocDscr/citation/titlStmt/IDNoUnique ID number for the document
doc_desc/producersdocDscr/citation/prodStmt/producerProducers
- name.Name
- abbr- abbrAbbreviation
- affiliation- affiliationAffiliation
- role- roleRole
doc_desc/prod_datedocDscr/citation/prodStmt/prodDateDate of Production
doc_desc/version_statementdocDscr/citation/verStmtVersion Statement
doc_desc/version_statement/versiondocDscr/citation/verStmt/versionVersion
doc_desc/version_statement/version_dateVersion Date
doc_desc/version_statement/version_respdocDscr/citation/verStmt/verRespVersion Responsibility Statement
doc_desc/version_statement/version_notesdocDscr/citation/verStmt/notesVersion Notes
study_descstdyDscr
study_desc/title_statementstdyDscr/citation/titlStmt
study_desc/title_statement/idnostdyDscr/citation/titlStmt/IDNoUnique user defined ID
study_desc/title_statement/identifiersOther identifiers
- typeIdentifier type
- identifierIdentifier
study_desc/title_statement/titlestdyDscr/citation/titlStmt/titlSurvey title
study_desc/title_statement/sub_titlestdyDscr/citation/titlStmt/subTitlSurvey subtitle
study_desc/title_statement/alternate_titlestdyDscr/citation/titlStmt/altTitlAbbreviation or Acronym
study_desc/title_statement/translated_titlestdyDscr/citation/titlStmt/parTitlTranslated Title
study_desc/authoring_entitystdyDscr/citation/rspStmt/AuthEntyAuthoring entity/Primary investigators
- name.Agency Name
- affiliation- affiliationAffiliation
study_desc/oth_idstdyDscr/citation/rspStmt/othIdOther Identifications/Acknowledgments
- name.Name
- role- roleRole
- affiliation- affiliationAffiliation
study_desc/production_statementstdyDscr/citation/prodStmtProduction Statement
study_desc/production_statement/producersstdyDscr/citation/prodStmt/producerProducers
- name.Name
- abbr- abbrAbbreviation
- affiliation- affiliationAffiliation
- role- roleRole
study_desc/production_statement/copyrightstdyDscr/citation/prodStmt/copyrightCopyright
study_desc/production_statement/prod_datestdyDscr/citation/prodStmt/prodDateProduction Date
study_desc/production_statement/prod_placestdyDscr/citation/prodStmt/prodPlacProduction Place
study_desc/production_statement/funding_agenciesstdyDscr/citation/prodStmt/fundAgFunding Agency/Sponsor
- name.Funding Agency/Sponsor
- abbr- abbrAbbreviation
- grant- stdyDscr/citation/prodStmt/fundAgGrant Number
- role- roleRole
study_desc/distribution_statementstdyDscr/citation/distStmtDistribution Statement
study_desc/distribution_statement/distributorsstdyDscr/citation/distStmt/distrbtrDistributor
- name.Organization name
- abbr- abbrAbbreviation
- affiliation- affiliationAffiliation
- uri- uriURI
study_desc/distribution_statement/contactstdyDscr/citation/distStmt/contactContact
- name.Name
- affiliation- affiliationAffiliation
- email- emailEmail
- uri- uriURI
study_desc/distribution_statement/depositorstdyDscr/citation/distStmt/depositrDepositor
- name.Name
- abbr- abbrAbbreviation
- affiliation- affiliationAffiliation
- uriURI
study_desc/distribution_statement/deposit_datestdyDscr/citation/distStmt/depDateDate of Deposit
study_desc/distribution_statement/distribution_datestdyDscr/citation/distStmt/distDateDate of Distribution
study_desc/series_statementstdyDscr/citation/serStmtSeries Statement
study_desc/series_statement/series_namestdyDscr/citation/serStmt/serNameSeries Name
study_desc/series_statement/series_infostdyDscr/citation/serStmt/serInfoSeries Information
study_desc/version_statementstdyDscr/citation/verStmtVersion Statement
study_desc/version_statement/versionstdyDscr/citation/verStmt/versionVersion
study_desc/version_statement/version_dateVersion Date
study_desc/version_statement/version_respstdyDscr/citation/verStmt/verRespVersion Responsibility Statement
study_desc/version_statement/version_notesstdyDscr/citation/verStmt/notesVersion Notes
study_desc/bib_citationstdyDscr/citation/biblCitBibliographic Citation
study_desc/bib_citation_formatBibliographic Citation Format
study_desc/holdingsstdyDscr/citation/holdingsHoldings Information
- name.Name
- location- locationLocation
- callno- callnoCallno
- uri- uriURI
study_desc/study_notesstdyDscr/citation/notesStudy notes
study_desc/study_authorizationstdyDscr/studyAuthorizationStudy Authorization
study_desc/study_authorization/dateAuthorization Date
study_desc/study_authorization/agencystdyDscr/studyAuthorization/authorizingAgencyAuthorizing Agency
- name.Funding Agency/Sponsor
- affiliation- affiliationAffiliation
- abbr- abbrAbbreviation
study_desc/study_authorization/authorization_statementstdyDscr/studyAuthorization/authorizationStatementAuthorization Statement
study_desc/study_infostdyDscr/stdyInfoStudy Scope
study_desc/study_info/study_budgetstdyDscr/stdyInfo/studyBudgetStudy Budget
study_desc/study_info/keywordsstdyDscr/stdyInfo/subject/keyword
- keyword.Keyword
- vocab- vocabVocabulary
- uri- vocabURIuri
study_desc/study_info/topicsstdyDscr/stdyInfo/subject/topcClasTopic Classification
- topic.Topic
- vocab- vocabVocab
- uri- vocabURIURI
study_desc/study_info/abstractstdyDscr/stdyInfo/abstractAbstract
study_desc/study_info/time_periodsstdyDscr/stdyInfo/sumDscr/timePrdTime periods (YYYY/MM/DD)
- startStart date
- endEnd date
- cycleCycle
study_desc/study_info/coll_datesstdyDscr/stdyInfo/sumDscr/collDateDates of Data Collection (YYYY/MM/DD)
- startStart date
- endEnd date
- cycleCycle
study_desc/study_info/nationstdyDscr/stdyInfo/sumDscr/nationCountry
- name.Name
- abbreviation- abbrCountry code
study_desc/study_info/bboxstdyDscr/sumDscr/geoBndBoxGeographic bounding box
- west- westBLWest
- east- eastBLEast
- south- southBLSouth
- north- northBLNorth
study_desc/study_info/bound_polystdyDscr/sumDscr/boundPoly/polygon/pointGeographic Bounding Polygon
- latgringLatLatitude
- longringLonlongitude
study_desc/study_info/geog_coveragestdyDscr/stdyInfo/sumDscr/geogCoverGeographic Coverage
study_desc/study_info/geog_coverage_notesstdyDscr/sumDscr/geogCover/txtGeographic Coverage notes
study_desc/study_info/geog_unitstdyDscr/stdyInfo/sumDscr/geogUnitGeographic Unit
study_desc/study_info/analysis_unitstdyDscr/stdyInfo/sumDscr/anlyUnitUnit of Analysis
study_desc/study_info/universestdyDscr/stdyInfo/sumDscr/universeUniverse
study_desc/study_info/data_kindstdyDscr/stdyInfo/sumDscr/dataKindKind of Data
study_desc/study_info/notesstdyDscr/stdyInfo/notesStudy notes
study_desc/study_info/quality_statementstdyDscr/stdyInfo/qualityStatementQuality Statement
study_desc/study_info/quality_statement/compliance_descriptionstdyDscr/stdyInfo/qualityStatement/standardsCompliance/complianceDescriptionStandard compliance description
study_desc/study_info/quality_statement/standardsstdyDscr/stdyInfo/qualityStatement/standardsCompliance/standardStandards
- namestandardNameName
- producerproducer *Producer
study_desc/study_info/quality_statement/other_quality_statementstdyDscr/stdyInfo/qualityStatement/otherQualityStatementOther quality statement
study_desc/study_info/ex_post_evaluationstdyDscr/stdyInfo/exPostEvaluationEx-Post Evaluation
study_desc/study_info/ex_post_evaluation/completion_dateEvaluation completion date
study_desc/study_info/ex_post_evaluation/typeEvaluation type
study_desc/study_info/ex_post_evaluation/evaluatorstdyDscr/stdyInfo/exPostEvaluation/evaluatorEvaluators
- name.Funding Agency/Sponsor
- affiliation- affiliationAffiliation
- abbr- abbrAbbreviation
- role- roleRole
study_desc/study_info/ex_post_evaluation/evaluation_processstdyDscr/stdyInfo/exPostEvaluation/evaluationProcessEvaluation process
study_desc/study_info/ex_post_evaluation/outcomesstdyDscr/stdyInfo/exPostEvaluation/outcomesOutcomes
study_desc/study_developmentstdyDscr/studyDevelopmentStudy Development
study_desc/study_development/development_activitystdyDscr/studyDevelopment/developmentActivityDevelopment activity
- activity_type.Development activity type
- activity_description- descriptionDevelopment activity description
- participants- participantsParticipants
- resources- resourcesDevelopment activity resources
- outcome- outcomeDevelopment Activity Outcome
study_desc/methodstdyDscr/methodMethodology and Processing
study_desc/method/data_collectionstdyDscr/method/dataCollData Collection
study_desc/method/data_collection/time_methodstdyDscr/method/dataColl/timeMethTime Method
study_desc/method/data_collection/data_collectorsstdyDscr/method/dataColl/dataCollectorData Collectors
- name.Name
- affiliationAffiliation
- abbrAbbreviation
- roleRole
study_desc/method/data_collection/collector_trainingstdyDscr/method/dataColl/collectorTrainingCollector training
- type@typeTraining type
- training.Training
study_desc/method/data_collection/frequencystdyDscr/method/dataColl/frequencFrequency of Data Collection
study_desc/method/data_collection/sampling_procedurestdyDscr/method/dataColl/sampProcSampling Procedure
study_desc/method/data_collection/sample_framestdyDscr/method/dataColl/sampleFrameSample Frame
study_desc/method/data_collection/sample_frame/namestdyDscr/method/dataColl/sampleFrame/sampleFrameNameSample frame name
study_desc/method/data_collection/sample_frame/valid_periodstdyDscr/method/dataColl/sampleFrame/validPeriodValid periods (YYYY/MM/DD)
- eventEvent
- dateDate
study_desc/method/data_collection/sample_frame/custodianstdyDscr/method/dataColl/sampleFrame/custodianCustodian
study_desc/method/data_collection/sample_frame/universestdyDscr/method/dataColl/sampleFrame/universeUniverse
study_desc/method/data_collection/sample_frame/frame_unitstdyDscr/method/dataColl/sampleFrame/frameUnitFrame unit
study_desc/method/data_collection/sample_frame/frame_unit/is_primaryIs Primary
study_desc/method/data_collection/sample_frame/frame_unit/unit_typestdyDscr/method/dataColl/sampleFrame/frameUnit/unitTypeUnit Type
study_desc/method/data_collection/sample_frame/frame_unit/num_of_unitsNumber of units
study_desc/method/data_collection/sample_frame/reference_periodstdyDscr/method/dataColl/sampleFrame/referencePeriodReference periods (YYYY/MM/DD)
- eventEvent
- dateDate
study_desc/method/data_collection/sample_frame/update_procedurestdyDscr/method/dataColl/sampleFrame/updateProcedureUpdate procedure
study_desc/method/data_collection/sampling_deviationstdyDscr/method/dataColl/deviatDeviations from the Sample Design
study_desc/method/data_collection/coll_modestdyDscr/method/dataColl/collModeMode of data collection
study_desc/method/data_collection/research_instrumentstdyDscr/method/dataColl/resInstruType of Research Instrument
study_desc/method/data_collection/instru_developmentstdyDscr/method/dataColl/instrumentDevelopmentInstrument development
study_desc/method/data_collection/instru_development_typeInstrument development type
study_desc/method/data_collection/sourcesstdyDscr/method/dataColl/sourcesSources
- nameSource name
- originOrigin of Source
- characteristicsCharacteristics of Source Noted
study_desc/method/data_collection/coll_situationstdyDscr/method/dataColl/collSituCharacteristics of Data Collection Situation - Notes on data collection
study_desc/method/data_collection/act_minstdyDscr/method/dataColl/actMinSupervision
study_desc/method/data_collection/control_operationsstdyDscr/method/dataColl/ConOpsControl Operations
study_desc/method/data_collection/weightstdyDscr/method/dataColl/weightWeighting
study_desc/method/data_collection/cleaning_operationsstdyDscr/method/dataColl/cleanOpsCleaning Operations
study_desc/method/method_notesstdyDscr/method/notesMethodology notes
study_desc/method/analysis_infostdyDscr/method/anlyInfoData Appraisal
study_desc/method/analysis_info/response_ratestdyDscr/method/anlyInfo/respRateResponse Rate
study_desc/method/analysis_info/sampling_error_estimatesstdyDscr/method/anlyInfo/EstSmpErrEstimates of Sampling Error
study_desc/method/analysis_info/data_appraisalstdyDscr/method/anlyInfo/dataApprData Appraisal
study_desc/method/study_classstdyDscr/method/stdyClasClass of the Study
study_desc/method/data_processingstdyDscr/method/dataProcessingData Processing
- typeData processing type
- descriptionData processing description
study_desc/method/coding_instructionsstdyDscr/method/codingInstructionsCoding Instructions
- related_processesRelated processes
- typeCoding instructions type
- txtCoding instructions text
- commandCommand
- formal_languageIdentify the language of the command code
study_desc/data_accessstdyDscr/dataAccs/setAvail/dataAccs
study_desc/data_access/dataset_availabilitystdyDscr/dataAccs/setAvailData Set Availability
study_desc/data_access/dataset_availability/access_placestdyDscr/dataAccs/setAvail/accsPlacLocation of Data Collection
study_desc/data_access/dataset_availability/access_place_urlURL for Location of Data Collection
study_desc/data_access/dataset_availability/original_archivestdyDscr/dataAccs/setAvail/origArchArchive where study is originally stored
study_desc/data_access/dataset_availability/statusstdyDscr/dataAccs/setAvail/avlStatusAvailability Status
study_desc/data_access/dataset_availability/coll_sizestdyDscr/dataAccs/setAvail/collSizeExtent of Collection
study_desc/data_access/dataset_availability/completestdyDscr/dataAccs/setAvail/completeCompleteness of Study Stored
study_desc/data_access/dataset_availability/file_quantitystdyDscr/dataAccs/setAvail/fileQntyNumber of Files
study_desc/data_access/dataset_availability/notesstdyDscr/dataAccs/setAvail/notesNotes
study_desc/data_access/dataset_usestdyDscr/dataAccs/useStmtData Set Availability
study_desc/data_access/dataset_use/conf_decstdyDscr/dataAccs/useStmt/confDecConfidentiality Declaration
- txt.Confidentiality declaration text
- required- requiredIs signing of a confidentiality declaration required?
- form_url- URIConfidentiality declaration form URL
- form_id- formNoForm ID
study_desc/data_access/dataset_use/spec_permstdyDscr/dataAccs/useStmt/specPermSpecial Permissions
- txtSpecial permissions description
- required- requiredIndicate if special permissions are required to access a resource
- form_url- URIForm URL
- form_id- formNoForm ID
study_desc/data_access/dataset_use/restrictionsstdyDscr/dataAccs/useStmt/restrctnRestrictions
study_desc/data_access/dataset_use/contactstdyDscr/dataAccs/useStmt/contactContact
- name.Name
- affiliation- affiliationAffiliation
- uri- URIURI
- email- emailEmail
study_desc/data_access/dataset_use/cit_reqstdyDscr/dataAccs/useStmt/citReqCitation requirement
study_desc/data_access/dataset_use/deposit_reqstdyDscr/dataAccs/useStmt/deposReqDeposit requirement
study_desc/data_access/dataset_use/conditionsstdyDscr/dataAccs/useStmt/conditionsConditions
study_desc/data_access/dataset_use/disclaimerstdyDscr/dataAccs/useStmt/disclaimerDisclaimer
study_desc/data_access/notesstdyDscr/dataAccs/setAvail/notesNotes
data_files
variables
variable_groupsVariable groups
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-4-mapping-the-geographic-schema-to-dcatschema.html b/annex-4-mapping-the-geographic-schema-to-dcatschema.html new file mode 100644 index 0000000..269e5bd --- /dev/null +++ b/annex-4-mapping-the-geographic-schema-to-dcatschema.html @@ -0,0 +1,566 @@ + + + + + + + Annex 4: Mapping the geographic schema to DCAT/schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 4: Mapping the geographic schema to DCAT/schema.org

+

[to do]

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-5-mapping-the-indicatortime-series-schema-to-schema.html b/annex-5-mapping-the-indicatortime-series-schema-to-schema.html new file mode 100644 index 0000000..3618202 --- /dev/null +++ b/annex-5-mapping-the-indicatortime-series-schema-to-schema.html @@ -0,0 +1,566 @@ + + + + + + + Annex 5: Mapping the indicator/time series schema to schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 5: Mapping the indicator/time series schema to schema.org

+

[to do]

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-6-mapping-the-table-schema-to-schema.html b/annex-6-mapping-the-table-schema-to-schema.html new file mode 100644 index 0000000..3b50f86 --- /dev/null +++ b/annex-6-mapping-the-table-schema-to-schema.html @@ -0,0 +1,566 @@ + + + + + + + Annex 6: Mapping the table schema to schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 6: Mapping the table schema to schema.org

+

[to do]

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-7-mapping-the-image-schema-to-dublin-core-iptc-and-schema.html b/annex-7-mapping-the-image-schema-to-dublin-core-iptc-and-schema.html new file mode 100644 index 0000000..ced8181 --- /dev/null +++ b/annex-7-mapping-the-image-schema-to-dublin-core-iptc-and-schema.html @@ -0,0 +1,566 @@ + + + + + + + Annex 7: Mapping the image schema to Dublin Core, IPTC, and schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 7: Mapping the image schema to Dublin Core, IPTC, and schema.org

+

[to do]

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-8-mapping-the-audio-schema-to-dublin-core-and-schema.html b/annex-8-mapping-the-audio-schema-to-dublin-core-and-schema.html new file mode 100644 index 0000000..d9f7958 --- /dev/null +++ b/annex-8-mapping-the-audio-schema-to-dublin-core-and-schema.html @@ -0,0 +1,566 @@ + + + + + + + Annex 8: Mapping the audio schema to Dublin Core and schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 8: Mapping the audio schema to Dublin Core and schema.org

+

[to do]

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/annex-9-mapping-the-video-schema-to-dublin-core-and-schema.html b/annex-9-mapping-the-video-schema-to-dublin-core-and-schema.html new file mode 100644 index 0000000..e27460d --- /dev/null +++ b/annex-9-mapping-the-video-schema-to-dublin-core-and-schema.html @@ -0,0 +1,566 @@ + + + + + + + Annex 9: Mapping the video schema to Dublin Core and schema.org | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Annex 9: Mapping the video schema to Dublin Core and schema.org

+

[to do]

+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter01.html b/chapter01.html new file mode 100644 index 0000000..ce9b1c2 --- /dev/null +++ b/chapter01.html @@ -0,0 +1,607 @@ + + + + + + + Chapter 1 The challenge of finding and assessing, accessing, and using data | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 1 The challenge of finding and assessing, accessing, and using data

+

In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges.

+
+

1.1 Finding and assessing data

+

Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks, often referred to as tribal knowledge, to locate and obtain the data they require. This may lead to the use of convenient data that may not be the most relevant. Others may encounter datasets of interest in academic publications, which can be challenging due to the inconsistent or non-standardized citation of datasets. However, most data users use general search engines or turn to specialized data catalogs to discover relevant data resources.

+

Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, such as a query for “population of India in 2023,” yield instant informative responses (though not always from the most authoritative source). Even less direct queries, like “indicators of malnutrition in Yemen,” return adequate responses, as the engine can “understand” concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Additionally, generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although it is not without the risk of errors and inaccuracies. However, these search engines may not be optimized to identify the most relevant data when the user’s requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking “satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition.”

+

While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking pertinent data. Nonetheless, the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal search indexes and algorithms. With the rapid advancements in AI-based solutions, many of which are available as open-source software, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems.

+

The solution to improve data discoverability involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest.

+

Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data’s suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset’s fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time.

+
+
+

1.2 Accessing data

+

Accessing data is a multifaceted challenge that encompasses legal, ethical, and practical considerations. To ensure that data access is lawful, ethical, efficient, and enables relevant and responsible use of the data, data providers and users must adhere to specific principles and practices:

+
    +
  • Data providers must ensure that they possess the legal rights to share the data and define clear usage rights for data users.
  • +
  • Data users must understand how they can use the data, whether for research, commercial purposes, or other applications, and they must strictly adhere to the terms of use.
  • +
  • Data access must comply with data privacy laws and ethical standards. Sensitive or personally identifiable information must be handled with care to protect individuals’ privacy.
  • +
  • Data providers must furnish comprehensive metadata that provides context and a full understanding of the data. Metadata should include details about the data’s provenance, encompassing its history, transformations, and processing steps. Understanding how the data was created and modified is essential for accurate and responsible analysis.
  • +
  • Data should be available in user-friendly formats compatible with common data analysis tools, such as CSV, JSON, or Excel.
  • +
  • Data should be accessible through various means, accommodating users’ preferences and capacities. This may involve offering downloadable files, providing access through web-based tools, and supporting data streaming. - APIs are essential for enabling programmable data access, allowing researchers to retrieve and manipulate data programmatically for integration into their research workflows and applications.
  • +
+

Data users in developing countries often encounter additional challenges in accessing data, including:

+
    +
  • Lack of resources: Researchers in developing countries may lack the financial resources to purchase data or access data stored in expensive cloud-based repositories.
  • +
  • Lack of infrastructure: Researchers in developing countries may lack access to the high-speed internet and computing resources required for working with large datasets.
  • +
  • Lack of expertise: Researchers in developing countries may lack the expertise to work with complex data formats and utilize data analysis tools. +These specific challenges should be considered when developing data dissemination systems.
  • +
+
+
+

1.3 Using data

+

The challenge for data users extends beyond discovering data to obtaining all the necessary information for a comprehensive understanding of the data and for responsible and appropriate use. A single indicator label, such as “unemployment rate (%),” can obscure significant variations by country, source, and time. The international recommendations for the definition and calculation of the “unemployment rate” have evolved over time, and not all countries employ the same data collection instrument (e.g., labor force surveys) to gather the underlying data. Detailed metadata should always accompany data on online data dissemination platforms. This association should be close; relevant metadata should ideally be no more than one click away from the data. This is particularly crucial when a platform publishes data from multiple sources that are not fully harmonized.

+
+

The scope and meaning of labor statistics, in general, are determined by their source and methodology, which holds true for the unemployment rate. To interpret the data accurately, it is crucial to understand what the data convey, how they were collected and constructed, and to have information on the relevant metadata. The design and characteristics of the data source, typically a labor force survey or a similar household survey for the unemployment rate, especially in terms of definitions and concepts used, geographical and age coverage, and reference periods, have significant implications for the resulting data. Taking these aspects into account is essential when analyzing the statistics. Additionally, it is crucial to seek information on any methodological changes and breaks in series to assess their impact on trend analysis and to keep in mind methodological differences across countries when conducting cross-country studies. (From Quick guide on interpreting the unemployment rate, International Labour Office – Geneva: ILO, 2019, ISBN: 978-92-2-133323-4 (web pdf)).

+
+

Whenever possible, reproducible or replicable scripts used with the data, along with the analytical output of these scripts, should be published alongside the data. These scripts can be highly valuable to researchers who wish to expand the scope of previous data analysis or reuse parts of the code, and to students who can learn from reading and replicating the work of experienced analysts. To enhance data usability, we have developed a specific metadata schema for documenting research projects and scripts.

+
+
+

1.4 A FAIR solution

+

To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469).

+

It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse.

+ +
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter02.html b/chapter02.html new file mode 100644 index 0000000..9db9afa --- /dev/null +++ b/chapter02.html @@ -0,0 +1,1067 @@ + + + + + + + Chapter 2 The features of a modern data dissemination platform | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 2 The features of a modern data dissemination platform

+

In the introductory section of this Guide, we proposed that a data dissemination platform should be modeled after highly successful e-commerce platforms. These platforms are designed to optimally satisfy the requirements and expectations of both buyers (in our context, the data users) and sellers (in our context, the data providers who make their datasets accessible through a data catalog). In this chapter, we outline the crucial features that a modern online data catalog should incorporate to adhere to this model and effectively cater to the diverse needs and expectations of its users.

+

Our objective is to provide recommendations for developing data catalogs that encompass lexical search and semantic search, filtering, advanced search functionality, interactive user interfaces, and the capability to operate as a data recommender system. To define these features, we approach the topic from three distinct perspectives: the viewpoint of data users, who represent a highly diverse community with varying needs, preferences, expectations, and capabilities; the standpoint of data suppliers, who either publish their data or delegate the task to a data library; and the perspective of catalog administrators, responsible for curating and disseminating data in a responsible, effective, and efficient manner while optimizing both user and supplier satisfaction.

+

The creation of a contemporary data dissemination platform is a collaborative endeavor, engaging data curators, user experience (UX) experts, designers, search engineers, and subject matter specialists with a profound understanding of both the data and the users’ requirements and preferences. Inclusive in this development process should be the active participation of the users themselves, allowing them to provide feedback that directly influences the system’s design.

+
+

2.1 Features for data users

+

In order to cultivate a favorable user experience, online data catalogs must offer an intuitive and efficient interface, allowing users to effortlessly access the most pertinent datasets. To meet user expectations effectively, one should emphasize simplicity, predictability, relevance, speed, and reliability. Integrating these principles into the design of data catalogs can deliver a seamless and user-friendly experience, akin to the convenience and ease provided by well-known internet search engines and e-commerce platforms. This, in turn, streamlines the process of discovering and obtaining the necessary data, making it quick and hassle-free for users.

+
+

2.1.1 Simple search interface

+

The default option to search for data in a specialized catalog should be a single search box, following the model of general search engines. The objective of the search algorithm should then be to “understand” the user’s query as accurately as possible, potentially by parsing and enhancing the query, and returning the most relevant results ranked in order of importance.

+
+
+
+ +

image

+
+
+


+

However, not all users can be expected to provide ideal queries. The search engine must be able to tolerate spelling mistakes to provide a seamless user experience. Auto-completion and spell checkers of queries are independent of the metadata being searched and can be enabled using indexing tools such as Solr or ElasticSearch. Additionally, after processing a user query, the application can provide suggestions for related keywords. This can be implemented using a graph of related words generated by natural language processing (NLP) models. Access to an API is necessary to implement keyword suggestions based on such graphs. The example below shows a related words graph for the terms “climate change” as returned by an NLP model.

+
+
+ +
+


+

A search interface could retrieve such information via API and display it as follows:

+
+
+ +
+


+
+
+

2.1.2 Browser

+

Some users will just want to browse a catalog. This should be made easy. The use of cards is recommended. For images, a mosaic view can be provided. For microdata, a variable view.

+
+
+

2.1.3 Latest additions and history

+

The catalog must provide a list of the most recent additions, and a history of additions and updates. +For each entry, information must be available on the date the entry was first added to the catalog, and when it was last updated. +When a dataset is replaced with a new version, the versioning must be clear.

+


+image +

+
+ +
+

2.1.5 Document as a query

+

A search engine with semantic search capability should be able to process short or long queries, even accepting a document (a PDF or a TXT file) as a query. The search engine will then first analyze the semantic content of the document, convert it into an embedding vector, and identify the closest resources available in the catalog.

+
+
+
+ +

image

+
+
+


+
+ +
+

2.1.7 Semantic search and recommendations

+

There are two types of search engines: lexical and semantic. The former matches literal terms in the query to the search engine’s index, while the latter aims to identify datasets that have semantically similar metadata to the query. While an ideal data catalog would offer both types of search engines, implementing semantic searchability can be complex.

+

(explain how semantic search workd for different data types - with embeddings and vector indexing and cosine similarity - use of API)

+

For microdata: embeddings based on thematic variable groupings - an option to implement semantic search and recommendations +Discovery of microdata poses specific challenges. Typically, a data dictionary will be available, with variables organized by data file. A “virtual” organization of variables by thematic group, with a pre-defined ontology, can significantly improve data discoverability. AI solutions can be used to generate such groupings and map variables to them. The DDI metadata standard provides the metadata elements needed to store information on variable groups.

+
+
+

2.1.8 Customized views

+

Build your own dashboards +- Allo users to set preferences: thematic, data type, geographies, search query +- Have a page where pre-designed dashboards (country/thematic pages) and custom dashboard are accessible +- Allow sharing of dashboards +- Core idea: all data and metadata accessible via API; platform operates as a service to feed dashboards (within the platform or external)

+
+
+

2.1.9 Data and metadata as a service

+
    +
  • Maintain a data service: let external users build dashboards/poaltorms dynamically connected via API; one organization cannot customize to all communities of users.
  • +
+
+
+

2.1.10 Query user interface

+

For time series only

+
+
+

2.1.11 Ranking results

+

A search engine not only needs to identify relevant datasets but also must return the results in a proper order of relevance, with the most relevant results at the top of the list. If users fail to find a relevant response among the top results, they may choose to search for data elsewhere. The ability of a search engine to return relevant results in the optimal rank depends on the metadata’s content and structure. To optimize the ranking of results, a lot of relevance engineering is required, including tuning advanced search tools like Solr or ElasticSearch. Large data catalogs managed by well-resourced agencies can leverage data scientists to explore the possibility of using machine learning solutions such as “learn-to-rank” to improve result ranking. See section “Improving results ranking” below. For more detailed information, see D. Turnbull and J. Berryman’s (2016) in-depth description of tools and methods.

+

Keyword-based searches can be optimized using tools like Solr or ElasticSearch. Out-of-the-box solutions, such as those provided by SQL databases, rarely deliver satisfactory results. Structured metadata can help optimize search engines and the ranking of results by allowing for the boosting of specific metadata elements. For instance, a query term found in the title of a dataset would carry more weight than if it were found in the notes element, and the results would be ranked accordingly. Similarly, a country name found in the nation or reference country metadata elements should be given more weight than if it were found in a variable description. Advanced indexing tools like Solr and ElasticSearch provide boosting functionalities to fine-tune search engines and enhance result relevancy.

+
+
+

2.1.12 Filtering results

+

Facets or filters are useful for narrowing down datasets based on specific metadata categories. For instance, in a data catalog with datasets from different countries, a “country” facet can help users find relevant datasets quickly. To be effective, filters should be based on metadata elements that have a limited number of categories and a predictable set of options. Controlled vocabularies can be used to enable such filters. Furthermore, as some metadata elements are specific to particular data types, contextual facets should be integrated into the catalog’s user interface to offer relevant filters based on the type of data being searched.

+
+ +
+


+

Tags and tag groups (which are available in all schemas we recommend) provide much flexibility to implement facets, as we showed in section 1.7.

+

(use pills / …)

+
+
+

2.1.13 Sorting results

+

Sorting results

+
+
+

2.1.14 Collections

+

Organize entries by collections

+
+
+

2.1.15 Linking results

+

Not all data catalog users know exactly what they are looking for and may need to explore the catalog to find relevant resources. E-commerce platforms use recommender systems to suggest products to customers, and data catalogs should have a similar commitment to bringing relevant resources to users’ attention. To achieve this, modern data catalogs display relationships between entries, which may involve data of different types, such as microdata files, analytical scripts, and working papers.

+

These relationships can be documented in the metadata, such as identifying datasets as part of a series or new versions of a previous dataset. When relationships are not known or documented, machine learning tools such as topic models and word embedding models can be used to establish the topical or semantic closeness between resources of different types. This can be used to implement a recommender system in data catalogs, which automatically identifies and displays related documents and data for a given resource. The image below shows how “related documents” and “related data” can be automatically identified and displayed for a resource (in this case a document).

+
+ +
+


+
+
+

2.1.16 Organized results

+

When a data catalog contains multiple types of data, it should offer an easy way for users to filter and display query results by data type. For example, when searching for “US population,” one user may only be interested in knowing the total population of the USA, while another may need the public use census microdata sample, and a third may be searching for a publication. To cater to such needs, presenting query results in type-specific tabs (with an “All” option) and/or providing a filter (facet) by type will allow users to focus on the types of data relevant to them. This is similar to commercial platforms that offer search results organized by department, allowing users to search for “keyboard” in either the “music” or “electronics” department.

+
+ +
+


+
+
+

2.1.17 Saving and sharing results

+

URL / API query ; export list ; social networks, etc.

+
+
+

2.1.18 Personalized results

+

Option for user to set a profile with preferences that may be used to display results.

+
+
+

2.1.19 Metadata display and formats

+

To make metadata easily accessible to users, it’s important to display it in a convenient way. The display of metadata will vary depending on the data type being used, as each type uses a specific metadata schema. For online catalogs, style sheets can be utilized to control the appearance of the HTML pages.

+

In addition to being displayed in HTML format, metadata should be available as electronic files in JSON, XML, and potentially PDF format. Structured metadata provides greater control and flexibility to automatically generate JSON and XML files, as well as format and create PDF outputs. It’s important that the JSON and XML files generated by the data catalog comply with the underlying metadata schema and are properly validated. This ensures that the metadata files can be easily and reliably reused and repurposed.

+
+ +
+


+
+
+

2.1.20 Variable-level comparison

+

E-commerce platforms commonly allow customers to compare products by displaying their pictures and descriptions (i.e., metadata) side-by-side. Similarly, for data users, the ability to compare datasets can be valuable to evaluate the consistency or comparability of a variable or an indicator over time or across sources and countries. However, to implement this functionality, detailed and structured metadata at the variable level are necessary. These metadata standards, such as DDI and ISO 19110/19139, enable the implementation of this feature.

+

In the example below, we show how a query for water returns not only a list of seven datasets, but also a list of variables in each dataset that match the query.

+
+ +
+


+

The variable view shows that a total of 90 variables match the searched keyword.

+
+ +
+


+

After selecting the variables of interest, users should be able to display their metadata in a format that facilitates comparison. The availability of detailed metadata is crucial to ensure the quality and usefulness of these comparisons. For example, when working with a survey dataset, capturing information on the variable universe, categories, questions, interviewer instructions, and summary statistics would be ideal. This comprehensive metadata will enable users to make informed decisions about which variables to use and how to analyze them.

+
+ +
+


+
+
+

2.1.21 Transparency in access policies

+

The terms of use (ideally provided in the form of a standard license) and the conditions of access to data should be made transparent and visible in the data catalog. The access policy will preferably be provided using a controlled vocabulary, which can be used to enable a facet (filter) as shown in the screenshot below.

+
+ +
+


+
+
+

2.1.22 Data and metadata API

+

To keep up with modern data management needs, a comprehensive data catalog must provide users with convenient access to both data and metadata through an application programming interface (API). The structured metadata in a catalog allows users to extract specific components of the metadata they need, such as the identifier and title of all microdata and geographic datasets conducted after a certain year. With an API, users can easily and automatically access datasets or subsets of datasets they require. This enables internal features of the catalog such as dynamic visualizations and data previews, making data management more efficient. It is crucial that detailed documentation and guidelines on the use of the data and metadata API are provided to users to maximize the benefits of this feature.

+

Metadata (and data) should be accessible via API +The API should be well documented with examples. +API query builder: UI for building an API query

+
+
+

2.1.23 Online data access forms

+

Make the process of registration, requests fully digital, easy, and fully traceable.

+
+

2.1.23.1 Bulk download option

+

Even when UI or visualizations etc are shown, many users just want to downlaod the data and metadata. +(…)

+
+
+
+

2.1.24 Data preview

+

When the data (time series and tabular data, possibly also microdata) are made available via API, the data catalog can also provide a data preview option, and possibly a data extraction option, to the users. Multiple JavaScript tools, some of them open-source, are available to easily embed data grids in catalog pages.

+
+ +
+

For a document, the “data preview” would consist of a document viewer that would allow the user to view the document within the application (even when the document is not stored in the catalog itself but in an external website). When implementing such a feature, check that the terms of use of the origination source allows that.

+
+
+
+ +

image

+
+
+


+
+
+

2.1.25 Data extraction

+

For some data (microdata / time series), provide a simple way for users to extract specific variables / observations.

+
+
+

2.1.26 Data visualizations

+

Embedding visualizations in a data catalog can greatly enhance its usefulness. Different types of data require different types of visualizations. For instance, time series data can be effectively displayed using a line chart, while images with geographic information can be displayed on a map that shows the location of the image capture. For more complex data, other types of charts can be created as well. However, in order to embed dynamic charts in a catalog page, the data needs to be available via API. A good data catalog should offer flexibility in the types of charts and maps that can be embedded in a metadata page. For instance, the NADA catalog provides catalog administrators with the ability to create visualizations using various tools. By including visualizations in a data catalog, users are able to quickly and easily understand the data and gain insights from it.

+

The NADA catalog allows catalog adinistrators to generate such visualizations using different tools of their choice. The example below were generated using the open-source Apache eCharts library.

+


+Example: Line chart for a time series

+
+ +
+


+Example: Geo-location of an image

+
+ +
+


+
+
+

2.1.27 Permanent URLs

+

To ensure efficient management and organization of datasets within a data catalog, it is essential to assign a unique identifier to each dataset. This identifier should not only meet technical requirements but also serve other purposes such as facilitating dataset citation. To achieve maximum effectiveness, it is recommended that datasets have a globally unique identifier, which can be accomplished through the assignment of a Digital Object Identifier (DOI). DOIs can be generated in addition to a catalog-specific unique identifier and provide a permanent and persistent identifier for the dataset. For more information about the process of generating DOIs and the reasons to use them, visit the DataCite website.

+

Include a citation requirement in metadata.

+
+
+

2.1.28 Archive / tombstone

+

When a dataset is removed or replaced, the reproducibility of some analysis may become impossible. This may be a problem for some users. Unless there is a reason for not making them accessible, old versions of datasets should be kept accessible. But they should not be the ones indexed and dislayed in the catalog, to avoid cnfusion or the risk that a user would exploit a version other than the latest. Moving datasts that are replaced to an archive section of the catalog (not indexed) is an option. Note that DOIs require a permanent web page.

+
+
+

2.1.29 Catalog of citations

+

A data catalog should not be limited to data. Ideally, the scripts produced by researchers to analyze the data, and the output of their analysis, should also be available. An ideal data catalog will allow a user to:

+
    +
  • search for data, and find/access the related scripts and citations
  • +
  • search for a document (analytical output), and find/access the related data and scripts
  • +
  • search for a script, and find/access the data and analytical output
  • +
+

Maintain a catalog of citations of datasets.

+
+
+
+ +

image

+
+
+


+
+
+

2.1.30 Reproducible and replicable scripts

+

Document, catalog, and publish reproducible/replicable scripts.

+
+
+
+ +

image

+
+
+


+
+
+

2.1.31 Notifications or alerts

+

Users may want to be automatically notified (by email) when new entries of interest are added, or when change are made to a specific resource. A system allowing users to set criteria for automatic notification can be developed.

+

Example of Google SCholar alerts:

+
+
+
+ +

image

+
+
+


+
+
+

2.1.32 Providing feedback

+

Feedback on catalog certainly. In the form of a “Contact” email and possibly a “feedback form”. Also, if the platform itself is open source, GitHub for issues and suggestions on the application itself.

+

BUT: Users forum, “reviews” as in e-commerce platforms, is not always recommended. Not all users are ’constructive” and qualified. Requires moderation, which can be costly and controversial. May create dis-incentives for data producers to publish their data. Could be a good option for data platforms that are internal to an organization (where comments are attributed, and an authentication system controls who can provide feedback), but not for public data platforms.

+
+
+

2.1.33 Getting support

+

Contact, responsive +FAQs

+
+
+

2.1.34 Web content accessibility

+

Web Content Accessibility Guidelines (WCAG) international standard. WCAG documents explain how to make web content more accessible to people with disabilities. +ADA provides people with disabilities the same opportunities, free of discrimination. +WCAG is a compilation of accessibility guidelines for websites, whereas ADA is a civil rights law in the same ambit.

+
+
+
+

2.2 Features for data providers

+

When the data catalog is not administered by the producer of the data but by an entrusted repository, data providers want:

+
+

2.2.1 Safety

+
    +
  • Safety, protection against reputation risk (responsible use of data)
  • +
  • Guarantee that regulations and terms of use will be strictly complied with; reputation of the organization that manages the catalog (Seal of Approval or other accreditation; properly staffed)
  • +
+
+
+

2.2.2 Visibility

+
    +
  • Visibility to maximize the use of data (including options to share/publicize on social media) - screenshot from data.gov
  • +
+
+
+
+ +

image

+
+
+


+
+
+

2.2.3 Low burden

+

“do not disturb”: low burden of deposit and no burden of serving users (minimum interaction with users; providing detailed metadata helps)

+
+
+

2.2.4 Real time information on usage

+

Monitoring of usage (downloads and citations) to assess demand; reports on this (automatically generated)

+
+
+

2.2.5 Feedback from users

+

Feedback on quality issues

+
+
+
+

2.3 Features for catalog administrators

+

In addition to meeting the needs of its users, a modern data catalog should also offer features that a catalog administrator can appreciate or expect. The features listed below can serve as checklist for choice of an application or development of features. These features may include:

+
+

2.3.1 Data deposit

+

User friendly interface for data deposit. Compliant with metadata stadards. With embedded quality gateways and clearance procedures.

+
+
+

2.3.2 Privacy protection

+

Tools for privacy protection control (e.g., tools to identify direct identifiers)

+
+
+

2.3.3 Free software

+

Availability of the application as an open-source software, accompanied by detailed technical documentation

+
+
+

2.3.4 Security

+

Robust security measures, such as compatibility with advanced authentication systems, flexible role/profile definitions, regular upgrades and security patches, and accreditation by information security experts

+
+
+

2.3.5 IT affordability

+

Reasonable IT requirements, such as shared server operability and sufficient memory capacity

+
+
+

2.3.6 Ease of maintenance

+

Ease of upgrading to the latest version

+
+
+

2.3.7 Interoperability

+

Interoperability with other catalogs and applications, as well as compliance with metadata standards. By publishing metadata across multiple catalogs and hubs, data visibility can be increased, and the service provided to users can be maximized. This requires automation to ensure proper synchronization between catalogs (with only one catalog serving as the “owner” of a dataset), which necessitates interoperability between the catalogs, enabled by compliance with common formats and metadata standards and schemas.

+
+
+

2.3.8 Flexibility on access policies

+

Flexibility in implementing data access policies that conform to the specific procedures and protocols of the organization managing the catalog

+
+
+

2.3.9 API based system for automation and efficiency

+

Availability of APIs for catalog administration +Easy automation of procedures (harvesting, migration of formats, editing, etc.) This means API-based system.

+
+
+

2.3.10 Featuring tools

+

Ability to feature datasets

+
+
+

2.3.11 Usage monitoring and analytics

+

Easy activation of usage analytics (using Google Analytics, Omniture, or other)

+
+
+

2.3.12 Multilingual capability

+

Multilingual capability, including internationalization of the code and the option for catalog administrators to translate or adapt software translations

+
+
+

2.3.13 Embedded SEO

+

Embedded Search Engine Optimization (SEO) procedures

+
+
+

2.3.14 Widgets and plugins

+

Ability to use widgets to embed custom charts, maps, and data grids in the catalog

+
+
+

2.3.15 Feedback to developers

+

Ability to provide feedback and suggestions to the application developers.

+
+
+
+

2.4 Machine learning for a better user experience

+

In Chapter 1, we emphasized the importance of generating comprehensive metadata and how machine learning can be leveraged to enrich it. Natural language processing (NLP) tools and models, in particular, have been employed to enhance the performance of search engines. By utilizing machine learning models, semantic search engines and recommender systems can be developed to aid users in locating relevant data. Moreover, machine learning can improve the ranking of search results to ensure that the most pertinent results are brought to users’ attention. Google, Bing, and other leading search engines have employed machine learning for years. While specialized data catalogs may not have the resources to implement such advanced systems, catalog administrators should explore opportunities to utilize machine learning to enhance their users’ experience. Catalogs can make use of external APIs to exploit machine learning solutions without requiring administrators to develop machine learning expertise or train their own models. For instance, APIs can be used to automatically and instantly translate queries or convert queries into embeddings. Ideally, a global community of practice will develop such APIs, including training NLP models, and provide them as a global public good.

+
+

2.4.1 Improved discoverability

+

In 2019, Google introduced their NLP model, BERT (Biderectional Encoder Representations from Transformers), as a component of their search engine. Other major companies, such as Amazon, Apple, and Microsoft, are also developing similar models to enhance their search engines. One of the objectives of these companies is to create search engines that can support digital assistants like Siri, Alexa, Cortana, and Hey Google, which operate on a conversational mode and provide answers to users rather than just links to resources. Improving NLP models is a continuous and strategic priority for these companies, as not all answers can be found in textual resources. Google is also conducting research to develop solutions for extracting answers from tabular data.

+

Specialized data catalogs maintained by data centers, statistical agencies, and other data producers still rely almost exclusively on full-text search engines. The search engine within these catalogs looks for matches between keywords submitted by the user and keywords found in an index, without attempting to understand or improve the user’s query. This can result in issues such as misinterpretation of the query, as discussed in Chapter 1, where a search for “dutch disease” may be mistakenly interpreted as a health-related query rather than an economic concept.

+

The administrators of these specialized data catalogs often lack the resources to develop and implement the most advanced NLP solutions, and should not be required to do so. To assist them in transitioning from keyword-based search systems to semantic search and recommender systems, open solutions should be developed and published, such as pre-trained NLP models, open source tools, and open APIs. This would necessitate the creation and publishing of global public goods, including specialized corpora and the training of embedding models on these corpora, open NLP models and APIs that data catalogs can utilize to generate embeddings for their metadata, query parsers that can automatically improve/optimize queries and convert them into numeric vectors, and guidelines for implementing semantic search and recommender systems using tools like Solr, ElasticSearch, and Milvus.

+

Simple models created from open source tools and publicly-available documents can provide straightforward solutions. In the example below, we demonstrate how these models can “understand” the concept of “dutch disease” and correctly associate it with relevant economic concepts.

+
+
+ +
+


+
+
+

2.4.2 Improved results ranking

+

Effective search engines not only identify relevant resources, but also rank and present them to users in an optimal order of relevance. As highlighted in Chapter 1, research shows that 75% of search engine users do not click past the first page, emphasizing the importance of ranking and presenting results effectively.

+

Data catalog administrators face two challenges in improving their search engine performance. Firstly, they need to improve their ranking in search engines such as Google by enriching metadata and embedding metadata compliant with DCAT or schema.org standards on catalog pages. Secondly, they need to improve the ranking of results returned by their own search engines in response to user queries.

+

Google’s success in 1996 was largely attributed to their revolutionary approach to ranking search results called PageRank. Since then, they and other leading search engines have invested heavily in improving ranking methodologies with advanced techniques like RankBrain (introduced in 2015). These approaches include primary, contextual, and user-specific ranking, which utilize machine learning models referred to as Learn to Rank models. Lucidworks provides a clear description of this approach, noting that “Learning to rank (LTR) is a class of algorithmic techniques that apply supervised machine learning to solve ranking problems in search relevancy. In other words, it’s what orders query results. Done well, you have happy employees and customers; done poorly, at best you have frustrations, and worse, they will never return. To perform learning to rank you need access to training data, user behaviors, user profiles, and a powerful search engine such as SOLR. The training data for a learning to rank model consists of a list of results for a query and a relevance rating for each of those results with respect to the query. Data scientists create this training data by examining results and deciding to include or exclude each result from the data set.”

+

Implementing Learn to Rank models can be challenging for data catalog administrators due to the resource-intensive nature of building the training dataset, fitting models, and implementing them. An alternative solution is to optimize the implementation of Solr or ElasticSearch, which can often contribute significantly to improving the ranking of search results. For more information on the challenge and available tools and methods for relevancy engineering, refer to D. Turnbull and J. Berryman’s 2016 publication.

+
+
+ +
+


+
+
+
+

2.5 Cataloguing tools

+

The examples we provided in this chapter are taken from our NADA cataloguing application. Other open-source cataloguing applications are available, including CKAN, GeoNetworks, and Dataverse.

+

CKAN

+

CKAN is a data management system that provides a platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. CKAN is an open source software held in trust by Open Knowledge Foundation. It is open and licensed under the GNU Affero General Public License (AGPL) v3.0. CKAN is used by some of the lead open data platforms, such as the US data.gov or the OCHA Humanitarian Data Exchange. CKAN does not require that the metadata comply with any metadata standard (which brings flexibility, but at a cost in terms of discoverability and quality control), but organizes the metadata in the following elements (information extracted from CKAN on-line documentation):

+
    +
  • Title: allows intuitive labeling of the dataset for search, sharing and linking.
  • +
  • Unique identifier: dataset has a unique URL which is customizable by the publisher.
  • +
  • Groups: display of which groups the dataset belongs to if applicable. Groups (such as science data) allow easier data linking, finding and sharing among interested publishers and users.
  • +
  • Description: additional information describing or analyzing the data. This can either be static or an editable wiki which anyone can contribute to instantly or via admin moderation.
  • +
  • Data preview: preview [.csv] data quickly and easily in browser to see if this is the dataset you want.
  • +
  • Revision history: CKAN allows you to display a revision history for datasets which are freely editable by users
  • +
  • Extra fields: these hold any additional information, such as location data (see geospatial feature) or types relevant to the publisher or dataset. How and where extra fields display is customizable.
  • +
  • Licence: instant view of whether the data is available under an open license or not. This makes it clear to users whether they have the rights to use, change and re-distribute the data.
  • +
  • Tags: see what labels the dataset in question belongs to. Tags also allow for browsing between similarly tagged datasets in addition to enabling better discoverability through tag search and faceting by tags.
  • +
  • Multiple formats (if provided): see the different formats the data has been made available in quickly in a table, with any further information relating to specific files provided inline.
  • +
  • API key: allows access every metadata field of the dataset and ability to change the data if you have the relevant permissions via API.
  • +
+

The extra fields section allows ingestion of structured metadata, which makes it relatively easy to exporting data and metadata from NADA to CKAN. Importing data and metadata from CKAN to NADA is also possible (using the catalog’s respective APIs), but with a reduced metadata structure.

+

GeoNetworks

+

GeoNetworks is a cataloguing tool for geographic data and services (not for other types of data), which includes a specialized metadata editor. According to its website, “It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world. (…) The metadata editor support ISO19115/119/110 standards used for spatial resources and also Dublin Core format usually used for opendata portals.”

+

DataVerse

+

The Dataverse Project is led by the Institute for Quantitative Social Science (IQSS). Dataverse makes use of the DDI Codebook and Dublin Core metadata standards. According to its website, Dataverse “is an open source web application to share, preserve, cite, explore, and analyze research data. (…) The central insight behind the Dataverse Project is to automate much of the job of the professional archivist, and to provide services for and to distribute credit to the data creator.”

+

“The Institute for Quantitative Social Science (IQSS) collaborates with the Harvard University Library and Harvard University Information Technology organization to make the installation of the Harvard Dataverse Repository openly available to researchers and data collectors worldwide from all disciplines, to deposit data. IQSS leads the development of the open source Dataverse Project software and, with the Open Data Assistance Program at Harvard (a collaboration with Harvard Library, the Office for Scholarly Communication and IQSS), provides user support.”

+ +
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter03.html b/chapter03.html new file mode 100644 index 0000000..531b7c0 --- /dev/null +++ b/chapter03.html @@ -0,0 +1,1258 @@ + + + + + + + Chapter 3 The power of rich, structured metadata | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 3 The power of rich, structured metadata

+

The previous chapter defined the features of an advanced data discoverability and dissemination solution. What enables such a solution is not only the algorithms and technology, but also the quality of the metadata available to enable them. Metadata is defined as “… structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage that resource” (Data thesaurus, NIH, https://nnlm/gov/data/thesaurus) Metadata must be findable by machines and usable by humans. This chapter describes what metadata are needed, and how they can be organized and improved to fully enable the search and recommender tools. The metadata must be rich and structured. To make them rich, machine learning can be used. To ensure consistent structure, the use of metadata standards and schemas is highly recommended. In this chapter, we build the case for rich, augmented, structured metadata and for the adoption of metadata standard and schemas. The second part of this Guide will provide a detailed description of each recommended standard or schema, for different data types.

+
+

3.1 Rich metadata

+

Rich metadata means detailed and comprehensive metadata. Rich metadata are beneficial to both the users and the providers (producers and curators) of data.

+
+

3.1.1 Benefits for data users

+

Being provided with rich metadata helps data users:

+
    +
  • Find data of interest. The metadata provide much of the content that the search engine will be able to index and discover. The richer the metadata, the better the search engine will be able to help users identify relevant data.
  • +
  • Understand what the data are measuring and how they have been created. Without a proper description of the data, the risk is high that a user will misunderstand and possibly misuse them, or simply decide not to make use of them.
  • +
  • Assess the quality of the data, including their reliability, fitness for purpose, and consistency with other datasets when the purpose requires integration of multiple datasets.
  • +
+
+
+

3.1.2 Benefits for data producers

+

For the data producers, rich metadata will contribute to:

+
    +
  • Ensure transparency, auditability, and credibility of the data and of the derived products.
  • +
  • Increase the visibility of the data, and thus the demand for, and use of the data.
  • +
  • Reduce the cost of operating a data dissemination service by lowering the burden of responding to users’ requests for information.
  • +
  • Support the preservation of institutional memory.
  • +
  • Provide the meta-database needed to harmonize data collection methods and instruments, e.g., by providing convenient tools to compare variables across datasets. A compelling case for rich metadata for transparency and harmonization can be found in “The Struggle for Integration and Harmonization of Social Statistics in +a Statistical Agency - A Case Study of Statistics Canada” by Gordon Priest (2010).
    +
    + +
  • +
+
+
+

3.1.3 Scope of the metadata

+

What makes metadata “rich and comprehensive” is not always easy to define, and is specific to each data type. Microdata and geospatial datasets for example will require much more – and different– metadata than a document or an image. Metadata standards and schemas provide data curators with detailed lists of elements (or fields), specific to each data type, that must or may be provided to document a dataset. The metadata elements included in a standard or schema will typically cover cataloguing material, contextual information, and explanatory materials.

+
+

3.1.3.1 Cataloguing material

+

Cataloguing material includes elements such as a title, a unique identifier for the dataset, a version number and description, as well as information related to the data curation (including who generated the metadata and when, or where and when metadata may have been harvested from an external catalog). This information allows the dataset to be uniquely identified within a collection/catalog, and serves as a bibliographic record of the dataset, allowing it to be properly acknowledged and cited in publications.

+
+
+

3.1.3.2 Contextual information

+

Contextual information describes the context in which the data were collected and how they were put to use. It enables secondary users to understand the background and processes behind the data production. Contextual information should cover topics such as:

+
    +
  • What justified or required the data collection (the objectives of the data production exercise);
  • +
  • Who or what was being studied;
  • +
  • The geographic and temporal coverage of the data;
  • +
  • Changes and developments that occurred over time in the data collection methodology and in the dataset, if relevant. For repeated cross-section, panel, or time series datasets, this may include information describing changes in the question text, variable labeling, sampling procedures, or others;
  • +
  • What are the key output of the data collection, such as a publication, the design or implementation of a policy or project, etc.
  • +
  • Problems encountered in the process of data collection, entry, checking, and cleaning;
  • +
  • Other useful information on the life cycle of the dataset.
  • +
+
+
+

3.1.3.3 Explanatory material

+

Explanatory materials are the information that should be created and preserved to ensure the long-term functionality of a dataset and its contents. This applies mostly to microdata, geospatial data, and to some extent to tabulations and to time series and indicators databases. It is less relevant for images, videos, and documents. Explanatory materials include:

+
    +
  • Information about the data collection methods: This section should describe the instruments used and methods employed, and how they were developed. If applicable, details of the sampling design and sampling frames should be included. It is also useful to include information on any monitoring process undertaken during the data collection as well as details of quality controls.
  • +
  • Information about the structure of the dataset: Key to this information is a detailed data dictionary describing the structure of the dataset, including information about relationships between individual files or records within the study. For example, it should include key variables required for unique identification of subjects across files (required to properly merge data files), the number of cases and variables in each file, and the number of files in the dataset. For relational models, the structure and relations between datasets records and elements should be described.
  • +
  • Technical information: This information relates to the technical framework and should include the computer system used to generate the data and related files; the software packages with which the files were created.
  • +
  • Variables and values, coding and classification schemes (for microdata and geospatial data): The documentation should contain an exhaustive list of variables in the dataset, including a complete explanation and full details about the coding and classifications used for the information allocated to those fields. It is especially important to have blank and missing fields explained and accounted for. It is helpful to identify variables to which standard coding classifications apply, and to record the version of the classification scheme used.
  • +
  • Information about derived variables (for microdata and geospatial data, and tabulations): Many data producers derive new variables from original data. This may be as simple as grouping raw age (in years) data according to groups of years appropriate for the survey, or it may be much more complex and require the use of sophisticated algorithms. When grouped or derived variables are created, it is important that the logic for the grouping or derivation is clear. Simple grouping, such as for age, can be included within the data dictionary. More complex derivations require other means of recording the information. Sufficient supporting information should be provided to allow an easy link between the core variables used and the resultant variables. In addition, computer algorithms used to create the derivations should be saved together with information on the software.
  • +
  • Weighting and grossing (for sample survey microdata): Weighting and grossing variables must be fully documented, with explanations of the construction of the variables and clear indications of the circumstances in which they should be used. The latter is particularly important when different weights are applied for different purposes.
  • +
  • Data source: Details about the source from which the data is derived should be included. For example, when the data source consists of responses to survey questionnaires, each question should be carefully recorded in the documentation. Ideally, the text will include a reference to the generated variable(s). It is also useful to explain the conditions under which a question would be asked, including, if possible, the cases to which it applies and, ideally, a summary of response statistics.
  • +
  • Confidentiality and anonymization: It is important to determine whether the data contains any confidential information on individuals, households, organizations, or institutions. If so, such information should be recorded together with any agreement on how to use the data, such as with survey respondents. Issues of confidentiality may restrict the analyses to be undertaken or results to be published, particularly if the data is to be made available for secondary use. If the data were anonymized to prevent identification, it is wise to record the anonymization procedure (taking care of not providing information that would enable a reverse-engineering of the procedure) and its impact on the data, as such modification may restrict subsequent analysis.
  • +
+
+
+
+

3.1.4 Controlled vocabularies

+

Metadata standards and schemas provide lists of elements with a description of the expected content to be captured in each element. For some elements, it may be appropriate to restrict the valid content to pre-selected options or “controlled vocabularies”. A controlled vocabulary is a pre-defined list of values that can be accepted as valid content for some elements. For example. a metadata element “data type” should not be populated with free text, but should make use of a pre-defined taxonomy of data types. The use of controlled vocabularies (for selected metadata elements) will be particularly useful to implement search and filter features in data catalogs (see section 3.1.1 of this Guide), and to foster inter-operability of data catalogs.

+
+

In library and information science, controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search.Wikipedia

+
+

Controlled vocabularies can be specific to an agency, or be developed by a community of practice. For example, the list of countries and codes provided by the ISO 3166 can be used as a controlled vocabulary for a metadata element country or nation; the ISO 639 list of languages can be used as a controlled vocabulary for a metadata element language. Or the CESSDA topics classification can be used as a controlled vocabulary for the element topics found in most metadata schemas. When a controlled vocabulary is used in a metadata standard or schema, it is good practice to include an identification of its origin and version.

+

Some recommended controlled vocabularies are included in the description of the ISO 19139 standard for geographic data and services (see chapter 6). Most standards and schemas we recommend also include a topics element. Annex 1 provides a description of the CESSDA topics classification.

+

Ideally, controlled vocabulary will be developed in compliance with the FAIR principles for scientific data management and stewardship: Findability, Accessibility, Interoperability, and Reuse.

+
+
+

3.1.5 Tags

+

All metadata standards and schemas described in this guide include a tags element, even when this element is not part of a standard. This element enables the implementation of filters (facets) in data cataloguing applications, in a flexible manner. The tags metadata element is repeatable (meaning that more than one tag can be attached to a dataset) and contains two sub-elements to capture a tag (word or phrase), and the tag_group (if any) it belongs to.

+
+ +
+

To illustrate the use of tags, let’s assume that a catalog contains datasets that are available freely, and others that are available for a fee. The catalog administrator may want to provide a filter (facet) in the user interface that would allow users to filter datasets based on their free or not free status. None of the metadata schemas we describe in the Guide contains an element specifically designed to indicate the “free” or “for a fee” nature of the data. But this information can be captured in a tag “Free” or “For a fee” that would be added to each dataset in the catalog, with a tag group that could be named “free_or_fee”. In R, this would be done as follows (for a “Free” dataset):

+
# ... ,
+tags = list(
+  list(tag = "Free", tag_group = "free_or_fee")
+)
+# ... 
+

In the NADA catalog, a facet titled “Free or for a fee” can then be created based on the information found in the tags element where tag_group = “free_or_fee”.

+
+ +
+
+
+
+

3.2 Structured metadata

+
+

3.2.1 What structure?

+

Metadata should not only be comprehensive and detailed, they should also be organized in a structured manner, preferably using a standardized structure. Structured metadata means that the metadata are stored in specific fields (or elements) organized in a metadata schema. Standardized means that the list and description of elements are commonly agreed by a community of practice.

+

“A metadata schema is a system that defines the data elements needed to describe a particular object, such as a certain type of research data.” (Ten rules data discovert - add ref)

+

Some metadata standards have originated from academic data centers, like the Data Documentation Initiative (DDI), maintained by the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan. Other found their origins in specialized communities of practice (like the ISO 19139 for geospatial resources). The private sector also contributes to the development of standards, like the International Press Telecommunications Council (IPTC) standard developed by and for news media.

+

Metadata compliant with standards and schemas will typically be stored as JSON or XML files (described in Chapter 2), which are plain text files. The example below show how a simple free-text content would be structured and stored in JSON and XML formats, using metadata elements from the DDI Codebook metadata standard:

+

Free text version:

+

The Child Mortality Survey (CMS) was conducted by the National Statistics Office of Popstan from July 2010 to June 2011, with financial support from the Child Health Trust Fund (TF123_456).

+

Structured, machine-readable (JSON) version:

+
  "title"           : "Child Mortality Survey 2010-2011",
+  "alternate_title" : "CMS 2010-2011", 
+  "authoring_entity": "National Statistics Office (NSO)", 
+  "funding_agencies": [{"name":"Child Health Trust Fund (CHTF)", "grant":"TF123_456"}],
+  "coll_dates"      : [{"start":"2010-07", "end":"2011-06"}],
+  "nation"          : [{"name":"Popstan", "abbreviation":"POP"}] 
+}  
+

In XML format:

+
<titl>Child Mortality Survey 2010-2011</titl>
+<altTitl>CMS 2010-2011</altTitl>
+<rspStmt><AuthEnty>National Statistics Office</AuthEnty></rspStmt>
+<fundAg abbr=CHTF>Child Health Trust Fund</fundAg>
+<collDate date="2010-07" event="start"/>
+<collDate date="2011-06" event="end"/>
+<nation abbr="POP">Popstan</nation>
+

All three versions contain (almost) the same information. In the structured version, we have added acronyms and the ISO country code. This does not create new information but will help make the existing information more discoverable and inter-operable. The structured version is clearly more suitable for publishing in a meta-database (or catalog). Organizing and storing metadata in such a structured manner will enable all kinds of applications. For example, when metadata for a collection of surveys are stored in a database, it becomes straightforward to apply filters (for example, a filter by country using the nation/name element) and targeted searches to answer questions like “What data are available that cover the month of December 2010?” or “What surveys did the CHTF sponsor?”.

+
+
+

3.2.2 Formats for structured metadata: JSON and XML

+

Metadata standards and schemas consist of structured lists of metadata fields. They serve multiple purposes. First, they help data curators generate complete and usable documentation of their datasets. Metadata standards that are intuitive and human-readable better serve this purpose. Second, they help generate machine-readable metadata that are the input to software applications like on-line data catalogs. Metadata available in open file formats like JSON (JavaScript Object Notation) and XML (eXtended Markup Language) are most suitable for this purposes.

+

Some international metadata standards like the Data Documentation Initative (DDI Codebook, for microdata), the ISO 19139 (for geospatial data), or the Dublin Core (a more generic metadata specification) are described and published as XML specifications. Any XML standard or schema can be “translated” into JSON, which is our preferred format (a choice we justify in the next section).

+

JSON and XML formats have similarities:

+
    +
  • Both are non-proprietary text files
  • +
  • Both are hierarchical (they may contain values within values)
  • +
  • Both can be parsed and used by many programming languages including R and Python
  • +
+

JSON files are however easier to parse than XML, easier to generate programmatically, and easier to read by humans. This makes them our preferred choice for describing and using metadata standards and schemas.

+

Metadata in JSON are stored as key/value pairs, where the keys correspond to the names of the metadata elements in the standard. Values can be string, numeric, boolean, arrays, null, or JSON objects (for a more detailed description of the JSON format, see www.w3schools.com). Metadata in XML are stored within named tags. The example below shows how the JSON and XML formats are used to document the list of authors of a document, using elements from the Dublin Core metadata standard.

+
+ +
+


+

In the documents schema, authors are documented in the metadata element authors which contains the following sub-elements: first_name, initial, last_name, and affiliation.

+
+ +
+


+

In JSON, this information will be stored in key/value pairs as follows.

+
"authors" : [
+  {"first_name" : "Dieter", 
+   "last_name"  : "Wang", 
+   "affiliation": "World Bank Group; Fragility, Conflict and Violence"},
+  {"first_name" : "Bo",     
+   "initial"    : "P.J.", 
+   "last_name"  : "Andrée", 
+   "affiliation": "World Bank Group; Fragility, Conflict and Violence"},
+  {"first_name" : "Andres", 
+   "initial"    : "F.", 
+   "last_name"  : "Chamorro", 
+   "affiliation": "World Bank Group; Development Data Analytics and Tools"},
+  {"first_name" : "Phoebe", 
+   "initial"    : "G.", 
+   "last_name"  : "Spencer",  
+   "affiliation":"World Bank Group; Fragility, Conflict and Violence"}
+]
+

In XML, the same information will be stored within named tags as follows.

+
<authors>
+  <author>
+    <first_name>Dieter</first_name> 
+    <last_name>Wang</last_name> 
+    <affiliation>World Bank Group; Fragility, Conflict and Violence</affiliation>
+  </author>
+  <author>
+    <first_name>Bo</first_name> 
+    <initial>P.J.</initial> 
+    <last_name>Andrée</last_name> 
+    <affiliation>World Bank Group; Fragility, Conflict and Violence</affiliation>
+  </author>
+  <author>
+    <first_name>Andres</first_name> 
+    <initial>E.</initial>
+    <last_name>Chamorro</last_name> 
+    <affiliation>World Bank Group; Development Data Analytics and Tools</affiliation>
+  </author>
+  <author>
+    <first_name>Phoebe</first_name> 
+    <initial>G.</initial>
+    <last_name>Spencer</last_name> 
+    <affiliation>World Bank Group; Fragility, Conflict and Violence</affiliation>
+  </author>
+</authors>
+
+
+

3.2.3 Benefits of structured metadata

+

Metadata standards and schemas must be comprehensive and intuitive. They aim to provide comprehensive and granular lists of elements. Some standards may contain a very long list of elements. Most often, only a subset of the available elements will be used to document a specific dataset. For example, the elements of the DDI metadata standard related to sample design will be used to document sample survey datasets but will be ignored when documenting a population census or an administrative dataset. In all standards and schemas, most elements are optional, not required. Data curators should however try and provide content for all elements for which information is or can be made available.

+

Complying with metadata standards and schemas contributes to the completeness, usability, discoverability, and inter-operability of the metadata, and to the visibility of the data and metadata.

+
+

3.2.3.1 Completeness

+

When they document datasets, data curators who do not make use of metadata standards and schemas tend to focus on the readily-available documentation and may omit some information that secondary data users –and search engines– may need. Metadata standards and schemas provide checklists of what information could or should be provided. These checklists are developed by experts, and are regularly updated or upgraded based on feedback received from users or to accommodate new technologies.

+

Generating complete metadata will often be a collaborative exercise, as the production of data involves multiple stakeholders. The implementation of a survey, for example, may involve sampling specialists, field managers, data processing experts, subject matter specialists, and programmers. Documenting a dataset should not be seen as a last and independent step in the implementation of a data collection or production project. Ideally, metadata will be captured continuously and in quasi-real time during the entire life cycle of the data collection/production, and contributed by those who have most knowledge of each phase of the data production process.

+

Generating complete and detailed metadata may be seen as a burden by some organizations or researchers. But it will typically represent only a small fraction of the time and budget invested in the production of the data, and is an investment that will add much value to the data by increasing their usability and discoverability.

+
+
+

3.2.3.2 Usability

+

Fully understanding a dataset before conducting analysis should be a pre-requisite for all researchers and data users. But this will only be possible when the documentation is easy to obtain and exploit. Convenience to users is key. When using a geographic dataset for example, the user should be able to immediately find the coordinate reference system that was used. When using survey microdata, which may contain hundreds or thousands of variables, the user need to be able to immediately access information on a variable label, underlying question, universe, categories, etc. Structured metadata enables such “convenience”, as they can easily be transformed into bookmarked PDF documents, searchable websites, machine-readable codebooks, etc. The way metadata are displayed can be tailored to the specific needs of different categories of users.

+
+
+

3.2.3.3 Discoverability

+
+

Data discoverability is one of the main tasks, next to availability and interoperability, that public policy makers and implementers should take into due consideration in order to foster access, use and re-use of public sector information, particularly in case of open data. Users shall be enabled to easily search and find data they need for the most different purposes. That is clearly highlighted in the introduction statements of the INSPIRE Directive, where we can read that “The loss of time and resources in searching for existing (spatial) data or establishing whether they may be used for a particular purpose is a key obstacle to the full exploitation of the data available”. +From Metadata and data portals/catalogues are essential assets to enable that data discoverability.

+
+

What matters is not only what metadata are provided as input to the search engines that matters, it is also how the metadata are provided. To understand the value of structured metadata, we need to take into consideration how search engines ingest, index, and exploit the metadata. In brief, the metadata will need to be acquired, augmented, analyzed and transformed, and indexed before they can be made searchable. We provide here an overview of the process, which is described in detail by D. Turnbull and J. Berryman in “Relevant Search: With applications for Solr and Elasticsearch” (2016).

+
    +
  • Acquisition: Search engines like Google and Bing acquire metadata by crawling billions of web pages using web crawlers (or bots), with an objective to cover the entire web. Guidance is available to webmasters on how to optimize websites for visibility (see for example Google’s Search Engine Optimization (SEO) Starter Guide. The search tools embedded in specialized data catalogs have a much simpler task, as the catalog administrators and curators generate or control, and provide, the well-contained content to be indexed. In a cataloguing application like NADA, this content is provided in the form of structured metadata files saved in JSON or XML format. For textual data (documents), the content of the document (not only the metadata on the) can also be indexed. The process of acquisition/extraction of metadata by the search engine tool must preserve the structure of the metadata, in its original or in a modified form. This will be critical for optimizing the performance of the search tool and the ranking of query results (e.g., a keyword found in a document title may have more weight than the same keyword found in the document abstract), for implementing facets, or for providing advanced search options (e.g., search only in the “authors” metadata field).

  • +
  • Augmentation or enrichment: the content of the metadata can be augmented or enriched in multiple ways, often automatically (by extracting information from an external source, or using machine learning algorithms). Part of this augmentation process should happen before the metadata are submitted to the search engine. Other procedures of enrichment of the metadata may be implemented after acquisition of the metadata by the search engine tool. Metadata augmentation can have a significant impact on the discoverability of data. See the section “Augmented (enriched) metadata” below.

  • +
  • Analysis or transformation: The metadata generated by the data curator and by the augmentation process will mostly (not exclusively) consist of text. For the purpose of discoverability, some of the text has no value; words like “the”, “a”, it”, “with”, etc., referred to as stop words, will be removed from the metadata (multiple tools are available for this purpose). The remaining words will be converted to lowercase, may be submitted to spell checkers (to exclude or fix errors), and words will be stemmed or lemmatized. The stemming or lemmatization consist of converting words to their stem or root); this will among other transformations change plurals to singular and the conjugated forms of the verbs to their base form. Last, the transformed metadata will be tokenized, i.e. split into a list of terms (tokens). To enable semantic searchability, the metadata can also be converted into numeric vectors using natural language processing embedding models. These vectors will be saved in a database (such as ElasticSearch or Milvus) that will provide functionalities to measure similarity/distance between vectors. Section 1.8 below provide more information on text embedding and semantic searchability.

  • +
  • Indexing: The last phase of metadata processing is the indexing of the tokens. The index of a search engine is an inverted index, which will contain a list of all terms found in the metadata, with the following information (among other) attached to each term:

    +
      +
    • The document frequency, i.e. the number of metadata documents where the word is found (a metadata document is the metadata related to one dataset).
    • +
    • The identification of the metadata documents in which the term was found.
    • +
    • The term frequency in each metadata document.
    • +
    • The term positions in the metadata document, i.e. where the term is found in the document. This is important to identify colocations. When a user submits a query for “demographic transition” for example, documents where the two terms are found next to each other will be more relevant than documents where both terms appear but in different parts of the document.
    • +
  • +
+

Once the metadata has been acquired, transformed, and indexed, it is available for use via a user interface (UI). A data catalog UI will typically include a search box and facets (filters). The search engine underlying the search box can be simple (out-of-the-box full text search, looking for exact matches of keywords), or advanced (with semantic search capability and optimized ranking of query results). Basic full-text search do not provide satisfactory user experience, as we illustrated in the introduction to this Guide. Rich, structured metadata, combined with advanced search optimization tools and machine learning solutions, allow catalog administrators to tune the search engine, and implement advanced solutions including semantic searchability.

+
+ +
+
+
+

3.2.3.4 Interoperability

+

Data catalogs that adopt common metadata standards and schemas can exchange information including through automated harvesting and synchronization of catalogs. This allows them to increase their visibility, and to publish their metadata in hubs. Recommendations and guidelines for improved inter-operability of data catalogs are provided by the Open Archives Initiative.

+

Interoperability between data catalogs can be further improved by the adoption of common controlled vocabularies. For example, the adoption of the ISO country codes in country lists will guarantee that all catalogs will be able to filter dataset by country in a consistent manner. This will solve the issue of possible differences in the spelling of country names (e.g., one catalog referring to the Democratic Republic of Congo as Congo, DR, and another one as Congo, Dem. Rep.). It also solves issues of changing country names, e.g. Swaziland renamed as Eswatini in 2018). Controlled vocabularies are often used for “categorical” metadata elements like topics, keywords, data type, etc. Some metadata standards like the ISO 19139 for geospatial data include their own recommended controlled vocabularies. Ideally, controlled vocabularies are developed in accordance with FAIR principles (Findability, Accessibility, Interoperability, and Reuse of digital assets). “The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.” (https://www.go-fair.org/fair-principles/)

+

The adoption of standards and schemas by software developers also contributes to the easy transfer of metadata across applications. For example, data capture tools like Survey Solutions by the World Bank and CsPro by the US Census Bureau offer options to export metadata compliant with the DDI Codebook standard; ESRI’s ArcGIS software export geospatial metadata in the ISO 19139 standard.

+
+
+

3.2.3.5 Visibility

+

Data cataloguing applications provide search and filtering tools to help users of the catalog identify data of interest. But not all users will start their search for data directly in specialized data catalogs; many will start their search in Google, Google Dataset Search, Bing, Yahoo! or another search engine.

+Some search engines may provide users with a direct answer to their query, without transiting via the source catalog. This will be the case when the query can be associated with a specific indicator, time and location for which data are openly available or accessible via a public API. For example, a search for “population india 2020” on Google, will provide an answer first, followed by links to the underlying sources. +
+
+ +
+


+In other cases, the search engine will provide users with a link to a specific catalog page, not to the catalog’s home page. In such cases, the user will not be directly connected to the catalog’s own search engine. For example, a search for “albania lsms 2012” (a Living Standard Measurement Study, i.e. household survey) in Google will send the user directly to the survey page of the catalog, not to the home or search page of the catalog. +
+
+ +
+


+

In some cases, the user may not be brought to the data catalog at all, if the catalog ranked low in the relevance order of the Google query results. User behavior data (2020) showed that “only 9% of Google searchers make it to the bottom of the first page of the search results”, and that “only .44% of searchers go to the second page of Google’s search results”. (source: https://www.smartinsights.com/search-engine-marketing/search-engine-statistics/)

+

It is thus critical to optimize the visibility of the content of specialized data catalogs in the lead search engines, Google in particular. This optimization process is referred to as search engine optimization or SEO. Wikipedia describes SEO as “the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as”natural” or “organic” results) rather than direct traffic or paid traffic. (…) As an Internet marketing strategy, SEO considers how search engines work, the computer-programmed algorithms that dictate search engine behavior, what people search for, the actual search terms or keywords typed into search engines, and which search engines are preferred by their targeted audience. SEO is performed because a website will receive more visitors from a search engine when websites rank higher on the search engine results page.”

+
+

Because search engines crawl the web pages that are generated from databases (rather than crawling the databases themselves), your carefully applied metadata inside the database will not even be seen by search engines unless you write scripts to display the metadata tags and their values in HTML meta tags. It is crucial to understand that any metadata offered to search engines must be recognizable as part of a schema and must be machine-readable, which is to say that the search engine must be able to parse the metadata accurately. (For example, if you enter a bibliographic citation into a single metadata field, the search engine probably won’t know how to distinguish the article title from the journal title, or the volume from the issue number. In order for the search engine to read those citations effectively each part of the citation must have its own field. (…) Making sure metadata is machine-readable requires patterns and consistency, which will also prepare it for transformation to other schema. (This is far more important than picking any single metadata schema. (…) From the blog post “Metadata, Schema.org, and Getting Your Digital Collection Noticed” by Patrick Hogan (https://www.ala.org/tools/article/ala-techsource/metadata-schemaorg-and-getting-your-digital-collection-noticed-3)

+
+

Guidelines for implementing SEO are provided by Google Search, Google Dataset Search, and other lead search engines. These guidelines are to be implemented not only by webmasters, but also by the developers of data cataloguing tools who should embed SEO into their software applications.

+ +

An important element of SEO is the provision of structured metadata that can be exploited directly by the crawlers and indexers of search engines. This is the purpose of a set of schemas known as schema.org. In 2011 Google, Microsoft, Yandex, and Yahoo! created a common set of schemas for structured data markup on web pages with the aim of helping search engines to better understand websites. An alternative to schema.org is the DCAT (Data Catalog Vocabulary) metadata schema recommended by the W3C, also recognized by Google. “DCAT is a vocabulary for publishing data catalogs on the Web, which was originally developed in the context of government data catalogs such as data.gov and data.gov.uk (…)” (https://www.w3.org/TR/vocab-dcat-2/) Mapping augmented and structured metadata to the schema.org and/or DCAT standard is a critical element of such optimization. It will contribute significantly to the visibility of on-line data and metadata. Implementing such structured data markup in digital repositories is the responsibility of data librarians and of developers of data cataloguing applications.

+
+
+
+
+

3.3 Augmenting metadata

+

Detailed and complete metadata foster usability and discoverability of data. Augmentation of “enrichment” or “enhancement” of the metadata will therefore be beneficial. There are multiple ways metadata can be made richer, or augmented, programmatically and in a largely automated manner. Metadata can be extracted from external sources or from the data themselves.

+

Extraction from external sources

+

Metadata can be augmented by tapping into external sources related to the data being documented. For example, in a catalog of documents published in peer-reviewed journals, the Scimago Journal Rank (SJR) indicator could be extracted and added as an additional metadata element for each document. This information can then be used by the catalog’s search engine to rank query results, by “boosting” the rank of documents published in prestigious journals.

+

Extraction from the data

+

Metadata can be extracted from the data themselves. What metadata can be extracted will be specific to each data type. Examples of metadata augmentation will be provided in the subsequent chapters. We mention a few below.

+
    +
  • For microdata: variable-level statistics (range of values, number of valid/missing cases, frequencies for categorical variables, summary statistics like means or standard deviations for continuous variables) can be extracted and stored as metadata. The DDI Codebook metadata standard provides elements for that purpose.
    +
  • +
  • For documents: information such as the country counts (how many times each country is mentioned) can be extracted automatically to fill out the metadata element related to geographic coverage. Natural language processing (NLP) models can be applied to automatically extract keywords or topics (e.g., using a Latent Dirichlet Allocation - LDA - topic model). Classification models can be applied to categorize documents by type.
  • +
+

Embeddings and semantic discovery

+

Previous sections of the chapter showed the value of rich and structured metadata to improve data usability and discoverability. Comprehensive and structured metadata are required to build and develop advanced and optimized lexical search engines (i.e. search engines that return results based on a matching of terms found in a query and in an inverted index). The richness of the metadata guarantees that the search engine will have all necessary “raw material” to identify datasets of interest. The metadata structure allows catalog administrators to tune their search engine (provided they use advanced solutions like Solr or ElasticSearch) to return and rank results in the most relevant manner. But this leaves one issue unsolved: the dependency on keyword matching. A user interested in datasets related to malnutrition for example will not find the indicators on Prevalence of stunting and Prevalence of wasting that the catalog may contain, unless the keyword “malnutrition” was included in these indicators’ metadata. Smarter search engines will be able to “understand” users intent, and identify relevant data based not only on a keyword matching process, but also on the semantic closeness between a query submitted by thea user and the metadata available in the database. The combination of rich metadata and natural language processing (NLP) models can solve this issue, by enabling semantic searchability in data catalogs.

+

To enable a semantic search engine (or a recommender system), we need a way to “quantify” the semantic content of a query submitted by the user and the semantic content the metadata associated with a dataset, and to measure the closeness between them. This “quantitative” representation of semantic content can be generated in the form of numeric vectors called embeddings. “Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.” (Jurafsky, Daniel; H. James, Martin (2000)). These vectors will typically have a large dimension, with a length of 100 or more. They can be generated for a word, a phrase, or a longer text such as a paragraph or a full document. They are calculated using models like word2vec (Mikolov et al., 2013) or other. Training such models require a large corpus of hundreds of thousands or millions of documents. Pre-trained models and APIs are available that allow data catalog curators to generate embeddings for their metadata and, in real time, for queries submitted by users.

+

Practically, embeddings are used as follows: metadata (or part of the metadata) associated with a dataset are converted into a numeric vector using a pre-trained embedding model. These embeddings are stored in a database. When a user submits a search query (which can be a term, a phrase, or even a document), the query is analyzed and enhanced (stop words are removed, spelling errors may be fixed, language detection and automatic translation may be applied, and more), then transformed into a vector using the same pre-trained model that was used to generate the metadata vectors. The metadata vectors that have the shortest distance (typically the cosine distance) with the query vector will be identified. The search engine will then return a sorted list of datasets having the highest semantic similarity with the query, or the distance between vectors will be used in combination with other criteria to rank and return results to the user. The fast identification of the closest vectors requires a specialized and optimized tool like the open source Milvus application.

+
    +
  • For geospatial data: bounding boxes (i.e. the extent of the data) can be derived from the data files.
  • +
  • For photos taken by digital cameras: metadata such as the date and time the photo was taken and possibly the geographic location can be extracted from the EXIF metadata generated by digital cameras and stored in the image file. Also, machine learning models allow image labeling, face detection, text detection and recognition to be applied at low cost (using commercial solutions like Google Vision or Amazon Rekognition among others).
  • +
  • For videos and audio files, machine learning models of speech-to-text API solutions can be used to automatically generate transcripts (see for example Amazon Transcribe, Google Cloud Speech-to-Text, Microsoft Azure Speech to Text, or rev.ai). The content of the transcripts can then be indexed in search engines, making the content of video and audio files more discoverable.
  • +
  • For programs and scripts: a parsing of the commands used in the script may be used to derive information on the methods applied.
  • +
  • For all types: user-defined tags can be added, possibly generated by machine learning classification algorithms.
  • +
+
+ +
+

3.5 Search engine optimization: schema.org

+

The standards and schemas we recommend are lists of elements that have been tailored for each data type. The importance of structured and rich metadata has been described. Specialized metadata standards will foster comprehensiveness and discoverability in specialized catalog, and help build optimized data discovery suystems. But it is also critical to ensure the visibility and discoverability of the metadata in generic search engines, which are not built around the same schemas. The web makes use of its own schemas: schema.org. To ensure SEO, the specialized schemas should be mapped to it.

+
+

3.5.1 The basics of search engine optimization

+

Data catalogs must be optimized to improve the visibility and ranking of their content in search engines, including specialized search engines like Google’s Dataset Search. The ranking of web pages by Google and other lead search engines is determined by complex, proprietary, and non-disclosed algorithms. The only option for a web developer to ensure that a web page appears on top of the Google list of results is to pay for it, publishing it as a commercial ad. Otherwise, the ranking of a web page will be determined by a combination of known and unknown criteria. “Google’s automated ranking systems are designed to present helpful, reliable information that’s primarily created to benefit people, not to gain search engine rankings, in the top Search results.” (Google Search Central) But Google, Bing and other search engines provide web developers with some guidance and recommendations on search engine optimization (SEO). See for example the Google Search Central website where Google publish “Specific things you can do to improve the SEO of your website”.

+

Improving the ranking of catalog pages is a shared responsibility of data curators and catalog developers and administrators. Data curators must pay particular attention to providing rich, useful content in the catalog web pages (the HTML pages that describe each catalog entry). To identify relevant results, search engines index the content of web pages. Datasets that are well documented, i.e. those published with rich and structured metadata, will thus have a better chance to be discovered. Much attention should be paid to some core elements including the dataset title, producer, description (abstract), keywords, topics, access license, and geographic coverage. In Google Search Central’s terms, curators must “create helpful, reliable, people-first content” (not search engine-first content) and “use words that people would use to look for your content, and place those words in prominent locations on the page, such as the title and main heading of a page, and other descriptive locations such as alt text and link text.*

+

Developers and administrators of cataloguing applications must pay attention to other aspects of a catalog that will make it rank higher in Google and other search engine results:

+
    +
  • Ensuring that a data catalog delivers a good experience to users (see Understanding page experience in Google Search results), which among other things involves: +
      +
    • Catalog pages that load fast
    • +
    • Catalog pages that are mobile-friendly. A data catalog should thus be built with a responsive design.
    • +
    • Provide secure connection by serving the catalog over HTTPS (see more information, see for example https://web.dev/enable-https/)
    • +
  • +
  • Embedding structured data in the catalog’s HTML pages. The HTML pages in a data catalog are mostly the pages that will make the metadata specific to an entry visible to the user. These pages are automatically generated by the cataloguing application, by extracting and formatting the metadata stored in the catalog’s database. Structured data is information that will be included in these HTML pages (but not shown to the user) to help Google understand the content of the page. The use of structured data only applies to certain types of content, including datasets. The use of structured data influences not only the ranking of a page, but also the way information on the page will be displayed by Google. The next section is dedicated to this.
  • +
+

Last, Google will “reward” popular websites, i.e. websites that are frequently visited and to which many other influent and popular websites provide links. Google’s recommendation is thus to “tell people about your site. Be active in communities where you can tell like-minded people about your services and products that you mention on your site.”

+

A helpful and detailed self-assessment list of items that data curators, catalog developers, and catalog administrators should pay attention to is provided by Google. Various tools are also available to catalog developers and administrators to assess the technical performance of their websites.

+
+

3.5.1.1 Structured data for rich results in Google

+

Structured data is information that is embedded in HTML pages that helps Google classify, understand, and display the content of the page when the page is related to a specific type of content. The information stored in the structured data does not impact how the page itself is displayed in a web browser; it only impacts the display of information on the page when returned by Google search results. The types of content to which structured metadata applies is diverse and includes items like job positings, cooking receipes, books, events, movies, math solvers, and others (see the list provided in Google’s Search Gallery). It also applies to resources of type dataset and image. In this context, a dataset can be any type of structured dataset including microdata, indicators, tables, and geographic datasets.

+

The structured data to be embedded in an HTML page consists of a set of metadata elements compliant with either the dataset schema from schema.org or W3C’s Data Catalog Vocabulary (DCAT) for datasets, and with the image schema from schema.org for images. For datasets, the schema.org schema is the most frequently used option.[^1]

+
+
+

3.5.1.2 schema.org

+

schema.org is a collection of schemas designed to document many types of resources. The most generic type is a “thing” which can be a person, an organization, an event, a creative work, etc. A creative work can be a book, a movie, a photograph, a data catalog, a dataset, etc. Among the many types of creative work for which schemas are available, we are particularly interested in the ones that correspond to the types of data and resources we recommend in this guide. This includes:

+
    +
  • DataCatalog: A data catalog is a collection of datasets.
    +
  • +
  • Dataset: A body of structured information describing some topic(s) of interest.
    +
  • +
  • MediaObject: A media object, such as an image, video, or audio object embedded in a web page or a downloadable dataset. This includes: +
  • +
  • Book: A book.
    +
  • +
  • DigitalDocument: An electronic file or document.
  • +
+

The schemas proposed by schema.org have been developed primarily “to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results.” (from schema.org, Q&A) These schemas have not been developed by specialized communities of practice (statisticians, survey specialists, data librarians) to document datasets for preservation of institutional memory, to increase transparency in the data production process, or to provide data users with the “cook book” they may need to safely and responsibly use data. These schemas are not the ones that statistical organizations need to comply with international recommendations like the Generic Standard Business Process Model (GSBPM). But they play a critical role in improving data discoverability, as they provide webmasters and search engines with a means to better capture and index the content of web-based data platforms. Schemas from schema.org should thus be embedded in data catalogs. Data cataloguing applications should automatically map (some of) the elements of the specialized metadata standards and schemas they use to the appropriate fields of schema.org. Recommended mapping between the specialized standards and schemas and schema.org are not yet available. The production of such mappings, and the development of utilities to facilitate the production of content compliant with schema.org, would contribute to the objective of visibility and discoverability of data.

+
+
+

3.5.1.3 DCAT

+

DCAT describes datasets and data services using a standard model and vocabulary. It is organized in 13 “classes” (Catalog, Cataloged Resource, Catalog Record, Dataset, Distribution, Data Service, Concept Scheme, Concept, Organization/Person, Relationship, Role, Period of Time, and Location). Within classes, properties are used as metadata elements. For example, the class Cataloged Resource includes properties like title, description, resource creator; the class Dataset includes properties like spatial resolution, temporal coverage; many of these properties can easily be mapped to equivalent elements of the specialized metadata schemas we recommend in this Guide.

+
+
+

3.5.1.4 Practical implementation of structured data

+

The embedding of structured data into HTML pages must be automated in a data cataloguing tool. Data catalogs applications dynamically generate the HTML pages that display the description of each catalog entry. They do so by extracting the necessary metadata from the catalog database, and applying “transformations and styles” to this content to produce a user-friendly output that catalog visitors will view in their web browser. To embed structured data in these pages, the catalog application will (i) extract the relevant subset of metadata elements from the original metadata (e.g., from the DDI-compliant metadata for a micro-dataset), (ii) map these extracted elements to the schema.org or DCAT schema, and (iii) save it in the HTML page as a JSON-LD “hidden” component. Mapping the core elements of specialized metadata standards to the schema.org schema is thus essential to enable this feature. A mapping between the schema presented in this Guide and schema.org is provided in annex 2 of the Guide.

+

The screenshots below show an example of an HTML page for a dataset published in a NADA catalog, with the underlying code. The structured metadata is used by Google to display this information as a formatted, “rich result” in Google Dataset Search.

+


+The HTML page as viewed by the catalog user - The web browser will ignore the embedded structured metadata when the HTML page is displayed. What users will see is entirely controlled by the catalog application. +

+
+
+ +
+


+


+The HTML page code (abstract) - The automatically-generated structured data can be seen in the HTML page code (or page source). This information is visible and processed by Google, Bing, and other search engine’s web crawlers. Note that the structured data, although not “visible” to users, can be made accessible to them via API. Other data cataloguing applications may be able to ingest this information; the CKAN cataloguing tool for example makes use of metadata compliant with DCAT or schema.org. Making the structured data accessible is one way to improve the inter-operability of data catalogs. +

+
+
+ +
+


+


+The result - Higher visibility/ranking in Google Dataset Search - The websites catalog.ihsn.org and microdata.worldbank.org are NADA catalogs, which embed schema.org metadata. +

+
+
+ +
+


+
+
+
+
+

3.6 Where to find the schemas’ documentation

+

The most recent documentation of the schemas described in the Guide is available on-line at https://ihsn.github.io/nada-api-redoc/catalog-admin/#.

+
+ +
+


+

The documentation of each standard or schema starts with four common elements that are not actually part of the standard or schema, but that contain information that will be used when the metadata are published in a data catalog that uses the NADA application. If NADA is not used, these “administrative elements” can be ignored.

+
+ +
+


+
    +
  • repositoryid identifies the collection in which the metadata will be published.
  • +
  • access_policy determines if and how the data files will be accessible from the catalog in which the metadata are published. This element only applies to the microdata and geographic metadata standards. It makes use of a controlled vocabulary with the following access policy options: +
      +
    • direct: data can be downloaded without requiring them to be registered;
    • +
    • open: same as “direct”, with an open data license attached to the dataset;
    • +
    • public: public use files, which only require users to be registered in the catalog;
    • +
    • licensed: access to data is restricted to registered users who receive authorization to use the data, after submitting a request;
    • +
    • remote: data are made available by an external data repository;
    • +
    • data_na: data are not accessible to the public (only metadata are published).
    • +
  • +
  • published determines the status of the metadata in the on-line catalog (with options 0 = draft and 1 = published). Published entries are visible to all visitors of the on-line catalog; unpublished (draft) entries will only be visible by the catalog administrators and reviewers.
  • +
  • overwrite determines whether the metadata already in the catalog for this entry can be overwritten (iwith options yes or no, ‘no’ being the default).
  • +
+

This set of administrative elements is followed by one or multiple sections that contain the elements specific to each standard/schema. For example, the DDI Codebook metadata standard, used to document microdata, contains the following main sections:

+
    +
  • document description: a description of the metadata (who documented the dataset, when, etc.) Most schemas will contain such a section describing the metadata, useful mainly to data curators and catalog administrators. In other schemas, this section may be named metadata_description.
  • +
  • study description: the description of the survey/census/study, not including the data files and data dictionary.
  • +
  • file description: a list and description of data files associated to the study.
  • +
  • variable description: the data dictionary (description of variables).
  • +
+

The schema-specific sections are followed by a few other metadata elements common to most schemas. These elements are used to provide additional information useful for cataloguing and discoverability purposes. They include tags (which allow catalog administrators to attach tags to datasets independently of their type, which can be used as filters in the catalog), and external resources.

+

Some schemas provide the possibility for data curators to add their own metadata elements in an additional section. The use of additional elements should be the exception, as metadata standards and schemas are designed to provide all elements needed to fully document a data resource.

+

In each standard and schema, metadata elements can have the following properties:

+
    +
  • Optional or required. When an element is declared as required (or mandatory), the metadata will be considered invalid if it contains no information in that element. To keep the schemas flexible, very few elements are set as required. Note that it is possible for a metadata element to be required but have all its components (for elements that have sub-elements) declared as optional. This will be the case when at least one (but any) of the sub-element must contain information. It is also possible for an element to be declared optional but have one or more of its sub-elements declared mandatory (this means that the field is optional, but if it is used, some of its features MUST be provided.)
  • +
  • Repeatable or Not repeatable. For example, the element nation in the DDI standard is Repeatable because a dataset can cover more than one country, while the element title is Not repeatable because a study should be identified by a unique title.
  • +
  • Type. This indicates the format of the information contained in an element. It can be a string (text), a numeric value, a boolean variable (TRUE/FALSE), or an array.
  • +
+

Some schemas may recommend controlled vocabularies for some elements. For example, the ISO 19139 used to document geographic datasets recommends …

+

In most cases however, controlled vocabularies are not part of the metadata standard or schema. They will be selected and activated in templates and applications. +…example…

+
+
+

3.7 Generating structured metadata

+

Metadata compliant with the standards and schemas described in this Guide can be generated in two different ways: programmatically using a programming language like R or Python, or by using a specialized metadata editor application. The first option provides a high degree of flexibility and efficiency. It offers multiple opportunities to automate part of the metadata generation process, and to exploit advanced machine learning solutions to enhance metadata. Also, metadata generated using R or Python can also be published in a NADA catalog using the NADA API and the R package NADAR or the Python library PyNADA. The programmatic option may thus be the preferred option for organizations that have strong expertise in R or Python. For other organizations, and for some types of data, the use of a specialized metadata editor may be a better option. Metadata editors are specialized software applications designed to offer a user-friendly alternative to the programmatic generation of metadata. We provide in this section a brief description of how structured metadata can be generated and published using respectively R, Python, and a metadata editor application.

+
+

3.7.1 Generating compliant metadata using a metadata editor

+

The easiest way to generate metadata compliant with the standards and schemas we describe in this Guide is to use a specialized Metadata Editor. A Metadata Editor provides a user-friendly and flexible interface to document data. Most metadata editors are specific to a certain standard. The IHSN / World Bank developed an open source multi-standard Metadata Editor.

+

This Metadata Editor contains all suggested standards. The full version of each standard is embedded in the application. But few users will ever make use of all elements contained in the standard. And some will want to customize the instructions, labels of the metadata elements, controlled vocabularies, and instructions to curators who will enter the metadata.

+

The Metadata Editor allows users to develop their own templates based on the full version of the standards. A template is a subset of the elements available in the standard/schema, where the elements can be renamed and other customization can be made (within limits, as the metadata generated must remain compliant with the standard independently of the template).

+

Template manager:

+
+
+
+ +

image

+
+
+


+Then UI with (for some types) import of data and automated generation of some metadata. +
+
+
+ +

image

+
+
+


+

(describe / provide bettere example)

+
+
+

3.7.2 Generating compliant metadata using R

+

All schemas described in the on-line documentation can be used to generate compliant metadata using R scripts. Generating metadata using R will consist of producing a list object (itself containing lists). In the documentation of the standards and schemas, curly brackets indicate to R users that a list must be created to store the metadata elements. Square brackets indicate that a block of elements is repeatable, which corresponds in R to a list of lists. For example (using the DOCUMENT metadata schema):

+
+ +
+
+

The sequence in which the metadata elements are created when documenting a dataset using R or Python does not have to match the sequence in the schema documentation.

+
+

Metadata compliant with a standard/schema can be generated using R, and directly uploaded in a NADA catalog without having to be saved as a JSON file. An object (a list) must be created in the R script that contains metadata compliant with the JSON schema. The example below shows how such an object is created and published in a NADA catalog. We assume here that we have a document with the following information:

+
    +
  • document unique id: WB_10986/7710
  • +
  • title: Teaching in Lao PDR
  • +
  • authors: Luis Benveniste, Jeffery Marshall, Lucrecia Santibañez (World Bank)
  • +
  • date published: 2007
  • +
  • countries: Lao PDR.
  • +
  • The document is available from the World Bank Open knowledge Repository at http://hdl.handle.net/10986/7710.
  • +
+

We will use the DOCUMENT schema to document the publication, and the EXTERNAL RESOURCE schema to publish a link to the document in NADA.

+
+ +
+


+

Publishing data and metadata in a NADA catalog (using R and the NADAR package or Python and the PyNADA library) requires to first identify the on-line catalog where the metadata will be published (by providing its URL in the set_api_url command line) and to provide a key to authenticate as a catalog administrator (in the set_api_key command line; note that this key should never be entered in clear in a script to avoid accidental disclosure).

+

We then create an object (a list in R, or a dictionary in Python) that we will for example name my_doc. Within this list (or dictionary), we will enter all metadata elements. Some will be simple elements, others will be lists (or dictionaries). The first element to be included is the required document_description. Within it, we include the title_statement which is also required and contains the mandatory elements idno and title (all documents must have a unique ID number for cataloguing purpose, and a title). The list of countries that the document covers is a repeatable element, i.e. a list of lists (although we only have one country in this case). Information on the authors is a repeatable element, allowing us to capture the information on the three co-authors individually.

+

This my_doc object is then published in the NADA catalog using the add_document function. Last, we publish (as an external resource) a link to the file, with only basic information. We do not need to document this resource in detail, as it corresponds to the metadata provided in my_doc. If we had a different external resource (for example an MS-Excel table that contains all tables shown in the publication), we would make use of more of the external resources metadata elements to document it. Note that instead of a URL, we could have provided a path to an electronic file (e.g., to the PDF document), in which case the file would be uploaded to the web server and made available directly from the on-line catalog. We had previously captured a screenshot of the cover page of the document to be used as thumbnail in the catalog (optional).

+
library(nadar)
+# Define the NADA catalog URL and provide an API key
+set_api_url("http://nada-demo.ihsn.org/index.php/api/")
+set_api_key("a1b2c3d4e5")  
+    # Note: an administrator API key must always be kept strictly confidential; 
+    # It is good practice to read it from an external file, not to enter it in clear 
+thumb  <- "C:/DOCS/teaching_lao.JPG"  # Cover page image to be used as thumbnail
+# Generate and publish the metadata on the publication
+doc_id <- "WB_10986/7710" 
+my_doc <- list(
+   document_description = list(
+   
+      title_statement = list(
+        idno = doc_id, 
+        title = "Teaching in Lao PDR"
+      ),
+      
+      date_published = "2007",
+  
+      ref_country = list(
+        list(name = "Lao PDR",  code = "LAO")
+      ),
+      
+      # Authors: we only have one author, but this is a list of lists 
+      # as the 'authors' element is a repeatable element in the schema
+      authors = list(
+        list(first_name = "Luis",     last_name = "Benveniste", affiliation = "World Bank"),
+        list(first_name = "Jeffery",  last_name = "Marshall",   affiliation = "World Bank"),
+        list(first_name = "Lucrecia", last_name = "Santibañez", affiliation = "World Bank")
+      )
+   )
+)
+# Publish the metadata in the central catalog 
+add_document(idno = doc_id, 
+             metadata = my_doc, 
+             repositoryid = "central", 
+             published = 1,
+             thumbnail = thumb,
+             overwrite = "yes")
+# Add a link as an external resource of type document/analytical (doc/anl).
+external_resources_add(
+  title = "Teaching in Lao PDR",
+  idno = doc_id,
+  dctype = "doc/anl",
+  file_path = "http://hdl.handle.net/10986/7710",
+  overwrite = "yes"
+)
+

The document is now available in the NADA catalog.

+
+ +
+
+
+

3.7.3 Generating compliant metadata using Python

+

Generating metadata using Python will consist of producing a dictionary object, which will itself contain lists and dictionaries. Non-repeatable metadata elements will be stored as dictionaries, and repeatable elements as lists of dictionaries. In the metadata documentation, curly brackets indicate that a dictionary must be created to store the metadata elements. Square brackets indicate that a dictionary containing dictionaries must be created.

+
+ +
+


+
+

Dictionaries in Python are very similar to JSON schemas. When documenting a dataset, data curators who use Python can copy a schema from the ReDoc website, paste it in their script editor, then fill out the relevant metadata elements and delete the ones that are not used.

+
+
+ +
+


+

The Python equivalent of the R example we provided above is as follows:

+
import pynada as nada
+# Define the NADA catalog URL and provide an API key
+set_api_url("http://nada-demo.ihsn.org/index.php/api/")
+set_api_key("a1b2c3d4e5")  
+    # Note: an administrator API key must always be kept strictly confidential; 
+    # It is good practice to read it from an external file, not to enter it in clear  
+thumb  <- "C:/DOCS/teaching_lao.JPG"  # Cover page image to be used as thumbnail
+# Generate and publish the metadata on the publication
+doc_id = "WB_10986/7710"
+document_description = {
+  'title_statement': {
+      'idno': "WB_10986/7710",
+      'title': "Teaching in Lao PDR"
+  },
+  
+  'date_published': "2007",
+  'ref_country': [
+        {'name': "Lao PDR", 'code': "Lao"}
+    ],
+  
+  # Authors: we only have one author, but this is a list of lists 
+  # as the 'authors' element is a repeatable element in the schema
+  'authors': [
+      {'first_name': "Luis",     'last_name': "Benveniste", 'affiliation' = "World Bank"},
+      {'first_name': "Jeffery",  'last_name': "Marshall",   'affiliation' = "World Bank"},
+      {'first_name': "Lucrecia", 'last_name': "Santibañez", 'affiliation' = "World Bank"},
+  ]
+}
+# Publish the metadata in the central catalog 
+nada.create_document_dataset(
+  dataset_id = doc_id,
+  repository_id = "central",
+  published = 1,
+  overwrite = "yes",
+  my_doc_metadata,             @@@@@@
+  thumbnail_path = thumb)
+# Add a link as an external resource of type document/analytical (doc/anl).
+nada.add_resource(
+  dataset_id = doc_id,
+  dctype = "doc/anl",
+  title = "Teaching in Lao PDR",
+  file_path = "http://hdl.handle.net/10986/7710",
+  overwrite = "yes")
+

[^1] See Omar Benjelloun, Shiyu Chen, Natasha Noy, 2020, Google Dataset Search by the Numbers, https://doi.org/10.48550/arXiv.2006.06894

+ +
+
+
+ + + +
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter04.html b/chapter04.html new file mode 100644 index 0000000..e05f895 --- /dev/null +++ b/chapter04.html @@ -0,0 +1,3337 @@ + + + + + + + Chapter 4 Documents | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 4 Documents

+
+ +
+


+

This chapter describes the use of a metadata schema for documenting documents. By document, we mean a bibliographic resource of any type such as a book, a working paper or a paper published in a scientific journal, a report, a presentation, a manual, or any another resource consisting mainly of text and available in physical and/or electronic format.

+
+

Suggestions and recommendations to data curators

+
    +
  • Documents in a data catalog can appear (i) as “data” in the catalog, or as “related resources” attached to other datasets. The schema we describe here is to be used for documents that will be listed as catalog entries and made searchable, not those that will be attached as resources (for which the “external resource” metadata schema must be used.
  • +
  • For all types of data we describe in this Guide (microdata, geographic, indicators, tables, images, audio, video, and scripts), what is indexed and made searchable in the catalog are the metadata associated with the data (some of these metadata may have been extracted directly from the data). For documents, not only the metadata but the content of the document (the “data”) can and should be indexed and made searchable. Some documents may have been scanned and submitted to optical character recognition (OCR). The OCR process will not always manage to properly convert images to text, resulting in errors and non-existing words that should not be included in an index. It is thus recommended to submit the text version of these documents to a pipeline of quality control and enhancement (spell checker, and other).
  • +
  • Including a screenshot of a document cover page in a data catalog adds value.
    +
  • +
  • Documents should be categorized by type, and the type metadata element should have a controlled vocabulary. If a document can have more than one type, use the tags element (with a tag_group = type) instead of the non-repeatable type element to store this information. Use this information to activate a facet in the catalog user interface. Many users will find it useful to be able to filter documents by type.
  • +
  • The document metadata can be augmented in different manners, including by applying automated topic extraction (e.g. using a LDA topic model) and by generating document embeddings. When topic models and embedding models are used, it is important to ensure that the same topic model and the same embedding model is consistently used for all resources in the catalog.
  • +
  • Machine learning tools also provide automatic language detection and translation solutions that mey be useful to enhance the metadata.
  • +
  • Documenting documents using R or Python is not very complex. For large collections of documents, managing and publishing metadata can be made significantly more efficient when programmatic solutions are used.
  • +
  • It is highly recommended to obtain a globally unique identifier for each document, such as a DOI, an ISBN, or other.
  • +
+
+
+

4.1 MARC 21, Dublin Core, and BibTex

+

Librarians have developed specific standards to describe and catalog documents. The MARC 21 (MAchine-Readable Cataloging) standard used by the United States Library of Congress is one of them. It provides a detailed structure for documenting bibliographic resources, and is the recommended standard for well-resourced document libraries.

+

For the purpose of cataloguing documents in a less-specialized repository intended to accommodate data of multiple types, we built our schema on a simpler but also highly popular standard, the Dublin Core Metadata Element Set. We will refer to this metadata specification, developed by the Dublin Core Metadata Initiative, as the Dublin Core. The Dublin Core became an ISO standard (ISO 15836) in 2009. It consists of a list of fifteen core metadata elements, to which more specialized elements can be added. These fifteen elements, with a definition extracted from the Dublin Core website, are the following:

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NoElement nameDescription
1contributorAn entity responsible for making contributions to the resource.
2coverageThe spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
3creatorAn entity primarily responsible for making the resource.
4dateA point or period of time associated with an event in the life cycle of the resource.
5descriptionAn account of the resource.
6formatThe file format, physical medium, or dimensions of the resource.
7identifierAn unambiguous reference to the resource within a given context.
8languageA language of the resource.
9publisherAn entity responsible for making the resource available.
10relationA related resource.
11rightsInformation about rights held in and over the resource.
12sourceA related resource from which the described resource is derived.
13subjectThe topic of the resource.
14titleA name given to the resource.
15typeThe nature or genre of the resource.
+

Due to its simplicity and versatility, this standard is widely used for multiple purposes. It can be used to document not only documents but also resources of other types like images or others. Documents that can be described using the MARC 21 standard can be described using the Dublin Core, although not with the same granularity of information. The US Library of Congress provides a mapping between the MARC and the Dublin Core metadata elements.

+

MARC 21 and the Dublin Core are used to document a resource (typically, the electronic file containing the document) and its content. Another schema, BibTex, has been developed for the specific purpose of recording bibliographic citations. BibTex is a list of fields that may be used to generate bibliographic citations compliant with different bibliography styles. It applies to documents of multiple types: books, articles, reports, etc.

+

The metadata schema we propose to document publications and reports is a combination of Dublin Core, MARC 21, and BibTex elements. The technical documentation of the schema and its API is available at https://ihsn.github.io/nada-api-redoc/catalog-admin/#tag/Documents.

+
+
+

4.2 Schema description

+

The proposed schema comprises two main blocks of elements, metadata_information and document_description. It also contains the tags element common to all our schemas. The repository_id, published and overwrite items in the schema are not metadata elements per se, but parameters used when publishing the metadata in a NADA catalog. +

+
{
+    "repositoryid": "string",
+    "published": 0,
+    "overwrite": "no",
+    "metadata_information": {},
+    "document_description": {},
+    "provenance": [],
+    "tags": [],
+    "lda_topics": [],
+    "embeddings": [],
+    "additional": { }
+}
+


+
+

4.2.1 Metadata information

+

The metadata_information contains information not related to the document itself but to its metadata. In other words, it contains “metadata on the metadata”. This information is optional but we recommend to enter content at least in the name and date sub-elements, which indicate who generated the metadata and when. This information is not useful to end-users of document catalogs, but is useful to catalog administrators for two reasons:

+
    +
  • metadata compliant with standards are intended to be shared and used by inter-operable applications. Data catalogs offer opportunities to harvest (pull) information from other catalogs, or to publish (push) metadata in other catalogs. Metadata information helps to keep track of the provenance of metadata.

  • +
  • metadata for a same document may have been generated by more than one person or organization, or one version of the metadata can be updated and replaced with a new version. The metadata information helps catalog administrators distinguish and manage different versions of the metadata. +

  • +
+
"metadata_information": {
+    "title": "string",
+    "idno": "string",
+    "producers": [
+        {
+            "name": "string",
+            "abbr": "string",
+            "affiliation": "string",
+            "role": "string"
+        }
+    ],
+    "production_date": "string",
+    "version": "string"
+}
+


+

The elements in the block are:

+
    +
  • title [Required ; Not repeatable ; String]
    +The title of the metadata document (which will usually be the same as the “Title” in the “Document description / Title statement” section). The metadata document is the metadata file (XML or JSON file) that is being generated.

  • +
  • idno [Optional ; Not repeatable ; String]
    +A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the “Primary ID” in section “Document description / Title statement”, although it is good practice to generate identifiers that establish a clear connection between these two identifiers. The Document ID could also include the metadata document version identifier. For example, if the “Primary ID” of the publication is “978-1-4648-1342-9”, the Document ID could be “IHSN_978-1-4648-1342-9_v1.0” if the metadata are produced by the IHSN and if this is version 1.0 of the metadata. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier may not contain blank spaces.

  • +
  • producers [Optional ; Repeatable]
    +This refers to the producer(s) of the metadata, not to the producer(s) of the document itself. The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a “Recommended” element. For catalog administration purposes, information on the producer and on the date of metadata production is useful.

    +
      +
    • name [Optional ; Not repeatable ; String]
      +The name of the person or organization who produced the metadata or contributed to its production.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The abbreviation (or acronym) of the organization that is referenced in name.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • role [Optional ; Not repeatable ; String]
      +The specific role of the person or organization mentioned in name in the production of the metadata.

    • +
  • +
  • production_date [Optional ; Not repeatable ; String]
    +The date the metadata on this document was produced (not distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a “Recommended” element, as information on the producer and on the date of metadata production is useful for catalog administration purposes.

  • +
  • version [Optional ; Not repeatable ; String]
    +The version of the metadata document (not the version of the publication, report, or other resource being documented).

  • +
+
+

Example +

+
+
my_doc = list(
+  
+  metadata_information = list(
+    
+    idno = "WBDG_978-1-4648-1342-9",
+    
+    producers = list(
+      list(name = "Development Data Group, Curation Team", 
+           abbr = "WBDG", 
+           affiliation = "World Bank")
+    ),
+    
+    production_date = "2020-12-27"
+  ),
+  
+  # ...
+  
+) 
+


+
+
+

4.2.2 Document description

+

The document_description block contains the metadata elements used to describe the document. It includes the Dublin Core elements and a few more. The schema also includes elements intended to store information generated by machine learning (natural language processing - NLP) models to augment metadata on documents. +

+
"document_description": {
+    "title_statement": {},
+    "authors": [],
+    "editors": [],
+    "date_created": "string",
+    "date_available": "string",
+    "date_modified": "string",
+    "date_published": "string",
+    "identifiers": [],
+    "type": "string",
+    "status": "string",
+    "description": "string",
+    "toc": "string",
+    "toc_structured": [],
+    "abstract": "string",
+    "notes": [],
+    "scope": "string",
+    "ref_country": [],
+    "geographic_units": [],
+    "bbox": [],
+    "spatial_coverage": "string",
+    "temporal_coverage": "string",
+    "publication_frequency": "string",
+    "languages": [],
+    "license": [],
+    "bibliographic_citation": [],
+    "chapter": "string",
+    "edition": "string",
+    "institution": "string",
+    "journal": "string",
+    "volume": "string",
+    "number": "string",
+    "pages": "string",
+    "series": "string",
+    "publisher": "string",
+    "publisher_address": "string",
+    "annote": "string",
+    "booktitle": "string",
+    "crossref": "string",
+    "howpublished": "string",
+    "key": "string",
+    "organization": "string",
+    "url": null,
+    "translators": [],
+    "contributors": [],
+    "contacts": [],
+    "rights": "string",
+    "copyright": "string",
+    "usage_terms": "string",
+    "disclaimer": "string",
+    "security_classification": "string",
+    "access_restrictions": "string",
+    "sources": [],
+    "data_sources": [],
+    "keywords": [],
+    "themes": [],
+    "topics": [],
+    "disciplines": [],
+    "audience": "string",
+    "mandate": "string",
+    "pricing": "string",
+    "relations": [],
+    "reproducibility": {}
+}
+


+
    +
  • title_statement [Required ; Not repeatable]
  • +
+

The title_statement is a required group of five elements, two of which are required: +

+
"title_statement": {
+ "idno": "string",
+ "title": "string",
+ "sub_title": "string",
+ "alternate_title": "string",
+ "translated_title": "string"
+}
+


+
    +
  • idno [Required ; Not repeatable ; String]
    +A unique identifier of the document, which serves as the “primary ID”. idno is a unique identification number used to identify the database. A unique identifier is required for cataloguing purpose, so this element is declared as “Required”. The identifier will allow users to cite the indicator/series properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the idno can be constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a same study (in element identifiers); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The idno should not contain blank spaces.
  • +
  • title [Required ; Not repeatable ; String]
    +The title of the book, report, paper, or other document. Pay attention to the use of capitalization in the title, to ensure consistency across documents listed in your catalog. Pay attention to the consistent use of capitalization in the title. It is recommended to use sentence capitalization.
    +
  • +
  • sub_title [Optional ; Not repeatable ; String]
    +The document subtitle can be used when there is a need to distinguish characteristics of a document. Pay attention to the consistent use of capitalization in the subtitle.
  • +
  • alternate_title [Optional ; Not repeatable ; String]
    +An alternate version of the title, possibly an abbreviated version. For example, the World Bank’s World Development Report is often referred to as the WDR; the alternate title for the “World Development Report 2021” could then be “WDR 2021”.
  • +
  • translated_title [Optional ; Not repeatable ; String]
    +A translation of the title of the document. Special characters should be properly displayed, such as accents and other stress marks or different alphabets.
  • +
+


+
my_doc <- list(
+
+  # ... ,
+  
+  document_description = list(
+    title_statement = list(
+      idno = "978-1-4648-1342-9",
+      title = "The Changing Nature of Work",
+      sub-title = "World Development Report 2019",
+      alternate_title = "WDR 2019",
+      translated_title = "Rapport sur le Développement dans le Monde 2019"
+    ),
+    
+    # ...
+  )  
+)  
+


+
    +
  • authors [Optional ; Repeatable]
    +The authors should be listed in the same order as they appear in the source itself, which is not necessarily alphabetical. +

    +
     "authors": [
    +     {
    +         "first_name": "string",
    +         "initial": "string",
    +         "last_name": "string",
    +         "affiliation": "string",
    +         "author_id": [
    +             {
    +                 "type": null,
    +                 "id": null
    +             }
    +         ],
    +         "full_name": "string"
    +     }
    + ]
    +


    +
      +
    • first_name [Optional ; Not repeatable ; String]
      +The first name of the author.
    • +
    • initial [Optional ; Not repeatable ; String]
      +The initials of the author.
    • +
    • last_name [Optional ; Not repeatable ; String]
      +The last name of the author.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the author.
    • +
    • author_id [Optional ; Repeatable]
      +The author ID in a registry of academic researchers such as the Open Researcher and Contributor ID (ORCID).
      +
        +
      • type [Optional ; Not repeatable ; String]
        +The type of ID, i.e. the identification of the registry that assigned the author’s identifier, for example “ORCID”.
      • +
      • id [Optional ; Not repeatable ; String]
        +The ID of the author in the registry mentioned in type.

      • +
    • +
    • full_name [Optional ; Not repeatable ; String]
      +The full name of the author. This element should only be used when the first and last name of an author cannot be distinguished, i.e. when elements first_name and last_name cannot be filled out. This element can also be used when the author of a document is an organization or other type of entity.
      +
    • +
    +
      my_doc <- list(
    +    # ... ,
    +    document_description = list(
    +      # ... ,
    +
    +      authors = list(
    +         list(first_name = "John", last_name = "Smith",
    +              author_id = list(type = "ORCID", id = "0000-0002-1234-XXXX")),
    +         list(first_name = "Jane", last_name = "Doe"),
    +              author_id = list(type = "ORCID", id = "0000-0002-5678-YYYY"))
    +      ),
    +
    +      # ...
    +    ) 
    +


  • +
  • editors [Optional ; Repeatable]
    +If the source is a text within an edited volume, it should be listed under the name of the author of the text used, not under the name of the editor. The name of the editor should however be provided in the bibliographic citation, in accordance with a reference style. +

  • +
+
"editors": [
+    {
+        "first_name": "string",
+        "initial": "string",
+        "last_name": "string",
+        "affiliation": "string"
+    }
+]
+


+- first_name [Optional ; Not repeatable ; String]
+The first name of the editor. +- initial [Optional ; Not repeatable ; String]
+The initials of the editor. +- last_name [Optional ; Not repeatable ; String]
+The last name of the editor. +- affiliation [Optional ; Not repeatable ; String]
+The affiliation of the editor.

+
    +
  • date_created [Optional ; Not repeatable ; String]
    +The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was produced. This can be different from the date the document was published, made available, and from the temporal coverage. The document “Nigeria - Displacement Report” by the International Organization for Migration (IOM) shown below provides an example of this. The document was produced in November 2020 (date_created), refers to events that occurred between 21 September and 10 October 2021 (temporal_coverage), and was published (date_published) on 28 January 2021.

  • +
  • date_available [Optional ; Not repeatable ; String]
    +The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was made available. This is different from the date it was published (see element date_published below). This element will not be used frequently.

  • +
  • date_modified [Optional ; Not repeatable ; String]
    +The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was last modified.

  • +
  • date_published [Optional ; Not repeatable ; String]
    +The date, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY), when the document was published.

    +

    The example below is a report from the International Organization for Migrations (IOM). It shows the difference between the date the document was created (date_created), published (date_published), and the period it covers (temporal_coverage).

    +
    +

    +
    +

    In R, this will be captured as follows: +

    +
    my_doc <- list(
    +  # ... ,
    +  document_description = list(
    +    # ... ,
    +
    +    temporal_coverage = "21 September 2020 to 10 October 2020",
    +    date_created = "2020-11",  
    +    date_published = "2021-01-28",
    +
    +    # ...
    +  ),
    +  # ...
    +)  
    +


  • +
  • identifiers [Optional ; Repeatable]
    +This element is used to enter document identifiers (IDs) other than the catalog ID entered in the title_statement (idno). It can for example be a Digital Object Identifier (DOI), an International Standard Book Number (ISBN), or an International Standard Serial Number (ISSN). The ID entered in the title_statement can be repeated here (the title_statement does not provide a type parameter; if a DOI, ISBN, ISSN, or other standard reference ID is used as idno, it is recommended to repeat it here with the identification of its type).

  • +
+


+
"identifiers": [
+    {
+        "type": "string",
+        "identifier": "string"
+    }
+]
+


+
    +
  • type [Optional ; Not repeatable ; String]
    +The type of identifier, for example “DOI”, “ISBN”, or “ISSN”.
  • +
  • identifier [Required ; Not repeatable ; String]
    +The identifier itself.

  • +
+

The example shows the list of identifiers of the World Bank World Development Report 2020 The Changing Nature of Work (see full metadata for this document in the Complete Example 2 of this chapter).

+


+
  my_doc <- list(
+  
+    # ... ,
+    
+    document_description = list(
+
+      # ... ,
+      
+      identifiers = list(
+        list(type = "ISSN",           identifier = "0163-5085"),
+        list(type = "ISBN softcover", identifier = "978-1-4648-1328-3"),
+        list(type = "ISBN hardcover", identifier = "978-1-4648-1342-9"),
+        list(type = "e-ISBN",         identifier = "978-1-4648-1356-6"),
+        list(type = "DOI softcover",  identifier = "10.1596/978-1-4648-1328-3"),   
+        list(type = "DOI hardcover",  identifier = "10.1596/978-1-4648-1342-9")   
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • type [Optional ; Not repeatable ; String]

    +

    This describes the nature of the resource. It is recommended practice to select a value from a controlled vocabulary, which could for example include the following options: “article”, “book”, “booklet”, “collection”, “conference proceedings”, “manual”, “master thesis”, “patent”, “PhD thesis”, “proceedings”, “technical report”, “working paper”, “website”, “other”. Specialized agencies may want to create their own controlled vocabularies; for example, a national statistical agency may need options like “press release”, “methodology document”, “protocol”, or “yearbook”. The type element can be used to create a “Document type” facet (filter) in a data catalog. If the controlled vocabulary is such that it contains values that are not mutually exclusive (i.e. if a document could possibly have more than one type), the element type cannot be used as it is not repeatable. In such case, the solution is to provide the type of document as tags, in a tag_group that could for example be named type or document_type. Note also that the Dublin Core provides a controlled vocabulary (the DCMI Type Vocabulary) for the type element, but this vocabulary is related to the types of resources (dataset, event, image, software, sound, etc.), not the type of document which is what we are interested in here.

  • +
+


+
    +
  • status [Optional ; Not repeatable ; String]

    +

    The status of the document. The status of the document should (but does not have to) be provided using a controlled vocabulary, for example with the following options: “first draft”, “draft”, “reviewed draft”, “final draft”, “final”. Most documents published in a catalog will likely be “final”.

  • +
+


+
    +
  • description [Optional ; Not repeatable ; String]

    +

    This element is used to provide a brief description of the document (not an abstract, which would be provided in the field abstract). It should not be used to provide content that is contained in other, more specific elements. As stated in the Dublin Core Usage Guide, “Since the description field is a potentially rich source of indexable terms, care should be taken to provide this element when possible. Best practice recommendation for this element is to use full sentences, as description is often used to present information to users to assist in their selection of appropriate resources from a set of search results.”

  • +
+


+
    +
  • toc [Optional ; Not repeatable ; String]

    +

    The table of content of the document, provided as a single string element, i.e. with no structure (an structured alternative is provided with the field toc_structured described below). This element is also a rich source of indexable terms which can contribute to document discoverability; care should thus be taken to use it (or the toc_structured alternative) whenever possible. +

    +
    my_doc <- list(
    +  # ... ,
    +  document_description = list(
    +    # ... ,
    +
    +    toc = "Introduction
    +           1. The importance of rich and structured metadata
    +           1.1 Rich metadata
    +           1.2 Structured metadata
    +           2. Technology: JSON schemas and tools
    +           2.1 JSON schemas
    +           2.1.1 Advantages of JSON over XML
    +           2.2 Defining a metadata schema in JSON format",
    +    # ...
    +  ),
    +
    +  # ...
    +) 
  • +
+


+
    +
  • toc_structured [Optional ; Not repeatable]
  • +
+


+
"toc_structured": [
+    {
+        "id": "string",
+        "parent_id": "string",
+        "name": "string"
+    }
+]
+


+

This element is used as an alternative to toc to provide a structured table of content. The element contains a repeatable block of sub-elements which provides the possibility to define a hierarchical structure:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier for the element of the table of content. For example, the id for Chapter 1 could be “1” while the id for section 1 of chapter 1 would be “11”.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The id of the parent section (e.g., if the table of content is divided into chapters, themselves divided into sections, the parent_id of a section would be the id of the chapter it belongs to.)
  • +
  • name [Required ; Not repeatable ; String]
    +The label of this section of the table of content (e.g., the chapter or section title)

  • +
+

The example below shows how the content provided in the previous example is presented in a structured format.

+


+
my_doc <- list(
+  # ... ,
+  document_description = list(
+    # ...,
+        
+    toc_structured = list(
+      list(id = "0",   parent_id = "" ,  name = "Introduction"),
+      list(id = "1",   parent_id = "" ,  name = "1. The importance of rich and structured metadata"),
+      list(id = "11",  parent_id = "1",  name = "1.1 Rich metadata"),
+      list(id = "12",  parent_id = "1",  name = "1.2 Structured metadata"),
+      list(id = "2",   parent_id = "" ,  name = "2. Technology: JSON schemas and tools"),
+      list(id = "21",  parent_id = "2",  name = "2.1 JSON schemas"),
+      list(id = "211", parent_id = "21", name = "2.1.1 Advantages of JSON over XML"),
+      list(id = "22",  parent_id = "2",  name = "2.2 Defining a metadata schema in JSON format")
+      # etc.
+    ),
+    # ...
+  ),
+  # ...
+) 
+


+
    +
  • abstract [Optional ; Not repeatable ; String]

    +

    The abstract is a summary of the document, usually about one or two paragraph(s) long (around 150 to 300 words).

  • +
+


+
my_doc <- list(
+  # ... ,
+  document_description = list(
+        # ... ,
+        
+        abstract = "The 2019 World Development Report studies how the nature of work is changing as a result of advances in technology today. 
+                    While technology improves overall living standards, the process can be disruptive. 
+                    A new social contract is needed to smooth the transition and guard against inequality.",
+        
+        # ...
+  ),
+  # ...
+) 
+


+
    +
  • notes [Optional ; Repeatable ; String]
  • +
+


+
notes": [
+    {
+        "note": "string"
+    }
+]
+


+

This field can be used to provide information on the document that does not belong to the other, more specific metadata elements provided in the schema. +- note
+A note, entered as free text.

+


+
my_doc <- list(
+  # ... ,
+  document_description = list(
+    # ... ,
+    
+    notes = list(
+      list(note = "This is note 1"),
+      list(note = "This is note 2")
+    ),  
+    
+    # ...
+  ),
+  # ...
+) 
+


+
    +
  • scope [Optional ; Not repeatable ; String]

    +

    A textual description of the topics covered in the document, which complements (but does not duplicate) the elements description and topics available in the schema.

  • +
  • ref_country [Optional ; Repeatable]
    +The list of countries (or regions) covered by the document, if applicable. +This is a repeatable block of two elements:

    +
      +
    • name [Required ; Not repeatable ; String]
      +The country/region name. Note that many organizations have their own policies on the naming of countries/regions/economies/territories, which data curators will have to comply with.
    • +
    • code [Optional ; Not repeatable ; String]
      +The country/region code. It is recommended to use a standard list of countries codes, such as the ISO 3166. +
    • +
    +
    "ref_country": [
    +  {
    +      "name": "string",
    +      "code": "string"
    +  }
    +]
    +


    +

    The field ref_country will often be used as a filter (facet) in data catalogs. When a document is related to only part of a country, we still want to capture this information in the metadata. For example, the ref_country element for the document “Sewerage and sanitation : Jakarta and Manila” will list “Indonesia” (code IDN) and “Philippines” (code PHL).

    +

    Considering the importance of the geographic coverage of a document as a filter, the ref_country element deserves particular attention. The document title will often but not always provide the necessary information. Using R, Python or other programming languages, a list of all countries mentioned in a document can be automatically extracted, with their frequencies. This approach (which requires a lookup file containing a list of all countries in the world with their different denominations and spelling) can be used to extract the information needed to populate the ref_country element (not all countries in the list will have to be included; some threshold can be set to only include countries that are “significantly” mentioned in a document). Tools like the R package countrycode are available to facilitate this process.

    +

    When a document is related to a region (not to specific countries), or when it is related to a topic but not a specific geographic area, the ref_country might still be applicable. Try and extract (possibly using a script that parses the document) information on the countries mentioned in the document. For example, ref_country for the World Bank document “The investment climate in South Asia” should include Afghanistan (mentioned 81 times in the document), Bangladesh (113), Bhutan (94), India (148), Maldives (62), Nepal (64), Pakistan (103), and Sri Lanka (98), but also China (not a South-Asian country, but mentioned 63 times in the document).

    +

    If a document is not specific to any country, the element ref_country would be ignored (not included in the metadata) if the content of the document is not related to any geographic area (for example, the user’s guide of a software application), or would contain “World” (code WLD) if the document is related but not specific to countries (for example, a document on “Climate change mitigation”).

  • +
+


+
my_doc <- list(
+  # ... ,
+  document_description = list(
+    # ... ,
+    
+    ref_country = list(
+       list(name = "Bangladesh", code = "BGD"),
+       list(name = "India",      code = "IND"),
+       list(name = "Nepal",      code = "NPL")
+    ),
+    
+    # ...
+  ) 
+


+
    +
  • geographic_units [Optional ; Repeatable]
    +A list of geographic units covered in the document, other than the countries listed in ref_country.
  • +
+


+
"geographic_units": [
+    {
+        "name": "string",
+        "code": "string",
+        "type": "string"
+    }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the geographic unit.
  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the geographic unit.
  • +
  • type [Optional ; Not repeatable ; String]
    +The type of the geographic unit (e.g., “province”, “state”, “district”, or “town”).
  • +
+


+
    +
  • bbox [Optional ; Repeatable]
    +This element is used to define one or multiple geographic bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. The valid range of latitude in degrees is -90 and +90 for the southern and northern hemisphere, respectively. Longitude is in the range -180 and +180 specifying coordinates west and east of the Prime Meridian, respectively. This element will rarely be used for documenting publications. Bounding boxes are an optional element, but when a bounding box is defined, all four coordinates are required.
  • +
+


+
"bbox": [
+    {
+        "west": "string",
+        "east": "string",
+        "south": "string",
+        "north": "string"
+    }
+]
+


+
    +
  • west [Required ; Not repeatable ; String]
    +The West longitude of the bounding box.
  • +
  • east [Optional ; Not repeatable ; String]
    +The East longitude of the bounding box.
  • +
  • south [Optional ; Not repeatable ; String]
    +The South latitude of the bounding box.
  • +
  • north [Optional ; Not repeatable ; String]
    +The North latitude of the bounding box.
  • +
+
my_doc <- list(
+  # ... ,
+  document_description = list(
+        # ... ,
+        
+        bbox = list(
+          list(west  = "92.12973", 
+               east  = "92.26863", 
+               south = "20.91856", 
+               north = "21.22292")
+        ),
+        
+        # ...
+  ),
+  # ...
+) 
+


+
    +
  • spatial_coverage [Optional ; Not repeatable ; String]

    +

    This element provides another space for capturing information on the spatial coverage of a document, which complements the ref_country, geographic_units, and bbox elements. It can be used to qualify the geographic coverage of the document, in the form of a free text. For example, a report on refugee camps in the Cox’s Bazar district of Bangladesh would have Bangladesh as reference country, “Cox’s Bazar” as a geographic unit, and “Rohingya’s refugee camps” as spatial coverage. +

    +
    my_doc <- list(
    +  # ... ,
    +  document_description = list(
    +    # ... ,
    +
    +    ref_country = list(
    +      list(name = "Bangladesh", code = "BGD")
    +    ),
    +
    +    geographic_units = list(
    +      list(name = "Cox's Bazar", type = "District")
    +    ),  
    +
    +    spatial_coverage = "Rohingya's refugee camps",
    +
    +    # ...
    +  ),
    +  # ...
    +
    +)  
  • +
+


+
    +
  • temporal_coverage [Optional ; Not repeatable ; String]

    +

    Not all documents have a specific time coverage. When they do, it can be specified in this element.

  • +
+


+
    +
  • publication_frequency [Optional ; Not repeatable ; String]
    +Some documents are published regularly. The frequency of publications can be documented using this element.

    +

    It is recommended to use a controlled vocabulary, for example the PRISM Publishing Frequency Vocabulary which identifies standard publishing frequencies for a serial or periodical publication.

    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    FrequencyDescription
    annuallyPublished once a year
    semiannuallyPublished twice a year
    quarterlyPublished every 3 months, or once a quarter
    bimonthlyPublished twice a month
    monthlyPublished once a month
    biweeklyPublished twice a week
    weeklyPublished once a week
    dailyPublished every day
    continuallyPublished continually as new content is added; typical of websites and blogs, typically several times a day
    irregularlyPublished on an irregular schedule, such as every month except July and August
    otherPublished on another schedule not enumerated in this controlled vocabulary
  • +
+


+
    +
  • languages [Optional ; Repeatable]
    +The language(s) in which the document is written. For the language codes and names, the use of the ISO 639-2 standard is recommended.
  • +
+


+
"languages": [
+    {
+        "name": "string",
+        "code": "string"
+    }
+]
+


+

This is a block of two elements (at least one must be provided for each language):

+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the language.
  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the language. The use of ISO 639-2 (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings. +
  • +
+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      languages = list(
+        list(name = "English", code = "EN")
+      )
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • license [Optional ; Repeatable]
    +Information on the license(s) attached to the document, which defines the terms of use. +
  • +
+
"license": [
+    {
+        "name": "string",
+        "uri": "string"
+    }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the license (e.g., CC-BY 4.0).
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the license, where detailed information on the license can be obtained.
  • +
+


+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      license = list(
+        list(name = "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)", 
+             uri = "http://creativecommons.org/licenses/by/3.0/igo")
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • bibliographic_citation [Optional ; Repeatable]
    +The bibliographic citation provides relevant information about the author and the publication. When using the element bibliographic_citation, the citation is provided as a single item. It should be provided in a standard style: Modern Language Association (MLA), American Psychological Association (APA), or Chicago. Note that the schema provides an itemized list of all elements (BibTex fields) required to build a citation in a format of their choice.
  • +
+


+
"bibliographic_citation": [
+    {
+        "style": "string",
+        "citation": "string"
+    }
+]
+


+
    +
  • style [Optional ; Not repeatable ; String]
    +The citation style, e.g. “MLA”, “APA”, or “Chicago”.
  • +
  • citation [Optional ; Not repeatable ; String]
    +The citation in the style mentioned in style.

  • +
+

The example below shows how the bibliographic citation for an article published in Econometrica can be provided in three different formats.

+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      bibliographic_citation = list(
+      
+        list(style = "MLA", 
+             citation = 'Davidson, Russell, and Jean-Yves Duclos. “Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality.” Econometrica, vol. 68, no. 6, [Wiley, Econometric Society], 2000, pp. 1435–64, http://www.jstor.org/stable/3003995.'),
+             
+        list(style = "APA", 
+             citation = 'Davidson, R., & Duclos, J.-Y. (2000). Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality. Econometrica, 68(6), 1435–1464. http://www.jstor.org/stable/3003995'),
+             
+        list(style = "Chicago", 
+             citation = 'Davidson, Russell, and Jean-Yves Duclos. “Statistical Inference for Stochastic Dominance and for the Measurement of Poverty and Inequality.” Econometrica 68, no. 6 (2000): 1435–64. http://www.jstor.org/stable/3003995.')   
+             
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+ + + + + + +
Bibliographic elements
+

The elements that follow are bibliographic elements that correspond to BibTex fields. Note that some of the BibTex elements are found elsewhere in the schema (namely type, authors, editors, year and month, isbn, issn and doi); when constructing a bibliographic citation, these external elements will have to be included as relevant. The description of the bibliographic fields listed below was adapted from Wikipedia’s description of BibTex.

+
{
+    "chapter": "string",
+    "edition": "string",
+    "institution": "string",
+    "journal": "string",
+    "volume": "string",
+    "number": "string",
+    "pages": "string",
+    "series": "string",
+    "publisher": "string",
+    "publisher_address": "string",
+    "annote": "string",
+    "booktitle": "string",
+    "crossref": "string",
+    "howpublished": "string",
+    "key": "string",
+    "organization": "string",
+    "url": null
+}
+

The elements that are required to form a complete bibliographic citation depend on the type of document. The table below, adapted from the BibTex templates, provides a list of required and optional fields by type of document:

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Document typeRequired fieldsOptional fields
Article from a journal or magazineauthor, title, journal, yearvolume, number, pages, month, note, key
Book with an explicit publisherauthor or editor, title, publisher, yearvolume, series, address, edition, month, note, key
Printed and bound document without a named publisher or sponsoring institutiontitleauthor, howpublished, address, month, year, note, key
Part of a book (chapter and/or range of pages)author or editor, title, chapter and/or pages, publisher, yearvolume, series, address, edition, month, note, key
Part of a book with its own titleauthor, title, book title, publisher, yeareditor, pages, organization, publisher, address, month, note, key
Article in a conference proceedingsauthor, title, book title, yeareditor, pages, organization, publisher, address, month, note, key
Technical documentationtitleauthor, organization, address, edition, month, year, key
Master’s thesisauthor, title, school, yearaddress, month, note, key
Ph.D. thesisauthor, title, school, yearaddress, month, note, key
Proceedings of a conferencetitle, yeareditor, publisher, organization, address, month, note, key
Report published by a school or other institution, usually numbered within a seriesauthor, title, institution, yeartype, number, address, month, note, key
Document with an author and title, but not
formally published
author, title, notemonth, year, key
+
    +
  • chapter [Optional ; Not repeatable ; String]
    +A chapter (or section) number. This element is only used to document a resource which has been extracted from a book.

  • +
  • edition [Optional ; Not repeatable ; String]
    +The edition of a book - for example “Second”. When a book has no edition number/name present, it can be assumed to be a first edition. If the edition is other than the first, information on the edition of the book being documented must be mentioned in the citation. The edition can be identified by a number, a label (such as “Revised edition” or “Abridged edition”), and/or a year. The first letter of the label should be capitalized.

  • +
  • institution [Optional ; Not repeatable ; String]
    +The sponsoring institution of a technical report. For citations of Master’s and Ph.D. thesis, this will be the name of the school.

  • +
  • journal [Optional ; Not repeatable ; String]
    +A journal name. Abbreviations are provided for many journals.

  • +
  • volume [Optional ; Not repeatable ; String]
    +The volume of a journal or multi-volume book. Periodical publications, such as scholarly journals, are published on a regular basis in installments that are called issues. A volume usually consists of the issues published during one year.

  • +
  • number [Optional ; Not repeatable ; String]
    +The number of a journal, magazine, technical report, or of a work in a series. An issue of a journal or magazine is usually identified by its volume (see previous element) and number; the organization that issues a technical report usually gives it a number; and sometimes books are given numbers in a named series.

  • +
  • pages [Optional ; Not repeatable ; String]
    +One or more page numbers or range of numbers, such as 42-111 or 7,41,73-97 or 43+ (the `+’ indicates pages following that don’t form a simple range).

  • +
  • series [Optional ; Not repeatable ; String]
    +The name of a series or set of books. When citing an entire book, the title field gives its title and an optional series field gives the name of a series or multi-volume set in which the book is published.

  • +
  • publisher [Optional ; Not repeatable ; String]
    +The entity responsible for making the resource available. For major publishing houses, the information can be omitted. For small publishers, providing the complete address is recommended. If the company is a university press, the abbreviation UP (for University Press) can be used. The publisher is not stated for journal articles, working papers, and similar types of documents.

  • +
  • publisher_address [Optional ; Not repeatable ; String]
    +The address of the publisher. For major publishing houses, just the city is given. For small publishers, the complete address can be provided.

  • +
  • annote [Optional ; Not repeatable ; String]
    +An annotation. This element will not be used by standard bibliography styles like the MLA, APA or Chicago, but may be used by others that produce an annotated bibliography.

  • +
  • booktitle [Optional ; Not repeatable ; String]
    +Title of a book, part of which is being cited. If you are documenting the book itself, this element will not be used; it is only used when part of a book is being documented.

  • +
  • crossref [Optional ; Not repeatable ; String]
    +The catalog identifier (“database key”) of another catalog entry being cross referenced. This element may be used when multiple entries refer to a same publication, to avoid duplication.

  • +
  • howpublished [Optional ; Not repeatable ; String]
    +The howpublished element is used to store the notice for unusual publications. The first word should be capitalized. For example, “WebPage”, or “Distributed at the local tourist office”.

  • +
  • key [Optional ; Not repeatable ; String]
    +A key is a field used for alphabetizing, cross referencing, and creating a label when the `author’ information is missing.

  • +
  • organization [Optional ; Not repeatable ; String]
    +The organization that sponsors a conference or that publishes a manual.

  • +
  • url [Optional ; Not repeatable ; String]
    +The URL of the document, preferably a permanent URL. +

  • +
+

This example makes use of the same Econometrica paper used in the previous example.

+


+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      bibliographic_fields = list(
+        doi     = "https://doi.org/10.1111/1468-0262.00167",
+        journal = "Econometrica",
+        volume  = "68",
+        issue   = "6",
+        pages   = "1435-1464",
+        url     = "https://onlinelibrary.wiley.com/doi/abs/10.1111/1468-0262.00167"
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+
+


+
    +
  • translators [Optional ; Repeatable]
    +Information on translators, for publications that are translations of publication originally created in another language.
  • +
+


+
"translators": [
+    {
+        "first_name": "string",
+        "initial": "string",
+        "last_name": "string",
+        "affiliation": "string"
+    }
+]
+


+
    +
  • first_name [Optional ; Not repeatable ; String]
    +The first name of the translator.
  • +
  • initial [Optional ; Not repeatable ; String]
    +The initials of the translator.
  • +
  • last_name [Optional ; Not repeatable ; String]
    +The last name of the translator.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the translator.
  • +
+


+
    +
  • contributors [Optional ; Repeatable]
    +These elements are used to acknowledge contributions to the production of the document, other than the ones for which specific metadata elements are provided (like autors or translators). +
  • +
+
"contributors": [
+    {
+        "first_name": "string",
+        "initial": "string",
+        "last_name": "string",
+        "affiliation": "string",
+        "contribution": "string"
+    }
+]
+


+
    +
  • first_name [Optional ; Not repeatable ; String]
    +The first name of the contributor.
  • +
  • initial [Optional ; Not repeatable ; String]
    +The initials of the contributor.
  • +
  • last_name [Optional ; Not repeatable ; String]
    +The last name of the contributor. If the contributor is an organization, enter the name of the organization here.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the contributor.
  • +
  • contribution [Optional ; Not repeatable ; String]
    +A brief description of the specific contribution of the person to the document, e.g. “Design of the cover page”, or “Proofreading”.
  • +
+


+
    +
  • contacts [Optional ; Repeatable]
    +Contact information for a person or organization that can be contacted for inquiries related to the document. +
  • +
+
"contacts": [
+    {
+        "name": "string",
+        "role": "string",
+        "affiliation": "string",
+        "email": "string",
+        "telephone": "string",
+        "uri": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the contact. This can be a person or an organization..
  • +
  • role [Optional ; Not repeatable ; String]
    +The specific role of the person or organization mentioned in contact.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the contact person.
  • +
  • email [Optional ; Not repeatable ; String]
    +The email address of the contact person or organization. Personal emails should be avoided.
  • +
  • telephone [Optional ; Not repeatable ; String]
    +The telephone number for the contact person or organization. Personal phone numbers should be avoided.
  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to an on-line resource related to the contact person or organization.
  • +
+


+
    +
  • rights [Optional ; Not repeatable ; String]

    +

    A statement on the rights associated with the document (others than the copyright, which should be described in the element copyright described below).

    +

    The example is extracted from the World Bank World Development Report 2019.

    +
    my_doc <- list(
    +  # ... ,
    +  document_description = list(
    +    # ... ,
    +
    +    rights = "Some rights reserved. Nothing herein shall constitute or be considered to be a limitation upon or waiver of the privileges and immunities of The World Bank, all of which are specifically reserved.",
    +
    +    # ...
    +  ),
    +  # ... 
    +)  
  • +
+


+
    +
  • copyright [Optional ; Not repeatable ; String]

    +

    A statement and identifier indicating the legal ownership and rights regarding use and re-use of all or part of the resource. If the document is protected by a copyright, enter the information on the person or organization who owns the rights.

  • +
+


+
    +
  • usage_terms [Optional ; Not repeatable ; String]

    +

    This element is used to provide a description of the legal terms or other conditions that a person or organization who wants to use or reproduce the document has to comply with.

  • +
+


+
    +
  • disclaimer [Optional ; Not repeatable ; String]

    +

    A disclaimer limits the liability of the author(s) and/or publisher(s) of the document. A standard legal statement should be used for all documents from a same agency.

    +
      my_doc <- list(
    +  # ... ,
    +  document_description = list(
    +    # ... ,
    +    disclaimer = "This work is a product of the staff of The World Bank with external contributions. The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of The World Bank, its Board of Executive Directors, or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries."
    +    # ...
    +  ),
    +  # ... 
    +)  
  • +
  • security_classification [Optional ; Not repeatable ; String]

    +

    Information on the security classification attached to the document. The different levels of classification indicate the degree of sensitivity of the content of the document. This field should make use of a controlled vocabulary, specific or adopted by the organization that curates or disseminates the document. Such a vocabulary could contain the following levels: public, internal only, confidential, restricted, strictly confidential

  • +
+


+
    +
  • access_restrictions [Optional ; Not repeatable ; String]

    +

    A textual description of access restrictions that apply to the document. +

  • +
  • sources [Optional ; Repeatable]

  • +
+


+
"sources": [
+    {
+        "source_origin": "string",
+        "source_char": "string",
+        "source_doc": "string"
+    }
+]
+


+

This element is used to describe the sources of different types (except data sources, which must be listed in the next element data_source) that were used in the production of the document. +- source_origin [Optional ; Not repeatable ; String]
+For historical materials, information about the origin(s) of the sources and the rules followed in establishing the sources should be specified. +- source_char [Optional ; Not repeatable ; String]
+Characteristics of the source. Assessment of characteristics and quality of source material. +- source_doc [Optional ; Not repeatable ; String]
+Documentation and access to the source.

+


+
    +
  • data_sources [Optional ; Repeatable]
  • +
+


+
"data_sources": [
+    {
+        "name": "string",
+        "uri": "string",
+        "note": "string"
+    }
+]
+


+

Used to list the machine-readable data file(s) -if any- that served as the source(s) of the data collection. +- name [Required ; Not repeatable ; String]
+Name (title) of the dataset used as source. +- uri [Optional ; Not repeatable ; String]
+Link (URL) to the dataset or to a web page describing the dataset.
+- note [Optional ; Not repeatable ; String]
+Additional information on the data source.

+

The data source for the publication Bangladesh Demographic and Health Survey (DHS), 2017-18 - Final Report would be entered as follows:

+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      data_sources = list(
+        list(name = "Bangladesh Demographic and Health Survey 2017-18", 
+             uri = "https://www.dhsprogram.com/methodology/survey/survey-display-536.cfm",
+             note  = "Household survey conducted by the National Institute of Population Research and Training, Medical Education and Family Welfare Division and Ministry of Health and Family Welfare. Data and documentation available at https://dhsprogram.com/)"
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • keywords [Optional ; Repeatable]
  • +
+


+
"keywords": [
+    {
+        "name": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

A list of keywords that provide information on the core content of the document. Keywords provide a convenient solution to improve the discoverability of the document, as it allows terms and phrases not found in the document itself to be indexed and to make a document discoverable by text-based search engines. A controlled vocabulary can be used (although not required), such as the UNESCO Thesaurus. The list provided here can combine keywords from multiple controlled vocabularies and user-defined keywords.

+
    +
  • name [Required ; Not repeatable ; String]
    +The keyword itself.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The controlled vocabulary (including version number or date) from which the keyword is extracted, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the controlled vocabulary from which the keyword is extracted, if any.
  • +
+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      keywords = list(
+        list(name = "Migration", vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        list(name = "Migrants", vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        list(name = "Refugee", vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        list(name = "Conflict"),
+        list(name = "Asylum seeker"),
+        list(name = "Forced displacement"),
+        list(name = "Forcibly displaced"),
+        list(name = "Internally displaced population (IDP)"),
+        list(name = "Population of concern (PoC)")
+        list(name = "Returnee")
+        list(name = "UNHCR")
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • themes [Optional ; Repeatable]
  • +
+


+
"themes": [
+    {
+        "id": "string",
+        "name": "string",
+        "parent_id": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

A list of themes covered by the document. A controlled vocabulary will preferably be used. The list provided here can combine themes from multiple controlled vocabularies and user-defined themes. Note that themes will rarely be used as the elements topics and disciplines are more appropriate for most uses. This is a block of five fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The ID of the theme, taken from a controlled vocabulary.
  • +
  • name [Required ; Not repeatable ; String]
    +The name (label) of the theme, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.
  • +
+


+
    +
  • topics [Optional ; Repeatable]
    +
  • +
+
"topics": [
+    {
+        "id": "string",
+        "name": "string",
+        "parent_id": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

Information on the topics covered in the document. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. The list provided here can combine topics from multiple controlled vocabularies and user-defined topics. The element is a block of five fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The identifier of the topic, taken from a controlled vocabulary.
  • +
  • name [Required ; Not repeatable ; String]
    +The name (label) of the topic, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.

  • +
+

We use the working paper “Push and Pull - A Study of International Migration from Nepal” by Maheshwor Shrestha, World Bank Policy Research Working Paper 7965, February 2017, as an example.

+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,
+      
+      topics = list(
+      
+        list(name = "Demography.Migration", 
+             vocabulary = "CESSDA Topic Classification", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        
+        list(name = "Demography.Censuses", 
+             vocabulary = "CESSDA Topic Classification", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        
+        list(id = "F22", 
+             name = "International Migration", 
+             parent_id = "F2 - International Factor Movements and International Business", 
+             vocabulary = "JEL Classification System", 
+             uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+        
+        list(id = "O15", 
+             name = "Human Resources - Human Development - Income Distribution - Migration", 
+             parent_id = "O1 - Economic Development", 
+             vocabulary = "JEL Classification System", 
+             uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+        
+        list(id = "O12", 
+             name = "Microeconomic Analyses of Economic Development", 
+             parent_id = "O1 - Economic Development", 
+             vocabulary = "JEL Classification System", 
+             uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+        
+        list(id = "J61", 
+             name = "Geographic Labor Mobility - Immigrant Workers", 
+             parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers", 
+             vocabulary = "JEL Classification System", 
+             uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+             
+      ),
+      
+      # ...
+    ),
+  )  
+


+
    +
  • disciplines [Optional ; Repeatable]
  • +
+


+
"disciplines": [
+    {
+        "id": "string",
+        "name": "string",
+        "parent_id": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

Information on the academic disciplines related to the content of the document. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia. The list provided here can combine disciplines from multiple controlled vocabularies and user-defined disciplines. This is a block of five elements:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The identifier of the discipline, taken from a controlled vocabulary.
  • +
  • name [Optional ; Not repeatable ; String]
    +The name (label) of the discipline, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent identifier of the discipline (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.

  • +
+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... ,  
+      
+      disciplines = list(
+        
+        list(name = "Economics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+             
+        list(name = "Agricultural economics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+        
+        list(name = "Econometrics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields")
+             
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • audience [Optional ; Not repeatable ; String]

    +

    Information on the intended audience for the document, i.e. the category or categories of users for whom the resource is intended in terms of their interest, skills, status, or other.

  • +
+


+
    +
  • mandate [Optional ; Not repeatable ; String]

    +

    The legislative or other mandate under which the resource was produced.

  • +
+


+
    +
  • pricing [Optional ; Not repeatable ; String]

    +

    The current price of the document in any defined currency. As this information is subject to regular change, it will often not be included in the document metadata.

  • +
+


+
    +
  • relations [Optional ; Repeatable]
    +References to related resources with a specification of the type of relationship.
  • +
+


+
"relations": [
+    {
+        "name": "string",
+        "type": "isPartOf"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The related resource. Recommended practice is to identify the related resource by means of a URL. If this is not possible or feasible, a string conforming to a formal identification system may be provided.
  • +
  • type [Optional ; Not repeatable ; String]
    +The type of relationship. The use of a controlled vocabulary is recommended. The Dublin Core proposes the following vocabulary: {isPartOf, hasPart, isVersionOf, isFormatOf, hasFormat, references, isReferencedBy, isBasedOn, isBasisFor, replaces, isReplacedBy, requires, isRequiredBy}.
    +
  • +
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeDescription
isPartOfThe described resource is a physical or logical part of the referenced resource.
hasPart
isVersionOfThe described resource is a version edition or adaptation of the referenced resource. A change in version implies substantive changes in content rather than differences in format.
isFormatOf
hasFormatThe described resource pre-existed the referenced resource, which is essentially the same intellectual content presented in another format.
references
isReferencedBy
isBasedOn
isBasisFor
replacesThe described resource supplants, displaces or supersedes the referenced resource.
isReplacedByThe described resource is supplanted, displaced or superseded by the referenced resource.
requires
+


+
    +
  • reproducibility [Optional ; Not repeatable]
  • +
+


+
"reproducibility": {
+    "statement": "string",
+    "links": [
+        {
+            "uri": "string",
+            "description": "string"
+        }
+    ]
+}
+


+

We present in chapter 12 a metadata schema intended to document reproducible research and scripts. That chapter lists multiple reasons to make research reproducible, replicable, and auditable. Ideally, when a research output (paper) is published, the data and code used in the underlying analysis should be made as openly available as possible. Increasingly, academic journals make it a requirement. The reproducibility element is used to provide interested users with information on reproducibility and replicability of the research output.

+
    +
  • statement [Optional ; Not repeatable ; String]
    +A general statement on reproducibility and replicability of the analysis (including data processing, tabulation, production of visualizations, modeling, etc.) being presented in the document.
  • +
  • links [Optional ; Repeatable]
    +Links to web pages where reproducible materials and the related information can be found. +
      +
    • uri [Optional ; Not repeatable ; String]
      +The link to a web page.
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the content of the web page.
    • +
  • +
+
  my_doc <- list(
+    # ... ,
+    document_description = list(
+      # ... , 
+      
+      reproducibility = list(
+        statement = "The scripts used to acquire data, assess and edit data files, train the econometric models, and to generate the tables and charts included in the publication, are openly accessible (Stata 15 scripts).",
+        links = list(
+          list(uri = "www.[...]",  
+               description = "Description and access to reproducible Stata scripts"),
+          list(uri = "www.[...]",  
+               description = "Derived data files")     
+        )       
+      ),
+      # ...
+    ),
+    # ... 
+  )        
+
+
+

4.2.3 Provenance

+

Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.

+


+
"provenance": [
+    {
+        "origin_description": {
+            "harvest_date": "string",
+            "altered": true,
+            "base_url": "string",
+            "identifier": "string",
+            "date_stamp": "string",
+            "metadata_namespace": "string"
+        }
+    }
+]
+


+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.

    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Document Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

4.2.4 Tags

+

tags [Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R.

+


+
"tags": [
+    {
+        "tag": "string",
+        "tag_group": "string"
+    }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.
  • +
  • tag_group [Optional ; Not repeatable ; String]
    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.
  • +
+
+
+

4.2.5 LDA topics

+

lda_topics [Optional ; Not repeatable]

+


+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.

+
+

Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.

+
+

The image below provides an example of topics extracted from a document from the United Nations High Commission for Refugees, using a LDA topic model trained by the World Bank (this model was trained to identify 75 topics; no document will cover all topics).

+

+

The lda_topics element includes the following metadata fields:

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition of the document.

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the document (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic. This is specific to the model, not to a document.
      • +
    • +
  • +
+
lda_topics = list(
+  
+   list(
+  
+      model_info = list(
+        list(source      = "World Bank, Development Data Group",
+             author      = "A.S.",
+             version     = "2021-06-22",
+             model_id    = "Mallet_WB_75",
+             nb_topics   = 75,
+             description = "LDA model, 75 topics, trained on Mallet",
+             corpus      = "World Bank Documents and Reports (1950-2021)",
+             uri         = ""))
+      ),
+      
+      topic_description = list(
+      
+        list(topic_id    = "topic_27",
+             topic_score = 32,
+             topic_label = "Education",
+             topic_words = list(list(word = "school",      word_weight = "")
+                                list(word = "teacher",     word_weight = ""),
+                                list(word = "student",     word_weight = ""),
+                                list(word = "education",   word_weight = ""),
+                                list(word = "grade",       word_weight = "")),
+        
+        list(topic_id    = "topic_8",
+             topic_score = 24,
+             topic_label = "Gender",
+             topic_words = list(list(word = "women",       word_weight = "")
+                                list(word = "gender",      word_weight = ""),
+                                list(word = "man",         word_weight = ""),
+                                list(word = "female",      word_weight = ""),
+                                list(word = "male",        word_weight = "")),
+        
+        list(topic_id    = "topic_39",
+             topic_score = 22,
+             topic_label = "Forced displacement",
+             topic_words = list(list(word = "refugee",     word_weight = "")
+                                list(word = "programme",   word_weight = ""),
+                                list(word = "country",     word_weight = ""),
+                                list(word = "migration",   word_weight = ""),
+                                list(word = "migrant",     word_weight = "")),
+                                
+        list(topic_id    = "topic_40",
+             topic_score = 11,
+             topic_label = "Development policies",
+             topic_words = list(list(word = "development", word_weight = "")
+                                list(word = "policy",      word_weight = ""),
+                                list(word = "national",    word_weight = ""),
+                                list(word = "strategy",    word_weight = ""),
+                                list(word = "activity",    word_weight = ""))
+                                
+      )
+      
+   )
+   
+)
+


+The information provided by LDA models can be used to build a “filter by topic composition” tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic. +
+
+ +
+
+
+

4.2.6 Embeddings

+

embeddings [Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below.

+

+

The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

+


+
"embeddings": [
+    {
+        "id": "string",
+        "description": "string",
+        "date": "string",
+        "vector": null
+    }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.
  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).
  • +
  • vector [Required ; Not repeatable ; Object] @@@@@@@@ do not offer options +The numeric vector representing the document, provided as an object (array or string).

    +[1,4,3,5,7,9]
  • +
+
+
+

4.2.7 Additional fields

+

additional [Optional ; Not repeatable]
+The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

+
+
+
+

4.3 Complete examples

+

Generating metadata compliant with the document schema is easy. The three examples below illustrate how metadata can be generated and published in a NADA catalog, programmatically. In the first two examples, we assume that an electronic copy of a document is available, and that the metadata must be generated from scratch (not by re-purposing/mapping existing metadata). In the third example, we assume that a list of publications with some metadata is available as a CSV file; metadata compliant with the schema are created and published in a catalog using a single script.

+
+

4.3.1 Example 1: Working Paper

+
+

4.3.1.1 Description

+

This document is the World Bank Policy Working Paper No 9412, titled “Predicting Food Crises” published in September 2020 under a CC-By 4.0 license. The list of authors is provided on the cover page; an abstract, a list of acknowledgments, and a list of keywords are also provided.

+
+

+

+ +
+
+
+

4.3.1.2 Using a metadata editor

+

(use the open source WB metadata editor)

+
+
+
+ +

image

+
+
+


+
+
+

4.3.1.3 Using R

+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_folder")
+doc_file <- "WB_PRWP_9412_Food_Crises.pdf"
+
+id <- "WB_WPS9412"  
+
+thumb_file <- gsub(".pdf", ".jpg", doc_file)
+capture_pdf_cover(doc_file)  # Capture cover page for use as thumbnail
+
+example_1 <- list(
+  
+  document_description = list(
+    
+    title_statement = list(idno = id, title = "Predicting Food Crises"),
+    
+    date_published = "2020-09",
+    
+    authors = list(
+      list(last_name = "Andrée", first_name = "Bo Pieter Johannes", 
+           affiliation = "World Bank",
+           author_id = list(list(type = "ORCID", id = "0000-0002-8007-5007"))),
+      list(last_name = "Chamorro", first_name = "Andres", 
+           affiliation = "World Bank"),
+      list(last_name = "Kraay", first_name = "Aart",   
+           affiliation = "World Bank"),
+      list(last_name = "Spencer", first_name = "Phoebe", 
+           affiliation = "World Bank"),
+      list(last_name = "Wang", first_name = "Dieter", 
+           affiliation = "World Bank",
+           author_id = list(list(type = "ORCID", id = "0000-0003-1287-332X")))
+    ),
+                   
+    journal   = "World Bank Policy Research Working Paper",
+    number    = "9412",
+    publisher = "World Bank",
+     
+    ref_country =  list(
+      list(name="Afghanistan",      code="AFG"),
+      list(name="Burkina Faso",     code="BFA"),
+      list(name="Chad",             code="TCD"),
+      list(name="Congo, Dem. Rep.", code="COD"),
+      list(name="Ethiopia",         code="ETH"),
+      list(name="Guatemala",        code="GTM"),
+      list(name="Haiti",            code="HTI"),
+      list(name="Kenya",            code="KEN"),
+      list(name="Malawi",           code="MWI"),
+      list(name="Mali",             code="MLI"),
+      list(name="Mauritania",       code="MRT"),
+      list(name="Mozambique",       code="MOZ"),
+      list(name="Niger",            code="NER"),
+      list(name="Nigeria",          code="NGA"),
+      list(name="Somalia",          code="SOM"),
+      list(name="South Sudan",      code="SSD"),
+      list(name="Sudan",            code="SDN"),
+      list(name="Uganda",           code="UGA"),
+      list(name="Yemen, Rep.",      code="YEM"),
+      list(name="Zambia",           code="ZMB"),
+      list(name="Zimbabwe",         code="ZWE")
+    ),
+     
+    abstract = "Globally, more than 130 million people are estimated to be in food crisis. These humanitarian disasters are associated with severe impacts on livelihoods that can reverse years of development gains. The existing outlooks of crisis-affected populations rely on expert assessment of evidence and are limited in their temporal frequency and ability to look beyond several months. This paper presents a statistical forecasting approach to predict the outbreak of food crises with sufficient lead time for preventive action. Different use cases are explored related to possible alternative targeting policies and the levels at which finance is typically unlocked. The results indicate that, particularly at longer forecasting horizons, the statistical predictions compare favorably to expert-based outlooks. The paper concludes that statistical models demonstrate good ability to detect future outbreaks of food crises and that using statistical forecasting approaches may help increase lead time for action.",
+     
+    languages = list(list(name="English", code="EN")),
+     
+    reproducibility = list(
+      statement = "The code and data needed to reproduce the analysis are openly available.",
+      links = list(
+        list(uri="http://fcv.ihsn.org/catalog/study/RR_WLD_2020_PFC_v01", 
+             description= "Source code"),
+        list(uri="http://fcv.ihsn.org/catalog/study/WLD_2020_PFC_v01_M",  
+             description= "Dataset")
+      )
+    )
+    
+  )
+  
+)  
+  
+# Publish the metadata in NADA
+document_add(idno = id, 
+             metadata = example_1, 
+             repositoryid = "central", 
+             published = 1, 
+             thumbnail = thumb_file, 
+             overwrite = "yes")
+
+# Provide a link to the document (as an external resource)
+external_resources_add(
+  title = "Predicting Food Crises",
+  idno = id,
+  dctype = "doc/anl",
+  file_path = "http://hdl.handle.net/10986/34510",
+  overwrite = "yes"
+)
+

The document will now be available in the NADA catalog.

+


+ +

+
+
+

4.3.1.4 Using Python

+

The Python equivalent of the R script presented above is as follows.

+
# @@@ Script not tested yet
+
+import pynada as nada
+import inspect
+
+dataset_id = "WB_WPS9412"
+
+repository_id = "central"
+published = 0
+overwrite = "yes"
+
+document_description = {
+
+  'title_statement': {
+    'idno': dataset_id,
+    'title': "Predicting Food Crises"
+  },
+  
+  'date_published': "2020-09",
+  
+  'authors': [
+    {
+      'last_name': "Andrée",
+      'first_name': "Bo Pieter Johannes",
+      'affiliation': "World Bank"
+    },
+    {
+      'last_name': "Chamorro",
+      'first_name': "Andres",
+      'affiliation': "World Bank"
+    },
+    {
+      'last_name': "Kraay",
+      'first_name': "Aart",
+      'affiliation': "World Bank"
+    },
+    {
+      'last_name': "Spencer",
+      'first_name': "Phoebe",
+      'affiliation': "World Bank"
+    },
+    {
+      'last_name': "Wang",
+      'first_name': "Dieter",
+      'affiliation': "World Bank"
+    }
+  ],
+  
+  'journal': "World Bank Policy Research Working Paper No. 9412",
+  
+  'publisher': "World Bank",
+  
+  'ref_country': [
+    {'name'="Afghanistan",      'code'="AFG"},
+    {'name'="Burkina Faso",     'code'="BFA"},
+    {'name'="Chad",             'code'="TCD"},
+    {'name'="Congo, Dem. Rep.", 'code'="COD"},
+    {'name'="Ethiopia",         'code'="ETH"},
+    {'name'="Guatemala",        'code'="GTM"},
+    {'name'="Haiti",            'code'="HTI"},
+    {'name'="Kenya",            'code'="KEN"},
+    {'name'="Malawi",           'code'="MWI"},
+    {'name'="Mali",             'code'="MLI"},
+    {'name'="Mauritania",       'code'="MRT"},
+    {'name'="Mozambique",       'code'="MOZ"},
+    {'name'="Niger",            'code'="NER"},
+    {'name'="Nigeria",          'code'="NGA"},
+    {'name'="Somalia",          'code'="SOM"},
+    {'name'="South Sudan",      'code'="SSD"},
+    {'name'="Sudan",            'code'="SDN"},
+    {'name'="Uganda",           'code'="UGA"},
+    {'name'="Yemen, Rep.",      'code'="YEM"},
+    {'name'="Zambia",           'code'="ZMB"},
+    {'name'="Zimbabwe",         'code'="ZWE"}
+  ],
+  
+  'abstract': inspect.cleandoc("""\
+        
+Globally, more than 130 million people are estimated to be in food crisis. These humanitarian disasters are associated with severe impacts on livelihoods that can reverse years of development gains. 
+The existing outlooks of crisis-affected populations rely on expert assessment of evidence and are limited in their temporal frequency and ability to look beyond several months. 
+This paper presents a statistical forecasting approach to predict the outbreak of food crises with sufficient lead time for preventive action. 
+Different use cases are explored related to possible alternative targeting policies and the levels at which finance is typically unlocked. 
+The results indicate that, particularly at longer forecasting horizons, the statistical predictions compare favorably to expert-based outlooks. 
+The paper concludes that statistical models demonstrate good ability to detect future outbreaks of food crises and that using statistical forecasting approaches may help increase lead time for action.
+    
+    """),
+    
+  'languages': [
+    {'name': "English", 'code': "EN"}
+  ],
+  
+  'reproducibility': {
+    'statement': "The code and data needed to reproduce the analysis are openly available.",
+    'links': [
+      {
+        'uri': "http://fcv.ihsn.org/catalog/study/RR_WLD_2020_PFC_v01", 
+        'description':  "Source code"
+      },
+      {
+        'uri': "http://fcv.ihsn.org/catalog/study/WLD_2020_PFC_v01_M",
+        'description': "Dataset"
+      }
+    ]
+  },
+  
+files = [
+  {'file_uri': "http://hdl.handle.net/10986/34510"},
+]
+
+
+nada.create_document_dataset(
+    dataset_id = dataset_id,
+    repository_id = repository_id,
+    published = published,
+    overwrite = overwrite,
+    document_description = document_description,
+    resources = resources,
+    files = files
+)
+
+# If you have pdf file, generate thumbnail from it.
+pdf_file = "WB_PRWP_9412_Food_Crises.pdf"
+thumbnail_path = nada.pdf_to_thumbnail(pdf_file, page_no=1)
+nada.upload_thumbnail(dataset_id, thumbnail_path)
+
+
+
+

4.3.2 Example 2: Book

+

This example documents the World Bank World Development Report (WDR) 2019 titled “The Changing Nature of Work”. The book is available in multiple languages. It also has related resources like presentations and an Overview available in multiple languages, which we also document.

+
+

4.3.2.1 Description

+
+ + + +
+
+
+

4.3.2.2 Using R

+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_folder")
+doc_file <- "2019-WDR-Report.pdf"
+
+id <- "WB_WDR2019"  
+meta_id <- "WBDG_WB_WDR2019"
+
+thumb_file <- gsub(".pdf", ".jpg", doc_file)
+capture_pdf_cover(doc_file)  # Capture cover page for use as thumbnail
+
+# Generate the metadata
+
+example_2 = list(
+  
+  metadata_information = list(
+    title = "The Changing Nature of Work",
+    idno = meta_id,
+    producers = list(
+      list(name = "Development Data Group, Curation Team", 
+           abbr = "DECDG", 
+           affiliation = "World Bank")
+    ),
+    production_date = "2020-12-27"
+  ),
+
+  document_description = list(
+    
+    title_statement = list(
+      idno = id,
+      title = "The Changing Nature of Work",
+      sub_title = "World Development Report 2019",
+      abbreviated_title = "WDR 2019"
+    ),
+
+    authors = list(
+      list(first_name = "Rong",      last_name = "Chen",      affiliation = "World Bank"),
+      list(first_name = "Davida",    last_name = "Connon",    affiliation = "World Bank"),
+      list(first_name = "Ana P.",    last_name = "Cusolito",  affiliation = "World Bank"),
+      list(first_name = "Ugo",       last_name = "Gentilini", affiliation = "World Bank"),
+      list(first_name = "Asif",      last_name = "Islam",     affiliation = "World Bank"),
+      list(first_name = "Shwetlena", last_name = "Sabarwal",  affiliation = "World Bank"),
+      list(first_name = "Indhira",   last_name = "Santos",    affiliation = "World Bank"),
+      list(first_name = "Yucheng",   last_name = "Zheng",     affiliation = "World Bank")
+    ),
+    
+    date_created = "2019",
+    date_published = "2019",
+    
+    identifers = list(
+      list(type = "ISSN",           value = "0163-5085"),
+      list(type = "ISBN softcover", value = "978-1-4648-1328-3"),
+      list(type = "ISBN hardcover", value = "978-1-4648-1342-9"),
+      list(type = "e-ISBN",         value = "978-1-4648-1356-6"),
+      list(type = "DOI softcover",  value = "10.1596/978-1-4648-1328-3"),   
+      list(type = "DOI hardcover",  value = "10.1596/978-1-4648-1342-9")
+    ),
+    
+    type = "book",
+    
+    description = "The World Development Report (WDR) 2019: The Changing Nature of Work studies how the nature of work is changing as a result of advances in technology today. Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Work is constantly reshaped by technological progress. Firms adopt new ways of production, markets expand, and societies evolve. Overall, technology brings opportunity, paving the way to create new jobs, increase productivity, and deliver effective public services. Firms can grow rapidly thanks to digital transformation, expanding their boundaries and reshaping traditional production patterns. The rise of the digital platform firm means that technological effects reach more people faster than ever before. Technology is changing the skills that employers seek. Workers need to be better at complex problem-solving, teamwork and adaptability. Digital technology is also changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. The Report analyzes these changes and considers how governments can best respond. Investing in human capital must be a priority for governments in order for workers to build the skills in demand in the labor market. In addition, governments need to enhance social protection and extend it to all people in society, irrespective of the terms on which they work. To fund these investments in human capital and social protection, the Report offers some suggestions as to how governments can mobilize additional revenues by increasing the tax base.",
+    
+    toc_structured = list(
+      list(id = "00",                   name = "Overview"),
+      list(id = "01", parent_id = "00", name = "Changes in the nature of work"),
+      list(id = "02", parent_id = "00", name = "What can governments do?"),
+      list(id = "03", parent_id = "00", name = "Organization of this study"),
+      list(id = "10",                   name = "1. The changing nature of work"),
+      list(id = "11", parent_id = "10", name = "Technology generates jobs"),
+      list(id = "12", parent_id = "10", name = "How work is changing"),
+      list(id = "13", parent_id = "10", name = "A simple model of changing work"),
+      list(id = "20",                   name = "2. The changing nature of firms"),
+      list(id = "21", parent_id = "20", name = "Superstar firms"),
+      list(id = "22", parent_id = "20", name = "Competitive markets"),
+      list(id = "23", parent_id = "20", name = "Tax avoidance"),       
+      list(id = "30",                   name = "3. Building human capital"),
+      list(id = "31", parent_id = "30", name = "Why governments should get involved"),
+      list(id = "32", parent_id = "30", name = "Why measurement helps"),
+      list(id = "33", parent_id = "30", name = "The human capital project"), 
+      list(id = "40",                   name = "4. Lifelong learning"),
+      list(id = "41", parent_id = "40", name = "Learning in early childhood"),
+      list(id = "42", parent_id = "40", name = "Tertiary education"),
+      list(id = "43", parent_id = "40", name = "Adult learning outside the workplace"),
+      list(id = "50",                   name = "5. Returns to work"),
+      list(id = "51", parent_id = "50", name = "Informality"),
+      list(id = "52", parent_id = "50", name = "Working women"),
+      list(id = "53", parent_id = "50", name = "Working in agriculture"),  
+      list(id = "60",                   name = "6. Strengthening social protection"),
+      list(id = "61", parent_id = "60", name = "Social assistance"),
+      list(id = "62", parent_id = "60", name = "Social insurance"),
+      list(id = "63", parent_id = "60", name = "Labor regulation"),       
+      list(id = "70",                   name = "7. Ideas for social inclusion"),
+      list(id = "71", parent_id = "70", name = "A global 'New Deal'"),
+      list(id = "72", parent_id = "70", name = "Creating a new social contract"),
+      list(id = "73", parent_id = "70", name = "Financing social inclusion")
+    ),
+
+    abstract = "Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Instead, technology is bringing opportunity, paving the way to create new jobs, increase productivity, and improve public service delivery. The nature of work is changing.
+Firms can grow rapidly thanks to digital transformation, which blurs their boundaries and challenges traditional production patterns.
+The rise of the digital platform firm means that technological effects reach more people faster than ever before.
+Technology is changing the skills that employers seek. Workers need to be good at complex problem-solving, teamwork and adaptability.
+Technology is changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers.
+What can governments do? The 2019 WDR suggests three solutions:
+1 - Invest in human capital especially in disadvantaged groups and early childhood education to develop the new skills that are increasingly in demand in the labor market, such as high-order cognitive and sociobehavioral skills
+2 - Enhance social protection to ensure universal coverage and protection that does not fully depend on having formal wage employment
+3 - Increase revenue mobilization by upgrading taxation systems, where needed, to provide fiscal space to finance human capital development and social protection.",
+    
+    ref_country = list(
+      list(name = "World", code = "WLD")
+    ),
+    
+    spatial_coverage = "Global",
+    
+    publication_frequency = "Annual",
+    
+    languages = list(
+      list(name = "English",   code = "EN"),
+      list(name = "Chinese",   code = "ZH"),
+      list(name = "Arabic",    code = "AR"),
+      list(name = "French",    code = "FR"),
+      list(name = "Spanish",   code = "ES"),
+      list(name = "Italian",   code = "IT"),
+      list(name = "Bulgarian", code = "BG"),
+      list(name = "Russian",   code = "RU"),
+      list(name = "Serbian",   code = "SR")
+    ),
+    
+    license = list(
+      list(name = "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)", 
+           uri = "http://creativecommons.org/licenses/by/3.0/igo")
+    ),
+    
+    bibliographic_citation = list(
+      list(citation = " World Bank. 2019. World Development Report 2019: The Changing Nature of Work. Washington, DC: World Bank. doi:10.1596/978-1-4648-1328-3. License: Creative Commons Attribution CC BY 3.0 IGO")
+    ),
+    
+    series = "World Development Report",
+    
+    contributors = list(
+      list(first_name  = "Simeon", last_name = "Djankov",
+           affiliation = "World Bank", role = "WDR Director"),
+      list(first_name  = "Federica", last_name = "Saliola",
+           affiliation = "World Bank", role = "WDR Director"),
+      list(first_name  = "David", last_name = "Sharrock",
+           affiliation = "World Bank", role = "Communications"),
+      list(first_name  = "Consuelo Jurado", last_name = "Tan",
+           affiliation = "World Bank", role = "Program Assistant")
+    ),
+    
+    publisher = "World Bank Publications",
+    publisher_address = "The World Bank Group, 1818 H Street NW, Washington, DC 20433, USA",
+    
+    contacts = list(
+      list(name = "World Bank Publications", email = "pubrights@worldbank.org")
+    ),
+    
+    topics = list(
+      list(name = "Labour And Employment - Employee Training", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),  
+      list(name = "Labour And Employment - Labour And Employment Policy", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(name = "Labour And Employment - Working Conditions", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(name = "Social Stratification And Groupings - Social And Occupational Mobility", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+    ),
+    
+    disciplines = list(
+      list(name = "Economics")
+    )
+    
+  )
+  
+)
+
+# Publish the metadata in NADA
+
+document_add(idno = id, 
+             metadata = example_2, 
+             repositoryid = "central", 
+             published = 1, 
+             thumbnail = thumb_file, 
+             overwrite = "yes")
+
+# Provide links to the document and related resources
+
+external_resources_add(
+  title = "The Changing Nature of Work",
+  description = "Links to the PDF report in all available languages",
+  idno = id,
+  dctype = "doc/anl",
+  language = "English, Chinese, Arabic, French, Spanish, Italian, Bulgarian, Russian, Serbian",
+  file_path = "https://www.worldbank.org/en/publication/wdr2019",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "WORLD DEVELOPMENT REPORT 2019 - THE CHANGING NATURE OF WORK - Presentation (slide deck), English",
+  idno = id,
+  dctype = "doc/oth",
+  language = "English",
+  file_path = "http://pubdocs.worldbank.org/en/808261547222082195/WDR19-English-Presentation.pdf",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "INFORME SOBRE EL DESARROLLO MUNDIAL 2019 - LA NATURALEZA CAMBIANTE DEL TRABAJO - Presentation (slide deck), Spanish",
+  idno = id,
+  dctype = "doc/oth",
+  language = "Spanish",
+  file_path = "http://pubdocs.worldbank.org/en/942911547222108647/WDR19-Spanish-Presentation.pdf",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "RAPPORT SUR LE DÉVELOPPEMENT DANS LE MONDE 2019 - LE TRAVAIL EN MUTATION - Presentation (slide deck), French",
+  idno = id,
+  dctype = "doc/oth",
+  language = "French",
+  file_path = "http://pubdocs.worldbank.org/en/132831547222088914/WDR19-French-Presentation.pdf",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "RAPPORTO SULLO SVILUPPO MONDIALE 2019 - CAMBIAMENTI NEL MONDO DEL LAVORO - Presentation (slide deck), Italian",
+  idno = id,
+  dctype = "doc/oth",
+  language = "Italian",
+  file_path = "http://pubdocs.worldbank.org/en/842271547222095493/WDR19-Italian-Presentation.pdf",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "ДОКЛАД О МИРОВОМ РАЗВИТИИ 2019 - ИЗМЕНЕНИЕ ХАРАКТЕРА ТРУДА - Presentation (slide deck), Russian",
+  idno = id,
+  dctype = "doc/oth",
+  language = "Russian",
+  file_path = "http://pubdocs.worldbank.org/en/679061547222101914/WDR19-Russian-Presentation.pdf",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "Jobs of the future require more investment in people - Press Release (October 11, 2018)",
+  idno = id,
+  dctype = "doc/oth",
+  dcdate = "2018-10-11",
+  language = "Russian",
+  file_path = "https://www.worldbank.org/en/news/press-release/2018/10/11/jobs-of-the-future-require-more-investment-in-people",
+  overwrite = "yes"
+)
+

The document is now available in the NADA catalog. +
+ +

+
+
+

4.3.2.3 Using Python

+

The Python equivalent of the R script presented above is as follows.

+
# @@@ Script not tested yet - must be edited to match the R script
+
+import pynada as nada
+import inspect
+
+dataset_id = "DOC_001"
+
+repository_id = "central"
+
+published = 0
+
+overwrite = "yes"
+
+metadata_information = {
+    'title': "The Changing Nature of Work",
+    'idno': "META_DOC_001",
+    'producers': [
+        {
+            'name': "Development Data Group, Curation Team", 
+            'abbr': "DECDG",
+            'affiliation': "World Bank"
+        }
+    ],
+    'production_date': "2020-12-27"
+}
+
+document_description = {
+  'title_statement': {
+    'idno': dataset_id,
+    'title': "The Changing Nature of Work",
+    'sub-title': "World Development Report 2019",
+    'abbreviated_title': "WDR2019"
+  },
+  
+  'type': "book",
+  
+  'description': inspect.cleandoc("""\
+        
+The World Development Report (WDR) 2019: The Changing Nature of Work studies how the nature of work is changing as a result of advances in technology today. Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Work is constantly reshaped by technological progress. Firms adopt new ways of production, markets expand, and societies evolve. Overall, technology brings opportunity, paving the way to create new jobs, increase productivity, and deliver effective public services. Firms can grow rapidly thanks to digital transformation, expanding their boundaries and reshaping traditional production patterns. The rise of the digital platform firm means that technological effects reach more people faster than ever before. Technology is changing the skills that employers seek. Workers need to be better at complex problem-solving, teamwork and adaptability. Digital technology is also changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers. The Report analyzes these changes and considers how governments can best respond. Investing in human capital must be a priority for governments in order for workers to build the skills in demand in the labor market. In addition, governments need to enhance social protection and extend it to all people in society, irrespective of the terms on which they work. To fund these investments in human capital and social protection, the Report offers some suggestions as to how governments can mobilize additional revenues by increasing the tax base.
+    
+    """),
+  
+  'toc_structured': [
+    {'id': "00",                    'name': "Overview"},
+    {'id': "01", 'parent_id': "00", 'name': "Changes in the nature of work"},
+    {'id': "02", 'parent_id': "00", 'name': "What can governments do?"},
+    {'id': "03", 'parent_id': "00", 'name': "Organization of this study"},
+    {'id': "10",                    'name': "1. The changing nature of work"},
+    {'id': "11", 'parent_id': "10", 'name': "Technology generates jobs"},
+    {'id': "12", 'parent_id': "10", 'name': "How work is changing"},
+    {'id': "13", 'parent_id': "10", 'name': "A simple model of changing work"},
+    {'id': "20",                    'name': "2. The changing nature of firms"},
+    {'id': "21", 'parent_id': "20", 'name': "Superstar firms"},
+    {'id': "22", 'parent_id': "20", 'name': "Competitive markets"},
+    {'id': "23", 'parent_id': "20", 'name': "Tax avoidance"},       
+    {'id': "30",                    'name': "3. Building human capital"},
+    {'id': "31", 'parent_id': "30", 'name': "Why governments should get involved"},
+    {'id': "32", 'parent_id': "30", 'name': "Why measurement helps"},
+    {'id': "33", 'parent_id': "30", 'name': "The human capital project"}, 
+    {'id': "40",                    'name': "4. Lifelong learning"},
+    {'id': "41", 'parent_id': "40", 'name': "Learning in early childhood"},
+    {'id': "42", 'parent_id': "40", 'name': "Tertiary education"},
+    {'id': "43", 'parent_id': "40", 'name': "Adult learning outside the workplace"},
+    {'id': "50",                    'name': "5. Returns to work"},
+    {'id': "51", 'parent_id': "50", 'name': "Informality"},
+    {'id': "52", 'parent_id': "50", 'name': "Working women"},
+    {'id': "53", 'parent_id': "50", 'name': "Working in agriculture"},  
+    {'id': "60",                    'name': "6. Strengthening social protection"},
+    {'id': "61", 'parent_id': "60", 'name': "Social assistance"},
+    {'id': "62", 'parent_id': "60", 'name': "Social insurance"},
+    {'id': "63", 'parent_id': "60", 'name': "Labor regulation"},       
+    {'id': "70",                    'name': "7. Ideas for social inclusion"},
+    {'id': "71", 'parent_id': "70", 'name': "A global 'New Deal'"},
+    {'id': "72", 'parent_id': "70", 'name': "Creating a new social contract"},
+    {'id': "73", 'parent_id': "70", 'name': "Financing social inclusion"}
+  ],
+  
+  'abstract': inspect.cleandoc("""\
+        
+Fears that robots will take away jobs from people have dominated the discussion over the future of work, but the World Development Report 2019 finds that on balance this appears to be unfounded. Instead, technology is bringing opportunity, paving the way to create new jobs, increase productivity, and improve public service delivery.
+The nature of work is changing.
+Firms can grow rapidly thanks to digital transformation, which blurs their boundaries and challenges traditional production patterns.
+The rise of the digital platform firm means that technological effects reach more people faster than ever before.
+Technology is changing the skills that employers seek. Workers need to be good at complex problem-solving, teamwork and adaptability.
+Technology is changing how people work and the terms on which they work. Even in advanced economies, short-term work, often found through online platforms, is posing similar challenges to those faced by the world’s informal workers.
+What can governments do?
+The 2019 WDR suggests three solutions:
+1 - Invest in human capital especially in disadvantaged groups and early childhood education to develop the new skills that are increasingly in demand in the labor market, such as high-order cognitive and sociobehavioral skills
+2 - Enhance social protection to ensure universal coverage and protection that does not fully depend on having formal wage employment
+3 - Increase revenue mobilization by upgrading taxation systems, where needed, to provide fiscal space to finance human capital development and social protection.
+    
+    """),
+  
+ 'ref_country': [
+    {'name': "World", 'code': "WLD"}
+  ],
+  
+  'spatial_coverage': "Global",
+  
+  'date_created': "2019",
+
+  'date_published': "2019",
+  
+  'identifiers': [
+      {'type': "ISSN",           'value': "0163-5085"},
+      {'type': "ISBN softcover", 'value': "978-1-4648-1328-3"},
+      {'type': "ISBN hardcover", 'value': "978-1-4648-1342-9"},
+      {'type': "e-ISBN",         'value': "978-1-4648-1356-6"},
+      {'type': "DOI softcover",  'value': "10.1596/978-1-4648-1328-3"},   
+      {'type': "DOI hardcover",  'value': "10.1596/978-1-4648-1342-9"}
+    ],
+  
+  'publication_frequency': "Annual",
+  
+  'languages': [
+      {'name': "English",   'code': "EN"},
+      {'name': "Chinese",   'code': "ZH"},
+      {'name': "Arabic",    'code': "AR"},
+      {'name': "French",    'code': "FR"},
+      {'name': "Spanish",   'code': "ES"},
+      {'name': "Italian",   'code': "IT"},
+      {'name': "Bulgarian", 'code': "BG"},
+      {'name': "Russian",   'code': "RU"},
+      {'name': "Serbian",   'code': "SR"}
+    ],
+  
+  'license': [
+        {
+            'name': "Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)", 
+            'uri': "http://creativecommons.org/licenses/by/3.0/igo"
+        }
+    ],
+  
+  'authors': [
+      {'first_name': "Rong",      'last_name': "Chen",      'affiliation': "World Bank"},
+      {'first_name': "Davida",    'last_name': "Connon",    'affiliation': "World Bank"},
+      {'first_name': "Ana P.",    'last_name': "Cusolito",  'affiliation': "World Bank"},
+      {'first_name': "Ugo",       'last_name': "Gentilini", 'affiliation': "World Bank"},
+      {'first_name': "Asif",      'last_name': "Islam",     'affiliation': "World Bank"},
+      {'first_name': "Shwetlena", 'last_name': "Sabarwal",  'affiliation': "World Bank"},
+      {'first_name': "Indhira",   'last_name': "Santos",    'affiliation': "World Bank"},
+      {'first_name': "Yucheng",   'last_name': "Zheng",     'affiliation': "World Bank"}
+  ],
+  
+  'contributors': [
+    {'first_name': "Simeon", 'last_name': "Djankov", 'affiliation': "World Bank", 'role': "WDR Director"},
+    {'first_name': "Federica", 'last_name': "Saliola", 'affiliation': "World Bank", 'role': "WDR Director"},
+    {'first_name': "David", 'last_name': "Sharrock", 'affiliation': "World Bank", 'role': "Communications"},
+    {'first_name': "Consuelo Jurado", 'last_name': "Tan", 'affiliation': "World Bank", 'role': "Program Assistant"}
+  ],
+  
+  'topics': [
+    {
+      'name': "LabourAndEmployment.EmployeeTraining", 
+      'vocabulary': "CESSDA Topic Classification", 
+      'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+    },  
+    {
+      'name': "LabourAndEmployment.LabourAndEmploymentPolicy", 
+      'vocabulary': "CESSDA Topic Classification", 
+      'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+    },
+    {
+      'name': "LabourAndEmployment.WorkingConditions", 
+      'vocabulary': "CESSDA Topic Classification", 
+      'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+    },
+    {
+      'name': "SocialStratificationAndGroupings.SocialAndOccupationalMobility", 
+      'vocabulary': "CESSDA Topic Classification", 
+      'uri': "https://vocabularies.cessda.eu/vocabulary/TopicClassification"
+    }
+  ],
+  
+  'disciplines': [
+      {'name': "Economics"}
+  ]
+}
+
+
+
+

4.3.3 Example 3: Importing from a list of documents

+

In this example we take a different use case. We assume that a list of publications is available as a CSV file. Each row in this file describes one publication, with the following columns containing the metadata (with no missing information for the required elements):

+
    +
  • URL_pdf (required): a link to the publication (direct link to a PDF file)
  • +
  • ID (required): a unique identifier for each document, with no missing value)
  • +
  • title (required): the title of the document
  • +
  • country (optional): the country (or countries) that the document is about, separated by a “;”
  • +
  • authors (optional): separated by a “;” and with the last name and first name separated by a “,” (last name always provided before first name)
  • +
  • abstract (optional): abstract
  • +
  • type (optional): type of document
  • +
  • date_published (optional): date the document was published; optional by highly recommended
  • +
+

The R (or Python) script reads the CSV file. The listed documents are downloaded (if not previously done), and the cover page of each document is captured and saved as a JPG file to be used as a thumbnail in the catalog. Metadata are formatted to comply with the document schema, then published. The documents are not uploaded in the catalog, but links to the originating catalog are provided. There is no limit to the number of documents that could be included in such a batch process. If a repository of documents is available with metadata available in a structured format (in a CSV file as in the example, from an API, or from another source), the migration of the documents to a NADA catalog can be fully automated using a script similar to the one shown in the example. Note that such a script could also include some processes of metadata augmentation (e.g., submitting each document to a topic model to extract and store the topic composition of the document).

+


+ +

+
+

4.3.3.1 Using R

+
library(nadar)
+library(stringr)
+library(rlist)
+library(countrycode) # Will be used to automatically add ISO country codes
+
+# ----------------------------------------------------------------------------------
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+# Read the CSV file containing the information (metadata) on the 5 documents
+
+setwd("C:/my_folder")
+# Read the file containing information on the 5 documents
+doc_list <- read.csv("my_list_of_documents.csv", stringsAsFactors = FALSE)
+
+# Generate the metadata for each document in the list, and publish in NADA
+
+for(i in 1:nrow(doc_list)) {
+  
+  # Download the file if not already done
+  url <- doc_list$URL_pdf[i]
+  pdf_file  <- basename(doc_list$URL_pdf[i])
+  if(!file.exists(pdf_file)) download.file(url, pdf_file, mode = "wb")
+  
+  # Map the available metadata elements to the schema
+  id        <- doc_list$ID[i]
+  title     <- doc_list$title[i]
+  date      <- as.character(doc_list$date_published[i])
+  abstract  <- doc_list$abstract[i]
+  type      <- doc_list$type[i]
+  
+  # Split the authors' list an generate a list compliant with the schema
+  list_authors <- doc_list$authors[i]
+  list_authors <- str_split(list_authors, ";")
+  authors = list()
+  for(n in 1:length(list_authors[[1]])) {
+    author = trimws(list_authors[[1]][n])
+    if("," %in% author) {  # If we have last name and first name
+      last_first = str_split(author, ",")
+      a_l = list(last_name  = trimws(last_first[[1]][1]), 
+                 first_name = trimws(last_first[[1]][2]))
+    } else {   # E.g., when author is an organization
+      a_l = list(last_name  = author, first_name = "")
+    }  
+    authors = list.append(authors, a_l)
+  }
+  
+  # Split the country list an generate a list compliant with the schema
+  list_countries <- doc_list$country[i]
+  list_countries <- str_split(list_countries, ";")
+  countries = list()
+  for(n in 1:length(list_countries[[1]])) {
+    country = trimws(list_countries[[1]][n])
+    if(country == "World"){
+      c_code = "WLD"
+    } else {
+      c_code = countrycode(country, origin = 'country.name', destination = 'iso3c')
+    }
+    if(is.na(c_code)) c_code = ""
+    c_l = list(name = country, code = c_code)
+    countries = list.append(countries, c_l)
+  }
+  
+  # Capture the cover page as JPG, and generate the full document metadata 
+  
+  thumb <- gsub(".pdf", ".jpg", pdf_file)
+  capture_pdf_cover(pdf_file)   # To be used as thumbnail
+  
+  this_document <- list(
+    document_description = list(
+      title_statement = list(idno = id, title = title),
+      date_published = date,
+      authors = authors,
+      abstract = abstract,
+      ref_country = countries
+    )
+  )
+  
+  # Publish the metadata in NADA
+  
+  document_add(idno = id,
+               published = 1,
+               overwrite = "yes",
+               metadata = this_document,
+               thumbnail = thumb)
+  
+  # Add a link to the document
+  
+  external_resources_add(
+    title = as.character(this_document$document_description$title_statement[1]),
+    idno = id,
+    dctype = "doc/anl",
+    file_path = url,
+    overwrite = "yes"
+  )
+  
+}
+
+
+

4.3.3.2 Using Python

+
# @@@ Script not tested yet
+
+import pynada as nada
+import pandas as pd
+import urllib.request
+import os.path
+
+# Set API key and catalog URL
+nada.set_api_key("my_api_key")                                
+nada.set_api_url("http://my_catalog.ihsn.org/index.php/api/") 
+
+# Read the file containing information on the 5 documents
+doc_list <- pd.read_csv("my_list_of_documents.csv")
+
+# Generate the metadata and publish in NADA catalog
+for index, doc in doc_list.iterrows():
+
+  # Download the file if not already done
+  url = doc['URL']
+  pdf_file = os.path.basename(url)
+  if(!os.path.exists(pdf_file)) {
+     urllib.request.urlretrieve(url, pdf_file)
+  }
+
+  # Map/generate metadata fields
+  id        = doc['id']
+  title     = f"{doc['title']} - Census {doc['censusyear']}"
+  author    = doc['authors']
+  contrib   = doc['contributor']
+  date      = doc['date_published']
+  avail     = doc['date_available']
+  abstract  = doc['description']
+  publisher = doc['publisher']
+  spatial   = doc['state'] 
+  language  = [{'name': "English", 'code': "ENG"}]
+ 
+  # Document the file, and publish in NADA
+  idno = id
+  repository_id = "central"
+  published = 1
+  overwrite = "yes"
+  document_description = {
+    'title_statement': {
+      'idno': id,
+      'title': title
+    },
+    'date_published': date,
+    'date_available': date,
+    'authors': [
+      {'last_name': author}
+    ],
+    'contributors': [
+      {'last_name': contrib}
+    ],
+    'publisher': publisher,
+    'abstract': abstract,
+    'description': desc,
+    'ref_country': [
+      {'name': "India", 'code': "IND"}
+    ],
+    'languages': language,
+    'pages': pages,
+    'rights': "Office of the Registrar General, India (ORGI)"
+  }
+  tags = tags
+  files = [
+    {'file_uri': pdf_file, 'format': "Adobe Acrobat PDF"},
+  ]
+  
+  nada.create_document_dataset(
+    dataset_id = idno,
+    repository_id = repository_id,
+    published = published,
+    overwrite = overwrite,
+    document_description = document_description,
+    tags = tags,
+    files = files
+  )
+
+  # generate thumbnail from the pdf file.
+  thumbnail_path = nada.pdf_to_thumbnail(pdf_file, page_no=1)
+  nada.upload_thumbnail(idno, thumbnail_path)
+ +
+
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter05.html b/chapter05.html new file mode 100644 index 0000000..f6a3ee6 --- /dev/null +++ b/chapter05.html @@ -0,0 +1,3731 @@ + + + + + + + Chapter 5 Microdata | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 5 Microdata

+
+ +
+


+
+

5.1 Definition of microdata

+

When surveys or censuses are conducted, or when administrative data are recorded, information is collected on each unit of observation. The unit of observation can be a person, a household, a firm, an agricultural holding, a facility, or other. Microdata are the data files resulting from these data collection activities, which contain the unit-level information (as opposed to aggregated data in the form of counts, means, or other). Information on each unit is stored in variables, which can be of different types (e.g. numeric or alphanumeric, discrete or continuous). These variables may contain data reported by the respondent (e.g., the marital status of a person), obtained by observation or measurement (e.g., the GPS location of a dwelling), or generated by calculation, recoding or derivation (e.g., the sample weight in a survey).

+

For efficiency reasons, variables are often stored in numeric format (i.e. coded values), even when they contain qualitative information (coded values). For example, the sex of a respondent may be stored in a variable named ‘Q_01’, and include values 1, 2 and 9 where 1 represents “male”, 2 represents “female”, and 9 represents “unreported”. Microdata must therefore be provided at a minimum with a data dictionary containing the variables and value labels and, for derived variables, information of the derivation process. But many other features of a micro-dataset should also be described such as the objectives and the methodology of data collection (including a description of the sampling design for sample surveys), the period of data collection, the identification of the primary investigator and other contributors, the scope and geographic coverage of the data, and much more. This information will make the data usable and discoverable.

+
+
+

5.2 The Data Documentation Initiative (DDI) metadata standard

+

The DDI metadata standard provides a structured and comprehensive list of hundreds of elements and attributes which may be used to document microdata. It is unlikely that any one study would ever require using them all, but this list provides a convenient solution to foster completeness of the information, and to generate documentation that will meet the needs of users.

+

The Data Documentation Initiative (DDI) metadata standard originated in the Inter-university Consortium for Political and Social Research (ICPSR), a membership-based organization with more than 500 member colleges and universities worldwide. The DDI is now the project of an alliance of North American and European institutions. Member institutions comprise many of the largest data producers and data archives in the world. The DDI standard is used by a large community of data archivists, including data librarians from academia, data managers in national statistical agencies and other official data producing agencies, and international organizations. The standard has two branches: the DDI-Codebook (version 2.x) and the DDI LifeCycle (version 3.x). These two branches serve different purposes and audiences. For the purpose of data archiving and cataloguing, the schema we recommend in this Guide is the DDI-Codebook. We use a slightly simplified version of version 2.5 of the standard, to which we add a few elements (including the tags element common to all schemas described in the Guide. A mapping between the elements included in our schema and the DDI Codebook metadata tags is provided in annex 2.

+

The DDI standard is published under the terms of the [GNU General Public License]((http://www.gnu.org/licenses) (version 3 or later).

+
+

5.2.1 DDI-Codebook

+

The DDI Alliance developed the DDI-Codebook for organizing the content, presentation, transfer, and preservation of metadata in the social and behavioral sciences. It enables documenting microdata files in a simultaneously flexible and rigorous way. The DDI-Codebook aims to provide a straightforward means of recording and communicating all the salient characteristics of a micro-dataset.

+

The DDI-Codebook is designed to encompass the kinds of data resulting from surveys, censuses, administrative records, experiments, direct observation and other systematic methodology for generating empirical measurements. The unit of observation can be individual persons, households, families, business establishments, transactions, countries or other subjects of scientific interest.

+

The DDI Alliance publishes the DDI-Codebook as an XML schema. We present in this Guide a JSON implementation of the schema, which is used in our R package NADAR and Python Library PyNADA. The NADA cataloguing application works with both the XML and the JSON version. A DDI-compliant metadata file can be converted from the JSON schema to the XML or from XML to JSON.

+
+
+

5.2.2 DDI-Lifecycle

+

As indicated by the DDI Alliance website, DDI-Lifecycle is “designed to document and manage data across the entire life cycle, from conceptualization to data publication, analysis and beyond. It encompasses all of the DDI-Codebook specification and extends it. Based on XML Schemas, DDI-Lifecycle is modular and extensible.” DDI-lifecycle can be used to “populate variable and question banks to explore available data and question structures for reuse in new surveys”. As this is not our objective, and because using the DDI-Lifecycle adds significant complexity, we do not make use of it and this chapter only covers the DDI-Codebook.

+
+
+
+

5.3 Some practical considerations

+

The DDI is a comprehensive schema that provides metadata elements to document a study (e.g., a survey, or an administrative datasets), the related data files, and the variables they contain. A separate schema is used to document the related resources (questionnaires, reports, and others); see Chapter 13.

+

Some datasets may contain hundreds or even thousands of variables. For each variable, the DDI can include not only the variable name, label and description, but also summary statistics like the count of valid and missing observations, weighted and unweighted frequencies, means, and others. Generating a DDI file manually, in particular the variable-level metadata, can be a tedious and time consuming task. But variable names, summary statistics, and (when avaiulable) variable and value labels can be extracted directly from the data files. User-friendly solutions (specialized metadata editors) are available to automate a large part of this work. DDI can also be generated programmatically using R or Python. Section 5.5 provides examples of the use of specialized DDI metadata editors and programming languages to generate DDI-compliant metadata.

+

Documenting microdata is more complex than documenting publications or other types of data like tables or indicators. The production of microdata often involves experts in survey design, sampling, data processing, and analysis. Generating the metadata should thus be a collective responsibility and will ideally be done in real time (“document as you survey”). Data documentation should be implemented during the whole lifecycle of data production, not as an ex post task. This is in line with what the Generic Statistical Business process Model (GSBPM) recommends: “Good metadata management is essential for the efficient operation of statistical business processes. Metadata are present in every phase, either created, updated or carried forward from a previous phase or reused from another business process. In the context of this model, the emphasis of the overarching process of metadata management is on the creation/revision, updating, use and archiving of statistical metadata, though metadata on the different sub-processes themselves are also of interest, including as an input for quality management. The key challenge is to ensure that these metadata are captured as early as possible, and stored and transferred from phase to phase alongside the data they refer to.” Too often, microdata are documented after completion of the data collection, sometimes by a team who was not directly involved in the production of the data. In such cases, some information may not have been captured and will be difficult to find or reconstruct.

+
+

Suggestions and recommendations to data curators

+
    +
  • Generating detailed metadata at the variable level (including elements like the formulation of the questions, variable and value labels, interviewer instructions, universe, derivation procedures, etc.) may seem to be a tedious exercise, but it adds considerable value to the metadata. Indeed, it will (i) provide a detailed data dictionary, required to make the data usable, (ii) provide the necessary information for making the data more discoverable and to enable variable comparison tools, and (iii) guarantee the preservation of institutional memory. The cost of generating such metadata will be very small relative to the cost of generating the data.
  • +
  • To make the data more discoverable, attention should be paid to provide a detailed description of the scope and objectives of the data collection. When a survey (or other microdataset) is used to generate statistical indicators, a list of these indicators should be provided in the metadata.
  • +
  • The keywords metadata element provides a flexible solution to improve the discoverability of data. For example, a survey that collects data on children age, weight and height, will be relevant for measuring malnutrition and generating indicators like prevalence of stunting or wasting, overweight and underweight. The variable description alone would not make the data discoverable in keyword-based search engines, hence the importance of adding relevant terms and phrases in the keyword section.
  • +
  • The DDI metadata will be saved as an XML or JSON file, i.e. as plain text. This means that the DDI metadata cannot include complex formulas. The description of some variables, as well as the description of a survey sample design, may require the use of formulas. In such case, the recommendation is to provide as much of the information as possible in the DDI, and to provide links to documents where the formulas can be found (these documents would be published with the metadata as external resources).
    +
  • +
  • Typically, the variables in the DDI are organized by data file. The DDI provides an option –the variable groups– to organize variables differently, for example thematically. These variable groupings are virtual, in the sense that they do not impact the way variables are stored. Not all variables have to be mapped to such groups, and a same variable can belong to more than one group. This option provides the possibility to organize the variables based on a thematic or topical classification. Machine learning (AI) tools make it possible to automate the process of mapping variables to a pre-defined list of groups (each one of them described by a label and a short description). By doing this, and by generating embeddings at the group level, it becomes possible to add semantic search and to implement a recommender system that applies to microdata.
    +
  • +
+
+
+
+

5.4 Schema description: DDI-Codebook 2.5

+

The DDI-Codebook is a comprehensive, structured list of elements to be used to document microdata of any source. The standard contains five main sections:

+
    +
  • Document description (doc_desc), with elements used to describe the metadata (not the data); the term “document” refers here to the XML (or JSON) file that contains the metadata.
  • +
  • Study description (study_desc), which contains the elements used to describe the study itself (the survey, the administrative process, or the other activity that resulted in the production of the microdata). This section will contain information on the primary investigator, scope and coverage of the data, sampling, etc.
  • +
  • File description (data_files), which provides elements to document each data file that compose the dataset (this is thus a repeatable block of elements).
  • +
  • Variable description (variables), with elements used to describe each variable contained in the data files, including the variable names, the variable and value labels, summary statistics for each variable, interviewers’ instructions, description of recoding or derivation procedure, and more.
  • +
  • Variable groups (variable_groups), an optional section that allows organizing variables by thematic or other groups, independently from the data file they belong to. Variable groups are “virtual”; the grouping of variables does not affect the data files.
  • +
+

The other sections in the schema are not part of the DDI Codebook itself. Some are used for catalog administration purposes when the NADA cataloguing application is used (repositoryid, access_policy, published, overwrite, and provenance).

+
    +
  • repositoryid identifies the data catalog/collection in which the metadata will be published.
  • +
  • access_policy indicates the access policy to be applied to the microdata (open access, public use files, licensed access, no access, etc.)
  • +
  • published: Indicates whether the metadata will be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.
  • +
  • overwrite: Indicates whether metadata that may have been previously uploaded for the same dataset can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. Note that a dataset will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element study_desc > title_statement > idno is the same.
  • +
  • provenance is used to store information on the source and time of harvesting, for metadata that were extracted automatically from external data catalogs.
  • +
+

Other sections are provided to allow additional metadata to be collected and stored, including metadata generated by machine learning models (tags, lda_topics, embeddings, and additional). The tags is a section common to all schemas (with the exception of the external resources schema), which provides a flexible solution to generate customized facets in data catalogs. The additional section allows data curators to supplement the DDI standard with their own metadata elements, without breaking compliance with the DDI.

+
{
+  "repositoryid": "string",
+  "access_policy": "data_na",
+  "published": 0,
+  "overwrite": "no",
+  "doc_desc": {},
+  "study_desc": {},
+  "data_files": [],
+  "variables": [],
+  "variable_groups": [],
+  "provenance": [],
+  "tags": [],
+  "lda_topics": [],
+  "embeddings": [],
+  "additional": { }
+}
+


+

The DDI-Codebook also provides a solution to describe OLAP cubes, which we do not make use of as our purpose is to use the standard to document and catalog datasets, not to manage data.

+
+

Each metadata element in the DDI standard has a name. In our JSON version of the standard, we do not make use of the exact same names. We adapted some of them for clarity. For example, we renamed the DDI element titlStmt as title_statement. The mapping between the DDI Codebook 2.5 standard and the elements in our schema is provided in appendix. JSON files created using our adapted version of the DDI can be exported as a DDI 2.5 compliant and validated XML file using R or Python scripts provided in the NADAR package and PyNADA library.

+
+
+

5.4.1 Document description

+

doc_desc [Optional ; Not repeatable]
+Documenting a study using the DDI-Codebook standard consists of generating a metadata file in XML or JSON format. This file is what is referred to as the metadata document. The doc_desc or document description is thus a description of the metadata file, and consists of bibliographic information describing the DDI-compliant document as a whole. As a same dataset can possibly be documented by more than one organization, and because metadata can be automatically harvested by on-line catalogs, traceability of the metadata is important. This section, which only contains five main elements, should be as complete as possible, and at least contain information on the producer and prod_date; information.

+
"doc_desc": {
+  "title": "string",
+  "idno": "string",
+  "producers": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "role": "string"
+    }
+  ],
+  "prod_date": "string",
+  "version_statement": {
+    "version": "string",
+    "version_date": "string",
+    "version_resp": "string",
+    "version_notes": "string"
+  }
+}
+


+
    +
  • title [Optional ; Not repeatable ; String]
    +The title of the metadata document (which may be the title of the study itself). The metadata document is the DDI metadata file (XML or JSON file) that is being generated. The “Document title” should mention the geographic scope of the data collection as well as the time period covered. For example: “DDI 2.5: Albania Living Standards Study 2012”.

  • +
  • idno [Optional ; Not repeatable ; String]
    +A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the unique identifier idno found in section study_description / title_statement, although it is good practice to generate identifiers that establish a clear connection between the two identifiers. The Document ID could also include the metadata document version identifier. For example, if the “Primary identifier” of the study is “ALB_LSMS_2012”, the “Document ID” in the Metadata information could be “IHSN_DDI_v01_ALB_LSMS_2012” if the DDI metadata are produced by the IHSN. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier should not contain blank spaces.

  • +
  • producers [Optional ; Repeatable]
    +The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a “Recommended” element. For catalog administration purposes, information on the producer and on the date of metadata production is useful.

    +
      +
    • name [Optional ; Not repeatable ; String]
      +The name of the person or organization in charge of the production of the DDI metadata. If the name of individuals cannot be provided due to an organization’s data protection rules, the title of the person, or an anonymized identifier, can be provided (or this field can be left blank if no other option is available).
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The initials of the person, or the abbreviation of the organization’s name mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • role [Optional ; Not repeatable ; String]
      +The specific role of the person or organization mentioned in name in the production of the DDI metadata.

    • +
  • +
  • prod_date [Optional ; Not repeatable ; String]
    +The date the DDI metadata document was produced (not the date it was distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a “Recommended” element, as information on the producer and on the date of metadata production is useful for catalog administration purposes.

  • +
  • version_statement [Optional ; Not repeatable]
    +A version statement for the metadata (DDI) document. Documenting a dataset is not a trivial exercise. It may happen that, having identified errors or gaps in a DDI document, or after receiving suggestions for improvement or additional input, the DDI metadata are modified. The version_statement describes the version of the metadata document. It is good practice to provide a version number and date, and information on what distinguishes the current version from the previous one(s).

    +
      +
    • version [Optional ; Not repeatable ; String]
      +The label of the version, also known as release or edition. For example, Version 1.2
    • +
    • version_date [Optional ; Not repeatable ; String]
      +The date when this version of the metadata document (DDI file) was produced, preferably identifying an exact date. This will usually correspond to the prod_date element. It is recommended to enter the date in the ISO 8601 date format (YYYY-MM-DD or YYYY-MM or YYYY).
    • +
    • version_resp [Optional ; Not repeatable ; String]
      +The organization or person responsible for this version of the metadata document.
    • +
    • version_notes [Optional ; Not repeatable ; String]
      +This element can be used to clarify information/annotation regarding this version of the metadata document, for example to indicate what is new or specific in this version comparing with a previous version.
    • +
  • +
+
my_ddi <- list(
+  
+  doc_desc = list(
+    title = "Albania Living Standards Study 2012",
+    idno = "DDI_WB_ALB_2012_LSMS_v02",
+    producers = list(
+      list(name = "Development Data Group", 
+           abbr = "DECDG", 
+           affiliation = "World Bank", 
+           role = "Production of the DDI-compliant metadata"
+      )     
+    ),
+    prod_date = "2021-02-16",
+    version_statement = list(
+      version = "Version 2.0",
+      version_date = "2021-02-16",
+      version_resp = "OD",
+      version_notes = "Version identical to Version 1.0 except for the Data Appraisal section which was added."
+    )
+  ),
+  
+  # ... (other sections of the DDI)
+  
+)  
+


+
+
+

5.4.2 Study description

+

study_desc [Required ; Not repeatable]
+The study_desc or study description consists of information about the data collection or study that the DDI-compliant documentation file describes. This section includes study-level information such as scope and coverage, objectives, producers, sampling, data collection dates and methods, etc.

+
"study_desc": {
+  "title_statement": {},
+  "authoring_entity": [],
+  "oth_id": [],
+  "production_statement": {},
+  "distribution_statement": {},
+  "series_statement": {},
+  "version_statement": {},
+  "bib_citation": "string",
+  "bib_citation_format": "string",
+  "holdings": [],
+  "study_notes": "string",
+  "study_authorization": {},
+  "study_info": {},
+  "study_development": {},
+  "method": {},
+  "data_access": {}
+}
+


+
+

5.4.2.1 Title statement

+

title_statement [Required ; Not repeatable]
+The title statement for the study.

+
"title_statement": {
+  "idno": "string",
+  "identifiers": [
+    {
+      "type": "string",
+      "identifier": "string"
+    }
+  ],
+  "title": "string",
+  "sub_title": "string",
+  "alternate_title": "string",
+  "translated_title": "string"
+}
+


+
    +
  • idno [Required ; Not repeatable ; String]
    +idno is the primary identifier of the dataset. It is a unique identification number used to identify the study (survey, census or other). A unique identifier is required for cataloguing purpose, so this element is declared as “Required”. The identifier will allow users to cite the dataset properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the idno can be constructed by an organization using a consistent scheme. The scheme could for example be “catalog-country-study-year-version”, where country is the 3-letter ISO country code, producer is the abbreviation of the producing agency, study is the study acronym, year is the reference year (or the year the study started), version is a version number. Using that scheme, the Uganda 2005 Demographic and Health Survey for example would have the following idno (where “MDA” stand for “My Data Archive”): MDA_UGA_DHS_2005_v01. Note that the schema allows you to provide more than one identifier for a same study (in element identifiers); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The identifier should not contain blank spaces.

  • +
  • identifiers [Optional ; Repeatable]
    +This repeatable element is used to enter identifiers (IDs) other than the idno entered in the Title statement. It can for example be a Digital Object Identifier (DOI). The idno can be repeated here (the idno element does not provide a type parameter; if a DOI or other standard reference ID is used as idno, it is recommended to repeat it here with the identification of its type).

    +
      +
    • type [Optional ; Not repeatable ; String]
      +The type of unique ID, e.g. “DOI”.
    • +
    • identifier [Required ; Not repeatable ; String]
      +The identifier itself.

    • +
  • +
  • title [Required ; Not repeatable ; String]
    +This element is “Required”. Provide here the full authoritative title for the study. Make sure to use a unique name for each distinct study. The title should indicate the time period covered. For example, in a country conducting monthly labor force surveys, the title of a study would be like “Labor Force Survey, December 2020”. When a survey spans two years (for example, a household income and expenditure survey conducted over a period of 12 months from June 2020 to June 2021), the range of years can be provided in the title, for example “Household Income and Expenditure Survey 2020-2021”. The title of a survey should be its official name as stated on the survey questionnaire or in other study documents (report, etc.). Including the country name in the title is optional (another metadata element is used to identify the reference countries). Pay attention to the consistent use of capitalization in the title.

  • +
  • sub_title [Optional ; Not repeatable ; String]
    +The sub-title is a secondary title used to amplify or state certain limitations on the main title, for example to add information usually associated with a sequential qualifier for a survey. For example, we may have “[country] Universal Primary Education Project, Impact Evaluation Survey 2007” as title, and “Baseline dataset” as sub-title. Note that this information could also be entered as a Title with no Subtitle: “[country] Universal Primary Education Project, Impact Evaluation Survey 2007 - Baseline dataset”.

  • +
  • alternate_title [Optional ; Not repeatable ; String]
    +The alternate_title will typically be used to capture the abbreviation of the survey title. Many surveys are known and referred to by their acronym. The survey reference year(s) may be included. For example, the “Demographic and Health Survey 2012” would be abbreviated as “DHS 2012”, or the “Living Standards Measurement Study 2020-2012” as “LSMS 2020-2021”.

  • +
  • translated_title [Optional ; Not repeatable ; String]

    +In countries with more than one official language, a translation of the title may be provided here. Likewise, the translated title may simply be a translation into English from a country’s own language. Special characters should be properly displayed, such as accents and other stress marks or different alphabets.

  • +
+
my_ddi <- list(
+  
+  # ... ,
+  
+  study_desc = list(
+    title_statement = list(
+      idno = "ML_ALB_2012_LSMS_v02",
+      identifiers = list(
+        list(type = "DOI", identifier = "XXX-XXXX-XXX")
+      ),
+      title = "Living Standards Study 2012",
+      alternate_title = "LSMS 2012",
+      translated_title = "Anketa e Matjes së Nivelit të Jetesës (AMNJ) 2012"
+    )
+  ),
+  
+  # ...
+)  
+


+
+
+

5.4.2.2 Authoring entity

+

authoring_entity [Optional ; Repeatable]
+The name and affiliation of the person, corporate body, or agency responsible for the study’s substantive and intellectual content (the “authoring entity” or “primary investigator”). Generally, in a survey, the authoring entity will be the institution implementing the survey. Repeat the element for each authoring entity, and enter the affiliation when relevant. If various institutions have been equally involved as main investigators, then should all be listed. This only includes the agencies responsible for the implementation of the study, not sponsoring agencies or entities providing technical assistance (for which other metadata elements are available). The order in which authoring entities are listed is discretionary. It can be alphabetic or by significance of contribution. Individual persons can also be mentioned, if not prohibited by privacy protection rules.

+
"authoring_entity": [
+  {
+    "name": "string",
+    "affiliation": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the person, corporate body, or agency responsible for the work’s substantive and intellectual content. The primary investigator will in most cases be an institution, but could also be an individual in the case of small-scale academic surveys. If persons are mentioned, use the appropriate format of Surname, First name.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the person, corporate body, or agency mentioned in name.
  • +
+
my_ddi <- list(
+  
+  # ... ,
+  
+  study_desc = list(
+    
+    # ... ,
+    
+    authoring_entity = list(
+      
+      list(name = "National Statistics Office of Popstan (NSOP)", 
+           affiliation = "Ministry of Planning"),
+      
+      list(name = "Department of Public Health of Popstan (DPH)", 
+           affiliation = "Ministry of Health")
+      
+    ),
+    
+    # ...
+  )
+  
+)  
+


+
+
+

5.4.2.3 Other entity

+

oth_id [Optional ; Repeatable]
+This element is used to acknowledge any other people and organizations that have in some form contributed to the study. This does not include other producers which should be listed in producers, and financial sponsors which should be listed in the element funding_agencies.

+
"oth_id": [
+  {
+    "name": "string",
+    "role": "string",
+    "affiliation": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the person or organization.
  • +
  • role [Optional ; Not repeatable ; String]
    +A brief description of the specific role of the person or organization mentioned in name.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the person or organization mentioned in name.
  • +
+
my_ddi <- list(
+  
+  # ... ,
+  
+  study_desc = list(
+    # ... ,
+    
+    oth_id = list(
+      list(name = "John Doe", 
+           role = "Technical advisor in sample design", 
+           affiliation = "World Bank Group"
+      )
+    ),
+    # ...
+  
+  )
+  
+)  
+


+
+
+

5.4.2.4 Production statement

+

production_statement [Optional ; Not repeatable]
+A production statement for the work at the appropriate level.

+
"production_statement": {
+  "producers": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "role": "string"
+    }
+  ],
+  "copyright": "string",
+  "prod_date": "string",
+  "prod_place": "string",
+  "funding_agencies": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "grant": "string",
+      "role": "string"
+    }
+  ]
+}
+


+
    +
  • producers [Optional ; Repeatable]
    +This field is provided to list other interested parties and persons that have played a significant but not the leading technical role in implementing and producing the data (which will be listed in authoring_entity), and not the financial sponsors (which will be listed in funding_agencies).

    +
      +
    • name [Required ; Not repeatable ; String]
      +The name of the person or organization.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The official abbreviation of the organization mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • role [Optional ; Not repeatable ; String]
      +A succinct description of the specific contribution by the person or organization in the production of the data.
    • +
  • +
  • copyright [Optional ; Not repeatable ; String]
    +A copyright statement for the study at the appropriate level.

  • +
  • prod_date [Optional ; Not repeatable ; String]
    +This is the date (preferably entered in ISO 8601 format: YYYY-MM-DD or YYYY-MM or YYYY) of the actual and final production of the version of the dataset being documented. At least the month and year should be provided. A regular expression can be entered in user templates to validate the information captured in this field.

  • +
  • prod_place [Optional ; Not repeatable ; String]
    +The address of the organization that produced the study.

  • +
  • funding_agencies [Optional ; repeatable]
    +The source(s) of funds for the production of the study. If different funding agencies sponsored different stages of the production process, use the role attribute to distinguish them.

    +
      +
    • name [Required ; Not repeatable ; String]
      +The name of the funding agency.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The abbreviation (acronym) of the funding agency mentioned in name.
    • +
    • grant [Optional ; Not repeatable ; String]
      +The grant number. If an agency has provided more than one grant, list them all separated with a “;”.
    • +
    • role [Optional ; Not repeatable ; String]
      +The specific contribution of the funding agency mentioned in name. This element is used when multiple funding agencies are listed to distinguish their specific contributions.

    • +
  • +
+

This example shows the Bangladesh 2018-2019 Demographic and Health Survey (DHS)

+
my_ddi <- list(
+  
+  # ... ,
+  
+  study_desc = list(
+    
+    # ... ,
+    
+    production_statement = list(
+      
+      producers = list(
+        
+        list(name = "National Institute of Population Research and Training",
+             abbr = "NIPORT",
+             role = "Primary investigator"),
+        
+        list(name = "Medical Education and Family Welfare Division",
+             role = "Advisory"),
+        
+        list(name = "Ministry of Health and Family Welfare",
+             abbr = "MOHFW",
+             role = "Advisory"),
+        
+        list(name = "Mitra and Associates",
+             role = "Data collection - fieldwork"),
+        
+        list(name = "ICF (consulting firm)",
+             role = "Technical assistance / DHS Program")
+      
+      ),
+      
+      prod_date = "2019",   
+      
+      prod_place = "Dhaka, Bangladesh",
+      
+      funding_agencies = list(
+        list(name = "United States Agency for International Development",
+             abbr = "USAID")
+      )
+      
+    ),    
+    # ...,    
+    
+  )
+  # ...
+  
+)
+


+
+
+

5.4.2.5 Distribution statement

+

distribution_statement [Optional ; Not repeatable]
+A distribution statement for the study.

+
"distribution_statement": {
+  "distributors": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "uri": "string"
+    }
+  ],
+  "contact": [
+    {
+      "name": "string",
+      "affiliation": "string",
+      "email": "string",
+      "uri": "string"
+    }
+  ],
+  "depositor": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "uri": "string"
+    }
+  ],
+  "deposit_date": "string",
+  "distribution_date": "string"
+}
+


+
    +
  • distributors [Optional ; Repeatable]
    +The organization(s) designated by the author or producer to generate copies of the study output including any necessary editions or revisions.

    +
      +
    • name [Required ; Not repeatable ; String]
      +The name of the distributor. It can be an individual or an organization.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The official abbreviation of the organization mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • uri [Optional ; Not repeatable ; String]

      +A URL to the ordering service or download facility on a Web site.

    • +
  • +
  • contact [Optional ; Repeatable]
    +Names and addresses of individuals responsible for the study. Individuals listed as contact persons will be used as resource persons regarding problems or questions raised by users.

    +
      +
    • name [Required ; Not repeatable ; String]
      +The name of the person or organization that can be contacted.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • email [Optional ; Not repeatable ; String]
      +An email address for the contact mentioned in name.
    • +
    • uri [Optional ; Not repeatable ; String]

      +A URL to the contact mentioned in name.

    • +
  • +
  • depositor [Optional ; Repeatable]
    +The name of the person (or institution) who provided this study to the archive storing it.

    +
      +
    • name [Required ; Not repeatable ; String]
      +The name of the depositor. It can be an individual or an organization.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The official abbreviation of the organization mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • uri [Optional ; Not repeatable ; String]

      +A URL to the depositor

    • +
  • +
  • deposit_date [Optional ; Not repeatable ; String]
    +The date that the study was deposited with the archive that originally received it. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible.

  • +
  • distribution_date [Optional ; Not repeatable ; String]
    +The date that the study was made available for distribution/presentation. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible.

  • +
+

This example is @@@@@@@@@@@@

+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    
+    distribution_statement = list(
+      
+       distributors = list(
+         list(name = "World Bank Microdata Library",           
+              abbr = "WBML",
+              affiliation = "World Bank Group",
+              uri = "http:/microdata.worldbank.org")
+       ),
+       
+       contact = list(
+         list(name = "",
+              affiliation = "",
+              email = "",
+              uri = "")
+       ),
+       
+       depositor = list(
+         list(name = "",         
+              abbr = "",
+              affiliation = "",
+              uri = "")
+       ),
+       
+       deposit_date = "",
+       
+       distribution_date = ""
+       
+    ),
+    # ...
+  )
+  # ...
+)      
+


+
+
+

5.4.2.6 Series statement

+

series_statement [Optional; Not repeatable]
+A study may be repeated at regular intervals (such as an annual labor force survey), or be part of an international survey program (such as the MICS, DHS, LSMS and others). The series statement provides information on the series.

+
"series_statement": {
+  "series_name": "string",
+  "series_info": "string"
+}
+


+
    +
  • series_name [Optional ; Not repeatable ; String]
    +The name of the series to which the study belongs. For example, “Living Standards Measurement Study (LSMS)” or “Demographic and Health Survey (DHS)” or “Multiple Indicator Cluster Survey VII (MICS7)”. A description of the series can be provided in the element “series_info”.
  • +
  • series_info [Optional ; Not repeatable ; String]
    +A brief description of the characteristics of the series, including when it started, how many rounds were already implemented, and who is in charge would be provided here.
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  
+  study_desc = list(
+    # ... ,
+    series_statement = list(
+      list(series_name = "Multiple Indicator Cluster Survey (MICS) by UNICEF",
+           series_info = "The Multiple Indicator Cluster Survey, Round 3 (MICS3) is the third round of MICS surveys, previously conducted around 1995 (MICS1) and 2000 (MICS2). MICS surveys are designed by UNICEF, and implemented by national agencies in participating countries. MICS was designed to monitor various indicators identified at the World Summit for Children and the Millennium Development Goals. Many questions and indicators in MICS3 are consistent and compatible with the prior round of MICS (MICS2) but less so with MICS1, although there have been a number of changes in definition of indicators between rounds. Round 1 covered X countries, round 2 covered Y countries, and Round 3 covered Z countries.")
+    ),
+    # ...
+  ),
+  # ...
+)  
+


+
+
+

5.4.2.7 Version statement

+

version_statement [Optional; Not repeatable]
+Version statement for the study.

+
"version_statement": {
+  "version": "string",
+  "version_date": "string",
+  "version_resp": "string",
+  "version_notes": "string"
+}
+


+

The version statement should contain a version number followed by a version label. The version number should follow a standard convention to be adopted by the data repository. We recommend that larger series be defined by a number to the left of a decimal and iterations of the same series by a sequential number that identifies the release. The left number could for example be (0) for the raw, unedited dataset; (1) for the edited dataset, non anonymized, available for internal use at the data producing agency; and (2) the edited dataset, prepared for dissemination to secondary users (possibly anonymized). Example:

+

v0: Basic raw data, resulting from the data capture process, before any data editing is implemented.
+v1.0: Edited data, first iteration, for internal use only.
+v1.1: Edited data, second iteration, for internal use only.
+v2.1: Edited data, anonymized and packaged for public distribution.

+
    +
  • version [Optional ; Not repeatable ; String]
    +The version number, also known as release or edition.
  • +
  • version_date [Optional ; Not repeatable ; String]
    +The ISO 8601 standard for dates (YYYY-MM-DD) is recommended for use with the “date” attribute.
  • +
  • version_resp [Optional ; Not repeatable ; String]
    +The person(s) or organization(s) responsible for this version of the study.
  • +
  • version_notes [Optional ; Not repeatable ; String]
    +Version notes should provide a brief report on the changes made through the versioning process. The note should indicate how this version differs from other versions of the same dataset.
    +
  • +
+
my_ddi <- list(
+  
+    # ... 
+
+  study_desc = list(
+    
+    # ... ,
+    
+    version_statement = list(
+      version = "Version 1.1",
+      version_date = "2021-02-09",
+      version_resp = "National Statistics Office, Data Processing unit",
+      version_notes = "This dataset contains the edited version of the data that were used to produce the Final Survey Report. It is equivalent to version 1.0 of the dataset, except for the addition of an additional variable (variable weight2) containing a calibrated version of the original sample weights (variable weight)"
+    ),
+    
+    # ...
+    
+  ),
+  
+  # ...
+  
+)  
+


+
+
+

5.4.2.8 Bibliographic citation

+

bib_citation [Optional ; Not repeatable ; String]
+Complete bibliographic reference containing all of the standard elements of a citation that can be used to cite the study. The bib_citation_format (see below) is provided to enable specification of the particular citation style used, e.g., APA, MLA, or Chicago.

+
+
+

5.4.2.9 Bibliographic citation format

+

bib_citation_format [Optional ; Not repeatable ; String]
+This element is used to specify the particular citation style used in the field bib_citation described above, e.g., APA, MLA, or Chicago.

+
  my_ddi <- list(
+    doc_desc = list(
+      # ... 
+    ),
+    study_desc = list(
+      # ... ,
+      bib_citation = "",
+      bib_citation_format = ""
+      # ...
+    ),
+    # ...
+  )  
+


+
+
+

5.4.2.10 Holdings

+

holdings [Optional ; Repeatable]
+Information concerning either the physical or electronic holdings of the study being described.

+
"holdings": [
+  {
+    "name": "string",
+    "location": "string",
+    "callno": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +Name of the physical or electronic holdings of the cited study.
  • +
  • location [Optional ; Not repeatable ; String]
    +The physical location where a copy of the study is held.
  • +
  • callno [Optional ; Not repeatable ; String]
    +The call number at the location specified in location.
  • +
  • uri [Optional ; Not repeatable ; String]
    +A URL for accessing the electronic copy of the cited study from the location mentioned in name.
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    holdings = list(
+       name = "World Bank Microdata Library",
+       location = "World Bank, Development Data Group",
+       uri = "http://microdata.worldbank.org"
+    ),
+    # ...
+  ),
+  # ...
+)  
+


+
+
+

5.4.2.11 Study notes

+

study_notes [Optional ; Not repeatable]

+

This element can be used to provide additional information on the study which cannot be accommodated in the specific metadata elements of the schema, in the form of a free text field.

+
+
+

5.4.2.12 Study autorization

+

study_authorization [Optional ; Not repeatable]

+
"study_authorization": {
+  "date": "string",
+  "agency": [
+    {
+      "name": "string",
+      "affiliation": "string",
+      "abbr": "string"
+    }
+  ],
+  "authorization_statement": "string"
+}
+


+

Provides structured information on the agency that authorized the study, the date of authorization, and an authorization statement. This element will be used when a special legislation is required to conduct the data collection (for example a Census Act) or when the approval of an Ethics Board or other body is required to collect the data.

+
    +
  • date [Optional ; Not repeatable ; String] +The date, preferably entered in ISO 8601 format (YYYY-MM-DD), when the authorization to conduct the study was granted.
  • +
  • agency [Optional ; Repeatable]
    +Identification of the agency that authorized the study. +
      +
    • name [Optional ; Not repeatable ; String]
      +Name of the agent or agency that authorized the study.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The institutional affiliation of the authorizing agent or agency mentioned in name.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The abbreviation of the authorizing agent’s or agency’s name.

    • +
  • +
  • authorization_statement [Optional ; Not repeatable ; String]
    +The text of the authorization (or a description and link to a document or other resource containing the authorization statement).
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_authorization = list(
+       date = "2018-02-23",
+       agency = list(
+          name = "Institutional Review Board of the University of Popstan",
+          abbr = "IRB-UP")
+    ),
+    authorization_statement = "The required documentation covering the study purpose, disclosure information, questionnaire content, and consent statements was delivered to the IRB-UP on 2017-12-27 and was reviewed by the compliance officer. Statement of authorization for the described study was issued on 2018-02-23."
+    # ...
+  ),
+  # ...
+)  
+


+
+
+

5.4.2.13 Study information

+

study_info [Required ; Not repeatable]
+This section contains the metadata elements needed to describe the core elements of a study including the dates of data collection and reference period, the country and other geographic coverage information, and more. These elements are not required in the DDI standard, but documenting a study without provinding at least some of this information would make the metadata mostly irrelevant.

+
"study_info": {
+  "study_budget": "string",
+  "keywords": [],
+  "topics": [],
+  "abstract": "string",
+  "time_periods": [],
+  "coll_dates": [],
+  "nation": [],
+  "bbox": [],
+  "bound_poly": [],
+  "geog_coverage": "string",
+  "geog_coverage_notes": "string",
+  "geog_unit": "string",
+  "analysis_unit": "string",
+  "universe": "string",
+  "data_kind": "string",
+  "notes": "string",
+  "quality_statement": {},
+  "ex_post_evaluation": {}
+}
+


+
    +
  • study_budget [Optional ; Not repeatable ; String]

    +

    This is a free-text field, not a structured element. The budget of a study will ideally be described by budget line. The currency used to describe the budget should be specified. This element can also be used to document issues related to the budget (e.g., documenting possible under-run and over-run).

    +
      my_ddi <- list(
    +  # ... ,
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      study_budget = "The study had a total budget of 500,000 USD allocated as follows:
    +          By type of expense:
    +            - Staff: 150,000 USD
    +            - Consultants (incl. interviewers): 180,000 USD
    +            - Travel: 50,000 USD
    +            - Equipment: 90,000 USD
    +            - Other: 30,000 USD
    +          By activity
    +            - Study design (questionnaire design and testing, sampling, piloting): 100,000 USD
    +            - Data collection: 250,000 USD
    +            - Data processing and tabulation: 80,000 USD
    +            - Analysis and dissemination: 50,000 USD
    +            - Evaluation: 20,000 USD
    +          By source of funding:
    +            - Government budget: 300,000 USD 
    +            - External sponsors
    +               - Grant ABC001 - 150,000 USD
    +               - Grant XYZ987 - 50,000 USD",
    +
    +      # ... 
    +
    +  ),
    +  # ...
    +)  
    +


  • +
  • keywords [Optional ; Repeatable]

  • +
+
"keywords": [
+  {
+    "keyword": "string",
+    "vocab": "string",
+    "uri": "string"
+  }
+]
+


+

Keywords are words or phrases that describe salient aspects of a data collection’s content. The addition of keywords can significantly improve the discoverability of data. Keywords can summarize and improve the description of the content or subject matter of a study. For example, keywords “poverty”, “inequality”, “welfare”, and “prosperity” could be attached to a household income survey used to generate poverty and inequality indicators (for which these keywords may not appear anywhere else in the metadata). A controlled vocabulary can be employed. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
+- keyword [ Required ; String ; Non repeatable]
+A keyword (or phrase).
+- vocab [Optional ; Not repeatable ; String]
+The controlled vocabulary from which the keyword is extracted, if any.
+- uri [Optional ; Not repeatable ; String]
+The URI of the controlled vocabulary used, if any.

+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      keywords = list(
+        list(keyword = "poverty",
+             vocab = "UNESCO Thesaurus",
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"),
+        list(keyword = "income distribution",
+             vocab = "UNESCO Thesaurus",
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"),
+        list(keyword = "inequality",
+             vocab = "UNESCO Thesaurus",
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/")     
+      ),
+    # ...
+  ),
+  # ...
+)          
+


+
    +
  • topics [Optional ; Repeatable]
    +The topics field indicates the broad substantive topic(s) that the study covers. A topic classification facilitates referencing and searches in on-line data catalogs.
  • +
+
"topics": [
+  {
+    "topic": "string",
+    "vocab": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • topic [Required ; Not repeatable]
    +The label of the topic. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) Topic Classification.
  • +
  • vocab [Required ; Not repeatable]
    +The specification (name including the version) of the controlled vocabulary in use.
  • +
  • uri [Required ; Not repeatable]
    +A link (URL) to the controlled vocabulary website.
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      topics = list(
+        
+        list(topic = "Equality, inequality and social exclusion",
+             vocab = "CESSDA topics classification",
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        
+        list(topic = "Social and occupational mobility",
+             vocab = "CESSDA topics classification",
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+        
+        ),
+    # ...
+  ),
+  # ...
+) 
+


+
    +
  • abstract [Optional ; Not repeatable ; String]
    +An un-formatted summary describing the purpose, nature, and scope of the data collection, special characteristics of its contents, major subject areas covered, and what questions the primary investigator(s) attempted to answer when they conducted the study. The summary should ideally be between 50 and 5000 characters long. The abstract should provide a clear summary of the purposes, objectives and content of the survey. It should be written by a researcher or survey statistician aware of the study. Inclusion of this element is strongly recommended.

    +

    This example is for the Afrobarometer Survey 1999-2000, Merged Round 1 dataset.

    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ,
    +
    +      abstract = "The Afrobarometer is a comparative series of public attitude surveys that assess African citizen's attitudes to democracy and governance, markets, and civil society, among other topics.
    +
    +The 12 country dataset is a combined dataset for the 12 African countries surveyed during round 1 of the survey, conducted between 1999-2000 (Botswana, Ghana, Lesotho, Mali, Malawi, Namibia, Nigeria South Africa, Tanzania, Uganda, Zambia and Zimbabwe), plus data from the old Southern African Democracy Barometer, and similar surveys done in West and East Africa.",
    +
    +    # ...
    +  ),
    +  # ...
    +) 
    +


  • +
  • time_periods [Optional ; Repeatable]
    +This refers to the time period (also known as span) covered by the data, not the dates of data collection.

  • +
+
"time_periods": [
+  {
+    "start": "string",
+    "end": "string",
+    "cycle": "string"
+  }
+]
+


+
    +
  • start [Required ; Not repeatable ; String]
    +The start date for the cycle being described. Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

  • +
  • end [Required ; Not repeatable ; String]
    +The end date for the cycle being described. Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). Indicate open-ended dates with two decimal points (..)

  • +
  • cycle [Optional ; Not repeatable ; String]
    +The cycle attribute permits specification of the relevant cycle, wave, or round of data.

  • +
  • coll_dates [Optional ; Repeatable]
    +Contains the date(s) when the data were collected, which may be different from the date the data refer to (see time_periods above). For example, data may be collected over a period of 2 weeks (coll_dates) about household expenditures during a reference week (time_periods) preceding the beginning of data collection. Use the event attribute to specify the “start” and “end” for each period entered.

  • +
+
"coll_dates": [
+  {
+    "start": "string",
+    "end": "string",
+    "cycle": "string"
+  }
+]
+


+
    +
  • start [Required ; Not repeatable ; String]
    +Date the data collection started (for the specified cycle, if any). Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
  • +
  • end [Required ; Not repeatable ; String]
    +Date the data collection ended (for the specified cycle, if any). Enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
  • +
  • cycle [Optional ; Not repeatable ; String]
    +Identification of the cycle of data collection. The cycle attribute permits specification of the relevant cycle, wave, or round of data. For example, a household consumption survey could visit households in four phases (one per quarter). Each quarter would be a cycle, and the specific dates of data collection for each quarter would be entered.
  • +
+

This example is for an impact evaluation survey with a baseline and two follow-up surveys)

+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      time_periods = list(
+        
+        list(start = "2020-01-10",
+             end   = "2020-01-16",
+             cycle = "Baseline survey"),
+        
+        list(start = "2020-07-10",
+             end   = "2020-07-16",
+             cycle = "First follow-up survey"),
+        
+        list(start = "2021-01-10",
+             end   = "2021-01-16",
+             cycle = "Second and last follow-up survey"),
+      ),
+      
+      coll_dates = list(
+        
+        list(start = "2020-01-17",
+             end = "2020-01-25",
+             cycle = "Baseline survey"),
+        
+        list(start = "2020-07-17",
+             end = "2020-07-24",
+             cycle = "First follow-up survey"),
+        
+        list(start = "2021-01-17",
+             end = "2021-01-22",
+             cycle = "Second and last follow-up survey")
+        ),
+      
+    # ...
+  ),
+  # ...
+)   
+


+
    +
  • nation [Optional ; Repeatable]
    +Indicates the country or countries (or “economies”, or “territories”) covered in the study (but not the sub-national geographic areas). If the study covers more than one country, they will be entered separately.
  • +
+
"nation": [
+  {
+    "name": "string",
+    "abbreviation": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The country name, even in cases where the study does not cover the entire country.

  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +The abbreviation will contain a country code, preferably the 3-letter ISO 3166-1 country code.

  • +
  • bbox [Optional ; Repeatable]
    +This element is used to define one or multiple bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. This element is optional, but if the bound_poly element (see below) is used, then the bbox element must be included.

  • +
+
"bbox": [
+  {
+    "west": "string",
+    "east": "string",
+    "south": "string",
+    "north": "string"
+  }
+]
+


+
    +
  • west [Required ; Not repeatable ; String]
    +West longitude of the bounding box.
  • +
  • east [Required ; Not repeatable ; String]
    +East longitude of the bounding box.
  • +
  • south [Required ; Not repeatable ; String]
    +South latitude of the bounding box.
  • +
  • north [Required ; Not repeatable ; String]
    +North latitude of the bounding box.
  • +
+This example is for a study covering the islands of Madagascar and Mauritius +
+ +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      nation = list(
+        list(name = "Madagascar", abbreviation = "MDG"),
+        list(name = "Mauritius",  abbreviation = "MUS")
+      ),
+      
+      bbox = list(
+        
+        list(name  = "Madagascar",
+             west  = "43.2541870461", 
+             east  = "50.4765368996", 
+             south = "-25.6014344215", 
+             north = "-12.0405567359"),
+        
+        list(name  = "Mauritius",
+             west  = "56.6", 
+             east  = "72.466667", 
+             south = "-20.516667", 
+             north = "-5.25")
+        
+        ),
+    # ...
+  ),
+  # ...
+)    
+


+
    +
  • bound_poly [Optional ; Repeatable]
    +The bbox metadata element (see above) describes a rectangular area representing the entire geographic coverage of a dataset. The element bound_poly allows for a more detailed description of the geographic coverage, by allowing multiple and non-rectangular polygons (areas) to be described. This is done by providing list(s) of latitude and longitude coordinates that define the area(s). It should only be used to define the outer boundaries of the covered areas. This field is intended to enable a refined coordinate-based search, not to actually map an area. Note that if the bound_poly element is used, then the element bbox MUST be present as well, and all points enclosed by the bound_poly MUST be contained within the bounding box defined in bbox.
  • +
+
"bound_poly": [
+  {
+    "lat": "string",
+    "lon": "string"
+  }
+]
+


+
    +
  • lat [Required ; Not repeatable ; String]
    +The latitude of the coordinate.
  • +
  • lon [Required ; Not repeatable ; String]
    +The longitude of the coordinate.
  • +
+

This example shows a polygon for the State of Nevada, USA

+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      bound_poly = list(
+        list(lat = "42.002207",      lon = "-120.005729004"),
+        list(lat = "42.002207",      lon = "-114.039663"),
+        list(lat = "35.9",           lon = "-114.039663"),
+        list(lat = "36.080",         lon = "-114.544"),
+        list(lat = "35.133",         lon = "-114.542"),
+        list(lat = "35.00208499998", lon = "-114.63288"),
+        list(lat = "35.00208499998", lon = "-114.63323"),
+        list(lat = "38.999",         lon = "-120.005729004"),
+        list(lat = "42.002207",      lon = "-120.005729004")
+      ),
+      
+    # ...
+  ),
+  # ...
+)    
+


+
    +
  • geog_coverage [Optional ; Not repeatable ; String]

    +

    Information on the geographic coverage of the study. This includes the total geographic scope of the data, and any additional levels of geographic coding provided in the variables. Typical entries will be “National coverage”, “Urban areas”, “Rural areas”, “State of …”, “Capital city”, etc. This does not describe where the data were collected; it describes which area the data are representative of. This means for example that a sample survey could be declared as having a national coverage even if some districts of the country where not included in the sample, as long as the sample is nationally representative.

  • +
  • geog_coverage_notes [Optional ; Not repeatable ; String]

    +

    Additional information on the geographic coverage of the study entered as a free text field.

  • +
  • geog_unit [Optional ; Not repeatable ; String]

    +

    Describes the levels of geographic aggregation covered by the data. Particular attention must be paid to include information on the lowest geographic area for which data are representative.

    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ,
    +
    +      geog_coverage = "National coverage",
    +
    +      geog_coverage_notes = "The sample covered the urban and rural areas of all provinces of the country. Some areas of province X were however not accessible due to civil unrest.",
    +
    +      geog_unit = "The survey provides data representative at the national, provincial and district levels. For the capital city, the data are representative at the ward level.",
    +
    +      # ...
    +    ),
    +  # ...
    +)    
    +


  • +
  • analysis_unit [Optional ; Not repeatable ; String]

    +

    A study can have multiple units of analysis. This field will list the various units that can be analyzed. For example, a Living Standard Measurement Study (LSMS) may have collected data on households and their members (individuals), on dwelling characteristics, on prices in local markets, on household enterprises, on agricultural plots, and on characteristics of health and education facilities in the sample areas.

    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ,
    +
    +      analysis_unit = "Data were collected on households, individuals (household members), dwellings, commodity prices at local markets, household enterprises, agricultural plots, and characteristics of health and education facilities."
    +
    +      # ...
    +    ),
    +  # ...
    +)    
    +


  • +
  • universe [Optional ; Not repeatable ; String]

    +

    The universe is the group of persons (or other units of observations, like dwellings, facilities, or other) that are the object of the study and to which any analytic results refer. The universe will rarely cover the entire population of the country. Sample household surveys, for example, may not cover homeless, nomads, diplomats, community households. Population censuses do not cover diplomats. Facility surveys may be limited to facilities of a certain type (e.g., public schools). Try to provide the most detailed information possible on the population covered by the survey/census, focusing on excluded categories of the population. For household surveys, age, nationality, and residence commonly help to delineate a given universe, but any of a number of factors may be involved, such as sex, race, income, veteran status, criminal convictions, etc. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is a member of the population under study.

    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ,
    +
    +      universe = "The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.",
    +
    +      # ...
    +    ),
    +  # ...
    +)   
    +


  • +
  • data_kind [Optional ; Not repeatable ; String]

    +

    This field describes the main type of microdata generated by the study: survey data, census/enumeration data, aggregate data, clinical data, event/transaction data, program source code, machine-readable text, administrative records data, experimental data, psychological test, textual data, coded textual, coded documents, time budget diaries, observation data/ratings, process-produced data, etc. A controlled vocabulary should be used as this information may be used to build facets (filters) in a catalog user interface.

    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ,
    +
    +      data_kind = "Sample survey data",
    +
    +      # ...
    +    ),
    +  # ...
    +)  
    +


  • +
  • notes [Optional ; Not repeatable ; String]

    +

    This element is provided to document any specific situations, observations, or events that occurred during data collection. Consider stating such items like:

    +
      +
    • Was a training of enumerators held? (elaborate)
    • +
    • Was a pilot survey conducted?
    • +
    • Did any events have a bearing on the data quality? (elaborate)
    • +
    • How long did an interview take on average?
    • +
    • In what language(s) were the interviews conducted?
    • +
    • Were there any corrective actions taken by management when problems occurred in the field?
    • +
    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ,
    +
    +      notes = "The pre-test for the survey took place from August 15, 2006 - August 25, 2006 and included 14 interviewers who would later become supervisors for the main survey.
    +Each interviewing team comprised of 3-4 female interviewers (no male interviewers were used due to the sensitivity of the subject matter), together with a field editor and a supervisor and a driver. A total of 52 interviewers, 14 supervisors and 14 field editors were used. Training of interviewers took place at the headquarters of the Statistics Office from July 1 to July 12, 2006.
    +Data collection took place over a period of about 6 weeks from September 2, 2006 until October 17, 2006. Interviewing took place everyday throughout the fieldwork period, although interviewing teams were permitted to take one day off per week.
    +Interviews averaged 35 minutes for the household questionnaire (excluding water testing), 23 minutes for the women's questionnaire, and 27 for the under five children's questionnaire (excluding the anthropometry). Interviews were conducted primarily in English, but occasionally used local translation.
    +Six staff members of the Statistics Office provided overall fieldwork coordination and supervision." 
    +
    +      # ...
    +    ),
    +  # ...
    +)    
    +


  • +
  • quality_statement [Optional ; Not Repeatable]
    +This section lists the specific standards complied with during the execution of this study, and provides the option to formulate a general statement on the quality of the data. Any known quality issue should be reported here. Such issues are better reported by the data producer or curator, not left to the secondary analysts to discover. Transparency in reporting quality issues will increase credibility and reputation of the data provider.

  • +
+
"quality_statement": {
+  "compliance_description": "string",
+  "standards": [
+    {
+      "name": "string",
+      "producer": "string"
+    }
+  ],
+  "other_quality_statement": "string"
+}
+


+
    +
  • compliance_description [Optional ; Not repeatable ; String]
    +A statement on compliance with standard quality assessment procedures. The list of these standards can be documented in the next element, standards.
  • +
  • standards [Optional ; Repeatable]

    +An itemized list of quality standards complied with during the execution of the study.
    +
      +
    • name [Optional ; Not repeatable ; String]
      +The name of the quality standard, if such a standard was used. Include the date when the standard was published, and the version of the standard with which the study is compliant, and the “URI” attribute includes .
    • +
    • producer [Optional ; Not repeatable ; String]
      +The producer of the quality standard mentined in name.

    • +
  • +
  • other_quality_statement [Optional ; Not repeatable ; String]

    +Any additional statement on the quality of the data, entered as free text. This can be independent of any particular quality standard.
  • +
+

@@@ complete the example

+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      quality_statement = list(
+        
+         compliance_description = "",
+        
+         standards = list(
+           list(name = "",
+                producer = "")
+         ),
+         
+         other_quality_statement = ""   
+           
+      ),
+      
+    # ...
+  ),
+  # ...
+) 
+


+
    +
  • ex_post_evaluation [Optional ; Not Repeatable]
    +Ex-post evaluations are frequently done within large statistical or research organizations, in particular when a study is intended to be repeated. Such evaluations are recommended by the Generic Statistical Business Process Model (GSBPM). This section of the schema is used to describe the evaluation procedures and their outcomes.
  • +
+
"ex_post_evaluation": {
+  "completion_date": "string",
+  "type": "string",
+  "evaluator": [
+    {
+      "name": "string",
+      "affiliation": "string",
+      "abbr": "string",
+      "role": "string"
+    }
+  ],
+  "evaluation_process": "string",
+  "outcomes": "string"
+}
+


+
    +
  • completion_date [Optional ; Not repeatable ; String]
    +The date the ex-post evaluation was completed.
  • +
  • type [Optional ; Not Repeatable]
    +The type attribute identifies the type of evaluation with or without the use of a controlled vocabulary.
  • +
  • evaluator [Optional ; Repeatable]
    +The evaluator element identifies the person(s) and/or organization(s) involved in the evaluation.
    +
      +
    • name [Optional ; Not repeatable ; String]
      +The name of the person or organization involved in the evaluation.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the individual or organization mentioned in name.
    • +
    • abbr [Optional ; Not repeatable ; String]
      +An abbreviation for the organization mentioned in name.
    • +
    • role [Optional ; Not repeatable ; String]
      +The specific role played by the individual or organization mentioned in name in the evaluation process.
    • +
  • +
  • evaluation_process [Optional ; Not repeatable ; String]
    +A description of the evaluation process. This may include information on the dates the evaluation was conducted, cost/budget, relevance, institutional or legal arrangements, et.
  • +
  • outcomes [Optional ; Not repeatable ; String]
    +A description of the outcomes of the evaluation. It may include a reference to an evaluation report.
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      ex_post_evaluation = list(
+        
+        completion_date = "2020-04-30",
+        
+        type = "Independent evaluation requested by the survey sponsor",
+        
+        evaluator = list(
+          list(name = "John Doe",
+               affiliation = "Alpha Consulting, Ltd.",
+               abbr = "AC",
+               role = "Evaluation of the sampling methodology"),
+          list(name = "Jane Smith",
+               affiliation = "Beta Statistical Services, Ltd.",
+               abbr = "BSS",
+               role = "Evaluation of the data processing and analysis")       
+        ),
+        
+        evaluation_process = "In-depth review of pre-collection and collection procedures",
+        
+        outcomes = "The following steps were highly effective in increasing response rates."
+        
+      )
+  ),
+  # ...
+) 
+


+
+
+

5.4.2.14 Study development

+

study_development [Optional ; Not repeatable]

+
"study_development": {
+  "development_activity": [
+    {
+    "activity_type": "string",
+    "activity_description": "string",
+    "participants": [
+      {
+      "name": "string",
+      "affiliation": "string",
+      "role": "string"
+      }
+    ],
+    "resources": [
+      {
+        "name": "string",
+        "origin": "string",
+        "characteristics": "string"
+      }
+    ],
+    "outcome": "string"
+    }
+  ]
+}
+


+

This section is used to describe the process that led to the production of the final output of the study, from its inception/design to the dissemination of the final output.

+
    +
  • development_activity [Optional ; Repeatable]
    @@@@ missing in schema; must be added then screenshot taken +Each activity will be documented separately. The Generic Statistical Business Process Model (GSBPM) provides a useful decomposition of such a process, which can be used to list the activities to be described. This is a repeatable set of metadata elements; each activity should be documented separately.

    +
      +
    • activity_type [Optional ; Not repeatable ; String]
      +The type of activity. A controlled vocabulary can be used, possibly comprising the main components of the GSBPM: {Needs specification, Design, Build, Collect, Process, Analyze, Disseminate, Evaluate}).
    • +
    • activity_description [Optional ; Not repeatable ; String]
      +A brief description of the activity.
    • +
    • participants [Optional ; Repeatable]
      +A list of participants (persons or organizations) in the activity. This is a repeatable set of elements; each participant can be documented separately.
      +
        +
      • name [Optional ; Not repeatable ; String]
        +Name of the participating person or organization.
      • +
      • affiliation [Optional ; Not repeatable ; String]
        +Affiliation of the person or organization mentioned in name.
      • +
      • role [Optional ; Not repeatable ; String]
        +Specific role (participation) of the person or organization mentioned in name.

      • +
    • +
    • resources [Optional ; Not Repeatable]

      +A description of the data sources and other resources used to implement the activity.
      +
        +
      • name [Optional ; Not repeatable ; String]
        +The name of the resource.
      • +
      • origin [Optional ; Not repeatable ; String]
        +The origin of the resource mentioned in name.
      • +
      • characteristics [Optional ; Not repeatable ; String]
        +The characteristics of the resource mentioned in name.

      • +
    • +
    • outcome [Optional ; Not repeatable ; String]
      +Description of the main outcome of the activity.
    • +
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ),
+      
+    study_development = list(
+      
+      development_activity = list(
+        
+        list(
+            activity_type = "Questionnaire design and piloting",
+            activity_description = "",
+            participants = list(
+              list(name = "",
+                   affiliation = "",
+                   role = ""),
+              list(name = "",
+                   affiliation = "",
+                   role = ""),
+              list(name = "",
+                   affiliation = "",
+                   role = "")
+            ),
+            resources = list(
+              list(name = "",
+                   origin = "",
+                   characteristics = "")
+            ),
+            outcome = ""
+          ),
+        
+        list(
+            activity_type = "Interviewers training",
+            activity_description = "",
+            participants = list(
+              list(name = "",
+                   affiliation = "",
+                   role = ""),
+              list(name = "",
+                   affiliation = "",
+                   role = ""),
+              list(name = "",
+                   affiliation = "",
+                   role = "")
+            ),
+            resources = list(
+              list(name = "",
+                   origin = "",
+                   characteristics = "")
+            ),
+            outcome = ""
+          )
+        
+      )
+      
+    ),
+    
+  # ...
+  
+)
+


+
+
+

5.4.2.15 Method

+

method [Optional ; Not Repeatable]
+This section describes the methodology and processing involved in a study.

+
"method": {
+  "data_collection": {},
+  "method_notes": "string",
+  "analysis_info": {},
+  "study_class": null,
+  "data_processing": [],
+  "coding_instructions": []
+}
+


+
    +
  • data_collection [Optional ; Not Repeatable]
    +A block of metadata elements used to describe the methodology employed in a data collection. This includes the design of the questionnaire, sampling, supervision of field work, and other characteristics of the data collection phase.
  • +
+
"data_collection": {
+  "time_method": "string",
+  "data_collectors": [],
+  "collector_training": [],
+  "frequency": "string",
+  "sampling_procedure": "string",
+  "sample_frame": {},
+  "sampling_deviation": "string",
+  "coll_mode": null,
+  "research_instrument": "string",
+  "instru_development": "string",
+  "instru_development_type": "string",
+  "sources": [],
+  "coll_situation": "string",
+  "act_min": "string",
+  "control_operations": "string",
+  "weight": "string",
+  "cleaning_operations": "string"
+}
+


+
    +
  • time_method [Optional ; Not repeatable ; String]
    +The time method or time dimension of the data collection. A controlled vocabulary can be used. The entries for this element may include “panel survey”, “cross-section”, “trend study”, or “time-series”.

  • +
  • data_collectors [Optional ; Not Repeatable]
    +The entity (individual, agency, or institution) responsible for administering the questionnaire or interview or compiling the data.

  • +
+
"data_collectors": [
+  {
+    "name": "string",
+    "affiliation": "string",
+    "abbr": "string",
+    "role": "string"
+  }
+]
+
    +
  • name [Optional ; Not repeatable ; String]
    +In most cases, we will record here the name of the agency, not the name of interviewers. Only in the case of very small-scale surveys, with a very limited number of interviewers, the name of persons will be included as well.

  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the data collector mentioned in name.

  • +
  • abbr [Optional ; Not repeatable ; String]
    +The abbreviation given to the agency mentioned in name.

  • +
  • role [Optional ; Not repeatable ; String]
    +The specific role of the person or agency mentioned in name.

  • +
  • collector_training [Optional ; Repeatable]

    +Describes the training provided to data collectors including interviewer training, process testing, compliance with standards etc. This set of elements is repeatable, to capture different aspects of the training process.

  • +
+
"collector_training": [
+  {
+    "type": "string",
+    "training": "string"
+  }
+]
+


+
    +
  • type [Optional ; Not repeatable ; String]
    +The type of training being described. For example, “Training of interviewers”, “Training of controllers”, “Training of cartographers”, “Training on the use of tablets for data collection”, etc.

  • +
  • training [Optional ; Not repeatable ; String]
    +A brief description of the training. This may include information on the dates and duration, audience, location, content, trainers, issues, etc.

  • +
  • frequency [Optional ; Not repeatable ; String]
    +For data collected at more than one point in time, the frequency with which the data were collected.

  • +
  • sampling_procedure [Optional ; Not repeatable ; String]
    +This field only applies to sample surveys. It describes the type of sample and sample design used to select the survey respondents to represent the population. This section should include summary information that includes (but is not limited to): sample size (expected and actual) and how the sample size was decided; level of representation of the sample; sample frame used, and listing exercise conducted to update it; sample selection process (e.g., probability proportional to size or over sampling); stratification (implicit and explicit); design omissions in the sample; strategy for absent respondents/not found/refusals (replacement or not). Detailed information on the sample design is critical to allow users to adequately calculate sampling errors and confidence intervals for their estimates. To do that, they will need to be able to clearly identify the variables in the dataset that represent the different levels of stratification and the primary sampling unit (PSU).
    +In publications and reports, the description of sampling design often contains complex formulas and symbols. As the XML and JSON formats used to store the metadata are plain text files, they cannot contain these complex representations. You may however provide references (title/author/date) to documents where such detailed descriptions are provided, and make sure that the documents (or links to the documents) are provided in the catalog where the survey metadata are published.

  • +
  • sample_frame [Optional ; Not Repeatable]
    +A description of the sample frame used for identifying the population from which the sample was taken. For example, a telephone book may be a sample frame for a phone survey. Or the listing of enumeration areas (EAs) of a population census can provide a sample frame for a household survey. In addition to the name, label and text describing the sample frame, this structure lists who maintains the sample frame, the period for which it is valid, a use statement, the universe covered, the type of unit contained in the frame as well as the number of units available, the reference period of the frame and procedures used to update the frame.

  • +
+
"sample_frame": {
+  "name": "string",
+  "valid_period": [
+    {
+      "event": "string",
+      "date": "string"
+    }
+  ],
+  "custodian": "string",
+  "universe": "string",
+  "frame_unit": {
+    "is_primary": null,
+    "unit_type": "string",
+    "num_of_units": "string"
+  },
+  "reference_period": [
+    {
+      "event": "string",
+      "date": "string"
+    }
+  ],
+  "update_procedure": "string"
+}
+


+
    +
  • name [Optional ; Not Repeatable]
    +The name (title) of the sample frame.

  • +
  • valid_period [Optional ; Repeatable]
    +Defines a time period for the validity of the sampling frame, using a list of events and dates.

    +
      +
    • event [Optional ; Not repeatable ; String]
      +The event can for example be start or end.
    • +
    • date [Optional ; Not repeatable ; String]
      +The date corresponding to the event, entered in ISO 8601 format: YYYY-MM-DD.

    • +
  • +
  • custodian [ Optional ; Not Repeatable]
    +Custodian identifies the agency or individual responsible for creating and/or maintaining the sample frame.

  • +
  • universe [Optional ; Not Repeatable]
    +A description of the universe of population covered by the sample frame. Age,nationality, and residence commonly help to delineate a given universe, but any of a number of factors may be involved, such as sex, race, income, etc. The universe may consist of elements other than persons, such as housing units, court cases, deaths, countries, etc. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is included in the sample frame.

  • +
  • frame_unit [Optional ; Not Repeatable]
    +Provides information about the sampling frame unit.

    +
      +
    • is_primary [Optional ; Boolean ; Not Repeatable]
      +This boolean attribute (true/false) indicates whether the unit is primary or not.
    • +
    • unit_type [Optional ; Not repeatable ; String]
      +The type of the sampling frame unit (for example “household”, or “dwelling”).
    • +
    • num_of_units [Optional ; Not Repeatable ; String]

      +The number of units in the sample frame, possibly with information on its distribution (e.g. by urban/rural, province, or other).

    • +
  • +
  • reference_period [Optional ; Not Repeatable]
    +Indicates the period of time in which the sampling frame was actually used for the study in question. Use ISO 8601 date format to enter the relevant date(s).

    +
      +
    • event [Optional ; Not repeatable ; String]
      +Indicates the type of event that the date corresponds to, e.g., “start”, “end”, “single”.
    • +
    • date [Optional ; Not repeatable ; String]
      +The relevant date in ISO 8601 date/time format.

    • +
  • +
  • update_procedure [Optional ; Not repeatable ; String]
    +This element is used to describe how and with what frequency the sample frame is updated. For example: “The lists and boundaries of enumeration areas are updated every ten years at the occasion of the population census cartography work. Listing of households in enumeration areas are updated as and when needed, based on their selection in survey samples.”

  • +
  • sampling_deviation [Optional ; Not repeatable ; String]

  • +
+

Sometimes the reality of the field requires a deviation from the sampling design (for example due to difficulty to access to zones due to weather problems, political instability, etc). If for any reason, the sample design has deviated, this can be reported here. This element will provide information indicating the correspondence as well as the possible discrepancies between the sampled units (obtained) and available statistics for the population (age, sex-ratio, marital status, etc.) as a whole.

+
    +
  • coll_mode [Optional ; Repeatable ; String]
  • +
+

The mode of data collection is the manner in which the interview was conducted or information was gathered. Ideally, a controlled vocabulary will be used to constrain the entries in this field (which could include items like “telephone interview”, “face-to-face paper and pen interview”, “face-to-face computer-assisted interviews (CAPI)”, “mail questionnaire”, “computer-aided telephone interviews (CATI)”, “self-administered web forms”, “measurement by sensor”, and others.
+This is a repeatable field, as some data collection activities implement multi-mode data collection (for example, a population census can offer respondents the options to submit information via web forms, telephone interviews, mailed forms, or face-to-face interviews. Note that in the API description (see screenshot above), the element is described as having type “null”, not {}. This is due to the fact that the element can be entered either as a list (repeatable element) or as a string.

+
    +
  • research_instrument [Optional ; Not repeatable ; String]
  • +
+

The research instrument refers to the questionnaire or form used for collecting data. The following should be mentioned:
+- List of questionnaires and short description of each (all questionnaires must be provided as External Resources)
+- In what language(s) was/were the questionnaire(s) available?
+- Information on the questionnaire design process (based on a previous questionnaire, based on a standard model questionnaire, review by stakeholders). If a document was compiled that contains the comments provided by the stakeholders on the draft questionnaire, or a report prepared on the questionnaire testing, a reference to these documents can be provided here.

+
    +
  • instru_development [Optional ; Not repeatable ; String]
  • +
+

Describe any development work on the data collection instrument. This may include a description of the review process, standards followed, and a list of agencies/people consulted.

+
    +
  • instru_development_type [Optional ; Repeatable ; String]
  • +
+

The instrument development type. This element will be used when a pre-defined list of options (controlled vocabulary) is available.

+
    +
  • sources [Optional ; Repeatable]

    +A description of sources used for developing the methodology of the data collection.
  • +
+
"sources": [
+  {
+    "name": "string",
+    "origin": "string",
+    "characteristics": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name and other information on the source. For example, “United States Internal Revenue Service Quarterly Payroll File”

  • +
  • origin [Optional ; Not repeatable ; String]
    +For historical materials, information about the origin(s) of the sources and the rules followed in establishing the sources should be specified. This may not be relevant to survey data.

  • +
  • characteristics [Optional ; Not repeatable ; String]
    +Assessment of characteristics and quality of source material. This may not be relevant to survey data.

  • +
  • coll_situation [Optional ; Not repeatable ; String]

  • +
+

A description of noteworthy aspects of the data collection situation. Includes information on factors such as cooperativeness of respondents, duration of interviews, number of call-backs, etc.

+
    +
  • act_min [Optional ; Not repeatable ; String]
  • +
+

A summary of actions taken to minimize data loss. This includes information on actions such as follow-up visits, supervisory checks, historical matching, estimation, etc. Note that this element does not have to include detailed information on response rates, as a specific metadata element is provided for that purpose in section analysis_info / response_rate (see below).

+
    +
  • control_operations [Optional ; Not repeatable ; String]
  • +
+

This element will provide information on the oversight of the data collection, i.e. on methods implemented to facilitate data control performed by the primary investigator or by the data archive.

+
    +
  • weight [Optional ; Not repeatable ; String]
  • +
+

This field only applies to sample surveys. The use of sampling procedures may make it necessary to apply weights to produce accurate statistical results. Describe here the criteria for using weights in analysis of a collection, and provide a list of variables used as weighting coefficient. If more than one variable is a weighting variable, describe how these variables differ from each other and what the purpose of each one of them is.

+
    +
  • cleaning_operations [Optional ; Not repeatable ; String]
  • +
+

A description of the methods used to clean or edit the data, e.g., consistency checking, wild code checking, etc. The data editing should contain information on how the data was treated or controlled for in terms of consistency and coherence. This item does not concern the data entry phase but only the editing of data whether manual or automatic. It should provide answers to questions like: Was a hot deck or a cold deck technique used to edit the data? Were corrections made automatically (by program), or by visual control of the questionnaire? What software was used? If materials are available (specifications for data editing, report on data editing, programs used for data editing), they should be listed here and provided as external resources in data catalogs (the best documentation of data editing consists of well-documented reproducible scripts).

+

Example for the data_collection section:

+
```r
+my_ddi <- list(
+  
+  doc_desc = list(
+    # ... 
+  ),
+  
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ),
+    study_development = list(
+      # ... ),
+      
+    method = list(
+      
+      data_collection = list(
+        
+        time_method = "cross-section",
+        
+        data_collectors = list(
+          list(name = "Staff from the Central Statistics Office", 
+               abbr = "NSO", 
+               affiliation = "Ministry of Planning")
+        ),
+        
+        collector_training = list(
+          list(
+             type = "Training of interviewers", 
+             training = "72 staff (interviewers) were trained from [date] to [date] at the NSO headquarters. The training included 2 days of field work."
+          ),
+          list(
+             type = "Training of controllers and supervisors", 
+             training = "A 3-day training of 10 controlers and 2 supervisors was organized from [date] to [date]. The controllers and supervisors had previously participated in the interviewer training."
+          )
+        ),
+        
+        sampling_procedure = "A list of 500 Enumeration Areas (EAs) were randomly selected from the sample frame, 300 in urban areas and 200 in rural areas. In each selected EA, 10 households were then randomly selected. 5000 households were thus selected for the sample (3000 urban and 2000 rural). The distribution of the sample (households) by province is as follows:
+- Province A: Total: 1800  Urban: 1000  Rural: 800
+- Province B: Total: 1200  Urban:  500  Rural: 700
+- Province C: Total: 2000  Urban: 1500  Rural: 500",
+        
+        sample_frame = list(
+           name = "Listing of Enumeration Areas (EAs) from the Population and Housing Census 2011",
+           custodian = "National Statistics Office",
+           universe = "The sample frame contains 25365 EAs covering the entire territory of the country. EAs contain an average of 400 households in rural areas, and 580 in urban areas. ",
+           frame_unit = list(
+             is_primary = true,        
+             unit_type = "Enumeration areas (EAs)",
+             num_of_units = "25365, including 15100 in urban areas, and 10265 in rural areas."
+           ),
+           update_procedure = "The sample frame only provides EAs; a full household listing was conducted in each selected EA to provide an updated list of households."
+        ),
+        
+        sampling_deviation = "Due to floods in two sampled rural in province A, two EAs could not be reached. The sample was thus reduced to 4980 households. The response rate was 90%, so the actual final sample size was 4482 households.",
+        
+        coll_mode = "Face-to-face interviews, conducted using tablets (CAPI)",
+        
+        research_instrument = "The questionnaires for the Generic MICS were structured questionnaires based on the MICS3 Model Questionnaire with some modifications and additions. A household questionnaire was administered in each household, which collected various information on household members including sex, age, relationship, and orphanhood status. The household questionnaire includes household characteristics, support to orphaned and vulnerable children, education, child labour, water and sanitation, household use of insecticide treated mosquito nets, and salt iodization, with optional modules for child discipline, child disability, maternal mortality and security of tenure and durability of housing. 
+In addition to a household questionnaire, questionnaires were administered in each household for women age 15-49 and children under age five. For children, the questionnaire was administered to the mother or caretaker of the child. 
+The women's questionnaire include women's characteristics, child mortality, tetanus toxoid, maternal and newborn health, marriage, polygyny, female genital cutting, contraception, and HIV/AIDS knowledge, with optional modules for unmet need, domestic violence, and sexual behavior.
+The children's questionnaire includes children's characteristics, birth registration and early learning, vitamin A, breastfeeding, care of illness, malaria, immunization, and anthropometry, with an optional module for child development.
+The questionnaires were developed in English from the MICS3 Model Questionnaires and translated into local languages. After an initial review the questionnaires were translated back into English by an independent translator with no prior knowledge of the survey. The back translation from the local language version was independently reviewed and compared to the English original. Differences in translation were reviewed and resolved in collaboration with the original translators. The English and local language questionnaires were both piloted as part of the survey pretest.",
+        
+        instru_development = "The questionnaire was pre-tested with split-panel tests, as well as an analysis of non-response rates for individual items, and response distributions.",
+        
+        coll_situation = "Floods in province A made access to two selected enumeration areas impossible.",
+        
+        act_min = "Local authorities and local staff from the Ministry of Health contributed to an awareness campaign, which contributed to achieving a response rate of 90%.",
+        
+        control_operations = "Interviewing was conducted by teams of interviewers. Each interviewing team comprised of 3-4 female interviewers, a field editor and a supervisor, and a driver. Each team used a 4 wheel drive vehicle to travel from cluster to cluster (and where necessary within cluster). 
+The role of the supervisor was to coordinate field data collection activities, including management of the field teams, supplies and equipment, finances, maps and listings, coordinate with local authorities concerning the survey plan and make arrangements for accommodation and travel. Additionally, the field supervisor assigned the work to the interviewers, spot checked work, maintained field control documents, and sent completed questionnaires and progress reports to the central office. 
+The field editor was responsible for validating questionnaires at the end of the day when the data form interviews were transferred to their laptops. This included checking for missed questions, skip errors, fields incorrectly completed, and checking for inconsistencies in the data. The field editor also observed interviews and conducted review sessions with interviewers.
+Responsibilities of the supervisors and field editors are described in the Instructions for Supervisors and Field Editors, together with the different field controls that were in place to control the quality of the fieldwork.
+Field visits were also made by a team of central staff on a periodic basis during fieldwork. The senior staff of NSO also made 3 visits to field teams to provide support and to review progress.",
+        
+        weight = "Sample weights were calculated for each of the data files. Sample weights for the household data were computed as the inverse of the probability of selection of the household, computed at the sampling domain level (urban/rural within each region). The household weights were adjusted for non-response at the domain level, and were then normalized by a constant factor so that the total weighted number of households equals the total unweighted number of households. The household weight variable is called HHWEIGHT and is used with the HH data and the HL data. 
+Sample weights for the women's data used the un-normalized household weights, adjusted for non-response for the women's questionnaire, and were then normalized by a constant factor so that the total weighted number of women's cases equals the total unweighted number of women's cases.
+Sample weights for the children's data followed the same approach as the women's and used the un-normalized household weights, adjusted for non-response for the children's questionnaire, and were then normalized by a constant factor so that the total weighted number of children's cases equals the total unweighted number of children's cases.",
+        
+        cleaning_operations = "Data editing took place at a number of stages throughout the processing, including: 
+           a) Office editing and coding
+           b) During data entry
+           c) Structure checking and completenes 
+           d) Secondary editing
+           e) Structural checking of SPSS data files
+           Detailed documentation of the editing of data can be found in the 'Data processing guidelines' document provided as an external resource."
+        )
+
+      )
+
+    ),
+    # ...
+  )  
+
+)
+```
+


+
    +
  • method_notes [Optional ; Not repeatable ; String]
  • +
+

This element is provided to capture any additional relevant information on the data collection methodology, which could not fit in the previous metadata elements.

+
    +
  • analysis_info [Optional ; Not Repeatable]
    +This block of elements is used to organize information related to data quality and appraisal.
  • +
+
"analysis_info": {
+  "response_rate": "string",
+  "sampling_error_estimates": "string",
+  "data_appraisal": "string"
+}
+


+
    +
  • response_rate [Optional ; Not repeatable ; String]
    +The response rate is the percentage of sample units that participated in the survey based on the original sample size. Omissions may occur due to refusal to participate, impossibility to locate the respondent, or other reason. This element is used to provide a narrative description of the response rate, possibly by stratum or other criteria, and if possible with an identification of possible causes. If information is available on the causes of non-response (refusal/not found/other), it can be reported here. This field can also be used to describe non-responses in population censuses.
  • +
  • sampling_error_estimates [Optional ; Not repeatable ; String]
    +Sampling errors are intended to measure how precisely one can estimate a population value from a given sample. For sampling surveys, it is good practice to calculate and publish sampling error. This field is used to provide information on these calculations (not to provide the sampling errors themselves, which should be made available in publications or reports). Information can be provided on which ratios/indicators have been subjected to the calculation of sampling errors, and on the software used for computing the sampling error. Reference to a report or other document where the results can be found can also be provided.
  • +
  • data_appraisal [Optional ; Not repeatable ; String]
    +This section is used to report any other action taken to assess the reliability of the data, or any observations regarding data quality. Describe here issues such as response variance, interviewer and response bias, question bias, etc. For a population census, this can include information on the main results of a post enumeration survey (a report should be provided in external resources and mentioned here); it can also include relevant comparisons with data from other sources that can be used as benchmarks.
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ),
+    study_development = list(
+      # ... ),
+    method = list(
+      # ... ,
+      
+      analysis_info = list(
+        
+        response_rate = "Of these, 4996 were occupied households and 4811 were successfully interviewed for a response rate of 96.3%. Within these households, 7815 eligible women aged 15-49 were identified for interview, of which 7505 were successfully interviewed (response rate 96.0%), and 3242 children aged 0-4 were identified for whom the mother or caretaker was successfully interviewed for 3167 children (response rate 97.7%). These give overall response rates (household response rate times individual response rate) for the women's interview of 92.5% and for the children's interview of 94.1%.",
+        
+        sampling_error_estimates = "Estimates from a sample survey are affected by two types of errors: 1) non-sampling errors and 2) sampling errors. Non-sampling errors are the results of mistakes made in the implementation of data collection and data processing. Numerous efforts were made during implementation of the 2005-2006 MICS to minimize this type of error, however, non-sampling errors are impossible to avoid and difficult to evaluate statistically. If the sample of respondents had been a simple random sample, it would have been possible to use straightforward formulae for calculating sampling errors. However, the 2005-2006 MICS sample is the result of a multi-stage stratified design, and consequently needs to use more complex formulae. The SPSS complex samples module has been used to calculate sampling errors for the 2005-2006 MICS. This module uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. This method is documented in the SPSS file CSDescriptives.pdf found under the Help, Algorithms options in SPSS. 
+Sampling errors have been calculated for a select set of statistics (all of which are proportions due to the limitations of the Taylor linearization method) for the national sample, urban and rural areas, and for each of the five regions. For each statistic, the estimate, its standard error, the coefficient of variation (or relative error - the ratio between the standard error and the estimate), the design effect, and the square root design effect (DEFT - the ratio between the standard error using the given sample design and the standard error that would result if a simple random sample had been used), as well as the 95 percent confidence intervals (+/-2 standard errors). Details of the sampling errors are presented in the sampling errors appendix to the report and in the sampling errors table presented in the external resources.",
+        
+        data_appraisal = "A series of data quality tables and graphs are available to review the quality of the data and include the following: 
+        - Age distribution of the household population 
+        - Age distribution of eligible women and interviewed women 
+        - Age distribution of eligible children and children for whom the mother or caretaker was interviewed 
+        - Age distribution of children under age 5 by 3 month groups 
+        - Age and period ratios at boundaries of eligibility 
+        - Percent of observations with missing information on selected variables 
+        - Presence of mother in the household and person interviewed for the under 5 questionnaire
+        - School attendance by single year age 
+        - Sex ratio at birth among children ever born, surviving and dead by age of respondent 
+        - Distribution of women by time since last birth 
+        - Scatter plot of weight by height, weight by age and height by age 
+        - Graph of male and female population by single years of age 
+        - Population pyramid 
+        The results of each of these data quality tables are shown in the appendix of the final report.
+        The general rule for presentation of missing data in the final report tabulations is that a column is presented for missing data if the percentage of cases with missing data is 1% or more. Cases with missing data on the background characteristics (e.g. education) are included in the tables, but the missing data rows are suppressed and noted at the bottom of the tables in the report."
+
+      ),
+      
+      # ...
+  )
+  # ...
+)  
+


+
    +
  • study_class [Optional ; Repeatable ; String]
  • +
+

This element can be used to give the data archive’s class or study status number, which indicates the processing status of the study. But it can also be used as an element to indicate the type of study, based on a controlled vocabulary. The element is repeatable, allowing one study to belong to more than one class. Note that in the API description (see screenshot above), the element is described as having type “null”, not {}. This is due to the fact that the element can be entered either as a list (repeatable element) or as a string.

+
    +
  • data_processing [Optional ; Repeatable]
    @@@@ Improve definition of elements +
  • +
+
"data_processing": [
+  {
+    "type": "string",
+    "description": "string"
+  }
+]
+


+

This element is used to describe how data were electronically captured (e.g., entered in the field, in a centralized manner by data entry clerks, captured electronically using tablets and a CAPI application, via web forms, etc.). Information on devices and software used for data capture can also be provided here. Other data processing procedures not captured elsewhere in the documentation can be described here (tabulation, etc.)
+- type [Optional ; Not repeatable ; String]
+The type attribute supports better classification of this activity, including the optional use of a controlled vocabulary. The vocabulary could include options like “data capture”, “data validation”, “variable derivation”, “tabulation”, “data visualizations”, anonymization“, ”documentation”, etc.
+- description [Optional ; Repeatable ; String] +A description of a data processing task. +

+
    +
  • coding_instructions [Optional ; Repeatable]

    +The coding_instructions elements can be used to describe specific coding instructions used in data processing, cleaning, or tabulation. Providing this information may however be complex and very tedious for datasets with a significant number of variables, where hundreds of commands are used to process the data. An alternative option, preferable in many cases, will be to publish reproducible data editing, tabulation and analysis scripts together with the data, as related resources.
  • +
+
"coding_instructions": [
+  {
+    "related_processes": "string",
+    "type": "string",
+    "txt": "string",
+    "command": "string",
+    "formal_language": "string"
+  }
+]
+


+
    +
  • related_processes [Optional ; Not repeatable ; String]
    +The related_processes links a coding instruction to one or more processes such as “data editing”, “recoding”, “imputations and derivations”, “tabulation”, etc.
  • +
  • type [Optional ; Not repeatable ; String]
    +The “type” attribute supports the classification of this activity (e.g. “topcoding”). A controlled vocabulary can be used.
  • +
  • txt [Optional ; Not repeatable ; String]
    +A description of the code/command, in a human readable form.
  • +
  • command [Optional ; Not repeatable ; String]
    +The command code for the coding instruction.
  • +
  • formal_language [Optional ; Not repeatable ; String]
    +The language of the command code, e.g. “Stata”, “R”, “SPSS”, “SAS”, “Python”, etc.
  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ),
+    study_development = list(
+      # ... ),
+    
+    method = list(
+      # ... ,
+      study_class = "",
+      
+      data_processing = list(
+        list(type = "Data capture",
+             description = "Data collection was conducted using tablets and Survey Solutions software. Multiple quality controls and validations are embedded in the questionnaire."),
+        list(type = "Batch data editing",
+             description = "Data editing was conducted in batch using a R script, including techniques of hot deck, imputations, and recoding."),
+        list(type = "Tabulation and visualizations",
+             description = "The 25 tables and the visualizations published in the survey report were produced using Stata (script 'tabulation.do')."),
+        list(type = "Anonymization",
+             description = "An anonymized version of the dataset, published as a public use file, was created using the R package sdcMicro.")
+        ),
+      
+      coding_instructions = list(
+        list(related_processes = "",
+             type = "",
+             txt = "Suppression of observations with ...",
+             command = "",
+             formal_language = "Stata"),
+        list(related_processes = "",
+             type = "",
+             txt = "Top coding age",
+             command = "",
+             formal_language = "Stata"),
+        list(related_processes = "",
+             type = "",
+             txt = "",
+             command = "",
+             formal_language = "Stata")
+      )
+      
+  )
+  # ...
+)  
+


+
+
+

5.4.2.16 Data access

+

data_access [Optional ; Not Repeatable]
+This section describes the access conditions and terms of use for the dataset. This set of elements should be used when the access conditions are well-defined and are unlikely to change. An alternative option is to document the terms of use in the catalog where the data will be published, instead of “freezing” them in a metadata file.

+
"data_access": {
+  "dataset_availability": {
+    "access_place": "string",
+    "access_place_url": "string",
+    "original_archive": "string",
+    "status": "string",
+    "coll_size": "string",
+    "complete": "string",
+    "file_quantity": "string",
+    "notes": "string"
+  },
+  "dataset_use": {}
+}
+


+
    +
  • dataset_availability [Optional ; Not Repeatable]
    +Information on the availability and storage of the dataset.

    +
      +
    • access_place [Optional ; Not repeatable ; String]
      +Name of the location where the data collection is currently stored.
    • +
    • access_place_url [Optional ; Not repeatable ; String]
      +The URL of the website of the location where the data collection is currently stored.
    • +
    • original_archive [Optional ; Not repeatable ; String]
      +Archive from which the data collection was obtained, if any (the originating archive). Note that the schema we propose provides an element provenance, which is not part of the DDI, that can be used to document the origin of a dataset.
    • +
    • status [Optional ; Not repeatable ; String]
      +A statement of the data availability. An archive may need to indicate that a collection is unavailable because it is embargoed for a period of time, because it has been superseded, because a new edition is imminent, etc. This element will rarely be used.
    • +
    • coll_size [Optional ; Not repeatable ; String]
      +Extent of the collection. This is a summary of the number of physical files that exist in a collection. We will record here the number of files that contain data and note whether the collection contains other machine-readable documentation and/or other supplementary files and information such as data dictionaries, data definition statements, or data collection instruments. This element will rarely be used.
    • +
    • complete [Optional ; Not repeatable ; String]
      +This item indicates the relationship of the data collected to the amount of data coded and stored in the data collection. Information as to why certain items of collected information were not included in the data file stored by the archive should be provided here. Example: “Because of embargo provisions, data values for some variables have been masked. Users should consult the data definition statements to see which variables are under embargo.” This element will rarely be used.
    • +
    • file_quantity [Optional ; Not repeatable ; String]
      +The total number of physical files associated with a collection. This element will rarely be used.
    • +
    • notes [Optional ; Not repeatable ; String]
      +Additional information on the dataset availability, not included in one of the elements above.
    • +
    +
    my_ddi <- list(
    +  doc_desc = list(
    +    # ... 
    +  ),
    +  study_desc = list(
    +    # ... ,
    +    study_info = list(
    +      # ... ),
    +    study_development = list(
    +      # ... ),
    +    method = list(
    +      # ...),
    +
    +    data_access = list(
    +
    +      dataset_availability = list(
    +        access_place = "World Bank Microdata Library",
    +        access_place_url = "http://microdata.worldbank.org",
    +        status = "Available for public use",
    +        coll_size = "4 data files + machine-readable questionnaire and report (2 PDF files) + data editing script (1 Stata do file).",
    +        complete = "The variables 'latitude' and 'longitude' (GPS location of respondents) is not included, for confidentiality reasons.",
    +        file_quantity = "7"
    +      ),
    +
    +      # ...
    +    )
    +  )
    +  # ...
    +)  
    +


  • +
  • dataset_use [Optional ; Not Repeatable]
    +Information on the terms of use for the study dataset.

  • +
+
"dataset_use": {
+  "conf_dec": [
+    {
+      "txt": "string",
+      "required": "string",
+      "form_url": "string",
+      "form_id": "string"
+    }
+  ],
+  "spec_perm": [
+    {
+      "txt": "string",
+      "required": "string",
+      "form_url": "string",
+      "form_id": "string"
+    }
+  ],
+  "restrictions": "string",
+  "contact": [
+    {
+      "name": "string",
+      "affiliation": "string",
+      "uri": "string",
+      "email": "string"
+    }
+  ],
+  "cit_req": "string",
+  "deposit_req": "string",
+  "conditions": "string",
+  "disclaimer": "string"
+}
+


+
    +
  • conf_dec [Optional ; Repeatable]
    +This element is used to determine if signing of a confidentiality declaration is needed to access a resource. We may indicate here what Affidavit of Confidentiality must be signed before the data can be accessed. Another option is to include this information in the next element (Access conditions). If there is no confidentiality issue, this field can be left blank. +

    +
      +
    • txt [Optional ; Not repeatable ; String]
      +A statement on confidentiality and limitations to data use. This statement does not replace a more comprehensive data agreement (see Access condition). An example of statement could be the following: “Confidentiality of respondents is guaranteed by Articles N to NN of the National Statistics Act of [date]. Before being granted access to the dataset, all users have to formally agree:
      +
        +
      • To make no copies of any files or portions of files to which s/he is granted access except those authorized by the data depositor.
      • +
      • Not to use any technique in an attempt to learn the identity of any person, establishment, or sampling unit not identified on public use data files.
      • +
      • To hold in strictest confidence the identification of any establishment or individual that may be inadvertently revealed in any documents or discussion, or analysis.
      • +
      • That such inadvertent identification revealed in her/his analysis will be immediately and in confidentiality brought to the attention of the data depositor.”
      • +
    • +
    • required [Optional ; Not repeatable ; String]
      +The “required” attribute is used to aid machine processing of this element. The default specification is “yes”.
    • +
    • form_url [Optional ; Not repeatable ; String]
      +The "form_url element is used to provide a link to an online confidentiality declaration form.
    • +
    • form_id [Optional ; Not repeatable ; String]
      +Indicates the number or ID of the confidentiality declaration form that the user must fill out.

    • +
  • +
  • spec_perm [Optional ; Repeatable]
    +This element is used to determine if any special permissions are required to access a resource.

    +
      +
    • txt [Optional ; Not repeatable ; String]
      +A statement on the special permissions required to access the dataset.
    • +
    • required [Optional ; Not repeatable ; String]
      +The required is used to aid machine processing of this element. The default specification is “yes”.
    • +
    • form_url [Optional ; Not repeatable ; String]
      +The form_url is used to provide a link to a special on-line permissions form.
    • +
    • form_id [Optional ; Not repeatable ; String]
      +The “form_id” indicates the number or ID of the special permissions form that the user must fill out.

    • +
  • +
  • restrictions [Optional ; Not repeatable ; String]
    +Any restrictions on access to or use of the collection such as privacy certification or distribution restrictions should be indicated here. These can be restrictions applied by the author, producer, or distributor of the data. This element can for example contain a statement (extracted from the DDI documentation) like: “In preparing the data file(s) for this collection, the National Center for Health Statistics (NCHS) has removed direct identifiers and characteristics that might lead to identification of data subjects. As an additional precaution NCHS requires, under Section 308(d) of the Public Health Service Act (42 U.S.C. 242m), that data collected by NCHS not be used for any purpose other than statistical analysis and reporting. NCHS further requires that analysts not use the data to learn the identity of any persons or establishments and that the director of NCHS be notified if any identities are inadvertently discovered. Users ordering data are expected to adhere to these restrictions.”

  • +
  • contact [Optional ; Repeatable]
    +Users of the data may need further clarification and information on the terms of use and conditions to access the data. This set of elements is used to identify the contact persons who can be used as resource persons regarding problems or questions raised by the user community.

    +
      +
    • name [Optional ; Not repeatable ; String]
      +Name of the person. Note that in some cases, it might be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +Affiliation of the person.
    • +
    • uri [Optional ; Not repeatable ; String]
      +URI for the person; it can be the URL of the organization the person belongs to.
    • +
    • email [Optional ; Not repeatable ; String]
      +The email element is used to indicate an email address for the contact individual mentioned in name. Ideally, a generic email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members.

    • +
  • +
  • cit_req [Optional ; Not repeatable ; String]
    +A citation requirement that indicates the way that the dataset should be referenced when cited in any publication. Providing a citation requirement will guarantee that the data producer gets proper credit, and that results of analysis can be linked to the proper version of the dataset. The data access policy should explicitly mention the obligation to comply with the citation requirement. The citation should include at least the primary investigator, the name and abbreviation of the dataset, the reference year, and the version number. Include also a website where the data or information on the data is made available by the official data depositor. Ideally, the citation requirement will include a DOI (see the DataCite website for recommendations).

  • +
  • deposit_req [Optional ; Not repeatable ; String]
    +Information regarding data users’ responsibility for informing archives of their use of data through providing citations to the published work or providing copies of the manuscripts.

  • +
  • conditions [Optional ; Not repeatable ; String]
    +Indicates any additional information that will assist the user in understanding the access and use conditions of the data collection.

  • +
  • disclaimer [Optional ; Not repeatable ; String]
    +A disclaimer limits the liability that the data producer or data custodian has regarding the use of the data. A standard legal statement should be used for all datasets from a same agency. The following formulation could be used: The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.

  • +
+

Example

+
```r
+my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... ,
+    study_info = list(
+      # ... ),
+    study_development = list(
+      # ... ),
+    method = list(
+      # ...),
+      
+    data_access = list(
+      # ...,
+      
+      dataset_use = list(
+        
+        conf_dec = list(
+          list(txt = "Confidentiality of respondents is guaranteed by Articles N to NN of the National Statistics Act. All data users are required to sign an affidavit of confidentiality.", 
+               required = "yes", 
+               form_url = "http://datalibrary.org/affidavit", 
+               form_id = "F01_AC_v01")
+        ),
+        
+        spec_perm = list(
+          list(txt = "Permission will only be granted to residents of [country].", 
+               required = "yes", 
+               form_url = "http://datalibrary.org/residency", 
+               form_id = "F02_RS_v01")
+        ),
+        
+        restrictions = "Data will only be shared with users who are registered to the National Data Center and have successfuly completed the training on data privacy and responsible data use. Only users who legally reside in [country] will be authorized to access the data.",
+        
+        contact = list(
+          list(name = "Head, Data Processing Division", 
+               affiliation = "National Statistics Office", 
+               uri = "www.cso.org/databank", 
+               email = "dataproc@cso.org")
+        ),
+
+        cit_req = "National Statistics Office of Popstan. Multiple Indicators Cluster Survey 2000 (MICS 2000). Version 01 of the scientific use dataset (April 2001). DOI: XXX-XXXX-XXX",
+        
+        deposit_req = "To provide funding agencies with essential information about use of archival resources and to facilitate the exchange of information among researchers and development practitioners, users of the Microdata Library data are requested to send to the Microdata Library bibliographic citations for, or copies of, each completed manuscript or thesis abstract. Please indicate in a cover letter which data were used.",
+
+        disclaimer = "The user of the data acknowledges that the original collector of the data, the authorized distributor of the data, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses."
+        
+      )
+      
+  ),
+  # ...
+  
+)  
+```
+


+
    +
  • notes [Optional ; Not repeatable ; String]

    +Any additional information related to data access that is not contained in the specific metadata elements provided in the section data_access.
  • +
+
+
+
+

5.4.3 Description of data files

+

data_files [Optional ; Repeatable]

+The data_files section is the DDI section that contains the elements needed to describe each data file that form the study dataset. These are elements at the file level; it does not include the information at the variable level, which are contained in a separate section of the standard.

+
"data_files": [
+  {
+    "file_id": "string",
+    "file_name": "string",
+    "file_type": "string",
+    "description": "string",
+    "case_count": 0,
+    "var_count": 0,
+    "producer": "string",
+    "data_checks": "string",
+    "missing_data": "string",
+    "version": "string",
+    "notes": "string"
+  }
+]
+


+
    +
  • file_id [Optional ; Not repeatable ; String]
    +A unique file identifier (within the metadata document, not necessarily within a catalog). This will typically be the electronic file name.

  • +
  • file_name [Optional ; Not repeatable ; String]
    +This is not the name of the electronic file (which is provided in the previous element). It is a short title (label) that will help distinguish a particular file/part from other files/parts in the dataset.

  • +
  • file_type [Optional ; Not repeatable ; String]

    +The type of data files. For example, raw data (ASCII), or software-dependent files such as SAS / Stata / SPSS data file, etc. Provide specific information (e.g. Stata 10 or Stata 15, SPSS Windows or SPSS Export, etc.) Note that in an on-line catalog, data can be made available in multiple formats. In such case, the file_type element is not useful.

  • +
  • description [Optional ; Not repeatable ; String]
    +The file_id and file_name elements provide limited information on the content of the file. The description element is used to provide a more detailed description of the file content. This description should clearly distinguish collected variables and derived variables. It is also useful to indicate the availability in the data file of some particular variables such as the weighting coefficients. If the file contains derived variables, it is good practice to refer to the computer program that generated it. Information about the data file(s) that comprises a collection.

  • +
  • case_count [Optional ; Numeric ; Not Repeatable]
    +Number of cases or observations in the data file. The value is 0 by default.

  • +
  • var_count [Optional ; Numeric ; Not Repeatable]
    +Number of variables in the data file. The value is 0 by default.

  • +
  • producer [Optional ; Not repeatable ; String]
    +The name of the agency that produced the data file. Most data files will have been produced by the survey primary investigator. In some cases however, auxiliary or derived files from other producers may be released with a data set. This may for example be a file containing derived variables generated by a researcher.

  • +
  • data_checks [Optional ; Not repeatable ; String]
    +Use this element if needed to provide information about the types of checks and operations that have been performed on the data file to make sure that the data are as correct as possible, e.g. consistency checking, wildcode checking, etc. Note that the information included here should be specific to the data file. Information about data processing checks that have been carried out on the data collection (study) as a whole should be provided in the Data editing element at the study level. You may also provide here a reference to an external resource that contains the specifications for the data processing checks (that same information may be provided also in the Data Editing filed in the Study Description section).

  • +
  • missing_data [Optional ; Not repeatable ; String]
    +A description of missing data (number of missing cases, cause of missing values, etc.)

  • +
  • version [Optional ; Not repeatable ; String]
    +The version of the data file. A data file may undergo various changes and modifications. File specific versions can be tracked in this element. This field will in most cases be left empty.

  • +
  • notes [Optional ; Not repeatable ; String]
    +This field aims to provide information on the specific data file not covered elsewhere.

    +

    Example for UNICEF MICS dataset

  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... 
+  ),  
+  
+  data_files = list(
+    
+    list(file_id = "HHS2020_S01",
+         file_name = "Household roster (demographics)",
+         description = "The file contains the demographic information on all individuals in the sample",
+         case_count: 10000,
+         var_count: 12,
+         producer = "National Statistics Office",
+         missing_data = "Values of age outside valid range (0 to 100) have been replaced with 'missing'.",
+         version = "1.0 (edited, not anonymized)",
+         notes = ""
+    ),
+     
+    list(file_id = "HHS2020_S03A",
+         file_name = "Section 3A - Education",
+         description = "The file contains data related to section 3A of the household survey questionnaire (Education of household members aged 6 to 24 years). It also contains the weighting coefficient, and various recoded variables on levels of education.",
+         case_count: 2500,
+         var_count: 17,
+         producer = "National Statistics Office",
+         data_checks = "Education level (variable EDUCLEV) has been edited using hotdeck imputation when the reported value was out of acceptable range considering the AGE of the person.",
+         version = "1.0 (edited, not anonymized)"
+    ),
+    
+    list(file_id = "HHS2020_CONSUMPTION",
+         file_name = "Annualized household consumption by products and services",
+         description = "The file contains derived data on household consumption, annualized and aggregated by category of products and services. The file also contains a regional price deflator variable and the household weighting coefficient. The file was generated using a Stata program named 'cons_aggregate.do'.",
+         case_count: 42000,
+         var_count: 15,
+         producer = "National Statistics Office",
+         data_checks = "Outliers have been detected (> median + 5*IQR) for each product/service; fixed by imputation (regression model).",
+         missing_data = "Missing consumption values are treated as 0",
+         version = "1.0 (edited, not anonymized)"
+    )
+    
+  ),  
+  
+  # ...
+)       
+


+
+
+

5.4.4 Variable description

+

The DDI Codebook metadata standard provides multiple elements to document variables contained in a micro-dataset. There is much value in documenting variables: +- it makes the data usable by providing users with a detailed data dictionary; +- it makes the data more discoverable as all keywords included in the description of variables are indexed in data catalogs; +- it allows users to assess the comparability of data across sources; +- it enables the development of question banks; and +- it adds transparency and credibility to the data especially when derived or imputed variables are documented. +All possible effort should thus be made to generate and publish detailed variable-level documentation.

+

A micro-dataset can contain many variables. Some survey datasets include hundreds or event thousands of variables. Documenting variables can thus be a tedious process. The use of a specialized DDI metadata editor can make this process considerably more efficient. Much of the variable-level metadata can indeed be automatically extracted from the electronic data files. Data files in Stata, SPSS or other common formats include variable names, variable and value labels, and in some cases notes that can be extracted. And the variable-level summary statistics that are part of the metadata can be generated from the data files. Further, software applications used for capturing data like Survey Solutions from the World Bank or CsPro from the US Census Bureau can export variable metadata, including the variable names, the variable and value labels, and possibly the formulation of questions and the interviewers instructions when the software is used for conducting computer assisted personal interviews (CAPI). Survey Solutions and CsPro can export metadata in multiple formats, including the DDI Codebook. Multiple options exist to make the documentation of variables efficient. As much as possible, tedious manual curation of variable-level information should be avoided.

+

variables [Optional ; Repeatable]

+The metadata elements we describe below apply independently to each variable in the dataset.

+
"variables": [
+  {
+    "file_id": "string",
+    "vid": "string",
+    "name": "string",
+    "labl": "string",
+    "var_intrvl": "discrete",
+    "var_dcml": "string",
+    "var_wgt": 0,
+    "loc_start_pos": 0,
+    "loc_end_pos": 0,
+    "loc_width": 0,
+    "loc_rec_seg_no": 0,
+    "var_imputation": "string",
+    "var_derivation": "string",
+    "var_security": "string",
+    "var_respunit": "string",
+    "var_qstn_preqtxt": "string",
+    "var_qstn_qstnlit": "string",
+    "var_qstn_postqtxt": "string",
+    "var_forward": "string",
+    "var_backward": "string",
+    "var_qstn_ivuinstr": "string",
+    "var_universe": "string",
+    "var_sumstat": [],
+    "var_txt": "string",
+    "var_catgry": [],
+    "var_std_catgry": {},
+    "var_codinstr": "string",
+    "var_concept": [],
+    "var_format": {},
+    "var_notes": "string"
+  }
+]
+


+
    +
  • file_id [Required ; Not repeatable ; String]
    +A dataset can be composed of multiple data files. The file_id is the name of the data file that contains the variable being documented. This file name should correspond to a file_id listed in the data_file section of the DDI.

  • +
  • vid [Required ; Not repeatable ; String]
    +A unique identifier given to the variable. This can be a system-generated ID, such as a sequential number within each data file. The vid is not the variable name.

  • +
  • name [Required ; Not repeatable ; String]
    +The name of the variable in the data file. The name should be entered exactly as found in the data file (not abbreviated or converted to upper or lower cases, as some software applications are case-sensitive). This information can be programmatically extracted from the data file. The variable name is limited to eight characters in some statistical analysis software such as SAS or SPSS.

  • +
  • labl [Optional ; Not repeatable ; String]
    +All variables should have a label that provides a short but clear indication of what the variable contains. Ideally, all variables in a data file will have a different label. File formats like Stata or SPSS often contain variable labels. Variable labels can also be found in data dictionaries in software applications like Survey Solutions or CsPro. Avoid using the question itself as a label (specific elements are available to capture the literal question text; see below). Think of a label as what you would want to see in a tabulation of the variables. Keep in mind that software applications like Stata and others impose a limit to the number of characters in a label (often, 80).

  • +
  • var_intrvl [Optional ; Not repeatable ; String]
    +This element indicates whether the intervals between values for the variable are discrete or continuous.

  • +
  • var_dcml [Optional ; Not repeatable ; String]
    +This element refers to the number of decimal points in the values of the variable.

  • +
  • var_wgt [Optional ; Not repeatable ; Numeric]
    +This element, which applies to dataset from sample surveys, indicates whether the variable is a sample weight (value “1”) or not (value “0). Sample weights play an important role in the calculation of summary statistics and sampling errors, and should therefore be flagged.

  • +
  • loc_start_pos [Optional ; Not repeatable ; Numeric]

    +The starting position of the variable when the data are saved in an ASCII fixed-format data file.

  • +
  • loc_end_pos [Optional ; Not repeatable ; Numeric]

    +The end position of the variable when the data are saved in an ASCII fixed-format data file.

  • +
  • loc_width [Optional ; Not repeatable ; Numeric]

    +The length of the variable (the maximum number of characters used for its values) in an ASCII fixed-format data file.

  • +
  • loc_rec_seg_no [Optional ; Not repeatable ; Numeric]

    +Record segment number, deck or card number the variable is located on.

  • +
  • var_imputation [Optional ; Not repeatable ; String]
    +Imputation is the process of estimating values for variables when a value is missing. The element is used to describe the procedure used to impute values when missing.

  • +
  • var_derivation [Optional ; Not repeatable ; String]

    +Used only in the case of a derived variable, this element provides both a description of how the derivation was performed and the command used to generate the derived variable, as well as a specification of the other variables in the study used to generate the derivation. The var_derivation element is used to provide a brief description of this process. As full transparency in derivation processes is critical to build trust and ensure replicability or reproducibility, the information captured in this element will often not be sufficient. A reference to a document and/or computer program can in such case be provided in this element, and the document/scripts provided as external resources. For example, a variable “TOT_EXP” containing the annualized total household expenditure obtained from a household budget survey may be the result of a complex process of aggregation, de-seasonalization, and more. In such case, the information provided in the var_derivation element could be: “TOT_EXP was obtained by aggregating expenditure data on all goods and services, available in sections 4 to 6 of the household questionnaire. It contains imputed rental values for owner-occupied dwellings. The values have been deflated by a regional price deflator available in variable REG_DEF. All values are in local currency. Outliers have been fixed by imputation. Details on the calculations are available in Appendix 2 of the Report on Data Processing, and in the Stata program [generate_hh_exp_total.do].”

  • +
  • var_security [Optional ; Not repeatable ; String]
    +This element is used to provide information regarding levels of access, e.g., public, subscriber, need to know.

  • +
  • var_respunit [Optional ; Not repeatable ; String]
    +Provides information regarding who provided the information contained within the variable, e.g., head of household, respondent, proxy, interviewer.

  • +
  • var_qstn_preqtxt [Optional ; Not repeatable ; String]
    +The pre-question texts are the instructions provided to the interviewers and printed in the questionnaire before the literal question. This does not apply to all variables. Do not confuse this with instructions provided in the interviewer’s manual.

  • +
  • var_qstn_qstnlit [Optional ; Not repeatable ; String]
    +The literal question is the full text of the questionnaire as the enumerator is expected to ask it when conducting the interview. This does not apply to all variables (it does not apply to derived variables).

  • +
  • var_qstn_postqtxt [Optional ; Not repeatable ; String]
    +The post-question texts are instructions provided to the interviewers, printed in the questionnaire after the literal question. Post-question can be used to enter information on skips provided in the questionnaire. This does not apply to all variables. Do not confuse this with instructions provided in the interviewer’s manual. +
    +With the previous three elements, one should be able to understand how the question was formulated in a questionnaire. In the example below (extracted from the UNICEF Malawi 2006 MICS survey questionnaire), we find:

    +
      +
    • a pre-question: “Ask this question ONLY ONCE for each mother/caretaker (even if she has more children).”

    • +
    • a literal question: “Sometimes children have severe illnesses and should be taken immediately to a health facility. What types of symptoms would cause you to take your child to a health facility right away?”

    • +
    • a post-question: “Keep asking for more signs or symptoms until the mother/caretaker cannot recall any additional symptoms. Circle all symptoms mentioned. DO NOT PROMPT WITH ANY SUGGESTIONS”

      +

    • +
  • +
  • var_forward [Optional ; Not repeatable ; String]

    +Contains a reference to the IDs of possible following questions. This can be used to document forward skip instructions.

  • +
  • var_backward [Optional ; Not repeatable ; String]

    +Contains a reference to IDs of possible preceding questions. This can be used to document backward skip instructions.

  • +
  • var_qstn_ivuinstr [Optional ; Not repeatable ; String]

    +Specific instructions to the individual conducting an interview. The content will typically be entered by copy/pasting instructions in the interviewer’s manual (or in the CAPI application). In cases where the same instructions relate to multiple variables, repeat the same information in the metadata for all these variables. +NOTE: In earlier version of the documentation, due to a typo, the element was named var_qstn_ivulnstr.

  • +
  • var_universe [Optional ; Not repeatable ; String]
    +The universe at the variable level defines the population the question applied to. It reflects skip patterns in a questionnaire. This information can typically be copy/pasted from the survey questionnaire. Try to be as specific as possible. This information is critical for the analyst, as it explains why missing values may be found in a variable. In the example below (from the Malawi MICS 2006 survey questionnaire), the universe for questions ED1 to ED2 will be “Household members age 5 and above”, and the universe for Question ED3 will be “Household members age 5 and above who ever attended school or pre-school”.

    +

  • +
  • var_sumstat [Optional ; Repeatable]
    +The DDI metadata standard provides multiple elements to capture various summary statistics such as minimum, maximum, or mean values (weighted and un-weighted) for each variable (note that frequency statistics for categorical variables are reported in var_catgry described below). The content of the var_sumstat section will be easy to fill out programmatically (using R or Python) or using a specialized DDI metadata editor, which can read the data file and generate the summary statistics.

  • +
+
"var_sumstat": [
+  {
+    "type": "string",
+    "value": null,
+    "wgtd": "string"
+  }
+]
+


+
    +
  • type [Required ; Not repeatable ; String]
    +The type of statistics being shown: mean, median, mode, valid cases, invalid cases, minimum, maximum, or standard deviation.

  • +
  • value [Required ; Not repeatable ; Numeric]
    +The value of the summary statistics mentioned in type.

  • +
  • wgtd [Required ; Not repeatable ; String]

    +Indicates whether the statistics reported in value are weighted or not (for variables in sample surveys). Enter “weighted” if weighted, otherwise leave this element empty.

  • +
  • var_txt [Optional ; Not repeatable ; String]
    +This element provides a space to describe the variable in detail. Not all variables require a definition.

  • +
  • var_catgry [Optional ; Repeatable]
    +Variable categories are the lists of codes (and their meaning) that apply to a categorical variable. This block of elements is used to describe the categories (code and label) and optionally capture their weighted and/or un-weighted frequencies.

  • +
+
"var_catgry": [
+  {
+    "value": "string",
+    "label": "string",
+    "stats": [
+      {
+        "type": "string",
+        "value": null,
+        "wgtd": "string"
+      }
+    ]
+  }
+]
+


+
    +
  • value [Required ; Not repeatable ; String]
    +The value here is the code assigned to a variable category. For example, a variable “Sex” could have value 1 for “Male” and value 2 for “Female”.

    +
  • +
  • label [Required ; Not repeatable ; String]
    +The label attached to the code mentioned in value.
  • +
  • stats [Optional ; Repeatable]
    +This repeatable block of elements will contain the summary statistics for the category (not for the variable) being documented. This may include frequencies, percentages, or cross-tabulation results.
    +
      +
    • type [Required ; Not repeatable ; String]
      +The type of the summary statistic. This will usually be freq for frequency.
    • +
    • value [Required ; Not repeatable ; Numeric]
      +The value of the summary statistic, for the corresponding type.
    • +
    • wgtd [Optional ; Not repeatable ; String]

      +Indicates whether the statistic reported in value are weighted or not (for variables in sample surveys). Enter “weighted” if weighted, otherwise leave this element empty.

    • +
  • +
  • var_std_catgry [Optional ; Not repeatable]

    +This element is used to indicate that the codes used for a categorical variable are from a standard international or other classification, like COICOP, ISIC, ISO country codes, etc.
  • +
+
"var_std_catgry": {
+  "name": "string",
+  "source": "string",
+  "date": "string",
+  "uri": "string"
+}
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the classification, e.g. “International Standard Industrial Classification of All Economic Activities (ISIC), Revision 4”

  • +
  • source [Required ; Not repeatable ; String]
    +The source of the classification, e.g. “United Nations”

  • +
  • date [Required ; Not repeatable ; String]
    +The version (typically a date) of the classification used for the study.

    +

  • +
  • uri [Required ; Not repeatable ; String]
    +A URL to a website where an electronic copy and more information on the classification can be obtained.

  • +
  • var_codinstr [Optional ; Not repeatable ; String]
    +The coder instructions for the variable. These are any special instructions to those who converted information from one form to another (e.g., textual to numeric) for a particular variable.

  • +
  • var_concept [Optional ; Repeatable]

    +The general subject to which the parent element may be seen as pertaining. This element serves the same purpose as the keywords and topic classification elements, but at the variable description level.

  • +
+
"var_concept": [
+  {
+    "title": "string",
+    "vocab": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • title [Optional ; Not repeatable ; String]
    +The name (label) of the concept.
    +

  • +
  • vocab [Optional ; Not repeatable ; String]
    +The controlled vocabulary, if any, from which the concept `title’ was taken.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The location for the controlled vocabulary mentioned in `vocab’.

  • +
  • var_format [Optional ; Not repeatable]
    +The technical format of the variable in question.

  • +
+
"var_format": {
+  "type": "string",
+  "name": "string",
+  "note": "string"
+}
+


+
    +
  • type [Optional ; Not repeatable ; String]
    +Indicates if the variable is numeric, fixed string, dynamic string, or date. Numeric variables are used to store any number, integer or floating point (decimals). A fixed string variable has a predefined length which enables the publisher to handle this data type more efficiently. Dynamic string variables can be used to store open-ended questions.

  • +
  • name [Optional ; Not repeatable ; String]
    +In some cases may provide the name of the particular, proprietary format used.

  • +
  • note [Optional ; Not repeatable ; String]

    +Additional information on the variable format.

  • +
  • var_notes Optional ; Not repeatable ; String]
    +This element is provided to record any additional or auxiliary information related to the specific variable.

  • +
+

Example for two variables only:

+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... 
+  ),  
+  data_files = list(
+    # ...
+  ),
+  
+  variables = list(
+  
+    list(file_id = "",
+         vid = "",
+         name = "",
+         labl = "Main occupation",
+         var_intrvl = "discrete",
+         var_imputation = "",
+         var_respunit = "",
+         var_qstn_preqtxt = "",
+         var_qstn_qstnlit = "",
+         var_qstn_postqtxt = "",
+         var_qstn_ivulnstr = "",
+         var_universe = "",
+         var_sumstat = list(list(type = "", value = "", wgtd = "")),
+         var_txt = "",
+         var_forward = "",
+         var_catgry = list(list(value = "", 
+                                label = "", 
+                                stats = list(list(type = "", value = "", wgtd = ""),
+                                             list(type = "", value = "", wgtd = ""),
+                                             list(type = "", value = "", wgtd = "")),
+  
+                           list(value = "", 
+                                label = "", 
+                                stats = list(list(type = "", value = "", wgtd = ""),
+                                             list(type = "", value = "", wgtd = ""),
+                                             list(type = "", value = "", wgtd = "")),
+         var_std_catgry = list(),
+         var_codinstr = "",
+         var_concept = list(list(title = "", vocab = "", uri = "")),
+         var_format = list(type = "numeric", name = "")
+    ),
+    
+    list(file_id = "",
+         vid = "",
+         name = "V75_HH_CONS",
+         labl = "Household total consumption",
+         var_intrvl = "continuous",
+         var_dcml = "",
+         var_wgt = 0,
+         var_imputation = "",
+         var_derivation = "",
+         var_security = "",
+         var_respunit = "",
+         var_qstn_preqtxt = "",
+         var_qstn_qstnlit = "",
+         var_qstn_postqtxt = "",
+         var_qstn_ivulnstr = "",
+         var_universe = "",
+         var_sumstat = list(list(type = "", value = "", wgtd = "")),
+         var_txt = "",
+         var_codinstr = "",
+         var_concept = list(list(title = "", vocab = "", uri = "")),
+         var_format = list(type = "", name = "", value = ""),
+         var_notes = ""
+    )
+
+  ),
+  # ...
+)
+


+
+
+

5.4.5 Variable groups

+

variable_groups [Optional ; Repeatable]

+

In a dataset, variables are grouped by data file. For the convenience of users, the DDI allows data curators to organize the variables into different, “virtual” groups to organize variables by theme, type of respondent, or any other criteria. Grouping variables is optional, and will not impact the way variables are stored in the data files. One variable can belong to more than a group, and a group of variables can contain variables from more than one data file. The variable groups do not necessarily have to cover all variables in the data files. Variable groups can also contain other variable groups.

+
"variable_groups": [
+  {
+    "vgid": "string",
+    "variables": "string",
+    "variable_groups": "string",
+    "group_type": "subject",
+    "label": "string",
+    "universe": "string",
+    "notes": "string",
+    "txt": "string",
+    "definition": "string"
+  }
+]
+


+
    +
  • vgid [Optional ; Not repeatable ; String]
    +A unique identifier (within the DDI metadata file) for the variable group.

  • +
  • variables [Optional ; Not repeatable ; String]
    +The list of variables (variable identifiers - vid) in the group. Enter a list with items separated by a space, e.g. “V21 V22, V30”.

  • +
  • variable_groups [Optional ; Not repeatable ; String]
    +The variable groups (vgid) that are embedded in this variable group. Enter a list with items separated by a space, e.g. “VG2, VG5”.

  • +
  • group_type [Optional ; Subject ; Not Repeatable]
    +The type of grouping of the variables. A controlled vocabulary should be used. The DDI proposes the following vocabulary: {section, multipleResp, grid, display, repetition, subject, version, iteration, analysis, pragmatic, record, file, randomized, other}. A description of the groups can be found in this document by W. Thomas, W. Block, R. Wozniak and J. Buysse.

  • +
  • label [Optional ; Not repeatable ; String]
    +A short description of the variable group.

  • +
  • universe [Optional ; Not repeatable ; String]
    +The universe can be a population of individuals, households, facilities, organizations, or others, which can be defined by any type of criteria (e.g., “adult males”, “private schools”, “small and medium-size enterprises”, etc.).

  • +
  • notes [Optional ; Not repeatable ; String]
    +Used to provide additional information about the variable group.

  • +
  • txt [Optional ; Not repeatable ; String]
    +A more detailed description of variable group than the one provided in label.

  • +
  • definition [Optional ; Not repeatable ; String]
    +A brief rationale for the variable grouping.

  • +
+
my_ddi <- list(
+  doc_desc = list(
+    # ... 
+  ),
+  study_desc = list(
+    # ... 
+  ),  
+  data_files = list(
+    # ...
+  ),
+  variables = list(
+    # ...
+  ),
+  
+  variable_groups = list(
+    
+    list(vgid = "vg01",
+         variables = "",
+         variable_groups = "",
+         group_type = "subject",
+         label = "",
+         universe = "",
+         notes = "",
+         txt = "",
+         definition = ""
+    ),
+    
+    list(vgid = "vg02",
+         variables = "",
+         variable_groups = "",
+         group_type = "subject",
+         label = "",
+         universe = "",
+         notes = "",
+         txt = "",
+         definition = ""
+    )
+    
+  ),
+  
+  # ...
+)
+


+
+
+

5.4.6 Provenance

+

provenance [Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata. These elements are NOT part of the DDI metadata standard.

+
"provenance": [
+  {
+  "origin_description": {
+      "harvest_date": "string",
+      "altered": true,
+      "base_url": "string",
+      "identifier": "string",
+      "date_stamp": "string",
+      "metadata_namespace": "string"
+    }
+  }
+]
+


+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.

    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Study Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

5.4.7 Tags

+

tags [Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. Tags are NOT part of the DDI codebook standard.

+
"tags": [
+  {
+    "tag": "string",
+    "tag_group": "string"
+  }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.
  • +
  • tag_group [Optional ; Not repeatable ; String]
    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.
  • +
+
+
+

5.4.8 LDA topics

+

lda_topics [Optional ; Not repeatable]

+
"lda_topics": [
+  {
+    "model_info": [
+      {
+        "source": "string",
+        "author": "string",
+        "version": "string",
+        "model_id": "string",
+        "nb_topics": 0,
+        "description": "string",
+        "corpus": "string",
+        "uri": "string"
+      }
+    ],
+    "topic_description": [
+      {
+        "topic_id": null,
+        "topic_score": null,
+        "topic_label": "string",
+        "topic_words": [
+          {
+            "word": "string",
+            "word_weight": 0
+          }
+        ]
+      }
+    ]
+  }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document (in this case, the “document” is a compilation of elements from the dataset metadata) can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition. The lda_topics element is NOT part of the DDI Codebook standard.

+
+

Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.

+
+

The lda_topics element includes the following metadata fields:

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.
    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition of the document.
    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the document (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic. This is specific to the model, not to a document.
      • +
    • +
  • +
+
lda_topics = list(
+  
+   list(
+  
+      model_info = list(
+        list(source      = "World Bank, Development Data Group",
+             author      = "A.S.",
+             version     = "2021-06-22",
+             model_id    = "Mallet_WB_75",
+             nb_topics   = 75,
+             description = "LDA model, 75 topics, trained on Mallet",
+             corpus      = "World Bank Documents and Reports (1950-2021)",
+             uri         = ""))
+      ),
+      
+      topic_description = list(
+      
+        list(topic_id    = "topic_27",
+             topic_score = 32,
+             topic_label = "Education",
+             topic_words = list(list(word = "school",      word_weight = "")
+                                list(word = "teacher",     word_weight = ""),
+                                list(word = "student",     word_weight = ""),
+                                list(word = "education",   word_weight = ""),
+                                list(word = "grade",       word_weight = "")),
+        
+        list(topic_id    = "topic_8",
+             topic_score = 24,
+             topic_label = "Gender",
+             topic_words = list(list(word = "women",       word_weight = "")
+                                list(word = "gender",      word_weight = ""),
+                                list(word = "man",         word_weight = ""),
+                                list(word = "female",      word_weight = ""),
+                                list(word = "male",        word_weight = "")),
+        
+        list(topic_id    = "topic_39",
+             topic_score = 22,
+             topic_label = "Forced displacement",
+             topic_words = list(list(word = "refugee",     word_weight = "")
+                                list(word = "programme",   word_weight = ""),
+                                list(word = "country",     word_weight = ""),
+                                list(word = "migration",   word_weight = ""),
+                                list(word = "migrant",     word_weight = "")),
+                                
+        list(topic_id    = "topic_40",
+             topic_score = 11,
+             topic_label = "Development policies",
+             topic_words = list(list(word = "development", word_weight = "")
+                                list(word = "policy",      word_weight = ""),
+                                list(word = "national",    word_weight = ""),
+                                list(word = "strategy",    word_weight = ""),
+                                list(word = "activity",    word_weight = ""))
+                                
+      )
+      
+   )
+   
+)
+


+
+
+

5.4.9 Embeddings

+

embeddings [Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. In this case, the text would be a compilation of selected elements of the dataset metadata. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).

+

The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model. The embeddings element is NOT part of the DDI Codebook standard.

+
"embeddings": [
+  {
+    "id": "string",
+    "description": "string",
+    "date": "string",
+    "vector": { }
+  }
+]
+


+

The embeddings element contains four metadata fields: +- id [Optional ; Not repeatable ; String]
+A unique identifier of the word embedding model used to generate the vector. +- description [Optional ; Not repeatable ; String]
+A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc. +- date [Optional ; Not repeatable ; String]
+The date the model was trained (or a version date for the model). +- vector [Required ; Not repeatable ; Object] @@@@@@@@ do not offer options +The numeric vector representing the document, provided as an object (array or string).
+[1,4,3,5,7,9]

+
+
+

5.4.10 Additional

+

additional [Optional ; Not repeatable]
+The additional element is provided to allow users of the API to create their own elements and add them to the schema. It is not part of the DDI Codebook standard. All custom elements must be added within the element block; embedding them elsewhere in the schema would cause DDI schema validation to fail in NADA.

+
+
+
+

5.5 Generating and publishing DDI metadata

+

The DDI-Codebook metadata standard provides multiple elements to describe the variables in detail. This includes elements that are usually not found in data dictionaries, like summary statistics. Generating this information and manually capturing it in a DDI-compliant metadata file could be tedious. Indeed, some datasets contains hundreds or even thousands of variables. Some of the metadata (list of variables, possibly variable and value labels, and summary statistics) can be automatically extracted from the data files. Specialized metadata editors, who can read the data files, extract metadata, and generate DDI-compliant output are thus the preferred option to document microdata. Other software have the capability to generate variable-level metadata in DDI-compliant, such as CsPro and Survey Solutions (CAPI applications). Stata and R scripts also provide solutions to generate variable-level metadata out of data files. We present some of these tools below.

+
+

5.5.1 Using the World Bank Metadata Editor

+

@@@ Update this whole section with proper screenshots and description

+

The World Bank Metadata Editor is compliant with the DDI-Codebook 2.5. It is an open source software. [@@@@@ not yet - wait for license] It is a flexible application that can also accommodate other standards and schemas such as the Dublin Core (for documents) and the ISO 19139 (for geospatial data).

+

When importing data files, variable-level metadata is automatically generated including variable names, summary statistics, and variable and value labels if available in the source data files. Additional variable-level metadata can then be added manually.

+


+ +

+

The Metadata Editor provides forms to enter all other related metadata using the DDI-Codebook 2.5 standard, including the study description and a description of external resources. +
+image +

+

The World Bank Metadata Editor exports the metadata (for microdataset) in DDI-Codebook 2.5 format (XML) and in JSON format. Metadata related to external resources can be exported to a Dublin Core file. A transformation of the metadata files into a PDF document is also implemented.

+


+ +

+
+
+

5.5.2 Using R or Python

+

DDI-compliant metadata can also be generated and published in a NADA catalog programmatically. Programming languages like R and Python provides much flexibility to generate such metadata, including variable-level metadata.

+

We provide here and example where a dataset is available in Stata format. We use two data files from the Core Welfare Indicator Questionnaire (CWIQ) survey conducted in Liberia in 2007 (the full dataset has 12 data files; the extension of the script to the full dataset would be straightforward). One data file, named “sec_abcde_individual.dta”, contains individual-level variables. The other data file, named “sec_fgh_ _household.dta”, contains household-level variables. The content of the Stata files is as follows:

+
+
+ +
+


+
+

When generating the variable-level metadata, we want to extract the value labels from the data files, keeping the original [code - value label] pairs as they are in the original dataset. For example, if the Stata dataset has codes 1 = Male and 2 = Female for variable sex, we do not want them to be changed for example to 1 = Female and 2 = Male by the data import process. The import process in R packages do not always maintain the code/label pairs; some convert categorical data into factors and assign codes and value labels independently from the original coding.

+
+
# In http://catalog.ihsn.org/catalog/1523
+
+library(nadar)
+library(haven)
+library(rlist)
+library(stringr)
+
+# ----------------------------------------------------------------------------------
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+id = "LBR_CWIQ_2007"
+
+setwd("D:/LBR_CWIQ_2007")
+
+thumb = "liberia_cwiq.JPG"  # This image will be used as a thumbnail
+
+# The literal questions are only found in a PDF file; we extract them.
+# If list of questions had been available in MS-Excel format of equivalent, we
+# would import it from that file.
+literal_questions = list(
+  b1 = "Is [NAME] male or female?",
+  b2 = "How long has [NAME] been away in the last 12 months?",
+  b3 = "What is [NAME]'s relationship to the head of household?",
+  b4 = "How old was [NAME] at last birthday?",
+  b5 = "What is [NAME]'s marital status?",
+  b6 = "Is [NAME]'s father alive?",
+  b7 = "Is [NAME]'s father living in the household?",
+  b8 = "Is [NAME]'s mother alive?",
+  b9 = "Is [NAME]'s mother living in the household?",
+  c1 = "Can [NAME] read and write in any language?",
+  c2 = "Has [NAME] ever attended school?",
+  c3 = "What is the highest grade [NAME] completed?",
+  c4 = "Did [NAME] attend school last year?",
+  c5 = "Is [NAME] currently in school?",
+  c6 = "What is the current grade [NAME] is attending?",
+  c7 = "Who runs the school [NAME] is attending?",
+  c8 = "Did [NAME] have any problems with school?",
+  c9 = "Why is [NAME] not currently in school?",
+  c10= "Why has [NAME] not started school?"
+  # Etc. (we do not include all questions in the example)
+)  
+
+# Generate file-level and variable-level metadata for the two data files
+
+list_data_files = c("sec_abcde_individual.dta", "sec_fgh_household.dta")
+
+list_var = list()
+list_df = list()
+vno = 1
+fno = 1
+
+for (datafile in list_data_files) {
+  
+  data <- read_dta(datafile)
+  
+  # Generate file-level metadata
+  
+  # Create a file identifier (sequential)
+  fid = paste0("F", str_pad(fno, 2, pad = "0"))
+  fno = fno + 1
+  
+  # Add core metadata
+  case_n = nrow(data)  # Nb of observations in the data file
+  var_n = length(data) # Nb of variables in the data file
+  df = list(file_id = fid, 
+            file_name = datafile, 
+            case_count = case_n, 
+            var_count = var_n)
+  list_df = list.append(list_df, df)
+  
+  # Generate variable-level metadata
+  
+  for(v in 1:length(data)) {
+    
+    # Create a variable identifier (sequential)
+    vid = paste0("V", str_pad(vno, 4, pad = "0"))
+    vno = vno + 1
+    
+    # Variable name and literal question
+    vname = names(data[v])
+    question = as.character(literal_questions[vname])
+    if(is.null(question)) question = ""
+    
+    # Extract the variable label (trim leading and trailing white spaces)
+    var_lab <- trimws(attr(data[[v]], 'label'))
+    if(is.null(var_lab)) var_lab = ""
+    
+    # Variable-level summary statistics
+    vval = sum(!is.na(data[[v]]))
+    vmis = sum(is.na(data[[v]]))
+    vmin = as.character(min(data[[v]], na.rm = TRUE))
+    vmax = as.character(max(data[[v]], na.rm = TRUE))  
+    vstats = list(
+      list(type = "valid", value = vval),
+      list(type = "system missing", value = vmis),
+      list(type = "minimum", value = vmin),
+      list(type = "maximum", value = vmax)
+    )
+    
+    # Extract the (original) codes and value labels and calculate frequencies
+    freqs = list()
+    val_lab <- attr(data[[v]], 'labels')
+    if(!is.null(val_lab) & typeof(data[[v]]) != "character") {
+      freq_tbl = table(data[[v]])
+      for (i in 1:length(val_lab)) {
+        f = list(value = as.character(val_lab[i]), 
+                 labl  = as.character(names(val_lab[i])), 
+                 stats = list(
+                   list(type = "count", 
+                        value = sum(data[[v]] == val_lab[i], na.rm = TRUE)
+                   )
+                 )
+        )
+        freqs = list.append(freqs, f)             
+      }
+    } 
+    
+    # Compile the variable-level metadata
+    list_v = list(
+      file_id = fid,
+      vid = vid,
+      name = vname,
+      labl = var_lab,
+      var_qstn_qstnlit = question,
+      var_sumstat = vstats,
+      var_catgry = freqs)
+    
+    # Add to the list of variables already documented    
+    list_var = list.append(list_var, list_v)
+    
+  }
+  
+}
+
+# Generate the DDI-compliant metadata
+
+cwiq_ddi_metadata <- list(
+  
+  doc_desc = list(
+    producers = list(
+      list(name = "WB consultants")
+    ), 
+    prod_date = "2008-02-19"
+  ),
+  
+  study_desc = list(
+    
+    title_statement = list(
+      idno  = id,
+      title = "Core Welfare Indicators Questionnaire 2007"
+    ),
+    
+    authoring_entity = list(
+      list(name = "Liberia Institute of Statistics and Geo_Information Services")
+    ),
+    
+    study_info = list(
+      
+      coll_dates = list(
+        list(start = "2007-08-06", end = "2007-09-22")
+      ),
+      
+      nation = list(
+        list(name = "Liberia", abbreviation = "LBR")
+      ),
+      
+      abstract = "The Government of Liberia (GoL) is committed to producing a Poverty Reduction Strategy Paper (PSRP). To do this, the GoL will need to undertake an analysis of qualitative and quantitative sources to understand the nature of poverty ('Where are we?'); to develop a macro-economic framework, and conduct broad based and participatory consultations to choose objectives, define and prioritize strategies ('Where do we want to go? How far can we get?); and to develop a monitoring and evaluation system ('How will we know when we get there?). The analysis of the nature of poverty, the Poverty Profile, will establish the overall rate of poverty incidence, identifying the poor in relation to their location, habits, occupations, means of access to and use of government services, and their living standards in regard to health, education, nutrition. Given the capacity constraints it has been agreed that this information will be collected in a single visit survey using the Core Welfare Indicators Questionnaire (CWIQ) survey with an additional module to cover household income, expenditure and consumption. This will provide information to estimate welfare levels & poverty incidence, which can be combined and analyzed with the sectoral information from the main CWIQ questionnaire. While countries with more capacity usually do a household income, expenditure and consumption survey over 12 months, the single visit approach has been used in a number of countries (mainly in West Africa) fairly successfully.",
+      
+      geog_coverage = "National"
+      
+    ),
+    
+    method = list(
+      
+      data_collection = list(
+        
+        coll_mode = "face to face interview",
+        
+        sampling_procedure = "The CWIQ survey will be carried out on a sample of 3,600 randomly selected households located in 300 randomly selected clusters. This was the same basic sample used by the 2007 Liberian DHS. However, for Monrovia, a new listing was carried out and new EAs were chosen and the sampled households were chosen from that list. For rural areas, the same EAs were used but a new sample selection of housholds was drawn. Any household that may have participated in the LDHS was systematically eliminated. Twelve (12) households were selected in each of the 300 EA using systematic sampling. The total number of households and number of EAs sampled in each County are given in the table below. (More on the Sampling under the External Resources).",
+        
+        coll_situation = "On average, the interview process lasted about about 2 hours 45 minutes. The Income and Expenditure questionnaire alone took about 2 hours to complete. In many occasions, the questionnaire was completed in 2 sitting sessions."
+        
+      )
+      
+    )
+    
+  ),
+  
+  # Information of data files
+  data_files = list_df,  
+  
+  # Information on variables
+  variables = list_var
+  
+)
+
+# Publish the metadata in the NADA catalog
+
+microdata_add(
+  idno = id,
+  repositoryid = "central",
+  access_policy = "licensed",
+  published = 1,
+  overwrite = "yes",
+  metadata = cwiq_ddi_metadata,
+  thumbnail = thumb
+)
+
+# Add links to data and documents
+
+external_resources_add(
+  title = "Liberia, CWIQ 2007, Dataset in Stata 15 format",
+  idno = id,
+  dcdate = "2007",
+  language = "English",
+  country = "Liberia",
+  dctype = "dat/micro",
+  file_path = "LBR_CWIQ_2007_Stata15.zip",
+  description = "Liberia CWIQ dataset in Stata 15 format (2 data files)",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "Liberia, CWIQ 2007, Dataset in SPSS Windows format",
+  idno = id,
+  dcdate = "2007",
+  language = "English",
+  country = "Liberia",
+  dctype = "dat/micro",
+  file_path = "LBR_CWIQ_2007_Stata15.zip",
+  description = "Liberia CWIQ dataset in SPSS for Windows [.sav] format (2 data files)",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "CWIQ 2007 Questionnaire",
+  idno = id,
+  dcdate = "2007",
+  language = "English",
+  country = "Liberia",
+  dctype = "doc/ques",
+  file_path = "LCWIQ2007_.pdf",
+  overwrite = "yes"
+)
+

After running the script, the metadata (and links) are available in the NADA catalog.

+


+ +

+


+ +

+


+ +

+


+ +

+ +
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter06.html b/chapter06.html new file mode 100644 index 0000000..a0af33c --- /dev/null +++ b/chapter06.html @@ -0,0 +1,4890 @@ + + + + + + + Chapter 6 Geographic data and services | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 6 Geographic data and services

+
+
+ +
+


+
+

6.1 Background

+

To make geographic information discoverable and to facilitate their dissemination and use, the ISO Technical Committee on Geographic Information/Geomatics (ISO/TC211) created a set of metadata standards to describe geographic datasets (ISO 19115), geographic data structures (ISO 19115-2 / ISO 19110), and geographic data services (ISO 19119). These standards have been “unified” into a common XML specification (ISO 19139). This set of standards, known as the ISO 19100 series, served as the cornerstone of multiple initiatives to improve the documentation and management of geographic information such as the Open Geospatial Consortium (OGC), the US Federal Geographic Data Committee (FDGC), the European INSPIRE directive, or more recently the Research Data Alliance (RDA), among others.

+

The ISO 19100 standards have been designed to cover the large scope of geographic information. The level of detail they provide goes beyond the needs of most data curators. What we present in this Guide is a subset of the standards, which focuses on what we consider as the core requirements to describe and catalog geographic datasets and services. References and links to resources where more detailed information can be found are provided in appendix.

+
+
+

6.2 Geographic information metadata standards

+

Geographic information metadata standards cover three types of resources: i) datasets, ii) data structure definitions, and iii) data services. Each one of these three components is the object of a specific standard. To support their implementation, a common XML specification (ISO 19139) covering the three standards has been developed. The geographic metadata standard is however, by far, the most complex and “specialized” of all schemas described in this Guide. Its use requires expertise not only in data documentation, but also in the use of geospatial data. We provide in this chapter some information that readers who are not familiar with geographic data may find useful to better understand the purpose and use of the geographic metadata standards.

+
+

6.2.1 Documenting geographic datasets - The ISO 19115 standard

+

Geographic datasets “identify and depict geographic locations, boundaries and characteristics of features on the surface of the earth. They include geographic coordinates (e.g., latitude and longitude) and data associated to geographic locations (…)”. (Source: https://www.fws.gov/gis/)

+

The ISO 19115 standard defines the structure and content of the metadata to be used to document geographic datasets. The standard is split into two parts covering:

+
    +
  1. vector data (ISO 19115-1), and
  2. +
  3. raster data including imagery and gridded data (ISO 19115-2).
  4. +
+

Vector and raster spatial datasets are built with different structures and formats. The following summarizes how these two categories differ and how they can be processed using the R software. The descriptions of vector and raster data provided in this chapter are adapted from: +- https://gisgeography.com/spatial-data-types-vector-raster/ +- https://datacarpentry.org/organization-geospatial/02-intro-vector-data/index.html]

+

Vector data

+

Vector data are comprised of points, lines, and polygons (areas).

+

A vector point is defined by a single x, y coordinate. Generally, vector points are a latitude and longitude with a spatial reference frame. A point can for example represent the location of a building or facility. When multiple dots are connected in a set order, they become a vector line with each dot representing a vertex. Lines usually represent features that are linear in nature, like roads and rivers. Each bend in the line represents a vertex that has a defined x, y location. When a set of 3 or more vertices is joined in a particular order and closed (i.e. the first and last coordinate pairs are the same), it becomes a polygon. Polygons are used to show boundaries. They will typically represent lakes, oceans, countries and their administrative subdivisions (provinces, states, districts), building footprints, or outline of survey plots. Polygons have an area (which will correspond to the square-footage for a building footprint, to the acreage for an agricultural plot, etc.)

+

Vector data are often provided in one of the following file formats:

+
    +
  • ESRI Shapefile (actually a zip set of files; not standard and limited as it is based on an outdated DBF format, but still widely used);
  • +
  • ESRI GeoDatabase file (not a standard format, but widely used);
  • +
  • GML: the Official OGC geospatial standard format, used by standard spatial data services;
  • +
  • GeoPackage: the OGC recommended standard for handling vector data;
  • +
  • GeoJSON: another OGC standard, often used when a service is associated to the data;
  • +
  • KML/KMZ: Keyhole Markup Language, an XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers;
  • +
  • CSV file: Comma-separated values files, with geometries provided in OGC Well-Known-Text (WKT);
  • +
  • OSM: An XML-formatted file containing “nodes” (points), “ways” (connections), and “relations” from OpenStreetMap format.
  • +
+ + + + + + +
Some examples
+

EXAMPLE 1

+

The figure below provides an example of vector data extracted from Open Street Map for a part of the city of Thimphu, Bhutan (as of 17 May, 2021).

+
+ +
+

The content of this map can be exported as an OSM file.

+
+ +
+

Multiple applications will allow users to read and process OSM files, including open source software applications like QGIS or the R packages sf and osmdata

+
# Example of a R script that reads and shows the content of the map.osm file
+
+library(sf)
+
+# List the layers contained in the OSM file
+lyrs <- st_layers("map.osm")
+
+# Read the layers as sf objects
+points   <- st_read("map.osm", layer = "points")
+lines    <- st_read("map.osm", layer = "lines")
+polygons <- st_read("map.osm", layer = "multipolygons")
+

EXAMPLE 2

+

In this second example, we use the R sf (Simple Features) package to read a shape (vector) file of refugee camps in Bangladesh, downloaded from the Humanitarian Data Exchange (HDX) website:

+
# Load the sf package and utilities 
+
+library(sf)
+library(utils)
+
+# Download and unzip the shape file (published by HDX as a compressed zip format)
+
+setwd("E:/my_data")
+url <- "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip"
+download.file(url, destfile = "200908_RRC_Outline_Camp_AL1.zip")
+unzip("E:/my_data/200908_RRC_Outline_Camp_AL1.zip")
+
+# Read the file and display core information about its content
+
+al1 <- st_read("./200908_RRC_Outline_Camp_AL1/200908_RRC_Outline_Camp_AL1.shp")
+print(al1)
+plot(al1)
+
+# ------------------------------
+# Output of the 'print' command:
+# ------------------------------
+
+# Simple feature collection with 35 features and 14 fields
+# geometry type:  MULTIPOLYGON
+# dimension:      XY
+# bbox:           xmin: 92.12973 ymin: 20.91856 xmax: 92.26863 ymax: 21.22292
+# geographic CRS: WGS 84
+# First 10 features:
+#       District Upazila      Settlement        Union         Name_Alias    SSID SMSD__Cnam            NPM_Name Area_Acres PeriMe_Met
+# 1  Cox's Bazar   Ukhia Collective site Palong Khali Bagghona-Putibonia CXB-224    Camp 16 Camp 16 (Potibonia)  130.57004   4136.730
+# 2  Cox's Bazar   Ukhia Collective site Palong Khali               <NA> CXB-203   Camp 02E            Camp 02E   96.58179   4803.162
+# 3  ...
+# 
+#        Camp_Name         Area_SqM         Latitude        Longitude                       geometry
+# 1        Camp 16  528946.95881724 21.1563813298438 92.1490685817901 MULTIPOLYGON (((92.15056 21...
+# 2        Camp 2E 391267.799744003 21.2078084302778 92.1643360947381 MULTIPOLYGON (((92.16715 21...
+# 3        ...
+
+# Output of 'str' command:
+
+# Classes 'sf' and 'data.frame':    35 obs. of  15 variables:
+#  $ District  : chr  "Cox's Bazar" "Cox's Bazar" "Cox's Bazar" "Cox's Bazar" ...
+#  $ Upazila   : chr  "Ukhia" "Ukhia" "Ukhia" "Ukhia" ...
+#  $ Settlement: chr  "Collective site" "Collective site" "Collective site" "Collective site" ...
+#  $ Union     : chr  "Palong Khali" "Palong Khali" "Palong Khali" "Raja Palong" ...
+#  $ Name_Alias: chr  "Bagghona-Putibonia" NA "Jamtoli-Baggona" "Kutupalong RC" ...
+#  $ SSID      : chr  "CXB-224" "CXB-203" "CXB-223" "CXB-221" ...
+#  $ SMSD__Cnam: chr  "Camp 16" "Camp 02E" "Camp 15" "Camp KRC" ...
+#  $ NPM_Name  : chr  "Camp 16 (Potibonia)" "Camp 02E" "Camp 15 (Jamtoli)" "Kutupalong RC" ...
+#  $ Area_Acres: num  130.6 96.6 243.3 95.7 160.4 ...
+#  $ PeriMe_Met: num  4137 4803 4722 3095 4116 ...
+#  $ Camp_Name : chr  "Camp 16" "Camp 2E" "Camp 15" "Kutupalong RC" ...
+#  $ Area_SqM  : chr  "528946.95881724" "391267.799744003" "985424.393160958" "387729.666427279" ...
+#  $ Latitude  : chr  "21.1563813298438" "21.2078084302778" "21.1606399787906" "21.2120281895357" ...
+#  $ Longitude : chr  "92.1490685817901" "92.1643360947381" "92.1428956454661" "92.1638095873048" ...
+#  $ geometry  :sfc_MULTIPOLYGON of length 35; first list element: List of 1
+
+# This information can be extracted and used to document the data
+

The output of the script shows that the shape file contains 35 features (or “objects”; in this case each object represents a refugee camp) and 14 fields (attributes and variables; including information like the camp name, administrative region, surface area, and more) related to each object.

+

The geometry type (multipolygon) and dimension (XY) provide information on the type of object. “All geometries are composed of points. Points are coordinates in a 2-, 3- or 4-dimensional space. All points in a geometry have the same dimensionality. In addition to X and Y coordinates, there are two optional additional dimensions:

+
    +
  • a Z coordinate, denoting the altitude;
  • +
  • an M coordinate (rarely used), denoting some measure that is associated with the point, rather than with the feature as a whole (in which case it would be a feature attribute); examples could be time of measurement, or measurement error of the coordinates.
  • +
+

The four possible cases then are:

+
    +
  • two-dimensional points refer to x and y, easting and northing, or longitude and latitude, referred to as XY
  • +
  • three-dimensional points as XYZ
  • +
  • three-dimensional points as XYM
  • +
  • four-dimensional points as XYZM (the third axis is Z, the fourth is M)
  • +
+

The following seven simple feature types are the most common:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeDescription
POINTzero-dimensional geometry containing a single point
LINESTRINGsequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
POLYGONgeometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring
MULTIPOINTset of points; a MULTIPOINT is simple if no two Points in the MULTIPOINT are equal
MULTILINESTRINGset of linestrings
MULTIPOLYGONset of polygons
GEOMETRYCOLLECTIONset of geometries of any type except GEOMETRYCOLLECTION
+

The remaining ten geometries are rarer : CIRCULARSTRING, COMPOUNDCURVE, CURVEPOLYGON, MULTICURVE, MULTISURFACE, CURVE, SURFACE, POLYHEDRALSURFACE, TIN, TRIANGLE (see https://r-spatial.github.io/sf/articles/sf1.html).

+

The geographic CRS informs us on the coordinate reference system (CRS). Coordinates can only be placed on the Earth’s surface when their CRS is known; this may be a spheroid CRS such as WGS 84, a projected, two-dimensional (Cartesian) CRS such as a UTM zone or Web Mercator, or a CRS in three-dimensions, or including time. In our example above, the CRS is the WGS 84 (World Geodetic System 84), a standard for use in cartography, geodesy, and satellite navigation including GPS.

+

The bbox is the bounding box.

+

Information on a subset (top 10 - only 2 shown above) of the features is displayed in the output of the script, with the list of the 14 available fields.
+The plot(al1) command in R produces a visualization of the numeric fields in the data file:

+
+ +
+

All this information represents important components of the metadata, which we will want to capture, enrich, and catalog (together with additional information) using the ISO metadata standard. “Enriching” (or “augmenting”) the metadata will consist of providing more contextual information (who produced the data, when, why, etc.) and additional information on the features (e.g., what does the variable ’SMSD__Cnam’ represent?).

+

Raster data

+

Raster data are made up of pixels, also referred to as grid cells. Satellite imagery and other remote sensing data are raster datasets. Grid cells in raster data are usually (but not necessarily) regularly-spaced and square. Data stored in a raster format is arranged in a grid without storing the coordinates of each cell (pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in a grid.

+

Any given pixel in a grid stores one or more values (in one or more bands). For example, each cell (pixel) value in a satellite image has a red, a green, and a blue value. Cells in raster data could represent anything from elevation, temperature, rainfall, land cover, population density, or others. (Source: https://worldbank.github.io/OpenNightLights/tutorials/mod2_1_data_overview.html)

+

Raster data can be discrete or continuous. Discrete rasters have distinct themes or categories. For example, one grid cell can represent a land cover class, or a soil type. In a discrete raster, each thematic class can be discretely defined (usually represented by an integer) and distinguished from other classes. In other words, each cell is definable and its value applies to the entire area of the cell. For example, the value 1 for a class might indicate “urban area”, value 2 “forest”, and value 3 “others”. Continuous (or non-discrete) rasters are grid cells with gradual changing values, which could for example represent elevation, temperature, or an aerial photograph.

+

The difference between vector and raster data, and between different types of vectors, is clearly illustrated in the figure below taken from the World Bank’s Light Every Night GitHub repository.

+
+ +
+

In GIS applications, vector and raster data are often combined into multi-layer datasets, as shown in the figure below extracted from the County of San Bernardino (US) website.

+
+ +
+

We may occasionally want to convert raster data into vector data. For example, a building footprint layer (vector data, composed of polygons) can be derived from a satellite image (raster data). Such conversions can be implemented in a largely automated manner using machine learning algorithms.

+
+ +
+

Source: https://blogs.bing.com/maps/2019-09/microsoft-releases-18M-building-footprints-in-uganda-and-tanzania-to-enable-ai-assisted-mapping

+

Raster data are often provided in one of the following file formats:

+ +

GeoTIFF is a popular file format for raster data. A Tagged Image File Format (TIFF or TIF) is a file format designed to store raster-type data. A GeoTIFF file is a TIFF file that contains specific tags to store structured geospatial metadata including:

+
    +
  • Spatial extent: the area coverage of the file
  • +
  • Coordinate reference system: the projection / coordinate reference system used
  • +
  • Resolution: the spatial extent of each pixel (spatial resolution)
  • +
  • Number of layers: number of layers or bands available in the file
  • +
+

TIFF files can be read using (among other options) the R package raster or the Python library rasterio.

+

GeoTIFF files can also be provided as Cloud Optimized GeoTIFFS (COGs). In COGs, the data are structured in a way that allows them to be shared via web services which allow users to query, visualize, or download a user-defined subset of the content of the file, without having to download the entire file. This option can be a major advantage, as geoTIFF files generated by remote sensing/satellite imagery can be very large. Extracting only the relevant part of a file can save significant time and storage space.

+ + + + + + +
Some examples
+

EXAMPLE 1

+

The first example below shows the spatial distribution of the Ethiopian population in 2020. The data file was downloaded from the WorldPop website on 17 May 2021.

+
+ +
+
# Load the raster R package 
+
+library(raster)
+
+# Download a TIF file (spatial distribution of population, Ethiopia, 2020) - 62Mb
+
+setwd("E:/my_data")
+url <- "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif"
+file_name = basename(url)
+download.file(url, destfile = file_name, mode = 'wb')
+
+# Read the file and display core information about its content
+
+my_raster_file <- raster(file_name)
+print(my_raster_file)
+
+# ------------------------------
+# Output of the 'print' command:
+# ------------------------------
+
+# dimensions : 13893, 17983, 249837819  (nrow, ncol, ncell)
+# resolution : 0.0008333333, 0.0008333333  (x, y)
+# extent     : 32.99958, 47.98542, 3.322084, 14.89958  (xmin, xmax, ymin, ymax)
+# crs        : +proj=longlat +datum=WGS84 +no_defs 
+# source     : E:/my_data/eth_ppp_2020_constrained.tif 
+# names      : eth_ppp_2020_constrained 
+# values     : 1.36248, 847.9389  (min, max)
+

This output shows that the TIF file contains one layer of cells, forming an image of 13,893 by 17,983 cells. It also provides information on the projection system (datum): WGS 84 (World Geodetic System 84). This information (and more) will be part of the ISO-compliant metadata we want to generate to document and catalog a raster dataset.

+

EXAMPLE 2

+

In the second example, we demonstrate the advantages of Cloud Optimized GeoTIFFS (COGs). We extract information from the World Bank Light Every Night repository.

+
# Load 'aws.s3' package to access the Amazon Web Services (AWS) Simple Storage Service (s3)
+library("aws.s3")
+
+# Load 'raster' package to read the target GeoTiFF
+library("raster")
+
+# List files for World Bank bucket 'globalnightlight', setting a max number of items
+contents <- get_bucket(bucket = 'globalnightlight', max = 10000)
+
+# Get_bucket_df is similar to 'get_bucket' but returns the list as a dataframe
+contents <- get_bucket_df(bucket = 'globalnightlight')
+
+# Access DMSP-OLS data for satellite F12 in 1995
+F12_1995 <- get_bucket(bucket = 'globalnightlight', 
+                       prefix = "F121995")
+
+# As data.frame, with all objects listed
+F12_1995_df <- get_bucket_df(bucket = 'globalnightlight', 
+                             prefix = "F121995", 
+                             max = Inf)
+# Number of objects
+nrow(F12_1995_df) 
+
+# Save the object
+filename <- "F12199501140101.night.OIS.tir.co.tif"
+save_object(bucket = 'globalnightlight', 
+            object = "F121995/F12199501140101.night.OIS.tir.co.tif", 
+            file = filename)
+
+# Read it with raster package
+rs <- raster(filename)
+


+
+
+

6.2.2 Describing data structures - The ISO 19115-2 and ISO 19110 standards

+

The ISO 19115-2 provides the necessary metadata elements to describe the structure of raster data. The ISO 19115-1 standard does not provide all necessary metadata elements needed to describe the structure of vector datasets. The description of data structures for vector data (also referred to as feature types) is therefore often omitted. The ISO 19110 standard solves that issue, by providing the means to document the structure of vector datasets (column names and definitions, codes and value labels, measurement units, etc.), which will contribute to making the data more discoverable and usable.

+
+
+

6.2.3 Describing data services - The ISO 19119 standard

+

More and more data are disseminated not in the form of datasets, but as data services via web applications. “Geospatial services provide the technology to create, analyze, maintain, and distribute geospatial data and information.” (https://www.fws.gov/gis/) The ISO 19119 standard provides the elements to document such services.

+
+
+

6.2.4 Unified metadata specification - The ISO/TS 19139 standard

+

The three metadata standards previously described - ISO 19115 for vector and raster datasets, ISO 19110 for vector data structures, and ISO 19119 for data services, provide a set of concepts and definitions useful to describe the geographic information. To facilitate their practical implementation, a digital specification, which defines how this information is stored and organized in an electronic metadata file, is required. The ISO/TS 19139 standard, an XML specification of the ISO 19115/10110/19119/, was created for that purpose.

+

The ISO/TS 19139 is a standard used worldwide to describe geographic information. It is the backbone for the implementation of INSPIRE dataset and service metadata in the European Union. It is supported by a wide range of tools, including desktop applications like Quantum GIS, ESRI ArcGIS), and OGC-compliant metadata catalogs (e.g., GeoNetwork) and geographic servers (e.g., GeoServer).

+

ISO 19139-compliant metadata can be generated and edited using specialized metadata editors such as CatMDEdit or QSphere, or using programmatic tools like Java Apache SIS or the R packages geometa and geoflow, among others.

+

The ISO 19139 specification is complex. To enable and simplify its use in our NADA cataloguing application, we produced a JSON version of (part of) the standard. We selected the elements we considered most relevant for our purpose, and organized them into the JSON schema described below. For data curators with limited expertise in XML and geographic data documentation, this JSON schema will make the production of metadata compliant with the ISO 19139 standard easier.

+
+
+
+

6.3 Schema description

+

Main structure (describe) @@@@

+


+
{
+ "repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+}
+


+
+

6.3.1 Introduction to ISO19139

+

Geographic metadata (for both datasets and services) should include core metadata properties, and metadata sections aiming to describe specific aspect of the resource (e.g., resource identification or resource distribution).

+

The content of some metadata elements is controlled by codelists (or controlled vocabularies). A codelist is a pre-defined set of values. The content of an element controlled by a codelist should be selected from that list. This may for example apply to the element “language”, whose content should be selected from the ISO 639 list and codes codes for language names, instead of being free-text. The ISO 19139 suggests but does not impose codelists. It is highly recommended to make use of the suggested codelists (or of specific codelists that may be promoted by agencies or partnerships).

+

Some metadata elements (referred to as common elements) of the ISO 19139 can be repeated in different parts of a metadata file. For example, a standard set of fields is provided to describe a contact, a citation, or a file format. Such common elements can be used in multiple locations of a metadata file (e.g., to provide information on who the contact person is for information on data quality, on data access, on data documentation, etc.)

+

In the following sections, we first present the common elements, then the elements that form the core metadata properties (information on the metadata themselves), followed by the elements from the main metadata sections used to describe the data, and finally the features catalog elements which are used to document attributes and variables related to vector data (ISO 19110).

+
+
+

6.3.2 Common sets of elements

+

Common elements are blocks of metadata fields that can appear in multiple locations of a metadata file. For example, information on contact person(s) or organization(s) may have to be provided in the section of the file where we document the production and maintenance of the data, where we document the production and maintenance of the metadata, where we document the distribution and terms of use of the data, etc. Other types of common elements include online and offline resources, file formats, citations, keywords, constraints, and extent. We describe these sets of elements below.

+
+

6.3.2.1 Contact / Responsible party

+

The ISO 19139 specification provides a structured set of metadata elements to describe a contact. A contact is the party (person or organization) responsible for a specific task. The following set of elements can be used to describe a contact:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
individualNameName of the individual
organisationNameName of the organization
positionNamePosition of the individual in the organization
contactInfoContact information. The contact information is divided into 3 sections: phone(including either voice or facsimile numbers; address, handling the physical address elements (deliveryPoint, city, postalCode, country), contact e-mail (electronicEmailAddress), and onlineResource, e.g., the URL of the organization website (which includes linkage, name, description, protocol, and function ; see below)
roleRole of the person/organization. A recommended controlled vocabulary is provided by ISO 19139, with the following options: {resourceProvider, custodian, owner, sponsor, user, distributor, originator, pointOfContact, principalInvestigator, processor, publisher, author, coAuthor, collaborator, editor, mediator, rightsHolder, contributor, funder, stakeholder}
+


+
"contact": [
+ {
+  "individualName": "string",
+  "organisationName": "string",
+  "positionName": "string",
+  "contactInfo": {
+   "phone": {
+    "voice": "string",
+    "facsimile": "string"
+   },
+   "address": {
+    "deliveryPoint": "string",
+    "city": "string",
+    "postalCode": "string",
+    "country": "string",
+    "electronicMailAddress": "string"
+   },
+   "onlineResource": {
+    "linkage": "string",
+    "name": "string",
+    "description": "string",
+    "protocol": "string",
+    "function": "string"
+   }
+  },
+  "role": "string"
+ }
+]
+


+
+
+

6.3.2.2 Online resource

+

An online resource is a common set of elements frequently used in the geographic data/services schema. It can be used for example to provide a link to an organization website, to a data file or to a document, etc. An online resource is described with the following properties:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
linkageURL of the online resource. In case of a geographic standard data services, only the base URL should be provided, without any service parameter.
nameName of the online resource. In case of a geographic standard data services, this should be filled with the identifier of the resource as published in the service. Example, for an OGC Web Map Service (WMS), we will use the layer name.
descriptionDescription of the online resource
protocolWeb protocol used to get the resource, e.g., FTP, HTTP. In case of a basic HTTP, the ISO 19139 suggests the value ‘WWW:LINK-1.0-http–link’. For geographic standard data services, it is recommended to fill this element with the appropriate protocol identifier. For an OGC Web Map Service (WMS) link for example, use ‘OGC:WMS-1.1.0-http-get-map’
functionFunction (purpose) of the online resource.
+


+
"onlineResource": {
+ "linkage": "string",
+ "name": "string",
+ "description": "string",
+ "protocol": "string",
+ "function": "string"
+}
+


+
+
+

6.3.2.3 Offline resource (Medium)

+

An offline resource (medium) is a common set of elements that can be used to describe a physical resource used to distribute a dataset, e.g., a DVD or a CD-ROM. A medium is described with the following properties:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
nameName of the medium, eg. ‘dvd’. Recommended code following the ISO/TS 19139 MediumName codelist. Suggested values: {cdRom, dvd, dvdRom, 3halfInchFloppy, 5quarterInchFloppy, 7trackTape, 9trackType, 3480Cartridge, 3490Cartridge, 3580Cartridge, 4mmCartridgeTape, 8mmCartridgeTape, 1quarterInchCartridgeTape, digitalLinearTape, onLine, satellite, telephoneLink, hardcopy}
densityDensity (list of) at which the data is recorded
densityUnitUnit(s) of measure for the recording density
volumesNumber of items in the media identified
mediumFormatMethod used to write to the medium, e.g. tar . Recommended code following the ISO/TS 19139 MediumFormat codelist. Suggested values: {cpio, tar, highSierra, iso9660, iso9660RockRidge, iso9660AppleHFS, udf}
mediumNoteDescription of other limitations or requirements for using the medium
+
+
+

6.3.2.4 File format

+

The table below lists the ISO 19139 elements used to document a file format. A format is defined at a minimum by its name. It is also recommended to provide a version, and possibly a format specification. It is good practice to provide a standardized format name, using the file’s mime type, e.g., text/csv, image/tiff. A list of available mime types is available from the IANA website.

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
nameFormat name - Recommended
versionFormat version (if applicable) - Recommended
amendmentNumberAmendment number (if applicable)
specificationName of the specification - Recommended
fileDecompressionTechniqueTechnique for file decompression (if applicable)
FormatDistributorContact(s) responsible of the distribution
+


+
"resourceFormat": [
+ {
+  "name": "string",
+  "version": "string",
+  "amendmentNumber": "string",
+  "specification": "string",
+  "fileDecompressionTechnique": "string",
+  "FormatDistributor": {
+   "individualName": "string",
+   "organisationName": "string",
+   "positionName": "string",
+   "contactInfo": {},
+   "role": "string"
+  }
+ }
+]
+


+
+
+

6.3.2.5 Citation

+

The citation is another common element that can be used in various parts of a geographic metadata file. Citations are used to provide detailed information on external resources related to the dataset or service being documented. A citation can be defined using the following set of (mostly optional) elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
titleTitle of the resource
alternateTitleAn alternate title (if applicable)
dateDate(s) associated to a resource, with sub-elements date and type. This may include different types of dates. The type of date should be provided, and selected from the controlled vocabulary proposed by the ISO 19139: date of {creation, publication, revision, expiry, lastUpdate, lastRevision, nextUpdate, unavailable, inForce, adopted, deprecated, superseded, validityBegins, validityExpires, released, distribution}
editionEdition of the resource
editionDateEdition date
identifierA unique persistent identifier for the metadata. If a DOI is available for the resource, the DOI should be entered here. The same fileIdentifier should be used if no other persistent identifier is available.
citedResponsiblePartyContact(s)/party(ies) responsible for the resource.
presentationFormForm in which the resource is made available. The ISO 19139 recommends the following controlled vocabulary: {documentDigital, imageDigital, documentHardcopy, imageHardcopy, mapDigital, mapHardcopy, modelDigital, modelHardcopy, profileDigital, profileHardcopy, tableDigital, tableHardcopy, videoDigital, videoHardcopy, audioDigital, audioHardcopy, multimediaDigital, multimediaHardcopy, physicalSample, diagramDigital, diagramHardcopy}. For a geospatial dataset or web-layer, the value mapDigital will be preferred.
seriesA description of the series, in case the resource is part of a series. This include the series name, issueIdentification and page
otherCitationDetailsAny other citation details to specify
collectiveTitleA title in case the resource is part of a broader resource (e.g., data collection)
ISBNInternational Standard Book Number (ISBN); an international standard identification number for uniquely identifying publications that are not intended to continue indefinitely.
ISSNInternational Standard Serial Number (ISSN); an international standard for serial publications.
+


+
"citation": {
+ "title": "string",
+ "alternateTitle": "string",
+ "date": [
+  {
+   "date": "string",
+   "type": "string"
+  }
+ ],
+ "edition": "string",
+ "editionDate": "string",
+ "identifier": {
+  "authority": "string",
+  "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+  "string"
+ ],
+ "series": {
+  "name": "string",
+  "issueIdentification": "string",
+  "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+}
+


+
+
+

6.3.2.6 Keywords

+

Keywords contribute significantly to making a resource more discoverable. Entering a list of relevant keywords is therefore highly recommended. Keywords can, but do not have to be selected from a controlled vocabulary (thesaurus). Keywords are documented using the following elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + +
ElementDescription
typeKeywords type. The ISO 19139 provides a recommended controlled vocabulary with the following options: {dataCenter, discipline, place, dataResolution, stratum, temporal, theme, dataCentre, featureType, instrument, platform, process, project, service, product, subTopicCategory}
keywordThe keyword itself. When possible, existing vocabularies should be preferred to writing free-text keywords. An example of global vocabulary is the Global Change Master Directory that could be a valuable source to reference data domains / disciplines, or the UNESCO Thesaurus.
thesaurusNameA reference to a thesaurus (if applicable) from which the keywords are extracted. The thesaurus itself should then be documented as a citation.
+


+
"keywords": [
+ {
+  "type": "string",
+  "keyword": "string",
+  "thesaurusName": "string"
+ }
+]
+


+
+
+

6.3.2.7 Constraints @@@@ not clear. where is the element useLimitations? … what are the elements used in the schema?

+

The constraints common set of elements will be used to document legal and security constraints associated with the documented dataset or data service. Both types of constraints have one property in common, useLimitation, used to describe the use limitation(s) as free text.

+


+
"resourceConstraints": [
+ {
+  "legalConstraints": {
+   "useLimitation": [
+    "string"
+   ],
+   "accessConstraints": [
+    "string"
+   ],
+   "useConstraints": [
+    "string"
+   ],
+   "otherConstraints": [
+    "string"
+   ]
+  },
+  "securityConstraints": {
+  "useLimitation": [
+   "string"
+  ],
+  "classification": "string",
+  "userNote": "string",
+  "classificationSystem": "string",
+  "handlingDescription": "string"
+  }
+ }
+]
+


+

In addition to the useLimitation element, legal constraints (legalConstraints) can be described using the following three metadata elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + +
ElementDescription
accessConstraintsAccess constraints. The ISO 19139 provides a controlled vocabulary with the following options: {copyright, patent, patentPending, trademark, license, intellectualPropertyRights, restricted, otherRestrictions, unrestricted, licenceUnrestricted, licenceEndUser, licenceDistributor, private, statutory, confidential, SBU, in-confidence}
useConstraintsUse constraints. To be entered as free text. Filling this element will depend on the resource that is described. As best practice recommended to fill this element, this is where terms of use, disclaimers, preferred citation or* even data limitations can be captured
otherConstraintsAny other constraints related to the resource.
+

In addition to the useLimitation element, security constraints (securityConstraints) - which applies essentially to classified resources - can be described using the following four metadata elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
classificationClassification code. The ISO 19139 provides a controlled vocabulary with the following options: {unclassified, restricted, confidential, secret, topSecret, SBU, forOfficialUseOnly, protected, limitedDistribution}
userNoteNote to users (free text)
classificationSystemInformation on the system used to classify the information. Organizations may have their own system to classify the information.
handlingDescriptionAdditional free-text description of the classification
+
+
+

6.3.2.8 Extent

+

The extent defines the boundaries of the dataset in space (horizontally and vertically) and in time. The ISO 19139 standard defines the extent as follows:

+ ++++ + + + + + + + + + + + + + + + + + + + + +
ElementDescription
geographicElementSpatial (horizontal) extent element. This can be defined either with a geographicBoundingBox providing the coordinates bounding the limits of the dataset, by means of four properties: southBoundLongitude, westBoundLongitude, northBoundLongitude, eastBoundLongitude (recommended); or using geographicDescription - free text that defines the area covered. When the dataset covers one or more countries, it is recommended to enter the country names in this element, as it can then be used in data catalogs for filtering by geography.
verticalElementSpatial (vertical) extent element, providing two properties: minimumValue, maximumValue and verticalCRS (reference to the vertical coordinate reference system)
temporalElementTemporal extent element. Depending on the temporal characteristics of the dataset, this will consist in a TimePeriod (made of a beginPosition and endPosition) or a TimeInstant (made of a single timePosition) referencing date/time information according to ISO 8601
+


+
"extent": {
+ "geographicElement": [
+  {
+   "geographicBoundingBox": {
+    "westBoundLongitude": -180,
+    "eastBoundLongitude": -180,
+    "southBoundLatitude": -180,
+    "northBoundLatitude": -180
+   },
+   "geographicDescription": "string"
+  }
+ ],
+ "temporalElement": [
+  {
+   "extent": null
+  }
+ ],
+ "verticalElement": [
+  {
+   "minimumValue": 0,
+   "maximumValue": 0,
+   "verticalCRS": null
+  }
+ ]
+}
+


+
+
+
+

6.3.3 Core metadata properties

+

A set of elements is provided in the ISO 19139 to document the core properties of the metadata (not the data). With a few exceptions, these elements apply to the metadata related to datasets and data services. The table below summarizes these elements and their applicability. A description of the elements follows.

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PropertyDescriptionUsed in dataset metadataUsed in service metadata
fileIdentifierUnique persistent identifier for the resourceYes-
languageMain language used in the metadata descriptionYesYes
characterSetCharacter set encoding used in the metadata descriptionYesYes
parentIdentifierUnique persistent identifier of the parent resource (if any)YesYes
hierarchyLevelScope(s) / hierarchy level(s) of the resource. List of pre-defined values suggested by the ISO 19139. See details below.YesYes
hierarchyLevelNameAlternative name definitions for hierarchy levelsYesYes
contactcontact(s) associated to the metadata, i.e. persons/organizations in charge of the metadata create/edition/maintenance. For more details, see section on common elementsYesYes
dateStampDate and time when the metadata record was created or updatedYesYes
metadataStandardNameReference or name of the metadata standard used.YesYes
metadataStandardVersionVersion of the metadata standard. For the ISO/TC211, the version corresponds to the creation/revision year.YesYes
dataSetURIUnique persistent link to reference the databaseYes-
+


+
"description": {
+ "idno": "string",
+ "language": "string",
+ "characterSet": {
+  "codeListValue": "string",
+  "codeList": "string"
+ },
+ "parentIdentifier": "string",
+ "hierarchyLevel": [],
+ "hierarchyLevelName": [],
+ "contact": [],
+ "dateStamp": "string",
+ "metadataStandardName": "string",
+ "metadataStandardVersion": "string",
+ "dataSetURI": "string"
+}
+


+
+

6.3.3.1 Resource identifier (idno)

+

The idno must provide a unique and persistent identifier for the resource (dataset or service). A common approach consists in building a semantic identifier, constructed by concatenating some owner and data characteristics. Although this approach offers the advantages of readability of the identifier, it may not guarantee its global uniqueness and its persistence in time. The use of time periods and/or geographic extents as components of a file identifier is not recommended, as these elements may evolve over time. The use of random identifiers such as the Universally Unique Identifiers (UUID) is sometimes suggested as an alternative, but this approach is also not recommended. The use of Digital Object Identifiers (DOI) as global and unique file identifiers is recommended.

+
+
+

6.3.3.2 Language (language)

+

The metadata language refers to the main language used in the metadata. The recommended practice is to use the ISO 639-2 Language Code List (also known as the alpha-3 language code), e.g. ‘eng’ for English or ‘fra’ for French.

+
+
+

6.3.3.3 Character set (characterSet)

+

The character set encoding of the metadata description. The best practice is to use the utf8 encoding codelist value (UTF-8 encoding). It is capable of encoding all valid character code points in Unicode, a standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML. UTF-8 is the most common encoding for the World Wide Web. Many text editors will provide you with an option to save your metadata (text) files in UTF-8, which will often be the default option (see below the example of Notepad++ and R Studio).

+
+ +
+
+
+

6.3.3.4 Parent Identifier (parentIdentifier)

+

A geographic data resource can be a subset of a larger dataset. For example, an aquatic species distribution map can be part of a data collection covering all species, or the 2010 population census dataset of a country can be part of a dataset that includes all population censuses for that country since 1900. In such case, the parent identifier metadata element can be used to identify this higher-level resource. As for the fileIdentifier, the parentIdentifier must be a unique identifier persistent in time. In a data catalog, a parentIdentifier will allow the user to move from one dataset to another. The parentIdentifier is generally applied to datasets, although it may in some cases be used in data services descriptions.

+
+
+

6.3.3.5 Hierarchy level(s) (hierarchyLevel)

+


+
"hierarchyLevel": [
+  "string"
+ ]
+


+

The hierarchylevel defines the scope of the resource. It indicates whether the resource is a collection, a dataset, a series, a service, or another type of resource. The ISO 19139 provides a controlled vocabulary for this element. It is recommended but not mandatory to make use of it. The most relevant levels for the purpose of cataloguing geographic data and services are dataset (for both raster and vector data), service (a capability which a service provider entity makes available to a service user entity through a set of interfaces that define a behavior), and series. Series will be used when the data represent an ordered succession, in time or in space; this will typically apply to time series, but it can also be used to describe other types of series (e.g., a series of ocean water temperatures collected at a succession of depths).

+

The recommended controlled vocabulary for hierarchylevel includes: {dataset, series, service, attribute, attributeType, collectionHardware, collectionSession, nonGeographicDataset, dimensionGroup, feature, featureType, propertyType, fieldSession, software, model, tile, initiative, stereomate, sensor, platformSeries, sensorSeries, productionSeries, transferAggregate, otherAggregate}

+
+
+

6.3.3.6 Hierarchy level name(s) (hierarchyLevelname)

+


+
"hierarchyLevelName": [
+  "string"
+]
+


+

The hierarchyLevelName provides an alternative to describe hierarchy levels, using free text instead of a controlled vocabulary. The use of hierarchyLevel is preferred to the use of hierarchylevelName.

+
+
+

6.3.3.7 Contact(s) (contact)

+

The contact element is a common element described in the common elements section of this chapter. When associated to the metadata, it is used to identify the person(s) or organization(s) in charge of the creation, edition, and maintenance of the metadata. The contact(s) responsible for the metadata are not necessarily the ones who are responsible for the dataset/service creation/edition/maintenance. The latter will be documented in the dataset identification elements of the metadata file.

+
+
+

6.3.3.8 Date stamp (dateStamp)

+

The date stamp associated to the metadata. The metadata date stamp may be automatically filled by metadata editors, and will ideally use the standard ISO 8601 date format: YYYY-MM-DD (possibly with a time).

+
+
+

6.3.3.9 Metadata standard name (metadataStandardName)

+

The name of the geographic metadata standard used to describe the resource. The recommended values are:

+
    +
  • in the case of vector dataset metadata: ISO 19115 Geographic information - Metadata
  • +
  • in the case of grid/imagery dataset metadata: ISO 19115-2 Geographic Information - Metadata Part 2 Extensions for imagery and gridded data
  • +
  • in the case of service metadata: ISO 19119 Geographic information - Services
  • +
+
+
+

6.3.3.10 Metadata standard version (metadataStandardVersion)

+

The version of the metadata standard being used. It is good practice to enter the standard’s inception/revision year. ISO standards are revised with an average periodicity of 10-year. Although the ISO TC211 geographic information metadata standards have been reviewed, it is still accepted to refer to the original version of the standard as many information systems/catalogs still make use of that version.

+

The recommended values are:

+
    +
  • in the case of vector dataset metadata: ISO 19115:2003
  • +
  • in the case of grid/imagery dataset metadata: ISO 19115-2:2009
  • +
  • in the case of service metadata: ISO 19119:2005
  • +
+
+
+

6.3.3.11 Dataset URI (datasetURI)

+

A unique resource identifier for the dataset, such as a web link that uniquely identifies the dataset. The use of a Digital Object Identifier (DOI) is recommended.

+
+
+
+

6.3.4 Main metadata sections

+

Geographic data can be diverse and complex. Users need detailed information to discover data and to use them in an informed and responsible manner. The core of the information on data will be provided in various sections of the metadata file. This will include information on the type of data, on the coordinate system being used, on the scope and coverage of the data, on the format and location of the data, on possible quality issues that users need to be aware of, and more. The table below summarizes the main metadata sections, by order of appearance in the ISO 19139 specification.

+


+
"description": {
+ "spatialRepresentationInfo": [],
+ "referenceSystemInfo": [],
+ "identificationInfo": [],
+ "contentInfo": [],
+ "distributionInfo": {},
+ "dataQualityInfo": [],
+ "metadataMaintenance": {} 
+}
+


+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SectionDescriptionUsability in dataset metadataUsability in service metadata
spatialRepresentationInfoThe spatial representation of the dataset. Distinction is made between vector and grid (raster) spatial representations.Yes-
referenceSystemInfoThe reference systems used in the resource. In practice, this will often be limited to the geographic coordinate system.YesYes
identificationInfoIdentifies the resource, including descriptive elements (eg. title, purpose, abstract, keywords) and contact(s) having a role in the resource provision. See details belowYesYes
contentInfoThe content of a dataset resource, i.e. how the dataset is structured (dimensions, attributes, variables, etc.). In the case of vector datasets, this relates to separate metadata files compliant with the ISO 19110 standard (Feature Catalogue). In the case of raster / gridded data, this is covered by the ISO 19115-2 extension for imagery and gridded data.Yes-
distributionInfoThe mode(s) of distribution of the resource (format, online resources), and by whom it is distributed.YesYes
dataQualityInfoThe quality reports on the resource (dataset or service), and in case of datasets, on the provenance / lineage information giving the process steps performed to obtain the dataset resource.YesYes
metadataMaintenanceInfoThe metadata maintenance cycle operated for the resource.YesYes
+

These sections are described in more detail below.

+
+

6.3.4.1 Spatial representation (spatialRepresentationInfo)

+


+
"spatialRepresentationInfo": [
+ {
+  "vectorSpatialRepresentation": {
+   "topologyLevel": "string",
+   "geometricObjects": [
+    {
+     "geometricObjectType": "string",
+     "geometricObjectCount": 0
+    }
+   ]
+  },
+  "gridSpatialRepresentation": {
+   "numberOfDimensions": 0,
+   "axisDimensionProperties": [
+    {
+     "dimensionName": "string",
+     "dimensionSize": 0,
+     "resolution": 0
+   }
+  ],
+  "cellGeometry": "string",
+  "transformationParameterAvailability": true
+  }
+ }
+]
+


+

Information on the spatial representation is critical to properly describe a geospatial dataset. The ISO/TS 19139 distinguishes two types of spatial representations, characterized by different properties.

+

The vector spatial representation describes the topology level and the geometric objects of vector datasets using the following two properties:

+
    +
  • Topology level (topologyLevel) is the type of topology used in the vector spatial dataset. The ISO 19139 provides a controlled vocabulary with the following options: {geometryOnly, topology1D, planarGraph, fullPlanarGraph, surfaceGraph, fullSurfaceGraph, topology3D, fullTopology3D, abstract}. In most cases, vector datasets will be described as geometryOnly which covers common geometry types (points, lines, polygons).
  • +
  • Geometric objects (geometricObjects) will define: +
      +
    • Geometry type (geometricObjectType): The type of geometry handled. Possible values are: {complex, composite, curve, point, solid, surface}.
    • +
    • Geometry count (geometricObjectCount): The number (count) of geometries in the dataset.
    • +
  • +
+

In the case of an homogeneous geometry type, a single geometricObjectselement can be defined. For complex geometries (mixture of various geometry types), one geometricObjects element will be defined for each geometry type.

+

The grid spatial representation describes gridded (raster) data using the following three properties:

+
    +
  • Number of dimensions (numberOfDimensions) in the grid.
  • +
  • Axis dimension properties (axisDimensionProperties): a list of each dimension including, for each dimension: +
      +
    • The name of the dimension type (dimensionName): the ISO 19139 provides a controlled vocabulary with the following options: {row, column, vertical, track, crossTrack, line, sample, time}. These options represent the following:

      +
        +
      • row: ordinate (y) axis
      • +
      • column: abscissa (x) axis
      • +
      • vertical: vertical (z) axis
      • +
      • track: along the direction of motion of the scan point
      • +
      • crossTrack: perpendicular to the direction of motion of the scan point
      • +
      • line: scan line of a sensor
      • +
      • sample: element along a scan line
      • +
      • time: duration
      • +
      +

      In the Ethiopia population density file we used as an example of raster data, the types of dimensions will be row and column as the file is a spatial 2D raster. If we had a data with elevation or time dimensions, we would use respectively “vertical” and “time” dimension as name types.

    • +
    • The dimension size (dimensionSize): the length of the dimension.

    • +
    • The dimension resolution: a resolution number associated to a unit of measurement. This is the resolution of the grid cell dimension. For example:

      +
        +
      • for longitude/latitude dimensions, and a grid at 1deg x 5deg, the ‘row’ dimension will have a resolution of 1 deg, and the ‘column’ dimension will have a resolution of 5 deg
      • +
      • for a “vertical” dimension, this will represent the elevation step. For example, the vertical resolution of the mean Ozone concentration between 40m and 50m altitude at a location of longitude x/ latitude y would be 10 m.
      • +
      • similar: in case of a spatial-temporal grid, the “time” resolution will represent the time lag (e.g., 1 year, 1 month, 1 week, etc.) between two measures.

      • +
    • +
  • +
  • Cell geometry type (cellGeometry): The type of geometry used for grid cells. Possible values are: {point, area, voxel, stratum} Most “grids” are commonly area-based, but in principle a grid goes beyond this and the grid cells can target a point, an area, or a volume. +
      +
    • point: each cell represents a point
    • +
    • area: each cell represents an area
    • +
    • voxel: each cell represents a volumetric measurement on a regular grid in a three dimensional space
    • +
    • stratum: height range for a single point vertical profile
    • +
  • +
+
+
+

6.3.4.2 Reference system(s) (referenceSystemInfo)

+

The reference system(s) typically (but not necessarily) applies to the geographic reference system of the dataset. Multiple reference systems can be listed if a dataset is distributed with different spatial reference systems. This block of elements may also apply to service metadata. A spatial web-service may support several map projections / geographic coordinate reference systems.

+


+
"referenceSystemInfo": [
+ {
+  "code": "string",
+  "codeSpace": "string"
+ }
+]
+


+

A reference system is defined by two properties:

+
    +
  • the identifier of the reference system. The recommended practice is to use to the Spatial Reference IDentifier (SRID) number. For example, the SRID of the World Geodetic System (WGS 84) is 4326.
  • +
  • the code space of the source authority providing the SRID. The best practice is to use the EPSG authority code EPSG (as most of geographic reference systems are registered in it). Codes from other authorities can be used to define ad-hoc projections, for example: +
  • +
+

The main reference system registry is EPSG, which provides a “search by name” tool for users who need to find a SRID (global or local/country-specific). Other websites reference geographic systems, but are not authoritative sources including http://epsg.io/ and https://spatialreference.org/ The advantage of these sites is that they go beyond the EPSG registry, and handle other specific registries given by providers like ESRI.

+

The following ESRI projections could be relevant, in particular those in support of world equal-area projected maps (maps conserving area proportions):

+ +
+
+

6.3.4.3 Identification (identificationInfo)

+

The identification information (identificationInfo) is where the citation elements of the resource will be provided. This may include descriptive information like title, abstract, purpose, keywords, etc., and identification of the parties/contact(s) associated with the resource, such as the owner, publisher, co-authors, etc. Providing and publishing detailed information in these elements will contribute significantly to improving the discoverability of the data.

+


+
"identificationInfo": [
+ {
+  "citation": {},
+  "abstract": "string",
+  "purpose": "string",
+  "credit": "string",
+  "status": "string",
+  "pointOfContact": [],
+  "resourceMaintenance": [],
+  "graphicOverview": [],
+  "resourceFormat": [],
+  "descriptiveKeywords": [],
+  "resourceConstraints": [],
+  "resourceSpecificUsage": [],
+  "aggregationInfo": {},
+  "extent": {},
+  "spatialRepresentationType": "string",
+  "spatialResolution": {},
+  "language": [],
+  "characterSet": [],
+  "topicCategory": [],
+  "supplementalInformation": "string",
+  "serviceIdentification": {}
+ }
+]
+


+

The identification of a resource includes elements that are common to both datasets and data services, and others that are specific to the type of resource. The following table summarizes the identification elements that can be used for dataset, service, or both.

+

Identification elements applicable to datasets and data services

+

The following metadata elements apply to resources of type dataset and service.

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
citationA citation set of elements that will describe the dataset/service from a citation perspective, including title, associated contacts, etc. For more details, see section on common elements
abstractAn abstract for the dataset/service resource
purposeA statement describing the purpose of the dataset/service resource
creditCredit information.
statusStatus of the resource, with the following recommended controlled vocabulary: {completed, historicalArchive, obsolete, onGoing, planned, required, underDevelopment, final, pending, retired, superseded, tentative, valid, accepted, notAccepted, withdrawn, proposed, deprecated}
pointOfContactOne ore more points of contacts to associate with the resource. People that can be contacted for information on the dataset/service. For more details, see section contact in the common elements section of the chapter.
resourceMaintenanceInformation on how the resource is maintained, essentially informing on the maintenance and update frequency (maintenanceAndUpdateFrequency). This frequency should be chosen among possible values recommended by the ISO 19139 standard: {continual, daily, weekly, fortnightly, monthly, quarterly, biannually, annually, asNeeded, irregular, notPlanned, unknown}.
graphicOverviewOne or more graphic overview(s) that provide a visual identification of the dataset/service. e.g., a link to a map overview image. A graphicOverview will be defined with 3 properties fileName (or URL), fileDescription, and optionally a fileType.
resourceFormatResource format(s) description. For more details on how to describe a format, see the common elements section of the chapter.
descriptiveKeywordsA set of keywords that describe the dataset. Keywords are grouped by keyword type, with the possibility to associate a thesaurus (if applicable). For more details how to describe keywords, see the common elements section of the chapter.
resourceConstraintsLegal and/or Security constraints associated to the resource. For more details how to describe constraints, see the common elements section of the chapter
resourceSpecificUsageInformation about specific usage(s) of the dataset/service, e.g., a research paper, a success story, etc.
aggregationInfoInformation on an aggregate or parent resource to which the resource belongs, i.e. a collection.
+


+

Resource maintenance +

+
"resourceMaintenance": [
+ {
+ "maintenanceAndUpdateFrequency": "string"
+ }
+]
+


+

Graphic overview +

+
"graphicOverview": [
+ {
+  "fileName": "string",
+  "fileDescription": "string",
+  "fileType": "string"
+ }
+]
+


+

Resource specific usage +

+
"resourceSpecificUsage": [
+ {
+  "specificUsage": "string",
+  "usageDateTime": "string",
+  "userDeterminedLimitations": "string",
+  "userContactInfo": []
+ }
+]
+

For userContactInfo, seee common elements Contact +

+

Aggregation information +

+
"aggregationInfo": {
+ "aggregateDataSetName": "string",
+ "aggregateDataSetIdentifier": "string",
+ "associationType": "string",
+ "initiativeType": "string"
+}
+


+

Identification elements applicable to datasets

+

The following metadata elements are specific to resources of type dataset.

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
spatialRepresentationTypeThe spatial representation type of the dataset. Values should be selected from the following controlled vocabulary: {vector, grid, textTable, tin, stereoModel, video}
spatialResolutionThe spatial resolution of the data as numeric value associated to a unit of measure.
languageThe language used in the dataset.
characterSetThe character set encoding used in the dataset.
topicCategoryThe topic category(ies) characterizing the dataset resource. Values should be selected from the following controlled vocabulary: {farming, biota, boundaries, climatologyMeteorologyAtmosphere, economy, elevation, environment, geoscientificInformation, health, imageryBaseMapsEarthCover, intelligenceMilitary, inlandWaters, location, oceans, planningCadastre, society, structure, transportation, utilitiesCommunication, extraTerrestrial, disaster}
extentDefines the spatial (horizontal and vertical) and temporal region to which the content of the resource applies. For more details, see the common elements section of the chapter
supplementalInformationAny additional information, provided as free text.
+


+Spatial resolution, language, characterset, and topic category

+
"spatialResolution": {
+ "uom": "string",
+ "value": 0
+},
+"language": [
+ "string"
+],
+"characterSet": [
+ {
+  "codeListValue": "string",
+  "codeList": "string"
+ }
+],
+"topicCategory": [
+ "string"
+]
+


+

Identification elements applicable to data services

+

The following metadata elements are specific to resources of type service.

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
serviceTypeThe type of service (as free text),eg. OGC:WMS
serviceTypeVersionThe version of the service e.g. 1.3.0
accessPropertiesAccess properties, including description of fees, plannedAvailableDateTime, orderingInstructions and turnaround
restrictionsLegal and/or Security constraints associated to the service. For more details, see the common elements section of the chapter.
keywordsSet of service keywords. For more details, see the common elements section of the chapter.
extentDefines the spatial (horizontal and vertical) and temporal region to which the service applies (if applicable). see the common elements section of the chapter.
coupledResourceEventual resource(s) coupled to a service operation.
couplingTypeThe type of coupling between service and coupled resources. Values should be selected from the following controlled vocabulary: {loose, mixed, tight}
containsOperationsOperation(s) available for the service. See below for details.
operatesOnList of dataset identifiers on which the service operates.
+


+
"serviceIdentification": {
+ "serviceType": "string",
+ "serviceTypeVersion": "string",
+ "accessProperties": {
+  "fees": "string",
+  "plannedAvailableDateTime": "string",
+  "orderingInstructions": "string",
+  "turnaround": "string"
+ },
+ "restrictions": [],
+ "keywords": [],
+ "coupledResource": [
+  {
+   "operationName": "string",
+   "identifier": "string"
+  }
+ ],
+ "couplingType": "string",
+ "containsOperations": [
+  {
+   "operationName": "string",
+   "DCP": [
+    "string"
+   ],
+   "operationDescription": "string",
+   "invocationName": "string",
+   "parameters": [
+    {
+     "name": "string",
+     "direction": "string",
+     "description": "string",
+     "optionality": "string",
+     "repeatability": true,
+     "valueType": "string"
+    }
+   ],
+   "connectPoint": {
+    "linkage": "string",
+    "name": "string",
+    "description": "string",
+    "protocol": "string",
+    "function": "string"
+   },
+   "dependsOn": [
+    { }
+   ]
+  }
+ ],
+ "operatesOn": [
+  {
+   "uuidref": "string"
+  }
+ ]
+}
+


+
+
6.3.4.3.1 Service operation
+

A data service operation is described with the following metadata elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
operationNameName of the operation
DCPDistributed Computing Platform. Recommended value: ‘WebServices’
operationDescriptionDescription of the operation
invocationNameName of the operation as invoked when using the service
parametersOperation parameter(s). A parameter can be defined with several properties including name, description, direction (in, out, or ‘inout’), optionality (‘Mandatory’ or ‘Optional’), repeatability(true/false), and the valueType (type of value expected, e.g., string, numeric, etc.)
connectPointURL points, defined as online resource(s)
dependsOnService operation(s) the service operation depends on.
+

The service operation(s) descriptions are recommended when the service does not support the self-description of its operations.

+
+
+
+

6.3.4.4 Content (contentInfo)

+

For vector datasets, the ISO 19115-1 does not provide all necessary elements; the structure of vector datasets is therefore documented using the featureCatalogueDescription of the ISO 19110 (Feature Catalogue) standard. The ISO 19110 is included in the unified ISO 19139 XML specification.

+

Feature catalogue description (featureCatalogueDescription)

+

The Feature Catalogue description aims to link the structural metadata (ISO 19110) to the dataset metadata (ISO 19115). This will be required when the structural metadata is not contained in the same metadata file as the dataset metadata.1 The following elements are used to document this relationship:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
complianceCodeIndicates whether the dataset complies with the feature catalogue description
languageLanguage used in the feature catalogue
includedWithDatasetIndicates if the feature catalogue description is included with the dataset (essentially, as downloadable resource)
featureCatalogueCitationA citation that references the ISO 19110 feature catalogue. As best practice, this citation will essentially use two properties: uuidref giving the persistent identifier of the feature catalogue, href giving a web link to access the ISO 19110 feature catalogue.
+


+
"contentInfo": [
+ {
+ "featureCatalogueDescription": {
+  "complianceCode": true,
+  "language": "string",
+  "includedWithDataset": true,
+  "featureCatalogueCitation": {
+   "title": "string",
+   "alternateTitle": "string",
+   "date": [
+    {
+     "date": "string",
+     "type": "string"
+    }
+   ],
+   "edition": "string",
+   "editionDate": "string",
+   "identifier": {
+    "authority": "string",
+    "code": null
+   },
+   "citedResponsibleParty": [],
+   "presentationForm": [
+    "string"
+   ],
+   "series": {
+    "name": "string",
+    "issueIdentification": "string",
+    "page": "string"
+   },
+   "otherCitationDetails": "string",
+   "collectiveTitle": "string",
+   "ISBN": "string",
+   "ISSN": "string"
+   }
+  },
+  "coverageDescription": {
+   "contentType": "string",
+   "dimension": [
+    {
+     "name": "string",
+     "type": "string"
+    }
+   ]
+  }
+ }
+]
+


+

The feature catalog can be an external metadata file or document. We embedded it our JSON schema. See the section ISO 19110 Feature Catalogue below.

+

Coverage description (coverageDescription)

+

The structure of raster/gridded datasets can be described using the ISO 19115-2 standard, using the coverageDescription element and the following two properties:

+ ++++ + + + + + + + + + + + + + + + + + + + + +
ElementDescription
contentTypeType of coverage content, e.g., ‘image’. It is recommended to define the content type using the controlled vocabulary suggested by the ISO 19139 which contains the following values: {image, thematicClassification, physicalMeasurement, auxillaryInformation, qualityInformation, referenceInformation, modelResult, coordinate, auxilliaryData}
dimensionList of coverage dimensions. Each dimension can be defined by a name and a type. For the type, a good practice is to rely on primitive data types defined in the XML Schema https://www.w3.org/2009/XMLSchema/XMLSchema.xsd
rangeElementDescriptionList of range element descriptions. Each range element description will have a name/definition (corresponding to the dimension considered), and list of accepted values as rangeElement. For example, for a timeseries with series defined at specific instants in time, the Time dimension of the spatio-temporal coverage could be defined here giving the list of time instants supported by the time series.
+
+
+

6.3.4.5 Distribution (distributionInfo)

+

The distribution information documents who is the actual distributor of the resources, and other aspects of the distribution in term of format and online resources. This information is provided using the following elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + +
ElementDescription
distributionFormatFormat(s) definitions. See the common elements section for information on how to document a format.
distributorContact(s) in charge of the resource distribution. See the common elements section for information on how to document a contact.
transferOptionsTransfer option(s) to get the resource. To align with the ISO 19139, these resources should be set in an onLine element where all online resources available can be listed, or as offLine for media not available online.
+


+
"distributionFormat": [
+ {
+  "name": "string",
+  "version": "string",
+  "amendmentNumber": "string",
+  "specification": "string",
+  "fileDecompressionTechnique": "string",
+  "FormatDistributor": {}
+ }
+]
+


+
+
+

6.3.4.6 Data quality (dataQualityInfo)

+

Information on the quality of the data will be useful to secondary analysts, to ensure proper use of the data. Data quality is documented in the section dataQualityInfo using three main metadata elements:

+ ++++ + + + + + + + + + + + + + + + + + + + + +
ElementDescription
scopeScope / hierarchy level targeted by the data quality information section. The ISO 19139 recommends the use of a controlled vocabulary with the following options: {attribute, attributeType, collectionHardware, collectionSession, dataset, series, nonGeographicDataset, dimensionGroup, feature, featureType, propertyType, fieldSession, software, service, model, tile, initiative, stereomate, sensor, platformSeries, sensorSeries, productionSeries, transferAggregate, otherAggregate}
reportReport(s) describing the quality information, for example a INSPIRE metadata compliance report. To see how to create a data quality conformance report, see details below.
lineageThe lineage provides the elements needed to describe the process that led to the production of the data. In combination with report, the lineage will allow data users to assess quality conformance. This is an important metadata element.
+


+
"dataQualityInfo": [
+ {
+  "scope": "string",
+  "report": [],
+  "lineage": {
+   "statement": "string",
+   "processStep": []
+  }
+ }
+]
+


+
+
6.3.4.6.1 Report (report)
+


+
"report": [
+ {
+  "DQ_DomainConsistency": {
+   "result": {
+   "nameOfMeasure": [],
+   "measureIdentification": "string",
+   "measureDescription": "string",
+   "evaluationMethodType": [],
+   "evaluationMethodDescription": "string",
+   "evaluationProcedure": {},
+   "dateTime": "string",
+   "result": []
+   }
+  }
+ }
+]
+


+

A report describes the result of an assessment of the conformance (or not) of a resource to consistency rules. The result is the main component of a report, which can be described with the following elements:

+
    +
  • nameOfMeasure: One or more measure names used for the data quality report
  • +
  • measureIdentification: Identification of the measure, using a unique identifier (if applicable)
  • +
  • measureDescription: A description of the measure
  • +
  • evaluationMethodType: Type of evaluation method. The ISO 19139 recommends the use of a controlled vocabulary with the following options: {directInternal, directExternal, indirect}
  • +
  • evaluationMethodDescription: Description of the evaluation method
  • +
  • evaluationProcedure: Citation of the evaluation procedure (as citation element)
  • +
  • dateTime: Date time when the report was established
  • +
  • report: Result(s) associated to the report. Each result should be described with a specification, an explanation (of the result of conformance or not conformance), and a pass property indicating if the result was positive (true) or not (false).
  • +
+


+
"result": {
+ "nameOfMeasure": [
+  "string"
+ ],
+ "measureIdentification": "string",
+ "measureDescription": "string",
+ "evaluationMethodType": [
+  "string"
+ ],
+ "evaluationMethodDescription": "string",
+ "evaluationProcedure": {
+ "title": "string",
+ "alternateTitle": "string",
+ "date": [
+  {
+   "date": "string",
+   "type": "string"
+  }
+ ],
+ "edition": "string",
+ "editionDate": "string",
+ "identifier": {
+ "authority": "string",
+ "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+  "string"
+ ],
+ "series": {
+  "name": "string",
+  "issueIdentification": "string",
+  "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+ },
+ "dateTime": "string",
+ "result": []
+ }
+}
+


+
+
+
6.3.4.6.2 Lineage (lineage)
+

The lineage provides a structured solution to describe the work flow that led to the production of the data/service, defined by:

+
    +
  • a general statement of the work flow performed
  • +
  • sequence of process steps performed. Each processStep is defined by the following elements: +
      +
    • description: Description of the process step performed
    • +
    • rationale: Rationale of the process step
    • +
    • dateTime: Date of the processing
    • +
    • processor: Contact(s) acting as processor(s) for the target step
    • +
    • source: Source(s) used for the process step. Each source can have a description and a sourceCitation (as citation element).
    • +
  • +
+


+
"lineage": {
+ "statement": "string",
+ "processStep": [
+  {
+   "description": "string",
+   "rationale": "string",
+   "dateTime": "string",
+   "processor": [],
+   "source": [
+   {
+   "description": "string",
+   "sourceCitation": {
+   "title": "string",
+   "alternateTitle": "string",
+   "date": [
+   {
+   "date": "string",
+   "type": "string"
+   }
+  ],
+  "edition": "string",
+  "editionDate": "string",
+  "identifier": {
+  "authority": "string",
+  "code": null
+ },
+ "citedResponsibleParty": [],
+ "presentationForm": [
+ "string"
+ ],
+ "series": {
+ "name": "string",
+ "issueIdentification": "string",
+ "page": "string"
+ },
+ "otherCitationDetails": "string",
+ "collectiveTitle": "string",
+ "ISBN": "string",
+ "ISSN": "string"
+ }
+ }
+ ]
+ }
+ ]
+}
+


+
+
+
+

6.3.4.7 Metadata maintenance (metadataMaintenanceInfo)

+

The metadataMaintenanceInfo and maintenanceAndUpdateFrequency elements provide information on the maintenance of the metadata including the frequency of updates. The metadataMaintenanceInfo element is a free text element. The information provided in maintenanceAndUpdateFrequency should be chosen from values recommended by the ISO 19139 controlled vocabulary with the following options: {continual, daily, weekly, fortnightly, monthly, quarterly, biannually, annually, asNeeded, irregular, notPlanned, unknown}.

+


+
"metadataMaintenance": {
+ "maintenanceAndUpdateFrequency": "string"
+}
+


+
+
+
+
+

6.4 ISO 19110 Feature Catalogue (feature_catalogue)

+

We describe below how the ISO 19110 feature catalogue is used to document the structure of a vector dataset (complementing the ISO 10119-1). This is equivalent to producing a “data dictionary” for the variables/features included in a vector dataset. An example of the implementation of such a feature catalogue using R is provided in the Examples section of this chapter (see Example 3 in section 5.5.3).

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
nameName of the feature catalogue
scopeSubject domain(s) of feature types defined in this feature catalogue
fieldOfApplicationOne or more fields of applications for this feature catalogue.
versionNumberVersion number of this feature catalogue, which may include both a major version number or letter and a sequence of minor release numbers or letters, such as ‘3.2.4a.’ The format of this attribute may differ between cataloguing authorities.
versionDateVersion date
producerThe responsibleParty in charge of the feature catalogue production
functionalLanguageFormal functional language in which the feature operation formal definition occurs in this feature catalogue
featureTypeOne or more feature type(s) defined in the Feature catalogue. The definition of several feature types can be considered when targeting various forms of a dataset (e.g., simplified vs. complete set of attributes, raw vs. aggregated, etc). In practice, a simple ISO 19110 feature catalogue will reference one feature type describing the unique dataset structure. See details below.
+


+
"feature_catalogue": {
+ "name": "string",
+ "scope": [],
+ "fieldOfApplication": [],
+ "versionNumber": "string",
+ "versionDate": {},
+ "producer": {},
+ "functionalLanguage": "string",
+ "featureType": []
+}
+


+

The featureType is the actual data structure definition of a dataset (data dictionary), and has the following properties:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
typeNameText string that uniquely identifies this feature type within the feature catalogue that contains this feature type
definitionDefinition of the feature type
codeCode that uniquely identifies this feature type within the feature catalogue that contains this feature type
isAbstractIndicates if the feature type is abstract or not
aliasesOne or more aliases as equivalent names of the feature type
carrierOfCharacteristicsFeature attribute(s) / column(s) definitions. See below details.
+


+
"featureType": [
+ {
+  "typeName": "string",
+  "definition": "string",
+  "code": "string",
+  "isAbstract": true,
+  "aliases": [
+   "string"
+  ],
+  "carrierOfCharacteristics": [
+    {
+     "memberName": "string",
+     "definition": "string",
+     "cardinality": {
+      "lower": 0,
+      "upper": 0
+    },
+    "code": "string",
+    "valueMeasurementUnit": "string",
+    "valueType": "string",
+    "listedValue": [
+      {
+       "label": "string",
+       "code": "string",
+       "definition": "string"
+      }
+    ]
+   }
+  ]
+ }
+]
+


+

Each feature attribute, i.e. column that is a member of the vector data structure is defined as carrier of characteristics. Each set of characteristics can be defined with the following properties:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ElementDescription
memberNameName of the property member of the feature type
definitionDefinition of the property member
cardinalityDefinition of the member type cardinality. The cardinality is set of two properties: lower cardinality (lower) and upper cardinality (upper). For simple tabular datasets, the cardinality will be 1-1. Multiple cardinalities (eg. 1-N, N-N) apply particularly to feature catalogues/types that describe relational databases.
codeCode for the attribute member of the feature type. Corresponds to the actual column name in an attributes table.
valueMeasurementUnitMeasurement unit of the values (in case of the feature member corresponds to a measurable variable)
valueTypeType of value. A good practice is to rely on primitive data types defined in the XML Schema https://www.w3.org/2009/XMLSchema/XMLSchema.xsd
listedValueList of controlled value(s) used in the attribute member. Each value corresponds to an object compound by 1) a label, 2) a code (as contained in the dataset), 3) a definition. This element will be used when the feature member relates to reference datasets, such as code lists or registers. e.g., list of countries, land cover types, etc.
+
+
+

6.5 Provenance

+


+
"provenance": [
+ {
+  "origin_description": {
+   "harvest_date": "string",
+   "altered": true,
+   "base_url": "string",
+   "identifier": "string",
+   "date_stamp": "string",
+   "metadata_namespace": "string"
+  }
+ }
+]
+


+

provenance [Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the ISO 19139 metadata standard.

+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.
    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Study Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The datestamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

6.6 Tags

+

tags [Optional ; Repeatable]
+Tags provides easy way to include custom-facets in NADA. Should consider using one or multiple controlled vocabulary(ies). See section 1.7 for more on the importance and use of tags and tag_groups in data catalogs.

+


+
"tags": [
+ {
+  "tag": "string",
+  "tag_group": "string"
+ }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.
  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group to which the tag belongs. Grouping tags allows implementation of controlled facets (filters) in data catalogs.
  • +
+
+
+

6.7 LDA topics

+

lda_topics [Optional ; Not repeatable]

+


+
"lda_topics": [
+{
+"model_info": [
+  {
+   "source": "string",
+   "author": "string",
+   "version": "string",
+   "model_id": "string",
+   "nb_topics": 0,
+   "description": "string",
+   "corpus": "string",
+   "uri": "string"
+  }
+ ],
+ "topic_description": [
+  {
+   "topic_id": null,
+   "topic_score": null,
+   "topic_label": "string",
+   "topic_words": [
+    {
+     "word": "string",
+     "word_weight": 0
+    }
+   ]
+  }
+ ]
+ }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document (in this case, the “document” is a compilation of elements from the dataset metadata) can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.

+
+

Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.

+
+

The lda_topics element includes the following metadata fields:

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.
    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.
      +
    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition of the document.
    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the document (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic. This is specific to the model, not to a document.
      • +
    • +
  • +
+
lda_topics = list(
+  
+   list(
+  
+      model_info = list(
+        list(source      = "World Bank, Development Data Group",
+             author      = "A.S.",
+             version     = "2021-06-22",
+             model_id    = "Mallet_WB_75",
+             nb_topics   = 75,
+             description = "LDA model, 75 topics, trained on Mallet",
+             corpus      = "World Bank Documents and Reports (1950-2021)",
+             uri         = ""))
+      ),
+      
+      topic_description = list(
+      
+        list(topic_id    = "topic_27",
+             topic_score = 32,
+             topic_label = "Education",
+             topic_words = list(list(word = "school",      word_weight = "")
+                                list(word = "teacher",     word_weight = ""),
+                                list(word = "student",     word_weight = ""),
+                                list(word = "education",   word_weight = ""),
+                                list(word = "grade",       word_weight = "")),
+        
+        list(topic_id    = "topic_8",
+             topic_score = 24,
+             topic_label = "Gender",
+             topic_words = list(list(word = "women",       word_weight = "")
+                                list(word = "gender",      word_weight = ""),
+                                list(word = "man",         word_weight = ""),
+                                list(word = "female",      word_weight = ""),
+                                list(word = "male",        word_weight = "")),
+        
+        list(topic_id    = "topic_39",
+             topic_score = 22,
+             topic_label = "Forced displacement",
+             topic_words = list(list(word = "refugee",     word_weight = "")
+                                list(word = "programme",   word_weight = ""),
+                                list(word = "country",     word_weight = ""),
+                                list(word = "migration",   word_weight = ""),
+                                list(word = "migrant",     word_weight = "")),
+                                
+        list(topic_id    = "topic_40",
+             topic_score = 11,
+             topic_label = "Development policies",
+             topic_words = list(list(word = "development", word_weight = "")
+                                list(word = "policy",      word_weight = ""),
+                                list(word = "national",    word_weight = ""),
+                                list(word = "strategy",    word_weight = ""),
+                                list(word = "activity",    word_weight = ""))
+                                
+      )
+      
+   )
+   
+)
+
+
+

6.8 Embeddings

+

embeddings [Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. In this case, the text would be a compilation of selected elements of the dataset metadata. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).

+

The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

+


+
"embeddings": [
+ {
+  "id": "string",
+  "description": "string",
+  "date": "string",
+  "vector": { }
+ }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.

  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.

  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).

  • +
  • vector [Required ; Not repeatable ; Object] @@@@@@@@ do not offer options +The numeric vector representing the document, provided as an object (array or string).
    +[1,4,3,5,7,9]

  • +
+
+
+

6.9 Additional

+

additional [Optional ; Not repeatable]

+The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

+
+
+

6.10 Complete examples

+
+

6.10.1 Example 1 (vector - shape files): Bangladesh, Outline of camps of Rohingya refugees in Cox’s Bazar, January 2021

+

In this first example, we use a geographic dataset that contains the outline of Rohingya refugee camps, settlements, and sites in Cox’s Bazar, Bangladesh. The dataset was imported from the Humanitarian Data Exchange website on March 3, 2021.

+

We include in the metadata a simple description of the features (variables) contained in the shape files. This information will significantly increase data discoverability, as it provide information of the content of the data files (which is not described elsewhere in the metadata).

+
+ +
+ + + + + + +
Generating the metadata using R
+
library(nadar)
+library(readr)     @@@@ used?
+library(readxl)    @@@@ used?
+library(writexl)   @@@@ used?
+library(sf)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/") 
+
+thumb = "shape_camps.JPG"
+
+# Download the data files (if not already downloaded)
+# Note: the data are frequently updated; the links below may have become invalid.
+# Visit: https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd for an update.
+
+base_url = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/"
+urls <- list(
+  paste0(base_url, "7cec91fb-d0a8-4781-9f8d-9b69772ef2fd/download/210118_rrc_geodata_al1al2al3.gdb.zip"),
+  paste0(base_url, "ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip"),
+  paste0(base_url, "bd5351e7-3ffc-4eaa-acbc-c6d917b5549c/download/200908_rrc_outline_camp_al1.kmz"),
+  paste0(base_url, "9d5693ec-eeb8-42ed-9b65-4c279f523276/download/200908_rrc_outline_block_al2.zip"),
+  paste0(base_url, "ed119ae4-b13d-4473-9afe-a8c36e07870b/download/200908_rrc_outline_block_al2.kmz"),
+  paste0(base_url, "0d2d87ae-52a5-4dca-b435-dcd9c617b417/download/210118_rrc_outline_subblock_al3.zip"),
+  paste0(base_url, "6286c4a5-d2ab-499a-b019-a7f0c327bd5f/download/210118_rrc_outline_subblock_al3.kmz")
+)  
+
+for(url in urls) {
+  f <- basename(url) 
+  if (!file.exists(f)) download.file(url, destfile=f, mode="wb")
+}
+
+# Unzip and read the shape files to extract information
+# The object contain the number of features, layers, geodetic CRS, etc.
+
+unzip("200908_rrc_outline_camp_al1.zip", exdir = "AL1")
+al1 <- st_read("./AL1/200908_RRC_Outline_Camp_AL1.shp")
+
+unzip("200908_rrc_outline_block_al2.zip", exdir = "AL2")
+al2 <- st_read("./AL2/200908_RRC_Outline_Block_AL2.shp")
+
+unzip("210118_rrc_outline_subblock_al3.zip", exdir = "AL3")
+al3 <- st_read("./AL3/210118_RRC_Outline_SubBlock_AL3.shp")
+
+# ---------------
+
+id = "BGD_2021_COX_CAMPS_GEO_OUTLINE"
+
+my_geo_metadata <- list(
+  
+  metadata_information = list(
+    title = "(Demo) Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+    producers = list(list(name = "NADA team")),
+    production_date = "2022-02-18"
+  ),
+  
+  description = list(
+    
+    idno = id,
+    
+    language = "eng",
+    
+    characterSet = list(codeListValue = "utf8"),
+    
+    hierarchyLevel = list("dataset"),
+    
+    contact = list(
+      list(
+        organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+        contactInfo = list(
+          address = list(country = "Bangladesh"),
+          onlineResource = list(
+            linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/",
+            name = "Website"
+          )
+        ),
+        role = "owner"
+      )
+    ),
+
+    dateStamp = "2021-01-20",
+      
+    metadataStandardName = "ISO 19115:2003/19139",
+      
+    spatialRepresentationInfo = list(
+      
+      # File 200908_rrc_outline_camp_al1.zip
+      list(
+        vectorSpatialRepresentationInfo = list(
+          topologyLevel = "geometryOnly",
+          geometricObjects = list(
+            geometricObjectType = "surface",
+            geometricObjectCount = "35" 
+          )
+        )
+      ),
+      
+      # File 200908_rrc_outline_block_al2.zip
+      list(
+        vectorSpatialRepresentationInfo = list(
+          topologyLevel = "geometryOnly",
+          geometricObjects = list(
+            geometricObjectType = "surface",
+            geometricObjectCount = "173" 
+          )
+        )
+      ),
+      
+      # File 210118_rrc_outline_subblock_al3.zip
+      list(
+        vectorSpatialRepresentationInfo = list(
+          topologyLevel = "geometryOnly",
+          geometricObjects = list(
+            geometricObjectType = "surface",
+            geometricObjectCount = "967" 
+          )
+        )
+      )
+      
+    ),
+    
+    referenceSystemInfo = list(
+      list(code = "4326", codeSpace = "EPSG"),
+      list(code = "84",   codespace = "WGS")
+    ),
+    
+    identificationInfo = list(
+      
+      list(
+        
+        citation = list(
+          title = "Bangladesh, Outline of camps of Rohingya refugees in Cox's Bazar, January 2021",
+          date = list(
+            list(date = "2021-01-20", type = "creation")
+          ),
+          citedResponsibleParty = list(
+            list(
+              organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+              contactInfo = list(
+                address = list(country = "Bangladesh"),
+                onlineResource = list(
+                  linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/",
+                  name = "Website"
+                )
+              ),
+              role = "owner"
+            )
+          )
+        ),
+
+        abstract = "These polygons were digitized through a combination of methodologies, originally using VHR satellite imagery and GPS points collected in the field, verified and amended according to Site Management Sector, RRRC, Camp in Charge (CiC) officers inputs, with technical support from other partners.",
+        
+        purpose = "Inform the UNHCR operations (and other support agencies') in refugee camps in Cox's Bazar.",
+        
+        credit = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+        
+        status = "completed",
+        
+        pointOfContact = list(
+          list(
+            organisationName = "Site Management Sector, RRRC, Inter Sector Coordination Group (ISCG)",
+            contactInfo = list(
+              address = list(country = "Bangladesh"),
+              onlineResource = list(
+                linkage = "https://www.humanitarianresponse.info/en/operations/bangladesh/",
+                name = "Website"
+              )
+            ),
+            role = "pointOfContact"
+          )  
+        ),
+        
+        resourceMaintenance = list(
+          list(maintenanceOrUpdateFrequency = "asNeeded")
+        ),
+        
+        graphicOverview = list(    # @@@@@@@@@@@@
+          list(fileName = "",
+               fileDescription = "",
+               fileType = "")
+        ),
+        
+        resourceFormats = list(
+          list(name = "application/zip", 
+               specification = "ESRI Shapefile (zipped)", 
+               FormatDistributor = list(organisationName = "ESRI")
+          ),
+          list(name = "application/vnd.google-earth.kmz", 
+               specification = "KMZ file", 
+               FormatDistributor = list(organisationName = "Google")
+          ),
+          list(name = "ESRI Geodatabase", 
+               FormatDistributor = list(organisationName = "ESRI")
+          )
+        ),
+        
+        descriptiveKeywords = list(
+          list(keyword = "refugee camp"), 
+          list(keyword = "forced displacement"),
+          list(keyword = "rohingya")
+        ),
+    
+        resourceConstraints = list(
+          list(
+            legalConstraints = list(
+              uselimitation = list("License: http://creativecommons.org/publicdomain/zero/1.0/legalcode"),
+              accessConstraints = list("unrestricted"),
+              useConstraints = list("licenceUnrestricted")
+            )
+          )
+        ),
+        
+        extent = list(
+          geographicElement = list(
+            list(
+              geographicBoundingBox = list(
+                southBoundLatitude = 20.91856,  
+                westBoundLongitude = 92.12973,
+                northBoundLatitude = 21.22292,
+                eastBoundLongitude = 92.26863
+              )
+            )
+          )
+        ),
+        
+        spatialRepresentationType = "vector",
+        
+        language = list("eng")
+      
+      )
+    
+    ),
+    
+    distributionInfo = list(
+      
+      distributionFormat = list(
+        list(name = "application/zip", 
+             specification = "ESRI Shapefile (zipped)", 
+             FormatDistributor = list(organisationName = "ESRI")
+        ),
+        list(name = "application/vnd.google-earth.kmz", 
+             specification = "KMZ file", 
+             FormatDistributor = list(organisationName = "Google")
+        ),
+        list(name = "ESRI Geodatabase", 
+             FormatDistributor = list(organisationName = "ESRI")
+        )
+      ),
+      
+      distributor = list(
+        list(
+          organisationName = "United Nations Office for the Coordination of Humanitarian Affairs (OCHA)", 
+          contactInfo = list(
+            onlineResource = list(
+              linkage = "https://data.humdata.org/dataset/outline-of-camps-sites-of-rohingya-refugees-in-cox-s-bazar-bangladesh",
+              name = "Website"
+            )
+          )
+        )
+      )#,
+      
+      # transferOptions = list(
+      #   list(
+      #     onLine = list(   # @@@@@@@@ / use external resources schema?
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/7cec91fb-d0a8-4781-9f8d-9b69772ef2fd/download/210118_rrc_geodata_al1al2al3.gdb.zip",
+      #         name = "210118_RRC_GeoData_AL1,AL2,AL3.gdb.zip",
+      #         description = "This zipped geodatabase file (GIS) contains the Camp boundary (Admin level-1) and and camp-block boundary (admin level-2 or camp sub-division) and sub-block boundary of Rohingya refugee camps and administrative level-3 or sub block division of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ace4b0a6-ef0f-46e4-a50a-8c552cfe7bf3/download/200908_rrc_outline_camp_al1.zip",
+      #         name = "200908_RRC_Outline_Camp_AL1.zip",
+      #         description = "This zipped shape file (GIS) contains the Camp boundary (Admin level-1) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/bd5351e7-3ffc-4eaa-acbc-c6d917b5549c/download/200908_rrc_outline_camp_al1.kmz",
+      #         name = "200908_RRC_Outline_Camp_AL1.kmzKMZ",
+      #         description = "This kmz file (Google Earth) contains the Camp boundary (Admin level-1) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/9d5693ec-eeb8-42ed-9b65-4c279f523276/download/200908_rrc_outline_block_al2.zip",
+      #         name = "200908_RRC_Outline_Block_AL2.zip",
+      #         description = "This zipped shape file (GIS) contains the camp-block boundary (admin level-2 or camp sub-division) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/ed119ae4-b13d-4473-9afe-a8c36e07870b/download/200908_rrc_outline_block_al2.kmz",
+      #         name = "200908_RRC_Outline_Block_AL2.kmzKMZ",
+      #         description = "This kmz file (Google Earth) contains the camp-block boundary (admin level-2 or camp sub-division) of Rohinya refugees in Cox's Bazar, Bangladesh. Updated: September 8, 2020",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/0d2d87ae-52a5-4dca-b435-dcd9c617b417/download/210118_rrc_outline_subblock_al3.zip",
+      #         name = "210118_RRC_Outline_SubBlock_AL3.zip",
+      #         description = "This zipped shape file (GIS) contains the camp-sub-block (Admin level-3) of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/1a67eb3b-57d8-4062-b562-049ad62a85fd/resource/6286c4a5-d2ab-499a-b019-a7f0c327bd5f/download/210118_rrc_outline_subblock_al3.kmz",
+      #         name = "210118_RRC_Outline_SubBlock_AL3.kmzKMZ",
+      #         description = "This kmz file (Google Earth) contains the camp-sub-block (Admin level-3) of Camp 1E-1W, Camp 2E-2W, Camp 8E-8W, Camp 4 Extension, Camp 3-7, Camp 9-20, and Camp 21-27 in Cox's Bazar, Bangladesh. Updated: January 20, 2021",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       )
+      #     )
+      #   )
+      # )
+      
+    ),
+
+    dataQualityInfo = list(
+      list(
+        scope = "dataset", 
+        lineage = list(
+          statement = "The camps are continuously expanding, and Camp Boundaries are structured around the GoB, RRRC official governance structure of the camps, taking into account the potential new land allocation. The database is kept as accurate as possible, given these challenges."
+        )
+      )  
+    ),
+    
+    metadataMaintenance = list(maintenanceAndUpdateFrequency = "asNeeded"),
+    
+    feature_catalogue = list(
+      
+      name = "Feature Catalogue dataset xxxxx",
+      scope = list("3 shape files: al1, al2, al3"),
+    
+      featureType = list(
+        list(
+          typeName =  "",
+          definition = "",
+          carrierOfCharacteristics = list(
+            list(
+              memberName = 'District',
+              definition = 'Cox s Bazar'
+            ),
+            list(
+              memberName = 'Upazila',
+              definition = 'Teknaf, Ukhia',
+            ),
+            list(
+              memberName = 'Settlement',
+              definition = 'Collective site; Collective site with host community',
+            ),
+            list(
+              memberName = 'Union',
+              definition = 'Baharchhara; Nhilla; Palong Khali; Raja Palong; Whykong',
+            ),
+            list(
+              memberName = 'Name_Alias',
+              definition = 'Alikhali; Bagghona-Putibonia; Camp 20 Extension; 
+                            Camp 4; Camp 4 Extension; Chakmarkul; Choukhali; 
+                            Hakimpara; Jadimura; Jamtoli-Baggona; Jomer Chora; 
+                            Kutupalong RC; Modur Chora; Nayapara; Nayapara RC; 
+                            Shamlapur; Tasnimarkhola; Tasnimarkhola-Burmapara; 
+                            Unchiprang'
+            ),
+            list(
+              memberName = 'SSID',
+              definition = 'CXB-017 to CXB-235',
+            ),
+            list(
+              memberName = 'SMSD__Cnam',
+              definition = 'Camp 01E; Camp 01W; Camp 02E; Camp 02W; Camp 03; Camp 04;
+                            Camp 04X; Camp 05; Camp 06; Camp 07; Camp 08E; Camp 08W;
+                            Camp 09; Camp 10; Camp 11; Camp 12; Camp 13; Camp 14;
+                            Camp 15; Camp 16; Camp 17; Camp 18; Camp 19; Camp 20;
+                            Camp 20X; Camp 21; Camp 22; Camp 23; Camp 24; Camp 25;
+                            Camp 26; Camp 27; Camp KRC; Camp NRC; Choukhali',
+            ),
+            list(
+              memberName = 'NPM_Name',
+              definition = 'Camp 01E; Camp 01W; Camp 02E; Camp 02W; Camp 03; 
+                            Camp 04; Camp 04 Extension; Camp 05; Camp 06; ; Camp 07; 
+                            Camp 08E; Camp 08W; Camp 09; Camp 10; Camp 11; Camp 12; 
+                            Camp 13  Camp 14 (Hakimpara); Camp 15 (Jamtoli); 
+                            Camp 16 (Potibonia); Camp 17; Camp 18; Camp 19; Camp 20; 
+                            Camp 20 Extension; Camp 21 (Chakmarkul); Camp 22 (Unchiprang); 
+                            Camp 23 (Shamlapur); Camp 24 (Leda); Camp 25 (Ali Khali); 
+                            Camp 26 (Nayapara); Camp 27 (Jadimura); Choukhali; 
+                            Kutupalong RC; Nayapara RC',
+            ),
+            list(
+              memberName = 'Area_Acres',
+              definition = 'Area in acres',
+            ),
+            list(
+              memberName = 'PeriMe_Met',
+              definition = 'Perimeter in meters',
+            ),
+            list(
+              memberName = 'Camp_Name',
+              definition = 'Camp 10; Camp 11; Camp 12; Camp 13; Camp 14; Camp 15; 
+                            Camp 16; Camp 17; Camp 18; Camp 19; Camp 1E; Camp 1W; 
+                            Camp 20 Camp 20 Extension; Camp 21; Camp 22; Camp 23; 
+                            Camp 24; Camp 25; Camp 26; Camp 27; Camp 2E; Camp 2W; 
+                            Camp 3; Camp 4; Camp 4 Extension; Camp 5; Camp 6; 
+                            Camp 7; Camp 8E; Camp 8W; Camp 9; Choukhali; 
+                            Kutupalong RC; Nayapara RC',
+            ),
+            list(
+              memberName = 'Area_SqM',
+              definition = 'Area in square km',
+            ),
+            list(
+              memberName = 'Latitude'
+            ),
+            list(
+              memberName = 'Longitude'
+            ),
+            list(
+              memberName = 'geometry'
+            )
+            #,
+            #... al2, al3  @@@@@@@@@ complete
+          )
+        )
+      )
+    )
+
+  )
+  
+)  
+
+
+# Publish in NADA catalog
+
+geospatial_add(
+  idno = id, 
+  metadata = my_geo_metadata, 
+  repositoryid = "central", 
+  published = 1, 
+  thumbnail = thumb, 
+  overwrite = "yes"
+)
+
+# Add a link to HDX as an external resource
+
+external_resources_add(
+  title = "Humanitarian Data Exchange website",
+  idno = id,
+  dctype = "web",
+  file_path = "https://data.humdata.org/",
+  overwrite = "yes"
+)
+

The result in NADA

+

After running the script, the data and metadata will be available in NADA.

+


+ +

+

Generating the metadata using Python

+
+
+

6.10.2 Example 2 (vector, CSV data): Syria Refugee Sites (OCHA)

+

The Syria Refugee Sites dataset used as a second example contains verified data about the geographic location (point geometry), name, and operational status of refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. Only refugee sites operated by the United Nations High Commissioner for Refugees (UNHCR) or the Government of Turkey are included. Data are provided as CSV, TSV and XLSX files. This example demonstrates the use of the ISO 19115 standard.

+ + + + + + +
Generating the metadata using R
+
library(nadar)
+library(sf)
+library(sp)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/") 
+
+options(stringsAsFactors = FALSE)
+
+# Download and read the data file
+
+url = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/cc3e9e48-e363-404e-948b-e42d13c316d9/download/syria_refugeesites_2016jan21_hiu_dos.csv"
+data_file = basename(url)
+if(!file.exists(data_file)) download.file(url, destfile = data_file, mode = "wb")
+
+sf <- st_read(data_file)
+sp <- as.data.frame(sf)
+sp$Long <- as(sp$Long, "numeric")
+sp$Lat  <- as(sp$Lat,  "numeric")
+coordinates(sp) <- c("Long", "Lat")
+proj4string(sp) <- CRS("+init=epsg:4326")
+
+# Generate the metadata
+
+id <- "EX2_SYR_REFUGEE_SITES"
+
+my_geo_data <- list(
+  
+  metadata_information = list(
+    title = "(Demo) Syria, Refugee Sites",
+    producers = list(
+      list(name = "NADA team")
+    ),
+    production_date = "2022-02-18"
+  ),
+
+  description = list(
+    
+    idno = id,
+    
+    language = "eng",
+    
+    characterSet = list(codeListValue = "utf8"),
+    
+    hierarchyLevel = list("dataset"),
+    
+    contact = list(
+      list(
+        organisationName = "U.S. Department of State - Humanitarian Information Unit",
+        contactInfo = list(
+          address = list(electronicEmailAddress = "HIU_DATA@state.gov"),
+          onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website")
+        ),
+        role = "pointOfContact"
+      )
+    ),
+    
+    dateStamp = "2018-06-18",
+  
+    metadataStandardName = "ISO 19115:2003/19139",
+    
+    spatialRepresentationInfo = list(
+      list(
+        vectorSpatialRepresentation = list(
+          topologyLevel = "geometryOnly",
+          geometricObjects = list(
+            list(
+              geometricObjectType = "point",
+              geometricObjectCounty = nrow(sp)
+            )
+          )
+        )
+      )
+    ),
+    
+    referenceSystemInfo = list(
+      list(code = "4326", codeSpace = "EPSG")
+    ),
+    
+    identificationInfo = list(
+      
+      list(
+
+        citation = list(
+          title = "Syria Refugee Sites",
+          date = list(
+            list(date = "2016-01-14", type = "creation"),
+            list(date = "2016-02-04", type = "publication")
+          ),
+          identifier = list(authority = "IHSN", code = id),
+          citedResponsibleParty = list(
+            list(
+              individualName = "Humanitarian Information Unit",
+              organisationName = "U.S. Department of State - Humanitarian Information Unit",
+              contactInfo = list(
+                address = list(
+                  electronicEmailAddress = "HIU_DATA@state.gov"
+                ),
+                onlineResource = list(
+                  linkage = "http://hiu.state.gov/",
+                  name = "Website"
+                )
+              ),
+              role = "owner"
+            )
+          )
+        ),
+        
+        abstract = "The 'Syria Refugee Sites' dataset is compiled by the U.S. Department of State, Humanitarian Information Unit (INR/GGI/HIU). This dataset contains open source derived data about the geographic location (point geometry), name, and operational status of refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. Only refugee sites operated by the United Nations High Commissioner for Refugees (UNHCR) or the Government of Turkey are included. Compiled by the U.S Department of State, Humanitarian Information Unit (HIU), each attribute in the dataset (including name, location, and status) is verified against multiple sources. The name and status are obtained from UN and AFAD reporting and the UNHCR data portal (accessible at http://data.unhcr.org/syrianrefugees/regional.php). The locations are obtained from both the U.S. Department of State, PRM and the National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) (accessible at http://geonames.nga.mil/ggmagaz/). The name and status for each refugee site is verified with PRM.  Locations are verified using high-resolution commercial satellite imagery and/or known areas of population. Additionally, all data is checked against various news sources. The data contained herein is entirely unclassified and is current as of 14 January 2016. The data is updated as needed.",
+
+        purpose = "The 'Syria Refugee Sites' dataset contains verified data about the refugee sites hosting Syrian refugees in Turkey, Jordan, and Iraq. This file is compiled by the U.S Department of State, Humanitarian Information Unit (HIU) and is used in the production of the unclassified 'Syria: Numbers and Locations of Syrian Refugees' map product (accessible at https://hiu.state.gov/Pages/MiddleEast.aspx). The data contained herein is entirely unclassified and is current as of 14 January 2016.",
+
+        credit = "U.S. Department of State - Humanitarian Information Unit",
+
+        status = "onGoing",
+        
+        pointOfContact = list(
+          list(
+            individualName = "Humanitarian Information Unit",
+            organisationName = "U.S. Department of State - Humanitarian Information Unit",
+            contactInfo = list(
+              address = list(electronicEmailAddress = "HIU_DATA@state.gov"),
+              onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website")
+            ),
+            role = "pointOfContact"
+          )
+        ),
+        
+        resourceMaintenance = list(
+          list(maintenanceOrUpdateFrequency = "fortnightly")
+        ),
+        
+        # graphicOverview = list(),
+        
+        resourceFormat = list(
+          list(
+            name = "text/csv",
+            specification = "RFC4180 -  Common Format and MIME Type for Comma-Separated Values (CSV) Files"
+          ),
+          list(
+            name = "text/tab-separated-values",
+            specification = "Tab-Separated Values (CSV)"
+          ),
+          list(
+            name = "xlsx",
+            specification = "Microsoft Excel (XLSX)"
+          )
+        ),
+        
+        descriptiveKeywords = list(
+          list(type = "theme", keyword = "Middle East"),
+          list(type = "theme", keyword = "Refugees"),
+          list(type = "theme", keyword = "Displacement"),
+          list(type = "theme", keyword = "Refugee Camps"),
+          list(type = "theme", keyword = "UNHCR"),
+          list(type = "place", keyword = "Syria"),
+          list(type = "place", keyword = "Turkey"),
+          list(type = "place", keyword = "Lebanon"),
+          list(type = "place", keyword = "Jordan"),
+          list(type = "place", keyword = "Iraq"),
+          list(type = "place", keyword = "Egypt")
+        ),
+
+        resourceConstraints = list(
+          list(
+            legalConstraints = list(
+              uselimitation = list("License: Creative Commons Attribution 4.0 International License"),
+              accessConstraints = list("unrestricted"),
+              useConstraints = list("licenceUnrestricted")
+            )
+          ),
+          list(
+            securityConstraints = list(
+              classification = "unclassified",
+              handlingDescription = "All data contained herein are strictly unclassified with no restrictions on distribution. Accuracy of geographic data is not assured by the U.S. Department of State."
+            )
+          )
+        ),
+        
+        extent = list(
+          geographicElement = list(
+            list(
+              geographicBoundingBox = list(
+                southBoundLatitude = bbox(sp)[2,1],
+                westBoundLongitude = bbox(sp)[1,1],
+                northBoundLatitude = bbox(sp)[2,2],
+                eastBoundLongitude = bbox(sp)[1,2]
+              )
+            )
+          )
+        ),
+
+        spatialRepresentationType = "vector",
+
+        language = list("eng"),
+
+        characterSet = list(
+          list(codeListValue = "utf8")
+        ),
+
+        topicCategory = list("society")
+      
+      )  
+
+    ),
+    
+    distributionInfo = list(
+      
+      distributionFormat = list(
+        list(
+          name = "text/csv", 
+          specification = "RFC4180 -  Common Format and MIME Type for Comma-Separated Values (CSV) Files"
+        ),
+        list(
+          name = "text/tab-separated-values", 
+          specification = "Tab-Separated Values (CSV)"
+        ),
+        list(
+          name = "xlsx", 
+          specification = "Microsoft Excel (XLSX)"
+        )
+      ),
+      
+      distributor = list(
+        list(
+          individualName = "Humanitarian Information Unit",
+          organisationName = "U.S. Department of State - Humanitarian Information Unit",
+          contactInfo = list(
+            address = list(electronicEmailAddress = "HIU_DATA@state.gov"),
+            onlineResource = list(linkage = "http://hiu.state.gov/", name = "Website")
+          ),
+          role = "distributor"
+        )
+      ) #,
+      
+      # transferOptions = list(
+      #   list(
+      #     onLine = list(
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/syria-refugee-sites",
+      #         name = "Source metadata (HTML View)",
+      #         protocol = "WWW:LINK-1.0-http--link",
+      #         "function" = "Information"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/cc3e9e48-e363-404e-948b-e42d13c316d9/download/syria_refugeesites_2016jan21_hiu_dos.csv",
+      #         name = "syria_refugeesites_2016jan21_hiu_dos.csv",
+      #         description = "Data download (CSV)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/42f7884c-f54d-478c-a970-623945740e5d/download/syria_refugeesites_2016jan21_hiu_dos.tsv",
+      #         name = "syria_refugeesites_2016jan21_hiu_dos.tsv",
+      #         description = "Data download (TSV)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.humdata.org/dataset/ff383a8b-396a-4d78-b403-687b0a783769/resource/59660c9a-e41a-4d54-bfc2-dd8fd1032c97/download/syria_refugeesites_2016jan21_hiu_dos.xlsx",
+      #         name = "syria_refugeesites_2016jan21_hiu_dos.xlsx",
+      #         description = "Data download (TSV)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       )
+      #     )
+      #   )
+      # )
+      
+    ),
+
+    dataQualityInfo = list(
+      list(
+        scope = "dataset",
+        lineage = list(
+          statement = "Methodology: Compiled by the U.S Department of State, Humanitarian Information Unit (INR/GGI/HIU), each attribute in the dataset (including name, location, and status) is verified against multiple sources. The name and status are obtained from the UNHCR data portal (accessible at http://data.unhcr.org/syrianrefugees/regional.php). The locations are obtained from the U.S. Department of State, Bureau of Population, Refugees, and Migration (PRM) and the National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) (accessible at http://geonames.nga.mil/ggmagaz/). The name and status for each refugee site is verified with PRM. Locations are verified using high-resolution commercial satellite imagery and/or known areas of population. Additionally, all data is checked against various news sources."
+        )
+      )
+    ),
+
+    metadataMaintenance = list(maintenanceAndUpdateFrequency = "fortnightly")
+    
+  )
+  
+)
+
+# Publish in NADA catalog
+
+geospatial_add(
+  idno = id, 
+  metadata = my_geo_data, 
+  repositoryid = "central", 
+  published = 1, 
+  thumbnail = NULL, 
+  overwrite = "yes"
+)
+ + + + + + +
Generating the metadata using Python
+ + + + + + +
The result in NADA
+
+
+

6.10.3 Example 3 (vector, with Feature Catalogue) - The GDIS (beta) dataset

+

This example demonstrates the use of the ISO 19115 (geographic dataset) and ISO 19110 (feature catalogue). Documenting features contained in datasets makes the metadata richer and more discoverable. It is recommended to provide such information, which can easily be extracted from shape files and others. The dataset used for the example is the Geocoded Disasters (GDIS) Dataset, v1 (1960-2018)

+
library(nadar)
+library(sf)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/") 
+
+thumb = "disaster.JPG"
+
+# Load the dataset (2 Gb) to extract some information
+
+load("pend-gdis-1960-2018-disasterlocations.rdata")
+data = GDIS_disasterlocations
+df = as.data.frame(GDIS_disasterlocations)
+column_names = colnames(df)[!colnames(df) %in% c("geometry","centroid")]
+exclude_listed_values_for = c("longitude", "latitude") #exclude ISO 19110 listed values for these columns
+
+# Generate the metadata 
+
+id <- "GDIS_TEST_01"
+
+ttl = "Geocoded Disasters (GDIS) Dataset, v1 (1960–2018)"
+
+my_geo_data <- list(
+  
+  metadata_information = list(
+    title = ttl,
+    idno = id,
+    producers = list(
+      list(name = "NADA team")
+    ),
+    production_date = "2022-02-18",
+    version = "v1.0 2022-02"
+  ),
+
+  description = list(
+  
+    idno = id,
+    language = "English",
+    characterSet = list(
+      codeListValue = "utf8",
+      codeList = "http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CharacterSetCode"
+    ),
+    hierarchyLevel = list("dataset"),
+    contact = list(
+      list(
+        organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+        contactInfo = list(
+          phone = list(
+            voice = "+1 845-365-8920",
+            facsimile = "+1 845-365-8922"
+          ),
+          address = list(
+            deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000",
+            city = "Palisades, NY",
+            postalCode = "10964",
+            electronicEmailAddress = "ciesin.info@ciesin.columbia.edu"
+          )
+        ),
+        role = "pointOfContact"
+      )
+    ),
+    dateStamp = "2021-03-10",
+    metadataStandardName = "ISO 19115:2003/19139",
+    dataSetURI = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018",
+    
+    spatialRepresentationInfo = list(
+      list(
+        vectorSpatialRepresentation = list(
+          topologyLevel = "geometryOnly",
+          geometricObjects = list(
+            list(
+              geometricObjectType = tolower(as.character(st_geometry_type(data)[1])),
+              geometricObjectCounty = nrow(data)
+            )
+          )
+        )
+      )
+    ),
+    
+    referenceSystemInfo = list(
+      list(code = "4326", codeSpace = "EPSG")
+    ),
+    
+    identificationInfo = list(
+      list(
+        citation = list(
+          title = ttl,
+          date = list(
+            list(date = "2021-03-10", type = "publication")
+          ),
+          identifier = list(authority= "DOI", code = "10.7927/zz3b-8y61"),
+          citedResponsibleParty = list(
+            list(
+              individualName = "Rosvold, E., and H. Buhaug",
+              role = "owner"
+            )
+          ),
+          edition = "1.00",
+          presentationForm = list("raster", "map", "map service"),
+          series = list(
+            name = "Scientific Data",
+            issueIdentification = "8:61"
+          )
+        ),
+        abstract = "The Geocoded Disasters (GDIS) Dataset is a geocoded extension of a selection of natural disasters from the Centre for Research on the Epidemiology of Disasters' (CRED) Emergency Events Database (EM-DAT). The data set encompasses 39,953 locations for 9,924 disasters that occurred worldwide in the years 1960 to 2018. All floods, storms (typhoons, monsoons etc.), earthquakes, landslides, droughts, volcanic activity and extreme temperatures that were recorded in EM-DAT during these 58 years and could be geocoded are included in the data set. The highest spatial resolution in the data set corresponds to administrative level 3 (usually district/commune/village) in the Global Administrative Areas database (GADM, 2018). The vast majority of the locations are administrative level 1 (typically state/province/region).",
+        purpose = "To provide the subnational location for different types of natural disasters recorded in EM-DAT between 1960-2018.",
+        credit = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+        status = "completed",
+        pointOfContact = list(
+          list(
+            organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+            contactInfo = list(
+              phone = list(
+                voice = "+1 845-365-8920",
+                facsimile = "+1 845-365-8922"
+              ),
+              address = list(
+                deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000",
+                city = "Palisades, NY",
+                postalCode = "10964",
+                electronicEmailAddress = "ciesin.info@ciesin.columbia.edu"
+              )
+            ),
+            role = "pointOfContact"
+          )
+        ),
+        resourceMaintenance = list(
+          list(maintenanceOrUpdateFrequency = "asNeeded")
+        ),
+        graphicOverview = list(
+          list(
+            fileName = "https://sedac.ciesin.columbia.edu/downloads/maps/pend/pend-gdis-1960-2018/sedac-logo.jpg", 
+            fileDescription = "Geocoded Disasters (GDIS) Dataset", 
+            fileType = "image/jpeg"
+          )
+        ),
+        resourceFormat = list(
+          list(
+            name = "OpenFileGDB", 
+            specification = "ESRI - GeoDatabase"
+          ),
+          list(
+            name = "text/csv", 
+            specification = "RFC4180 -  Common Format and MIME Type for Comma-Separated Values (CSV) Files"
+          ),
+          list(
+            name = "application/geopackage+sqlite3", 
+            specification = "http://www.geopackage.org/spec/"
+          )
+        ),
+        descriptiveKeywords = list(
+          list(type = "theme", keyword = "climatology"),
+          list(type = "theme", keyword = "meteorology"),
+          list(type = "theme", keyword = "atmosphere"),
+          list(type = "theme", keyword = "earth science", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "human dimension", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "natural hazard", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "drought", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "earthquake", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "flood", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "landslides", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "tropical cyclones", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "cyclones", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6"),
+          list(type = "theme", keyword = "volcanic eruption", 
+               thesaurusName = "GCMD Science Keywords, Version 8.6")
+        ),
+        resourceConstraints = list(
+          list(
+            legalConstraints = list(
+              uselimitation = list(
+                "This work is licensed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0). Users are free to use, copy, distribute, transmit, and adapt the work for commercial and non-commercial purposes, without restriction, as long as clear attribution of the source is provided.",
+                "Recommended citation: Rosvold, E.L., Buhaug, H. GDIS, a global dataset of geocoded disaster locations. Scientific Data 8, 61 (2021). https://doi.org/10.1038/s41597-021-00846-6."
+              ),
+              accessConstraints = list("unrestricted"),
+              useConstraints = list("licenceUnrestricted")
+            )
+          )
+        ),
+        extent = list(
+          geographicElement = list(
+            list(
+              geographicBoundingBox = list(
+                westBoundLongitude = -180,
+                eastBoundLongitude = 180,
+                southBoundLatitude = -58,
+                northBoundLatitude = 90
+              )
+            )
+          )#, 
+          # temporalElement = list(
+          #   list(
+          #     extent = list(
+          #       TimePeriod = list(
+          #         beginPosition = "1960-01-01",
+          #         endPosition = "2018-12-31"
+          #       )
+          #     )
+          #   )
+          # )
+        ),
+        spatialRepresentationType = "vector",
+        language = list("eng"),
+        characterSet = list(
+          list(
+            codeListValue = "utf8",
+            codeList = "http://standards.iso.org/iso/19139/resources/gmxCodelists.xml#MD_CharacterSetCode"
+          )
+        )
+      )
+    ),
+    
+    distributionInfo = list(
+      
+      distributionFormat = list(
+        list(name = "OpenFileGDB", 
+             specification = "ESRI - GeoDatabase", 
+             fileDecompressionTechnique = "Unzip"),
+        list(name = "text/csv", 
+             specification = "RFC4180 -  Common Format and MIME Type for Comma-Separated Values (CSV) Files", 
+             fileDecompressionTechnique = "Unzip"),
+        list(name = "application/geopackage+sqlite3", 
+             specification = "http://www.geopackage.org/spec/", 
+             fileDecompressionTechnique = "Unzip")
+      ),
+      
+      distributor = list(
+        list(
+          organisationName = "NASA Socioeconomic Data and Applications Center (SEDAC)",
+          contactInfo = list(
+            phone = list(
+              voice = "+1 845-365-8920",
+              facsimile = "+1 845-365-8922"
+            ),
+            address = list(
+              deliveryPoint = "CIESIN, Columbia University, 61 Route 9W, P.O. Box 1000",
+              city = "Palisades, NY",
+              postalCode = "10964",
+              electronicEmailAddress = "ciesin.info@ciesin.columbia.edu"
+            )
+          ),
+          role = "pointOfContact"
+        )
+      )#,
+      
+      # transferOptions = list(
+      #   list(
+      #     onLine = list(
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018",
+      #         name = "Source metadata (HTML View)",
+      #         protocol = "WWW:LINK-1.0-http--link",
+      #         "function" = "Information"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-gdb.zip",
+      #         name = "pend-gdis-1960-2018-disasterlocations-gdb.zip",
+      #         description = "Data download (Geodatabase)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-gpkg.zip",
+      #         name = "pend-gdis-1960-2018-disasterlocations-gpkg.zip",
+      #         description = "Data download (GeoPackage)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-csv.zip",
+      #         name = "pend-gdis-1960-2018-disasterlocations-csv.zip",
+      #         description="Data download (CSV)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-priogrid-key-csv.zip",
+      #         name = "pend-gdis-1960-2018-priogrid-key-csv.zip",
+      #         description = "Data download (CSV)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-disasterlocations-rdata.zip",
+      #         name = "pend-gdis-1960-2018-disasterlocations-rdata.zip",
+      #         description = "Data download (RData)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-replicationcode-r.zip",
+      #         name = "pend-gdis-1960-2018-replicationcode-r.zip",
+      #         description = "Source code (R)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://beta.sedac.ciesin.columbia.edu/downloads/data/pend/pend-gdis-1960-2018/pend-gdis-1960-2018-codebook.pdf",
+      #         name = "pend-gdis-1960-2018-codebook.pdf",
+      #         description = "Codebook (PDF)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       )
+      #     )
+      #   )
+      # )
+    ),
+    
+    dataQualityInfo = list(
+      list(
+        scope = "dataset",
+        lineage = list(
+          statement = "CIESIN follows procedures designed to ensure that data disseminated by CIESIN are of reasonable quality. If, despite these procedures, users encounter apparent errors or misstatements in the data, they should contact SEDAC User Services at +1 845-365-8920 or via email at ciesin.info@ciesin.columbia.edu. Neither CIESIN nor NASA verifies or guarantees the accuracy, reliability, or completeness of any data provided. CIESIN provides this data without warranty of any kind whatsoever, either expressed or implied. CIESIN shall not be liable for incidental, consequential, or special damages arising out of the use of any data provided by CIESIN."
+        )
+      )
+    ),
+    
+    metadataMaintenance = list(
+      maintenanceAndUpdateFrequency = "asNeeded"
+    )
+  
+  ),
+
+  # Feature catalog (ISO 19110/19139)
+  
+  feature_catalogue = list(
+    name = sprintf("%s - Feature Catalogue", ttl),
+    featureType = list(
+      list(
+        typeName =  ttl,
+        definition = "Disaster locations",
+        code = "pend-gdis-1960-2018-disasterlocations",
+        isAbstract = FALSE,
+        # carrierOfCharacteristics = lapply(column_names, function(column_name){
+        #   print(column_name)
+        #   values = unique(df[,column_name])
+        #   values = values[order(values)]
+        #   member = list(
+        #     memberName = sprintf("Label for '%s'", column_name),
+        #     definition = sprintf("Definition for '%s'", column_name),
+        #     cardinality = list(lower = 1, upper = 1),
+        #     code = column_name,
+        #     valueType = switch(class(df[,column_name]), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+        #     valueMeasurementUnit = NA,
+        #     listedValue = if(column_name %in% exclude_listed_values_for) {list()} else {lapply(values, function(x){ list(label = sprintf("Label for '%s'", x), code = x, definition = sprintf("Definition for '%s'", x)) })}
+        #   )
+        #   return(member)
+        # })
+        carrierOfCharacteristics = list(
+          list(
+            memberName = 'id',
+            definition = 'ID-variable identifying each disaster in the geocoded dataset. Contrary to disasterno each disaster in each country has a unique id number',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA01', # short for Disaster Feature Attribute 01
+            valueType = switch(class(df[,'id']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'country',
+            definition = 'Name of the country within which the location is',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA02',
+            valueType = switch(class(df[,'country']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'iso3',
+            definition = 'Three-letter country code, ISO 3166-1',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA03',
+            valueType = switch(class(df[,'iso3']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'gwno',
+            definition = 'Gledistsch and Ward country code (Gleditsch & Ward, 1999)',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA04',
+            valueType = switch(class(df[,'gwno']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'geo_id',
+            definition = 'Unique ID-variable for each location',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA05',
+            valueType = switch(class(df[,'geo_id']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'geolocation',
+            definition = 'Name of the location of the observation, which corresponds to the highest (most disaggregated) level available. For instance, observations at the third administrative level will have geolocation values identical to the adm3 variable',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA06',
+            valueType = switch(class(df[,'geolocation']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'level',
+            definition = 'The administrative level of the observation, ranges from 1-3 where 3 is the most disaggregated',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA07',
+            valueType = switch(class(df[,'level']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'adm1',
+            definition = 'Name of administrative level 1 for the given location',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA08',
+            valueType = switch(class(df[,'adm1']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'adm2',
+            definition = 'Name of administrative level 2 for the given location',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA09',
+            valueType = switch(class(df[,'adm2']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'location',
+            definition = 'Name of administrative level 3 for the given location',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA10',
+            valueType = switch(class(df[,'location']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'historical',
+            definition = 'Marks whether the disaster happened in a country that has since changed, takes the value 1 if the disaster happened in a country that has since changed, and 0 if not',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA11',
+            valueType = switch(class(df[,'historical']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'hist_country',
+            definition = 'Name of country at the time of the disaster, if the observation takes the value 1 on the historical variable, this is different from the country variable',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA12',
+            valueType = switch(class(df[,'hist_country']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          ),
+          list(
+            memberName = 'disastertype',
+            definition = 'Type of disaster as defined by EM-DAT (Guha-Sapir et al., 2014): flood, storm, earthquake, extreme temperature, landslide, volcanic activity, drought or mass movement (dry)',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA13',
+            valueType = switch(class(df[,'disastertype']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA',
+            listedValue = list(
+              list(
+                label = 'flood',
+                code = 'flood',
+                definition = 'A general term for the overflow of water from a stream channel onto normally dry land in the floodplain (riverine flooding), higher-than-normal levels along the coast and in lakes or reservoirs (coastal flooding) as well as ponding of water at or near the point where the rain fell (flash floods).'
+              ),
+              list(
+                label = 'storm',
+                code = 'storm',
+                definition = 'A type of meteorological hazard generated by the heating of air and the availability of moist and unstable air masses. Convective storms range from localized thunderstorms (with heavy rain and/or hail, lightning, high winds, tornadoes) to meso-scale, multi-day events.'
+              ),
+              list(
+                label = 'earthquake',
+                code = 'earthquake',
+                definition = 'Sudden movement of a block of the Earth’s crust along a geological fault and associated ground shaking.'
+              ),
+              list(
+                label = 'extreme temperature',
+                code = 'extreme temperature',
+                definition = 'A general term for temperature variations above (extreme heat) or below (extreme cold) normal conditions.'
+              ),
+              list(
+                label = 'landslide',
+                code = 'landslide',
+                definition = 'Independent of the presence of water, mass movement may also be triggered by earthquakes.'
+              ),
+              list(
+                label = 'volcanic activity',
+                code = 'volcanic activity',
+                definition = 'A type of volcanic event near an opening/vent in the Earth’s surface including volcanic eruptions of lava, ash, hot vapor, gas, and pyroclastic material.'
+              ),
+              list(
+                label = 'drought',
+                code = 'drought',
+                definition = 'An extended period of unusually low precipitation that produces a shortage of water for people, animals, and plants. Drought is different from most other hazards in that it develops slowly, sometimes even over years, and its onset is generally difficult to detect. Drought is not solely a physical phenomenon because its impacts can be exacerbated by human activities and water supply demands. Drought is therefore often defined both conceptually and operationally. Operational definitions of drought, meaning the degree of precipitation reduction that constitutes a drought, vary by locality, climate and environmental sector.'
+              ),
+              list(
+                label = 'mass movement (dry)',
+                code = 'mass movement (dry)',
+                definition = 'Any type of downslope movement of earth materials.'
+              )
+            )
+          ),
+          list(
+            memberName = 'disasterno',
+            definition = 'ID-variable from EM-DAT (Guha-Sapir et al., 2014), use this to join the geocoded data with EM-DAT records in order to obtain information on the specific disasters',
+            cardinality = list(lower = 1, upper = 1),
+            code = 'DFA14',
+            valueType = switch(class(df[,'disasterno']), "character" = "xs:string", "integer" = "xs:int", "numeric" = "xs:decimal", "xs:string"),
+            valueMeasurementUnit = 'NA'
+          )
+        
+        )
+      )
+    )
+  )
+)
+
+# Publish in NADA catalog
+
+geospatial_add(
+  idno = id, 
+  metadata = my_geo_data, 
+  repositoryid = "central", 
+  published = 1, 
+  thumbnail = thumb, 
+  overwrite = "yes")
+
+# Add links as external resources
+
+external_resources_add(
+  idno = id,
+  dctype = "web",
+  title = "Website: Geocoded Disasters (GDIS) Dataset, v1 (1960–2018)",
+  file_path = "https://beta.sedac.ciesin.columbia.edu/data/set/pend-gdis-1960-2018",
+  overwrite = "yes"
+)
+
+
+

6.10.4 Example 4 (raster): Spatial distribution of the Ethiopian population in 2020

+

This fourth example makes use of elements from the ISO 19115 to document a dataset generated by the WorldPop program using data from multiple sources and machine learning models. “WorldPop develops peer-reviewed research and methods for the construction of open and high-resolution geospatial data on population distributions, demographic and dynamics, with a focus on low and middle income countries.” As of March 1st, 2021 WorldPop was publishing over 44,600 datasets on its website. See https://www.worldpop.org/project/categories?id=3.

+
+ +
+

The selected example represents the spatial distribution of the Ethiopian population in 2020.

+
+ +
+ + + + + + +
Generating the metadata using R
+
library(nadar)
+library(raster)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/") 
+
+# Download and read the dataset 
+
+url = "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif"
+filename = basename(url)
+if(!file.exists(filename)) download.file(url, destfile = filename, mode = "wb")
+ras <- raster("eth_ppp_2020_constrained.tif")
+
+id <- "WP_ETH_POP"
+thumb <- "ethiopia_pop.JPG"
+
+# Generate the metadata
+
+my_geo_data <- list(
+  
+  metadata_information = list(
+    title = "(Demo) Ethiopia Gridded Population 2020 (WorldPop)",
+    producers = list(list(name = "NADA team")),
+    production_date = "2022-02-18"
+  ),
+  
+  description = list(
+    
+    idno = id,
+    language = "eng",
+    characterSet = list(codeListValue = "utf8"),
+    hierarchyLevel = list("dataset"),
+    contact = list(
+      list(organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+           contactInfo = list(
+             onlineResource = list(
+               linkage = "https://www.worldpop.org/", name = "Website"
+             )
+           ),
+           role = "pointOfContact"
+      )
+    ),
+    
+    dateStamp = "2020-09-20",
+    metadataStandardName = "ISO 19115:2003/19139",
+    
+    spatialRepresentationInfo = list(
+      
+      list(
+        gridSpatialRepresentationInfo = list(
+          numberOfDimensions = 2L,
+          axisDimensionproperties = list(
+            list(
+              dimensionName = "row", dimensionSize = dim(ras)[1]
+            ),
+            list(
+              dimensionName = "column", dimensionSize = dim(ras)[2]
+            )
+          ),
+          cellGeometry = "area"
+        )
+      )
+      
+    ),
+    
+    referenceSystemInfo = list(
+      list(code = "4326", codeSpace = "EPSG")
+    ),
+    
+    identificationInfo = list(
+      
+      list(
+        
+        citation = list(
+          title = "Ethiopia population 2020",
+          alternateTitle = "Estimated total number of people per grid-cell at a resolution of 3 arc-seconds (approximately 100m at the equator)",
+          date=list(
+            list(date = "2020-09-12", type = "creation")
+          ),
+          identifier = list(authority = "DOI", code = id),
+          citedResponsibleParty = list(
+            list(
+              organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+              contactInfo = list(
+                onlineResource = list(
+                  linkage = "https://www.worldpop.org/",
+                  name = "Website"
+                )
+              ),
+              role = "owner"
+            )
+          )
+        ),
+
+        abstract = "The spatial distribution of population in 2020, Ethiopia",
+        
+        credit = "World Pop - School of Geography and Environmental Science, University of Southampton",
+        
+        status = "completed",
+        
+        pointOfContact = list(
+          list(
+            organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+            contactInfo = list(
+              onlineResource = list(
+                linkage = "https://www.worldpop.org/",
+                name = "Website"
+              )
+            ),
+            role = "pointOfContact"
+          )
+        ),
+        
+        resourceMaintenance = list(
+          list(maintenanceOrUpdateFrequency = "notPlanned")
+        ),
+        
+        graphicOverview = list(
+          list(fileName = thumb, fileDescription = "Ethiopia population 2020")
+        ),
+        
+        resourceFormat = list(
+          list(name = "image/tiff", specification = "GeoTIFF")
+        ),
+        
+        descriptiveKeywords = list(
+          list(type = "theme", keyword = "population density"),
+          list(type = "theme", keyword = "gridded population"),
+          list(type = "place", keyword = "Ethiopia")
+        ),  
+        
+        resourceConstraints = list(
+          list(
+            legalConstraints = list(
+              accessConstraints = list("unrestricted"),
+              useConstraints = list("licenceUnrestricted"),
+              uselimitation = list(
+                "License: Creative Commons Attribution 4.0 International License",
+                "Recommended citation: Bondarenko M., Kerr D., Sorichetta A., and Tatem, A.J. 2020. Census/projection-disaggregated gridded population datasets for 51 countries across sub-Saharan Africa in 2020 using building footprints. WorldPop, University of Southampton, UK. doi:10.5258/SOTON/WP00682"
+              )
+            )
+          )
+        ),
+        
+        extent = list(
+          geographicElement = list(
+            list(
+              geographicBoundingBox = list(
+                southBoundLatitude = bbox(ras)[2,1],
+                westBoundLongitude = bbox(ras)[1,1],
+                northBoundLatitude = bbox(ras)[2,2],
+                eastBoundLongitude = bbox(ras)[1,2]
+              ),
+              geographicDescription = "Ethiopia"
+            )
+          )
+        ),
+        
+        spatialRepresentationType = "grid",
+        
+        #spatialResolution = list(value = 3, uom = "arc_second"),
+        
+        language = list("eng"),
+        
+        characterSet = list(
+          list(codeListValue = "utf8")
+        ),
+        
+        topicCategory = list("society"),
+
+        supplementalInformation = "References:
+          - Stevens FR, Gaughan AE, Linard C, Tatem AJ (2015) Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLoS ONE 10(2): e0107042. https://doi.org/10.1371/journal.pone.0107042
+          - WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076).
+          - Dooley, C. A., Boo, G., Leasure, D.R. and Tatem, A.J. 2020. Gridded maps of building patterns throughout sub-Saharan Africa, version 1.1. University of Southampton: Southampton, UK. Source of building footprints \"Ecopia Vector Maps Powered by Maxar Satellite Imagery\"© 2020. doi:10.5258/SOTON/WP00677
+          - Bondarenko M., Nieves J. J., Stevens F. R., Gaughan A. E., Tatem A. and Sorichetta A. 2020. wpgpRFPMS: Random Forests population modelling R scripts, version 0.1.0. University of Southampton: Southampton, UK. https://dx.doi.org/10.5258/SOTON/WP00665
+          - Ecopia.AI and Maxar Technologies. 2020. Digitize Africa data. http://digitizeafrica.ai"
+    
+      )  
+    ),
+    
+    distributionInfo = list(
+      
+      distributionFormat = list(
+        list(name = "image/tiff", specification = "GeoTIFF")
+      ),
+      distributor = list(
+        list(
+          organisationName = "World Pop - School of Geography and Environmental Science, University of Southampton",
+          contactInfo = list(
+            onlineResource = list(
+              linkage = "https://www.worldpop.org/",
+              name = "Website"
+            )
+          ),
+          role = "distributor"
+        )
+      )#,
+      
+      # transferOptions = list(    @@@ Use DC external resources?
+      #   list(
+      #     onLine = list(
+      #       list(
+      #         linkage = "https://www.worldpop.org/geodata/summary?id=49635",
+      #         name = "Source metadata (HTML View)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://www.worldpop.org/ajax/pdf/summary?id=49635",
+      #         name = "Source metadata (PDF)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://data.worldpop.org/GIS/Population/Global_2000_2020_Constrained/2020/maxar_v1/ETH/eth_ppp_2020_constrained.tif",
+      #         name = "eth_ppp_2020_constrained.tif",
+      #         description = "Data download (GeoTIFF)",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       )
+      #     )
+      #   )
+      # )
+      
+    ),
+    
+    dataQualityInfo = list(
+      
+      list(
+        scope = "dataset", 
+        lineage = list(
+          statement = "Data management workflow",
+          processStep = list(
+            list(
+              description = "This dataset was produced based on the 2020 population census/projection-based estimates for 2020 (information and sources of the input population data can be found here). Building footprints were provided by the Digitize Africa project of Ecopia.AI and Maxar Technologies (2020) and gridded building patterns derived from the datasets produced by Dooley et al. 2020. Geospatial covariates representing factors related to population distribution, were obtained from the \"Global High Resolution Population Denominators Project\" (OPP1134076)",
+              rationale = "Source data acquisition"
+            ),
+            list(
+              description = "The mapping approach is the Random Forests-based dasymetric redistribution developed by Stevens et al. (2015). The disaggregation was done by Maksym Bondarenko (WorldPop) and David Kerr (WorldPop), using the Random Forests population modelling R scripts (Bondarenko et al., 2020), with oversight from Alessandro Sorichetta (WorldPop).",
+              rationale = "Mapping"
+            )
+          )
+        )
+      )
+      
+    ),
+    
+    metadataMaintenance = list(maintenanceAndUpdateFrequency = "notPlanned")
+    
+  )
+  
+)
+
+# Publish the metadata in a NADA catalog
+
+geospatial_add(
+  idno = id, 
+  metadata = my_geo_data, 
+  repositoryid = "central", 
+  published = 1, 
+  thumbnail = thumb, 
+  overwrite = "yes"
+)
+
+# Add a link to WorldPop website as an external resource
+
+external_resources_add(
+  idno = id,
+  dctype = "web",
+  title = "WorldPop website",
+  file_path = "https://www.worldpop.org/",
+  overwrite = "yes"
+)
+ + + + + + +
Generating the metadata using Python
+ + + + + + +
The result in NADA
+
+
+

6.10.5 Example 5 (service): The United Nations Geospatial website

+

The previous four examples documented geographic datasets (ISO 19115). In this fourth example, we document a geographic service using elements from the ISO 19119 standard. The service described in this example is the United Nations Clear Map application from United Nations Geospatial.

+
+ +
+ + + + + + +
Generating the metadata using R
+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_geo_data/") 
+
+thumb = "un_clear_map.JPG"
+
+id = "UN_GEO_CLEAR-MAP"
+
+my_geo_service <- list(
+  
+  metadata_information = list(
+    idno = id,
+    title = "United Nations Geospatial, Clear Map",
+    producers = list(
+      list(name = "NADA team")
+    ),
+    production_date = "2022-02-18",
+    version = "v1.0 2022-02"
+  ),
+  
+  description = list(
+    
+    idno = id,
+    language = "eng",
+    characterSet = list(codeListValue = "utf8"),
+    hierarchyLevel = list("service"),
+    contact = list(
+      list(
+        organisationName = "United Nations Geospatial",
+        contactInfo = list(
+          address = list(
+            electronicEmailAddress = "gis@un.org"
+          ),
+          onlineResource = list(
+            linkage = "https://www.un.org/geospatial",
+            name = "Website"
+          )
+        ),
+        role = "owner"
+      )
+    ),
+    dateStamp = "2022-02-22",
+    metadataStandardName = "ISO 19119:2005/19139",
+    
+    referenceSystemInfo = list(
+      list(code = "3857", codeSpace = "EPSG")
+    ),
+    
+    identificationInfo = list(
+      
+      list(
+        citation = list(
+          title = "United Nations Clear Map - OGC Web Map Service",
+          date = list(
+            list(date = "2019-08-19", type = "creation"),
+            list(date = "2020-03-19", type ="lastUpdate")
+          ),
+          citedResponsibleParty = list(
+            list(
+              organisationName = "United Nations Geospatial",
+              contactInfo = list(
+                address = list(electronicEmailAddress = "gis@un.org"),
+                onlineResource = list(
+                  linkage = "https://www.un.org/geospatial",
+                  name = "Website"
+                )
+              ),
+              role = "owner"
+            )
+          )
+        ),
+        
+        abstract = "The United Nations Clear Map (hereinafter 'Clear Map') is a background reference web mapping service produced to facilitate 'the issuance of any map at any duty station, including dissemination via public electronic networks such as Internet' and 'to ensure that maps meet publication standards and that they are not in contravention of existing United Nations policies' in accordance with the in the Administrative Instruction on 'Regulations for the Control and Limitation of Documentation - Guidelines for the Publication of Maps' of 20 January 1997 (http://undocs.org/ST/AI/189/Add.25/Rev.1).",
+        purpose = "Clear Map is created for the use of the United Nations Secretariat and community.  All departments, offices and regional commissions of the United Nations Secretariat including offices away from Headquarters using Clear Map remain bound to the instructions as contained in the Administrative Instruction and should therefore seek clearance from the UN Geospatial Information Section (formerly Cartographic Section) prior to the issuance of their thematic maps using Clear Map as background reference.",
+        credit = "Produced by: United Nations Geospatial Contributor: UNGIS, UNGSC, Field Missions CONTACT US: Feedback is appreciated and should be sent directly to: Email:Clearmap@un.org / gis@un.org (UNCLASSIFIED) (c) UNITED NATIONS 2018",
+        status = "onGoing",
+        
+        pointOfContact = list(
+          list(
+            organisationName = "United Nations Geospatial",
+            contactInfo = list(
+              address = list(electronicEmailAddress = "gis@un.org"),
+              onlineResource = list(linkage = "https://www.un.org/geospatial", name = "Website")
+            ),
+            role = "pointOfContact"
+          )
+        ),
+        
+        resourceMaintenance = list(
+          list(maintenanceOrUpdateFrequency = "asNeeded")
+        ),
+        
+        graphicOverview = list(
+          list(
+            fileName = "https://geoportal.dfs.un.org/arcgis/sharing/rest/content/items/6f4eb9e136ee43758a62f587ceb0da01/info/thumbnail/thumbnail1567157577600.png",
+            fileDescription = "Service overview",
+            fileType = "image/png"
+          )
+        ),
+        
+        resourceFormat = list(
+          list(name = "PNG32"),
+          list(name = "PNG24"),
+          list(name = "PNG"),
+          list(name = "JPG"),
+          list(name = "DIB"),
+          list(name = "TIFF"),
+          list(name = "EMF"),
+          list(name = "PS"),
+          list(name = "PDF"),
+          list(name = "GIF"),
+          list(name = "SVG"),
+          list(name = "SVGZ"),
+          list(name = "BMP")
+        ),
+        
+        descriptiveKeywords = list(
+          list(type = "theme", keyword = "wms"),
+          list(type = "theme", keyword = "united nations"),
+          list(type = "theme", keyword = "global boundaries"),
+          list(type = "theme", keyword = "ocean coastline"),
+          list(type = "theme", keyword = "authoritative")
+        ),  
+        
+        resourceConstraints = list(
+          list(
+            legalConstraints = list(
+              uselimitation = list("The designations employed and the presentation of material on this map do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries.
+                Final boundary between the Republic of Sudan and the Republic of South Sudan has not yet been determined.
+                Final status of the Abyei area is not yet determined.
+                * Dotted line represents approximately the Line of Control in Jammu and Kashmir agreed upon by India and Pakistan. The final status of Jammu and Kashmir has not yet been agreed upon by the parties.
+                ** Chagos Archipelago appears without prejudice to the question of sovereignty.
+                *** A dispute exists between the Governments of Argentina and the United Kingdom of Great Britain and Northern Ireland concerning sovereignty over the Falkland Islands (Malvinas)."),
+              accessConstraints = list("unrestricted"),
+              useConstraints = list("licenceUnrestricted")
+            )
+          )
+        ),
+        
+        extent = list(
+          geographicElement = list(
+            list(
+              geographicBoundingBox = list(
+                southBoundLongitude = -1.4000299034940418,
+                westBoundLongitude  = -1.40477223188626,
+                northBoundLongitude =  2.149247026187029,
+                eastBoundLongitude  =  1.367128649366541
+              )
+            )
+          )
+        ),
+        
+        topicCategory = list("boundaries", "oceans"),
+        
+        serviceIdentification = list(
+          serviceType = "OGC:WMS",
+          serviceTypeVersion = "1.1.0"
+        )
+      )
+    ),
+    
+    distributionInfo = list(
+      
+      distributionFormat = list(
+        list(name = "PNG32"),
+        list(name = "PNG24"),
+        list(name = "PNG"),
+        list(name = "JPG"),
+        list(name = "DIB"),
+        list(name = "TIFF"),
+        list(name = "EMF"),
+        list(name = "PS"),
+        list(name = "PDF"),
+        list(name = "GIF"),
+        list(name = "SVG"),
+        list(name = "SVGZ"),
+        list(name = "BMP")
+      ),
+      
+      distributor = list(
+        list(
+          organisationName = "United Nations Geospatial",
+          contactInfo = list(
+            address = list(electronicEmailAddress = "gis@un.org"),
+            onlineResource = list(
+              linkage = "https://www.un.org/geospatial",
+              name = "Website"
+            )
+          ),
+          role = "owner"
+        )
+      )
+      #,
+      
+      # transferOptions = list(
+      #   list(
+      #     onLine = list(
+      #       list(
+      #         linkage = "https://geoportal.dfs.un.org/arcgis/home/item.html?id=541557fd0d4d42efb24449be614e6887",
+      #         name = "Original metadata",
+      #         description = "Original metadata from UN ClearMap portal",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://geoportal.dfs.un.org/arcgis/sharing/rest/content/items/541557fd0d4d42efb24449be614e6887/data",
+      #         name = "UN ClearMap WMS map service user guide",
+      #         description = "How to import and use WMS services of the UN Clear map",
+      #         protocol = "WWW:LINK-1.0-http--link"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Dark/MapServer?service=WMS",
+      #         name = "ClearMap_Dark",
+      #         description = "ClearMap Dark WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Gray/MapServer?service=WMS",
+      #         name = "ClearMap_Gray",
+      #         description = "ClearMap Gray WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Imagery/MapServer?service=WMS",
+      #         name = "ClearMap_Imagery",
+      #         description = "ClearMap Imagery WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Plain/MapServer?service=WMS",
+      #         name = "ClearMap_Plain",
+      #         description = "ClearMap Plain WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_Topo/MapServer?service=WMS",
+      #         name = "ClearMap_Topo",
+      #         description = "ClearMap Topo WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebDark/MapServer?service=WMS",
+      #         name = "ClearMap_WebDark",
+      #         description = "ClearMap WebDark WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebGray/MapServer?service=WMS",
+      #         name = "ClearMap_WebGray",
+      #         description = "ClearMap WebGray WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebPlain/MapServer?service=WMS",
+      #         name = "ClearMap_WebPlain",
+      #         description = "ClearMap WebPlain WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       ),
+      #       list(
+      #         linkage = "https://geoservices.un.org/arcgis/rest/services/ClearMap_WebTopo/MapServer?service=WMS",
+      #         name = "ClearMap_WebTopo",
+      #         description = "ClearMap WebTopo WMS",
+      #         protocol = "OGC:WMS-1.1.0-http-get-map"
+      #       )
+      #     )
+      #   )
+      # )
+      
+    ),
+  
+    metadataMaintenance = list(maintenanceAndUpdateFrequency = "asNeeded")
+  
+  )  
+  
+)
+
+# Publish in a NADA catalog 
+
+geospatial_add(
+  idno = id, 
+  metadata = my_geo_service, 
+  repositoryid = "central", 
+  published = 1, 
+  thumbnail = thumb, 
+  overwrite = "yes"
+)
+
+# Add links as external resources
+
+external_resources_add(
+  title = "United Nations Clear Map application",
+  idno = id,
+  dctype = "web",
+  file_path = "https://www.un.org/geospatial/",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "United Nations Geospatial website",
+  idno = id,
+  dctype = "web",
+  file_path = "https://geoservices.un.org/Html5Viewer/index.html?viewer=clearmap",
+  overwrite = "yes"
+)
+ + + + + + +
Generating the metadata using Python
+

[to do]

+ + + + + + +
The result in NADA
+
+
+
+

6.11 Useful tools

+

The ISO standard is complex and contains many nested elements. Using R or Python to generate the metadata is a convenient and powerful option, although it requires much attention to avoid errors. The geometa R package can be used to facilitate the process of documenting datasets using R.

+

Using a specialized metadata editor to generate the ISO-compliant metadata is a good alternative for those who have limited expertise in R or Python. The GeoNetwork editor provides such a solution.

+ +
+
+
+
+
    +
  1. In our JSON schema, the structural metadata and the dataset metadata are stored in one same container.↩︎

  2. +
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter07.html b/chapter07.html new file mode 100644 index 0000000..d03b754 --- /dev/null +++ b/chapter07.html @@ -0,0 +1,1590 @@ + + + + + + + Chapter 7 Databases of indicators | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 7 Databases of indicators

+
+ +
+
+

7.1 Database vs indicators

+

The schema we describe in this chapter is intended to document databases of indicators or time series, not the indicators or time series themselves (a schema for the description of indicators and time series is presented in chapter 8). Indicators are summary measures related to key issues or phenomena, derived from observed facts. Indicators form time series when they are provided with a temporal ordering, i.e. when their values are provided with an ordered annual, quarterly, monthly, daily, or other time reference. Indicators and time series are often contained in multi-indicators databases, like the World Bank’s World Development Indicators - WDI, whose on-line version contains series for 1,430 indicators (as of 2021).

+

The metadata related to a database can be published in a catalog as specific entries, or as information attached to an indicator. +[provide example / screenshot in NADA]

+
+
+

7.2 Schema description

+

The database schema is used to document the database that contains the time series, not to document the indicators or /series.

+


+
{
+  "published": 0,
+  "overwrite": "no",
+  "metadata_information": {},
+  "database_description": {},
+  "provenance": [],
+  "tags",
+  "lda_topics": {},
+  "embeddings": {},
+  "additional": {}
+}
+


+

The schema includes two elements that are not metadata, but parameters used when publishing the metadata in a NADA catalog:

+
    +
  • published: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished), in which case it is only visible to catalog administrators. This value must be set to 1 (published) to make the metadata visible. Note that the database metadata will only be shown in NADA in association with the metadata of an indicator.
  • +
  • overwrite: Indicates whether metadata that may have been previously uploaded for the same database can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. A database will be considered as being the same as a previously uploaded one if they have the same identifier (provided in the metadata element database_description > title_statement > idno).
  • +
+
+

7.2.0.1 Metadata information

+

metadata_information [Optional, Not Repeatable]
+The set of elements in metadata_information is used to provide information on the production of the database metadata. This information is used mostly for administrative purposes by data curators and catalog administrators.

+


+
"metadata_information": {
+  "title": "string",
+  "idno": "string",
+  "producers": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "role": "string"
+    }
+  ],
+  "prod_date": "string",
+  "version": "string"
+}
+


+
    +
  • title [Optional ; Not repeatable ; String]
    +The title of the metadata document containing the database metadata.
  • +
  • idno [Required ; Not repeatable ; String]
    +A unique identifier of the database metadata document. It can be for example the identifier of the database preceded by a prefix identifying the metadata producer.
  • +
  • producers [Optional ; Repeatable]
    +A list and description of the producers of the database metadata (not the producers of the database).
    +
      +
    • name [Optional ; Not repeatable ; String]
      +The name of the person or organization who produced the metadata (or contributed to its production).
    • +
    • abbr [Optional ; Not repeatable ; String]
      +The abbreviation (aconym) of the organization mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the person or organization mentioned in name.
    • +
    • role [Optional ; Not repeatable ; String]
      +The specific role of the person or organization mentioned in name in the production of the metadata.
    • +
  • +
  • prod_date [Optional ; Not repeatable ; String]
    +The date when the metadata was produced, preferably entered in ISO 8601 format (YYYY-MM-DD).
  • +
  • version [Optional ; Not repeatable ; String]
    +The version of the metadata (not the version of the database).
  • +
+
+
+

7.2.0.2 Database description

+

database_description [Required, Not Repeatable]

+


+
"database_description": {
+  "title_statement": {},
+  "authoring_entity": [],
+  "abstract": "string",
+  "url": "string",
+  "type": "string",
+  "date_created": "string",
+  "date_published": "string",
+  "version": [],
+  "update_frequency": "string",
+  "update_schedule": [],
+  "time_coverage": [],
+  "time_coverage_note": "string",
+  "periodicity": [],
+  "themes": [],
+  "topics": [],
+  "keywords": [],
+  "dimensions": [],
+  "ref_country": [],
+  "geographic_units": [],
+  "geographic_coverage_note": "string",
+  "bbox": [],
+  "geographic_granularity": "string",
+  "geographic_area_count": "string",
+  "sponsors": [],
+  "acknowledgments": [],
+  "acknowledgment_statement": "string",
+  "contacts": [],
+  "links": [],
+  "languages": [],
+  "access_options": [],
+  "errata": [],
+  "license": [],
+  "citation": "string",
+  "notes": [],
+  "disclaimer": "string",
+  "copyright": "string"
+}
+


+
    +
  • title_statement [Required, Not Repeatable]
  • +
+


+
"title_statement": {
+  "idno": "string",
+  "identifiers": [
+    {
+      "type": "string",
+      "identifier": "string"
+    }
+  ],
+  "title": "string",
+  "sub_title": "string",
+  "alternate_title": "string",
+  "translated_title": "string"
+}
+


+
    +
  • idno [Required ; Not repeatable ; String]
    +A unique identifier of the database. For example, the World Bank’s World Development Indicators database published in April 2020 could have idno = “WB_WDI_APR_2020”.

  • +
  • identifiers [Optional ; Repeatable]
    +This element is used to store database identifiers (IDs) other than the catalog ID entered in idno. It can for example be a Digital Object Identifier (DOI). The idno can be repeated here (idno does not provide a type parameter; if a DOI or other standard reference ID is used as idno, it is recommended to repeat it here with the identification of its type).

    +
      +
    • type [Optional ; Not repeatable ; String]
      +The type of unique ID, e.g. “DOI”.
    • +
    • identifier [Required ; Not repeatable ; String]
      +The identifier itself.
    • +
  • +
  • title [Required ; Not repeatable ; String]
    +The title is the name by which the database is formally known. It is good practice to include the year of production in the title (and possibly the month, or quarter, if a new version of the database is released more than once a year). For example, “World Development Indicators, April 2020”.

    +

  • +
  • sub_title [Optional ; Not repeatable ; String]
    +The database subtitle can be used when there is a need to distinguish characteristics of a database. This element will rarely be used.

  • +
  • alternate_title [Optional ; Not repeatable ; String]
    +This can be an acronym, or an alternative name of the database. For example, “WDI April 2020”.

  • +
  • translated_title [Optional ; Not repeatable ; String]
    +The title of the database in a secondary language (if more than one other language, they may be entered as one string, as this element is not repeatable).

  • +
  • authoring_entity [Optional ; Repeatable]
    +This set of five elements is used to identify the organization(s) or person(s) who are the main producers/curators of the database. Note that a similar element is available at the indicator/series level.

  • +
+


+
"authoring_entity": [
+  {
+    "name": "string",
+    "affiliation": "string",
+    "abbreviation": "string",
+    "email": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the person or organization who maintains the contents of the database (back-end). Write the name in full (use the element abbreviation to capture the acronym of the organization, if relevant).

  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the person or organization mentioned in name.
    +

  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +The abbreviated name (acronym) of the organization mentioned in name.

  • +
  • email [Optional ; Not repeatable ; String]
    +The public email contact of the person or organizations mentioned in name. It is good practice to provide a service account email address, not a personal one.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link (URL) to the website of the entity mentioned in name.

  • +
  • abstract [Optional ; Not repeatable ; String]

    +

    The abstract is a brief description of the database. It can for example include a short statement on the database scope and coverage (not in detail, as other fields are available for that purpose), objectives, history, and expected audience.

  • +
  • url [Optional ; Not repeatable ; String]

    +

    The link to the public interface of the database (home page).

  • +
  • type [Optional ; Not repeatable ; String]

    +

    The type of database.

  • +
  • date_created [Optional ; Not repeatable ; String]
    +This is the date the database was created. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).

  • +
  • date_published
    +This is the date the database was made public. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).

  • +
  • version [Optional ; Repeatable]
    +A database rarely remains static; it will be regularly updated and upgraded. The version element is a compound element and contains important information regarding the updating of the database. This includes any extension of the database (adding new series data), appending existing data, correcting existing data, etc.

  • +
+


+
"version": [
+  {
+    "version": "string",
+    "date": "string",
+    "responsibility": "string",
+    "notes": "string"
+  }
+]
+


+
    +
  • version [Optional ; Not repeatable ; String]
    +A label for the version. The version specification will be determined by a curator or a data manager under conventions determined by the authoring entity.

  • +
  • date [Optional ; Not repeatable ; String]
    +The date the version was released. The date should be entered in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).

  • +
  • responsibility [Optional ; Not repeatable ; String]
    +The organization or person in charge of this version of the database.

  • +
  • notes [Optional ; Not repeatable ; String]
    +Additional information on this version of the database. Notes can for example be used to document how this version differs from previous ones.

  • +
  • update_frequency [Optional ; Not repeatable ; String]
    +Indicates at which frequency the database is updated (for example, “annual” or “quarterly”). The use of a controlled vocabulary is recommended. If a database contains many indicators, the update frequency may vary by indicator (e.g., some may be updated on a monthly or quarterly basis while others are only updated annually). The information provided in the update_frequency will correspond to the frequency of update for the indicators that are most frequently updated. +

  • +
  • update_schedule [Optional ; Repeatable]
    +The update schedule is intended to provide users with information on scheduled updates. This is a repeatable field that allows for capturing specific dates, but this information would then have to be regularly updated. Often a single description will be used, which would avoid having to regularly update the metadata. For example, “The database is updated in January, April, July, October of each year.”

  • +
+


+
"update_schedule": [
+  {
+    "update": "string"
+  }
+]
+


+
    +
  • update [Optional ; Not repeatable ; String]
    +A description of the schedule of updates or a date entered in ISO 8601 format.

  • +
  • time_coverage [Optional ; Repeatable]
    +The time coverage is the time span of all the data contained in the database across all series. +

  • +
+
"time_coverage": [
+  {
+    "start": "string",
+    "end": "string"
+  }
+]
+


+- start [Optional ; Not repeatable ; String]
+Indicates the start date of the period covered by the data (across all series) in the database. The date should be provided in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY). +- end [Optional ; Not repeatable ; String]
+Indicates the end date of the period covered by the data (across all series) in the database. The date should be provided in ISO 8601 format (YYY-MM-DD, or YYYY-MM, or YYYY).

+
    +
  • time_coverage_note [Optional ; Not repeatable ; String]
    +The element is used to annotate and/or describe auxiliary information related to the time coverage described in time_coverage.

  • +
  • periodicity [Optional ; Repeatable]
    +The periodicity of the data describes the periodicity of the indicators contained in the database. A database can contain series covering different periods, in which case the information will be repeated for each type of periodicity. A controlled vocabulary should be used. +

  • +
+
"periodicity": [
+  {
+    "period": "string"
+  }
+]
+


+
    +
  • period [Optional ; Not repeatable ; String]
    +Periodicity of the time series included in the database, for example, “annual”, “quarterly”, or “monthly”.

  • +
  • themes [Optional ; Repeatable]
    +Themes provide a general idea of the research that might guide the creation and/or demand for the series. A theme is broad and is likely also subject to a community based definition or list. A controlled vocabulary should be used. This element will rarely be used (the element topics described below will be used more often).

  • +
+


+
"themes": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • id [Optional ; Not repeatable ; String]
    +The unique identifier of the theme. It can be a sequential number, or the identifier of the theme in a controlled vocabulary.

  • +
  • name [Required ; Not repeatable ; String]
    +The label of the theme associated with the data.

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +When a hierarchical (nested) controlled vocabulary is used, the parent_id field can be used to indicate a higher-level theme to which this theme belongs.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name of the controlled vocabulary used, if any.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to the controlled vocabulary mentioned in field ‘vocabulary’.

  • +
  • topics [Optional ; Repeatable]
    +The topics field indicates the broad substantive topic(s) that the indicator/series covers. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) topic classification.

  • +
+


+
"topics": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • id [Optional ; Not repeatable ; String]
    +The unique identifier of the topic. It can be a sequential number, or the identifier of the topic in a controlled vocabulary.

  • +
  • name [Required ; Not repeatable ; String]
    +The label of the topic associated with the data.
    +

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +When a hierarchical (nested) controlled vocabulary is used, the parent_id field can be used to indicate a higher-level topic to which this topic belongs.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name of the controlled vocabulary used, if any.

  • +
  • uri
    +A link to the controlled vocabulary mentioned in field `vocabulary’.

  • +
  • keywords [Optional ; Repeatable]
    +Words or phrases that describe salient aspects of a data collection’s content. This can be used for building keyword indexes and for classification and retrieval purposes. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus. The list of keywords can include keywords extracted from one or more controlled vocabularies and user-defined keywords.

  • +
+


+
"keywords": [
+  {
+    "name": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required ; String ; Non repeatable]
    +A keyword (or phrase).
    +

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name of the controlled vocabulary from which the keyword was extracted, if any.
    +

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI of the controlled vocabulary used, if any.

  • +
  • dimensions [Optional ; Repeatable]
    +The dimensions available for the series included in the database. For example, “country, year”.

  • +
+


+
"dimensions": [
+  {
+    "name": "string",
+    "label": "string"
+  }
+]
+


+
    +
  • name [Required ; String ; Non repeatable]
    +The name of the dimension.
    +

  • +
  • label [Optional ; Not repeatable ; String]
    +A label for the dimension.

  • +
  • ref_country [Optional ; Repeatable]

    +A list of countries for which data are available in the database. This element is somewhat redundant with the next element (geographic_units) which may also contain a list of countries. Identifying geographic areas of type “country” is important to enable filters and facets in data catalogs (country names are among the most frequent queries submitted to catalogs).

  • +
+


+
"ref_country": [
+  {
+    "name": "string",
+    "code": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the country.

  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the country. The use of the ISO 3166-1 alpha-3 codes is recommended.

  • +
  • geographic_units [Optional ; Repeatable]
    +A list of geographic units (regions, countries, states, provinces, etc.) for which data are available in the database. This list is not limited to countries; it can contain sub-national areas, supra-national regions, or non-administrative area names. The type element is used to indicate the type of geographic area. Countries may, but do not have to be repeated here if provided in the eleement ref_country. +

  • +
+
"geographic_units": [
+  {
+    "name": "string",
+    "code": "string",
+    "type": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the geographic unit e.g. ‘World’, ‘Sub-Saharan Africa’, ‘Afghanistan’, ‘Low-income countries’.

  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the geographic unit as found in the database. If no code is available in the database, a code still can be added to the metadata. In such case, using the ISO 3166-1 alpha-3 codes is recommended for countries.

  • +
  • type [Optional ; Not repeatable ; String]
    +Type of geographic unit e.g. country, state, region, province, or other grouping.

  • +
  • geographic_coverage_note [Optional ; Not repeatable ; String]
    +The note can be used to capture additional information on the geographic coverage of the database.

  • +
  • bbox [Optional ; Repeatable]
    +Bounding boxes are typically used for geographic datasets to indicate the geographic coverage of the data, but can be provided for databases as well, although this will rarely be done. A geographic bounding box defines a rectangular geographic area. +

  • +
+
"bbox": [
+  {
+    "west": "string",
+    "east": "string",
+    "south": "string",
+    "north": "string"
+  }
+]
+


+
    +
  • west [Required ; Not repeatable ; String]
    +Western geographic parameter of the bounding box.

  • +
  • east [Required ; Not repeatable ; String]
    +Eastern geographic parameter of the bounding box.

  • +
  • south [Required ; Not repeatable ; String]
    +Southern geographic parameter of the bounding box.

  • +
  • north [Required ; Not repeatable ; String]
    +Northern geographic parameter of the bounding box.

  • +
  • geographic_granularity [Optional ; Not repeatable ; String]

    +

    Whereas the geographic_units element lists the various geographic levels for which there is data in the database, the geographic_granularity element will provide information on the geographic levels for which information is available in the database. For example: “The database contains data at the national, provincial (admin 1) and district (admin 2) levels.”

  • +
  • geographic_area_count [Optional ; Not repeatable ; String]

    +

    The number of geographic areas for which data are provided in the database. The World Bank World Development Indicators for example provides data for 262 different areas (which includes countries and territories, geographic regions, and other country groupings).

  • +
  • sponsors [Optional ; Repeatable]
    +The source(s) of funds for the production and maintenance of the database. If different funding agencies sponsored different stages of the database development, use the role attribute to distinguish their respective contributions.

  • +
+


+
"sponsors": [
+  {
+    "name": "string",
+    "abbreviation": "string",
+    "role": "string",
+    "grant": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +Name of the funding agency/sponsor

  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +Abbreviation of the funding/sponsoring agency mentioned in name.

  • +
  • role [Optional ; Not repeatable ; String]
    +Role of the funding/sponsoring agency mentioned in name.

  • +
  • grant [Optional ; Not repeatable ; String]
    +Grant or award number. If an agency provided more than one grant, list all grants separated with a “;”.

  • +
  • uri [Optional ; Not repeatable ; String]
    +URI of the sponsor agency mentioned in name.

  • +
  • acknowledgments [Optional ; Repeatable]
    +An itemized list of person(s) and/or organization(s) other than sponsors and contributors already mentioned in metadata elements contributors and sponsors whose contribution to the database must be acknowledged.

  • +
+


+
"acknowledgments": [
+  {
+    "name": "string",
+    "affiliation": "string",
+    "role": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the person or agency being recognized for supporting the database.

  • +
  • affiliation [Optional ; Not repeatable ; String]
    +Affiliation of the person or agency recognized or acknowledged for supporting the database.

  • +
  • role [Optional ; Not repeatable ; String]
    +Role of the person or agency that is being recognized or acknowledged for supporting the database.

  • +
  • uri [Optional ; Not repeatable ; String]
    +Website URL or email of the person or organization being recognized or acknowledged for supporting the database.

  • +
  • acknowledgment_statement [Optional ; Not repeatable ; String]

    +

    An overall statement of acknowledgment, which can be used as an alternative (or supplement) to the itemized list provided in acknowledgments.

  • +
  • contacts [Optional ; Repeatable]
    +The contacts element provides the public interface for questions associated with the development and maintenance of the database. There could be various contacts provided depending upon the organization.

  • +
+


+
"contacts": [
+  {
+    "name": "string",
+    "role": "string",
+    "affiliation": "string",
+    "email": "string",
+    "telephone": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the contact person that should be contacted. Instead of the name of an individual (which would be subject to change and require frequent update of the metadata), a title can be provided here (e.g. “data helpdesk”).

  • +
  • role [Optional ; Not repeatable ; String]
    +The specific role of the contact person mentioned in name. This will be used when multiple contacts are listed, and is intended to help users direct their questions and requests to the right contact person.
    +

  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The organization or affiliation of the contact person mentioned in name.

  • +
  • email [Optional ; Not repeatable ; String]
    +The email address of the person or organization mentioned in name. Avoid using personal email accounts; the use of an anonymous email is recommended (e.g, “helpdesk@….org”)

  • +
  • telephone [Optional ; Not repeatable ; String]
    +The phone number of the person or organization mentioned in name.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI of the agency (typically, a URL to a “contact us” web page).

  • +
  • links [Optional ; Repeatable]
    +This field allows for the association of auxiliary links referring to the database.

  • +
+


+
"links": [
+  {
+    "uri": "string",
+    "description": "string"
+  }
+]
+


+
    +
  • uri [Optional ; Not repeatable ; String]
    +The URI for the associated link.

  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the link, in relation to the database. +

  • +
  • languages [Optional ; Repeatable]
    +This set of elements is provided to list the languages that are supported in the database. +

    +
    "languages": [
    +  {
    +    "name": "string",
    +    "code": "string"
    +  }
    +]
    +


    +
      +
    • name [Optional ; Not repeatable ; String]
      +The official name of the language being supported; it is recommended to use a name from the ISO 639-1 language name list.
    • +
    • code [Optional ; Not repeatable ; String]
      +The code of the language mentioned in name, preferably the three letter ISO 639-1 code.

    • +
  • +
  • access_options [Optional ; Repeatable]
    +This repeatable set of elements describes the different modes and formats in which the database is made accessible. When more than one mode of access is provided, describe them separately.

  • +
+


+
"access_options": [
+  {
+    "type": "string",
+    "uri": "string",
+    "note": "string"
+  }
+]
+


+
    +
  • type [Optional ; Not repeatable ; String]
    +The access type, e.g. “Application Programming Interface (API)”, “Bulk download in CSV format”, “On-line query interface”, etc.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI corresponding to the access mode mentioned in type.

  • +
  • note [Optional ; Not repeatable ; String]
    +This element allows for annotating any specific information associated with the access mode mentioned in type.

  • +
  • errata [Optional ; Repeatable]
    +A list of errata at the database level. Note that an errata element is also available in the schema used for the description of indicators/series.

  • +
+


+
"errata": [
+  {
+    "date": "string",
+    "description": "string"
+  }
+]
+


+
    +
  • date [Optional ; Not repeatable ; String]
    +The date the erratum was published, preferably entered in ISO format.

  • +
  • description [Optional ; Not repeatable ; String]
    +A description of the error and of the measures taken to remedy.

  • +
  • license [Optional ; Repeatable]
    +This set of elements is used to describe the access license(s) attached to the database.

  • +
+


+
"license": [
+  {
+    "name": "string",
+    "uri": "string",
+    "note": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the license, for example “Creative Commons Attribution 4.0 International license (CC-BY 4.0)”.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A URI to a description of the license, for example “https://creativecommons.org/licenses/by/4.0/”;

  • +
  • note [Optional ; Not repeatable ; String]
    +Any additional information to qualify the license requirements.

  • +
  • citation [Optional ; Not repeatable ; String]

    +

    The citation requirement for the database (i.e. how users should cite the database in publications and reports).

  • +
  • notes [Optional ; Repeatable]
    +This element is provided to add notes that are relevant for describing the database, that cannot be provided in other metadata elements.

  • +
+


+
"notes": [
+  {
+    "note": "string"
+  }
+]
+


+
    +
  • note [Optional ; Not repeatable ; String]
    +A free-text note.

  • +
  • disclaimer [Optional ; Not repeatable ; String]

  • +
+

If the agency responsible for managing the database has determined that there may be some liability as a result of the data, the element may be used to provide a disclaimer statement.

+
    +
  • copyright [Optional ; Not repeatable ; String]
    +The copyright attached to the database, if any.
  • +
+
+
+

7.2.1 Provenance

+

provenance [Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+

+
"provenance": [
+  {
+    "origin_description": {
+      "harvest_date": "string",
+      "altered": true,
+      "base_url": "string",
+      "identifier": "string",
+      "date_stamp": "string",
+      "metadata_namespace": "string"
+    }
+  }
+]
+


+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.

    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Document Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

7.2.2 Tags

+

tags [Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. +

+
"tags": [
+  {
+    "tag": "string",
+    "tag_group": "string"
+  }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.
  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.
  • +
+
+
+

7.2.3 LDA topics

+

lda_topics [Optional ; Not repeatable]
+

+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.

+
+

Important note: the topic composition of a document is specific to a topic model. To ensure consistency of the information captured in the lda_topics elements, it is important to make use of the same model(s) for generating the topic composition of all documents in a catalog. If a new, better LDA model is trained, the topic composition of all documents in the catalog should be updated.

+
+

The image below provides an example of topics extracted from a document from the United Nations High Commission for Refugees, using a LDA topic model trained by the World Bank (this model was trained to identify 75 topics; no document will cover all topics).

+

+

The lda_topics element includes the following metadata fields:

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition of the document.

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the document (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic. This is specific to the model, not to a document.
      • +
    • +
  • +
+
lda_topics = list(
+  
+   list(
+  
+      model_info = list(
+        list(source      = "World Bank, Development Data Group",
+             author      = "A.S.",
+             version     = "2021-06-22",
+             model_id    = "Mallet_WB_75",
+             nb_topics   = 75,
+             description = "LDA model, 75 topics, trained on Mallet",
+             corpus      = "World Bank Documents and Reports (1950-2021)",
+             uri         = ""))
+      ),
+      
+      topic_description = list(
+      
+        list(topic_id    = "topic_27",
+             topic_score = 32,
+             topic_label = "Education",
+             topic_words = list(list(word = "school",      word_weight = "")
+                                list(word = "teacher",     word_weight = ""),
+                                list(word = "student",     word_weight = ""),
+                                list(word = "education",   word_weight = ""),
+                                list(word = "grade",       word_weight = "")),
+        
+        list(topic_id    = "topic_8",
+             topic_score = 24,
+             topic_label = "Gender",
+             topic_words = list(list(word = "women",       word_weight = "")
+                                list(word = "gender",      word_weight = ""),
+                                list(word = "man",         word_weight = ""),
+                                list(word = "female",      word_weight = ""),
+                                list(word = "male",        word_weight = "")),
+        
+        list(topic_id    = "topic_39",
+             topic_score = 22,
+             topic_label = "Forced displacement",
+             topic_words = list(list(word = "refugee",     word_weight = "")
+                                list(word = "programme",   word_weight = ""),
+                                list(word = "country",     word_weight = ""),
+                                list(word = "migration",   word_weight = ""),
+                                list(word = "migrant",     word_weight = "")),
+                                
+        list(topic_id    = "topic_40",
+             topic_score = 11,
+             topic_label = "Development policies",
+             topic_words = list(list(word = "development", word_weight = "")
+                                list(word = "policy",      word_weight = ""),
+                                list(word = "national",    word_weight = ""),
+                                list(word = "strategy",    word_weight = ""),
+                                list(word = "activity",    word_weight = ""))
+                                
+      )
+      
+   )
+   
+)
+

The information provided by LDA models can be used to build a “filter by topic composition” tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic.

+
+ +
+
+
+

7.2.4 Embeddings

+

embeddings [Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below.

+

+

The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

+

+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.
  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).
  • +
  • vector [Required ; Not repeatable ; Object] @@@@@@@@ do not offer options +The numeric vector representing the document, provided as an object (array or string).

    +[1,4,3,5,7,9]
  • +
+
+
+

7.2.5 Additional

+

additional [Optional ; Not repeatable]
+The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

+

+
+

7.2.5.1 Complete example

+

We use the World Bank’s World Development Indicators 2021 (WDI) database as an example. In this example, we assume that all information is entered manually in the script. In a real application, it is likely that some elements like the list and number of geographic areas covered in the database, or the start and end year of the period covered by the data, will be extracted programmatically by reading the data file (the WDI data and related metadata can be downloaded as CSV or MS-Excel files), or by extracting information from the database API (WDI metadata is available via API).

+ + + + + + +
** Using R
+
# The code below creates an object `wdi_database` ready to be published in a NADA catalog (using the NADAR package).
+
+wdi_database <- list(
+  
+  database_description = list(    
+    
+    title_statement = list(
+      idno = "WB_WDI_2021_09_15",
+      title = "World Development Indicators 2021",
+      alternate_title = "WDI 2021"
+    ),
+    
+    authoring_entity = list(name = "Development Data Group", 
+                            affiliation = "The World Bank Group"),
+    
+    abstract = "The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.",
+    
+    url = "https://datatopics.worldbank.org/world-development-indicators/",
+    
+    type = "Time series database",
+    
+    date_created = "2021-09-15",
+    date_published = "2021-09-15",
+    
+    version = list(
+      list(version = "On-line public version (open data), 15 September 2021",
+           date = "2021-09-15", 
+           responsibility = "World Bank, Development Data Group")),
+
+    update_frequency = "Quarterly",
+    
+    update_schedule = list(list(update = "April, July, September, December")),
+    
+    time_coverage = list(list(start = "1960", end = "2021")),
+    
+    periodicity = list(list(period = "Annual")),
+    
+    topics = list_topics,
+    
+    geographic_units = list(
+      list(code = "ABW", name = "Aruba"),
+      list(code = "AFE", name = "Africa Eastern and Southern"),
+      list(code = "AFG", name = "Afghanistan"),
+      list(code = "AFW", name = "Africa Western and Central"),
+      list(code = "AGO", name = "Angola"),
+      list(code = "ALB", name = "Albania"),
+      list(code = "AND", name = "Andorra"),
+      list(code = "ARB", name = "Arab World"),
+      list(code = "ARE", name = "United Arab Emirates"),
+      list(code = "ARG", name = "Argentina")
+      # ... and 255 more - not shown here
+    ),
+    
+    geographic_granularity = "global, national, regional",                           
+    
+    geographic_area_count = "265",           
+    
+    languages = list(
+      list(code = "en", name = "English"),
+      list(code = "sp", name = "Spanish"),
+      list(code = "fr", name = "French"),
+      list(code = "ar", name = "Arabic"),
+      list(code = "cn", name = "Chinese")
+    ),
+    
+    contacts = list(list(name = "Data Help Desk", 
+                         affiliation = "World Bank",
+                         uri = "https://datahelpdesk.worldbank.org/",
+                         email = "data@worldbank.org")),    
+    
+    access_options = list(
+      list(type = "API",   
+           uri = "https://datahelpdesk.worldbank.org/knowledgebase/articles/889386"),
+      list(type = "Bulk (CSV)",  
+           uri = "https://data.worldbank.org/data-catalog/world-development-indicators"),
+      list(type = "Query", 
+           uri = "http://databank.worldbank.org/data/source/world-development-indicators"),
+      list(type = "PDF",   
+           uri = "https://openknowledge.worldbank.org/bitstream/handle/10986/26447/WDI-2017-web.pdf")),    
+    
+    license = list(list(type = "CC BY-4.0", 
+                        uri = "https://creativecommons.org/licenses/by/4.0/")),
+    
+    citation = "World Development Indicators 2021 (September), The World Bank"
+    
+  ) 
+  
+)
+ + + + + + +
** Using Python
+
# The code below creates a dictionary `wdi_database` ready to be published in a NADA catalog (using the PyNADA library).
+
+wdi_database: {
+  
+  "database_description" : {    
+    
+    "title_statement" : {
+      "idno" : "WB_WDI_2021_09_15",
+      "title" : "World Development Indicators 2021",
+      "alternate_title" : "WDI 2021"
+    },
+    
+    "authoring_entity" : {"name" : "Development Data Group", 
+                          "affiliation" : "The World Bank Group"},
+    
+    abstract = "The World Development Indicators is a compilation of relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty. The database contains 1,400 time series indicators for 217 economies and more than 40 country groups, with data for many indicators going back more than 50 years.",
+    
+    url = "https://datatopics.worldbank.org/world-development-indicators/",
+    
+    type = "Time series database",
+    
+    date_created = "2021-09-15",
+    date_published = "2021-09-15",
+    
+    version = [{"version" : "On-line public version (open data), 15 September 2021",
+                "date" : "2021-09-15", 
+                "responsibility" : "World Bank, Development Data Group"}],
+
+    update_frequency = "Quarterly",
+    
+    update_schedule = [{"update" : "April, July, September, December"}],
+    
+    time_coverage = [{"start" : "1960", "end" : "2021"}],
+    
+    periodicity = [{"period" : "Annual"}],
+    
+    topics = list_topics,
+    
+    geographic_units = [
+      {"code" : "ABW", "name" : "Aruba"},
+      {"code" : "AFE", "name" : "Africa Eastern and Southern"},
+      {"code" : "AFG", "name" : "Afghanistan"},
+      {"code" : "AFW", "name" : "Africa Western and Central"},
+      {"code" : "AGO", "name" : "Angola"},
+      {"code" : "ALB", "name" : "Albania"},
+      {"code" : "AND", "name" : "Andorra"},
+      {"code" : "ARB", "name" : "Arab World"},
+      {"code" : "ARE", "name" : "United Arab Emirates"},
+      {"code" : "ARG", "name" : "Argentina"}
+      # ... and 255 more, not shown here
+    ],
+    
+    geographic_granularity = "global, national, regional",                           
+    
+    geographic_area_count = "265",           
+    
+    languages = [
+      {"code" : "en", "name" : "English"},
+      {"code" : "sp", "name" : "Spanish"},
+      {"code" : "fr", "name" : "French"},
+      {"code" : "ar", "name" : "Arabic"},
+      {"code" : "cn", "name" : "Chinese"}
+    ],
+    
+    contacts = [{"name" : "Data Help Desk", 
+                  "affiliation" : "World Bank",
+                  "uri" : "https://datahelpdesk.worldbank.org/",
+                  "email" : "data@worldbank.org"}],    
+    
+    access_options = [
+      {"type" : "API",   
+       "uri"  : "https://datahelpdesk.worldbank.org/knowledgebase/articles/889386"},
+      {"type" : "Bulk (CSV)",  
+       "uri"  : "https://data.worldbank.org/data-catalog/world-development-indicators"},
+      {"type" : "Query", 
+       "uri"  : "http://databank.worldbank.org/data/source/world-development-indicators"},
+      {"type" : "PDF",   
+       "uri"  : "https://openknowledge.worldbank.org/bitstream/handle/10986/26447/WDI-2017-web.pdf"}
+    ],    
+    
+    license = [{"type" : "CC BY-4.0", 
+                "uri"  : "https://creativecommons.org/licenses/by/4.0/"}],
+    
+    citation = "World Development Indicators 2021 (September), The World Bank"
+    
+  } 
+  
+}
+ +
+
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter08.html b/chapter08.html new file mode 100644 index 0000000..bcc3c08 --- /dev/null +++ b/chapter08.html @@ -0,0 +1,1947 @@ + + + + + + + Chapter 8 Indicators and time series | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 8 Indicators and time series

+
+ +
+
+

8.1 Indicators, time series, database, and scope of the schema

+

Indicators are summary measures related to key issues or phenomena, derived from observed facts. Indicators form time series when they are provided with a temporal ordering, i.e. when their values are provided with an ordered annual, quarterly, monthly, daily, or other time reference. Time series are usually published with equal intervals between values. In the context of this Guide, we however consider as time series all indicators provided for a given geographic area with an associated time reference, whether this time represents a regular, continuous succession of time stamps or not. For example, the indicators provided by the Demographic and Health Surveys (DHS) StatCompiler, which are only available for the years when DHS are conducted in countries (which for some countries can be a single year), would be considered here as “time series”.

+

Time series are often contained in multi-indicators databases, like the World Bank’s World Development Indicators - WDI, whose on-line version contains series for 1,430 indicators (as of 2021). To document not only the series but also the databases they belong to, we propose two metadata schemas: one to document the series/indicators, the other one to document the databases they belong to.

+

In the NADA application, a series can be documented and published without an associated database, but information on a database will only be published in association with a series. The information on a database is thus treated as an “attachment” to the information on a series. A SERIES DESCRIPTION tab will display all metadata related to the series, i.e. all content entered in the series schema.

+ + + + + + + + + + + + +
+

The (optional) SOURCE DATABASE tab will display the metadata related to the database, i.e. all content entered in the series database schema. This information is displayed for information, but not indexed in the NADA catalog (i.e. not searchable).

+ + + + + + + + + + + + +
+
+

Suggestions and recommendations to data curators

+
    +
  • Indicators and time series often come with metadata limited to the indicators/series name and a brief definition. This significantly reduces the discoverability of the indicators, and the possibility to implement semantic searchability and recommender systems. It is therefore highly recommended to generate more detailed metadata for each time series, including information on the purpose and typical use of the indicators, of its relevancy to different audiences, of its limitations, and more.

  • +
  • When documenting an indicator or time series, attention should be paid to include keywords and phrases in the metadata that reflect how data users are likely to formulate their queries when searching data catalogs. Subject-matter expertise, combined with an analysis of queries submitted to data catalogs, can help to identify such keywords. For example, the metadata related to an indicator “Prevalence of stunting” should contain the keyword “malnutrition”, and the metadata related to “GDP per capita” should include keywords like “economic growth” or “national income”. By doing so, data curators will provide richer input to search engines and recommender systems, and will have a significant and direct impact on the discoverability of the data. The use of AI tools can considerabli facilitate the process of identifying related keywords. We provide in the chapter an example of use of chatGPT for such purpose.

  • +
+
+
+
+

8.2 Schema description

+

An indicator or time series is documented using the time series /indicators schema. The database schema is optional, and used to document the database, if any, that the indicator belongs to. When multiple series of a same database are documented, the metadata related to the database only needs to be generated once, then applied to all series. One metadata element in the time series /indicators schema is used to link an indicator to the corresponding database.

+
+

8.2.1 The time series (indicators) schema

+

The time series schema is used to document an indicator or a time series. In NADA, the data and metadata of an indicator can (but does not have to) be published with information on the database it belongs to (if any). A metadata element is provided to indicate the identifier of that database (if any), and to establish the link between the indicator metadata and the database metadata generated using the schema described above. +

+
{
+  "repositoryid": "string",
+  "access_policy": "na",
+  "data_remote_url": "string",
+  "published": 0,
+  "overwrite": "no",
+  "metadata_information": {},
+  "series_description": {},
+  "provenance": [],
+  "tags": [],
+  "lda_topics": [],
+  "embeddings": [],
+  "additional": { }
+}
+


+
+

8.2.1.1 Cataloguing parameters

+

The first elements of the schema (repositoryid, access_policy, data_remote_url, published, and overwrite) are not part of the series metadata. They are parameters used to indicate how the series will be published in a NADA catalog.

+

repositoryid identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA.

+

access_policy indicates the access policy to be applied to the data: direct access, open access, public use files, licensed access, data accessible from an external repository, and data not accessible. A controlled vocabulary is provided and must be used, with the following respective options: {direct; open; public; licensed; remote; data_na}.

+

data_remote_url provides the link to an external website where the data can be obtained, if the access_policy has been set to remote.

+

published: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.

+

overwrite: Indicates whether metadata that may have been previously uploaded for the same series can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. Note that a series will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element series_description > idno is the same.

+
+
+

8.2.1.2 Metadata information

+

metadata_information [Optional, Not Repeatable]
+The set of elements in metadata_information is used to provide information on the production of the indicator metadata. This information is used mostly for administrative purposes by data curators and catalog administrators. +

+
"metadata_information": {
+  "title": "string",
+  "idno": "string",
+  "producers": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "role": "string"
+    }
+  ],
+  "prod_date": "string",
+  "version": "string"
+}
+


+
    +
  • title [Optional ; Not repeatable ; String]
    +The title of the metadata document containing the indicator metadata.

  • +
  • idno [Required ; Not repeatable ; String]
    +A unique identifier of the indicator metadata document. It can be for example the identifier of the indicator preceded by a prefix identifying the metadata producer.

  • +
  • producers [Optional ; Repeatable]
    +This is a list of producers involved in the documentation (production of the metadata) of the series.

    +
      +
    • name [Optional ; Not repeatable, String]
      +The name of the agency that is responsible for the documentation of the series.
    • +
    • abbr [Optional ; Not repeatable, String]
      +Abbreviation (acronym) of the agency mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable, String]
      +Affiliation of the agency mentioned in name.
    • +
    • role [Optional ; Not repeatable, String]
      +The specific role of the agency mentioned in name in the production of the metadata. This element will be used when more than one person or organization is listed in the producers element to distinguish the specific contribution of each metadata producer.

    • +
  • +
  • prod_date [Optional ; Not repeatable, String]
    +The date the metadata was generated. The date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

  • +
  • version [Optional ; Not repeatable, String]
    +The version of the metadata on this series. This element will rarely be used.

    +
    metadata_creation = list(
    +
    +  producers = list(list(name = "Development Data Group", 
    +                        abbr = "DECDG", 
    +                        affiliation = "World Bank")),
    +
    +  prod_date = "2021-10-15"
    +
    +)
  • +
+
+
+

8.2.1.3 Series description

+

series_description [Required ; Repeatable]
+This section contains all elements used to describe a specific series or indicator. +

+
"series_description": {
+  "idno": "string",
+  "doi": "string",
+  "name": "string",
+  "database_id": "string",
+  "aliases": [],
+  "alternate_identifiers": [],
+  "languages": [],
+  "measurement_unit": "string",
+  "dimensions": [],
+  "periodicity": "string",
+  "base_period": "string",
+  "definition_short": "string",
+  "definition_long": "string",
+  "definition_references": [],
+  "statistical_concept": "string",
+  "concepts": [],
+  "methodology": "string",
+  "derivation": "string",
+  "imputation": "string",
+  "missing": "string",
+  "quality_checks": "string",
+  "quality_note": "string",
+  "sources_discrepancies": "string",
+  "series_break": "string",
+  "limitation": "string",
+  "themes": [],
+  "topics": [],
+  "disciplines": [],
+  "relevance": "string",
+  "time_periods": [],
+  "ref_country": [],
+  "geographic_units": [],
+  "bbox": [],
+  "aggregation_method": "string",
+  "disaggregation": "string",
+  "license": [],
+  "confidentiality": "string",
+  "confidentiality_status": "string",
+  "confidentiality_note": "string",
+  "links": [],
+  "api_documentation": [],
+  "authoring_entity": [],
+  "sources": [],
+  "sources_note": "string",
+  "keywords": [],
+  "acronyms": [],
+  "errata": [],
+  "notes": [],
+  "related_indicators": [],
+  "compliance": [],
+  "framework": [],
+  "series_groups": []
+}
+


+
    +
  • idno [Required ; Not repeatable ; String]

    +

    A unique identifier (ID) for the series. Most agencies and databases will have a coherent coding convention to generate their series IDs. For example, the name of the series in the World Bank’s World Development Indicators series are composed of the following elements, separated by a dot:

    +
      +
    • Topic code (2 digits).
    • +
    • General subject code (3 digits)
    • +
    • Specific subject code (4 digits)
    • +
    • Extensions (2 digits each)
    • +
    +

    For example, the series with identifier “DT.DIS.PRVT.CD” is the series containing data on “External debt disbursements by private creditors in current US dollars” (for more information, see How does the World Bank code its indicators?.

  • +
  • doi [Optional ; Not repeatable ; String]

    +

    A Digital Object Identifier (DOI) for the the series.

  • +
  • name [Required ; Not repeatable ; String]

    +

    The name (label) of the series. Note that a field alias is provided (see below) to capture alternative names for the series.

  • +
  • database_id [Optional ; Not repeatable ; String]

    +

    The unique identifier of the database the series belongs to. This field must correspond to the element database_description > title_statement > idno of the database schema described above. This is the only field that is needed to establish the link between the database metadata and the indicator metadata.

  • +
  • aliases [Optional ; Repeatable]
    +A series or an indicator can be referred to using different names. The aliases element is provided to capture the multiple names and labels that may be associated with (i.e synomyms of) the documented series or indicator. +

  • +
+
"aliases": [
+  {
+    "alias": "string"
+  }
+]
+


+- alias [Optional ; Not repeatable ; String]
+An alternative name for the indicator or series being documented.

+
    +
  • alternate_identifiers [Optional ; Not repeatable ; String]
    +The element idno described above is the reference unique identifier for the catalog in which the metadata is intended to be published. But the same indicator/metadata may be published in other catalogs. For example, a data catalog may publish metadata for series extracted from the World Bank World Development Indicators (WDI) database. And the WDI itself contains series generated and published by other organizations, such as the World Health Organization or UNICEF. Catalog administrators may want to assign a unique identifier specific to their catalog (the idno element), but keep track of the identifier of the series or indicator in other catalogs or databases. The alternate_identifiers element serves that purpose. +
  • +
+
"alternate_identifiers": [
+  {
+    "identifier": "string",
+    "name": "string",
+    "database": "string",
+    "uri": "string",
+    "notes": "string"
+  }
+]
+


+
    +
  • identifier [Required ; Not repeatable ; String]
    +An identifier for the series other than the identifier entered in idno (note that the identifier entered in idno can be included in this list, if it is useful to provide it with a type identifier (see name element below) which is not provided in idno. This can be the identifier of the indicator in another database/catalog, or a global unique identifier.

  • +
  • name

    +This element will be used to define the type of identifier. This will typically be used to flag DOIs by entering “Digital Object Identifier (DOI)”.

  • +
  • database
    +The name of the database (or catalog) where this alternative identifier is used, e.g. “IMF, International Financial Statistics (IFS)”.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link (URL) to the database mentioned in database.

  • +
  • notes [Optional ; Not repeatable ; String]
    +Any additional information on the alternate identifier.

  • +
  • languages [Optional ; Repeatable]

    +An indicator or time series can be made available at different levels of disaggregation. For example, an indicator containing estimates of the “Population” of a country by year can be available by sex. The data curators in such case will have two options: (i) create and document three separate indicators, namely “Population, Total”, “Population, Female”, and “Population, Male”; or create a single indicator “Population” and attach a dimension “sex” to it, with values “Total”, “Female”, and “Male”. The dimensions are features (or “variables”) that define the different levels of disaggregation within an indicator/series. The element dimensions is used to provide an itemized list of disaggregations that correspond exactly to the published data. Note that when an indicator is available at two “non-overlapping” levels of disaggregation, it should be split into two indicators. For example, if the Population indicator is available by male/female and by urban/rural, but not by male/urban/male/rural/female urban/female rural, it should be treated as two separate indicators (“Population by sex” with dimension sex = “male / female” and “Population by area of residence” with dimension area = “urban / rural”.) Note also that another element in the schema, disaggregation, is also provided, in which a narrative description of the actual or recommended disaggregations can be documented.
    +

  • +
+
"alternate_identifiers": [
+  {
+    "identifier": "string",
+    "name": "string",
+    "database": "string",
+    "uri": "string",
+    "notes": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the language.

  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the language, preferably the ISO code.

  • +
  • measurement_unit [Optional ; Not repeatable ; String]

    +

    The unit of measurement. Note that in many databases the measurement unit will be included in the series name/label. In the World Bank’s World Development Indicators for example, series are named as follows:

    +
      +
    • CO2 emissions (kg per 2010 US$ of GDP)
    • +
    • GDP per capita (current US$)
    • +
    • GDP per capita (current LCU)
    • +
    • Population density (people per sq. km of land area)
    • +
    +

    In such case, the name of the series should not be changed, but the measurement unit may be extracted from it and stored in element measurement_unit.

  • +
  • dimensions [Optional ; Repeatable]

    +An indicator or time series can be made available at different levels of disaggregation. For example, a time series containing annual estimates of the indicator “Resident population (mid-year)” can be provided by country, by urban/rural area of residence, by sex, by age group. The data curator has to make a decision on how to organize such data. One option is to create an indicator “Resident population (mid-year)” and to define a set of “dimensions” for the breakdowns. The dimensions would in such case be the year, the country, the area of residence, the sex, and the age group. Some of the dimensions would have to be provided with a code list (or ’controlled vocabulary”, for example stating that F means “Female”, M” means male, and T means “Total” for the dimension sex). Another option would be to create multiple indicators (e.g., creating three distinct indicators “Resident population, male (mid-year)”, “Resident population, female (mid-year)”, “Resident population, total (mid-year)” and using year, country, area of residence, and age group as dimensions). The element dimensions is used to provide an itemized list of disaggregations that correspond to the published data. Note that another element in the schema, disaggregation, is also provided, in which a narrative description of the actual or recommended disaggregations can be documented. Note also that in the SDMX standard, dimensions are listed in the Data Structure Definition” and are complemented by code lists* that provide the related controlled vocabularies.
    +

  • +
+
"dimensions": [
+  {
+    "name": "string",
+    "label": "string",
+    "description": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the dimension.

  • +
  • label [Required ; Not repeatable ; String]
    +The label of the dimension, for example “sex”, or “urban/rural”.

  • +
  • description [Optional ; Not repeatable ; String]
    +A description of the dimension (for example, if the label was “age group”, the description can provide detailed information on the age groups, e.g. “The age groups in the database are 0-14, 15-49, 50-64, and 65+ years old”.)

  • +
  • release_calendar [Optional ; Not repeatable ; String]

    +

    Information on when updates for the indicators can be expected. This will usually not consist of exact dates (which would have to be updated regularly), but of more general information like “Every first Monday of the Month”, or “Every year on June 30”, or “The last week of each quarter”.

  • +
  • periodicity [Optional ; Not repeatable ; String]

    +

    The periodicity of the series. It is recommended to use a controlled vocabulary with values like annual, quarterly, monthly, daily, etc.

  • +
  • base_period [Optional ; Not repeatable ; String]

    +

    The base period for the series. This field will only apply to series that require a base year (or other reference time) used as a benchmark, like a Consumer Price Index (CPI) which will have a value of 100 for a reference base year.

  • +
  • definition_short [Optional ; Not repeatable ; String]
    +A short definition of the series. The short definition captures the essence of the series.

  • +
  • definition_long [Optional ; Not repeatable ; String]
    +A long(er) version of the definition of the series. If only one definition is available (not a short/long version), it is recommended to capture it in the definition_short element. ALternatively, the same definition can be stored in both definition_short and definition_long.

  • +
  • definition_references [Optional ; Repeatable]
    +This element is provided to link to an external resources from which the definition was extracted. +

  • +
+
"definition_references": [
+  {
+    "source": "string",
+    "uri": "string",
+    "note": "string"
+  }
+]
+


+
    +
  • source [Optional ; Not repeatable ; String]
    +The source of the definition (title, or label).

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link (URL) to the source of the definition.

  • +
  • note [Optional ; Not repeatable ; String]
    +This element provides for annotating or explaining the reason the reference has been included as part of the metadata.

  • +
  • statistical_concept [Optional ; Not repeatable ; String]

    +

    This element allows to insert a reference of the series with content of a statistical character. This can include coding concepts or standards that are applied to render the data statistically relevant.

  • +
  • concepts [Optional ; Repeatable]
    +This repeatable element can be used to document concepts related to the indicators or time series (other than the main statistical concept that may have been entered in statisticsl_concept). For example, the concept of malnutrition could be documented in relation to the indicators “Prevalence of stunting” and “Prevalence of wasting”. +

  • +
+
"concepts": [
+  {
+    "name": "string",
+    "definition": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +A concise and standardized name (label) for the concept.

  • +
  • definition [Required ; Not repeatable ; String]
    +The definition of the concept.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link (URL) to a resource providing more detailed information on the concept.

  • +
  • data_collection [Optional ; Not repeatable]
    +This group of elements can be used to document data collection activities that led to or allowed the production of the indicator. This element will typically be used for the description of surveys or censuses. +Note: the schema also contains an element “sources”. That element will be used to document the organization and/or main data production program from which the indicator is derived. +

  • +
+
"data_collection": [
+  {
+    "data_source": "string",
+    "method": "string",
+    "period": "string",
+    "note": "string"
+    "uri": "string"
+  }
+]
+


+
    +
  • data_source [Required ; Not repeatable ; String]
    +A concise and standardized name (label) for the data source, e.g. “National Labor Force Survey, 1st quarter 2022”. If multiple data sources were used, they can all be listed here. Note that is a time series has values obtained from many different sources, the source for each value (or group of values) will not be part of the indicator/series metadata, but will be stored as an attribute in the data file where the information can be associated with a specific observation (“cell note” or group of observation (e.g. attached to an indicator for avv values for a same year or for a same area).

  • +
  • method [Required ; Not repeatable ; String]
    +Brief information on the data collection method, e.g. :Sample household survey”.

  • +
  • period [Optional ; Not repeatable ; String]
    +Information on the period of the data collection, e.g. “January to March 2022”.

  • +
  • note [Optional ; Not repeatable ; String]
    +Additional information on the data collection.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to a resource (website, document) where more information on the data collection can be found.

  • +
  • imputation [Optional ; Not repeatable ; String]

    +

    Data may have been imputed to account for data gaps or for other reasons (harmonization/standardization, and others). If imputations have been made, this element provides the space for their description.

  • +
  • adjustments [Optional ; Repeatable ; String]

    +

    Description of any adjustments with respect to use of standard classifications and harmonization of breakdowns for age group and other dimensions, or adjustments made for compliance with specific international or national definitions.

  • +
  • missing [Optional ; Not repeatable ; String]

    +

    Information on missing values in the series or indicator. This information can be related to treatment of missing values, to the cause(s) of missing values, and others.

  • +
  • validation_rules [Optional ; Repeatable ; String]

    +

    Description of the set of rules (itemized) used to validate values for the indicator, e.g. “Is within range 0-100”, or “Is the sum of indicatorX + indicator Y”.

  • +
  • quality_checks [Optional ; Not repeatable ; String]

    +

    Data may have gone through data quality checks to assure that the values are reasonable and coherent, which can be described in this element. These quality checks may include checking for outlying values or other. A brief description of such quality control procedures will contribute to reinforcing the credibility of the data being disseminated.

  • +
  • quality_note [Optional ; Not repeatable ; String]

    +

    Additional notes or an overall statement on data quality. These could for example cover non-standard quality notes and/or information on independent reviews on the data quality.

  • +
  • sources_discrepancies [Optional ; Not repeatable ; String]

    +

    This element is used to describe and explain why the data in the series may be different from the data for the same series published in other sources. International organizations, for example, may apply different techniques to make data obtained from national sources comparable across countries, in which cases the data published in international databases may differ from the data published in national, official databases.

  • +
  • series_break [Optional ; Not repeatable ; String]

    +

    Breaks in statistical series occur when there is a change in the standards, sources of data, or reference year used in the compilation of a series. Breaks in series must be well documented. The documentation should include the reason(s) for the break, the time it occured, and information on the impact on comparability of data over time.

  • +
  • limitation [Optional ; Not repeatable ; String]

    +

    This element is used to communicate to the user any limitations or exceptions in using the data. The limitations may result from the methodology, from issues of quality or consistency in the data source, or other.

  • +
  • themes [Optional ; Repeatable]
    +Themes provide a general idea of the research that might guide the creation and/or demand for the series. A theme is broad and is likely also subject to a community based definition or list. A controlled vocabulary should be used. This element will rarely be used (the element topics described below will be used more often). +

  • +
+
"themes": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • id [Optional ; Not repeatable ; String]
    +The unique identifier of the theme. It can be a sequential number, or the ID of the theme in a controlled vocabulary.

  • +
  • name [Required ; Not repeatable ; String]
    +The label of the theme associated with the data.

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +When a hierarchical (nested) controlled vocabulary is used, the parent_id field can be used to indicate a higher-level theme to which this theme belongs.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name of the controlled vocabulary used, if any.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to the controlled vocabulary mentioned in field `vocabulary’.

  • +
  • topics [Optional ; Repeatable]
    +The topics field indicates the broad substantive topic(s) that the indicator/series covers. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) topics classification.
    +

  • +
+
"topics": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • id [Optional ; Not repeatable ; String]
    +The unique identifier of the topic. It can be a sequential number, or the ID of the topic in a controlled vocabulary.

  • +
  • name [Required ; Not repeatable ; String]
    +The label of the topic associated with the data.
    +

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +When a hierarchical (nested) controlled vocabulary is used, the parent_id field can be used to indicate a higher-level topic to which this topic belongs.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name of the controlled vocabulary used, if any.

  • +
  • uri
    +A link to the controlled vocabulary mentioned in field vocabulary.

  • +
  • disciplines [Optional ; Repeatable]
    +Information on the academic disciplines related to the content of the document. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia. +

  • +
+
"disciplines": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+

This is a block of five elements:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The ID of the discipline, preferably taken from a controlled vocabulary.

  • +
  • name [Optional ; Not repeatable ; String]
    +The name (label) of the discipline, preferably taken from a controlled vocabulary.

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent ID of the discipline (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.

  • +
  • relevance [Optional ; Not repeatable ; String]

    +

    This field documents the relevance of an indicator or series in relation to a social imperative or policy objective.

  • +
  • mandate [Optional ; Not repeatable ; String]

    +
      +
    • mandate [Optional ; Not repeatable ; String]
      +Description of the institutional mandate or of a set of rules or other formal set of instructions assigning responsibility as well as the authority to an organization for the collection, processing, and dissemination of statistics for this indicator.
    • +
    • URI [Optional ; Not repeatable ; String]
      +A link to a resource (document, website) describing the mandate.
    • +
  • +
  • time_periods [Optional ; Repeatable]
    +The time period covers the entire span of data available for the series. The time period has a start and an end and is reported according to the periodicity provided in a previous element. +

  • +
+
"time_periods": [
+  {
+    "start": "string",
+    "end": "string",
+    "notes": "string"
+  }
+]
+


+
    +
  • start [Required ; Not repeatable ; String]
    +The initial date of the series in the dataset. The start date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

  • +
  • end [Required ; Not repeatable ; String]
    +The end date is the latest date for which an estimate for the indicator is available. The end date should be entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

  • +
  • notes [Optional ; Not repeatable ; String]
    +Additional information on the time period.

  • +
  • ref_country [Optional ; Repeatable]

    +A list of countries for which data are available in the series. This element is somewhat redundant with the next element (geographic_units) which may also contain a list of countries. Identifying geographic areas of type “country” is important to enable filters and facets in data catalogs (country names are among the most frequent queries submitted to catalogs). +

  • +
+
"ref_country": [
+  {
+    "name": "string",
+    "code": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the country.

  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the country. The use of the ISO 3166-1 alpha-3 codes is recommended.

  • +
  • geographic_units [Optional ; Repeatable]
    +List of geographic units (regions, countries, states, provinces, etc.) for which data are available for the series. +

  • +
+
"geographic_units": [
+  {
+    "name": "string",
+    "code": "string",
+    "type": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +Name of the geographic unit e.g. “World,”Africa”, “Afghanistan”, “OECD countries”, “Bangkok”.

  • +
  • code [Optional ; Not repeatable ; String]
    +Code of the geographic unit. The ISO 3166-1 alpha-3 code is preferred when the unit is a country.

  • +
  • type [Optional ; Not repeatable ; String]
    +Type of geographic unit e.g. “country”, “state”, “region”, “province”, “city”, etc.

  • +
  • bbox [Optional ; Repeatable]
    +This element is used to define one or multiple bounding box(es), which are the rectangular fundamental geometric description of the geographic coverage of the data. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the dataset’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. This element is optional, but if the bound_poly element (see below) is used, then the bbox element must be included.
    +

  • +
+
"bbox": [
+  {
+    "west": "string",
+    "east": "string",
+    "south": "string",
+    "north": "string"
+  }
+]
+


+
    +
  • west [Required ; Not repeatable ; String]
    +West longitude of the bounding box.
  • +
  • east [Required ; Not repeatable ; String]
    +East longitude of the bounding box.
  • +
  • south [Required ; Not repeatable ; String]
    +South latitude of the bounding box.
  • +
  • north [Required ; Not repeatable ; String]
    +North latitude of the bounding box.
  • +
+This example is for a study covering the islands of Madagascar and Mauritius +
+ +
+
my_indicator <- list(
+  metadata_information = list(
+    # ... 
+  ),
+  series_description = list(
+    # ... ,
+    study_info = list(
+      # ... ,
+      
+      ref_country = list(
+        list(name = "Madagascar", code = "MDG"),
+        list(name = "Mauritius",  code = "MUS")
+      ),
+      
+      bbox = list(
+        
+        list(name  = "Madagascar",
+             west  = "43.2541870461", 
+             east  = "50.4765368996", 
+             south = "-25.6014344215", 
+             north = "-12.0405567359"),
+        
+        list(name  = "Mauritius",
+             west  = "56.6", 
+             east  = "72.466667", 
+             south = "-20.516667", 
+             north = "-5.25")
+        
+        ),
+    # ...
+  ),
+  # ...
+)    
+


+
    +
  • aggregation_method [Optional ; Not repeatable ; String]

    +

    The aggregation_method element describes how values can be aggregated from one geographic level (for example, a country) to a higher-level geographic area (for example, a group of country defined based on a geographic criteria (region, world) or another criteria (low/medium/high-income countries, island countries, OECD countries, etc.). The aggregation method can be simple (like “sum” or “population-weighted average”) or more complex, involving weighting of values.

  • +
  • disaggregation [Optional ; Not repeatable ; String]

    +

    This element is intended to inform users that an indicator or series is available at various levels of disaggregation. The related series should be listed (by andme and/or identifier). For indicator “Population, total” for example, one may inform the user that the indicator is also available (in other series) by sex, urban/rural, and age group (in series “Population, male” and “Population, female”, etc.).

  • +
  • license [Optional ; Repeatable]
    +The license refers to the accessibility and terms of use associated with the data. Providing a license and a link to the terms of the license allos data users to determine, with full clarity, what they can and cannot do with the data. +

  • +
+
"license": [
+  {
+    "name": "string",
+    "uri": "string",
+    "note": "string"
+  }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the license, e.g. “Creative Commons Attribution 4.0 International license (CC-BY 4.0)”.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of a website where the licensed is described in detail, for example “https://creativecommons.org/licenses/by/4.0/”.

  • +
  • note [Optional ; Not repeatable ; String]
    +Any additional information on the license.

  • +
  • confidentiality [Optional ; Not repeatable ; String]

    +

    A statement of confidentiality for the series.

  • +
  • confidentiality_status [Optional ; Not repeatable ; String]

    +

    This indicates a confidentiality status for the series. A controlled vocabulary should be used with possible options “public”, “official use only”, “confidential”, “strictly confidential”. When all series are made publicly available, and belong to a database that has an open or public access policy, this element can be ignored.

  • +
  • confidentiality_note [Optional ; Not repeatable ; String]

    +

    This element is reserved for additional notes regarding confidentiality of the data. This could involve references to specific laws and circumstances regarding the use of data.

  • +
  • links [Optional ; Repeatable]
    +This element provides links to online resources of any type that could be useful to the data users. This can be links to description of methods and reference documents, analytics tools, visualizations, data sources, or other. +

  • +
+
"links": [
+  {
+    "type": "string",
+    "description": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • type [Optional ; Not repeatable ; String]
    +This element allows to classify the link that is provided.

  • +
  • description [Optional ; Not repeatable ; String]
    +A description of the link that is provided.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The uri (URL) to the described resource.

  • +
  • api_documentation [Optional ; Repeatable]
    +Increasingly, data are made accessible via Application Programming Interfaces (APIs). The API associated with a series must be documented. The documentation will usually not be specific to a series, but apply to all series in a same database. +

  • +
+
"api_documentation": [
+  {
+    "description": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • description [Optional ; Not repeatable ; String]
    +This element will not contain the API documentation itself, but information on what documentation is available.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the API documentation.

  • +
  • authoring_entity [Optional ; Repeatable]
    +This set of five elements is used to identify the organization(s) or person(s) who are the main producers/curators of the indicator. Note that a similar element is provided at the database level. The authoring_entity for the indicator can be different from the authoring_entity of the database. For example, the World Bank is the authoring entity for the World Development Indicators database, which contains indicators obtained from the International Monetary Fund, World Health Organization, and other organizations that are thus the authoring entitis for specific indicators.
    +

  • +
+
"authoring_entity": [
+  {
+    "name": "string",
+    "affiliation": "string",
+    "abbreviation": null,
+    "email": null,
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the person or organization who is responsible for the production of the indicator or series. Write the name in full (use the element abbreviation to capture the acronym of the organization, if relevant).

  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the person or organization mentioned in name.
    +

  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +Abbreviated name (acronym) of the organization mentioned in name.

  • +
  • email [Optional ; Not repeatable ; String]
    +The public email contact of the person or organizations mentioned in name. It is good practice to provide a service account email address, not a personal one.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link (URL) to the website of the entity mentioned in name.

  • +
  • sources [Optional ; Not repeatable ; String]
    +This element provides information on the source(s) of data that were used to generate the indicator. A source can refer to an organization (e.g., “Source: World Health Organization”), or to a dataset (e.g., for a national poverty headcount indicator, the sources will likely be a list of sample household surveys). In sources, we are mainly interested in the latter. When a series in a database is a series extracted from another database (e.g., when the World Bank World Development Indicators include a series from the World Health Organization in its database), the source organization should be mentioned in the authoring_entity element of the schema. The sources element is a repeatable element. +Note 1: In some cases, the source of a specific value in a database will be stored as an attribute of the data file (e.g., as a “footnote” attached to a specific cell. If the sources are listed in the data file, they may but do not need to be stored in the metadata. +Note 2: the schema also contains an element “data_collection” that would be used to describe a specific data collection activity from which an indicator is derived. +

  • +
+
"sources": [
+  {
+    "id": "string",
+    "name": "string",
+    "organization": "string",
+    "type": "string",
+    "note": "string"
+  }
+]
+


+
    +
  • id [Required ; String]
    +This element records the unique identifier of a source. It is a required element. If the source does not have a specific unique identifier, a sequential number can be used. If the source is a dataset or database that has its own unique identifier (possibly a DOI), this identifier should be used.

  • +
  • name [Optional ; String]
    +The name (title, or label) of the source.

  • +
  • organization [Optional ; String]
    +The organization responsible for the source data.
    +

  • +
  • type [Optional ; String]
    +The type of source, e.g. “household survey”, “administrative data”, or “external database”.
    +

  • +
  • note [Optional ; String]
    +This element can be used to provide additional information regarding the source data.

  • +
  • sources_note [Optional ; Not repeatable ; String]

    +

    Additional information on the source(s) of data used to generate the series or indicator.

  • +
  • keywords [Optional ; Repeatable]
    +Words or phrases that describe salient aspects of a data collection’s content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Keywords should be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
    +

  • +
+
"keywords": [
+  {
+    "name": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required ; String ; Non repeatable]
    +Keyword (or phrase). Keywords summarize the content or subject matter of the study.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +Controlled vocabulary from which the keyword is extracted, if any.
    +

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI of the controlled vocabulary used, if any.

  • +
  • acronyms [Optional ; Repeatable]
    +The acronyms element is used to document the meaning of all acronyms used in the metadata of a series. If some acronyms are well known (like “GDP”, or “IMF” for example), others may be less obvious or could be uncertain (does “PPP” mean “public-private partnership”, or “purchasing power parity”?). In any case, providing a list of acronyms with their meaning will help users and make your metadata more discoverable. Note that acronyms should not include country codes used in the documentation of the geographic coverage of the data. +

  • +
+
"acronyms": [
+  {
+    "acronym": "string",
+    "expansion": "string",
+    "occurrence": 0
+  }
+]
+


+
    +
  • acronym [Required ; Not repeatable ; String]
    +An acronym referenced in the series metadata (e.g. “GDP”).

  • +
  • expansion [Required ; Not repeatable ; String]
    +The expansion of the acronym, i.e. the full name or title that it represents (e.g., “Gross Domestic Product”).

  • +
  • occurrence [Optional ; Not repeatable ; Numeric]
    +This numeric element can be used to indicate the number of times the acronym is mentioned in the metadata. The element will rarely be used.

  • +
  • errata [Optional ; Repeatable]
    +This element is used to provide information on detected errors in the data or metadata for the series, and on the measures taken to remedy them. +

  • +
+
"errata": [
+  {
+    "date": "string",
+    "description": "string"
+  }
+]
+


+
    +
  • date [Required ; Repeatable ; String]
    +The date the erratum was published.

  • +
  • description [Required ; Repeatable ; String]
    +A description of the error and remedy measures.

  • +
  • notes [Optional ; Repeatable]
    +This element is open and reserved for explanatory notes deemed useful to the users of the data. Notes should account for additional information that might help: replicate the series; access the data and research area, or discoverability in general. +

  • +
+
"notes": [
+  {
+    "note": "string"
+  }
+]
+


+
    +
  • note [Required ; Repeatable ; String]
    +The note itself.

  • +
  • related_indicators [Optional ; Repeatable]
    +This element allows to reference indicators that are often associated with the indicator being documented. +

  • +
+
"related_indicators": [
+  {
+    "code": "string",
+    "label": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • code [Optional ; Not repeatable ; String]
    +The code for the indicator that is referenced in the document. It will likely be an ID that is used by that indicator.

  • +
  • label [Optional ; Not repeatable ; String]
    +The name or label of the indicator that is associated with the indicator being documented.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to the related indicator.

  • +
  • compliance [Optional ; Repeatable]

    +For some indicators, international standards have been established. This is for example the case of indicators like the unemployment or unemployment rate, for which the International Conference of Labour Statisticians defines the standards concepts and methods. The compliance element is used to document the compliance of a series with one or multiple national or international standards.
    +

  • +
+
"compliance": [
+  {
+    "standard": "string",
+    "abbreviation": "string",
+    "custodian": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • standard [Optional ; Not repeatable ; String]
    +The name of the standard that the series complies with. This name will ideally include a label and a version or a date. For example: “International Standard Industrial Classification of All Economic Activities (ISIC) Revision 4, published in 2007”

  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +The acronym of the standard that the series complies with.

  • +
  • custodian [Optional ; Not repeatable ; String]
    +The organization that maintains the standard that is being used for compliance. For example: “United Nations Statistics Division”.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to a public website site where information on the compliance standard can be obtained. For example: “https://unstats.un.org/unsd/classifications/Family/Detail/27

  • +
  • framework [Optional ; Repeatable]

    +Some national, regional, and international agencies develop monitoring frameworks, with goals, targets, and indicators. Some well-known examples are the Millennium Development Goals and the Sustainable Development Goals which establish international goals for human development, or the World Summit for Children (1990) which set international goals in the areas of child survival, development and protection, supporting sector goals such as women’s health and education, nutrition, child health, water and sanitation, basic education, and children in difficult circumstances. The framework element is used to link an indicator or series to the framework, goal, and target associated with it. +

  • +
+
"framework": [
+  {
+    "name": "string",
+    "abbreviation": "string",
+    "custodian": "string",
+    "description": "string",
+    "goal_id": "string",
+    "goal_name": "string",
+    "goal_description": "string",
+    "target_id": "string",
+    "target_name": "string",
+    "target_description": "string",
+    "indicator_id": "string",
+    "indicator_name": "string",
+    "indicator_description": "string",
+    "uri": "string",
+    "notes": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the framework.

  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +The abreviation of the name of the framework.

  • +
  • custodian [Optional ; Not repeatable ; String]
    +The name of the organization that is the official custodian of the framework.

  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the framework.

  • +
  • goal_id [Optional ; Not repeatable ; String]
    +The identifier of the Goal that the indicator or series is associated with.

  • +
  • goal_name [Optional ; Not repeatable ; String]
    +The name (label) of the Goal that the indicator or series is associated with.
    +

  • +
  • goal_description [Optional ; Not repeatable ; String]
    +A brief description of the Goal that the indicator or series is associated with.

  • +
  • target_id [Optional ; Not repeatable ; String]
    +The identifier of the Target that the indicator or series is associated with.

  • +
  • target_name [Optional ; Not repeatable ; String]
    +The name (label) of the Target that the indicator or series is associated with.

  • +
  • target_description [Optional ; Not repeatable ; String]
    +A brief description of the Target that the indicator or series is associated with.

  • +
  • indicator_id [Optional ; Not repeatable ; String]
    +The identifier of the indicator, as provided in the framework (this is not the idno identifier).

  • +
  • indicator_name [Optional ; Not repeatable ; String]
    +The name of the indicator, as provided in the framework (which may be different from the name provided in name)

  • +
  • indicator_description [Optional ; Not repeatable ; String]
    +A brief description of the indicator, as provided in the framework.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to a website providing detailed information on the framework, its goals, targets, and indicators.

  • +
  • notes [Optional ; Not repeatable ; String]
    +Any additional information on the relationship between the indicator/series and the framework.

  • +
  • series_group [Optional ; Repeatable]

    +The group(s) the indicator belongs to. Groups can be create to organize indicators/series by theme, producer, or other.
    +

  • +
+
"series_groups": [
+  {
+    "name": "string",
+    "description": "string",
+    "version": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the group.

  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the group.

  • +
  • version [Optional ; Not repeatable ; String]
    +The version of the grouping.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to a public website site where information on the grouping can be obtained.

  • +
  • contacts [Optional ; Repeatable]
    +The contacts element provides the public interface for questions associated with the production of the indicator or time series.

  • +
+


+
"contacts": [
+  {
+    "name": "string",
+    "role": "string",
+    "affiliation": "string",
+    "email": "string",
+    "telephone": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the contact person that should be contacted. Instead of the name of an individual (which would be subject to change and require frequent update of the metadata), a title can be provided here (e.g. “data helpdesk”).
  • +
  • role [Optional ; Not repeatable ; String]
    +The specific role of the contact person mentioned in name. This will be used when multiple contacts are listed, and is intended to help users direct their questions and requests to the right contact person.
    +
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The organization or affiliation of the contact person mentioned in name.
  • +
  • email [Optional ; Not repeatable ; String]
    +The email address of the person or organization mentioned in name. Avoid using personal email accounts; the use of an anonymous email is recommended (e.g, “helpdesk@….org”)
  • +
  • telephone [Optional ; Not repeatable ; String]
    +The phone number of the person or organization mentioned in name.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI of the agency (typically, a URL to a “contact us” web page).

  • +
+
+
+
+

8.2.2 Provenance

+

provenance [Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
+

+
"provenance": [
+    {
+        "origin_description": {
+            "harvest_date": "string",
+            "altered": true,
+            "base_url": "string",
+            "identifier": "string",
+            "date_stamp": "string",
+            "metadata_namespace": "string"
+        }
+    }
+]
+


+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.

    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Document Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

8.2.3 Tags

+

tags [Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. +

+
"tags": [
+    {
+        "tag": "string",
+        "tag_group": "string"
+    }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.

  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.

  • +
  • lda_topics [Optional ; Not repeatable]
    +

  • +
+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).

+

Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series’ name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%).

+

The lda_topics element includes the following metadata fields. An example in R was provided in Chapter 4 - Documents.

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the metadata (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic.

      • +
    • +
  • +
+
lda_topics = list(
+  
+   list(
+  
+      model_info = list(
+        list(source      = "World Bank, Development Data Group",
+             author      = "A.S.",
+             version     = "2021-06-22",
+             model_id    = "Mallet_WB_75",
+             nb_topics   = 75,
+             description = "LDA model, 75 topics, trained on Mallet",
+             corpus      = "World Bank Documents and Reports (1950-2021)",
+             uri         = ""))
+      ),
+      
+      topic_description = list(
+      
+        list(topic_id    = "topic_27",
+             topic_score = 32,
+             topic_label = "Education",
+             topic_words = list(list(word = "school",      word_weight = "")
+                                list(word = "teacher",     word_weight = ""),
+                                list(word = "student",     word_weight = ""),
+                                list(word = "education",   word_weight = ""),
+                                list(word = "grade",       word_weight = "")),
+        
+        list(topic_id    = "topic_8",
+             topic_score = 24,
+             topic_label = "Gender",
+             topic_words = list(list(word = "women",       word_weight = "")
+                                list(word = "gender",      word_weight = ""),
+                                list(word = "man",         word_weight = ""),
+                                list(word = "female",      word_weight = ""),
+                                list(word = "male",        word_weight = "")),
+        
+        list(topic_id    = "topic_39",
+             topic_score = 22,
+             topic_label = "Forced displacement",
+             topic_words = list(list(word = "refugee",     word_weight = "")
+                                list(word = "programme",   word_weight = ""),
+                                list(word = "country",     word_weight = ""),
+                                list(word = "migration",   word_weight = ""),
+                                list(word = "migrant",     word_weight = "")),
+                                
+        list(topic_id    = "topic_40",
+             topic_score = 11,
+             topic_label = "Development policies",
+             topic_words = list(list(word = "development", word_weight = "")
+                                list(word = "policy",      word_weight = ""),
+                                list(word = "national",    word_weight = ""),
+                                list(word = "strategy",    word_weight = ""),
+                                list(word = "activity",    word_weight = ""))
+                                
+      )
+      
+   )
+   
+)
+
    +
  • embeddings [Optional ; Repeatable]
    +In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).

    +

    The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

  • +
+


+
"embeddings": [
+    {
+        "id": "string",
+        "description": "string",
+        "date": "string",
+        "vector": null
+    }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.
  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).
  • +
  • vector [Required ; Not repeatable ; @@@@] +The numeric vector representing the series metadata.

  • +
+
+
+

8.2.4 Additional

+

additional [Optional ; Not repeatable]
+The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

+
+
+
+

8.3 Generating and publishing compliant metadata - Complete example

+

We use a series from the World Bank’s World Development Indicators (WDI 2021) as an example: the series “Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)”

+
+

Note that we only show how metadata are generated and published in a NADA catalog. We do not demonstrate the (optional) publishing of the underlying data in a MongoDB database, which makes the data accessible via API and allows activation of data visualizations in the NADA catalog. This is covered in the NADA documentation.

+
+
+

8.3.1 Use of AI for metadata augmentation

+

The discoverability of indicators by keyword-based search engines can be significantly improved by the inclusion of a list of relevant keywords in the metadata. These keywords can be synonyms or terms and concepts that are closely associated with the indicator. Identifying the most relevant related keywords requires subject matter expertise. But this can be considerably facilitated by the use of AI tools. We provide below an example of a query submitted to chatGPT. The proposed terms returned by the application MUST be reviewed by a subject matter expert. But having the proposed list (which can be copy-pasted then edited in a Metadata Editor or in a script) will make the process very efficient.

+
+
+
+ +

image

+
+
+


+

The returned list is as follows: +Poverty +Headcount ratio +Income +Consumption +Living standards +Basic needs +Poverty line +Purchasing power parity (PPP) +International poverty line +Economic development +Social inequality +Human development +Poverty reduction +Extreme poverty +Global poverty +Developing countries +Wealth distribution +Rural poverty +Urban poverty +Household income +Inclusive growth +Multidimensional poverty +Income inequality +Poverty gap +Human capital +Poverty trap +Food security +Employment +Vulnerability +Social protection +Poverty measurement +Poverty alleviation +Social exclusion +Targeted interventions +Poverty incidence +Poverty dynamics +Poverty cycle +Equity +Income distribution +Sustainable development

+
+
+

8.3.2 Using R

+
# The code below generates metadata at the database level (object "wdi_database")
+# and for a time series (object "this_series"). 
+# It then publishes the metadata in a NADA catalog using the R package NADAR.
+# It also publishes related materials as "external resources".
+library(nadar)
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+setwd("C:/my_indicators/")
+thumb = "poverty.JPG"   # Image to be used as thumbnail in the data catalog
+db_id = "WB_WDI_2021_09_15"  # The WDI database identifier
+
+# Document the indicator (Poverty headcount ratio at $1.90 a day)
+this_series = list(
+  
+  metadata_creation = list(
+    producers = list(
+      list(name = "Development Data Group",
+           abbr = "DECDG",
+           affiliation = "World Bank",
+           role = "Metadata curation")
+    ),  
+    prod_date = "2021-10-15",
+    version = "Example v 1.0"
+  ),
+  
+  series_description = list(
+    
+    idno = "SI.POV.DDAY",
+    
+    name = "Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)",
+    
+    database_id = db_id,   # To attach the database metadata to the series metadata
+    
+    measurement_unit = "% of population",
+    
+    periodicity = "Annual",
+    
+    definition_short = "Poverty headcount ratio at $1.90 a day is the percentage of the population living on less than $1.90 a day at 2011 international prices. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions.",
+    
+    definition_references = list(
+      list(source = "World Bank, Development Data Group",
+           uri = "https://databank.worldbank.org/metadataglossary/millennium-development-goals/series/SI.POV.DDAY"
+      )
+    ),
+    
+    methodology = "International comparisons of poverty estimates entail both conceptual and practical problems. Countries have different definitions of poverty, and consistent comparisons across countries can be difficult. Local poverty lines tend to have higher purchasing power in rich countries, where more generous standards are used, than in poor countries. Since World Development Report 1990, the World Bank has aimed to apply a common standard in measuring extreme poverty, anchored to what poverty means in the world's poorest countries. The welfare of people living in different countries can be measured on a common scale by adjusting for differences in the purchasing power of currencies. The commonly used $1 a day standard, measured in 1985 international prices and adjusted to local currency using purchasing power parities (PPPs), was chosen for World Development Report 1990 because it was typical of the poverty lines in low-income countries at the time. As differences in the cost of living across the world evolve, the international poverty line has to be periodically updated using new PPP price data to reflect these changes. The last change was in October 2015, when we adopted $1.90 as the international poverty line using the 2011 PPP. Prior to that, the 2008 update set the international poverty line at $1.25 using the 2005 PPP. Poverty measures based on international poverty lines attempt to hold the real value of the poverty line constant across countries, as is done when making comparisons over time. The $3.20 poverty line is derived from typical national poverty lines in countries classified as Lower Middle Income. The $5.50 poverty line is derived from typical national poverty lines in countries classified as Upper Middle Income. Early editions of World Development Indicators used PPPs from the Penn World Tables to convert values in local currency to equivalent purchasing power measured in U.S dollars. Later editions used 1993, 2005, and 2011 consumption PPP estimates produced by the World Bank. The current extreme poverty line is set at $1.90 a day in 2011 PPP terms, which represents the mean of the poverty lines found in 15 of the poorest countries ranked by per capita consumption. The new poverty line maintains the same standard for extreme poverty - the poverty line typical of the poorest countries in the world - but updates it using the latest information on the cost of living in developing countries. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions. The statistics reported here are based on consumption data or, when unavailable, on income surveys. Analysis of some 20 countries for which income and consumption expenditure data were both available from the same surveys found income to yield a higher mean than consumption but also higher inequality. When poverty measures based on consumption and income were compared, the two effects roughly cancelled each other out: there was no significant statistical difference.",
+    
+    limitation = "Despite progress in the last decade, the challenges of measuring poverty remain. The timeliness, frequency, quality, and comparability of household surveys need to increase substantially, particularly in the poorest countries. The availability and quality of poverty monitoring data remains low in small states, countries with fragile situations, and low-income countries and even some middle-income countries. The low frequency and lack of comparability of the data available in some countries create uncertainty over the magnitude of poverty reduction. Besides the frequency and timeliness of survey data, other data quality issues arise in measuring household living standards. The surveys ask detailed questions on sources of income and how it was spent, which must be carefully recorded by trained personnel. Income is generally more difficult to measure accurately, and consumption comes closer to the notion of living standards. And income can vary over time even if living standards do not. But consumption data are not always available: the latest estimates reported here use consumption data for about two-thirds of countries. However, even similar surveys may not be strictly comparable because of differences in timing or in the quality and training of enumerators. Comparisons of countries at different levels of development also pose a potential problem because of differences in the relative importance of the consumption of nonmarket goods. The local market value of all consumption in kind (including own production, particularly important in underdeveloped rural economies) should be included in total consumption expenditure but may not be. Most survey data now include valuations for consumption or income from own production, but valuation methods vary.",
+    
+    topics = list(
+      list(id = "1",
+           name = "Economics, Consumption and consumer behaviour",
+           vocabulary = "",
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(id = "2",
+           name = "Economics, Economic conditions and indicators",
+           vocabulary = "CESSDA Version 4.1",
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(id = "3",
+           name = "Economics, Economic systems and development",
+           vocabulary = "CESSDA Version 4.1",
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(id = "4",
+           name = "Social stratification and groupings, Equality, inequality and social exclusion",
+           vocabulary = "CESSDA Version 4.1",
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+    ),
+    
+    relevance = "The World Bank Group is committed to reducing extreme poverty to 3 percent or less, globally, by 2030. Monitoring poverty is important on the global development agenda as well as on the national development agenda of many countries. The World Bank produced its first global poverty estimates for developing countries for World Development Report 1990: Poverty (World Bank 1990) using household survey data for 22 countries (Ravallion, Datt, and van de Walle 1991). Since then there has been considerable expansion in the number of countries that field household income and expenditure surveys. The World Bank's Development Research Group maintains a database that is updated annually as new survey data become available (and thus may contain more recent data or revisions) and conducts a major reassessment of progress against poverty every year. PovcalNet is an interactive computational tool that allows users to replicate these internationally comparable $1.90, $3.20 and $5.50 a day global, regional and country-level poverty estimates and to compute poverty measures for custom country groupings and for different poverty lines. The Poverty and Equity Data portal provides access to the database and user-friendly dashboards with graphs and interactive maps that visualize trends in key poverty and inequality indicators for different regions and countries. The country dashboards display trends in poverty measures based on the national poverty lines alongside the internationally comparable estimates, produced from and consistent with PovcalNet.",
+    
+    time_periods = list(list(start = "1960", end = "2020")),
+    
+    geographic_units = list(
+      list(name = "Afghanistan", code = "AFG", type = "country/economy"),
+      list(name = "Africa Eastern and Southern", code = "AFE", type = "geographic region"),
+      list(name = "Africa Western and Central", code = "AFW", type = "geographic region"),
+      list(name = "Albania", code = "ALB", type = "country/economy"),
+      list(name = "Algeria", code = "DZA", type = "country/economy"),
+      list(name = "Angola", code = "AGO", type = "country/economy"),
+      list(name = "Aruba", code = "ABW", type = "country/economy")
+      # ... and many more - In a real situation, this would be programmatically extracted from the data
+    ),
+    
+    license = list(name = "CC BY-4.0", uri = "https://creativecommons.org/licenses/by/4.0/"),
+    
+    api_documentation = list(
+      description = "See the Developer Information webpage for detailed documentation of the API",
+      uri = "https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information"
+    ),
+    
+    source = "World Bank, Development Data Group (DECDG) and Poverty and Inequality Global Practice. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. Data for high-income economies are from the Luxembourg Income Study database. For more information and methodology, see PovcalNet website: http://iresearch.worldbank.org/PovcalNet/home.aspx",
+    
+    keywords = list(
+      list(name = "poverty rate"),
+      list(name = "poverty incidence"),
+      list(name = "global poverty line"),
+      list(name = "international poverty line"),
+      list(name = "welfare"),
+      list(name = "prosperity"),
+      list(name = "inequality"),
+      list(name = "income")
+    ),
+    
+    acronyms = list(
+      list(acronym = "PPP", expansion = "Purchasing Power Parity")
+    ),
+    
+    related_indicators = list(
+      list(code = "SI.POV.GAPS",
+           label = "Poverty gap at $1.90 a day (2011 PPP) (%)",
+           uri = "https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.GAPS"),
+      list(code = "SI.POV.NAHC",
+           label = "Poverty headcount ratio at national poverty lines (% of population)",
+           uri = "https://databank.worldbank.org/source/millennium-development-goals/Series/SI.POV.NAHC")
+    ),
+    
+    framework = list(
+      list(name = "Sustainable Development Goals (SDGs)",
+           description = "The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed and developing - in a global partnership.",
+           goal_id = "SDG Goal 1",
+           goal_name = "End poverty in all its forms everywhere",
+           target_id = "SDG Target 1.1",
+           target_name = "By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day",
+           indicator_id = "SDG Indicator 1.1.1",
+           indicator_name = "Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)",
+           uri = "https://sdgs.un.org/goals")
+    )
+    
+  ) 
+  
+)
+# Publish the metadata in NADA, with a link to the WDI website
+  # Database-level metadata
+  timeseries_database_add(idno = db_id, 
+                          published = 1, 
+                          overwrite = "yes", 
+                          metadata = wdi_database)
+  
+  # Indicator-level metadata
+  timeseries_add(
+    idno = this_series$series_description$idno, 
+    repositoryid = "central", 
+    published = 1, 
+    overwrite = "yes", 
+    metadata = this_series, 
+    thumbnail = thumb
+  )
+# Add a link to the WDI website as an external resource
+  
+external_resources_add(
+  title = "World Development Indicators website",
+  idno = this_series$series_description$idno,
+  dctype = "web",
+  file_path = "https://datatopics.worldbank.org/world-development-indicators/",
+  overwrite = "yes"
+)
+

After uploading the above metadata, and activating some visualization widgets, the result in NADA will be as follows (not all metadata displayed here; see https://nada-demo.ihsn.org/index.php/catalog/study/SI.POV.DDAY for the full view):

+


+
+ + + + +

+
+
+

8.3.3 Using Python

+

The equivalent in Python of the R script provided above is as follows.

+
# Same example in Python @@@@@@@@
+ +
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter09.html b/chapter09.html new file mode 100644 index 0000000..e2ba416 --- /dev/null +++ b/chapter09.html @@ -0,0 +1,2931 @@ + + + + + + + Chapter 9 Statistical tables | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 9 Statistical tables

+
+ +
+
+

9.1 Introduction

+

A statistical table (cross tabulation or contingency table) is a summary presentation of data. The OECD Glossary of Statistical Terms defines it as “observation data gained by a purposeful aggregation of statistical microdata conforming to statistical methodology [organized in] groups or aggregates, such as counts, means, or frequencies.”

+

Tables are produced as an array of rows and columns that display numeric aggregates in a clearly labeled fashion. They may have a complex structure and become quite elaborate. They are typically found in publications such as statistical yearbooks, census and survey reports, research papers, or published on-line.

+

Statistical tables can be understood by a broad audience. In some cases, they may be the only publicly-available output of a data collection activity. Even when other output is available –such as microdata, dashboards, or databases accessible via user interfaces or APIs– statistical tables are an important component of data dissemination. It is thus important to make tables as discoverable as possible. The schema described in this chapter was designed to structure and foster the comprehensiveness of information on tables by rendering the pertinent metadata into a structured, machine-readable format. It is intended for the purpose of improving data discoverability. The schema is not intended to store information to programmatically re-create tables.

+

The schema description is available at http://dev.ihsn.org/nada/api-documentation/catalog-admin/index.html#tag/Tables

+
+
+

9.2 Anatomy of a table

+

The figure below, adapted from LabWrite Resources, provides an illustration of what statistical tables typically look like. The main parts of a table are highlighted. They provide a content structure for the metadata schema we describe in this chapter.

+
+ +
+

Table number and title: Every table must have a title, and should have a number. Tables in yearbooks, report and papers are usually numbered in the order that they are referred to in the document. They can be numbered sequentially (Table 1, Table 2, and so on), by chapter (Table 1.1, Table 1.2, Table 2.1, …), or based on other reference system. The Table number typically precedes the table title. The title provides a description of the contents of the table. It should be concise and include the key elements shown in the table.

+

Column spanner, column heads, and stub head: The column headings (and sub-headings) identify what data are listed in the table in a vertical arrangement. A column heading placed above the leftmost column is often referred to as the stubhead, and the column is the stub column. A heading that sits above two or more columns to indicate a certain grouping is referred to as a column spanner.

+

Stubs: The horizontal headings and sub-headings of the rows are called row captions. Together, they form the stub.

+

Table body: The actual data (values) in a table (containing for example percentages, means, or counts of certain variables) form the table body.

+

Table spanner: A table spanner is located in the body of the table in order to divide the data in a table without changing the columns. Spanners go the entire length of the table.

+

Table notes: Table notes are used to provide information that is not self-explanatory (e.g., to provide the expanded form of acronyms used in row or column captions).

+

Table source: The source identifies the dataset(s) or database(s) that contain the data used to generate the table. This can for example be a survey or a census dataset.

+
+
+

9.3 Schema description

+

The table schema contains six blocks of elements. The first block of three elements (repository_id, published, and overwrite) do not describe the table, but are used by the NADA cataloguing application to determine where and how the table metadata is published in the catalog. The second block, metadata_information, contains “metadata on the metadata” and is used mainly for archiving purpose. The third block, table_description, contains the elements used to describe the table and its production process. A fourth block provenance, is used to document the origin of metadata that may be harvested from other catalogs. The block tags is used to add information (in the form of words or short phrases) that will be useful to create facets in the a catalog user interface. Last, an empty block additional is provided as a container for additional metadata elements that users may want to create.

+


+
{
+  "repositoryid": "string",
+  "published": 0,
+  "overwrite": "no",
+  "metadata_information": {},
+  "table_description": {},
+  "provenance": [],
+  "tags": [],
+  "lda_topics": [],
+  "embeddings": [],
+  "additional": { }
+}
+


+
+

9.3.1 Cataloguing parameters

+

The following elements are used by the NADA application API (see the NADA documentation for more information):

+
    +
  • repositoryid: A NADA catalog can be composed of multiple collections. The repositoryid element identifies in which collection the table will be published. This collection must have been previously created in the catalog. By default, the table will be published in the central catalog (i.e. in no particular collection).
    +

  • +
  • published: The NADA catalog allows tables to be published (in which case they will be visible to users of the catalog) or unpublished (in which case they will only be visible by administrators). The default value is 0 (unpublished). Code 1 is used to set the status to “published”.

  • +
  • overwrite: This element defines what action will be taken when a command is issued to add the table to a catalog and a table with the same identifier (element idno) is already in the catalog. By default, the command will not overwrite the existing table (the default value of overwrite is “no”). Set this parameter to “yes” to allow the existing table to be overwritten in the catalog.

  • +
+
+
+

9.3.2 Metadata information

+

metadata_information [Optional, Not Repeatable]
+The metadata_information block is used to document the table metadata (not the table itself). It provides information on the process of generating the table metadata. This block is optional. The information it contains is useful to catalog administrators, not to the public. It is however recommended to enter at least the identification of the metadata producer, her/his affiliation, and the date the metadata were created. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so the metadata produced by one organization can be found in other data centers (complying with standards and schema is precisely intended to facilitate inter-operability of catalogs and automated information sharing). +

+
"metadata_information": {
+  "idno": "string",
+  "title": "string",
+  "producers": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "role": "string"
+    }
+  ],
+  "production_date": "string",
+  "version": "string"
+}
+


+
    +
  • idno [Optional, Not Repeatable, String]
    +A unique identifier for the metadata document (the metadata document is the JSON file containing the table metadata). This is different from the table unique identifier (see section title_statement below), although the same identifier can be used, and it is good practice to generate identfiers that would maintain an easy connection between the metadata idno and the table idno. For example, if the unique identifier of the table is “TBL_0001”, the idno in the metadata_information could be “META_TBL_001”.

  • +
  • title [Optional, Not Repeatable, String]
    +The title of the metadata document (not necessarily the title of the table).

  • +
  • producers [Optional, Repeatable]
    +This refers to the producer(s) of the table metadata, not to the producer(s) of the table. This could for example be the data curator in a data center. Four elements can be used to provide information on the metadata producer(s):

    +
      +
    • name [Optional, Not Repeatable, String]
      +The name of the metadata producer/curator. An alternative to entering the name of the curator (e.g. for privacy protection purpose) is to enter the curator identifier (see the element abbr below)
    • +
    • abbr [Optional, Not Repeatable, String]
      +This element can be used to provide an identifier of the metadata producer/curator mentioned in name.
    • +
    • affiliation [Optional, Not Repeatable, String]
      +The affiliation of the metadata producer/curator mentioned in name.
    • +
    • role [Optional, Not Repeatable, String]
      +The specific role of the metadata producer/curator mentioned in name (applicable when more than one person was involved in the production of the metadata).

    • +
  • +
  • production_date [Optional, Not Repeatable, String]
    +The date the metadata (not the table) was produced. The date will preferably be entered in ISO 8601 format (YYYY-MM-DD).

  • +
  • version [Optional, Not Repeatable, String]
    +The version of the metadata (not the version of the table).

  • +
+
my_table = list(
+  # ... ,
+  metadata_information = list(
+    idno = "META_TBL_POP_PC2001_02-01", 
+    producers = list(
+      list(name = "John Doe",
+           affiliation = "National Data Center of Popstan")
+      ),
+    production_date = "2020-12-27",
+    version = "version 1.0"
+  ),
+  # ... 
+)
+
+
+

9.3.3 Table description

+

table_description [Required, Not Repeatable]
+This section contains the metadata elements that describe the table itself. Not all elements will be required to fully document a table, but efforts should be made to provide as much and as detailed information as possible, as richer metadata will make the table more discoverable. +

+
"table_description": {
+  "title_statement": {},
+  "identifiers": [],
+  "authoring_entity": [],
+  "contributors": [],
+  "publisher": [],
+  "date_created": "string",
+  "date_published": "string",
+  "date_modified": "string",
+  "version": "string",
+  "description": "string",
+  "table_columns": [],
+  "table_rows": [],
+  "table_footnotes": [],
+  "table_series": [],
+  "statistics": [],
+  "unit_observation": [],
+  "data_sources": [],
+  "time_periods": [],
+  "universe": [],
+  "ref_country": [],
+  "geographic_units": [],
+  "geographic_granularity": "string",
+  "bbox": [],
+  "languages": [],
+  "links": [],
+  "api_documentation": [],
+  "publications": [],
+  "keywords": [],
+  "themes": [],
+  "topics": [],
+  "disciplines": [],
+  "definitions": [],
+  "classifications": [],
+  "rights": "string",
+  "license": [],
+  "citation": "string",
+  "confidentiality": "string",
+  "sdc": "string",
+  "contacts": [],
+  "notes": [],
+  "relations": []
+  }
+


+
    +
  • title_statement [Required, Not Repeatable]
    +
  • +
+
"title_statement": {
+  "idno": "string",
+  "table_number": "string",
+  "title": "string",
+  "sub_title": "string",
+  "alternate_title": "string",
+  "translated_title": "string"
+}
+


+
    +
  • idno [Required, Not Repeatable, String]
    +A unique identifier to the table. Do not include spaces in the idno. This identifier must be unique to the catalog in which the table will be published. Some organizations have their own system to assign unique identifiers to tables. Ideally, an identifier that guarantees uniqueness globally will be used, such as a Digital Object Identifier (DOI) or an ISBN number. Note that a table may have more than one identifier. In such case, the element idno (as a non-repeatable element) will contain the main identifier (as selected as the “reference” one by the catalog administrator). The other identifiers will be provided in the element identifiers (see below).

  • +
  • table_number [Optional, Not Repeatable, String]
    +The table number. The table number will usually begin with the word “Table” followed by a numeric identifier such as: Table 1 or Table 2.1 etc. Different publications may use different ways to reference a table. This is particularly the case for publications that are part of a standard survey program and have well-defined table templates. The following are different ways to number a table:

    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    TypeDescription
    SequentialThis is a sequential number given to each table produced and appearing within the publication (e.g., Table 1, Table 2 to Table n).
    ThematicProvides a numbering scheme based on the theme and a sequential number
    ChapterThe tables can be numbered according to the chapter and then a sequential reference within that reference such as: Table 1.1 or Table 3.5 etc.
    AnnexTables in an annex will usually be given a letter number referring to the annex and a sequential number such as Table A.1 or Table B.3.
    NoteA table number is usually set apart from the title with a colon. The word “Table” should never abbreviated.
  • +
  • title [Required, Not Repeatable, String]
    +The title of the table. The title provides a brief description of the content of the table. It should be concise and include the key elements shown in the table. There are varying styles for writing a table title. A consistent style should be applied to all tables published in a catalog.

  • +
  • sub_title [Optional, Not Repeatable, String]
    +A subtitle can provide further descriptive or explanatory content to the table.

  • +
  • alternate_title [Optional, Not Repeatable, String]
    +An alternate title for the table.

  • +
  • translated_title [Optional, Not Repeatable, String]
    +A translation of the title.

  • +
+
my_table = list(
+  # ... 
+  table_description = list(
+        title_statement = list(
+           idno         = "EXAMPLE_TBL_001",
+           table_number = "Table 1.0",
+           title        = "Resident population by age group, sex, and area of residence, 2020",
+           sub_title    = "District of X, as of June 30",
+           translated_title = "Population résidente par groupe d'âge, sexe et zone de résidence, 2020 (district X, au 30 juin)"
+        ),
+        # ...
+  )
+)
+


+
    +
  • identifiers [Optional ; Repeatable]
    +This element is used to enter document identifiers other than the catalog identifier entered in the title_statement (idno). It can for example be a Digital Object Identifier (DOI). The identifier entered in the title_statement can be repeated here (the title_statement does not provide a type parameter; if a DOI or other standard reference ID is used as idno, it is recommended to repeat it here with the identification of its type). +
  • +
+
"identifiers": [
+  {
+    "type": "string",
+    "identifier": "string"
+  }
+]
+


+
    +
  • type [Optional, Not Repeatable, String]
    +The type of unique ID, e.g. “DOI”.
  • +
  • identifier [Required, Not Repeatable, String]
    +The identifier itself.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+        # ... ,
+        identifiers = list(
+          type  = "DOI",
+          identifier = "XXX.XXX.XXXX"
+        ),
+  # ...
+  )
+)
+


+
    +
  • authoring_entity [Optional, Not Repeatable]
    +The authoring entity identifies the person(s) or organization(s) responsible for the production of the table. An authoring entity is identified by its name, affiliation, abbreviation, URI, and author’s identifiers (if any). +
  • +
+
"authoring_entity": [
+  {
+  "name": "string",
+  "affiliation": "string",
+  "abbreviation": "string",
+  "uri": "string",
+    "author_id": [
+      {
+    "type": null,
+    "id": null
+      }
+    ]
+  }
+]
+


+
    +
  • name [Optional, Not Repeatable, String]
    +The name of person(s) or organization responsible for the production and content of the table.
  • +
  • affiliation [Optional, Not Repeatable, String]
    +The affiliation of the person(s) or organization(s) mentioned in name.
  • +
  • abbreviation [Optional, Not Repeatable, String]
    +The abbreviation (acronym) of the organization mentioned in name.
  • +
  • uri [Optional, Not Repeatable, String]
    +The URI can be a link to the website, or the email address, of the authoring entity mentioned in name.
  • +
  • author_id [Optional ; Repeatable]
    +The author identifier in a registry of academic researchers such as the Open Researcher and Contributor ID (ORCID).
    +
      +
    • type [Optional ; Not repeatable ; String]
      +The type of identifier, i.e. the identification of the registry that assigned the author’s identifier, e.g. “ORCID”.
    • +
    • id [Optional ; Not repeatable ; String]
      +The identifier of the author in the registry mentioned in type.
    • +
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+        # ... ,
+        authoring_entity = list(
+          name         = "John Doe",
+          affiliation  = "National Research Center, Popstan",
+          abbreviation = "NRC",
+          uri          = "www. ...",
+          author_id = list(
+            list(type = "ORCID", id = "XYZ123")
+          )
+        ),  
+        # ...
+  )
+)
+


+
    +
  • contributors [Optional, Repeatable]
    +This set of elements identifies the person(s) and/or organization(s), other than the authoring entity, who contributed to the production of the table.
    +
  • +
+
"contributors": [
+  {
+    "name": "string",
+    "affiliation": "string",
+    "abbreviation": "string",
+    "role": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional, Not Repeatable, String]
    +The name of the contributor (person or organization).
  • +
  • affiliation [Optional, Not Repeatable, String]
    +The affiliation of the contributor mentioned in name. This could be a government agency, a university or a department in a university, etc.
  • +
  • abbreviation [Optional, Not Repeatable, String]
    +The abbreviation for the institution which has been listed as the affiliation of the contributor.
    +
  • +
  • role [Optional, Not Repeatable, String]
    +The specific role of the contributor mentioned in name. This could for example be ““Research assistant”, “Technical specialist”, “Programmer”, or “Reviewer”.
  • +
  • uri [Optional, Not Repeatable, String]
    +A URI (link to a website, or email address) for the contributor mentioned in name.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+        # ... ,
+        contributors = list(
+          name         = "John Doe",
+          affiliation  = "National Research Center",
+          abbreviation = "NRC",
+          role         = "Research assistant; Stata programming",
+          uri          = "www. ..."
+        ),  
+        # ...
+  )
+)
+


+
    +
  • publisher [Optional, Not repeatable]
    +The entity responsible for publishing the table. +
  • +
+
"publisher": [
+  {
+    "name": "string",
+    "affiliation": "string",
+    "abbreviation": "string",
+    "role": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional, Not Repeatable, String]
    +The name of the publisher (person or organization).
  • +
  • affiliation [Optional, Not Repeatable, String]
    +The affiliation of the publisher. This could be a government agency, a university or a department in a university, etc.
  • +
  • abbreviation [Optional, Not Repeatable, String]
    +The abbreviation for the institution which has been listed as the affiliation of the publisher.
  • +
  • role [Optional, Not Repeatable, String]
    +The specific role of the publisher (this element is unlikely to be used as the role is obvious).
  • +
  • uri [Optional, Not Repeatable, String]
    +A URI (link to a website, or email address) of the publisher.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+        # ... ,
+        publisher = list(
+          name = "National Statistics Office, Publishing Department",
+          affiliation = "Ministry of Planning, National Statistics Office",
+          abbreviation = "NSO",
+          uri = "www. ..."
+        ),  
+        # ...
+  )
+)
+


+
    +
  • date_created [Optional, Not Repeatable, String]
    +The date the table was created. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the table is created refers to the date that the output was produced and considered ready for publication.

  • +
  • date_published [Optional, Not Repeatable, String]
    +The date the table was published. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). If the table is contained in a document (report, paper, book, etc.), the date the table is published is associated with the publication date of that document. If the table is found in a statistics yearbook for example, then the publication date will be the date the yearbook was published.

  • +
  • date_modified [Optional, Not Repeatable, String]
    +The date the table was last modified. It is recommended to enter the date in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). Modifications, revisions, or re-publications of the table are recorded in this element.

  • +
  • version [Optional, Not Repeatable, String]
    +The version of the table refers to the published version of the table. If for some reason, data in a published table are revised, then the version of the table is captured in this element.

  • +
  • description [Optional, Not Repeatable, String]
    +A brief “narrative” description of the table. The description can contain information on the content, purpose, production process, or other relevant information.

    +
    my_table = list(
    +  # ... ,
    +  table_description = list(
    +        # ... ,
    +        date_created = "2020-06-15",
    +        date_published = "2020-10-30",
    +        version = "Version 1.0",
    +        description = "The table is part of a series of tables extracted from the Population Census 2020 dataset. It presents counts of resident population by type of disability, sex, and age group, by province and at the national level. The data were collected in compliance with questions from the Washington Group.",
    +        # ...
    +  )
    +)
    +


  • +
  • table_columns [Optional, Repeatable]
    +The columns description is composed of the column spanner and the column heads. Columns spanners group the column heads together in a logical fashion to present the data to the user. Not all columns presented in a table will have a column spanner. The column spanners can become quite complicated; when a table is documented, the information found in the column spanner and heads can be merged and edited. What matters is not to document the exact structure of the table, but to ensure that the text of the spanners and heads is included in the metadata as this text will be used by search engines to find tables in data catalogs. +

  • +
+
"table_columns": [
+  {
+    "label": "string",
+    "var_name": "string",
+    "dataset": "string"
+  }
+]
+


+
    +
  • label [Required, Not Repeatable, String]
    +The labels of the table columns (or column captions) are vital for discoverability. The column labels will include both column spanners and column headers. Columns spanners are captions that join various column headers together.
  • +
  • var_name *[Optional, Not Repeatable, String*]
    +This refers to the name of the variables found in the dataset (typically microdata) used to produce the table. The objective of this optional field is to help establish a link between the source dataset and the table, to foster reproducibility.
    +
  • +
  • dataset [Optional, Not Repeatable, String]
    +This refers to the dataset (typically microdata) used to produce the table. If the dataset is available in a catalog and has a unique identifier (DOI or other), this identifier can be entered here. Alternatively, the title of the dataset or a permanent URI can be provided.

  • +
+

The column captions of the following table can be documented in the following manner:

+

+

+
my_table = list(
+  # ... ,
+  table_description = list(
+        # ... ,
+        table_columns = list(
+          
+          list(label = "Area of residence: National (total)", 
+               var_name = "urbrur", dataset = "pop_census_2020_v01"),
+          
+          list(label = "Area of residence: Urban", 
+               var_name = "urbrur", dataset = "pop_census_2020_v01"),
+          
+          list(label = "Area of residence: Rural", 
+               var_name = "urbrur", dataset = "pop_census_2020_v01"),
+          
+          list(label = "Sex: total", 
+               var_name = "sex", dataset = "pop_census_2020_v01")
+          
+          list(label = "Sex: male", 
+               var_name = "sex", dataset = "pop_census_2020_v01")
+          
+          list(label = "Sex: female", 
+               var_name = "sex", dataset = "pop_census_2020_v01")
+          
+        ),  
+        # ...
+  )
+)
+


+

Or, in a more concise but also valid version:

+
my_table = list(
+  # ... ,
+  table_description = list(
+        # ... ,
+        table_columns = list(
+          
+          list(label = "Area of residence: national (total) / urban / rural",
+               var_name = "urbrur", dataset = "pop_census_2020_v01"),
+          
+          list(label = "Sex: total / male / female", 
+               var_name = "sex", dataset = "pop_census_2020_v01")
+          
+        ),  
+        # ...
+  )
+)
+


+
    +
  • table_rows [Required, Not Repeatable, String]
    +Like the column spanner and column heads, the table_rows section is composed of the stub head and stubs (row captions). The stubs are the captions of the rows of data and the stub head is the label that groups the rows together in a logical fashion. As for table_columns, the information found in the stubs can be merged and edited to be optimized for clarity and discoverability. +
  • +
+
"table_rows": [
+  {
+    "label": "string",
+    "var_name": "string",
+    "dataset": "string"
+  }
+]
+


+
    +
  • label [Required, Not Repeatable, String]
    +As with the column labels, the content in this row_label is designed to include the stub head, stubs and any captions included.
  • +
  • var_name [Optional, Not Repeatable, String]
    +As with the column variables, this optional element is reserved to identify those variables found in the source dataset that are associated with the row of data.
  • +
  • dataset [Optional, Not Repeatable, String]
    +This refers to the dataset (typically microdata) used to produce the table. If the dataset is available in a catalog and has a unique identifier (DOI or other), this identifier can be entered here. Alternatively, the title of the dataset or a permanent URI can be provided. Note also that the schema provides a data_sources element (see below) to describe in more detail the sources of data. The content of the dataset element must be compatible with the information provided in that other element.

  • +
+

Example using the same table as for table_columns:

+
my_table = list(
+  # ... ,
+  
+  table_description = list(
+    # ... ,
+    
+    table_rows = list(
+      
+      list(label    = "Age group; 0-4 years",
+           var_name = "age", dataset  = "pop_census_2020_v01"),
+      
+      list(label    = "Age group; 5-9 years",
+           var_name = "age",
+           dataset  = "pop_census_2020_v01"),
+      
+      list(label    = "Age group; 10-14 years",
+           var_name = "age", 
+           dataset  = "pop_census_2020_v01"),
+      
+      list(label = "Age group; 15-19 years",
+           var_name = "age", 
+           dataset  = "pop_census_2020_v01")
+      
+    ),  
+    # ...
+  )
+)
+

The same information can be provided in a more concise version as follows:

+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    table_rows = list(
+      list(label = "Age group; 0-4 years, 5-9 years, 10-14 years, 15-19 years",
+           var_name = "age", 
+           dataset = "pop_census_2020_v01")
+    ),  
+    # ...
+  )
+)
+


+
    +
  • table_footnotes [Optional, Repeatable]
    +Footnotes provide additional clarity. They may for example be used to assure that the user is aware of conditions and exceptions that may apply to a table. Footnotes may include statements of missing data, imputation of data, or other content that is not included in the body of the publication. +
  • +
+
"table_footnotes": [
+  {
+    "number": "string",
+    "text": "string"
+  }
+]
+


+
    +
  • number [Optional, Not Repeatable, String]
    +A sequential number is usually given to the footnotes, which starts with 1 for each table.
  • +
  • text [Required, Not Repeatable, String]
    +The text of the footnote.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    table_footnotes = list(
+      
+      list(number = "1", 
+           text = "Data refer to the resident population only."),
+      
+      list(number = "2", 
+           text = "Figures for the district of X have been imputed.")
+      
+    ),  
+    
+    # ...
+  )
+)
+


+
    +
  • table_series [Optional, Repeatable]
    +Table series may be organized into series, typically by theme. +
  • +
+
"table_series": [
+  {
+    "name": "string",
+    "maintainer": "string",
+    "uri": "string",
+    "description": "string"
+  }
+]
+


+
    +
  • name [Optional, Not Repeatable, String]
    +The name (label) of the series.
  • +
  • maintainer [Optional, Not Repeatable, String]
    +The person or organization in charge of maintaining the series. This will often be the same person/organization that produce and publish the table. This optional element will often be ignored.
  • +
  • uri [Optional, Not Repeatable, String]
    +A URI to the series information. This optional element will often be ignored.
  • +
  • description [Optional, Not Repeatable, String]
    +A description of the series.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    table_series = list(
+      
+      list(name = "Population Census - Age distribution", 
+           description = "Series 1 - Tables on demographic composition of the population")
+      
+    ),  
+    
+    # ...
+  )
+)
+


+
    +
  • statistics [Optional, Repeatable]
    +The table metadata will not contain data. What the statistics element refers to is the type of statistics included in the table. Some tables may only contain counts, such as a table of population by age group and sex (which shows counts of persons; other tables could be counts of households, facilities, or any other observation unit). But statistical tables can contain many other types of summary statistics. This element is used to list these types of statistics. +
  • +
+
"statistics": [
+  {
+    "value": "string"
+  }
+]
+


+
    +
  • value [Required, Not Repeatable, String]
  • +
+

The use of a controlled vocabulary is recommended. This list could contain (but does not have to be limited to):

+
- Count (frequencies)
+- Number of missing values
+- Mean (average)
+- Median
+- Mode
+- Minimum value
+- Maximum value
+- Range
+- Standard deviation
+- Variance
+- Confidence interval (95%) - Lower limit
+- Confidence interval (95%) - Upper limit
+- Standard error
+- Sum
+- Inter-quartile Range (IQR)
+- Percentile (possibly with specification, e,g, "10th percentile")
+- Mean Absolute Deviation
+- Mean Absolute Deviation from the Median (MADM)
+- Coefficient of Variation (COV)
+- Coefficient of Dispersion (COD)
+- Skewness
+- Kurtosis
+- Entropy
+- Regression coefficient
+- R-squared 
+- Adjusted R-squared
+- Z-score
+- Accuracy
+- Precision
+- Mean squared logarithmic error (MSLE)<br><br>
+

Example in R for a table showing the distribution of the population by age goup and sex, and the mean age by sex

+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    statistics = list(
+      list(value = "count"),
+      list(value = "mean")
+    ),  
+    # ...
+  )
+)
+


+
    +
  • unit_observation [Optional, Repeatable]
    +The element provides information on the unit(s) of observations that correspond to the values shown in the table. +
  • +
+
"unit_observation": [
+  {
+    "value": "string"
+  }
+]
+


+
    +
  • value [Required, Not repeatable, String] +The value is not a numeric value; it is the label (description) of the observation unit, e.g, “individual” or “person”, “household”, “dwelling”, enterprise, “country”, etc.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    unit_observation = list(
+      list(value = "individual")
+    ),  
+    # ...
+  )
+)
+


+
    +
  • data_sources [Optional, Repeatable]

    +The data sources are often cited in the footnote section of a table. The name, source_id, and link elements are optional, but at least one of them must be provided. +
  • +
+
"data_sources": [
+  {
+    "name": "string",
+    "abbreviation": "string",
+    "source_id": "string",
+    "note": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional, Not repeatable, String]
    +The name (title) of the data source. For example, a table data may be extracted from the “Population Census 2020”.
    +
  • +
  • abbreviation [Optional, Not repeatable, String]
    +The abbreviation (acronym) of the data source.
  • +
  • source_id [Optional, Not repeatable, String]
    +A unique identifier for the source, such as a Digital Object Identifier (DOI).
  • +
  • note [Optional, Not repeatable, String]
    +A note that describes how the source was used, possibly mentioning issues in the use of the source.
  • +
  • uri [Optional, Not repeatable, String]
    +A link (URL) to the source dataset.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    data_sources = list(
+      list(source = "Population and Housing Census 2020",
+           abbreviation = "PHC 2020",
+           source_id = "ABC_PHC_2020_PUF"
+      )
+    ),  
+    # ...
+  
+  )
+)
+


+
    +
  • time_periods [Optional, Repeatable]
    +The time periods consists of a list or periods (range of years / quarters / months / days) that the data relate to, preferably entered in ISO 8601 format (YYYY, or YYYY-MM, or YYYY-MM-DD). If the data are by quarter, convert them into ISO 8601 format (e.g., first quarter of 2020 would be “from 2020-01 to 2020-03). This is a repeatable field. If the time periods are for example 1990, 2000 to 2004, and 2014 to June 2019, do not enter the time period as a single range 1990-2019 as this would include irrelevant periods. It should be entered as three separate ranges as in the example below. For data that are related to a specific date (for example, the population of a country as of the census day), enter the date in both the from and to fields. +
  • +
+
"time_periods": [
+  {
+    "from": "string",
+    "to": "string"
+  }
+]
+


+
    +
  • from [Required, Not repeatable, String]
    +The start date of the time period covered by the table, preferably entered in ISO 8601 format (YYYY, or YYYY-MM, or YYYY-MM-DD).
    +
  • +
  • to [Required, Not repeatable, String]
    +The end date of the time period covered by the table, preferably entered in ISO 8601 format (YYYY, or YYYY-MM, or YYYY-MM-DD).
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    time_periods = list(
+      list(from = "1990", to = "1990"),
+      list(from = "2000", to = "2004"),
+      list(from = "2014", to = "2019-06")
+    ),  
+    
+    # ...
+  )
+)
+


+
    +
  • universe [Optional, Repeatable]
    +The universe of a table refers to the population (or respondents) covered in the data. It does not have to be a population of individuals; it can for example be a population of households, facilities, firms, groups of persons, or even objects. The description of the universe should clearly inform the data users of inclusions and exclusions that they may not expect. +
  • +
+
"universe": [
+  {
+    "value": "string"
+  }
+]
+


+
    +
  • value [Required, Not repeatable, String]
    +A textual description of the universe covered by the data.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    universe = list(
+      list(value = "Resident male population aged 0 to 6 years; this excludes visitors and people present in the country under a diplomatic status.
+           Nomadic and homeless populations are included.")
+    ),  
+    # ...
+  )
+)
+


+
    +
  • ref_country [Optional, Repeatable]
    +This element is used to document the list of countries for which data are in the table. This element serves to assure that the country name and code are easily discoverable and contribute to a virtual national catalog. If the table only refers to part of a country (for example a city), the ref_country field should still be filled. Another element called geographic_unitsis provided (see below) to capture more detailed information on the table’s geographic coverage. +
  • +
+
"ref_country": [
+  {
+    "name": "string",
+    "code": "string"
+  }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +The name of a country for which data are in the table.

  • +
  • code [Required, Not repeatable, String]
    +The code of the country mentioned in name, preferably an ISO 3166 country code.

  • +
  • geographic_units [Optional, Repeatable]
    +An itemized list of geographic areas covered by the data in the table, other than the country/countries that must be entered in ref_country. +

  • +
+
"geographic_units": [
+  {
+    "name": "string",
+    "code": "string",
+    "type": "string"
+  }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +The name of the geographic unit.
  • +
  • code [Optional, Not repeatable, String]
    +The code of the geographic unit mentioned in name.
  • +
  • type [Optional, Not repeatable, String]
    +The type of geographic unit mentioned in name (e.g., “State”, “Province”, “Town”, “Region”, etc.)
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    ref_country = list(
+      list(name = "Malawi", code = "MWI")
+    ),  
+    
+    geographic_units = list(
+      list(name = "Northern", type = "region"),
+      list(name = "Central",  type = "region"),
+      list(name = "Southern", type = "region"),
+      list(name = "Lilongwe", type = "town"),
+      list(name = "Mzuzu",    type = "town"),
+      list(name = "Blantyre", type = "town")
+    ),
+    
+    # ...
+  )
+)
+


+
    +
  • geographic_granularity [Optional, Not repeatable, String]
    +A description of the geographic levels for which data are presented in the table. This is not a list of specific geographic areas, but a list of the administrative level(s) that correspond to these geographic areas.

    +

    Example for a table showing the population of a country by State, district, and sub-district (+ total)

    +
    my_table = list(
    +  # ... ,
    +  table_description = list(
    +    # ... ,
    +    ref_country = list(
    +      list(name = "India", code = "IND")
    +    ),  
    +
    +    geographic_granularity = "national, state (admin 1), district (admin 2), sub-district (admin 3)",
    +
    +    # ...
    +  )
    +)
    +


  • +
  • bbox [Optional ; Repeatable]
    +Bounding boxes are typically used for geographic datasets to indicate the geographic coverage of the data, but can be provided for tables as well, although this will rarely be done. A geographic bounding box defines a rectangular geographic area. +

  • +
+
"bbox": [
+  {
+    "west": "string",
+    "east": "string",
+    "south": "string",
+    "north": "string"
+  }
+]
+


+
    +
  • west [Required ; Not repeatable ; String]
    +Western geographic parameter of the bounding box.

  • +
  • east [Required ; Not repeatable ; String]
    +Eastern geographic parameter of the bounding box.

  • +
  • south [Required ; Not repeatable ; String]
    +Southern geographic parameter of the bounding box.

  • +
  • north [Required ; Not repeatable ; String]
    +Northern geographic parameter of the bounding box.

  • +
  • languages [Optional, Repeatable]
    +Most tables will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed. +

  • +
+
"languages": [
+  {
+    "name": "string",
+    "code": "string"
+  }
+]
+


+ +
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    languages = list(
+      list(name = "English", code = "EN"),
+      list(name = "French",  code = "FR")
+    ),  
+    
+    # ...
+  )
+)
+


+
    +
  • links [Optional, Repeatable]
    +A list of associated links related to the table. +
  • +
+
"links": [
+  {
+    "uri": "string",
+    "description": "string"
+  }
+]
+


+
    +
  • uri [Required, Not repeatable, String]
    +The URI to an external resource.
  • +
  • description [Optional, Not repeatable, String]
    +A brief description of the resource.

  • +
+

Example for a table extracted from the Gambia Demographic and Health Survey 2019/2020 Report, the links could be the following:

+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    links = list(
+      
+      list(uri = "https://dhsprogram.com/pubs/pdf/FR369/FR369.pdf", 
+           description = "The Gambia, Demographic and Health Survey 2019/2020 Report"),
+      
+      list(uri = "https://dhsprogram.com/data/available-datasets.cfm", 
+           description = "DHS microdata for The Gambia")
+    
+    ),  
+    
+    # ...
+  )
+)
+


+
    +
  • api_documentation [Optional ; Repeatable]
    +Increasingly, data are made accessible via Application Programming Interfaces (APIs). The API associated with a series must be documented. +
  • +
+
"api_documentation": [
+  {
+    "description": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • description [Optional ; Not repeatable ; String]
    +This element will not contain the API documentation itself, but information on what documentation is available.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the API documentation.

  • +
  • publications [Optional, Repeatable]
    +This element identifies the publication(s) where the table is published. This could for example be a Statistics Yearbook, a report, a paper, etc.
    +

  • +
+
"publications": [
+  {
+    "title": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • title [Required, Not repeatable, String]
    +The title of the publication (including the producer and the year).
  • +
  • uri [Optional, Not repeatable, String]
    +A link to the publication.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... , 
+    
+    publications = list(
+      list(title = "United Nations Statistical Yearbook, Fifty-second issue, May 2023", 
+           uri   = "https://www.un-ilibrary.org/content/books/9789210557566")
+    ), 
+    
+    # ...
+  )
+)
+


+
    +
  • keywords [Optional ; Repeatable]
  • +
+


+
"keywords": [
+  {
+    "name": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+

A list of keywords that provide information on the core content of the table. Keywords provide a convenient solution to improve the discoverability of the table, as it allows terms and phrases not found in the table itself to be indexed and to make a table discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the UNESCO Thesaurus. The list provided here can combine keywords from multiple controlled vocabularies, and user-defined keywords.

+
    +
  • name [Required ; Not repeatable ; String]
    +The keyword itself.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The controlled vocabulary (including version number or date) from which the keyword is extracted, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the controlled vocabulary from which the keyword is extracted, if any.
  • +
+
  my_table = list(
+    # ... ,
+    table_description = list(
+      # ... ,
+      
+      keywords = list(
+        list(name = "Migration", vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        list(name = "Migrants", vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        list(name = "Refugee", vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        list(name = "Forced displacement"),
+        list(name = "Forcibly displaced")
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • themes [Optional ; Repeatable]
  • +
+


+
"themes": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+

A list of themes covered by the table. A controlled vocabulary will preferably be used. Note that themes will rarely be used as the elements topics and disciplines are more appropriate for most uses. This is a block of five fields: +- id [Optional ; Not repeatable ; String]
+The ID of the theme, taken from a controlled vocabulary. +- name [Required ; Not repeatable ; String]
+The name (label) of the theme, preferably taken from a controlled vocabulary. +- parent_id [Optional ; Not repeatable ; String]
+The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. +- vocabulary [Optional ; Not repeatable ; String]
+The name (including version number) of the controlled vocabulary used, if any. +- uri [Optional ; Not repeatable ; String]
+The URL to the controlled vocabulary used, if any.

+
    +
  • topics [Optional ; Repeatable]
  • +
+


+
"topics": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+

Information on the topics covered in the table. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. Note that you may use more than one controlled vocabulary.

+

This element is a block of five fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The identifier of the topic, taken from a controlled vocabulary.
  • +
  • name [Required ; Not repeatable ; String]
    +The name (label) of the topic, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.
  • +
+
  my_table <- list(
+    # ... ,
+    table_description = list(
+      # ... ,
+      
+      topics = list(
+        list(name = "Demography.Migration", 
+             vocabulary = "CESSDA Topic Classification", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(name = "Demography.Censuses", 
+             vocabulary = "CESSDA Topic Classification", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "F22", 
+             name = "International Migration", 
+             parent_id = "F2 - International Factor Movements and International Business", 
+             vocabulary = "JEL Classification System", 
+             uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+        list(id = "O15", 
+             name = "Human Resources - Human Development - Income Distribution - Migration", 
+             parent_id = "O1 - Economic Development", 
+             vocabulary = "JEL Classification System", 
+             uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+      ),
+      
+      # ...
+    
+    ),
+    
+  )  
+


+
    +
  • disciplines [Optional ; Repeatable]
  • +
+


+
"disciplines": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+

Information on the academic disciplines related to the content of the table. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia. +This is a block of five elements:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The identifier of the discipline, taken from a controlled vocabulary.
  • +
  • name [Optional ; Not repeatable ; String]
    +The name (label) of the discipline, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent identifier of the discipline (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.
  • +
+
  my_table <- list(
+    # ... ,
+    table_description = list(
+      # ... ,  
+      
+      disciplines = list(
+        
+        list(name = "Economics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+             
+        list(name = "Agricultural economics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+        
+        list(name = "Econometrics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields")
+             
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • definitions [Optional, Repeatable]
    +Definitions or concepts covered by the table. +
  • +
+
"definitions": [
+  {
+    "name": "string",
+    "definition": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +The name (or label) of the term, indicator, or concept being defined.
  • +
  • definition [Required, Not repeatable, String]
    +The definition of the term, indicator, or concept.
  • +
  • uri [Optional, Not repeatable, String]
    +A link to the source of the definition, or to a site providing a more detailed definition.

  • +
+

Example for a table on malnutrition that would include estimates of stunting and wasting prevalence:

+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    definitions = list(
+      
+      list(name = "stunting", 
+           definition = "Prevalence of stunting is the percentage of children under age 5 whose height for age is more than two standard deviations below the median for the international reference population ages 0-59 months. For children up to two years old height is measured by recumbent length. For older children height is measured by stature while standing. The data are based on the WHO's new child growth standards released in 2006.",
+           uri = "https://data.worldbank.org/indicator/SH.STA.STNT.ZS?locations=1W"),
+      
+      list(name = "wasting", 
+           definition = "Prevalence of wasting, male,is the proportion of boys under age 5 whose weight for height is more than two standard deviations below the median for the international reference population ages 0-59.", 
+           uri = "https://data.worldbank.org/indicator/SH.STA.WAST.MA.ZS?locations=1W")
+    
+    ),  
+    # ...
+  )
+)
+


+
    +
  • classifications [Optional, Repeatable]
    +The element is used to document the use of standard classifications (or “ontologies”, or “taxonomies”) in the table. +
  • +
+
"classifications": [
+  {
+    "name": "string",
+    "version": "string",
+    "organization": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +Name (label) of the classification, ontology, or taxonomy.
  • +
  • version [Optional, Not repeatable, String]
    +Version of the classification, ontology, or taxonomy used in the table.
  • +
  • organization [Optional, Not repeatable, String]
    +Organization that is the custodian of the classification, ontology, or taxonomy.
  • +
  • uri [Optional, Not repeatable, String]
    +Link to an external resource where detailed information on the classification, ontology, or taxonomy can be obtained.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    classifications = list(
+      
+      list(name = "International Standard Classification of Occupations (ISCO)", 
+           version = "ISCO-08", 
+           organization = "International Labour Organization (ILO)",
+           uri = "https://www.ilo.org/public/english/bureau/stat/isco/")
+      
+    ),  
+    # ...
+    
+  )
+  
+)
+


+
    +
  • rights [Optional, Not repeatable, String]
    +Information on the rights or copyright that applies to the table. +

  • +
  • license [Optional, Repeatable]
    +A table may require a license to use or reproduce. This is done to protect the intellectual content of the research product. The licensing entity may be different from the researcher or the publisher. It is the entity which has the intellectual rights to the table (s) and would grant rights or restrictions on the reuse of the table. +

  • +
+
"license": [
+  {
+    "name": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +The name of the license”.
  • +
  • uri [Optional, Not repeatable, String]
    +A link to a publicly-accessible description of the terms of the license.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    license = list(
+      list(name = "Attribution 4.0 International (CC BY 4.0)", 
+           uri = "https://creativecommons.org/licenses/by/4.0/")
+    ), 
+    
+    # ...
+  )
+  
+)
+


+
    +
  • citation [Optional, Not repeatable, String]
    +A citation requirement for the table (i.e. an indication of how the table should be cited in publications). +

  • +
  • confidentiality [Optional, Not repeatable, String]
    +A published table may be protected through a confidentiality agreement between the publisher and the researcher. It may also determine certain rights regarding the use of the research and the data presented to the table. The data may also present confidential information that is produced for selective audiences. This element is used to provide a statement on any limitations ore restrictions on use of the table based on confidential data or agreements. +

  • +
  • sdc [Optional, Not repeatable, String]

    +Information on statistical disclosure control measures applied to the table. This can include cell suppression, or other techniques. Specialized packages have been developed for this purpose, like sdcTable: Methods for Statistical Disclosure Control in Tabular Data and https://cran.r-project.org/web/packages/sdcTable/sdcTable.pdf
    +The information provided here should be such that it does not provide intruders with useful information for reverse-engineering the protection measures applied to the table. +

  • +
  • contacts [Optional, Repeatable]
    +Users of the data may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a “generic” email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members. +

  • +
+
"contacts": [
+  {
+    "name": "string",
+    "role": "string",
+    "affiliation": "string",
+    "email": "string",
+    "telephone": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +Name of a person or unit (such as a data help desk). It will usually be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position.
  • +
  • role [Optional, Not repeatable, String]
    +The specific role of name, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact.
  • +
  • affiliation [Optional, Not repeatable, String]
    +Affiliation of the person/unit.
  • +
  • email [Optional, Not repeatable, String]
    +E-mail address of the person.
  • +
  • telephone [Optional, Not repeatable, String]
    +A phone number that can be called to obtain information or provide feedback on the table. This should never be a personal phone number; a corporate number (typically of a data help desk) should be provided.
  • +
  • uri [Optional, Not repeatable, String]
    +A link to a website where contact information for name can be found.
  • +
+
my_table = list(
+  # ... ,
+  table_description = list(
+    # ... ,
+    
+    contacts = list(
+      
+      list(name = "Data helpdesk", 
+           role = "Support to data users",
+           affiliation = "National Statistics Office",
+           email = "data_helpdesk@ ...")
+      
+    )  
+  )
+)
+


+
    +
  • notes [Optional, Repeatable]
    +The notes provide a space to include observations or open-ended content that may be material in understanding the table, which have not been captured in other elements of the schema. +
  • +
+
"notes": [
+  {
+    "note": "string"
+  }
+]
+


+
    +
  • note [Required, Not repeatable, String]
    +The note itself.

  • +
  • relations [Optional ; Repeatable]
    +If the table has a relation to other resources (e.g., it is a subset of another resource, or a translation of another resource), the relation(s) and associated resources can be listed in this element. +

  • +
+
"relations": [
+  {
+    "name": "string",
+    "type": "isPartOf"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The related resource. Recommended practice is to identify the related resource by means of a URI. If this is not possible or feasible, a string conforming to a formal identification system may be provided.

  • +
  • type [Optional ; Not repeatable ; String]
    +The type of relationship. The use of a controlled vocabulary is recommended. The Dublin Core proposes the following vocabulary: isPartOf, hasPart, isVersionOf, isFormatOf, hasFormat, references, isReferencedBy, isBasedOn, isBasisFor, replaces, isReplacedBy, requires, isRequiredBy.

    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    TypeDescription
    isPartOfThe described resource is a physical or logical part of the referenced resource.
    hasPart
    isVersionOfThe described resource is a version edition or adaptation of the referenced resource. A change in version implies substantive changes in content rather than differences in format.
    isFormatOf
    hasFormatThe described resource pre-existed the referenced resource, which is essentially the same intellectual content presented in another format.
    references
    isReferencedBy
    isBasedOn
    isBasisFor
    replacesThe described resource supplants, displaces or supersedes the referenced resource.
    isReplacedByThe described resource is supplanted, displaced or superseded by the referenced resource.
    requires
    +


  • +
+
+
+

9.3.4 Provenance

+

provenance [Optional ; Repeatable]
+

+
"provenance": [
+    {
+        "origin_description": {
+            "harvest_date": "string",
+            "altered": true,
+            "base_url": "string",
+            "identifier": "string",
+            "date_stamp": "string",
+            "metadata_namespace": "string"
+        }
+    }
+]
+


+

Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.

+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.
    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Table Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

9.3.5 Tags

+

tags [Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +

+
"tags": [
+    {
+        "tag": "string",
+        "tag_group": "string"
+    }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.

  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.

  • +
  • lda_topics [Optional ; Not repeatable]
    +

  • +
+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).

+

Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series’ name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%).

+

The lda_topics element includes the following metadata fields. An example in R was provided in Chapter 4 - Documents.

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the metadata (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic.

      • +
    • +
  • +
  • embeddings [Optional ; Repeatable]
    +In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).

    +

    The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

  • +
+


+
"embeddings": [
+    {
+        "id": "string",
+        "description": "string",
+        "date": "string",
+        "vector": null
+    }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.
  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).
  • +
  • vector [Required ; Not repeatable ; @@@@] +The numeric vector representing the series metadata.

  • +
+
+
+

9.3.6 Additional (custom) elements

+

additional [Optional ; Not repeatable]
+The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

+
+
+
+

9.4 Complete examples

+

We provide here examples of documentation of actual tables, and their publishing in a NADA catalog. We use the R package NADAR and the Python library PyNada to publish metadata in the catalog. The example only demonstrate the production and publishing of table metadata. We do not show in the example how the data can also be published in a NADA database (MongoDB), to be made available via API. The use of the data API is covered in the NADA documentation.

+
+

9.4.1 Example 1

+

This first example is a table presenting the evolution since 1960 of the number of households by size and of the average household size in the United States, published by the US Census Bureau. This table, published in MS-Excel format, was downloaded on 20 February 2021 from https://www.census.gov/data/tables/time-series/demo/families/households.html. +
+ +

+

Using R

+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_tables/")
+
+id = "TBL_EXAMPLE_01"
+thumb = "household_pic.JPG"   # To be used as thumbnail in the data catalog
+
+# Document the table
+
+my_table_hh4 <- list(
+  
+  metadata_information = list(
+    idno = "META_TBL_EXAMPLE-01",
+    producers = list(
+      list(name = "Olivier Dupriez",affiliation = "World Bank")
+    ),
+    production_date = "2021-02-20"
+  ),
+  
+  table_description = list(
+    
+    title_statement = list(
+      idno = id,
+      table_number = "Table HH-4",
+      title = "Households by Size: 1960 to Present",
+      sub_title = "(Numbers in thousands, except for averages)"
+    ),
+    
+    authoring_entity = list(
+      list(name = "United States Census Bureau",
+           affiliation = " U.S. Department of Commerce",
+           abbreviation = "US BUCEN",
+           uri = "https://www.census.gov/en.html"
+      )
+    ),
+    
+    date_created = "2020",
+    
+    date_published = "2020-12",
+    
+    table_columns = list(
+      list(label = "Year"),
+      list(label = "All households (number)"),
+      list(label = "Number of people: One"),
+      list(label = "Number of people: Two"),
+      list(label = "Number of people: Three"),
+      list(label = "Number of people: Four"),
+      list(label = "Number of people: Five"),
+      list(label = "Number of people: Six"),
+      list(label = "Number of people: Seven or more"),
+      list(label = "Average number of people per household")
+    ),
+    
+    table_rows = list(
+      list(label = "Year (values from 1960 to 2020)")
+    ),
+    
+    table_footnotes = list(
+      
+      list(number = "1", 
+           text = "This table uses the householder's person weight to describe characteristics of people living in households. As a result, estimates of the number of households do not match estimates of housing units from the Housing Vacancy Survey (HVS). The HVS is weighted to housing units, rather than the population, in order to more accurately estimate the number of occupied and vacant housing units. If you are primarily interested in housing inventory estimates, then see the published tables and reports here: http://www.census.gov/housing/hvs/. If you are primarily interested in characteristics about the population and people who live in households, then see the H table series and reports here: https://www.census.gov/topics/families/families-and-households.html."),
+      
+      list(number = "2", 
+           text = "Details may not sum to total due to rounding."),
+      
+      list(number = "3", 
+           text = "1993 figures revised based on population from the most recent decennial census."),
+      
+      list(number = "4", 
+           text = "The 2014 CPS ASEC included redesigned questions for income and health insurance coverage. All of the approximately 98,000 addresses were selected to receive the improved set of health insurance coverage items. The improved income questions were implemented using a split panel design.  Approximately 68,000 addresses were selected to receive a set of income questions similar to those used in the 2013 CPS ASEC. The remaining 30,000 addresses were selected to receive the redesigned income questions. The source of data for this table is the CPS ASEC sample of 98,000 addresses.")
+      
+    ),
+    
+    table_series = list(
+      list(name = "Historical Households Tables",
+           maintainer = "United States Census Bureau",
+           uri = "https://www.census.gov/data/tables/time-series/demo/families/households.html",
+           description = "Tables on households generated from the Current Population Survey")
+    ),
+    
+    statistics = list(
+      list(value = "Count"),
+      list(value = "Average")
+    ),
+    
+    unit_observation = list(
+      list(value = "Household")
+    ),
+    
+    data_sources = list(
+      list(source = "U.S. Census Bureau, Current Population Survey, March and Annual Social and Economic Supplements")
+    ),
+    
+    time_periods = list(
+      list(from = "1960", to = "2020")
+    ),
+    
+    universe = list(
+      list(value = "US resident population")
+    ),
+    
+    ref_country = list(
+      list(name = "United States", code = "USA")
+    ),
+    
+    geographic_granularity = "Country",
+    
+    languages = list(
+      list(name = "English", code = "EN")
+    ),
+    
+    links = list(
+      list(uri = "https://www2.census.gov/programs-surveys/demo/tables/families/time-series/households/hh4.xls",
+           description = "Table in MS-Excel formal"),
+      list(uri = "https://www.census.gov/programs-surveys/cps/technical-documentation/complete.html",
+           description = "Technical documentation with information about ASEC, including the source and accuracy statement")
+    ),
+    
+    topics = list(
+      list(
+        id = "1",
+        name = "Demography - Censuses",
+        parent_id = "Demography",
+        vocabulary = "CESSDA Controlled Vocabulary for CESSDA Topic Classification v. 3.0 (2019-05-20)",
+        uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification?v=3.0"
+      )
+    ),
+    
+    contacts = list(
+      list(name = "Fertility and Family Statistics Branch",
+           affiliation = "US Census Bureau",
+           telephone = "+1 - 301-763-2416",
+           uri = "ask.census.gov")
+    )
+    
+  )
+
+)    
+  
+# Publish the table in a NADA catalog
+
+table_add(idno = id, 
+          metadata = my_table_hh4, 
+          repositoryid = "central", 
+          published = 1, 
+          thumbnail = thumb, 
+          overwrite = "yes")
+
+# Provide a link to the table series page (US Bucen website)
+
+external_resources_add(
+  title = "Historical Households Tables (US Bucen web page)",
+  idno = id,
+  dctype = "web",
+  file_path = "https://www.census.gov/data/tables/time-series/demo/families/households.html",
+  overwrite = "yes"
+)
+



+The result in NADA will be as follows (only part of metadata displayed):

+


+ +

+


+

Using Python

+

The same result can be achieved in Python; the script will be as follows:

+
# Python script
+
+
+

9.4.2 Example 2

+

For this second example, we use a regional table from the World Bank: “World Development Indicators - Country profiles”. The table is available on-line in Excel and in PDF formats, for many geographic areas: world, geographic regions, country groups (income level, etc), and country. A separate table is available for each of these areas. Metadata common to all table files is available in a separate Excel file.

+


+ +
+ +

+

As the same metadata applies to all tables, we generate the metadata once, and use a function to publish the geography-specific tables in one loop. In our example, we only generate the tables for the following geographies: world, World Bank regions, and countries of South Asia. This will result in the documentation and publishing of 15 tables. By providing the list of all countries to the loop, we would publish 200+ tables using this script.

+

We include definitions in the metadata. These definitions are extracted from the World Development Indicators API.

+

In the script, we assume that we only want to publish the metadata in the catalog, and provide a link to the originating World Bank website. In other words, we do not make the XLSX or PDF directly accessible from the NADA catalog (which would be easy to implement).

+

Using R

+
# --------------------------------------------------------------------------
+# Load libraries and establish the catalog administrator credentials
+# --------------------------------------------------------------------------
+
+library(nadar)
+library(jsonlite)
+library(httr)
+library(rlist)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_tables/")
+
+thumb_file <- "WB_country_profiles_WLD.jpg"
+
+src_data <- "World Bank, World Development Indicators database - WDI Central, 2021"
+
+# The tables contain data extracted from WDI time series. We identified these 
+# series ID and we list them here in their order of appearance in the table. 
+
+tbl_wdi_indicators = list(
+  "SP.POP.TOTL", "SP.POP.GROW", "AG.SRF.TOTL.K2", "EN.POP.DNST",
+  "SI.POV.NAHC", "SI.POV.DDAY", "NY.GNP.ATLS.CD", "NY.GNP.PCAP.CD",
+  "NY.GNP.MKTP.PP.CD", "NY.GNP.PCAP.PP.CD", "SI.DST.FRST.20",
+  "SP.DYN.LE00.IN", "SP.DYN.TFRT.IN", "SP.ADO.TFRT", "SP.DYN.CONU.ZS",
+  "SH.STA.BRTC.ZS", "SH.DYN.MORT", "SH.STA.MALN.ZS", "SH.IMM.MEAS",
+  "SE.PRM.CMPT.ZS", "SE.PRM.ENRR", "SE.SEC.ENRR", "SE.ENR.PRSC.FM.ZS",
+  "SH.DYN.AIDS.ZS", "AG.LND.FRST.K2", "ER.PTD.TOTL.ZS", 
+  "ER.H2O.FWTL.ZS", "SP.URB.GROW", "EG.USE.PCAP.KG.OE", 
+  "EN.ATM.CO2E.PC", "EG.USE.ELEC.KH.PC", "NY.GDP.MKTP.CD", 
+  "NY.GDP.MKTP.KD.ZG", "NY.GDP.DEFL.KD.ZG", "NV.AGR.TOTL.ZS", 
+  "NV.IND.TOTL.ZS", "NE.EXP.GNFS.ZS", "NE.IMP.GNFS.ZS",
+  "NE.GDI.TOTL.ZS", "GC.REV.XGRT.GD.ZS", "GC.NLD.TOTL.GD.ZS", 
+  "FS.AST.DOMS.GD.ZS", "GC.TAX.TOTL.GD.ZS", "MS.MIL.XPND.GD.ZS",
+  "IT.CEL.SETS.P2", "IT.NET.USER.ZS", "TX.VAL.TECH.MF.ZS", 
+  "IQ.SCI.OVRL", "TG.VAL.TOTL.GD.ZS", "TT.PRI.MRCH.XD.WD", 
+  "DT.DOD.DECT.CD", "DT.TDS.DECT.EX.ZS", "SM.POP.NETM", 
+  "BX.TRF.PWKR.CD.DT", "BX.KLT.DINV.CD.WD", "DT.ODA.ODAT.CD"
+)
+
+rows = list()
+defs = list()
+
+# We then use the WDI API to retrieve information on the series (name, label, 
+# definition) to be included in the published metadata. 
+
+for(s in tbl_wdi_indicators) {
+  
+  url = paste0("https://api.worldbank.org/v2/sources/2/series/", s, 
+               "/metadata?format=JSON")
+  s_meta <- GET(url)
+  if(http_error(s_meta)){
+    stop("The request failed")
+  } else {
+    s_metadata <- fromJSON(content(s_meta, as = "text"))  
+    s_metadata <- s_metadata$source$concept[[1]][[2]][[1]][[2]][[1]]
+  }
+  
+  indic_lbl = s_metadata$value[s_metadata$id=="IndicatorName"]
+  indic_def = s_metadata$value[s_metadata$id=="Longdefinition"]
+  
+  this_row = list(var_name = s, dataset = src_data, label = indic_lbl)
+  rows = list.append(rows, this_row)
+  
+  this_def = list(name = indic_lbl, definition = indic_def)
+  defs = list.append(defs, this_def)
+  
+}
+
+# --------------------------------------------------------------------------
+# We create a function that takes two parameters: the country (or region) 
+# name, and the country (or region) code. This function will generate the 
+# table metadata and publish the selected table in the NADA catalog.
+# --------------------------------------------------------------------------
+
+publish_country_profile <- function(country_name, country_code) {
+  
+  # Generate the country/region-specific unique table ID and table title
+  
+  idno_meta <- paste0("UC013_", country_code)
+  idno_tbl  <- paste0("UC013_", country_code)
+  tbl_title <- paste0("World Development Indicators, Country Profile, ", 
+                      country_name, " - 2021")
+  citation  <- paste("World Bank,", tbl_title, 
+                     ", https://datacatalog.worldbank.org/dataset/country-profiles, accessed on [date]")
+  
+  # Generate the schema-compliant metadata
+  
+  my_tbl <- list(
+    
+    metadata_information = list(    
+      producers = list(list(name = "NADA team")),
+      production_date = "2021-09-14",
+      version = "v01"
+    ),
+    
+    table_description = list(
+      
+      title_statement = list(
+        idno = idno_tbl,
+        title = tbl_title
+      ),
+      
+      authoring_entity = list(
+        list(name = "World Bank, Development Data Group",
+             abbreviation = "WB",
+             uri = "https://data.worldbank.org/")
+      ),
+      
+      date_created = "2021-07-03",
+      date_published = "2021-07",
+      
+      description = "Country profiles present the latest key development data drawn from the World Development Indicators (WDI) database. They follow the format of The Little Data Book, the WDI's quick reference publication.",
+      
+      table_columns = list(
+        list(label = "Year 1990"),
+        list(label = "Year 2000"),
+        list(label = "Year 2010"),
+        list(label = "Year 2018")
+      ),
+      
+      table_rows = rows,
+      
+      table_series = list(
+        list(name = "World Development Indicators, Country Profiles",
+             maintainer = "World Bank, Development Data Group (DECDG)")
+      ),
+      
+      data_sources = list(
+        list(source = src_data)
+      ),
+      
+      time_periods = list(
+        list(from = "1990", to = "1990"),
+        list(from = "2000", to = "2000"),
+        list(from = "2010", to = "2010"),
+        list(from = "2018", to = "2018")
+      ),
+      
+      ref_country = list(
+        list(name = country_name, code = country_code)
+      ),
+      
+      geographic_granularity = area,
+      
+      languages = list(
+        list(name = "English", code = "EN")
+      ),
+      
+      links = list(
+        list(uri = "https://datacatalog.worldbank.org/dataset/country-profiles",
+             description = "Country Profiles in World Bank Data Catalog website"),
+        list(uri = "http://wdi.worldbank.org/tables",
+             description = "Country Profiles in World Bank Word Development Indicators website"),
+        list(uri = "https://datatopics.worldbank.org/world-development-indicators/",
+             description = "Word Development Indicators website")
+      ),
+      
+      keywords = list(
+        list(name = "World View"),
+        list(name = "People"),
+        list(name = "Environment"),
+        list(name = "Economy"),
+        list(name = "States and markets"),
+        list(name = "Global links")
+      ),
+      
+      topics = list(
+        list(id = "1", name = "Demography", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "2", name = "Economics", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "3", name = "Education", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "4", name = "Health", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "5", name = "Labour And Employment", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "6", name = "Natural Environment", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "7", name = "Social Welfare Policy and Systems", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "8", name = "Trade Industry and Markets", 
+             vocabulary = "CESSDA", 
+             uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+        list(id = "9", name = "Economic development")
+      ),
+      
+      definitions = defs,
+      
+      license  = list(
+        list(name = "Creative Commons - Attribution 4.0 International - CC BY 4.0",
+             uri = "https://creativecommons.org/licenses/by/4.0/")
+      ),
+      
+      citation = citation,
+      
+      contacts = list(
+        list(name = "World Bank, Development Data Group, Help Desk",
+             telephone = "+1 (202) 473-7824 or +1 (800) 590-1906",
+             email = "data@worldbank.org",
+             uri = "https://datahelpdesk.worldbank.org/")
+      )
+      
+    ) 
+    
+  )  
+  
+  # Publish the table in the NADA catalog
+  
+  table_add(idno = my_tbl$table_description$title_statement$idno, 
+            metadata = my_tbl, 
+            repositoryid = "central", 
+            published = 1, 
+            overwrite = "yes",
+            thumbnail = thumb_file)  
+  
+  # Add a link to the WDI website as an external resource
+  
+  external_resources_add(
+    title = "World Development Indicators - Regional tables",
+    idno = idno_tbl,
+    dctype = "web",
+    file_path = "http://wdi.worldbank.org/table",
+    overwrite = "yes"
+  )
+  
+}
+
+# --------------------------------------------------------------------------
+# We run the function in a loop to publish the selected tables 
+# --------------------------------------------------------------------------
+
+# List of countries/regions
+
+geo_list <- list(
+  list(name = "World",                        code = "WLD", area = "World"),
+  list(name = "East Asia and Pacific",        code = "EAP", area = "Region"),
+  list(name = "Europe and Central Asia",      code = "ECA", area = "Region"),
+  list(name = "Latin America and Caribbean",  code = "LAC", area = "Region"),
+  list(name = "Middle East and North Africa", code = "MNA", area = "Region"),
+  list(name = "South Asia",                   code = "SAR", area = "Region"),
+  list(name = "Sub-Saharan Africa",           code = "AFR", area = "Region"),
+  list(name = "Afghanistan",                  code = "AFG", area = "Country"),
+  list(name = "Bangladesh",                   code = "BGD", area = "Country"),
+  list(name = "Bhutan",                       code = "BHU", area = "Country"),
+  list(name = "India",                        code = "IND", area = "Country"),
+  list(name = "Maldives",                     code = "MDV", area = "Country"),
+  list(name = "Nepal",                        code = "NPL", area = "Country"),
+  list(name = "Pakistan",                     code = "PAK", area = "Country"),
+  list(name = "Sri Lanka",                    code = "LKA", area = "Country"))
+
+# Loop through the list of countries/region to publish the tables
+
+for(i in 1:length(geo_list)) {
+  area <- as.character(geo_list[[i]][3])
+  publish_country_profile(
+    country_name = as.character(geo_list[[i]][1]), 
+    country_code = as.character(geo_list[[i]][2]))
+}  
+



+

** Using Python**

+
# Python script
+

The result in NADA

+
+ +
+
+
+

9.4.3 Example 3

+

This example is selected to show how the documentation can take advantage of R or Python to extract information from the table. Here we have the table in MS-Excel format. The table contains a long list of countries, which would be tedious to manually enter. A script reads the Excel file and extracts some of the information which is then added to the table metadata. The table also contains the definitions of the indicators shown in the table.

+

Here we assume we want to provide the XLS and PDF tables in addition to a link to the source website. We will identify and upload the resources (XLS and PDF) on our web server.

+

The table:

+

+



+
+

Using R

+
+
library(nadar)
+library(readxl)
+library(rlist)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_tables/")
+
+thumb = "SDGs.jpg"
+
+id = "TBL_EXAMPLE-03"
+
+# ---------------------------------------------------------------------------
+# We read the MS-Excel file and extract the list of countries and definitions
+# ---------------------------------------------------------------------------
+
+# We generate the list of countries
+df <- read_xlsx("WV2_Global_goals_ending_poverty_and_improving_lives.xlsx", 
+                range = "A5:A230")
+ctry_list <- list()
+for(i in 1:nrow(df)) {
+  c <- list(name = as.character(df[[1]][i]))
+  ctry_list <- list.append(ctry_list, c)
+}
+
+# We extract the definitions found in the table.  
+# Note that we could have instead copy/pasted the definitions. 
+# For example, the command line:
+#   list(name = as.character(df[1,1]), definition = as.character(df[3,1]))
+# is equivalent to:
+#   list(name = "Income share held by lowest 20%",
+#        definition = "Percentage share of income or consumption is the share that accrues to subgroups of population indicated by deciles or quintiles. Percentage shares by quintile may not sum to 100 because of rounding.")
+
+df <- read_xlsx("WV2_Global_goals_ending_poverty_and_improving_lives.xlsx", 
+                range = "A241:A340", col_names = FALSE)
+
+def_list = list(
+  list(name = as.character(df[1,1]),  definition = as.character(df[3,1])),
+  list(name = as.character(df[11,1]), definition = as.character(df[13,1])),
+  list(name = as.character(df[21,1]), definition = as.character(df[23,1])),
+  list(name = as.character(df[31,1]), definition = as.character(df[33,1])),
+  list(name = as.character(df[41,1]), definition = as.character(df[43,1])),
+  list(name = as.character(df[51,1]), definition = as.character(df[53,1])),
+  list(name = as.character(df[61,1]), definition = as.character(df[63,1])),
+  list(name = as.character(df[71,1]), definition = as.character(df[73,1])),
+  list(name = as.character(df[78,1]), definition = as.character(df[80,1])),
+  list(name = as.character(df[85,1]), definition = as.character(df[87,1])),
+  list(name = as.character(df[92,1]), definition = as.character(df[94,1]))
+)  
+
+# We generate the table metadata
+
+my_tbl <- list(
+  
+  metadata_information = list(
+    idno = "META_TBL_EXAMPLE-03",
+    producers = list(
+      list(name = "Olivier Dupriez", affiliation = "World Bank")
+    ),
+    production_date = "2021-02-20"
+  ),
+  
+  table_description = list(
+    
+    title_statement = list(
+      idno = id,
+      table_number = "WV.2",
+      title = "Global Goals: Ending Poverty and Improving Lives"
+    ),
+    
+    authoring_entity = list(
+      list(name = "World Bank, Development Data Group",
+           abbreviation = "WB",
+           uri = "https://data.worldbank.org/")
+    ),
+    
+    date_created = "2020-12-16",
+    date_published = "2020-12",
+    
+    description = "",
+    
+    table_columns = list(
+      list(label = "Percentage share of income or consumption - Lowest 20% - 2007-18"),
+      list(label = "Prevalence of child malnutrition - Stunting, height for age - %  of children under 5 - 2011-19"),
+      list(label = "Maternal mortality ratio - Modeled estimates - per 100,000 live births - 2017"),
+      list(label = "Under-five mortality rate - Total - per 1,000 live births - 2019"),
+      list(label = "Incidence of HIV, ages 15-49 (per 1,000 uninfected population ages 15-49) - 2019"),
+      list(label = "Incidence of tuberculosis - per 100,000 people - 2019"),
+      list(label = "Mortality caused by road traffic injury - per 100,000 people - 2016"),
+      list(label = "Primary completion rate - Total - % of relevant age group - 2018"),
+      list(label = "Contributing family workers - Male - % of male employment - 2018"),
+      list(label = "Contributing family workers - Female - % of female employment - 2018"),
+      list(label = "Labor productivity - GDP per person employed - % growth - 2015-18")
+    ),
+    
+    table_rows = list(
+      list(label = "Country or region")
+    ),
+    
+    table_series = list(
+      list(name = "World Development Indicators - World View",
+           description = "World Development Indicators includes data spanning up to 56 years-from 1960 to 2016. World view frames global trends with indicators on population, population density, urbanization, GNI, and GDP. As in previous years, the World view online tables present indicators measuring the world's economy and progress toward improving lives, achieving sustainable development, providing support for vulnerable populations, and reducing gender disparities. Data on poverty and shared prosperity are now in a separate section, while highlights of progress toward the Sustainable Development Goals are now presented in the companion publication, Atlas of Sustainable Development Goals 2017.
+           
+  The global highlights in this section draw on the six themes of World Development Indicators:
+  - Poverty and shared prosperity, which presents indicators that measure progress toward the World Bank Group's twin goals of ending extreme poverty by 2030 and promoting shared prosperity in every country.
+  - People, which showcases indicators covering education, health, jobs, social protection, and gender and provides a portrait of societal progress across the world.
+  - Environment, which presents indicators on the use of natural resources, such as water and energy, and various measures of environmental degradation, including pollution, deforestation, and loss of habitat, all of which must be considered in shaping development strategies.
+  - Economy, which provides a window on the global economy through indicators that describe the economic activity of the more than 200 countries and territories that produce, trade, and consume the world's output.
+  - States and markets, which encompasses indicators on private investment and performance, financial system development, quality and availability of infrastructure, and the role of the public sector in nurturing investment and growth.
+  - Global links, which presents indicators on the size and direction of the flows and links that enable economies to grow, including measures of trade, remittances, equity, and debt, as well as tourism and migration.",
+           uri = "http://wdi.worldbank.org/tables",
+           maintainer = "World Bank, Development Data Group (DECDG)")
+    ),
+    
+    data_sources = list(
+      list(source = "World Bank, World Development Indicators database, 2020")
+    ),
+    
+    time_periods = list(
+      list(from = "2007", to = "2019")  # The table cover all years from 2007 to 2019 
+    ),
+    
+    ref_country = ctry_list,
+    geographic_granularity = "Country, WB geographic region, other country groupings",
+    
+    languages = list(
+      list(name = "English", code = "EN")
+    ),
+    
+    links = list(
+      list(uri = "http://wdi.worldbank.org/tables",
+           description = "World Development Indicators - Global Goals tables"),
+      list(uri = "https://datatopics.worldbank.org/world-development-indicators/",
+           description = "Word Development Indicators website"),
+      list(uri = "https://sdgs.un.org/goals",
+           description = "United Nations, Sustainable Development Goals (SDG) website")
+    ),
+    
+    keywords = list(
+      list(name = "Sustainable Development Goals (SDGs)"),
+      list(name = "Shared prosperity"),
+      list(name = "HIV - AIDS")
+    ),
+    
+    topics = list(
+      list(id = "1",
+           name = "Demography", 
+           vocabulary = "CESSDA", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(id = "2",
+           name = "Economics", 
+           vocabulary = "CESSDA", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(id = "3",
+           name = "Education", 
+           vocabulary = "CESSDA", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      list(id = "4",
+           name = "Health", 
+           vocabulary = "CESSDA", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification")
+    ),
+    
+    disciplines = list(
+      list(name = "Economics")
+    ),
+    
+    definitions = def_list,
+    
+    license  = list(
+      list(name = "Creative Commons - Attribution 4.0 International - CC BY 4.0",
+           uri = "https://creativecommons.org/licenses/by/4.0/")
+    ),
+    
+    citation = "",
+    
+    contacts = list(
+      list(name = "World Bank, Development Data Group, Help Desk",
+           telephone = "+1 (202) 473-7824 or +1 (800) 590-1906",
+           email = "data@worldbank.org",
+           uri = "https://datahelpdesk.worldbank.org/")
+    )
+    
+  )
+)  
+
+# We publish the table in the catalog
+
+table_add(idno = id, 
+          metadata = my_tbl, 
+          repositoryid = "central", 
+          published = 1, 
+          overwrite = "yes",
+          thumbnail = thumb)
+
+# We add the MS-Excel and PDF versions of the table as external resources
+
+external_resources_add(
+  title = "Global Goals: Ending Poverty and Improving Lives (in MS-Excel format)",
+  idno = id,
+  dctype = "tbl",
+  file_path = "WV2_Global_goals_ending_poverty_and_improving_lives.xlsx",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "Global Goals: Ending Poverty and Improving Lives (in PDF format)",
+  idno = id,
+  dctype = "tbl",
+  file_path = "WV2_Global_goals_ending_poverty_and_improving_lives.pdf",
+  overwrite = "yes"
+)
+

The table will now be available in the NADA catalog.

+
+ +
+



+*** Using Python

+
#Python script
+ +
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter10.html b/chapter10.html new file mode 100644 index 0000000..34ce665 --- /dev/null +++ b/chapter10.html @@ -0,0 +1,2904 @@ + + + + + + + Chapter 10 Images | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 10 Images

+
+
+ +
+


+
+

10.1 Image metadata

+

This chapter describes the use of two metadata standards for the documentation of images. Images may include both electronic and physical representations, but we are here interested in images available as electronic files, intended to be catalogued and published in on-line catalogs/albums. These files will typically be available in one of the following formats: JPG, PNG, or TIFF. Images can be photos taken by digital cameras, images generated by computer, or scanned images. The metadata standards we describe are intended to make these images discoverable, accessible, and usable. For that purpose, metadata must be provided on the content of the image (in the form of caption, description, keywords, etc.), on the location and date the image was generated, on the author, and more. Information on use license and copyrights, on possible privacy protection issues (persons, possibly minors, etc.) is needed to provide users with information they need to ensure their use of the published images is legal, ethical, and responsible.

+

The device used to generate images in the form of electronic files (such as digital cameras) contain embedded metadata. Digital cameras generate EXIF metadata. This information may be useful to some users, but (with a few exceptions like the date the photo was taken and the GPS location if generated), they lack information on the content of the image (what is represented in it), required for discoverability. This information must added by curators. Part of it will be entered manually, other can be extracted in a largely automated manner using machine learning models and APIs. This information must be structured and stored in compliance with a metadata standard. We present in this chapter two standards that can serve that purpose: the comprehensive (and somewhat complex) IPTC standard, and the simpler Dublin Core (DCMI) standard. The metadata schema we propose embeds both options; when using the schema, users will select either one or the other to document their images. We also make references to the ImageObject metadata schema from schema.org, and include some of their elements in our schema.

+
+

Although photographs may be more explicit than a long discourse for humans, they don’t describe themselves in term of content as texts do. For texts, authors use many clues to indicate what they are talking about: titles, abstract, keywords, etc. which may be used for automatic cataloguing. Searching for photos must rely on manual cataloguing, or relate texts and documents that come with the photos. (Source: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.5077&rep=rep1&type=pdf)

+
+

We start with a brief presentation of the EXIF metadata, then describe the schema we propose for the documentation and cataloguing of images.

+
+

10.1.1 Embedded metadata: EXIF

+

Modern digital cameras automatically generate metadata and embed it into the image file. This metadata is known as the Exchangeable Image File Format or EXIF. EXIF will record information on the date and time the image was taken, on the GPS location coordinates (latitude & longitude, possibly altitude) if the camera was equipped with a GPS and geolocation was enabled, information on the device including manufacturer and model, technical information (lens type, focal range, aperture, shutter speed, flash settings), the system-generated unique image identifier, and more.

+

There are several ways to extract or view an image’s EXIF Data. For example, the R packages ExifTool and ExifR allow extraction and use of EXIF metadata, and applications like Flickr will display EXIF content.

+


+ +

+

But with the exception of the date, location (if captured), and unique image identifier, the content of the EXIF does not provide information that users interested in identifying images based on their source and/or content will find useful. Metadata describing the content and source of an image will have to be obtained from another source or using other tools.

+
+
+

10.1.2 IPTC and Dublin Core standards

+

The metadata schema we propose for documenting images contains two mutually-exclusive options: the Dublin Core, as a simple option, and the IPTC as a more complex and advanced solutions. The schema also contains a few metadata elements that will be used no matter which option is selected. The schema is structured as follows:

+
    +
  • A few elements common to both options are provided to document the metadata (not the image itself), to provide some cataloguing parameters, and to set a unique identifier for the image being documented.

  • +
  • Then come the two options for documenting the image itself: the IPTC block of metadata elements, and the Dublin Core block of elements. Users will make use of one of them, not both.

    +
      +
    • The IPTC is the most detailed and complex schema. The version embedded in our schema is 2019.1 According to the IPTC website, “The IPTC Photo Metadata Standard is the most widely used standard to describe photos, because of its universal acceptance among news agencies, photographers, photo agencies, libraries, museums, and other related industries. It structures and defines metadata properties that allow users to add precise and reliable data about images.” The IPTC standard consists of two schemas: IPTC Core and IPTC Extension. They provide a comprehensive set of fields to document an image including information on time and geographic coverage, people and objects shown in the image, information on rights, and more. The schema is complex and in most cases only a small subset of fields will be used to document an image. Controlled vocabularies are recommended for some elements.

    • +
    • The Dublin Core (DCMI) is a simpler and highly flexible standard, composed of 15 core elements which we supplement with a few elements mostly taken from the ImageObject schema from schema.org.

    • +
  • +
  • Last, a small number of additional metadata elements are provided, which are common to both options described above.

  • +
+

Whether the IPTC or the simpler DCMI option is used, the metadata should be made as rich as possible.

+
+
+

10.1.3 Augmenting image metadata

+

To make images discoverable, metadata that describe the content depicted in an image, the source of the image and the rights and licensing associated with it, are essential but not provided in the EXIF. Additional metadata must be provided.

+

Some of these metadata will have to be generated by image authors and/or curators, other can be generated in a much automated manner using machine learning models and tools. Image processing algorithms that make it possible to augmented metadata include algorithms of face detection, person identification, automated labeling, text extraction, and others. Before describing the proposed metadata schema in the following sections, we present here some example of tools that make such metadata enhancement easy and affordable.

+

The example we provide below makes use of the Google Vision API to generate image metadata. Google Vision is one out of multiple tools that can be used for that purpose such as Amazon Rekognition, or Microsoft Azure Computer Vision. This example makes use of a photo selected from the World Bank Flickr album.

+
+ +
+

The image comes with a brief description that identifies the photographer, the location (name of the country and town, not GPS location), and the content of the image. The description of the image includes important keywords that, when indexed in a catalog, will support discoverability of the image. This information, to be manually entered, is valuable and must be part of the curated image metadata.

+ + + + + + + + + + + + +
+

But we can add useful additional information in an automated manner and at low cost using machine learning models. In the example below, we use the (free) on-line “Try it” tool of the Google Vision application.

+
+ +
+

The Google Vision API returns and displays the results of the image processing in multiple tabs. The same content is available programmatically in JSON format. The content of this JSON file can be mapped to elements of the metadata schema, for automatic addition to the image metadata.

+

The first tab is the result of faces detection. Each detected face has a bounding box and metadata such as the derived emotion of the person. The bounding box can be used to automatically flag images that have one or multiple “significant size” face(s) and may have to be excluded from the published images for privacy protection reasons.

+
+ +
+

The second tab reports on detected objects.

+
+ +
+

The third tab suggests labels that could be attached to the image, provided with a degree of confidence. A threshold can be set to automatically add (or not) each proposed label as a keyword in the image metadata.

+
+ +
+

+ +

+

The fourth tab shows the text detected in the image. The quality of text detection and recognition depends on the resolution of the image and on the size and orientation of the text in the image. In our example, the algorithm fails to read (most of) the small, rotated and truncated text.

+
+ +
+

The tool managed to recognize some, but not all characters. In this case, this would be considered as not useful information to be added to the image metadata.

+
+ +
+

We are not interested in the properties tab which does not provide information that can be used for discoverability of images based on their content or source.

+

The last tab, Safe search, could be used as warnings if you plan to make the image publicly accessible.

+
+ +
+

This “Try it” tool demonstrates the capabilities of the application which, for automating the processing of a collection of images, would be accessed programmatically using R, Python or another programming language. Accessing the application’s API requires a key. The cost of image labeling, face detection, and other image processing is low. For information on pricing, consult the website of the API providers.

+
+
+
+

10.2 Schema description

+

The schema contains two options to document images: the IPTC and the Dublin Core metadata standards. The schema contains four main groups of metadata elements: +1. A small set of “common elements” (used no matter what option – IPTC or Dublin Core – is used), used mostly for cataloguing purpose. +2. The IPTC metadata elements +3. The Dublin Core (DCMI) elements +4. Another small set of common elements.

+

The description of IPTC metadata elements is largely taken from the Photo Metadata section of the IPTC website.

+


+
{
+  "repositoryid": "central",
+  "published": "0",
+  "overwrite": "no",
+  "metadata_information": {},
+  "image_description": {
+    "idno": "string",
+    "identifiers": [],
+    "iptc": {},
+    "dcmi": {},
+    "license": [],
+    "album": []
+  },
+  "provenance": [],
+  "tags": [],
+  "lda_topics": [],
+  "embeddings": [],
+  "additional": { }
+}
+


+
+

10.2.1 Common elements

+
    +
  • metadata_information [Optional ; Not repeatable]
    +This block is used to describe who produced the metadata and when. This is an optional section of the schema. It is useful for archivist more than data users. The description of the image itself is found in the IPTC or DCMI section. +
  • +
+
"metadata_information": {
+  "title": "string",
+  "idno": "string",
+  "producers": [
+    {
+      "name": "string",
+      "abbr": "string",
+      "affiliation": "string",
+      "role": "string"
+    }
+  ],
+  "production_date": "string",
+  "version": "string"
+}
+


+
    +
  • title [Optional ; Not Repeatable ; String]
    +The title of the image metadata. This can be the same as the image title.

  • +
  • idno [Optional ; Not Repeatable ; String]
    +The unique identifier of the image metadata document (which can be different from the image identifier).

  • +
  • producers [Optional ; Repeatable]
    +A list of persons or organizations involved in the documentation (production of the metadata) of the image.

    +
      +
    • name [Optional ; Not repeatable, String]
      +The name of the person or agency that is responsible for the documentation of the image.
    • +
    • abbr [Optional ; Not repeatable, String]
      +Abbreviation (acronym) of the agency mentioned in name.
    • +
    • affiliation [Optional ; Not repeatable, String]
      +Affiliation of the person or agency mentioned in name.
    • +
    • role [Optional ; Not repeatable, String]
      +The specific role of the person or agency mentioned in name in the production of the metadata. This element will be used when more than one person or organization is listed in the producers element to distinguish the specific contribution of each metadata producer.

    • +
  • +
  • production_date [Optional ; Not repeatable, String]

    +The date the image metadata was generated (not the date the image was created), preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

  • +
  • version [Optional ; Not repeatable, String]
    +The version of the metadata on this image. This element will rarely be used.

  • +
  • image_description [Required ; Not Repeatable]
    +The image_description will contain the metadata related to one image. +

  • +
+
"image_description": {
+  "idno": "string",
+  "identifiers": [
+    {
+      "type": "string",
+      "identifier": "string"
+    }
+  ],
+  "iptc": {},
+  "dcmi": {},
+  "license": [],
+  "album": []
+}
+


+
    +
  • idno [Required ; Not Repeatable, String]
    +The (main) unique identifier of the image, to be used for cataloguing purpose.

  • +
  • identifiers [Optional, Repeatable]
    +The repeatable element identifiers is used to list image identifiers other than the one used in idno. Some images may have unique identifiers assigned by different organizations or cataloguing systems; this element is used to document them.

    +

    This element is used to enter image identifiers (IDs) other than the catalog ID entered in the image_description / idno element. It can for example be a Digital Object Identifier (DOI), or the EXIF identifier. Note that the ID entered in the idno element can be repeated here (idno does not provide a type parameter, that curators may want to document).

    +
      +
    • type [Optional, Not Repeatable, String] +The type of identifier. This could be for example “DOI”.
    • +
    • identifier [Required, Not Repeatable, String] +The identifier itself.
    • +
  • +
+
+
+

10.2.2 IPTC option

+

iptc [Optional ; Not Repeatable]
+The schema provides two options (standards) to document an image: the IPTC, and the Dublin Core. Only one of these standards, not both, will be used to document an image. The block iptc will be used when IPTC is the preferred option. In such case, the dcmi block describe later in this chapter will be left empty. IPTC is the most complex of these two options. +

+
"iptc": {
+  "photoVideoMetadataIPTC": {
+    "title": "string",
+    "imageSupplierImageId": "string",
+    "registryEntries": [],
+    "digitalImageGuid": "string",
+    "dateCreated": "2023-04-11T15:06:09Z",
+    "headline": "string",
+    "eventName": "string",
+    "description": "string",
+    "captionWriter": "string",
+    "keywords": [],
+    "sceneCodes": [],
+    "sceneCodesLabelled": [],
+    "subjectCodes": [],
+    "subjectCodesLabelled": [],
+    "creatorNames": [],
+    "creatorContactInfo": {},
+    "creditLine": "string",
+    "digitalSourceType": "http://example.com",
+    "jobid": "string",
+    "jobtitle": "string",
+    "source": "string",
+    "locationsShown": [],
+    "imageRating": 0,
+    "supplier": [],
+    "copyrightNotice": "string",
+    "copyrightOwners": [],
+    "usageTerms": "string",
+    "embdEncRightsExpr": [],
+    "linkedEncRightsExpr": [],
+    "webstatementRights": "http://example.com",
+    "instructions": "string",
+    "genres": [],
+    "intellectualGenre": "string",
+    "artworkOrObjects": [],
+    "personInImageNames": [],
+    "personsShown": [],
+    "modelAges": [],
+    "additionalModelInfo": "string",
+    "minorModelAgeDisclosure": "http://example.com",
+    "modelReleaseDocuments": [],
+    "modelReleaseStatus": {},
+    "organisationInImageCodes": [],
+    "organisationInImageNames": [],
+    "productsShown": [],
+    "maxAvailHeight": 0,
+    "maxAvailWidth": 0,
+    "propertyReleaseStatus": {},
+    "propertyReleaseDocuments": [],
+    "aboutCvTerms": []
+  }
+}
+


+

photoVideoMetadataIPTC [Required ; Not Repeatable ; String]
+Contains all elements used to describe the image using the IPTC standard.

+
    +
  • title [Optional ; Not Repeatable ; String]
    +The title is a shorthand reference for the digital image. It provides a short verbal and human readable name which can be a text and/or a numeric reference. It is not the same as the Headline (see below). Some may use the title field to store the file name of the image, though the field may be used in many ways. This element should not be used to provide the unique identifier of the image.

  • +
  • imageSupplierImageId [Optional ; Not Repeatable ; String]
    +A unique identifier assigned by the image supplier to the image.

  • +
  • registryEntries [Optional ; Repeatable]
    +A structured element used to provide cataloguing information (i.e. an entry in a registry). It includes the unique identifier for the image issued by the registry and the registry’s organization identifier. +

  • +
+
"registryEntries": [
+  {
+    "role": "http://example.com",
+    "assetIdentifier": "string",
+    "registryIdentifier": "http://example.com"
+  }
+]
+


+
    +
  • role: [Optional ; Not Repeatable ; String]
    +An identifier of the reason and/or purpose for this Registry Entry.

  • +
  • assetIdentifier [Optional ; Not Repeatable ; String]
    +A unique identifier created by the registry and applied by the creator of the digital image. This value shall not be changed after being applied. This identifier is linked to a corresponding Registry Organization Identifier. Enter the unique identifier created by a registry and applied by the creator of the digital image. This value shall not be changed after being applied. This identifier may be globally unique by itself, but it must be unique for the issuing registry. An input to this field should be made mandatory.

  • +
  • registryIdentifier [Optional ; Not Repeatable ; String]
    +An identifier for the registry/organization which issued the corresponding Registry Image Id.

  • +
  • digitalImageGuid [Optional ; Not Repeatable ; String]
    +A globally unique identifier for the image. This identifier is created and applied by the creator of the digital image at the time of its creation. This value shall not be changed after that time. The identifier can be generated using an algorithm that would guarantee that the created identifier is globally unique. Device that create digital images like digital or video cameras or scanners usually create such an identifier at the time of the creation of the digital data, and add it to the metadata embedded in the image file (e.g., the EXIF metadata).IPTC’s requirements for unique ids are as follows:

    +
      +
    • It must be globally unique. Algorithms for this purpose exist.
    • +
    • It should identify the camera body.
    • +
    • It should identify each individual photo from this camera body.
    • +
    • It should identify the date and time of the creation of the picture.
    • +
    • It should be secured against tampering.

    • +
  • +
  • dateCreated [Optional ; Not Repeatable ; String]
    +Designates the date and optionally the time the content of the image was created. For a photo, this will be the date and time the photo was taken. When no information is available on the time, the time is set to 00:00:00. The preferred format for the dateCreated element is the truncated DateTime format, for example: 2021-02-22T21:24:06Z

  • +
  • headline [Optional ; Not Repeatable ; String]
    +A brief publishable summary of the contents of the image. Note that a headline is not the same as a title.

  • +
  • eventName [Optional ; Not Repeatable ; String]
    +The name or a brief description of the event where the image was taken. If this is a sub-event of a larger event, mention both in the description. For example: “Opening statement, 1st International Conference on Metadata Standards, New York, November 2021”.

  • +
  • description [Optional ; Not Repeatable ; String]
    +A textual description, including captions, of the image. This describes the who, what, and why of what is happening in this image. This might include names of people, and/or their role in the action that is taking place within the image. Example: “The president of the Metadata Association delivers the keynote address”.

  • +
  • captionWriter [Optional ; Not Repeatable ; String]
    +An identifier, or the name, of the person involved in writing, editing or correcting the description of the image.

  • +
  • keywords: [Optional ; Repeatable ; String]
    +

  • +
+
"keywords": [
+  "string"
+]
+


+

Keywords (terms or phrases) to express the subject of the image. Keywords do not have to be taken from a controlled vocabulary.

+
    +
  • sceneCodes [Optional ; Repeatable ; String]

    +
  • +
+
"sceneCodes": [
+  "string"
+]
+



+The sceneCodes describe the scene of a photo content. The IPTC Scene-NewsCodes controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license) should be used, where a scene is represented as a string of 6 digits.

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
codeLabelDescription
010100headshotA head only view of a person (or animal/s) or persons as in a montage.
010200half-lengthA torso and head view of a person or persons.
010300full-lengthA view from head to toe of a person or persons
010400profileA view of a person from the side
010500rear viewA view of a person or persons from the rear.
010600singleA view of only one person, object or animal.
010700coupleA view of two people who are in a personal relationship, for example engaged, married or in a romantic partnership.
010800twoA view of two people
010900groupA view of more than two people
011000general viewAn overall view of the subject and its surrounds
011100panoramic viewA panoramic or wide angle view of a subject and its surrounds
011200aerial viewA view taken from above
011300under-waterA photo taken under water
011400night sceneA photo taken during darkness
011500satelliteA photo taken from a satellite in orbit
011600exterior viewA photo that shows the exterior of a building or other object
011700interior viewA scene or view of the interior of a building or other object
011800close-upA view of, or part of a person/object taken at close range in order to emphasize detail or accentuate mood. Macro photography.
011900actionSubject in motion such as children jumping, horse running
012000performingSubject or subjects on a stage performing to an audience
012100posingSubject or subjects posing such as a “victory” pose or other stance that symbolizes leadership.
012200symbolicA posed picture symbolizing an event - two rings for marriage
012300off-beatAn attractive, perhaps fun picture of everyday events - dog with sunglasses, people cooling off in the fountain
012400movie scenePhotos taken during the shooting of a movie or TV production.
+


+
    +
  • sceneCodesLabelled [Optional ; Repeatable]

    +
  • +
+
"sceneCodesLabelled": [
+  {
+    "code": "string",
+    "label": "string",
+    "description": "string"
+  }
+]
+


+

The sceneCodes element described above only allows for the capture of codes. To improve discoverability (by indexing important keywords), not only the scene codes but also the scene description should be provided. The IPTC standard does not provide an element that allows the scene label and description to be entered. The sceneCodesLabelled is an element that we added to our schema. Ideally, curators will enter the scene codes in the element sceneCodes to maintain full compatibility with the IPTC, and complement that information by also entering the codes and their description in the sceneCodesLabelled element.

+
    +
  • code [Optional ; Not Repeatable ; String]
    +The code for the scene of a photo content. The IPTC Scene-NewsCodes controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license) should be used, where a scene is represented as a string of 6 digits. See table above.

  • +
  • label [Optional ; Not Repeatable ; String]
    +The label of the scene. See table above for examples.

  • +
  • description [Optional ; Not Repeatable ; String]
    +A more detailed description of the scene. See table above for examples.

  • +
  • subjectCodes [Optional ; Repeatable ; String]
    +

  • +
+
"subjectCodes": [
+  "string"
+]
+


+

Specifies one or more subjects from the IPTC Subject-NewsCodes controlled vocabulary to categorize the image. Each Subject is represented as a string of 8 digits. The vocabulary consists of about 1400 terms organized into 3 levels (users can decide to use only the first, or the first two levels; the more detail is provided, the better the discoverability of the image). The first level of the controlled vocabulary is as follows: +

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
codeLabelDescription
01000000arts, culture and entertainmentMatters pertaining to the advancement and refinement of the human mind, of interests, skills, tastes and emotions
02000000crime, law and justiceEstablishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules and the punishment of offenders. Organizations and bodies involved in these activities.
03000000disaster and accidentMan made and natural events resulting in loss of life or injury to living creatures and/or damage to inanimate objects or property.
04000000economy, business and financeAll matters concerning the planning, production and exchange of wealth.
05000000educationAll aspects of furthering knowledge of human individuals from birth to death.
06000000environmental issueAll aspects of protection, damage, and condition of the ecosystem of the planet earth and its surroundings.
07000000healthAll aspects pertaining to the physical and mental welfare of human beings.
08000000human interestLighter items about individuals, groups, animals or objects.
09000000laborSocial aspects, organizations, rules and conditions affecting the employment of human effort for the generation of wealth or provision of services and the economic support of the unemployed.
10000000lifestyle and leisureActivities undertaken for pleasure, relaxation or recreation outside paid employment, including eating and travel.
11000000politicsLocal, regional, national and international exercise of power, or struggle for power, and the relationships between governing bodies and states.
12000000religion and beliefAll aspects of human existence involving theology, philosophy, ethics and spirituality.
13000000science and technologyAll aspects pertaining to human understanding of nature and the physical world and the development and application of this knowledge
14000000social issueAspects of the behavior of humans affecting the quality of life.
15000000sportCompetitive exercise involving physical effort. Organizations and bodies involved in these activities.
16000000unrestconflicts and war Acts of socially or politically motivated protest and/or violence.
17000000weatherThe study, reporting and prediction of meteorological phenomena.
+



+

As an example of subjects at the three levels, the list below zooms on the subject “education”.

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
codeSubjectDescription
05000000educationAll aspects of furthering knowledge of human individuals from birth to death
05001000Adult educationEducation provided for older students outside the usual age groups of 5-25
05002000Further educationAny form of education beyond basic education of several levels
05003000parent organizationGroups of parents set up to support schools
05004000preschoolEducation for children under the national compulsory education age
05005000schoolA building or institution in which education of various sorts is provided
05005001elementary schoolsSchools usually of a level from kindergarten through 11 or 12 years of age
05005002middle schoolsTransitional school between elementary and high school, 12 through 13 years of age
05005003high schoolsPre-college/ university level education 14 to 17 or 18 years of age, called freshman, sophomore, junior and senior
05006000teachers unionOrganization of teachers for collective bargaining and other purposes
05007000universityInstitutions of higher learning capable of providing doctorate degrees
05008000upbringingLessons learned from parents and others as one grows up
05009000entrance examinationExams for entering colleges, universities, junior and senior high schools, and all other higher and lower education institutes, including cram schools, which help students prepare for exams for entry to prestigious schools.
05010000teaching and learningEither end of the education equation
05010001studentsPeople of any age in a structured environment, not necessarily a classroom, in order to learn something
05010002teachersPeople with knowledge who can impart that knowledge to others
05010003curriculumThe courses offered by a learning institution and the regulation of those courses
05010004test/examinationA measurement of student accomplishment
05011000religious educationInstruction by any faith, in that faith or about other faiths, usually, but not always, conducted in schools run by religious bodies
05011001parochial schoolA school run by the Roman Catholic faith
05011002seminaryA school of any faith specifically designed to train ministers
05011003yeshivaA school for training rabbis
05011004madrasaA school for teaching Islam
+


+
    +
  • subjectCodesLabelled [Optional ; Repeatable]

    +
  • +
+
"subjectCodesLabelled": [
+  {
+    "code": "string",
+    "label": "string",
+    "description": "string"
+  }
+]
+


+

The subjectCodes element described above only allows for the capture of codes. To improve discoverability (by indexing important keywords), not only the subject codes but also the subject description should be provided. The IPTC standard does not provide an element that allows the subject label and description to be entered. The subjectCodesLabelled is an element that we added to our schema. Ideally, curators will enter the subject codes in the element subjectCodes to maintain full compatibility with the IPTC, and complement that information by also entering the codes and their description in the subjectCodesLabelled element.

+
    +
  • code [Optional ; Not Repeatable ; String]
    +Specifies one or more subjects from the IPTC Subject-NewsCodes controlled vocabulary to categorize the image. Each Subject is represented as a string of 8 digits. The vocabulary consists of about 1400 terms organized into 3 levels (users can decide to use only the first, or the first two levels; the more detail is provided, the better the discoverability of the image). See examples in the table above.

  • +
  • label [Optional ; Not Repeatable ; String]
    +The label of the subject. See table above for examples.

  • +
  • description [Optional ; Not Repeatable ; String]
    +A more detailed description of the subject. See table above for examples.

  • +
  • creatorNames [Optional ; Repeatable ; String]
    +

  • +
+
"creatorNames": [
+  "string"
+]
+


+

Enter details about the creator or creators of this image. The Image Creator must often be attributed in association with any use of the image. The Image Creator, Copyright Owner, Image Supplier and Licensor may be the same or different entities.

+
    +
  • creatorContactInfo [Optional ;Not repeatable ; String]
    +
  • +
+
"creatorContactInfo": {
+  "country": "string",
+  "emailwork": "string",
+  "region": "string",
+  "phonework": "string",
+  "weburlwork": "string",
+  "address": "string",
+  "city": "string",
+  "postalCode": "string"
+}
+


+

The creator’s contact information provides all necessary information to get in contact with the creator of this image and comprises a set of elements for proper addressing. Note that if the creator is also the licensor, his or her contact information should be provided in the licensor fields.

+
    +
  • country [Optional ; Not Repeatable ; String]
    +The country name for the address of the person that created this image.

  • +
  • emailwork [Optional ; Not Repeatable ; String]
    +The work email address(es) for the creator of the image. Multiple email addresses can be given, in which case they should be separated by a comma.

  • +
  • region [Optional ; Not Repeatable ; String]
    +The state or province for the address of the creator of the image.

  • +
  • phonework [Optional ; Not Repeatable ; String]
    +The work phone number(s) for the creator of the image. Use the international format including the country code, such as +1 (123) 456789. Multiple numbers can be given, in which case they should be separated by a comma.

  • +
  • weburlwork [Optional ; Not Repeatable ; String]
    +The work web address for the creator of the image. Multiple addresses can be given, in which case they should be separated by a comma.

  • +
  • address [Optional ; Not Repeatable ; String]
    +The address of the creator of the image. This may comprise a company name.

  • +
  • city [Optional ; Not Repeatable ; String]
    +The city for the address of the person that created the image.

  • +
  • postalCode [Optional ; Not Repeatable ; String]
    +Enter the local postal code for the address of the person who created the image.

  • +
  • creditLine [Optional ; Not Repeatable ; String]
    +The credit to person(s) and/or organization(s) required by the supplier of the image to be used when published. This is a free-text field.

  • +
  • digitalSourceType [Optional ; Not Repeatable ; String]
    +The type of the source of this digital image. One value should be selected from the IPTC controlled vocabulary (published under a Creative Commons Attribution (CC BY) 4.0 license license) that contains the following values: +

    + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    TypeSourceDescription
    digitalCaptureOriginal digital capture of a real life sceneThe digital image is the original and only instance and was taken by a digital camera
    negativeFilmDigitized from a negative on filmThe digital image was digitized from a negative on film on any other transparent medium
    positiveFilmDigitized from a positive on filmThe digital image was digitized from a positive on a transparency or any other transparent medium
    printDigitized from a print on non-transparent mediumThe digital image was digitized from an image printed on a non-transparent medium
    softwareImageCreated by softwareThe digital image was created by computer software
  • +
+


+
    +
  • jobid [Optional ; Not Repeatable ; String]
    +Number or identifier for the purpose of improved workflow handling (control or tracking). This is a user created identifier related to the job for which the image is supplied.
    +Note: As this identifier references a job of the receiver’s workflow it must first be issued by the receiver, then transmitted to the creator or provider of the news object and finally added by the creator to this field.

  • +
  • jobtitle [Optional ; Not Repeatable ; String]
    +The job title of the photographer (the person listed in creatorNames). The use of this element implies that the photographer information (creatorNames is not empty).

  • +
  • source [Optional ; Not Repeatable ; String]
    +The name of a person or party who has a role in the content supply chain. The source can be different from the creator and from the entities listed in the Copyright Notice.

  • +
  • locationsShown [Optional ; Repeatable]
    +

  • +
+
"locationsShown": [
+  {
+    "name": "string",
+    "identifiers": [
+      "http://example.com"
+    ],
+    "worldRegion": "string",
+    "countryName": "string",
+    "countryCode": "string",
+    "provinceState": "string",
+    "city": "string",
+    "sublocation": "string",
+    "gpsAltitude": 0,
+    "gpsLatitude": 0,
+    "gpsLongitude": 0
+  }
+]
+


+

This block of elements is used to document the location shown in the image. This information should be provided with as much detail as possible. It contains elements that can be used to provide a “nested” description of the location, from a high geographic level (world region) down to a very specific location (city and sub-location within a city).

+
    +
  • name [Optional ; Not Repeatable ; String]
    +The full name of the location.

  • +
  • identifiers [Optional ; Repeatable ; String]
    +A globally unique identifier of the location shown.

  • +
  • worldRegion [Optional ; Not Repeatable ; String]
    +The name of a world region. This element is at the first (top) level of the top-down geographical hierarchy.

  • +
  • countryName [Optional ; Not Repeatable ; String]
    +The name of a country of a location. This element is at the second level of a top-down geographical hierarchy.

  • +
  • countryCode [Optional ; Not Repeatable ; String]
    +The ISO code of the country mentioned in countryName.

    +

  • +
  • provinceState [Optional ; Not Repeatable ; String]
    +The name of a sub-region of the country - for example a province or a state name. This element is at the third level of a top-down geographical hierarchy.

    +

  • +
  • city [Optional ; Not Repeatable ; String]
    +The name of the city. This element is at the fourth level of a top-down geographical hierarchy.

  • +
  • sublocation [Optional ; Not Repeatable ; String]
    +The sublocation name could either be the name of a sublocation to a city or the name of a well known location or (natural) monument outside a city. This element is at the fifth (lowest) level of a top-down geographical hierarchy.

  • +
  • gpsAltitude [Optional ; Not Repeatable ; Numeric]
    +The altitude in meters of a WGS84 based position of this location.

  • +
  • gpsLatitude [Optional ; Not Repeatable ; Numeric]
    +Latitude of a WGS84 based position of this location (in some cases, this information may be contained in the EXIF metadata).

  • +
  • gpsLongitude [Optional ; Not Repeatable ; Numeric]
    +Longitude of a WGS84 based position of this location (in some cases, this information may be contained in the EXIF metadata).

  • +
  • imageRating [Optional ; Not Repeatable ; Numeric]
    +Rating of the image by its user or supplier. The value shall be -1 or in the range 0 to 5. -1 indicates “rejected” and 0 “unrated”. If an explicit value is not provided, the default value is 0 will be assumed.

  • +
  • supplier [Optional ; Repeatable]
    +

  • +
+
"supplier": [
+  {
+    "name": "string",
+    "identifiers": [
+      "http://example.com"
+    ]
+  }
+]
+


+
    +
  • name [Optional ; Not Repeatable ; String]
    +The name of the supplier of the image (person or organization).

  • +
  • identifiers [Optional ; Repeatable ; String]
    +The identifier for the most recent supplier of this image. This will not necessarily be the creator or the owner of the image.

  • +
  • copyrightNotice [Optional ; Not Repeatable ; String]
    +
    + +
    +Contains any necessary copyright notice for claiming the intellectual property for this photograph and should identify the current owner of the copyright for the photograph. Other entities like the creator of the photograph may be added in the corresponding field. Notes on usage rights should be provided in “Rights usage terms”. Example: ©2008 Jane Doe. If the copyright ownership must be expressed in a more controlled manner, use the fields “Copyright Owner”, “Copyright Owner ID”, “Copyright Owner Name” described below instead of the copyrightNotice element.

  • +
  • copyrightOwners [Optional ; Repeatable]
    +Owner or owners of the copyright in the licensed image, described in a structured format (as an alternative to the element copyrightNotice described above. This block serves the same purpose of identifying the rights holder/s for the image. The Copyright Owner, Image Creator and Licensor may be the same or different entities. +

  • +
+
"copyrightOwners": [
+  {
+    "name": "string",
+    "role": [
+      "http://example.com"
+    ],
+    "identifiers": [
+      "http://example.com"
+    ]
+  }
+]
+<br> 
+
+  - **`name`** *[Optional ; Not Repeatable ; String]* <br>
+  The name of the owner of the copyright in the licensed image.<br>
+  - **`role`** *[Optional ; Repeatable ; String]*<br>
+  The role the entity.<br>  
+  - **`identifiers`** *[Optional ; Repeatable ; String]*<br>
+  The identifier of the owner of the copyright in the licensed image.<br><br>
+
+
+- **`usageTerms`** *[Optional ; Not Repeatable ; String]* <br>
+The licensing parameters of the image expressed in free-text. Enter instructions on how this image can legally be used. The PLUS fields of the IPTC Extension can be used in parallel to express the licensed usage in more controlled terms.<br>
+
+
+- **`embdEncRightsExpr`** *[Optional ; Repeatable]* <br>
+An embedded rights expression using a rights expression language which is encoded as a string. 
+Embedded Encoded Rights Expression (EERE) structure
+A structure providing details of an embedded encoded rights expression
+<br>
+```json
+"embdEncRightsExpr": [
+  {
+    "encRightsExpr": "string",
+    "rightsExprEncType": "string",
+    "rightsExprLangId": "http://example.com"
+  }
+]
+


+
    +
  • encRightsExpr [Optional ; Not Repeatable ; String]
    +Rights Expression Language ID. An identifier of the rights expression language used by the rights expression.

  • +
  • rightsExprEncType [Optional ; Not Repeatable ; String]
    +The encoding type of the rights expression, identified by an IANA Media Type.

  • +
  • rightsExprLangId [Optional ; Not Repeatable ; String]
    +An embedded rights expression using any rights expression language.
    @@@@ +https://www.iptc.org/std/photometadata/specification/IPTC-PhotoMetadata#embedded-encoded-rights-expression-eere-structure +

  • +
  • linkedEncRightsExpr [Optional ; Repeatable]
    +Link to Encoded Rights Expression.
    +

  • +
+
"linkedEncRightsExpr": [
+  {
+    "linkedRightsExpr": "http://example.com",
+    "rightsExprEncType": "string",
+    "rightsExprLangId": "http://example.com"
+  }
+]
+


+
    +
  • linkedRightsExpr [Optional ; Not Repeatable ; String]
    +The link to a web resource representing an encoded rights expression.

  • +
  • rightsExprEncType [Optional ; Not Repeatable ; String]
    +The encoding type of the rights expression, identified by an IANA Media Type.

  • +
  • rightsExprLangId [Optional ; Not Repeatable ; String]
    +The identifer of the rights expression language used by the rights expression.

  • +
  • webstatementRights [Optional ; Not Repeatable ; String]
    +URL referencing a web resource providing a statement of the copyright ownership and usage rights of the image.

  • +
  • instructions [Optional ; Not Repeatable ; String]
    +Any of a number of instructions from the provider or creator to the receiver of the image which might include any of the following: embargoes and other restrictions not covered by the “Rights Usage Terms” field; information regarding the original means of capture (scanning notes, colourspace info) or other specific text information that the user may need for accurate reproduction; additional permissions required when publishing; credits for publishing if they exceed the IIM length of the credit field.

  • +
  • genres [Optional ; Repeatable]
    +

  • +
+
"genres": [
+  {
+    "cvId": "http://example.com",
+    "cvTermName": "string",
+    "cvTermId": "http://example.com",
+    "cvTermRefinedAbout": "http://example.com"
+  }
+]
+


+
    +
  • cvId [Optional ; Not Repeatable ; String]
    +The globally unique identifier of the Controlled Vocabulary the term is from.

  • +
  • cvTermName [Optional ; Not Repeatable ; String]
    +The natural language name of the term from a Controlled Vocabulary.

  • +
  • cvTermId [Optional ; Not Repeatable ; String]
    +The globally unique identifier of the term from a Controlled Vocabulary.

  • +
  • cvTermRefinedAbout [Optional ; Not Repeatable ; String]
    +Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary. May be used to refine the generic about relationship.
    +Artistic, style, journalistic, product or other genre(s) of the image (expressed by a term from any Controlled Vocabulary)

  • +
  • intellectualGenre [Optional ; Not Repeatable ; String]
    +A term to describe the nature of the image in terms of its intellectual or journalistic characteristics (for example “actuality”, “interview”, “background”, “feature”, “summary”, “wrapup” for journalistic genres, or “daybook”, “obituary”, “press release”, “transcript” for news category related genres. It is advised to use terms from a controlled vocabulary such as the NewsCodes Scheme published by the IPTC under a Creative Commons Attribution (CC BY) 4.0 license.
    +

    + ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    GenreDescription
    ActualityRecording of an event
    Advertiser SuppliedContent is supplied by an organization or individual that has paid the news provider for its placement
    AdviceLetters and answers about readers’ personal problems
    AdvisoryRecommendation on editorial or technical matters by a provider to its customers
    On This DayList of data, including birthdays of famous people and items of historical significance, for a given day
    AnalysisData and conclusions drawn by a journalist who has conducted in depth research for a story
    Archival materialMaterial selected from the originator’s archive that has been previously distributed
    BackgroundScene setting and explanation for an event being reported
    Behind the StoryThe content describes how a story was reported and offers context on the reporting
    BiographyFacts and background about a person
    Birth AnnouncementNews of newly born children
    Current EventsContent about events taking place at the time of the report
    Curtain RaiserInformation about the staging and outcome of an immediately upcoming event
    DaybookItems filed on a regular basis that are lists of upcoming events with time and place, designed to inform others of events for planning purposes.
    ExclusiveInformation content, in any form, that is unique to a specific information provider.
    Fact CheckThe news item looks into the truth or falsehood of another reported news item or assertion (for example a statement on social media by a public figure)
    FeatureThe object content is about a particular event or individual that may not be significant to the current breaking news.
    FixtureThe object contains data that occurs often and predictably.
    ForecastThe object contains opinion as to the outcome of a future event.
    From the SceneThe object contains a report from the scene of an event.
    Help us to ReportThe news item is a call for readers to provide information that may help journalists to investigate a potential news story
    HistoryThe object content is based on previous rather than current events.
    HoroscopeAstrological forecasts
    InterviewThe object contains a report of a dialogue with a news source that gives it significant voice (includes Q and A).
    Listing of factsDetailed listing of facts related to a topic or a story
    MusicThe object contains music alone.
    ObituaryThe object contains a narrative about an individual’s life and achievements for publication after his or her death.
    OpinionThe object contains an editorial comment that reflects the views of the author.
    Polls and SurveysThe object contains numeric or other information produced as a result of questionnaires or interviews.
    Press ReleaseThe object contains promotional material or information provided to a news organisation.
    Press-DigestThe object contains an editorial comment by another medium completely or in parts without significant journalistic changes.
    ProfileThe object contains a description of the life or activity of a news subject (often a living individual).
    ProgramA news item giving lists of intended events and time to be covered by the news provider. Each program covers a day, a week, a month or a year. The covered period is referenced as a keyword.
    Question and Answer SessionThe object contains the interviewer and subject questions and answers.
    QuoteThe object contains a one or two sentence verbatim in direct quote.
    Raw SoundThe object contains unedited sounds.
    Response to a QuestionThe object contains a reply to a question.
    Results Listings and StatisticsThe object contains alphanumeric data suitable for presentation in tabular form.
    RetrospectiveThe object contains material that looks back on a specific (generally long) period of time such as a season, quarter, year or decade.
    ReviewThe object contains a critique of a creative activity or service (for example a book, a film or a restaurant).
    SatireUses exaggeration, irony, or humor to make a point; not intended to be understood as factual
    ScenerThe object contains a description of the event circumstances.
    Side bar and supporting informationRelated story that provides additional context or insight into a news event
    Special ReportIn-depth examination of a single subject requiring extensive research and usually presented at great length, either as a single item or as a series of items
    SponsoredContent is produced on behalf of an organization or individual that has paid the news provider for production and may approve content publication
    SummarySingle item synopsis of a number of generally unrelated news stories
    SupportedContent is produced with financial support from an organization or individual, yet not approved by the underwriter before or after publication
    SynopsisThe object contains a condensed version of a single news item.
    Text onlyThe object contains a transcription of text.
    Transcript and VerbatimA word for word report of a discussion or briefing
    UpdateThe object contains an intraday snapshot (as for electronic services) of a single news subject.
    VoicerContent is only voice
    WrapComplete summary of an event
    WrapupRecap of a running story
  • +
+


+
    +
  • artworkOrObjects [Optional ; Repeatable]
    +This block provides a set of metadata elements to be used to describe the object or artwork shown in the image. +
  • +
+
"artworkOrObjects": [
+{
+  "title": "string",
+  "contentDescription": "string",
+  "physicalDescription": "string",
+  "creatorNames": [
+    "string"
+  ],
+  "creatorIdentifiers": [
+    "string"
+  ],
+  "contributionDescription": "string",
+  "stylePeriod": [
+    "string"
+  ],
+  "dateCreated": "2023-04-11T15:06:09Z",
+  "circaDateCreated": "string",
+  "source": "string",
+  "sourceInventoryNr": "string",
+  "sourceInventoryUrl": "http://example.com",
+  "currentCopyrightOwnerName": "string",
+  "currentCopyrightOwnerIdentifier": "http://example.com",
+  "copyrightNotice": "string",
+  "currentLicensorName": "string",
+  "currentLicensorIdentifier": "http://example.com"
+  }
+]
+


+
    +
  • title [Optional ; Not Repeatable ; String]
    +A human readable name of the object or artwork shown in the image.

  • +
  • contentDescription [Optional ; Not Repeatable ; String]
    +A textual description of the content depicted in the object or artwork.

  • +
  • physicalDescription [Optional ; Not Repeatable ; String]
    +A textual description of the physical characteristics of the artwork or object, without reference to the content depicted. This would be used to describe the object type, materials, techniques, and measurements.

  • +
  • creatorNames [Optional Repeatable ; String]
    +The name of the person(s) (possibly an organization) who created the object or artwork shown in the image.

  • +
  • creatorIdentifiers [Optional ; Repeatable ; String]
    +One or multiple globally unique identifier(s) for the artist who created the artwork or object shown in the image. This could be an identifier issued by an online registry of persons or companies. Make sure to enter these identifiers in the exact same sequence as the names entered in the field creatorNames.

  • +
  • contributionDescription [Optional ; Not Repeatable ; String]
    +A description of any contributions made to the artwork or object. It should include the type, date and location of contribution, and details about the contributor.

  • +
  • stylePeriod [Optional ; Repeatable ; String]
    +The style, historical or artistic period, movement, group, or school whose characteristics are represented in the artwork or object. It is advised to take the terms from a Controlled Vocabulary.

  • +
  • dateCreated [Optional ; Not Repeatable ; String]
    +The date and optionally the time the artwork or object shown in the image was created.

  • +
  • circaDateCreated [Optional ; Not Repeatable ; String]
    +The approximate date or range of dates associated with the creation and production of an artwork or object or its components.

  • +
  • source [Optional ; Not Repeatable ; String]
    +The name of the organization or body holding and registering the artwork or object in this image for inventory purposes.

  • +
  • sourceInventoryNr [Optional ; Not Repeatable ; String]
    +The inventory number issued by the organization or body holding and registering the artwork or object in the image.

  • +
  • sourceInventoryUrl [Optional ; Not Repeatable ; String]
    +A reference URL for the metadata record of the inventory maintained by the Source.

  • +
  • currentCopyrightOwnerName [Optional ; Not Repeatable ; String]
    +The name of the current owner of the copyright of the artwork or object.

  • +
  • currentCopyrightOwnerIdentifier [Optional ; Not Repeatable ; String]
    +A globally unique identifier for the current copyright owner e.g. issued by an online registry of persons or companies.

  • +
  • copyrightNotice [Optional ; Not Repeatable ; String]
    +Any necessary copyright notice for claiming the intellectual property for artwork or an object in the image and should identify the current owner of the copyright of this work with associated intellectual property rights.

  • +
  • currentLicensorName [Optional ; Not Repeatable ; String]
    +Name of the current licensor of the artwork or object.

  • +
  • currentLicensorIdentifier [Optional ; Not Repeatable ; String]
    +A globally unique identifier for the current licensor e.g. issued by an online registry of persons or companies.

  • +
  • personInImageNames [Optional ; Repeatable ; String]
    +

  • +
+
"personInImageNames": [
+  "string"
+]
+


+

This repeatable block of elements is used to provide information on the person(s) shown in the image.

+
    +
  • personsShown [Optional ; Repeatable]
    +Details about person(s) shown in the image. It is not required to list all, just those details which can be recognized. +
  • +
+
"personsShown": [
+  {
+    "name": "string",
+    "description": "string",
+    "identifiers": [
+    "http://example.com"
+    ],
+    "characteristics": [
+      {
+        "cvId": "http://example.com",
+        "cvTermName": "string",
+        "cvTermId": "http://example.com",
+        "cvTermRefinedAbout": "http://example.com"
+      }
+    ]
+  }
+]
+


+
    +
  • name [Optional ; Not Repeatable ; String]
    +The name of a person shown in the image.
  • +
  • description [Optional ; Not Repeatable ; String]
    +A textual description of the person. For example, you may include actions taken, emotional expressions shown and more.
  • +
  • identifiers [Optional ; Not Repeatable ; String]
    +Globally Unique identifiers of the person, such as those from WikiData.
    +
  • +
  • characteristics [Optional ; Not Repeatable ; String]
    +A property or trait of the person, provided as a term selected from a Controlled Vocabulary.
    +
      +
    • cvId [Optional ; Not Repeatable ; String]
      +The globally unique identifier of the Controlled Vocabulary the term is from.
    • +
    • cvTermName [Optional ; Not Repeatable ; String]
      +The natural language name of the term from a Controlled Vocabulary.
    • +
    • cvTermId [Optional ; Not Repeatable ; String]
      +The globally unique identifier of the term from a Controlled Vocabulary.
    • +
    • cvTermRefinedAbout [Optional ; Not Repeatable ; String]
      +The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.

    • +
  • +
  • modelAges [Optional ; Repeatable ; Numeric]
    +
  • +
+
"modelAges": [
+  0
+]
+


+

Age of the human model(s) at the time the image was taken. Be aware of any legal implications of providing ages for young models. Ages below 18 years should not be included.

+
    +
  • additionalModelInfo [Optional ; Not Repeatable ; String]
    +Information about other facets of the model(s).

  • +
  • minorModelAgeDisclosure [Optional ; Not Repeatable ; String]
    +The age of the youngest model pictured in the image, at the time the image was created. This information is not intended to be displayed publicly; it is intended to be used as a filter for inclusion/exclusion of images in catalogs and dissemination processes.

  • +
  • modelReleaseDocuments [Optional ; Repeatable ; String]
    +

  • +
+
"modelReleaseDocuments": [
+  "string"
+]
+


+

Identifier associated with each Model Release.

+
    +
  • modelReleaseStatus [Optional ; Not Repeatable]
    +
  • +
+
"modelReleaseStatus": {
+  "cvId": "http://example.com",
+  "cvTermName": "string",
+  "cvTermId": "http://example.com",
+  "cvTermRefinedAbout": "http://example.com"
+}
+


+
    +
  • cvId [Optional ; Not Repeatable ; String]
    +The globally unique identifier of the Controlled Vocabulary the term is from.

  • +
  • cvTermName [Optional ; Not Repeatable ; String]
    +The natural language name of the term from a Controlled Vocabulary.

  • +
  • cvTermId [Optional ; Not Repeatable ; String]
    +The globally unique identifier of the term from a Controlled Vocabulary.

  • +
  • cvTermRefinedAbout [Optional ; Not Repeatable ; String]
    +The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary. May be used to refine the generic about relationship.

  • +
  • organisationInImageCodes [Optional ; Repeatable ; String]
    +

  • +
+
"organisationInImageCodes": [
+  "string"
+]
+


+

The code, extracted from a controlled vocabulary, used to identify the organization or company featured in the image. For example a stock ticker symbol may be used. Enter an identifier for the controlled vocabulary, then a colon, and finally the code from the vocabulary assigned to the organization (e.g. nasdaq:companyA)

+
    +
  • organisationInImageNames [Optional ; Repeatable ; String]
    +
  • +
+
"organisationInImageNames": [
+  "string"
+]
+


+

The name of the organization or company which is featured in the image.

+
    +
  • productsShown [Optional ; Repeatable]
    +Details about a product shown in the image. +
  • +
+
"productsShown": [
+  {
+    "description": "string",
+    "gtin": "string",
+    "name": "string"
+  }
+]
+


+
    +
  • description [Optional ; Not Repeatable ; String]
    +A textual description of the product.

  • +
  • gtin [Optional ; Not Repeatable ; String]
    +The Global Trade Item Number (GTIN) of the product (GTIN-8 to GTIN-14 codes can be used).

  • +
  • name [Optional ; Not Repeatable ; String]
    +The name of the product.

  • +
  • maxAvailHeight [Optional ; Not Repeatable ; Numeric]
    +The maximum available height in pixels of the original photo from which this photo has been derived by downsizing.

  • +
  • maxAvailWidth [Optional ; Not Repeatable ; Numeric]
    +The maximum available width in pixels of the original photo from which this photo has been derived by downsizing.

  • +
  • propertyReleaseStatus [Optional ; Not Repeatable]
    +

  • +
+
"propertyReleaseStatus": {
+  "cvId": "http://example.com",
+  "cvTermName": "string",
+  "cvTermId": "http://example.com",
+  "cvTermRefinedAbout": "http://example.com"
+}
+


+

This summarizes the availability and scope of property releases authorizing usage of the properties appearing in the photograph. One value should be selected from a controlled vocabulary. It is recommended to apply the value PR-UPR very carefully and to check the wording of the property release thoroughly before applying it.
+- cvId [Optional ; Not Repeatable ; String]
+The globally unique identifier of the Controlled Vocabulary the term is from.
+- cvTermName [Optional ; Not Repeatable ; String]
+The natural language name of the term from a Controlled Vocabulary.
+- cvTermId [Optional ; Not Repeatable ; String]
+The globally unique identifier of the term from a Controlled Vocabulary.
+- cvTermRefinedAbout [Optional ; Not Repeatable ; String]
+Refined ‘about’ relationship of the CV-Term. The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.

+
    +
  • propertyReleaseDocuments [Optional ; Repeatable ; String]
    +
  • +
+
"propertyReleaseDocuments": [
+  "string"
+]
+



+Optional identifier associated with each Property Release.

+
    +
  • aboutCvTerms [Optional ; Repeatable]
    +
  • +
+
"aboutCvTerms": [
+  {
+    "cvId": "http://example.com",
+    "cvTermName": "string",
+    "cvTermId": "http://example.com",
+    "cvTermRefinedAbout": "http://example.com"
+  }
+]
+


+

One or more topics, themes or entities the content is about, each one expressed by a term from a controlled vocabulary.
+- cvId [Optional ; Not Repeatable ; String]
+The globally unique identifier of the Controlled Vocabulary the term is from.
+- cvTermName [Optional ; Not Repeatable ; String]
+The natural language name of the term from a Controlled Vocabulary.
+- cvTermId [Optional ; Not Repeatable ; String]
+The globally unique identifier of the term from a Controlled Vocabulary.
+- cvTermRefinedAbout [Optional ; Not Repeatable ; String]
+Refined ‘about’ relationship of the CV-Term. The refined ‘about’ relationship of the term with the content. Optionally enter a refinement of the ‘about’ relationship of the term with the content of the image. This must be a globally unique identifier from a Controlled Vocabulary.

+
+

The IPTC elements are followed by a small set of common elements: see license, tags, and album in section Additional elements.

+
+
+
+

10.2.3 Dublin Core option

+

We introduced the Dublin Core Metadata Initiative (DCMI) specification in chapter 3 - Documents. It contains 15 core elements, which are generic and versatile enough to be used for documenting different types of resources. Other elements can be added to the specification to increase its relevancy for specific uses. In the schema we recommend for the documentation of publications, we added elements inspired by the MARC 21 standard. We take a similar approach for the use of the Dublin Core for documenting images, by adding elements inspired by the ImageObject schema from schema.org to the 15 elements.

+

The fifteen elements, with their definition extracted from the Dublin Core website, are the following:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Element nameDescription
identifierAn unambiguous reference to the resource within a given context.
typeThe nature or genre of the resource.
titleA name given to the resource.
descriptionAn account of the resource.
subjectThe topic of the resource.
creatorAn entity primarily responsible for making the resource.
contributorAn entity responsible for making contributions to the resource.
publisherAn entity responsible for making the resource available.
dateA point or period of time associated with an event in the life cycle of the resource.
coverageThe spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
formatThe file format, physical medium, or dimensions of the resource.
languageA language of the resource.
relationA related resource.
rightsInformation about rights held in and over the resource.
sourceA related resource from which the described resource is derived.
+

We do not use the identifier element, as we already have a unique identifier in the common element idno.

+

We added the following elements to the schema, which are not part of the core list of the DCMI:

+
    +
  • identifiers
  • +
  • caption
  • +
  • keywords
  • +
  • topics
  • +
  • country
  • +
  • gps (latitude, longitude, altitude)
  • +
  • note
  • +
+
+

The common additional elements license, album and tags also complement the DCMI metadata (see section Additional elements).

+
+

We describe below how DCMI elements are used to document images.

+

dcmi [Optional, Not repeatable]
+Users of the schema will chose either IPTC or Dublin Core (DCMI), not both, to document their images. If the choice is DCMI, the elements under dcmi will be used. +

+
"dcmi": {
+  "type": "image",
+  "title": "string",
+  "caption": "string",
+  "description": "string",
+  "topics": [],
+  "keywords": [],
+  "creator": "string",
+  "contributor": "string",
+  "publisher": "string",
+  "date": "string",
+  "country": [],
+  "coverage": "string",
+  "gps": {},
+  "format": "string",
+  "languages": [],
+  "relations": [],
+  "rights": "string",
+  "source": "string",
+  "note": "string"
+}
+


+
    +
  • type [Required, Not Repeatable, String]
    +The Dublin Core schema is flexible and versatile, and can be used to document different types of resources. This element is used to document the type of resource being documented. The DCMI provides a list of suggested categories, including “image” which is the relevant type to be entered here. Some users may want to be more specific in the description of the type of resource, for example distinguishing color from black & white images. This distinction should not be made in this element; another element can be used for such purpose (like tags and tag groups).

  • +
  • title [Optional, Not Repeatable, String]
    +The title of the photo.

  • +
  • caption [Optional, Not Repeatable, String]
    +A caption for the photo.

  • +
  • description [Optional, Not Repeatable, String]
    +A brief description of the content depicted in the image. This element will typically provide more detailed information than the title or caption. Note that other elements can be used to provide a more specific and “itemized” description of an image; the element keywords for example can be used to list labels associated with an image (possibly generated in an automated manner using machine learning tools).

  • +
  • topics [Optional ; Repeatable]
    +The topics field indicates the broad substantive topic(s) that the image represents. A topic classification facilitates referencing and searches in electronic survey catalogs. Topics should be selected from a standard controlled vocabulary such as the Council of European Social Science Data Archives (CESSDA) thesaurus.
    +

  • +
+
"topics": [
+  {
+    "id": "string",
+    "name": "string",
+    "parent_id": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • id [Optional ; Not repeatable ; String]
    +The unique identifier of the topic. It can be a sequential number, or the ID of the topic in a controlled vocabulary.

  • +
  • name [Required ; Not repeatable ; String]
    +The label of the topic associated with the data.
    +

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +When a hierarchical (nested) controlled vocabulary is used, the parent_id field can be used to indicate a higher-level topic to which this topic belongs.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name of the controlled vocabulary used, if any.

  • +
  • uri
    +A link to the controlled vocabulary mentioned in field `vocabulary’.

  • +
  • keywords [Optional ; Repeatable]
    +Words or phrases that describe salient aspects of an image content. Can be used for building keyword indexes and for classification and retrieval purposes. A controlled vocabulary can be employed. Keywords should be selected from a standard thesaurus, preferably an international, multilingual thesaurus.
    +

  • +
+
"keywords": [
+  {
+    "name": "string",
+    "vocabulary": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Required ; String ; Non repeatable]
    +Keyword (or phrase). Keywords summarize the content or subject matter of the image.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +Controlled vocabulary from which the keyword is extracted, if any.
    +

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI of the controlled vocabulary used, if any.

  • +
  • creator [Optional, Not Repeatable, String]
    +The name of the person (or organization) who has taken the photo or created the image.

  • +
  • contributor [Optional, Not Repeatable, String]
    +The contributor could be a person or organization, possibly a sponsoring organizations.

  • +
  • publisher [Optional, Not Repeatable, String]
    +The person or organization who publish the image.

  • +
  • date [Optional, Not Repeatable, String]
    +The date when the photo was taken / the image was created, preferably entered in ISO 8601 format.

  • +
  • country [Optional, Repeatable]
    +

  • +
+
"country": [
+  {
+    "name": "string",
+    "code": "string"
+  }
+]
+


+
    +
  • name [Optional, Not Repeatable, String]
    +The name of the country/economy where the photo was taken.

  • +
  • code [Optional, Not Repeatable, String]
    +The code of the country/economy mentioned in name. This will preferably be the ISO country code.

  • +
  • coverage [Optional, Not Repeatable, String]
    +In the Dublin Core, the coverage can be either temporal or geographic. In the use of the schema, coverage is used to document the geographic coverage of the image. This element complements the country element, and allows more specific information to be provided.

  • +
  • gps [Optional, Not Repeatable]
    +The geographic location where the photo was taken. Some digital cameras equipped with GPS can, when the option is activated, capture and store in the EXIF metadata the exact geographic location where the photo was taken. +

  • +
+
"gps": {
+  "latitude": -90,
+  "longitude": -180,
+  "altitude": 0
+}
+


+
    +
  • latitude [Optional, Not Repeatable, String]
    +The latitude of the geographic location where the photo was taken.

  • +
  • longitude [Optional, Not Repeatable, String]
    +The longitude of the geographic location where the photo was taken.

  • +
  • altitude [Optional, Not Repeatable, String]
    +The altitude of the geographic location where the photo was taken.

  • +
  • format [Optional, Not Repeatable, String]
    +This refers to the image file format. It is typically expressed using a MIME format.

  • +
  • languages [Optional, Not repeatable, String]
    +The language(s) in which the image metadata (caption, title) is provided. This is a block of two elements (at least one must be provided for each language). +

  • +
+
"languages": [
+  {
+    "name": "string",
+    "code": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the language.

  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the language. The use of ISO 639-2 (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings.

  • +
  • relations [Optional, Repeatable, String]
    +A list of related resources (images or of other type) +

  • +
+
"relations": [
+  {
+    "name": "string",
+    "type": "isPartOf",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name (title) of the related resource.

  • +
  • type [Optional ; Not repeatable ; String]
    +A brief description of the type of relation. A controlled vocabulary could be used.

  • +
  • uri [Optional ; Not repeatable ; String]
    +A link to the related resource being described.

  • +
  • rights [Optional, Not Repeatable, String]
    +The copyrights for the photograph. License is in another (common) element.

  • +
  • source [Optional, Not Repeatable, String]
    +A related resource from which the described image is derived.

  • +
  • note [Optional, Not Repeatable, String]
    +Any additional information on the image, not captured in one of the other metadata elements.

  • +
+
+
+

10.2.4 Additional elements (IPTC and DCMI)

+

Two elements are added to the list of image_description section of the schema. They apply both to the IPTC and to the DCMI options.

+
    +
  • license [Optional ; Repeatable]
    +The license under which the image is published. +
  • +
+
"license": [
+  {
+    "name": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not Repeatable ; String]
    +The name of the license.

  • +
  • uri [Optional ; Not Repeatable ; String]
    +A URL where detailed information on the license / terms of use can be found.

  • +
  • album [Optional ; Repeatable]
    +If your catalog contains many images, you will likely want to group them by album. Albums are collections of images organized by theme, period, location, photographer, or other criteria. One image can belong to more than one album. Albums are thus “virtual collections”. +

  • +
+
"album": [
+  {
+    "name": "string",
+    "description": "string",
+    "owner": "string",
+    "uri": "string"
+  }
+]
+


+
    +
  • name [Optional ; Not Repeatable ; String]
    +A short name (label) given to the album.

  • +
  • description [Optional ; Not Repeatable ; String]
    +A brief description of the album.

  • +
  • owner [Optional ; Not Repeatable ; String]
    +Identification of the owner/custodian of the album. This can be the name of a person or an organization.

  • +
  • uri [Optional ; Not Repeatable ; String]
    +A URL for the album.

  • +
  • provenance [Optional ; Repeatable] +Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata.
    +

  • +
+
"provenance": [
+    {
+        "origin_description": {
+            "harvest_date": "string",
+            "altered": true,
+            "base_url": "string",
+            "identifier": "string",
+            "date_stamp": "string",
+            "metadata_namespace": "string"
+        }
+    }
+]
+


+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.
    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Study Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@ definition

    • +
  • +
  • tags [Optional ; Repeatable]
    +As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +
  • +
+
"tags": [
+    {
+        "tag": "string",
+        "tag_group": "string"
+    }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.
  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.

  • +
+
+
+

10.2.5 LDA topics

+

lda_topics [Optional ; Not repeatable]

+


+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
+
+Once an LDA topic model has been trained, it can be used to infer the topic composition of any document. This inference will then provide the share that each topic represents in the document. The sum of all represented topics is 1 (100%).
+
+The metadata element lda_topics is provided to allow data curators to store information on the inferred topic composition of the documents listed in a catalog. Sub-elements are provided to describe the topic model, and the topic composition.

+

The lda_topics element includes the following metadata fields:

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition of the document.

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the document (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic. This is specific to the model, not to a document.
      • +
    • +
  • +
+
lda_topics = list(
+  
+   list(
+  
+      model_info = list(
+        list(source      = "World Bank, Development Data Group",
+             author      = "A.S.",
+             version     = "2021-06-22",
+             model_id    = "Mallet_WB_75",
+             nb_topics   = 75,
+             description = "LDA model, 75 topics, trained on Mallet",
+             corpus      = "World Bank Documents and Reports (1950-2021)",
+             uri         = ""))
+      ),
+      
+      topic_description = list(
+      
+        list(topic_id    = "topic_27",
+             topic_score = 32,
+             topic_label = "Education",
+             topic_words = list(list(word = "school",      word_weight = "")
+                                list(word = "teacher",     word_weight = ""),
+                                list(word = "student",     word_weight = ""),
+                                list(word = "education",   word_weight = ""),
+                                list(word = "grade",       word_weight = "")),
+        
+        list(topic_id    = "topic_8",
+             topic_score = 24,
+             topic_label = "Gender",
+             topic_words = list(list(word = "women",       word_weight = "")
+                                list(word = "gender",      word_weight = ""),
+                                list(word = "man",         word_weight = ""),
+                                list(word = "female",      word_weight = ""),
+                                list(word = "male",        word_weight = "")),
+        
+        list(topic_id    = "topic_39",
+             topic_score = 22,
+             topic_label = "Forced displacement",
+             topic_words = list(list(word = "refugee",     word_weight = "")
+                                list(word = "programme",   word_weight = ""),
+                                list(word = "country",     word_weight = ""),
+                                list(word = "migration",   word_weight = ""),
+                                list(word = "migrant",     word_weight = "")),
+                                
+        list(topic_id    = "topic_40",
+             topic_score = 11,
+             topic_label = "Development policies",
+             topic_words = list(list(word = "development", word_weight = "")
+                                list(word = "policy",      word_weight = ""),
+                                list(word = "national",    word_weight = ""),
+                                list(word = "strategy",    word_weight = ""),
+                                list(word = "activity",    word_weight = ""))
+                                
+      )
+      
+   )
+   
+)
+

The information provided by LDA models can be used to build a “filter by topic composition” tool in a catalog, to help identify documents based on a combination of topics, allowing users to set minimum thresholds on the share of each selected topic.

+
+ +
+
+
+

10.2.6 Embeddings

+

embeddings [Optional ; Repeatable]
+In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API). These vector representations can be used to identify semantically-closed documents, by calculating the distance between vectors and identifying the closest ones, as shown in the example below.

+

+

The word vectors do not have to be stored in the document metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

+


+
"embeddings": [
+    {
+        "id": "string",
+        "description": "string",
+        "date": "string",
+        "vector": null
+    }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.

  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.

  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).

  • +
  • vector [Required ; Not repeatable ; Object] @@@@@@@@ do not offer options +The numeric vector representing the document, provided as an object (array or string).

    +[1,4,3,5,7,9]

  • +
  • additional [Optional ; Not repeatable]
    +The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

  • +
+
+
+
+

10.3 Examples

+

Use schema and resource schema for publishing links.

+
+

10.3.1 Example 1 - Using the IPTC option

+

We selected an image from the World Bank Flickr collection. The image is available at https://www.flickr.com/photos/worldbank/8120361619/in/album-72157648790716931/ +Some metadata is provided with the photo.

+
+ +
+

Metadata:

+
+ +
+

The image is made available in multiple formats. We assume that we want to only provide access to the small, medium and original version of the image available in our NADA catalog. We also assume that instead of uploading the images to our catalog server to make them available directly from our catalog, we want to provide link to the images in the source repository (Flickr in this case).

+

+

Using R

+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_images/")
+# Download image files from Flickr (different resolutions)
+
+download.file("https://live.staticflickr.com/4858/31953178928_77e4d7abae_o_d.jpg", 
+              destfile = "img_001_original.jpg", mode = "wb")
+
+download.file("https://live.staticflickr.com/4858/31953178928_44abb01418_w_d.jpg", 
+              destfile = "img_001_small.jpg", mode = "wb")
+
+# Generate image metadata (using the IPTC metadata elements)
+
+my_image <- list(
+  
+  metadata_information = list(
+    
+    producers = list(name = "OD"),
+    
+    production_date = "2022-01-10"
+    
+  ),
+  
+  idno = "image_001",
+  
+  image_description = list(
+    
+    iptc = list(
+      
+      photoVideoMetadataIPTC = list(
+        
+        title = "Man fetching water, Afghanistan",
+        
+        imageSupplierImageId = "Image_001",
+        
+        headline = "Residents get water",
+        
+        dateCreated = "2008-09-20T00:00:00Z",
+        
+        creatorNames = list("Sofie Tesson, Taimani Films"),
+        
+        description = "View of villagers, getting some water. 
+                       World Bank Emergency Horticulture and Livestock Project",
+        
+        digitalImageGuid = "72157648790716931",
+        
+        locationsShown = list(
+          list(countryCode = "AFG", countryName = "Afghanistan")
+        ),
+        
+        keywords = list("Water and sanitation"),
+        
+      @@@ as list?  sceneCodes = list("010600, 011000, 011100, 011900"),
+        
+        sceneCodesLabelled = list(
+          
+          list(code = "010600", 
+               label = "single",
+               description = "A view of only one person, object or animal."),
+          
+          list(code = "011000", 
+               label = "general view",
+               description = "An overall view of the subject and its surrounds"),
+
+          list(code = "011100", 
+               label = "panoramic view",
+               description = "A panoramic or wide angle view of a subject and its surrounds"),
+          
+          list(code = "011900", 
+               label = "action",
+               description = "Subject in motion")
+          
+        ),
+          
+      @@@ as list?   subjectCodes = list("06000000, 09000000, 14000000"),
+        
+        subjectCodesLabelled = list(
+          
+          list(code = "06000000", 
+               label = "environmental issue",
+               description = "All aspects of protection, damage, and condition of the ecosystem of the planet earth and its surroundings."),
+          
+          list(code = "09000000", 
+               label = "labor",
+               description = "Social aspects, organizations, rules and conditions affecting the employment of human effort for the generation of wealth or provision of services and the economic support of the unemployed."),
+          
+          list(code = "14000000", 
+               label = "social issue",
+               description = "Aspects of the behavior of humans affecting the quality of life.")
+          
+        ),
+          
+        source = "World Bank",
+        
+        supplier = list(
+          list(name = "World Bank")
+        )
+      
+      )
+      
+    ),
+
+    license = list(
+      list(name = "Attribution 2.0 Generic (CC BY 2.0)", 
+           uri = "https://creativecommons.org/licenses/by/2.0/")
+    ),
+    
+    album = list(
+      list(name = "World Bank Projects in Afghanistan")
+    )
+    
+  )
+  
+)
+
+# Publish the image metadata in the NADA catalog
+
+image_add(idno = "image_001", 
+          metadata = my_image,
+          repositoryid = "central",
+          overwrite = "yes", 
+          published = 1,
+          thumbnail = thumb)
+
+# Provide a link to the images in the originating repository, and upload files
+# (uploading files will make them available directly from the NADA catalog)
+
+external_resources_add(
+  idno = "image_001",
+  dctype = "pic",
+  title = "Man fetching water, Afghanistan (Flickr link)",
+  file_path = "https://www.flickr.com/photos/water_alternatives/31953178928/in/photolist-QFAoS5",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  idno = "image_001",
+  dctype = "pic",
+  title = "Man fetching water, Afghanistan (original size)",
+  file_path = "img_001_original.jpg",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  idno = "image_001",
+  dctype = "pic",
+  title = "Man fetching water, Afghanistan (small size)",
+  file_path = "img_001_small.jpg",
+  overwrite = "yes"
+)
+



+

Result in NADA

+

The metadata, links, and images will be displayed in NADA.

+
+ +
+



+Different views (mosaic, list, page views) are available. If the metadata contained a GPS location, a map showing the exact location where the photo was taken will also be displayed in the image page.

+
+ +
+



+

Using Python

+
# Python script
+
+
+

10.3.2 Example 2 - Using the DCMI option

+

We document the same image as in Example 1.

+

Using R

+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_images/")
+# Download image files from Flickr (different resolutions)
+
+download.file("https://live.staticflickr.com/4858/31953178928_77e4d7abae_o_d.jpg", 
+              destfile = "img_001_original.jpg", mode = "wb")
+
+download.file("https://live.staticflickr.com/4858/31953178928_44abb01418_w_d.jpg", 
+              destfile = "img_001_small.jpg", mode = "wb")
+
+# Generate image metadata (using the DCMI metadata elements)
+
+pic_desc <- list(
+  
+  metadata_information = list(
+    
+    producers = list(name = "OD"),
+    
+    production_date = "2022-01-10"
+
+  ),
+  
+  idno = "image_001",
+  
+  image_description = list(
+    
+    dcmi = list(
+      
+      identifier = "72157648790716931",
+      
+      type = "image",
+      
+      title = "Man fetching water, Afghanistan",
+      
+      caption = "Residents get water",
+      
+      description = "View of villagers, getting some water.
+                     World Bank Emergency Horticulture and Livestock Project",
+      
+      subject = "",
+      
+      topics = list(),
+      
+      keywords = list(
+        list(name = "water and sanitation")
+      ),
+      
+      creator = "Sofie Tesson, Taimani Films",
+      
+      publisher = "World Bank",
+      
+      date = "2008-09-20T00:00:00Z",
+      
+      country = list(name = "Afghanistan", code = "AFG"),
+      
+      language = "English"
+      
+    ),
+  
+    license = list(
+      list(name = "Attribution 2.0 Generic (CC BY 2.0)", 
+           uri = "https://creativecommons.org/licenses/by/2.0/")),
+    
+    album = list(
+      list(name = "World Bank Projects in Afghanistan")
+    )
+    
+  )  
+    
+)
+
+# Publish the image metadata in the NADA catalog
+
+image_add(idno = "image_001", 
+          metadata = pic_desc,
+          repositoryid = "central",
+          overwrite = "yes", 
+          published = 1,
+          thumbnail = thumb)
+
+# Provide a link to the images in the originating repository, and upload files
+# (uploading files will make them available directly from the NADA catalog)
+
+external_resources_add(
+  idno = "image_001",
+  dctype = "pic",
+  title = "Man fetching water, Afghanistan (Flickr link)",
+  file_path = "https://www.flickr.com/photos/water_alternatives/31953178928/in/photolist-QFAoS5",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  idno = "image_001",
+  dctype = "pic",
+  title = "Man fetching water, Afghanistan (original size)",
+  file_path = "img_001_original.jpg",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  idno = "image_001",
+  dctype = "pic",
+  title = "Man fetching water, Afghanistan (small size)",
+  file_path = "img_001_small.jpg",
+  overwrite = "yes"
+)
+



+

Using Python

+
# Python script
+ +
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter11.html b/chapter11.html new file mode 100644 index 0000000..34a0d3f --- /dev/null +++ b/chapter11.html @@ -0,0 +1,1435 @@ + + + + + + + Chapter 11 Videos | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 11 Videos

+
+ +
+

The schema we propose to document video files is a combination of elements extracted from the Dublin Core Metadata Initiative (DCMI) and from the VideoObject (from schema.org) schemas. This schema is very similar to the schema we proposed for audio files (see chapter 10).

+

The Dublin Core is a generic and versatile standard, which we also use (in an augmented form) for the documentation of Documents (Chapter 4), Images (Chapter 9), and Audio files (chapter 10). It contains 15 core elements, to which we added a selection of elements from VideoObject. We also included the elements keywords, topics, tags, provenance and additional that are found in other schemas documented in the Guide.

+

The resulting metadata schema is simple, but it contains the elements needed to document the resources and their content in a way that will foster their discoverability in data catalogs. Compliance with the VideoObject elements contributes to search engine optimization, as search engines like Google, Bing and others “reward” metadata published in formats compatible with the schema.org recommendations.

+


+
{
+ "repositoryid": "string",
+ "published": 0,
+ "overwrite": "no",
+ "metadata_information": {},
+ "video_description": {},
+ "provenance": [],
+ "tags": [],
+ "lda_topics": [],
+ "embeddings": [],
+ "additional": { }
+}
+


+

When published in a NADA catalog, the metadata related to video files will appear in a specific tab.

+


+ +

+
+

11.1 Augmenting video metadata

+

Videos typically come with limited metadata. To make them more discoverable, a transcription of the video content can be generated, stored, and indexed in the catalog. The metadata schema we propose includes an element transcription that can store transcriptions (and possibly their automatically-generated translations) in the video metadata. Word embedding models and topic models can be applied to the transcriptions to further augment the metadata. This will significantly increase the discoverability of the resource, and offer the possibility to apply semantic searchability on video metadata.

+

Machine learning speech-to-text solutions are available (although not for all languages) to automatically generate transcriptions at a low cost. This includes commercial applications like Whisper by openAI, Microsoft Azure, or Amazon Transcribe. Open source solutions in Python also exist.

+

Transcriptions of videos published on Youtube are available on-line (the example below was extracted from https://www.youtube.com/watch?v=Axs8NPVYmms).

+


+ +

+

Note that some care must be taken when adding automatic speech transcriptions into your metadata, as the transcriptions are not always perfect and may return unexpected results. This will be the case when the sound quality is low, or when the video includes sections in an unknown language (see the example below, of a video in English that includes a brief segmnent in Somali; the speech-to-text algorithm may in such case attempt to transcribe text it does not recognize, returning invalid information).

+


+ +

+
+
+

11.2 Schema description

+

The first three elements of the schema (repositoryid, published, and overwrite) are not part of the video metadata. They are parameters used to indicate how the video metadata will be published in a NADA catalog.

+
    +
  • repositoryid identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA.

  • +
  • published: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.

  • +
  • overwrite: Indicates whether metadata that may have been previously uploaded for the same video can be overwritten. By default, the value is “no”. It must be set to “yes” to overwrite existing information. Note that a video will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element video_description > idno is the same.

  • +
+
+

11.2.1 Metadata information

+

metadata_information [Optional ; Not Repeatable] +The metadata information set is used to document the video metadata (not the video itself). This provides information useful for archiving purposes. This set is optional. It is recommended however to enter at least the identification and affiliation of the metadata producer, and the date of creation of the metadata. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so metadata produced by one organization can be found in other data centers. +

+
"metadata_information": {
+ "title": "string",
+ "idno": "string",
+ "producers": [
+  {
+   "name": "string",
+   "abbr": "string",
+   "affiliation": "string",
+   "role": "string"
+  }
+ ],
+ "production_date": "string",
+ "version": "string"
+}
+


+
    +
  • title [Optional ; Not Repeatable ; String]
    +The title of the video.

  • +
  • idno [Optional ; Not Repeatable ; String]
    +A unique identifier for the metadata document (unique in the catalog; ideally also unique globally). This is different from the video unique ID (see idno element in section video_description below), although it is good practice to generate identifiers that would maintain an easy connection between the metadata idno element and the video idno found under video_description (see below).

  • +
  • producers [Optional ; Repeatable]
    +This refers to the producer(s) of the metadata, NOT to the producer(s) of the video. This could for example be the data curator in a data center.

    +
      +
    • name [Optional ; Not repeatable ; String]
      +Name of the metadata producer/curator. An alternative to entering the name of the curator (e.g. for privacy protection purpose) is to enter the curator ID (see the element abbr below)
    • +
    • abbr [Optional ; Not repeatable ; String]
      +Can be used to provide an ID of the metadata producer/curator.
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +Affiliation of the metadata producer/curator.
    • +
    • role [Optional ; Not repeatable ; String]
      +Specific role of the metadata producer/curator.
    • +
  • +
  • production_date [Optional ; Not repeatable ; String]
    +Date the metadata (not the table) was produced.

  • +
  • version [Optional ; Not repeatable ; String]
    +Version of the metadata (not version of the table).

  • +
+
+
+

11.2.2 Video description

+

video_description [Required ; Not Repeatable]
+The video_description section contains all elements that will be used to describe the video and its content. These are the elements that will be indexed and made searchable when published in a data catalog.

+
    +
  • idno [Mandatory, Not Repeatable ; String]
    +idno is an identification number that is used to uniquely identify a video in a catalog. It will also help users of the data cite the video properly. The best option is to obtain a Digital Object Identifier (DOI) for the video, as it will ensure that the ID is unique globally. Alternatively, it can be an identifier constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a video (see identifiers below). This element maps to the “identifier” element in the Dublin Core.

  • +
  • identifiers [Optional ; Repeatable]

    +

  • +
+
"identifiers": [
+ {
+  "type": "string",
+  "identifier": "string"
+ }
+]
+


+

This element is used to enter video identifiers other than the idno element described above). It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in idno can be repeated here, allowing to attach a “type” attribute to it. +- type [Optional ; Not repeatable ; String]
+The type of unique identifier, e.g., “DOI”. +- value [Required ; Not repeatable ; String]
+The identifier.

+
    +
  • title [Required ; Not repeatable ; String]

    +

    The title of the video. This element maps to the element caption in VideoObject.

  • +
  • alt_title [Optional ; Not repeatable ; String]

    +

    An alias for the video title. This element maps to the element alternateName in VideoObject.

  • +
  • description [Optional ; Not repeatable ; String]

    +

    A brief description of the video, typically about a paragraph long (around 150 to 250 words). This element maps to the element abstract in VideoObject.

  • +
  • genre [Optional ; Repeatable ; String]

    +

    The genre of the video, broadcast channel or group. This is a VideoObject element. A controlled vocabulary can be used.

  • +
  • keywords [Optional ; Repeatable]

    +

  • +
+
"keywords": [
+ {
+  "name": "string",
+  "vocabulary": "string",
+  "uri": "string"
+ }
+]
+


+

A list of keywords that provide information on the core content of the video. Keywords provide a convenient solution to improve the discoverability of the video, as it allows terms and phrases not found elsewhere in the video metadata to be indexed and to make the video discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the UNESCO Thesaurus. The list can combine keywords from multiple controlled vocabularies, and user-defined keywords.
+- name [Required ; Not repeatable ; String]
+The keyword itself. +- vocabulary [Optional ; Not repeatable ; String]
+The controlled vocabulary (including version number or date) from which the keyword is extracted, if any. +- uri [Optional ; Not repeatable ; String]
+The URL of the controlled vocabulary from which the keyword is extracted, if any.

+
my_video <- list(
+  # ... ,
+  video_description = list(
+    # ... ,
+    
+    keywords = list(
+    
+      list(name = "Migration", 
+           vocabulary = "Unesco Thesaurus (June 2021)", 
+           uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+      
+      list(name = "Migrants", 
+           vocabulary = "Unesco Thesaurus (June 2021)", 
+           uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+      
+      list(name = "Refugee", 
+           vocabulary = "Unesco Thesaurus (June 2021)", 
+           uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+           
+      list(name = "Forced displacement"),
+      
+      list(name = "Internally displaced population (IDP)")
+    
+    ),
+    
+    # ...
+  ),
+  # ... 
+)  
+


+
    +
  • topics [Optional ; Repeatable]

    +
  • +
+
"topics": [
+ {
+  "id": "string",
+  "name": "string",
+  "parent_id": "string",
+  "vocabulary": "string",
+  "uri": "string"
+ }
+]
+


+

Information on the topics covered in the video. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. Note that you may use more than one controlled vocabulary. This element is a block of five fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The identifier of the topic, taken from a controlled vocabulary.
  • +
  • name [Required ; Not repeatable ; String]
    +The name (label) of the topic, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.

  • +
+
my_video <- list(
+  # ... ,
+  video_description = list(
+    # ... ,
+    
+    topics = list(
+    
+      list(name = "Demography.Migration", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      
+      list(name = "Demography.Censuses", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      
+      list(id = "F22", 
+           name = "International Migration", 
+           parent_id = "F2 - International Factor Movements and International Business", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+      
+      list(id = "O15", 
+           name = "Human Resources - Human Development - Income Distribution - Migration", 
+           parent_id = "O1 - Economic Development", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+      
+      list(id = "O12", 
+           name = "Microeconomic Analyses of Economic Development", 
+           parent_id = "O1 - Economic Development", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+      
+      list(id = "J61", 
+           name = "Geographic Labor Mobility - Immigrant Workers", 
+           parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+           
+    ),
+    
+    # ...
+  ),
+)  
+


+
    +
  • persons [Optional ; Repeatable]
    +
  • +
+
"persons": [
+ {
+  "name": "string",
+  "role": "string"
+ }
+]
+


+

A list of persons who appear in the video.
+- name [Required ; Not repeatable ; String]
+The name of the person.
+- role [Optional ; Not repeatable, String]
+The role of the person mentioned in name.

+
my_video <- list(
+metadata_information = list(
+  # ... 
+),
+video_description = list(
+  # ... ,
+  
+  persons = list(
+    
+    list(name = "John Smith",
+         role = "Keynote speaker"),
+    
+    list(name = "Jane Doe",
+         role = "Debate moderator")
+    
+  ),
+# ...
+) 
+


+
    +
  • main_entity [Optional ; Not repeatable ; String]

    +

    Indicates the primary entity described in the video. This element maps to the element mainEntity in VideoObject.

  • +
  • date_created [Optional, Not Repeatable ; String]

    +

    The date the video was created. It is recommended to enter the date in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the video is created refers to the date that the video was produced and considered ready for dissemination.

  • +
  • date_published [Optional, Not Repeatable ; String]

    +

    The date the video was published. It is recommended to use the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

  • +
  • version [Optional, Not Repeatable ; String]

    +

    The version of the video refers to the published version of the video.

  • +
  • status [Optional ; Not repeatable, String]

    +

    The status of the video in terms of its stage in a lifecycle. A controlled vocabulary should be used. Example terms include {Incomplete, Draft, Published, Obsolete}. Some organizations define a set of terms for the stages of their publication lifecycle. This element maps to the element creativeWorkStatus in VideoObject.

  • +
  • country [Optional ; Repeatable]

    +

  • +
+
"country": [
+ {
+  "name": "string",
+  "code": "string"
+ }
+]
+


+

The list of countries (or regions) covered by the video, if applicable. This refers to the content of the video, not to the country where the video was released. This is a repeatable block of two elements: +- name [Required ; Not repeatable ; String]
+The country/region name. Note that many organizations have their own policies on the naming of countries/regions/economies/territories, which data curators will have to comply with. +- code [Optional ; Not repeatable ; String]
+The country/region code (entered as a string, even for numeric codes). It is recommended to use a standard list of countries and regions, such as the ISO country list (ISO 3166). +

+
    +
  • spatial_coverage [Optional ; Not repeatable ; String]

    +

    Indicates the place(s) which are depicted or described in the video. This element maps to the element contentLocation in VideoObject. This element complements the ref_country element. It can be used to qualify the geographic coverage of the video, in the form of a free text.

  • +
  • content_reference_time [Optional ; Not repeatable ; String]

    +

    The specific time described by the video, for works that emphasize a particular moment within an event. This element maps to the element contentReferenceTime in VideoObject.

  • +
  • temporal_coverage [Optional ; Not repeatable ; String]

    +

    Indicates the period that the video applies to, i.e. that it describes, either as a DateTime or as a textual string indicating a time period in ISO 8601 time interval format. This element maps to the element temporalCoverage in VideoObject.

  • +
  • recorded_at [Optional ; Not repeatable ; String]

    +

    This element maps to the element recordedAt in VideoObject schema. It identifies the event where the video was recorded (e.g., a conference, or a demonstration).

  • +
  • audience [Optional ; Not repeatable ; String]

  • +
+

A brief description of the intended audience of the video, i.e. the group for whom it was created.

+
    +
  • bbox [Optional ; Repeatable]

    +
  • +
+
"bbox": [
+ {
+  "west": "string",
+  "east": "string",
+  "south": "string",
+  "north": "string"
+ }
+]
+


+

This element is used to define one or multiple bounding box(es), which are the (rectangular) fundamental geometric description of the geographic coverage of the video. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the video’s geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search. +- west [Required ; Not repeatable ; String]
+West longitude of the box +- east [Required ; Not repeatable ; String]
+East longitude of the box +- south [Required ; Not repeatable ; String]
+South latitude of the box +- north [Required ; Not repeatable ; String]
+North latitude of the box +

+
    +
  • language [Optional, Repeatable]

    +
  • +
+
"language": [
+ {
+  "name": "string",
+  "code": "string"
+ }
+]
+


+

Most videos will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed. For the language code, ISO codes will preferably be used. The language refers to the language in which the video is published. This is a block of two elements (at least one must be provided for each language): +- name [Optional ; Not repeatable ; String]
+The name of the language. +- code [Optional ; Not repeatable ; String]
+The code of the language. The use of ISO 639-2 (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings.

+
    +
  • creator [Optional, Not repeatable ; String]
  • +
+

Organization or person who created/authored the video.

+
    +
  • production_company [Optional, Not repeatable ; String]

    +

    The production company or studio responsible for the item. This element maps to the element productionCompany in VideoObject.

  • +
  • publisher [Optional, Not repeatable ; String]

    +
    my_video = list(
    +  # ... ,
    +  video_description = list(
    +        # ... ,
    +        publisher = "@@@@@",
    +        # ...
    +  )
    +)
    +


  • +
  • repository [Optional ; Not repeatable ; String]

    +

    The name of the repository (organization).

  • +
  • contacts [Optional, Repeatable]

    +Users of the video may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a “generic” email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members. +

  • +
+
"contacts": [
+ {
+  "name": "string",
+  "role": "string",
+  "affiliation": "string",
+  "email": "string",
+  "telephone": "string",
+  "uri": "string"
+ }
+]
+


+
    +
  • name [Required, Not repeatable, String]
    +Name of a person or unit (such as a data help desk). It will usually be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position.

  • +
  • role [Optional, Not repeatable, String]
    +The specific role of name, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact.

  • +
  • affiliation [Optional, Not repeatable, String]
    +Affiliation of the person/unit.

  • +
  • email [Optional, Not repeatable, String]
    +E-mail address of the person.

  • +
  • telephone [Optional, Not repeatable, String]
    +A phone number that can be called to obtain information or provide feedback on the table. This should never be a personal phone number; a corporate number (typically of a data help desk) should be provided.

  • +
  • uri [Optional, Not repeatable, String]
    +A link to a website where contact information for name can be found.

  • +
  • contributors [Optional, Repeatable]
    +

  • +
+
"contributors": [
+ {
+  "name": "string",
+  "affiliation": "string",
+  "abbr": "string",
+  "role": "string",
+  "uri": "string"
+ }
+]
+


+

Identifies the person(s) and/or organization(s) who contributed to the production of the video. The role attribute allows defining what the specific contribution of the identified person or organization was.
+- name [Optional, Not Repeatable ; String]
+The name of the contributor (person or organization). +- affiliation [Optional, Not Repeatable ; String]
+The affiliation of the contributor. +- abbr [Optional, Not Repeatable ; String]
+The abbreviation for the institution which has been listed as the affiliation of the contributor.
+- role [Optional, Not Repeatable ; String]
+The specific role of the contributor. This could for example be “Cameraman”, “Sound engineer”, etc. +- uri [Optional, Not Repeatable ; String]
+A URI (link to a website, or email address) for the contributor.

+
my_video = list(
+  # ... ,
+  video_description = list(
+        # ... ,
+        contributors = list(
+          list(
+            name         = "",
+            affiliation  = "",
+            abbr         = "",
+            role         = "",
+            uri          = "")
+        ),  
+        # ...
+  )
+)
+
    +
  • sponsors [Optional ; Repeatable]
    +
  • +
+
"sponsors": [
+ {
+  "name": "string",
+  "abbr": "string",
+  "grant": "string",
+  "role": "string"
+ }
+]
+


+

This element is used to list the funders/sponsors of the video. If different funding agencies financed different stages of the production process, use the “role” attribute to distinguish them. +- name [Required ; Not repeatable ; String]
+The name of the sponsor (person or organization) +- abbr [Optional ; Not repeatable ; String]
+The abbreviation (acronym) of the sponsor. +- grant [Optional ; Not repeatable ; String]
+The grant (or contract) number. +- role [Optional ; Not repeatable ; String]
+The specific role of the sponsor.

+
    +
  • translators [Optional ; Repeatable]

    +
  • +
+
"translators": [
+ {
+  "first_name": "string",
+  "initial": "string",
+  "last_name": "string",
+  "affiliation": "string"
+ }
+]
+


+

Organization or person who adapted the video to different languages. This element maps to the element translator in VideoObject. +- first_name [Optional ; Not repeatable ; String]
+The first name of the translator. +- initial [Optional ; Not repeatable ; String]
+The initials of the translator. +- last_name [Optional ; Not repeatable ; String]
+The last name of the translator. +- affiliation [Optional ; Not repeatable ; String]
+The affiliation of the translator.

+
    +
  • is_based_on [Optional ; Not repeatable, String]

    +

    A resource from which this video is derived or from which it is a modification or adaption. This element maps to the element isBasedOn in VideoObject.

  • +
  • is_part_of [Optional ; Not repeatable, String]

    +

    Indicates another video that this video is part of. This element maps to the element isPartOf in VideoObject.

  • +
  • relations [Optional ; Repeatable, String]

    +

  • +
+
"relations": [
+ "string"
+]
+


+

Defines, as a free text field, the relation between the video being documented and other resources. This is a Dublin Core element.

+
    +
  • video_provider [Optional ; Not repeatable, String]
    +
    + +

    +

    The person or organization who provides the video. This element maps to the element provider in VideoObject.

  • +
  • video_url [Optional ; Not repeatable, String]

    +

    URL of the video. This element maps to the element url in VideoObject.

  • +
  • embed_url [Optional ; Not repeatable, String]

    +

    A URL pointing to a player for a specific video. This element maps to the element embedUrl in VideoObject. For example, “https://www.youtube.com/embed/7Aif1xjstws

    +

    To be embedded, a video must be hosted on a video sharing platform like Youtube (www.youtube.com). To obtain the “embed link” from youtube, click on the “Share” button, then “Embed”. In the result box, select the content of the element src =.

    +


    + +

  • +
  • encoding_format [Optional ; Not repeatable, String]

    +

    The video file format, typically expressed using a MIME format. This element corresponds to the “encodingFormat” element of VideoObject and maps to the element format of the Dublin Core.

  • +
  • duration [Optional ; Not repeatable, String]

    +

    The duration of the item (movie, audio recording, event, etc.) in ISO 8601 format. This element is a VideoObject element.

    +

    ISO 8601 durations are expressed using the following format, where (n) is replaced by the value for each of the date and time elements that follows the (n). For example: (3)H means 3 hours.

    +
    +

    P(n)Y(n)M(n)DT(n)H(n)M(n)S

    +Where:
    +
      +
    • P is the Period designator and is always placed at the beginning of the duration
    • +
    • (n)Y represents the number of years
    • +
    • (n)M represents the number of months
    • +
    • (n)W represents the number of weeks
    • +
    • (n)D represents the number of days
    • +
    • T is the Time designator and always precedes the time components
    • +
    • (n)H represents the number of hours
    • +
    • (n)M represents the number of minutes
    • +
    • (n)S represents the number of seconds
    • +
    +

    For example, P1Y2M20DT3H30M8S represents a duration of one year, two months, twenty days, three hours, thirty minutes, and eight seconds.

    +

    Date and time elements including their designator may be omitted if their value is zero, and lower-order elements may also be omitted for reduced precision. For example, “P23DT23H” and “P4Y” are both acceptable duration representations.

    +

    As M can represent both Month and Minutes, the time designator T is used. For example, “P1M” is a one-month duration and “PT1M” is a one-minute duration.

    +

    This information on the ISO 8601 was adapted from wikipedia where more detailed information can be found.

    +
  • +
  • rights [Optional ; Not repeatable, String]

    +

    A textual description of the rights associated to the video. If a copyright is available, the three following elements will be used instead of this element.

  • +
  • copyright_holder [Optional ; Not repeatable, String]

    +

    The party holding the legal copyright to the video. This element corresponds to the “copyrightHolder” element of VideoObject.

  • +
  • copyright_notice [Optional ; Not repeatable, String]

    +

    Text of a notice appropriate for describing the copyright aspects of the video, ideally indicating the owner of the copyright. This element corresponds to the “copyrightNotice” element of VideoObject.

  • +
  • copyright_year [Optional ; Not repeatable, String]

    +

    The year during which the claimed copyright for the video was first asserted. This element corresponds to the “copyrightYear” element of VideoObject.

  • +
  • credit_text [Optional ; Not repeatable, String]

    +

    This element can be used to credit the person(s) and/or organization(s) associated with a published video. This element corresponds to the “creditText” element of VideoObject.

  • +
  • citation [Optional ; Not repeatable, String]

    +

    This element provides a required or recommended citation of the audio file.

  • +
  • transcript [Optional ; Repeatable, String]

    +

  • +
+
"transcript": [
+ {
+  "language_name": "string",
+  "language_code": "string",
+  "text": "string"
+ }
+]
+


+

The transcript of the video content, provided as a text. Note that if the text is very long, an alternative is to save it in a separate text file and to make it available in a data catalog as an external resource. +- language_name [Optional ; Not repeatable ; String]

+The name of the language of the transcript. +- language_code [Optional ; Not repeatable ; String]

+The code of the language of the transcript, preferably the ISO code. +- text [Optional ; Not repeatable ; String]

+

The transcript itself. Adding the transcript in the metadata will make the video much more discoverable, as the content of the transcription can be indexed in catalogs.

+
    +
  • media [Optional ; Repeatable ; String]

    +
  • +
+
"media": [
+ "string"
+]
+


+

A description of the media on which the recording is stored (other than the online file format); e,g., “CD-ROM”.

+
    +
  • album [Optional ; Repeatable]

    +
  • +
+
"album": [
+ {
+  "name": "string",
+  "description": "string",
+  "owner": "string",
+  "uri": "string"
+ }
+]
+


+

When a video is published in a catalog containing many other videos, it may be desirable to organize them by album. Albums are collections of videos organized by theme, period, location, or other criteria. One video can belong to more than one album. Albums are “virtual collections”. +- name [Optional ; Not Repeatable ; String]
+The name (label) of the album. +- description [Optional ; Not Repeatable ; String]
+A brief description of the album. +- owner [Optional ; Not Repeatable ; String]
+The owner of the album. +- uri [Optional ; Not Repeatable ; String]
+A link (URL) to the album.

+
    +
  • provenance [Optional ; Repeatable]

    +

    Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the IPTC or DCMI metadata standard.
    +

    +
    "provenance": [
    +  {
    +      "origin_description": {
    +          "harvest_date": "string",
    +          "altered": true,
    +          "base_url": "string",
    +          "identifier": "string",
    +          "date_stamp": "string",
    +          "metadata_namespace": "string"
    +      }
    +  }
    +]
    +


  • +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.

    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Study Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The datestamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@

    • +
  • +
  • tags [Optional ; Repeatable]
    +As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +

  • +
+
"tags": [
+    {
+        "tag": "string",
+        "tag_group": "string"
+    }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.

  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.

  • +
  • lda_topics [Optional ; Not repeatable]

    +

    We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).

    +

    Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series’ name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%).

  • +
+


+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

The lda_topics element includes the following metadata fields.

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the metadata (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic.

      • +
    • +
  • +
  • embeddings [Optional ; Repeatable]
    +In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).

    +

    The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

  • +
+


+
"embeddings": [
+    {
+        "id": "string",
+        "description": "string",
+        "date": "string",
+        "vector": null
+    }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.

  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.

  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).

  • +
  • vector [Required ; Not repeatable ; @@@@] +The numeric vector representing the video metadata.

  • +
  • additional [Optional ; Not repeatable]
    +The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail. +
    + +

  • +
+
+
+
+

11.3 Complete example

+
+

11.3.1 In R

+
library(nadar)
+
+# ----------------------------------------------------------------------------------
+# Enter credentials (API confidential key) and catalog URL
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:/my_videos")
+
+id = "MDA_VDO_001"
+
+thumb = "vdo_001.jpg"
+
+# Generate the metadata
+
+my_video = list(
+  
+  metadata_information = list(
+    title = "Mogadishu, Somalia: A Call for Help",
+    idno = id,
+    producers = list(
+      list(name = "John Doe", affiliation = "National Library")
+    ),
+    production_date = "2021-09-03"
+  ),
+  
+  video_description = list(
+    
+    idno = id,
+    
+    title = "Mogadishu, Somalia: A Call for Help",
+    
+    alt_title = "Somalia: Guterres in Mogadishu",
+    
+    date_published = "2011-09-01",
+    
+    description = "During a landmark visit, the United Nations High Commissioner for Refugees calls on the international community to rapidly increase aid to Somalia.",
+    
+    genre = "Documentary",
+    
+    persons = list(
+      list(name = "António Guterres", role = "High Commissioner for Refugees"),
+      list(name = "Fadhumo", role = "Somali internally displaced person (IDP)")
+    ),
+    
+    main_entity = "United Nations High Commission for Refugees (UNHCR), the UN Refugee Agency",
+    
+    country = list(
+      list(name = "Somalia", code = "SOM")
+    ),
+    
+    spatial_coverage = "Mogadishu, Somalia",
+    
+    content_reference_time = "2011-09",
+    
+    languages = list(
+      list(name = "English", code = "EN")
+    ),
+    
+    creator = "United Nations High Commission for Refugees (UNHCR)",
+    
+    video_url = "https://www.youtube.com/watch?v=7Aif1xjstws",
+    
+    embed_url = "https://www.youtube.com/embed/7Aif1xjstws",
+    
+    transcript = list(
+      list(
+         language = "English",
+         transcript = "Mogadishu is a dangerous place securityhas improved since al-shabaab militias 
+         withdrew last month but not a lot despite the insecurity hundreds of thousands of Somalis 
+         have been streaming into the capital from surrounding areas they're fleeing the worst famine 
+         to strike the region in 60 years in a landmark visit the UN High Commissioner for Refugees 
+         Antonio Gutierrez traveled to Mogadishu this week to visit with Somalis he urged the international 
+         community to rapidly increase aid to people who have been through so much already makes us very emotional is to
+         feel that for 2020 as these people has been suffering the suffering enormously of course there is a large
+         responsibility of Somalis in the way things have happened but let's also recognize that international community
+         there sometimes also be part of the problem and not part of the solution some aid is getting through fatuma has
+         just been registered to receive assistance from UNHCR she left her home and is now seeking help in the capital
+         she is camped with thousands of others in a settlement not far from the shoreline UNHCR is providing plastic
+         sheeting and other supplies there are also food distributions there are a total of four hundred thousand displaced
+         people in Mogadishu 100,000 arrived in the past two months alone getting assistance to them despite the
+         dangers is an urgent priority otherwise settlements like these are certain to you"
+      )   
+    ),
+    
+    duration = "PT2M14S"  # 2 minutes and 14 seconds
+    
+  )
+  
+)
+
+# Publish in the NADA catalog
+
+video_add(idno = id, 
+          published = 1, 
+          overwrite = "yes", 
+          metadata = my_video, 
+          thumbnail = thumb)
+

In NADA, the video will now appear in the “All” tab and in the “Videos” tab.

+


+ +

+

If the embed_url element was provided, the video can be played within the NADA page.

+


+ +

+
+
+

11.3.2 In Python

+
# Python script
+ +
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter12.html b/chapter12.html new file mode 100644 index 0000000..395f7b8 --- /dev/null +++ b/chapter12.html @@ -0,0 +1,2691 @@ + + + + + + + Chapter 12 Research projects and scripts | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 12 Research projects and scripts

+
+
+ +
+


+
+

12.1 Rationale

+

Documenting, cataloguing and disseminating data has the potential to increase the volume and diversity of data analysis. There is also much value in documenting, cataloguing and disseminating data processing and analysis scripts. Technological solutions such as GitHub, Jupyter Notebooks or Jupiter Lab facilitate the preservation and sharing of code, and enable collaborative work around data analysis. Coding style guides like the Google style guides and the Guide to Reproducible Code in Ecology and Evolution by the British Ecological Society, contribute to foster the usability, adaptability, and reproducibility of code. But these tools and guidelines do not fully address the issue of cataloguing and discoverability of the data processing and analysis programs and scripts. We propose –as a complement to collaboration tools and style guides– a metadata schema to document data analysis projects and scripts. The production of structured metadata will contribute not only to discoverability, but also to the reproducibility, replicability, and auditability of data analytics.

+

There are multiple reasons to make reproducibility, replicability, and auditability of data analytics a component of a data dissemination system. This will:

+
    +
  • Improve the quality of research and analysis. Public scrutiny enables contestability and independent quality control of the output of research and analysis; these are strong incentives for additional rigor in data analysis.
  • +
  • Allow the re-purposing or expansion of analysis by the research community, thereby increasing the relevance, utility and value of both the data and of the analytical work.
  • +
  • Strengthen the reputation and credibility of the analysis.
  • +
  • Provide students and peers with rich training materials.
    +
  • +
  • In some cases, satisfy a requirement imposed by peer reviewed journals or financial sponsors of research activities. For example, the Data and Policy Code of the American Economic Association (accessed on June 29, 2020), states that It is the policy of the American Economic Association to publish papers only if the data and code used in the analysis are clearly and precisely documented, and access to the data and code is clearly and precisely documented and is non-exclusive to the authors. Authors of accepted papers that contain empirical work, simulations, or experimental work must provide, prior to acceptance, information about the data, programs, and other details of the computations sufficient to permit replication, as well as information about access to data and programs.
  • +
  • Contribute to assuring the fairness of policy advice and interventions resulting from data analysis. Data analysis may be used to identify or target the beneficiaries of policies and programs, or may contribute otherwise to the design and implementation of development policies and projects. By doing so, they also contribute to identifying populations to be excluded from these interventions. Errors and biases may be introduced in analysis by accidental or intentional human errors, by the algorithms themselves, or they can result from flaws in the data. The analysis that informs such projects and policies should therefore be made auditable and contestable, i.e. documented and published.
  • +
+
+
+

12.2 Motivation for open analytics

+

Stodden et al (2013) make a useful distinction between five levels of research openness:

+
    +
  1. Reviewable research. The descriptions of the research methods can be independently assessed, and the results judged credible. This includes both traditional peer review and community review and does not imply reproducibility.
  2. +
  3. Replicable research. Tools are made available that would allow one to duplicate the results of the research, for example by running the authors’ code to produce the plots shown in the publication. (Here tools might be limited in scope, e.g., only essential data or executables, and might only be made available to referees or only upon request.)
  4. +
  5. Confirmable research. The main conclusions of the research can be attained independently without the use of software provided by the author. (But using the complete description of algorithms and methodology provided in the publication and any supplementary materials.)
  6. +
  7. Auditable research. Sufficient records (including data and software) have been archived so that the research can be defended later if necessary or differences between independent confirmations resolved. The archive might be private.
  8. +
  9. Open or Reproducible research. This is auditable research made openly available. This comprised well-documented and fully open code and data that are publicly available that would allow one to (a) fully audit the computational procedure, (b) replicate and also independently reproduce the results of the research, and (c) extend the results or apply the method to new problems.
  10. +
+
+
+

12.3 Goal: discoverable code

+

Search and filter by title, author, software, method, country, etc. Get links to analytical output and data. Example: search for a “project that implemented multiple imputation in R for a project related to poverty in Kenya”: search for poverty AND “multiple imputation” and filter the results by software / country.

+

Note: the code will also be “attached” to the output page (paper) and to the dataset page of the catalog if they are available in the catalog.

+
+
+
+ +

image

+
+
+


+

Provide access to scripts with detailed information, including software and libraries used, distribution license, IT requirements, datasets used, list of outputs, and more.

+
+
+
+ +

image

+
+
+


+
+
+

12.4 Schema description

+

To make data processing and analysis scripts more discoverable and usable, we propose a metadata schema inspired by the schemas available to document datasets. The proposed schema contains two main blocks of metadata elements: the document description intended to document the metadata themselves (the term document refers to the file that will contain the metadata), and the project description used to document the research or analytical work and the related scripts. We also include in the schema the tags, provenance, and additional elements common to all schemas.

+


+
{
+    "repositoryid": "string",
+    "published": 0,
+    "overwrite": "no",
+    "doc_desc": {},
+    "project_desc": {},
+    "provenance": [],
+    "tags": [],
+    "lda_topics": [],
+    "embeddings": [],
+    "additional": { }
+}
+


+
+

12.4.1 Document description

+

doc_desc [Optional ; Not repeatable]
+The document description is a description of the metadata file being generated. It provides metadata about the metadata. This block is optional. It is used to document the research project metadata (not the project itself). This information is not needed to document the project; it only provides information, useful for archiving purposes, on the process of generating the project metadata. The information it contains are typically useful to a catalog administrator; they are not useful to the public and do not need to be displayed in the publicly-available catalog interface. This block is optional. It is recommended to enter at least the identification of the metadata producer, her/his affiliation, and the date the metadata were created. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so the metadata produced by one organization can be found in other data centers (complying with standards and schema is precisely intended to facilitate inter-operability of catalogs and automated information sharing). Keeping track of who documented a resource is thus useful.

+


+
"doc_desc": {
+    "title": "string",
+    "idno": "string",
+    "producers": [
+        {
+            "name": "string",
+            "abbr": "string",
+            "affiliation": "string",
+            "role": "string"
+        }
+    ],
+    "prod_date": "string",
+    "version": "string"
+}
+


+
    +
  • title [Optional ; Not Repeatable ; String]
    +The title of the project. This will usually be the same as the element title in the project description section.

  • +
  • idno [Optional ; Not Repeatable ; String]
    +A unique identifier for the metadata document.

  • +
  • producers [Optional ; Not Repeatable]
    +A list of producers of the metadata (who may be but do not have to be the authors of the research project and scripts being documented). These can be persons or organizations. The following four elements are used to identify them and specify their specific role as and if relevant (this block of four elements is repeated for each contributor to the metadata):

    +
      +
    • name [Optional ; Not Repeatable ; String]
      +Name of the person or organization who documented the project.
    • +
    • abbr: [Optional ; Not Repeatable ; String]
      +The abbreviation of the organization that is referenced under ‘name’ above.
    • +
    • affiliation [Optional ; Not Repeatable ; String]
      +Affiliation of the person(s) or organization(s) who documented the project.
    • +
    • role [Optional ; Not Repeatable ; String]
      +This attribute is used to distinguish different stages of involvement in the metadata production process.

    • +
  • +
  • prod_date [Optional ; Not Repeatable ; String]
    +The date the metadata on this project was produced (not distributed or archived), preferably in ISO 8601 format (YYYY-MM-DD or YYY-MM).

  • +
  • version [Optional ; Not Repeatable ; String]
    +Documenting a research project is not a trivial exercise. It may happen that, having identified errors or omissions in the metadata or having received suggestions for improvement, a new version of the metadata is produced. This element is used to identify and describe the current version of the metadata. It is good practice to provide a version number, and information on what distinguishes this version from the previous one(s) if relevant. +

    +
    my_project = list(
    +  doc_desc = list(
    +    idno = "META_RP_001", 
    +    producers = list(
    +      list(name = "John Doe",
    +           affiliation = "National Data Center of Popstan")
    +    ),
    +    prod_date = "2020-12-27",
    +    version = "Version 1.0 - Original version of the documentation provided by the author of the project"
    +  ),
    +  # ... 
    +)
  • +
+
+
+

12.4.2 Project description

+

project_desc [Required ; Not repeatable]
+The project description contains the metadata related to the project itself. All efforts should be made to provide as much and as detailed information as possible.

+


+
"project_desc": {
+    "title_statement": {},
+    "abstract": "string",
+    "review_board": "string",
+    "output": [],
+    "approval_process": [],
+    "project_website": [],
+    "language": [],
+    "production_date": "string",
+    "version_statement": {},
+    "errata": [],
+    "process": [],
+    "authoring_entity": [],
+    "contributors": [],
+    "sponsors": [],
+    "curators": [],
+    "reviews_comments": [],
+    "acknowledgments": [],
+    "acknowledgment_statement": "string",
+    "disclaimer": "string",
+    "confidentiality": "string",
+    "citation_requirement": "string",
+    "related_projects": [],
+    "geographic_units": [],
+    "keywords": [],
+    "themes": [],
+    "topics": [],
+    "disciplines": [],
+    "repository_uri": [],
+    "license": [],
+    "copyright": "string",
+    "technology_environment": "string",
+    "technology_requirements": "string",
+    "reproduction_instructions": "string",
+    "methods": [],
+    "software": [],
+    "scripts": [],
+    "data_statement": "string",
+    "datasets": [],
+    "contacts": []
+}
+


+
    +
  • title_statement [Required ; Non repeatable]
    +The title_statement is a group of five elements, two of them mandatory. +
  • +
+
"title_statement": {
+    "idno": "string",
+    "identifiers": [
+        {
+            "type": "string",
+            "identifier": "string"
+        }
+    ],
+    "title": "string",
+    "sub_title": "string",
+    "alternate_title": "string",
+    "translated_title": "string"
+}
+


+
    +
  • idno [Required ; Not Repeatable ; String]
    +A unique identifier to the project. Define and use a consistent scheme to use. Avoid including spaces in the ID. The ID number of a research project is a unique number that is used to identify a particular project. This ID number is a vital reference. A research project can be the formal cause of a survey, scripts, tables and knowledge products. Do not include spaces in the idno element. Use a system that guarantees uniqueness of the ID (DOI, own reference number).
  • +
  • identifiers [Optional ; Repeatable]
    +This repeatable element is used to enter identifiers (IDs) other than the idno entered in the title_statement. It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in idno can (and in some cases should) be repeated here. The element idno does not provide a type parameter; repeating it in this section makes it possible to add that information. +
      +
    • type [Optional ; Not repeatable ; String]
      +The type of unique ID, e.g. “DOI”.
    • +
    • identifier [Required ; Not repeatable ; String]
      +The identifier itself.

    • +
  • +
  • title [Required ; Not Repeatable ; String]
    +The title is the official name of the project as it may be stated in reports, papers or other documents. The title will in most cases be identical to the Document Title (see above). The title may correspond to the title of an academic paper, of a project impact evaluation, etc. Pay attention to capitalization in the title.
  • +
  • sub_title [Optional ; Not Repeatable ; String]
    +Subtitle is optional and rarely used. A short subtitle for the project. Often the sub title is used to qualify the title or rephrase the title.
    +
  • +
  • alternate_title [Optional ; Not Repeatable ; String]
    +An alternate title of the project. This would be any alternate title that would help discover the research project. In countries with more than one official language, a translation of the title may be provided. Likewise, the translated title may simply be a translation into English from a country’s own language.
    +
  • +
  • translated_title [Optional ; Not Repeatable ; String]
    +A translated version of the title (this will be used for example when a catalog documents all entries in English, but wants to preserve the title of a project in its original language when the original language is not English).
  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+    
+      title_statement = list(
+          idno = "RR_WB_2020_001",
+          identifiers = list(
+          list(type = "DOI", identifier = "XXX-XXX-XXXX")
+        ),
+          date = "2020",
+          title = "Predicting Food Crises - Econometric Model"
+      ),
+
+      # ...
+  ),
+  # ...
+)
+


+
    +
  • abstract [Optional ; Non repeatable ; String]
    +The abstract should provide a clear summary of the purposes, objectives and content of the project. An abstract can make reference to the various outputs associated with the research project.

    +

    Example extracted from https://microdata.worldbank.org/index.php/catalog/4218:

    +
    my_project = list(
    +  # ... ,
    +  project_desc = list(
    +      # ... ,
    +
    +      abstract = "Food price inflation is an important metric to inform economic policy but traditional sources of consumer prices are often produced with delay during crises and only at an aggregate level. This may poorly reflect the actual price trends in rural or poverty-stricken areas, where large populations reside in fragile situations.
    +This data set includes food price estimates and is intended to help gain insight in price developments beyond what can be formally measured by traditional methods. The estimates are generated using a machine-learning approach that imputes ongoing subnational price surveys, often with accuracy similar to direct measurement of prices. The data set provides new opportunities to investigate local price dynamics in areas where populations are sensitive to localized price shocks and where traditional data are not available.",
    +
    +      # ...
    +  ),
    +  # ...
    +)
    +


  • +
  • review_board [Optional ; Non repeatable ; String]
    +Information on whether and when the project was submitted, reviewed, and approved by an institutional review board (or independent ethics committee, ethical review board (ERB), research ethics board, or equivalent). +

  • +
  • output [Optional ; Repeatable]
    +This element will describe and reference all substantial/intended products of the research project, which may include publications, reports, websites, datasets, interactive applications, presentations, visualizations, and others. An output may also be referred to as a “deliverable”. +

  • +
+
"output": [
+    {
+        "type": "string",
+        "title": "string",
+        "authors": "string",
+        "description": "string",
+        "abstract": "string",
+        "uri": "string",
+        "doi": "string"
+    }
+]
+


+

The output is a repeatable block of seven elements, used to document all output of the research project: +- type [Optional ; Non repeatable]
+Type of output. The type of output relates to the media which is used to convey or communicate the intended results, findings or conclusions of the research project. This field may be controlled by a controlled vocabulary. The kind on content could be “Working paper”, “Database”, etc. +- title [Required ; Non repeatable]
+Formal title of the output. Depending upon the kind of output, the title will vary in formality. +- authors [Optional ; Non repeatable]
+Authors of the output; if multiple, they will be listed in one same text field. +- description [Optional ; Non repeatable]
+Brief description of the output (NOT an abstract) +- abstract [Optional ; Non repeatable]
+If the output consists of a document, the abstract will be entered here. +- uri [Optional ; Non repeatable]
+A link where the output or information on the output can be found. +- doi [Optional ; Non repeatable]v +Digital Object Identifier (DOI) of the output, if available.

+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+      output = list(
+        
+        list(type = "working paper",
+             title = "Estimating Food Price Inflation from Partial Surveys",
+             authors = "Andrée, B. P. J.",
+             description = "World Bank Policy Research Working Paper",
+             abstract = "The traditional consumer price index is often produced at an aggregate level, using data from few, highly urbanized, areas. As such, it poorly describes price trends in rural or poverty-stricken areas, where large populations may reside in fragile situations. Traditional price data collection also follows a deliberate sampling and measurement process that is not well suited for monitoring during crisis situations, when price stability may deteriorate rapidly. To gain real-time insights beyond what can be formally measured by traditional methods, this paper develops a machine-learning approach for imputation of ongoing subnational price surveys. The aim is to monitor inflation at the market level, relying only on incomplete and intermittent survey data. The capabilities are highlighted using World Food Programme surveys in 25 fragile and conflict-affected countries where real-time monthly food price data are not publicly available from official sources. The results are made available as a data set that covers more than 1200 markets and 43 food types. The local statistics provide a new granular view on important inflation events, including the World Food Price Crisis of 2007–08 and the surge in global inflation following the 2020 pandemic. The paper finds that imputations often achieve accuracy similar to direct measurement of prices. The estimates may provide new opportunities to investigate local price dynamics in markets where prices are sensitive to localized shocks and traditional data are not available.",
+             uri = "http://hdl.handle.net/10986/36778"),
+             
+        list(type = "dataset",
+             title = "Monthly food price estimates",
+             authors = "Andrée, B. P. J.",
+             description = "A dataset of derived data, published as open data",
+             abstract = "Food price inflation is an important metric to inform economic policy but traditional sources of consumer prices are often produced with delay during crises and only at an aggregate level. This may poorly reflect the actual price trends in rural or poverty-stricken areas, where large populations reside in fragile situations.
+This data set includes food price estimates and is intended to help gain insight in price developments beyond what can be formally measured by traditional methods. The estimates are generated using a machine-learning approach that imputes ongoing subnational price surveys, often with accuracy similar to direct measurement of prices. The data set provides new opportunities to investigate local price dynamics in areas where populations are sensitive to localized price shocks and where traditional data are not available."
+             uri = "https://microdata.worldbank.org/index.php/catalog/4218"),
+             doi = "https://doi.org/10.48529/2ZH0-JF55")
+      
+      ),
+  
+      # ...
+)
+


+
    +
  • approval_process [Optional ; Repeatable]
    +The approval_process is a group of six elements used to describe the formal approval process(es) (if any) that the project had to go through. This may for example include an approval by an Ethics Board to collect new data, followed by an internal review process to endorse the results. +
  • +
+
"approval_process": [
+    {
+        "approval_phase": "string",
+        "approval_authority": "string",
+        "submission_date": "string",
+        "reviewer": "string",
+        "review_status": "string",
+        "approval_date": "string"
+    }
+]
+


+
    +
  • approval_phase [Optional ; Non repeatable]
    +A label that describes the approval phase.
  • +
  • approval_authority [Optional ; Non repeatable]
    +Identification of the person(s) or organization(s) whose approval was required or sought.
  • +
  • submission_date [Optional ; Non repeatable]
    +The date, entered in ISO 8601 format (YYYY-MM-DD), when the project (or a component of it) was submitted for approval.
  • +
  • reviewer [Optional ; Non repeatable]
    +Identification of the reviewer(s).
  • +
  • review_status [Optional ; Non repeatable]
    +Status of approval.
  • +
  • approval_date [Optional ; Non repeatable]
    +Date the approval was formally received, preferably entered in ISO 8601 format (YYYY-MM-DD).

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+      approval_process = list(
+        
+        list(approval_phase = "Authorization to conduct the survey",
+             approval_authority = "Internal Ethics Board, [Organization]",
+             submission_date = "2019-01-15",
+             review_status = "Approved (permission No ABC123)",
+             approval_date = "2020-04-30"),
+        
+        list(approval_phase = "Review of research output and authorization to publish",
+             approval_authority = "Internal Ethics Board, [Organization]",
+             submission_date = "2021-07-15",
+             review_status = "Approved",
+             approval_date = "2021-10-30")
+        
+      ),
+      # ...
+  )
+  # ...
+)  
+


+
    +
  • project_website [Optional ; Repeatable ; String]
    +URL of the project website. +
  • +
+
"project_website": [
+    "string"
+]
+


+
    +
  • language [Optional ; Repeatable]
    +A block of two elements describing the language(s) of the project. At least one of the two elements must be provided for each listed language. The use of ISO 639-2 (the alpha-3 code in Codes for the representation of names of languages) is recommended. +
  • +
+
"language": [
+    {
+        "name": "string",
+        "code": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the language.
  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the language. Numeric codes must be entered as strings.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+      languages = list(
+        list(name = "English", code = "EN"),
+        list(name = "French",  code = "FR")
+      ),
+      
+      # ...
+  )
+  # ...
+)
+


+
    +
  • production_date
    +The date in ISO 8601 format (YYYY-MM-DD) the project was completed (this refers to the version that is being documented and released.) +

  • +
  • version_statement [Optional ; Repeatable]
    +This repeatable block of four elements is used to list and describe the successive versions of the project. +

  • +
+
"version_statement": {
+    "version": "string",
+    "version_date": "string",
+    "version_resp": "string",
+    "version_notes": "string"
+}
+


+
    +
  • version [Optional ; Not repeatable ; String]
    +A label describing the version. For example, “Version 1.2” [String]
  • +
  • version_date [Optional ; Not repeatable ; String]
    +Date (in ISO 8601 format, YYYY-MM-DD) the version was released [String]
  • +
  • version_resp [Optional ; Not repeatable ; String]
    +Person(s) or organization(s) responsible for this version. [String]
  • +
  • version_notes [Optional ; Not repeatable ; String]
    +Additional information on the version if any; it is good practice to describe what distinguishes this version from the previous one(s). The version must be entered as a string, even when composed only of numbers.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    version_statement = list(
+      
+      list(version = "v1.0", 
+           version_date = "2021-12-27",
+           version_resp = "University of Popstan, Department of Economics",
+           version_notes = "First version approved for open dissemination")
+      
+    ), 
+    
+  # ...
+)  
+


+
    +
  • errata [Optional ; Repeatable]

    +This field is used to list and describe errata. +
  • +
+
"errata": [
+    {
+        "date": "string",
+        "description": "string"
+    }
+]
+


+
    +
  • date [Optional ; Not repeatable ; String]
    +Date (in ISO 8601 format, YYYY-MM-DD) the erratum was released.
  • +
  • description [Optional ; Not repeatable ; String]
    +Description of the error(s) and measures taken to address it/them.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    errata = list(
+      list(date = "2021-10-30", 
+           description = "Outliers in the data for Afghanistan resulted in unrealistic model estimates of the food prices for January 2020. In the latest version of the 'model.R' script, outliers are detected and dropped from the input data file. The published dataset has been updated."
+      )
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • process [Optional ; Repeatable]
    +This element is used to document the life cycle of the research project, from its design and inception to its conclusion. This can include phases of fundraising, IRB, concept note review, data acquisition, analysis, publishing of a working paper, peer review, publishing in journal, presentation to conferences, publishing, evaluation, reporting to sponsors, etc. It is recommended to provide these steps in a chronological order. +
  • +
+
"process": [
+    {
+        "name": "string",
+        "date_start": "string",
+        "date_end": "string",
+        "description": "string"
+    }
+]
+


+
    +
  • name: [Optional ; Not repeatable ; String]
    +This is a header for the phase of the process.
  • +
  • date_start [Optional ; Not repeatable ; String]
    +Date the phase started (preferably in ISO 8601 format, YYYY-MM-DD)
  • +
  • date_end [Optional ; Not repeatable ; String]
    +Date the phase ended (preferably in ISO 8601 format, YYYY-MM-DD)
  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the phase.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    process = list(
+      
+      list(name = "Presentation of the concept note at the Review Committee decision meeting", 
+           date_start = "2018-02-23",
+           date_end = "2018-02-23",
+           description = "Presentation of the research objectives and method by the primary investigator to the Review Committee, which resulted in the approval of the concept note."
+      ),
+      
+      list(name = "Fundraising", 
+           date_start = "2018-02-24",
+           date_end = "2018-02-30",
+           description = "Discussion with project sponsors, and conclusion of the funding agreement."
+      ),
+      
+      list(name = "Data acquisition and analytics", 
+           date_start = "2018-03-15",
+           date_end = "2019-01-30",
+           description = "Implementation of web scraping, then data analysis"
+      ),  
+      
+      list(name = "Working paper", 
+           date_start = "2019-01-30",
+           date_end = "2019-02-25",
+           description = "Production (and copy editing) of the working paper"
+      ),
+      
+      list(name = "Presentation to conferences", 
+           date_start = "2019-04-12",
+           date_end = "2019-04-12",
+           description = "Presentation of the paper by the primary investigator at the ... conference, London"
+      ),
+      
+      list(name = "Curation and dissemination of data and code", 
+           date_start = "2019-02-25",
+           date_end = "2019-03-18",
+           description = "Data and script documentation, and publishing in the National Microdata Library"
+      )
+      
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • authoring_entity [Optional ; Repeatable]
    +This section will identify the person(s) and/or organization(s) in charge of the intellectual content of the research project, and specify their respective role. +
  • +
+
"authoring_entity": [
+    {
+        "name": "string",
+        "role": "string",
+        "affiliation": "string",
+        "abbreviation": "string",
+        "email": "string",
+        "author_id": []
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +Name of the person or organization responsible for the research project.
  • +
  • role [Optional ; Not repeatable ; String]
    +Specific role of the person or organization mentioned in name.
    +
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +Agency or organization affiliation of the author/primary investigator mentioned in name.
  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +Abbreviation used to identify the agency stated under affiliation.
  • +
  • email [Optional ; Not repeatable ; String]
    +Depending on the agency policies, a researcher may provide a personal email or an agency email to field inquires related to the project.
  • +
  • author_id [Optional ; Repeatable]
    +A block of two elements used to provide unique identifiers of the authors, as provided by different registers of researchers. For example, this can be an ORCID number (ORCID is a non-profit organization supported by a global community of member organizations, including research institutions, publishers, sponsors, professional associations, service providers, and other stakeholders in the research ecosystem.) +
      +
    • type [Optional ; Not repeatable ; String]
      +The type of ID; for example, “ORCID”.
    • +
    • id [Required ; Not repeatable ; String]
      +A unique identification number/code for the authoring entity, entered as a string variable.

    • +
  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    authoring_entity = list(
+      
+      list(name = "", 
+           role = "",
+           affiliation = "",
+           email = "",
+           author_id = list(
+             list(type = "", id = "ORCID")
+          )   
+      )
+      
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • contributors [Optional ; Repeatable] This section is provided to record other contributors to the research project and provide recognition for the roles they provided. +
  • +
+
"contributors": [
+    {
+        "name": "string",
+        "role": "string",
+        "affiliation": "string",
+        "abbreviation": "string",
+        "email": "string",
+        "url": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +Name of the person, corporate body, or agency contributing to the intellectual content of the project (other than the PI). If a person, invert first and last name and use commas.
  • +
  • role [Optional ; Not repeatable ; String]
    +Title of the person (if any) responsible for the work’s substantive and intellectual content.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +Agency or organization affiliation of the contributor.
  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +Abbreviation used to identify the agency stated under affiliation.
  • +
  • email [Optional ; Not repeatable ; String]
    +Depending on the agency policies, a researcher may provide a personal email or an agency email to field inquires related to the project.
    +
  • +
  • url [Optional ; Not repeatable ; String]
    +Thhe URL that provides information on the contributor or its affiliate

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    contributors = list(
+      list(name = "", 
+           role = "",
+           affiliation = "",
+           email = ""
+      )
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • sponsors [Optional ; Repeatable]
    The source(s) of funds for production of the work. If different funding agencies sponsored different stages of the production process, use the ‘role’ attribute to distinguish them. +
  • +
+
"sponsors": [
+    {
+        "name": "string",
+        "abbreviation": "string",
+        "role": "string",
+        "grant_no": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +Name of the funding agency/sponsor.
  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +Abbreviation of the funding/sponsoring agency.
  • +
  • role [Optional ; Not repeatable ; String]
    +Specific role of the funding/sponsoring agency.
    +
  • +
  • grant_no [Optional ; Not repeatable ; String]
    +Grant or award number.
  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    sponsors = list(
+      
+      list(name = "ABC Foundation", 
+           abbr = "ABCF",
+           role = "Purchase of the data",
+           grant_no = "ABC_001_XYZ"
+      ),
+      
+      list(name = "National Research Foundation", 
+           abbr = "NRF",
+           role = "Funding of staff and research assistant costs, and variable costs for participation in conferences",
+           grant_no = "NRF_G01"
+      )
+      
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • curators [Optional ; Repeatable]
    +A list of persons and/or organizations in charge of curating the resources associated with the project. +
  • +
+
"curators": [
+    {
+        "name": "string",
+        "role": "string",
+        "affiliation": "string",
+        "abbreviation": "string",
+        "email": "string",
+        "url": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the person or organization.
    +
  • +
  • role [Optional ; Not repeatable ; String]
    +The specific role of the person or organization in the curation of the project resources.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the person or organization.
  • +
  • abbreviation [Optional ; Not repeatable ; String]
    +An acronym of the organization, if an organization was entered in name.
  • +
  • email [Optional ; Not repeatable ; String]
    +The email address of the person or organization. The use of personal email addresses must be avoided.
  • +
  • url [Optional ; Not repeatable ; String]
    +A link to the website of the person or organization. +

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    curators = list(
+      
+      list(name = "National Data Archive of Popstan", 
+           role = "Documentation, preservation and dissemination of the data and reproducible code",
+           email = "helpdesk@nda. ...",
+           url = "popstan_nda,org"
+      )
+      
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • reviews_comments [Optional ; Repeatable]

    +Many research projects will be subject to a review process, which may happen at different stages of the project implementation (from design to review of the final output). This block is intended to document the comments received by reviewers during this process. It is a repeatable block of metadata elements, which can be used to document comments with a fine granularity. +
  • +
+
"reviews_comments": [
+    {
+        "comment_date": "string",
+        "comment_by": "string",
+        "comment_description": "string",
+        "comment_response": "string"
+    }
+]
+


+
    +
  • comment_date [Optional ; Not repeatable ; String]
    +The date the comment was provided, in ISO 8601 format (YYYY-MM-DD or YYYY-MM).
  • +
  • comment_by [Optional ; Not repeatable ; String]
    +The name of the person or organization that provided the comment.
  • +
  • comment_description [Optional ; Not repeatable ; String]
    +The comment itself, in its original formulation or in a summary version.
  • +
  • comment_response [Optional ; Not repeatable ; String]
    +The response provided by teh research team/person to the comment, in its original formulation or in a summary version. +

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    reviews_comments = list(
+      list(comment_date = "", 
+           comment_by = "",
+           comment_description = "",
+           comment_response = ""
+      )
+    ),
+    # ...
+  )
+)  
+


+
    +
  • acknowledgments [Optional ; Repeatable]
    +This repeatable block of elements is used to provide an itemized list of persons and organizations whose contribution to the project must be acknowledged. Note that specific metadata elements are available for listing financial sponsors and main contributors to the study.
    +An alternative to this field is the acknowledgment_statement field (see below) which can be used to provide the acknowledgment in the form of an unstructured text. +
  • +
+
"acknowledgments": [
+    {
+        "name": "string",
+        "affiliation": "string",
+        "role": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the person or agency being recognized for supporting the project.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The affiliation of the person or agency being acknowledged.
  • +
  • role [Optional ; Not repeatable ; String]
    +A brief description of the role of the person or agency that is being recognized or acknowledged for supporting the project.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    acknowledgements = list(
+      list(name = "", 
+           affiliation = "",
+           role = ""
+      ),
+      list(name = "", 
+           affiliation = "",
+           role = ""
+      )
+    ),
+    # ...
+  )
+)  
+


+
    +
  • acknowledgement_statement [Optional ; Not repeatable ; String]
    +This field is used to provide acknowledgments in the form of an unstructured text. An alternative to this field is the acknowledgments field which provides a solution to itemize the acknowledgments.

  • +
  • disclaimer [Optional ; Not repeatable ; String]

    +Disclaimers limit the responsibility or liability of the publishing organization or researchers associated with the research project. Disclaimers assure that any research in the public domain produced by an organization has limited repercussions to the publishing organization. A disclaimer is intended to prevent liability from any effects occurring as a result of the acts or omissions in the research.

  • +
  • confidentiality [Optional ; Not repeatable ; String]

    +A confidentiality statement binds the publisher to ethical considerations regarding the subjects of the research. In most cases, the individual identity of an individual that is the subject of research can not be released and special effort is required to assure the preservation of privacy.

  • +
  • citation_requirement [Optional ; Not repeatable ; String]

    +The citation requirement is specific to the output and is a preferred shorthand or means to refer to the publication or published good.

  • +
  • related_projects [Optional ; Repeatable]
    +The objective of this block is to provide links (URLs) to other, related projects which can be documented and disseminated in the same catalog or any other location on the internet. +

  • +
+
"related_projects": [
+    {
+        "name": "string",
+        "uri": "string",
+        "note": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name (title) of the related project.
  • +
  • uri [Optional ; Not repeatable ; String]
    +A link (URL) to the related project web page.
  • +
  • note [Optional ; Not repeatable ; String]
    +A brief description or other relevant information on the related project.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    related_projects = list(
+      list(name = "", 
+           uri = "", 
+           note = "")
+    ),
+    # ...
+  )
+)  
+


+
    +
  • geographic_units [Optional ; Repeatable]
    +The geographic areas covered by the project. When the project relates to one or more countries, or part of one or more countries, it is important to provide the country name. This means that for a project related to a specific province or town of a country, the country name will be entered in addition to the province or town (as separate entries in this repeatable block of elements). Note that the area does not have to be an administrative area; it can for example be an ocean. +
  • +
+
"geographic_units": [
+    {
+        "name": "string",
+        "code": "string",
+        "type": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The name of the geographic area.
  • +
  • code [Optional ; Not repeatable ; String]
    +The code of the geographic area. For countries, it is recommended to use the ISO 3166 country codes and names.
  • +
  • type [Optional ; Not repeatable ; String]
    +The type of geographic area.
  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    geographic_units = list(
+      list(name = "India",     code = "IND", type = "Country"),
+      list(name = "New Delhi",               type = "City"),
+      list(name = "Kerala",                  type = "State"),
+      list(name = "Nepal",     code = "NPL", type = "Country"),
+      list(name = "Kathmandu",               type = "City")
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • keywords [Optional ; Repeatable]
    +
  • +
+
"keywords": [
+    {
+        "name": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

A list of keywords that provide information on the core scope and objectives of the research project. Keywords provide a convenient solution to improve the discoverability of the research, as it allows terms and phrases not found elsewhere in the metadata to be indexed and to make a project discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the UNESCO Thesaurus. The list provided here can combine keywords from multiple controlled vocabularies, and user-defined keywords.

+
    +
  • name [Required ; Not repeatable ; String]
    +The keyword itself.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The controlled vocabulary (including version number or date) from which the keyword is extracted, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the controlled vocabulary from which the keyword is extracted, if any.

  • +
+
  my_project <- list(
+    # ... ,
+    project_desc = list(
+      # ... ,
+      
+      keywords = list(
+      
+        list(name = "Migration", 
+             vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        
+        list(name = "Migrants", 
+             vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+        
+        list(name = "Refugee", 
+             vocabulary = "Unesco Thesaurus (June 2021)", 
+             uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
+             
+        list(name = "Conflict"),
+        list(name = "Asylum seeker"),
+        list(name = "Forced displacement"),
+        list(name = "Forcibly displaced"),
+        list(name = "Internally displaced population (IDP)"),
+        list(name = "Population of concern (PoC)")
+        list(name = "Returnee")
+        list(name = "UNHCR")
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • themes [Optional ; Repeatable]
    +
  • +
+
"themes": [
+    {
+        "id": "string",
+        "name": "string",
+        "parent_id": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

A list of themes covered by the research project. A controlled vocabulary will preferably be used. Note that themes will rarely be used as the elements topics and disciplines are more appropriate for most uses. This is a block of five fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The ID of the theme, taken from a controlled vocabulary.

  • +
  • name [Required ; Not repeatable ; String]
    +The name (label) of the theme, preferably taken from a controlled vocabulary.

  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent ID of the theme (ID of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.

  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.

  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.

  • +
  • topics [Optional ; Repeatable]
    +

  • +
+
"topics": [
+    {
+        "id": "string",
+        "name": "string",
+        "parent_id": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+

Information on the topics covered in the research project. A controlled vocabulary will preferably be used, for example the CESSDA Topics classification, a typology of topics available in 11 languages; or the Journal of Economic Literature (JEL) Classification System, or the World Bank topics classification. Note that you may use more than one controlled vocabulary. +This element is a block of five fields: +- id [Optional ; Not repeatable ; String]
+The identifier of the topic, taken from a controlled vocabulary. +- name [Required ; Not repeatable ; String]
+The name (label) of the topic, preferably taken from a controlled vocabulary. +- parent_id [Optional ; Not repeatable ; String]
+The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used. +- vocabulary [Optional ; Not repeatable ; String]
+The name (including version number) of the controlled vocabulary used, if any. +- uri [Optional ; Not repeatable ; String]
+The URL to the controlled vocabulary used, if any.

+
my_project = list(
+  # ... ,
+  
+  project_desc = list(
+      # ... ,
+
+    topics = list(
+      
+      list(name = "Demography.Migration", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      
+      list(name = "Demography.Censuses", 
+           vocabulary = "CESSDA Topic Classification", 
+           uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
+      
+      list(id = "F22", 
+           name = "International Migration", 
+           parent_id = "F2 - International Factor Movements and International Business", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+      
+      list(id = "O15", 
+           name = "Human Resources - Human Development - Income Distribution - Migration", 
+           parent_id = "O1 - Economic Development", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+      
+      list(id = "O12", 
+           name = "Microeconomic Analyses of Economic Development", 
+           parent_id = "O1 - Economic Development", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
+      
+      list(id = "J61", 
+           name = "Geographic Labor Mobility - Immigrant Workers", 
+           parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers", 
+           vocabulary = "JEL Classification System", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • disciplines [Optional ; Repeatable]
    +
  • +
+
"disciplines": [
+    {
+        "id": "string",
+        "name": "string",
+        "parent_id": "string",
+        "vocabulary": "string",
+        "uri": "string"
+    }
+]
+


+Information on the academic disciplines related to the content of the research project. A controlled vocabulary will preferably be used, for example the one provided by the list of academic fields in Wikipedia. +This is a block of five elements:

+
    +
  • id [Optional ; Not repeatable ; String]
    +The identifier of the discipline, taken from a controlled vocabulary.
  • +
  • name [Optional ; Not repeatable ; String]
    +The name (label) of the discipline, preferably taken from a controlled vocabulary.
  • +
  • parent_id [Optional ; Not repeatable ; String]
    +The parent identifier of the discipline (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.
  • +
  • vocabulary [Optional ; Not repeatable ; String]
    +The name (including version number) of the controlled vocabulary used, if any.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL to the controlled vocabulary used, if any.

  • +
+
  my_project <- list(
+    # ... ,
+    
+    project_desc = list(
+      # ... ,  
+      
+      disciplines = list(
+        
+        list(name = "Economics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+             
+        list(name = "Agricultural economics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields"),
+        
+        list(name = "Econometrics", 
+             vocabulary = "Wikipedia List of academic fields (as of 21 June 2021)", 
+             uri = "https://en.wikipedia.org/wiki/List_of_academic_fields")
+             
+      ),
+      
+      # ...
+    ),
+    # ... 
+  )  
+


+
    +
  • repository_uri In the process of producing the outputs of the research project, a researcher may want to share their source code for transparency and replicability. This repository provides information for finding the repository where the source code is kept.
  • +
+


+
"repository_uri": [
+    {
+        "name": "string",
+        "type": "string",
+        "uri": null
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +Name of the repository where code is hosted.
  • +
  • type [Optional ; Not repeatable ; String]
    +Repository type e.g.GitHub, Bitbucket, etc.
  • +
  • uri [Required ; Not repeatable ; String]
    +URI of the project source code/script repository

  • +
+
my_project = list(
+  # ... ,
+  
+  project_desc = list(
+      # ... ,
+    
+    repository_uri = list(
+      list(name = "A comparative assessment of machine learning classification algorithms applied to poverty prediction", 
+           type = "GitHub public repo", 
+           uri = "https://github.com/worldbank/ML-classification-algorithms-poverty")
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • license [Optional ; Repeatable]
    +Information on the license(s) attached to the research project resources, which defines their terms of use. +
  • +
+
"license": [
+    {
+        "name": "string",
+        "uri": "string"
+    }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the license.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URL of the license, where detailed information on the license can be obtained.
  • +
  • note [Optional ; Not repeatable ; String]
    +Additional information on the license. +
  • +
+
my_project <- list(
+  # ... ,
+  project_desc = list(
+    # ... ,
+    
+    license = list(
+    
+      list(name = "Attribution 4.0 International (CC BY 4.0)", 
+           uri  = "https://creativecommons.org/licenses/by/4.0/")
+           
+    ),
+    
+    # ...
+  ),
+  # ... 
+)  
+


+
    +
  • copyright [Optional ; Not repeatable ; String]

    +Information on the copyright, if any, that applies to the research project metadata.

  • +
  • technology_environment [Optional ; Not repeatable ; String]
    +This field is used to provide a description (as detailed as possible) of the computational environment under which the scripts were implemented and are expected to be reproducible. A substantial challenge in reproducing analyses is installing and configuring the web of dependencies of specific versions of various analytical tools. Virtual machines (a computer inside a computer) enable you to efficiently share your entire computational environment with all the dependencies intact. (https://ropensci.github.io/reproducibility-guide/sections/introduction/)

  • +
  • technology_requirements [Optional ; Not repeatable ; String]
    +Software/hardware or other technology requirements needed to run the scripts and replicate the outputs

  • +
  • reproduction_instructions [Optional ; Not repeatable ; String]
    +Instructions to secondary analysts who may want to reproduce the scripts.

  • +
  • methods [Optional ; Repeatable]
    +A list of analytic, statistical, econometric, machine learning methods used in the project. The objective is to allow users to find projects based on a search on methods applied, e.g. answer a query like “poverty prediction using random forest”. +

  • +
+
"methods": [
+    {
+        "name": "string",
+        "note": "string"
+    }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +A short name for the method being described.
  • +
  • note [Optional ; Not repeatable ; String]
    +Any additional information on the method. +

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    methods = list(
+      
+      list(name = "linear regression", 
+           note = "Implemented using R package 'stats'"),
+      
+      list(name = "random forest", 
+           note = "Used for both regression and classification"),
+      
+      list(name = "lasso regression (least asolute shrinkage and selection operator)", 
+           note = "Implemented using R package glmnet"),
+      
+      list(name = "gradient boosting machine (GBM)"),
+      
+      list(name = "cross validation"),
+      
+      list(name = "mean square error, quadratic loss, L2 loss", 
+           note = "Loss functions used to fit models")
+      
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • software [Optional ; Repeatable]
    +This field is used to list the software and the specialized packages and libraries/packages that were used to implement the project and that are required to reproduce the scripts. The libraries that are loaded by the scripts (e.g., by the R require or library command) are included (not all their own dependencies, which will be assumed to be installed automatically). +
  • +
+
"software": [
+    {
+        "name": "string",
+        "version": "string",
+        "library": [
+            "string"
+        ]
+    }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the software.
  • +
  • version [Optional ; Not repeatable ; String]
    +The version of the software.
  • +
  • library [Optional ; Repeatable]

    +A list of libraries/packages required to run the scripts. Note that the specific version of each package is not documented here; it is expected to be found in the script or in the reproduction instructions.
  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    software = list(
+      
+      list(name    = "R", 
+           version = "4.0.2",
+           library = list("caret", "dplyr", "ggplot2"),
+      
+      list(name    = "Stata", 
+           version = "15"),
+      
+      list(name    = "Python", 
+           version = "3.7 (Anaconda install)",
+           library = list("pandas", "scikit-learn")
+      
+    ),
+    
+    # ...
+  )
+)  
+


+
    +
  • scripts [Optional ; Repeatable]
    +This field is used to describe the scripts written by the project authors. All scripts are expected to have been written using software listed in the field software. +
  • +
+
"scripts": [
+    {
+        "file_name": "string",
+        "zip_package": "string",
+        "title": "string",
+        "authors": [
+            {
+                "name": "string",
+                "affiliation": "string",
+                "role": "string"
+            }
+        ],
+        "date": "string",
+        "format": "string",
+        "software": "string",
+        "description": "string",
+        "methods": "string",
+        "dependencies": "string",
+        "instructions": "string",
+        "source_code_repo": "string",
+        "notes": "string",
+        "license": [
+            {
+                "name": "string",
+                "uri": "string",
+                "note": "string"
+            }
+        ]
+    }
+]
+


+
    +
  • file_name [Optional ; Not repeatable ; String]
    +Name of the script file (for R users, this will typically include files with extension [.R], for Stata users it will be files with extension [.do], for Python users …). But this can also include other files related and required to run the scripts (for example lookup CSV files, etc.) This does not include the data files, which are described ina specific field.
  • +
  • zip_package [Optional ; Not repeatable]
    +If the script files have been saved as or in a compressed file (zip, rar, of equivalent), we provide here the name of the zip file containing the script.
  • +
  • title [Optional ; Not repeatable ; String]
    +A title (label) given to the script file
  • +
  • authors [Optional ; Repeatable]
    +This is a repeatable block that allows entering a list of authors and co-authors of a script +
      +
    • name [Optional ; Not repeatable ; String]
      +Name of the author (person or organization) of the script
    • +
    • affiliation [Optional ; Not repeatable ; String]
      +The affiliation of the author.
    • +
    • role [Optional ; Not repeatable ; String]
      +Specific role of the person or organization in the production of the script.

    • +
  • +
  • date [Optional ; Not repeatable ; String]*
    +Date the script was produced, in ISO 8601 format (YYYY-MM-DD)
  • +
  • format [Optional ; Not repeatable ; String]
    +File format
  • +
  • software [Optional ; Not repeatable ; String]
    +Software used to run the script
  • +
  • description [Optional ; Not repeatable ; String]
    +Brief description of the script
  • +
  • methods [Optional ; Not repeatable ; String]
    +Statistical/analytic methods included in the script
  • +
  • dependencies [Optional ; Not repeatable ; String]
    +Any dependencies (packages/libraries) that the script relies on. This field is not needed if dependencies were described in the library element.
  • +
  • instructions [Optional ; Not repeatable ; String]
    +Instructions for running the script. Information on the sequence in which the scripts must be run is critical.
  • +
  • source_code_repo [Optional ; Not repeatable ; String]
    +Repository (e.g. GitHub repo) where the script has been published.
  • +
  • notes [Optional ; Not repeatable ; String]
    +Any additional information on the script.
  • +
  • license [Optional ; Repeatable]
    +License, if any, under which the script is published. +
      +
    • name [Optional ; Not repeatable ; String]
      +Name (label) of the license
      +
    • +
    • uri [Optional ; Not repeatable ; String]
      +License URI

    • +
  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    scripts = list(
+      
+      list(file_name = "00_script.R", 
+           zip_package = "all_scripts.zip", 
+           title = "Project X - Master script", 
+           authors = list(name = "John Doe", 
+                          affiliation = "IHSN", 
+                          role = "Writing, testing and documenting the script"),
+           date = "2020-12-27",
+           format = "R script",
+           software = "R x64 4.0.2",
+           description = "Master script for automated reproduction of the analysis. Calls all other scripts in proper sequence to reproduce the full analysis.",
+           methods = "box-cox transformation of data",
+           dependencies = "",
+           instructions = "",
+           source_code_repo = "",
+           notes = "",
+           license = list(name = "CC BY 4.0", 
+                          uri = "https://creativecommons.org/licenses/by/4.0/deed.ast")),
+      
+      list(file_name = "01_regression.R", 
+           zip_package = "", 
+           title = "Charts and maps", 
+           authors = list(name = "", 
+                          affiliation = "", 
+                          role = ""),
+           date = "",
+           format = "R script",
+           software = "R",
+           description = "This script runs all linear regressions and PCA presented in the working paper.",
+           methods = "linear regression; principal component analysis",
+           dependencies = "",
+           instructions = "",
+           source_code_repo = "",
+           notes = "",
+           license = list(list(name = "CC BY 4.0", 
+                               uri = "https://creativecommons.org/licenses/by/4.0/deed.ast"))),
+      
+      list(file_name = "02_visualization", 
+           zip_package = "", 
+           title = "", 
+           authors = list(name = "", 
+                          abbr = "", 
+                          role = ""),
+           date = "",
+           format = "",
+           software = "",
+           description = "",
+           instructions = "",
+           source_code_repo = "",
+           notes = "",
+           license = list(list(name = "CC BY 4.0", 
+                               uri = "https://creativecommons.org/licenses/by/4.0/deed.ast"))),
+      
+    ),
+    # ...
+  )
+)  
+


+
    +
  • data_statement [Optional ; Not repeatable ; String]
    +An overall statement on the data used in the project. A separate field is provided to list and document the origin and key characteristics of the datasets.

  • +
  • datasets [Optional ; Repeatable]
    +This field is used to provide an itemized list of datasets used in the project. The data are not documented here (specific metadata are available for documenting data of different types, like the DDI for microdata, the ISO 19139 for geographic datasets, etc.) +

  • +
+
"datasets": [
+    {
+        "name": "string",
+        "idno": "string",
+        "note": "string",
+        "access_type": "string",
+        "license": "string",
+        "license_uri": "string",
+        "uri": "string"
+    }
+]
+


+
    +
  • name [Optional ; Not repeatable ; String]
    +The dataset name (title)
  • +
  • idno [Optional ; Not repeatable ; String]
    +The unique identifier of the dataset
  • +
  • note [Optional ; Not repeatable ; String]
    +A brief description of the dataset.
  • +
  • access_type [Optional ; Not repeatable ; String]
    +The access policy pplied to the dataset.
  • +
  • license [Optional ; Not repeatable ; String]
    +The access license that applies to the dataset.
  • +
  • license_uri [Optional ; Not repeatable ; String]
    +The URL of a web page where more information on the license can be obtained.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI where the dataset (or a detailed description of it) can be obtained. +

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    datasets = list(
+      
+      list(name = "Multiple Indicator Cluster Survey 2019, Round 6, Chad", 
+           idno = "TCD_2019_MICS_v01_M", 
+           uri  = "https://microdata.worldbank.org/index.php/catalog/4150"),
+      
+      list(name = "World Bank Group Country Survey 2018, Chad", 
+           idno = "TCD_2018_WBCS_v01_M", 
+           access_type = "Public access", 
+           uri = "https://microdata.worldbank.org/index.php/catalog/3058")
+      
+    ),
+    # ...
+  )
+)  
+


+
    +
  • contacts [Optional ; Repeatable]
    +The contacts element provides the public interface for questions associated with the research project. There could be various contacts provided depending upon the organization. It is important to assure that the proper contacts are provided to channel public inquiry. +
  • +
+
"contacts": [
+    {
+        "name": "string",
+        "role": "string",
+        "affiliation": "string",
+        "email": "string",
+        "telephone": "string",
+        "uri": "string"
+    }
+]
+


+
    +
  • name [Required ; Not repeatable ; String]
    +The name of the contact person that should be contacted depending on the role defined below.
  • +
  • role [Optional ; Not repeatable ; String]
    +Role of the contact person. A research project may have contact persons depending on the output or some of the technical input. Some complex projects may have various data collection processes that have different processing channels and contacts. This section should provide for a key primary public interface that can refer the public inquiry or provide a collection of entry points.
  • +
  • affiliation [Optional ; Not repeatable ; String]
    +The organization or affiliation of the contact person. This is usually the organization that the contact person represents.
  • +
  • email [Optional ; Not repeatable ; String]
    +Email address of the responsible person, institution, or division in charge of the research project or output.
  • +
  • telephone [Optional ; Not repeatable ; String]
    +Phone number of the responsible institution or division of the research project or output.
  • +
  • uri [Optional ; Not repeatable ; String]
    +The URI of the agency or organization of the contact organization. This may be the same as the web page of the project or may be a permanent contact name at an institutional level and not project related. Eventually a project web site may be removed but there may still be need to have a contact. In this case, it is recommended to have a contact that is permanent.

  • +
+
my_project = list(
+  # ... ,
+  project_desc = list(
+      # ... ,
+    
+    contacts = list(
+      
+      list(name = "Data helpdesk", 
+           affiliation = "National Data Center", 
+           role = "Support to data users", 
+           uri = "helpdesk@ndc. ...")
+    ),
+    
+    # ...
+  )
+)  
+


+
+
+

12.4.3 Provenance

+

provenance [Optional ; Repeatable]
+Metadata can be programmatically harvested from external catalogs. The provenance group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been made to the harvested metadata. +

+
"provenance": [
+    {
+        "origin_description": {
+            "harvest_date": "string",
+            "altered": true,
+            "base_url": "string",
+            "identifier": "string",
+            "date_stamp": "string",
+            "metadata_namespace": "string"
+        }
+    }
+]
+


+
    +
  • origin_description [Required ; Not repeatable]
    +The origin_description elements are used to describe when and from where metadata have been extracted or harvested.
    +
      +
    • harvest_date [Required ; Not repeatable ; String]
      +The date and time the metadata were harvested, entered in ISO 8601 format.
    • +
    • altered [Optional ; Not repeatable ; Boolean]
      +A boolean variable (“true” or “false”; “true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element idno in the Document Description / Title Statement section) will be modified when published in a new catalog.
    • +
    • base_url [Required ; Not repeatable ; String]
      +The URL from where the metadata were harvested.
    • +
    • identifier [Optional ; Not repeatable ; String]
      +The unique dataset identifier (idno element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The identifier element in provenance is used to maintain traceability.
    • +
    • date_stamp [Optional ; Not repeatable ; String]
      +The date stamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).
    • +
    • metadata_namespace [Optional ; Not repeatable ; String]
      +@@@@@@@
    • +
  • +
+
+
+

12.4.4 Tags

+

tags [Optional ; Repeatable]
+As shown in section 1.7 of the Guide, tags, when associated with tag_groups, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R. +

+
"tags": [
+    {
+        "tag": "string",
+        "tag_group": "string"
+    }
+]
+


+
    +
  • tag [Required ; Not repeatable ; String]
    +A user-defined tag.

  • +
  • tag_group [Optional ; Not repeatable ; String]

    +A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.

  • +
  • lda_topics [Optional ; Not repeatable]
    +

  • +
+
"lda_topics": [
+    {
+        "model_info": [
+            {
+                "source": "string",
+                "author": "string",
+                "version": "string",
+                "model_id": "string",
+                "nb_topics": 0,
+                "description": "string",
+                "corpus": "string",
+                "uri": "string"
+            }
+        ],
+        "topic_description": [
+            {
+                "topic_id": null,
+                "topic_score": null,
+                "topic_label": "string",
+                "topic_words": [
+                    {
+                        "word": "string",
+                        "word_weight": 0
+                    }
+                ]
+            }
+        ]
+    }
+]
+


+

We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or “augment”) metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of “clustering” words that are likely to appear in similar contexts (the number of “clusters” or “topics” is a parameter provided when training a model). Clusters of related words form “topics”. A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).

+

Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series’ name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%).

+

The lda_topics element includes the following metadata fields. An example in R was provided in Chapter 4 - Documents.

+
    +
  • model_info [Optional ; Not repeatable]
    +Information on the LDA model.

    +
      +
    • source [Optional ; Not repeatable ; String]
      +The source of the model (typically, an organization).
    • +
    • author [Optional ; Not repeatable ; String]
      +The author(s) of the model.
    • +
    • version [Optional ; Not repeatable ; String]
      +The version of the model, which could be defined by a date or a number.
    • +
    • model_id [Optional ; Not repeatable ; String]
      +The unique ID given to the model.
    • +
    • nb_topics [Optional ; Not repeatable ; Numeric]
      +The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).
    • +
    • description [Optional ; Not repeatable ; String]
      +A brief description of the model.
    • +
    • corpus [Optional ; Not repeatable ; String]
      +A brief description of the corpus on which the LDA model was trained.
    • +
    • uri [Optional ; Not repeatable ; String]
      +A link to a web page where additional information on the model is available.

    • +
  • +
  • topic_description [Optional ; Repeatable]
    +The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).

    +
      +
    • topic_id [Optional ; Not repeatable ; String]
      +The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).
    • +
    • topic_score [Optional ; Not repeatable ; Numeric]
      +The share of the topic in the metadata (%).
    • +
    • topic_label [Optional ; Not repeatable ; String]
      +The label of the topic, if any (not automatically generated by the LDA model).
    • +
    • topic_words [Optional ; Not repeatable]
      +The list of N keywords describing the topic (e.g., the top 5 words).
      +
        +
      • word [Optional ; Not repeatable ; String]
        +The word.
      • +
      • word_weight [Optional ; Not repeatable ; Numeric]
        +The weight of the word in the definition of the topic.

      • +
    • +
  • +
  • embeddings [Optional ; Repeatable]
    +In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).

    +

    The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.

  • +
+


+
"embeddings": [
+    {
+        "id": "string",
+        "description": "string",
+        "date": "string",
+        "vector": null
+    }
+]
+


+

The embeddings element contains four metadata fields:

+
    +
  • id [Optional ; Not repeatable ; String]
    +A unique identifier of the word embedding model used to generate the vector.
  • +
  • description [Optional ; Not repeatable ; String]
    +A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
  • +
  • date [Optional ; Not repeatable ; String]
    +The date the model was trained (or a version date for the model).
  • +
  • vector [Required ; Not repeatable ; @@@@] +The numeric vector representing the series metadata.

  • +
+
+
+

12.4.5 Additional

+

additional [Optional ; Not repeatable]
@@@@ add this to the schema and do screenshot +The additional element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the additional block; embedding them elsewhere in the schema would cause schema validation to fail.

+
+
+
+

12.5 Generating compliant metadata

+

For this example of documentation and publishing of reproducible research, we use the Replication data for: Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia published in the OpenICPSR website. The primary investigators for the project were Vivi Alatas, Abhijit Banerjee, Rema Hanna, Benjamin A. Olken, Ririn Purnamasari, and Matthew Wai-Poi.

+
+

A service of the Inter-university Consortium for Political and Social Research (ICPSR), openICPSR is a self-publishing repository for social, behavioral, and health sciences research data. openICPSR is particularly well-suited for the deposit of replication data sets for researchers who need to publish their raw data associated with a journal article so that other researchers can replicate their findings. (from OpenICPSR website)

+
+
+

12.5.1 Full example, using a metadata editor

+
+
+
+ +

image

+
+
+


+
+
+

12.5.2 Full example, using R

+
library(jsonlite)
+library(httr)
+library(dplyr)
+library(nadar)
+
+# ----credentials and catalog URL --------------------------------------------------
+my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
+set_api_key("my_keys[1,1")  
+set_api_url("https://.../index.php/api/") 
+set_api_verbose(FALSE)
+# ----------------------------------------------------------------------------------
+
+setwd("C:\my_project")       
+thumb = "elite_capture.JPG"  # Will be used as thumbnail in the data catalog
+
+id = "IDN_2019_ECTWP_v01_RR" 
+
+# Generate the metadata
+
+my_project_metadata <- list(
+  
+  # Information on metadata production
+  
+  doc_desc = list(
+    
+    producers = list(
+      list(name = "OD", affiliation = "National Data Center")
+    ),
+    
+    prod_date = "2022-01-15"
+    
+  ),
+  
+  # Documentation of the research project, and scripts
+  
+  project_desc = list(
+    
+    title_statement = list(
+      idno  = id,
+      title = "Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia",
+      sub_title = "Reproducible scripts"
+    ),
+    
+    production_date = list("2019"),
+    
+    geographic_units = list(
+      list(name="Indonesia", code="IDN", type="Country")
+    ),
+    
+    authoring_entity   = list(
+      
+      list(name        = "Vivi Alatas", 
+           role        = "Primary investigator",
+           affiliation = "World Bank",
+           email       = "valatas@worldbank.org"),
+      
+      list(name        = "Abhijit Banerjee", 
+           role        = "Primary investigator",
+           affiliation = "Department of Economics, MIT",
+           email       = "banerjee@mit.edu"),
+      
+      list(name        = "Rema Hanna", 
+           role        = "Primary investigator",
+           affiliation = "Harvard Kennedy School",
+           email       = "rema_hanna@hks.harvard.edu"),
+      
+      list(name        = "Benjamin A. Olken", 
+           role        = "Primary investigator",
+           affiliation = "Department of Economics, MIT",
+           email       = "bolken@mit.edu"),
+      
+      list(name        = "Ririn Purnamasari", 
+           role        = "Primary investigator",
+           affiliation = "World Bank",
+           email       = "rpurnamasari@worldbank.org"),
+      
+      list(name        = "Matthew Wai-Poi", 
+           role        = "Primary investigator",
+           affiliation = "World Bank",
+           email       = "mwaipoi@worldbank.org")
+      
+    ),
+    
+    abstract = "This paper investigates how elite capture affects the welfare gains from targeted government transfer programs in Indonesia, using both a high-stakes field experiment that varied the extent of elite influence and nonexperimental data on a variety of existing government programs. While the relatives of those holding formal leadership positions are more likely to receive benefits in some programs, we argue that the welfare consequences of elite capture appear small: eliminating elite capture entirely would improve the welfare gains from these programs by less than one percent.",
+    
+    keywords = list(
+      list(name="proxy-means test (PMT)"),
+      list(name="experimental design")
+    ),
+    
+    topics = list(
+      
+      list(id="D72", 
+           name = "Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior", 
+           vocabulary = "JEL codes", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php"), 
+      
+      list(id = "H53", 
+           name = "National Government Expenditures and Welfare Programs", 
+           vocabulary = "JEL codes", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php"),
+      
+      list(id = "I38", 
+           name = "Welfare, Well-Being, and Poverty: Government Programs; Provision and Effects of Welfare Programs", 
+           vocabulary = "JEL codes", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php"), 
+      
+      list(id = "O15", 
+           name = "Economic Development: Human Resources; Human Development; Income Distribution; Migration", 
+           vocabulary = "JEL codes", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php"), 
+      
+      list(id = "O17", 
+           name = "Formal and Informal Sectors; Shadow Economy Institutional Arrangements", 
+           vocabulary = "JEL codes", 
+           uri = "https://www.aeaweb.org/econlit/jelCodes.php")
+      
+    ),
+    
+    output_types = list(
+      
+      list(type  = "Article", 
+           title = "Does Elite Capture Matter Local Elites and Targeted Welfare Programs in Indonesia",
+           description = "AEA Papers and Proceedings 2019, 109: 334-339", 
+           uri = "https://doi.org/10.1257/pandp.20191047",
+           doi = "10.1257/pandp.20191047"),
+      
+      list(type = "Working Paper", 
+           title = "Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia",
+           description = "NBER Working Paper No. 18798, February 2013", 
+           uri = "https://www.nber.org/papers/w18798")
+      
+    ),
+    
+    version_statement = list(version = "1.0", version_date  = "2019"),  
+    
+    language = list(
+      list(name = "English", code = "EN")
+    ),   
+    
+    methods = list(
+      list(name = "linear regression with large dummy-variable set (areg)"),
+      list(name = "probit regression"),
+      list(name = "Test linear hypotheses after estimation")
+    ),
+    
+    software  = list(
+      list(name= "Stata", version = "14")
+    ),
+    
+    reproduction_instructions = "The master do file should run start to finish in less than five minutes from the master do file '0MASTER 20190918.do'. Original data is in data-PUBLISH/originaldata and is all that is needed to run the code; all data in data-PUBLISH/codeddata is created from the coding do files. All results are then created and saved in output-PUBLISH/tables.
+  
+      Key Subfolders:
+      1. code-PUBLISH: This folder contains all relevant code. The master do file is located here ('0Master20190918.do') as well as the two folders that are necessary for the creation of datasets/coding ('coding_matching' folder) and for the analysis/table creation ('analysis' folder). Users should update the directory on the master file to reflect the location of the directory on their computers once downloaded. Following that, all the data and output files needed to replicate the main findings of the paper (Tables 1A-1D, Table 2 and the 4 Appendix Tables) will be generated. The sub do files provide specific notes on the variables created where relevant.
+      2. data-PUBLISH: This folder contains all relevant .dta files. The first folder, 'original data' contains the 'Baseline' folder that has the original baseline survey information. Under 'original data' you will also find the 'Others' folder with the randomization results, the 2008 PPLS data and the PODES 2008 village level administrative data. The 'Endline2' folder contains the endline survey information. These datasets have been modified only to mask sensitive information. Finally, the 'codeddata' folder that stores intermediate datasets that are created through the sub 'coding_matching' do files.
+      3. log-PUBLISH: This folder contains the latest log file. When users run the master do file, a new log file will automatically be created and stored here.
+      4. output-PUBLISH: This folder contains all the tables of the main paper and appendix. When users run the master do file, these tables will be automatically overwritten.",
+    
+    confidentiality = "The published materials do not contain confidential information.",
+    
+    datasets = list(
+      
+      list(name = "Village survey (original data; baseline)", 
+           idno = "", 
+           note = "Stata 14 data files", 
+           access_type = "Public", 
+           uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+      
+      list(name = "Village survey (original data; endline)", 
+           idno = "", 
+           note = "Stata 14 data files", 
+           access_type = "Public", 
+           uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+      
+      list(name = "Randomization data", 
+           idno = "", 
+           note = "Stata 14 data files", 
+           access_type = "Public", 
+           uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+      
+      list(name = "2008 PPLS", 
+           idno = "", 
+           note = "Stata 14 data files", 
+           access_type = "Public", 
+           uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+      
+      list(name = "2008 PODES - Village level administrative data", 
+           idno = "", 
+           note = "Stata 14 data files", 
+           access_type = "Public", 
+           uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view"),
+      
+      list(name = "Coded data (intermediary data files generated by the scripts)", 
+           idno = "", 
+           note = "Stata 14 data files", 
+           access_type = "Public", 
+           uri = "https://www.openicpsr.org/openicpsr/project/119802/version/V1/view")
+      
+    ),
+    
+    sponsors = list(
+      
+      list(name="Australian Aid (World Bank Trust Fund)",
+           abbr="AusAID",
+           role="Financial support"),
+      
+      list(name="3ie",
+           grant_no="OW3.1055",
+           role="Financial support"),
+      
+      list(name="NIH",
+           grant_no="P01 HD061315",
+           role="Financial support")
+      
+    ),
+    
+    acknowledgements = list(
+      
+      list(name = "Jurist Tan, Talitha Chairunissa, Amri Ilmma, Chaeruddin Kodir, He Yang, and Gabriel Zucker",
+           role    = "Research assistance"),
+      
+      list(name    = "Scott Guggenheim",
+           role    = "Provided comments"),
+      
+      list(name    = "Mitra Samya, BPS, TNP2K, and SurveyMeter",
+           role    = "Field cooperation")
+      
+    ),
+    
+    disclaimer = "Users acknowledge that the original collector of the data, ICPSR, and the relevant funding agency bear no responsibility for use of the data or for interpretations or inferences based upon such uses.",
+    
+    scripts = list(
+      
+      list(file_name   = "0MASTER-20190918.do",
+           zip_package = "119802-V1.zip",
+           title       = "Master Stata do file",
+           authors     = list(list(name="Rema Hanna, Ben Olken (PIs) and Sam Solomon (RA)")),
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Master do file; this script calls all do files required to replicate the output from start to finish (in no more than a few minutes)",
+           notes       = "Original data is in data-PUBLISH/originaldata and is all that is needed to run the code; all data in data-PUBLISH/codeddata is created from the coding do files. All results are then created and saved in output-PUBLISH/tables."),
+      
+      list(file_name   = "coding baseline.do",
+           title       = "coding baseline variables",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 1/7"),
+      
+      list(file_name   = "coding suseti pmt.do",
+           title       = "coding pmt",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 2/7"),
+      
+      list(file_name   = "coding elite relation.do",
+           title       = "coding additional variables for analysis",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 3/7"),
+      
+      list(file_name   = "matching hybrid.do",
+           title       = "matching baseline survey data and matching results",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 4/7; Generates poverty density measure"),
+      
+      list(file_name   = "coding existing social programs.do",
+           title       = "coding existing social programs",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 5/7"),
+      
+      list(file_name   = "coding kitchen-sink variables.do",
+           title       = "coding miscellaneous variables",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 6/7"),
+      
+      list(file_name   = "coding_partV_hybrid.do",
+           title       = "coding for part V of analysis plan",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Coding/matching script 7/7"),
+      
+      list(file_name   = "0 Table 1AB.do",
+           title       = "Table 1: formal vs. informal elites - Panels A and B: historical benefits",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 1/7"),
+      
+      list(file_name   = "0 Table 1CD.do",
+           title       = "Table 1: formal vs. informal elites - Panels C and D: PKH Experiment",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 2/7"),
+      
+      list(file_name   = "0 Table 2 Appendix Table 3.do",
+           title       = "Table 7: Social welfare simulations",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 3/7"),
+      
+      list(file_name   = "0 Appendix Table 1A.do",
+           title       = "Table 2A: Elite capture in historical programs",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 4/7"),
+      
+      list(file_name   = "0 Appendix Table 1B.do",
+           title       = "Table 2B: Elite capture in PKH experiment",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 5/7"),
+      
+      list(file_name   = "0 Appendix Table 2.do",
+           title       = "Appendix Table 12: Probit Model from Table 7",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 6/7"),
+      
+      list(file_name   = "0 Appendix Table 4.do",
+           title       = "Appendix Table 13: Social welfare simulations -- PKH - Additional model from Table 7",
+           zip_package = "119802-V1.zip",
+           format      = "Stata do file",
+           software    = "Stata 14",
+           description = "Analysis script 7/7"),
+      
+      list(file_name   = "master_log_09182019.smcl",
+           title       = "Log file - Run of master do file",
+           zip_package = "119802-V1.zip",
+           format      = "Stata log file",
+           software    = "Stata 14",
+           description = "Latest log file obtained by running the master do file")
+    )
+    
+  )
+  
+)
+
+
+# Publish the project metadata in the NADA catalog
+
+script_add(idno = id, 
+           metadata = my_project_metadata, 
+           repositoryid = "central", 
+           published = 1, 
+           thumbnail = thumb, 
+           overwrite = "yes")
+
+
+# Add links to ICPSROpen website and AEA website as external resources:
+
+external_resources_add(
+  title = "Elite Capture Paper (Alatas et Al., 2019) - Project page - OpenICPSR",
+  idno = id,
+  dctype = "web",
+  file_path = "https://www.openicpsr.org/openicpsr/project/116471/version/V1/view;jsessionid=31C3E76620D0DDD1CABADAA263A1E491",
+  overwrite = "yes"
+)
+
+external_resources_add(
+  title = "American Economic Association (AEA) paper: Does Elite Capture Matter? Local Elites and Targeted Welfare Programs in Indonesia",
+  idno = id,
+  dctype = "doc/anl",
+  file_path = "https://www.aeaweb.org/articles?id=10.1257/pandp.20191047",
+  overwrite = "yes"
+)
+

The metadata and all resources (script files, etc.) are now available in the NADA catalog.
+@@@@@ redo screenshot when displays external resources

+


+
+

+
+
+

12.5.3 Full example, using Python

+
# Python example
+ +
+
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/chapter13.html b/chapter13.html new file mode 100644 index 0000000..9a6c031 --- /dev/null +++ b/chapter13.html @@ -0,0 +1,723 @@ + + + + + + + Chapter 13 External resources | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Chapter 13 External resources

+

The metadata schemas presented in chapters 4 to 12 of the Guide are intended to document in detail resources of multiple types (data and scripts). When published in a NADA catalog, these metadata will be made visible and searchable. But publishing metadata in an HTML format is not enough. In most cases, you will also want to made files (data files, documents, or others) accessible in your catalog, and provide links to other, related resources. These files will have to be uploaded on your web server, and the links created, with some documentation. These related materials are what is referred to as “external resources”.

+

External resources are not a specific type of data. They are resources of any type (data, document, web page, or any other type of resource that can be provided as an electronic file or a web link) that can be attached as a “related resource” to a catalog entry. A schema that is intentionally kept very simple, based on the Dublin Core standard, is used to describe these resources. This schema will never be used independently; it will always be used in combination with one of the other metadata standards and schemas documented in this Guide.

+

The table below shows some examples of the kind of external resources that may be attached to the metadata of different data types.

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Data typeResources that may be documented and published as external resources
DocumentMS-Excel version of tables included in a publication ; PDF/DOC version of the publication ; visualizations files (scripts and image) for visualizations included in the publication ; link to electronic annexes
Microdatasurvey questionnaire ; survey report ; technical documentation (sampling, etc.) ; data entry application ; survey budget in Excel ; microdata files in different formats ; link to an external website
Geographic datasetlink to an interactive web application ; technical documentation in PDF ; data analysis scripts ; publicly accessible data files
Time serieslink to a database query interface ; technical documents ; link to external websites ; visualization scripts
Tableslink to an organization website ; tabulation scripts
Imagesimage files in different formats and resolutions ; link to a photo album application ; link to a photographer website
Audio recordingsaudio file in MP3 or other format ; transcript in PDF
Videosvideo file in WAV or other format ; transcript in PDF
Scriptspublication ; link to a package/library web page ; link to datasets
+

Note that a catalog entry (e.g. a document, or a table) can itself be provided as a link (i.e. as an external resource) for another catalog entry.

+

In a NADA catalog, the external resources will not appear as catalog entries. Their list and description will be displayed (and the resources made accessible) in a “DOWNLOAD” tab for the entry to which they are attached.

+


+ +

+

The schema used to document external resources only contains 16 elements.

+


+
{
+  "dctype": "doc/adm",
+  "dcformat": "application/zip",
+  "title": "string",
+  "author": "string",
+  "dcdate": "string",
+  "country": "string",
+  "language": "string",
+  "contributor": "string",
+  "publisher": "string",
+  "rights": "string",
+  "description": "string",
+  "abstract": "string",
+  "toc": "string",
+  "filename": "string",
+  "created": "2023-04-09T19:23:22Z",
+  "changed": "2023-04-09T19:23:22Z"
+}
+


+

dctype [Optional, Not Repeatable, String]
+This element defines the type of external resource being documented. This element plays an important role in the cataloguing system (NADA), as it is used to determine where and how the resource will be published. Particular attention must be paid to the type “Microdata File” (dat/micro) and to other data types, when the datasets will be published in a data catalog but with access restrictions). The NADA catalog allows data to be published under different levels of accessibility: open data, direct access, public use files, licensed data, access in data enclave, or no access. Most standards include an element access_policy which is used to determine the type of access to a resource, and will apply to data of type dat/micro. The resource type dctype must be selected from a controlled vocabulary:

+
    +
  • doc/adm: Document, Administrative [doc/adm]
  • +
  • doc/anl: Document, Analytical [doc/anl]
  • +
  • doc/oth: Document, Other [doc/oth]
  • +
  • doc/qst: Document, Questionnaire [doc/qst]
  • +
  • doc/ref: Document, Reference [doc/ref]
  • +
  • doc/rep: Document, Report [doc/rep]
  • +
  • doc/tec: Document, Technical [doc/tec]
  • +
  • aud: Audio [aud]
  • +
  • dat: Database [dat] (not including microdata)
  • +
  • map: Map [map]
  • +
  • dat/micro: Microdata File [dat/micro]
  • +
  • pic: Photo / image [pic]
  • +
  • prg: Program / script [prg]
  • +
  • tbl: Table [tbl]
  • +
  • vid: Video [vid]
  • +
  • web: Web Site [web]
  • +
+

dcformat [Optional, Not Repeatable, String]
+The resource file format. This format can be entered using a controlled vocabulary. Options could include:

+
    +
  • application/x-compressed: Compressed, Generic
  • +
  • application/zip: Compressed, ZIP
  • +
  • application/x-cspro: Data, CSPro
  • +
  • application/dbase: Data, dBase
  • +
  • application/msaccess: Data, Microsoft Access
  • +
  • application/x-sas: Data, SAS
  • +
  • application/x-spss: Data, SPSS
  • +
  • application/x-stata: Data, Stata
  • +
  • text: Document, Generic
  • +
  • text/html: Document, HTML
  • +
  • application/msexcel: Document, Microsoft Excel
  • +
  • application/mspowerpoint: Document, Microsoft PowerPoint
  • +
  • application/msword: Document, Microsoft Word
  • +
  • application/pdf: Document, PDF
  • +
  • application/postscript: Document, Postscript
  • +
  • text/plain: Document, Plain
  • +
  • text/wordperfect: Document, WordPerfect
  • +
  • image/gif: Image, GIF
  • +
  • image/jpeg: Image, JPEG
  • +
  • image/png: Image, PNG
  • +
  • image/tiff: Image, TIFF
  • +
+

title [Required, Not Repeatable, String]
+The title of the resource.

+

author [Optional, Not Repeatable, String]
+The author(s) of the resource. If more than one, separate the names with a “;”.

+

dcdate [Optional, Not Repeatable, String]
+The date the resource was produced or released, preferably entered in ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).

+

country [Optional, Not Repeatable, String]
+The country name, if the resource is specific to a country. If more than one, enter the country names separated with a “;”.

+

language [Optional, Not Repeatable, String]
+The language name. If more than one, enter the language names separated with a “;”.

+

contributor [Optional, Not Repeatable, String]
+List of contributor (free text). If more than one, enter the names separated with a “;”.

+

publisher [Optional, Not Repeatable, String]
+List of contributor (free text). If more than one, enter the names separated with a “;”.

+

rights [Optional, Not Repeatable, String]
+The rights associated with the resource.

+

description [Optional, Not Repeatable, String]
+A brief description of the resource (but not the abstract; see the next element).

+

abstract [Optional, Not Repeatable, String]
+And abstract for the resource.

+

toc [Optional, Not Repeatable, String]
+The table of content of the resource (if the resource is a publication), entered as free text.

+

filename [Optional, Not Repeatable, String]
+A file name or a URL.

+
+

13.1 Example of use of external resources

+

The “complete examples” provided in the previous chapters included some examples of the use of the “external_resources_add” command (from the Nadar R package) or “…” (from the PyNada Python library). We provide here one more example.

+
# R example  @@@@
+
# Python example  @@@@
+ +
+
+ + + +
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/css/images/Python-logo.webp b/css/images/Python-logo.webp new file mode 100644 index 0000000..af4bdb5 Binary files /dev/null and b/css/images/Python-logo.webp differ diff --git a/css/images/R-logo.png b/css/images/R-logo.png new file mode 100644 index 0000000..be48e30 Binary files /dev/null and b/css/images/R-logo.png differ diff --git a/css/images/chat-right-quote.svg b/css/images/chat-right-quote.svg new file mode 100644 index 0000000..42c8dbe --- /dev/null +++ b/css/images/chat-right-quote.svg @@ -0,0 +1,4 @@ + + + + \ No newline at end of file diff --git a/css/images/code-square.svg b/css/images/code-square.svg new file mode 100644 index 0000000..415b56c --- /dev/null +++ b/css/images/code-square.svg @@ -0,0 +1,4 @@ + + + + \ No newline at end of file diff --git a/css/images/exclamation-square.svg b/css/images/exclamation-square.svg new file mode 100644 index 0000000..41436cb --- /dev/null +++ b/css/images/exclamation-square.svg @@ -0,0 +1,4 @@ + + + + \ No newline at end of file diff --git a/css/images/filetype-json.svg b/css/images/filetype-json.svg new file mode 100644 index 0000000..2b9d988 --- /dev/null +++ b/css/images/filetype-json.svg @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/css/images/filetype-xml.svg b/css/images/filetype-xml.svg new file mode 100644 index 0000000..d822645 --- /dev/null +++ b/css/images/filetype-xml.svg @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/css/images/lightbulb.svg b/css/images/lightbulb.svg new file mode 100644 index 0000000..c13f627 --- /dev/null +++ b/css/images/lightbulb.svg @@ -0,0 +1,3 @@ + + + \ No newline at end of file diff --git a/css/style.css b/css/style.css new file mode 100644 index 0000000..a75fbc1 --- /dev/null +++ b/css/style.css @@ -0,0 +1,143 @@ +/* -----------code chunk------------- */ + + +pre.sourceCode { + background-size: 40px !important; + background-repeat: no-repeat !important; + background-position: 20px 10px !important; + padding:60px 20px 15px 20px !important; + margin: 0 !important; + min-height: 120px; + line-height: 1; + background-color: #ebeced !important; + border: solid 3px #cccecf !important; + width: auto; +} + +pre.sourceCode.json { + background-image: url("images/filetype-json.svg") !important; +} + +pre.sourceCode.xml { + background-image: url("images/filetype-xml.svg") !important; +} + +pre.sourceCode.r { + background-image: url("images/R-logo.png") !important; +} + +pre.sourceCode.python { + background-image: url("images/Python-logo.webp") !important; +} + +code.sourceCode { + font-size: .85em !important; + background-color: #ebeced !important; + border: solid 0px #cccecf !important; + padding: 0px !important; +} + +/* -----------inline code------------- */ +p code, li code { + font-size: 1em !important; + background-color: #ebeced !important; + border: solid 1px #cccecf !important; + border-radius: 4px !important; + padding: 2px !important; +} + + +.book .book-body .page-wrapper .page-inner section.normal pre.small-block { + width: 75%; +} + +img { + border-radius: 4px; +} + +.caption { + color: #777; + margin-top: 10px; +} + +pre { + word-break: normal; + word-wrap: normal; +} + +.flushright { + text-align: right; +} +blockquote > p:last-child { + text-align: right; +} +blockquote > p:first-child { + text-align: inherit; +} +.header-section-number { + padding-right: .2em; + font-weight: 500; +} +.level1 .header-section-number { + display: inline-block; + border-bottom: 3px solid; +} +.level1 h1 { + border-bottom: 1px solid; +} +h1, h2, h3, h4, h5, h6 { + font-weight: normal; +} +h1.title { + font-weight: 700; +} +.smallcaps { + font-variant: small-caps; +} +.book .book-body .page-wrapper .page-inner section.normal strong { + font-weight: 600; +} + +/* -----------div tips------------- */ + +div.code { + padding: 1em; + margin: 1em 0; + padding-left: 100px; + background-size: 50px; + background-repeat: no-repeat; + background-position: 25px 25px; + min-height: 120px; + background-color: #ebeced; + border: dashed 5px #cccecf; +} + +div.code { + background-image: url("images/code-square.svg"); +} + +div.note, div.idea, div.quote { + padding: 1em; + margin: 1em 0; + padding-left: 100px; + background-size: 50px; + background-repeat: no-repeat; + background-position: 25px 25px; + min-height: 120px; + color: #1f5386; + background-color: #dfedff; + border: solid 3px #bed3ec; +} + +div.note { + background-image: url("images/exclamation-square.svg"); +} + +div.idea { + background-image: url("images/lightbulb.svg"); +} + +div.quote { + background-image: url("images/chat-right-quote.svg"); +} + diff --git a/images/Anatomy_Table.JPG b/images/Anatomy_Table.JPG new file mode 100644 index 0000000..ac6dbfa Binary files /dev/null and b/images/Anatomy_Table.JPG differ diff --git a/images/CWIQ_Stata.JPG b/images/CWIQ_Stata.JPG new file mode 100644 index 0000000..e752db3 Binary files /dev/null and b/images/CWIQ_Stata.JPG differ diff --git a/images/CWIQ_in_NADA_1.JPG b/images/CWIQ_in_NADA_1.JPG new file mode 100644 index 0000000..956c7dd Binary files /dev/null and b/images/CWIQ_in_NADA_1.JPG differ diff --git a/images/CWIQ_in_NADA_2.JPG b/images/CWIQ_in_NADA_2.JPG new file mode 100644 index 0000000..c4d7252 Binary files /dev/null and b/images/CWIQ_in_NADA_2.JPG differ diff --git a/images/CWIQ_in_NADA_3.JPG b/images/CWIQ_in_NADA_3.JPG new file mode 100644 index 0000000..3e9ee32 Binary files /dev/null and b/images/CWIQ_in_NADA_3.JPG differ diff --git a/images/CWIQ_in_NADA_4.JPG b/images/CWIQ_in_NADA_4.JPG new file mode 100644 index 0000000..47bacc9 Binary files /dev/null and b/images/CWIQ_in_NADA_4.JPG differ diff --git a/images/DCMI_MARC21_BIBTEX.JPG b/images/DCMI_MARC21_BIBTEX.JPG new file mode 100644 index 0000000..17965bd Binary files /dev/null and b/images/DCMI_MARC21_BIBTEX.JPG differ diff --git a/images/DDI.JPG b/images/DDI.JPG new file mode 100644 index 0000000..eea38a3 Binary files /dev/null and b/images/DDI.JPG differ diff --git a/images/Google_LSMS_Albania_2012.JPG b/images/Google_LSMS_Albania_2012.JPG new file mode 100644 index 0000000..e9a3a8d Binary files /dev/null and b/images/Google_LSMS_Albania_2012.JPG differ diff --git a/images/Google_Population_India_2020.JPG b/images/Google_Population_India_2020.JPG new file mode 100644 index 0000000..9456271 Binary files /dev/null and b/images/Google_Population_India_2020.JPG differ diff --git a/images/Google_Vision_00.jpg b/images/Google_Vision_00.jpg new file mode 100644 index 0000000..96d59dc Binary files /dev/null and b/images/Google_Vision_00.jpg differ diff --git a/images/Google_Vision_00a.jpg b/images/Google_Vision_00a.jpg new file mode 100644 index 0000000..d60b3dc Binary files /dev/null and b/images/Google_Vision_00a.jpg differ diff --git a/images/Google_Vision_02.jpg b/images/Google_Vision_02.jpg new file mode 100644 index 0000000..b7eefc7 Binary files /dev/null and b/images/Google_Vision_02.jpg differ diff --git a/images/Google_Vision_03.jpg b/images/Google_Vision_03.jpg new file mode 100644 index 0000000..b0411d2 Binary files /dev/null and b/images/Google_Vision_03.jpg differ diff --git a/images/Google_Vision_04.jpg b/images/Google_Vision_04.jpg new file mode 100644 index 0000000..76cb274 Binary files /dev/null and b/images/Google_Vision_04.jpg differ diff --git a/images/Google_Vision_04a.jpg b/images/Google_Vision_04a.jpg new file mode 100644 index 0000000..ddccab5 Binary files /dev/null and b/images/Google_Vision_04a.jpg differ diff --git a/images/Google_Vision_04b.jpg b/images/Google_Vision_04b.jpg new file mode 100644 index 0000000..41af90e Binary files /dev/null and b/images/Google_Vision_04b.jpg differ diff --git a/images/Google_Vision_04c.jpg b/images/Google_Vision_04c.jpg new file mode 100644 index 0000000..263c82c Binary files /dev/null and b/images/Google_Vision_04c.jpg differ diff --git a/images/Google_Vision_05.jpg b/images/Google_Vision_05.jpg new file mode 100644 index 0000000..2594b14 Binary files /dev/null and b/images/Google_Vision_05.jpg differ diff --git a/images/Google_Vision_06.jpg b/images/Google_Vision_06.jpg new file mode 100644 index 0000000..56201ed Binary files /dev/null and b/images/Google_Vision_06.jpg differ diff --git a/images/Google_Vision_07.jpg b/images/Google_Vision_07.jpg new file mode 100644 index 0000000..2ffdc58 Binary files /dev/null and b/images/Google_Vision_07.jpg differ diff --git a/images/Google_vision_01.JPG b/images/Google_vision_01.JPG new file mode 100644 index 0000000..6d41fa2 Binary files /dev/null and b/images/Google_vision_01.JPG differ diff --git a/images/IPTC_DCMI.JPG b/images/IPTC_DCMI.JPG new file mode 100644 index 0000000..d5e94d5 Binary files /dev/null and b/images/IPTC_DCMI.JPG differ diff --git a/images/Image_Example_01a.JPG b/images/Image_Example_01a.JPG new file mode 100644 index 0000000..8af855c Binary files /dev/null and b/images/Image_Example_01a.JPG differ diff --git a/images/Image_Example_01b.JPG b/images/Image_Example_01b.JPG new file mode 100644 index 0000000..b124b18 Binary files /dev/null and b/images/Image_Example_01b.JPG differ diff --git a/images/Image_Example_01c.JPG b/images/Image_Example_01c.JPG new file mode 100644 index 0000000..db513b3 Binary files /dev/null and b/images/Image_Example_01c.JPG differ diff --git a/images/JSON_array_list_authors.JPG b/images/JSON_array_list_authors.JPG new file mode 100644 index 0000000..c745981 Binary files /dev/null and b/images/JSON_array_list_authors.JPG differ diff --git a/images/JSON_to_Python_interpret.JPG b/images/JSON_to_Python_interpret.JPG new file mode 100644 index 0000000..ed1a7f0 Binary files /dev/null and b/images/JSON_to_Python_interpret.JPG differ diff --git a/images/JSON_to_R_interpret.JPG b/images/JSON_to_R_interpret.JPG new file mode 100644 index 0000000..0c9f35b Binary files /dev/null and b/images/JSON_to_R_interpret.JPG differ diff --git a/images/LDA_refugee_education.JPG b/images/LDA_refugee_education.JPG new file mode 100644 index 0000000..91d4c3e Binary files /dev/null and b/images/LDA_refugee_education.JPG differ diff --git a/images/NADA_Timeseries_Database_view.JPG b/images/NADA_Timeseries_Database_view.JPG new file mode 100644 index 0000000..a105736 Binary files /dev/null and b/images/NADA_Timeseries_Database_view.JPG differ diff --git a/images/NADA_Timeseries_Series_view.JPG b/images/NADA_Timeseries_Series_view.JPG new file mode 100644 index 0000000..567d3e3 Binary files /dev/null and b/images/NADA_Timeseries_Series_view.JPG differ diff --git a/images/ReDoc_Microdata_37.JPG b/images/ReDoc_Microdata_37.JPG new file mode 100644 index 0000000..23d744d Binary files /dev/null and b/images/ReDoc_Microdata_37.JPG differ diff --git a/images/ReDoc_Microdata_WBME_01.JPG b/images/ReDoc_Microdata_WBME_01.JPG new file mode 100644 index 0000000..703f706 Binary files /dev/null and b/images/ReDoc_Microdata_WBME_01.JPG differ diff --git a/images/ReDoc_Microdata_WBME_03.JPG b/images/ReDoc_Microdata_WBME_03.JPG new file mode 100644 index 0000000..764169c Binary files /dev/null and b/images/ReDoc_Microdata_WBME_03.JPG differ diff --git a/images/ReDoc_documents_18.JPG b/images/ReDoc_documents_18.JPG new file mode 100644 index 0000000..98ddc24 Binary files /dev/null and b/images/ReDoc_documents_18.JPG differ diff --git a/images/ReDoc_documents_21.JPG b/images/ReDoc_documents_21.JPG new file mode 100644 index 0000000..051faa1 Binary files /dev/null and b/images/ReDoc_documents_21.JPG differ diff --git a/images/ReDoc_documents_21b.JPG b/images/ReDoc_documents_21b.JPG new file mode 100644 index 0000000..9ee029d Binary files /dev/null and b/images/ReDoc_documents_21b.JPG differ diff --git a/images/ReDoc_documents_22.JPG b/images/ReDoc_documents_22.JPG new file mode 100644 index 0000000..2e7c57f Binary files /dev/null and b/images/ReDoc_documents_22.JPG differ diff --git a/images/ReDoc_image_27.JPG b/images/ReDoc_image_27.JPG new file mode 100644 index 0000000..fc38426 Binary files /dev/null and b/images/ReDoc_image_27.JPG differ diff --git a/images/ReDoc_images_34.JPG b/images/ReDoc_images_34.JPG new file mode 100644 index 0000000..f3eb275 Binary files /dev/null and b/images/ReDoc_images_34.JPG differ diff --git a/images/ReDoc_images_35.JPG b/images/ReDoc_images_35.JPG new file mode 100644 index 0000000..1f70a14 Binary files /dev/null and b/images/ReDoc_images_35.JPG differ diff --git a/images/ReDoc_ts_series_49.JPG b/images/ReDoc_ts_series_49.JPG new file mode 100644 index 0000000..459b92d Binary files /dev/null and b/images/ReDoc_ts_series_49.JPG differ diff --git a/images/ReDoc_ts_series_50.JPG b/images/ReDoc_ts_series_50.JPG new file mode 100644 index 0000000..046235a Binary files /dev/null and b/images/ReDoc_ts_series_50.JPG differ diff --git a/images/ReDoc_ts_series_51.JPG b/images/ReDoc_ts_series_51.JPG new file mode 100644 index 0000000..01ed58e Binary files /dev/null and b/images/ReDoc_ts_series_51.JPG differ diff --git a/images/ReDoc_ts_series_52.JPG b/images/ReDoc_ts_series_52.JPG new file mode 100644 index 0000000..fd28a62 Binary files /dev/null and b/images/ReDoc_ts_series_52.JPG differ diff --git a/images/ReDoc_ts_series_53.JPG b/images/ReDoc_ts_series_53.JPG new file mode 100644 index 0000000..3cee302 Binary files /dev/null and b/images/ReDoc_ts_series_53.JPG differ diff --git a/images/ReDoc_ts_series_54.JPG b/images/ReDoc_ts_series_54.JPG new file mode 100644 index 0000000..4aacc80 Binary files /dev/null and b/images/ReDoc_ts_series_54.JPG differ diff --git a/images/ReDoc_videos_34.JPG b/images/ReDoc_videos_34.JPG new file mode 100644 index 0000000..75428c5 Binary files /dev/null and b/images/ReDoc_videos_34.JPG differ diff --git a/images/ReDoc_videos_45.JPG b/images/ReDoc_videos_45.JPG new file mode 100644 index 0000000..c146094 Binary files /dev/null and b/images/ReDoc_videos_45.JPG differ diff --git a/images/ReDoc_videos_46.JPG b/images/ReDoc_videos_46.JPG new file mode 100644 index 0000000..2bdd719 Binary files /dev/null and b/images/ReDoc_videos_46.JPG differ diff --git a/images/ReDoc_videos_47.JPG b/images/ReDoc_videos_47.JPG new file mode 100644 index 0000000..8cd2381 Binary files /dev/null and b/images/ReDoc_videos_47.JPG differ diff --git a/images/ReDoc_videos_48.JPG b/images/ReDoc_videos_48.JPG new file mode 100644 index 0000000..78e8ff6 Binary files /dev/null and b/images/ReDoc_videos_48.JPG differ diff --git a/images/Table_Example02_in_NADA.JPG b/images/Table_Example02_in_NADA.JPG new file mode 100644 index 0000000..06db3fa Binary files /dev/null and b/images/Table_Example02_in_NADA.JPG differ diff --git a/images/Table_Example03_in_NADA.JPG b/images/Table_Example03_in_NADA.JPG new file mode 100644 index 0000000..b3d1bbf Binary files /dev/null and b/images/Table_Example03_in_NADA.JPG differ diff --git a/images/Video_NADA_tabs.JPG b/images/Video_NADA_tabs.JPG new file mode 100644 index 0000000..2c2a2e2 Binary files /dev/null and b/images/Video_NADA_tabs.JPG differ diff --git a/images/catalog_access_policy_01.JPG b/images/catalog_access_policy_01.JPG new file mode 100644 index 0000000..8f2d827 Binary files /dev/null and b/images/catalog_access_policy_01.JPG differ diff --git a/images/catalog_data_preview_01.JPG b/images/catalog_data_preview_01.JPG new file mode 100644 index 0000000..09eabf6 Binary files /dev/null and b/images/catalog_data_preview_01.JPG differ diff --git a/images/catalog_display_01.JPG b/images/catalog_display_01.JPG new file mode 100644 index 0000000..6828a0e Binary files /dev/null and b/images/catalog_display_01.JPG differ diff --git a/images/catalog_facets_01.JPG b/images/catalog_facets_01.JPG new file mode 100644 index 0000000..72b9ce8 Binary files /dev/null and b/images/catalog_facets_01.JPG differ diff --git a/images/catalog_related_01.JPG b/images/catalog_related_01.JPG new file mode 100644 index 0000000..e324ed3 Binary files /dev/null and b/images/catalog_related_01.JPG differ diff --git a/images/catalog_search_01.JPG b/images/catalog_search_01.JPG new file mode 100644 index 0000000..f85454f Binary files /dev/null and b/images/catalog_search_01.JPG differ diff --git a/images/catalog_tabs_01.JPG b/images/catalog_tabs_01.JPG new file mode 100644 index 0000000..a82bf57 Binary files /dev/null and b/images/catalog_tabs_01.JPG differ diff --git a/images/catalog_variable_view_01.JPG b/images/catalog_variable_view_01.JPG new file mode 100644 index 0000000..5912e3b Binary files /dev/null and b/images/catalog_variable_view_01.JPG differ diff --git a/images/catalog_variable_view_02.JPG b/images/catalog_variable_view_02.JPG new file mode 100644 index 0000000..2688be1 Binary files /dev/null and b/images/catalog_variable_view_02.JPG differ diff --git a/images/catalog_variable_view_03.JPG b/images/catalog_variable_view_03.JPG new file mode 100644 index 0000000..947245f Binary files /dev/null and b/images/catalog_variable_view_03.JPG differ diff --git a/images/catalog_visualization_03.JPG b/images/catalog_visualization_03.JPG new file mode 100644 index 0000000..43d78f8 Binary files /dev/null and b/images/catalog_visualization_03.JPG differ diff --git a/images/catalog_visualization_05.JPG b/images/catalog_visualization_05.JPG new file mode 100644 index 0000000..bfdcf8f Binary files /dev/null and b/images/catalog_visualization_05.JPG differ diff --git a/images/compare_variables_IHSN.JPG b/images/compare_variables_IHSN.JPG new file mode 100644 index 0000000..138a944 Binary files /dev/null and b/images/compare_variables_IHSN.JPG differ diff --git a/images/copy_ReDoc.JPG b/images/copy_ReDoc.JPG new file mode 100644 index 0000000..74a7bef Binary files /dev/null and b/images/copy_ReDoc.JPG differ diff --git a/images/cover2.JPG b/images/cover2.JPG new file mode 100644 index 0000000..35f4470 Binary files /dev/null and b/images/cover2.JPG differ diff --git a/images/document_example_00b.JPG b/images/document_example_00b.JPG new file mode 100644 index 0000000..85088f6 Binary files /dev/null and b/images/document_example_00b.JPG differ diff --git a/images/document_example_01_abstract.JPG b/images/document_example_01_abstract.JPG new file mode 100644 index 0000000..8d310e6 Binary files /dev/null and b/images/document_example_01_abstract.JPG differ diff --git a/images/document_example_01_authors_keywords.JPG b/images/document_example_01_authors_keywords.JPG new file mode 100644 index 0000000..96eac3d Binary files /dev/null and b/images/document_example_01_authors_keywords.JPG differ diff --git a/images/document_example_01_cover.JPG b/images/document_example_01_cover.JPG new file mode 100644 index 0000000..de8a5b7 Binary files /dev/null and b/images/document_example_01_cover.JPG differ diff --git a/images/document_example_01_nada.JPG b/images/document_example_01_nada.JPG new file mode 100644 index 0000000..1991d84 Binary files /dev/null and b/images/document_example_01_nada.JPG differ diff --git a/images/document_example_01b_authors_keywords.JPG b/images/document_example_01b_authors_keywords.JPG new file mode 100644 index 0000000..1c46cd4 Binary files /dev/null and b/images/document_example_01b_authors_keywords.JPG differ diff --git a/images/document_example_02_TOC.JPG b/images/document_example_02_TOC.JPG new file mode 100644 index 0000000..67c7630 Binary files /dev/null and b/images/document_example_02_TOC.JPG differ diff --git a/images/document_example_02_cover.JPG b/images/document_example_02_cover.JPG new file mode 100644 index 0000000..e2ebb2e Binary files /dev/null and b/images/document_example_02_cover.JPG differ diff --git a/images/document_example_02_nada.JPG b/images/document_example_02_nada.JPG new file mode 100644 index 0000000..8019f47 Binary files /dev/null and b/images/document_example_02_nada.JPG differ diff --git a/images/document_example_02_rights.JPG b/images/document_example_02_rights.JPG new file mode 100644 index 0000000..02cb4d5 Binary files /dev/null and b/images/document_example_02_rights.JPG differ diff --git a/images/embedding_related_docs.JPG b/images/embedding_related_docs.JPG new file mode 100644 index 0000000..4153e15 Binary files /dev/null and b/images/embedding_related_docs.JPG differ diff --git a/images/external_resources_tab_NADA.JPG b/images/external_resources_tab_NADA.JPG new file mode 100644 index 0000000..099862a Binary files /dev/null and b/images/external_resources_tab_NADA.JPG differ diff --git a/images/filter_by_topic_share_1.JPG b/images/filter_by_topic_share_1.JPG new file mode 100644 index 0000000..ba765dc Binary files /dev/null and b/images/filter_by_topic_share_1.JPG differ diff --git a/images/geo_example1_in_nada.JPG b/images/geo_example1_in_nada.JPG new file mode 100644 index 0000000..8051ce3 Binary files /dev/null and b/images/geo_example1_in_nada.JPG differ diff --git a/images/geo_logo.JPG b/images/geo_logo.JPG new file mode 100644 index 0000000..b338138 Binary files /dev/null and b/images/geo_logo.JPG differ diff --git a/images/geospatial_encoding_utf8.JPG b/images/geospatial_encoding_utf8.JPG new file mode 100644 index 0000000..c530975 Binary files /dev/null and b/images/geospatial_encoding_utf8.JPG differ diff --git a/images/geospatial_example_00b_layers.JPG b/images/geospatial_example_00b_layers.JPG new file mode 100644 index 0000000..cc9790c Binary files /dev/null and b/images/geospatial_example_00b_layers.JPG differ diff --git a/images/geospatial_example_00c_vector_raster_2.JPG b/images/geospatial_example_00c_vector_raster_2.JPG new file mode 100644 index 0000000..b13bf96 Binary files /dev/null and b/images/geospatial_example_00c_vector_raster_2.JPG differ diff --git a/images/geospatial_example_01_building_footprint.JPG b/images/geospatial_example_01_building_footprint.JPG new file mode 100644 index 0000000..58a7a07 Binary files /dev/null and b/images/geospatial_example_01_building_footprint.JPG differ diff --git a/images/geospatial_example_03_vector_OSM.JPG b/images/geospatial_example_03_vector_OSM.JPG new file mode 100644 index 0000000..14c8c62 Binary files /dev/null and b/images/geospatial_example_03_vector_OSM.JPG differ diff --git a/images/geospatial_example_03_vector_OSM_export.JPG b/images/geospatial_example_03_vector_OSM_export.JPG new file mode 100644 index 0000000..487105d Binary files /dev/null and b/images/geospatial_example_03_vector_OSM_export.JPG differ diff --git a/images/geospatial_example_script_OCHA_BGD.JPG b/images/geospatial_example_script_OCHA_BGD.JPG new file mode 100644 index 0000000..2eed8cc Binary files /dev/null and b/images/geospatial_example_script_OCHA_BGD.JPG differ diff --git a/images/geospatial_example_script_UN.JPG b/images/geospatial_example_script_UN.JPG new file mode 100644 index 0000000..2cb3112 Binary files /dev/null and b/images/geospatial_example_script_UN.JPG differ diff --git a/images/geospatial_example_script_worldpop_00.JPG b/images/geospatial_example_script_worldpop_00.JPG new file mode 100644 index 0000000..f565703 Binary files /dev/null and b/images/geospatial_example_script_worldpop_00.JPG differ diff --git a/images/geospatial_example_script_worldpop_ETH.JPG b/images/geospatial_example_script_worldpop_ETH.JPG new file mode 100644 index 0000000..3572def Binary files /dev/null and b/images/geospatial_example_script_worldpop_ETH.JPG differ diff --git a/images/geospatial_plot_vector.JPG b/images/geospatial_plot_vector.JPG new file mode 100644 index 0000000..8f808f7 Binary files /dev/null and b/images/geospatial_plot_vector.JPG differ diff --git a/images/index_ccby_logo.png b/images/index_ccby_logo.png new file mode 100644 index 0000000..bcb047a Binary files /dev/null and b/images/index_ccby_logo.png differ diff --git a/images/microdata_bbox.JPG b/images/microdata_bbox.JPG new file mode 100644 index 0000000..4e2664e Binary files /dev/null and b/images/microdata_bbox.JPG differ diff --git a/images/movie_logo.JPG b/images/movie_logo.JPG new file mode 100644 index 0000000..f307fe0 Binary files /dev/null and b/images/movie_logo.JPG differ diff --git a/images/reDoc.JPG b/images/reDoc.JPG new file mode 100644 index 0000000..31cdb6c Binary files /dev/null and b/images/reDoc.JPG differ diff --git a/images/reDoc_0.JPG b/images/reDoc_0.JPG new file mode 100644 index 0000000..ad9e792 Binary files /dev/null and b/images/reDoc_0.JPG differ diff --git a/images/reDoc_html_code.JPG b/images/reDoc_html_code.JPG new file mode 100644 index 0000000..1d24713 Binary files /dev/null and b/images/reDoc_html_code.JPG differ diff --git a/images/reDoc_html_rank.JPG b/images/reDoc_html_rank.JPG new file mode 100644 index 0000000..807d32f Binary files /dev/null and b/images/reDoc_html_rank.JPG differ diff --git a/images/reDoc_html_view.JPG b/images/reDoc_html_view.JPG new file mode 100644 index 0000000..942c6ea Binary files /dev/null and b/images/reDoc_html_view.JPG differ diff --git a/images/reDoc_tags.JPG b/images/reDoc_tags.JPG new file mode 100644 index 0000000..0f0a544 Binary files /dev/null and b/images/reDoc_tags.JPG differ diff --git a/images/reDoc_tags_2.JPG b/images/reDoc_tags_2.JPG new file mode 100644 index 0000000..50adcdf Binary files /dev/null and b/images/reDoc_tags_2.JPG differ diff --git a/images/related_words_graph.JPG b/images/related_words_graph.JPG new file mode 100644 index 0000000..59bba0b Binary files /dev/null and b/images/related_words_graph.JPG differ diff --git a/images/schema_documentation_indexing.JPG b/images/schema_documentation_indexing.JPG new file mode 100644 index 0000000..5f36e8c Binary files /dev/null and b/images/schema_documentation_indexing.JPG differ diff --git a/images/schema_guide_exif_01.JPG b/images/schema_guide_exif_01.JPG new file mode 100644 index 0000000..a07a404 Binary files /dev/null and b/images/schema_guide_exif_01.JPG differ diff --git a/images/schema_search_ranking.JPG b/images/schema_search_ranking.JPG new file mode 100644 index 0000000..6c538e8 Binary files /dev/null and b/images/schema_search_ranking.JPG differ diff --git a/images/script_example1_nada.JPG b/images/script_example1_nada.JPG new file mode 100644 index 0000000..66b1170 Binary files /dev/null and b/images/script_example1_nada.JPG differ diff --git a/images/script_logo.JPG b/images/script_logo.JPG new file mode 100644 index 0000000..b6bce9b Binary files /dev/null and b/images/script_logo.JPG differ diff --git a/images/table_example_01_US_BUCEN.JPG b/images/table_example_01_US_BUCEN.JPG new file mode 100644 index 0000000..e5baf41 Binary files /dev/null and b/images/table_example_01_US_BUCEN.JPG differ diff --git a/images/table_example_01_US_BUCEN_nada1.JPG b/images/table_example_01_US_BUCEN_nada1.JPG new file mode 100644 index 0000000..5cce666 Binary files /dev/null and b/images/table_example_01_US_BUCEN_nada1.JPG differ diff --git a/images/table_example_02_WB_CTRY_PROFILE.JPG b/images/table_example_02_WB_CTRY_PROFILE.JPG new file mode 100644 index 0000000..74ee106 Binary files /dev/null and b/images/table_example_02_WB_CTRY_PROFILE.JPG differ diff --git a/images/table_example_02_WB_CTRY_PROFILE_SEL.JPG b/images/table_example_02_WB_CTRY_PROFILE_SEL.JPG new file mode 100644 index 0000000..df1a9d7 Binary files /dev/null and b/images/table_example_02_WB_CTRY_PROFILE_SEL.JPG differ diff --git a/images/table_example_03_WB_GLOBAL_GOAL.JPG b/images/table_example_03_WB_GLOBAL_GOAL.JPG new file mode 100644 index 0000000..2cb71ba Binary files /dev/null and b/images/table_example_03_WB_GLOBAL_GOAL.JPG differ diff --git a/images/table_example_05.JPG b/images/table_example_05.JPG new file mode 100644 index 0000000..6c376f8 Binary files /dev/null and b/images/table_example_05.JPG differ diff --git a/images/table_logo.JPG b/images/table_logo.JPG new file mode 100644 index 0000000..26c3713 Binary files /dev/null and b/images/table_logo.JPG differ diff --git a/images/time_series_logo.JPG b/images/time_series_logo.JPG new file mode 100644 index 0000000..12e0e71 Binary files /dev/null and b/images/time_series_logo.JPG differ diff --git a/images/video_in_NADA.JPG b/images/video_in_NADA.JPG new file mode 100644 index 0000000..edaf80d Binary files /dev/null and b/images/video_in_NADA.JPG differ diff --git a/images/video_in_NADA_2.JPG b/images/video_in_NADA_2.JPG new file mode 100644 index 0000000..e130845 Binary files /dev/null and b/images/video_in_NADA_2.JPG differ diff --git a/images/word_graph_dutch_disease.JPG b/images/word_graph_dutch_disease.JPG new file mode 100644 index 0000000..5bdd08b Binary files /dev/null and b/images/word_graph_dutch_disease.JPG differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..bdfdffa --- /dev/null +++ b/index.html @@ -0,0 +1,592 @@ + + + + + + + [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+ +
+

Preface

+

+ +

+

Numerous organizations –government agencies, international organizations, the private sector, the academia, and others– invest in data collection and creation. Their datasets often possess intrinsic value not only for their creators but also for a broader community of secondary users and researchers. By repurposing and reusing data, this community adds value to the data. However, many valuable datasets remain difficult to find, access, and use, and are therefore underexploited. A dedicated and concerted effort to improve the discoverability, accessibility, and usability of data is needed. Such effort would largely hinge on the quality of the metadata associated with the data. This Guide aims to promote and facilitate the production and use of rich and structured metadata, ultimately promoting the responsible use and repurposing of data.

+

The primary audience for the Guide are data producers and curators, data librarians and catalogs administrators, and the developers of data management and dissemination platforms, who seek to maximize the value of existing data in a responsible and technically proficient manner. The Guide applies mainly to socio-economic data of different types (indicators, microdata, geographic datasets, publications, and others).

+

The Guide is part of a broader toolset that also includes specialized software applications – a specialized metadata editor and a cataloging tool. This toolset covers the technical aspects of data documentation and dissemination. Legal and ethical considerations are equally important, but are adressed in other guidelines and are supported by different tools.

+
+

Acknowledgments

+

The Guide was written by Olivier Dupriez (Deputy Chief Statistician, World Bank) and Mehmood Asghar (Senior Data Engineer, World Bank). Kamwoo Lee (Data Scientist, World Bank) produced some of the examples of the use of metadata schemas included in the Guide and contributed to the testing of the schemas. Emmanuel Blondel (consultant) contributed much of chapter 6. Geoffrey Greenwell (consultant) provided input to chapter 9. Tefera Bekele Degefu and Cathrine Machingauta (Data Scientists, World Bank) participated in the testing of the metadata schemas.

+

The production of the Guide and related tools has been made possible by financial contributions from:

+
    +
  • The World Bank-UNHCR Joint Data Center Microdata Library project P174080, Grant No TF0B4772, administered by the World Bank Development Data Group.
  • +
  • The UK Aid-UNHCR-World Bank research program Building the Evidence on Protracted Forced Displacement, funded by the UK government (FCV Data Platform component, project P174529, Grant No TF0B4149). This project supported the development of a data platform which led to the improvement and testing of some of the metadata schemas described in the Guide.
  • +
  • The World Bank administrative budget.
  • +
+

The Guide was created using R Bookdown and is licensed under a Creative Commons Attribution- NonCommercial- NoDerivatives 4.0 International License.

+

chatGPT was used as a copy editor, but not for substantive content suggestion or creation.

+

Feedback and suggestions on the Guide are welcome. They can be sent to […] or submitted on GitHub where the Guide’s source code is stored (https://github.com/mah0001/schema-guide).

+ + + +
+
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/index.md b/index.md new file mode 100644 index 0000000..d746b13 --- /dev/null +++ b/index.md @@ -0,0 +1,46 @@ +--- +title: "[DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability" +author: "Olivier Dupriez and Mehmood Asghar" +date: "2023-11-23" +knit: bookdown::render_book +site: bookdown::bookdown_site +documentclass: krantz +monofont: "Source Code Pro" +monofontoptions: "Scale=0.7" +biblio-style: apalike +link-citations: yes +description: "" +github-repo: "" +cover-image: "./images/cover2.jpg" +url: '' +colorlinks: yes +graphics: yes +--- + +# Preface {-} + +

+Numerous organizations --government agencies, international organizations, the private sector, the academia, and others-- invest in data collection and creation. Their datasets often possess intrinsic value not only for their creators but also for a broader community of secondary users and researchers. By repurposing and reusing data, this community adds value to the data. However, many valuable datasets remain difficult to find, access, and use, and are therefore underexploited. A dedicated and concerted effort to improve the discoverability, accessibility, and usability of data is needed. Such effort would largely hinge on the quality of the **metadata** associated with the data. This Guide aims to promote and facilitate the production and use of rich and structured metadata, ultimately promoting the responsible use and repurposing of data. + +The primary audience for the Guide are data producers and curators, data librarians and catalogs administrators, and the developers of data management and dissemination platforms, who seek to maximize the value of existing data in a responsible and technically proficient manner. The Guide applies mainly to socio-economic data of different types (indicators, microdata, geographic datasets, publications, and others). + +The Guide is part of a broader toolset that also includes specialized software applications -- a specialized metadata editor and a cataloging tool. This toolset covers the *technical* aspects of data documentation and dissemination. *Legal* and *ethical* considerations are equally important, but are adressed in other guidelines and are supported by different tools. + +## Acknowledgments {-} + +The Guide was written by Olivier Dupriez (Deputy Chief Statistician, World Bank) and Mehmood Asghar (Senior Data Engineer, World Bank). Kamwoo Lee (Data Scientist, World Bank) produced some of the examples of the use of metadata schemas included in the Guide and contributed to the testing of the schemas. Emmanuel Blondel (consultant) contributed much of chapter 6. Geoffrey Greenwell (consultant) provided input to chapter 9. Tefera Bekele Degefu and Cathrine Machingauta (Data Scientists, World Bank) participated in the testing of the metadata schemas. + +The production of the Guide and related tools has been made possible by financial contributions from: + + - The World Bank-UNHCR Joint Data Center Microdata Library project P174080, Grant No TF0B4772, administered by the World Bank Development Data Group. + - The UK Aid-UNHCR-World Bank research program Building the Evidence on Protracted Forced Displacement, funded by the UK government (FCV Data Platform component, project P174529, Grant No TF0B4149). This project supported the development of a data platform which led to the improvement and testing of some of the metadata schemas described in the Guide. + - The World Bank administrative budget. + +The Guide was created using [R Bookdown](https://bookdown.org/) and is licensed under a [Creative Commons Attribution- NonCommercial- NoDerivatives 4.0 International License](https://creativecommons.org/licenses/by-nc-nd/4.0/). + +chatGPT was used as a copy editor, but not for substantive content suggestion or creation. + +Feedback and suggestions on the Guide are welcome. They can be sent to [...] or submitted on GitHub where the Guide's source code is stored (https://github.com/mah0001/schema-guide). + +![](./images/index_ccby_logo.png){width=20%} + diff --git a/introduction.html b/introduction.html new file mode 100644 index 0000000..412b1ac --- /dev/null +++ b/introduction.html @@ -0,0 +1,577 @@ + + + + + + + Introduction | [DRAFT - WORK IN PROGRESS] Metadata Standards and Schemas for Improved Data Discoverability and Usability + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

Introduction

+

Over the last decade, the supply of socio-economic data available to researchers and policy makers has increased considerably, along with advances in the tools and methods available to exploit these data. This provides the research community and development practitioners with unprecedented opportunities to increase the use and value of existing data.

+

#Note: +Data that were initially collected with one intention can be reused for a completely different purpose. (…) Because the potential of data to serve a productive use is essentially limitless, enabling the reuse and repurposing of data is critical if data are to lead to better lives. (World Bank, World Development Report 2021)

+

But data can be challenging to find, access, and use, resulting in many valuable datasets remaining underutilized. Data repositories and libraries, and the data catalogs they maintain, play a crucial role in making data more discoverable, visible, and usable. But many of these catalogs are built on sub-optimal standards and technological solutions, resulting in limited findability and visibility of their assets. To address such market failures, a better market place for data is needed.

+

A better market place for data can be developed on the model of large e-commerce platforms, which are designed to effectively and efficiently serve both buyers and sellers. In a market place for data, the “buyers” are the data users, and the “sellers” are the organizations who own or curate datasets and seek to make them available to users – preferably free of charge to maximize the use of data. Data platforms must be optimized to provide data users with convenient ways of identifying, locating, and acquiring data (which requires the implementation of a user-friendly search and recommendation system), and to provide data owners with a trustable mechanism to make their datasets visible and discoverable and to share them in a cost-effective, convenient, and safe manner.

+

Achieving such objectives requires detailed and structured metadata that properly describe the data products. Indeed, search algorithms and recommender systems exploit metadata, not data. Metadata are essential to the credibility, discoverability, visibility, and usability of the data. Adopting metadata standards and schemas is a practical and efficient solution to achieve completeness and quality of the metadata. This Guide presents a set of recommended standards and schemas covering multiple types of data along with guidance for their implementation. The data types covered include microdata, statistical tables, indicators and time series, geographic datasets, text, images, video recordings, and programs and scripts.

+

Chapter 1 of the Guide outlines the challenges associated with finding and using data. Chapter 2 describes the essential features of a modern data catalog, and Chapter 3 explains how rich and structured metadata, compliant with the metadata standards and schemas we describe in the Guide, can enable advanced search algorithms and recommender systems. Finally, Chapters 4 to 13 present the recommended standards and schemas, along with examples of their use.

+

This Guide was produced by the Office of the World Bank Chief Statistician as a reference guide for World Bank staff and for partners involved in the curation and dissemination of data related to social and economic development. The standards and schemas it describes are used by the World Bank in its data management and dissemination systems, and for the development of systems and tools for the acquisition, documentation, cataloguing, and dissemination of data. Among these tools is a specialized Metadata Editor designed to facilitate the documentation of datasets in compliance with the recommended standards and schemas, and a cataloguing application (“NADA”). Both applications are openly available.

+ +
+ + + +
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + diff --git a/libs/anchor-sections-1.1.0/anchor-sections-hash.css b/libs/anchor-sections-1.1.0/anchor-sections-hash.css new file mode 100644 index 0000000..b563ec9 --- /dev/null +++ b/libs/anchor-sections-1.1.0/anchor-sections-hash.css @@ -0,0 +1,2 @@ +/* Styles for section anchors */ +a.anchor-section::before {content: '#';font-size: 80%;} diff --git a/libs/anchor-sections-1.1.0/anchor-sections.css b/libs/anchor-sections-1.1.0/anchor-sections.css new file mode 100644 index 0000000..041905f --- /dev/null +++ b/libs/anchor-sections-1.1.0/anchor-sections.css @@ -0,0 +1,4 @@ +/* Styles for section anchors */ +a.anchor-section {margin-left: 10px; visibility: hidden; color: inherit;} +.hasAnchor:hover a.anchor-section {visibility: visible;} +ul > li > .anchor-section {display: none;} diff --git a/libs/anchor-sections-1.1.0/anchor-sections.js b/libs/anchor-sections-1.1.0/anchor-sections.js new file mode 100644 index 0000000..fee005d --- /dev/null +++ b/libs/anchor-sections-1.1.0/anchor-sections.js @@ -0,0 +1,11 @@ +document.addEventListener('DOMContentLoaded', function () { + // If section divs is used, we need to put the anchor in the child header + const headers = document.querySelectorAll("div.hasAnchor.section[class*='level'] > :first-child") + + headers.forEach(function (x) { + // Add to the header node + if (!x.classList.contains('hasAnchor')) x.classList.add('hasAnchor') + // Remove from the section or div created by Pandoc + x.parentElement.classList.remove('hasAnchor') + }) +}) diff --git a/libs/gitbook-2.6.7/css/fontawesome/fontawesome-webfont.ttf b/libs/gitbook-2.6.7/css/fontawesome/fontawesome-webfont.ttf new file mode 100644 index 0000000..35acda2 Binary files /dev/null and b/libs/gitbook-2.6.7/css/fontawesome/fontawesome-webfont.ttf differ diff --git a/libs/gitbook-2.6.7/css/plugin-bookdown.css b/libs/gitbook-2.6.7/css/plugin-bookdown.css new file mode 100644 index 0000000..ab7c20e --- /dev/null +++ b/libs/gitbook-2.6.7/css/plugin-bookdown.css @@ -0,0 +1,105 @@ +.book .book-header h1 { + padding-left: 20px; + padding-right: 20px; +} +.book .book-header.fixed { + position: fixed; + right: 0; + top: 0; + left: 0; + border-bottom: 1px solid rgba(0,0,0,.07); +} +span.search-highlight { + background-color: #ffff88; +} +@media (min-width: 600px) { + .book.with-summary .book-header.fixed { + left: 300px; + } +} +@media (max-width: 1240px) { + .book .book-body.fixed { + top: 50px; + } + .book .book-body.fixed .body-inner { + top: auto; + } +} +@media (max-width: 600px) { + .book.with-summary .book-header.fixed { + left: calc(100% - 60px); + min-width: 300px; + } + .book.with-summary .book-body { + transform: none; + left: calc(100% - 60px); + min-width: 300px; + } + .book .book-body.fixed { + top: 0; + } +} + +.book .book-body.fixed .body-inner { + top: 50px; +} +.book .book-body .page-wrapper .page-inner section.normal sub, .book .book-body .page-wrapper .page-inner section.normal sup { + font-size: 85%; +} + +@media print { + .book .book-summary, .book .book-body .book-header, .fa { + display: none !important; + } + .book .book-body.fixed { + left: 0px; + } + .book .book-body,.book .book-body .body-inner, .book.with-summary { + overflow: visible !important; + } +} +.kable_wrapper { + border-spacing: 20px 0; + border-collapse: separate; + border: none; + margin: auto; +} +.kable_wrapper > tbody > tr > td { + vertical-align: top; +} +.book .book-body .page-wrapper .page-inner section.normal table tr.header { + border-top-width: 2px; +} +.book .book-body .page-wrapper .page-inner section.normal table tr:last-child td { + border-bottom-width: 2px; +} +.book .book-body .page-wrapper .page-inner section.normal table td, .book .book-body .page-wrapper .page-inner section.normal table th { + border-left: none; + border-right: none; +} +.book .book-body .page-wrapper .page-inner section.normal table.kable_wrapper > tbody > tr, .book .book-body .page-wrapper .page-inner section.normal table.kable_wrapper > tbody > tr > td { + border-top: none; +} +.book .book-body .page-wrapper .page-inner section.normal table.kable_wrapper > tbody > tr:last-child > td { + border-bottom: none; +} + +div.theorem, div.lemma, div.corollary, div.proposition, div.conjecture { + font-style: italic; +} +span.theorem, span.lemma, span.corollary, span.proposition, span.conjecture { + font-style: normal; +} +div.proof>*:last-child:after { + content: "\25a2"; + float: right; +} +.header-section-number { + padding-right: .5em; +} +#header .multi-author { + margin: 0.5em 0 -0.5em 0; +} +#header .date { + margin-top: 1.5em; +} diff --git a/libs/gitbook-2.6.7/css/plugin-clipboard.css b/libs/gitbook-2.6.7/css/plugin-clipboard.css new file mode 100644 index 0000000..6844a70 --- /dev/null +++ b/libs/gitbook-2.6.7/css/plugin-clipboard.css @@ -0,0 +1,18 @@ +div.sourceCode { + position: relative; +} + +.copy-to-clipboard-button { + position: absolute; + right: 0; + top: 0; + visibility: hidden; +} + +.copy-to-clipboard-button:focus { + outline: 0; +} + +div.sourceCode:hover > .copy-to-clipboard-button { + visibility: visible; +} diff --git a/libs/gitbook-2.6.7/css/plugin-fontsettings.css b/libs/gitbook-2.6.7/css/plugin-fontsettings.css new file mode 100644 index 0000000..3fa6f35 --- /dev/null +++ b/libs/gitbook-2.6.7/css/plugin-fontsettings.css @@ -0,0 +1,303 @@ +/* + * Theme 1 + */ +.color-theme-1 .dropdown-menu { + background-color: #111111; + border-color: #7e888b; +} +.color-theme-1 .dropdown-menu .dropdown-caret .caret-inner { + border-bottom: 9px solid #111111; +} +.color-theme-1 .dropdown-menu .buttons { + border-color: #7e888b; +} +.color-theme-1 .dropdown-menu .button { + color: #afa790; +} +.color-theme-1 .dropdown-menu .button:hover { + color: #73553c; +} +/* + * Theme 2 + */ +.color-theme-2 .dropdown-menu { + background-color: #2d3143; + border-color: #272a3a; +} +.color-theme-2 .dropdown-menu .dropdown-caret .caret-inner { + border-bottom: 9px solid #2d3143; +} +.color-theme-2 .dropdown-menu .buttons { + border-color: #272a3a; +} +.color-theme-2 .dropdown-menu .button { + color: #62677f; +} +.color-theme-2 .dropdown-menu .button:hover { + color: #f4f4f5; +} +.book .book-header .font-settings .font-enlarge { + line-height: 30px; + font-size: 1.4em; +} +.book .book-header .font-settings .font-reduce { + line-height: 30px; + font-size: 1em; +} + +/* sidebar transition background */ +div.book.color-theme-1 { + background: #f3eacb; +} +.book.color-theme-1 .book-body { + color: #704214; + background: #f3eacb; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section { + background: #f3eacb; +} + +/* sidebar transition background */ +div.book.color-theme-2 { + background: #1c1f2b; +} + +.book.color-theme-2 .book-body { + color: #bdcadb; + background: #1c1f2b; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section { + background: #1c1f2b; +} +.book.font-size-0 .book-body .page-inner section { + font-size: 1.2rem; +} +.book.font-size-1 .book-body .page-inner section { + font-size: 1.4rem; +} +.book.font-size-2 .book-body .page-inner section { + font-size: 1.6rem; +} +.book.font-size-3 .book-body .page-inner section { + font-size: 2.2rem; +} +.book.font-size-4 .book-body .page-inner section { + font-size: 4rem; +} +.book.font-family-0 { + font-family: Georgia, serif; +} +.book.font-family-1 { + font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal { + color: #704214; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal a { + color: inherit; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h1, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h2, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h3, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h4, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h5, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h6 { + color: inherit; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h1, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h2 { + border-color: inherit; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal h6 { + color: inherit; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal hr { + background-color: inherit; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal blockquote { + border-color: #c4b29f; + opacity: 0.9; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code { + background: #fdf6e3; + color: #657b83; + border-color: #f8df9c; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal .highlight { + background-color: inherit; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table th, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table td { + border-color: #f5d06c; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table tr { + color: inherit; + background-color: #fdf6e3; + border-color: #444444; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal table tr:nth-child(2n) { + background-color: #fbeecb; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal { + color: #bdcadb; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal a { + color: #3eb1d0; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h1, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h2, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h3, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h4, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h5, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h6 { + color: #fffffa; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h1, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h2 { + border-color: #373b4e; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal h6 { + color: #373b4e; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal hr { + background-color: #373b4e; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal blockquote { + border-color: #373b4e; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code { + color: #9dbed8; + background: #2d3143; + border-color: #2d3143; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal .highlight { + background-color: #282a39; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table th, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table td { + border-color: #3b3f54; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table tr { + color: #b6c2d2; + background-color: #2d3143; + border-color: #3b3f54; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal table tr:nth-child(2n) { + background-color: #35394b; +} +.book.color-theme-1 .book-header { + color: #afa790; + background: transparent; +} +.book.color-theme-1 .book-header .btn { + color: #afa790; +} +.book.color-theme-1 .book-header .btn:hover { + color: #73553c; + background: none; +} +.book.color-theme-1 .book-header h1 { + color: #704214; +} +.book.color-theme-2 .book-header { + color: #7e888b; + background: transparent; +} +.book.color-theme-2 .book-header .btn { + color: #3b3f54; +} +.book.color-theme-2 .book-header .btn:hover { + color: #fffff5; + background: none; +} +.book.color-theme-2 .book-header h1 { + color: #bdcadb; +} +.book.color-theme-1 .book-body .navigation { + color: #afa790; +} +.book.color-theme-1 .book-body .navigation:hover { + color: #73553c; +} +.book.color-theme-2 .book-body .navigation { + color: #383f52; +} +.book.color-theme-2 .book-body .navigation:hover { + color: #fffff5; +} +/* + * Theme 1 + */ +.book.color-theme-1 .book-summary { + color: #afa790; + background: #111111; + border-right: 1px solid rgba(0, 0, 0, 0.07); +} +.book.color-theme-1 .book-summary .book-search { + background: transparent; +} +.book.color-theme-1 .book-summary .book-search input, +.book.color-theme-1 .book-summary .book-search input:focus { + border: 1px solid transparent; +} +.book.color-theme-1 .book-summary ul.summary li.divider { + background: #7e888b; + box-shadow: none; +} +.book.color-theme-1 .book-summary ul.summary li i.fa-check { + color: #33cc33; +} +.book.color-theme-1 .book-summary ul.summary li.done > a { + color: #877f6a; +} +.book.color-theme-1 .book-summary ul.summary li a, +.book.color-theme-1 .book-summary ul.summary li span { + color: #877f6a; + background: transparent; + font-weight: normal; +} +.book.color-theme-1 .book-summary ul.summary li.active > a, +.book.color-theme-1 .book-summary ul.summary li a:hover { + color: #704214; + background: transparent; + font-weight: normal; +} +/* + * Theme 2 + */ +.book.color-theme-2 .book-summary { + color: #bcc1d2; + background: #2d3143; + border-right: none; +} +.book.color-theme-2 .book-summary .book-search { + background: transparent; +} +.book.color-theme-2 .book-summary .book-search input, +.book.color-theme-2 .book-summary .book-search input:focus { + border: 1px solid transparent; +} +.book.color-theme-2 .book-summary ul.summary li.divider { + background: #272a3a; + box-shadow: none; +} +.book.color-theme-2 .book-summary ul.summary li i.fa-check { + color: #33cc33; +} +.book.color-theme-2 .book-summary ul.summary li.done > a { + color: #62687f; +} +.book.color-theme-2 .book-summary ul.summary li a, +.book.color-theme-2 .book-summary ul.summary li span { + color: #c1c6d7; + background: transparent; + font-weight: 600; +} +.book.color-theme-2 .book-summary ul.summary li.active > a, +.book.color-theme-2 .book-summary ul.summary li a:hover { + color: #f4f4f5; + background: #252737; + font-weight: 600; +} diff --git a/libs/gitbook-2.6.7/css/plugin-highlight.css b/libs/gitbook-2.6.7/css/plugin-highlight.css new file mode 100644 index 0000000..2aabd3d --- /dev/null +++ b/libs/gitbook-2.6.7/css/plugin-highlight.css @@ -0,0 +1,426 @@ +.book .book-body .page-wrapper .page-inner section.normal pre, +.book .book-body .page-wrapper .page-inner section.normal code { + /* http://jmblog.github.com/color-themes-for-google-code-highlightjs */ + /* Tomorrow Comment */ + /* Tomorrow Red */ + /* Tomorrow Orange */ + /* Tomorrow Yellow */ + /* Tomorrow Green */ + /* Tomorrow Aqua */ + /* Tomorrow Blue */ + /* Tomorrow Purple */ +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-comment, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-comment, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-title { + color: #8e908c; +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-variable, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-variable, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-attribute, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-attribute, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-tag, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-tag, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-regexp, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-regexp, +.book .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-constant, +.book .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-constant, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-tag .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .xml .hljs-tag .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-pi, +.book .book-body .page-wrapper .page-inner section.normal code .xml .hljs-pi, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-doctype, +.book .book-body .page-wrapper .page-inner section.normal code .xml .hljs-doctype, +.book .book-body .page-wrapper .page-inner section.normal pre .html .hljs-doctype, +.book .book-body .page-wrapper .page-inner section.normal code .html .hljs-doctype, +.book .book-body .page-wrapper .page-inner section.normal pre .css .hljs-id, +.book .book-body .page-wrapper .page-inner section.normal code .css .hljs-id, +.book .book-body .page-wrapper .page-inner section.normal pre .css .hljs-class, +.book .book-body .page-wrapper .page-inner section.normal code .css .hljs-class, +.book .book-body .page-wrapper .page-inner section.normal pre .css .hljs-pseudo, +.book .book-body .page-wrapper .page-inner section.normal code .css .hljs-pseudo { + color: #c82829; +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-number, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-number, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-preprocessor, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-preprocessor, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-pragma, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-pragma, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-built_in, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-built_in, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-literal, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-literal, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-params, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-params, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-constant, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-constant { + color: #f5871f; +} +.book .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-class .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-class .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal pre .css .hljs-rules .hljs-attribute, +.book .book-body .page-wrapper .page-inner section.normal code .css .hljs-rules .hljs-attribute { + color: #eab700; +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-string, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-string, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-value, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-value, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-inheritance, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-inheritance, +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-header, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-header, +.book .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-symbol, +.book .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-symbol, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-cdata, +.book .book-body .page-wrapper .page-inner section.normal code .xml .hljs-cdata { + color: #718c00; +} +.book .book-body .page-wrapper .page-inner section.normal pre .css .hljs-hexcolor, +.book .book-body .page-wrapper .page-inner section.normal code .css .hljs-hexcolor { + color: #3e999f; +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-function, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-function, +.book .book-body .page-wrapper .page-inner section.normal pre .python .hljs-decorator, +.book .book-body .page-wrapper .page-inner section.normal code .python .hljs-decorator, +.book .book-body .page-wrapper .page-inner section.normal pre .python .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .python .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-function .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-function .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-title .hljs-keyword, +.book .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-title .hljs-keyword, +.book .book-body .page-wrapper .page-inner section.normal pre .perl .hljs-sub, +.book .book-body .page-wrapper .page-inner section.normal code .perl .hljs-sub, +.book .book-body .page-wrapper .page-inner section.normal pre .javascript .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .javascript .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal pre .coffeescript .hljs-title, +.book .book-body .page-wrapper .page-inner section.normal code .coffeescript .hljs-title { + color: #4271ae; +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs-keyword, +.book .book-body .page-wrapper .page-inner section.normal code .hljs-keyword, +.book .book-body .page-wrapper .page-inner section.normal pre .javascript .hljs-function, +.book .book-body .page-wrapper .page-inner section.normal code .javascript .hljs-function { + color: #8959a8; +} +.book .book-body .page-wrapper .page-inner section.normal pre .hljs, +.book .book-body .page-wrapper .page-inner section.normal code .hljs { + display: block; + background: white; + color: #4d4d4c; + padding: 0.5em; +} +.book .book-body .page-wrapper .page-inner section.normal pre .coffeescript .javascript, +.book .book-body .page-wrapper .page-inner section.normal code .coffeescript .javascript, +.book .book-body .page-wrapper .page-inner section.normal pre .javascript .xml, +.book .book-body .page-wrapper .page-inner section.normal code .javascript .xml, +.book .book-body .page-wrapper .page-inner section.normal pre .tex .hljs-formula, +.book .book-body .page-wrapper .page-inner section.normal code .tex .hljs-formula, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .javascript, +.book .book-body .page-wrapper .page-inner section.normal code .xml .javascript, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .vbscript, +.book .book-body .page-wrapper .page-inner section.normal code .xml .vbscript, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .css, +.book .book-body .page-wrapper .page-inner section.normal code .xml .css, +.book .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-cdata, +.book .book-body .page-wrapper .page-inner section.normal code .xml .hljs-cdata { + opacity: 0.5; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code { + /* + +Orginal Style from ethanschoonover.com/solarized (c) Jeremy Hull + +*/ + /* Solarized Green */ + /* Solarized Cyan */ + /* Solarized Blue */ + /* Solarized Yellow */ + /* Solarized Orange */ + /* Solarized Red */ + /* Solarized Violet */ +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs { + display: block; + padding: 0.5em; + background: #fdf6e3; + color: #657b83; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-comment, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-comment, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-template_comment, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-template_comment, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .diff .hljs-header, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .diff .hljs-header, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-doctype, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-doctype, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-pi, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-pi, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .lisp .hljs-string, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .lisp .hljs-string, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-javadoc, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-javadoc { + color: #93a1a1; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-keyword, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-keyword, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-winutils, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-winutils, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .method, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .method, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-addition, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-addition, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-tag, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .css .hljs-tag, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-request, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-request, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-status, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-status, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .nginx .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .nginx .hljs-title { + color: #859900; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-number, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-number, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-command, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-command, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-string, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-string, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-tag .hljs-value, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-tag .hljs-value, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-rules .hljs-value, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-rules .hljs-value, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-phpdoc, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-phpdoc, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .tex .hljs-formula, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .tex .hljs-formula, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-regexp, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-regexp, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-hexcolor, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-hexcolor, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-link_url, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-link_url { + color: #2aa198; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-localvars, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-localvars, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-chunk, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-chunk, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-decorator, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-decorator, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-built_in, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-built_in, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-identifier, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-identifier, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .vhdl .hljs-literal, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .vhdl .hljs-literal, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-id, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-id, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-function, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .css .hljs-function { + color: #268bd2; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-attribute, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-attribute, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-variable, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-variable, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .lisp .hljs-body, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .lisp .hljs-body, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .smalltalk .hljs-number, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .smalltalk .hljs-number, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-constant, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-constant, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-class .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-class .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-parent, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-parent, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .haskell .hljs-type, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .haskell .hljs-type, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-link_reference, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-link_reference { + color: #b58900; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-preprocessor, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-preprocessor, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-preprocessor .hljs-keyword, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-preprocessor .hljs-keyword, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-pragma, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-pragma, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-shebang, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-shebang, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-symbol, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-symbol, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-symbol .hljs-string, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-symbol .hljs-string, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .diff .hljs-change, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .diff .hljs-change, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-special, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-special, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-attr_selector, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-attr_selector, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-subst, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-subst, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-cdata, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-cdata, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .clojure .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .clojure .hljs-title, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-pseudo, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .css .hljs-pseudo, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-header, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-header { + color: #cb4b16; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-deletion, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-deletion, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-important, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-important { + color: #dc322f; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .hljs-link_label, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .hljs-link_label { + color: #6c71c4; +} +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal pre .tex .hljs-formula, +.book.color-theme-1 .book-body .page-wrapper .page-inner section.normal code .tex .hljs-formula { + background: #eee8d5; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code { + /* Tomorrow Night Bright Theme */ + /* Original theme - https://github.com/chriskempson/tomorrow-theme */ + /* http://jmblog.github.com/color-themes-for-google-code-highlightjs */ + /* Tomorrow Comment */ + /* Tomorrow Red */ + /* Tomorrow Orange */ + /* Tomorrow Yellow */ + /* Tomorrow Green */ + /* Tomorrow Aqua */ + /* Tomorrow Blue */ + /* Tomorrow Purple */ +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-comment, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-comment, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-title { + color: #969896; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-variable, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-variable, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-attribute, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-attribute, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-tag, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-tag, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-regexp, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-regexp, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-constant, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-constant, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-tag .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .hljs-tag .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-pi, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .hljs-pi, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-doctype, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .hljs-doctype, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .html .hljs-doctype, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .html .hljs-doctype, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-id, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .css .hljs-id, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-class, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .css .hljs-class, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-pseudo, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .css .hljs-pseudo { + color: #d54e53; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-number, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-number, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-preprocessor, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-preprocessor, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-pragma, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-pragma, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-built_in, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-built_in, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-literal, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-literal, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-params, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-params, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-constant, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-constant { + color: #e78c45; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-class .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-class .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-rules .hljs-attribute, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .css .hljs-rules .hljs-attribute { + color: #e7c547; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-string, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-string, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-value, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-value, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-inheritance, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-inheritance, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-header, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-header, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-symbol, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-symbol, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-cdata, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .hljs-cdata { + color: #b9ca4a; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .css .hljs-hexcolor, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .css .hljs-hexcolor { + color: #70c0b1; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-function, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-function, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .python .hljs-decorator, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .python .hljs-decorator, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .python .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .python .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-function .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-function .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .ruby .hljs-title .hljs-keyword, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .ruby .hljs-title .hljs-keyword, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .perl .hljs-sub, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .perl .hljs-sub, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .javascript .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .javascript .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .coffeescript .hljs-title, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .coffeescript .hljs-title { + color: #7aa6da; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs-keyword, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs-keyword, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .javascript .hljs-function, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .javascript .hljs-function { + color: #c397d8; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .hljs, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .hljs { + display: block; + background: black; + color: #eaeaea; + padding: 0.5em; +} +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .coffeescript .javascript, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .coffeescript .javascript, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .javascript .xml, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .javascript .xml, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .tex .hljs-formula, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .tex .hljs-formula, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .javascript, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .javascript, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .vbscript, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .vbscript, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .css, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .css, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal pre .xml .hljs-cdata, +.book.color-theme-2 .book-body .page-wrapper .page-inner section.normal code .xml .hljs-cdata { + opacity: 0.5; +} diff --git a/libs/gitbook-2.6.7/css/plugin-search.css b/libs/gitbook-2.6.7/css/plugin-search.css new file mode 100644 index 0000000..c85e557 --- /dev/null +++ b/libs/gitbook-2.6.7/css/plugin-search.css @@ -0,0 +1,31 @@ +.book .book-summary .book-search { + padding: 6px; + background: transparent; + position: absolute; + top: -50px; + left: 0px; + right: 0px; + transition: top 0.5s ease; +} +.book .book-summary .book-search input, +.book .book-summary .book-search input:focus, +.book .book-summary .book-search input:hover { + width: 100%; + background: transparent; + border: 1px solid #ccc; + box-shadow: none; + outline: none; + line-height: 22px; + padding: 7px 4px; + color: inherit; + box-sizing: border-box; +} +.book.with-search .book-summary .book-search { + top: 0px; +} +.book.with-search .book-summary ul.summary { + top: 50px; +} +.with-search .summary li[data-level] a[href*=".html#"] { + display: none; +} diff --git a/libs/gitbook-2.6.7/css/plugin-table.css b/libs/gitbook-2.6.7/css/plugin-table.css new file mode 100644 index 0000000..7fba1b9 --- /dev/null +++ b/libs/gitbook-2.6.7/css/plugin-table.css @@ -0,0 +1 @@ +.book .book-body .page-wrapper .page-inner section.normal table{display:table;width:100%;border-collapse:collapse;border-spacing:0;overflow:auto}.book .book-body .page-wrapper .page-inner section.normal table td,.book .book-body .page-wrapper .page-inner section.normal table th{padding:6px 13px;border:1px solid #ddd}.book .book-body .page-wrapper .page-inner section.normal table tr{background-color:#fff;border-top:1px solid #ccc}.book .book-body .page-wrapper .page-inner section.normal table tr:nth-child(2n){background-color:#f8f8f8}.book .book-body .page-wrapper .page-inner section.normal table th{font-weight:700} diff --git a/libs/gitbook-2.6.7/css/style.css b/libs/gitbook-2.6.7/css/style.css new file mode 100644 index 0000000..cba69b2 --- /dev/null +++ b/libs/gitbook-2.6.7/css/style.css @@ -0,0 +1,13 @@ +/*! normalize.css v2.1.0 | MIT License | git.io/normalize */img,legend{border:0}*{-webkit-font-smoothing:antialiased}sub,sup{position:relative}.book .book-body .page-wrapper .page-inner section.normal hr:after,.book-langs-index .inner .languages:after,.buttons:after,.dropdown-menu .buttons:after{clear:both}body,html{-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}article,aside,details,figcaption,figure,footer,header,hgroup,main,nav,section,summary{display:block}audio,canvas,video{display:inline-block}.hidden,[hidden]{display:none}audio:not([controls]){display:none;height:0}html{font-family:sans-serif}body,figure{margin:0}a:focus{outline:dotted thin}a:active,a:hover{outline:0}h1{font-size:2em;margin:.67em 0}abbr[title]{border-bottom:1px dotted}b,strong{font-weight:700}dfn{font-style:italic}hr{-moz-box-sizing:content-box;box-sizing:content-box;height:0}mark{background:#ff0;color:#000}code,kbd,pre,samp{font-family:monospace,serif;font-size:1em}pre{white-space:pre-wrap}q{quotes:"\201C" "\201D" "\2018" "\2019"}small{font-size:80%}sub,sup{font-size:75%;line-height:0;vertical-align:baseline}sup{top:-.5em}sub{bottom:-.25em}svg:not(:root){overflow:hidden}fieldset{border:1px solid silver;margin:0 2px;padding:.35em .625em .75em}legend{padding:0}button,input,select,textarea{font-family:inherit;font-size:100%;margin:0}button,input{line-height:normal}button,select{text-transform:none}button,html input[type=button],input[type=reset],input[type=submit]{-webkit-appearance:button;cursor:pointer}button[disabled],html input[disabled]{cursor:default}input[type=checkbox],input[type=radio]{box-sizing:border-box;padding:0}input[type=search]{-webkit-appearance:textfield;-moz-box-sizing:content-box;-webkit-box-sizing:content-box;box-sizing:content-box}input[type=search]::-webkit-search-cancel-button{margin-right:10px;}button::-moz-focus-inner,input::-moz-focus-inner{border:0;padding:0}textarea{overflow:auto;vertical-align:top}table{border-collapse:collapse;border-spacing:0}/*! + * Preboot v2 + * + * Open sourced under MIT license by @mdo. + * Some variables and mixins from Bootstrap (Apache 2 license). + */.link-inherit,.link-inherit:focus,.link-inherit:hover{color:inherit}/*! + * Font Awesome 4.7.0 by @davegandy - http://fontawesome.io - @fontawesome + * License - http://fontawesome.io/license (Font: SIL OFL 1.1, CSS: MIT License) + */@font-face{font-family:'FontAwesome';src:url('./fontawesome/fontawesome-webfont.ttf?v=4.7.0') format('truetype');font-weight:normal;font-style:normal}.fa{display:inline-block;font:normal normal normal 14px/1 FontAwesome;font-size:inherit;text-rendering:auto;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.fa-lg{font-size:1.33333333em;line-height:.75em;vertical-align:-15%}.fa-2x{font-size:2em}.fa-3x{font-size:3em}.fa-4x{font-size:4em}.fa-5x{font-size:5em}.fa-fw{width:1.28571429em;text-align:center}.fa-ul{padding-left:0;margin-left:2.14285714em;list-style-type:none}.fa-ul>li{position:relative}.fa-li{position:absolute;left:-2.14285714em;width:2.14285714em;top:.14285714em;text-align:center}.fa-li.fa-lg{left:-1.85714286em}.fa-border{padding:.2em .25em .15em;border:solid .08em #eee;border-radius:.1em}.fa-pull-left{float:left}.fa-pull-right{float:right}.fa.fa-pull-left{margin-right:.3em}.fa.fa-pull-right{margin-left:.3em}.pull-right{float:right}.pull-left{float:left}.fa.pull-left{margin-right:.3em}.fa.pull-right{margin-left:.3em}.fa-spin{-webkit-animation:fa-spin 2s infinite linear;animation:fa-spin 2s infinite linear}.fa-pulse{-webkit-animation:fa-spin 1s infinite steps(8);animation:fa-spin 1s infinite steps(8)}@-webkit-keyframes fa-spin{0%{-webkit-transform:rotate(0deg);transform:rotate(0deg)}100%{-webkit-transform:rotate(359deg);transform:rotate(359deg)}}@keyframes fa-spin{0%{-webkit-transform:rotate(0deg);transform:rotate(0deg)}100%{-webkit-transform:rotate(359deg);transform:rotate(359deg)}}.fa-rotate-90{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=1)";-webkit-transform:rotate(90deg);-ms-transform:rotate(90deg);transform:rotate(90deg)}.fa-rotate-180{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=2)";-webkit-transform:rotate(180deg);-ms-transform:rotate(180deg);transform:rotate(180deg)}.fa-rotate-270{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=3)";-webkit-transform:rotate(270deg);-ms-transform:rotate(270deg);transform:rotate(270deg)}.fa-flip-horizontal{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=0, mirror=1)";-webkit-transform:scale(-1, 1);-ms-transform:scale(-1, 1);transform:scale(-1, 1)}.fa-flip-vertical{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=2, mirror=1)";-webkit-transform:scale(1, -1);-ms-transform:scale(1, -1);transform:scale(1, -1)}:root .fa-rotate-90,:root .fa-rotate-180,:root .fa-rotate-270,:root .fa-flip-horizontal,:root .fa-flip-vertical{filter:none}.fa-stack{position:relative;display:inline-block;width:2em;height:2em;line-height:2em;vertical-align:middle}.fa-stack-1x,.fa-stack-2x{position:absolute;left:0;width:100%;text-align:center}.fa-stack-1x{line-height:inherit}.fa-stack-2x{font-size:2em}.fa-inverse{color:#fff}.fa-glass:before{content:"\f000"}.fa-music:before{content:"\f001"}.fa-search:before{content:"\f002"}.fa-envelope-o:before{content:"\f003"}.fa-heart:before{content:"\f004"}.fa-star:before{content:"\f005"}.fa-star-o:before{content:"\f006"}.fa-user:before{content:"\f007"}.fa-film:before{content:"\f008"}.fa-th-large:before{content:"\f009"}.fa-th:before{content:"\f00a"}.fa-th-list:before{content:"\f00b"}.fa-check:before{content:"\f00c"}.fa-remove:before,.fa-close:before,.fa-times:before{content:"\f00d"}.fa-search-plus:before{content:"\f00e"}.fa-search-minus:before{content:"\f010"}.fa-power-off:before{content:"\f011"}.fa-signal:before{content:"\f012"}.fa-gear:before,.fa-cog:before{content:"\f013"}.fa-trash-o:before{content:"\f014"}.fa-home:before{content:"\f015"}.fa-file-o:before{content:"\f016"}.fa-clock-o:before{content:"\f017"}.fa-road:before{content:"\f018"}.fa-download:before{content:"\f019"}.fa-arrow-circle-o-down:before{content:"\f01a"}.fa-arrow-circle-o-up:before{content:"\f01b"}.fa-inbox:before{content:"\f01c"}.fa-play-circle-o:before{content:"\f01d"}.fa-rotate-right:before,.fa-repeat:before{content:"\f01e"}.fa-refresh:before{content:"\f021"}.fa-list-alt:before{content:"\f022"}.fa-lock:before{content:"\f023"}.fa-flag:before{content:"\f024"}.fa-headphones:before{content:"\f025"}.fa-volume-off:before{content:"\f026"}.fa-volume-down:before{content:"\f027"}.fa-volume-up:before{content:"\f028"}.fa-qrcode:before{content:"\f029"}.fa-barcode:before{content:"\f02a"}.fa-tag:before{content:"\f02b"}.fa-tags:before{content:"\f02c"}.fa-book:before{content:"\f02d"}.fa-bookmark:before{content:"\f02e"}.fa-print:before{content:"\f02f"}.fa-camera:before{content:"\f030"}.fa-font:before{content:"\f031"}.fa-bold:before{content:"\f032"}.fa-italic:before{content:"\f033"}.fa-text-height:before{content:"\f034"}.fa-text-width:before{content:"\f035"}.fa-align-left:before{content:"\f036"}.fa-align-center:before{content:"\f037"}.fa-align-right:before{content:"\f038"}.fa-align-justify:before{content:"\f039"}.fa-list:before{content:"\f03a"}.fa-dedent:before,.fa-outdent:before{content:"\f03b"}.fa-indent:before{content:"\f03c"}.fa-video-camera:before{content:"\f03d"}.fa-photo:before,.fa-image:before,.fa-picture-o:before{content:"\f03e"}.fa-pencil:before{content:"\f040"}.fa-map-marker:before{content:"\f041"}.fa-adjust:before{content:"\f042"}.fa-tint:before{content:"\f043"}.fa-edit:before,.fa-pencil-square-o:before{content:"\f044"}.fa-share-square-o:before{content:"\f045"}.fa-check-square-o:before{content:"\f046"}.fa-arrows:before{content:"\f047"}.fa-step-backward:before{content:"\f048"}.fa-fast-backward:before{content:"\f049"}.fa-backward:before{content:"\f04a"}.fa-play:before{content:"\f04b"}.fa-pause:before{content:"\f04c"}.fa-stop:before{content:"\f04d"}.fa-forward:before{content:"\f04e"}.fa-fast-forward:before{content:"\f050"}.fa-step-forward:before{content:"\f051"}.fa-eject:before{content:"\f052"}.fa-chevron-left:before{content:"\f053"}.fa-chevron-right:before{content:"\f054"}.fa-plus-circle:before{content:"\f055"}.fa-minus-circle:before{content:"\f056"}.fa-times-circle:before{content:"\f057"}.fa-check-circle:before{content:"\f058"}.fa-question-circle:before{content:"\f059"}.fa-info-circle:before{content:"\f05a"}.fa-crosshairs:before{content:"\f05b"}.fa-times-circle-o:before{content:"\f05c"}.fa-check-circle-o:before{content:"\f05d"}.fa-ban:before{content:"\f05e"}.fa-arrow-left:before{content:"\f060"}.fa-arrow-right:before{content:"\f061"}.fa-arrow-up:before{content:"\f062"}.fa-arrow-down:before{content:"\f063"}.fa-mail-forward:before,.fa-share:before{content:"\f064"}.fa-expand:before{content:"\f065"}.fa-compress:before{content:"\f066"}.fa-plus:before{content:"\f067"}.fa-minus:before{content:"\f068"}.fa-asterisk:before{content:"\f069"}.fa-exclamation-circle:before{content:"\f06a"}.fa-gift:before{content:"\f06b"}.fa-leaf:before{content:"\f06c"}.fa-fire:before{content:"\f06d"}.fa-eye:before{content:"\f06e"}.fa-eye-slash:before{content:"\f070"}.fa-warning:before,.fa-exclamation-triangle:before{content:"\f071"}.fa-plane:before{content:"\f072"}.fa-calendar:before{content:"\f073"}.fa-random:before{content:"\f074"}.fa-comment:before{content:"\f075"}.fa-magnet:before{content:"\f076"}.fa-chevron-up:before{content:"\f077"}.fa-chevron-down:before{content:"\f078"}.fa-retweet:before{content:"\f079"}.fa-shopping-cart:before{content:"\f07a"}.fa-folder:before{content:"\f07b"}.fa-folder-open:before{content:"\f07c"}.fa-arrows-v:before{content:"\f07d"}.fa-arrows-h:before{content:"\f07e"}.fa-bar-chart-o:before,.fa-bar-chart:before{content:"\f080"}.fa-twitter-square:before{content:"\f081"}.fa-facebook-square:before{content:"\f082"}.fa-camera-retro:before{content:"\f083"}.fa-key:before{content:"\f084"}.fa-gears:before,.fa-cogs:before{content:"\f085"}.fa-comments:before{content:"\f086"}.fa-thumbs-o-up:before{content:"\f087"}.fa-thumbs-o-down:before{content:"\f088"}.fa-star-half:before{content:"\f089"}.fa-heart-o:before{content:"\f08a"}.fa-sign-out:before{content:"\f08b"}.fa-linkedin-square:before{content:"\f08c"}.fa-thumb-tack:before{content:"\f08d"}.fa-external-link:before{content:"\f08e"}.fa-sign-in:before{content:"\f090"}.fa-trophy:before{content:"\f091"}.fa-github-square:before{content:"\f092"}.fa-upload:before{content:"\f093"}.fa-lemon-o:before{content:"\f094"}.fa-phone:before{content:"\f095"}.fa-square-o:before{content:"\f096"}.fa-bookmark-o:before{content:"\f097"}.fa-phone-square:before{content:"\f098"}.fa-twitter:before{content:"\f099"}.fa-facebook-f:before,.fa-facebook:before{content:"\f09a"}.fa-github:before{content:"\f09b"}.fa-unlock:before{content:"\f09c"}.fa-credit-card:before{content:"\f09d"}.fa-feed:before,.fa-rss:before{content:"\f09e"}.fa-hdd-o:before{content:"\f0a0"}.fa-bullhorn:before{content:"\f0a1"}.fa-bell:before{content:"\f0f3"}.fa-certificate:before{content:"\f0a3"}.fa-hand-o-right:before{content:"\f0a4"}.fa-hand-o-left:before{content:"\f0a5"}.fa-hand-o-up:before{content:"\f0a6"}.fa-hand-o-down:before{content:"\f0a7"}.fa-arrow-circle-left:before{content:"\f0a8"}.fa-arrow-circle-right:before{content:"\f0a9"}.fa-arrow-circle-up:before{content:"\f0aa"}.fa-arrow-circle-down:before{content:"\f0ab"}.fa-globe:before{content:"\f0ac"}.fa-wrench:before{content:"\f0ad"}.fa-tasks:before{content:"\f0ae"}.fa-filter:before{content:"\f0b0"}.fa-briefcase:before{content:"\f0b1"}.fa-arrows-alt:before{content:"\f0b2"}.fa-group:before,.fa-users:before{content:"\f0c0"}.fa-chain:before,.fa-link:before{content:"\f0c1"}.fa-cloud:before{content:"\f0c2"}.fa-flask:before{content:"\f0c3"}.fa-cut:before,.fa-scissors:before{content:"\f0c4"}.fa-copy:before,.fa-files-o:before{content:"\f0c5"}.fa-paperclip:before{content:"\f0c6"}.fa-save:before,.fa-floppy-o:before{content:"\f0c7"}.fa-square:before{content:"\f0c8"}.fa-navicon:before,.fa-reorder:before,.fa-bars:before{content:"\f0c9"}.fa-list-ul:before{content:"\f0ca"}.fa-list-ol:before{content:"\f0cb"}.fa-strikethrough:before{content:"\f0cc"}.fa-underline:before{content:"\f0cd"}.fa-table:before{content:"\f0ce"}.fa-magic:before{content:"\f0d0"}.fa-truck:before{content:"\f0d1"}.fa-pinterest:before{content:"\f0d2"}.fa-pinterest-square:before{content:"\f0d3"}.fa-google-plus-square:before{content:"\f0d4"}.fa-google-plus:before{content:"\f0d5"}.fa-money:before{content:"\f0d6"}.fa-caret-down:before{content:"\f0d7"}.fa-caret-up:before{content:"\f0d8"}.fa-caret-left:before{content:"\f0d9"}.fa-caret-right:before{content:"\f0da"}.fa-columns:before{content:"\f0db"}.fa-unsorted:before,.fa-sort:before{content:"\f0dc"}.fa-sort-down:before,.fa-sort-desc:before{content:"\f0dd"}.fa-sort-up:before,.fa-sort-asc:before{content:"\f0de"}.fa-envelope:before{content:"\f0e0"}.fa-linkedin:before{content:"\f0e1"}.fa-rotate-left:before,.fa-undo:before{content:"\f0e2"}.fa-legal:before,.fa-gavel:before{content:"\f0e3"}.fa-dashboard:before,.fa-tachometer:before{content:"\f0e4"}.fa-comment-o:before{content:"\f0e5"}.fa-comments-o:before{content:"\f0e6"}.fa-flash:before,.fa-bolt:before{content:"\f0e7"}.fa-sitemap:before{content:"\f0e8"}.fa-umbrella:before{content:"\f0e9"}.fa-paste:before,.fa-clipboard:before{content:"\f0ea"}.fa-lightbulb-o:before{content:"\f0eb"}.fa-exchange:before{content:"\f0ec"}.fa-cloud-download:before{content:"\f0ed"}.fa-cloud-upload:before{content:"\f0ee"}.fa-user-md:before{content:"\f0f0"}.fa-stethoscope:before{content:"\f0f1"}.fa-suitcase:before{content:"\f0f2"}.fa-bell-o:before{content:"\f0a2"}.fa-coffee:before{content:"\f0f4"}.fa-cutlery:before{content:"\f0f5"}.fa-file-text-o:before{content:"\f0f6"}.fa-building-o:before{content:"\f0f7"}.fa-hospital-o:before{content:"\f0f8"}.fa-ambulance:before{content:"\f0f9"}.fa-medkit:before{content:"\f0fa"}.fa-fighter-jet:before{content:"\f0fb"}.fa-beer:before{content:"\f0fc"}.fa-h-square:before{content:"\f0fd"}.fa-plus-square:before{content:"\f0fe"}.fa-angle-double-left:before{content:"\f100"}.fa-angle-double-right:before{content:"\f101"}.fa-angle-double-up:before{content:"\f102"}.fa-angle-double-down:before{content:"\f103"}.fa-angle-left:before{content:"\f104"}.fa-angle-right:before{content:"\f105"}.fa-angle-up:before{content:"\f106"}.fa-angle-down:before{content:"\f107"}.fa-desktop:before{content:"\f108"}.fa-laptop:before{content:"\f109"}.fa-tablet:before{content:"\f10a"}.fa-mobile-phone:before,.fa-mobile:before{content:"\f10b"}.fa-circle-o:before{content:"\f10c"}.fa-quote-left:before{content:"\f10d"}.fa-quote-right:before{content:"\f10e"}.fa-spinner:before{content:"\f110"}.fa-circle:before{content:"\f111"}.fa-mail-reply:before,.fa-reply:before{content:"\f112"}.fa-github-alt:before{content:"\f113"}.fa-folder-o:before{content:"\f114"}.fa-folder-open-o:before{content:"\f115"}.fa-smile-o:before{content:"\f118"}.fa-frown-o:before{content:"\f119"}.fa-meh-o:before{content:"\f11a"}.fa-gamepad:before{content:"\f11b"}.fa-keyboard-o:before{content:"\f11c"}.fa-flag-o:before{content:"\f11d"}.fa-flag-checkered:before{content:"\f11e"}.fa-terminal:before{content:"\f120"}.fa-code:before{content:"\f121"}.fa-mail-reply-all:before,.fa-reply-all:before{content:"\f122"}.fa-star-half-empty:before,.fa-star-half-full:before,.fa-star-half-o:before{content:"\f123"}.fa-location-arrow:before{content:"\f124"}.fa-crop:before{content:"\f125"}.fa-code-fork:before{content:"\f126"}.fa-unlink:before,.fa-chain-broken:before{content:"\f127"}.fa-question:before{content:"\f128"}.fa-info:before{content:"\f129"}.fa-exclamation:before{content:"\f12a"}.fa-superscript:before{content:"\f12b"}.fa-subscript:before{content:"\f12c"}.fa-eraser:before{content:"\f12d"}.fa-puzzle-piece:before{content:"\f12e"}.fa-microphone:before{content:"\f130"}.fa-microphone-slash:before{content:"\f131"}.fa-shield:before{content:"\f132"}.fa-calendar-o:before{content:"\f133"}.fa-fire-extinguisher:before{content:"\f134"}.fa-rocket:before{content:"\f135"}.fa-maxcdn:before{content:"\f136"}.fa-chevron-circle-left:before{content:"\f137"}.fa-chevron-circle-right:before{content:"\f138"}.fa-chevron-circle-up:before{content:"\f139"}.fa-chevron-circle-down:before{content:"\f13a"}.fa-html5:before{content:"\f13b"}.fa-css3:before{content:"\f13c"}.fa-anchor:before{content:"\f13d"}.fa-unlock-alt:before{content:"\f13e"}.fa-bullseye:before{content:"\f140"}.fa-ellipsis-h:before{content:"\f141"}.fa-ellipsis-v:before{content:"\f142"}.fa-rss-square:before{content:"\f143"}.fa-play-circle:before{content:"\f144"}.fa-ticket:before{content:"\f145"}.fa-minus-square:before{content:"\f146"}.fa-minus-square-o:before{content:"\f147"}.fa-level-up:before{content:"\f148"}.fa-level-down:before{content:"\f149"}.fa-check-square:before{content:"\f14a"}.fa-pencil-square:before{content:"\f14b"}.fa-external-link-square:before{content:"\f14c"}.fa-share-square:before{content:"\f14d"}.fa-compass:before{content:"\f14e"}.fa-toggle-down:before,.fa-caret-square-o-down:before{content:"\f150"}.fa-toggle-up:before,.fa-caret-square-o-up:before{content:"\f151"}.fa-toggle-right:before,.fa-caret-square-o-right:before{content:"\f152"}.fa-euro:before,.fa-eur:before{content:"\f153"}.fa-gbp:before{content:"\f154"}.fa-dollar:before,.fa-usd:before{content:"\f155"}.fa-rupee:before,.fa-inr:before{content:"\f156"}.fa-cny:before,.fa-rmb:before,.fa-yen:before,.fa-jpy:before{content:"\f157"}.fa-ruble:before,.fa-rouble:before,.fa-rub:before{content:"\f158"}.fa-won:before,.fa-krw:before{content:"\f159"}.fa-bitcoin:before,.fa-btc:before{content:"\f15a"}.fa-file:before{content:"\f15b"}.fa-file-text:before{content:"\f15c"}.fa-sort-alpha-asc:before{content:"\f15d"}.fa-sort-alpha-desc:before{content:"\f15e"}.fa-sort-amount-asc:before{content:"\f160"}.fa-sort-amount-desc:before{content:"\f161"}.fa-sort-numeric-asc:before{content:"\f162"}.fa-sort-numeric-desc:before{content:"\f163"}.fa-thumbs-up:before{content:"\f164"}.fa-thumbs-down:before{content:"\f165"}.fa-youtube-square:before{content:"\f166"}.fa-youtube:before{content:"\f167"}.fa-xing:before{content:"\f168"}.fa-xing-square:before{content:"\f169"}.fa-youtube-play:before{content:"\f16a"}.fa-dropbox:before{content:"\f16b"}.fa-stack-overflow:before{content:"\f16c"}.fa-instagram:before{content:"\f16d"}.fa-flickr:before{content:"\f16e"}.fa-adn:before{content:"\f170"}.fa-bitbucket:before{content:"\f171"}.fa-bitbucket-square:before{content:"\f172"}.fa-tumblr:before{content:"\f173"}.fa-tumblr-square:before{content:"\f174"}.fa-long-arrow-down:before{content:"\f175"}.fa-long-arrow-up:before{content:"\f176"}.fa-long-arrow-left:before{content:"\f177"}.fa-long-arrow-right:before{content:"\f178"}.fa-apple:before{content:"\f179"}.fa-windows:before{content:"\f17a"}.fa-android:before{content:"\f17b"}.fa-linux:before{content:"\f17c"}.fa-dribbble:before{content:"\f17d"}.fa-skype:before{content:"\f17e"}.fa-foursquare:before{content:"\f180"}.fa-trello:before{content:"\f181"}.fa-female:before{content:"\f182"}.fa-male:before{content:"\f183"}.fa-gittip:before,.fa-gratipay:before{content:"\f184"}.fa-sun-o:before{content:"\f185"}.fa-moon-o:before{content:"\f186"}.fa-archive:before{content:"\f187"}.fa-bug:before{content:"\f188"}.fa-vk:before{content:"\f189"}.fa-weibo:before{content:"\f18a"}.fa-renren:before{content:"\f18b"}.fa-pagelines:before{content:"\f18c"}.fa-stack-exchange:before{content:"\f18d"}.fa-arrow-circle-o-right:before{content:"\f18e"}.fa-arrow-circle-o-left:before{content:"\f190"}.fa-toggle-left:before,.fa-caret-square-o-left:before{content:"\f191"}.fa-dot-circle-o:before{content:"\f192"}.fa-wheelchair:before{content:"\f193"}.fa-vimeo-square:before{content:"\f194"}.fa-turkish-lira:before,.fa-try:before{content:"\f195"}.fa-plus-square-o:before{content:"\f196"}.fa-space-shuttle:before{content:"\f197"}.fa-slack:before{content:"\f198"}.fa-envelope-square:before{content:"\f199"}.fa-wordpress:before{content:"\f19a"}.fa-openid:before{content:"\f19b"}.fa-institution:before,.fa-bank:before,.fa-university:before{content:"\f19c"}.fa-mortar-board:before,.fa-graduation-cap:before{content:"\f19d"}.fa-yahoo:before{content:"\f19e"}.fa-google:before{content:"\f1a0"}.fa-reddit:before{content:"\f1a1"}.fa-reddit-square:before{content:"\f1a2"}.fa-stumbleupon-circle:before{content:"\f1a3"}.fa-stumbleupon:before{content:"\f1a4"}.fa-delicious:before{content:"\f1a5"}.fa-digg:before{content:"\f1a6"}.fa-pied-piper-pp:before{content:"\f1a7"}.fa-pied-piper-alt:before{content:"\f1a8"}.fa-drupal:before{content:"\f1a9"}.fa-joomla:before{content:"\f1aa"}.fa-language:before{content:"\f1ab"}.fa-fax:before{content:"\f1ac"}.fa-building:before{content:"\f1ad"}.fa-child:before{content:"\f1ae"}.fa-paw:before{content:"\f1b0"}.fa-spoon:before{content:"\f1b1"}.fa-cube:before{content:"\f1b2"}.fa-cubes:before{content:"\f1b3"}.fa-behance:before{content:"\f1b4"}.fa-behance-square:before{content:"\f1b5"}.fa-steam:before{content:"\f1b6"}.fa-steam-square:before{content:"\f1b7"}.fa-recycle:before{content:"\f1b8"}.fa-automobile:before,.fa-car:before{content:"\f1b9"}.fa-cab:before,.fa-taxi:before{content:"\f1ba"}.fa-tree:before{content:"\f1bb"}.fa-spotify:before{content:"\f1bc"}.fa-deviantart:before{content:"\f1bd"}.fa-soundcloud:before{content:"\f1be"}.fa-database:before{content:"\f1c0"}.fa-file-pdf-o:before{content:"\f1c1"}.fa-file-word-o:before{content:"\f1c2"}.fa-file-excel-o:before{content:"\f1c3"}.fa-file-powerpoint-o:before{content:"\f1c4"}.fa-file-photo-o:before,.fa-file-picture-o:before,.fa-file-image-o:before{content:"\f1c5"}.fa-file-zip-o:before,.fa-file-archive-o:before{content:"\f1c6"}.fa-file-sound-o:before,.fa-file-audio-o:before{content:"\f1c7"}.fa-file-movie-o:before,.fa-file-video-o:before{content:"\f1c8"}.fa-file-code-o:before{content:"\f1c9"}.fa-vine:before{content:"\f1ca"}.fa-codepen:before{content:"\f1cb"}.fa-jsfiddle:before{content:"\f1cc"}.fa-life-bouy:before,.fa-life-buoy:before,.fa-life-saver:before,.fa-support:before,.fa-life-ring:before{content:"\f1cd"}.fa-circle-o-notch:before{content:"\f1ce"}.fa-ra:before,.fa-resistance:before,.fa-rebel:before{content:"\f1d0"}.fa-ge:before,.fa-empire:before{content:"\f1d1"}.fa-git-square:before{content:"\f1d2"}.fa-git:before{content:"\f1d3"}.fa-y-combinator-square:before,.fa-yc-square:before,.fa-hacker-news:before{content:"\f1d4"}.fa-tencent-weibo:before{content:"\f1d5"}.fa-qq:before{content:"\f1d6"}.fa-wechat:before,.fa-weixin:before{content:"\f1d7"}.fa-send:before,.fa-paper-plane:before{content:"\f1d8"}.fa-send-o:before,.fa-paper-plane-o:before{content:"\f1d9"}.fa-history:before{content:"\f1da"}.fa-circle-thin:before{content:"\f1db"}.fa-header:before{content:"\f1dc"}.fa-paragraph:before{content:"\f1dd"}.fa-sliders:before{content:"\f1de"}.fa-share-alt:before{content:"\f1e0"}.fa-share-alt-square:before{content:"\f1e1"}.fa-bomb:before{content:"\f1e2"}.fa-soccer-ball-o:before,.fa-futbol-o:before{content:"\f1e3"}.fa-tty:before{content:"\f1e4"}.fa-binoculars:before{content:"\f1e5"}.fa-plug:before{content:"\f1e6"}.fa-slideshare:before{content:"\f1e7"}.fa-twitch:before{content:"\f1e8"}.fa-yelp:before{content:"\f1e9"}.fa-newspaper-o:before{content:"\f1ea"}.fa-wifi:before{content:"\f1eb"}.fa-calculator:before{content:"\f1ec"}.fa-paypal:before{content:"\f1ed"}.fa-google-wallet:before{content:"\f1ee"}.fa-cc-visa:before{content:"\f1f0"}.fa-cc-mastercard:before{content:"\f1f1"}.fa-cc-discover:before{content:"\f1f2"}.fa-cc-amex:before{content:"\f1f3"}.fa-cc-paypal:before{content:"\f1f4"}.fa-cc-stripe:before{content:"\f1f5"}.fa-bell-slash:before{content:"\f1f6"}.fa-bell-slash-o:before{content:"\f1f7"}.fa-trash:before{content:"\f1f8"}.fa-copyright:before{content:"\f1f9"}.fa-at:before{content:"\f1fa"}.fa-eyedropper:before{content:"\f1fb"}.fa-paint-brush:before{content:"\f1fc"}.fa-birthday-cake:before{content:"\f1fd"}.fa-area-chart:before{content:"\f1fe"}.fa-pie-chart:before{content:"\f200"}.fa-line-chart:before{content:"\f201"}.fa-lastfm:before{content:"\f202"}.fa-lastfm-square:before{content:"\f203"}.fa-toggle-off:before{content:"\f204"}.fa-toggle-on:before{content:"\f205"}.fa-bicycle:before{content:"\f206"}.fa-bus:before{content:"\f207"}.fa-ioxhost:before{content:"\f208"}.fa-angellist:before{content:"\f209"}.fa-cc:before{content:"\f20a"}.fa-shekel:before,.fa-sheqel:before,.fa-ils:before{content:"\f20b"}.fa-meanpath:before{content:"\f20c"}.fa-buysellads:before{content:"\f20d"}.fa-connectdevelop:before{content:"\f20e"}.fa-dashcube:before{content:"\f210"}.fa-forumbee:before{content:"\f211"}.fa-leanpub:before{content:"\f212"}.fa-sellsy:before{content:"\f213"}.fa-shirtsinbulk:before{content:"\f214"}.fa-simplybuilt:before{content:"\f215"}.fa-skyatlas:before{content:"\f216"}.fa-cart-plus:before{content:"\f217"}.fa-cart-arrow-down:before{content:"\f218"}.fa-diamond:before{content:"\f219"}.fa-ship:before{content:"\f21a"}.fa-user-secret:before{content:"\f21b"}.fa-motorcycle:before{content:"\f21c"}.fa-street-view:before{content:"\f21d"}.fa-heartbeat:before{content:"\f21e"}.fa-venus:before{content:"\f221"}.fa-mars:before{content:"\f222"}.fa-mercury:before{content:"\f223"}.fa-intersex:before,.fa-transgender:before{content:"\f224"}.fa-transgender-alt:before{content:"\f225"}.fa-venus-double:before{content:"\f226"}.fa-mars-double:before{content:"\f227"}.fa-venus-mars:before{content:"\f228"}.fa-mars-stroke:before{content:"\f229"}.fa-mars-stroke-v:before{content:"\f22a"}.fa-mars-stroke-h:before{content:"\f22b"}.fa-neuter:before{content:"\f22c"}.fa-genderless:before{content:"\f22d"}.fa-facebook-official:before{content:"\f230"}.fa-pinterest-p:before{content:"\f231"}.fa-whatsapp:before{content:"\f232"}.fa-server:before{content:"\f233"}.fa-user-plus:before{content:"\f234"}.fa-user-times:before{content:"\f235"}.fa-hotel:before,.fa-bed:before{content:"\f236"}.fa-viacoin:before{content:"\f237"}.fa-train:before{content:"\f238"}.fa-subway:before{content:"\f239"}.fa-medium:before{content:"\f23a"}.fa-yc:before,.fa-y-combinator:before{content:"\f23b"}.fa-optin-monster:before{content:"\f23c"}.fa-opencart:before{content:"\f23d"}.fa-expeditedssl:before{content:"\f23e"}.fa-battery-4:before,.fa-battery:before,.fa-battery-full:before{content:"\f240"}.fa-battery-3:before,.fa-battery-three-quarters:before{content:"\f241"}.fa-battery-2:before,.fa-battery-half:before{content:"\f242"}.fa-battery-1:before,.fa-battery-quarter:before{content:"\f243"}.fa-battery-0:before,.fa-battery-empty:before{content:"\f244"}.fa-mouse-pointer:before{content:"\f245"}.fa-i-cursor:before{content:"\f246"}.fa-object-group:before{content:"\f247"}.fa-object-ungroup:before{content:"\f248"}.fa-sticky-note:before{content:"\f249"}.fa-sticky-note-o:before{content:"\f24a"}.fa-cc-jcb:before{content:"\f24b"}.fa-cc-diners-club:before{content:"\f24c"}.fa-clone:before{content:"\f24d"}.fa-balance-scale:before{content:"\f24e"}.fa-hourglass-o:before{content:"\f250"}.fa-hourglass-1:before,.fa-hourglass-start:before{content:"\f251"}.fa-hourglass-2:before,.fa-hourglass-half:before{content:"\f252"}.fa-hourglass-3:before,.fa-hourglass-end:before{content:"\f253"}.fa-hourglass:before{content:"\f254"}.fa-hand-grab-o:before,.fa-hand-rock-o:before{content:"\f255"}.fa-hand-stop-o:before,.fa-hand-paper-o:before{content:"\f256"}.fa-hand-scissors-o:before{content:"\f257"}.fa-hand-lizard-o:before{content:"\f258"}.fa-hand-spock-o:before{content:"\f259"}.fa-hand-pointer-o:before{content:"\f25a"}.fa-hand-peace-o:before{content:"\f25b"}.fa-trademark:before{content:"\f25c"}.fa-registered:before{content:"\f25d"}.fa-creative-commons:before{content:"\f25e"}.fa-gg:before{content:"\f260"}.fa-gg-circle:before{content:"\f261"}.fa-tripadvisor:before{content:"\f262"}.fa-odnoklassniki:before{content:"\f263"}.fa-odnoklassniki-square:before{content:"\f264"}.fa-get-pocket:before{content:"\f265"}.fa-wikipedia-w:before{content:"\f266"}.fa-safari:before{content:"\f267"}.fa-chrome:before{content:"\f268"}.fa-firefox:before{content:"\f269"}.fa-opera:before{content:"\f26a"}.fa-internet-explorer:before{content:"\f26b"}.fa-tv:before,.fa-television:before{content:"\f26c"}.fa-contao:before{content:"\f26d"}.fa-500px:before{content:"\f26e"}.fa-amazon:before{content:"\f270"}.fa-calendar-plus-o:before{content:"\f271"}.fa-calendar-minus-o:before{content:"\f272"}.fa-calendar-times-o:before{content:"\f273"}.fa-calendar-check-o:before{content:"\f274"}.fa-industry:before{content:"\f275"}.fa-map-pin:before{content:"\f276"}.fa-map-signs:before{content:"\f277"}.fa-map-o:before{content:"\f278"}.fa-map:before{content:"\f279"}.fa-commenting:before{content:"\f27a"}.fa-commenting-o:before{content:"\f27b"}.fa-houzz:before{content:"\f27c"}.fa-vimeo:before{content:"\f27d"}.fa-black-tie:before{content:"\f27e"}.fa-fonticons:before{content:"\f280"}.fa-reddit-alien:before{content:"\f281"}.fa-edge:before{content:"\f282"}.fa-credit-card-alt:before{content:"\f283"}.fa-codiepie:before{content:"\f284"}.fa-modx:before{content:"\f285"}.fa-fort-awesome:before{content:"\f286"}.fa-usb:before{content:"\f287"}.fa-product-hunt:before{content:"\f288"}.fa-mixcloud:before{content:"\f289"}.fa-scribd:before{content:"\f28a"}.fa-pause-circle:before{content:"\f28b"}.fa-pause-circle-o:before{content:"\f28c"}.fa-stop-circle:before{content:"\f28d"}.fa-stop-circle-o:before{content:"\f28e"}.fa-shopping-bag:before{content:"\f290"}.fa-shopping-basket:before{content:"\f291"}.fa-hashtag:before{content:"\f292"}.fa-bluetooth:before{content:"\f293"}.fa-bluetooth-b:before{content:"\f294"}.fa-percent:before{content:"\f295"}.fa-gitlab:before{content:"\f296"}.fa-wpbeginner:before{content:"\f297"}.fa-wpforms:before{content:"\f298"}.fa-envira:before{content:"\f299"}.fa-universal-access:before{content:"\f29a"}.fa-wheelchair-alt:before{content:"\f29b"}.fa-question-circle-o:before{content:"\f29c"}.fa-blind:before{content:"\f29d"}.fa-audio-description:before{content:"\f29e"}.fa-volume-control-phone:before{content:"\f2a0"}.fa-braille:before{content:"\f2a1"}.fa-assistive-listening-systems:before{content:"\f2a2"}.fa-asl-interpreting:before,.fa-american-sign-language-interpreting:before{content:"\f2a3"}.fa-deafness:before,.fa-hard-of-hearing:before,.fa-deaf:before{content:"\f2a4"}.fa-glide:before{content:"\f2a5"}.fa-glide-g:before{content:"\f2a6"}.fa-signing:before,.fa-sign-language:before{content:"\f2a7"}.fa-low-vision:before{content:"\f2a8"}.fa-viadeo:before{content:"\f2a9"}.fa-viadeo-square:before{content:"\f2aa"}.fa-snapchat:before{content:"\f2ab"}.fa-snapchat-ghost:before{content:"\f2ac"}.fa-snapchat-square:before{content:"\f2ad"}.fa-pied-piper:before{content:"\f2ae"}.fa-first-order:before{content:"\f2b0"}.fa-yoast:before{content:"\f2b1"}.fa-themeisle:before{content:"\f2b2"}.fa-google-plus-circle:before,.fa-google-plus-official:before{content:"\f2b3"}.fa-fa:before,.fa-font-awesome:before{content:"\f2b4"}.fa-handshake-o:before{content:"\f2b5"}.fa-envelope-open:before{content:"\f2b6"}.fa-envelope-open-o:before{content:"\f2b7"}.fa-linode:before{content:"\f2b8"}.fa-address-book:before{content:"\f2b9"}.fa-address-book-o:before{content:"\f2ba"}.fa-vcard:before,.fa-address-card:before{content:"\f2bb"}.fa-vcard-o:before,.fa-address-card-o:before{content:"\f2bc"}.fa-user-circle:before{content:"\f2bd"}.fa-user-circle-o:before{content:"\f2be"}.fa-user-o:before{content:"\f2c0"}.fa-id-badge:before{content:"\f2c1"}.fa-drivers-license:before,.fa-id-card:before{content:"\f2c2"}.fa-drivers-license-o:before,.fa-id-card-o:before{content:"\f2c3"}.fa-quora:before{content:"\f2c4"}.fa-free-code-camp:before{content:"\f2c5"}.fa-telegram:before{content:"\f2c6"}.fa-thermometer-4:before,.fa-thermometer:before,.fa-thermometer-full:before{content:"\f2c7"}.fa-thermometer-3:before,.fa-thermometer-three-quarters:before{content:"\f2c8"}.fa-thermometer-2:before,.fa-thermometer-half:before{content:"\f2c9"}.fa-thermometer-1:before,.fa-thermometer-quarter:before{content:"\f2ca"}.fa-thermometer-0:before,.fa-thermometer-empty:before{content:"\f2cb"}.fa-shower:before{content:"\f2cc"}.fa-bathtub:before,.fa-s15:before,.fa-bath:before{content:"\f2cd"}.fa-podcast:before{content:"\f2ce"}.fa-window-maximize:before{content:"\f2d0"}.fa-window-minimize:before{content:"\f2d1"}.fa-window-restore:before{content:"\f2d2"}.fa-times-rectangle:before,.fa-window-close:before{content:"\f2d3"}.fa-times-rectangle-o:before,.fa-window-close-o:before{content:"\f2d4"}.fa-bandcamp:before{content:"\f2d5"}.fa-grav:before{content:"\f2d6"}.fa-etsy:before{content:"\f2d7"}.fa-imdb:before{content:"\f2d8"}.fa-ravelry:before{content:"\f2d9"}.fa-eercast:before{content:"\f2da"}.fa-microchip:before{content:"\f2db"}.fa-snowflake-o:before{content:"\f2dc"}.fa-superpowers:before{content:"\f2dd"}.fa-wpexplorer:before{content:"\f2de"}.fa-meetup:before{content:"\f2e0"}.sr-only{position:absolute;width:1px;height:1px;padding:0;margin:-1px;overflow:hidden;clip:rect(0, 0, 0, 0);border:0}.sr-only-focusable:active,.sr-only-focusable:focus{position:static;width:auto;height:auto;margin:0;overflow:visible;clip:auto} +.book .book-header,.book .book-summary{font-family:"Helvetica Neue",Helvetica,Arial,sans-serif}.book-langs-index{width:100%;height:100%;padding:40px 0;margin:0;overflow:auto}@media (max-width:600px){.book-langs-index{padding:0}}.book-langs-index .inner{max-width:600px;width:100%;margin:0 auto;padding:30px;background:#fff;border-radius:3px}.book-langs-index .inner h3{margin:0}.book-langs-index .inner .languages{list-style:none;padding:20px 30px;margin-top:20px;border-top:1px solid #eee}.book-langs-index .inner .languages:after,.book-langs-index .inner .languages:before{content:" ";display:table;line-height:0}.book-langs-index .inner .languages li{width:50%;float:left;padding:10px 5px;font-size:16px}@media (max-width:600px){.book-langs-index .inner .languages li{width:100%;max-width:100%}}.book .book-header{overflow:visible;height:50px;padding:0 8px;z-index:2;font-size:.85em;color:#7e888b;background:0 0}.book .book-header .btn{display:block;height:50px;padding:0 15px;border-bottom:none;color:#ccc;text-transform:uppercase;line-height:50px;-webkit-box-shadow:none!important;box-shadow:none!important;position:relative;font-size:14px}.book .book-header .btn:hover{position:relative;text-decoration:none;color:#444;background:0 0}.book .book-header h1{margin:0;font-size:20px;font-weight:200;text-align:center;line-height:50px;opacity:0;padding-left:200px;padding-right:200px;-webkit-transition:opacity .2s ease;-moz-transition:opacity .2s ease;-o-transition:opacity .2s ease;transition:opacity .2s ease;overflow:hidden;text-overflow:ellipsis;white-space:nowrap}.book .book-header h1 a,.book .book-header h1 a:hover{color:inherit;text-decoration:none}@media (max-width:1000px){.book .book-header h1{display:none}}.book .book-header h1 i{display:none}.book .book-header:hover h1{opacity:1}.book.is-loading .book-header h1 i{display:inline-block}.book.is-loading .book-header h1 a{display:none}.dropdown{position:relative}.dropdown-menu{position:absolute;top:100%;left:0;z-index:100;display:none;float:left;min-width:160px;padding:0;margin:2px 0 0;list-style:none;font-size:14px;background-color:#fafafa;border:1px solid rgba(0,0,0,.07);border-radius:1px;-webkit-box-shadow:0 6px 12px rgba(0,0,0,.175);box-shadow:0 6px 12px rgba(0,0,0,.175);background-clip:padding-box}.dropdown-menu.open{display:block}.dropdown-menu.dropdown-left{left:auto;right:4%}.dropdown-menu.dropdown-left .dropdown-caret{right:14px;left:auto}.dropdown-menu .dropdown-caret{position:absolute;top:-8px;left:14px;width:18px;height:10px;float:left;overflow:hidden}.dropdown-menu .dropdown-caret .caret-inner,.dropdown-menu .dropdown-caret .caret-outer{display:inline-block;top:0;border-left:9px solid transparent;border-right:9px solid transparent;position:absolute}.dropdown-menu .dropdown-caret .caret-outer{border-bottom:9px solid rgba(0,0,0,.1);height:auto;left:0;width:auto;margin-left:-1px}.dropdown-menu .dropdown-caret .caret-inner{margin-top:-1px;top:1px;border-bottom:9px solid #fafafa}.dropdown-menu .buttons{border-bottom:1px solid rgba(0,0,0,.07)}.dropdown-menu .buttons:after,.dropdown-menu .buttons:before{content:" ";display:table;line-height:0}.dropdown-menu .buttons:last-child{border-bottom:none}.dropdown-menu .buttons .button{border:0;background-color:transparent;color:#a6a6a6;width:100%;text-align:center;float:left;line-height:1.42857143;padding:8px 4px}.alert,.dropdown-menu .buttons .button:hover{color:#444}.dropdown-menu .buttons .button:focus,.dropdown-menu .buttons .button:hover{outline:0}.dropdown-menu .buttons .button.size-2{width:50%}.dropdown-menu .buttons .button.size-3{width:33%}.alert{padding:15px;margin-bottom:20px;background:#eee;border-bottom:5px solid #ddd}.alert-success{background:#dff0d8;border-color:#d6e9c6;color:#3c763d}.alert-info{background:#d9edf7;border-color:#bce8f1;color:#31708f}.alert-danger{background:#f2dede;border-color:#ebccd1;color:#a94442}.alert-warning{background:#fcf8e3;border-color:#faebcc;color:#8a6d3b}.book .book-summary{position:absolute;top:0;left:-300px;bottom:0;z-index:1;width:300px;color:#364149;background:#fafafa;border-right:1px solid rgba(0,0,0,.07);-webkit-transition:left 250ms ease;-moz-transition:left 250ms ease;-o-transition:left 250ms ease;transition:left 250ms ease}.book .book-summary ul.summary{position:absolute;top:0;left:0;right:0;bottom:0;overflow-y:auto;list-style:none;margin:0;padding:0;-webkit-transition:top .5s ease;-moz-transition:top .5s ease;-o-transition:top .5s ease;transition:top .5s ease}.book .book-summary ul.summary li{list-style:none}.book .book-summary ul.summary li.divider{height:1px;margin:7px 0;overflow:hidden;background:rgba(0,0,0,.07)}.book .book-summary ul.summary li i.fa-check{display:none;position:absolute;right:9px;top:16px;font-size:9px;color:#3c3}.book .book-summary ul.summary li.done>a{color:#364149;font-weight:400}.book .book-summary ul.summary li.done>a i{display:inline}.book .book-summary ul.summary li a,.book .book-summary ul.summary li span{display:block;padding:10px 15px;border-bottom:none;color:#364149;background:0 0;text-overflow:ellipsis;overflow:hidden;white-space:nowrap;position:relative}.book .book-summary ul.summary li span{cursor:not-allowed;opacity:.3;filter:alpha(opacity=30)}.book .book-summary ul.summary li a:hover,.book .book-summary ul.summary li.active>a{color:#008cff;background:0 0;text-decoration:none}.book .book-summary ul.summary li ul{padding-left:20px}@media (max-width:600px){.book .book-summary{width:calc(100% - 60px);bottom:0;left:-100%}}.book.with-summary .book-summary{left:0}.book.without-animation .book-summary{-webkit-transition:none!important;-moz-transition:none!important;-o-transition:none!important;transition:none!important}.book{position:relative;width:100%;height:100%}.book .book-body,.book .book-body .body-inner{position:absolute;top:0;left:0;overflow-y:auto;bottom:0;right:0}.book .book-body{color:#000;background:#fff;-webkit-transition:left 250ms ease;-moz-transition:left 250ms ease;-o-transition:left 250ms ease;transition:left 250ms ease}.book .book-body .page-wrapper{position:relative;outline:0}.book .book-body .page-wrapper .page-inner{max-width:800px;margin:0 auto;padding:20px 0 40px}.book .book-body .page-wrapper .page-inner section{margin:0;padding:5px 15px;background:#fff;border-radius:2px;line-height:1.7;font-size:1.6rem}.book .book-body .page-wrapper .page-inner .btn-group .btn{border-radius:0;background:#eee;border:0}@media (max-width:1240px){.book .book-body{-webkit-transition:-webkit-transform 250ms ease;-moz-transition:-moz-transform 250ms ease;-o-transition:-o-transform 250ms ease;transition:transform 250ms ease;padding-bottom:20px}.book .book-body .body-inner{position:static;min-height:calc(100% - 50px)}}@media (min-width:600px){.book.with-summary .book-body{left:300px}}@media (max-width:600px){.book.with-summary{overflow:hidden}.book.with-summary .book-body{-webkit-transform:translate(calc(100% - 60px),0);-moz-transform:translate(calc(100% - 60px),0);-ms-transform:translate(calc(100% - 60px),0);-o-transform:translate(calc(100% - 60px),0);transform:translate(calc(100% - 60px),0)}}.book.without-animation .book-body{-webkit-transition:none!important;-moz-transition:none!important;-o-transition:none!important;transition:none!important}.buttons:after,.buttons:before{content:" ";display:table;line-height:0}.button{border:0;background:#eee;color:#666;width:100%;text-align:center;float:left;line-height:1.42857143;padding:8px 4px}.button:hover{color:#444}.button:focus,.button:hover{outline:0}.button.size-2{width:50%}.button.size-3{width:33%}.book .book-body .page-wrapper .page-inner section{display:none}.book .book-body .page-wrapper .page-inner section.normal{display:block;word-wrap:break-word;overflow:hidden;color:#333;line-height:1.7;text-size-adjust:100%;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%;-moz-text-size-adjust:100%}.book .book-body .page-wrapper .page-inner section.normal *{box-sizing:border-box;-webkit-box-sizing:border-box;}.book .book-body .page-wrapper .page-inner section.normal>:first-child{margin-top:0!important}.book .book-body .page-wrapper .page-inner section.normal>:last-child{margin-bottom:0!important}.book .book-body .page-wrapper .page-inner section.normal blockquote,.book .book-body .page-wrapper .page-inner section.normal code,.book .book-body .page-wrapper .page-inner section.normal figure,.book .book-body .page-wrapper .page-inner section.normal img,.book .book-body .page-wrapper .page-inner section.normal pre,.book .book-body .page-wrapper .page-inner section.normal table,.book .book-body .page-wrapper .page-inner section.normal tr{page-break-inside:avoid}.book .book-body .page-wrapper .page-inner section.normal h2,.book .book-body .page-wrapper .page-inner section.normal h3,.book .book-body .page-wrapper .page-inner section.normal h4,.book .book-body .page-wrapper .page-inner section.normal h5,.book .book-body .page-wrapper .page-inner section.normal p{orphans:3;widows:3}.book .book-body .page-wrapper .page-inner section.normal h1,.book .book-body .page-wrapper .page-inner section.normal h2,.book .book-body .page-wrapper .page-inner section.normal h3,.book .book-body .page-wrapper .page-inner section.normal h4,.book .book-body .page-wrapper .page-inner section.normal h5{page-break-after:avoid}.book .book-body .page-wrapper .page-inner section.normal b,.book .book-body .page-wrapper .page-inner section.normal strong{font-weight:700}.book .book-body .page-wrapper .page-inner section.normal em{font-style:italic}.book .book-body .page-wrapper .page-inner section.normal blockquote,.book .book-body .page-wrapper .page-inner section.normal dl,.book .book-body .page-wrapper .page-inner section.normal ol,.book .book-body .page-wrapper .page-inner section.normal p,.book .book-body .page-wrapper .page-inner section.normal table,.book .book-body .page-wrapper .page-inner section.normal ul{margin-top:0;margin-bottom:.85em}.book .book-body .page-wrapper .page-inner section.normal a{color:#4183c4;text-decoration:none;background:0 0}.book .book-body .page-wrapper .page-inner section.normal a:active,.book .book-body .page-wrapper .page-inner section.normal a:focus,.book .book-body .page-wrapper .page-inner section.normal a:hover{outline:0;text-decoration:underline}.book .book-body .page-wrapper .page-inner section.normal img{border:0;max-width:100%}.book .book-body .page-wrapper .page-inner section.normal hr{height:4px;padding:0;margin:1.7em 0;overflow:hidden;background-color:#e7e7e7;border:none}.book .book-body .page-wrapper .page-inner section.normal hr:after,.book .book-body .page-wrapper .page-inner section.normal hr:before{display:table;content:" "}.book .book-body .page-wrapper .page-inner section.normal h1,.book .book-body .page-wrapper .page-inner section.normal h2,.book .book-body .page-wrapper .page-inner section.normal h3,.book .book-body .page-wrapper .page-inner section.normal h4,.book .book-body .page-wrapper .page-inner section.normal h5,.book .book-body .page-wrapper .page-inner section.normal h6{margin-top:1.275em;margin-bottom:.85em;}.book .book-body .page-wrapper .page-inner section.normal h1{font-size:2em}.book .book-body .page-wrapper .page-inner section.normal h2{font-size:1.75em}.book .book-body .page-wrapper .page-inner section.normal h3{font-size:1.5em}.book .book-body .page-wrapper .page-inner section.normal h4{font-size:1.25em}.book .book-body .page-wrapper .page-inner section.normal h5{font-size:1em}.book .book-body .page-wrapper .page-inner section.normal h6{font-size:1em;color:#777}.book .book-body .page-wrapper .page-inner section.normal code,.book .book-body .page-wrapper .page-inner section.normal pre{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;direction:ltr;border:none;color:inherit}.book .book-body .page-wrapper .page-inner section.normal pre{overflow:auto;word-wrap:normal;margin:0 0 1.275em;padding:.85em 1em;background:#f7f7f7}.book .book-body .page-wrapper .page-inner section.normal pre>code{display:inline;max-width:initial;padding:0;margin:0;overflow:initial;line-height:inherit;font-size:.85em;white-space:pre;background:0 0}.book .book-body .page-wrapper .page-inner section.normal pre>code:after,.book .book-body .page-wrapper .page-inner section.normal pre>code:before{content:normal}.book .book-body .page-wrapper .page-inner section.normal code{padding:.2em;margin:0;font-size:.85em;background-color:#f7f7f7}.book .book-body .page-wrapper .page-inner section.normal code:after,.book .book-body .page-wrapper .page-inner section.normal code:before{letter-spacing:-.2em;content:"\00a0"}.book .book-body .page-wrapper .page-inner section.normal ol,.book .book-body .page-wrapper .page-inner section.normal ul{padding:0 0 0 2em;margin:0 0 .85em}.book .book-body .page-wrapper .page-inner section.normal ol ol,.book .book-body .page-wrapper .page-inner section.normal ol ul,.book .book-body .page-wrapper .page-inner section.normal ul ol,.book .book-body .page-wrapper .page-inner section.normal ul ul{margin-top:0;margin-bottom:0}.book .book-body .page-wrapper .page-inner section.normal ol ol{list-style-type:lower-roman}.book .book-body .page-wrapper .page-inner section.normal blockquote{margin:0 0 .85em;padding:0 15px;opacity:0.75;border-left:4px solid #dcdcdc}.book .book-body .page-wrapper .page-inner section.normal blockquote:first-child{margin-top:0}.book .book-body .page-wrapper .page-inner section.normal blockquote:last-child{margin-bottom:0}.book .book-body .page-wrapper .page-inner section.normal dl{padding:0}.book .book-body .page-wrapper .page-inner section.normal dl dt{padding:0;margin-top:.85em;font-style:italic;font-weight:700}.book .book-body .page-wrapper .page-inner section.normal dl dd{padding:0 .85em;margin-bottom:.85em}.book .book-body .page-wrapper .page-inner section.normal dd{margin-left:0}.book .book-body .page-wrapper .page-inner section.normal .glossary-term{cursor:help;text-decoration:underline}.book .book-body .navigation{position:absolute;top:50px;bottom:0;margin:0;max-width:150px;min-width:90px;display:flex;justify-content:center;align-content:center;flex-direction:column;font-size:40px;color:#ccc;text-align:center;-webkit-transition:all 350ms ease;-moz-transition:all 350ms ease;-o-transition:all 350ms ease;transition:all 350ms ease}.book .book-body .navigation:hover{text-decoration:none;color:#444}.book .book-body .navigation.navigation-next{right:0}.book .book-body .navigation.navigation-prev{left:0}@media (max-width:1240px){.book .book-body .navigation{position:static;top:auto;max-width:50%;width:50%;display:inline-block;float:left}.book .book-body .navigation.navigation-unique{max-width:100%;width:100%}}.book .book-body .page-wrapper .page-inner section.glossary{margin-bottom:40px}.book .book-body .page-wrapper .page-inner section.glossary h2 a,.book .book-body .page-wrapper .page-inner section.glossary h2 a:hover{color:inherit;text-decoration:none}.book .book-body .page-wrapper .page-inner section.glossary .glossary-index{list-style:none;margin:0;padding:0}.book .book-body .page-wrapper .page-inner section.glossary .glossary-index li{display:inline;margin:0 8px;white-space:nowrap}*{-webkit-box-sizing:border-box;-moz-box-sizing:border-box;box-sizing:border-box;-webkit-overflow-scrolling:auto;-webkit-tap-highlight-color:transparent;-webkit-text-size-adjust:none;-webkit-touch-callout:none}a{text-decoration:none}body,html{height:100%}html{font-size:62.5%}body{text-rendering:optimizeLegibility;font-smoothing:antialiased;font-family:"Helvetica Neue",Helvetica,Arial,sans-serif;font-size:14px;letter-spacing:.2px;text-size-adjust:100%} +.book .book-summary ul.summary li a span {display:inline;padding:initial;overflow:visible;cursor:auto;opacity:1;} +/* show arrow before summary tag as in bootstrap */ +details > summary {display:list-item;cursor:pointer;} diff --git a/libs/gitbook-2.6.7/js/app.min.js b/libs/gitbook-2.6.7/js/app.min.js new file mode 100644 index 0000000..643f1f9 --- /dev/null +++ b/libs/gitbook-2.6.7/js/app.min.js @@ -0,0 +1 @@ +(function e(t,n,r){function s(o,u){if(!n[o]){if(!t[o]){var a=typeof require=="function"&&require;if(!u&&a)return a(o,!0);if(i)return i(o,!0);var f=new Error("Cannot find module '"+o+"'");throw f.code="MODULE_NOT_FOUND",f}var l=n[o]={exports:{}};t[o][0].call(l.exports,function(e){var n=t[o][1][e];return s(n?n:e)},l,l.exports,e,t,n,r)}return n[o].exports}var i=typeof require=="function"&&require;for(var o=0;o"'`]/g,reHasEscapedHtml=RegExp(reEscapedHtml.source),reHasUnescapedHtml=RegExp(reUnescapedHtml.source);var reEscape=/<%-([\s\S]+?)%>/g,reEvaluate=/<%([\s\S]+?)%>/g,reInterpolate=/<%=([\s\S]+?)%>/g;var reIsDeepProp=/\.|\[(?:[^[\]]*|(["'])(?:(?!\1)[^\n\\]|\\.)*?\1)\]/,reIsPlainProp=/^\w*$/,rePropName=/[^.[\]]+|\[(?:(-?\d+(?:\.\d+)?)|(["'])((?:(?!\2)[^\n\\]|\\.)*?)\2)\]/g;var reRegExpChars=/^[:!,]|[\\^$.*+?()[\]{}|\/]|(^[0-9a-fA-Fnrtuvx])|([\n\r\u2028\u2029])/g,reHasRegExpChars=RegExp(reRegExpChars.source);var reComboMark=/[\u0300-\u036f\ufe20-\ufe23]/g;var reEscapeChar=/\\(\\)?/g;var reEsTemplate=/\$\{([^\\}]*(?:\\.[^\\}]*)*)\}/g;var reFlags=/\w*$/;var reHasHexPrefix=/^0[xX]/;var reIsHostCtor=/^\[object .+?Constructor\]$/;var reIsUint=/^\d+$/;var reLatin1=/[\xc0-\xd6\xd8-\xde\xdf-\xf6\xf8-\xff]/g;var reNoMatch=/($^)/;var reUnescapedString=/['\n\r\u2028\u2029\\]/g;var reWords=function(){var upper="[A-Z\\xc0-\\xd6\\xd8-\\xde]",lower="[a-z\\xdf-\\xf6\\xf8-\\xff]+";return RegExp(upper+"+(?="+upper+lower+")|"+upper+"?"+lower+"|"+upper+"+|[0-9]+","g")}();var contextProps=["Array","ArrayBuffer","Date","Error","Float32Array","Float64Array","Function","Int8Array","Int16Array","Int32Array","Math","Number","Object","RegExp","Set","String","_","clearTimeout","isFinite","parseFloat","parseInt","setTimeout","TypeError","Uint8Array","Uint8ClampedArray","Uint16Array","Uint32Array","WeakMap"];var templateCounter=-1;var typedArrayTags={};typedArrayTags[float32Tag]=typedArrayTags[float64Tag]=typedArrayTags[int8Tag]=typedArrayTags[int16Tag]=typedArrayTags[int32Tag]=typedArrayTags[uint8Tag]=typedArrayTags[uint8ClampedTag]=typedArrayTags[uint16Tag]=typedArrayTags[uint32Tag]=true;typedArrayTags[argsTag]=typedArrayTags[arrayTag]=typedArrayTags[arrayBufferTag]=typedArrayTags[boolTag]=typedArrayTags[dateTag]=typedArrayTags[errorTag]=typedArrayTags[funcTag]=typedArrayTags[mapTag]=typedArrayTags[numberTag]=typedArrayTags[objectTag]=typedArrayTags[regexpTag]=typedArrayTags[setTag]=typedArrayTags[stringTag]=typedArrayTags[weakMapTag]=false;var cloneableTags={};cloneableTags[argsTag]=cloneableTags[arrayTag]=cloneableTags[arrayBufferTag]=cloneableTags[boolTag]=cloneableTags[dateTag]=cloneableTags[float32Tag]=cloneableTags[float64Tag]=cloneableTags[int8Tag]=cloneableTags[int16Tag]=cloneableTags[int32Tag]=cloneableTags[numberTag]=cloneableTags[objectTag]=cloneableTags[regexpTag]=cloneableTags[stringTag]=cloneableTags[uint8Tag]=cloneableTags[uint8ClampedTag]=cloneableTags[uint16Tag]=cloneableTags[uint32Tag]=true;cloneableTags[errorTag]=cloneableTags[funcTag]=cloneableTags[mapTag]=cloneableTags[setTag]=cloneableTags[weakMapTag]=false;var deburredLetters={"À":"A","Á":"A","Â":"A","Ã":"A","Ä":"A","Å":"A","à":"a","á":"a","â":"a","ã":"a","ä":"a","å":"a","Ç":"C","ç":"c","Ð":"D","ð":"d","È":"E","É":"E","Ê":"E","Ë":"E","è":"e","é":"e","ê":"e","ë":"e","Ì":"I","Í":"I","Î":"I","Ï":"I","ì":"i","í":"i","î":"i","ï":"i","Ñ":"N","ñ":"n","Ò":"O","Ó":"O","Ô":"O","Õ":"O","Ö":"O","Ø":"O","ò":"o","ó":"o","ô":"o","õ":"o","ö":"o","ø":"o","Ù":"U","Ú":"U","Û":"U","Ü":"U","ù":"u","ú":"u","û":"u","ü":"u","Ý":"Y","ý":"y","ÿ":"y","Æ":"Ae","æ":"ae","Þ":"Th","þ":"th","ß":"ss"};var htmlEscapes={"&":"&","<":"<",">":">",'"':""","'":"'","`":"`"};var htmlUnescapes={"&":"&","<":"<",">":">",""":'"',"'":"'","`":"`"};var objectTypes={function:true,object:true};var regexpEscapes={0:"x30",1:"x31",2:"x32",3:"x33",4:"x34",5:"x35",6:"x36",7:"x37",8:"x38",9:"x39",A:"x41",B:"x42",C:"x43",D:"x44",E:"x45",F:"x46",a:"x61",b:"x62",c:"x63",d:"x64",e:"x65",f:"x66",n:"x6e",r:"x72",t:"x74",u:"x75",v:"x76",x:"x78"};var stringEscapes={"\\":"\\","'":"'","\n":"n","\r":"r","\u2028":"u2028","\u2029":"u2029"};var freeExports=objectTypes[typeof exports]&&exports&&!exports.nodeType&&exports;var freeModule=objectTypes[typeof module]&&module&&!module.nodeType&&module;var freeGlobal=freeExports&&freeModule&&typeof global=="object"&&global&&global.Object&&global;var freeSelf=objectTypes[typeof self]&&self&&self.Object&&self;var freeWindow=objectTypes[typeof window]&&window&&window.Object&&window;var moduleExports=freeModule&&freeModule.exports===freeExports&&freeExports;var root=freeGlobal||freeWindow!==(this&&this.window)&&freeWindow||freeSelf||this;function baseCompareAscending(value,other){if(value!==other){var valIsNull=value===null,valIsUndef=value===undefined,valIsReflexive=value===value;var othIsNull=other===null,othIsUndef=other===undefined,othIsReflexive=other===other;if(value>other&&!othIsNull||!valIsReflexive||valIsNull&&!othIsUndef&&othIsReflexive||valIsUndef&&othIsReflexive){return 1}if(value-1){}return index}function charsRightIndex(string,chars){var index=string.length;while(index--&&chars.indexOf(string.charAt(index))>-1){}return index}function compareAscending(object,other){return baseCompareAscending(object.criteria,other.criteria)||object.index-other.index}function compareMultiple(object,other,orders){var index=-1,objCriteria=object.criteria,othCriteria=other.criteria,length=objCriteria.length,ordersLength=orders.length;while(++index=ordersLength){return result}var order=orders[index];return result*(order==="asc"||order===true?1:-1)}}return object.index-other.index}function deburrLetter(letter){return deburredLetters[letter]}function escapeHtmlChar(chr){return htmlEscapes[chr]}function escapeRegExpChar(chr,leadingChar,whitespaceChar){if(leadingChar){chr=regexpEscapes[chr]}else if(whitespaceChar){chr=stringEscapes[chr]}return"\\"+chr}function escapeStringChar(chr){return"\\"+stringEscapes[chr]}function indexOfNaN(array,fromIndex,fromRight){var length=array.length,index=fromIndex+(fromRight?0:-1);while(fromRight?index--:++index=9&&charCode<=13)||charCode==32||charCode==160||charCode==5760||charCode==6158||charCode>=8192&&(charCode<=8202||charCode==8232||charCode==8233||charCode==8239||charCode==8287||charCode==12288||charCode==65279)}function replaceHolders(array,placeholder){var index=-1,length=array.length,resIndex=-1,result=[];while(++index>>1;var MAX_SAFE_INTEGER=9007199254740991;var metaMap=WeakMap&&new WeakMap;var realNames={};function lodash(value){if(isObjectLike(value)&&!isArray(value)&&!(value instanceof LazyWrapper)){if(value instanceof LodashWrapper){return value}if(hasOwnProperty.call(value,"__chain__")&&hasOwnProperty.call(value,"__wrapped__")){return wrapperClone(value)}}return new LodashWrapper(value)}function baseLodash(){}function LodashWrapper(value,chainAll,actions){this.__wrapped__=value;this.__actions__=actions||[];this.__chain__=!!chainAll}var support=lodash.support={};lodash.templateSettings={escape:reEscape,evaluate:reEvaluate,interpolate:reInterpolate,variable:"",imports:{_:lodash}};function LazyWrapper(value){this.__wrapped__=value;this.__actions__=[];this.__dir__=1;this.__filtered__=false;this.__iteratees__=[];this.__takeCount__=POSITIVE_INFINITY;this.__views__=[]}function lazyClone(){var result=new LazyWrapper(this.__wrapped__);result.__actions__=arrayCopy(this.__actions__);result.__dir__=this.__dir__;result.__filtered__=this.__filtered__;result.__iteratees__=arrayCopy(this.__iteratees__);result.__takeCount__=this.__takeCount__;result.__views__=arrayCopy(this.__views__);return result}function lazyReverse(){if(this.__filtered__){var result=new LazyWrapper(this);result.__dir__=-1;result.__filtered__=true}else{result=this.clone();result.__dir__*=-1}return result}function lazyValue(){var array=this.__wrapped__.value(),dir=this.__dir__,isArr=isArray(array),isRight=dir<0,arrLength=isArr?array.length:0,view=getView(0,arrLength,this.__views__),start=view.start,end=view.end,length=end-start,index=isRight?end:start-1,iteratees=this.__iteratees__,iterLength=iteratees.length,resIndex=0,takeCount=nativeMin(length,this.__takeCount__);if(!isArr||arrLength=LARGE_ARRAY_SIZE?createCache(values):null,valuesLength=values.length;if(cache){indexOf=cacheIndexOf;isCommon=false;values=cache}outer:while(++indexlength?0:length+start}end=end===undefined||end>length?length:+end||0;if(end<0){end+=length}length=start>end?0:end>>>0;start>>>=0;while(startlength?0:length+start}end=end===undefined||end>length?length:+end||0;if(end<0){end+=length}length=start>end?0:end-start>>>0;start>>>=0;var result=Array(length);while(++index=LARGE_ARRAY_SIZE,seen=isLarge?createCache():null,result=[];if(seen){indexOf=cacheIndexOf;isCommon=false}else{isLarge=false;seen=iteratee?[]:result}outer:while(++index>>1,computed=array[mid];if((retHighest?computed<=value:computed2?sources[length-2]:undefined,guard=length>2?sources[2]:undefined,thisArg=length>1?sources[length-1]:undefined;if(typeof customizer=="function"){customizer=bindCallback(customizer,thisArg,5);length-=2}else{customizer=typeof thisArg=="function"?thisArg:undefined;length-=customizer?1:0}if(guard&&isIterateeCall(sources[0],sources[1],guard)){customizer=length<3?undefined:customizer;length=1}while(++index-1?collection[index]:undefined}return baseFind(collection,predicate,eachFunc)}}function createFindIndex(fromRight){return function(array,predicate,thisArg){if(!(array&&array.length)){return-1}predicate=getCallback(predicate,thisArg,3);return baseFindIndex(array,predicate,fromRight)}}function createFindKey(objectFunc){return function(object,predicate,thisArg){predicate=getCallback(predicate,thisArg,3);return baseFind(object,predicate,objectFunc,true)}}function createFlow(fromRight){return function(){var wrapper,length=arguments.length,index=fromRight?length:-1,leftIndex=0,funcs=Array(length);while(fromRight?index--:++index=LARGE_ARRAY_SIZE){return wrapper.plant(value).value()}var index=0,result=length?funcs[index].apply(this,args):value;while(++index=length||!nativeIsFinite(length)){return""}var padLength=length-strLength;chars=chars==null?" ":chars+"";return repeat(chars,nativeCeil(padLength/chars.length)).slice(0,padLength)}function createPartialWrapper(func,bitmask,thisArg,partials){var isBind=bitmask&BIND_FLAG,Ctor=createCtorWrapper(func);function wrapper(){var argsIndex=-1,argsLength=arguments.length,leftIndex=-1,leftLength=partials.length,args=Array(leftLength+argsLength);while(++leftIndexarrLength)){return false}while(++index-1&&value%1==0&&value-1&&value%1==0&&value<=MAX_SAFE_INTEGER}function isStrictComparable(value){return value===value&&!isObject(value)}function mergeData(data,source){var bitmask=data[1],srcBitmask=source[1],newBitmask=bitmask|srcBitmask,isCommon=newBitmask0){if(++count>=HOT_COUNT){return key}}else{count=0}return baseSetData(key,value)}}();function shimKeys(object){var props=keysIn(object),propsLength=props.length,length=propsLength&&object.length;var allowIndexes=!!length&&isLength(length)&&(isArray(object)||isArguments(object));var index=-1,result=[];while(++index=120?createCache(othIndex&&value):null}var array=arrays[0],index=-1,length=array?array.length:0,seen=caches[0];outer:while(++index-1){splice.call(array,fromIndex,1)}}return array}var pullAt=restParam(function(array,indexes){indexes=baseFlatten(indexes);var result=baseAt(array,indexes);basePullAt(array,indexes.sort(baseCompareAscending));return result});function remove(array,predicate,thisArg){var result=[];if(!(array&&array.length)){return result}var index=-1,indexes=[],length=array.length;predicate=getCallback(predicate,thisArg,3);while(++index2?arrays[length-2]:undefined,thisArg=length>1?arrays[length-1]:undefined;if(length>2&&typeof iteratee=="function"){length-=2}else{iteratee=length>1&&typeof thisArg=="function"?(--length,thisArg):undefined;thisArg=undefined}arrays.length=length;return unzipWith(arrays,iteratee,thisArg)});function chain(value){var result=lodash(value);result.__chain__=true;return result}function tap(value,interceptor,thisArg){interceptor.call(thisArg,value);return value}function thru(value,interceptor,thisArg){return interceptor.call(thisArg,value)}function wrapperChain(){return chain(this)}function wrapperCommit(){return new LodashWrapper(this.value(),this.__chain__)}var wrapperConcat=restParam(function(values){values=baseFlatten(values);return this.thru(function(array){return arrayConcat(isArray(array)?array:[toObject(array)],values)})});function wrapperPlant(value){var result,parent=this;while(parent instanceof baseLodash){var clone=wrapperClone(parent);if(result){previous.__wrapped__=clone}else{result=clone}var previous=clone;parent=parent.__wrapped__}previous.__wrapped__=value;return result}function wrapperReverse(){var value=this.__wrapped__;var interceptor=function(value){return wrapped&&wrapped.__dir__<0?value:value.reverse()};if(value instanceof LazyWrapper){var wrapped=value;if(this.__actions__.length){wrapped=new LazyWrapper(this)}wrapped=wrapped.reverse();wrapped.__actions__.push({func:thru,args:[interceptor],thisArg:undefined});return new LodashWrapper(wrapped,this.__chain__)}return this.thru(interceptor)}function wrapperToString(){return this.value()+""}function wrapperValue(){return baseWrapperValue(this.__wrapped__,this.__actions__)}var at=restParam(function(collection,props){return baseAt(collection,baseFlatten(props))});var countBy=createAggregator(function(result,value,key){hasOwnProperty.call(result,key)?++result[key]:result[key]=1});function every(collection,predicate,thisArg){var func=isArray(collection)?arrayEvery:baseEvery;if(thisArg&&isIterateeCall(collection,predicate,thisArg)){predicate=undefined}if(typeof predicate!="function"||thisArg!==undefined){predicate=getCallback(predicate,thisArg,3)}return func(collection,predicate)}function filter(collection,predicate,thisArg){var func=isArray(collection)?arrayFilter:baseFilter;predicate=getCallback(predicate,thisArg,3);return func(collection,predicate)}var find=createFind(baseEach);var findLast=createFind(baseEachRight,true);function findWhere(collection,source){return find(collection,baseMatches(source))}var forEach=createForEach(arrayEach,baseEach);var forEachRight=createForEach(arrayEachRight,baseEachRight);var groupBy=createAggregator(function(result,value,key){if(hasOwnProperty.call(result,key)){result[key].push(value)}else{result[key]=[value]}});function includes(collection,target,fromIndex,guard){var length=collection?getLength(collection):0;if(!isLength(length)){collection=values(collection);length=collection.length}if(typeof fromIndex!="number"||guard&&isIterateeCall(target,fromIndex,guard)){fromIndex=0}else{fromIndex=fromIndex<0?nativeMax(length+fromIndex,0):fromIndex||0}return typeof collection=="string"||!isArray(collection)&&isString(collection)?fromIndex<=length&&collection.indexOf(target,fromIndex)>-1:!!length&&getIndexOf(collection,target,fromIndex)>-1}var indexBy=createAggregator(function(result,value,key){result[key]=value});var invoke=restParam(function(collection,path,args){var index=-1,isFunc=typeof path=="function",isProp=isKey(path),result=isArrayLike(collection)?Array(collection.length):[];baseEach(collection,function(value){var func=isFunc?path:isProp&&value!=null?value[path]:undefined;result[++index]=func?func.apply(value,args):invokePath(value,path,args)});return result});function map(collection,iteratee,thisArg){var func=isArray(collection)?arrayMap:baseMap;iteratee=getCallback(iteratee,thisArg,3);return func(collection,iteratee)}var partition=createAggregator(function(result,value,key){result[key?0:1].push(value)},function(){return[[],[]]});function pluck(collection,path){return map(collection,property(path))}var reduce=createReduce(arrayReduce,baseEach);var reduceRight=createReduce(arrayReduceRight,baseEachRight);function reject(collection,predicate,thisArg){var func=isArray(collection)?arrayFilter:baseFilter;predicate=getCallback(predicate,thisArg,3);return func(collection,function(value,index,collection){return!predicate(value,index,collection)})}function sample(collection,n,guard){if(guard?isIterateeCall(collection,n,guard):n==null){collection=toIterable(collection);var length=collection.length;return length>0?collection[baseRandom(0,length-1)]:undefined}var index=-1,result=toArray(collection),length=result.length,lastIndex=length-1;n=nativeMin(n<0?0:+n||0,length);while(++index0){result=func.apply(this,arguments)}if(n<=1){func=undefined}return result}}var bind=restParam(function(func,thisArg,partials){var bitmask=BIND_FLAG;if(partials.length){var holders=replaceHolders(partials,bind.placeholder);bitmask|=PARTIAL_FLAG}return createWrapper(func,bitmask,thisArg,partials,holders)});var bindAll=restParam(function(object,methodNames){methodNames=methodNames.length?baseFlatten(methodNames):functions(object);var index=-1,length=methodNames.length;while(++indexwait){complete(trailingCall,maxTimeoutId)}else{timeoutId=setTimeout(delayed,remaining)}}function maxDelayed(){complete(trailing,timeoutId)}function debounced(){args=arguments;stamp=now();thisArg=this;trailingCall=trailing&&(timeoutId||!leading);if(maxWait===false){var leadingCall=leading&&!timeoutId}else{if(!maxTimeoutId&&!leading){lastCalled=stamp}var remaining=maxWait-(stamp-lastCalled),isCalled=remaining<=0||remaining>maxWait;if(isCalled){if(maxTimeoutId){maxTimeoutId=clearTimeout(maxTimeoutId)}lastCalled=stamp;result=func.apply(thisArg,args)}else if(!maxTimeoutId){maxTimeoutId=setTimeout(maxDelayed,remaining)}}if(isCalled&&timeoutId){timeoutId=clearTimeout(timeoutId)}else if(!timeoutId&&wait!==maxWait){timeoutId=setTimeout(delayed,wait)}if(leadingCall){isCalled=true;result=func.apply(thisArg,args)}if(isCalled&&!timeoutId&&!maxTimeoutId){args=thisArg=undefined}return result}debounced.cancel=cancel;return debounced}var defer=restParam(function(func,args){return baseDelay(func,1,args)});var delay=restParam(function(func,wait,args){return baseDelay(func,wait,args)});var flow=createFlow();var flowRight=createFlow(true);function memoize(func,resolver){if(typeof func!="function"||resolver&&typeof resolver!="function"){throw new TypeError(FUNC_ERROR_TEXT)}var memoized=function(){var args=arguments,key=resolver?resolver.apply(this,args):args[0],cache=memoized.cache;if(cache.has(key)){return cache.get(key)}var result=func.apply(this,args);memoized.cache=cache.set(key,result);return result};memoized.cache=new memoize.Cache;return memoized}var modArgs=restParam(function(func,transforms){transforms=baseFlatten(transforms);if(typeof func!="function"||!arrayEvery(transforms,baseIsFunction)){throw new TypeError(FUNC_ERROR_TEXT)}var length=transforms.length;return restParam(function(args){var index=nativeMin(args.length,length);while(index--){args[index]=transforms[index](args[index])}return func.apply(this,args)})});function negate(predicate){if(typeof predicate!="function"){throw new TypeError(FUNC_ERROR_TEXT)}return function(){return!predicate.apply(this,arguments)}}function once(func){return before(2,func)}var partial=createPartial(PARTIAL_FLAG);var partialRight=createPartial(PARTIAL_RIGHT_FLAG);var rearg=restParam(function(func,indexes){return createWrapper(func,REARG_FLAG,undefined,undefined,undefined,baseFlatten(indexes))});function restParam(func,start){if(typeof func!="function"){throw new TypeError(FUNC_ERROR_TEXT)}start=nativeMax(start===undefined?func.length-1:+start||0,0);return function(){var args=arguments,index=-1,length=nativeMax(args.length-start,0),rest=Array(length);while(++indexother}function gte(value,other){return value>=other}function isArguments(value){return isObjectLike(value)&&isArrayLike(value)&&hasOwnProperty.call(value,"callee")&&!propertyIsEnumerable.call(value,"callee")}var isArray=nativeIsArray||function(value){return isObjectLike(value)&&isLength(value.length)&&objToString.call(value)==arrayTag};function isBoolean(value){return value===true||value===false||isObjectLike(value)&&objToString.call(value)==boolTag}function isDate(value){return isObjectLike(value)&&objToString.call(value)==dateTag}function isElement(value){return!!value&&value.nodeType===1&&isObjectLike(value)&&!isPlainObject(value)}function isEmpty(value){if(value==null){return true}if(isArrayLike(value)&&(isArray(value)||isString(value)||isArguments(value)||isObjectLike(value)&&isFunction(value.splice))){return!value.length}return!keys(value).length}function isEqual(value,other,customizer,thisArg){customizer=typeof customizer=="function"?bindCallback(customizer,thisArg,3):undefined;var result=customizer?customizer(value,other):undefined;return result===undefined?baseIsEqual(value,other,customizer):!!result}function isError(value){return isObjectLike(value)&&typeof value.message=="string"&&objToString.call(value)==errorTag}function isFinite(value){return typeof value=="number"&&nativeIsFinite(value)}function isFunction(value){return isObject(value)&&objToString.call(value)==funcTag}function isObject(value){var type=typeof value;return!!value&&(type=="object"||type=="function")}function isMatch(object,source,customizer,thisArg){customizer=typeof customizer=="function"?bindCallback(customizer,thisArg,3):undefined;return baseIsMatch(object,getMatchData(source),customizer)}function isNaN(value){return isNumber(value)&&value!=+value}function isNative(value){if(value==null){return false}if(isFunction(value)){return reIsNative.test(fnToString.call(value))}return isObjectLike(value)&&reIsHostCtor.test(value)}function isNull(value){return value===null}function isNumber(value){return typeof value=="number"||isObjectLike(value)&&objToString.call(value)==numberTag}function isPlainObject(value){var Ctor;if(!(isObjectLike(value)&&objToString.call(value)==objectTag&&!isArguments(value))||!hasOwnProperty.call(value,"constructor")&&(Ctor=value.constructor,typeof Ctor=="function"&&!(Ctor instanceof Ctor))){return false}var result;baseForIn(value,function(subValue,key){result=key});return result===undefined||hasOwnProperty.call(value,result)}function isRegExp(value){return isObject(value)&&objToString.call(value)==regexpTag}function isString(value){return typeof value=="string"||isObjectLike(value)&&objToString.call(value)==stringTag}function isTypedArray(value){return isObjectLike(value)&&isLength(value.length)&&!!typedArrayTags[objToString.call(value)]}function isUndefined(value){return value===undefined}function lt(value,other){return value0;while(++index=nativeMin(start,end)&&value=0&&string.indexOf(target,position)==position}function escape(string){string=baseToString(string);return string&&reHasUnescapedHtml.test(string)?string.replace(reUnescapedHtml,escapeHtmlChar):string}function escapeRegExp(string){string=baseToString(string);return string&&reHasRegExpChars.test(string)?string.replace(reRegExpChars,escapeRegExpChar):string||"(?:)"}var kebabCase=createCompounder(function(result,word,index){return result+(index?"-":"")+word.toLowerCase()});function pad(string,length,chars){string=baseToString(string);length=+length;var strLength=string.length;if(strLength>=length||!nativeIsFinite(length)){return string}var mid=(length-strLength)/2,leftLength=nativeFloor(mid),rightLength=nativeCeil(mid);chars=createPadding("",rightLength,chars);return chars.slice(0,leftLength)+string+chars}var padLeft=createPadDir();var padRight=createPadDir(true);function parseInt(string,radix,guard){if(guard?isIterateeCall(string,radix,guard):radix==null){radix=0}else if(radix){radix=+radix}string=trim(string);return nativeParseInt(string,radix||(reHasHexPrefix.test(string)?16:10))}function repeat(string,n){var result="";string=baseToString(string);n=+n;if(n<1||!string||!nativeIsFinite(n)){return result}do{if(n%2){result+=string}n=nativeFloor(n/2);string+=string}while(n);return result}var snakeCase=createCompounder(function(result,word,index){return result+(index?"_":"")+word.toLowerCase()});var startCase=createCompounder(function(result,word,index){return result+(index?" ":"")+(word.charAt(0).toUpperCase()+word.slice(1))});function startsWith(string,target,position){string=baseToString(string);position=position==null?0:nativeMin(position<0?0:+position||0,string.length);return string.lastIndexOf(target,position)==position}function template(string,options,otherOptions){var settings=lodash.templateSettings;if(otherOptions&&isIterateeCall(string,options,otherOptions)){options=otherOptions=undefined}string=baseToString(string);options=assignWith(baseAssign({},otherOptions||options),settings,assignOwnDefaults);var imports=assignWith(baseAssign({},options.imports),settings.imports,assignOwnDefaults),importsKeys=keys(imports),importsValues=baseValues(imports,importsKeys);var isEscaping,isEvaluating,index=0,interpolate=options.interpolate||reNoMatch,source="__p += '";var reDelimiters=RegExp((options.escape||reNoMatch).source+"|"+interpolate.source+"|"+(interpolate===reInterpolate?reEsTemplate:reNoMatch).source+"|"+(options.evaluate||reNoMatch).source+"|$","g");var sourceURL="//# sourceURL="+("sourceURL"in options?options.sourceURL:"lodash.templateSources["+ ++templateCounter+"]")+"\n";string.replace(reDelimiters,function(match,escapeValue,interpolateValue,esTemplateValue,evaluateValue,offset){interpolateValue||(interpolateValue=esTemplateValue);source+=string.slice(index,offset).replace(reUnescapedString,escapeStringChar);if(escapeValue){isEscaping=true;source+="' +\n__e("+escapeValue+") +\n'"}if(evaluateValue){isEvaluating=true;source+="';\n"+evaluateValue+";\n__p += '"}if(interpolateValue){source+="' +\n((__t = ("+interpolateValue+")) == null ? '' : __t) +\n'"}index=offset+match.length;return match});source+="';\n";var variable=options.variable;if(!variable){source="with (obj) {\n"+source+"\n}\n"}source=(isEvaluating?source.replace(reEmptyStringLeading,""):source).replace(reEmptyStringMiddle,"$1").replace(reEmptyStringTrailing,"$1;");source="function("+(variable||"obj")+") {\n"+(variable?"":"obj || (obj = {});\n")+"var __t, __p = ''"+(isEscaping?", __e = _.escape":"")+(isEvaluating?", __j = Array.prototype.join;\n"+"function print() { __p += __j.call(arguments, '') }\n":";\n")+source+"return __p\n}";var result=attempt(function(){return Function(importsKeys,sourceURL+"return "+source).apply(undefined,importsValues)});result.source=source;if(isError(result)){throw result}return result}function trim(string,chars,guard){var value=string;string=baseToString(string);if(!string){return string}if(guard?isIterateeCall(value,chars,guard):chars==null){return string.slice(trimmedLeftIndex(string),trimmedRightIndex(string)+1)}chars=chars+"";return string.slice(charsLeftIndex(string,chars),charsRightIndex(string,chars)+1)}function trimLeft(string,chars,guard){var value=string;string=baseToString(string);if(!string){return string}if(guard?isIterateeCall(value,chars,guard):chars==null){return string.slice(trimmedLeftIndex(string))}return string.slice(charsLeftIndex(string,chars+""))}function trimRight(string,chars,guard){var value=string;string=baseToString(string);if(!string){return string}if(guard?isIterateeCall(value,chars,guard):chars==null){return string.slice(0,trimmedRightIndex(string)+1)}return string.slice(0,charsRightIndex(string,chars+"")+1)}function trunc(string,options,guard){if(guard&&isIterateeCall(string,options,guard)){options=undefined}var length=DEFAULT_TRUNC_LENGTH,omission=DEFAULT_TRUNC_OMISSION;if(options!=null){if(isObject(options)){var separator="separator"in options?options.separator:separator;length="length"in options?+options.length||0:length;omission="omission"in options?baseToString(options.omission):omission}else{length=+options||0}}string=baseToString(string);if(length>=string.length){return string}var end=length-omission.length;if(end<1){return omission}var result=string.slice(0,end);if(separator==null){return result+omission}if(isRegExp(separator)){if(string.slice(end).search(separator)){var match,newEnd,substring=string.slice(0,end);if(!separator.global){separator=RegExp(separator.source,(reFlags.exec(separator)||"")+"g")}separator.lastIndex=0;while(match=separator.exec(substring)){newEnd=match.index}result=result.slice(0,newEnd==null?end:newEnd)}}else if(string.indexOf(separator,end)!=end){var index=result.lastIndexOf(separator);if(index>-1){result=result.slice(0,index)}}return result+omission}function unescape(string){string=baseToString(string);return string&&reHasEscapedHtml.test(string)?string.replace(reEscapedHtml,unescapeHtmlChar):string}function words(string,pattern,guard){if(guard&&isIterateeCall(string,pattern,guard)){pattern=undefined}string=baseToString(string);return string.match(pattern||reWords)||[]}var attempt=restParam(function(func,args){try{return func.apply(undefined,args)}catch(e){return isError(e)?e:new Error(e)}});function callback(func,thisArg,guard){if(guard&&isIterateeCall(func,thisArg,guard)){thisArg=undefined}return isObjectLike(func)?matches(func):baseCallback(func,thisArg)}function constant(value){return function(){return value}}function identity(value){return value}function matches(source){return baseMatches(baseClone(source,true))}function matchesProperty(path,srcValue){return baseMatchesProperty(path,baseClone(srcValue,true))}var method=restParam(function(path,args){return function(object){return invokePath(object,path,args)}});var methodOf=restParam(function(object,args){return function(path){return invokePath(object,path,args)}});function mixin(object,source,options){if(options==null){var isObj=isObject(source),props=isObj?keys(source):undefined,methodNames=props&&props.length?baseFunctions(source,props):undefined;if(!(methodNames?methodNames.length:isObj)){methodNames=false;options=source;source=object;object=this}}if(!methodNames){methodNames=baseFunctions(source,keys(source))}var chain=true,index=-1,isFunc=isFunction(object),length=methodNames.length;if(options===false){chain=false}else if(isObject(options)&&"chain"in options){chain=options.chain}while(++index0||end<0)){return new LazyWrapper(result)}if(start<0){result=result.takeRight(-start)}else if(start){result=result.drop(start)}if(end!==undefined){end=+end||0;result=end<0?result.dropRight(-end):result.take(end-start)}return result};LazyWrapper.prototype.takeRightWhile=function(predicate,thisArg){return this.reverse().takeWhile(predicate,thisArg).reverse()};LazyWrapper.prototype.toArray=function(){return this.take(POSITIVE_INFINITY)};baseForOwn(LazyWrapper.prototype,function(func,methodName){var checkIteratee=/^(?:filter|map|reject)|While$/.test(methodName),retUnwrapped=/^(?:first|last)$/.test(methodName),lodashFunc=lodash[retUnwrapped?"take"+(methodName=="last"?"Right":""):methodName];if(!lodashFunc){return}lodash.prototype[methodName]=function(){var args=retUnwrapped?[1]:arguments,chainAll=this.__chain__,value=this.__wrapped__,isHybrid=!!this.__actions__.length,isLazy=value instanceof LazyWrapper,iteratee=args[0],useLazy=isLazy||isArray(value);if(useLazy&&checkIteratee&&typeof iteratee=="function"&&iteratee.length!=1){isLazy=useLazy=false}var interceptor=function(value){return retUnwrapped&&chainAll?lodashFunc(value,1)[0]:lodashFunc.apply(undefined,arrayPush([value],args))};var action={func:thru,args:[interceptor],thisArg:undefined},onlyLazy=isLazy&&!isHybrid;if(retUnwrapped&&!chainAll){if(onlyLazy){value=value.clone();value.__actions__.push(action);return func.call(value)}return lodashFunc.call(undefined,this.value())[0]}if(!retUnwrapped&&useLazy){value=onlyLazy?value:new LazyWrapper(this);var result=func.apply(value,args);result.__actions__.push(action);return new LodashWrapper(result,chainAll)}return this.thru(interceptor)}});arrayEach(["join","pop","push","replace","shift","sort","splice","split","unshift"],function(methodName){var func=(/^(?:replace|split)$/.test(methodName)?stringProto:arrayProto)[methodName],chainName=/^(?:push|sort|unshift)$/.test(methodName)?"tap":"thru",retUnwrapped=/^(?:join|pop|replace|shift)$/.test(methodName);lodash.prototype[methodName]=function(){var args=arguments;if(retUnwrapped&&!this.__chain__){return func.apply(this.value(),args)}return this[chainName](function(value){return func.apply(value,args)})}});baseForOwn(LazyWrapper.prototype,function(func,methodName){var lodashFunc=lodash[methodName];if(lodashFunc){var key=lodashFunc.name,names=realNames[key]||(realNames[key]=[]);names.push({name:methodName,func:lodashFunc})}});realNames[createHybridWrapper(undefined,BIND_KEY_FLAG).name]=[{name:"wrapper",func:undefined}];LazyWrapper.prototype.clone=lazyClone;LazyWrapper.prototype.reverse=lazyReverse;LazyWrapper.prototype.value=lazyValue;lodash.prototype.chain=wrapperChain;lodash.prototype.commit=wrapperCommit;lodash.prototype.concat=wrapperConcat;lodash.prototype.plant=wrapperPlant;lodash.prototype.reverse=wrapperReverse;lodash.prototype.toString=wrapperToString;lodash.prototype.run=lodash.prototype.toJSON=lodash.prototype.valueOf=lodash.prototype.value=wrapperValue;lodash.prototype.collect=lodash.prototype.map;lodash.prototype.head=lodash.prototype.first;lodash.prototype.select=lodash.prototype.filter;lodash.prototype.tail=lodash.prototype.rest;return lodash}var _=runInContext();if(typeof define=="function"&&typeof define.amd=="object"&&define.amd){root._=_;define(function(){return _})}else if(freeExports&&freeModule){if(moduleExports){(freeModule.exports=_)._=_}else{freeExports._=_}}else{root._=_}}).call(this)}).call(this,typeof global!=="undefined"?global:typeof self!=="undefined"?self:typeof window!=="undefined"?window:{})},{}],3:[function(require,module,exports){(function(window,document,undefined){var _MAP={8:"backspace",9:"tab",13:"enter",16:"shift",17:"ctrl",18:"alt",20:"capslock",27:"esc",32:"space",33:"pageup",34:"pagedown",35:"end",36:"home",37:"left",38:"up",39:"right",40:"down",45:"ins",46:"del",91:"meta",93:"meta",224:"meta"};var _KEYCODE_MAP={106:"*",107:"+",109:"-",110:".",111:"/",186:";",187:"=",188:",",189:"-",190:".",191:"/",192:"`",219:"[",220:"\\",221:"]",222:"'"};var _SHIFT_MAP={"~":"`","!":"1","@":"2","#":"3",$:"4","%":"5","^":"6","&":"7","*":"8","(":"9",")":"0",_:"-","+":"=",":":";",'"':"'","<":",",">":".","?":"/","|":"\\"};var _SPECIAL_ALIASES={option:"alt",command:"meta",return:"enter",escape:"esc",plus:"+",mod:/Mac|iPod|iPhone|iPad/.test(navigator.platform)?"meta":"ctrl"};var _REVERSE_MAP;for(var i=1;i<20;++i){_MAP[111+i]="f"+i}for(i=0;i<=9;++i){_MAP[i+96]=i}function _addEvent(object,type,callback){if(object.addEventListener){object.addEventListener(type,callback,false);return}object.attachEvent("on"+type,callback)}function _characterFromEvent(e){if(e.type=="keypress"){var character=String.fromCharCode(e.which);if(!e.shiftKey){character=character.toLowerCase()}return character}if(_MAP[e.which]){return _MAP[e.which]}if(_KEYCODE_MAP[e.which]){return _KEYCODE_MAP[e.which]}return String.fromCharCode(e.which).toLowerCase()}function _modifiersMatch(modifiers1,modifiers2){return modifiers1.sort().join(",")===modifiers2.sort().join(",")}function _eventModifiers(e){var modifiers=[];if(e.shiftKey){modifiers.push("shift")}if(e.altKey){modifiers.push("alt")}if(e.ctrlKey){modifiers.push("ctrl")}if(e.metaKey){modifiers.push("meta")}return modifiers}function _preventDefault(e){if(e.preventDefault){e.preventDefault();return}e.returnValue=false}function _stopPropagation(e){if(e.stopPropagation){e.stopPropagation();return}e.cancelBubble=true}function _isModifier(key){return key=="shift"||key=="ctrl"||key=="alt"||key=="meta"}function _getReverseMap(){if(!_REVERSE_MAP){_REVERSE_MAP={};for(var key in _MAP){if(key>95&&key<112){continue}if(_MAP.hasOwnProperty(key)){_REVERSE_MAP[_MAP[key]]=key}}}return _REVERSE_MAP}function _pickBestAction(key,modifiers,action){if(!action){action=_getReverseMap()[key]?"keydown":"keypress"}if(action=="keypress"&&modifiers.length){action="keydown"}return action}function _keysFromString(combination){if(combination==="+"){return["+"]}combination=combination.replace(/\+{2}/g,"+plus");return combination.split("+")}function _getKeyInfo(combination,action){var keys;var key;var i;var modifiers=[];keys=_keysFromString(combination);for(i=0;i1){_bindSequence(combination,sequence,callback,action);return}info=_getKeyInfo(combination,action);self._callbacks[info.key]=self._callbacks[info.key]||[];_getMatches(info.key,info.modifiers,{type:info.action},sequenceName,combination,level);self._callbacks[info.key][sequenceName?"unshift":"push"]({callback:callback,modifiers:info.modifiers,action:info.action,seq:sequenceName,level:level,combo:combination})}self._bindMultiple=function(combinations,callback,action){for(var i=0;i-1){return false}if(_belongsTo(element,self.target)){return false}return element.tagName=="INPUT"||element.tagName=="SELECT"||element.tagName=="TEXTAREA"||element.isContentEditable};Mousetrap.prototype.handleKey=function(){var self=this;return self._handleKey.apply(self,arguments)};Mousetrap.init=function(){var documentMousetrap=Mousetrap(document);for(var method in documentMousetrap){if(method.charAt(0)!=="_"){Mousetrap[method]=function(method){return function(){return documentMousetrap[method].apply(documentMousetrap,arguments)}}(method)}}};Mousetrap.init();window.Mousetrap=Mousetrap;if(typeof module!=="undefined"&&module.exports){module.exports=Mousetrap}if(typeof define==="function"&&define.amd){define(function(){return Mousetrap})}})(window,document)},{}],4:[function(require,module,exports){(function(process){function normalizeArray(parts,allowAboveRoot){var up=0;for(var i=parts.length-1;i>=0;i--){var last=parts[i];if(last==="."){parts.splice(i,1)}else if(last===".."){parts.splice(i,1);up++}else if(up){parts.splice(i,1);up--}}if(allowAboveRoot){for(;up--;up){parts.unshift("..")}}return parts}var splitPathRe=/^(\/?|)([\s\S]*?)((?:\.{1,2}|[^\/]+?|)(\.[^.\/]*|))(?:[\/]*)$/;var splitPath=function(filename){return splitPathRe.exec(filename).slice(1)};exports.resolve=function(){var resolvedPath="",resolvedAbsolute=false;for(var i=arguments.length-1;i>=-1&&!resolvedAbsolute;i--){var path=i>=0?arguments[i]:process.cwd();if(typeof path!=="string"){throw new TypeError("Arguments to path.resolve must be strings")}else if(!path){continue}resolvedPath=path+"/"+resolvedPath;resolvedAbsolute=path.charAt(0)==="/"}resolvedPath=normalizeArray(filter(resolvedPath.split("/"),function(p){return!!p}),!resolvedAbsolute).join("/");return(resolvedAbsolute?"/":"")+resolvedPath||"."};exports.normalize=function(path){var isAbsolute=exports.isAbsolute(path),trailingSlash=substr(path,-1)==="/";path=normalizeArray(filter(path.split("/"),function(p){return!!p}),!isAbsolute).join("/");if(!path&&!isAbsolute){path="."}if(path&&trailingSlash){path+="/"}return(isAbsolute?"/":"")+path};exports.isAbsolute=function(path){return path.charAt(0)==="/"};exports.join=function(){var paths=Array.prototype.slice.call(arguments,0);return exports.normalize(filter(paths,function(p,index){if(typeof p!=="string"){throw new TypeError("Arguments to path.join must be strings")}return p}).join("/"))};exports.relative=function(from,to){from=exports.resolve(from).substr(1);to=exports.resolve(to).substr(1);function trim(arr){var start=0;for(;start=0;end--){if(arr[end]!=="")break}if(start>end)return[];return arr.slice(start,end-start+1)}var fromParts=trim(from.split("/"));var toParts=trim(to.split("/"));var length=Math.min(fromParts.length,toParts.length);var samePartsLength=length;for(var i=0;i1){for(var i=1;i= 0x80 (not a basic code point)","invalid-input":"Invalid input"},baseMinusTMin=base-tMin,floor=Math.floor,stringFromCharCode=String.fromCharCode,key;function error(type){throw RangeError(errors[type])}function map(array,fn){var length=array.length;var result=[];while(length--){result[length]=fn(array[length])}return result}function mapDomain(string,fn){var parts=string.split("@");var result="";if(parts.length>1){result=parts[0]+"@";string=parts[1]}string=string.replace(regexSeparators,".");var labels=string.split(".");var encoded=map(labels,fn).join(".");return result+encoded}function ucs2decode(string){var output=[],counter=0,length=string.length,value,extra;while(counter=55296&&value<=56319&&counter65535){value-=65536;output+=stringFromCharCode(value>>>10&1023|55296);value=56320|value&1023}output+=stringFromCharCode(value);return output}).join("")}function basicToDigit(codePoint){if(codePoint-48<10){return codePoint-22}if(codePoint-65<26){return codePoint-65}if(codePoint-97<26){return codePoint-97}return base}function digitToBasic(digit,flag){return digit+22+75*(digit<26)-((flag!=0)<<5)}function adapt(delta,numPoints,firstTime){var k=0;delta=firstTime?floor(delta/damp):delta>>1;delta+=floor(delta/numPoints);for(;delta>baseMinusTMin*tMax>>1;k+=base){delta=floor(delta/baseMinusTMin)}return floor(k+(baseMinusTMin+1)*delta/(delta+skew))}function decode(input){var output=[],inputLength=input.length,out,i=0,n=initialN,bias=initialBias,basic,j,index,oldi,w,k,digit,t,baseMinusT;basic=input.lastIndexOf(delimiter);if(basic<0){basic=0}for(j=0;j=128){error("not-basic")}output.push(input.charCodeAt(j))}for(index=basic>0?basic+1:0;index=inputLength){error("invalid-input")}digit=basicToDigit(input.charCodeAt(index++));if(digit>=base||digit>floor((maxInt-i)/w)){error("overflow")}i+=digit*w;t=k<=bias?tMin:k>=bias+tMax?tMax:k-bias;if(digitfloor(maxInt/baseMinusT)){error("overflow")}w*=baseMinusT}out=output.length+1;bias=adapt(i-oldi,out,oldi==0);if(floor(i/out)>maxInt-n){error("overflow")}n+=floor(i/out);i%=out;output.splice(i++,0,n)}return ucs2encode(output)}function encode(input){var n,delta,handledCPCount,basicLength,bias,j,m,q,k,t,currentValue,output=[],inputLength,handledCPCountPlusOne,baseMinusT,qMinusT;input=ucs2decode(input);inputLength=input.length;n=initialN;delta=0;bias=initialBias;for(j=0;j=n&¤tValuefloor((maxInt-delta)/handledCPCountPlusOne)){error("overflow")}delta+=(m-n)*handledCPCountPlusOne;n=m;for(j=0;jmaxInt){error("overflow")}if(currentValue==n){for(q=delta,k=base;;k+=base){t=k<=bias?tMin:k>=bias+tMax?tMax:k-bias;if(q0&&len>maxKeys){len=maxKeys}for(var i=0;i=0){kstr=x.substr(0,idx);vstr=x.substr(idx+1)}else{kstr=x;vstr=""}k=decodeURIComponent(kstr);v=decodeURIComponent(vstr);if(!hasOwnProperty(obj,k)){obj[k]=v}else if(isArray(obj[k])){obj[k].push(v)}else{obj[k]=[obj[k],v]}}return obj};var isArray=Array.isArray||function(xs){return Object.prototype.toString.call(xs)==="[object Array]"}},{}],8:[function(require,module,exports){"use strict";var stringifyPrimitive=function(v){switch(typeof v){case"string":return v;case"boolean":return v?"true":"false";case"number":return isFinite(v)?v:"";default:return""}};module.exports=function(obj,sep,eq,name){sep=sep||"&";eq=eq||"=";if(obj===null){obj=undefined}if(typeof obj==="object"){return map(objectKeys(obj),function(k){var ks=encodeURIComponent(stringifyPrimitive(k))+eq;if(isArray(obj[k])){return map(obj[k],function(v){return ks+encodeURIComponent(stringifyPrimitive(v))}).join(sep)}else{return ks+encodeURIComponent(stringifyPrimitive(obj[k]))}}).join(sep)}if(!name)return"";return encodeURIComponent(stringifyPrimitive(name))+eq+encodeURIComponent(stringifyPrimitive(obj))};var isArray=Array.isArray||function(xs){return Object.prototype.toString.call(xs)==="[object Array]"};function map(xs,f){if(xs.map)return xs.map(f);var res=[];for(var i=0;i",'"',"`"," ","\r","\n","\t"],unwise=["{","}","|","\\","^","`"].concat(delims),autoEscape=["'"].concat(unwise),nonHostChars=["%","/","?",";","#"].concat(autoEscape),hostEndingChars=["/","?","#"],hostnameMaxLen=255,hostnamePartPattern=/^[a-z0-9A-Z_-]{0,63}$/,hostnamePartStart=/^([a-z0-9A-Z_-]{0,63})(.*)$/,unsafeProtocol={javascript:true,"javascript:":true},hostlessProtocol={javascript:true,"javascript:":true},slashedProtocol={http:true,https:true,ftp:true,gopher:true,file:true,"http:":true,"https:":true,"ftp:":true,"gopher:":true,"file:":true},querystring=require("querystring");function urlParse(url,parseQueryString,slashesDenoteHost){if(url&&isObject(url)&&url instanceof Url)return url;var u=new Url;u.parse(url,parseQueryString,slashesDenoteHost);return u}Url.prototype.parse=function(url,parseQueryString,slashesDenoteHost){if(!isString(url)){throw new TypeError("Parameter 'url' must be a string, not "+typeof url)}var rest=url;rest=rest.trim();var proto=protocolPattern.exec(rest);if(proto){proto=proto[0];var lowerProto=proto.toLowerCase();this.protocol=lowerProto;rest=rest.substr(proto.length)}if(slashesDenoteHost||proto||rest.match(/^\/\/[^@\/]+@[^@\/]+/)){var slashes=rest.substr(0,2)==="//";if(slashes&&!(proto&&hostlessProtocol[proto])){rest=rest.substr(2);this.slashes=true}}if(!hostlessProtocol[proto]&&(slashes||proto&&!slashedProtocol[proto])){var hostEnd=-1;for(var i=0;i127){newpart+="x"}else{newpart+=part[j]}}if(!newpart.match(hostnamePartPattern)){var validParts=hostparts.slice(0,i);var notHost=hostparts.slice(i+1);var bit=part.match(hostnamePartStart);if(bit){validParts.push(bit[1]);notHost.unshift(bit[2])}if(notHost.length){rest="/"+notHost.join(".")+rest}this.hostname=validParts.join(".");break}}}}if(this.hostname.length>hostnameMaxLen){this.hostname=""}else{this.hostname=this.hostname.toLowerCase()}if(!ipv6Hostname){var domainArray=this.hostname.split(".");var newOut=[];for(var i=0;i0?result.host.split("@"):false;if(authInHost){result.auth=authInHost.shift();result.host=result.hostname=authInHost.shift()}}result.search=relative.search;result.query=relative.query;if(!isNull(result.pathname)||!isNull(result.search)){result.path=(result.pathname?result.pathname:"")+(result.search?result.search:"")}result.href=result.format();return result}if(!srcPath.length){result.pathname=null;if(result.search){result.path="/"+result.search}else{result.path=null}result.href=result.format();return result}var last=srcPath.slice(-1)[0];var hasTrailingSlash=(result.host||relative.host)&&(last==="."||last==="..")||last==="";var up=0;for(var i=srcPath.length;i>=0;i--){last=srcPath[i];if(last=="."){srcPath.splice(i,1)}else if(last===".."){srcPath.splice(i,1);up++}else if(up){srcPath.splice(i,1);up--}}if(!mustEndAbs&&!removeAllDots){for(;up--;up){srcPath.unshift("..")}}if(mustEndAbs&&srcPath[0]!==""&&(!srcPath[0]||srcPath[0].charAt(0)!=="/")){srcPath.unshift("")}if(hasTrailingSlash&&srcPath.join("/").substr(-1)!=="/"){srcPath.push("")}var isAbsolute=srcPath[0]===""||srcPath[0]&&srcPath[0].charAt(0)==="/";if(psychotic){result.hostname=result.host=isAbsolute?"":srcPath.length?srcPath.shift():"";var authInHost=result.host&&result.host.indexOf("@")>0?result.host.split("@"):false;if(authInHost){result.auth=authInHost.shift();result.host=result.hostname=authInHost.shift()}}mustEndAbs=mustEndAbs||result.host&&srcPath.length;if(mustEndAbs&&!isAbsolute){srcPath.unshift("")}if(!srcPath.length){result.pathname=null;result.path=null}else{result.pathname=srcPath.join("/")}if(!isNull(result.pathname)||!isNull(result.search)){result.path=(result.pathname?result.pathname:"")+(result.search?result.search:"")}result.auth=relative.auth||result.auth;result.slashes=result.slashes||relative.slashes;result.href=result.format();return result};Url.prototype.parseHost=function(){var host=this.host;var port=portPattern.exec(host);if(port){port=port[0];if(port!==":"){this.port=port.substr(1)}host=host.substr(0,host.length-port.length)}if(host)this.hostname=host};function isString(arg){return typeof arg==="string"}function isObject(arg){return typeof arg==="object"&&arg!==null}function isNull(arg){return arg===null}function isNullOrUndefined(arg){return arg==null}},{punycode:6,querystring:9}],11:[function(require,module,exports){var $=require("jquery");function toggleDropdown(e){var $dropdown=$(e.currentTarget).parent().find(".dropdown-menu");$dropdown.toggleClass("open");e.stopPropagation();e.preventDefault()}function closeDropdown(e){$(".dropdown-menu").removeClass("open")}function init(){$(document).on("click",".toggle-dropdown",toggleDropdown);$(document).on("click",".dropdown-menu",function(e){e.stopPropagation()});$(document).on("click",closeDropdown)}module.exports={init:init}},{jquery:1}],12:[function(require,module,exports){var $=require("jquery");module.exports=$({})},{jquery:1}],13:[function(require,module,exports){var $=require("jquery");var _=require("lodash");var storage=require("./storage");var dropdown=require("./dropdown");var events=require("./events");var state=require("./state");var keyboard=require("./keyboard");var navigation=require("./navigation");var sidebar=require("./sidebar");var toolbar=require("./toolbar");function start(config){sidebar.init();keyboard.init();dropdown.init();navigation.init();toolbar.createButton({index:0,icon:"fa fa-align-justify",label:"Toggle Sidebar",onClick:function(e){e.preventDefault();sidebar.toggle()}});events.trigger("start",config);navigation.notify()}var gitbook={start:start,events:events,state:state,toolbar:toolbar,sidebar:sidebar,storage:storage,keyboard:keyboard};var MODULES={gitbook:gitbook,jquery:$,lodash:_};window.gitbook=gitbook;window.$=$;window.jQuery=$;gitbook.require=function(mods,fn){mods=_.map(mods,function(mod){mod=mod.toLowerCase();if(!MODULES[mod]){throw new Error("GitBook module "+mod+" doesn't exist")}return MODULES[mod]});fn.apply(null,mods)};module.exports={}},{"./dropdown":11,"./events":12,"./keyboard":14,"./navigation":16,"./sidebar":18,"./state":19,"./storage":20,"./toolbar":21,jquery:1,lodash:2}],14:[function(require,module,exports){var Mousetrap=require("mousetrap");var navigation=require("./navigation");var sidebar=require("./sidebar");function bindShortcut(keys,fn){Mousetrap.bind(keys,function(e){fn();return false})}function init(){bindShortcut(["right"],function(e){navigation.goNext()});bindShortcut(["left"],function(e){navigation.goPrev()});bindShortcut(["s"],function(e){sidebar.toggle()})}module.exports={init:init,bind:bindShortcut}},{"./navigation":16,"./sidebar":18,mousetrap:3}],15:[function(require,module,exports){var state=require("./state");function showLoading(p){state.$book.addClass("is-loading");p.always(function(){state.$book.removeClass("is-loading")});return p}module.exports={show:showLoading}},{"./state":19}],16:[function(require,module,exports){var $=require("jquery");var url=require("url");var events=require("./events");var state=require("./state");var loading=require("./loading");var usePushState=typeof history.pushState!=="undefined";function handleNavigation(relativeUrl,push){var uri=url.resolve(window.location.pathname,relativeUrl);notifyPageChange();location.href=relativeUrl;return}function updateNavigationPosition(){var bodyInnerWidth,pageWrapperWidth;bodyInnerWidth=parseInt($(".body-inner").css("width"),10);pageWrapperWidth=parseInt($(".page-wrapper").css("width"),10);$(".navigation-next").css("margin-right",bodyInnerWidth-pageWrapperWidth+"px")}function notifyPageChange(){events.trigger("page.change")}function preparePage(notify){var $bookBody=$(".book-body");var $bookInner=$bookBody.find(".body-inner");var $pageWrapper=$bookInner.find(".page-wrapper");updateNavigationPosition();$bookInner.scrollTop(0);$bookBody.scrollTop(0);if(notify!==false)notifyPageChange()}function isLeftClickEvent(e){return e.button===0}function isModifiedEvent(e){return!!(e.metaKey||e.altKey||e.ctrlKey||e.shiftKey)}function handlePagination(e){if(isModifiedEvent(e)||!isLeftClickEvent(e)){return}e.stopPropagation();e.preventDefault();var url=$(this).attr("href");if(url)handleNavigation(url,true)}function goNext(){var url=$(".navigation-next").attr("href");if(url)handleNavigation(url,true)}function goPrev(){var url=$(".navigation-prev").attr("href");if(url)handleNavigation(url,true)}function init(){$.ajaxSetup({});if(location.protocol!=="file:"){history.replaceState({path:window.location.href},"")}window.onpopstate=function(event){if(event.state===null){return}return handleNavigation(event.state.path,false)};$(document).on("click",".navigation-prev",handlePagination);$(document).on("click",".navigation-next",handlePagination);$(document).on("click",".summary [data-path] a",handlePagination);$(window).resize(updateNavigationPosition);preparePage(false)}module.exports={init:init,goNext:goNext,goPrev:goPrev,notify:notifyPageChange}},{"./events":12,"./loading":15,"./state":19,jquery:1,url:10}],17:[function(require,module,exports){module.exports={isMobile:function(){return document.body.clientWidth<=600}}},{}],18:[function(require,module,exports){var $=require("jquery");var _=require("lodash");var storage=require("./storage");var platform=require("./platform");var state=require("./state");function toggleSidebar(_state,animation){if(state!=null&&isOpen()==_state)return;if(animation==null)animation=true;state.$book.toggleClass("without-animation",!animation);state.$book.toggleClass("with-summary",_state);storage.set("sidebar",isOpen())}function isOpen(){return state.$book.hasClass("with-summary")}function init(){if(platform.isMobile()){toggleSidebar(false,false)}else{toggleSidebar(storage.get("sidebar",true),false)}$(document).on("click",".book-summary li.chapter a",function(e){if(platform.isMobile())toggleSidebar(false,false)})}function filterSummary(paths){var $summary=$(".book-summary");$summary.find("li").each(function(){var path=$(this).data("path");var st=paths==null||_.contains(paths,path);$(this).toggle(st);if(st)$(this).parents("li").show()})}module.exports={init:init,isOpen:isOpen,toggle:toggleSidebar,filter:filterSummary}},{"./platform":17,"./state":19,"./storage":20,jquery:1,lodash:2}],19:[function(require,module,exports){var $=require("jquery");var url=require("url");var path=require("path");var state={};state.update=function(dom){var $book=$(dom.find(".book"));state.$book=$book;state.level=$book.data("level");state.basePath=$book.data("basepath");state.innerLanguage=$book.data("innerlanguage");state.revision=$book.data("revision");state.filepath=$book.data("filepath");state.chapterTitle=$book.data("chapter-title");state.root=url.resolve(location.protocol+"//"+location.host,path.dirname(path.resolve(location.pathname.replace(/\/$/,"/index.html"),state.basePath))).replace(/\/?$/,"/");state.bookRoot=state.innerLanguage?url.resolve(state.root,".."):state.root};state.update($);module.exports=state},{jquery:1,path:4,url:10}],20:[function(require,module,exports){var baseKey="";module.exports={setBaseKey:function(key){baseKey=key},set:function(key,value){key=baseKey+":"+key;try{sessionStorage[key]=JSON.stringify(value)}catch(e){}},get:function(key,def){key=baseKey+":"+key;if(sessionStorage[key]===undefined)return def;try{var v=JSON.parse(sessionStorage[key]);return v==null?def:v}catch(err){return sessionStorage[key]||def}},remove:function(key){key=baseKey+":"+key;sessionStorage.removeItem(key)}}},{}],21:[function(require,module,exports){var $=require("jquery");var _=require("lodash");var events=require("./events");var buttons=[];function insertAt(parent,selector,index,element){var lastIndex=parent.children(selector).length;if(index<0){index=Math.max(0,lastIndex+1+index)}parent.append(element);if(index",{class:"dropdown-menu",html:''});if(_.isString(dropdown)){$menu.append(dropdown)}else{var groups=_.map(dropdown,function(group){if(_.isArray(group))return group;else return[group]});_.each(groups,function(group){var $group=$("
",{class:"buttons"});var sizeClass="size-"+group.length;_.each(group,function(btn){btn=_.defaults(btn||{},{text:"",className:"",onClick:defaultOnClick});var $btn=$("'; + var clipboard; + + gitbook.events.bind("page.change", function() { + + if (!ClipboardJS.isSupported()) return; + + // the page.change event is thrown twice: before and after the page changes + if (clipboard) { + // clipboard is already defined but we are on the same page + if (clipboard._prevPage === window.location.pathname) return; + // clipboard is already defined and url path change + // we can deduct that we are before page changes + clipboard.destroy(); // destroy the previous events listeners + clipboard = undefined; // reset the clipboard object + return; + } + + $(copyButton).prependTo("div.sourceCode"); + + clipboard = new ClipboardJS(".copy-to-clipboard-button", { + text: function(trigger) { + return trigger.parentNode.textContent; + } + }); + + clipboard._prevPage = window.location.pathname + + }); + +}); diff --git a/libs/gitbook-2.6.7/js/plugin-fontsettings.js b/libs/gitbook-2.6.7/js/plugin-fontsettings.js new file mode 100644 index 0000000..a70f0fb --- /dev/null +++ b/libs/gitbook-2.6.7/js/plugin-fontsettings.js @@ -0,0 +1,152 @@ +gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) { + var fontState; + + var THEMES = { + "white": 0, + "sepia": 1, + "night": 2 + }; + + var FAMILY = { + "serif": 0, + "sans": 1 + }; + + // Save current font settings + function saveFontSettings() { + gitbook.storage.set("fontState", fontState); + update(); + } + + // Increase font size + function enlargeFontSize(e) { + e.preventDefault(); + if (fontState.size >= 4) return; + + fontState.size++; + saveFontSettings(); + }; + + // Decrease font size + function reduceFontSize(e) { + e.preventDefault(); + if (fontState.size <= 0) return; + + fontState.size--; + saveFontSettings(); + }; + + // Change font family + function changeFontFamily(index, e) { + e.preventDefault(); + + fontState.family = index; + saveFontSettings(); + }; + + // Change type of color + function changeColorTheme(index, e) { + e.preventDefault(); + + var $book = $(".book"); + + if (fontState.theme !== 0) + $book.removeClass("color-theme-"+fontState.theme); + + fontState.theme = index; + if (fontState.theme !== 0) + $book.addClass("color-theme-"+fontState.theme); + + saveFontSettings(); + }; + + function update() { + var $book = gitbook.state.$book; + + $(".font-settings .font-family-list li").removeClass("active"); + $(".font-settings .font-family-list li:nth-child("+(fontState.family+1)+")").addClass("active"); + + $book[0].className = $book[0].className.replace(/\bfont-\S+/g, ''); + $book.addClass("font-size-"+fontState.size); + $book.addClass("font-family-"+fontState.family); + + if(fontState.theme !== 0) { + $book[0].className = $book[0].className.replace(/\bcolor-theme-\S+/g, ''); + $book.addClass("color-theme-"+fontState.theme); + } + }; + + function init(config) { + var $bookBody, $book; + + //Find DOM elements. + $book = gitbook.state.$book; + $bookBody = $book.find(".book-body"); + + // Instantiate font state object + fontState = gitbook.storage.get("fontState", { + size: config.size || 2, + family: FAMILY[config.family || "sans"], + theme: THEMES[config.theme || "white"] + }); + + update(); + }; + + + gitbook.events.bind("start", function(e, config) { + var opts = config.fontsettings; + if (!opts) return; + + // Create buttons in toolbar + gitbook.toolbar.createButton({ + icon: 'fa fa-font', + label: 'Font Settings', + className: 'font-settings', + dropdown: [ + [ + { + text: 'A', + className: 'font-reduce', + onClick: reduceFontSize + }, + { + text: 'A', + className: 'font-enlarge', + onClick: enlargeFontSize + } + ], + [ + { + text: 'Serif', + onClick: _.partial(changeFontFamily, 0) + }, + { + text: 'Sans', + onClick: _.partial(changeFontFamily, 1) + } + ], + [ + { + text: 'White', + onClick: _.partial(changeColorTheme, 0) + }, + { + text: 'Sepia', + onClick: _.partial(changeColorTheme, 1) + }, + { + text: 'Night', + onClick: _.partial(changeColorTheme, 2) + } + ] + ] + }); + + + // Init current settings + init(opts); + }); +}); + + diff --git a/libs/gitbook-2.6.7/js/plugin-search.js b/libs/gitbook-2.6.7/js/plugin-search.js new file mode 100644 index 0000000..747fcce --- /dev/null +++ b/libs/gitbook-2.6.7/js/plugin-search.js @@ -0,0 +1,270 @@ +gitbook.require(["gitbook", "lodash", "jQuery"], function(gitbook, _, $) { + var index = null; + var fuse = null; + var _search = {engine: 'lunr', opts: {}}; + var $searchInput, $searchLabel, $searchForm; + var $highlighted = [], hi, hiOpts = { className: 'search-highlight' }; + var collapse = false, toc_visible = []; + + function init(config) { + // Instantiate search settings + _search = gitbook.storage.get("search", { + engine: config.search.engine || 'lunr', + opts: config.search.options || {}, + }); + }; + + // Save current search settings + function saveSearchSettings() { + gitbook.storage.set("search", _search); + } + + // Use a specific index + function loadIndex(data) { + // [Yihui] In bookdown, I use a character matrix to store the chapter + // content, and the index is dynamically built on the client side. + // Gitbook prebuilds the index data instead: https://github.com/GitbookIO/plugin-search + // We can certainly do that via R packages V8 and jsonlite, but let's + // see how slow it really is before improving it. On the other hand, + // lunr cannot handle non-English text very well, e.g. the default + // tokenizer cannot deal with Chinese text, so we may want to replace + // lunr with a dumb simple text matching approach. + if (_search.engine === 'lunr') { + index = lunr(function () { + this.ref('url'); + this.field('title', { boost: 10 }); + this.field('body'); + }); + data.map(function(item) { + index.add({ + url: item[0], + title: item[1], + body: item[2] + }); + }); + return; + } + fuse = new Fuse(data.map((_data => { + return { + url: _data[0], + title: _data[1], + body: _data[2] + }; + })), Object.assign( + { + includeScore: true, + threshold: 0.1, + ignoreLocation: true, + keys: ["title", "body"] + }, + _search.opts + )); + } + + // Fetch the search index + function fetchIndex() { + return $.getJSON(gitbook.state.basePath+"/search_index.json") + .then(loadIndex); // [Yihui] we need to use this object later + } + + // Search for a term and return results + function search(q) { + let results = []; + switch (_search.engine) { + case 'fuse': + if (!fuse) return; + results = fuse.search(q).map(function(result) { + var parts = result.item.url.split('#'); + return { + path: parts[0], + hash: parts[1] + }; + }); + break; + case 'lunr': + default: + if (!index) return; + results = _.chain(index.search(q)).map(function(result) { + var parts = result.ref.split("#"); + return { + path: parts[0], + hash: parts[1] + }; + }) + .value(); + } + + // [Yihui] Highlight the search keyword on current page + $highlighted = $('.page-inner') + .unhighlight(hiOpts).highlight(q, hiOpts).find('span.search-highlight'); + scrollToHighlighted(0); + + return results; + } + + // [Yihui] Scroll the chapter body to the i-th highlighted string + function scrollToHighlighted(d) { + var n = $highlighted.length; + hi = hi === undefined ? 0 : hi + d; + // navignate to the previous/next page in the search results if reached the top/bottom + var b = hi < 0; + if (d !== 0 && (b || hi >= n)) { + var path = currentPath(), n2 = toc_visible.length; + if (n2 === 0) return; + for (var i = b ? 0 : n2; (b && i < n2) || (!b && i >= 0); i += b ? 1 : -1) { + if (toc_visible.eq(i).data('path') === path) break; + } + i += b ? -1 : 1; + if (i < 0) i = n2 - 1; + if (i >= n2) i = 0; + var lnk = toc_visible.eq(i).find('a[href$=".html"]'); + if (lnk.length) lnk[0].click(); + return; + } + if (n === 0) return; + var $p = $highlighted.eq(hi); + $p[0].scrollIntoView(); + $highlighted.css('background-color', ''); + // an orange background color on the current item and removed later + $p.css('background-color', 'orange'); + setTimeout(function() { + $p.css('background-color', ''); + }, 2000); + } + + function currentPath() { + var href = window.location.pathname; + href = href.substr(href.lastIndexOf('/') + 1); + return href === '' ? 'index.html' : href; + } + + // Create search form + function createForm(value) { + if ($searchForm) $searchForm.remove(); + if ($searchLabel) $searchLabel.remove(); + if ($searchInput) $searchInput.remove(); + + $searchForm = $('
', { + 'class': 'book-search', + 'role': 'search' + }); + + $searchLabel = $('",e.querySelectorAll("[msallowcapture^='']").length&&v.push("[*^$]="+M+"*(?:''|\"\")"),e.querySelectorAll("[selected]").length||v.push("\\["+M+"*(?:value|"+R+")"),e.querySelectorAll("[id~="+S+"-]").length||v.push("~="),(t=C.createElement("input")).setAttribute("name",""),e.appendChild(t),e.querySelectorAll("[name='']").length||v.push("\\["+M+"*name"+M+"*="+M+"*(?:''|\"\")"),e.querySelectorAll(":checked").length||v.push(":checked"),e.querySelectorAll("a#"+S+"+*").length||v.push(".#.+[+~]"),e.querySelectorAll("\\\f"),v.push("[\\r\\n\\f]")}),ce(function(e){e.innerHTML="";var t=C.createElement("input");t.setAttribute("type","hidden"),e.appendChild(t).setAttribute("name","D"),e.querySelectorAll("[name=d]").length&&v.push("name"+M+"*[*^$|!~]?="),2!==e.querySelectorAll(":enabled").length&&v.push(":enabled",":disabled"),a.appendChild(e).disabled=!0,2!==e.querySelectorAll(":disabled").length&&v.push(":enabled",":disabled"),e.querySelectorAll("*,:x"),v.push(",.*:")})),(d.matchesSelector=K.test(c=a.matches||a.webkitMatchesSelector||a.mozMatchesSelector||a.oMatchesSelector||a.msMatchesSelector))&&ce(function(e){d.disconnectedMatch=c.call(e,"*"),c.call(e,"[s!='']:x"),s.push("!=",F)}),v=v.length&&new RegExp(v.join("|")),s=s.length&&new RegExp(s.join("|")),t=K.test(a.compareDocumentPosition),y=t||K.test(a.contains)?function(e,t){var n=9===e.nodeType?e.documentElement:e,r=t&&t.parentNode;return e===r||!(!r||1!==r.nodeType||!(n.contains?n.contains(r):e.compareDocumentPosition&&16&e.compareDocumentPosition(r)))}:function(e,t){if(t)while(t=t.parentNode)if(t===e)return!0;return!1},j=t?function(e,t){if(e===t)return l=!0,0;var n=!e.compareDocumentPosition-!t.compareDocumentPosition;return n||(1&(n=(e.ownerDocument||e)==(t.ownerDocument||t)?e.compareDocumentPosition(t):1)||!d.sortDetached&&t.compareDocumentPosition(e)===n?e==C||e.ownerDocument==p&&y(p,e)?-1:t==C||t.ownerDocument==p&&y(p,t)?1:u?P(u,e)-P(u,t):0:4&n?-1:1)}:function(e,t){if(e===t)return l=!0,0;var n,r=0,i=e.parentNode,o=t.parentNode,a=[e],s=[t];if(!i||!o)return e==C?-1:t==C?1:i?-1:o?1:u?P(u,e)-P(u,t):0;if(i===o)return pe(e,t);n=e;while(n=n.parentNode)a.unshift(n);n=t;while(n=n.parentNode)s.unshift(n);while(a[r]===s[r])r++;return r?pe(a[r],s[r]):a[r]==p?-1:s[r]==p?1:0}),C},se.matches=function(e,t){return se(e,null,null,t)},se.matchesSelector=function(e,t){if(T(e),d.matchesSelector&&E&&!N[t+" "]&&(!s||!s.test(t))&&(!v||!v.test(t)))try{var n=c.call(e,t);if(n||d.disconnectedMatch||e.document&&11!==e.document.nodeType)return n}catch(e){N(t,!0)}return 0":{dir:"parentNode",first:!0}," ":{dir:"parentNode"},"+":{dir:"previousSibling",first:!0},"~":{dir:"previousSibling"}},preFilter:{ATTR:function(e){return e[1]=e[1].replace(te,ne),e[3]=(e[3]||e[4]||e[5]||"").replace(te,ne),"~="===e[2]&&(e[3]=" "+e[3]+" "),e.slice(0,4)},CHILD:function(e){return e[1]=e[1].toLowerCase(),"nth"===e[1].slice(0,3)?(e[3]||se.error(e[0]),e[4]=+(e[4]?e[5]+(e[6]||1):2*("even"===e[3]||"odd"===e[3])),e[5]=+(e[7]+e[8]||"odd"===e[3])):e[3]&&se.error(e[0]),e},PSEUDO:function(e){var t,n=!e[6]&&e[2];return G.CHILD.test(e[0])?null:(e[3]?e[2]=e[4]||e[5]||"":n&&X.test(n)&&(t=h(n,!0))&&(t=n.indexOf(")",n.length-t)-n.length)&&(e[0]=e[0].slice(0,t),e[2]=n.slice(0,t)),e.slice(0,3))}},filter:{TAG:function(e){var t=e.replace(te,ne).toLowerCase();return"*"===e?function(){return!0}:function(e){return e.nodeName&&e.nodeName.toLowerCase()===t}},CLASS:function(e){var t=m[e+" "];return t||(t=new RegExp("(^|"+M+")"+e+"("+M+"|$)"))&&m(e,function(e){return t.test("string"==typeof e.className&&e.className||"undefined"!=typeof e.getAttribute&&e.getAttribute("class")||"")})},ATTR:function(n,r,i){return function(e){var t=se.attr(e,n);return null==t?"!="===r:!r||(t+="","="===r?t===i:"!="===r?t!==i:"^="===r?i&&0===t.indexOf(i):"*="===r?i&&-1:\x20\t\r\n\f]*)[\x20\t\r\n\f]*\/?>(?:<\/\1>|)$/i;function j(e,n,r){return m(n)?S.grep(e,function(e,t){return!!n.call(e,t,e)!==r}):n.nodeType?S.grep(e,function(e){return e===n!==r}):"string"!=typeof n?S.grep(e,function(e){return-1)[^>]*|#([\w-]+))$/;(S.fn.init=function(e,t,n){var r,i;if(!e)return this;if(n=n||D,"string"==typeof e){if(!(r="<"===e[0]&&">"===e[e.length-1]&&3<=e.length?[null,e,null]:q.exec(e))||!r[1]&&t)return!t||t.jquery?(t||n).find(e):this.constructor(t).find(e);if(r[1]){if(t=t instanceof S?t[0]:t,S.merge(this,S.parseHTML(r[1],t&&t.nodeType?t.ownerDocument||t:E,!0)),N.test(r[1])&&S.isPlainObject(t))for(r in t)m(this[r])?this[r](t[r]):this.attr(r,t[r]);return this}return(i=E.getElementById(r[2]))&&(this[0]=i,this.length=1),this}return e.nodeType?(this[0]=e,this.length=1,this):m(e)?void 0!==n.ready?n.ready(e):e(S):S.makeArray(e,this)}).prototype=S.fn,D=S(E);var L=/^(?:parents|prev(?:Until|All))/,H={children:!0,contents:!0,next:!0,prev:!0};function O(e,t){while((e=e[t])&&1!==e.nodeType);return e}S.fn.extend({has:function(e){var t=S(e,this),n=t.length;return this.filter(function(){for(var e=0;e\x20\t\r\n\f]*)/i,he=/^$|^module$|\/(?:java|ecma)script/i;ce=E.createDocumentFragment().appendChild(E.createElement("div")),(fe=E.createElement("input")).setAttribute("type","radio"),fe.setAttribute("checked","checked"),fe.setAttribute("name","t"),ce.appendChild(fe),y.checkClone=ce.cloneNode(!0).cloneNode(!0).lastChild.checked,ce.innerHTML="",y.noCloneChecked=!!ce.cloneNode(!0).lastChild.defaultValue,ce.innerHTML="",y.option=!!ce.lastChild;var ge={thead:[1,"","
"],col:[2,"","
"],tr:[2,"","
"],td:[3,"","
"],_default:[0,"",""]};function ve(e,t){var n;return n="undefined"!=typeof e.getElementsByTagName?e.getElementsByTagName(t||"*"):"undefined"!=typeof e.querySelectorAll?e.querySelectorAll(t||"*"):[],void 0===t||t&&A(e,t)?S.merge([e],n):n}function ye(e,t){for(var n=0,r=e.length;n",""]);var me=/<|&#?\w+;/;function xe(e,t,n,r,i){for(var o,a,s,u,l,c,f=t.createDocumentFragment(),p=[],d=0,h=e.length;d\s*$/g;function je(e,t){return A(e,"table")&&A(11!==t.nodeType?t:t.firstChild,"tr")&&S(e).children("tbody")[0]||e}function De(e){return e.type=(null!==e.getAttribute("type"))+"/"+e.type,e}function qe(e){return"true/"===(e.type||"").slice(0,5)?e.type=e.type.slice(5):e.removeAttribute("type"),e}function Le(e,t){var n,r,i,o,a,s;if(1===t.nodeType){if(Y.hasData(e)&&(s=Y.get(e).events))for(i in Y.remove(t,"handle events"),s)for(n=0,r=s[i].length;n").attr(n.scriptAttrs||{}).prop({charset:n.scriptCharset,src:n.url}).on("load error",i=function(e){r.remove(),i=null,e&&t("error"===e.type?404:200,e.type)}),E.head.appendChild(r[0])},abort:function(){i&&i()}}});var _t,zt=[],Ut=/(=)\?(?=&|$)|\?\?/;S.ajaxSetup({jsonp:"callback",jsonpCallback:function(){var e=zt.pop()||S.expando+"_"+wt.guid++;return this[e]=!0,e}}),S.ajaxPrefilter("json jsonp",function(e,t,n){var r,i,o,a=!1!==e.jsonp&&(Ut.test(e.url)?"url":"string"==typeof e.data&&0===(e.contentType||"").indexOf("application/x-www-form-urlencoded")&&Ut.test(e.data)&&"data");if(a||"jsonp"===e.dataTypes[0])return r=e.jsonpCallback=m(e.jsonpCallback)?e.jsonpCallback():e.jsonpCallback,a?e[a]=e[a].replace(Ut,"$1"+r):!1!==e.jsonp&&(e.url+=(Tt.test(e.url)?"&":"?")+e.jsonp+"="+r),e.converters["script json"]=function(){return o||S.error(r+" was not called"),o[0]},e.dataTypes[0]="json",i=C[r],C[r]=function(){o=arguments},n.always(function(){void 0===i?S(C).removeProp(r):C[r]=i,e[r]&&(e.jsonpCallback=t.jsonpCallback,zt.push(r)),o&&m(i)&&i(o[0]),o=i=void 0}),"script"}),y.createHTMLDocument=((_t=E.implementation.createHTMLDocument("").body).innerHTML="
",2===_t.childNodes.length),S.parseHTML=function(e,t,n){return"string"!=typeof e?[]:("boolean"==typeof t&&(n=t,t=!1),t||(y.createHTMLDocument?((r=(t=E.implementation.createHTMLDocument("")).createElement("base")).href=E.location.href,t.head.appendChild(r)):t=E),o=!n&&[],(i=N.exec(e))?[t.createElement(i[1])]:(i=xe([e],t,o),o&&o.length&&S(o).remove(),S.merge([],i.childNodes)));var r,i,o},S.fn.load=function(e,t,n){var r,i,o,a=this,s=e.indexOf(" ");return-1").append(S.parseHTML(e)).find(r):e)}).always(n&&function(e,t){a.each(function(){n.apply(this,o||[e.responseText,t,e])})}),this},S.expr.pseudos.animated=function(t){return S.grep(S.timers,function(e){return t===e.elem}).length},S.offset={setOffset:function(e,t,n){var r,i,o,a,s,u,l=S.css(e,"position"),c=S(e),f={};"static"===l&&(e.style.position="relative"),s=c.offset(),o=S.css(e,"top"),u=S.css(e,"left"),("absolute"===l||"fixed"===l)&&-1<(o+u).indexOf("auto")?(a=(r=c.position()).top,i=r.left):(a=parseFloat(o)||0,i=parseFloat(u)||0),m(t)&&(t=t.call(e,n,S.extend({},s))),null!=t.top&&(f.top=t.top-s.top+a),null!=t.left&&(f.left=t.left-s.left+i),"using"in t?t.using.call(e,f):c.css(f)}},S.fn.extend({offset:function(t){if(arguments.length)return void 0===t?this:this.each(function(e){S.offset.setOffset(this,t,e)});var e,n,r=this[0];return r?r.getClientRects().length?(e=r.getBoundingClientRect(),n=r.ownerDocument.defaultView,{top:e.top+n.pageYOffset,left:e.left+n.pageXOffset}):{top:0,left:0}:void 0},position:function(){if(this[0]){var e,t,n,r=this[0],i={top:0,left:0};if("fixed"===S.css(r,"position"))t=r.getBoundingClientRect();else{t=this.offset(),n=r.ownerDocument,e=r.offsetParent||n.documentElement;while(e&&(e===n.body||e===n.documentElement)&&"static"===S.css(e,"position"))e=e.parentNode;e&&e!==r&&1===e.nodeType&&((i=S(e).offset()).top+=S.css(e,"borderTopWidth",!0),i.left+=S.css(e,"borderLeftWidth",!0))}return{top:t.top-i.top-S.css(r,"marginTop",!0),left:t.left-i.left-S.css(r,"marginLeft",!0)}}},offsetParent:function(){return this.map(function(){var e=this.offsetParent;while(e&&"static"===S.css(e,"position"))e=e.offsetParent;return e||re})}}),S.each({scrollLeft:"pageXOffset",scrollTop:"pageYOffset"},function(t,i){var o="pageYOffset"===i;S.fn[t]=function(e){return $(this,function(e,t,n){var r;if(x(e)?r=e:9===e.nodeType&&(r=e.defaultView),void 0===n)return r?r[i]:e[t];r?r.scrollTo(o?r.pageXOffset:n,o?n:r.pageYOffset):e[t]=n},t,e,arguments.length)}}),S.each(["top","left"],function(e,n){S.cssHooks[n]=Fe(y.pixelPosition,function(e,t){if(t)return t=We(e,n),Pe.test(t)?S(e).position()[n]+"px":t})}),S.each({Height:"height",Width:"width"},function(a,s){S.each({padding:"inner"+a,content:s,"":"outer"+a},function(r,o){S.fn[o]=function(e,t){var n=arguments.length&&(r||"boolean"!=typeof e),i=r||(!0===e||!0===t?"margin":"border");return $(this,function(e,t,n){var r;return x(e)?0===o.indexOf("outer")?e["inner"+a]:e.document.documentElement["client"+a]:9===e.nodeType?(r=e.documentElement,Math.max(e.body["scroll"+a],r["scroll"+a],e.body["offset"+a],r["offset"+a],r["client"+a])):void 0===n?S.css(e,t,i):S.style(e,t,n,i)},s,n?e:void 0,n)}})}),S.each(["ajaxStart","ajaxStop","ajaxComplete","ajaxError","ajaxSuccess","ajaxSend"],function(e,t){S.fn[t]=function(e){return this.on(t,e)}}),S.fn.extend({bind:function(e,t,n){return this.on(e,null,t,n)},unbind:function(e,t){return this.off(e,null,t)},delegate:function(e,t,n,r){return this.on(t,e,n,r)},undelegate:function(e,t,n){return 1===arguments.length?this.off(e,"**"):this.off(t,e||"**",n)},hover:function(e,t){return this.mouseenter(e).mouseleave(t||e)}}),S.each("blur focus focusin focusout resize scroll click dblclick mousedown mouseup mousemove mouseover mouseout mouseenter mouseleave change select submit keydown keypress keyup contextmenu".split(" "),function(e,n){S.fn[n]=function(e,t){return 0