-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit d44b787
Showing
197 changed files
with
63,844 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
--- | ||
output: html_document | ||
--- | ||
|
||
# Introduction {-} | ||
|
||
Over the last decade, the supply of socio-economic data available to researchers and policy makers has increased considerably, along with advances in the tools and methods available to exploit these data. This provides the research community and development practitioners with unprecedented opportunities to increase the use and value of existing data. | ||
|
||
#Note: | ||
Data that were initially collected with one intention can be reused for a completely different purpose. (…) Because the potential of data to serve a productive use is essentially limitless, enabling the reuse and repurposing of data is critical if data are to lead to better lives. ([World Bank, World Development Report 2021](https://www.worldbank.org/en/publication/wdr2021)) | ||
|
||
But data can be challenging to find, access, and use, resulting in many valuable datasets remaining underutilized. Data repositories and libraries, and the data catalogs they maintain, play a crucial role in making data more discoverable, visible, and usable. But many of these catalogs are built on sub-optimal standards and technological solutions, resulting in limited findability and visibility of their assets. To address such market failures, a better market place for data is needed. | ||
|
||
A better market place for data can be developed on the model of large e-commerce platforms, which are designed to effectively and efficiently serve both buyers and sellers. In a market place for data, the "buyers" are the data users, and the "sellers" are the organizations who own or curate datasets and seek to make them available to users -- preferably free of charge to maximize the use of data. Data platforms must be optimized to provide data users with convenient ways of identifying, locating, and acquiring data (which requires the implementation of a user-friendly search and recommendation system), and to provide data owners with a trustable mechanism to make their datasets visible and discoverable and to share them in a cost-effective, convenient, and safe manner. | ||
|
||
Achieving such objectives requires detailed and structured metadata that properly describe the data products. Indeed, search algorithms and recommender systems exploit metadata, not data. Metadata are essential to the credibility, discoverability, visibility, and usability of the data. Adopting metadata standards and schemas is a practical and efficient solution to achieve completeness and quality of the metadata. This Guide presents a set of recommended standards and schemas covering multiple types of data along with guidance for their implementation. The data types covered include microdata, statistical tables, indicators and time series, geographic datasets, text, images, video recordings, and programs and scripts. | ||
|
||
Chapter 1 of the Guide outlines the challenges associated with finding and using data. Chapter 2 describes the essential features of a modern data catalog, and Chapter 3 explains how rich and structured metadata, compliant with the metadata standards and schemas we describe in the Guide, can enable advanced search algorithms and recommender systems. Finally, Chapters 4 to 13 present the recommended standards and schemas, along with examples of their use. | ||
|
||
This Guide was produced by the Office of the World Bank Chief Statistician as a reference guide for World Bank staff and for partners involved in the curation and dissemination of data related to social and economic development. The standards and schemas it describes are used by the World Bank in its data management and dissemination systems, and for the development of systems and tools for the acquisition, documentation, cataloguing, and dissemination of data. Among these tools is a specialized **Metadata Editor** designed to facilitate the documentation of datasets in compliance with the recommended standards and schemas, and a cataloguing application ("NADA"). Both applications are openly available. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
output: html_document | ||
--- | ||
|
||
# (PART) RATIONALE AND OBJECTIVES {-} | ||
|
||
# The challenge of finding and assessing, accessing, and using data {#chapter01} | ||
|
||
In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges. | ||
|
||
## Finding and assessing data | ||
|
||
Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks, often referred to as *tribal knowledge*, to locate and obtain the data they require. This may lead to the use of *convenient* data that may not be the most relevant. Others may encounter datasets of interest in academic publications, which can be challenging due to the inconsistent or non-standardized citation of datasets. However, most data users use general search engines or turn to specialized data catalogs to discover relevant data resources. | ||
|
||
Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, such as a query for "population of India in 2023," yield instant informative responses (though not always from the most authoritative source). Even less direct queries, like "indicators of malnutrition in Yemen," return adequate responses, as the engine can "understand" concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Additionally, generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although it is not without the risk of errors and inaccuracies. However, these search engines may not be optimized to identify the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking "satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition." | ||
|
||
While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking pertinent data. Nonetheless, the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal search indexes and algorithms. With the rapid advancements in AI-based solutions, many of which are available as open-source software, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems. | ||
|
||
The solution to improve data discoverability involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest. | ||
|
||
Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data's suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset's fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time. | ||
|
||
## Accessing data | ||
|
||
Accessing data is a multifaceted challenge that encompasses legal, ethical, and practical considerations. To ensure that data access is lawful, ethical, efficient, and enables relevant and responsible use of the data, data providers and users must adhere to specific principles and practices: | ||
|
||
- Data providers must ensure that they possess the legal rights to share the data and define clear usage rights for data users. | ||
- Data users must understand how they can use the data, whether for research, commercial purposes, or other applications, and they must strictly adhere to the terms of use. | ||
- Data access must comply with data privacy laws and ethical standards. Sensitive or personally identifiable information must be handled with care to protect individuals' privacy. | ||
- Data providers must furnish comprehensive metadata that provides context and a full understanding of the data. Metadata should include details about the data's provenance, encompassing its history, transformations, and processing steps. Understanding how the data was created and modified is essential for accurate and responsible analysis. | ||
- Data should be available in user-friendly formats compatible with common data analysis tools, such as CSV, JSON, or Excel. | ||
- Data should be accessible through various means, accommodating users' preferences and capacities. This may involve offering downloadable files, providing access through web-based tools, and supporting data streaming. - APIs are essential for enabling programmable data access, allowing researchers to retrieve and manipulate data programmatically for integration into their research workflows and applications. | ||
|
||
Data users in developing countries often encounter additional challenges in accessing data, including: | ||
|
||
- Lack of resources: Researchers in developing countries may lack the financial resources to purchase data or access data stored in expensive cloud-based repositories. | ||
- Lack of infrastructure: Researchers in developing countries may lack access to the high-speed internet and computing resources required for working with large datasets. | ||
- Lack of expertise: Researchers in developing countries may lack the expertise to work with complex data formats and utilize data analysis tools. | ||
These specific challenges should be considered when developing data dissemination systems. | ||
|
||
## Using data | ||
|
||
The challenge for data users extends beyond discovering data to obtaining all the necessary information for a comprehensive understanding of the data and for responsible and appropriate use. A single indicator label, such as "unemployment rate (%)," can obscure significant variations by country, source, and time. The international recommendations for the definition and calculation of the "unemployment rate" have evolved over time, and not all countries employ the same data collection instrument (e.g., labor force surveys) to gather the underlying data. Detailed metadata should always accompany data on online data dissemination platforms. This association should be close; relevant metadata should ideally be no more than one click away from the data. This is particularly crucial when a platform publishes data from multiple sources that are not fully harmonized. | ||
|
||
:::quote | ||
The scope and meaning of labor statistics, in general, are determined by their source and methodology, which holds true for the unemployment rate. To interpret the data accurately, it is crucial to understand what the data convey, how they were collected and constructed, and to have information on the relevant metadata. The design and characteristics of the data source, typically a labor force survey or a similar household survey for the unemployment rate, especially in terms of definitions and concepts used, geographical and age coverage, and reference periods, have significant implications for the resulting data. Taking these aspects into account is essential when analyzing the statistics. Additionally, it is crucial to seek information on any methodological changes and breaks in series to assess their impact on trend analysis and to keep in mind methodological differences across countries when conducting cross-country studies. (From Quick guide on interpreting the unemployment rate, International Labour Office – Geneva: ILO, 2019, ISBN: 978-92-2-133323-4 (web pdf)). | ||
::: | ||
|
||
Whenever possible, reproducible or replicable scripts used with the data, along with the analytical output of these scripts, should be published alongside the data. These scripts can be highly valuable to researchers who wish to expand the scope of previous data analysis or reuse parts of the code, and to students who can learn from reading and replicating the work of experienced analysts. To enhance data usability, we have developed a specific metadata schema for documenting research projects and scripts. | ||
|
||
## A FAIR solution | ||
|
||
To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469). | ||
|
||
It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse. |
Oops, something went wrong.