From 4ecf69a82650813ee811dc6ee0378f5c5b1a17c9 Mon Sep 17 00:00:00 2001 From: Olivier Dupriez <35276300+odwb@users.noreply.github.com> Date: Tue, 31 Oct 2023 13:24:54 -0400 Subject: [PATCH] Update 01_chapter01_challenge_finding_using_data.Rmd --- 01_chapter01_challenge_finding_using_data.Rmd | 51 +++++++++---------- 1 file changed, 23 insertions(+), 28 deletions(-) diff --git a/01_chapter01_challenge_finding_using_data.Rmd b/01_chapter01_challenge_finding_using_data.Rmd index 0658168..b183984 100644 --- a/01_chapter01_challenge_finding_using_data.Rmd +++ b/01_chapter01_challenge_finding_using_data.Rmd @@ -6,55 +6,50 @@ output: html_document # The challenge of finding and assessing, accessing, and using data {#chapter01} -In the landscape of data sharing policies adopted by numerous national and international organizations, a common challenge emerges for researchers and other data users: the practicality of finding, accessing, and using data in an efficient manner. Navigating through a vast and ever-growing pool of data sources and types can be a complex, time-consuming, and sometimes frustrating endeavor. It involves identifying pertinent sources, acquiring and comprehending relevant datasets, and efficiently analyzing them. This challenge is marked by issues such as inadequate metadata, limitations of data discovery systems, and low visibility of valuable data repositories and cataloguing systems. The technical hurdles to data discoverability, accessibility, and usability must be addressed to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the upcoming sections, we will delve into these challenges. +In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges. ## Finding and assessing data -Researchers and data users identify and acquire data in various ways. Some rely on personal networks --or "tribal knowledge"-- to find and obtain the data they need. This can lead to the use of "convenient" data that may not be the most relevant. Others may locate datasets of interest in academic publications, which can be challenging as datasets are often not cited in a consistent or standardized manner. But most data users search specialized data catalogs or use general search engines to locate relevant data resources or catalogs. +Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks, often referred to as *tribal knowledge*, to locate and obtain the data they require. This may lead to the use of *convenient* data that may not be the most relevant. Others may encounter datasets of interest in academic publications, which can be challenging due to the inconsistent or non-standardized citation of datasets. However, most data users use general search engines or turn to specialized data catalogs to discover relevant data resources. -The lead internet search engines have remarkable capabilities in locating and ranking relevant resources available online. The algorithms that power these search engines have lexical and semantic capabilities. Simple data queries -- for example a query for "population of India in 2023" will receive an instant informative response (although not always from the most authoritative source of information). Less straightforward queries -- for example, "indicators of malnutrition in yemen") will also return adequate responses, as the engine will be able to associate malnutrition with the anthropometric indicators of stunting, wasting, and underweight population. But these search engines are not optimized and may not be able to find the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. For example, internet search engines would be of limited help to a researcher looking for "satellite imagery that could be combined with survey data to generate small-area estimates of child malnutrition". +Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, such as a query for "population of India in 2023," yield instant informative responses (though not always from the most authoritative source). Even less direct queries, like "indicators of malnutrition in Yemen," return adequate responses, as the engine can "understand" concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Additionally, generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although it is not without the risk of errors and inaccuracies. However, these search engines may not be optimized to identify the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking "satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition." -While general search engines are crucial in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms maintained by national or international organizations, academic data centers, data archives, or data libraries may thus be better suited for researchers seeking relevant data. However, the search algorithms integrated into such specialized data catalogs can sometimes provide unsatisfactory search results due to the lack of optimization of search indexes and algorithms. With the fast developments of AI-based solutions, many of them available as open source software, specialized catalogs have the possibility to considerably improve the capabilities of their search engine, transforming them into proper data recommender systems. +While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking pertinent data. Nonetheless, the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal search indexes and algorithms. With the rapid advancements in AI-based solutions, many of which are available as open-source software, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems. -The solution to improve discoverability of data involves (i) improving the on-line visibility of specialized data catalogs, and (ii) modernizing the discoverability tools in specialized data catalogs.[1] Both require high quality, comprehensive and structured metadata. Metadata -- the detailed description of datasets -- is what search engines will index and use to identify and locate data of interest +The solution to improve data discoverability involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest. -[geographic - location name challenge; heographic indexing combined with Nominatim] - -Metadata is the first thing that data users will look at to assess whether the data meet their needs. Ideally, researchers should have easy access to both relevant datasets and the metadata required to evaluate the data's suitability for their specific purposes. Obtaining a dataset may be time-consuming and sometimes costly. Therefore, users should only invest resources and time in acquiring data that they know is of high quality and relevance. - -Assessing a dataset's fitness for a specific purpose necessitates different metadata elements for varying data types and uses. Some metadata elements are straightforward, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy. However, more detailed information may be required. For instance, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a specific variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not allow for any valid statistical inference. Furthermore, comparability across sources is crucial to many users and uses, so the metadata should provide a detailed description of sampling, universe, variables, concepts, and methods relevant to the data type. A data user may also require information on the frequency of data updates (for time series or panel surveys, for example) and on previous uses of the dataset by the research community. +Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data's suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset's fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time. ## Accessing data -Accessing data is a multifaceted challenge that involves legal, ethical, and practical considerations. To ensure that data access is legal, ethical and enables relevant and responsible use of the data, data providers and users must adhere to certain principles and practices: +Accessing data is a multifaceted challenge that encompasses legal, ethical, and practical considerations. To ensure that data access is lawful, ethical, efficient, and enables relevant and responsible use of the data, data providers and users must adhere to specific principles and practices: -- Data providers must ensure that they have the legal rights to share the data and that they define clear usage rights for data users. Data users need to know how they can use the data, whether it's for research, commercial purposes, or other applications, and they must strictly comply with the terms of use. +- Data providers must ensure that they possess the legal rights to share the data and define clear usage rights for data users. +- Data users must understand how they can use the data, whether for research, commercial purposes, or other applications, and they must strictly adhere to the terms of use. - Data access must comply with data privacy laws and ethical standards. Sensitive or personally identifiable information must be handled with care to protect individuals' privacy. -- Data providers must offer comprehensive metadata that provides context and full understanding of the data. Metadata should include details about the data's provenance, including its history, transformations, and processing steps. Understanding how the data was created and modified is essential for accurate and responsible analysis. -- Data should be available in formats that are user-friendly and compatible with common data analysis tools. Common formats like CSV, JSON, or Excel can be practical choices. -- Data should be made accessible through various means, considering users' preferences and capacities. This might involve offering downloadable files, providing access through web-based tools, and supporting data streaming. APIs are crucial for enabling programmable access to data. They allow researchers to retrieve and manipulate data programmatically, integrating it into their research workflows and applications. +- Data providers must furnish comprehensive metadata that provides context and a full understanding of the data. Metadata should include details about the data's provenance, encompassing its history, transformations, and processing steps. Understanding how the data was created and modified is essential for accurate and responsible analysis. +- Data should be available in user-friendly formats compatible with common data analysis tools, such as CSV, JSON, or Excel. +- Data should be accessible through various means, accommodating users' preferences and capacities. This may involve offering downloadable files, providing access through web-based tools, and supporting data streaming. - APIs are essential for enabling programmable data access, allowing researchers to retrieve and manipulate data programmatically for integration into their research workflows and applications. -Data users in developing countries often face additional challenges in accessing data. These challenges include: - - Lack of resources: Researchers in developing countries may not have the financial resources to purchase data or to access data that is stored in expensive cloud-based repositories. - - Lack of infrastructure: Researchers in developing countries may not have access to the high-speed internet and computing resources that are needed to work with large datasets. - - Lack of expertise: Researchers in developing countries may not have the expertise to work with complex data formats and to use data analysis tools. -These specific challenges should be taken into consideration when developming data dissemination systems. +Data users in developing countries often encounter additional challenges in accessing data, including: + +- Lack of resources: Researchers in developing countries may lack the financial resources to purchase data or access data stored in expensive cloud-based repositories. +- Lack of infrastructure: Researchers in developing countries may lack access to the high-speed internet and computing resources required for working with large datasets. +- Lack of expertise: Researchers in developing countries may lack the expertise to work with complex data formats and utilize data analysis tools. +These specific challenges should be considered when developing data dissemination systems. ## Using data -The challenge for data users is not only to discover data, but also to obtain all necessary information to fully understand the data and to use them responsibly and appropriately. A same indicator label, for example *unemployment rate (%)*, can mask significant differences by country, source, and time. The international recommendations for the definition and calculation of *unemployment rate* has changed over time, and not all countries use the same data collection instrument (labor force survey or other) to collect the underlying data. In on-line data dissemination platforms, detailed metadata should therefore always be associated and disseminated with the data. This must be a close association; the relevant metadata will ideally not be more than one click away from the data. This is particularly critical when a platform publishes data from multiple sources that are not fully harmonized. +The challenge for data users extends beyond discovering data to obtaining all the necessary information for a comprehensive understanding of the data and for responsible and appropriate use. A single indicator label, such as "unemployment rate (%)," can obscure significant variations by country, source, and time. The international recommendations for the definition and calculation of the "unemployment rate" have evolved over time, and not all countries employ the same data collection instrument (e.g., labor force surveys) to gather the underlying data. Detailed metadata should always accompany data on online data dissemination platforms. This association should be close; relevant metadata should ideally be no more than one click away from the data. This is particularly crucial when a platform publishes data from multiple sources that are not fully harmonized. :::quote -The scope and meaning of labour statistics in general are determined by their source and methodology, and this is certainly true for the unemployment rate. In order to interpret the data accurately, it is crucial to understand what the data convey and how they were collected and constructed, which implies having information on the relevant metadata. The design and characteristics of the data source (typically a labour force survey or similar household survey for the unemployment rate), especially in terms of definitions and concepts used, geographical and age coverage, and reference periods have great implications for the resulting data, making it crucial to take them into account when analysing the statistics. It is also essential to seek information on any methodological changes and breaks in series to assess their impact for trend analysis, and to keep in mind methodological differences across countries when conducting cross-country studies. (From [*Quick guide on interpreting the unemployment rate*](https://ilo.org/wcmsp5/groups/public/---dgreports/---stat/documents/publication/wcms_675155.pdf), International Labour Office – Geneva: ILO, 2019, ISBN : 978-92-2-133323-4 (web pdf)). +The scope and meaning of labor statistics, in general, are determined by their source and methodology, which holds true for the unemployment rate. To interpret the data accurately, it is crucial to understand what the data convey, how they were collected and constructed, and to have information on the relevant metadata. The design and characteristics of the data source, typically a labor force survey or a similar household survey for the unemployment rate, especially in terms of definitions and concepts used, geographical and age coverage, and reference periods, have significant implications for the resulting data. Taking these aspects into account is essential when analyzing the statistics. Additionally, it is crucial to seek information on any methodological changes and breaks in series to assess their impact on trend analysis and to keep in mind methodological differences across countries when conducting cross-country studies. (From Quick guide on interpreting the unemployment rate, International Labour Office – Geneva: ILO, 2019, ISBN: 978-92-2-133323-4 (web pdf)). ::: -When possible, reproducible or replicable scripts that made use of the data, and the analytical output of these scripts, should be published with the data. These scripts may be highly valuable to researchers who may want to expand the scope of previous data analysis or re-purpose part of the code, and to students who may learn from reading and replicating the work of experienced analysts. To foster the usability of data, we developed a specific metadata schema for the documentation of research projects and scripts. +Whenever possible, reproducible or replicable scripts used with the data, along with the analytical output of these scripts, should be published alongside the data. These scripts can be highly valuable to researchers who wish to expand the scope of previous data analysis or reuse parts of the code, and to students who can learn from reading and replicating the work of experienced analysts. To enhance data usability, we have developed a specific metadata schema for documenting research projects and scripts. ## A FAIR solution -To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. To improve search capabilities and increase the visibility of specialized data libraries, a combination of better data curation, search engines, and increased accessibility is necessary. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management. (https://doi.org/10.1371/journal.pcbi.1008469) - -It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This requires anticipating user needs and investing in the curation of data for reuse. To ensure data is **findable**, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more **accessible**. Additionally, multiple modes of data access should be available to improve accessibility, while data should be made **interoperable** to promote data sharing and reusability. Detailed metadata, including fitness for purpose assessments, should be displayed, alongside scripts and permanent availability options, such as a DOI, to promote **reuse**. +To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469). -[1] The internet search engines are themselves investing in specialized data discovery solutions. See for example Google Dataset Search and Google Data Commons. -[2] The results shown are for a specific date, and are subject to variation over time. +It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse.