From 211cb109eb741537f2f4eb9b98704880abe9ee60 Mon Sep 17 00:00:00 2001 From: Olivier Dupriez <35276300+odwb@users.noreply.github.com> Date: Thu, 23 Nov 2023 14:48:22 -0500 Subject: [PATCH] Update 01_chapter01_challenge_finding_using_data.Rmd --- 01_chapter01_challenge_finding_using_data.Rmd | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/01_chapter01_challenge_finding_using_data.Rmd b/01_chapter01_challenge_finding_using_data.Rmd index 8d18d0d..922f0c8 100644 --- a/01_chapter01_challenge_finding_using_data.Rmd +++ b/01_chapter01_challenge_finding_using_data.Rmd @@ -6,21 +6,17 @@ output: html_document # The challenge of finding, accessing, and using data {#chapter01} -In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges. +In the realm of data sharing policies adopted by numerous national and international organizations, a common challenge arises for researchers and other data users: the practicality of finding, accessing, and using data. Navigating through an extensive and continually expanding pool of data sources and types can be a complex, time-consuming, and occasionally frustrating undertaking. It entails identifying relevant sources, acquiring and comprehending pertinent datasets, and effectively analyzing them. This challenge is characterized by issues such as insufficient metadata, limitations of data discovery algorithms and systems, and the limited visibility of valuable data repositories and cataloging systems. Addressing the technical hurdles to data discoverability, accessibility, and usability is vital to enhance the effectiveness of data sharing policies and maximize the utility of collected data. In the following sections, we will delve into these challenges. ## Finding data -Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks or *tribal knowledge* to locate and obtain the data they require, or identify datasets of interest in academic publications. This may lead to the use of *convenient* data that may not be the most relevant. Many other data users will use general search engines or turn to specialized data catalogs to discover relevant data resources. +Researchers and data users employ various methods to identify and acquire data. Some rely on personal networks or *tribal knowledge* to locate and obtain the data they require, or identify datasets of interest in academic publications. This may lead to the use of data that may not be the most relevant for the researcher's specific purpose. Many other data users rely on general search engines or turn to specialized data catalogs to discover relevant data resources. -Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, for example a query for *population of India in 2023*, may yield instant informative responses (though not always from the most authoritative source). Less specific queries, for example *indicators of malnutrition in Yemen*, may return adequate responses, as the search engine can understand concepts and associate malnutrition with anthropometric indicators like stunting, wasting, and the underweight population. Generative AI has augmented the capabilities of these search engines to engage with data users in a conversational manner, which can be suitable for addressing simple queries, although not without risk of errors and inaccuracies. +Prominent internet search engines possess notable capabilities in locating and ranking pertinent resources available online. The algorithms powering these search engines incorporate lexical and semantic capabilities. Straightforward data queries, for example a search for *population of India in 2023*, will yield an instant answer to the prompt (though not always from the most authoritative source). Less specific queries, for example a search for *malnutrition indicators for Yemen*, will not return a direct answer, but will provide adequate information and links to useful resources, as the search engine has semantic capability and can associate anthropometric indicators like *percentage of stunting, wasting, and underweight population* to the concept of malnutrition. Generative artificial intelligence has added the capability of these search engines to engage with data users using natural language, which may be suitable for addressing simple queries, although not without risk of errors and inaccuracies. -But these search engines are not be optimized to identify the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. For instance, internet search engines might offer limited assistance to a researcher seeking *satellite imagery that can be combined with survey data to generate small-area estimates of child malnutrition*. +But these search engines are not be optimized to identify the most relevant data when the user's requirements cannot be expressed in the form of a straightforward query. And the answers they return to users' queries is constrained by the quality of metadata attached to data available online. While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking data. Unfortunately, the search algorithms integrated into these specialized data catalogs are often limited to simple, poorly optimized keyword-based systems, and many rely on sub-optimal metadata. They have lexical search capability -- although not always adequately optmized -- but lack semantic search capability and fail to operate as recommender systems. With the rapid advancements of technology, the search performance of specialized data catalogs has the potential to be significantly enhanced, transforming data catalogs into effective data recommender systems. The solution necessitates high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines -- from sepcialized catalogs or from internet search engines -- index and use to identify and locate data of interest. -While general search engines are pivotal in directing users to relevant catalogs and repositories, specialized online data catalogs and platforms managed by national or international organizations, academic data centers, data archives, or data libraries may be better suited for researchers seeking data. But the search algorithms integrated into these specialized data catalogs may at times yield unsatisfactory search results due to suboptimal metadata, indexing, and search algorithms. With the rapid advancements of technology, specialized catalogs have the potential to significantly enhance the capabilities of their search engines, transforming them into effective data recommender systems and making the vision of a better market place for data achievable. - -The solution involves (i) enhancing the online visibility of specialized data catalogs and (ii) modernizing the discoverability tools within specialized data catalogs.[1] Both necessitate high-quality, comprehensive, and structured metadata. Metadata, which offers a detailed description of datasets, is what search engines index and use to identify and locate data of interest. - -Metadata is the first element that data users examine to assess whether the data align with their requirements. Ideally, researchers should have easy access to both relevant datasets and the metadata essential for evaluating the data's suitability for their specific purposes. Acquiring a dataset can be time-consuming and occasionally costly; hence, users should allocate resources and time exclusively to obtain data that is known to be of high quality and relevance. Evaluating a dataset's fitness for a specific purpose necessitates different metadata elements for various data types and applications. Some metadata elements, such as data type, temporal coverage, geographic coverage, scope and universe, and access policy, are straightforward. However, more intricate information may be required. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is minimal, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time. +Metadata are not only needed to enable the data discovery systems. They are also required to allow users to assess the relevance, or fitness-for-purpose, of a dataset. Metadata is the first element that data users examine to assess whether the data align with their requirements. Acquiring a dataset can be time-consuming and occasionally costly activity. Users should be provided with the necessary information to assess the relevance of a dataset prior to acquiring it. Evaluating the relevance of a dataset necessitates different metadata elements. Some, such as the data type, geographic and temporal coverage, scope and universe, and access policy, are straightforward. Others may be more specific. For example, a survey dataset (microdata) may only be relevant to a researcher if a specific modality of a particular variable has a sufficient number of respondents. If the sample size is too small, the dataset would not support valid statistical inference. Furthermore, comparability across sources is vital for many users and applications; thus, the metadata should offer a comprehensive description of sampling, universe, variables, concepts, and methods relevant to the data type. Data users may also seek information on the frequency of data updates, previous uses of the dataset within the research community, and methodological changes over time. Too often, the metadata provided to users lacks such detail. ## Accessing data @@ -52,6 +48,6 @@ Whenever possible, reproducible or replicable scripts used with the data, along ## A FAIR solution -To effectively address the information retrieval challenge, researchers should consider not only the content of the information but also the context within which it is created and the diverse range of potential users who may need it. A foundational element is being mindful of users and their potential interactions with the data and work. Improving search capabilities and increasing the visibility of specialized data libraries requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469). +Improving search capabilities and increasing the accessibility of data requires a combination of enhanced data curation, search engines, and increased accessibility. Adhering to the FAIR principles (Findable, Accessible, Interoperable, and Reusable) is an effective approach to data management (https://doi.org/10.1371/journal.pcbi.1008469). It is essential to focus on the entire data curation process, from acquisition to dissemination, to optimize data analysis by streamlining the process of finding, assessing, accessing, and preparing data. This involves anticipating user needs and investing in data curation for reuse. To ensure data is findable, libraries should implement advanced search algorithms and filters, including full-text, advanced, semantic, and recommendation-based search options. Search engine optimization is also crucial for making catalogs more accessible. Moreover, multiple modes of data access should be available to enhance accessibility, while data should be made interoperable to promote data sharing and reusability. Detailed metadata, including fitness-for-purpose assessments, should be displayed alongside scripts and permanent availability options, such as a DOI, to encourage reuse.