30 research outputs found
A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature
Discovering authoritative links between publications and the datasets that
they use can be a labor-intensive process. We introduce a natural language
processing pipeline that retrieves and reviews publications for informal
references to research datasets, which complements the work of data librarians.
We first describe the components of the pipeline and then apply it to expand an
authoritative bibliography linking thousands of social science studies to the
data-related publications in which they are used. The pipeline increases recall
for literature to review for inclusion in data-related collections of
publications and makes it possible to detect informal data references at scale.
We contribute (1) a novel Named Entity Recognition (NER) model that reliably
detects informal data references and (2) a dataset connecting items from social
science literature with datasets they reference. Together, these contributions
enable future work on data reference, data citation networks, and data reuse.Comment: 13 pages, 7 figures, 3 table
Spatial Discovery and the Research Library: Linking Research Datasets and Documents
Academic libraries have always supported research across disciplines by integrating access to diverse contents and resources. They now have the opportunity to reinvent their role in facilitating interdisciplinary work by offering researchers new ways of sharing, curating, discovering, and linking research data. Spatial data and metadata support this process because location often integrates disciplinary perspectives, enabling researchers to make their own research data more discoverable, to discover data of other researchers, and to integrate data from multiple sources. The Center for Spatial Studies at the University of California, Santa Barbara (UCSB) and the UCSB Library are undertaking joint research to better enable the discovery of research data and publications. The research addresses the question of how to spatially enable data discovery in a setting that allows for mapping and analysis in a GIS while connecting the data to publications about them. It suggests a framework for an integrated data discovery mechanism and shows how publications may be linked to associated data sets exposed either directly or through metadata on Esri’s Open Data platform. The results demonstrate a simple form of linking data to publications through spatially referenced metadata and persistent identifiers. This linking adds value to research products and increases their discoverability across disciplinary boundaries. Current data publishing practices in academia result in datasets that are not easily discovered, hard to integrate across domains, and typically not linked to publications about them. For example, discovering that two datasets, such as archaeological observations and specimen data collections, share a spatial extent in Mesoamerica, is not currently supported, nor is it easy to get from those data sets to relevant publications or other documents. In our previous work, we had developed a basic linked metadata model relating spatially referenced datasets to documents. The research reported here applies the model to a collection of spatially referenced researcher datasets, capturing metadata and encoding them as linked open data. We use existing RDF vocabularies to triplify the metadata, to make them spatially explicit, and to link them thematically. Our latest research has produced a simple and extensible method for exposing metadata of research objects as a library service and for spatially integrating collections across repositories
DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization
Data users need relevant context and research expertise to effectively search
for and identify relevant datasets. Leading data providers, such as the
Inter-university Consortium for Political and Social Research (ICPSR), offer
standardized metadata and search tools to support data search. Metadata
standards emphasize the machine-readability of data and its documentation.
There are opportunities to enhance dataset search by improving users' ability
to learn about, and make sense of, information about data. Prior research has
shown that context and expertise are two main barriers users face in
effectively searching for, evaluating, and deciding whether to reuse data. In
this paper, we propose a novel chatbot-based search system, DataChat, that
leverages a graph database and a large language model to provide novel ways for
users to interact with and search for research data. DataChat complements data
archives' and institutional repositories' ongoing efforts to curate, preserve,
and share research data for reuse by making it easier for users to explore and
learn about available research data.Comment: 6 pages, 2 figures, and 1 table. Accepted to the 86th Annual Meeting
of the Association for Information Science & Technolog
How and Why do Researchers Reference Data? A Study of Rhetorical Features and Functions of Data References in Academic Articles
Data reuse is a common practice in the social sciences. While published data
play an essential role in the production of social science research, they are
not consistently cited, which makes it difficult to assess their full scholarly
impact and give credit to the original data producers. Furthermore, it can be
challenging to understand researchers' motivations for referencing data. Like
references to academic literature, data references perform various rhetorical
functions, such as paying homage, signaling disagreement, or drawing
comparisons. This paper studies how and why researchers reference social
science data in their academic writing. We develop a typology to model
relationships between the entities that anchor data references, along with
their features (access, actions, locations, styles, types) and functions
(critique, describe, illustrate, interact, legitimize). We illustrate the use
of the typology by coding multidisciplinary research articles (n=30)
referencing social science data archived at the Inter-university Consortium for
Political and Social Research (ICPSR). We show how our typology captures
researchers' interactions with data and purposes for referencing data. Our
typology provides a systematic way to document and analyze researchers'
narratives about data use, extending our ability to give credit to data that
support research.Comment: 35 pages, 2 appendices, 1 tabl
Enabling the discovery of thematically related research objects with systematic spatializations
It is challenging for scholars to discover thematically related research in a multidisciplinary setting, such as that of a university library. In this work, we use spatialization techniques to convey the relatedness of research themes without requiring scholars to have specific knowledge of disciplinary search terminology. We approach this task conceptually by revisiting existing spatialization techniques and reframing them in terms of core concepts of spatial information, highlighting their different capacities. To apply our design, we spatialize masters and doctoral theses (two kinds of research objects available through a university library repository) using topic modeling to assign a relatively small number of research topics to the objects. We discuss and implement two distinct spaces for exploration: a field view of research topics and a network view of research objects. We find that each space enables distinct visual perceptions and questions about the relatedness of research themes. A field view enables questions about the distribution of research objects in the topic space, while a network view enables questions about connections between research objects or about their centrality. Our work contributes to spatialization theory a systematic choice of spaces informed by core concepts of spatial information. Its application to the design of library discovery tools offers two distinct and intuitive ways to gain insights into the thematic relatedness of research objects, regardless of the disciplinary terms used to describe them
How do properties of data, their curation, and their funding relate to reuse?
Despite large public investments in facilitating the secondary use of data, there is little information about the specific factors that predict data’s reuse. Using data download logs from the Inter-university Consortium for Political and Social Research (ICPSR), this study examines how data properties, curation decisions, and repository funding models relate to data reuse. We find that datasets deposited by institutions, subject to many curatorial tasks, and whose access and preservation is funded externally are used more often. Our findings confirm that investments in data collection, curation, and preservation are associated with more data reuse.National Science Foundation grant 1930645 (LH, AP, DA)
Institute of Museum and Library Services grant LG-37-19-0134-19 (LH, DA)
National Institute of Drug Abuse contract number N01DA-14-5576 (AP)http://deepblue.lib.umich.edu/bitstream/2027.42/168212/5/Hemphill et al Data downloads.pdf4ae71d2a-01c0-4084-84c3-c32ce960e81c5836d8a9-776f-4cd5-ba6e-a0cfd10d555dSEL
How and Why Do Researchers Reference Data ? A Study of Rhetorical Features and Functions of Data References in Academic Articles
La réutilisation des données est une pratique courante dans les sciences sociales. Il peut être difficile de comprendre les motivations pour référencer les données. Cet article étudie comment et pourquoi les chercheurs font référence aux données scientifiques dans leurs écrits universitaires. Nous illustrons l’utilisation de la typologie en codant la recherche multidisciplinaire d’ articles. La typologie offre un moyen systématique de documenter et d’analyser les récits des chercheurs
Detecting Informal Data References in Academic Literature
The Inter-university Consortium for Political and Social Research (ICPSR) is developing a machine learning approach using natural language processing (NLP) to assist in the detection of informal data references. Formal data citations that reference unique identifiers are readily discoverable; however, informal references indicating research data reuse are challenging to infer and detect. We contribute a model that uses a combination of cues, such as the presence of indicator terms and syntactical patterns, to assign a likelihood score to dataset mentions and extract candidate data citations from academic text. In production, the model will support the evaluation of candidate documents for ingest into the ICPSR Bibliography of Data-related Literature. This work supports a larger effort to measure the impact of research data.http://deepblue.lib.umich.edu/bitstream/2027.42/168392/1/Detecting_Informal_Data_Refs.pdfDescription of Detecting_Informal_Data_Refs.pdf : PreprintSEL
Recommended from our members
Designing for Serendipity: Research Data Curation in Topic Spaces
Researchers seeking relevant data across disciplines confront the challenge of navigating technical descriptions. How can curation support the serendipitous discovery of related research data? Everyday spaces like bookshelves are designed to support browsing and exploration by placing similar resources closer together. Space and time are foundational ordering relations for knowledge organization. I ask how this ordering, which is well-established in the geographic context, can be translated to locate and organize research data in abstract topic spaces. This dissertation develops methods for making the latent topics of research metadata explicit. These methods produce spatial configurations where related research topics are co-located in neighborhoods. This has the potential to support serendipitous discovery by offering researchers ways to discover related data. I test this notion in three studies that develop topic spaces for research data curation. The first part of this dissertation in Chapter 2 focuses on supporting research data discovery with a common terminology. I develop a crosscutting base vocabulary of geospatial topics to help users discover related government data in a ubiquitous open civic data platform. Semantic annotation expands search terms by mapping users’ vernacular onto the language of metadata. In the second part of this dissertation, I shift away from addressing terminological search to supporting spatial curation by developing topic spaces. In Chapter 3, I develop two kinds of topic spaces for curating research theses and dissertations: landscapes and networks. I use topic modeling to determine the latent semantic similarity of research metadata and then produce topic spaces from these using spatialization techniques. In Chapter 4, I spatialize an institute’s multidisciplinary body of research, producing topic maps at two distinct levels of detail. Emerging spatial patterns, like centrality and proximity, support high-level narratives about cross-disciplinary research activities that complement the quantitative metrics currently cited in reviews of institutional research. Together, these three studies demonstrate strategies for developing topic spaces in which diverse, yet related, multidisciplinary research data are curated. Future research will extend these methods by tracing the impact of specific curatorial actions contributing to research data discovery and reuse