12,420 research outputs found
The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures
Materials science literature contains millions of materials synthesis
procedures described in unstructured natural language text. Large-scale
analysis of these synthesis procedures would facilitate deeper scientific
understanding of materials synthesis and enable automated synthesis planning.
Such analysis requires extracting structured representations of synthesis
procedures from the raw text as a first step. To facilitate the training and
evaluation of synthesis extraction models, we introduce a dataset of 230
synthesis procedures annotated by domain experts with labeled graphs that
express the semantics of the synthesis sentences. The nodes in this graph are
synthesis operations and their typed arguments, and labeled edges specify
relations between the nodes. We describe this new resource in detail and
highlight some specific challenges to annotating scientific text with shallow
semantic structure. We make the corpus available to the community to promote
further research and development of scientific information extraction systems.Comment: Accepted as a long paper at the Linguistic Annotation Workshop (LAW)
at ACL 201
SynKB: Semantic Search for Synthetic Procedures
In this paper we present SynKB, an open-source, automatically extracted
knowledge base of chemical synthesis protocols. Similar to proprietary
chemistry databases such as Reaxsys, SynKB allows chemists to retrieve
structured knowledge about synthetic procedures. By taking advantage of recent
advances in natural language processing for procedural texts, SynKB supports
more flexible queries about reaction conditions, and thus has the potential to
help chemists search the literature for conditions used in relevant reactions
as they design new synthetic routes. Using customized Transformer models to
automatically extract information from 6 million synthesis procedures described
in U.S. and EU patents, we show that for many queries, SynKB has higher recall
than Reaxsys, while maintaining high precision. We plan to make SynKB available
as an open-source tool; in contrast, proprietary chemistry databases require
costly subscriptions.Comment: Accepted to EMNLP 2022 Demo trac
Building Open Knowledge Graph for Metal-Organic Frameworks (MOF-KG): Challenges and Case Studies
Metal-Organic Frameworks (MOFs) are a class of modular, porous crystalline
materials that have great potential to revolutionize applications such as gas
storage, molecular separations, chemical sensing, catalysis, and drug delivery.
The Cambridge Structural Database (CSD) reports 10,636 synthesized MOF crystals
which in addition contains ca. 114,373 MOF-like structures. The sheer number of
synthesized (plus potentially synthesizable) MOF structures requires
researchers pursue computational techniques to screen and isolate MOF
candidates. In this demo paper, we describe our effort on leveraging knowledge
graph methods to facilitate MOF prediction, discovery, and synthesis. We
present challenges and case studies about (1) construction of a MOF knowledge
graph (MOF-KG) from structured and unstructured sources and (2) leveraging the
MOF-KG for discovery of new or missing knowledge.Comment: Accepted by the International Workshop on Knowledge Graphs and Open
Knowledge Network (OKN'22) Co-located with the 28th ACM SIGKDD Conferenc
The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain
This paper presents a new challenging information extraction task in the
domain of materials science. We develop an annotation scheme for marking
information on experiments related to solid oxide fuel cells in scientific
publications, such as involved materials and measurement conditions. With this
paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus
consisting of 45 open-access scholarly articles annotated by domain experts. A
corpus and an inter-annotator agreement study demonstrate the complexity of the
suggested named entity recognition and slot filling tasks as well as high
annotation quality. We also present strong neural-network based models for a
variety of tasks that can be addressed on the basis of our new data set. On all
tasks, using BERT embeddings leads to large performance gains, but with
increasing task complexity, adding a recurrent neural network on top seems
beneficial. Our models will serve as competitive baselines in future work, and
analysis of their performance highlights difficult cases when modeling the data
and suggests promising research directions.Comment: Accepted for publication at ACL 202
Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science
This paper describes a machine learning and data science pipeline for
structured information extraction from documents, implemented as a suite of
open-source tools and extensions to existing tools. It centers around a
methodology for extracting procedural information in the form of recipes,
stepwise procedures for creating an artifact (in this case synthesizing a
nanomaterial), from published scientific literature. From our overall goal of
producing recipes from free text, we derive the technical objectives of a
system consisting of pipeline stages: document acquisition and filtering,
payload extraction, recipe step extraction as a relationship extraction task,
recipe assembly, and presentation through an information retrieval interface
with question answering (QA) functionality. This system meets computational
information and knowledge management (CIKM) requirements of metadata-driven
payload extraction, named entity extraction, and relationship extraction from
text. Functional contributions described in this paper include semi-supervised
machine learning methods for PDF filtering and payload extraction tasks,
followed by structured extraction and data transformation tasks beginning with
section extraction, recipe steps as information tuples, and finally assembled
recipes. Measurable objective criteria for extraction quality include precision
and recall of recipe steps, ordering constraints, and QA accuracy, precision,
and recall. Results, key novel contributions, and significant open problems
derived from this work center around the attribution of these holistic quality
measures to specific machine learning and inference stages of the pipeline,
each with their performance measures. The desired recipes contain identified
preconditions, material inputs, and operations, and constitute the overall
output generated by our computational information and knowledge management
(CIKM) system.Comment: 15th International Conference on Document Analysis and Recognition
Workshops (ICDARW 2019
Recommended from our members
Supporting Story Synthesis: Bridging the Gap between Visual Analytics and Storytelling
Visual analytics usually deals with complex data and uses sophisticated algorithmic, visual, and interactive techniques. Findings of the analysis often need to be communicated to an audience that lacks visual analytics expertise. This requires analysis outcomes to be presented in simpler ways than that are typically used in visual analytics systems. However, not only analytical visualizations may be too complex for target audience but also the information that needs to be presented. Hence, there exists a gap on the path from obtaining analysis findings to communicating them, which involves two aspects: information and display complexity. We propose a general framework where data analysis and result presentation are linked by story synthesis, in which the analyst creates and organizes story contents. Differently, from the previous research, where analytic findings are represented by stored display states, we treat findings as data constructs. In story synthesis, findings are selected, assembled, and arranged in views using meaningful layouts that take into account the structure of information and inherent properties of its components. We propose a workflow for applying the proposed framework in designing visual analytics systems and demonstrate the generality of the approach by applying it to two domains, social media, and movement analysis
- …