133,177 research outputs found
The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures
Materials science literature contains millions of materials synthesis
procedures described in unstructured natural language text. Large-scale
analysis of these synthesis procedures would facilitate deeper scientific
understanding of materials synthesis and enable automated synthesis planning.
Such analysis requires extracting structured representations of synthesis
procedures from the raw text as a first step. To facilitate the training and
evaluation of synthesis extraction models, we introduce a dataset of 230
synthesis procedures annotated by domain experts with labeled graphs that
express the semantics of the synthesis sentences. The nodes in this graph are
synthesis operations and their typed arguments, and labeled edges specify
relations between the nodes. We describe this new resource in detail and
highlight some specific challenges to annotating scientific text with shallow
semantic structure. We make the corpus available to the community to promote
further research and development of scientific information extraction systems.Comment: Accepted as a long paper at the Linguistic Annotation Workshop (LAW)
at ACL 201
Di-hadron correlations at ISR and RHIC energies
The structure of hadron-hadron correlations is investigated in proton-proton
collisions.We focus on the transmission of the initial transverse momenta of
partons (``intrinsic '') to the hadron-hadron correlations. Values of the
intrinsic transverse momentum obtained from experimental correlations are
compared to the results of a model with partially randomized parton transverse
momenta at ISR and RHIC energies. Procedures for extracting the correlations
from data are discussed.Comment: Modifications in the text and in the title. 12 pages, 4 figure
Are e-readers suitable tools for scholarly work?
This paper aims to offer insights into the usability, acceptance and
limitations of e-readers with regard to the specific requirements of scholarly
text work. To fit into the academic workflow non-linear reading, bookmarking,
commenting, extracting text or the integration of non-textual elements must be
supported. A group of social science students were questioned about their
experiences with electronic publications for study purposes. This same group
executed several text-related tasks with the digitized material presented to
them in two different file formats on four different e-readers. Their
performances were subsequently evaluated by means of frequency analyses in
detail. Findings - e-Publications have made advances in the academic world;
however e-readers do not yet fit seamlessly into the established chain of
scholarly text-processing focusing on how readers use material during and after
reading. Our tests revealed major deficiencies in these techniques. With a
small number of participants (n=26) qualitative insights can be obtained, not
representative results. Further testing with participants from various
disciplines and of varying academic status is required to arrive at more
broadly applicable results. Practical implications - Our test results help to
optimize file conversion routines for scholarly texts. We evaluated our data on
the basis of descriptive statistics and abstained from any statistical
significance test. The usability test of e-readers in a scientific context
aligns with both studies on the prevalence of e-books in the sciences and
technical test reports of portable reading devices. Still, it takes a
distinctive angle in focusing on the characteristics and procedures of textual
work in the social sciences and measures the usability of e-readers and
file-features against these standards.Comment: 22 pages, 6 figures, accepted for publication in Online Information
Revie
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
EXACT2: the semantics of biomedical protocols
© 2014 Soldatova et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.This article has been made available through the Brunel Open Access Publishing Fund.Background: The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility. Methods: We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility. To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously ‘unseen’ (not used for the construction of EXACT2) protocols. Results: The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed. Conclusions: The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format.This work has been partially funded by the Brunel University BRIEF award and a grant from Occams Resources
Using Multi-granular Fuzzy Linguistic Modelling Methods to Represent Social Networks Related Information in an Organized Way
Social networks are the preferred mean for experts to share their knowledge and provide information.
Therefore, it is one of the best sources that can be used for obtaining data that can
be used for a high amount of purposes. For instance, determining social needs, identifying problems,
getting opinions about certain topics, ... Nevertheless, this kind of information is difficult
for a computational system to interpret due to the fact that the text is presented in free form and
that the information that represents is imprecise. In this paper, a novel method for extracting information from social networks and represent it in a fuzzy ontology is presented. Sentiment analysis
procedures are used in order to extract information from free text. Moreover, multi-granular
fuzzy linguistic modelling methods are used for converting the information into the most suitable
representation mean.This work has been supported by the ’Juan de la Cierva Incorporación’ grant from the Spanish
Ministry of Economy and Competitiveness and by the Grant from the FEDER funds provided by the
Spanish Ministry of Economy and Competitiveness (No. TIN2016-75850-R)
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
- …