109,209 research outputs found
D-TERMINE : data-driven term extraction methodologies investigated
Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation.
One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well.
Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams.
Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system.
Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results.
In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies
Recommended from our members
Exploiting domain knowledge to enhance opinion mining using a hybrid semantic knowledgebase-machine learning approach
With the fast growth of World Wide Web 2.0, a great number of opinions about a variety of products have been published on blogs, forums, and social networks. Online opinions play an important role in supporting consumers make decisions about purchasing products or services. In addition, customer reviews allow companies to understand the strengths and limitations of their products and services, which aids in improving their marketing campaigns. The challenge is that online opinions are predominantly expressed in natural language text, and hence opinion mining tools are required to facilitate the effective analysis of opinions from the unstructured text and to allow for qualitative information extraction. This research presents a Hybrid Semantic Knowledgebase-Machine Learning approach for mining opinions at the domain feature level and classifying the overall opinion on a multi-point scale. The proposed approach benefits from the advantages of deploying a novel Semantic Knowledgebase approach to analyse a collection of reviews at the domain feature level and produce a set of structured information that associates the expressed opinions with specific domain features. The information in the knowledgebase is further supplemented with domain-relevant facts sourced from public Semantic datasets, and the enriched semantically-tagged information is then used to infer valuable semantic information about the domain as well as the expressed opinions on the domain features by summarising the overall opinions about the domain across multiple reviews, and by averaging the overall opinions about other cinematic features. The retrieved semantic information represents a valuable resource for training a Machine Learning classifier to predict the numerical rating of each review. Experimental evaluation revealed that the proposed Hybrid Semantic Knowledgebase-Machine Learning approach improved the precision and recall of the extracted domain features, and hence proved suitable for producing an enriched dataset of semantic features that resulted in higher classification accuracy
Data-driven prognosis method using hybrid deep recurrent neural network
Prognostics and health management (PHM) has attracted increasing attention in modern manufacturing systems to achieve accurate predictive maintenance that reduces production downtime and enhances system safety. Remaining useful life (RUL) prediction plays a crucial role in PHM by providing direct evidence for a cost-effective maintenance decision. With the advances in sensing and communication technologies, data-driven approaches have achieved remarkable progress in machine prognostics. This paper develops a novel data-driven approach to precisely estimate the remaining useful life of machines using a hybrid deep recurrent neural network (RNN). The long short-term memory (LSTM) layers and classical neural networks are combined in the deep structure to capture the temporal information from the sequential data. The sequential sensory data from multiple sensors data can be fused and directly used as input of the model. The extraction of handcrafted features that relies heavily on prior knowledge and domain expertise as required by traditional approaches is avoided. The dropout technique and decaying learning rate are adopted in the training process of the hybrid deep RNN structure to increase the learning efficiency. A comprehensive experimental study on a widely used prognosis dataset is carried out to show the outstanding effectiveness and superior performance of the proposed approach in RUL prediction. © 2020 Elsevier B.V
MeInfoText 2.0: gene methylation and cancer relation extraction from biomedical literature
<p>Abstract</p> <p>Background</p> <p>DNA methylation is regarded as a potential biomarker in the diagnosis and treatment of cancer. The relations between aberrant gene methylation and cancer development have been identified by a number of recent scientific studies. In a previous work, we used co-occurrences to mine those associations and compiled the MeInfoText 1.0 database. To reduce the amount of manual curation and improve the accuracy of relation extraction, we have now developed MeInfoText 2.0, which uses a machine learning-based approach to extract gene methylation-cancer relations.</p> <p>Description</p> <p>Two maximum entropy models are trained to predict if aberrant gene methylation is related to any type of cancer mentioned in the literature. After evaluation based on 10-fold cross-validation, the average precision/recall rates of the two models are 94.7/90.1 and 91.8/90% respectively. MeInfoText 2.0 provides the gene methylation profiles of different types of human cancer. The extracted relations with maximum probability, evidence sentences, and specific gene information are also retrievable. The database is available at <url>http://bws.iis.sinica.edu.tw:8081/MeInfoText2/</url>.</p> <p>Conclusion</p> <p>The previous version, MeInfoText, was developed by using association rules, whereas MeInfoText 2.0 is based on a new framework that combines machine learning, dictionary lookup and pattern matching for epigenetics information extraction. The results of experiments show that MeInfoText 2.0 outperforms existing tools in many respects. To the best of our knowledge, this is the first study that uses a hybrid approach to extract gene methylation-cancer relations. It is also the first attempt to develop a gene methylation and cancer relation corpus.</p
Machine learning-based analysis of experimental electron beams and gamma energy distributions
The photon flux resulting from high-energy electron beam interactions with
high field systems, such as in the upcoming FACET-II experiments at SLAC
National Accelerator Laboratory, may give deep insight into the electron beam's
underlying dynamics at the interaction point. Extraction of this information is
an intricate process, however. To demonstrate how to approach this challenge
with modern methods, this paper utilizes data from simulated plasma wakefield
acceleration-derived betatron radiation experiments and high-field
laser-electron-based radiation production to determine reliable methods of
reconstructing key beam and interaction properties. For these measurements,
recovering the emitted 200 keV to 10 GeV photon energy spectra from two
advanced spectrometers now being commissioned requires testing multiple methods
to finalize a pipeline from their responses to incident electron beam
information. In each case, we compare the performance of: neural networks,
which detect patterns between data sets through repeated training; maximum
likelihood estimation (MLE), a statistical technique used to determine unknown
parameters from the distribution of observed data; and a hybrid approach
combining the two. Further, in the case of photons with energies above 30 MeV,
we also examine the efficacy of QR decomposition, a matrix decomposition
method. The betatron radiation and the high-energy photon cases demonstrate the
effectiveness of a hybrid ML-MLE approach, while the high-field electrodynamics
interaction and the low-energy photon cases showcased the machine learning (ML)
model's efficiency in the presence of noise. As such, while there is utility in
all the methods, the ML-MLE hybrid approach proves to be the most
generalizable.Comment: 23 pages, 30 figure
- …