109,209 research outputs found

    D-TERMINE : data-driven term extraction methodologies investigated

    Get PDF
    Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies

    Data-driven prognosis method using hybrid deep recurrent neural network

    Get PDF
    Prognostics and health management (PHM) has attracted increasing attention in modern manufacturing systems to achieve accurate predictive maintenance that reduces production downtime and enhances system safety. Remaining useful life (RUL) prediction plays a crucial role in PHM by providing direct evidence for a cost-effective maintenance decision. With the advances in sensing and communication technologies, data-driven approaches have achieved remarkable progress in machine prognostics. This paper develops a novel data-driven approach to precisely estimate the remaining useful life of machines using a hybrid deep recurrent neural network (RNN). The long short-term memory (LSTM) layers and classical neural networks are combined in the deep structure to capture the temporal information from the sequential data. The sequential sensory data from multiple sensors data can be fused and directly used as input of the model. The extraction of handcrafted features that relies heavily on prior knowledge and domain expertise as required by traditional approaches is avoided. The dropout technique and decaying learning rate are adopted in the training process of the hybrid deep RNN structure to increase the learning efficiency. A comprehensive experimental study on a widely used prognosis dataset is carried out to show the outstanding effectiveness and superior performance of the proposed approach in RUL prediction. © 2020 Elsevier B.V

    MeInfoText 2.0: gene methylation and cancer relation extraction from biomedical literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DNA methylation is regarded as a potential biomarker in the diagnosis and treatment of cancer. The relations between aberrant gene methylation and cancer development have been identified by a number of recent scientific studies. In a previous work, we used co-occurrences to mine those associations and compiled the MeInfoText 1.0 database. To reduce the amount of manual curation and improve the accuracy of relation extraction, we have now developed MeInfoText 2.0, which uses a machine learning-based approach to extract gene methylation-cancer relations.</p> <p>Description</p> <p>Two maximum entropy models are trained to predict if aberrant gene methylation is related to any type of cancer mentioned in the literature. After evaluation based on 10-fold cross-validation, the average precision/recall rates of the two models are 94.7/90.1 and 91.8/90% respectively. MeInfoText 2.0 provides the gene methylation profiles of different types of human cancer. The extracted relations with maximum probability, evidence sentences, and specific gene information are also retrievable. The database is available at <url>http://bws.iis.sinica.edu.tw:8081/MeInfoText2/</url>.</p> <p>Conclusion</p> <p>The previous version, MeInfoText, was developed by using association rules, whereas MeInfoText 2.0 is based on a new framework that combines machine learning, dictionary lookup and pattern matching for epigenetics information extraction. The results of experiments show that MeInfoText 2.0 outperforms existing tools in many respects. To the best of our knowledge, this is the first study that uses a hybrid approach to extract gene methylation-cancer relations. It is also the first attempt to develop a gene methylation and cancer relation corpus.</p

    Machine learning-based analysis of experimental electron beams and gamma energy distributions

    Full text link
    The photon flux resulting from high-energy electron beam interactions with high field systems, such as in the upcoming FACET-II experiments at SLAC National Accelerator Laboratory, may give deep insight into the electron beam's underlying dynamics at the interaction point. Extraction of this information is an intricate process, however. To demonstrate how to approach this challenge with modern methods, this paper utilizes data from simulated plasma wakefield acceleration-derived betatron radiation experiments and high-field laser-electron-based radiation production to determine reliable methods of reconstructing key beam and interaction properties. For these measurements, recovering the emitted 200 keV to 10 GeV photon energy spectra from two advanced spectrometers now being commissioned requires testing multiple methods to finalize a pipeline from their responses to incident electron beam information. In each case, we compare the performance of: neural networks, which detect patterns between data sets through repeated training; maximum likelihood estimation (MLE), a statistical technique used to determine unknown parameters from the distribution of observed data; and a hybrid approach combining the two. Further, in the case of photons with energies above 30 MeV, we also examine the efficacy of QR decomposition, a matrix decomposition method. The betatron radiation and the high-energy photon cases demonstrate the effectiveness of a hybrid ML-MLE approach, while the high-field electrodynamics interaction and the low-energy photon cases showcased the machine learning (ML) model's efficiency in the presence of noise. As such, while there is utility in all the methods, the ML-MLE hybrid approach proves to be the most generalizable.Comment: 23 pages, 30 figure
    corecore