713 research outputs found

    Optimization issues in machine learning of coreference resolution

    Get PDF

    A big data MapReduce framework for fault diagnosis in cloud-based manufacturing

    Get PDF
    This research develops a MapReduce framework for automatic pattern recognition based on fault diagnosis by solving data imbalance problem in a cloud-based manufacturing (CBM). Fault diagnosis in a CBM system significantly contributes to reduce the product testing cost and enhances manufacturing quality. One of the major challenges facing the big data analytics in cloud-based manufacturing is handling of datasets, which are highly imbalanced in nature due to poor classification result when machine learning techniques are applied on such datasets. The framework proposed in this research uses a hybrid approach to deal with big dataset for smarter decisions. Furthermore, we compare the performance of radial basis function based Support Vector Machine classifier with standard techniques. Our findings suggest that the most important task in cloud-based manufacturing, is to predict the effect of data errors on quality due to highly imbalance unstructured dataset. The proposed framework is an original contribution to the body of literature, where our proposed MapReduce framework has been used for fault detection by managing data imbalance problem appropriately and relating it to firm’s profit function. The experimental results are validated using a case study of steel plate manufacturing fault diagnosis, with crucial performance matrices such as accuracy, specificity and sensitivity. A comparative study shows that the methods used in the proposed framework outperform the traditional ones

    A Comprehensive Survey on Rare Event Prediction

    Full text link
    Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

    Large-scale Multi-Label Text Classification for an Online News Monitoring System

    Get PDF
    This thesis provides a detailed exploration of numerous methods — some established and some novel — considered in the construction of a text-categorization system, for use in a large-scale, online news-monitoring system known as PULS. PULS is an information extraction (IE) system, consisting of a number of tools for automatically collecting named-entities from text. The system also has access to large training corpora in the business domain, where documents are annotated with associated industry-sectors. These assets are leveraged in the construction of a multi-label industry-sector classifier, the output of which is displayed on the web-based front-end of PULS, for new articles. Through review of background literature and direct experimentation with each stage of development, we illuminate many major challenges of multi-label classification. These challenges include: working effectively in a real-world scenario that poses time and memory restrictions; organizing and processing semi-structured, pre-annotated text corpora; handling large-scale data sets and label sets with significant class imbalances; weighing the trade-offs of different learning algorithms and feature-selection methods with respect to end-user performance; and finding meaningful evaluations for each system component. In addition to presenting the challenges associated with large-scale multi-label learning, this thesis presents a number of experiments and evaluations to determine methods which enhance overall performance. The major outcome of these experiments is a multi-stage, multi-label classifier that combines IE-based rote classification — with features extracted by the PULS system — with an array of balanced, statistical classifiers. Evaluation of this multi-stage system shows improvement over a baseline classifier and, for certain evaluations, over state-of-the-art performance from literature, when tested on a commonly-used corpus. Aspects of the classification method and their associated experimental results have also been published for international conference proceedings

    Using a shallow linguistic kernel for drug-drug interaction extraction

    Get PDF
    A drug–drug interaction (DDI) occurs when one drug influences the level or activity of another drug. Information Extraction (IE) techniques can provide health care professionals with an interesting way to reduce time spent reviewing the literature for potential drug–drug interactions. Nevertheless, no approach has been proposed to the problem of extracting DDIs in biomedical texts. In this article, we study whether a machine learning-based method is appropriate for DDI extraction in biomedical texts and whether the results provided are superior to those obtained from our previously proposed pattern-based approach [1]. The method proposed here for DDI extraction is based on a supervised machine learning technique, more specifically, the shallow linguistic kernel proposed in Giuliano et al. (2006) [2]. Since no benchmark corpus was available to evaluate our approach to DDI extraction, we created the first such corpus, DrugDDI, annotated with 3169 DDIs. We performed several experiments varying the configuration parameters of the shallow linguistic kernel. The model that maximizes the F-measure was evaluated on the test data of the DrugDDI corpus, achieving a precision of 51.03%, a recall of 72.82% and an F-measure of 60.01%. To the best of our knowledge, this work has proposed the first full solution for the automatic extraction of DDIs from biomedical texts. Our study confirms that the shallow linguistic kernel outperforms our previous pattern-based approach. Additionally, it is our hope that the DrugDDI corpus will allow researchers to explore new solutions to the DDI extraction problem.This study was funded by the Projects MA2VICMR (S2009/TIC-1542) and MULTIMEDICA (TIN2010-20644-C03-01).Publicad

    Bottom-Up Modeling of Permissions to Reuse Residual Clinical Biospecimens and Health Data

    Full text link
    Consent forms serve as evidence of permissions granted by patients for clinical procedures. As the recognized value of biospecimens and health data increases, many clinical consent forms also seek permission from patients or their legally authorized representative to reuse residual clinical biospecimens and health data for secondary purposes, such as research. Such permissions are also granted by the government, which regulates how residual clinical biospecimens may be reused with or without consent. There is a need for increasingly capable information systems to facilitate discovery, access, and responsible reuse of residual clinical biospecimens and health data in accordance with these permissions. Semantic web technologies, especially ontologies, hold great promise as infrastructure for scalable, semantically interoperable approaches in healthcare and research. While there are many published ontologies for the biomedical domain, there is not yet ontological representation of the permissions relevant for reuse of residual clinical biospecimens and health data. The Informed Consent Ontology (ICO), originally designed for representing consent in research procedures, may already contain core classes necessary for representing clinical consent processes. However, formal evaluation is needed to make this determination and to extend the ontology to cover the new domain. This dissertation focuses on identifying the necessary information required for facilitating responsible reuse of residual clinical biospecimens and health data, and evaluating its representation within ICO. The questions guiding these studies include: 1. What is the necessary information regarding permissions for facilitating responsible reuse of residual clinical biospecimens and health data? 2. How well does the Informed Consent Ontology represent the identified information regarding permissions and obligations for reuse of residual clinical biospecimens and health data? We performed three sequential studies to answer these questions. First, we conducted a scoping review to identify regulations and norms that bear authority or give guidance over reuse of residual clinical biospecimens and health data in the US, the permissions by which reuse of residual clinical biospecimens and health data may occur, and key issues that must be considered when interpreting these regulations and norms. Second, we developed and tested an annotation scheme to identify permissions within clinical consent forms. Lastly, we used these findings as source data for bottom-up modelling and evaluation of ICO for representation of this new domain. We found considerable overlap in classes already in ICO and those necessary for representing permissions to reuse residual clinical biospecimens and health data. However, we also identified more than fifty classes that should be added to or imported into ICO. These efforts provide a foundation for comprehensively representing permissions to reuse residual clinical biospecimens and health data. Such representation fills a critical gap for developing applications which safeguard biospecimen resources and enable querying based on their permissions for use. By modeling information about permissions in an ontology, the heterogeneity of these permissions at a range of levels (e.g., federal regulations, consent forms) can be richly represented using entity-relationship links and embedded rules of inference and inheritance. Furthermore, by developing this content in ICO, missing content will be added to the Open Biological and Biomedical Ontology (OBO) Foundry, enabling use alongside other widely adopted ontologies and providing a valuable resource for biospecimen and information management. These methods may also serve as a model for domain experts to interact with ontology development communities to improve ontologies and address gaps which hinder successful uptake.PHDNursingUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162937/1/eliewolf_1.pd
    • …
    corecore