13,215 research outputs found

    Information-theoretic Interestingness Measures for Cross-Ontology Data Mining

    Full text link
    Community annotation of biological entities with concepts from multiple bio-ontologies has created large and growing repositories of ontology-based annotation data with embedded implicit relationships among orthogonal ontologies. Development of efficient data mining methods and metrics to mine and assess the quality of the mined relationships has not kept pace with the growth of annotation data. In this study, we present a data mining method that uses ontology-guided generalization to discover relationships across ontologies along with a new interestingness metric based on information theory. We apply our data mining algorithm and interestingness measures to datasets from the Gene Expression Database at the Mouse Genome Informatics as a preliminary proof of concept to mine relationships between developmental stages in the mouse anatomy ontology and Gene Ontology concepts (biological process, molecular function and cellular component). In addition, we present a comparison of our interestingness metric to four existing metrics. Ontology-based annotation datasets provide a valuable resource for discovery of relationships across ontologies. The use of efficient data mining methods and appropriate interestingness metrics enables the identification of high quality relationships

    How to define co-occurrence in different domains of study?

    Full text link
    This position paper presents a comparative study of co-occurrences. Some similarities and differences in the definition exist depending on the research domain (e.g. linguistics, NLP, computer science). This paper discusses these points, and deals with the methodological aspects in order to identify co-occurrences in a multidisciplinary paradigm.Comment: CICLING'2018 (International Conference on Computational Linguistics and Intelligent Text Processing) - March 18 to 24, 2018 - Hanoi, Vietnam (not published in CICLING proceedings

    Relation Extraction : A Survey

    Full text link
    With the advent of the Internet, large amount of digital text is generated everyday in the form of news articles, research publications, blogs, question answering forums and social media. It is important to develop techniques for extracting information automatically from these documents, as lot of important information is hidden within them. This extracted information can be used to improve access and management of knowledge hidden in large text corpora. Several applications such as Question Answering, Information Retrieval would benefit from this information. Entities like persons and organizations, form the most basic unit of the information. Occurrences of entities in a sentence are often linked through well-defined relations; e.g., occurrences of person and organization in a sentence may be linked through relations such as employed at. The task of Relation Extraction (RE) is to identify such relations automatically. In this paper, we survey several important supervised, semi-supervised and unsupervised RE techniques. We also cover the paradigms of Open Information Extraction (OIE) and Distant Supervision. Finally, we describe some of the recent trends in the RE techniques and possible future research directions. This survey would be useful for three kinds of readers - i) Newcomers in the field who want to quickly learn about RE; ii) Researchers who want to know how the various RE techniques evolved over time and what are possible future research directions and iii) Practitioners who just need to know which RE technique works best in various settings

    An Approach to Find Missing Values in Medical Datasets

    Full text link
    Mining medical datasets is a challenging problem before data mining researchers as these datasets have several hidden challenges compared to conventional datasets.Starting from the collection of samples through field experiments and clinical trials to performing classification,there are numerous challenges at every stage in the mining process. The preprocessing phase in the mining process itself is a challenging issue when, we work on medical datasets. One of the prime challenges in mining medical datasets is handling missing values which is part of preprocessing phase. In this paper, we address the issue of handling missing values in medical dataset consisting of categorical attribute values. The main contribution of this research is to use the proposed imputation measure to estimate and fix the missing values. We discuss a case study to demonstrate the working of proposed measure.Comment: 7 pages,ACM Digital Library, ICEMIS September 201

    Clinical Relationships Extraction Techniques from Patient Narratives

    Full text link
    The Clinical E-Science Framework (CLEF) project was used to extract important information from medical texts by building a system for the purpose of clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. The system is divided into two parts, one part concerns with the identification of relationships between clinically important entities in the text. The full parses and domain-specific grammars had been used to apply many approaches to extract the relationship. In the second part of the system, statistical machine learning (ML) approaches are applied to extract relationship. A corpus of oncology narratives that hand annotated with clinical relationships can be used to train and test a system that has been designed and implemented by supervised machine learning (ML) approaches. Many features can be extracted from these texts that are used to build a model by the classifier. Multiple supervised machine learning algorithms can be applied for relationship extraction. Effects of adding the features, changing the size of the corpus, and changing the type of the algorithm on relationship extraction are examined. Keywords: Text mining; information extraction; NLP; entities; and relations.Comment: 15 pages 13 figures 7 table

    ECO-AMLP: A Decision Support System using an Enhanced Class Outlier with Automatic Multilayer Perceptron for Diabetes Prediction

    Full text link
    With advanced data analytical techniques, efforts for more accurate decision support systems for disease prediction are on rise. Surveys by World Health Organization (WHO) indicate a great increase in number of diabetic patients and related deaths each year. Early diagnosis of diabetes is a major concern among researchers and practitioners. The paper presents an application of \textit{Automatic Multilayer Perceptron }which\textit{ }is combined with an outlier detection method \textit{Enhanced Class Outlier Detection using distance based algorithm }to create a prediction framework named as Enhanced Class Outlier with Automatic Multi layer Perceptron (ECO-AMLP). A series of experiments are performed on publicly available Pima Indian Diabetes Dataset to compare ECO-AMLP with other individual classifiers as well as ensemble based methods. The outlier technique used in our framework gave better results as compared to other pre-processing and classification techniques. Finally, the results are compared with other state-of-the-art methods reported in literature for diabetes prediction on PIDD and achieved accuracy of 88.7\% bests all other reported studies

    A Study of Recent Contributions on Information Extraction

    Full text link
    This paper reports on modern approaches in Information Extraction (IE) and its two main sub-tasks of Named Entity Recognition (NER) and Relation Extraction (RE). Basic concepts and the most recent approaches in this area are reviewed, which mainly include Machine Learning (ML) based approaches and the more recent trend to Deep Learning (DL) based methods

    AppTechMiner: Mining Applications and Techniques from Scientific Articles

    Full text link
    This paper presents AppTechMiner, a rule-based information extraction framework that automatically constructs a knowledge base of all application areas and problem solving techniques. Techniques include tools, methods, datasets or evaluation metrics. We also categorize individual research articles based on their application areas and the techniques proposed/improved in the article. Our system achieves high average precision (~82%) and recall (~84%) in knowledge base creation. It also performs well in application and technique assignment to an individual article (average accuracy ~66%). In the end, we further present two use cases presenting a trivial information retrieval system and an extensive temporal analysis of the usage of techniques and application areas. At present, we demonstrate the framework for the domain of computational linguistics but this can be easily generalized to any other field of research.Comment: JCDL 2017, 6th International Workshop on Mining Scientific Publications. arXiv admin note: substantial text overlap with arXiv:1608.0638

    Towards Utility-driven Anonymization of Transactions

    Full text link
    Publishing person-specific transactions in an anonymous form is increasingly required by organizations. Recent approaches ensure that potentially identifying information (e.g., a set of diagnosis codes) cannot be used to link published transactions to persons' identities, but all are limited in application because they incorporate coarse privacy requirements (e.g., protecting a certain set of m diagnosis codes requires protecting all m-sized sets), do not integrate utility requirements, and tend to explore a small portion of the solution space. In this paper, we propose a more general framework for anonymizing transactional data under specific privacy and utility requirements. We model such requirements as constraints, investigate how these constraints can be specified, and propose COAT (COnstraint-based Anonymization of Transactions), an algorithm that anonymizes transactions using a flexible hierarchy-free generalization scheme to meet the specified constraints. Experiments with benchmark datasets verify that COAT significantly outperforms the current state-of-the-art algorithm in terms of data utility, while being comparable in terms of efficiency. The effectiveness of our approach is also demonstrated in a real-world scenario, which requires disseminating a private, patient-specific transactional dataset in a way that preserves both privacy and utility in intended studies

    CLINIQA: A Machine Intelligence Based Clinical Question Answering System

    Full text link
    The recent developments in the field of biomedicine have made large volumes of biomedical literature available to the medical practitioners. Due to the large size and lack of efficient searching strategies, medical practitioners struggle to obtain necessary information available in the biomedical literature. Moreover, the most sophisticated search engines of age are not intelligent enough to interpret the clinicians' questions. These facts reflect the urgent need of an information retrieval system that accepts the queries from medical practitioners' in natural language and returns the answers quickly and efficiently. In this paper, we present an implementation of a machine intelligence based CLINIcal Question Answering system (CLINIQA) to answer medical practitioner's questions. The system was rigorously evaluated on different text mining algorithms and the best components for the system were selected. The system makes use of Unified Medical Language System for semantic analysis of both questions and medical documents. In addition, the system employs supervised machine learning algorithms for classification of the documents, identifying the focus of the question and answer selection. Effective domain-specific heuristics are designed for answer ranking. The performance evaluation on hundred clinical questions shows the effectiveness of our approach.Comment: This manuscript was submitted to IEEE Transactions on Information Technology in Biomedicine in 2007 and was in second revision when it was withdrawn. As I moved to industry and could not get enough time to revise it. I am uploading it here for anyone interested in conventional ML based approach to NL
    • …
    corecore