13,215 research outputs found
Information-theoretic Interestingness Measures for Cross-Ontology Data Mining
Community annotation of biological entities with concepts from multiple
bio-ontologies has created large and growing repositories of ontology-based
annotation data with embedded implicit relationships among orthogonal
ontologies. Development of efficient data mining methods and metrics to mine
and assess the quality of the mined relationships has not kept pace with the
growth of annotation data. In this study, we present a data mining method that
uses ontology-guided generalization to discover relationships across ontologies
along with a new interestingness metric based on information theory. We apply
our data mining algorithm and interestingness measures to datasets from the
Gene Expression Database at the Mouse Genome Informatics as a preliminary proof
of concept to mine relationships between developmental stages in the mouse
anatomy ontology and Gene Ontology concepts (biological process, molecular
function and cellular component). In addition, we present a comparison of our
interestingness metric to four existing metrics. Ontology-based annotation
datasets provide a valuable resource for discovery of relationships across
ontologies. The use of efficient data mining methods and appropriate
interestingness metrics enables the identification of high quality
relationships
How to define co-occurrence in different domains of study?
This position paper presents a comparative study of co-occurrences. Some
similarities and differences in the definition exist depending on the research
domain (e.g. linguistics, NLP, computer science). This paper discusses these
points, and deals with the methodological aspects in order to identify
co-occurrences in a multidisciplinary paradigm.Comment: CICLING'2018 (International Conference on Computational Linguistics
and Intelligent Text Processing) - March 18 to 24, 2018 - Hanoi, Vietnam (not
published in CICLING proceedings
Relation Extraction : A Survey
With the advent of the Internet, large amount of digital text is generated
everyday in the form of news articles, research publications, blogs, question
answering forums and social media. It is important to develop techniques for
extracting information automatically from these documents, as lot of important
information is hidden within them. This extracted information can be used to
improve access and management of knowledge hidden in large text corpora.
Several applications such as Question Answering, Information Retrieval would
benefit from this information. Entities like persons and organizations, form
the most basic unit of the information. Occurrences of entities in a sentence
are often linked through well-defined relations; e.g., occurrences of person
and organization in a sentence may be linked through relations such as employed
at. The task of Relation Extraction (RE) is to identify such relations
automatically. In this paper, we survey several important supervised,
semi-supervised and unsupervised RE techniques. We also cover the paradigms of
Open Information Extraction (OIE) and Distant Supervision. Finally, we describe
some of the recent trends in the RE techniques and possible future research
directions. This survey would be useful for three kinds of readers - i)
Newcomers in the field who want to quickly learn about RE; ii) Researchers who
want to know how the various RE techniques evolved over time and what are
possible future research directions and iii) Practitioners who just need to
know which RE technique works best in various settings
An Approach to Find Missing Values in Medical Datasets
Mining medical datasets is a challenging problem before data mining
researchers as these datasets have several hidden challenges compared to
conventional datasets.Starting from the collection of samples through field
experiments and clinical trials to performing classification,there are numerous
challenges at every stage in the mining process. The preprocessing phase in the
mining process itself is a challenging issue when, we work on medical datasets.
One of the prime challenges in mining medical datasets is handling missing
values which is part of preprocessing phase. In this paper, we address the
issue of handling missing values in medical dataset consisting of categorical
attribute values. The main contribution of this research is to use the proposed
imputation measure to estimate and fix the missing values. We discuss a case
study to demonstrate the working of proposed measure.Comment: 7 pages,ACM Digital Library, ICEMIS September 201
Clinical Relationships Extraction Techniques from Patient Narratives
The Clinical E-Science Framework (CLEF) project was used to extract important
information from medical texts by building a system for the purpose of clinical
research, evidence-based healthcare and genotype-meets-phenotype informatics.
The system is divided into two parts, one part concerns with the identification
of relationships between clinically important entities in the text. The full
parses and domain-specific grammars had been used to apply many approaches to
extract the relationship. In the second part of the system, statistical machine
learning (ML) approaches are applied to extract relationship. A corpus of
oncology narratives that hand annotated with clinical relationships can be used
to train and test a system that has been designed and implemented by supervised
machine learning (ML) approaches. Many features can be extracted from these
texts that are used to build a model by the classifier. Multiple supervised
machine learning algorithms can be applied for relationship extraction. Effects
of adding the features, changing the size of the corpus, and changing the type
of the algorithm on relationship extraction are examined. Keywords: Text
mining; information extraction; NLP; entities; and relations.Comment: 15 pages 13 figures 7 table
ECO-AMLP: A Decision Support System using an Enhanced Class Outlier with Automatic Multilayer Perceptron for Diabetes Prediction
With advanced data analytical techniques, efforts for more accurate decision
support systems for disease prediction are on rise. Surveys by World Health
Organization (WHO) indicate a great increase in number of diabetic patients and
related deaths each year. Early diagnosis of diabetes is a major concern among
researchers and practitioners. The paper presents an application of
\textit{Automatic Multilayer Perceptron }which\textit{ }is combined with an
outlier detection method \textit{Enhanced Class Outlier Detection using
distance based algorithm }to create a prediction framework named as Enhanced
Class Outlier with Automatic Multi layer Perceptron (ECO-AMLP). A series of
experiments are performed on publicly available Pima Indian Diabetes Dataset to
compare ECO-AMLP with other individual classifiers as well as ensemble based
methods. The outlier technique used in our framework gave better results as
compared to other pre-processing and classification techniques. Finally, the
results are compared with other state-of-the-art methods reported in literature
for diabetes prediction on PIDD and achieved accuracy of 88.7\% bests all other
reported studies
A Study of Recent Contributions on Information Extraction
This paper reports on modern approaches in Information Extraction (IE) and
its two main sub-tasks of Named Entity Recognition (NER) and Relation
Extraction (RE). Basic concepts and the most recent approaches in this area are
reviewed, which mainly include Machine Learning (ML) based approaches and the
more recent trend to Deep Learning (DL) based methods
AppTechMiner: Mining Applications and Techniques from Scientific Articles
This paper presents AppTechMiner, a rule-based information extraction
framework that automatically constructs a knowledge base of all application
areas and problem solving techniques. Techniques include tools, methods,
datasets or evaluation metrics. We also categorize individual research articles
based on their application areas and the techniques proposed/improved in the
article. Our system achieves high average precision (~82%) and recall (~84%) in
knowledge base creation. It also performs well in application and technique
assignment to an individual article (average accuracy ~66%). In the end, we
further present two use cases presenting a trivial information retrieval system
and an extensive temporal analysis of the usage of techniques and application
areas. At present, we demonstrate the framework for the domain of computational
linguistics but this can be easily generalized to any other field of research.Comment: JCDL 2017, 6th International Workshop on Mining Scientific
Publications. arXiv admin note: substantial text overlap with
arXiv:1608.0638
Towards Utility-driven Anonymization of Transactions
Publishing person-specific transactions in an anonymous form is increasingly
required by organizations. Recent approaches ensure that potentially
identifying information (e.g., a set of diagnosis codes) cannot be used to link
published transactions to persons' identities, but all are limited in
application because they incorporate coarse privacy requirements (e.g.,
protecting a certain set of m diagnosis codes requires protecting all m-sized
sets), do not integrate utility requirements, and tend to explore a small
portion of the solution space. In this paper, we propose a more general
framework for anonymizing transactional data under specific privacy and utility
requirements. We model such requirements as constraints, investigate how these
constraints can be specified, and propose COAT (COnstraint-based Anonymization
of Transactions), an algorithm that anonymizes transactions using a flexible
hierarchy-free generalization scheme to meet the specified constraints.
Experiments with benchmark datasets verify that COAT significantly outperforms
the current state-of-the-art algorithm in terms of data utility, while being
comparable in terms of efficiency. The effectiveness of our approach is also
demonstrated in a real-world scenario, which requires disseminating a private,
patient-specific transactional dataset in a way that preserves both privacy and
utility in intended studies
CLINIQA: A Machine Intelligence Based Clinical Question Answering System
The recent developments in the field of biomedicine have made large volumes
of biomedical literature available to the medical practitioners. Due to the
large size and lack of efficient searching strategies, medical practitioners
struggle to obtain necessary information available in the biomedical
literature. Moreover, the most sophisticated search engines of age are not
intelligent enough to interpret the clinicians' questions. These facts reflect
the urgent need of an information retrieval system that accepts the queries
from medical practitioners' in natural language and returns the answers quickly
and efficiently. In this paper, we present an implementation of a machine
intelligence based CLINIcal Question Answering system (CLINIQA) to answer
medical practitioner's questions. The system was rigorously evaluated on
different text mining algorithms and the best components for the system were
selected. The system makes use of Unified Medical Language System for semantic
analysis of both questions and medical documents. In addition, the system
employs supervised machine learning algorithms for classification of the
documents, identifying the focus of the question and answer selection.
Effective domain-specific heuristics are designed for answer ranking. The
performance evaluation on hundred clinical questions shows the effectiveness of
our approach.Comment: This manuscript was submitted to IEEE Transactions on Information
Technology in Biomedicine in 2007 and was in second revision when it was
withdrawn. As I moved to industry and could not get enough time to revise it.
I am uploading it here for anyone interested in conventional ML based
approach to NL
- …