12,626 research outputs found
Process Discovery using Classification Tree Hidden Semi-Markov Model
Various and ubiquitous information systems are being used in monitoring,
exchanging, and collecting information. These systems are generating massive
amount of event sequence logs that may help us understand underlying
phenomenon. By analyzing these logs, we can learn process models that describe
system procedures, predict the development of the system, or check whether the
changes are expected. In this paper, we consider a novel technique that models
these sequences of events in temporal-probabilistic manners. Specifically, we
propose a probabilistic process model that combines hidden semi-Markov model
and classification trees learning. Our experimental result shows that the
proposed approach can answer a kind of question-"what are the most frequent
sequence of system dynamics relevant to a given sequence of observable
events?". For example, "Given a series of medical treatments, what are the most
relevant patients' health condition pattern changes at different times?"Comment: 2016 IEEE International Conference on Information Reuse and
Integratio
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
The amount of text that is generated every day is increasing dramatically.
This tremendous volume of mostly unstructured text cannot be simply processed
and perceived by computers. Therefore, efficient and effective techniques and
algorithms are required to discover useful patterns. Text mining is the task of
extracting meaningful information from text, which has gained significant
attentions in recent years. In this paper, we describe several of the most
fundamental text mining tasks and techniques including text pre-processing,
classification and clustering. Additionally, we briefly explain text mining in
biomedical and health care domains.Comment: some of References format have update
Geographical Hidden Markov Tree for Flood Extent Mapping (With Proof Appendix)
Flood extent mapping plays a crucial role in disaster management and national
water forecasting. Unfortunately, traditional classification methods are often
hampered by the existence of noise, obstacles and heterogeneity in spectral
features as well as implicit anisotropic spatial dependency across class
labels. In this paper, we propose geographical hidden Markov tree, a
probabilistic graphical model that generalizes the common hidden Markov model
from a one dimensional sequence to a two dimensional map. Anisotropic spatial
dependency is incorporated in the hidden class layer with a reverse tree
structure. We also investigate computational algorithms for reverse tree
construction, model parameter learning and class inference. Extensive
evaluations on both synthetic and real world datasets show that proposed model
outperforms multiple baselines in flood mapping, and our algorithms are
scalable on large data sizes
PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
In the big data era, scalability has become a crucial requirement for any
useful computational model. Probabilistic graphical models are very useful for
mining and discovering data insights, but they are not scalable enough to be
suitable for big data problems. Bayesian Networks particularly demonstrate this
limitation when their data is represented using few random variables while each
random variable has a massive set of values. With hierarchical data - data that
is arranged in a treelike structure with several levels - one would expect to
see hundreds of thousands or millions of values distributed over even just a
small number of levels. When modeling this kind of hierarchical data across
large data sets, Bayesian networks become infeasible for representing the
probability distributions for the following reasons: i) Each level represents a
single random variable with hundreds of thousands of values, ii) The number of
levels is usually small, so there are also few random variables, and iii) The
structure of the network is predefined since the dependency is modeled top-down
from each parent to each of its child nodes, so the network would contain a
single linear path for the random variables from each parent to each child
node. In this paper we present a scalable probabilistic graphical model to
overcome these limitations for massive hierarchical data. We believe the
proposed model will lead to an easily-scalable, more readable, and expressive
implementation for problems that require probabilistic-based solutions for
massive amounts of hierarchical data. We successfully applied this model to
solve two different challenging probabilistic-based problems on massive
hierarchical data sets for different domains, namely, bioinformatics and latent
semantic discovery over search logs
Learning the Dimensionality of Hidden Variables
A serious problem in learning probabilistic models is the presence of hidden
variables. These variables are not observed, yet interact with several of the
observed variables. Detecting hidden variables poses two problems: determining
the relations to other variables in the model and determining the number of
states of the hidden variable. In this paper, we address the latter problem in
the context of Bayesian networks. We describe an approach that utilizes a
score-based agglomerative state-clustering. As we show, this approach allows us
to efficiently evaluate models with a range of cardinalities for the hidden
variable. We show how to extend this procedure to deal with multiple
interacting hidden variables. We demonstrate the effectiveness of this approach
by evaluating it on synthetic and real-life data. We show that our approach
learns models with hidden variables that generalize better and have better
structure than previous approaches.Comment: Appears in Proceedings of the Seventeenth Conference on Uncertainty
in Artificial Intelligence (UAI2001
A semi-supervised deep learning algorithm for abnormal EEG identification
Systems that can automatically analyze EEG signals can aid neurologists by
reducing heavy workload and delays. However, such systems need to be first
trained using a labeled dataset. While large corpuses of EEG data exist, a
fraction of them are labeled. Hand-labeling data increases workload for the
very neurologists we try to aid. This paper proposes a semi-supervised learning
workflow that can not only extract meaningful information from large unlabeled
EEG datasets but also make predictions with minimal supervision, using labeled
datasets as small as 5 examples.Comment: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended
Abstrac
Combining complex networks and data mining: why and how
The increasing power of computer technology does not dispense with the need
to extract meaningful in- formation out of data sets of ever growing size, and
indeed typically exacerbates the complexity of this task. To tackle this
general problem, two methods have emerged, at chronologically different times,
that are now commonly used in the scientific community: data mining and complex
network theory. Not only do complex network analysis and data mining share the
same general goal, that of extracting information from complex systems to
ultimately create a new compact quantifiable representation, but they also
often address similar problems too. In the face of that, a surprisingly low
number of researchers turn out to resort to both methodologies. One may then be
tempted to conclude that these two fields are either largely redundant or
totally antithetic. The starting point of this review is that this state of
affairs should be put down to contingent rather than conceptual differences,
and that these two fields can in fact advantageously be used in a synergistic
manner. An overview of both fields is first provided, some fundamental concepts
of which are illustrated. A variety of contexts in which complex network theory
and data mining have been used in a synergistic manner are then presented.
Contexts in which the appropriate integration of complex network metrics can
lead to improved classification rates with respect to classical data mining
algorithms and, conversely, contexts in which data mining can be used to tackle
important issues in complex network theory applications are illustrated.
Finally, ways to achieve a tighter integration between complex networks and
data mining, and open lines of research are discussed.Comment: 58 pages, 19 figure
Automatic Keyword Extraction for Text Summarization: A Survey
In recent times, data is growing rapidly in every domain such as news, social
media, banking, education, etc. Due to the excessiveness of data, there is a
need of automatic summarizer which will be capable to summarize the data
especially textual data in original document without losing any critical
purposes. Text summarization is emerged as an important research area in recent
past. In this regard, review of existing work on text summarization process is
useful for carrying out further research. In this paper, recent literature on
automatic keyword extraction and text summarization are presented since text
summarization process is highly depend on keyword extraction. This literature
includes the discussion about different methodology used for keyword extraction
and text summarization. It also discusses about different databases used for
text summarization in several domains along with evaluation matrices. Finally,
it discusses briefly about issues and research challenges faced by researchers
along with future direction.Comment: 12 pages, 4 figure
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
An Interdisciplinary Comparison of Sequence Modeling Methods for Next-Element Prediction
Data of sequential nature arise in many application domains in forms of, e.g.
textual data, DNA sequences, and software execution traces. Different research
disciplines have developed methods to learn sequence models from such datasets:
(i) in the machine learning field methods such as (hidden) Markov models and
recurrent neural networks have been developed and successfully applied to a
wide-range of tasks, (ii) in process mining process discovery techniques aim to
generate human-interpretable descriptive models, and (iii) in the grammar
inference field the focus is on finding descriptive models in the form of
formal grammars. Despite their different focuses, these fields share a common
goal - learning a model that accurately describes the behavior in the
underlying data. Those sequence models are generative, i.e, they can predict
what elements are likely to occur after a given unfinished sequence. So far,
these fields have developed mainly in isolation from each other and no
comparison exists. This paper presents an interdisciplinary experimental
evaluation that compares sequence modeling techniques on the task of
next-element prediction on four real-life sequence datasets. The results
indicate that machine learning techniques that generally have no aim at
interpretability in terms of accuracy outperform techniques from the process
mining and grammar inference fields that aim to yield interpretable models
- …