24,620 research outputs found
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining
Modern biomedical data mining requires feature selection methods that can (1)
be applied to large scale feature spaces (e.g. `omics' data), (2) function in
noisy problems, (3) detect complex patterns of association (e.g. gene-gene
interactions), (4) be flexibly adapted to various problem domains and data
types (e.g. genetic variants, gene expression, and clinical data) and (5) are
computationally tractable. To that end, this work examines a set of
filter-style feature selection algorithms inspired by the `Relief' algorithm,
i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an
open source framework called ReBATE (Relief-Based Algorithm Training
Environment). We apply a comprehensive genetic simulation study comparing
existing RBAs, a proposed RBA called MultiSURF, and other established feature
selection methods, over a variety of problems. The results of this study (1)
support the assertion that RBAs are particularly flexible, efficient, and
powerful feature selection methods that differentiate relevant features having
univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm
the efficacy of expansions for classification vs. regression, discrete vs.
continuous features, missing data, multiple classes, or class imbalance, (3)
identify previously unknown limitations of specific RBAs, and (4) suggest that
while MultiSURF* performs best for explicitly identifying pure 2-way
interactions, MultiSURF yields the most reliable feature selection performance
across a wide range of problem types.Comment: Revised submission to JB
Spatio-Temporal Modeling of Wireless Users Internet Access Patterns Using Self-Organizing Maps
User online behavior and interests will play a central role in future mobile
networks. We introduce a systematic method for large-scale multi-dimensional
analysis of online activity for thousands of mobile users across 79 buildings
over a variety of web domains. We propose a modeling approach based on
self-organizing maps (SOM) for discovering, organizing and visualizing
different mobile users' trends from billions of WLAN records. We find
surprisingly that users' trends based on domains and locations can be
accurately modeled using a self-organizing map with clearly distinct
characteristics. We also find many non-trivial correlations between different
types of web domains and locations. Based on our analysis, we introduce a
mixture model as an initial step towards realistic simulation of wireless
network usage
Algorithms for Efficient Mining of Statistically Significant Attribute Association Information
Knowledge of the association information between the attributes in a data set
provides insight into the underlying structure of the data and explains the
relationships (independence, synergy, redundancy) between the attributes and
class (if present). Complex models learnt computationally from the data are
more interpretable to a human analyst when such interdependencies are known. In
this paper, we focus on mining two types of association information among the
attributes - correlation information and interaction information for both
supervised (class attribute present) and unsupervised analysis (class attribute
absent). Identifying the statistically significant attribute associations is a
computationally challenging task - the number of possible associations
increases exponentially and many associations contain redundant information
when a number of correlated attributes are present. In this paper, we explore
efficient data mining methods to discover non-redundant attribute sets that
contain significant association information indicating the presence of
informative patterns in the data.Comment: 16 pages, 7 figure
Re-mining positive and negative association mining results
Positive and negative association mining are well-known and extensively studied data mining techniques to analyze market basket data. Efficient algorithms exist to find both types of association, separately or simultaneously. Association mining is performed by operating on the transaction data. Despite being an integral part of the transaction data, the pricing and time information has not been incorporated into market basket analysis so far, and additional attributes have been handled using quantitative association mining. In this paper, a new approach is proposed to incorporate price, time and domain related attributes into data mining by re-mining the association mining results. The underlying factors behind positive and negative relationships, as indicated by the association rules, are characterized and described through the second data mining stage re-mining. The applicability of the methodology is demonstrated by analyzing data coming from apparel retailing industry, where
price markdown is an essential tool for promoting sales and generating increased revenue
A Survey of Parallel Sequential Pattern Mining
With the growing popularity of shared resources, large volumes of complex
data of different types are collected automatically. Traditional data mining
algorithms generally have problems and challenges including huge memory cost,
low processing speed, and inadequate hard disk space. As a fundamental task of
data mining, sequential pattern mining (SPM) is used in a wide variety of
real-life applications. However, it is more complex and challenging than other
pattern mining tasks, i.e., frequent itemset mining and association rule
mining, and also suffers from the above challenges when handling the
large-scale data. To solve these problems, mining sequential patterns in a
parallel or distributed computing environment has emerged as an important issue
with many applications. In this paper, an in-depth survey of the current status
of parallel sequential pattern mining (PSPM) is investigated and provided,
including detailed categorization of traditional serial SPM approaches, and
state of the art parallel SPM. We review the related work of parallel
sequential pattern mining in detail, including partition-based algorithms for
PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for
PSPM, and provide deep description (i.e., characteristics, advantages,
disadvantages and summarization) of these parallel approaches of PSPM. Some
advanced topics for PSPM, including parallel quantitative / weighted / utility
sequential pattern mining, PSPM from uncertain data and stream data, hardware
acceleration for PSPM, are further reviewed in details. Besides, we review and
provide some well-known open-source software of PSPM. Finally, we summarize
some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page
Recent Trends and Research Issues in Video Association Mining
With the ever-growing digital libraries and video databases, it is
increasingly important to understand and mine the knowledge from video database
automatically. Discovering association rules between items in a large video
database plays a considerable role in the video data mining research areas.
Based on the research and development in the past years, application of
association rule mining is growing in different domains such as surveillance,
meetings, broadcast news, sports, archives, movies, medical data, as well as
personal and online media collections. The purpose of this paper is to provide
general framework of mining the association rules from video database. This
article is also represents the research issues in video association mining
followed by the recent trends.Comment: 13 pages; 1 Figure; 1 Tabl
Random Forests on Distance Matrices for Imaging Genetics Studies
We propose a non-parametric regression methodology, Random Forests on
Distance Matrices (RFDM), for detecting genetic variants associated to
quantitative phenotypes representing the human brain's structure or function,
and obtained using neuroimaging techniques. RFDM, which is an extension of
decision forests, requires a distance matrix as response that encodes all
pair-wise phenotypic distances in the random sample. We discuss ways to learn
such distances directly from the data using manifold learning techniques, and
how to define such distances when the phenotypes are non-vectorial objects such
as brain connectivity networks. We also describe an extension of RFDM to detect
espistatic effects while keeping the computational complexity low. Extensive
simulation results and an application to an imaging genetics study of
Alzheimer's Disease are presented and discussed
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Statistical Modeling of Epistasis and Linkage Decay using Logic Regression
Logic regression has been recognized as a tool that can identify and model non-additive genetic interactions using Boolean logic groups. Logic regression, TASSEL-GLM and SAS-GLM were compared for analytical precision using a previously characterized model system to identify the best genetic model explaining epistatic interaction of vernalization-sensitivity in barley. A genetic model containing two molecular markers identified in vernalization response in barley was selected using logic regression while both TASSEL-GLM and SAS-GLM included spurious associations in their models. The results also suggest the logic regression can be used to identify dominant/recessive relationships between epistatic alleles through its use of conjugate
operators
EagleMine: Vision-Guided Mining in Large Graphs
Given a graph with millions of nodes, what patterns exist in the
distributions of node characteristics, and how can we detect them and separate
anomalous nodes in a way similar to human vision? In this paper, we propose a
vision-guided algorithm, EagleMine, to summarize micro-cluster patterns in
two-dimensional histogram plots constructed from node features in a large
graph. EagleMine utilizes a water-level tree to capture cluster structures
according to vision-based intuition at multi-resolutions. EagleMine traverses
the water-level tree from the root and adopts statistical hypothesis tests to
determine the optimal clusters that should be fitted along the path, and
summarizes each cluster with a truncated Gaussian distribution. Experiments on
real data show that our method can find truncated and overlapped elliptical
clusters, even when some baseline methods split one visual cluster into pieces
with Gaussian spheres. To identify potentially anomalous microclusters,
EagleMine also a designates score to measure the suspiciousness of outlier
groups (i.e. node clusters) and outlier nodes, detecting bots and anomalous
users with high accuracy in the real Microblog data.Comment: 9 page
- …