Search CORE

24,620 research outputs found

Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining

Author: Meeker Melissa
Moore Jason H.
Olson Randal S.
Schmitt Peter
Urbanowicz Ryan J.
Publication venue
Publication date: 02/04/2018
Field of study

Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. `omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the `Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF* performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.Comment: Revised submission to JB

arXiv.org e-Print Archive

Spatio-Temporal Modeling of Wireless Users Internet Access Patterns Using Self-Organizing Maps

Author: Helmy Ahmed
Moghaddam Saeed
Publication venue
Publication date: 29/08/2010
Field of study

User online behavior and interests will play a central role in future mobile networks. We introduce a systematic method for large-scale multi-dimensional analysis of online activity for thousands of mobile users across 79 buildings over a variety of web domains. We propose a modeling approach based on self-organizing maps (SOM) for discovering, organizing and visualizing different mobile users' trends from billions of WLAN records. We find surprisingly that users' trends based on domains and locations can be accurately modeled using a self-organizing map with clearly distinct characteristics. We also find many non-trivial correlations between different types of web domains and locations. Based on our analysis, we introduce a mixture model as an initial step towards realistic simulation of wireless network usage

arXiv.org e-Print Archive

Algorithms for Efficient Mining of Statistically Significant Attribute Association Information

Author: Chanda Pritam
Ramanathan Murali
Zhang Aidong
Publication venue
Publication date: 19/08/2012
Field of study

Knowledge of the association information between the attributes in a data set provides insight into the underlying structure of the data and explains the relationships (independence, synergy, redundancy) between the attributes and class (if present). Complex models learnt computationally from the data are more interpretable to a human analyst when such interdependencies are known. In this paper, we focus on mining two types of association information among the attributes - correlation information and interaction information for both supervised (class attribute present) and unsupervised analysis (class attribute absent). Identifying the statistically significant attribute associations is a computationally challenging task - the number of possible associations increases exponentially and many associations contain redundant information when a number of correlated attributes are present. In this paper, we explore efficient data mining methods to discover non-redundant attribute sets that contain significant association information indicating the presence of informative patterns in the data.Comment: 16 pages, 7 figure

arXiv.org e-Print Archive

Re-mining positive and negative association mining results

Author: Atan Tankut
Demiriz Ayhan
Ertek Gurdal
Ertek Gürdal
Kula Ufuk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Positive and negative association mining are well-known and extensively studied data mining techniques to analyze market basket data. Efficient algorithms exist to find both types of association, separately or simultaneously. Association mining is performed by operating on the transaction data. Despite being an integral part of the transaction data, the pricing and time information has not been incorporated into market basket analysis so far, and additional attributes have been handled using quantitative association mining. In this paper, a new approach is proposed to incorporate price, time and domain related attributes into data mining by re-mining the association mining results. The underlying factors behind positive and negative relationships, as indicated by the association rules, are characterized and described through the second data mining stage re-mining. The applicability of the methodology is demonstrated by analyzing data coming from apparel retailing industry, where price markdown is an essential tool for promoting sales and generating increased revenue

A Survey of Parallel Sequential Pattern Mining

Author: Chao Han-Chieh
Fournier-Viger Philippe
Gan Wensheng
Lin Jerry Chun-Wei
Yu Philip S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/04/2019
Field of study

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page

arXiv.org e-Print Archive

Recent Trends and Research Issues in Video Association Mining

Author: R Nedunchezhian
V Vijayakumar
Publication venue
Publication date: 09/12/2011
Field of study

With the ever-growing digital libraries and video databases, it is increasingly important to understand and mine the knowledge from video database automatically. Discovering association rules between items in a large video database plays a considerable role in the video data mining research areas. Based on the research and development in the past years, application of association rule mining is growing in different domains such as surveillance, meetings, broadcast news, sports, archives, movies, medical data, as well as personal and online media collections. The purpose of this paper is to provide general framework of mining the association rules from video database. This article is also represents the research issues in video association mining followed by the recent trends.Comment: 13 pages; 1 Figure; 1 Tabl

arXiv.org e-Print Archive

Random Forests on Distance Matrices for Imaging Genetics Studies

Author: Montana Giovanni
Sim Aaron
Tsagkrasoulis Dimosthenis
Publication venue
Publication date: 24/09/2013
Field of study

We propose a non-parametric regression methodology, Random Forests on Distance Matrices (RFDM), for detecting genetic variants associated to quantitative phenotypes representing the human brain's structure or function, and obtained using neuroimaging techniques. RFDM, which is an extension of decision forests, requires a distance matrix as response that encodes all pair-wise phenotypic distances in the random sample. We discuss ways to learn such distances directly from the data using manifold learning techniques, and how to define such distances when the phenotypes are non-vectorial objects such as brain connectivity networks. We also describe an extension of RFDM to detect espistatic effects while keeping the computational complexity low. Extensive simulation results and an application to an imaging genetics study of Alzheimer's Disease are presented and discussed

arXiv.org e-Print Archive

State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"

Author: Andrews Pierre
Rajman Martin
Vesely Martin
Publication venue
Publication date: 29/12/2004
Field of study

Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining. Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques associated with the text mining process and to identify the relevant research topics in related research areas. In this document we present the results of this comprehensive survey. The report includes a description of the current state-of-the-art and practice, a roadmap for follow-up research in the identified areas, and recommendations for anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS

arXiv.org e-Print Archive

Statistical Modeling of Epistasis and Linkage Decay using Logic Regression

Author: Jean-Luc Jannink
John A. Henning
Peter Szucs
Thomas B. Parker
Walt F. Mahaffee
Publication venue
Publication date: 18/11/2008
Field of study

Logic regression has been recognized as a tool that can identify and model non-additive genetic interactions using Boolean logic groups. Logic regression, TASSEL-GLM and SAS-GLM were compared for analytical precision using a previously characterized model system to identify the best genetic model explaining epistatic interaction of vernalization-sensitivity in barley. A genetic model containing two molecular markers identified in vernalization response in barley was selected using logic regression while both TASSEL-GLM and SAS-GLM included spurious associations in their models. The results also suggest the logic regression can be used to identify dominant/recessive relationships between epistatic alleles through its use of conjugate
operators

EagleMine: Vision-Guided Mining in Large Graphs

Author: Cheng Xueqi
Faloutsos Christos
Feng Wenjie
Hooi Bryan
Liu Shenghua
Shen Huawei
Publication venue
Publication date: 24/10/2017
Field of study

Given a graph with millions of nodes, what patterns exist in the distributions of node characteristics, and how can we detect them and separate anomalous nodes in a way similar to human vision? In this paper, we propose a vision-guided algorithm, EagleMine, to summarize micro-cluster patterns in two-dimensional histogram plots constructed from node features in a large graph. EagleMine utilizes a water-level tree to capture cluster structures according to vision-based intuition at multi-resolutions. EagleMine traverses the water-level tree from the root and adopts statistical hypothesis tests to determine the optimal clusters that should be fitted along the path, and summarizes each cluster with a truncated Gaussian distribution. Experiments on real data show that our method can find truncated and overlapped elliptical clusters, even when some baseline methods split one visual cluster into pieces with Gaussian spheres. To identify potentially anomalous microclusters, EagleMine also a designates score to measure the suspiciousness of outlier groups (i.e. node clusters) and outlier nodes, detecting bots and anomalous users with high accuracy in the real Microblog data.Comment: 9 page

arXiv.org e-Print Archive