6,304 research outputs found
A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining
Big data comes in various ways, types, shapes, forms and sizes. Indeed,
almost all areas of science, technology, medicine, public health, economics,
business, linguistics and social science are bombarded by ever increasing flows
of data begging to analyzed efficiently and effectively. In this paper, we
propose a rough idea of a possible taxonomy of big data, along with some of the
most commonly used tools for handling each particular category of bigness. The
dimensionality p of the input space and the sample size n are usually the main
ingredients in the characterization of data bigness. The specific statistical
machine learning technique used to handle a particular big data set will depend
on which category it falls in within the bigness taxonomy. Large p small n data
sets for instance require a different set of tools from the large n small p
variety. Among other tools, we discuss Preprocessing, Standardization,
Imputation, Projection, Regularization, Penalization, Compression, Reduction,
Selection, Kernelization, Hybridization, Parallelization, Aggregation,
Randomization, Replication, Sequentialization. Indeed, it is important to
emphasize right away that the so-called no free lunch theorem applies here, in
the sense that there is no universally superior method that outperforms all
other methods on all categories of bigness. It is also important to stress the
fact that simplicity in the sense of Ockham's razor non plurality principle of
parsimony tends to reign supreme when it comes to massive data. We conclude
with a comparison of the predictive performance of some of the most commonly
used methods on a few data sets.Comment: 18 pages, 2 figures 3 table
Autonomous clustering using rough set theory
This paper proposes a clustering technique that minimises the need for subjective
human intervention and is based on elements of rough set theory. The proposed algorithm is
unified in its approach to clustering and makes use of both local and global data properties to
obtain clustering solutions. It handles single-type and mixed attribute data sets with ease and
results from three data sets of single and mixed attribute types are used to illustrate the
technique and establish its efficiency
Two new feature selection algorithms with rough sets theory
Rough Sets Theory has opened new trends for the development of the Incomplete Information Theory. Inside this one, the notion of reduct is a very significant one, but to obtain a reduct in a decision system is an expensive computing process although very important in data analysis and knowledge discovery. Because of this, it has been necessary the development of different variants to calculate reducts. The present work look into the utility that offers Rough Sets Model and Information Theory in feature selection and a new method is presented with the purpose of calculate a good reduct. This new method consists of a greedy algorithm that uses heuristics to work out a good reduct in acceptable times. In this paper we propose other method to find good reducts, this method combines elements of Genetic Algorithm with Estimation of Distribution Algorithms. The new methods are compared with others which are implemented inside Pattern Recognition and Ant Colony Optimization Algorithms and the results of the statistical tests are shown.IFIP International Conference on Artificial Intelligence in Theory and Practice - Knowledge Acquisition and Data MiningRed de Universidades con Carreras en InformĂĄtica (RedUNCI
Fuzzy-Rough Nearest Neighbour Classification and Prediction
AbstractNearest neighbour (NN) approaches are inspired by the way humans make decisions, comparing a test object to previously encountered samples. In this paper, we propose an NN algorithm that uses the lower and upper approximations from fuzzy-rough set theory in order to classify test objects, or predict their decision value. It is shown experimentally that our method outperforms other NN approaches (classical, fuzzy and fuzzy-rough ones) and that it is competitive with leading classification and prediction methods. Moreover, we show that the robustness of our methods against noise can be enhanced effectively by invoking the approximations of the Vaguely Quantified Rough Set (VQRS) model, which emulates the linguistic quantifiers âsomeâ and âmostâ from natural language
Integrating rough set theory and medical applications
AbstractMedical science is not an exact science in which processes can be easily analyzed and modeled. Rough set theory has proven well suited for accommodating such inexactness of the medical profession. As rough set theory matures and its theoretical perspective is extended, the theory has been also followed by development of innovative rough sets systems as a result of this maturation. Unique concerns in medical sciences as well as the need of integrated rough sets systems are discussed. We present a short survey of ongoing research and a case study on integrating rough set theory and medical application. Issues in the current state of rough sets in advancing medical technology and some of its challenges are also highlighted
Data mining in soft computing framework: a survey
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included
The inference of gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the
reconstruction of gene trees based on sequence alignments. Yet, most
phylogeneticists seek to reveal the history of species. Although the histories
of genes and species are tightly linked, they are seldom identical, because
genes duplicate, are lost or horizontally transferred, and because alleles can
co-exist in populations for periods that may span several speciation events.
Building models describing the relationship between gene and species trees can
thus improve the reconstruction of gene trees when a species tree is known, and
vice-versa. Several approaches have been proposed to solve the problem in one
direction or the other, but in general neither gene trees nor species trees are
known. Only a few studies have attempted to jointly infer gene trees and
species trees. In this article we review the various models that have been used
to describe the relationship between gene trees and species trees. These models
account for gene duplication and loss, transfer or incomplete lineage sorting.
Some of them consider several types of events together, but none exists
currently that considers the full repertoire of processes that generate gene
trees along the species tree. Simulations as well as empirical studies on
genomic data show that combining gene tree-species tree models with models of
sequence evolution improves gene tree reconstruction. In turn, these better
gene trees provide a better basis for studying genome evolution or
reconstructing ancestral chromosomes and ancestral gene sequences. We predict
that gene tree-species tree methods that can deal with genomic data sets will
be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational
Evolutionary Biology" conference, Montpellier, 201
- âŠ