1,213 research outputs found
Feature selection in high-dimensional dataset using MapReduce
This paper describes a distributed MapReduce implementation of the minimum
Redundancy Maximum Relevance algorithm, a popular feature selection method in
bioinformatics and network inference problems. The proposed approach handles
both tall/narrow and wide/short datasets. We further provide an open source
implementation based on Hadoop/Spark, and illustrate its scalability on
datasets involving millions of observations or features
Spatial and verbal routes to number comparison in young children
The ability to compare the numerical magnitude of symbolic numbers represents a milestone in the development of numerical skills. However, it remains unclear how basic numerical abilities contribute to the understanding of symbolic magnitude and whether the impact of these abilities may vary when symbolic numbers are presented as number words (e.g., \u201csix vs. eight\u201d) vs. Arabic numbers (e.g., 6 vs. 8). In the present study on preschool children, we show that comparison of number words is related to cardinality knowledge whereas the comparison of Arabic digits is related to both cardinality knowledge and the ability to spatially map numbers. We conclude that comparison of symbolic numbers in preschool children relies on multiple numerical skills and representations, which can be differentially weighted depending on the presentation format. In particular, the spatial arrangement of digits on the number line seems to scaffold the development of a \u201cspatial route\u201d to understanding the exact magnitude of numerals
On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems
We present a new distributed fuzzy partitioning method to reduce the
complexity of multi-way fuzzy decision trees in Big Data classification
problems. The proposed algorithm builds a fixed number of fuzzy sets for all
variables and adjusts their shape and position to the real distribution of
training data. A two-step process is applied : 1) transformation of the
original distribution into a standard uniform distribution by means of the
probability integral transform. Since the original distribution is generally
unknown, the cumulative distribution function is approximated by computing the
q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy
partition in the transformed attribute space using a fixed number of equally
distributed triangular membership functions. Despite the aforementioned
transformation, the definition of every fuzzy set in the original space can be
recovered by applying the inverse cumulative distribution function (also known
as quantile function). The experimental results reveal that the proposed
methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT)
induction algorithm to maintain classification accuracy with up to 6 million
fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData
Congress). arXiv admin note: text overlap with arXiv:1902.0935
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Circuit Transformations for Quantum Architectures
Quantum computer architectures impose restrictions on qubit interactions. We propose efficient circuit transformations that modify a given quantum circuit to fit an architecture, allowing for any initial and final mapping of circuit qubits to architecture qubits. To achieve this, we first consider the qubit movement subproblem and use the ROUTING VIA MATCHINGS framework to prove tighter bounds on parallel routing. In practice, we only need to perform partial permutations, so we generalize ROUTING VIA MATCHINGS to that setting. We give new routing procedures for common architecture graphs and for the generalized hierarchical product of graphs, which produces subgraphs of the Cartesian product. Secondly, for serial routing, we consider the TOKEN SWAPPING framework and extend a 4-approximation algorithm for general graphs to support partial permutations. We apply these routing procedures to give several circuit transformations, using various heuristic qubit placement subroutines. We implement these transformations in software and compare their performance for large quantum circuits on grid and modular architectures, identifying strategies that work well in practice
- …