726 research outputs found
Towards an automated classification of spreadsheets
Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.We introduce with this paper a methodology to automatically classify spreadsheets into different domains. We exploit existing data mining classification algorithms using spreadsheet-specific features. The algorithms were trained and validated with cross-validation using the EUSES corpus, with an up to 89% accuracy. The best algorithm was applied to the larger Enron corpus in order to get some insight from it and to demonstrate the usefulness of this work
Operation of a quantum dot in the finite-state machine mode: single-electron dynamic memory
A single electron dynamic memory is designed based on the non-equilibrium
dynamics of charge states in electrostatically-defined metallic quantum dots.
Using the orthodox theory for computing the transfer rates and a master
equation, we model the dynamical response of devices consisting of a charge
sensor coupled to either a single and or a double quantum dot subjected to a
pulsed gate voltage. We show that transition rates between charge states in
metallic quantum dots are characterized by an asymmetry that can be controlled
by the gate voltage. This effect is more pronounced when the switching between
charge states corresponds to a Markovian process involving electron transport
through a chain of several quantum dots. By simulating the dynamics of electron
transport we demonstrate that the quantum box operates as a finite-state
machine that can be addressed by choosing suitable shapes and switching rates
of the gate pulses. We further show that writing times in the ns range and
retention memory times six orders of magnitude longer, in the ms range, can be
achieved on the double quantum dot system using experimentally feasible
parameters thereby demonstrating that the device can operate as a dynamic
single electron memory.Comment: 18 pages, 8 figure
Digging into acceptor splice site prediction : an iterative feature selection approach
Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction.
We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature.
The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets
Different Approaches to Community Evolution Prediction in Blogosphere
Predicting the future direction of community evolution is a problem with high
theoretical and practical significance. It allows to determine which
characteristics describing communities have importance from the point of view
of their future behaviour. Knowledge about the probable future career of the
community aids in the decision concerning investing in contact with members of
a given community and carrying out actions to achieve a key position in it. It
also allows to determine effective ways of forming opinions or to protect group
participants against such activities. In the paper, a new approach to group
identification and prediction of future events is presented together with the
comparison to existing method. Performed experiments prove a high quality of
prediction results. Comparison to previous studies shows that using many
measures to describe the group profile, and in consequence as a classifier
input, can improve predictions.Comment: SNAA2013 at ASONAM2013 IEEE Computer Societ
On the Construction of Human-Automation Interfaces by Formal Abstraction
In this paper we present a formal methodology and an algorithmic procedure for constructing human-auto-mation interfaces and corresponding user-manuals. Our focus is the information provided to the user about the behavior of the underlying machine, rather than the graphical and layout features of the interface itself. Our approach involves a systematic reduction of the behavioral model of the machine, as well as systematic abstraction of information that displayed in the inter-face. This reduction procedure satisfies two require-ments: First, the interface must be correct so as not to cause mode confusion that may lead the user to per-form incorrect actions. Secondly, the interface must be as simple as possible and not include any unnecessary information. The algorithm for generating such inter-faces can be automated, and a preliminary software system for its implementation has been developed
Preceding rule induction with instance reduction methods
A new prepruning technique for rule induction is presented which applies instance reduction before rule induction. An empirical evaluation records the predictive accuracy and size of rule-sets generated from 24 datasets from the UCI Machine Learning Repository. Three instance reduction algorithms (Edited Nearest Neighbour, AllKnn and DROP5) are compared. Each one is used to reduce the size of the training set, prior to inducing a set of rules using Clark and Boswell's modification of CN2. A hybrid instance reduction algorithm (comprised of AllKnn and DROP5) is also tested. For most of the datasets, pruning the training set using ENN, AllKnn or the hybrid significantly reduces the number of rules generated by CN2, without adversely affecting the predictive performance. The hybrid achieves the highest average predictive accuracy
Building cloud applications for challenged networks
Cloud computing has seen vast advancements and uptake in many parts of the world. However, many of the design patterns and deployment models are not very suitable for locations with challenged networks such as countries with no nearby datacenters. This paper describes the problem and discusses the options available for such locations, focusing specifically on community clouds as a short-term solution. The paper highlights the impact of recent trends in the development of cloud applications and how changing these could better help deployment in challenged networks. The paper also outlines the consequent challenges in bridging different cloud deployments, also known as cross-cloud computing
Fairness in Algorithmic Decision Making: An Excursion Through the Lens of Causality
As virtually all aspects of our lives are increasingly impacted by
algorithmic decision making systems, it is incumbent upon us as a society to
ensure such systems do not become instruments of unfair discrimination on the
basis of gender, race, ethnicity, religion, etc. We consider the problem of
determining whether the decisions made by such systems are discriminatory,
through the lens of causal models. We introduce two definitions of group
fairness grounded in causality: fair on average causal effect (FACE), and fair
on average causal effect on the treated (FACT). We use the Rubin-Neyman
potential outcomes framework for the analysis of cause-effect relationships to
robustly estimate FACE and FACT. We demonstrate the effectiveness of our
proposed approach on synthetic data. Our analyses of two real-world data sets,
the Adult income data set from the UCI repository (with gender as the protected
attribute), and the NYC Stop and Frisk data set (with race as the protected
attribute), show that the evidence of discrimination obtained by FACE and FACT,
or lack thereof, is often in agreement with the findings from other studies. We
further show that FACT, being somewhat more nuanced compared to FACE, can yield
findings of discrimination that differ from those obtained using FACE.Comment: 7 pages, 2 figures, 2 tables.To appear in Proceedings of the
International Conference on World Wide Web (WWW), 201
Assisted Diagnosis of Parkinsonism Based on the Striatal Morphology
Parkinsonism is a clinical syndrome characterized by the progressive loss of striatal dopamine. Its diagnosis
is usually corroborated by neuroimaging data such as DaTSCAN neuroimages that allow visualizing
the possible dopamine deficiency. During the last decade, a number of computer systems have been
proposed to automatically analyze DaTSCAN neuroimages, eliminating the subjectivity inherent to the
visual examination of the data. In this work, we propose a computer system based on machine learning
to separate Parkinsonian patients and control subjects using the size and shape of the striatal region,
modeled from DaTSCAN data. First, an algorithm based on adaptative thresholding is used to parcel
the striatum. This region is then divided into two according to the brain hemisphere division and characterized
with 152 measures, extracted from the volume and its three possible 2-dimensional projections.
Afterwards, the Bhattacharyya distance is used to discard the least discriminative measures and, finally,
the neuroimage category is estimated by means of a Support Vector Machine classifier. This method was
evaluated using a dataset with 189 DaTSCAN neuroimages, obtaining an accuracy rate over 94%. This
rate outperforms those obtained by previous approaches that use the intensity of each striatal voxel as
a feature.This work was supported by the MINECO/
FEDER under the TEC2015-64718-R project, the
Ministry of Economy, Innovation, Science and
Employment of the Junta de Andaluc´ıa under the
P11-TIC-7103 Excellence Project and the Vicerectorate
of Research and Knowledge Transfer of the
University of Granada
The identification of informative genes from multiple datasets with increasing complexity
Background
In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes.
Results
In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes.
Conclusions
We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events
- …