43 research outputs found
The Effect of Class Noise on Continuous Test Case Selection: A Controlled Experiment on Industrial Data
Continuous integration and testing produce a large amount of data about defects in code revisions, which can be utilized for training a predictive learner to effectively select a subset of test suites. One challenge in using predictive learners lies in the noise that comes in the training data, which often leads to a decrease in classification performances. This study examines the impact of one type of noise, called class noise, on a learner’s ability for selecting test cases. Understanding the impact of class noise on the performance of a learner for test case selection would assist testers decide on the appropriateness of different noise handling strategies. For this purpose, we design and implement a controlled experiment using an industrial data-set to measure the impact of class noise at six different levels on the predictive performance of a learner. We measure the learning performance using the Precision, Recall, F-score, and Mathew Correlation Coefficient (MCC) metrics. The results show a statistically significant relationship between class noise and the learners performance for test case selection. Particularly, a significant difference between the three performance measures (Precision, F-score, and MCC)under all the six noise levels and at 0% level was found, whereas a similar relationship between recall and class noise was found at a level above30%. We conclude that higher class noise ratios lead to missing out more tests in the predicted subset of test suite and increases the rate of false alarms when the class noise ratio exceeds 30
Evaluating Nuclei Concentration in Amyloid Fibrillation Reactions Using Back-Calculation Approach
Background: In spite of our extensive knowledge of the more than 20 proteins associated with different amyloid diseases, we do not know how amyloid toxicity occurs or how to block its action. Recent contradictory reports suggest that the fibrils and/or the oligomer precursors cause toxicity. An estimate of their temporal concentration may broaden understanding of the amyloid aggregation process. Methodology/Principal Findings: Assuming that conversion of folded protein to fibril is initiated by a nucleation event, we back-calculate the distribution of nuclei concentration. The temporal in vitro concentration of nuclei for the model hormone, recombinant human insulin, is estimated to be in the picomolar range. This is a conservative estimate since the back-calculation method is likely to overestimate the nuclei concentration because it does not take into consideration fibril fragmentation, which would lower the amount of nuclei Conclusions: Because of their propensity to form aggregates (non-ordered) and fibrils (ordered), this very low concentration could explain the difficulty in isolating and blocking oligomers or nuclei toxicity and the long onset time for amyloid diseases
Ensemble of a subset of kNN classifiers
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines
Graph Perturbation as Noise Graph Addition: A New Perspective for Graph Anonymization
Different types of data privacy techniques have been applied
to graphs and social networks. They have been used under different
assumptions on intruders’ knowledge. i.e., different assumptions on what
can lead to disclosure. The analysis of different methods is also led by
how data protection techniques influence the analysis of the data. i.e.,
information loss or data utility.
One of the techniques proposed for graph is graph perturbation.
Several algorithms have been proposed for this purpose. They proceed
adding or removing edges, although some also consider adding and
removing nodes.
In this paper we propose the study of these graph perturbation techniques
from a different perspective. Following the model of standard
database perturbation as noise addition, we propose to study graph perturbation
as noise graph addition. We think that changing the perspective
of graph sanitization in this direction will permit to study the properties
of perturbed graphs in a more systematic way
A synthetic data generator for online social network graphs
Two of the difficulties for data analysts of online social networks are (1) the public availability of data and (2) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated data. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, correlations and so on. In the following work, we present and validate an approach for populating a graph topology with synthetic data which approximates an online social network. The empirical tests confirm that our approach generates a dataset which is both diverse and with a good fit to the target requirements, with a realistic modeling of noise and fitting to communities. A good match is obtained between the generated data and the target profiles and distributions, which is competitive with other state of the art methods. The data generator is also highly configurable, with a sophisticated control parameter set for different “similarity/diversity” levels.This work is partially funded by the Spanish MEC (project TIN2013-49814-EXP)