5,260 research outputs found
Improved Error Bounds Based on Worst Likely Assignments
Error bounds based on worst likely assignments use permutation tests to
validate classifiers. Worst likely assignments can produce effective bounds
even for data sets with 100 or fewer training examples. This paper introduces a
statistic for use in the permutation tests of worst likely assignments that
improves error bounds, especially for accurate classifiers, which are typically
the classifiers of interest.Comment: IJCNN 201
Network Model Selection for Task-Focused Attributed Network Inference
Networks are models representing relationships between entities. Often these
relationships are explicitly given, or we must learn a representation which
generalizes and predicts observed behavior in underlying individual data (e.g.
attributes or labels). Whether given or inferred, choosing the best
representation affects subsequent tasks and questions on the network. This work
focuses on model selection to evaluate network representations from data,
focusing on fundamental predictive tasks on networks. We present a modular
methodology using general, interpretable network models, task neighborhood
functions found across domains, and several criteria for robust model
selection. We demonstrate our methodology on three online user activity
datasets and show that network model selection for the appropriate network task
vs. an alternate task increases performance by an order of magnitude in our
experiments
Analysis of group evolution prediction in complex networks
In the world, in which acceptance and the identification with social
communities are highly desired, the ability to predict evolution of groups over
time appears to be a vital but very complex research problem. Therefore, we
propose a new, adaptable, generic and mutli-stage method for Group Evolution
Prediction (GEP) in complex networks, that facilitates reasoning about the
future states of the recently discovered groups. The precise GEP modularity
enabled us to carry out extensive and versatile empirical studies on many
real-world complex / social networks to analyze the impact of numerous setups
and parameters like time window type and size, group detection method,
evolution chain length, prediction models, etc. Additionally, many new
predictive features reflecting the group state at a given time have been
identified and tested. Some other research problems like enriching learning
evolution chains with external data have been analyzed as well
Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk
Background. Test resources are usually limited and therefore it is often not
possible to completely test an application before a release. To cope with the
problem of scarce resources, development teams can apply defect prediction to
identify fault-prone code regions. However, defect prediction tends to low
precision in cross-project prediction scenarios.
Aims. We take an inverse view on defect prediction and aim to identify
methods that can be deferred when testing because they contain hardly any
faults due to their code being "trivial". We expect that characteristics of
such methods might be project-independent, so that our approach could improve
cross-project predictions.
Method. We compute code metrics and apply association rule mining to create
rules for identifying methods with low fault risk. We conduct an empirical
study to assess our approach with six Java open-source projects containing
precise fault data at the method level.
Results. Our results show that inverse defect prediction can identify approx.
32-44% of the methods of a project to have a low fault risk; on average, they
are about six times less likely to contain a fault than other methods. In
cross-project predictions with larger, more diversified training sets,
identified methods are even eleven times less likely to contain a fault.
Conclusions. Inverse defect prediction supports the efficient allocation of
test resources by identifying methods that can be treated with less priority in
testing activities and is well applicable in cross-project prediction
scenarios.Comment: Submitted to PeerJ C
Classification in Networked Data: A Toolkit and a Univariate Case Study
This paper1 is about classifying entities that are interlinked with entities for which the class is
known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked
data, and a case-study of its application to networked data used in prior machine learning
research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier,
a relational classifier, and a collective inference procedure. Various existing node-centric
relational learning algorithms can be instantiated with appropriate choices for these components,
and new combinations of components realize new algorithms. The case study focuses on univariate
network classification, for which the only information used is the structure of class linkage in
the network (i.e., only links and some class labels). To our knowledge, no work previously has
evaluated systematically the power of class-linkage alone for classification in machine learning
benchmark data sets. The results demonstrate that very simple network-classification models perform
quite well—well enough that they should be used regularly as baseline classifiers for studies
of learning with networked data. The simplest method (which performs remarkably well) highlights
the close correspondence between several existing methods introduced for different purposes—that
is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study
also shows that there are two sets of techniques that are preferable in different situations, namely
when few versus many labels are known initially. We also demonstrate that link selection plays an
important role similar to traditional feature selectionNYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
TSML: A XML-based Format for Exchange of Training Samples for Pattern Recognition in Remote Sensing Images
The availability of large and complex data sets has shifted the focus of pattern recognition towards developing techniques that can efficiently handle these types of data sets. For example, Multiple Classifier Systems claim their ability in reducing the error and complexity of classification by partitioning the data space and combining classifiers predictions. However, it is not an easy task to generate several partitions and moreover to use them in an efficient manner. Another difficult aspect is related to the exchange of training data in different formats among systems to combine classifiers of different and heterogeneous systems. This paper presents a model and structure of training samples based on XML (eXtensible Markup Language) to facilitate the partitioning and exchange among different image classification system. The main contribution is to apply the flexibility of XML that addresses interoperability and communication among heterogeneous systems in partitioning data sets as well as to facilitate interchange of such sets among image processing and pattern recognition systems
- …