4,663 research outputs found
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models
Background: In this paper we present the approaches and methods employed in
order to deal with a large scale multi-label semantic indexing task of
biomedical papers. This work was mainly implemented within the context of the
BioASQ challenge of 2014. Methods: The main contribution of this work is a
multi-label ensemble method that incorporates a McNemar statistical
significance test in order to validate the combination of the constituent
machine learning algorithms. Some secondary contributions include a study on
the temporal aspects of the BioASQ corpus (observations apply also to the
BioASQ's super-set, the PubMed articles collection) and the proper adaptation
of the algorithms used to deal with this challenging classification task.
Results: The ensemble method we developed is compared to other approaches in
experimental scenarios with subsets of the BioASQ corpus giving positive
results. During the BioASQ 2014 challenge we obtained the first place during
the first batch and the third in the two following batches. Our success in the
BioASQ challenge proved that a fully automated machine-learning approach, which
does not implement any heuristics and rule-based approaches, can be highly
competitive and outperform other approaches in similar challenging contexts
NeuroSVM: A Graphical User Interface for Identification of Liver Patients
Diagnosis of liver infection at preliminary stage is important for better
treatment. In todays scenario devices like sensors are used for detection of
infections. Accurate classification techniques are required for automatic
identification of disease samples. In this context, this study utilizes data
mining approaches for classification of liver patients from healthy
individuals. Four algorithms (Naive Bayes, Bagging, Random forest and SVM) were
implemented for classification using R platform. Further to improve the
accuracy of classification a hybrid NeuroSVM model was developed using SVM and
feed-forward artificial neural network (ANN). The hybrid model was tested for
its performance using statistical parameters like root mean square error (RMSE)
and mean absolute percentage error (MAPE). The model resulted in a prediction
accuracy of 98.83%. The results suggested that development of hybrid model
improved the accuracy of prediction. To serve the medicinal community for
prediction of liver disease among patients, a graphical user interface (GUI)
has been developed using R. The GUI is deployed as a package in local
repository of R platform for users to perform prediction.Comment: 9 pages, 6 figure
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Ensembles of Randomized Time Series Shapelets Provide Improved Accuracy while Reducing Computational Costs
Shapelets are discriminative time series subsequences that allow generation
of interpretable classification models, which provide faster and generally
better classification than the nearest neighbor approach. However, the shapelet
discovery process requires the evaluation of all possible subsequences of all
time series in the training set, making it extremely computation intensive.
Consequently, shapelet discovery for large time series datasets quickly becomes
intractable. A number of improvements have been proposed to reduce the training
time. These techniques use approximation or discretization and often lead to
reduced classification accuracy compared to the exact method.
We are proposing the use of ensembles of shapelet-based classifiers obtained
using random sampling of the shapelet candidates. Using random sampling reduces
the number of evaluated candidates and consequently the required computational
cost, while the classification accuracy of the resulting models is also not
significantly different than that of the exact algorithm. The combination of
randomized classifiers rectifies the inaccuracies of individual models because
of the diversity of the solutions. Based on the experiments performed, it is
shown that the proposed approach of using an ensemble of inexpensive
classifiers provides better classification accuracy compared to the exact
method at a significantly lesser computational cost
How Unique is Your .onion? An Analysis of the Fingerprintability of Tor Onion Services
Recent studies have shown that Tor onion (hidden) service websites are
particularly vulnerable to website fingerprinting attacks due to their limited
number and sensitive nature. In this work we present a multi-level feature
analysis of onion site fingerprintability, considering three state-of-the-art
website fingerprinting methods and 482 Tor onion services, making this the
largest analysis of this kind completed on onion services to date.
Prior studies typically report average performance results for a given
website fingerprinting method or countermeasure. We investigate which sites are
more or less vulnerable to fingerprinting and which features make them so. We
find that there is a high variability in the rate at which sites are classified
(and misclassified) by these attacks, implying that average performance figures
may not be informative of the risks that website fingerprinting attacks pose to
particular sites.
We analyze the features exploited by the different website fingerprinting
methods and discuss what makes onion service sites more or less easily
identifiable, both in terms of their traffic traces as well as their webpage
design. We study misclassifications to understand how onion service sites can
be redesigned to be less vulnerable to website fingerprinting attacks. Our
results also inform the design of website fingerprinting countermeasures and
their evaluation considering disparate impact across sites.Comment: Accepted by ACM CCS 201
- …