4,663 research outputs found

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

    Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models

    Full text link
    Background: In this paper we present the approaches and methods employed in order to deal with a large scale multi-label semantic indexing task of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge of 2014. Methods: The main contribution of this work is a multi-label ensemble method that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper adaptation of the algorithms used to deal with this challenging classification task. Results: The ensemble method we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. During the BioASQ 2014 challenge we obtained the first place during the first batch and the third in the two following batches. Our success in the BioASQ challenge proved that a fully automated machine-learning approach, which does not implement any heuristics and rule-based approaches, can be highly competitive and outperform other approaches in similar challenging contexts

    NeuroSVM: A Graphical User Interface for Identification of Liver Patients

    Full text link
    Diagnosis of liver infection at preliminary stage is important for better treatment. In todays scenario devices like sensors are used for detection of infections. Accurate classification techniques are required for automatic identification of disease samples. In this context, this study utilizes data mining approaches for classification of liver patients from healthy individuals. Four algorithms (Naive Bayes, Bagging, Random forest and SVM) were implemented for classification using R platform. Further to improve the accuracy of classification a hybrid NeuroSVM model was developed using SVM and feed-forward artificial neural network (ANN). The hybrid model was tested for its performance using statistical parameters like root mean square error (RMSE) and mean absolute percentage error (MAPE). The model resulted in a prediction accuracy of 98.83%. The results suggested that development of hybrid model improved the accuracy of prediction. To serve the medicinal community for prediction of liver disease among patients, a graphical user interface (GUI) has been developed using R. The GUI is deployed as a package in local repository of R platform for users to perform prediction.Comment: 9 pages, 6 figure

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research

    Ensembles of Randomized Time Series Shapelets Provide Improved Accuracy while Reducing Computational Costs

    Full text link
    Shapelets are discriminative time series subsequences that allow generation of interpretable classification models, which provide faster and generally better classification than the nearest neighbor approach. However, the shapelet discovery process requires the evaluation of all possible subsequences of all time series in the training set, making it extremely computation intensive. Consequently, shapelet discovery for large time series datasets quickly becomes intractable. A number of improvements have been proposed to reduce the training time. These techniques use approximation or discretization and often lead to reduced classification accuracy compared to the exact method. We are proposing the use of ensembles of shapelet-based classifiers obtained using random sampling of the shapelet candidates. Using random sampling reduces the number of evaluated candidates and consequently the required computational cost, while the classification accuracy of the resulting models is also not significantly different than that of the exact algorithm. The combination of randomized classifiers rectifies the inaccuracies of individual models because of the diversity of the solutions. Based on the experiments performed, it is shown that the proposed approach of using an ensemble of inexpensive classifiers provides better classification accuracy compared to the exact method at a significantly lesser computational cost

    How Unique is Your .onion? An Analysis of the Fingerprintability of Tor Onion Services

    Full text link
    Recent studies have shown that Tor onion (hidden) service websites are particularly vulnerable to website fingerprinting attacks due to their limited number and sensitive nature. In this work we present a multi-level feature analysis of onion site fingerprintability, considering three state-of-the-art website fingerprinting methods and 482 Tor onion services, making this the largest analysis of this kind completed on onion services to date. Prior studies typically report average performance results for a given website fingerprinting method or countermeasure. We investigate which sites are more or less vulnerable to fingerprinting and which features make them so. We find that there is a high variability in the rate at which sites are classified (and misclassified) by these attacks, implying that average performance figures may not be informative of the risks that website fingerprinting attacks pose to particular sites. We analyze the features exploited by the different website fingerprinting methods and discuss what makes onion service sites more or less easily identifiable, both in terms of their traffic traces as well as their webpage design. We study misclassifications to understand how onion service sites can be redesigned to be less vulnerable to website fingerprinting attacks. Our results also inform the design of website fingerprinting countermeasures and their evaluation considering disparate impact across sites.Comment: Accepted by ACM CCS 201
    • …
    corecore