16 research outputs found

    A Web Page Classifier Library Based on Random Image Content Analysis Using Deep Learning

    Full text link
    In this paper, we present a methodology and the corresponding Python library 1 for the classification of webpages. Our method retrieves a fixed number of images from a given webpage, and based on them classifies the webpage into a set of established classes with a given probability. The library trains a random forest model build upon the features extracted from images by a pre-trained deep network. The implementation is tested by recognizing weapon class webpages in a curated list of 3859 websites. The results show that the best method of classifying a webpage into the studies classes is to assign the class according to the maximum probability of any image belonging to this (weapon) class being above the threshold, across all the retrieved images. Further research explores the possibilities for the developed methodology to also apply in image classification for healthcare applications.Comment: 4 pages, 3 figures. Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference. ACM, 201

    Machine Learning for Internet Security: Malware Detection and Web Image Classification

    No full text
    In today's fast-moving Internet-driven world, new opportunities are emerging to take advantage of the latest technologies. However, this trend of empowerment is not only available for the good, but also for various questionable and criminal activities. The first part of the thesis addresses the problem of the automatic mal ware detection. An unusual restriction applied to malware classification is the strict zero False Positives rate. To satisfy this restriction, a two-stage methodology is proposed. Due to nominal features representation, an adaptation of the Min Hash algorithm is used on the first stage, balanced in accuracy and running time. The second stage classifier uses two ELMs, each with a hyper-parameter adjusting the trade-off between coverage and an amount of False Positives/Negatives. Final outputs include the third "unknown" class; sacrificing some coverage to achieve a really low zero False Positives rate (2 out of 38,000 on test set). The second half of the thesis explores the web image classification for the web content filtering. The training dataset inherits properties of real web images: high variability, often weak clues to the website class, and a high amount of semantic noise. For the classification, a suitable image representation and a two-stage methodology are proposed. Images are represented by their local features, with the local feature descriptors being the smallest processing unit. On the first stage, the class probability density in the descriptor space is estimated with a random Vector Quantization. On the second stage, classes of images are derived from their classified descriptors, in the image-to-class fashion. The approach provides the average accuracy of 35% in a 10-class setting, with the particular accuracy for an "Adult" class over 70%

    Fusing extreme learning machine with convolutional neural network

    No full text

    Predicting Huntington’s Disease: Extreme Learning Machine with Missing Values

    Get PDF
    Problems with incomplete data and missing values are common and important in real-world machine learning scenarios, yet often underrepresented in the research field. Particularly data related to healthcare tends to feature missing values which must be handled properly, and ignoring any incomplete samples is not an acceptable solution. The Extreme Learning Machine has demonstrated excellent performance in a variety of machine learning tasks, including situations with missing values. In this paper, we present an application to predict the onset of Huntington’s disease several years in advance based on data from MRI brain scans. Experimental results show that such prediction is indeed realistic with reasonable accuracy, provided the missing values are handled with care. In particular, Multiple Imputation ELM achieves exceptional prediction accuracy

    Practical Estimation of Mutual Information on Non-Euclidean Spaces

    No full text
    Part 3: MAKE PrivacyInternational audienceWe propose, in this paper, to address the issue of measuring the impact of privacy and anonymization techniques, by measuring the data loss between “before” and “after”. The proposed approach focuses therefore on data usability, more than in ensuring that the data is sufficiently anonymized. We use Mutual Information as the measure criterion for this approach, and detail how we propose to measure Mutual Information over non-Euclidean data, in practice, using two possible existing estimators. We test this approach using toy data to illustrate the effects of some well known anonymization techniques on the proposed measure
    corecore