9 research outputs found

    Active Learning for One-Class Classification Using Two One-Class Classifiers

    Full text link
    This paper introduces a novel, generic active learning method for one-class classification. Active learning methods play an important role to reduce the efforts of manual labeling in the field of machine learning. Although many active learning approaches have been proposed during the last years, most of them are restricted on binary or multi-class problems. One-class classifiers use samples from only one class, the so-called target class, during training and hence require special active learning strategies. The few strategies proposed for one-class classification either suffer from their limitation on specific one-class classifiers or their performance depends on particular assumptions about datasets like imbalance. Our proposed method bases on using two one-class classifiers, one for the desired target class and one for the so-called outlier class. It allows to invent new query strategies, to use binary query strategies and to define simple stopping criteria. Based on the new method, two query strategies are proposed. The provided experiments compare the proposed approach with known strategies on various datasets and show improved results in almost all situations.Comment: EUSIPCO 201

    Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

    Get PDF
    Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data

    Biomining:-An Efficient Data Retrieval Tool for Bioinformatics to Avoid Redundant and Irrelevant Data Retrieval from Biological Databases

    Get PDF
    MINING biological data is an emerging area of intersection between data mining and bioinformatics. Bioinformaticians have been working on the research and development of computational methodologies and tools for expanding the use of biological, medical, behavioral, or health-related data. Data mining researchers have been making substantial contribution to the development of models and algorithms to meet challenges posed by the bioinformatics research. Mining these databases tend to develop data quality issues like data anomaly and duplication. For biological data to be corrected, methods and tools must be developed. This paper proposes one such tool, called BIOMINING that is designed to eliminate anomalous and redundancy in biological web content

    Content Based Web Page Re-Ranking Using Relevancy Algorithm

    Get PDF
    The World Wide Web is a system of interlinked hypertext documents that are accessed via the internet. It plays a leading role for retrieving user requested information from the web resources. In order to retrieve user requested information, search engine plays a major role for crawling web content on different node and organizing them into result pages so that user can easily select the required information by navigating through the result pages link. This strategy worked well in earlier because, number of resources available for user request is limited. It is feasible to identify the relevant information directly by the user from the search engine results. As the Internet era increases, sharing of resource also increases and this leads to develop an automated technique to rank each web content resource. Different search engine uses different techniques to rank search results for the user query. This leads to business motivation of bringing up their web resource into top ranking position. As the competition and web resource increases, the ranking of web content becomes tedious and dynamic with respect to the user query. In the proposed work a new approach is introduced to rank the relevant pages based on the content and keywords rather than keyword and page ranking provided by search engines. Based on the user query, search engine results are retrieved. Every result is individually analyzed based on keywords and content. User Query is pre-processed to identify the root words. Root word is considered for Dictionary construction and Dictionary is built with synonyms for the user query. Keywords and content words of each resultant web page is preprocessed and compared against the dictionary. If a match is found, then particular weight is awarded for each word. Finally, the total relevancy of the particular link against user request is computed by summarizing all the weights of the keyword and content words. The results are then re-ranked in descending order of their weights and displayed

    Choice of clustering methods

    Get PDF
    Выбор алгоритма машинного обучения для решения некоторой задачи является проблемой. В данном докладе рассматриваются алгоритмы кластерного анализа и методика их выбора для эффективного решения прикладных задач. There is an existing problem of choosing a machine learning algorithm as a solution to a task. In this study, a review of numerous clustering algorithms is conducted and a method for choosing a clustering algorithm for efficient solving of practical problems is developed and attempted

    Predicción de abandonos y pagos en videojuegos Freemium para móviles

    Get PDF
    73 páginas.Trabajo de Máster en Economía, Finanzas y Computación. Director: Dr. Emilio Congregado Ramírez de Aguilera. La industria de videojuegos para móviles ha crecido vertiginosamente en la actualidad y las empresas del sector dedican un gran esfuerzo para aumentar la retención de sus usuarios y la monetización. Un aspecto fundamental para la toma de decisiones es la predicción de jugadores que abandonan y/o realizan pagos dentro de los juegos. El objetivo de la presente investigación es predecir abandonos y pagos en un juego de la empresa Genera Games. Para ello se realizó el proceso de Descubrimiento de Conocimiento en Bases de Datos, desde la extracción de datos hasta el reconocimiento de patrones. Se utilizaron los métodos de clasificación binaria Regresión Logística, Bosques Aleatorios y Potenciación del Gradiente. El procesamiento computacional se hizo mediante SQL, R, Python y el software de análisis estadístico Stata. Como resultados se obtuvieron las predicciones para distintos períodos (corto, medio y largo plazo), así como la comparativa entre los métodos de clasificación.Mobile gaming industry has grown vertiginously at present, and the companies of the sector dedicate a great effort to increase the retention of their users and the monetization. A fundamental aspect for decision making is the prediction of players who abandon and/or make payments within the games. The objective of this research is to predict dropouts (churn) and payments in a game of the company Genera Games. For this purpose, the process of Knowledge Discovery in Databases was carried out, from data extraction to pattern recognition. Binary classification methods Logistic Regression, Random Forests and Gradient Boosting were used. Computational processing was done using SQL, R, Python and the statistical analysis software Stata. As results were obtained the predictions for different periods (short, medium and long term), as well as the comparison between the classification methods

    Perbandingan Metode Ensemble Random Forest Dengan Smote-Boosting Dan Smote-Bagging Pada Klasifikasi Data Mining Untuk Kelas Imbalance (Studi Kasus : Data Beasiswa Bidikmisi Tahun 2017 di Jawa Timur) - A Comparison Of The Ensemble Random Forest Methods With Smote-Boosting And Smote-Bagging On Data Mining Classification For Imbalance Class

    Get PDF
    Teknik data mining dalam bidang pendidikan mulai berkembang, seiring dengan berkembangnya teknologi dan besarnya data yang dapat disimpan dalam sistem penyimpanan database pendidikan. Metode klasifikasi digunakan untuk mengelompokkan siswa kedalam kelas yang teridentifikasi. Namun dalam teknik klasifikasi kondisi imbalance class sering terjadi dan menjadi masalah, karena mesin klasifikasi akan condong memprediksi ke kelas mayoritas (kelas negatif) dibandingkan kelas minoritas (kelas positif). Hampir semua classifier termasuk random forest mengasumsikan sebuah pembagian yang rata antar kelas-kelas pengamatan. Random forest umumnya menunjukkan peningkatan kinerja yang besar dibandingkan CART dan C4.5 dan menghasilkan tingkat kesalahan generalisasi yang lebih baik dibandingkan dengan AdaBoost, dan lebih robust terhadap noise. Namun, seperti kebanyakan metode klasifikasi lainnya, random forest juga bisa menghasilkan hasil kurang optimal pada dataset yang imbalance. Alternatif lain dalam meningkatkan akurasi kelas imbalance adalah dengan menggunakan metode ensemble. Salah satu metode yang popular digunakan akhir-akhir ini yaitu SMOTE-Boosting dan SMOTE-Bagging yang mengkombinasikan algoritma pada level data yaitu SMOTE dengan metode ensemble. Data Bidikmisi yang digunakan dalam penelitian mempunyai 10 variabel nominal (X1-X10) dan 1 variabel rasio (X11). Berdasarkan kriteria performansi g-mean dan AUC dari kelas (Y) yang imbalance menunjukkan bahwa algoritma ensemble SMOTE-Bagging (g-mean=33,13% dan AUC=52,12%) dan SMOTE-Boosting (g-mean=30,22% dan AUC=50,76%) menunjukkan ketepatan klasifikasi yang cenderung lebih baik dibandingkan metode AdaBoost.M2 (g-mean=9,03% dan AUC=50,26%). Selisih antara kedua metode algoritma SMOTE-Boosting dan SMOTE-Bagging sangat kecil. Bisa dikatakan bahwa kedua metode tersebut cukup berhasil mengambil keuntungan dari dua algoritma boosting dan bagging dengan SMOTE. Ketika boosting dan bagging mempengaruhi akurasi dari random forest dengan berfokus pada semua kelas data, algoritma SMOTE merubah nilai performansi dari random forest hanya pada kelas minoritas. ========================================================================================================== Data mining techniques in the field of education began to grow, along with the development of technology and the amount of data that can be stored in the database storage system of education. Classification methods are used to group students into identified classes. However, in classification techniques the imbalance class condition often occurs and becomes a problem. In the imbalanced classification, the training data set as one majority class could be far surpassed the training dataset as the minority class. This became a problem because classification will tend to predict the data come from the majority class (negative class) compared to the minority class (positive class). Almost all classifiers including random forest assume an equitable division between observation classes. Random forest generally shows a large performance increase compared to CART and C4.5 and results in a better generalization error rate compared to AdaBoost, and is more robust to noise. However, like most other classification methods, random forest can also produce less than optimal results on the imbalanced dataset. Another alternative in improving the accuracy of the imbalance class is by using the ensemble method. One popular method used recently is SMOTE-Boosting and SMOTE-Bagging that combine algorithms at data level ie SMOTE with ensemble method. Bidikmisi data used in the study have 10 nominal variables (X1-X10) and 1 ratio variable (X11). Based on the performance criteria of g-mean and AUC of class (Y) the imbalance shows that the ensemble algorithm SMOTE-Bagging (g-mean = 33.13% and AUC = 52.12%) and SMOTE-Boosting (g-mean = 30, 22% and AUC = 50.76%) showed better classification accuracy than the AdaBoost.M2 method (g-mean = 9.03% and AUC = 50.26%). The difference between the two SMOTE-Boosting and SMOTE-Bagging algorithms is very small. It can be said that both methods are quite successful to take advantage of two boosting and bagging algorithms with SMOTE. When boosting and bagging affect the accuracy of random forest by focusing on all data classes, the SMOTE algorithm alters the performance values of random forest only in minority classes

    BIG DATA и анализ высокого уровня : материалы конференции

    Get PDF
    В сборнике опубликованы результаты научных исследований и разработок в области BIG DATA and Advanced Analytics для оптимизации IT-решений и бизнес-решений, а также тематических исследований в области медицины, образования и экологии
    corecore