Search CORE

9 research outputs found

Active Learning for One-Class Classification Using Two One-Class Classifiers

Author: Schlachter Patrick
Yang Bin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/01/2019
Field of study

This paper introduces a novel, generic active learning method for one-class classification. Active learning methods play an important role to reduce the efforts of manual labeling in the field of machine learning. Although many active learning approaches have been proposed during the last years, most of them are restricted on binary or multi-class problems. One-class classifiers use samples from only one class, the so-called target class, during training and hence require special active learning strategies. The few strategies proposed for one-class classification either suffer from their limitation on specific one-class classifiers or their performance depends on particular assumptions about datasets like imbalance. Our proposed method bases on using two one-class classifiers, one for the desired target class and one for the so-called outlier class. It allows to invent new query strategies, to use binary query strategies and to define simple stopping criteria. Based on the new method, two query strategies are proposed. The provided experiments compare the proposed approach with known strategies on various datasets and show improved results in almost all situations.Comment: EUSIPCO 201

arXiv.org e-Print Archive

Crossref

Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

Author: Babu B.Hari
Chandra N.Subash
Gopal T. Venu
Publication venue: Institute for Project Management Pvt. Ltd
Publication date: 05/09/2020
Field of study

Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data

Interscience Research Network

Biomining:-An Efficient Data Retrieval Tool for Bioinformatics to Avoid Redundant and Irrelevant Data Retrieval from Biological Databases

Author: Punithavalli Dr.M.
Sumithiradevi C.
Suresh S.
Publication venue: Global Journals Inc. (US)
Publication date: 15/01/2011
Field of study

MINING biological data is an emerging area of intersection between data mining and bioinformatics. Bioinformaticians have been working on the research and development of computational methodologies and tools for expanding the use of biological, medical, behavioral, or health-related data. Data mining researchers have been making substantial contribution to the development of models and algorithms to meet challenges posed by the bioinformatics research. Mining these databases tend to develop data quality issues like data anomaly and duplication. For biological data to be corrected, methods and tools must be developed. This paper proposes one such tool, called BIOMINING that is designed to eliminate anomalous and redundancy in biological web content

Global Journal of Computer Science and Technology (GJCST)

Content Based Web Page Re-Ranking Using Relevancy Algorithm

Author: Harish Kumar B.T.
Venugopal K.R.
Vibha Lakshmikantha .
Publication venue: 'Quest Journals'
Publication date: 01/01/2014
Field of study

The World Wide Web is a system of interlinked hypertext documents that are accessed via the internet. It plays a leading role for retrieving user requested information from the web resources. In order to retrieve user requested information, search engine plays a major role for crawling web content on different node and organizing them into result pages so that user can easily select the required information by navigating through the result pages link. This strategy worked well in earlier because, number of resources available for user request is limited. It is feasible to identify the relevant information directly by the user from the search engine results. As the Internet era increases, sharing of resource also increases and this leads to develop an automated technique to rank each web content resource. Different search engine uses different techniques to rank search results for the user query. This leads to business motivation of bringing up their web resource into top ranking position. As the competition and web resource increases, the ranking of web content becomes tedious and dynamic with respect to the user query. In the proposed work a new approach is introduced to rank the relevant pages based on the content and keywords rather than keyword and page ranking provided by search engines. Based on the user query, search engine results are retrieved. Every result is individually analyzed based on keywords and content. User Query is pre-processed to identify the root words. Root word is considered for Dictionary construction and Dictionary is built with synonyms for the user query. Keywords and content words of each resultant web page is preprocessed and compared against the dictionary. If a match is found, then particular weight is awarded for each word. Finally, the total relevancy of the particular link against user request is computed by summarizing all the weights of the keyword and content words. The results are then re-ranked in descending order of their weights and displayed

ePrints@Bangalore University

Choice of clustering methods

Author: Chochieva A. S.
Piletski I. I.
Пилецкий И. И.
Чочиева А. С.
Publication venue: Беспринт, РБ
Publication date: 01/01/2020
Field of study

Выбор алгоритма машинного обучения для решения некоторой задачи является проблемой. В данном докладе рассматриваются алгоритмы кластерного анализа и методика их выбора для эффективного решения прикладных задач. There is an existing problem of choosing a machine learning algorithm as a solution to a task. In this study, a review of numerous clustering algorithms is conducted and a method for choosing a clustering algorithm for efficient solving of practical problems is developed and attempted

Belarusian State University of Informatics and Radioelectronics Repository

Predicción de abandonos y pagos en videojuegos Freemium para móviles

Author: Darias Jojorina Alexei
Publication venue: Universidad Internacional de Andalucía
Publication date: 01/01/2018
Field of study

73 páginas.Trabajo de Máster en Economía, Finanzas y Computación. Director: Dr. Emilio Congregado Ramírez de Aguilera. La industria de videojuegos para móviles ha crecido vertiginosamente en la actualidad y las empresas del sector dedican un gran esfuerzo para aumentar la retención de sus usuarios y la monetización. Un aspecto fundamental para la toma de decisiones es la predicción de jugadores que abandonan y/o realizan pagos dentro de los juegos. El objetivo de la presente investigación es predecir abandonos y pagos en un juego de la empresa Genera Games. Para ello se realizó el proceso de Descubrimiento de Conocimiento en Bases de Datos, desde la extracción de datos hasta el reconocimiento de patrones. Se utilizaron los métodos de clasificación binaria Regresión Logística, Bosques Aleatorios y Potenciación del Gradiente. El procesamiento computacional se hizo mediante SQL, R, Python y el software de análisis estadístico Stata. Como resultados se obtuvieron las predicciones para distintos períodos (corto, medio y largo plazo), así como la comparativa entre los métodos de clasificación.Mobile gaming industry has grown vertiginously at present, and the companies of the sector dedicate a great effort to increase the retention of their users and the monetization. A fundamental aspect for decision making is the prediction of players who abandon and/or make payments within the games. The objective of this research is to predict dropouts (churn) and payments in a game of the company Genera Games. For this purpose, the process of Knowledge Discovery in Databases was carried out, from data extraction to pattern recognition. Binary classification methods Logistic Regression, Random Forests and Gradient Boosting were used. Computational processing was done using SQL, R, Python and the statistical analysis software Stata. As results were obtained the predictions for different periods (short, medium and long term), as well as the comparison between the classification methods

Repositorio de la UNIA

Perbandingan Metode Ensemble Random Forest Dengan Smote-Boosting Dan Smote-Bagging Pada Klasifikasi Data Mining Untuk Kelas Imbalance (Studi Kasus : Data Beasiswa Bidikmisi Tahun 2017 di Jawa Timur) - A Comparison Of The Ensemble Random Forest Methods With Smote-Boosting And Smote-Bagging On Data Mining Classification For Imbalance Class

Author: Pangastuti Sinta Septi
Publication venue
Publication date: 01/09/2018
Field of study

Teknik data mining dalam bidang pendidikan mulai berkembang, seiring dengan berkembangnya teknologi dan besarnya data yang dapat disimpan dalam sistem penyimpanan database pendidikan. Metode klasifikasi digunakan untuk mengelompokkan siswa kedalam kelas yang teridentifikasi. Namun dalam teknik klasifikasi kondisi imbalance class sering terjadi dan menjadi masalah, karena mesin klasifikasi akan condong memprediksi ke kelas mayoritas (kelas negatif) dibandingkan kelas minoritas (kelas positif). Hampir semua classifier termasuk random forest mengasumsikan sebuah pembagian yang rata antar kelas-kelas pengamatan. Random forest umumnya menunjukkan peningkatan kinerja yang besar dibandingkan CART dan C4.5 dan menghasilkan tingkat kesalahan generalisasi yang lebih baik dibandingkan dengan AdaBoost, dan lebih robust terhadap noise. Namun, seperti kebanyakan metode klasifikasi lainnya, random forest juga bisa menghasilkan hasil kurang optimal pada dataset yang imbalance. Alternatif lain dalam meningkatkan akurasi kelas imbalance adalah dengan menggunakan metode ensemble. Salah satu metode yang popular digunakan akhir-akhir ini yaitu SMOTE-Boosting dan SMOTE-Bagging yang mengkombinasikan algoritma pada level data yaitu SMOTE dengan metode ensemble. Data Bidikmisi yang digunakan dalam penelitian mempunyai 10 variabel nominal (X1-X10) dan 1 variabel rasio (X11). Berdasarkan kriteria performansi g-mean dan AUC dari kelas (Y) yang imbalance menunjukkan bahwa algoritma ensemble SMOTE-Bagging (g-mean=33,13% dan AUC=52,12%) dan SMOTE-Boosting (g-mean=30,22% dan AUC=50,76%) menunjukkan ketepatan klasifikasi yang cenderung lebih baik dibandingkan metode AdaBoost.M2 (g-mean=9,03% dan AUC=50,26%). Selisih antara kedua metode algoritma SMOTE-Boosting dan SMOTE-Bagging sangat kecil. Bisa dikatakan bahwa kedua metode tersebut cukup berhasil mengambil keuntungan dari dua algoritma boosting dan bagging dengan SMOTE. Ketika boosting dan bagging mempengaruhi akurasi dari random forest dengan berfokus pada semua kelas data, algoritma SMOTE merubah nilai performansi dari random forest hanya pada kelas minoritas. ========================================================================================================== Data mining techniques in the field of education began to grow, along with the development of technology and the amount of data that can be stored in the database storage system of education. Classification methods are used to group students into identified classes. However, in classification techniques the imbalance class condition often occurs and becomes a problem. In the imbalanced classification, the training data set as one majority class could be far surpassed the training dataset as the minority class. This became a problem because classification will tend to predict the data come from the majority class (negative class) compared to the minority class (positive class). Almost all classifiers including random forest assume an equitable division between observation classes. Random forest generally shows a large performance increase compared to CART and C4.5 and results in a better generalization error rate compared to AdaBoost, and is more robust to noise. However, like most other classification methods, random forest can also produce less than optimal results on the imbalanced dataset. Another alternative in improving the accuracy of the imbalance class is by using the ensemble method. One popular method used recently is SMOTE-Boosting and SMOTE-Bagging that combine algorithms at data level ie SMOTE with ensemble method. Bidikmisi data used in the study have 10 nominal variables (X1-X10) and 1 ratio variable (X11). Based on the performance criteria of g-mean and AUC of class (Y) the imbalance shows that the ensemble algorithm SMOTE-Bagging (g-mean = 33.13% and AUC = 52.12%) and SMOTE-Boosting (g-mean = 30, 22% and AUC = 50.76%) showed better classification accuracy than the AdaBoost.M2 method (g-mean = 9.03% and AUC = 50.26%). The difference between the two SMOTE-Boosting and SMOTE-Bagging algorithms is very small. It can be said that both methods are quite successful to take advantage of two boosting and bagging algorithms with SMOTE. When boosting and bagging affect the accuracy of random forest by focusing on all data classes, the SMOTE algorithm alters the performance values of random forest only in minority classes

ITS Repository

BIG DATA и анализ высокого уровня : материалы конференции

Author
Publication venue: Беспринт, РБ
Publication date: 01/01/2020
Field of study

В сборнике опубликованы результаты научных исследований и разработок в области BIG DATA and Advanced Analytics для оптимизации IT-решений и бизнес-решений, а также тематических исследований в области медицины, образования и экологии

Belarusian State University of Informatics and Radioelectronics Repository