Search CORE

4,555 research outputs found

GOGGLES: Automatic Image Labeling with Affinity Coding

Author: Chaba Sanya
Chau Duen Horng
Chu Xu
Das Nilaksh
Gandhi Sakshi
Wu Renzhi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/03/2020
Field of study

Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels). We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set. We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management of Dat

arXiv.org e-Print Archive

Crossref

Corporate Credit Rating: A Survey

Author: Cheng Xi
Feng Bojing
Li Dan
Liu Zeyu
Xue Wenfang
Publication venue
Publication date: 18/09/2023
Field of study

Corporate credit rating (CCR) plays a very important role in the process of contemporary economic and social development. How to use credit rating methods for enterprises has always been a problem worthy of discussion. Through reading and studying the relevant literature at home and abroad, this paper makes a systematic survey of CCR. This paper combs the context of the development of CCR methods from the three levels: statistical models, machine learning models and neural network models, summarizes the common databases of CCR, and deeply compares the advantages and disadvantages of the models. Finally, this paper summarizes the problems existing in the current research and prospects the future of CCR. Compared with the existing review of CCR, this paper expounds and analyzes the progress of neural network model in this field in recent years.Comment: 11 page

arXiv.org e-Print Archive

Dealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods

Author: Vluymans Sarah
Publication venue: Ghent University. Faculty of Medicine and Health Sciences ; University of Granada. Department of Computer Science and Artificial Intelligence
Publication date: 01/01/2018
Field of study

Ghent University Academic Bibliography

CERN Document Server

Fuzzy-Rough Set based Semi-Supervised Learning

Author: Jensen Richard
MacParthalain Neil
Publication venue
Publication date: 01/01/2011
Field of study

Abstract—Much work has been carried out in the area of fuzzy-rough sets for supervised learning. However, very little has been accomplished for the unsupervised or semi-supervised tasks. For many real-word applications, it is often expensive, time-consuming and difficult to obtain labels for all data objects. This often results in large quantities of data which may only have very few labelled data objects. This paper proposes a novel fuzzy-rough based semi-supervised self-learning or self-training approach for the assignment of labels to unlabelled data. Unlike other semi-supervised approaches, the proposed technique requires no subjective thresholding or domain information. An experimental evaluation is performed on artificial data and also applied to a real-world mammographic risk assessment problem with encouraging results. Index Terms—Rough sets, fuzzy sets, mammographic analysis, semi-supervised learning I

CiteSeerX

Aberystwyth Research Portal

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Author: Chawla Nitesh V.
Fernández Hilario Alberto Luis
García López Salvador
Herrera Triguero Francisco
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2018
Field of study

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada

A Bibliographic View on Constrained Clustering

Author: Hennessey Samuel
Kuncheva Ludmila
Williams Francis
Publication venue
Publication date: 22/09/2022
Field of study

A keyword search on constrained clustering on Web-of-Science returned just under 3,000 documents. We ran automatic analyses of those, and compiled our own bibliography of 183 papers which we analysed in more detail based on their topic and experimental study, if any. This paper presents general trends of the area and its sub-topics by Pareto analysis, using citation count and year of publication. We list available software and analyse the experimental sections of our reference collection. We found a notable lack of large comparison experiments. Among the topics we reviewed, applications studies were most abundant recently, alongside deep learning, active learning and ensemble learning.Comment: 18 pages, 11 figures, 177 reference

arXiv.org e-Print Archive

Learning in Dynamic Data-Streams with a Scarcity of Labels

Author: Fahy Conor
Publication venue: Faculty of Computing, Engineering and Media
Publication date: 01/01/2019
Field of study

Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

De Montfort University Open Research Archive