323,053 research outputs found
Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy
We describe a framework for designing efficient active learning algorithms
that are tolerant to random classification noise and are
differentially-private. The framework is based on active learning algorithms
that are statistical in the sense that they rely on estimates of expectations
of functions of filtered random examples. It builds on the powerful statistical
query framework of Kearns (1993).
We show that any efficient active statistical learning algorithm can be
automatically converted to an efficient active learning algorithm which is
tolerant to random classification noise as well as other forms of
"uncorrelated" noise. The complexity of the resulting algorithms has
information-theoretically optimal quadratic dependence on , where
is the noise rate.
We show that commonly studied concept classes including thresholds,
rectangles, and linear separators can be efficiently actively learned in our
framework. These results combined with our generic conversion lead to the first
computationally-efficient algorithms for actively learning some of these
concept classes in the presence of random classification noise that provide
exponential improvement in the dependence on the error over their
passive counterparts. In addition, we show that our algorithms can be
automatically converted to efficient active differentially-private algorithms.
This leads to the first differentially-private active learning algorithms with
exponential label savings over the passive case.Comment: Extended abstract appears in NIPS 201
Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy
Astrophysics and cosmology are rich with data. The advent of wide-area
digital cameras on large aperture telescopes has led to ever more ambitious
surveys of the sky. Data volumes of entire surveys a decade ago can now be
acquired in a single night and real-time analysis is often desired. Thus,
modern astronomy requires big data know-how, in particular it demands highly
efficient machine learning and image analysis algorithms. But scalability is
not the only challenge: Astronomy applications touch several current machine
learning research questions, such as learning from biased data and dealing with
label and measurement noise. We argue that this makes astronomy a great domain
for computer science research, as it pushes the boundaries of data analysis. In
the following, we will present this exciting application area for data
scientists. We will focus on exemplary results, discuss main challenges, and
highlight some recent methodological advancements in machine learning and image
analysis triggered by astronomical applications
Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID)
aims at learning modality-invariant features from unlabeled cross-modality
dataset, which is crucial for practical applications in video surveillance
systems. The key to essentially address the USL-VI-ReID task is to solve the
cross-modality data association problem for further heterogeneous joint
learning. To address this issue, we propose a Dual Optimal Transport Label
Assignment (DOTLA) framework to simultaneously assign the generated labels from
one modality to its counterpart modality. The proposed DOTLA mechanism
formulates a mutual reinforcement and efficient solution to cross-modality data
association, which could effectively reduce the side-effects of some
insufficient and noisy label associations. Besides, we further propose a
cross-modality neighbor consistency guided label refinement and regularization
module, to eliminate the negative effects brought by the inaccurate supervised
signals, under the assumption that the prediction or label distribution of each
example should be similar to its nearest neighbors. Extensive experimental
results on the public SYSU-MM01 and RegDB datasets demonstrate the
effectiveness of the proposed method, surpassing existing state-of-the-art
approach by a large margin of 7.76% mAP on average, which even surpasses some
supervised VI-ReID methods
Multi-Label Dimensionality Reduction
abstract: Multi-label learning, which deals with data associated with multiple labels simultaneously, is ubiquitous in real-world applications. To overcome the curse of dimensionality in multi-label learning, in this thesis I study multi-label dimensionality reduction, which extracts a small number of features by removing the irrelevant, redundant, and noisy information while considering the correlation among different labels in multi-label learning. Specifically, I propose Hypergraph Spectral Learning (HSL) to perform dimensionality reduction for multi-label data by exploiting correlations among different labels using a hypergraph. The regularization effect on the classical dimensionality reduction algorithm known as Canonical Correlation Analysis (CCA) is elucidated in this thesis. The relationship between CCA and Orthonormalized Partial Least Squares (OPLS) is also investigated. To perform dimensionality reduction efficiently for large-scale problems, two efficient implementations are proposed for a class of dimensionality reduction algorithms, including canonical correlation analysis, orthonormalized partial least squares, linear discriminant analysis, and hypergraph spectral learning. The first approach is a direct least squares approach which allows the use of different regularization penalties, but is applicable under a certain assumption; the second one is a two-stage approach which can be applied in the regularization setting without any assumption. Furthermore, an online implementation for the same class of dimensionality reduction algorithms is proposed when the data comes sequentially. A Matlab toolbox for multi-label dimensionality reduction has been developed and released. The proposed algorithms have been applied successfully in the Drosophila gene expression pattern image annotation. The experimental results on some benchmark data sets in multi-label learning also demonstrate the effectiveness and efficiency of the proposed algorithms.Dissertation/ThesisPh.D. Computer Science 201
More efficient manual review of automatically transcribed tabular data
Machine learning methods have proven useful in transcribing historical data.
However, results from even highly accurate methods require manual verification
and correction. Such manual review can be time-consuming and expensive,
therefore the objective of this paper was to make it more efficient.
Previously, we used machine learning to transcribe 2.3 million handwritten
occupation codes from the Norwegian 1950 census with high accuracy (97%). We
manually reviewed the 90,000 (3%) codes with the lowest model confidence. We
allocated those 90,000 codes to human reviewers, who used our annotation tool
to review the codes. To assess reviewer agreement, some codes were assigned to
multiple reviewers. We then analyzed the review results to understand the
relationship between accuracy improvements and effort. Additionally, we
interviewed the reviewers to improve the workflow. The reviewers corrected
62.8% of the labels and agreed with the model label in 31.9% of cases. About
0.2% of the images could not be assigned a label, while for 5.1% the reviewers
were uncertain, or they assigned an invalid label. 9,000 images were
independently reviewed by multiple reviewers, resulting in an agreement of
86.43% and disagreement of 8.96%. We learned that our automatic transcription
is biased towards the most frequent codes, with a higher degree of
misclassification for the lowest frequency codes. Our interview findings show
that the reviewers did internal quality control and found our custom tool
well-suited. So, only one reviewer is needed, but they should report
uncertainty.Comment: 19 pages, 5 figures, 1 tabl
BlindSage: Label Inference Attacks against Node-level Vertical Federated Graph Neural Networks
Federated learning enables collaborative training of machine learning models
by keeping the raw data of the involved workers private. One of its main
objectives is to improve the models' privacy, security, and scalability.
Vertical Federated Learning (VFL) offers an efficient cross-silo setting where
a few parties collaboratively train a model without sharing the same features.
In such a scenario, classification labels are commonly considered sensitive
information held exclusively by one (active) party, while other (passive)
parties use only their local information. Recent works have uncovered important
flaws of VFL, leading to possible label inference attacks under the assumption
that the attacker has some, even limited, background knowledge on the relation
between labels and data. In this work, we are the first (to the best of our
knowledge) to investigate label inference attacks on VFL using a
zero-background knowledge strategy. To concretely formulate our proposal, we
focus on Graph Neural Networks (GNNs) as a target model for the underlying VFL.
In particular, we refer to node classification tasks, which are widely studied,
and GNNs have shown promising results. Our proposed attack, BlindSage, provides
impressive results in the experiments, achieving nearly 100% accuracy in most
cases. Even when the attacker has no information about the used architecture or
the number of classes, the accuracy remained above 85% in most instances.
Finally, we observe that well-known defenses cannot mitigate our attack without
affecting the model's performance on the main classification task
- …