397 research outputs found
Unsupervised Learning with Imbalanced Data via Structure Consolidation Latent Variable Model
Unsupervised learning on imbalanced data is challenging because, when given
imbalanced data, current model is often dominated by the major category and
ignores the categories with small amount of data. We develop a latent variable
model that can cope with imbalanced data by dividing the latent space into a
shared space and a private space. Based on Gaussian Process Latent Variable
Models, we propose a new kernel formulation that enables the separation of
latent space and derives an efficient variational inference method. The
performance of our model is demonstrated with an imbalanced medical image
dataset.Comment: ICLR 2016 Worksho
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
How to Do Machine Learning with Small Data? -- A Review from an Industrial Perspective
Artificial intelligence experienced a technological breakthrough in science,
industry, and everyday life in the recent few decades. The advancements can be
credited to the ever-increasing availability and miniaturization of
computational resources that resulted in exponential data growth. However,
because of the insufficient amount of data in some cases, employing machine
learning in solving complex tasks is not straightforward or even possible. As a
result, machine learning with small data experiences rising importance in data
science and application in several fields. The authors focus on interpreting
the general term of "small data" and their engineering and industrial
application role. They give a brief overview of the most important industrial
applications of machine learning and small data. Small data is defined in terms
of various characteristics compared to big data, and a machine learning
formalism was introduced. Five critical challenges of machine learning with
small data in industrial applications are presented: unlabeled data, imbalanced
data, missing data, insufficient data, and rare events. Based on those
definitions, an overview of the considerations in domain representation and
data acquisition is given along with a taxonomy of machine learning approaches
in the context of small data
Gaussian Processes for Data Scarcity Challenges
This thesis focuses on Gaussian process models specifically designed for scarce data problems. Data scarcity or lack of data can be a weak spot for many machine learning algorithms. Nevertheless, both are commonly found in a diverse set of applications such as medicine, quality assurance, and remote sensing. Supervised classification algorithms can require large amounts of labeled data, and fulfilling this requirement is not straightforward.
In medicine, breast cancer datasets typically have few cancerous cells and many healthy cells due to the overall relative scarcity of cancerous cells versus non-cancerous ones. The lack of cancerous cells causes the dataset to be imbalanced, which makes it difficult for learning algorithms to learn the differences between cancerous and healthy cells. A similar imbalance exists in the quality assurance industry, in which the ratio of faulty to non-faulty cases is very low. In sensor networks, and in particular those which measure air pollution across cities, combining sensors of different qualities can help fill gaps in what is often a very data scarce landscape.
In data scarce scenarios, we present a probabilistic latent variable model that can cope with imbalanced data. By incorporating label information, we develop a kernel that can capture shared and private characteristics of data separately. On the other hand, in cases where no labels are available, an active learning based technique is proposed, based on a Gaussian process classifier with an oracle in the loop to annotate only the data about which the algorithm is uncertain. Finally, when disparate data types with different granularity levels are available, a transfer learning based approach is proposed. We show that jointly modeling data with various granularity helps improve prediction of rare data.
The developed methods are demonstrated in experiments with real and synthetic data. The results presented in this thesis show that the developed methods improve prediction for scarce data problems with various granularities
A scalable machine learning system for anomaly detection in manufacturing
Berichte über Rückrufaktionen in der Automobilindustrie gehören inzwischen zum medialen Alltag. Tatsächlich hat deren Häufigkeit und die Anzahl der betroffenen Fahrzeuge in den letzten Jahren weiter zugenommen. Die meisten Aktionen sind auf Fehler in der Produktion zurückzuführen. Für die Hersteller stellt neben Verbesserungen im Qualitätsmanagement die intelligente und automatisierte Analyse von Produktionsprozessdaten ein bislang kaum ausgeschöpftes Potential dar. Die technischen Herausforderungen sind jedoch enorm: die Datenmengen sind gewaltig und die für einen Fehler charakteristischen Datenmuster zwangsläufig unbekannt. Der Einsatz maschineller Lernverfahren (ML) ist ein vielversprechender Ansatz um diese Suche nach der sinnbildlichen Nadel im Häuhaufen zu ermöglichen. Algorithmen sollen anhand der Daten selbständig lernen zwischen normalem und auffälligem Prozessverhalten zu unterscheiden um Prozessexperten frühzeitig zu warnen. Industrie und Forschung versuchen bereits seit Jahren solche ML-Systeme im Produktionsumfeld zu etablieren. Die meisten ML-Projekte scheitern jedoch bereits vor der Produktivphase bzw. verschlingen enorme Ressourcen im Betrieb und liefern keinen wirtschaftlichen Mehrwert.
Ziel der Arbeit ist die Entwicklung eines technischen Frameworks zur Implementierung eines skalierbares ML-System für die Anomalieerkennung in Prozessdaten. Die Trainingsprozesse zum Initialisieren und Adaptieren der Modelle sollen hochautomatisierbar sein um einen strukturierten Skalierungsprozess zu ermöglichen. Das entwickelt DM/ML-Verfahren ermöglicht den langfristigen Aufwand für den Systembetrieb durch initialen Mehraufwand für den Modelltrainingsprozess zu senken und hat sich in der Praxis als sowohl relativ als auch absolut Skalierbar bewährt. Dadurch kann die Komplexität auf Systemebene auf ein beherrschbares Maß reduziert werden um einen späteren Systembetrieb zu ermöglichen
CVAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior
We present a self-supervised variational autoencoder (VAE) to jointly learn
disentangled and dependent hidden factors and then enhance disentangled
representation learning by a self-supervised classifier to eliminate coupled
representations in a contrastive manner. To this end, a Contrastive Copula VAE
(CVAE) is introduced without relying on prior knowledge about data in the
probabilistic principle and involving strong modeling assumptions on the
posterior in the neural architecture. CVAE simultaneously factorizes the
posterior (evidence lower bound, ELBO) with total correlation (TC)-driven
decomposition for learning factorized disentangled representations and extracts
the dependencies between hidden features by a neural Gaussian copula for copula
coupled representations. Then, a self-supervised contrastive classifier
differentiates the disentangled representations from the coupled
representations, where a contrastive loss regularizes this contrastive
classification together with the TC loss for eliminating entangled factors and
strengthening disentangled representations. CVAE demonstrates a strong
effect in enhancing disentangled representation learning. CVAE further
contributes to improved optimization addressing the TC-based VAE instability
and the trade-off between reconstruction and representation
Meta-survey on outlier and anomaly detection
The impact of outliers and anomalies on model estimation and data processing
is of paramount importance, as evidenced by the extensive body of research
spanning various fields over several decades: thousands of research papers have
been published on the subject. As a consequence, numerous reviews, surveys, and
textbooks have sought to summarize the existing literature, encompassing a wide
range of methods from both the statistical and data mining communities. While
these endeavors to organize and summarize the research are invaluable, they
face inherent challenges due to the pervasive nature of outliers and anomalies
in all data-intensive applications, irrespective of the specific application
field or scientific discipline. As a result, the resulting collection of papers
remains voluminous and somewhat heterogeneous. To address the need for
knowledge organization in this domain, this paper implements the first
systematic meta-survey of general surveys and reviews on outlier and anomaly
detection. Employing a classical systematic survey approach, the study collects
nearly 500 papers using two specialized scientific search engines. From this
comprehensive collection, a subset of 56 papers that claim to be general
surveys on outlier detection is selected using a snowball search technique to
enhance field coverage. A meticulous quality assessment phase further refines
the selection to a subset of 25 high-quality general surveys. Using this
curated collection, the paper investigates the evolution of the outlier
detection field over a 20-year period, revealing emerging themes and methods.
Furthermore, an analysis of the surveys sheds light on the survey writing
practices adopted by scholars from different communities who have contributed
to this field. Finally, the paper delves into several topics where consensus
has emerged from the literature. These include taxonomies of outlier types,
challenges posed by high-dimensional data, the importance of anomaly scores,
the impact of learning conditions, difficulties in benchmarking, and the
significance of neural networks. Non-consensual aspects are also discussed,
particularly the distinction between local and global outliers and the
challenges in organizing detection methods into meaningful taxonomies
Knowledge-Guided Data-Centric AI in Healthcare: Progress, Shortcomings, and Future Directions
The success of deep learning is largely due to the availability of large
amounts of training data that cover a wide range of examples of a particular
concept or meaning. In the field of medicine, having a diverse set of training
data on a particular disease can lead to the development of a model that is
able to accurately predict the disease. However, despite the potential
benefits, there have not been significant advances in image-based diagnosis due
to a lack of high-quality annotated data. This article highlights the
importance of using a data-centric approach to improve the quality of data
representations, particularly in cases where the available data is limited. To
address this "small-data" issue, we discuss four methods for generating and
aggregating training data: data augmentation, transfer learning, federated
learning, and GANs (generative adversarial networks). We also propose the use
of knowledge-guided GANs to incorporate domain knowledge in the training data
generation process. With the recent progress in large pre-trained language
models, we believe it is possible to acquire high-quality knowledge that can be
used to improve the effectiveness of knowledge-guided generative methods.Comment: 21 pages, 13 figures, 4 table
- …