397 research outputs found

    Unsupervised Learning with Imbalanced Data via Structure Consolidation Latent Variable Model

    Full text link
    Unsupervised learning on imbalanced data is challenging because, when given imbalanced data, current model is often dominated by the major category and ignores the categories with small amount of data. We develop a latent variable model that can cope with imbalanced data by dividing the latent space into a shared space and a private space. Based on Gaussian Process Latent Variable Models, we propose a new kernel formulation that enables the separation of latent space and derives an efficient variational inference method. The performance of our model is demonstrated with an imbalanced medical image dataset.Comment: ICLR 2016 Worksho

    Transfer Learning for Speech and Language Processing

    Full text link
    Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is traditionally studied in the name of `model adaptation'. Recent advance in deep learning shows that transfer learning becomes much easier and more effective with high-level abstract features learned by deep models, and the `transfer' can be conducted not only between data distributions and data types, but also between model structures (e.g., shallow nets and deep nets) or even model types (e.g., Bayesian models and neural models). This review paper summarizes some recent prominent research towards this direction, particularly for speech and language processing. We also report some results from our group and highlight the potential of this very interesting research field.Comment: 13 pages, APSIPA 201

    How to Do Machine Learning with Small Data? -- A Review from an Industrial Perspective

    Full text link
    Artificial intelligence experienced a technological breakthrough in science, industry, and everyday life in the recent few decades. The advancements can be credited to the ever-increasing availability and miniaturization of computational resources that resulted in exponential data growth. However, because of the insufficient amount of data in some cases, employing machine learning in solving complex tasks is not straightforward or even possible. As a result, machine learning with small data experiences rising importance in data science and application in several fields. The authors focus on interpreting the general term of "small data" and their engineering and industrial application role. They give a brief overview of the most important industrial applications of machine learning and small data. Small data is defined in terms of various characteristics compared to big data, and a machine learning formalism was introduced. Five critical challenges of machine learning with small data in industrial applications are presented: unlabeled data, imbalanced data, missing data, insufficient data, and rare events. Based on those definitions, an overview of the considerations in domain representation and data acquisition is given along with a taxonomy of machine learning approaches in the context of small data

    Gaussian Processes for Data Scarcity Challenges

    Get PDF
    This thesis focuses on Gaussian process models specifically designed for scarce data problems. Data scarcity or lack of data can be a weak spot for many machine learning algorithms. Nevertheless, both are commonly found in a diverse set of applications such as medicine, quality assurance, and remote sensing. Supervised classification algorithms can require large amounts of labeled data, and fulfilling this requirement is not straightforward. In medicine, breast cancer datasets typically have few cancerous cells and many healthy cells due to the overall relative scarcity of cancerous cells versus non-cancerous ones. The lack of cancerous cells causes the dataset to be imbalanced, which makes it difficult for learning algorithms to learn the differences between cancerous and healthy cells. A similar imbalance exists in the quality assurance industry, in which the ratio of faulty to non-faulty cases is very low. In sensor networks, and in particular those which measure air pollution across cities, combining sensors of different qualities can help fill gaps in what is often a very data scarce landscape. In data scarce scenarios, we present a probabilistic latent variable model that can cope with imbalanced data. By incorporating label information, we develop a kernel that can capture shared and private characteristics of data separately. On the other hand, in cases where no labels are available, an active learning based technique is proposed, based on a Gaussian process classifier with an oracle in the loop to annotate only the data about which the algorithm is uncertain. Finally, when disparate data types with different granularity levels are available, a transfer learning based approach is proposed. We show that jointly modeling data with various granularity helps improve prediction of rare data. The developed methods are demonstrated in experiments with real and synthetic data. The results presented in this thesis show that the developed methods improve prediction for scarce data problems with various granularities

    A scalable machine learning system for anomaly detection in manufacturing

    Get PDF
    Berichte über Rückrufaktionen in der Automobilindustrie gehören inzwischen zum medialen Alltag. Tatsächlich hat deren Häufigkeit und die Anzahl der betroffenen Fahrzeuge in den letzten Jahren weiter zugenommen. Die meisten Aktionen sind auf Fehler in der Produktion zurückzuführen. Für die Hersteller stellt neben Verbesserungen im Qualitätsmanagement die intelligente und automatisierte Analyse von Produktionsprozessdaten ein bislang kaum ausgeschöpftes Potential dar. Die technischen Herausforderungen sind jedoch enorm: die Datenmengen sind gewaltig und die für einen Fehler charakteristischen Datenmuster zwangsläufig unbekannt. Der Einsatz maschineller Lernverfahren (ML) ist ein vielversprechender Ansatz um diese Suche nach der sinnbildlichen Nadel im Häuhaufen zu ermöglichen. Algorithmen sollen anhand der Daten selbständig lernen zwischen normalem und auffälligem Prozessverhalten zu unterscheiden um Prozessexperten frühzeitig zu warnen. Industrie und Forschung versuchen bereits seit Jahren solche ML-Systeme im Produktionsumfeld zu etablieren. Die meisten ML-Projekte scheitern jedoch bereits vor der Produktivphase bzw. verschlingen enorme Ressourcen im Betrieb und liefern keinen wirtschaftlichen Mehrwert. Ziel der Arbeit ist die Entwicklung eines technischen Frameworks zur Implementierung eines skalierbares ML-System für die Anomalieerkennung in Prozessdaten. Die Trainingsprozesse zum Initialisieren und Adaptieren der Modelle sollen hochautomatisierbar sein um einen strukturierten Skalierungsprozess zu ermöglichen. Das entwickelt DM/ML-Verfahren ermöglicht den langfristigen Aufwand für den Systembetrieb durch initialen Mehraufwand für den Modelltrainingsprozess zu senken und hat sich in der Praxis als sowohl relativ als auch absolut Skalierbar bewährt. Dadurch kann die Komplexität auf Systemebene auf ein beherrschbares Maß reduziert werden um einen späteren Systembetrieb zu ermöglichen

    C2^2VAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior

    Full text link
    We present a self-supervised variational autoencoder (VAE) to jointly learn disentangled and dependent hidden factors and then enhance disentangled representation learning by a self-supervised classifier to eliminate coupled representations in a contrastive manner. To this end, a Contrastive Copula VAE (C2^2VAE) is introduced without relying on prior knowledge about data in the probabilistic principle and involving strong modeling assumptions on the posterior in the neural architecture. C2^2VAE simultaneously factorizes the posterior (evidence lower bound, ELBO) with total correlation (TC)-driven decomposition for learning factorized disentangled representations and extracts the dependencies between hidden features by a neural Gaussian copula for copula coupled representations. Then, a self-supervised contrastive classifier differentiates the disentangled representations from the coupled representations, where a contrastive loss regularizes this contrastive classification together with the TC loss for eliminating entangled factors and strengthening disentangled representations. C2^2VAE demonstrates a strong effect in enhancing disentangled representation learning. C2^2VAE further contributes to improved optimization addressing the TC-based VAE instability and the trade-off between reconstruction and representation

    Meta-survey on outlier and anomaly detection

    Full text link
    The impact of outliers and anomalies on model estimation and data processing is of paramount importance, as evidenced by the extensive body of research spanning various fields over several decades: thousands of research papers have been published on the subject. As a consequence, numerous reviews, surveys, and textbooks have sought to summarize the existing literature, encompassing a wide range of methods from both the statistical and data mining communities. While these endeavors to organize and summarize the research are invaluable, they face inherent challenges due to the pervasive nature of outliers and anomalies in all data-intensive applications, irrespective of the specific application field or scientific discipline. As a result, the resulting collection of papers remains voluminous and somewhat heterogeneous. To address the need for knowledge organization in this domain, this paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection. Employing a classical systematic survey approach, the study collects nearly 500 papers using two specialized scientific search engines. From this comprehensive collection, a subset of 56 papers that claim to be general surveys on outlier detection is selected using a snowball search technique to enhance field coverage. A meticulous quality assessment phase further refines the selection to a subset of 25 high-quality general surveys. Using this curated collection, the paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods. Furthermore, an analysis of the surveys sheds light on the survey writing practices adopted by scholars from different communities who have contributed to this field. Finally, the paper delves into several topics where consensus has emerged from the literature. These include taxonomies of outlier types, challenges posed by high-dimensional data, the importance of anomaly scores, the impact of learning conditions, difficulties in benchmarking, and the significance of neural networks. Non-consensual aspects are also discussed, particularly the distinction between local and global outliers and the challenges in organizing detection methods into meaningful taxonomies

    Knowledge-Guided Data-Centric AI in Healthcare: Progress, Shortcomings, and Future Directions

    Full text link
    The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data: data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the effectiveness of knowledge-guided generative methods.Comment: 21 pages, 13 figures, 4 table
    • …
    corecore