13 research outputs found

    Deep Learning for Music Information Retrieval in Limited Data Scenarios.

    Get PDF
    PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, over tting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate \data-driven priors" that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-speci c knowledge. We rst consider scenarios in which parts of samples can be missing, aiming to make more datasets available for model training. In an initial study focusing on audio source separation (ASS), we exploit additionally available unlabelled music and solo source recordings by using generative adversarial networks (GANs), resulting in higher separation quality. We then present a fully adversarial framework for learning generative models with missing data. Our discriminator consists of separately trainable components that can be combined to train the generator with the same objective as in the original GAN framework. We apply our framework to image generation, image segmentation and ASS, demonstrating superior performance compared to the original GAN. To improve performance on any given MIR task, we also aim to leverage datasets which are annotated for similar tasks. We use multi-task learning (MTL) to perform singing voice detection and singing voice separation with one model, improving performance on both tasks. Furthermore, we employ meta-learning on a diverse collection of ten MIR tasks to nd a weight initialisation for a \universal MIR model" so that training the model on any MIR task with this initialisation quickly leads to good performance. Since our data-driven priors encode knowledge shared across tasks and datasets, they are suited for high-dimensional, end-to-end models, instead of small models relying on task-speci c feature engineering, such as xed spectrogram representations of audio commonly used in machine listening. To this end, we propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS directly on the raw waveform while performing favourably to its spectrogrambased counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave- U-Net, which performs comparably to Wavenet and Temporal Convolutional Network (TCN) on a variety of sequence modelling tasks, while being more computationally e cient.

    AI-generated Content for Various Data Modalities: A Survey

    Full text link
    AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications and the demonstrated potential of recent works, AIGC developments have been attracting lots of attention recently, and AIGC methods have been developed for various data modalities, such as image, video, text, 3D shape (as voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human avatar (body and head), 3D motion, and audio -- each presenting different characteristics and challenges. Furthermore, there have also been many significant developments in cross-modality AIGC methods, where generative methods can receive conditioning input in one modality and produce outputs in another. Examples include going from various modalities to image, video, 3D shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar), and audio modalities. In this paper, we provide a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities, and present comparative results for various modalities. Moreover, we also discuss the challenges and potential future research directions

    Exploring variability in medical imaging

    Get PDF
    Although recent successes of deep learning and novel machine learning techniques improved the perfor- mance of classification and (anomaly) detection in computer vision problems, the application of these methods in medical imaging pipeline remains a very challenging task. One of the main reasons for this is the amount of variability that is encountered and encapsulated in human anatomy and subsequently reflected in medical images. This fundamental factor impacts most stages in modern medical imaging processing pipelines. Variability of human anatomy makes it virtually impossible to build large datasets for each disease with labels and annotation for fully supervised machine learning. An efficient way to cope with this is to try and learn only from normal samples. Such data is much easier to collect. A case study of such an automatic anomaly detection system based on normative learning is presented in this work. We present a framework for detecting fetal cardiac anomalies during ultrasound screening using generative models, which are trained only utilising normal/healthy subjects. However, despite the significant improvement in automatic abnormality detection systems, clinical routine continues to rely exclusively on the contribution of overburdened medical experts to diagnosis and localise abnormalities. Integrating human expert knowledge into the medical imaging processing pipeline entails uncertainty which is mainly correlated with inter-observer variability. From the per- spective of building an automated medical imaging system, it is still an open issue, to what extent this kind of variability and the resulting uncertainty are introduced during the training of a model and how it affects the final performance of the task. Consequently, it is very important to explore the effect of inter-observer variability both, on the reliable estimation of model’s uncertainty, as well as on the model’s performance in a specific machine learning task. A thorough investigation of this issue is presented in this work by leveraging automated estimates for machine learning model uncertainty, inter-observer variability and segmentation task performance in lung CT scan images. Finally, a presentation of an overview of the existing anomaly detection methods in medical imaging was attempted. This state-of-the-art survey includes both conventional pattern recognition methods and deep learning based methods. It is one of the first literature surveys attempted in the specific research area.Open Acces

    Visual Processing and Latent Representations in Biological and Artificial Neural Networks

    Get PDF
    The human visual system performs the impressive task of converting light arriving at the retina into a useful representation that allows us to make sense of the visual environment. We can navigate easily in the three-dimensional world and recognize objects and their properties, even if they appear from different angles and under different lighting conditions. Artificial systems can also perform well on a variety of complex visual tasks. While they may not be as robust and versatile as their biological counterpart, they have surprising capabilities that are rapidly improving. Studying the two types of systems can help us understand what computations enable the transformation of low-level sensory data into an abstract representation. To this end, this dissertation follows three different pathways. First, we analyze aspects of human perception. The focus is on the perception in the peripheral visual field and the relation to texture perception. Our work builds on a texture model that is based on the features of a deep neural network. We start by expanding the model to the temporal domain to capture dynamic textures such as flames or water. Next, we use psychophysical methods to investigate quantitatively whether humans can distinguish natural textures from samples that were generated by a texture model. Finally, we study images that cover the entire visual field and test whether matching the local summary statistics can produce metameric images independent of the image content. Second, we compare the visual perception of humans and machines. We conduct three case studies that focus on the capabilities of artificial neural networks and the potential occurrence of biological phenomena in machine vision. We find that comparative studies are not always straightforward and propose a checklist on how to improve the robustness of the conclusions that we draw from such studies. Third, we address a fundamental discrepancy between human and machine vision. One major strength of biological vision is its robustness to changes in the appearance of image content. For example, for unusual scenarios, such as a cow on a beach, the recognition performance of humans remains high. This ability is lacking in many artificial systems. We discuss on a conceptual level how to robustly disentangle attributes that are correlated during training, and test this on a number of datasets

    Automatic Image Captioning with Style

    Get PDF
    This thesis connects two core topics in machine learning, vision and language. The problem of choice is image caption generation: automatically constructing natural language descriptions of image content. Previous research into image caption generation has focused on generating purely descriptive captions; I focus on generating visually relevant captions with a distinct linguistic style. Captions with style have the potential to ease communication and add a new layer of personalisation. First, I consider naming variations in image captions, and propose a method for predicting context-dependent names that takes into account visual and linguistic information. This method makes use of a large-scale image caption dataset, which I also use to explore naming conventions and report naming conventions for hundreds of animal classes. Next I propose the SentiCap model, which relies on recent advances in artificial neural networks to generate visually relevant image captions with positive or negative sentiment. To balance descriptiveness and sentiment, the SentiCap model dynamically switches between two recurrent neural networks, one tuned for descriptive words and one for sentiment words. As the first published model for generating captions with sentiment, SentiCap has influenced a number of subsequent works. I then investigate the sub-task of modelling styled sentences without images. The specific task chosen is sentence simplification: rewriting news article sentences to make them easier to understand. For this task I design a neural sequence-to-sequence model that can work with limited training data, using novel adaptations for word copying and sharing word embeddings. Finally, I present SemStyle, a system for generating visually relevant image captions in the style of an arbitrary text corpus. A shared term space allows a neural network for vision and content planning to communicate with a network for styled language generation. SemStyle achieves competitive results in human and automatic evaluations of descriptiveness and style. As a whole, this thesis presents two complete systems for styled caption generation that are first of their kind and demonstrate, for the first time, that automatic style transfer for image captions is achievable. Contributions also include novel ideas for object naming and sentence simplification. This thesis opens up inquiries into highly personalised image captions; large scale visually grounded concept naming; and more generally, styled text generation with content control

    Statistical Learning and Inference at Particle Collider Experiments

    Get PDF
    Advances in data analysis techniques may play a decisive role in the discovery reach of particle collider experiments. However, the importing of expertise and methods from other data-centric disciplines such as machine learning and statistics faces significant hurdles, mainly due to the established use of different language and constructs. A large part of this document, also conceived as an introduction to the description of an analysis searching for non-resonant Higgs pair production in data collected by the CMS detector at the Large Hadron Collider (LHC), is therefore devoted to a broad redefinition of the relevant concepts for problems in experimental particle physics. The aim is to better connect these issues with those in other fields of research, so the solutions found can be repurposed. The formal exploration of the properties of the statistical models at particle colliders is useful to highlight the main challenges posed by statistical inference in this context: the multi-dimensional nature of the models, which can be studied only in a generative manner via forward simulation of observations, and the effect of nuisance parameters. The first issue can be tackled with likelihood-free inference methods coupled with the use of low-dimensional summary statistics, which may be constructed either with machine learning techniques or through physically motivated variables (e.g. event reconstruction). The second, i.e. the misspecification of the generative model which is addressed by the inclusion of nuisance parameters, reduces the effectiveness of summary statistics constructed with machine-learning techniques. A subset of the data analysis techniques formally discussed in the introductory part of the document are also exploited to study the non-resonant production process pp → HH → bbbb at the LHC in the context of the Standard Model (SM) and its extensions in effective fields theories (EFT), based on anomalous couplings of the Higgs field. Data collected in 2016 by the CMS detector and corresponding to a total of 35.9 fb−1 of proton-proton collisions are used to set an 95% confidence upper limit at 847 fb on the production cross section σ(pp → HH → bbbb) in the SM. Upper limits are also obtained for the cross sections corresponding to a representative set of points of the parameter space of EFT. The combination of those results with the ones obtained from the study of other decay channels of HH pairs is also discussed. In addition, the exercise of reformulating the goals of high energy physics analysis as a statistical inference problem is combined with modern machine learning technologies to develop a new technique, referred to as inference-aware neural optimisation. The technique produces summary statistics which directly minimise the expected uncertainty on the parameters of interest, optimally accounting for the effect of nuisance parameters. The application of this technique to a synthetic problem demonstrates that the obtained summary statistics are considerable more effective than those obtained with standard supervised learning methods, when the effect of the nuisance parameters is significant. Assuming its scalability to LHC data scenarios, this technique has ground-breaking potential for analyses dominated by systematic uncertainties

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF

    A Study on Learning Representations for Relations Between Words

    Get PDF
    Reasoning about relations between words or entities plays an important role in human cognition. It is thus essential for a computational system which processes human languages to be able to understand the semantics of relations to simulate human intelligence. Automatic relation learning provides valuable information for many natural language processing tasks including ontology creation, question answering and machine translation, to name a few. This need brings us to the topic of this thesis where the main goal is to explore multiple resources and methodologies to effectively represent relations between words. How to effectively represent semantic relations between words remains a problem that is underexplored. A line of research makes use of relational patterns, which are the linguistic contexts in which two words co-occur in a corpus to infer a relation between them (e.g., X leads to Y). This approach suffers from data sparseness because not every related word-pair co-occurs even in a large corpus. In contrast, prior work on learning word embeddings have found that certain relations between words could be captured by applying linear arithmetic operators on the corresponding pre-trained word embeddings. Specifically, it has been shown that the vector offset (expressed as PairDiff) from one word to the other in a pair encodes the relation that holds between them, if any. Such a compositional method addresses the data sparseness by inferring a relation from constituent words in a word-pair and obviates the need of relational patterns. This thesis investigates the best way to compose word embeddings to represent relational instances. A systematic comparison is carried out for unsupervised operators, which in general reveals the superiority of the PairDiff operator on multiple word embedding models and benchmark datasets. Despite the empirical success, no theoretical analysis has been conducted so far explaining why and under what conditions PairDiff is optimal. To this end, a theoretical analysis is conducted for the generalised bilinear operators that can be used to measure the relational distance between two word-pairs. The main conclusion is that, under certain assumptions, the bilinear operator can be simplified to a linear form, where the widely used PairDiff operator is a special case. Multiple recent works raised concerns about existing unsupervised operators for inferring relations from pre-trained word embeddings. Thus, the question of whether it is possible to learn better parametrised relational compositional operators is addressed in this thesis. A supervised relation representation operator is proposed using a non-linear neural network that performs relation prediction. The evaluation on two benchmark datasets reveals that the penultimate layer of the trained neural network-based relational predictor acts as a good representation for the relations between words. Because we believe that both relational patterns and word embeddings provide complementary information to learn relations, a self-supervised context-guided relation embedding method that is trained on the two sources of information has been proposed. Experimentally, incorporating relational contexts shows improvement in the performance of a compositional operator for representing unseen word-pairs. Besides unstructured text corpora, knowledge graphs provide another source for relational facts in the form of nodes (i.e., entities) connected by edges (i.e., relations). Knowledge graphs are employed widely in natural language processing applications such as question answering and dialogue systems. Embedding entities and relations in a graph have shown impressive results for inferring previously unseen relations between entities. This thesis contributes to developing a theoretical model to infer a relationship between the connections in the graph and the embeddings of entities and relations. Learning graph embeddings that satisfy the proven theorem demonstrates efficient performance compared to existing heuristically derived graph embedding methods. As graph embedding methods generate representations for only existing relation types, a relation composition task is proposed in the thesis to tackle this limitation

    Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning

    Get PDF
    Contains fulltext : 228326pre.pdf (preprint version ) (Open Access) Contains fulltext : 228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202
    corecore