44 research outputs found

    The Road to General Intelligence

    Get PDF
    Humans have always dreamed of automating laborious physical and intellectual tasks, but the latter has proved more elusive than naively suspected. Seven decades of systematic study of Artificial Intelligence have witnessed cycles of hubris and despair. The successful realization of General Intelligence (evidenced by the kind of cross-domain flexibility enjoyed by humans) will spawn an industry worth billions and transform the range of viable automation tasks.The recent notable successes of Machine Learning has lead to conjecture that it might be the appropriate technology for delivering General Intelligence. In this book, we argue that the framework of machine learning is fundamentally at odds with any reasonable notion of intelligence and that essential insights from previous decades of AI research are being forgotten. We claim that a fundamental change in perspective is required, mirroring that which took place in the philosophy of science in the mid 20th century. We propose a framework for General Intelligence, together with a reference architecture that emphasizes the need for anytime bounded rationality and a situated denotational semantics. We given necessary emphasis to compositional reasoning, with the required compositionality being provided via principled symbolic-numeric inference mechanisms based on universal constructions from category theory. • Details the pragmatic requirements for real-world General Intelligence. • Describes how machine learning fails to meet these requirements. • Provides a philosophical basis for the proposed approach. • Provides mathematical detail for a reference architecture. • Describes a research program intended to address issues of concern in contemporary AI. The book includes an extensive bibliography, with ~400 entries covering the history of AI and many related areas of computer science and mathematics.The target audience is the entire gamut of Artificial Intelligence/Machine Learning researchers and industrial practitioners. There are a mixture of descriptive and rigorous sections, according to the nature of the topic. Undergraduate mathematics is in general sufficient. Familiarity with category theory is advantageous for a complete understanding of the more advanced sections, but these may be skipped by the reader who desires an overall picture of the essential concepts This is an open access book

    Multimodal Integration for Natural Language Classification and Generation

    Get PDF
    Multimodal integration is a framework for building models that can accept information from different types of modalities. Due to the recent success in the Transformer model and Pre-training Fine-tuning Techniques, Vision-and-Language Pre-training Models have been heavily investigated and they achieved State-of-the-Art in various of Vision-and-Language downstream tasks, such as Visual Question Answering, Image Text Matching and Image Captioning. However, most of the previous studies focus on improving the performance of the models and only provide accessible code for research purposes. There are several existing open-source libraries such as Natural Language Toolkit, OpenCV and HuggingFace, which combine and standardise the available models and tools for easy access, but applying these libraries still requires expertise in both Deep Learning and programming. Moreover, there has been no recent research aimed at establishing user-friendly multimodal question-answering platforms for non-deep-learning users. Therefore, the question of how State-Of-The-Art multimodal models can be easily applied by professionals in other domains remains open. Apart from the first challenge, there exists another challenge in the less-common domain. Since general multimodal domains such as street view, landscape, and indoor scenes have been extensively studied with current VL-PMs, while specific domains like medicine, geography, and esports have garnered less attention. Due to the difficulties in data collection, there aren't many publicly available multimodal datasets, and those that exist tend to be small. This scarcity poses challenges for model training. Consequently, the question of how to collect a comprehensive multimodal dataset in the esports domain and how to improve domain-specific multimodal models remains open. Therefore, the main focus of this thesis is integrating multimodal information for natural language classification and generation tasks by addressing the two challenges

    Lifelong Learning in the Clinical Open World

    Get PDF
    Despite mounting evidence that data drift causes deep learning models to deteriorate over time, the majority of medical imaging research is developed for - and evaluated on - static close-world environments. There have been exciting advances in the automatic detection and segmentation of diagnostically-relevant findings. Yet the few studies that attempt to validate their performance in actual clinics are met with disappointing results and little utility as perceived by healthcare professionals. This is largely due to the many factors that introduce shifts in medical image data distribution, from changes in the acquisition practices to naturally occurring variations in the patient population and disease manifestation. If we truly wish to leverage deep learning technologies to alleviate the workload of clinicians and drive forward the democratization of health care, we must move away from close-world assumptions and start designing systems for the dynamic open world. This entails, first, the establishment of reliable quality assurance mechanisms with methods from the fields of uncertainty estimation, out-of-distribution detection, and domain-aware prediction appraisal. Part I of the thesis summarizes my contributions to this area. I first propose two approaches that identify outliers by monitoring a self-supervised objective or by quantifying the distance to training samples in a low-dimensional latent space. I then explore how to maximize the diversity among members of a deep ensemble for improved calibration and robustness; and present a lightweight method to detect low-quality lung lesion segmentation masks using domain knowledge. Of course, detecting failures is only the first step. We ideally want to train models that are reliable in the open world for a large portion of the data. Out-of-distribution generalization and domain adaptation may increase robustness, but only to a certain extent. As time goes on, models can only maintain acceptable performance if they continue learning with newly acquired cases that reflect changes in the data distribution. The goal of continual learning is to adapt to changes in the environment without forgetting previous knowledge. One practical strategy to approach this is expansion, whereby multiple parametrizations of the model are trained and the most appropriate one is selected during inference. In the second part of the thesis, I present two expansion-based methods that do not rely on information regarding when or how the data distribution changes. Even when appropriate mechanisms are in place to fail safely and accumulate knowledge over time, this will only translate to clinical usage insofar as the regulatory framework allows it. Current regulations in the USA and European Union only authorize locked systems that do not learn post-deployment. Fortunately, regulatory bodies are noting the need for a modern lifecycle regulatory approach. I review these efforts, along with other practical aspects of developing systems that learn through their lifecycle, in the third part of the thesis. We are finally at a stage where healthcare professionals and regulators are embracing deep learning. The number of commercially available diagnostic radiology systems is also quickly rising. This opens up our chance - and responsibility - to show that these systems can be safe and effective throughout their lifespan

    Grounded Semantic Reasoning for Robotic Interaction with Real-World Objects

    Get PDF
    Robots are increasingly transitioning from specialized, single-task machines to general-purpose systems that operate in unstructured environments, such as homes, offices, and warehouses. In these real-world domains, robots need to manipulate novel objects while adapting to changes in environments and goals. Semantic knowledge, which concisely describes target domains with symbols, can potentially reveal the meaningful patterns shared between problems and environments. However, existing robots are yet to effectively reason about semantic data encoding complex relational knowledge or jointly reason about symbolic semantic data and multimodal data pertinent to robotic manipulation (e.g., object point clouds, 6-DoF poses, and attributes detected with multimodal sensing). This dissertation develops semantic reasoning frameworks capable of modeling complex semantic knowledge grounded in robot perception and action. We show that grounded semantic reasoning enables robots to more effectively perceive, model, and interact with objects in real-world environments. Specifically, this dissertation makes the following contributions: (1) a survey providing a unified view for the diversity of works in the field by formulating semantic reasoning as the integration of knowledge sources, computational frameworks, and world representations; (2) a method for predicting missing relations in large-scale knowledge graphs by leveraging type hierarchies of entities, effectively avoiding ambiguity while maintaining generalization of multi-hop reasoning patterns; (3) a method for predicting unknown properties of objects in various environmental contexts, outperforming prior knowledge graph and statistical relational learning methods due to the use of n-ary relations for modeling object properties; (4) a method for purposeful robotic grasping that accounts for a broad range of contexts (including object visual affordance, material, state, and task constraint), outperforming existing approaches in novel contexts and for unknown objects; (5) a systematic investigation into the generalization of task-oriented grasping that includes a benchmark dataset of 250k grasps, and a novel graph neural network that incorporates semantic relations into end-to-end learning of 6-DoF grasps; (6) a method for rearranging novel objects into semantically meaningful spatial structures based on high-level language instructions, more effectively capturing multi-object spatial constraints than existing pairwise spatial representations; (7) a novel planning-inspired approach that iteratively optimizes placements of partially observed objects subject to both physical constraints and semantic constraints inferred from language instructions.Ph.D

    Deep Visual Parsing with Limited Supervision

    Get PDF
    Scene parsing entails interpretation of the visual world in terms of meaningful semantic concepts. Automatically performing such analysis with machine learning techniques is not a purely scientific endeavour. It holds transformative potential for emerging technologies, such as autonomous driving and robotics, where deploying a human expert can be economically unfeasible or hazardous. Recent methods based on deep learning have made substantial progress towards realising this potential. However, to achieve high accuracy on application-specific formulations of the scene parsing task, such as semantic segmentation, deep learning models require significant amounts of high-quality dense annotation. Obtaining such supervision with human labour is costly and time-consuming. Therefore, reducing the need for precise annotation without sacrificing model accuracy is essential when it comes to deploying these models at scale. In this dissertation, we advance towards this goal by progressively reducing the amount of required supervision in the context of semantic image segmentation. In this task, we aim to label every pixel in the image with its semantic category. We formulate and implement four novel deep learning techniques operating under varying levels of task supervision: First, we develop a recurrent model for instance segmentation, which sequentially predicts one object mask at a time. Sequential models have provision for exploiting the temporal context: segmenting prominent instances first may disambiguate mask prediction for hard objects (e.g. due to occlusion) later on. However, such advantageous ordering of prediction is typically unavailable. Our proposed actor-critic framework discovers such orderings and provides empirical accuracy benefits compared to a baseline without such capacity. Second, we consider weakly supervised semantic segmentation. This problem setting requires the model to produce object masks with only image-level labels available as the training supervision. In contrast to previous works, we approach this problem with a practical single-stage model. Despite its simple design, it produces highly accurate segmentation, competitive with, or even improving upon several multi-stage methods. Reducing the amount of supervision further, we next study unsupervised domain adaptation. In this scenario, there are no labels available for real-world data. Instead, we may only use the labels of synthetically generated visual scenes. We propose a novel approach, which adapts the segmentation model trained on synthetic data to unlabelled real-world images using pseudo labels. Crucially, we construct these pseudo annotation by leveraging equivariance of the semantic segmentation task to similarity transformations. At the time of publication, our adaptation framework achieved state-of-the-art accuracy, in some benchmarks even substantially surpassing that of previous art. Last, we present an unsupervised technique for representation learning. We define the desired representation to be useful for the task of video object segmentation, which requires establishing dense object-level correspondences in video sequences. Learning such features efficiently in a fully convolutional regime is prone to degenerate solutions. Yet our approach circumvents them with a simple and effective mechanism based on the already familiar model equivariance to similarity transformations. We empirically show that our framework attains new state-of-the-art video segmentation accuracy at a significantly reduced computational cost

    Deep 3D Information Prediction and Understanding

    Get PDF
    3D information prediction and understanding play significant roles in 3D visual perception. For 3D information prediction, recent studies have demonstrated the superiority of deep neural networks. Despite the great success of deep learning, there are still many challenging issues to be solved. One crucial issue is how to learn the deep model in an unsupervised learning framework. In this thesis, we take monocular depth estimation as an example to study this problem through exploring the domain adaptation technique. Apart from the prediction from a single image or multiple images, we can also estimate the depth from multi-modal data, such as RGB image data coupled with 3D laser scan data. Since the 3D data is usually sparse and irregularly distributed, we are required to model the contextual information from the sparse data and fuse the multi-modal features. We examine the issues by studying the depth completion task. For 3D information understanding, such as point clouds analysis, due to the sparsity and unordered property of 3D point cloud, instead of the conventional convolution, new operations which can model the local geometric shape are required. We design a basic operation for point cloud analysis through introducing a novel adaptive edge-to-edge interaction learning module. Besides, due to the diversity in configurations of the 3D laser scanners, the captured 3D data often varies from dataset to dataset in object size, density, and viewpoints. As a result, the domain generalization in 3D data analysis is also a critical problem. We study this issue in 3D shape classification by proposing an entropy regularization term. Through studying four specific tasks, this thesis focuses on several crucial issues in deep 3D information prediction and understanding, including model designing, multi-modal fusion, sparse data analysis, unsupervised learning, domain adaptation, and domain generalization
    corecore