548 research outputs found

    Text Classification

    Get PDF
    There is an abundance of text data in this world but most of it is raw. We need to extract information from this data to make use of it. One way to extract this information from raw text is to apply informative labels drawn from a pre-defined fixed set i.e. Text Classification. In this thesis, we focus on the general problem of text classification, and work towards solving challenges associated to binary/multi-class/multi-label classification. More specifically, we deal with the problem of (i) Zero-shot labels during testing; (ii) Active learning for text screening; (iii) Multi-label classification under low supervision; (iv) Structured label space; (v) Classifying pairs of words in raw text i.e. Relation Extraction. For (i), we use a zero-shot classification model that utilizes independently learned semantic embeddings. Regarding (ii), we propose a novel active learning algorithm that reduces problem of bias in naive active learning algorithms. For (iii), we propose neural candidate-selector architecture that starts from a set of high-recall candidate labels to obtain high-precision predictions. In the case of (iv), we proposed an attention based neural tree decoder that recursively decodes an abstract into the ontology tree. For (v), we propose using second-order relations that are derived by explicitly connecting pairs of words via context token(s) for improved relation extraction. We use a wide variety of both traditional and deep machine learning tools. More specifically, we used traditional machine learning models like multi-valued linear regression and logistic regression for (i, ii), deep convolutional neural networks for (iii), recurrent neural networks for (iv) and transformer networks for (v)

    Deep learning architectures for 2D and 3D scene perception

    Get PDF
    Scene understanding is a fundamental problem in computer vision tasks, that is being more intensively explored in recent years with the development of deep learning. In this dissertation, we proposed deep learning structures to address challenges in 2D and 3D scene perception. We developed several novel architectures for 3D point cloud understanding at city-scale point by effectively capturing both long-range and short-range information to handle the challenging problem of large variations in object size for city-scale point cloud segmentation. GLSNet++ is a two-branch network for multiscale point cloud segmentation that models this complex problem using both global and local processing streams to capture different levels of contextual and structural 3D point cloud information. We developed PointGrad, a new graph convolution gradient operator for capturing structural relationships, that encoded point-based directional gradients into a high-dimensional multiscale tensor space. Using the Point- Grad operator with graph convolution on scattered irregular point sets captures the salient structural information in the point cloud across spatial and feature scale space, enabling efficient learning. We integrated PointGrad with several deep network architectures for large-scale 3D point cloud semantic segmentation, including indoor scene and object part segmentation. In many real application areas including remote sensing and aerial imaging, the class imbalance is common and sufficient data for rare classes is hard to acquire or has high-cost associated with expert labeling. We developed MDXNet for few-shot and zero-shot learning, which emulates the human visual system by leveraging multi-domain knowledge from general visual primitives with transfer learning for more specialized learning tasks in various application domains. We extended deep learning methods in various domains, including the material domain for predicting carbon nanotube forest attributes and mechanical properties, biomedical domain for cell segmentation.Includes bibliographical references

    Learning deep embeddings by learning to rank

    Full text link
    We study the problem of embedding high-dimensional visual data into low-dimensional vector representations. This is an important component in many computer vision applications involving nearest neighbor retrieval, as embedding techniques not only perform dimensionality reduction, but can also capture task-specific semantic similarities. In this thesis, we use deep neural networks to learn vector embeddings, and develop a gradient-based optimization framework that is capable of optimizing ranking-based retrieval performance metrics, such as the widely used Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). Our framework is applied in three applications. First, we study Supervised Hashing, which is concerned with learning compact binary vector embeddings for fast retrieval, and propose two novel solutions. The first solution optimizes Mutual Information as a surrogate ranking objective, while the other directly optimizes AP and NDCG, based on the discovery of their closed-form expressions for discrete Hamming distances. These optimization problems are NP-hard, therefore we derive their continuous relaxations to enable gradient-based optimization with neural networks. Our solutions establish the state-of-the-art on several image retrieval benchmarks. Next, we learn deep neural networks to extract Local Feature Descriptors from image patches. Local features are used universally in low-level computer vision tasks that involve sparse feature matching, such as image registration and 3D reconstruction, and their matching is a nearest neighbor retrieval problem. We leverage our AP optimization technique to learn both binary and real-valued descriptors for local image patches. Compared to competing approaches, our solution eliminates complex heuristics, and performs more accurately in the tasks of patch verification, patch retrieval, and image matching. Lastly, we tackle Deep Metric Learning, the general problem of learning real-valued vector embeddings using deep neural networks. We propose a learning to rank solution through optimizing a novel quantization-based approximation of AP. For downstream tasks such as retrieval and clustering, we demonstrate promising results on standard benchmarks, especially in the few-shot learning scenario, where the number of labeled examples per class is limited

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Full text link
    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field

    Learning from limited labeled data - Zero-Shot and Few-Shot Learning

    Get PDF
    Human beings have the remarkable ability to recognize novel visual concepts after observing only few or zero examples of them. Deep learning, however, often requires a large amount of labeled data to achieve a good performance. Labeled instances are expensive, difficult and even infeasible to obtain because the distribution of training instances among labels naturally exhibits a long tail. Therefore, it is of great interest to investigate how to learn efficiently from limited labeled data. This thesis concerns an important subfield of learning from limited labeled data, namely, low-shot learning. The setting assumes the availability of many labeled examples from known classes and the goal is to learn novel classes from only a few~(few-shot learning) or zero~(zero-shot learning) training examples of them. To this end, we have developed a series of multi-modal learning approaches to facilitate the knowledge transfer from known classes to novel classes for a wide range of visual recognition tasks including image classification, semantic image segmentation and video action recognition. More specifically, this thesis mainly makes the following contributions. First, as there is no agreed upon zero-shot image classification benchmark, we define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets. Second, in order to tackle the labeled data scarcity, we propose feature generation frameworks that synthesize data in the visual feature space for novel classes. Third, we extend zero-shot learning and few-shot learning to the semantic segmentation task and propose a challenging benchmark for it. We show that incorporating semantic information into a semantic segmentation network is effective in segmenting novel classes. Finally, we develop better video representation for the few-shot video classification task and leverage weakly-labeled videos by an efficient retrieval method.Menschen haben die bemerkenswerte FĂ€higkeit, neuartige visuelle Konzepte zu erkennen, nachdem sie nur wenige oder gar keine Beispiele davon beobachtet haben. Tiefes Lernen erfordert jedoch oft eine große Menge an beschrifteten Daten, um eine gute Leistung zu erzielen. Etikettierte Instanzen sind teuer, schwierig und sogar undurchfĂŒhrbar, weil die Verteilung der Trainingsinstanzen auf die Etiketten naturgemĂ€ĂŸ einen langen Schwanz aufweist. Daher ist es von großem Interesse zu untersuchen, wie man effizient aus begrenzten gelabelten Daten lernen kann. Diese These betrifft einen wichtigen Teilbereich des Lernens aus begrenzt gelabelten Daten, nĂ€mlich das Low-Shot-Lernen. Das Setting setzt die VerfĂŒgbarkeit vieler gelabelter Beispiele aus bekannten Klassen voraus, und das Ziel ist es, neuartige Klassen aus nur wenigen (few-shot learning) oder null (zero-shot learning) Trainingsbeispielen davon zu lernen. Zu diesem Zweck haben wir eine Reihe von multimodalen LernansĂ€tzen entwickelt, um den Wissenstransfer von bekannten Klassen zu neuartigen Klassen fĂŒr ein breites Spektrum von visuellen Erkennungsaufgaben zu erleichtern, darunter Bildklassifizierung, semantische Bildsegmentierung und Videoaktionserkennung. Genauer gesagt, leistet diese Arbeit hauptsĂ€chlich die folgenden BeitrĂ€ge. Da es keinen vereinbarten Benchmark fĂŒr die Zero-Shot- Bildklassifikation gibt, definieren wir zunĂ€chst einen neuen Benchmark, indem wir sowohl die Evaluierungsprotokolle als auch die Datensplits öffentlich zugĂ€nglicher DatensĂ€tze vereinheitlichen. Zweitens schlagen wir zur BewĂ€ltigung der etikettierten Datenknappheit einen Rahmen fĂŒr die Generierung von Merkmalen vor, der Daten im visuellen Merkmalsraum fĂŒr neuartige Klassen synthetisiert. Drittens dehnen wir das Zero-Shot-Lernen und das few-Shot-Lernen auf die semantische Segmentierungsaufgabe aus und schlagen dafĂŒr einen anspruchsvollen Benchmark vor. Wir zeigen, dass die Einbindung semantischer Informationen in ein semantisches Segmentierungsnetz bei der Segmentierung neuartiger Klassen effektiv ist. Schließlich entwickeln wir eine bessere Videodarstellung fĂŒr die Klassifizierungsaufgabe ”few-shot video” und nutzen schwach markierte Videos durch eine effiziente Abrufmethode.Max Planck Institute Informatic
    • 

    corecore