175 research outputs found

    Semantics for vision-and-language understanding

    Get PDF
    Recent advancements in Artificial Intelligence have led to several breakthroughs in many heterogeneous scientific fields, such as the prediction of protein structures or self-driving cars. These results are obtained by means of Machine Learning techniques, which make it possible to automatically learn from the available annotated examples a mathematical model capable of solving the task. One of its sub-fields, Deep Learning, brought further improvements by providing the possibility to also compute an informative and non-redundant representation for each example by means of the same learning process. To successfully solve the task under analysis, the model needs to overcome the generalization gap, meaning that it needs to work well both on the training data, and on examples which are drawn from the same distribution but are never observed at training time. Several heuristics are often used to overcome this gap, such as the introduction of inductive biases when modeling the data or the usage of regularization techniques; however, a popular way consists in collecting and annotating more examples hoping they can cover the cases which were not previously observed. In particular, recent state-of-the-art solutions use hundreds of millions or even billions of annotated examples, and the underlying trend seems to imply that the collection and annotation of more and more examples should be the prominent way to overcome the generalization gap. However, there are many fields, e.g. medical fields, in which it is difficult to collect such a large amount of examples, and producing high quality annotations is even more arduous and costly. During my Ph.D. and in this thesis, I designed and proposed several solutions which address the generalization gap in three different domains by leveraging semantic aspects of the available data. In particular, the first part of the thesis includes techniques which create new annotations for the data under analysis: these include data augmentation techniques, which are used to compute variations of the annotations by means of semantics-preserving transformations, and transfer learning, which is used in the scope of this thesis to automatically generate textual descriptions for a set of images. In the second part of the thesis, this gap is reduced by customizing the training objective based on the semantics of the annotations. By means of these customizations, a problem is shifted from the commonly used single-task setting to a multi-task learning setting by designing an additional task, and then two variations of a standard loss function are proposed by introducing semantic knowledge into the training process

    Towards Interaction-level Video Action Understanding

    Get PDF
    A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding

    Evolution of A Common Vector Space Approach to Multi-Modal Problems

    Get PDF
    A set of methods to address computer vision problems has been developed. Video un- derstanding is an activate area of research in recent years. If one can accurately identify salient objects in a video sequence, these components can be used in information retrieval and scene analysis. This research started with the development of a course-to-fine frame- work to extract salient objects in video sequences. Previous work on image and video frame background modeling involved methods that ranged from simple and efficient to accurate but computationally complex. It will be shown in this research that the novel approach to implement object extraction is efficient and effective that outperforms the existing state-of-the-art methods. However, the drawback to this method is the inability to deal with non-rigid motion. With the rapid development of artificial neural networks, deep learning approaches are explored as a solution to computer vision problems in general. Focusing on image and text, the image (or video frame) understanding can be achieved using CVS. With this concept, modality generation and other relevant applications such as automatic im- age description, text paraphrasing, can be explored. Specifically, video sequences can be modeled by Recurrent Neural Networks (RNN), the greater depth of the RNN leads to smaller error, but that makes the gradient in the network unstable during training.To overcome this problem, a Batch-Normalized Recurrent Highway Network (BNRHN) was developed and tested on the image captioning (image-to-text) task. In BNRHN, the highway layers are incorporated with batch normalization which diminish the gradient vanishing and exploding problem. In addition, a sentence to vector encoding framework that is suitable for advanced natural language processing is developed. This semantic text embedding makes use of the encoder-decoder model which is trained on sentence paraphrase pairs (text-to-text). With this scheme, the latent representation of the text is shown to encode sentences with common semantic information with similar vector rep- resentations. In addition to image-to-text and text-to-text, an image generation model is developed to generate image from text (text-to-image) or another image (image-to- image) based on the semantics of the content. The developed model, which refers to the Multi-Modal Vector Representation (MMVR), builds and encodes different modalities into a common vector space that achieve the goal of keeping semantics and conversion between text and image bidirectional. The concept of CVS is introduced in this research to deal with multi-modal conversion problems. In theory, this method works not only on text and image, but also can be generalized to other modalities, such as video and audio. The characteristics and performance are supported by both theoretical analysis and experimental results. Interestingly, the MMVR model is one of the many possible ways to build CVS. In the final stages of this research, a simple and straightforward framework to build CVS, which is considered as an alternative to the MMVR model, is presented

    ADVISE: advanced digital video information segmentation engine.

    Get PDF
    by Chung-Wing Ng.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 100-107).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgment --- p.viTable of Contents --- p.viiList of Tables --- p.xList of Figures --- p.xiChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Image-based Video Description --- p.2Chapter 1.2 --- Video Summary --- p.5Chapter 1.3 --- Video Matching --- p.6Chapter 1.4 --- Contributions --- p.7Chapter 1.5 --- Outline of Thesis --- p.8Chapter Chapter 2 --- Literature Review --- p.10Chapter 2.1 --- Video Retrieval in Digital Video Libraries --- p.11Chapter 2.1.1 --- The VISION Project --- p.11Chapter 2.1.2 --- The INFORMEDIA Project --- p.12Chapter 2.1.3 --- Discussion --- p.13Chapter 2.2 --- Video Structuring --- p.14Chapter 2.2.1 --- Video Segmentation --- p.16Chapter 2.2.2 --- Color histogram Extraction --- p.17Chapter 2.2.3 --- Further Structuring --- p.18Chapter 2.3 --- XML Technologies --- p.19Chapter 2.3.1 --- XML Syntax --- p.20Chapter 2.3.2 --- "Document Type Definition, DTD" --- p.21Chapter 2.3.3 --- "Extensible Stylesheet Language, XSL" --- p.21Chapter 2.4 --- SMIL Technology --- p.22Chapter 2.4.1 --- SMIL Syntax --- p.23Chapter 2.4.2 --- Model of SMIL Applications --- p.23Chapter Chapter 3 --- Overview of ADVISE --- p.25Chapter 3.1 --- Objectives --- p.26Chapter 3.2 --- System Architecture --- p.26Chapter 3.2.1 --- Video Preprocessing Module --- p.26Chapter 3.2.2 --- Web-based Video Retrieval Module --- p.30Chapter 3.2.3 --- Video Streaming Server --- p.34Chapter 3.3 --- Summary --- p.35Chapter Chapter 4 --- Construction of Video Table-of-Contents (V-ToC) --- p.36Chapter 4.1 --- Video Structuring --- p.37Chapter 4.1.1 --- Terms and Definitions --- p.37Chapter 4.1.2 --- Regional Color Histograms --- p.39Chapter 4.1.3 --- Video Shot Boundaries Detection --- p.43Chapter 4.1.4 --- Video Groups Formation --- p.47Chapter 4.1.5 --- Video Scenes Formation --- p.50Chapter 4.2 --- Storage and Presentation --- p.53Chapter 4.2.1 --- Definition of XML Video Structure --- p.54Chapter 4.2.2 --- V-ToC Presentation Using XSL --- p.55Chapter 4.3 --- Evaluation of Video Structure --- p.58Chapter Chapter 5 --- Video Summarization --- p.62Chapter 5.1 --- Terms and Definitions --- p.64Chapter 5.2 --- Video Features Used for Summarization --- p.65Chapter 5.3 --- Video Summarization Algorithm --- p.67Chapter 5.3.1 --- Combining Extracted Video Segments --- p.68Chapter 5.3.2 --- Scoring the Extracted Video Segments --- p.69Chapter 5.3.3 --- Selecting Extracted Video Segments --- p.70Chapter 5.3.4 --- Refining the Selection Result --- p.71Chapter 5.4 --- Video Summary in SMIL --- p.74Chapter 5.5 --- Evaluations --- p.76Chapter 5.5.1 --- Experiment 1: Percentages of Features Extracted --- p.76Chapter 5.5.2 --- Experiment 2: Evaluation of the Refinement Process --- p.78Chapter Chapter 6 --- Video Matching Using V-ToC --- p.80Chapter 6.1 --- Terms and Definitions --- p.81Chapter 6.2 --- Video Features Used for Matching --- p.82Chapter 6.3 --- Non-ordered Tree Matching Algorithm --- p.83Chapter 6.4 --- Ordered Tree Matching Algorithms --- p.87Chapter 6.5 --- Evaluation of Video Matching --- p.91Chapter 6.5.1 --- Applying Non-ordered Tree Matching --- p.92Chapter 6.5.2 --- Applying Ordered Tree Matching --- p.94Chapter Chapter 7 --- Conclusion --- p.96Bibliography --- p.10

    Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

    Full text link
    The past decade has witnessed the rapid development of ML and DL methodologies in agricultural systems, showcased by great successes in variety of agricultural applications. However, these conventional ML/DL models have certain limitations: They heavily rely on large, costly-to-acquire labeled datasets for training, require specialized expertise for development and maintenance, and are mostly tailored for specific tasks, thus lacking generalizability. Recently, foundation models have demonstrated remarkable successes in language and vision tasks across various domains. These models are trained on a vast amount of data from multiple domains and modalities. Once trained, they can accomplish versatile tasks with just minor fine-tuning and minimal task-specific labeled data. Despite their proven effectiveness and huge potential, there has been little exploration of applying FMs to agriculture fields. Therefore, this study aims to explore the potential of FMs in the field of smart agriculture. In particular, we present conceptual tools and technical background to facilitate the understanding of the problem space and uncover new research directions in this field. To this end, we first review recent FMs in the general computer science domain and categorize them into four categories: language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs. Subsequently, we outline the process of developing agriculture FMs and discuss their potential applications in smart agriculture. We also discuss the unique challenges associated with developing AFMs, including model training, validation, and deployment. Through this study, we contribute to the advancement of AI in agriculture by introducing AFMs as a promising paradigm that can significantly mitigate the reliance on extensive labeled datasets and enhance the efficiency, effectiveness, and generalization of agricultural AI systems.Comment: 16 pages, 2 figure

    Content-based video indexing for sports applications using integrated multi-modal approach

    Full text link
    This thesis presents a research work based on an integrated multi-modal approach for sports video indexing and retrieval. By combining specific features extractable from multiple (audio-visual) modalities, generic structure and specific events can be detected and classified. During browsing and retrieval, users will benefit from the integration of high-level semantic and some descriptive mid-level features such as whistle and close-up view of player(s). The main objective is to contribute to the three major components of sports video indexing systems. The first component is a set of powerful techniques to extract audio-visual features and semantic contents automatically. The main purposes are to reduce manual annotations and to summarize the lengthy contents into a compact, meaningful and more enjoyable presentation. The second component is an expressive and flexible indexing technique that supports gradual index construction. Indexing scheme is essential to determine the methods by which users can access a video database. The third and last component is a query language that can generate dynamic video summaries for smart browsing and support user-oriented retrievals

    Reformulation and Decomposition: Multitask learning approaches to Long Document Problems

    Get PDF
    Recent advances in Natural Language Processing (NLP) have led to success across a wide range of tasks including machine translation, summarization, and classification. Yet, the field still faces major challenges. This thesis addresses two key under-researched areas: the absence of general multitask learning capabilities, and the inability to scale to long, complex documents. Firstly, this thesis explores a form of multitasking where NLP tasks are reformulated as question answering problems. I examine existing models and measure their robustness to paraphrasing of their input. I contribute an annotated dataset which enables detailed analysis of model failures as well as evaluating methods for improving model robustness. Secondly, a set of long document tasks; MuLD, is introduced which forms a benchmark for evaluating the performance of models on large inputs with long-range dependencies. I show that this is a challenging task for baseline models. I then design an approach using task-decomposition to provide an interpretable solution which easily allows for multitask learning. I then explore how these themes of task reformulation for multitask learning, and task-decomposition for long inputs can be applied to other modalities. I show how visual modelling: a visual analogue of language modelling, can be used to predict missing frames from videos of simple physics simulations, and probe what knowledge about the physical world this induces in such models. Finally, I demonstrate how this task can be used to unite vision and NLP using the same framework, describing how task-reformulation and task-decomposition can be used for this purpose
    corecore