175 research outputs found
Semantics for vision-and-language understanding
Recent advancements in Artificial Intelligence have led to several breakthroughs in many heterogeneous scientific fields, such as the prediction of protein structures or self-driving cars. These results are obtained by means of Machine Learning techniques, which make it possible to automatically learn from the available annotated examples a mathematical model capable of solving the task. One of its sub-fields, Deep Learning, brought further improvements by providing the possibility to also compute an informative and non-redundant representation for each example by means of the same learning process. To successfully solve the task under analysis, the model needs to overcome the generalization gap, meaning that it needs to work well both on the training data, and on examples which are drawn from the same distribution but are never observed at training time. Several heuristics are often used to overcome this gap, such as the introduction of inductive biases when modeling the data or the usage of regularization techniques; however, a popular way consists in collecting and annotating more examples hoping they can cover the cases which were not previously observed. In particular, recent state-of-the-art solutions use hundreds of millions or even billions of annotated examples, and the underlying trend seems to imply that the collection and annotation of more and more examples should be the prominent way to overcome the generalization gap. However, there are many fields, e.g. medical fields, in which it is difficult to collect such a large amount of examples, and producing high quality annotations is even more arduous and costly.
During my Ph.D. and in this thesis, I designed and proposed several solutions which address the generalization gap in three different domains by leveraging semantic aspects of the available data. In particular, the first part of the thesis includes techniques which create new annotations for the data under analysis: these include data augmentation techniques, which are used to compute variations of the annotations by means of semantics-preserving transformations, and transfer learning, which is used in the scope of this thesis to automatically generate textual descriptions for a set of images. In the second part of the thesis, this gap is reduced by customizing the training objective based on the semantics of the annotations. By means of these customizations, a problem is shifted from the commonly used single-task setting to a multi-task learning setting by designing an additional task, and then two variations of a standard loss function are proposed by introducing semantic knowledge into the training process
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
Evolution of A Common Vector Space Approach to Multi-Modal Problems
A set of methods to address computer vision problems has been developed. Video un- derstanding is an activate area of research in recent years. If one can accurately identify salient objects in a video sequence, these components can be used in information retrieval and scene analysis. This research started with the development of a course-to-fine frame- work to extract salient objects in video sequences. Previous work on image and video frame background modeling involved methods that ranged from simple and efficient to accurate but computationally complex. It will be shown in this research that the novel approach to implement object extraction is efficient and effective that outperforms the existing state-of-the-art methods. However, the drawback to this method is the inability to deal with non-rigid motion.
With the rapid development of artificial neural networks, deep learning approaches are explored as a solution to computer vision problems in general. Focusing on image and text, the image (or video frame) understanding can be achieved using CVS. With this concept, modality generation and other relevant applications such as automatic im- age description, text paraphrasing, can be explored. Specifically, video sequences can be modeled by Recurrent Neural Networks (RNN), the greater depth of the RNN leads to smaller error, but that makes the gradient in the network unstable during training.To overcome this problem, a Batch-Normalized Recurrent Highway Network (BNRHN) was developed and tested on the image captioning (image-to-text) task. In BNRHN, the highway layers are incorporated with batch normalization which diminish the gradient vanishing and exploding problem. In addition, a sentence to vector encoding framework that is suitable for advanced natural language processing is developed. This semantic text embedding makes use of the encoder-decoder model which is trained on sentence paraphrase pairs (text-to-text). With this scheme, the latent representation of the text is shown to encode sentences with common semantic information with similar vector rep- resentations. In addition to image-to-text and text-to-text, an image generation model is developed to generate image from text (text-to-image) or another image (image-to- image) based on the semantics of the content. The developed model, which refers to the Multi-Modal Vector Representation (MMVR), builds and encodes different modalities into a common vector space that achieve the goal of keeping semantics and conversion between text and image bidirectional. The concept of CVS is introduced in this research to deal with multi-modal conversion problems. In theory, this method works not only on text and image, but also can be generalized to other modalities, such as video and audio. The characteristics and performance are supported by both theoretical analysis and experimental results. Interestingly, the MMVR model is one of the many possible ways to build CVS. In the final stages of this research, a simple and straightforward framework to build CVS, which is considered as an alternative to the MMVR model, is presented
Recommended from our members
Natural-language video description with deep recurrent neural networks
For most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting meaning from video pixels and generating a sentence description is a very complex problem. The goal of this thesis is to develop models that can automatically generate natural language descriptions for events in videos. It presents several approaches to automatic video description by building on recent advances in “deep” machine learning. The techniques presented in this thesis view the task of video description akin to machine translation, treating the video domain as a source “language” and uses deep neural net architectures to “translate” videos to text.
Specifically, I develop video captioning techniques using a unified deep neural network with both convolutional and recurrent structure, modeling the temporal elements in videos and language with deep recurrent neural networks. In my initial approach, I adapt a model that can learn from paired images and captions to transfer knowledge from this auxiliary task to generate descriptions for short video clips. Next, I present an end-to-end deep network that can jointly model a sequence of video frames and a sequence of words. To further improve grammaticality and descriptive quality, I also propose methods to integrate linguistic knowledge from plain text corpora. Additionally, I show that such linguistic knowledge can help describe novel objects unseen in paired image/video-caption data. Finally, moving beyond short video clips, I present methods to process longer multi-activity videos, specifically to jointly segment and describe coherent event sequences in movies.Computer Science
Recommended from our members
User-centred video abstraction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single user’s subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators ‘on the fly’. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewers’ preferences. Accordingly, the end-user’s priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenes’ semantic categories. Fusing this retrieved data with pre-built users’ profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques
ADVISE: advanced digital video information segmentation engine.
by Chung-Wing Ng.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 100-107).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgment --- p.viTable of Contents --- p.viiList of Tables --- p.xList of Figures --- p.xiChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Image-based Video Description --- p.2Chapter 1.2 --- Video Summary --- p.5Chapter 1.3 --- Video Matching --- p.6Chapter 1.4 --- Contributions --- p.7Chapter 1.5 --- Outline of Thesis --- p.8Chapter Chapter 2 --- Literature Review --- p.10Chapter 2.1 --- Video Retrieval in Digital Video Libraries --- p.11Chapter 2.1.1 --- The VISION Project --- p.11Chapter 2.1.2 --- The INFORMEDIA Project --- p.12Chapter 2.1.3 --- Discussion --- p.13Chapter 2.2 --- Video Structuring --- p.14Chapter 2.2.1 --- Video Segmentation --- p.16Chapter 2.2.2 --- Color histogram Extraction --- p.17Chapter 2.2.3 --- Further Structuring --- p.18Chapter 2.3 --- XML Technologies --- p.19Chapter 2.3.1 --- XML Syntax --- p.20Chapter 2.3.2 --- "Document Type Definition, DTD" --- p.21Chapter 2.3.3 --- "Extensible Stylesheet Language, XSL" --- p.21Chapter 2.4 --- SMIL Technology --- p.22Chapter 2.4.1 --- SMIL Syntax --- p.23Chapter 2.4.2 --- Model of SMIL Applications --- p.23Chapter Chapter 3 --- Overview of ADVISE --- p.25Chapter 3.1 --- Objectives --- p.26Chapter 3.2 --- System Architecture --- p.26Chapter 3.2.1 --- Video Preprocessing Module --- p.26Chapter 3.2.2 --- Web-based Video Retrieval Module --- p.30Chapter 3.2.3 --- Video Streaming Server --- p.34Chapter 3.3 --- Summary --- p.35Chapter Chapter 4 --- Construction of Video Table-of-Contents (V-ToC) --- p.36Chapter 4.1 --- Video Structuring --- p.37Chapter 4.1.1 --- Terms and Definitions --- p.37Chapter 4.1.2 --- Regional Color Histograms --- p.39Chapter 4.1.3 --- Video Shot Boundaries Detection --- p.43Chapter 4.1.4 --- Video Groups Formation --- p.47Chapter 4.1.5 --- Video Scenes Formation --- p.50Chapter 4.2 --- Storage and Presentation --- p.53Chapter 4.2.1 --- Definition of XML Video Structure --- p.54Chapter 4.2.2 --- V-ToC Presentation Using XSL --- p.55Chapter 4.3 --- Evaluation of Video Structure --- p.58Chapter Chapter 5 --- Video Summarization --- p.62Chapter 5.1 --- Terms and Definitions --- p.64Chapter 5.2 --- Video Features Used for Summarization --- p.65Chapter 5.3 --- Video Summarization Algorithm --- p.67Chapter 5.3.1 --- Combining Extracted Video Segments --- p.68Chapter 5.3.2 --- Scoring the Extracted Video Segments --- p.69Chapter 5.3.3 --- Selecting Extracted Video Segments --- p.70Chapter 5.3.4 --- Refining the Selection Result --- p.71Chapter 5.4 --- Video Summary in SMIL --- p.74Chapter 5.5 --- Evaluations --- p.76Chapter 5.5.1 --- Experiment 1: Percentages of Features Extracted --- p.76Chapter 5.5.2 --- Experiment 2: Evaluation of the Refinement Process --- p.78Chapter Chapter 6 --- Video Matching Using V-ToC --- p.80Chapter 6.1 --- Terms and Definitions --- p.81Chapter 6.2 --- Video Features Used for Matching --- p.82Chapter 6.3 --- Non-ordered Tree Matching Algorithm --- p.83Chapter 6.4 --- Ordered Tree Matching Algorithms --- p.87Chapter 6.5 --- Evaluation of Video Matching --- p.91Chapter 6.5.1 --- Applying Non-ordered Tree Matching --- p.92Chapter 6.5.2 --- Applying Ordered Tree Matching --- p.94Chapter Chapter 7 --- Conclusion --- p.96Bibliography --- p.10
Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges
The past decade has witnessed the rapid development of ML and DL
methodologies in agricultural systems, showcased by great successes in variety
of agricultural applications. However, these conventional ML/DL models have
certain limitations: They heavily rely on large, costly-to-acquire labeled
datasets for training, require specialized expertise for development and
maintenance, and are mostly tailored for specific tasks, thus lacking
generalizability. Recently, foundation models have demonstrated remarkable
successes in language and vision tasks across various domains. These models are
trained on a vast amount of data from multiple domains and modalities. Once
trained, they can accomplish versatile tasks with just minor fine-tuning and
minimal task-specific labeled data. Despite their proven effectiveness and huge
potential, there has been little exploration of applying FMs to agriculture
fields. Therefore, this study aims to explore the potential of FMs in the field
of smart agriculture. In particular, we present conceptual tools and technical
background to facilitate the understanding of the problem space and uncover new
research directions in this field. To this end, we first review recent FMs in
the general computer science domain and categorize them into four categories:
language FMs, vision FMs, multimodal FMs, and reinforcement learning FMs.
Subsequently, we outline the process of developing agriculture FMs and discuss
their potential applications in smart agriculture. We also discuss the unique
challenges associated with developing AFMs, including model training,
validation, and deployment. Through this study, we contribute to the
advancement of AI in agriculture by introducing AFMs as a promising paradigm
that can significantly mitigate the reliance on extensive labeled datasets and
enhance the efficiency, effectiveness, and generalization of agricultural AI
systems.Comment: 16 pages, 2 figure
Content-based video indexing for sports applications using integrated multi-modal approach
This thesis presents a research work based on an integrated multi-modal approach for sports video indexing and retrieval. By combining specific features extractable from multiple (audio-visual) modalities, generic structure and specific events can be detected and classified. During browsing and retrieval, users will benefit from the integration of high-level semantic and some descriptive mid-level features such as whistle and close-up view of player(s). The main objective is to contribute to the three major components of sports video indexing systems. The first component is a set of powerful techniques to extract audio-visual features and semantic contents automatically. The main purposes are to reduce manual annotations and to summarize the lengthy contents into a compact, meaningful and more enjoyable presentation. The second component is an expressive and flexible indexing technique that supports gradual index construction. Indexing scheme is essential to determine the methods by which users can access a video database. The third and last component is a query language that can generate dynamic video summaries for smart browsing and support user-oriented retrievals
Reformulation and Decomposition: Multitask learning approaches to Long Document Problems
Recent advances in Natural Language Processing (NLP) have led to success across a wide range of tasks including machine translation, summarization, and classification. Yet, the field still faces major challenges. This thesis addresses two key under-researched areas: the absence of general multitask learning capabilities, and the inability to scale to long, complex documents. Firstly, this thesis explores a form of multitasking where NLP tasks are reformulated as question answering problems. I examine existing models and measure their robustness to paraphrasing of their input. I contribute an annotated dataset which enables detailed analysis of model failures as well as evaluating methods for improving model robustness. Secondly, a set of long document tasks; MuLD, is introduced which forms a benchmark for evaluating the performance of models on large inputs with long-range dependencies. I show that this is a challenging task for baseline models. I then design an approach using task-decomposition to provide an interpretable solution which easily allows for multitask learning. I then explore how these themes of task reformulation for multitask learning, and task-decomposition for long inputs can be applied to other modalities. I show how visual modelling: a visual analogue of language modelling, can be used to predict missing frames from videos of simple physics simulations, and probe what knowledge about the physical world this induces in such models. Finally, I demonstrate how this task can be used to unite vision and NLP using the same framework, describing how task-reformulation and task-decomposition can be used for this purpose
- …