322 research outputs found

    A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

    Full text link
    The alignment of heterogeneous sequential data (video to text) is an important and challenging problem. Standard techniques for this task, including Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from inherent drawbacks. Mainly, the Markov assumption implies that, given the immediate past, future alignment decisions are independent of further history. The separation between similarity computation and alignment decision also prevents end-to-end training. In this paper, we propose an end-to-end neural architecture where alignment actions are implemented as moving data between stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture supports a large variety of alignment tasks, including one-to-one, one-to-many, skipping unmatched elements, and (with extensions) non-monotonic alignment. Extensive experiments on semi-synthetic and real datasets show that our algorithm outperforms state-of-the-art baselines.Comment: Accepted at CVPR 2018 (Spotlight). arXiv file includes the paper and the supplemental materia

    Mapping (Dis-)Information Flow about the MH17 Plane Crash

    Get PDF
    Digital media enables not only fast sharing of information, but also disinformation. One prominent case of an event leading to circulation of disinformation on social media is the MH17 plane crash. Studies analysing the spread of information about this event on Twitter have focused on small, manually annotated datasets, or used proxys for data annotation. In this work, we examine to what extent text classifiers can be used to label data for subsequent content analysis, in particular we focus on predicting pro-Russian and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though we find that a neural classifier improves over a hashtag based baseline, labeling pro-Russian and pro-Ukrainian content with high precision remains a challenging problem. We provide an error analysis underlining the difficulty of the task and identify factors that might help improve classification in future work. Finally, we show how the classifier can facilitate the annotation task for human annotators

    Supervision distante pour l'apprentissage de structures discursives dans les conversations multi-locuteurs

    Get PDF
    L'objectif principal de cette thèse est d'améliorer l'inférence automatique pour la modélisation et la compréhension des communications humaines. En particulier, le but est de faciliter considérablement l'analyse du discours afin d'implémenter, au niveau industriel, des outils d'aide à l'exploration des conversations. Il s'agit notamment de la production de résumés automatiques, de recommandations, de la détection des actes de dialogue, de l'identification des décisions, de la planification et des relations sémantiques entre les actes de dialogue afin de comprendre les dialogues. Dans les conversations à plusieurs locuteurs, il est important de comprendre non seulement le sens de l'énoncé d'un locuteur et à qui il s'adresse, mais aussi les relations sémantiques qui le lient aux autres énoncés de la conversation et qui donnent lieu à différents fils de discussion. Une réponse doit être reconnue comme une réponse à une question particulière ; un argument, comme un argument pour ou contre une proposition en cours de discussion ; un désaccord, comme l'expression d'un point de vue contrasté par rapport à une autre idée déjà exprimée. Malheureusement, les données de discours annotées à la main et de qualités sont coûteuses et prennent du temps, et nous sommes loin d'en avoir assez pour entraîner des modèles d'apprentissage automatique traditionnels, et encore moins des modèles d'apprentissage profond. Il est donc nécessaire de trouver un moyen plus efficace d'annoter en structures discursives de grands corpus de conversations multi-locuteurs, tels que les transcriptions de réunions ou les chats. Un autre problème est qu'aucune quantité de données ne sera suffisante pour permettre aux modèles d'apprentissage automatique d'apprendre les caractéristiques sémantiques des relations discursives sans l'aide d'un expert ; les données sont tout simplement trop rares. Les relations de longue distance, dans lesquelles un énoncé est sémantiquement connecté non pas à l'énoncé qui le précède immédiatement, mais à un autre énoncé plus antérieur/tôt dans la conversation, sont particulièrement difficiles et rares, bien que souvent centrales pour la compréhension. Notre objectif dans cette thèse a donc été non seulement de concevoir un modèle qui prédit la structure du discours pour une conversation multipartite sans nécessiter de grandes quantités de données annotées manuellement, mais aussi de développer une approche qui soit transparente et explicable afin qu'elle puisse être modifiée et améliorée par des experts.The main objective of this thesis is to improve the automatic capture of semantic information with the goal of modeling and understanding human communication. We have advanced the state of the art in discourse parsing, in particular in the retrieval of discourse structure from chat, in order to implement, at the industrial level, tools to help explore conversations. These include the production of automatic summaries, recommendations, dialogue acts detection, identification of decisions, planning and semantic relations between dialogue acts in order to understand dialogues. In multi-party conversations it is important to not only understand the meaning of a participant's utterance and to whom it is addressed, but also the semantic relations that tie it to other utterances in the conversation and give rise to different conversation threads. An answer must be recognized as an answer to a particular question; an argument, as an argument for or against a proposal under discussion; a disagreement, as the expression of a point of view contrasted with another idea already expressed. Unfortunately, capturing such information using traditional supervised machine learning methods from quality hand-annotated discourse data is costly and time-consuming, and we do not have nearly enough data to train these machine learning models, much less deep learning models. Another problem is that arguably, no amount of data will be sufficient for machine learning models to learn the semantic characteristics of discourse relations without some expert guidance; the data are simply too sparse. Long distance relations, in which an utterance is semantically connected not to the immediately preceding utterance, but to another utterance from further back in the conversation, are particularly difficult and rare, though often central to comprehension. It is therefore necessary to find a more efficient way to retrieve discourse structures from large corpora of multi-party conversations, such as meeting transcripts or chats. This is one goal this thesis achieves. In addition, we not only wanted to design a model that predicts discourse structure for multi-party conversation without requiring large amounts of hand-annotated data, but also to develop an approach that is transparent and explainable so that it can be modified and improved by experts. The method detailed in this thesis achieves this goal as well

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy

    Relation Classification with Limited Supervision

    Get PDF
    Large reams of unstructured data, for instance in form textual document collections containing entities and relations, exist in many domains. The process of deriving valuable domain insights and intelligence from such documents collections usually involves the extraction of information such as the relations between the entities in such collections. Relation classification is the task of detecting relations between entities. Supervised machine learning models, which have become the tool of choice for relation classification, require substantial quantities of annotated data for each relation in order to perform optimally. For many domains, such quantities of annotated data for relations may not be readily available, and manually curating such annotations may not be practical due to time and cost constraints. In this work, we develop both model-specific and model-agnostic approaches for relation classification with limited supervision. We start by proposing an approach for learning embeddings for contextual surface patterns, which are the set of surface patterns associated with entity pairs across a text corpus, to provide additional supervision signals for relation classification with limited supervision. We find that this approach improves classification performance on relations with limited supervision instances. However, this initial approach assumes the availability of at least one annotated instance per relation during training. In order to address this limitation, we propose an approach which formulates the task of relation classification as that of textual entailment. This reformulation allows us to use the textual descriptions of relations to classify their instances. It also allows us to utilize existing textual entailment datasets and models to classify relations with zero supervision instances. The two methods proposed previously rely on the use of specific model architectures for relation classification. Since a wide variety of models have been proposed for relation classification in the literature, a more general approach is thus desirable. We subsequently propose our first model-agnostic meta-learning algorithm for relation classification with limited supervision. This algorithm is applicable to any gradient-optimized relation classification model. We show that the proposed approach improves the predictive performance of two existing relation classification models when supervision for relations is limited. Next, because all the approaches we have proposed so far assume the availability of all supervision needed for classifying relations prior to model training, they are unable to handle the case when new supervision for relations becomes available after training. Such new supervision may need to be incorporated into the model to enable it classify new relations or to improve its performance on existing relations. Our last approach addresses this short-coming. We propose a model-agnostic algorithm which enables relation classification models to learn continually from new supervision as it becomes available, while doing so in a data-efficient manner and without forgetting knowledge of previous relations

    Data-driven machine translation for sign languages

    Get PDF
    This thesis explores the application of data-driven machine translation (MT) to sign languages (SLs). The provision of an SL MT system can facilitate communication between Deaf and hearing people by translating information into the native and preferred language of the individual. We begin with an introduction to SLs, focussing on Irish Sign Language - the native language of the Deaf in Ireland. We describe their linguistics and mechanics including similarities and differences with spoken languages. Given the lack of a formalised written form of these languages, an outline of annotation formats is discussed as well as the issue of data collection. We summarise previous approaches to SL MT, highlighting the pros and cons of each approach. Initial experiments in the novel area of example-based MT for SLs are discussed and an overview of the problems that arise when automatically translating these manual-visual languages is given. Following this we detail our data-driven approach, examining the MT system used and modifications made for the treatment of SLs and their annotation. Through sets of automatically evaluated experiments in both language directions, we consider the merits of data-driven MT for SLs and outline the mainstream evaluation metrics used. To complete the translation into SLs, we discuss the addition and manual evaluation of a signing avatar for real SL output
    corecore