20 research outputs found

    A First Summarization System of a Video in a Target Language

    Get PDF
    International audienceIn this paper, we present the first results of the project AMIS (Access Multilingual Information opinionS) funded by Chist-Era. The main goal of this project is to understand the content of a video in a foreign language. In this work, we consider the understanding process, such as the aptitude to capture the most important ideas contained in a media expressed in a foreign language. In other words, the understanding will be approached by the global meaning of the content of a support and not by the meaning of each fragment of a video. Several stumbling points remain before reaching the fixed goal. They concern the following aspects: Video summarization, Speech recognition, Machine translation and Speech segmentation. All these issues will be discussed and the methods used to develop each of these components will be presented. A first implementation is achieved and each component of this system is evaluated on a representative test data. We propose also a protocol for a global subjective evaluation of AMIS

    Summarizing videos into a target language: Methodology, architectures and evaluation

    Get PDF
    International audienceThe aim of the work is to report the results of the Chist-Era project AMIS (Access Multilingual Information opinionS). The purpose of AMIS is to answer the following question: How to make the information in a foreign language accessible for everyone? This issue is not limited to translate a source video into a target language video since the objective is to provide only the main idea of an Arabic video in English. This objective necessitates developing research in several areas that are not, all arrived at a maturity state: Video summarization, Speech recognition, Machine translation, Audio summarization and Speech segmentation. In this article we present several possible architectures to achieve our objective, yet we focus on only one of them. The scientific locks are be presented, and we explain how to deal with them. One of the big challenges of this work is to conceive a way to evaluate objectively a system composed of several components knowing that each of them has its limits and can propagate errors through the first component. Also, a subjective evaluation procedure is proposed in which several annotators have been mobilized to test the quality of the achieved summaries

    Speech segmentation and speaker diarisation for transcription and translation

    Get PDF
    This dissertation outlines work related to Speech Segmentation – segmenting an audio recording into regions of speech and non-speech, and Speaker Diarization – further segmenting those regions into those pertaining to homogeneous speakers. Knowing not only what was said but also who said it and when, has many useful applications. As well as providing a richer level of transcription for speech, we will show how such knowledge can improve Automatic Speech Recognition (ASR) system performance and can also benefit downstream Natural Language Processing (NLP) tasks such as machine translation and punctuation restoration. While segmentation and diarization may appear to be relatively simple tasks to describe, in practise we find that they are very challenging and are, in general, ill-defined problems. Therefore, we first provide a formalisation of each of the problems as the sub-division of speech within acoustic space and time. Here, we see that the task can become very difficult when we want to partition this domain into our target classes of speakers, whilst avoiding other classes that reside in the same space, such as phonemes. We present a theoretical framework for describing and discussing the tasks as well as introducing existing state-of-the-art methods and research. Current Speaker Diarization systems are notoriously sensitive to hyper-parameters and lack robustness across datasets. Therefore, we present a method which uses a series of oracle experiments to expose the limitations of current systems and to which system components these limitations can be attributed. We also demonstrate how Diarization Error Rate (DER), the dominant error metric in the literature, is not a comprehensive or reliable indicator of overall performance or of error propagation to subsequent downstream tasks. These results inform our subsequent research. We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation is a crucial first step in the system chain. Current methods typically do not account for the inherent structure of spoken discourse. As such, we explored a novel method which exploits an utterance-duration prior in order to better model the segment distribution of speech. We show how this method improves not only segmentation, but also the performance of subsequent speech recognition, machine translation and speaker diarization systems. Typical ASR transcriptions do not include punctuation and the task of enriching transcriptions with this information is known as ‘punctuation restoration’. The benefit is not only improved readability but also better compatibility with NLP systems that expect sentence-like units such as in conventional machine translation. We show how segmentation and diarization are related tasks that are able to contribute acoustic information that complements existing linguistically-based punctuation approaches. There is a growing demand for speech technology applications in the broadcast media domain. This domain presents many new challenges including diverse noise and recording conditions. We show that the capacity of existing GMM-HMM based speech segmentation systems is limited for such scenarios and present a Deep Neural Network (DNN) based method which offers a more robust speech segmentation method resulting in improved speech recognition performance for a television broadcast dataset. Ultimately, we are able to show that the speech segmentation is an inherently ill-defined problem for which the solution is highly dependent on the downstream task that it is intended for

    Réseau social numérique et images vectorielles : introduction à une communication à vocation internationale

    Get PDF
    Dans le contexte de la mondialisation, nous fondons ce travail préliminaire sur l'émergence de la relation entre les Réseaux Sociaux Numériques (RSN) et les systèmes artificiels de communication visuelle, à savoir la signalétique. Cette dernière sert à donner une information sur un sujet, pour faciliter la communication entre les usagers à l'échelle internationale. Ce système de communication visuelle a également une visée pragmatique : il doit conduire le destinataire à accomplir une action et/ ou il doit influer sur sa perception de la réalité. Le signagramme, qui est de type figuratif, est son unité d'écriture. Notre objectif est, tout d'abord, de concevoir un nouveau prototype d'un RSN de communication à vocation internationale, en nous servant de la signalétique et de l'outil de traduction automatique de syntagmes en signagrammes. Le SignaComm est l'intitulé de notre RSN : il est spécialisé et informatif. Ensuite, nous développons ce prototype en vue de tester ses capacités de communiquer des messages visuels à des usagers nationaux et internationaux. En guise d'exemple, nous traitons le cas d'une secousse sismique appartenant au domaine des risques et catastrophes naturels.Dans le contexte de la mondialisation, nous fondons ce travail préliminaire sur l'émergence de la relation entre les Réseaux Sociaux Numériques (RSN) et les systèmes artificiels de communication visuelle, à savoir la signalétique. Cette dernière sert à donner une information sur un sujet, pour faciliter la communication entre les usagers à l'échelle internationale. Ce système de communication visuelle a également une visée pragmatique : il doit conduire le destinataire à accomplir une action et/ ou il doit influer sur sa perception de la réalité. Le signagramme, qui est de type figuratif, est son unité d'écriture. Notre objectif est, tout d'abord, de concevoir un nouveau prototype d'un RSN de communication à vocation internationale, en nous servant de la signalétique et de l'outil de traduction automatique de syntagmes en signagrammes. Le SignaComm est l'intitulé de notre RSN : il est spécialisé et informatif. Ensuite, nous développons ce prototype en vue de tester ses capacités de communiquer des messages visuels à des usagers nationaux et internationaux. En guise d'exemple, nous traitons le cas d'une secousse sismique appartenant au domaine des risques et catastrophes naturels

    The role of context in image annotation and recommendation

    Get PDF
    With the rise of smart phones, lifelogging devices (e.g. Google Glass) and popularity of image sharing websites (e.g. Flickr), users are capturing and sharing every aspect of their life online producing a wealth of visual content. Of these uploaded images, the majority are poorly annotated or exist in complete semantic isolation making the process of building retrieval systems difficult as one must firstly understand the meaning of an image in order to retrieve it. To alleviate this problem, many image sharing websites offer manual annotation tools which allow the user to “tag” their photos, however, these techniques are laborious and as a result have been poorly adopted; Sigurbjörnsson and van Zwol (2008) showed that 64% of images uploaded to Flickr are annotated with < 4 tags. Due to this, an entire body of research has focused on the automatic annotation of images (Hanbury, 2008; Smeulders et al., 2000; Zhang et al., 2012a) where one attempts to bridge the semantic gap between an image’s appearance and meaning e.g. the objects present. Despite two decades of research the semantic gap still largely exists and as a result automatic annotation models often offer unsatisfactory performance for industrial implementation. Further, these techniques can only annotate what they see, thus ignoring the “bigger picture” surrounding an image (e.g. its location, the event, the people present etc). Much work has therefore focused on building photo tag recommendation (PTR) methods which aid the user in the annotation process by suggesting tags related to those already present. These works have mainly focused on computing relationships between tags based on historical images e.g. that NY and timessquare co-exist in many images and are therefore highly correlated. However, tags are inherently noisy, sparse and ill-defined often resulting in poor PTR accuracy e.g. does NY refer to New York or New Year? This thesis proposes the exploitation of an image’s context which, unlike textual evidences, is always present, in order to alleviate this ambiguity in the tag recommendation process. Specifically we exploit the “what, who, where, when and how” of the image capture process in order to complement textual evidences in various photo tag recommendation and retrieval scenarios. In part II, we combine text, content-based (e.g. # of faces present) and contextual (e.g. day-of-the-week taken) signals for tag recommendation purposes, achieving up to a 75% improvement to precision@5 in comparison to a text-only TF-IDF baseline. We then consider external knowledge sources (i.e. Wikipedia & Twitter) as an alternative to (slower moving) Flickr in order to build recommendation models on, showing that similar accuracy could be achieved on these faster moving, yet entirely textual, datasets. In part II, we also highlight the merits of diversifying tag recommendation lists before discussing at length various problems with existing automatic image annotation and photo tag recommendation evaluation collections. In part III, we propose three new image retrieval scenarios, namely “visual event summarisation”, “image popularity prediction” and “lifelog summarisation”. In the first scenario, we attempt to produce a rank of relevant and diverse images for various news events by (i) removing irrelevant images such memes and visual duplicates (ii) before semantically clustering images based on the tweets in which they were originally posted. Using this approach, we were able to achieve over 50% precision for images in the top 5 ranks. In the second retrieval scenario, we show that by combining contextual and content-based features from images, we are able to predict if it will become “popular” (or not) with 74% accuracy, using an SVM classifier. Finally, in chapter 9 we employ blur detection and perceptual-hash clustering in order to remove noisy images from lifelogs, before combining visual and geo-temporal signals in order to capture a user’s “key moments” within their day. We believe that the results of this thesis show an important step towards building effective image retrieval models when there lacks sufficient textual content (i.e. a cold start)

    Information Refinement Technologies for Crisis Informatics: User Expectations and Design Implications for Social Media and Mobile Apps in Crises

    Get PDF
    In the past 20 years, mobile technologies and social media have not only been established in everyday life, but also in crises, disasters, and emergencies. Especially large-scale events, such as 2012 Hurricane Sandy or the 2013 European Floods, showed that citizens are not passive victims but active participants utilizing mobile and social information and communication technologies (ICT) for crisis response (Reuter, Hughes, et al., 2018). Accordingly, the research field of crisis informatics emerged as a multidisciplinary field which combines computing and social science knowledge of disasters and is rooted in disciplines such as human-computer interaction (HCI), computer science (CS), computer supported cooperative work (CSCW), and information systems (IS). While citizens use personal ICT to respond to a disaster to cope with uncertainty, emergency services such as fire and police departments started using available online data to increase situational awareness and improve decision making for a better crisis response (Palen & Anderson, 2016). When looking at even larger crises, such as the ongoing COVID-19 pandemic, it becomes apparent the challenges of crisis informatics are amplified (Xie et al., 2020). Notably, information is often not available in perfect shape to assist crisis response: the dissemination of high-volume, heterogeneous and highly semantic data by citizens, often referred to as big social data (Olshannikova et al., 2017), poses challenges for emergency services in terms of access, quality and quantity of information. In order to achieve situational awareness or even actionable information, meaning the right information for the right person at the right time (Zade et al., 2018), information must be refined according to event-based factors, organizational requirements, societal boundary conditions and technical feasibility. In order to research the topic of information refinement, this dissertation combines the methodological framework of design case studies (Wulf et al., 2011) with principles of design science research (Hevner et al., 2004). These extended design case studies consist of four phases, each contributing to research with distinct results. This thesis first reviews existing research on use, role, and perception patterns in crisis informatics, emphasizing the increasing potentials of public participation in crisis response using social media. Then, empirical studies conducted with the German population reveal positive attitudes and increasing use of mobile and social technologies during crises, but also highlight barriers of use and expectations towards emergency services to monitor and interact in media. The findings led to the design of innovative ICT artefacts, including visual guidelines for citizens’ use of social media in emergencies (SMG), an emergency service web interface for aggregating mobile and social data (ESI), an efficient algorithm for detecting relevant information in social media (SMO), and a mobile app for bidirectional communication between emergency services and citizens (112.social). The evaluation of artefacts involved the participation of end-users in the application field of crisis management, pointing out potentials for future improvements and research potentials. The thesis concludes with a framework on information refinement for crisis informatics, integrating event-based, organizational, societal, and technological perspectives

    WiFi-Based Human Activity Recognition Using Attention-Based BiLSTM

    Get PDF
    Recently, significant efforts have been made to explore human activity recognition (HAR) techniques that use information gathered by existing indoor wireless infrastructures through WiFi signals without demanding the monitored subject to carry a dedicated device. The key intuition is that different activities introduce different multi-paths in WiFi signals and generate different patterns in the time series of channel state information (CSI). In this paper, we propose and evaluate a full pipeline for a CSI-based human activity recognition framework for 12 activities in three different spatial environments using two deep learning models: ABiLSTM and CNN-ABiLSTM. Evaluation experiments have demonstrated that the proposed models outperform state-of-the-art models. Also, the experiments show that the proposed models can be applied to other environments with different configurations, albeit with some caveats. The proposed ABiLSTM model achieves an overall accuracy of 94.03%, 91.96%, and 92.59% across the 3 target environments. While the proposed CNN-ABiLSTM model reaches an accuracy of 98.54%, 94.25% and 95.09% across those same environments

    Detecting early signs of dementia in conversation

    Get PDF
    Dementia can affect a person's speech, language and conversational interaction capabilities. The early diagnosis of dementia is of great clinical importance. Recent studies using the qualitative methodology of Conversation Analysis (CA) demonstrated that communication problems may be picked up during conversations between patients and neurologists and that this can be used to differentiate between patients with Neuro-degenerative Disorders (ND) and those with non-progressive Functional Memory Disorder (FMD). However, conducting manual CA is expensive and difficult to scale up for routine clinical use.\ud This study introduces an automatic approach for processing such conversations which can help in identifying the early signs of dementia and distinguishing them from the other clinical categories (FMD, Mild Cognitive Impairment (MCI), and Healthy Control (HC)). The dementia detection system starts with a speaker diarisation module to segment an input audio file (determining who talks when). Then the segmented files are passed to an automatic speech recogniser (ASR) to transcribe the utterances of each speaker. Next, the feature extraction unit extracts a number of features (CA-inspired, acoustic, lexical and word vector) from the transcripts and audio files. Finally, a classifier is trained by the features to determine the clinical category of the input conversation. Moreover, we investigate replacing the role of a neurologist in the conversation with an Intelligent Virtual Agent (IVA) (asking similar questions). We show that despite differences between the IVA-led and the neurologist-led conversations, the results achieved by the IVA are as good as those gained by the neurologists. Furthermore, the IVA can be used for administering more standard cognitive tests, like the verbal fluency tests and produce automatic scores, which then can boost the performance of the classifier. The final blind evaluation of the system shows that the classifier can identify early signs of dementia with an acceptable level of accuracy and robustness (considering both sensitivity and specificity)

    Understanding misinformation on Twitter in the context of controversial issues

    Get PDF
    Social media is slowly supplementing, or even replacing, traditional media outlets such as television, newspapers, and radio. However, social media presents some drawbacks when it comes to circulating information. These drawbacks include spreading false information, rumors, and fake news. At least three main factors create these drawbacks: The filter bubble effect, misinformation, and information overload. These factors make gathering accurate and credible information online very challenging, which in turn may affect public trust in online information. These issues are even more challenging when the issue under discussion is a controversial topic. In this thesis, four main controversial topics are studied, each of which comes from a different domain. This variation of domains can give a broad view of how misinformation is manifested in social media, and how it is manifested differently in different domains. This thesis aims to understand misinformation in the context of controversial issue discussions. This can be done through understanding how misinformation is manifested in social media as well as by understanding people’s opinions towards these controversial issues. In this thesis, three different aspects of a tweet are studied. These aspects are 1) the user sharing the information, 2) the information source shared, and 3) whether specific linguistic cues can help in assessing the credibility of information on social media. Finally, the web application tool TweetChecker is used to allow online users to have a more in-depth understanding of the discussions about five different controversial health issues. The results and recommendations of this study can be used to build solutions for the problem of trustworthiness of user-generated content on different social media platforms, especially for controversial issues
    corecore