212 research outputs found

    RUSHES—an annotation and retrieval engine for multimedia semantic units

    Get PDF
    Multimedia analysis and reuse of raw un-edited audio visual content known as rushes is gaining acceptance by a large number of research labs and companies. A set of research projects are considering multimedia indexing, annotation, search and retrieval in the context of European funded research, but only the FP6 project RUSHES is focusing on automatic semantic annotation, indexing and retrieval of raw and un-edited audio-visual content. Even professional content creators and providers as well as home-users are dealing with this type of content and therefore novel technologies for semantic search and retrieval are required. In this paper, we present a summary of the most relevant achievements of the RUSHES project, focusing on specific approaches for automatic annotation as well as the main features of the final RUSHES search engine

    Semantic Knowledge Graphs for the News: A Review

    Get PDF
    ICT platforms for news production, distribution, and consumption must exploit the ever-growing availability of digital data. These data originate from different sources and in different formats; they arrive at different velocities and in different volumes. Semantic knowledge graphs (KGs) is an established technique for integrating such heterogeneous information. It is therefore well-aligned with the needs of news producers and distributors, and it is likely to become increasingly important for the news industry. This article reviews the research on using semantic knowledge graphs for production, distribution, and consumption of news. The purpose is to present an overview of the field; to investigate what it means; and to suggest opportunities and needs for further research and development.publishedVersio

    Accessing spoken interaction through dialogue processing [online]

    Get PDF
    Zusammenfassung Unser Leben, unsere Leistungen und unsere Umgebung, alles wird derzeit durch Schriftsprache dokumentiert. Die rasante Fortentwicklung der technischen Möglichkeiten Audio, Bilder und Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt werden um die schriftliche Dokumentation von menschlicher Kommunikation, zum Beispiel Meetings, zu unterstützen, zu ergänzen oder gar zu ersetzen. Diese neuen Technologien können uns in die Lage versetzen Information aufzunehmen, die anderweitig verloren gehen, die Kosten der Dokumentation zu senken und hochwertige Dokumente mit audiovisuellem Material anzureichern. Die Indizierung solcher Aufnahmen stellt die Kerntechnologie dar um dieses Potential auszuschöpfen. Diese Arbeit stellt effektive Alternativen zu schlüsselwortbasierten Indizes vor, die Suchraumeinschränkungen bewirken und teilweise mit einfachen Mitteln zu berechnen sind. Die Indizierung von Sprachdokumenten kann auf verschiedenen Ebenen erfolgen: Ein Dokument gehört stilistisch einer bestimmten Datenbasis an, welche durch sehr einfache Merkmale bei hoher Genauigkeit automatisch bestimmt werden kann. Durch diese Art von Klassifikation kann eine Reduktion des Suchraumes um einen Faktor der Größenordnung 4­10 erfolgen. Die Anwendung von thematischen Merkmalen zur Textklassifikation bei einer Nachrichtendatenbank resultiert in einer Reduktion um einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen sie in thematische Segmente unterteilt werden. Ein neuer probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia­ tive und Stil) liefern vergleichbare oder bessere Resultate als traditionelle schlüsselwortbasierte Ansätze. Diese thematische Segmente können durch die vorherrschende Aktivität charakterisiert werden (erzählen, diskutieren, planen, ...), die durch ein neuronales Netz detektiert werden kann. Die Detektionsraten sind allerdings begrenzt da auch Menschen diese Aktivitäten nur ungenau bestimmen. Eine maximale Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten Daten theoretisch möglich. Eine thematische Klassifikation dieser Segmente wurde ebenfalls auf einer Datenbasis durchgeführt, die Detektionsraten für diesen Index sind jedoch gering. Auf der Ebene der einzelnen Äußerungen können Dialogakte wie Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw. mit einem diskriminativ trainierten Hidden Markov Model erkannt werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen wie Frage/Antwort­Spielen erweitert werden (Dialogspiele). Dialogakte und ­spiele können eingesetzt werden um Klassifikatoren für globale Sprechstile zu bauen. Ebenso könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz erinnern und versuchen, diese in einer grafischen Repräsentation wiederzufinden. In einer Studie mit sehr pessimistischen Annahmen konnten Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische Repräsentation von Aktivität bestimmt. Dialogakte könnte in diesem Szenario ebenso nützlich sein, die Benutzerstudie konnte aufgrund der geringen Datenmenge darüber keinen endgültigen Aufschluß geben. Die Studie konnte allerdings für detailierte Basismerkmale wie Formalität und Sprecheridentität keinen Effekt zeigen. Abstract Written language is one of our primary means for documenting our lives, achievements, and environment. Our capabilities to record, store and retrieve audio, still pictures, and video are undergoing a revolution and may support, supplement or even replace written documentation. This technology enables us to record information that would otherwise be lost, lower the cost of documentation and enhance high­quality documents with original audiovisual material. The indexing of the audio material is the key technology to realize those benefits. This work presents effective alternatives to keyword based indices which restrict the search space and may in part be calculated with very limited resources. Indexing speech documents can be done at a various levels: Stylistically a document belongs to a certain database which can be determined automatically with high accuracy using very simple features. The resulting factor in search space reduction is in the order of 4­10 while topic classification yielded a factor of 18 in a news domain. Since documents can be very long they need to be segmented into topical regions. A new probabilistic segmentation framework as well as new features (speaker initiative and style) prove to be very effective compared to traditional keyword based methods. At the topical segment level activities (storytelling, discussing, planning, ...) can be detected using a machine learning approach with limited accuracy; however even human annotators do not annotate them very reliably. A maximum search space reduction factor of 6 is theoretically possible on the databases used. A topical classification of these regions has been attempted on one database, the detection accuracy for that index, however, was very low. At the utterance level dialogue acts such as statements, questions, backchannels (aha, yeah, ...), etc. are being recognized using a novel discriminatively trained HMM procedure. The procedure can be extended to recognize short sequences such as question/answer pairs, so called dialogue games. Dialog acts and games are useful for building classifiers for speaking style. Similarily a user may remember a certain dialog act sequence and may search for it in a graphical representation. In a study with very pessimistic assumptions users are able to pick one out of four similar and equiprobable meetings correctly with an accuracy ~ 43% using graphical activity information. Dialogue acts may be useful in this situation as well but the sample size did not allow to draw final conclusions. However the user study fails to show any effect for detailed basic features such as formality or speaker identity

    Interactive video retrieval

    Get PDF
    Video storage, analysis, and retrieval has become an important research topic recently due to the advancements in the creation and distribution of video data. In this thesis, an investigation into interactive video retrieval is presented. Advanced feedback techniques have been investigated in the retrieval of textual data. Novel interactive schemes, mainly based on the concept of relevance feedback, have been developed and experimented. However, such approaches have not been applied in the video retrieval domain. In this thesis, we investigate the use of advanced interactive retrieval schemes for the retrieval of video data. To understand the role of various features for the video retrieval, we experimented with various retrieval strategies. We benchmarked the role of visual features, the textual features and their combination. To explore this further, we categorized query into various classes and investigated the retrieval effectiveness of various features and their combination. Based on the results, we developed a retrieval scheme for video retrieval. We developed an interactive retrieval technique based on the concept of implicit feedback. A number of retrieval models are developed based on this concept and benchmarked with a simulation- based evaluation strategy. A Binary Voting Model performed well and has been reformed for user-based experiments. We experimented with the users and compared the performance of an interactive retrieval system, using a combination of implicit and explicit feedback techniques, with that of a system using explicit feedback techniques

    Volume 26, Number 1, March 2006 OLAC Newsletter

    Get PDF
    Digitized March 2006 issue of the OLAC Newsletter

    Multi-modal Video Content Understanding

    Get PDF
    Video is an important format of information. Humans use videos for a variety of purposes such as entertainment, education, communication, information sharing, and capturing memories. To this date, humankind accumulated a colossal amount of video material online which is freely available. Manual processing at this scale is simply impossible. To this end, many research efforts have been dedicated to the automatic processing of video content. At the same time, human perception of the world is multi-modal. A human uses multiple senses to understand the environment and objects, and their interactions. When watching a video, we perceive the content via both audio and visual modalities, and removing one of these modalities results in less immersive experience. Similarly, if information in both modalities does not correspond, it may create a sense of dissonance. Therefore, joint modelling of multiple modalities (such as audio, visual, and text) within one model is an active research area. In the last decade, the fields of automatic video understanding and multi-modal modelling have seen exceptional progress due to the ubiquitous success of deep learning models and, more recently, transformer-based architectures in particular. Our work draws on these advances and pushes the state-of-the-art of multi-modal video understanding forward. Applications of automatic multi-modal video processing are broad and exciting! For instance, the content-based textual description of a video (video captioning) may allow a visually- or auditory-impaired person to understand the content and, thus, engage in brighter social interactions. However, prior work in video content description relies on the visual input alone, missing vital information only available in the audio stream. To this end, we proposed two novel multi-modal transformer models that encode audio and visual interactions simultaneously. More specifically, first, we introduced a late-fusion multi-modal transformer that is highly modular and allows the processing of an arbitrary set of modalities. Second, an efficient bi-modal transformer was presented to encode audio-visual cues starting from the lower network layers allowing more rich audio-visual features and stronger performance as a result. Another application is the automatic visually-guided sound generation that might help professional sound (foley) designers who spend hours searching a database for relevant audio for a movie scene. Previous approaches for automatic conditional audio generation support only one class (e. g. “dog barking”), while real-life applications may require generation for hundreds of data classes and one would need to train one model for every data class which can be infeasible. To bridge this gap, we introduced a novel two-stage model that, first, efficiently encodes audio as a set of codebook vectors (i. e. trains to make “building blocks”) and, then, learns to sample these audio vectors given visual inputs to make a relevant audio track for this visual input. Moreover, we studied the automatic evaluation of the conditional audio generation model and proposed metrics that measure both quality and relevance of the generated samples. Finally, as video editing is becoming more common among non-professionals due to the increased popularity of such services as YouTube, automatic assistance during video editing grows in demand, e. g. off-sync detection between audio and visual tracks. Prior work in audio-visual synchronization was devoted to solving the task on lip-syncing datasets with “dense” signals, such as interviews and presentations. In such videos, synchronization cues occur “densely” across time, and it is enough to process just a few tens of a second to synchronize the tracks. In contrast, opendomain videos mostly have only “sparse” cues that occur just once in a seconds-long video clip (e. g. “chopping wood”). To address this, we: a) proposed a novel dataset with “sparse” sounds; b) designed a model which can efficiently encode seconds-long audio-visual tracks in a small set of “learnable selectors” that is, then, used for synchronization. In addition, we explored the temporal artefacts that common audio and video compression algorithms leave in data streams. To prevent a model from learning to rely on these artefacts, we introduced a list of recommendations on how to mitigate them. This thesis provides the details of the proposed methodologies as well as a comprehensive overview of advances in relevant fields of multi-modal video understanding. In addition, we provide a discussion of potential research directions that can bring significant contributions to the field

    Extraction of textual information from image for information retrieval

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4

    Full text link
    Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family large language models.Comment: Preprint under review, 58 page

    Enhanced web-based summary generation for search.

    Get PDF
    After a user types in a search query on a major search engine, they are presented with a number of search results. Each search result is made up of a title, brief text summary and a URL. It is then the user\u27s job to select documents for further review. Our research aims to improve the accuracy of users selecting relevant documents by improving the way these web pages are summarized. Improvements in accuracy will lead to time improvements and user experience improvements. We propose ReClose, a system for generating web document summaries. ReClose generates summary content through combining summarization techniques from query-biased and query-independent summary generation. Query-biased summaries generally provide query terms in context. Query-independent summaries focus on summarizing documents as a whole. Combining these summary techniques led to a 10% improvement in user decision making over Google generated summaries. Color-coded ReClose summaries provide keyword usage depth at a glance and also alert users to topic departures. Color-coding further enhanced ReClose results and led to a 20% improvement in user decision making over Google generated summaries. Many online documents include structure and multimedia of various forms such as tables, lists, forms and images. We propose to include this structure in web page summaries. We found that the expert user was insignificantly slowed in decision making while the majority of average users made decisions more quickly using summaries including structure without any decrease in decision accuracy. We additionally extended ReClose for use in summarizing large numbers of tweets in tracking flu outbreaks in social media. The resulting summaries have variable length and are effective at summarizing flu related trends. Users of the system obtained an accuracy of 0.86 labeling multi-tweet summaries. This showed that the basis of ReClose is effective outside of web documents and that variable length summaries can be more effective than fixed length. Overall the ReClose system provides unique summaries that contain more informative content than current search engines produce, highlight the results in a more meaningful way, and add structure when meaningful. The applications of ReClose extend far beyond search and have been demonstrated in summarizing pools of tweets

    Full Issue: vol. 63, issue 4

    Get PDF
    corecore