921 research outputs found
Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media
Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines
A Domain Adaptation Approach to Improve Speaker Turn Embedding Using Face Representation
This paper proposes a novel approach to improve speaker modeling using knowledge transferred from face representation. In particular, we are interested in learning a discriminative metric which allows speaker turns to be compared directly, which is beneficial for tasks such as diarization and dialogue analysis. Our method improves the embedding space of speaker turns by applying maximum mean discrepancy loss to minimize the disparity between the distributions of facial and acoustic embedded features. This approach aims to discover the shared underlying structure of the two embedded spaces, thus enabling the transfer of knowledge from the richer face representation to the counterpart in speech. Experiments are conducted on broadcast TV news datasets, REPERE and ETAPE, to demonstrate the validity of our method. Quantitative results in verification and clustering tasks show promising improvement, especially in cases where speaker turns are short or the training data size is limited
AlbayzĂn-2014 evaluation: audio segmentation and classification in broadcast news domains
The electronic version of this article is the complete one and can be found online at: http://dx.doi.org/10.1186/s13636-015-0076-3Audio segmentation is important as a pre-processing task to improve the performance of many speech technology tasks and, therefore, it has an undoubted research interest. This paper describes the database, the metric, the systems and the results for the AlbayzĂn-2014 audio segmentation campaign. In contrast to previous evaluations where the task was the segmentation of non-overlapping classes, AlbayzĂn-2014 evaluation proposes the delimitation of the presence of speech, music and/or noise that can be found simultaneously. The database used in the evaluation was created by fusing different media and noises in order to increase the difficulty of the task. Seven segmentation systems from four different research groups were evaluated and combined. Their experimental results were analyzed and compared with the aim of providing a benchmark and showing up the promising directions in this field.This work has been partially funded by the Spanish Government and the European Union (FEDER) under the project TIN2011-28169-C05-02 and supported by the European Regional Development Fund and the Spanish
Government (âSpeechTech4All Projectâ TEC2012-38939-C03
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
Multiclass audio segmentation based on recurrent neural networks for broadcast domain data
This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the AlbayzĂn 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases
Game Plan: What AI can do for Football, and What Football can do for AI
The rapid progress in artificial intelligence (AI) and machine learning has opened unprecedented
analytics possibilities in various team and individual sports, including baseball, basketball, and
tennis. More recently, AI techniques have been applied to football, due to a huge increase in
data collection by professional teams, increased computational power, and advances in machine
learning, with the goal of better addressing new scientific challenges involved in the analysis of
both individual playersâ and coordinated teamsâ behaviors. The research challenges associated
with predictive and prescriptive football analytics require new developments and progress at the
intersection of statistical learning, game theory, and computer vision. In this paper, we provide
an overarching perspective highlighting how the combination of these fields, in particular, forms a
unique microcosm for AI research, while offering mutual benefits for professional teams, spectators,
and broadcasters in the years to come. We illustrate that this duality makes football analytics
a game changer of tremendous value, in terms of not only changing the game of football itself,
but also in terms of what this domain can mean for the field of AI. We review the state-of-theart and exemplify the types of analysis enabled by combining the aforementioned fields, including
illustrative examples of counterfactual analysis using predictive models, and the combination of
game-theoretic analysis of penalty kicks with statistical learning of player attributes. We conclude
by highlighting envisioned downstream impacts, including possibilities for extensions to other sports
(real and virtual)
Dutkat: A Privacy-Preserving System for Automatic Catch Documentation and Illegal Activity Detection in the Fishing Industry
United Nations' Sustainable Development Goal 14 aims to conserve and sustainably use the oceans and their resources for the benefit of people and the planet. This includes protecting marine ecosystems, preventing pollution, and overfishing, and increasing scientific understanding of the oceans. Achieving this goal will help ensure the health and well-being of marine life and the millions of people who rely on the oceans for their livelihoods. In order to ensure sustainable fishing practices, it is important to have a system in place for automatic catch documentation.
This thesis presents our research on the design and development of Dutkat, a privacy-preserving, edge-based system for catch documentation and detection of illegal activities in the fishing industry. Utilising machine learning techniques, Dutkat can analyse large amounts of data and identify patterns that may indicate illegal activities such as overfishing or illegal discard of catch. Additionally, the system can assist in catch documentation by automating the process of identifying and counting fish species, thus reducing potential human error and increasing efficiency. Specifically, our research has consisted of the development of various components of the Dutkat system, evaluation through experimentation, exploration of existing data, and organization of machine learning competitions. We have also implemented it from a compliance-by-design perspective to ensure that the system is in compliance with data protection laws and regulations such as GDPR. Our goal with Dutkat is to promote sustainable fishing practices, which aligns with the Sustainable Development Goal 14, while simultaneously protecting the privacy and rights of fishing crews
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and âenablersâ, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
Event Based Retrieval From Digital Libraries Containing Data Streams
The objective of this research is to study the issues involved in building a digital library that contains data streams and allows event-based retrieval. âDigital Libraries are storehouses of information available through the Internet that provide ways to collect, store, and organize data and make it accessible for search, retrieval, and processingâ [29]. Data streams are sources of information for applications such as news-on-demand, weather services, and scientific research, to name a few. A data stream is a sequence of data units produced over a period of time. Examples of data streams are video streams, audio stream, and sensor readings. Saving data streams in digital libraries is advantageous because of the services provided by digital libraries such as archiving, preservation, administration, and access control. Events are noteworthy occurrences that happen during data streams. Events are easier to remember than specific time instances at which they occur; hence using them for retrieval is more commensurate with human behavior and can be more efficient via direct accessing instead of scanning. The focus of this research is not only on storing data streams in a digital library and using event-based retrieval, but also on relating streams and playing them back at the same time, possibly in a synchronized manner, to facilitate better understanding in research or other working situations.
Our approach for this research starts by considering digital libraries for: stock market, news streams, census bureau statistics, weather, sports games, and the educational environment. For each of these applications, we form categories of possible users and the basic requirements for each of them. As a result, we identify a list of design goals that we take into consideration in developing the architecture of the library. To illustrate and validate our approach we implement a medical digital library containing actual Computed Tomography (CT) scan streams. It also contains sample medical text and audio streams to show the heterogeneity of the library. Streams are displayed in a concise, yet complete, way that makes it unproblematic for users to decide whether or not to playback a stream and to set playback options. The playback interface itself is organized in a way that accommodates synchronous and asynchronous streams and enables users to control the playback of these streams. We study the performance of the specialized search and retrieval processes in comparison to traditional search and retrieval processes. We conclude with a discussion on how to adapt the library to additional stream types in addition to suggesting other future efforts in this area
- âŠ