492 research outputs found

    Towards a better integration of written names for unsupervised speakers identification in videos

    No full text
    International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

    Unsupervised Speaker Identification in TV Broadcast Based on Written Names

    No full text
    International audienceIdentifying speakers in TV broadcast in an unsuper- vised way (i.e. without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming", modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure

    Towards a better integration of written names for unsupervised speakers identification in videos

    Get PDF
    International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

    Vocabulary-Learning Strategies of Students Learning Chinese as a Foreign Language in an Intensive-Training Setting

    Get PDF
    Compared with the research on vocabulary-learning strategies in the field of teaching English as a second or a foreign language, the research on the strategy use of Chinese-as-a-foreign-language (CFL) students, especially CFL students in an intensive-training setting, is scarce. The relationship between CFL students’ vocabulary-learning-strategy use and their learning outcomes remains underresearched. Therefore, this mixed-methods study was conducted to investigate the strategy use of CFL students in learning Chinese vocabulary words in an intensive language program and its relationship to students’ learning outcomes. A total of 137 beginning to advanced students enrolled in the program participated in the study. The strategy use of the students was measured by a 50-item questionnaire, and students’ learning outcomes were measured by their end-of-semester II Proficiency Progress Test, which includes a listening and a reading test. Interviews with nine participants of different grade-point-averages (GPAs; high, middle, and low) were conducted to gain a better understanding of the strategy use for more-successful and less-successful students. Descriptive data analysis revealed that the students in this study used 20 strategies commonly in their vocabulary learning. Of the 20 strategies, most of them were cognitive strategies and metacognitive strategies, and the majority of the commonly-used cognitive strategies were orthographic-knowledge-based strategies. The qualitative findings indicated that students with higher GPAs used more strategies and that certain patterns of strategy use differentiated more-successful students from less-successful students. Pearson product-moment correlation analyses revealed that several strategies involving learning and using vocabulary words in an authentic context had a positive and statistically significant association with students’ listening scores and reading scores, whereas several strategies focusing on decontextualized memorization of vocabulary words had a negative and statistically significant association with students’ listening scores. Two orthographic-knowledge-based strategies were found to be correlated positively with students’ reading scores. The findings of the study suggest that orthographic-knowledge-based strategies and metacognitive strategies such as selective attention are essential for CFL students in vocabulary learning. Strategies involving learning and using Chinese vocabulary words in an authentic context are important for CFL students to develop higher language proficiency. Research and pedagogical implications are drawn based on the findings

    Deliverable D1.1 State of the art and requirements analysis for hypervideo

    Get PDF
    This deliverable presents a state-of-art and requirements analysis report for hypervideo authored as part of the WP1 of the LinkedTV project. Initially, we present some use-case (viewers) scenarios in the LinkedTV project and through the analysis of the distinctive needs and demands of each scenario we point out the technical requirements from a user-side perspective. Subsequently we study methods for the automatic and semi-automatic decomposition of the audiovisual content in order to effectively support the annotation process. Considering that the multimedia content comprises of different types of information, i.e., visual, textual and audio, we report various methods for the analysis of these three different streams. Finally we present various annotation tools which could integrate the developed analysis results so as to effectively support users (video producers) in the semi-automatic linking of hypervideo content, and based on them we report on the initial progress in building the LinkedTV annotation tool. For each one of the different classes of techniques being discussed in the deliverable we present the evaluation results from the application of one such method of the literature to a dataset well-suited to the needs of the LinkedTV project, and we indicate the future technical requirements that should be addressed in order to achieve higher levels of performance (e.g., in terms of accuracy and time-efficiency), as necessary

    Representing and Redefining Specialised Knowledge: Medical Discourse

    Get PDF
    This volume brings together five selected papers on medical discourse which show how specialised medical corpora provide a framework that helps those engaging with medical discourse to determine how the everyday and the specialised combine to shape the discourse of medical professionals and non-medical communities in relation to both long and short-term factors. The papers contribute, in an exemplary way, to illustrating the shifting boundaries in today’s society between the two major poles making up the medical discourse cline: healthcare discourse at the one end, which records the demand for personalised therapies and individual medical services; and clinical discourse the other, which documents research into society’s collective medical needs

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Grounding language in events

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 137-142).Broadcast video and virtual environments are just two of the growing number of domains in which language is embedded in multiple modalities of rich non-linguistic information. Applications for such multimodal domains are often based on traditional natural language processing techniques that ignore the connection between words and the non-linguistic context in which they are used. This thesis describes a methodology for representing these connections in models which ground the meaning of words in representations of events. Incorporating these grounded language models with text-based techniques significantly improves the performance of three multimodal applications: natural language understanding in videogames, sports video search and automatic speech recognition. Two approaches to representing the structure of events are presented and used to model the meaning of words. In the domain of virtual game worlds, a hand-designed hierarchical behavior grammar is used to explicitly represent all the various actions that an agent can take in a virtual world. This grammar is used to interpret events by parsing sequences of observed actions in order to generate hierarchical event structures. In the noisier and more open -ended domain of broadcast sports video, hierarchical temporal patterns are automatically mined from large corpora of unlabeled video data. The structure of events in video is represented by vectors of these hierarchical patterns.(cont.) Grounded language models are encoded using Hierarchical Bayesian models to represent the probability of words given elements of these event structures. These grounded language models are used to incorporate non-linguistic information into text-based approaches to multimodal applications. In the virtual game domain, this non-linguistic information improves natural language understanding for a virtual agent by nearly 10% and cuts in half the negative effects of noise caused by automatic speech recognition. For broadcast video of baseball and American football, video search systems that incorporate grounded language models are shown to perform up to 33% better than text-based systems. Further, systems for recognizing speech in baseball video that use grounded language models show 25% greater word accuracy than traditional systems.by Michael Ben Fleischman.Ph.D

    Deliverable D1.4 Visual, text and audio information analysis for hypervideo, final release

    Get PDF
    Having extensively evaluated the performance of the technologies included in the first release of WP1 multimedia analysis tools, using content from the LinkedTV scenarios and by participating in international benchmarking activities, concrete decisions regarding the appropriateness and the importance of each individual method or combination of methods were made, which, combined with an updated list of information needs for each scenario, led to a new set of analysis requirements that had to be addressed through the release of the final set of analysis techniques of WP1. To this end, coordinated efforts on three directions, including (a) the improvement of a number of methods in terms of accuracy and time efficiency, (b) the development of new technologies and (c) the definition of synergies between methods for obtaining new types of information via multimodal processing, resulted in the final bunch of multimedia analysis methods for video hyperlinking. Moreover, the different developed analysis modules have been integrated into a web-based infrastructure, allowing the fully automatic linking of the multitude of WP1 technologies and the overall LinkedTV platform

    New insights into hierarchical clustering and linguistic normalization for speaker diarization

    Get PDF
    Face au volume croissant de donnĂ©es audio et multimĂ©dia, les technologies liĂ©es Ă  l'indexation de donnĂ©es et Ă  l'analyse de contenu ont suscitĂ© beaucoup d'intĂ©rĂȘt dans la communautĂ© scientifique. Parmi celles-ci, la segmentation et le regroupement en locuteurs, rĂ©pondant ainsi Ă  la question 'Qui parle quand ?' a Ă©mergĂ© comme une technique de pointe dans la communautĂ© de traitement de la parole. D'importants progrĂšs ont Ă©tĂ© rĂ©alisĂ©s dans le domaine ces derniĂšres annĂ©es principalement menĂ©s par les Ă©valuations internationales du NIST. Tout au long de ces Ă©valuations, deux approches se sont dĂ©marquĂ©es : l'une est bottom-up et l'autre top-down. L'ensemble des systĂšmes les plus performants ces derniĂšres annĂ©es furent essentiellement des systĂšmes types bottom-up, cependant nous expliquons dans cette thĂšse que l'approche top-down comporte elle aussi certains avantages. En effet, dans un premier temps, nous montrons qu'aprĂšs avoir introduit une nouvelle composante de purification des clusters dans l'approche top-down, nous obtenons des performances comparables Ă  celles de l'approche bottom-up. De plus, en Ă©tudiant en dĂ©tails les deux types d'approches nous montrons que celles-ci se comportent diffĂ©remment face Ă  la discrimination des locuteurs et la robustesse face Ă  la composante lexicale. Ces diffĂ©rences sont alors exploitĂ©es au travers d'un nouveau systĂšme combinant les deux approches. Enfin, nous prĂ©sentons une nouvelle technologie capable de limiter l'influence de la composante lexicale, source potentielle d'artefacts dans le regroupement et la segmentation en locuteurs. Notre nouvelle approche se nomme Phone Adaptive Training par analogie au Speaker Adaptive TrainingThe ever-expanding volume of available audio and multimedia data has elevated technologies related to content indexing and structuring to the forefront of research. Speaker diarization, commonly referred to as the who spoke when?' task, is one such example and has emerged as a prominent, core enabling technology in the wider speech processing research community. Speaker diarization involves the detection of speaker turns within an audio document (segmentation) and the grouping together of all same-speaker segments (clustering). Much progress has been made in the field over recent years partly spearheaded by the NIST Rich Transcription evaluations focus on meeting domain, in the proceedings of which are found two general approaches: top-down and bottom-up. Even though the best performing systems over recent years have all been bottom-up approaches we show in this thesis that the top-down approach is not without significant merit. Indeed we first introduce a new purification component leading to competitive performance to the bottom-up approach. Moreover, while investigating the two diarization approaches more thoroughly we show that they behave differently in discriminating between individual speakers and in normalizing unwanted acoustic variation, i.e.\ that which does not pertain to different speakers. This difference of behaviours leads to a new top-down/bottom-up system combination outperforming the respective baseline system. Finally, we introduce a new technology able to limit the influence of linguistic effects, responsible for biasing the convergence of the diarization system. Our novel approach is referred to as Phone Adaptive Training (PAT).PARIS-TĂ©lĂ©com ParisTech (751132302) / SudocSudocFranceF
    • 

    corecore