9 research outputs found

    Structuration de contenus audio-visuel pour le résumé automatique

    Get PDF
    Ces derniĂšres annĂ©es, avec l apparition des sites tels que Youtube, Dailymotion ou encore Blip TV, le nombre de vidĂ©os disponibles sur Internet aconsidĂ©rablement augmentĂ©. Le volume des collections et leur absence de structure limite l accĂšs par le contenu Ă  ces donnĂ©es. Le rĂ©sumĂ© automatique est un moyen de produire des synthĂšses qui extraient l essentiel des contenus et les prĂ©sentent de façon aussi concise que possible. Dans ce travail, nous nous intĂ©ressons aux mĂ©thodes de rĂ©sumĂ© vidĂ©o par extraction, basĂ©es sur l analyse du canal audio. Nous traitons les diffĂ©rents verrous scientifiques liĂ©s Ă  cet objectif : l extraction des contenus, la structuration des documents, la dĂ©finition et l estimation des fonctions d intĂ©rĂȘts et des algorithmes de composition des rĂ©sumĂ©s. Sur chacun de ces aspects, nous faisons des propositions concrĂštes qui sont Ă©valuĂ©es. Sur l extraction des contenus, nous prĂ©sentons une mĂ©thode rapide de dĂ©tection de termes. La principale originalitĂ© de cette mĂ©thode est qu elle repose sur la construction d un dĂ©tecteur en fonction des termes cherchĂ©s. Nous montrons que cette stratĂ©gie d auto-organisation du dĂ©tecteur amĂ©liore la robustesse du systĂšme, qui dĂ©passe sensiblement celle de l approche classique basĂ©e sur la transcription automatique de la parole.Nous prĂ©sentons ensuite une mĂ©thode de filtrage qui repose sur les modĂšles Ă  mixtures de Gaussiennes et l analyse factorielle telle qu elle a Ă©tĂ© utilisĂ©e rĂ©cemment en identification du locuteur. L originalitĂ© de notre contribution tient Ă  l utilisation des dĂ©compositions par analyse factorielle pour l estimation supervisĂ©e de filtres opĂ©rants dans le domaine cepstral.Nous abordons ensuite les questions de structuration de collections de vidĂ©os. Nous montrons que l utilisation de diffĂ©rents niveaux de reprĂ©sentation et de diffĂ©rentes sources d informations permet de caractĂ©riser le style Ă©ditorial d une vidĂ©o en se basant principalement sur l analyse de la source audio, alors que la plupart des travaux prĂ©cĂ©dents suggĂ©raient que l essentiel de l information relative au genre Ă©tait contenue dans l image. Une autre contribution concerne l identification du type de discours ; nous proposons des modĂšles bas niveaux pour la dĂ©tection de la parole spontanĂ©e qui amĂ©liorent sensiblement l Ă©tat de l art sur ce type d approches.Le troisiĂšme axe de ce travail concerne le rĂ©sumĂ© lui-mĂȘme. Dans le cadre du rĂ©sumĂ© automatique vidĂ©o, nous essayons, dans un premier temps, de dĂ©finir ce qu est une vue synthĂ©tique. S agit-il de ce qui le caractĂ©rise globalement ou de ce qu un utilisateur en retiendra (par exemple un moment Ă©mouvant, drĂŽle....) ? Cette question est discutĂ©e et nous faisons des propositions concrĂštes pour la dĂ©finition de fonctions d intĂ©rĂȘts correspondants Ă  3 diffĂ©rents critĂšres : la saillance, l expressivitĂ© et la significativitĂ©. Nous proposons ensuite un algorithme de recherche du rĂ©sumĂ© d intĂ©rĂȘt maximal qui dĂ©rive de celui introduit dans des travaux prĂ©cĂ©dents, basĂ© sur la programmation linĂ©aire en nombres entiers.These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming.AVIGNON-Bib. numĂ©rique (840079901) / SudocSudocFranceF

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Get PDF
    The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others , a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Get PDF
    International audienceThe I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others , a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation

    Structuration de contenus audio-visuel pour le résumé automatique

    No full text
    These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming.Ces derniĂšres annĂ©es, avec l’apparition des sites tels que Youtube, Dailymotion ou encore Blip TV, le nombre de vidĂ©os disponibles sur Internet aconsidĂ©rablement augmentĂ©. Le volume des collections et leur absence de structure limite l’accĂšs par le contenu Ă  ces donnĂ©es. Le rĂ©sumĂ© automatique est un moyen de produire des synthĂšses qui extraient l’essentiel des contenus et les prĂ©sentent de façon aussi concise que possible. Dans ce travail, nous nous intĂ©ressons aux mĂ©thodes de rĂ©sumĂ© vidĂ©o par extraction, basĂ©es sur l’analyse du canal audio. Nous traitons les diffĂ©rents verrous scientifiques liĂ©s Ă  cet objectif : l’extraction des contenus, la structuration des documents, la dĂ©finition et l’estimation des fonctions d’intĂ©rĂȘts et des algorithmes de composition des rĂ©sumĂ©s. Sur chacun de ces aspects, nous faisons des propositions concrĂštes qui sont Ă©valuĂ©es. Sur l’extraction des contenus, nous prĂ©sentons une mĂ©thode rapide de dĂ©tection de termes. La principale originalitĂ© de cette mĂ©thode est qu’elle repose sur la construction d’un dĂ©tecteur en fonction des termes cherchĂ©s. Nous montrons que cette stratĂ©gie d’auto-organisation du dĂ©tecteur amĂ©liore la robustesse du systĂšme, qui dĂ©passe sensiblement celle de l’approche classique basĂ©e sur la transcription automatique de la parole.Nous prĂ©sentons ensuite une mĂ©thode de filtrage qui repose sur les modĂšles Ă  mixtures de Gaussiennes et l’analyse factorielle telle qu’elle a Ă©tĂ© utilisĂ©e rĂ©cemment en identification du locuteur. L’originalitĂ© de notre contribution tient Ă  l’utilisation des dĂ©compositions par analyse factorielle pour l’estimation supervisĂ©e de filtres opĂ©rants dans le domaine cepstral.Nous abordons ensuite les questions de structuration de collections de vidĂ©os. Nous montrons que l’utilisation de diffĂ©rents niveaux de reprĂ©sentation et de diffĂ©rentes sources d’informations permet de caractĂ©riser le style Ă©ditorial d’une vidĂ©o en se basant principalement sur l’analyse de la source audio, alors que la plupart des travaux prĂ©cĂ©dents suggĂ©raient que l’essentiel de l’information relative au genre Ă©tait contenue dans l’image. Une autre contribution concerne l’identification du type de discours ; nous proposons des modĂšles bas niveaux pour la dĂ©tection de la parole spontanĂ©e qui amĂ©liorent sensiblement l’état de l’art sur ce type d’approches.Le troisiĂšme axe de ce travail concerne le rĂ©sumĂ© lui-mĂȘme. Dans le cadre du rĂ©sumĂ© automatique vidĂ©o, nous essayons, dans un premier temps, de dĂ©finir ce qu’est une vue synthĂ©tique. S’agit-il de ce qui le caractĂ©rise globalement ou de ce qu’un utilisateur en retiendra (par exemple un moment Ă©mouvant, drĂŽle....) ? Cette question est discutĂ©e et nous faisons des propositions concrĂštes pour la dĂ©finition de fonctions d’intĂ©rĂȘts correspondants Ă  3 diffĂ©rents critĂšres : la saillance, l’expressivitĂ© et la significativitĂ©. Nous proposons ensuite un algorithme de recherche du rĂ©sumĂ© d’intĂ©rĂȘt maximal qui dĂ©rive de celui introduit dans des travaux prĂ©cĂ©dents, basĂ© sur la programmation linĂ©aire en nombres entiers

    Audio-visual content structuring for automatic summarization

    No full text
    Ces derniĂšres annĂ©es, avec l’apparition des sites tels que Youtube, Dailymotion ou encore Blip TV, le nombre de vidĂ©os disponibles sur Internet aconsidĂ©rablement augmentĂ©. Le volume des collections et leur absence de structure limite l’accĂšs par le contenu Ă  ces donnĂ©es. Le rĂ©sumĂ© automatique est un moyen de produire des synthĂšses qui extraient l’essentiel des contenus et les prĂ©sentent de façon aussi concise que possible. Dans ce travail, nous nous intĂ©ressons aux mĂ©thodes de rĂ©sumĂ© vidĂ©o par extraction, basĂ©es sur l’analyse du canal audio. Nous traitons les diffĂ©rents verrous scientifiques liĂ©s Ă  cet objectif : l’extraction des contenus, la structuration des documents, la dĂ©finition et l’estimation des fonctions d’intĂ©rĂȘts et des algorithmes de composition des rĂ©sumĂ©s. Sur chacun de ces aspects, nous faisons des propositions concrĂštes qui sont Ă©valuĂ©es. Sur l’extraction des contenus, nous prĂ©sentons une mĂ©thode rapide de dĂ©tection de termes. La principale originalitĂ© de cette mĂ©thode est qu’elle repose sur la construction d’un dĂ©tecteur en fonction des termes cherchĂ©s. Nous montrons que cette stratĂ©gie d’auto-organisation du dĂ©tecteur amĂ©liore la robustesse du systĂšme, qui dĂ©passe sensiblement celle de l’approche classique basĂ©e sur la transcription automatique de la parole.Nous prĂ©sentons ensuite une mĂ©thode de filtrage qui repose sur les modĂšles Ă  mixtures de Gaussiennes et l’analyse factorielle telle qu’elle a Ă©tĂ© utilisĂ©e rĂ©cemment en identification du locuteur. L’originalitĂ© de notre contribution tient Ă  l’utilisation des dĂ©compositions par analyse factorielle pour l’estimation supervisĂ©e de filtres opĂ©rants dans le domaine cepstral.Nous abordons ensuite les questions de structuration de collections de vidĂ©os. Nous montrons que l’utilisation de diffĂ©rents niveaux de reprĂ©sentation et de diffĂ©rentes sources d’informations permet de caractĂ©riser le style Ă©ditorial d’une vidĂ©o en se basant principalement sur l’analyse de la source audio, alors que la plupart des travaux prĂ©cĂ©dents suggĂ©raient que l’essentiel de l’information relative au genre Ă©tait contenue dans l’image. Une autre contribution concerne l’identification du type de discours ; nous proposons des modĂšles bas niveaux pour la dĂ©tection de la parole spontanĂ©e qui amĂ©liorent sensiblement l’état de l’art sur ce type d’approches.Le troisiĂšme axe de ce travail concerne le rĂ©sumĂ© lui-mĂȘme. Dans le cadre du rĂ©sumĂ© automatique vidĂ©o, nous essayons, dans un premier temps, de dĂ©finir ce qu’est une vue synthĂ©tique. S’agit-il de ce qui le caractĂ©rise globalement ou de ce qu’un utilisateur en retiendra (par exemple un moment Ă©mouvant, drĂŽle....) ? Cette question est discutĂ©e et nous faisons des propositions concrĂštes pour la dĂ©finition de fonctions d’intĂ©rĂȘts correspondants Ă  3 diffĂ©rents critĂšres : la saillance, l’expressivitĂ© et la significativitĂ©. Nous proposons ensuite un algorithme de recherche du rĂ©sumĂ© d’intĂ©rĂȘt maximal qui dĂ©rive de celui introduit dans des travaux prĂ©cĂ©dents, basĂ© sur la programmation linĂ©aire en nombres entiers.These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming

    MORFITT : Un corpus multi-labels d'articles scientifiques français dans le domaine biomédical

    No full text
    International audienceThis article presents MORFITT, the first multi-label corpus in French annotated in specialties in the medical field. MORFITT is composed of 3,624 abstracts of scientific articles from PubMed, annotated in 12 specialties for a total of 5,116 annotations. We detail the corpus, the experiments and the preliminary results obtained using a classifier based on the pre-trained language model CamemBERT. These preliminary results demonstrate the difficulty of the task, with a weighted average F1-score of 61.78 %.Cet article présente MORFITT, le premier corpus multi-labels en français annoté en spécialités dans le domaine médical. MORFITT est composé de 3 624 résumés d'articles scientifiques issus de PubMed, annotés en 12 spécialités pour un total de 5 116 annotations. Nous détaillons le corpus, les expérimentations et les résultats préliminaires obtenus à l'aide d'un classifieur fondé sur le modÚle de langage pré-entraßné CamemBERT. Ces résultats préliminaires démontrent la difficulté de la tùche, avec un score F1 moyen pondéré de 61,78 %

    FrenchMedMCQA : Un jeu de données de questions à choix multiple en français pour le domaine médical

    No full text
    LOUHI WorkshopInternational audienceThis paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.Cet article prĂ©sente FrenchMedMCQA, le premier jeu de donnĂ©es de questions Ă  choix multiple (MCQA) en français disponible publiquement pour le domaine mĂ©dical. Il est composĂ© de 3 105 questions tirĂ©es d'examens rĂ©els du diplĂŽme de spĂ©cialisation mĂ©dicale française en pharmacie, mĂ©langeant des rĂ©ponses simples et multiples. Chaque instance du jeu de donnĂ©es contient un identifiant, une question, cinq rĂ©ponses possibles et leur(s) correction(s) manuelle(s). Nous proposons Ă©galement des modĂšles de rĂ©fĂ©rence pour traiter automatiquement cette tĂąche MCQA afin de signaler les performances actuelles et de mettre en Ă©vidence la difficultĂ© de la tĂąche. Une analyse dĂ©taillĂ©e des rĂ©sultats a montrĂ© qu'il est nĂ©cessaire d'avoir des reprĂ©sentations adaptĂ©es au domaine mĂ©dical ou Ă  la tĂąche MCQA : dans notre cas, les modĂšles spĂ©cialisĂ©s en anglais ont donnĂ© de meilleurs rĂ©sultats que les modĂšles gĂ©nĂ©riques en français, mĂȘme si FrenchMedMCQA est en français. Le corpus, les modĂšles et les outils sont disponibles en ligne

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Get PDF
    International audienceThe I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others , a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    No full text
    International audienceThe I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve subsystems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others , a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation
    corecore