148 research outputs found

    SPPAS: a tool for the phonetic segmentations of Speech

    No full text
    International audienceSPPAS is a tool to produce automatic annotations which include utterance, word, syllabic and phonemic segmentations from a recorded speech sound and its transcription. SPPAS is distributed under the terms of the GNU Public License. It was successfully applied during the Evalita 2011 campaign, on Italian map-task dialogues. It can also deal with French, English and Chinese and there is an easy way to add other languages. The paper describes the development of resources and free tools, consisting of acoustic models, phonetic dictionaries, and libraries and programs to deal with these data. All of them are publicly available

    A Multilingual Text Normalization Approach

    No full text
    International audienceThe creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the aspects of methodology and linguistic engineering, which serve to develop a multipurpose multilingual text corpus. This approach was applied to French, English, Spanish, Vietnamese, Khmer and Chinese. It consists in splitting the text normalization problem in a set of minor sub-problems as language-independent as possible. A set of text corpus normalization tools with linked resources and a document structuring method are proposed.<BR /

    Recherche automatique d'hétéro-répétitions dans un dialogue oral spontané

    Get PDF
    International audienceOther-repetitions are a device involving the reproduction by a speaker of what another speaker has just said. This paper proposes a solution to automatically detect other-repetitions in French conversational dialogue. A first step of the proposed system consists in finding all possible other-repetitions in the dialogue. A second step is used to select other-repetitions which need to be kept by combining rules with speaker statistics. This automatic detection, evaluated on a one hour dialogue, shows good results according to the expected objectives : recall is 1, and precision is about 80%.Cet article propose des critètres définitoires des hétéro-répétitions qui systématisent leur recherche dans un dialogue oral spontané

    A quantitative view of feedback lexical markers in conversational French

    No full text
    International audienceThis paper presents a quantitative description of the lexical items used for linguistic feedback in the Corpus of Interactional Data (CID). The paper includes the raw figures for feedback lexical item as well as more detailed figures concerning interindividual variability. This effort is a first step before a broader analysis including more discourse situations and featuring communicative function annotation

    Orthographic Transcription: Which Enrichment is required for Phonetization?

    No full text
    International audienceThis paper addresses the problem of the enrichment of transcriptions in the perspective of an automatic phonetization. Phonetization is the process of representing sounds with phonetic signs. There are two general ways to construct a phonetization process: rule based systems (with rules based on inference approaches or proposed by expert linguists) and dictionary based solutions which consist in storing a maximum of phonological knowledge in a lexicon. In both cases, phonetization is based on a manual transcription. Such a transcription is established on the basis of conventions that can differ depending on their working out context. This present study focuses on three different enrichments of such a transcription. Evaluations compare phonetizations obtained from automatic systems to a reference phonetized manually. The test corpus is made of three types of speech in French: conversational speech, read speech and political debate. A specific algorithm for the rule-based system is proposed to deal with enrichments. The final system obtained a phonetization of about 95.2% correct (from 3.7% to 5.6% error rates depending on the corpus)

    Developing Resources for Automated Speech Processing of Quebec French

    Get PDF
    International audienceThe analysis of the structure of speech nearly always rests on the alignment of the speech recording with a phonetic transcription. Nowadays several tools can perform this speech segmentation automatically. However, none of them carries out the automatic segmentation of Quebec French (QF hereafter) in a proper way. Contrary to what could be assumed, the acoustics and phonotactics of QF differs widely from that of France French (FF hereafter). To adequately segment QF, features like diphthongization of long vowels and affrication of coronal stops have to be taken into account. Thus acoustic models for automatic segmentation must be trained on speech samples exhibiting those phenomena. Dictionaries and lexicons must also be adapted and integrate differences in lexical units (such as very frequent words in QF that are not used in FF) and in the phonology of QF (such as the existence of tense and lax high vowels in QF but not in FF). This paper presents the development of linguistic resources to be included into the SPPAS software tool in order to get Text normalization, Phonetization, Alignment and Syllabification. We adapted the existing French lexicon and developed a QF-specific pronunciation dictionary. We then created an acoustic model from the existing ones and adapted it with 5 minutes of manually time-aligned data. These new resources are all freely distributed with SPPAS version 2.7; they perform the full process of speech segmentation in Quebec French

    Catégoriser les réponses aux interruptions dans les débats politiques

    No full text
    International audienceThis work was conducted to analyze political debates, with a multimodal point of view. Particularly, we focus on the answers produced by a main speakers after he was disrupted. Our approach relies on the annotations of each modality and on their review. We propose a manual categorization of the observed disruptions. We thenapply a categorization method to validate the manual one. The difficulty is to deal with multimodality, missing values and uncertainty in the automatic classification system.Cet article traite de l'analyse de débats politiques selon une perspective multimodale. Nous étudions plus particulièrement les réponses aux interruptions lors d'un débat à l'Assemblée nationale. Nous proposons de procéder à l'analyse via des annotations systématiques des différentes modalités. L'analyse argumentative nous a amenée à proposer une typologie de ces réponses. Celle-ci a été mise à l'épreuve d'une classification automatique. La difficulté dans la construction d'un tel système réside dans la nature même des données : multimodales, parfois manquantes et incertaines

    Annotation automatique en syllabes d'un dialogue oral spontané

    Get PDF
    International audienceThis paper proposes a solution to identify automatically syllable boundaries in the particular context of spontaneous speech. The main goal consists in identifying syllables from a continuous stream of phonemes. At first, phoneme classes are defined to be as well-suited as possible to reduce the problem complexity. Secondly, a few number of general rules are defined. Finally, some exception rules allows to adapt the problem to the specific context of spontaneous speech. The proposed system is evaluated and compares favorably to the only two existing other systems, for French, with significant improvements. Keywords:syllable, phoneme, segmentation, rules.Cet article propose une méthode pour identifier automatiquement les frontières de syllabes dans le contexte particulier de la parole spontanée. Le principe est d'identifier les syllabes à partir d'un flux de phonèmes. Dans un premier temps, nous proposons de regrouper les phonèmes dans des classes. Nous proposons ensuite des règles de segmentation selon les suites de classes rencontrées.Cette méthode a été appliquée sur le CID, corpus conversationnel français. Les évaluations montrent que notre proposition est plus proche d'une segmentation manuelle que les 3 outils qui existent déjà

    Multimodal Annotations and Categorization for Political Debates

    No full text
    International audienceThe paper introduces an annotation scheme for a political debate dataset which is mainly in the form of video, and audio annotations. The annotation contains various infor- mation ranging from general linguistic to domain specific information. Some are annotated with automatic tools, and some are manually annotated. One of the goals is to use the information to predict the categories of the answers by the speaker to the disruptions. A typology of such answers is proposed and an automatic categorization system based on a multimodal parametrization is successfully performed

    Identification thématique hiérarchique : Application aux forums de discussions

    Get PDF
    International audienceLes modèles statistiques du langage ont pour but de donner une représentation statistique de la langue mais souffrent de nombreuses imperfections. Des travaux récents ont montré que ces modèles peuvent être améliorés s'ils peuvent bénéficier de la connaissance du thème traité, afin de s'y adapter. Le thème du document est alors obtenu par un mécanisme d'identification thématique, mais les thèmes ainsi traités sont souvent de granularité différente, c'est pourquoi il nous semble opportun qu'ils soient organisés dans une hiérarchie. Cette structuration des thèmes implique la mise en place de techniques spécifiques d'identification thématique. Cet article propose un modèle statistique à base d'unigrammes pour identifier automatiquement le thème d'un document parmi une arborescence prédéfinie de thèmes possibles. Nous présen-tons également un critère qui permet au modèle de donner un degré de fiabilité à la décision prise. L'ensemble des expérimentations a été réalisé sur des données extraites du groupe 'fr' des forums de discussion. Statistical language modeling attempts to capture the regularities of natural language. The most accurate natural language processing systems still suffer from several shortcomings due to the complexity of natural language and from the weakness of the current language models. It is commonly conjectured that they should benefit from topic adaptation. The topic of the document is then obtained by a topic identification mechanism, but topics thus treated are often of different granularity. This is the reason why it seems appropriate to organize them in a hierarchy. This topic organization implies a development of specific techniques for topic identification. This paper proposes a statistical model based on unigrams to automatically identify the topic of a document among a tree structure of possible topics. We also present a criterion which reflects the degree of reliability of the decision. Experiments were carried out on data extracted from the French newsgroup 'fr'
    corecore