547 research outputs found

    From `Snippet-lects' to Doculects and Dialects: Leveraging Neural Representations of Speech for Placing Audio Signals in a Language Landscape

    Full text link
    XLSR-53 a multilingual model of speech, builds a vector representation from audio, which allows for a range of computational treatments. The experiments reported here use this neural representation to estimate the degree of closeness between audio files, ultimately aiming to extract relevant linguistic properties. We use max-pooling to aggregate the neural representations from a "snippet-lect" (the speech in a 5-second audio snippet) to a "doculect" (the speech in a given resource), then to dialects and languages. We use data from corpora of 11 dialects belonging to 5 less-studied languages. Similarity measurements between the 11 corpora bring out greatest closeness between those that are known to be dialects of the same language. The findings suggest that (i) dialect/language can emerge among the various parameters characterizing audio files and (ii) estimates of overall phonetic/phonological closeness can be obtained for a little-resourced or fully unknown language. The findings help shed light on the type of information captured by neural representations of speech and how it can be extracted from these representation

    Insights into Naxi and Pumi at the end of the 19th century: evidence on sound changes from the word lists by Charles-Eudes Bonin

    Get PDF
    International audienceThe word lists published in 1903 by C.-E. Bonin for several languages of East Asia are highly rudimentary; the transcription is based on French spelling conventions. These lists nonetheless provide hints about the pronunciation of these languages at the end of the 19th century. We examine two of Bonin's lists in light of more recent and more systematic descriptions of the same languages, looking for evidence about phonetic evolutions. The Naxi word list offers hints about the pronunciation of vowels /i/, /y/ and /o/ and the degree of palatalization of velars before high front vowels. The list for Pumi shows that the initial cluster /st-/ was still present at the time in the dialect recorded.Les vocabulaires de cinq langues d'Asie orientale publiés en 1903 par Charles-Eudes Bonin sont transcrits de façon rudimentaire, selon les conventions orthographiques du français. Ils fournissent néanmoins des indices concernant la prononciation de ces langues peu avant 1900. Nous examinons deux des listes de Bonin à la lumiÚre de données plus récentes et plus systématiques, afin de déceler d'éventuelles indications sur des changements phonétiques. La liste de mots naxi fournit des indices concernant le degré de palatalisation des vélaires devant les voyelles fermées d'avant et la prononciation des voyelles /i/, /y/ et /o/. La liste de mots pumi révÚle que le groupe /st-/ existait encore à l'époque dans le dialecte étudié

    A simple architecture for the fine-grained documentation of endangered languages: the LACITO multimedia archive

    Get PDF
    A paraßtre dans : Proceedings of Oriental-COCOSDA 2011. Présenté à : 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA 2011), 2011-10-26 -> 2011-10-28, TaiwanInternational audienceThe LACITO multimedia archive provides free access to documents of connected, spontaneous speech, mostly in "rare" or endangered languages, recorded in their cultural context and transcribed in consultation with native speakers. Its goal is to contribute to the documentation and study of a precious human heritage: the world's languages. It has a special strength in languages of Asia and the Pacific. The LACITO archive was built with little personnel and less funding. It has been devised, developed and maintained over two decades by two researchers assisted by one engineer. Its simple architecture is based on current standards: Unicode character coding and XML markup; and Dublin Core/Open Language Archives Community recommendations for metadata. The data can be consulted online with any standard browser. The technical simplicity of the tools developed at LACITO makes them suitable for the creation of similar databases at other institutions. (For instance, tools from this archive were successfully adapted in the creation of the Formosan Languages archive.

    Approche pour la synthĂšse diastĂ©rĂ©osĂ©lective d’analogues a-nuclĂ©osidiques et synthĂšse de nouveaux analogues de nuclĂ©osides C2'-dĂ©soxy

    Full text link
    Les nuclĂ©otides naturels sont les monomĂšres constitutifs de l’ADN et de l’ARN, et sont nĂ©cessaires Ă  la prolifĂ©ration des cellules cancĂ©reuses et des virus. L’utilisation d’analogues de nuclĂ©osides comme agents anti-cancĂ©reux et/ou antiviraux a rapidement suscitĂ© un grand intĂ©rĂȘt thĂ©rapeutique. Plusieurs analogues comportant un centre quaternaire carbonĂ© en position C3' ont Ă©tĂ© synthĂ©tisĂ©s par le laboratoire du Dr. Guindon et possĂšdent d’intĂ©ressantes activitĂ©s biologiques. Une nouvelle approche acyclique a Ă©tĂ© dĂ©veloppĂ©e afin d’accĂ©der efficacement et sĂ©lectivement Ă  une multitude de nouveaux analogues de nuclĂ©osides ciblĂ©s. Cette stratĂ©gie se distingue par l’élaboration d’une nouvelle rĂ©action d’aldolisation de Mukaiyama Ă©nantiosĂ©lective Ă  partir d’un alpha-alkoxy aldĂ©hyde protĂ©gĂ© et d’un complexe chiral. L’addition stĂ©rĂ©osĂ©lective de la nuclĂ©obase sur un prĂ©curseur dithioacĂ©tal, suivi d’une cyclisation intramolĂ©culaire de type « SN2-like », permet la synthĂšse directe des anomĂšres alpha de la sĂ©rie (D)-1',2'-trans porteurs d’un atome de fluor en C2' et d’un centre quaternaire en C3'. La diffĂ©renciation des deux alcools primaires rend possible la fonctionnalisation sĂ©lective sur la position C3' ou C5' en fin de synthĂšse pour potentiellement confĂ©rer une amĂ©lioration des propriĂ©tĂ©s biologiques sur de tels analogues. L’application de cette stratĂ©gie a Ă©galement permis de synthĂ©tiser facilement et efficacement une nouvelle famille d’analogues de nuclĂ©osides C2'-dĂ©soxy porteurs d’un centre quaternaire en C3'. La diffĂ©renciation des deux alcools primaires en C3' et C5' facilite la sĂ©paration des produits de glycosylation suite Ă  la dĂ©protection sĂ©lective d’une de ces deux positions. Il a Ă©tĂ© dĂ©terminĂ© que lorsque le prĂ©curseur de glycosylation contient une acĂ©tylglycine sur la position C3', un effet bĂȘta-directeur est induit. Finalement, de nouveaux analogues C2'-dĂ©soxy ont Ă©tĂ© prĂ©parĂ©s et sont prĂ©sentement testĂ©s contre une sĂ©rie de virus et d’enzymes polymĂ©rases.Natural nucleotides are the constituent monomers of DNA and RNA, and are necessary for the proliferation of cancer cells and viruses. The use of nucleoside analogues as anti-cancer and/or antiviral agents quickly aroused great therapeutic interest. Several analogues containing an all-carbon quaternary center in the C3' position have been synthesized by the Guindon laboratory and have interesting biological activities. A new acyclic approach has been developed in order to gain efficient and selective access to a multitude of these novel targeted nucleoside analogues. This strategy is distinguished by the development of a new enantioselective Mukaiyama aldol reaction starting from a protected alpha-alkoxy aldehyde and a chiral complex. The stereoselective addition of the nucleobase on an acyclic dithioacetal precursor, followed by an intramolecular SN2-like cyclization, allows the direct synthesis of alpha-anomers in the (D) -1',2'-trans series bearing a fluorine atom at the C2' position and a quaternary center at C3'. The differentiation of the two primary alcohols throughout the synthetic routes is necessary to selectively functionalize the C3' or C5' position at the end of the synthesis to potentially confer enhanced biological properties on such analogues. The application of this strategy has also made it possible to easily and efficiently synthesize a new family of C2'-deoxy analogues bearing a C3' quaternary center. The differentiation of the C3' and C5' primary alcohols facilitates the separation of the glycosylation products through the selective deprotection of one of these two positions. It has been determined that when the glycosylation precursor contains an acetylglycine at the C3' position, a bĂȘta-directing effect is induced. Finally, new C2'-deoxy analogues have been prepared and are currently being tested against a series of viruses and polymerase enzymes

    Phonetic lessons from automatic phonemic transcription: preliminary reflections on Na (Sino-Tibetan) and Tsuut’ina (Dene) data

    Get PDF
    International audienceAutomatic phonemic transcription tools now reach high levels of accuracy on a single speaker with relatively small amounts of training data: on the order of 100 to 250 minutes of transcribed speech. Beyond its practical usefulness for language documentation, use of automatic transcription also yields some insights for phoneticians. The present report illustrates this by going into qualitative error analysis on two test cases, Yongning Na (Sino-Tibetan) and Tsuut’ina (Dene). Among other benefits, error analysis allows for a renewed exploration of phonetic detail: examining the output of phonemic transcription software compared with spectrographic and aural evidence. From a methodological point of view, the present report is intended as a case study in Computational Language Documentation: an interdisciplinary approach that associates fieldworkers (“diversity linguists”) and computer scientists with phoneticians/phonologists

    Contribuer au progrĂšs solidaire des recherches et de la documentation : la Collection Pangloss et la Collection AuCo

    Get PDF
    International audienceThis talk sets out the scientific goals and achievements of two collections hosted by the Cocoon Open Archive of oral resources: the Pangloss Collection, which mainly focuses on unwritten languages from all areas in the world ; and the AuCo Collection, which is dedicated to languages of Vietnam and neighbouring countries. The aim is to contribute to joint progress in language documentation and in research. Emphasis is placed on the perspectives for phonetic/phonological research that are opened by some recent achievements in the framework of these two Collections.La présente communication présente les projets scientifiques et les réalisations de deux collections hébergées par la plateforme de ressources orales Cocoon : la Collection Pangloss, qui concerne principalement des langues de tradition orale (sans écriture), du monde entier ; et la Collection AuCo, dédiée aux langues du Vietnam et de pays voisins. L'objectif est un progrÚs solidaire des recherches et de la documentation linguistique. L'accent est mis sur les perspectives ouvertes pour la recherche en phonétique/phonologie par certaines réalisations récentes dans le cadre de ces deux Collections

    alpha,omega-Bis(trialkoxysilyl) difunctionalized polycyclooctenes from ruthenium-catalyzed chain-transfer ring-opening metathesis polymerization

    No full text
    International audienceThe ring-opening metathesis polymerization/cross-metathesis (ROMP/CM) of cyclooctene (COE) using bis(trialkoxysilyl)alkenes as chain-transfer agents (CTAs) and Ru catalysts to afford difunctionalized polyolefins is reported. The formation of alpha,omega-bis(trialkoxysilyl) telechelic polycycloolefins (DF) with controlled molar mass values takes place quite selectively (>90 wt%), along with minor amounts of cyclic non-functionalized polymers (CNF), as evidenced by NMR, MALDI-ToF MS, SEC analyses and fractionation experiments. The nature of the CTA and catalyst influenced much the efficiency and selectivity of the reaction. (MeO)(3)SiCH2CH=CHCH2Si(OMe)(3) (2) and (MeO)(3)Si(CH2)(3)NHC(O)OCH2CH=CHCH2OC(O)NH (CH2)(3)Si(OMe)(3) (5) proved to be the most efficient CTAs in terms of reactivity, catalyst productivity and selectivity towards DF. Diurethane CTA 5 is easily prepared, and can also be conveniently generated in situ during the ROMP/CM. Grubbs' 2nd-generation catalyst (G2) and Hoveyda-Grubbs's catalyst (HG2) afforded the best compromise in terms of selectivity and productivity, with turnover numbers of up to 95 000 mol(COE) mol(Ru)(-1) and 5000 mol(CTA) mol(Ru)(-1)

    Integrating Automatic Transcription into the Language Documentation Workflow: Experiments with Na Data and the Persephone Toolkit

    Get PDF
    Automatic speech recognition tools have potential for facilitating language documentation, but in practice these tools remain little-used by linguists for a variety of reasons, such as that the technology is still new (and evolving rapidly), user-friendly interfaces are still under development, and case studies demonstrating the practical usefulness of automatic recognition in a low-resource setting remain few. This article reports on a success story in integrating automatic transcription into the language documentation workflow, specifically for Yongning Na, a language of Southwest China. Using Persephone, an open-source toolkit, a single-speaker speech transcription tool was trained over five hours of manually transcribed speech. The experiments found that this method can achieve a remarkably low error rate (on the order of 17%), and that automatic transcriptions were useful as a canvas for the linguist. The present report is intended for linguists with little or no knowledge of speech processing. It aims to provide insights into (i) the way the tool operates and (ii) the process of collaborating with natural language processing specialists. Practical recommendations are offered on how to anticipate the requirements of this type of technology from the early stages of data collection in the field.National Foreign Language Resource Cente

    Documenting and Researching Endangered Languages: The Pangloss Collection

    Get PDF
    The Pangloss Collection is a language archive developed since 1994 at the Langues et Civilisations à Tradition Orale (LACITO) research group of the French Centre National de la Recherche Scientifique (CNRS). It contributes to the documentation and study of the world’s languages by providing free access to documents of connected, spontaneous speech, mostly in endangered or under-resourced languages, recorded in their cultural context and transcribed in consultation with native speakers. The Collection is an Open Archive containing media files (recordings), text annotations, and metadata; it currently contains over 1,400 recordings in 70 languages, including more than 400 transcribed and annotated documents. The annotations consist of transcription, free translation in English, French and/or other languages, and, in many cases, word or morpheme glosses; they are time-aligned with the recordings, usually at the utterance level. A web interface makes these annotations accessible online in an interlinear display format, in synchrony with the sound, using any standard browser. The structure of the XML documents makes them accessible to searching and indexing, always preserving the links to the recordings. Long-term preservation is guaranteed through a partnership with a digital archive. A guiding principle of the Pangloss Collection is that a close association between documentation and research is highly profitable to both. This article presents the collections currently available; it also aims to convey a sense of the range of possibilities they offer to the scientific and speaker communities and to the general public.National Foreign Language Resource Cente
    • 

    corecore