12 research outputs found

    Morfológiai egyértelmsítés maximum entrópia módszerrel

    Get PDF
    Cikkünkben olyan magyar nyelv statisztikai morfológiai egyértelmsítő modelleket hasonlítunk össze, amelyekbe a korpusztól független morfológiai elemzőt is beleépítettünk. Ismeretes, hogy magyar nyelvre a morfológiai elemző alkalmazása megnöveli a pontosságot a tisztán statisztikus módszerekhez képest. Modelljeink ugyanakkor a maximum entrópia módszer segítségével hatékony becslést adnak a morfológiai elemző által fel nem ismert szavakra is, tehát robusztusan viselkednek olyan tesztkorpuszokon is, amelyekhez a morfológiai elemző nem lett adaptálva

    Named entity extraction for speech

    Get PDF
    Named entity extraction is a field that has generated much interest over recent years with the explosion of the World Wide Web and the necessity for accurate information retrieval. Named entity extraction, the task of finding specific entities within documents, has proven of great benefit for numerous information extraction and information retrieval tasks.As well as multiple language evaluations, named entity extraction has been investigated on a variety of media forms with varying success. In general, these media forms have all been based upon standard text and assumed that any variation from standard text constitutes noise.We investigate how it is possible to find named entities in speech data.. Where others have focussed on applying named entity extraction techniques to transcriptions of speech, we investigate a method for finding the named entities direct from the word lattices associated with the speech signal. The results show that it is possible to improve named entity recognition at the expense of word error rate (WER) in contrast to the general view that F -score is directly proportional to WER.We use a. Hidden Markov Model {HMM) style approach to the task of named entity extraction and show how it is possible to utilise a HMM to find named entities within speech lattices. We further investigate how it is possible to improve results by considering an alternative derivation of the joint probability of words and entities than is traditionally used. This new derivation is particularly appropriate to speech lattices as no presumptions are made about the sequence of words.The HMM style approach that we use requires using a number of language models in parallel. We have developed a system for discriminately retraining these language models based upon the results of the output, and we show how it is possible to improve named entity recognition by iterations over both training data and development data. We also consider how part-of-speech (POS) can be used within word lattices. We devise a method of labelling a word lattice with POS tags and adapt the model to make use of these POS tags when producing the best path through the lattice. The resulting path provides the most likely sequence of words, entities and POS tags and we show how this new path is better than the previous path which ignored the POS tags

    Principled hidden tagset design for tiered tagging of Hungarian

    No full text
    For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features. 1

    XVIII. Magyar Számítógépes Nyelvészeti Konferencia

    Get PDF

    Automatic parts of speech determination in amorphologically complex language

    No full text
    Istraţivanje je imalo za cilj da provjeri u kojoj mjeri se naš kognitivni sistem moţe osloniti na fonotaktiĉke informacije, tj. moguće/dozvoljene kombinacije fonema/ grafema, u zadacima automatske percepcije i produkcije rijeĉi u jezicima sa bogatom infleksionom morfologijom. Da bi se dobio odgovor na to pitanje, sprovedene su tri studije. U prvoj studiji, uz pomoć mašina sa vektorima podrške (SVM), obavljena je diskriminacija promjenljivih vrsta rijeĉi. U drugoj studiji, produkcija infleksionih oblika rijeĉi izvedena je pomoću uĉenja zasnovanog na memoriji (MBL). Na osnovu rezultata iz druge studije, izveden je eksperiment u kojem se traţila potvrda kognitivne vjerodostojnosti modela i korišćenih informacija. Diskriminacija promjenljivih vrsta rijeĉi obavljena je na osnovu dozvoljenih sekvenci dva i tri grafema/fonema (tzv. bigrama i trigrama), ĉije su frekvencije javljanja unutar pojedinaĉnih gramatiĉkih tipova izraĉunate u zavisnosti od njihovog poloţaja u rijeĉima: na poĉetku, na kraju, unutar rijeĉi, svi zajedno. Maksimalna taĉnost se kretala oko 95% i dobijena je na svim bigramima, uz pomoć RBF jezgrene funkcije. Ovako visok procenat taĉne diskriminacije ukazuje da postoje karakteristiĉne distribucije bigrama za razliĉite vrste promjenljivih rijeĉi. S druge strane, najmanje informativnim su se pokazali bigrami na kraju i na poĉetku rijeĉi. MBL model iskorišćen je u zadatku automatske infleksione produkcije, tako što je za zadatu rijeĉ, na osnovu fonotaktiĉkih informacija iz posljednja ĉetiri sloga, generisan traţeni infleksioni oblik. Na uzorku od 89024 promjenljivih rijeĉi uzetih iz Frekvencijskog reĉnika dnevne štampe srpskog jezika, koristeći metod izostavljanja jednog primjera i konstantu veliĉinu skupa susjeda (k = 7), ostvarena je taĉnost oko 92%. Identifikovano je nekoliko faktora koji su uticali na ovu taĉnost, kao što su: vrsta rijeĉi, gramatiĉki tip, naĉin tvorbe i broj primjera u okviru jednog gramatiĉkog tipa, broju izuzetaka, broj fonoloških alternacija itd. U istraţivanju na subjektima, u zadatku leksiĉke odluke, za rijeĉi koje je MBL pogrešno obradio utvrĊeno je duţe vrijeme obrade. Ovo ukazuje na kognitivnu vjerodostojnost uĉenja zasnovanog na memoriji. Osim toga, potvrĊena je i kognitivna vjerodostojnost fonotaktiĉkih informacija, ovaj put u zadatku razumijevanja jezika. Sveukupno, nalazi dobijeni u ove tri studije govore u prilog teze o znaĉajnoj ulozi fonotaktiĉkih informacija u percepciji i produkciji morfološki sloţenih rijeĉi. Rezultati, takoĊe, ukazuju na potrebu da se ove informacije uzmu u obzir kada se diskutuje pojavljivanje većih jeziĉkih jedinica i obrazaca.The study was aimed at testing the extent to which our cognitive system can rely on phonotactic information, i.e., possible/ permissible combinations of phonemes/ graphemes, in the tasks of automatic processing and production of words in languages with rich inflectional morphology. In order to obtain the answer to this question, three studies have been conducted. In the first study, by applying the support vector machines (SVM) the discrimination of part of speech (PoS) with more than one possible meaning (i.e., ambiguous PoS) was performed. In the second study, the production of inflected word forms was done with memory based learning (MBL). Based on the results from the second study, a behavioral experiment was conducted as the third study, to test cognitive plausibility of the MBL performance. The discrimination of ambiguous PoS was performed using permissible sequences of two and three characters/sounds (i.e., bigrams and trigrams), whose frequency of occurrence within individual grammatical types was calculated depending on their position in a word: at the beginning, at the end, and irrespective of position in a word. Maximum accuracy achieved was approximatelly 95%. It was obtained when bigrams irrespective of position in a word were used. SVM model used RBF kernel function. Such high accuracy suggests that brigrams' probability distribution is informative about the types of flective words. Interestingly, the least informative were bigrams at the end and at the beginning of words. The MBL model was used in the task of automatic production of inflected forms, utilizingphonotactic information from the last four syllables. In a sample of 89024 flective words, taken from the Frequency dictionary of Serbian language (daily press), achieved accuracy was 92%. For this result the MBL used leave -one -out method and nearest neighborhood size of 7 (k = 7). We identified several factors that have contributed to the accuracy; in particular, part of speech, grammatical type, formation method and number of examples within one grammatical type, number of exceptions, the number of phonological alternations, etc. The visual lexical decision experiment revealed that words that the MBL model produced incorrectly also induced elongated reaction time latencies. Thus, we concluded that the MBL model might be cognitively plausibile. In addition, we reconfirmed informativeness of phonotactic information, this time in human conmprehension task. Overall, findings from three undertaken studies are in favor of phonotactic information for both processing and production of morphologically complex words. Results also suggest a necessity of taking into account this information when discussing emergence of larger units and language patterns
    corecore