22 research outputs found

    HPS: High precision stemmer

    Get PDF
    Abstract Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms

    Findings of the Shared Task on Multilingual Coreference Resolution

    Full text link
    This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages)

    Stochastic semantic parsing: technical report no. DCSE/TR-2006-01

    No full text
    The recent achievements in the area of automatic speech recognition started the development of speech enabled applications. Currently it is beginningto be insufficient to merely recognize an utterance. The applications are demanding to understand the meaning of the utterance. Semantic analysis (or semantic parsing) is a part of the natural language understanding process. The goal of semantic analysis is to represent what the subject intended to say. The thesis summarizes aspects of semantic analysis with emphasis on the stochastic approach to semantic analysis. Fundamental stochastic models along with the training and evaluation of these models are explained in details. Since, the performance of the system is significantly influenced by the way of preprocessing, it is also described in the thesis

    Deep Learning for Text Data on Mobile Devices

    Get PDF
    With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models

    ULSAna: Jazykově nazávislý sémantický analyzátor

    No full text
    Představujeme online vícejazyčný systém pro pro povrchovou sémantickou analýzu v 51 jazycích. Doména vstupních dat není nijak omezena. Systém využívá pro všechny jazyky pouze anglická trénovací data. Výsledná sémantická anotace je tedy konzistentní napříč všemi jazyky Jako základní stavební kameny využíváme značkování sémantických rolí z CoNLL a univerzální syntaktickou anotaci Universal Dependencies. Systém je veřejně dostupný a podporuje dávkové zpracování, takže může být snadno využit pro výzkumné účely.We present a live cross-lingual system capable of producing shallow semantic annotations of natural language sentences for 51 languages at this time. The domain of the input sentences is in principle unconstrained. The system uses single training data (in English) for all the languages. The resulting semantic annotations are therefore consistent across different languages. We use CoNLL Semantic Role Labeling training data and Universal dependencies as the basis for the system. The system is publicly available and supports processing data in batches; therefore, it can be easily used by the community for research tasks

    Hluboké učení pro textová data na mobilních zařízeních.

    No full text
    With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.Jako každá mocný nástroj, přináší spoustu výhod ale také spoustu risků. Predikce a automatizace může výrazně pomoci v našich každodenních životech. Odesílání uživatelských dat na server k procesování může narušit jejich soukromí. Soustředili jsme se na textová data a provedli několik experimentů, abychom ověřili, zda je odesílání dat na výpočetní servery nutností v době výkonných mobilních zařízeních

    Curriculum learning v analýze sentimentu

    No full text
    Tato práce se zabývá metodou curriculum learning pro učení hlubokých neuronových sítí pro analýzu sentimentu. Navrhli jsme nový přístup pro curriculum learning pro textová data. Seřadili jsme trénovací dataset tak, abychom uvedli jednodušší vzorky dříve. Za jednoduché vzorky jsou předpokládány krátké sekvence. Také jsem experimentovali s měřením frekvence slov, což je technika navržená předcházejícímí výzkumníky. Pokusili jsme se vyhodnotit změny v úspěšnosti obou přístupů. Naše experimenty neprokázali žádný nárůst úspěšnosti. Podařilo se však dosáhnout nového state of the art v analýze sentimentu na českém korpusu.This work deals with curriculum learning for deep learning models for the sentiment analysis task. We design a new way of curriculum learning for text data. We reorder the training dataset to introduce the simpler examples first. We estimate the difficulty of the examples by measuring the length of the sentences. The simple examples are supposed to be shorter. We also experiment with measuring the frequency of the words, which is a technique designed by earlier researchers. We attempt to evaluate changes in the overall accuracy of the models using both curriculum learning techniques. Our experiments do not show an increase in accuracy for any of the methods. Nevertheless, we reach a new state of the art in the sentiment analysis for Czech as a by-product of our effor
    corecore