22 research outputs found
HPS: High precision stemmer
Abstract Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms
Findings of the Shared Task on Multilingual Coreference Resolution
This paper presents an overview of the shared task on multilingual
coreference resolution associated with the CRAC 2022 workshop. Shared task
participants were supposed to develop trainable systems capable of identifying
mentions and clustering them according to identity coreference. The public
edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used
as the source of training and evaluation data. The CoNLL score used in previous
coreference-oriented shared tasks was used as the main evaluation metric. There
were 8 coreference prediction systems submitted by 5 participating teams; in
addition, there was a competitive Transformer-based baseline system provided by
the organizers at the beginning of the shared task. The winner system
outperformed the baseline by 12 percentage points (in terms of the CoNLL scores
averaged across all datasets for individual languages)
Stochastic semantic parsing: technical report no. DCSE/TR-2006-01
The recent achievements in the area of automatic speech recognition started the development of speech enabled applications. Currently it is beginningto be insufficient to merely recognize an utterance. The applications are demanding to understand the meaning of the utterance. Semantic analysis
(or semantic parsing) is a part of the natural language understanding process. The goal of semantic analysis is to represent what the subject intended to say. The thesis summarizes aspects of semantic analysis with emphasis on the stochastic approach to semantic analysis. Fundamental stochastic models along with the training and evaluation of these models are explained in details. Since, the performance of the system is significantly influenced by the way of preprocessing, it is also described in the thesis
Deep Learning for Text Data on Mobile Devices
With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models
ULSAna: Jazykově nazávislý sémantický analyzátor
PĹ™edstavujeme online vĂcejazyÄŤnĂ˝ systĂ©m pro pro povrchovou sĂ©mantickou analĂ˝zu v 51 jazycĂch. DomĂ©na vstupnĂch dat nenĂ nijak omezena. SystĂ©m vyuĹľĂvá pro všechny jazyky pouze anglická trĂ©novacĂ data. VĂ˝sledná sĂ©mantická anotace je tedy konzistentnĂ napĹ™ĂÄŤ všemi jazyky Jako základnĂ stavebnĂ kameny vyuĹľĂváme znaÄŤkovánĂ sĂ©mantickĂ˝ch rolĂ z CoNLL a univerzálnĂ syntaktickou anotaci Universal Dependencies. SystĂ©m je veĹ™ejnÄ› dostupnĂ˝ a podporuje dávkovĂ© zpracovánĂ, takĹľe mĹŻĹľe bĂ˝t snadno vyuĹľit pro vĂ˝zkumnĂ© účely.We present a live cross-lingual system capable of producing shallow semantic annotations of natural language sentences for 51 languages at this time. The domain of the input sentences is in principle unconstrained. The system uses single training data (in English) for all the languages. The resulting semantic annotations are therefore consistent across different languages. We use CoNLL Semantic Role Labeling training data and Universal dependencies as the basis for the system. The system is publicly available and supports processing data in batches; therefore, it can be easily used by the community for research tasks
HlubokĂ© uÄŤenĂ pro textová data na mobilnĂch zaĹ™ĂzenĂch.
With the rise of Artificial Intelligence (AI), it is becoming a significant phenomenon in our lives. As with many other powerful tools, AI brings many advantages but many risks as well. Predictions and automation can significantly help in our everyday lives. However, sending our data to servers for processing can severely hurt our privacy. In this paper, we describe experiments designed to find out whether we can enjoy the benefits of AI in the privacy of our mobile devices. We focus on text data since such data are easy to store in large quantities for mining by third parties. We measure the performance of deep learning methods in terms of accuracy (when compared to fully-fledged server models) and speed (number of text documents processed in a second). We conclude our paper with findings that with few relatively small modifications, mobile devices can process hundreds to thousands of documents while leveraging deep learning models.Jako kaĹľdá mocnĂ˝ nástroj, pĹ™inášà spoustu vĂ˝hod ale takĂ© spoustu riskĹŻ. Predikce a automatizace mĹŻĹľe vĂ˝raznÄ› pomoci v našich kaĹľdodennĂch Ĺľivotech. OdesĂlánĂ uĹľivatelskĂ˝ch dat na server k procesovánĂ mĹŻĹľe narušit jejich soukromĂ. SoustĹ™edili jsme se na textová data a provedli nÄ›kolik experimentĹŻ, abychom ověřili, zda je odesĂlánĂ dat na vĂ˝poÄŤetnĂ servery nutnostĂ v dobÄ› vĂ˝konnĂ˝ch mobilnĂch zaĹ™ĂzenĂch
Curriculum learning v analýze sentimentu
Tato práce se zabĂ˝vá metodou curriculum learning pro uÄŤenĂ hlubokĂ˝ch neuronovĂ˝ch sĂtĂ pro analĂ˝zu sentimentu. Navrhli jsme novĂ˝ pĹ™Ăstup pro curriculum learning pro textová data. SeĹ™adili jsme trĂ©novacĂ dataset tak, abychom uvedli jednoduššà vzorky dĹ™Ăve. Za jednoduchĂ© vzorky jsou pĹ™edpokládány krátkĂ© sekvence. TakĂ© jsem experimentovali s měřenĂm frekvence slov, coĹľ je technika navrĹľená pĹ™edcházejĂcĂmĂ vĂ˝zkumnĂky. Pokusili jsme se vyhodnotit zmÄ›ny v Ăşspěšnosti obou pĹ™ĂstupĹŻ. Naše experimenty neprokázali žádnĂ˝ nárĹŻst Ăşspěšnosti. PodaĹ™ilo se však dosáhnout novĂ©ho state of the art v analĂ˝ze sentimentu na ÄŤeskĂ©m korpusu.This work deals with curriculum learning for deep learning models for the sentiment analysis task. We design a new way of curriculum learning for text data. We reorder the training dataset to introduce the simpler examples first. We estimate the difficulty of the examples by measuring the length of the sentences. The simple examples are supposed to be shorter. We also experiment with measuring the frequency of the words, which is a technique designed by earlier researchers. We attempt to evaluate changes in the overall accuracy of the models using both curriculum learning techniques. Our experiments do not show an increase in accuracy for any of the methods. Nevertheless, we reach a new state of the art in the sentiment analysis for Czech as a by-product of our effor