4 research outputs found

    HPS: High precision stemmer

    Get PDF
    Abstract Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word. In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms

    Determination of basic form of words

    Get PDF
    Lemmatizace je důležitou procedurou před dolováním v textu v mnoha aplikacích. Proces lemmatizace je podobný procesu stemmingu, s tím rozdílem, že neurčuje pouze kořen slova, ale snaží se slovo převést pomocí metod Brute Force a Suffix Stripping do jeho základního tvaru. Hlavním cílem této práce je prezentovat metody pro vylepšení algoritmů lemmatizace českého jazyka. Obsahem je vytvoření trénovací množiny dat, kterou lze libovolně použít pro studentské i vědecké práce zabývající se podobnou problematikou.Lemmatization is an important preprocessing step for many applications of text mining. Lemmatization process is similar to the stemming process, with the difference that determines not only the word stem, but it´s trying to determines the basic form of the word using the methods Brute Force and Suffix Stripping. The main aim of this paper is to present methods for algorithmic improvements Czech lemmatization. The created training set of data are content of this paper and can be freely used for student and academic works dealing with similar problematics.

    Borderlands of nations, nations of borderlands. Minorities in the borderlands and on the fringes of countries

    Get PDF
    In the past two years, the European continent has become the target of mass migration of various ethnic and religious groups who, for reasons of security or economic hardship, have decided to leave their homelands and go into dangerous exile, mostly by sea. In order to reach the world perceived by them as an oasis of security and prosperity, and above all tolerance for racial, ethnic, cultural and religious differences, the arrivals are deepening the already large diversity of the Old Continent's population, where the various minorities have been living for a long time. Particularly interesting is the question of the functioning of national and religious minorities in the borderlands between countries, as well as the formation of such borderlands by different nations. Therefore, the editors propose that number 13 of Region and Regionalism addresses the issue of Borderlands of nations, nations of borderlands. The proposed subject matter met with the lively response from the authors, so much so that the number of submitted papers prompted the Editorial Board to divide them into two volumes. The first volume, collects the works discussing Minorities in the borderlands and the fringes of countries
    corecore