13 research outputs found
Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
In Automatic Text Summarization, preprocessing is an important phase to
reduce the space of textual representation. Classically, stemming and
lemmatization have been widely used for normalizing words. However, even using
normalization on large texts, the curse of dimensionality can disturb the
performance of summarizers. This paper describes a new method for normalization
of words to further reduce the space of representation. We propose to reduce
each word to its initial letters, as a form of Ultra-stemming. The results show
that Ultra-stemming not only preserve the content of summaries produced by this
representation, but often the performances of the systems can be dramatically
improved. Summaries on trilingual corpora were evaluated automatically with
Fresa. Results confirm an increase in the performance, regardless of summarizer
system used.Comment: 22 pages, 12 figures, 9 table
Recommended from our members
DEAR: A New Technique for Information Extraction and Context-Dependent Text Mining
The desire to store and the need to use electronic data has greatly increased as the power, availability, and connectivity of computers has grown. A large portion of this data is in the form of unstructured text documents. Locating specific information within this amorphous mass of documents is an area of active research. Our contribution to this pursuit is the development of the Document Entity and Resolution (DEAR) system. This system combines semantic similarity matching as provided by the open source WordNet database with the ability to recognize named entities through the OpenCalais system. When used in concert, this provides a novel way for users to quickly find relevant content and detect and identify uniquely named entities within that content. The theory behind the system is defined and the working system is described. This system is then applied to a collection of assessment documents as a proof-of-concept test of its viability. The results are promising and indicate that further research is warranted
A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages
TestitiivistelmäThe original publication is available at www.springerlink.com
The effects of separate and merged indexes and word normalization in multilingual CLIR
Multilingual IR may be performed in two environments: there may exist a separate index for each target language, or all the languages may be indexed in a merged index. In the first case, retrieval must be performed separately in each index, after which the result lists have to be merged. In the case of the merged index, there are two alternatives: either to perform retrieval with a merged query (all the languages in the same query), or to perform distinct retrievals in each language, and merge the result lists. Further, there are several indexing approaches concerning word normalization. The present paper examines the impact of stemming compared with inflected retrieval in multilingual IR when there are separate indexes / a merged index. Four different result list merging approaches are compared with each other. The best result was achieved when retrieval was performed in separate indexes and result lists were merged. Stemming seems to improve the results compared with inflected retrieval
The effects of separate and merged indexes and word normalization in multilingual CLIR
Multilingual IR may be performed in two environments: there may exist a separate index for each target language, or all the languages may be indexed in a merged index. In the first case, retrieval must be performed separately in each index, after which the result lists have to be merged. In the case of the merged index, there are two alternatives: either to perform retrieval with a merged query (all the languages in the same query), or to perform distinct retrievals in each language, and merge the result lists. Further, there are several indexing approaches concerning word normalization. The present paper examines the impact of stemming compared with inflected retrieval in multilingual IR when there are separate indexes / a merged index. Four different result list merging approaches are compared with each other. The best result was achieved when retrieval was performed in separate indexes and result lists were merged. Stemming seems to improve the results compared with inflected retrieval
Kyselynkäsittelymenetelmien evaluointitutkimus Suomalaisen verkkoarkiston taivutusmuotoindeksiä käyttäen
Suomen kielen rikas morfologia aiheuttaa tiedonhaulle haasteita. Jotta tiedonhaku on tuloksellista, täytyy kyselyn sanamuoto saada täsmäämään dokumentissa esiintyvän sanamuodon kanssa. Tässä tutkimuksessa verrataan neljän eri kyselynkäsittelymenetelmän tuloksellisuutta dokumenteista rakennetussa taivutusmuotoindeksissä.
Aiempi suomenkielisellä aineistolla toteutettu tiedonhaun evaluointitutkimus on käyttänyt dokumenttikokoelmina pääasiassa lehtiartikkelikokoelmista rakennettuja testikokoelmia. Tässä tutkimuksessa käytetään artikkelikokoelman sijaan Suomalaisesta verkkoarkistosta rakennettua testikokoelmaa, joka sisältää verkkosivuja joiden sisältö ja laatu vaihtelevat paljon. Tutkielmassa verrattavat menetelmät ovat Frequent case generation 3 (FCG3), Simple word ending based rule generator (SWERG+), Snowball-stemmaus yhdistettynä villiin korttiin sekä käsittelemättömät kyselyt.
Tämän tutkimuksen tutkimusmenetelmä on tiedonhaun laboratoriomallin mukainen testaus. Sen suorittamiseksi Suomalaisesta verkkoarkistosta oli rakennettava testikokoelma. Testikokoelmaan valittiin lopulta 16 hakuaihetta, joista muodostetuilla lyhyillä kyselyillä suoritettiin kyselyajot. Ajojen tulokset mitattiin tarkkuudella kymmenen ensimmäisen tulosdokumentin kohdalla sekä kumuloituvan hyödyn mittarilla.
Tutkimuksessa havaittiin FCG3-menetelmän tuottavan perustasona toimineita käsittelemättömiä kyselyitä parempia tuloksia. Sen sijaan aiemmassa tutkimuksessa hyvin suoriutunut SWERG+-menetelmä ei tuottanut tässä tutkimuksessa perustasoa parempia tuloksia. Snowball-stemmaus yhdistettynä villiin korttiin taas tuotti perustasoa heikompia tuloksia