2,014 research outputs found
Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
In Automatic Text Summarization, preprocessing is an important phase to
reduce the space of textual representation. Classically, stemming and
lemmatization have been widely used for normalizing words. However, even using
normalization on large texts, the curse of dimensionality can disturb the
performance of summarizers. This paper describes a new method for normalization
of words to further reduce the space of representation. We propose to reduce
each word to its initial letters, as a form of Ultra-stemming. The results show
that Ultra-stemming not only preserve the content of summaries produced by this
representation, but often the performances of the systems can be dramatically
improved. Summaries on trilingual corpora were evaluated automatically with
Fresa. Results confirm an increase in the performance, regardless of summarizer
system used.Comment: 22 pages, 12 figures, 9 table
Detecting New Word Meanings: A Comparison of Word Embedding Models in Spanish
Semantic neologisms (SN) are defined as words that acquire a new word meaning
while maintaining their form. Given the nature of this kind of neologisms, the
task of identifying these new word meanings is currently performed manually by
specialists at observatories of neology. To detect SN in a semi-automatic way,
we developed a system that implements a combination of the following
strategies: topic modeling, keyword extraction, and word sense disambiguation.
The role of topic modeling is to detect the themes that are treated in the
input text. Themes within a text give clues about the particular meaning of the
words that are used, for example: viral has one meaning in the context of
computer science (CS) and another when talking about health. To extract
keywords, we used TextRank with POS tag filtering. With this method, we can
obtain relevant words that are already part of the Spanish lexicon. We use a
deep learning model to determine if a given keyword could have a new meaning.
Embeddings that are different from all the known meanings (or topics) indicate
that a word might be a valid SN candidate. In this study, we examine the
following word embedding models: Word2Vec, Sense2Vec, and FastText. The models
were trained with equivalent parameters using Wikipedia in Spanish as corpora.
Then we used a list of words and their concordances (obtained from our database
of neologisms) to show the different embeddings that each model yields.
Finally, we present a comparison of these outcomes with the concordances of
each word to show how we can determine if a word could be a valid candidate for
SN.Comment: 16 pages, 3 figure
- …