28 research outputs found
A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters
2000 Mathematics Subject Classification: 62H30This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction.
For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach
based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned
Enhancements onMultiword Extraction and Inclusion of Relevant SingleWords on LocalMaxs
The digital information available to us reproduces itself in an overwhelmingly rapid
way. Following advances in Text Mining, this large amount of information can now
be processed and understood more swiftly by people. For this purpose, the concept of
extracting Relevant Expressions and Keywords from a text becomes an important task.
This process consists in retrieving the most important ideas from a document or set of
documents, which can be done using statistical and/or linguistic tools, being the first the
focus of this work.
In order to extract these terminologies using statistical methodologies, one must take
advantage of patterns that indicate importance in a word/expression. Relevant Expressions
tend to present some singularities, as the words therein, seem to have, for example,
high values of cohesion between them, conveying importance.
The LocalMaxs is an algorithm that uses this cohesion metric between words to capture
meaningful Multi Word Expressions from a text, with an average Precision close
to 70%, but it is not able to extract 1-grams (single words). This dissertation aims at
improving the performance of this algorithm, as well as including the newly added Relevant
Single Words, which is an important factor specially in languages where relevant
compound nouns come in long words (i.e. German). These improvements must be made
keeping language independence.A informação disponível em forma digital aumenta a uma velocidade estonteante, tornando
difícil o seu processamento e acompanhamento. Utilizando técnicas de Text Mining,
esta grande quantidade de informação pode ser lida e compreendida de forma mais expedita
por Humanos. A extração de Expressões e Termos Relevantes é um processo crucial
para a decomposição de um documento ou grupo de documentos, e consiste na recolha
dos conceitos mais importantes dos mesmos. Este processo é realizado através da
utilização de ferramentas estatísticas (focadas neste trabalho) e/ou linguísticas.
Para extrair estas terminologias utilizando métodos estatísticos, têm que ser encontrados
padrões que indiquem e apontem para a importância e relevância de uma palavra/
expressão. Expressões Relevantes apresentam várias características que as definem,
sendo uma das quais a verificação de altos valores de coesão estatística entre as palavras
que as compõem.
O algoritmo LocalMaxs utiliza estes valores de coesão entre palavras para extraír Expressões
Relevantes de um texto, com uma precisão de aproximadamente 70%. Não consegue,
no entanto, extrair 1-gramas (palavras isoladas) Relevantes. Esta dissertação tem
como objetivo melhorar a performance na extração de Expressões Relevantes do algoritmo
LocalMaxs, bem como criar mecanismos que o permitam extrair 1-gramas relevantes.
Estes melhoramentos devem manter o algoritmo independente da língua do texto em
análise
Towards improving WEBSOM with multi-word expressions
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaLarge quantities of free-text documents are usually rich in information and covers
several topics. However, since their dimension is very large, searching and filtering data is an exhaustive task. A large text collection covers a set of topics where each topic is affiliated to a group of documents. This thesis presents a method for building a document map about the core contents covered in the collection.
WEBSOM is an approach that combines document encoding methods and Self-Organising Maps (SOM) to generate a document map. However, this methodology has a weakness in the document encoding method because it uses single words to characterise documents.
Single words tend to be ambiguous and semantically vague, so some documents can be incorrectly related. This thesis proposes a new document encoding method to improve the WEBSOM approach by using multi word expressions (MWEs) to describe documents. Previous research and ongoing experiments encourage us to use MWEs to characterise documents because these are semantically more accurate than single words and more descriptive
A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus
This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000
Automatic extraction of concepts from texts and applications
The extraction of relevant terms from texts is an extensively researched task in Text-
Mining. Relevant terms have been applied in areas such as Information Retrieval or document clustering and classification. However, relevance has a rather fuzzy nature since the classification of some terms as relevant or not relevant is not consensual. For instance, while words such as "president" and "republic" are generally considered relevant by human evaluators, and words like "the" and "or" are not, terms such as "read" and "finish" gather no consensus about their semantic and informativeness.
Concepts, on the other hand, have a less fuzzy nature. Therefore, instead of deciding
on the relevance of a term during the extraction phase, as most extractors do, I propose to first extract, from texts, what I have called generic concepts (all concepts) and postpone the decision about relevance for downstream applications, accordingly to their needs.
For instance, a keyword extractor may assume that the most relevant keywords are the
most frequent concepts on the documents. Moreover, most statistical extractors are incapable of extracting single-word and multi-word expressions using the same methodology.
These factors led to the development of the ConceptExtractor, a statistical and
language-independent methodology which is explained in Part I of this thesis.
In Part II, I will show that the automatic extraction of concepts has great applicability.
For instance, for the extraction of keywords from documents, using the Tf-Idf metric
only on concepts yields better results than using Tf-Idf without concepts, specially for
multi-words. In addition, since concepts can be semantically related to other concepts,
this allows us to build implicit document descriptors. These applications led to published work. Finally, I will present some work that, although not published yet, is briefly discussed in this document.Fundação para a Ciência e a Tecnologia - SFRH/BD/61543/200
Un environnement générique et ouvert pour le traitement des expressions polylexicales
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work
Attribute Selection for Unsupervised and Language Independent Classification of Documents
Raw text documents are the most common way documents are written, that is, unstruc-
tured text. So, they contain most of the information available. Thus, it is desirable that
there are tools capable of extracting the core content of each document and, through it,
identify the group to which it belongs, since in unstructured texts there is usually no fore-
seen place for indicating the document class. Nowadays, English is not the only language
documents appear in the available repositories. This suggests the construction of tools
that, if possible, do not depend on the language in which the texts are written, which is a
challenge.
This dissertation focuses mainly on clustering documents according to their content,
using no class labels, that is, unsupervised clustering. It aims to mine and to create
features from text in order to achieve that purpose. It is also intended to classify new doc-
uments, in a supervised approach, according to the classes identified in the unsupervised
training phase.
In order to solve this, the proposed solution finds the best features inside the docu-
ments, and uses their discriminative power to provide clustering. In order to summarise
the core content of each cluster found by this approach, key expressions are automatically
extracted from their documents.Documentos de texto bruto são a forma mais comum de escrita de documentos, ou seja,
texto não estruturado. Assim, eles contêm a maioria das informações disponíveis. Deste
modo, é desejável que existam ferramentas capazes de extrair o conteúdo mais importante
de um documento e, por este meio, identificar o grupo ao qual o documento pertence, pois
em textos não estruturados geralmente não há uma previsão de indicação da classe do
mesmo. Atualmente, o Inglês não é a única linguagem em que os documentos aparecem
nos repositórios disponíveis. Isto sugere a construção de ferramentas que, se possível, não
dependam da linguagem em que os textos são escritos, sendo isto um desafio.
Esta dissertação foca-se principalmente em agrupar os documentos de acordo com
o seu conteúdo, sem usar rótulos de classes, ou seja, agrupamento não supervisionado.
O objetivo será alcançado através da extração e criação de atributos a partir do texto.
Pretende-se também classificar novos documentos, numa abordagem supervisionada, de
acordo com as classes identificadas na fase de treino não supervisionado.
De modo a tentar resolver este problema, é proposta uma solução que encontra os
melhores atributos nos documentos, e usa o poder discriminativo das mesmas para fa-
zer o agrupamento. De modo a sumarizar o conteúdo principal destes agrupamentos,
expressões chave são automaticamente extraídas dos documentos
Topic Segmentation: How Much Can We Do by Counting Words and Sequences of Words
In this paper, we present an innovative topic segmentation system based on a new informative similarity measure that takes into account word co-occurrence in order to avoid the accessibility to existing linguistic resources such as electronic dictionaries or lexico-semantic databases such as thesauri or ontology. Topic segmentation is the task of breaking documents into topically coherent multi-paragraph subparts. Topic segmentation has extensively been used in information retrieval and text summarization. In particular, our architecture proposes a language-independent topic segmentation system that solves three main problems evidenced by previous research: systems based uniquely on lexical repetition that show reliability problems, systems based on lexical cohesion using existing linguistic resources that are usually available only for dominating languages and as a consequence do not apply to less favored languages and finally systems that need previously existing harvesting training data. For that purpose, we only use statistics on words and sequences of words based on a set of texts. This solution provides a flexible solution that may narrow the gap between dominating languages and less favored languages thus allowing equivalent access to information