28 research outputs found

    A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters

    Get PDF
    2000 Mathematics Subject Classification: 62H30This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned

    Enhancements onMultiword Extraction and Inclusion of Relevant SingleWords on LocalMaxs

    Get PDF
    The digital information available to us reproduces itself in an overwhelmingly rapid way. Following advances in Text Mining, this large amount of information can now be processed and understood more swiftly by people. For this purpose, the concept of extracting Relevant Expressions and Keywords from a text becomes an important task. This process consists in retrieving the most important ideas from a document or set of documents, which can be done using statistical and/or linguistic tools, being the first the focus of this work. In order to extract these terminologies using statistical methodologies, one must take advantage of patterns that indicate importance in a word/expression. Relevant Expressions tend to present some singularities, as the words therein, seem to have, for example, high values of cohesion between them, conveying importance. The LocalMaxs is an algorithm that uses this cohesion metric between words to capture meaningful Multi Word Expressions from a text, with an average Precision close to 70%, but it is not able to extract 1-grams (single words). This dissertation aims at improving the performance of this algorithm, as well as including the newly added Relevant Single Words, which is an important factor specially in languages where relevant compound nouns come in long words (i.e. German). These improvements must be made keeping language independence.A informação disponível em forma digital aumenta a uma velocidade estonteante, tornando difícil o seu processamento e acompanhamento. Utilizando técnicas de Text Mining, esta grande quantidade de informação pode ser lida e compreendida de forma mais expedita por Humanos. A extração de Expressões e Termos Relevantes é um processo crucial para a decomposição de um documento ou grupo de documentos, e consiste na recolha dos conceitos mais importantes dos mesmos. Este processo é realizado através da utilização de ferramentas estatísticas (focadas neste trabalho) e/ou linguísticas. Para extrair estas terminologias utilizando métodos estatísticos, têm que ser encontrados padrões que indiquem e apontem para a importância e relevância de uma palavra/ expressão. Expressões Relevantes apresentam várias características que as definem, sendo uma das quais a verificação de altos valores de coesão estatística entre as palavras que as compõem. O algoritmo LocalMaxs utiliza estes valores de coesão entre palavras para extraír Expressões Relevantes de um texto, com uma precisão de aproximadamente 70%. Não consegue, no entanto, extrair 1-gramas (palavras isoladas) Relevantes. Esta dissertação tem como objetivo melhorar a performance na extração de Expressões Relevantes do algoritmo LocalMaxs, bem como criar mecanismos que o permitam extrair 1-gramas relevantes. Estes melhoramentos devem manter o algoritmo independente da língua do texto em análise

    Towards improving WEBSOM with multi-word expressions

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaLarge quantities of free-text documents are usually rich in information and covers several topics. However, since their dimension is very large, searching and filtering data is an exhaustive task. A large text collection covers a set of topics where each topic is affiliated to a group of documents. This thesis presents a method for building a document map about the core contents covered in the collection. WEBSOM is an approach that combines document encoding methods and Self-Organising Maps (SOM) to generate a document map. However, this methodology has a weakness in the document encoding method because it uses single words to characterise documents. Single words tend to be ambiguous and semantically vague, so some documents can be incorrectly related. This thesis proposes a new document encoding method to improve the WEBSOM approach by using multi word expressions (MWEs) to describe documents. Previous research and ongoing experiments encourage us to use MWEs to characterise documents because these are semantically more accurate than single words and more descriptive

    A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus

    Get PDF
    This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000

    Automatic extraction of concepts from texts and applications

    Get PDF
    The extraction of relevant terms from texts is an extensively researched task in Text- Mining. Relevant terms have been applied in areas such as Information Retrieval or document clustering and classification. However, relevance has a rather fuzzy nature since the classification of some terms as relevant or not relevant is not consensual. For instance, while words such as "president" and "republic" are generally considered relevant by human evaluators, and words like "the" and "or" are not, terms such as "read" and "finish" gather no consensus about their semantic and informativeness. Concepts, on the other hand, have a less fuzzy nature. Therefore, instead of deciding on the relevance of a term during the extraction phase, as most extractors do, I propose to first extract, from texts, what I have called generic concepts (all concepts) and postpone the decision about relevance for downstream applications, accordingly to their needs. For instance, a keyword extractor may assume that the most relevant keywords are the most frequent concepts on the documents. Moreover, most statistical extractors are incapable of extracting single-word and multi-word expressions using the same methodology. These factors led to the development of the ConceptExtractor, a statistical and language-independent methodology which is explained in Part I of this thesis. In Part II, I will show that the automatic extraction of concepts has great applicability. For instance, for the extraction of keywords from documents, using the Tf-Idf metric only on concepts yields better results than using Tf-Idf without concepts, specially for multi-words. In addition, since concepts can be semantically related to other concepts, this allows us to build implicit document descriptors. These applications led to published work. Finally, I will present some work that, although not published yet, is briefly discussed in this document.Fundação para a Ciência e a Tecnologia - SFRH/BD/61543/200

    Un environnement générique et ouvert pour le traitement des expressions polylexicales

    Get PDF
    The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work

    Attribute Selection for Unsupervised and Language Independent Classification of Documents

    Get PDF
    Raw text documents are the most common way documents are written, that is, unstruc- tured text. So, they contain most of the information available. Thus, it is desirable that there are tools capable of extracting the core content of each document and, through it, identify the group to which it belongs, since in unstructured texts there is usually no fore- seen place for indicating the document class. Nowadays, English is not the only language documents appear in the available repositories. This suggests the construction of tools that, if possible, do not depend on the language in which the texts are written, which is a challenge. This dissertation focuses mainly on clustering documents according to their content, using no class labels, that is, unsupervised clustering. It aims to mine and to create features from text in order to achieve that purpose. It is also intended to classify new doc- uments, in a supervised approach, according to the classes identified in the unsupervised training phase. In order to solve this, the proposed solution finds the best features inside the docu- ments, and uses their discriminative power to provide clustering. In order to summarise the core content of each cluster found by this approach, key expressions are automatically extracted from their documents.Documentos de texto bruto são a forma mais comum de escrita de documentos, ou seja, texto não estruturado. Assim, eles contêm a maioria das informações disponíveis. Deste modo, é desejável que existam ferramentas capazes de extrair o conteúdo mais importante de um documento e, por este meio, identificar o grupo ao qual o documento pertence, pois em textos não estruturados geralmente não há uma previsão de indicação da classe do mesmo. Atualmente, o Inglês não é a única linguagem em que os documentos aparecem nos repositórios disponíveis. Isto sugere a construção de ferramentas que, se possível, não dependam da linguagem em que os textos são escritos, sendo isto um desafio. Esta dissertação foca-se principalmente em agrupar os documentos de acordo com o seu conteúdo, sem usar rótulos de classes, ou seja, agrupamento não supervisionado. O objetivo será alcançado através da extração e criação de atributos a partir do texto. Pretende-se também classificar novos documentos, numa abordagem supervisionada, de acordo com as classes identificadas na fase de treino não supervisionado. De modo a tentar resolver este problema, é proposta uma solução que encontra os melhores atributos nos documentos, e usa o poder discriminativo das mesmas para fa- zer o agrupamento. De modo a sumarizar o conteúdo principal destes agrupamentos, expressões chave são automaticamente extraídas dos documentos

    Topic Segmentation: How Much Can We Do by Counting Words and Sequences of Words

    Get PDF
    In this paper, we present an innovative topic segmentation system based on a new informative similarity measure that takes into account word co-occurrence in order to avoid the accessibility to existing linguistic resources such as electronic dictionaries or lexico-semantic databases such as thesauri or ontology. Topic segmentation is the task of breaking documents into topically coherent multi-paragraph subparts. Topic segmentation has extensively been used in information retrieval and text summarization. In particular, our architecture proposes a language-independent topic segmentation system that solves three main problems evidenced by previous research: systems based uniquely on lexical repetition that show reliability problems, systems based on lexical cohesion using existing linguistic resources that are usually available only for dominating languages and as a consequence do not apply to less favored languages and finally systems that need previously existing harvesting training data. For that purpose, we only use statistics on words and sequences of words based on a set of texts. This solution provides a flexible solution that may narrow the gap between dominating languages and less favored languages thus allowing equivalent access to information
    corecore