Search CORE

15 research outputs found

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

Author: Torres-Moreno Juan-Manuel
Publication venue
Publication date: 14/09/2012
Field of study

In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.Comment: 22 pages, 12 figures, 9 table

arXiv.org e-Print Archive

CiteSeerX

Data Mining in Electronic Commerce

Author: Banks David L.
Said Yasmin H.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 07/09/2006
Field of study

Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Development of a stemmer for the isiXhosa language

Author
Publication venue: Faculty of Science & Agriculture
Publication date: 01/01/2016
Field of study

IsiXhosa language is one of the eleven official languages and the second most widely spoken language in South Africa. However, in terms of computational linguistics, the language did not get attention and natural language related work is almost non-existent. Document retrieval using unstructured queries requires some kind of language processing, and an efficient retrieval of documents can be achieved if we use a technique called stemming. The area that involves document storage and retrieval is called Information Retrieval (IR). Basically, IR systems make use of a Stemmer to index document representations and also terms in users’ queries to retrieve matching documents. In this dissertation, we present the developed Stemmer that can be used in both conditions. The Stemmer is used in IR systems, like Google to retrieve documents written in isiXhosa. In the Eastern Cape Province of South Africa many public schools take isiXhosa as a subject and also a number of Universities in South Africa teach isiXhosa. Therefore, for a language important such as this, it is important to make valuable information that is available online accessible to users through the use of IR systems. In our efforts to develop a Stemmer for the isiXhosa language, an investigation on how others have developed Stemmers for other languages was carried out. From the investigation we came to realize that the Porter stemming algorithm in particular was the main algorithm that many of other Stemmers make use of as a reference. We found that Porter’s algorithm could not be used in its totality in the development of the isiXhosa Stemmer because of the morphological complexity of the language. We developed an affix removal that is embedded with rules that determine which order should be followed in stripping the affixes. The rule is that, the word under consideration is checked against the exceptions, if it’s not in the exceptions list then the stripping continue in the following order; Prefix removal, Suffix removal and finally save the result as stem. The Stemmer was successfully developed and was tested and evaluated in a sample data that was randomly collected from the isiXhosa text books and isiXhosa dictionary. From the results obtained we concluded that the Stemmer can be used in IR systems as it showed 91 percent accuracy. The errors were 9 percent and therefore these results are within the accepted range and therefore the Stemmer can be used to help in retrieval of isiXhosa documents. This is only a noun Stemmer and in the future it can be extended to also stem verbs as well. The Stemmer can also be used in the development of spell-checkers of isiXhosa

South East Academic Libraries System (SEALS)

An evaluation of conflation accuracy using finite‐state transducers

Author: Carmen Galvez
Félix de Moya‐Anegón
Publication venue: 'Emerald'
Publication date
Field of study

Crossref

Development of a stemmer for the isiXhosa language

Author: Nogwina Mnoneleli
Publication venue: Faculty of Science & Agriculture
Publication date: 01/01/2016
Field of study

University of Fort Hare

South East Academic Libraries System (SEALS)

Identifying interactions between chemical entities in text

Author: Lamúrias André Francisco Martins
Publication venue
Publication date: 01/01/2014
Field of study

Tese de mestrado em Bioinformática e Biologia Computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2014Novas interações entre compostos químicos são geralmente descritas em artigos científicos, os quais estão a ser publicados a uma velocidade cada vez maior. No entanto, estes artigos são dirigidos a humanos, escritos em linguagem natural, e não são processados facilmente por um computador. Métodos de prospeção de texto são uma solução para este problema, extraindo automaticamente a informação relevante da literatura. Estes métodos devem ser adaptados ao domínio e tarefa a que vão ser aplicados. Esta dissertação propõe um sistema para identificação automática e eficaz de interações entre entidades químicas em documentos biomédicos. O sistema foi desenvolvido em dois módulos. O primeiro módulo reconhece as entidades químicas que são mencionadas num dado texto. Este módulo foi baseado num sistema já existente, o qual foi melhorado com um novo tipo de medidas de semelhança semântica. O segundo módulo identifica os pares de entidades que representam uma interação química no mesmo texto, com recurso a técnicas de Aprendizagem Automática e conhecimento específico ao domínio. Cada módulo foi avaliado separadamente, obtendo valores de precisão elevados em dois padrões de teste diferentes. Os dois módulos constituem o sistema IICE, que pode ser usado para analisar qualquer documento biomédico, de forma a encontrar entidades e interações químicas. Este sistema está acessível através de uma ferramenta web.Novel interactions between chemical compounds are often described in scientific articles, which are being published at an unprecedented rate. However, these articles are directed to humans, written in natural language, and cannot be easily processed by a machine. Text mining methods present a solution to this problem, by automatically extracting the relevant information from the literature. These methods should be adapted to the specific domain and task they are going to be applied to. This dissertation proposes a system for automatic and efficient identification of interactions between chemical entities from biomedical documents. This system was developed in two modules. The first module recognizes the chemical entities that are mentioned in a given text. This module was based on an existing framework, which was improved with a novel type of semantic similarity measure. The second module identifies the pairs of entities that represent a chemical interaction in the same text, using Machine Learning techniques and domain knowledge. Each module was evaluated separately, achieving high precision values against two different gold standards. The two modules were constitute the IICE system, which can be used to analyze any biomedical document for chemical entities and interactions, accessible via a web tool

Universidade de Lisboa: Repositório.UL

A modular architecture for systematic text categorisation

Author: Barnes Andrew James
Publication venue
Publication date
Field of study

This work examines and attempts to overcome issues caused by the lack of formal standardisation when defining text categorisation techniques and detailing how they might be appropriately integrated with each other. Despite text categorisation’s long history the concept of automation is relatively new, coinciding with the evolution of computing technology and subsequent increase in quantity and availability of electronic textual data. Nevertheless insufficient descriptions of the diverse algorithms discovered have lead to an acknowledged ambiguity when trying to accurately replicate methods, which has made reliable comparative evaluations impossible. Existing interpretations of general data mining and text categorisation methodologies are analysed in the first half of the thesis and common elements are extracted to create a distinct set of significant stages. Their possible interactions are logically determined and a unique universal architecture is generated that encapsulates all complexities and highlights the critical components. A variety of text related algorithms are also comprehensively surveyed and grouped according to which stage they belong in order to demonstrate how they can be mapped. The second part reviews several open-source data mining applications, placing an emphasis on their ability to handle the proposed architecture, potential for expansion and text processing capabilities. Finding these inflexible and too elaborate to be readily adapted, designs for a novel framework are introduced that focus on rapid prototyping through lightweight customisations and reusable atomic components. Being a consequence of inadequacies with existing options, a rudimentary implementation is realised along with a selection of text categorisation modules. Finally a series of experiments are conducted that validate the feasibility of the outlined methodology and importance of its composition, whilst also establishing the practicality of the framework for research purposes. The simplicity of experiments and results gathered clearly indicate the potential benefits that can be gained when a formalised approach is utilised

University of Huddersfield Repository

Un modèle de recherche d'information basé sur les graphes et les similarités structurelles pour l'amélioration du processus de recherche d'information

Author: Champclaux Yaël
Publication venue: HAL CCSD
Publication date: 04/12/2009
Field of study

The main objective of IR systems is to select relevant documents, related to a user's information need, from a collection of documents. Traditional approaches for document/query comparison use surface similarity, i.e. the comparison engine uses surface attributes (indexing terms). We propose a new method which uses a special kind of similarity, namely structural similarities (similarities that use both surface attributes and relation between attributes). These similarities were inspired from cognitive studies and a general similarity measure based on node comparison in a bipartite graph. We propose an adaptation of this general method to the special context of information retrieval. Adaptation consists in taking into account the domain specificities: data type, weighted edges, normalization choice. The core problem is how documents are compared against queries. The idea we develop is that similar documents will share similar terms and similar terms will appear in similar documents. We have developed an algorithm which traduces this idea. Then we have study problem related to convergence and complexity, then we have produce some test on classical collection and compare our measure with two others that are references in our domain. The Report is structured in five chapters: First chapter deals with comparison problem, and related concept like similarities, we explain different point of view and propose an analogy between cognitive similarity model and IR model. In the second chapter we present the IR task, test collection and measures used to evaluate a relevant document list. The third chapter introduces graph definition: our model is based on graph bipartite representation, so we define graphs and criterions used to evaluate them. The fourth chapter describe how we have adopted, and adapted the general comparison method. The Fifth chapter describes how we evaluate the ordering performance of our method, and also how we have compared our method with two others.Cette thèse d'informatique s'inscrit dans le domaine de la recherche d'information (RI). Elle a pour objet la création d'un modèle de recherche utilisant les graphes pour en exploiter la structure pour la détection de similarités entre les documents textuels d'une collection donnée et une requête utilisateur en vue d'améliorer le processus de recherche d'information. Ces similarités sont dites « structurelles » et nous montrons qu'elles apportent un gain d'information bénéfique par rapport aux seules similarités directes. Le rapport de thèse est structuré en cinq chapitres. Le premier chapitre présente un état de l'art sur la comparaison et les notions connexes que sont la distance et la similarité. Le deuxième chapitre présente les concepts clés de la RI, notamment l'indexation des documents, leur comparaison, et l'évaluation des classements retournés. Le troisième chapitre est consacré à la théorie des graphes et introduit les notations et notions liées à la représentation par graphe. Le quatrième chapitre présente pas à pas la construction de notre modèle pour la RI, puis, le cinquième chapitre décrit son application dans différents cas de figure, ainsi que son évaluation sur différentes collections et sa comparaison à d'autres approches

Thèses en Ligne

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Thèses en ligne de l'Université Toulouse III - Paul Sabatier