15 research outputs found

    Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

    Full text link
    In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.Comment: 22 pages, 12 figures, 9 table

    Data Mining in Electronic Commerce

    Full text link
    Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Development of a stemmer for the isiXhosa language

    Get PDF
    IsiXhosa language is one of the eleven official languages and the second most widely spoken language in South Africa. However, in terms of computational linguistics, the language did not get attention and natural language related work is almost non-existent. Document retrieval using unstructured queries requires some kind of language processing, and an efficient retrieval of documents can be achieved if we use a technique called stemming. The area that involves document storage and retrieval is called Information Retrieval (IR). Basically, IR systems make use of a Stemmer to index document representations and also terms in users’ queries to retrieve matching documents. In this dissertation, we present the developed Stemmer that can be used in both conditions. The Stemmer is used in IR systems, like Google to retrieve documents written in isiXhosa. In the Eastern Cape Province of South Africa many public schools take isiXhosa as a subject and also a number of Universities in South Africa teach isiXhosa. Therefore, for a language important such as this, it is important to make valuable information that is available online accessible to users through the use of IR systems. In our efforts to develop a Stemmer for the isiXhosa language, an investigation on how others have developed Stemmers for other languages was carried out. From the investigation we came to realize that the Porter stemming algorithm in particular was the main algorithm that many of other Stemmers make use of as a reference. We found that Porter’s algorithm could not be used in its totality in the development of the isiXhosa Stemmer because of the morphological complexity of the language. We developed an affix removal that is embedded with rules that determine which order should be followed in stripping the affixes. The rule is that, the word under consideration is checked against the exceptions, if it’s not in the exceptions list then the stripping continue in the following order; Prefix removal, Suffix removal and finally save the result as stem. The Stemmer was successfully developed and was tested and evaluated in a sample data that was randomly collected from the isiXhosa text books and isiXhosa dictionary. From the results obtained we concluded that the Stemmer can be used in IR systems as it showed 91 percent accuracy. The errors were 9 percent and therefore these results are within the accepted range and therefore the Stemmer can be used to help in retrieval of isiXhosa documents. This is only a noun Stemmer and in the future it can be extended to also stem verbs as well. The Stemmer can also be used in the development of spell-checkers of isiXhosa

    Development of a stemmer for the isiXhosa language

    Get PDF
    IsiXhosa language is one of the eleven official languages and the second most widely spoken language in South Africa. However, in terms of computational linguistics, the language did not get attention and natural language related work is almost non-existent. Document retrieval using unstructured queries requires some kind of language processing, and an efficient retrieval of documents can be achieved if we use a technique called stemming. The area that involves document storage and retrieval is called Information Retrieval (IR). Basically, IR systems make use of a Stemmer to index document representations and also terms in users’ queries to retrieve matching documents. In this dissertation, we present the developed Stemmer that can be used in both conditions. The Stemmer is used in IR systems, like Google to retrieve documents written in isiXhosa. In the Eastern Cape Province of South Africa many public schools take isiXhosa as a subject and also a number of Universities in South Africa teach isiXhosa. Therefore, for a language important such as this, it is important to make valuable information that is available online accessible to users through the use of IR systems. In our efforts to develop a Stemmer for the isiXhosa language, an investigation on how others have developed Stemmers for other languages was carried out. From the investigation we came to realize that the Porter stemming algorithm in particular was the main algorithm that many of other Stemmers make use of as a reference. We found that Porter’s algorithm could not be used in its totality in the development of the isiXhosa Stemmer because of the morphological complexity of the language. We developed an affix removal that is embedded with rules that determine which order should be followed in stripping the affixes. The rule is that, the word under consideration is checked against the exceptions, if it’s not in the exceptions list then the stripping continue in the following order; Prefix removal, Suffix removal and finally save the result as stem. The Stemmer was successfully developed and was tested and evaluated in a sample data that was randomly collected from the isiXhosa text books and isiXhosa dictionary. From the results obtained we concluded that the Stemmer can be used in IR systems as it showed 91 percent accuracy. The errors were 9 percent and therefore these results are within the accepted range and therefore the Stemmer can be used to help in retrieval of isiXhosa documents. This is only a noun Stemmer and in the future it can be extended to also stem verbs as well. The Stemmer can also be used in the development of spell-checkers of isiXhosa

    Identifying interactions between chemical entities in text

    Get PDF
    Tese de mestrado em BioinformĂĄtica e Biologia Computacional (BioinformĂĄtica), Universidade de Lisboa, Faculdade de CiĂȘncias, 2014Novas interaçÔes entre compostos quĂ­micos sĂŁo geralmente descritas em artigos cientĂ­ficos, os quais estĂŁo a ser publicados a uma velocidade cada vez maior. No entanto, estes artigos sĂŁo dirigidos a humanos, escritos em linguagem natural, e nĂŁo sĂŁo processados facilmente por um computador. MĂ©todos de prospeção de texto sĂŁo uma solução para este problema, extraindo automaticamente a informação relevante da literatura. Estes mĂ©todos devem ser adaptados ao domĂ­nio e tarefa a que vĂŁo ser aplicados. Esta dissertação propĂ”e um sistema para identificação automĂĄtica e eficaz de interaçÔes entre entidades quĂ­micas em documentos biomĂ©dicos. O sistema foi desenvolvido em dois mĂłdulos. O primeiro mĂłdulo reconhece as entidades quĂ­micas que sĂŁo mencionadas num dado texto. Este mĂłdulo foi baseado num sistema jĂĄ existente, o qual foi melhorado com um novo tipo de medidas de semelhança semĂąntica. O segundo mĂłdulo identifica os pares de entidades que representam uma interação quĂ­mica no mesmo texto, com recurso a tĂ©cnicas de Aprendizagem AutomĂĄtica e conhecimento especĂ­fico ao domĂ­nio. Cada mĂłdulo foi avaliado separadamente, obtendo valores de precisĂŁo elevados em dois padrĂ”es de teste diferentes. Os dois mĂłdulos constituem o sistema IICE, que pode ser usado para analisar qualquer documento biomĂ©dico, de forma a encontrar entidades e interaçÔes quĂ­micas. Este sistema estĂĄ acessĂ­vel atravĂ©s de uma ferramenta web.Novel interactions between chemical compounds are often described in scientific articles, which are being published at an unprecedented rate. However, these articles are directed to humans, written in natural language, and cannot be easily processed by a machine. Text mining methods present a solution to this problem, by automatically extracting the relevant information from the literature. These methods should be adapted to the specific domain and task they are going to be applied to. This dissertation proposes a system for automatic and efficient identification of interactions between chemical entities from biomedical documents. This system was developed in two modules. The first module recognizes the chemical entities that are mentioned in a given text. This module was based on an existing framework, which was improved with a novel type of semantic similarity measure. The second module identifies the pairs of entities that represent a chemical interaction in the same text, using Machine Learning techniques and domain knowledge. Each module was evaluated separately, achieving high precision values against two different gold standards. The two modules were constitute the IICE system, which can be used to analyze any biomedical document for chemical entities and interactions, accessible via a web tool

    A modular architecture for systematic text categorisation

    Get PDF
    This work examines and attempts to overcome issues caused by the lack of formal standardisation when defining text categorisation techniques and detailing how they might be appropriately integrated with each other. Despite text categorisation’s long history the concept of automation is relatively new, coinciding with the evolution of computing technology and subsequent increase in quantity and availability of electronic textual data. Nevertheless insufficient descriptions of the diverse algorithms discovered have lead to an acknowledged ambiguity when trying to accurately replicate methods, which has made reliable comparative evaluations impossible. Existing interpretations of general data mining and text categorisation methodologies are analysed in the first half of the thesis and common elements are extracted to create a distinct set of significant stages. Their possible interactions are logically determined and a unique universal architecture is generated that encapsulates all complexities and highlights the critical components. A variety of text related algorithms are also comprehensively surveyed and grouped according to which stage they belong in order to demonstrate how they can be mapped. The second part reviews several open-source data mining applications, placing an emphasis on their ability to handle the proposed architecture, potential for expansion and text processing capabilities. Finding these inflexible and too elaborate to be readily adapted, designs for a novel framework are introduced that focus on rapid prototyping through lightweight customisations and reusable atomic components. Being a consequence of inadequacies with existing options, a rudimentary implementation is realised along with a selection of text categorisation modules. Finally a series of experiments are conducted that validate the feasibility of the outlined methodology and importance of its composition, whilst also establishing the practicality of the framework for research purposes. The simplicity of experiments and results gathered clearly indicate the potential benefits that can be gained when a formalised approach is utilised

    Un modÚle de recherche d'information basé sur les graphes et les similarités structurelles pour l'amélioration du processus de recherche d'information

    Get PDF
    The main objective of IR systems is to select relevant documents, related to a user's information need, from a collection of documents. Traditional approaches for document/query comparison use surface similarity, i.e. the comparison engine uses surface attributes (indexing terms). We propose a new method which uses a special kind of similarity, namely structural similarities (similarities that use both surface attributes and relation between attributes). These similarities were inspired from cognitive studies and a general similarity measure based on node comparison in a bipartite graph. We propose an adaptation of this general method to the special context of information retrieval. Adaptation consists in taking into account the domain specificities: data type, weighted edges, normalization choice. The core problem is how documents are compared against queries. The idea we develop is that similar documents will share similar terms and similar terms will appear in similar documents. We have developed an algorithm which traduces this idea. Then we have study problem related to convergence and complexity, then we have produce some test on classical collection and compare our measure with two others that are references in our domain. The Report is structured in five chapters: First chapter deals with comparison problem, and related concept like similarities, we explain different point of view and propose an analogy between cognitive similarity model and IR model. In the second chapter we present the IR task, test collection and measures used to evaluate a relevant document list. The third chapter introduces graph definition: our model is based on graph bipartite representation, so we define graphs and criterions used to evaluate them. The fourth chapter describe how we have adopted, and adapted the general comparison method. The Fifth chapter describes how we evaluate the ordering performance of our method, and also how we have compared our method with two others.Cette thĂšse d'informatique s'inscrit dans le domaine de la recherche d'information (RI). Elle a pour objet la crĂ©ation d'un modĂšle de recherche utilisant les graphes pour en exploiter la structure pour la dĂ©tection de similaritĂ©s entre les documents textuels d'une collection donnĂ©e et une requĂȘte utilisateur en vue d'amĂ©liorer le processus de recherche d'information. Ces similaritĂ©s sont dites « structurelles » et nous montrons qu'elles apportent un gain d'information bĂ©nĂ©fique par rapport aux seules similaritĂ©s directes. Le rapport de thĂšse est structurĂ© en cinq chapitres. Le premier chapitre prĂ©sente un Ă©tat de l'art sur la comparaison et les notions connexes que sont la distance et la similaritĂ©. Le deuxiĂšme chapitre prĂ©sente les concepts clĂ©s de la RI, notamment l'indexation des documents, leur comparaison, et l'Ă©valuation des classements retournĂ©s. Le troisiĂšme chapitre est consacrĂ© Ă  la thĂ©orie des graphes et introduit les notations et notions liĂ©es Ă  la reprĂ©sentation par graphe. Le quatriĂšme chapitre prĂ©sente pas Ă  pas la construction de notre modĂšle pour la RI, puis, le cinquiĂšme chapitre dĂ©crit son application dans diffĂ©rents cas de figure, ainsi que son Ă©valuation sur diffĂ©rentes collections et sa comparaison Ă  d'autres approches
    corecore