124 research outputs found

    A linguistic failure analysis of classification of medical publications: A study on stemming vs lemmatization

    Get PDF
    Technology-Assisted Review (TAR) systems are essential to minimize the effort of the user during the search and retrieval of relevant documents for a specific information need. In this paper, we present a failure analysis based on terminological and linguistic aspects of a TAR system for systematic medical reviews. In particular, we analyze the results of the worst performing topics in terms of recall using the dataset of the CLEF 2017 eHealth task on TAR in Empirical Medicine.I sistemi TAR (Technology-Assisted Review) sono fondamentali per ridurre al minimo lo sforzo dell’utente che intende ricercare e recuperare i documenti rilevanti per uno specifico bisogno informativo. In questo articolo, presentiamo una failure analysis basata su aspetti terminologici e linguistici di un sistema TAR per le revisioni sistematiche in campo medico. In particolare, analizziamo i topic per i quali abbiamo ottenuto dei risultati peggiori in termini di recall utilizzando il dataset di CLEF 2017 eHealth task on TAR in Empirical Medicine

    Improving Arabic Light Stemming in Information Retrieval Systems

    Get PDF
    Information retrieval refers to the retrieval of textual documents such as newsprint and magazine articles or Web documents. Due to extensive research in the IR field, there are many retrieval techniques that have been developed for Arabic language. The main objective of this research to improve Arabic information retrieval by enhancing light stemming and preprocessing stage and to contribute to the open source community, also establish a guideline for Arabic normalization and stop-word removal. To achieve these objectives, we create a GUI toolkit that implements preprocessing stage that is necessary for information retrieval. One of these steps is normalizing, which we improved and introduced a set of rules to be standardized and improved by other researchers. The next preprocessing step we improved is stop-word removal, we introduced two different stop-word lists, the first one is intensive stop-word list for reducing the size of the index and ambiguous words, and the other is light stop-word list for better results with recall in information retrieval applications. We improved light stemming by update a suffix rule, and introduce the use of Arabized words, 100 words manually collected, these words should not follow the stemming rules since they came to Arabic language from other languages, and show how this improve results compared to two popular stemming algorithms like Khoja and Larkey stemmers. The proposed toolkit was integrated into a popular IR platform known as Terrier IR platform. We implemented Arabic language support into the Terrier IR platform. We used TF-IDF scoring model from Terrier IR platform. We tested our results using OSAC datasets. We used java programming language and Terrier IR platform for the proposed systems. The infrastructure we used consisted of CORE I7 CPU ran speed at 3.4 GHZ and 8 GB RAM

    A Linguistic Failure Analysis of Classification of Medical Publications: A Study on Stemming vs Lemmatization

    Get PDF
    Technology-Assisted Review (TAR) systems are essential to minimize the effort of the user during the search and retrieval of relevant documents for a specific information need. In this paper, we present a failure analysis based on terminological and linguistic aspects of a TAR system for systematic medical reviews. In particular, we analyze the results of the worst performing topics in terms of recall using the dataset of the CLEF 2017 eHealth task on TAR in Empirical Medicine.I sistemi TAR (Technology-Assisted Review) sono fondamentali per ridurre al minimo lo sforzo dell’utente che intende ricercare e recuperare i documenti rilevanti per uno specifico bisogno informativo. In questo articolo, presentiamo una failure analysis basata su aspetti terminologici e linguistici di un sistema TAR per le revisioni sistematiche in campo medico. In particolare, analizziamo i topic per i quali abbiamo ottenuto dei risultati peggiori in termini di recall utilizzando il dataset di CLEF 2017 eHealth task on TAR in Empirical Medicine

    Supporting Text Retrieval Query Formulation In Software Engineering

    Get PDF
    The text found in software artifacts captures important information. Text Retrieval (TR) techniques have been successfully used to leverage this information. Despite their advantages, the success of TR techniques strongly depends on the textual queries given as input. When poorly chosen queries are used, developers can waste time investigating irrelevant results. The quality of a query indicates the relevance of the results returned by TR in response to the query and can give an indication if the results are worth investigating or a reformulation of the query should be sought instead. Knowing the quality of the query could lead to time saved when irrelevant results are returned. However, the only way to determine if a query led to the wanted artifacts is by manually inspecting the list of results. This dissertation introduces novel approaches to measure and predict the quality of queries automatically in the context of SE tasks, based on a set of statistical properties of the queries. The approaches are evaluated for the task of concept location in source code. The results reveal that the proposed approaches are able to accurately capture and predict the quality of queries for SE tasks supported by TR. When a query has low quality, the developer can reformulate it and improve it. However, this is just as hard as formulating the query in the first place. This dissertation presents two approaches for partial and complete automation of the query reformulation process. The semi-automatic approach relies on developer feedback about the relevance of TR results and uses this information to automatically reformulate the query. The automatic approach learns and applies the best reformulation approach for a query and relies on a set of training queries and their statistical properties to achieve this. Both approaches are evaluated for concept location and the results show that the techniques are able to improve the results of the original queries in the majority of the cases. We expect that on the long run the proposed approaches will contribute directly to the reduction of developer effort and implicitly the reduction of software evolution costs

    Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks.

    Get PDF
    Information Retrieval (IR) approaches are used to leverage textual or unstructured data generated during the software development process to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.). Two of the most important steps for applying IR techniques to support SE tasks are preprocessing the corpus and configuring the IR technique, and these steps can significantly influence the outcome and the amount of effort developers have to spend for these maintenance tasks. We present the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support SE tasks. The approach named IR-GA determines the (near) optimal solution to be used for each step of the IR process without requiring any training. We applied IR-GA on three different SE tasks and the results of the study indicate that IR-GA outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach and a combinatorial approach

    Development of a stemmer for the isiXhosa language

    Get PDF
    IsiXhosa language is one of the eleven official languages and the second most widely spoken language in South Africa. However, in terms of computational linguistics, the language did not get attention and natural language related work is almost non-existent. Document retrieval using unstructured queries requires some kind of language processing, and an efficient retrieval of documents can be achieved if we use a technique called stemming. The area that involves document storage and retrieval is called Information Retrieval (IR). Basically, IR systems make use of a Stemmer to index document representations and also terms in users’ queries to retrieve matching documents. In this dissertation, we present the developed Stemmer that can be used in both conditions. The Stemmer is used in IR systems, like Google to retrieve documents written in isiXhosa. In the Eastern Cape Province of South Africa many public schools take isiXhosa as a subject and also a number of Universities in South Africa teach isiXhosa. Therefore, for a language important such as this, it is important to make valuable information that is available online accessible to users through the use of IR systems. In our efforts to develop a Stemmer for the isiXhosa language, an investigation on how others have developed Stemmers for other languages was carried out. From the investigation we came to realize that the Porter stemming algorithm in particular was the main algorithm that many of other Stemmers make use of as a reference. We found that Porter’s algorithm could not be used in its totality in the development of the isiXhosa Stemmer because of the morphological complexity of the language. We developed an affix removal that is embedded with rules that determine which order should be followed in stripping the affixes. The rule is that, the word under consideration is checked against the exceptions, if it’s not in the exceptions list then the stripping continue in the following order; Prefix removal, Suffix removal and finally save the result as stem. The Stemmer was successfully developed and was tested and evaluated in a sample data that was randomly collected from the isiXhosa text books and isiXhosa dictionary. From the results obtained we concluded that the Stemmer can be used in IR systems as it showed 91 percent accuracy. The errors were 9 percent and therefore these results are within the accepted range and therefore the Stemmer can be used to help in retrieval of isiXhosa documents. This is only a noun Stemmer and in the future it can be extended to also stem verbs as well. The Stemmer can also be used in the development of spell-checkers of isiXhosa
    corecore