15 research outputs found

    Evaluating Bad Query Abandonment in an Iterative SMS-Based FAQ Retrieval System

    Get PDF
    In this paper, we investigate how many iterations users are willing to tolerate in an iterative Frequently Asked Ques- tion (FAQ) system that provides information on HIV/AIDS. This is part of work in progress that aims to develop an automated Frequently Asked Question system that can be used to provide answers on HIV/AIDS related queries to users in Botswana. Our system engages the user in the question answering process by following an iterative interaction approach in order to avoid giving inappropriate answers to the user. Our findings provide us with an indication of how long users are willing to engage with the system. We sub- sequently use this to develop a novel evaluation metric to use in future developments of the system. As an additional finding, we show that the previous search experience of the users has a significant effect on their future behaviour

    Automatic structuring and correction suggestion system for Hungarian clinical records

    Get PDF
    The first steps of processing clinical documents are structuring and normalization. In this paper we demonstrate how we compensate the lack of any structure in the raw data by transforming simple formatting features automatically to structural units. Then we developed an algorithm to separate running text from tabular and numerical data. Finally we generated correcting suggestions for word forms recognized to be incorrect. Some evaluation results are also provided for using the system as automatically correcting input texts by choosing the best possible suggestion from the generated list. Our method is based on the statistical characteristics of our Hungarian clinical data set and on the HUMor Hungarian morphological analyzer. The conclusions claim that our algorithm is not able to correct all mistakes by itself, but is a very powerful tool to help manually correcting Hungarian medical texts in order to produce a correct text corpus of such a domain

    Adapting Sequence to Sequence models for Text Normalization in Social Media

    Full text link
    Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot explicitly handle noise found in short online posts. Moreover, the variety of frequently occurring linguistic variations presents several challenges, even for humans who might not be able to comprehend the meaning of such posts, especially when they contain slang and abbreviations. Text Normalization aims to transform online user-generated text to a canonical form. Current text normalization systems rely on string or phonetic similarity and classification models that work on a local fashion. We argue that processing contextual information is crucial for this task and introduce a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for NLP applications to adapt to noisy text in social media. Our character-based component is trained on synthetic adversarial examples that are designed to capture errors commonly found in online user-generated text. Experiments show that our model surpasses neural architectures designed for text normalization and achieves comparable performance with state-of-the-art related work.Comment: Accepted at the 13th International AAAI Conference on Web and Social Media (ICWSM 2019

    Magyar nyelvű klinikai dokumentumok előfeldolgozása

    Get PDF
    A klinikai dokumentumok feldolgozásának első lépése azok strukturálása és normalizálása. Bemutatjuk, hogy a szerkezeti egységek hiányát hogyan tudtuk a formázási jegyek alapján automatikus transzformációkkal pótolni, illetve alapvető metainformációkat a folyó szövegből kinyerni. Ezután a korpusz szöveges részeit elválasztottuk a nem szöveges részektől, az így kapott halmazra automatikus helyesírás-javító, illetve javaslatgeneráló rendszert hoztunk létre. Módszerünk elsősorban a rendelkezésünkre álló korpusz statisztikai viselkedésére épül, de külső erőforrásokat is bevontunk a jobb minőség elérése végett. Az algoritmust két funkciója: a helyesírás-javítás, illetve a javaslatgenerálás alapján értékeltük ki. Beláttuk, hogy módszerünk a teljesen automatikus javításra pillanatnyilag önmagában nem alkalmas, azonban ez nem is volt cél, viszont minimális emberi közreműködéssel hatékonyan alkalmazható egy helyes orvosi-klinikai korpusz létrehozására

    Correcting input noise in SMT as a char-based translation problem

    Get PDF
    Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator.Peer ReviewedPreprin

    Helyesírási hibák automatikus javítása orvosi szövegekben a szövegkörnyezet figyelembevételével

    Get PDF
    Cikkünkben egy korábban bemutatott orvosi helyesírás-javító rendszer lényegesen továbbfejlesztett változatát mutatjuk be, amely a korábbival ellentétben képes az egybeírások javítására, és a szövegkörnyezetet is figyelembe veszi ennek során, így alkalmas teljesen automatikus javításra is

    Toward Tweets Normalization Using Maximum Entropy

    Get PDF
    Abstract The use of social network services and microblogs, such as Twitter, has created valuable text resources, which contain extremely noisy text. Twitter messages contain so much noise that it is difficult to use them in natural language processing tasks. This paper presents a new approach using the maximum entropy model for normalizing Tweets. The proposed approach addresses words that are unseen in the training phase. Although the maximum entropy needs a training dataset to adjust its parameters, the proposed approach can normalize unseen data in the training set. The principle of maximum entropy emphasizes incorporating the available features into a uniform model. First, we generate a set of normalized candidates for each out-ofvocabulary word based on lexical, phonemic, and morphophonemic similarities. Then, three different probability scores are calculated for each candidate using positional indexing, a dependency-based frequency feature and a language model. After the optimal values of the model parameters are obtained in a training phase, the model can calculate the final probability value for candidates. The approach achieved an 83.12 BLEU score in testing using 2,000 Tweets. Our experimental results show that the maximum entropy approach significantly outperforms previous well-known normalization approaches

    Normalización de texto en español de Argentina

    Get PDF
    Tesis (Lic. en Cs. de la Computación)--Universidad Nacional de Córdoba, Facultad de Matemática, Astronomía, Física y Computación, 2018.En la actualidad la cantidad de datos que consume y genera una sola persona es gigantesca. Los datos cada vez son más, ya que cualquiera puede generarlos. Esto trae consigo un aumento en el ruido que hay en esos datos. Es por eso que el texto de las redes sociales se caracteriza por ser ruidoso, lo que es un problema cuando se quiere trabajar sobre ellos. En este trabajo construimos un corpus de tweets en español de Argentina. Recolectamos un conjunto grande de tweets y luego los seleccionamos manualmente para obtener una muestra representativa de los errores típicos de normalización. Luego, definimos criterios claros y explícitos de corrección y los utilizamos para proceder a la anotación manual del corpus. Además, presentamos un sistema de normalización de texto que trabaja sobre tweets. Dado un conjunto de tweets como entrada, el sistema detecta y corrige las palabras que deben ser estandarizadas. Para ello, utiliza una serie de componentes como recursos léxicos, sistemas de reglas y modelos de lenguaje. Finalmente, realizamos experimentos con diferentes corpus, entre ellos el nuestro, y diferentes configuraciones del sistema para entender las ventajas y desventajas de cada uno.Nowadays, the amount of data consumed and generated by only one person is enormous. Data amount keeps growing because anyone can generate it. This brings along an increment of noisy data. That is why social network text is noisy, which is a problem when it is needed to work on it. Here, we built a corpus of tweets in argentinian spanish. We collected a big set of tweets and we selected them manually to obtain a representative sample of common normalization errors. Then, we defined explicit and clear correction criteria and we used it to continue with the manual corpus annotation. Besides, we present a text normalization system that works on tweets. Given a set of tweets as input, the system detects and corrects words that need to be standardized. To do that, it uses a group of components as lexical resources, rule-based systems and language models. Finally, we made some experiments with different corpus, among them, the one we built, and different system configurations to understand each one’s advantages and disadvantages
    corecore