15 research outputs found
Evaluating Bad Query Abandonment in an Iterative SMS-Based FAQ Retrieval System
In this paper, we investigate how many iterations users are willing to tolerate in an iterative Frequently Asked Ques- tion (FAQ) system that provides information on HIV/AIDS. This is part of work in progress that aims to develop an automated Frequently Asked Question system that can be used to provide answers on HIV/AIDS related queries to users in Botswana. Our system engages the user in the question answering process by following an iterative interaction approach in order to avoid giving inappropriate answers to the user. Our findings provide us with an indication of how long users are willing to engage with the system. We sub- sequently use this to develop a novel evaluation metric to use in future developments of the system. As an additional finding, we show that the previous search experience of the users has a significant effect on their future behaviour
Automatic structuring and correction suggestion system for Hungarian clinical records
The first steps of processing clinical documents are structuring and normalization. In this paper we demonstrate how we compensate
the lack of any structure in the raw data by transforming simple formatting features automatically to structural units. Then we
developed an algorithm to separate running text from tabular and numerical data. Finally we generated correcting suggestions for word
forms recognized to be incorrect. Some evaluation results are also provided for using the system as automatically correcting input texts
by choosing the best possible suggestion from the generated list. Our method is based on the statistical characteristics of our Hungarian
clinical data set and on the HUMor Hungarian morphological analyzer. The conclusions claim that our algorithm is not able to correct
all mistakes by itself, but is a very powerful tool to help manually correcting Hungarian medical texts in order to produce a correct text
corpus of such a domain
Adapting Sequence to Sequence models for Text Normalization in Social Media
Social media offer an abundant source of valuable raw data, however informal
writing can quickly become a bottleneck for many natural language processing
(NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot
explicitly handle noise found in short online posts. Moreover, the variety of
frequently occurring linguistic variations presents several challenges, even
for humans who might not be able to comprehend the meaning of such posts,
especially when they contain slang and abbreviations. Text Normalization aims
to transform online user-generated text to a canonical form. Current text
normalization systems rely on string or phonetic similarity and classification
models that work on a local fashion. We argue that processing contextual
information is crucial for this task and introduce a social media text
normalization hybrid word-character attention-based encoder-decoder model that
can serve as a pre-processing step for NLP applications to adapt to noisy text
in social media. Our character-based component is trained on synthetic
adversarial examples that are designed to capture errors commonly found in
online user-generated text. Experiments show that our model surpasses neural
architectures designed for text normalization and achieves comparable
performance with state-of-the-art related work.Comment: Accepted at the 13th International AAAI Conference on Web and Social
Media (ICWSM 2019
Magyar nyelvű klinikai dokumentumok előfeldolgozása
A klinikai dokumentumok feldolgozásának elsĹ‘ lĂ©pĂ©se azok strukturálása Ă©s normalizálása. Bemutatjuk, hogy a szerkezeti egysĂ©gek hiányát hogyan tudtuk a formázási jegyek alapján automatikus transzformáciĂłkkal pĂłtolni, illetve alapvetĹ‘ metainformáciĂłkat a folyĂł szövegbĹ‘l kinyerni. Ezután a korpusz szöveges rĂ©szeit elválasztottuk a nem szöveges rĂ©szektĹ‘l, az Ăgy kapott halmazra automatikus helyesĂrás-javĂtĂł, illetve javaslatgenerálĂł rendszert hoztunk lĂ©tre. MĂłdszerĂĽnk elsĹ‘sorban a rendelkezĂ©sĂĽnkre állĂł korpusz statisztikai viselkedĂ©sĂ©re Ă©pĂĽl, de kĂĽlsĹ‘ erĹ‘forrásokat is bevontunk a jobb minĹ‘sĂ©g elĂ©rĂ©se vĂ©gett. Az algoritmust kĂ©t funkciĂłja: a helyesĂrás-javĂtás, illetve a javaslatgenerálás alapján Ă©rtĂ©keltĂĽk ki. Beláttuk, hogy mĂłdszerĂĽnk a teljesen automatikus javĂtásra pillanatnyilag önmagában nem alkalmas, azonban ez nem is volt cĂ©l, viszont minimális emberi közreműködĂ©ssel hatĂ©konyan alkalmazhatĂł egy helyes orvosi-klinikai korpusz lĂ©trehozására
Correcting input noise in SMT as a char-based translation problem
Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator.Peer ReviewedPreprin
HelyesĂrási hibák automatikus javĂtása orvosi szövegekben a szövegkörnyezet figyelembevĂ©telĂ©vel
CikkĂĽnkben egy korábban bemutatott orvosi helyesĂrás-javĂtĂł rendszer lĂ©nyegesen továbbfejlesztett változatát mutatjuk be, amely a korábbival ellentĂ©tben kĂ©pes az egybeĂrások javĂtására, Ă©s a szövegkörnyezetet is figyelembe veszi ennek során, Ăgy alkalmas teljesen automatikus javĂtásra is
Toward Tweets Normalization Using Maximum Entropy
Abstract The use of social network services and microblogs, such as Twitter, has created valuable text resources, which contain extremely noisy text. Twitter messages contain so much noise that it is difficult to use them in natural language processing tasks. This paper presents a new approach using the maximum entropy model for normalizing Tweets. The proposed approach addresses words that are unseen in the training phase. Although the maximum entropy needs a training dataset to adjust its parameters, the proposed approach can normalize unseen data in the training set. The principle of maximum entropy emphasizes incorporating the available features into a uniform model. First, we generate a set of normalized candidates for each out-ofvocabulary word based on lexical, phonemic, and morphophonemic similarities. Then, three different probability scores are calculated for each candidate using positional indexing, a dependency-based frequency feature and a language model. After the optimal values of the model parameters are obtained in a training phase, the model can calculate the final probability value for candidates. The approach achieved an 83.12 BLEU score in testing using 2,000 Tweets. Our experimental results show that the maximum entropy approach significantly outperforms previous well-known normalization approaches
Normalización de texto en español de Argentina
Tesis (Lic. en Cs. de la ComputaciĂłn)--Universidad Nacional de CĂłrdoba, Facultad de Matemática, AstronomĂa, FĂsica y ComputaciĂłn, 2018.En la actualidad la cantidad de datos que consume y genera una sola persona es gigantesca. Los datos cada vez son más, ya que cualquiera puede generarlos. Esto trae consigo un aumento en el ruido que hay en esos datos. Es por eso que el texto de las redes sociales se caracteriza por ser ruidoso, lo que es un problema cuando se quiere trabajar sobre ellos. En este trabajo construimos un corpus de tweets en español de Argentina. Recolectamos un conjunto grande de tweets y luego los seleccionamos manualmente para obtener una muestra representativa de los errores tĂpicos de normalizaciĂłn. Luego, definimos criterios claros y explĂcitos de correcciĂłn y los utilizamos para proceder a la anotaciĂłn manual del corpus. Además, presentamos un sistema de normalizaciĂłn de texto que trabaja sobre tweets. Dado un conjunto de tweets como entrada, el sistema detecta y corrige las palabras que deben ser estandarizadas. Para ello, utiliza una serie de componentes como recursos lĂ©xicos, sistemas de reglas y modelos de lenguaje. Finalmente, realizamos experimentos con diferentes corpus, entre ellos el nuestro, y diferentes configuraciones del sistema para entender las ventajas y desventajas de cada uno.Nowadays, the amount of data consumed and generated by only one person is enormous. Data amount keeps growing because anyone can
generate it. This brings along an increment of noisy data. That is why social network text is noisy, which is a problem when it is needed to work on it.
Here, we built a corpus of tweets in argentinian spanish. We collected a big set of tweets and we selected them manually to obtain a representative sample of common normalization errors. Then, we defined explicit and clear correction criteria and we used it to continue with the manual corpus annotation.
Besides, we present a text normalization system that works on tweets.
Given a set of tweets as input, the system detects and corrects words that need to be standardized. To do that, it uses a group of components as
lexical resources, rule-based systems and language models.
Finally, we made some experiments with different corpus, among them, the one we built, and different system configurations to understand each one’s advantages and disadvantages