747 research outputs found
A matter of words: NLP for quality evaluation of Wikipedia medical articles
Automatic quality evaluation of Web information is a task with many fields of
applications and of great relevance, especially in critical domains like the
medical one. We move from the intuition that the quality of content of medical
Web documents is affected by features related with the specific domain. First,
the usage of a specific vocabulary (Domain Informativeness); then, the adoption
of specific codes (like those used in the infoboxes of Wikipedia articles) and
the type of document (e.g., historical and technical ones). In this paper, we
propose to leverage specific domain features to improve the results of the
evaluation of Wikipedia medical articles. In particular, we evaluate the
articles adopting an "actionable" model, whose features are related to the
content of the articles, so that the model can also directly suggest strategies
for improving a given article quality. We rely on Natural Language Processing
(NLP) and dictionaries-based techniques in order to extract the bio-medical
concepts in a text. We prove the effectiveness of our approach by classifying
the medical articles of the Wikipedia Medicine Portal, which have been
previously manually labeled by the Wiki Project team. The results of our
experiments confirm that, by considering domain-oriented features, it is
possible to obtain sensible improvements with respect to existing solutions,
mainly for those articles that other approaches have less correctly classified.
Other than being interesting by their own, the results call for further
research in the area of domain specific features suitable for Web data quality
assessment
NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets
Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks
Automatic extraction of definitions
Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014This doctoral research work provides a set of methods and heuristics for
building a definition extractor or for fine-tuning an existing one. In order
to develop and test the architecture, a generic definitions extractor for the
Portuguese language is built. Furthermore, the methods were tested in the
construction of an extractor for two languages different from Portuguese,
which are English and, less extensively, Dutch. The approach presented
in this work makes the proposed extractor completely different in nature
in comparison to the other works in the field. It is a matter of fact that
most systems that automatically extract definitions have been constructed
taking into account a specific corpus on a specific topic, and are based on
the manual construction of a set of rules or patterns capable of identifyinf
a definition in a text.
This research focused on three types of definitions, characterized by the connector
between the defined term and its description. The strategy adopted
can be seen as a "divide and conquer"approach. Differently from the other
works representing the state of the art, specific heuristics were developed in
order to deal with different types of definitions, namely copula, verbal and
punctuation definitions.
We used different methodology for each type of definition, namely we propose
to use rule-based methods to extract punctuation definitions, machine
learning with sampling algorithms for copula definitions, and machine learning
with a method to increase the number of positive examples for verbal
definitions. This architecture is justified by the increasing linguistic complexity
that characterizes the different types of definitions. Numerous experiments
have led to the conclusion that the punctuation definitions are
easily described using a set of rules. These rules can be easily adapted to
the relevant context and translated into other languages. However, in order
to deal with the other two definitions types, the exclusive use of rules is not
enough to get good performance and it asks for more advanced methods, in
particular a machine learning based approach.
Unlike other similar systems, which were built having in mind a specific
corpus or a specific domain, the one reported here is meant to obtain good
results regardless the domain or context. All the decisions made in the
construction of the definition extractor take into consideration this central
objective.Este trabalho de doutoramento visa proporcionar um conjunto de métodos
e heurísticas para a construção de um extractor de definição ou para melhorar
o desempenho de um sistema já existente, quando usado com um corpus
específico. A fim de desenvolver e testar a arquitectura, um extractor de
definic˛ões genérico para a língua Portuguesa foi construído. Além disso,
os métodos foram testados na construção de um extractor para um idioma
diferente do Português, nomeadamente Inglês, algumas heurísticas também
foram testadas com uma terceira língua, ou seja o Holandês. A abordagem
apresentada neste trabalho torna o extractor proposto neste trabalho completamente
diferente em comparação com os outros trabalhos na área. É
um fato que a maioria dos sistemas de extracção automática de definicões
foram construídos tendo em conta um corpus específico com um tema bem
determinado e são baseados na construc˛ão manual de um conjunto de regras
ou padrões capazes de identificar uma definição num texto dum domínio
específico.
Esta pesquisa centrou-se em três tipos de definições, caracterizadas pela
ligacão entre o termo definido e a sua descrição. A estratégia adoptada pode
ser vista como uma abordagem "dividir para conquistar". Diferentemente
de outras pesquisa nesta área, foram desenvolvidas heurísticas específicas
a fim de lidar com as diferentes tipologias de definições, ou seja, cópula,
verbais e definicões de pontuação.
No presente trabalho propõe-se utilizar uma metodologia diferente para cada
tipo de definição, ou seja, propomos a utilização de métodos baseados em
regras para extrair as definições de pontuação, aprendizagem automática,
com algoritmos de amostragem para definições cópula e aprendizagem automática
com um método para aumentar automáticamente o número de
exemplos positivos para a definição verbal. Esta arquitetura é justificada
pela complexidade linguística crescente que caracteriza os diferentes tipos de
definições. Numerosas experiências levaram à conclusão de que as definições
de pontuação são facilmente descritas utilizando um conjunto de regras. Essas
regras podem ser facilmente adaptadas ao contexto relevante e traduzido
para outras línguas. No entanto, a fim de lidar com os outros dois tipos de
definições, o uso exclusivo de regras não é suficiente para obter um bom
desempenho e é preciso usar métodos mais avançados, em particular aqueles
baseados em aprendizado de máquina.
Ao contrário de outros sistemas semelhantes, que foram construídos tendo
em mente um corpus ou um domínio específico, o sistema aqui apresentado
foi desenvolvido de maneira a obter bons resultados, independentemente do
domínio ou da língua. Todas as decisões tomadas na construção do extractor
de definição tiveram em consideração este objectivo central.Fundação para a Ciência e a Tecnologia (FCT, SFRH/ BD/36732/2007
Effect of Text Processing Steps on Twitter Sentiment Classification using Word Embedding
Processing of raw text is the crucial first step in text classification and
sentiment analysis. However, text processing steps are often performed using
off-the-shelf routines and pre-built word dictionaries without optimizing for
domain, application, and context. This paper investigates the effect of seven
text processing scenarios on a particular text domain (Twitter) and application
(sentiment classification). Skip gram-based word embeddings are developed to
include Twitter colloquial words, emojis, and hashtag keywords that are often
removed for being unavailable in conventional literature corpora. Our
experiments reveal negative effects on sentiment classification of two common
text processing steps: 1) stop word removal and 2) averaging of word vectors to
represent individual tweets. New effective steps for 1) including non-ASCII
emoji characters, 2) measuring word importance from word embedding, 3)
aggregating word vectors into a tweet embedding, and 4) developing linearly
separable feature space have been proposed to optimize the sentiment
classification pipeline. The best combination of text processing steps yields
the highest average area under the curve (AUC) of 88.4 (+/-0.4) in classifying
14,640 tweets with three sentiment labels. Word selection from context-driven
word embedding reveals that only the ten most important words in Tweets
cumulatively yield over 98% of the maximum accuracy. Results demonstrate a
means for data-driven selection of important words in tweet classification as
opposed to using pre-built word dictionaries. The proposed tweet embedding is
robust to and alleviates the need for several text processing steps.Comment: 14 pages, 3 figures, 7 table
Stock price change prediction using news text mining
Along with the advent of the Internet as a new way of propagating news in a digital format, came the need to understand and transform this data into information. This work presents a computational framework that aims to predict the changes of stock prices along the day, given the occurrence of news articles related to the companies listed in the Down Jones Index. For this task, an automated process that gathers, cleans, labels, classifies, and simulates investments was developed. This process integrates the existing data mining and text algorithms, with the proposal of new techniques of alignment between news articles and stock prices, pre-processing, and classifier ensemble. The result of experiments in terms of classification measures and the Cumulative Return obtained through investment simulation outperformed the other results found after an extensive review in the related literature. This work also argues that the classification measure of Accuracy and incorrect use of cross validation technique have too few to contribute in terms of investment recommendation for financial market. Altogether, the developed methodology and results contribute with the state of art in this emerging research field, demonstrating that the correct use of text mining techniques is an applicable alternative to predict stock price movements in the financial market.Com o advento da Internet como um meio de propagação de notícias em formato digital, veio a necessidade de entender e transformar esses dados em informação. Este trabalho tem como objetivo apresentar um processo computacional para predição de preços de ações ao longo do dia, dada a ocorrência de notícias relacionadas às companhias listadas no índice Down Jones. Para esta tarefa, um processo automatizado que coleta, limpa, rotula, classifica e simula investimentos foi desenvolvido. Este processo integra algoritmos de mineração de dados e textos já existentes, com novas técnicas de alinhamento entre notícias e preços de ações, pré-processamento, e assembleia de classificadores. Os resultados dos experimentos em termos de medidas de classificação e o retorno acumulado obtido através de simulação de investimentos foram maiores do que outros resultados encontrados após uma extensa revisão da literatura. Este trabalho também discute que a acurácia como medida de classificação, e a incorreta utilização da técnica de validação cruzada, têm muito pouco a contribuir em termos de recomendação de investimentos no mercado financeiro. Ao todo, a metodologia desenvolvida e resultados contribuem com o estado da arte nesta área de pesquisa emergente, demonstrando que o uso correto de técnicas de mineração de dados e texto é uma alternativa aplicável para a predição de movimentos no mercado financeiro
An Improved Sentiment Classification Approach for Measuring User Satisfaction toward Governmental Services’ Mobile Apps Using Machine Learning Methods with Feature Engineering and SMOTE Technique
Analyzing the sentiment of Arabic texts is still a big research challenge due to the special characteristics and complexity of the Arabic language. Few studies have been conducted on Arabic sentiment analysis (ASA) compared to English or other Latin languages. In addition, most of the existing studies on ASA analyzed datasets collected from Twitter. However, little attention was given to the huge amounts of reviews for governmental or commercial mobile applications on Google Play or the App Store. For instance, the government of Saudi Arabia developed several mobile applications in healthcare, education, and other sectors as a response to the COVID-19 pandemic. To address this gap, this paper aims to analyze the users’ opinions of six applications in the healthcare sector. An improved sentiment classification approach was proposed for measuring user satisfaction toward governmental services’ mobile apps using machine learning models with different preprocessing methods. The Arb-AppsReview dataset was collected from the reviews of these six mobile applications available on Google Play and the App Store, which includes 51k reviews. Then, several feature engineering approaches were applied, which include Bing Liu lexicon, AFINN, and MPQA Subjectivity Lexicon, bag of words (BoW), term frequency-inverse document frequency (TF-IDF), and the Google pre-trained Word2Vec. Additionally, the SMOTE technique was applied as a balancing technique on this dataset. Then, five ML models were applied to classify the sentiment opinions. The experimental results showed that the highest accuracy score (94.38%) was obtained by applying a support vector machine (SVM) using the SMOTE technique with all concatenated features
Machine Learning for Biomedical Literature Triage
This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm
Prediction models for solitary pulmonary nodules based on curvelet textural features and clinical parameters
Lung cancer, one of the leading causes of cancer-related deaths, usually appears as solitary pulmonary nodules (SPNs) which are hard to diagnose using the naked eye. In this paper, curvelet-based textural features and clinical parameters are used with three prediction models [a multilevel model, a least absolute shrinkage and selection operator (LASSO) regression method, and a support vector machine (SVM)] to improve the diagnosis of benign and malignant SPNs. Dimensionality reduction of the original curvelet-based textural features was achieved using principal component analysis. In addition, non-conditional logistical regression was used to find clinical predictors among demographic parameters and morphological features. The results showed that, combined with 11 clinical predictors, the accuracy rates using 12 principal components were higher than those using the original curvelet-based textural features. To evaluate the models, 10-fold cross validation and back substitution were applied. The results obtained, respectively, were 0.8549 and 0.9221 for the LASSO method, 0.9443 and 0.9831 for SVM, and 0.8722 and 0.9722 for the multilevel model. All in all, it was found that using curvelet-based textural features after dimensionality reduction and using clinical predictors, the highest accuracy rate was achieved with SVM. The method may be used as an auxiliary tool to differentiate between benign and malignant SPNs in CT images
- …