737 research outputs found

    Using named entity recognition for relevance detection in social network messages

    Get PDF
    O crescimento contínuo das redes sociais ao longo da última década levou a que quantidades massivas de informação sejam geradas diariamente. Enquanto grande parte desta informação é de índole pessoal ou simplesmente sem interesse para a população em geral, tem-se por outro lado vindo a testemunhar cada vez mais a transmissão de notícias importantes através de redes sociais.Esta tese foca-se no estudo da relação entre entidades mencionadas numa publicação de rede social e a respetiva relevância jornalística dessa mesma publicação. Nesse sentido, este trabalho foi dividido em dois grandes objetivos: 1) implementar ou encontrar o melhor sistema de reconhecimento de entidades mencionadas (REM) para textos de redes sociais, e 2) analisar a importância de entidades extraídas de publicações como atributos para deteção de relevância com aprendizagem computacional.Apesar de já existirem diversas ferramentas para extração de entidades, a maioria destas ferramentas apresenta uma perda significativa de performance quando testada em textos de redes sociais, ao invés de textos formais. Isto deve-se essencialmente à informalidade característica deste tipo de textos, como por exemplo a ausência de contexto, pontuação desadequada, utilização errada de maiúsculas e minúsculas, a representação de emoticons com recurso a caracteres, erros gramáticos ou lexicais e até mesmo a utilização de diferentes línguas no mesmo texto. Para endereçar estes problemas, quatro ferramentas de reconhecimento de entidades - "Stanford NLP", "Gate" com "TwitIE", "Twitter NLP tools" e "OpenNLP" - foram testadas em "datasets" de redes sociais. Para além disso, tentamos compreender quão diferentes é que estas ferramentas eram, em termos de Precisão e "Recall" para 3 tipos de entidades (Pessoa, Local e Organização), e de que forma estas ferramentas se poderiam complementar de forma a obter um desempenho combinado superior ao de cada ferramenta utilizada individualmente, criando assim um Ensemble de ferramentas de REM. No seguimento da extração de entidades utilizando o Ensemble desenvolvido, diferentes atributos foram gerados baseados nestas entidades. Estes atributos incluíram o número de pessoas, locais e organizações mencionados numa publicação, estatísticas obtidas a partir da API pública do jornal "The Guardian", e foram também combinados com atributos baseados em "word embeddings". Vários modelos de aprendizagem foram treinados num "dataset" de tweets manualmente anotados. Os resultados obtidos das diferentes combinações de atributos, algoritmos, "hyperparameters" e "datasets" foram comparados e analisados. Os nossos resultados mostraram que utilizar um ensemble de ferramentas de NER pode melhorar o reconhecimento de certos tipos de entidades mencionadas, dependendo dos critérios de votação, e pode mesmo até melhorar a performance geral média dos tipos de entidades: Pessoa, Local e Organização. A análise de relevância mostrou que entidades mencionadas numa publicação podem de facto ser úteis na deteção da sua relevância, sendo não apenas uteis quando usadas isoladamente, tendo alcançado até 74% de AUC, mas também úteis quando combinadas com outros atributos como "word embeddings", tendo nesse caso alcançado um máximo de 94%, uma melhoria de 2.6% em relação a usar exclusivamente "word embeddings".The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Physical Human Activity Recognition Using Machine Learning Algorithms

    Get PDF
    With the rise in ubiquitous computing, the desire to make everyday lives smarter and easier with technology is on the increase. Human activity recognition (HAR) is the outcome of a similar motive. HAR enables a wide range of pervasive computing applications by recognizing the activity performed by a user. In order to contribute to the multi facet applications that HAR is capable to offer, predicting the right activity is of utmost importance. Simplest of the issues as the use of incorrect data manipulation or utilizing a wrong algorithm to perform prediction can hinder the performance of a HAR system. This study is designed to perform HAR by using two dimensionality reduction techniques followed by five different supervised machine learning algorithms as an aim to receive better predictive accuracy over the existing benchmark research. Correlation analysis (CA) and Principal component analysis (PCA) are used for feature reduction which resulted in 173 and 100 features respectively. Decision Tree, K Nearest Neighbor, Naive Bayes, Multinomial Logistic Regression and Artificial Neural Network algorithms were used to perform the classification task. The repeated random sub-sampling cross validation technique was used to perform the evaluation followed by a Wilcoxon signed rank test to evaluate the significance of the tests. The study resulted in ANN performing the best classification by achieving 97% of accuracy using the CA as feature reduction technique. The KNN and LR also provided satisfactory results and have received predictive results greater than the benchmark test. However, the decision tree and Naive bayes algorithms didn’t prove efficient
    corecore