16 research outputs found

    Comparing the Performance of Different NLP Toolkits in Formal and Social Media Text

    Get PDF
    Nowadays, there are many toolkits available for performing common natural language processing tasks, which enable the development of more powerful applications without having to start from scratch. In fact, for English, there is no need to develop tools such as tokenizers, part-of-speech (POS) taggers, chunkers or named entity recognizers (NER). The current challenge is to select which one to use, out of the range of available tools. This choice may depend on several aspects, including the kind and source of text, where the level, formal or informal, may influence the performance of such tools. In this paper, we assess a range of natural language processing toolkits with their default configuration, while performing a set of standard tasks (e.g. tokenization, POS tagging, chunking and NER), in popular datasets that cover newspaper and social network text. The obtained results are analyzed and, while we could not decide on a single toolkit, this exercise was very helpful to narrow our choice

    High Accuracy Location Information Extraction from Social Network Texts Using Natural Language Processing

    Full text link
    Terrorism has become a worldwide plague with severe consequences for the development of nations. Besides killing innocent people daily and preventing educational activities from taking place, terrorism is also hindering economic growth. Machine Learning (ML) and Natural Language Processing (NLP) can contribute to fighting terrorism by predicting in real-time future terrorist attacks if accurate data is available. This paper is part of a research project that uses text from social networks to extract necessary information to build an adequate dataset for terrorist attack prediction. We collected a set of 3000 social network texts about terrorism in Burkina Faso and used a subset to experiment with existing NLP solutions. The experiment reveals that existing solutions have poor accuracy for location recognition, which our solution resolves. We will extend the solution to extract dates and action information to achieve the project's goal

    Understanding human decision-making during production ramp-up using natural language processing

    Get PDF
    Ramping up a manufacturing system from being just assembled to full-volume production capacity is a time consuming and error-prone task. The full behaviour of a system is difficult to predict in advance and disruptions that need to be resolved until the required performance targets are reached occur often. Information about the experienced faults and issues might be recorded, but usually, no record of decisions concerning necessary physical and process adjustments are kept. Having these data could help to uncover significant insights into the ramp-up process that could reduce the effort needed to bring the system to its mandatory state. This paper proposes Natural Language Processing (NLP) to interpret human operator comments collected during ramp-up. Recurring patterns in their feedback could be used to gain a deeper understanding of the cause and effect relationship between the system state and the corrective action that an operator applied. A manual dispensing experiment was conducted where human assessments in form of unstructured free-form text were gathered. These data have been used as an input for initial NLP analysis and preliminary results using the NLTK library are presented. Outcomes show first insights into the topics participants considered and lead to valuable knowledge to learn from this experience for the future

    Different valuable tools for Arabic sentiment analysis: a comparative evaluation

    Get PDF
    Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain

    Social-Media-Daten: Chancen und Herausforderungen der Nutzung von Social-Media-Daten im Kontext wissenschaftlicher Forschung

    Get PDF
    Im Rahmen der vorliegende Arbeit konnten Chancen und Herausforderungen von Social Media Datenanalysen aufgezeigt werden. Um das Potenzial dieser nutzen zu können, ist eine interdisziplinäre Herangehensweise erforderlich. Während die vorliegenden Publikationen noch durch eine Einzelperson realisiert werden konnten, wird klar das die Einbeziehung der unterschiedlichen Disziplinen eine verstärkte Zusammenarbeit erfordert. Die Erweiterung der Agenten basierten Simulation um die „Mean Field Game Theory“ erfordert z.B. fortgeschrittene Kenntnisse der Physik, Sentiment Analysen erfordert die Zusammenarbeit von Linguisten und Informatikern, Clusteranalysen bedürfen der Zusammenarbeit von Datenanalytikern und Soziologen. Um das Potential der Analyseergebnisse zu heben sollten Wirtschaftswissenschaftler einbezogen bzw. sind diese Treiber und Wertschöpfer. Somit ist zukünftig eine verstärkte Zusammenarbeit zu erwarten, welches zu komplexen Formen der Zusammenarbeit führen wird. Dies wiederum bedingt Konzepte und Frameworks, um die Zusammenarbeit transparent und verständlich gestalten zu können

    A methodology for the resolution of cashtag collisions on Twitter – A natural language processing & data fusion approach

    Get PDF
    © 2019 The Authors. Investors utilise social media such as Twitter as a means of sharing news surrounding financials stocks listed on international stock exchanges. Company ticker symbols are used to uniquely identify companies listed on stock exchanges and can be embedded within tweets to create clickable hyperlinks referred to as cashtags, allowing investors to associate their tweets with specific companies. The main limitation is that identical ticker symbols are present on exchanges all over the world, and when searching for such cashtags on Twitter, a stream of tweets is returned which match any company in which the cashtag refers to - we refer to this as a cashtag collision. The presence of colliding cashtags could sow confusion for investors seeking news regarding a specific company. A resolution to this issue would benefit investors who rely on the speediness of tweets for financial information, saving them precious time. We propose a methodology to resolve this problem which combines Natural Language Processing and Data Fusion to construct company-specific corpora to aid in the detection and resolution of colliding cashtags, so that tweets can be classified as being related to a specific stock exchange or not. Supervised machine learning classifiers are trained twice on each tweet – once on a count vectorisation of the tweet text, and again with the assistance of features contained in the company-specific corpora. We validate the cashtag collision methodology by carrying out an experiment involving companies listed on the London Stock Exchange. Results show that several machine learning classifiers benefit from the use of the custom corpora, yielding higher classification accuracy in the prediction and resolution of colliding cashtags

    Using NLP to generate user stories from software specification in natural language

    Get PDF
    Orientador: Andrey Ricardo PimentelDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 27/08/2018Inclui referências: p. 80-82Resumo: O processo de elicitar as Histórias de Usuário requeridas para o desenvolvimento de software exige tempo e dedicação, e pode apresentar muito retrabalho se as conversas com partes interessadas não fornecerem informações coesas. O principal problema enfrentado é que o cliente muitas vezes não tem clareza sobre o que ele realmente quer, e no estado da arte não havia uma abordagem ou ferramenta que auxiliasse a transpor o que cliente deseja em histórias de usuário. Pensando nisso, propusemos a abordagem e ferramenta UserStoryGen para simplificar todo esse trabalho extenso, resolvendo esse problema através do uso de técnicas de Linguagem Natural (NLP) com estruturas adequadas para esse propósito e usando o modelo padrão de descrição de história para gerar automaticamente histórias de usuários. A abordagem UserStoryGen consiste em extrair informações como: título, descrição, verbo principal, usuários e entidades sistêmicas de histórias de usuários, a partir do texto não estruturado. O UserStoryGen usa o texto big picture como entrada para processamento de texto e geração automática de histórias do usuário. As histórias do usuário são geradas por meio de uma Restful A P I no formato JSON e podem ser exibidas tanto nesse formato, se apenas a chamada da Restful API for usada, como usando uma interface gráfica que mostrará o resultado em uma tabela. A implementação da UserStoryGen teve como objetivo automatizar este processo trabalhoso de extração de histórias de usuários do texto e obteve resultados significativos, principalmente nos testes com dados da indústria. Entre os três grupos de estudos de caso realizados, o terceiro, que utilizou dados da indústria, obteve os melhores resultados com textos que tiveram uma acurácia média de 76%, precisão de 88,23%, recall de 78,95% e medida F1 de 83,33%. O segundo grupo de estudos de casos com textos fornecidos por especialistas em Engenharia de Software obteve uma acurácia média de 73,68%, precisão de 85,71% e F1 de 82,76%. O primeiro grupo, utilizando textos de umwhite paper e de um livro teve o pior resultado, com uma acurácia média de 60% e uma medida F1 de 60,87%. Com base nos resultados obtidos com a UserStoryGen, concluímos que é completamente possível atingir o objetivo se pré-identificar e extrair as possíveis histórias de usuário para um determinado texto, e a implementação da abordagem proposta também pode ser melhorada em trabalhos futuros. A UserStoryGen representa um ganho para o Processo de Desenvolvimento Ágil, eliminando o tempo gasto na identificação de Estórias de Usuário, quando a equipe possui um texto com a big picture ou um documento textual das funcionalidades para usar como entrada. Palavras-chave: Processamento de Linguagem Natural, Extração Automática, Histórias de Usuário, Stanford CoreNLP.Abstract: The process of eliciting User Stories required for software development requires both time and dedication, and can present a lot of rework if conversations with stakeholders do not provide cohesive information. The main problem faced is that the client often lacks clarity about what he really wants, and in the state of the art there was no approach or tool that helps transpose what the customer wants into user stories. Thinking on it, we proposed the UserStoryGen approach and tool to simplify all this extensive work, resolving the issue through the use of Natural Language Processing (NLP) techniques with structures and the standard user story description template to automatically generate user stories. UserStoryGen's approach consists of extracting information such as: title, description, main verb, users and systemic entities of user stories from the unstructured text. The UserStoryGen uses big picture text as input for text processing and automated generation of the user stories. The user stories are generated through a Restful API in the JSON format and can be viewed either in this format, if only the Restful API call is used, as well as using a graphic interface that shows the results through a table. The implementation of UserStoryGen is aimed to automate the laborious process of extracting user stories from text and it obtained significant results, mainly with industry data. Among the three groups of case studies, the third one, that used industry data, obtained the best results with texts that had an average accuracy of 76%, precision of 88.23%, recall of 78.95% and F1 measure of 83.33%. The second group, using texts provided by software engineering specialists obtained an average accuracy of 73.68%, precision of 85.71% and F1 measure of 82.76%. The first group, using texts from a white paper and a book had the worst results with an average accuracy of 60% and a F1 measure of 60.87%. Based in the results obtained with the UserStoryGen, we concluded that it's completely possible to achieve the goal if pre-identifying and extracting the possible user stories for a given text, and the implementation of the proposed approach also can be improved in the future works. The UserStoryGen is a gain for Agile Development Process by eliminating time spent in User Stories identification when the team has a big picture text or a Features textual document to use as input. Keywords: Natural Language Processing, Automatic Extraction, User Stories, Stanford CoreNLP

    A survey on extremism analysis using natural language processing: definitions, literature review, trends and challenges

    Get PDF
    Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature

    A survey on extremism analysis using natural language processing: definitions, literature review, trends and challenges

    Get PDF
    Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.CRUE-CSIC agreementSpringer Natur
    corecore