2,478 research outputs found

    Towards the extraction of cross-sentence relations through event extraction and entity coreference

    Get PDF
    Cross-sentence relation extraction deals with the extraction of relations beyond the sentence boundary. This thesis focuses on two of the NLP tasks which are of importance to the successful extraction of cross-sentence relation mentions: event extraction and coreference resolution. The first part of the thesis focuses on addressing data sparsity issues in event extraction. We propose a self-training approach for obtaining additional labeled examples for the task. The process starts off with a Bi-LSTM event tagger trained on a small labeled data set which is used to discover new event instances in a large collection of unstructured text. The high confidence model predictions are selected to construct a data set of automatically-labeled training examples. We present several ways in which the resulting data set can be used for re-training the event tagger in conjunction with the initial labeled data. The best configuration achieves statistically significant improvement over the baseline on the ACE 2005 test set (macro-F1), as well as in a 10-fold cross validation (micro- and macro-F1) evaluation. Our error analysis reveals that the augmentation approach is especially beneficial for the classification of the most under-represented event types in the original data set. The second part of the thesis focuses on the problem of coreference resolution. While a certain level of precision can be reached by modeling surface information about entity mentions, their successful resolution often depends on semantic or world knowledge. This thesis investigates an unsupervised source of such knowledge, namely distributed word representations. We present several ways in which word embeddings can be utilized to extract features for a supervised coreference resolver. Our evaluation results and error analysis show that each of these features helps improve over the baseline coreference system’s performance, with a statistically significant improvement (CoNLL F1) achieved when the proposed features are used jointly. Moreover, all features lead to a reduction in the amount of precision errors in resolving references between common nouns, demonstrating that they successfully incorporate semantic information into the process

    Unsupervised learning of relation detection patterns

    Get PDF
    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.Postprint (published version

    Chatbot de Suporte para Plataforma de Marketing Multicanal

    Get PDF
    E-goi is an organization which provides automated multichannel marketing possibilities. Given its system’s complexity, it requires a not so smooth learning curve, which means that sometimes costumers incur upon some difficulties which directs them towards appropriate Costumer Support resources. With an increase in the number of users, these Costumer Support requests are somewhat frequent and demand an increase in availability in Costumer Support channels which become inundated with simple, easily-resolvable requests. The organization idealized the possibility of automating significant portion of costumer generated tickets with the possibility of scaling to deal with other types of operations. This thesis aims to present a long-term solution to that request with the development of a chatbot system, fully integrated with the existing enterprise modules and data sources. In order to accomplish this, prototypes using several Chatbot management and Natural Language Processing frameworks were developed. Afterwards, their advantages and disadvantages were pondered, followed by the implementation of its accompanying system and testing of developed software and Natural Language Processing results. Although the developed overarching system achieved its designed functionalities, the master’s thesis could not offer a viable solution for the problem at hand given that the available data could not provide an intent mining model usable in a real-world context.A E-goi é uma organização que disponibiliza soluções de marketing digital automatizadas e multicanal. Dada a complexidade do seu Sistema, que requer uma curva de aprendizagem não muito suave, o que significa que os seus utilizadores por vezes têm dificuldades que os levam a recorrer aos canais de Apoio ao Cliente. Com um aumento de utilizadores, estes pedidos de Apoio ao Cliente tornam-se frequentes e requerem um aumento da disponibilidade nos canais apropriados que ficam inundados de pedidos simples e de fácil resolução. A organização idealizou a possibilidade de automatizar uma porção significativa de tais pedidos, podendo escalar para outro tipo de operações. Este trabalho de mestrado visa apresentar uma proposta de solução a longo prazo para este problema. Pretende-se o desenvolvimento de um sistema de chatbots, completamente integrado com o sistema existente da empresa e variadas fontes de dados. Para este efeito, foram desenvolvidos protótipos de várias frameworks para gestão de chatbots e de Natural Language Processing, ponderadas as suas vantagens e desvantagens, implementado o sistema englobante e realizados planos de testes ao software desenvolvido e aos resultados de Natural Language Processing. Apesar do sistema desenvolvido ter cumprido as funcionalidades pelas quais foi concebido, a tese de mestrado não foi capaz de obter uma solução viável para o problema dado que com os dados disponibilizados não foi possível produzir um modelo de deteção de intenções usável num contexto real

    XSS attack detection based on machine learning

    Get PDF
    As the popularity of web-based applications grows, so does the number of individuals who use them. The vulnerabilities of those programs, however, remain a concern. Cross-site scripting is a very prevalent assault that is simple to launch but difficult to defend against. That is why it is being studied. The current study focuses on artificial systems, such as machine learning, which can function without human interaction. As technology advances, the need for maintenance is increasing. Those maintenance systems, on the other hand, are becoming more complex. This is why machine learning technologies are becoming increasingly important in our daily lives. This study use supervised machine learning to protect against cross-site scripting, which allows the computer to find an algorithm that can identify vulnerabilities. A large collection of datasets serves as the foundation for this technique. The model will be equipped with functions extracted from datasets that will allow it to learn the model of such an attack by filtering it using common Javascript symbols or possible Document Object Model (DOM) syntax. As long as the research continues, the best conjugate algorithms will be discovered that can successfully fight against cross-site scripting. It will do multiple comparisons between different classification methods on their own or in combination to determine which one performs the best.À medida que a popularidade dos aplicativos da internet cresce, aumenta também o número de indivíduos que os utilizam. No entanto, as vulnerabilidades desses programas continuam a ser uma preocupação para o uso da internet no dia-a-dia. O cross-site scripting é um ataque muito comum que é simples de lançar, mas difícil de-se defender. Por isso, é importante que este ataque possa ser estudado. A tese atual concentra-se em sistemas baseados na utilização de inteligência artificial e Aprendizagem Automática (ML), que podem funcionar sem interação humana. À medida que a tecnologia avança, a necessidade de manutenção também vai aumentando. Por outro lado, estes sistemas vão tornando-se cada vez mais complexos. É, por isso, que as técnicas de machine learning torna-se cada vez mais importantes nas nossas vidas diárias. Este trabalho baseia-se na utilização de Aprendizagem Automática para proteger contra o ataque cross-site scripting, o que permite ao computador encontrar um algoritmo que tem a possibilidade de identificar as vulnerabilidades. Uma grande coleção de conjuntos de dados serve como a base para a abordagem proposta. A máquina virá ser equipada com o processamento de linguagem natural, o que lhe permite a aprendizagem do padrão de tal ataque e filtrando-o com o uso da mesma linguagem, javascript, que é possível usar para controlar os objectos DOM (Document Object Model). Enquanto a pesquisa continua, os melhores algoritmos conjugados serão descobertos para que possam prever com sucesso contra estes ataques. O estudo fará várias comparações entre diferentes métodos de classificação por si só ou em combinação para determinar o que tiver melhor desempenho

    A DATA DRIVEN APPROACH TO IDENTIFY JOURNALISTIC 5WS FROM TEXT DOCUMENTS

    Get PDF
    Textual understanding is the process of automatically extracting accurate high-quality information from text. The amount of textual data available from different sources such as news, blogs and social media is growing exponentially. These data encode significant latent information which if extracted accurately can be valuable in a variety of applications such as medical report analyses, news understanding and societal studies. Natural language processing techniques are often employed to develop customized algorithms to extract such latent information from text. Journalistic 5Ws refer to the basic information in news articles that describes an event and include where, when, who, what and why. Extracting them accurately may facilitate better understanding of many social processes including social unrest, human rights violations, propaganda spread, and population migration. Furthermore, the 5Ws information can be combined with socio-economic and demographic data to analyze state and trajectory of these processes. In this thesis, a data driven pipeline has been developed to extract the 5Ws from text using syntactic and semantic cues in the text. First, a classifier is developed to identify articles specifically related to social unrest. The classifier has been trained with a dataset of over 80K news articles. We then use NLP algorithms to generate a set of candidates for the 5Ws. Then, a series of algorithms to extract the 5Ws are developed. These algorithms based on heuristics leverage specific words and parts-of-speech customized for individual Ws to compute their scores. The heuristics are based on the syntactic structure of the document as well as syntactic and semantic representations of individual words and sentences. These scores are then combined and ranked to obtain the best answers to Journalistic 5Ws. The classification accuracy of the algorithms is validated using a manually annotated dataset of news articles

    How to Rank Answers in Text Mining

    Get PDF
    In this thesis, we mainly focus on case studies about answers. We present the methodology CEW-DTW and assess its performance about ranking quality. Based on the CEW-DTW, we improve this methodology by combining Kullback-Leibler divergence with CEW-DTW, since Kullback-Leibler divergence can check the difference of probability distributions in two sequences. However, CEW-DTW and KL-CEW-DTW do not care about the effect of noise and keywords from the viewpoint of probability distribution. Therefore, we develop a new methodology, the General Entropy, to see how probabilities of noise and keywords affect answer qualities. We firstly analyze some properties of the General Entropy, such as the value range of the General Entropy. Especially, we try to find an objective goal, which can be regarded as a standard to assess answers. Therefore, we introduce the maximum general entropy. We try to use the general entropy methodology to find an imaginary answer with the maximum entropy from the mathematical viewpoint (though this answer may not exist). This answer can also be regarded as an “ideal” answer. By comparing maximum entropy probabilities and global probabilities of noise and keywords respectively, the maximum entropy probability of noise is smaller than the global probability of noise, maximum entropy probabilities of chosen keywords are larger than global probabilities of keywords in some conditions. This allows us to determinably select the max number of keywords. We also use Amazon dataset and a small group of survey to assess the general entropy. Though these developed methodologies can analyze answer qualities, they do not incorporate the inner connections among keywords and noise. Based on the Markov transition matrix, we develop the Jump Probability Entropy. We still adapt Amazon dataset to compare maximum jump entropy probabilities and global jump probabilities of noise and keywords respectively. Finally, we give steps about how to get answers from Amazon dataset, including obtaining original answers from Amazon dataset, removing stopping words and collinearity. We compare our developed methodologies to see if these methodologies are consistent. Also, we introduce Wald–Wolfowitz runs test and compare it with developed methodologies to verify their relationships. Depending on results of comparison, we get conclusions about consistence of these methodologies and illustrate future plans

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios

    A Survey on Text Classification Algorithms: From Text to Predictions

    Get PDF
    In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models
    corecore