1,909 research outputs found

    Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents

    Get PDF
    The presence of huge volumes of unstructured data in the form of pdf documents poses a challenge to the organizations trying to extract valuable information from it. In this thesis, we try to solve this problem as per the requirement of DNB by building an automatic information extraction system to get only the key information in which the company is interested in from the pdf documents. This is achieved by comparing the performance of named entity recognition models for automatic text extraction, built using Bi-directional Long Short Term Memory (Bi-LSTM) with a Conditional Random Field (CRF) in combination with three variations of word vectorization techniques. The word vectorisation techniques compared in this thesis include randomly generated word embeddings by the Keras embedding layer, pre-trained static word embeddings focusing on 100-dimensional GloVe embeddings and, finally, deep-contextual ELMo word embeddings. Comparison of these models helps us identify the advantages and disadvantages of using different word embeddings by analysing their effect on NER performance. This study was performed on a DNB provided data set. The comparative study showed that the NER systems built using Bi-LSTM-CRF with GloVe embeddings gave the best results with a micro F1 score of 0.868 and a macro-F1 score of 0.872 on unseen data, in comparison to a Bi-LSTM-CRF based NER using Keras embedding layer and ELMo embeddings which gave micro F1 scores of 0.858 and 0.796 and macro F1 scores of 0.848 and 0.776 respectively. The result is in contrary to our assumption that NER using deep contextualised word embeddings show better performance when compared to NER using other word embeddings. We proposed that this contradicting performance is due to the high dimensionality, and we analysed it by using a lower-dimensional word embedding. It was found that using 50-dimensional GloVe embeddings instead of 100-dimensional GloVe embeddings resulted in an improvement of the overall micro and macro F1 score from 0.87 to 0.88. Additionally, optimising the best model, which was the Bi-LSTM-CRF using 100-dimensional GloVe embeddings, by tuning in a small hyperparameter search space did not result in any improvement from the present micro F1 score of 0.87 and macro F1 score of 0.87.M30-DV Master's ThesisM-D

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Mining Twitter for crisis management: realtime floods detection in the Arabian Peninsula

    Get PDF
    A thesis submitted to the University of Bedfordshire, in partial fulfilment of the requirements for the degree of doctor of Philosophy.In recent years, large amounts of data have been made available on microblog platforms such as Twitter, however, it is difficult to filter and extract information and knowledge from such data because of the high volume, including noisy data. On Twitter, the general public are able to report real-world events such as floods in real time, and act as social sensors. Consequently, it is beneficial to have a method that can detect flood events automatically in real time to help governmental authorities, such as crisis management authorities, to detect the event and make decisions during the early stages of the event. This thesis proposes a real time flood detection system by mining Arabic Tweets using machine learning and data mining techniques. The proposed system comprises five main components: data collection, pre-processing, flooding event extract, location inferring, location named entity link, and flooding event visualisation. An effective method of flood detection from Arabic tweets is presented and evaluated by using supervised learning techniques. Furthermore, this work presents a location named entity inferring method based on the Learning to Search method, the results show that the proposed method outperformed the existing systems with significantly higher accuracy in tasks of inferring flood locations from tweets which are written in colloquial Arabic. For the location named entity link, a method has been designed by utilising Google API services as a knowledge base to extract accurate geocode coordinates that are associated with location named entities mentioned in tweets. The results show that the proposed location link method locate 56.8% of tweets with a distance range of 0 – 10 km from the actual location. Further analysis has shown that the accuracy in locating tweets in an actual city and region are 78.9% and 84.2% respectively

    Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu

    Get PDF
    With the development of international information technology, we are producing a huge amount of information all the time. The processing ability of information in various languages is gradually replacing information and becoming a rarer resource. How to obtain the most effective information in such a large and complex amount of multilingual textual information is a major goal of multilingual information processing. Multilingual text classification helps users to break the language barrier and accurately locate the required information and triage information. At the same time, the rapid development of the Internet has accelerated the communication among users of various languages, giving rise to a large number of multilingual texts, such as book and movie reviews, online chats, product introductions and other forms, which contain a large amount of valuable implicit information and urgently need automated tools to categorize and process those multilingual texts. This work describes the Natural Language Process (NLP) sub-task known as Multilingual Text Classification (MTC) performed within the context of Baidu, a Chinese leading AI company with a strong Internet base, whose NLP division led the industry in deep learning technology to go online in Machine Translation (MT) and search. Multilingual text classification is an important module in NLP machine translation and a basic module in NLP tasks. It can be applied to many fields, such as Fake Reviews Detection, News Headlines Categories Classification, Analysis of positive and negative reviews and so on. In the following work, we will first define the AI model paradigm of 'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then investigated the application scenarios of multilingual text classification. Most of the text classification systems currently available in the Chinese market are designed for a single language, such as Alibaba's text classification system. If users need to classify texts of the same category in multiple languages, they need to train multiple single text classification systems and then classify them one by one. However, many internationalized products do not have a single text language, such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc. Industry needs to understand and classify users’ reviews in various languages, and have conducted in-depth statistics and marketing strategy development, and multilingual text classification is particularly important in this scenario. Therefore, we focus on interpreting the methodology of multilingual text classification model of machine translation in Baidu NLP department, and capture sets of multilingual data of reviews, news headlines and other data for manual classification and labeling, use the labeling results for fine-tuning of multilingual text classification model, and output the quality evaluation data of Baidu multilingual text classification model after fine-tuning. We will discuss if the pre-training and fine-tuning of the large model can substantially improve the quality and performance of multilingual text classification. Finally, based on the machine translation-multilingual text classification model, we derive the application method of pre-training and fine-tuning paradigm in the current cutting-edge deep learning AI model under the NLP system and verify the generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos sempre a produzir uma enorme quantidade de informação e o recurso mais escasso já não é a informação, mas a capacidade de processar informação em cada língua. A maior parte da informação multilingue é expressa sob a forma de texto. Como obter a informação mais eficaz numa quantidade tão considerável e complexa de informação textual multilingue é um dos principais objetivos do processamento de informação multilingue. A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira linguística e a localizar com precisão a informação necessária e a classificá-la. Ao mesmo tempo, o rápido desenvolvimento da Internet acelerou a comunicação entre utilizadores de várias línguas, dando origem a um grande número de textos multilingues, tais como críticas de livros e filmes, chats, introduções de produtos e outros distintos textos, que contêm uma grande quantidade de informação implícita valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e processar esses textos multilingues. Este trabalho descreve a subtarefa do Processamento de Linguagem Natural (PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no contexto da Baidu, uma empresa chinesa líder em IA, cuja equipa de PNL levou a indústria em tecnologia baseada em aprendizagem neuronal a destacar-se em Tradução Automática (MT) e pesquisa científica. A classificação multilingue de textos é um módulo importante na tradução automática de PNL e um módulo básico em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como análise de sentimentos multilingues, categorização de notícias, filtragem de conteúdos indesejados (do inglês spam), entre outros. Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'pré-treino e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade de classificação de texto — a classificação de texto levada a cabo pela Alibaba. Após a pesquisa, verificamos que a maioria dos sistemas de classificação de texto atualmente disponíveis no mercado chinês são concebidos para uma única língua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de classificar textos da mesma categoria em várias línguas, precisam de aplicar vários sistemas de classificação de texto para cada língua e depois classificá-los um a um. No entanto, muitos produtos internacionalizados não têm uma única língua de texto, tais como AliExpress comércio eletrónico transfronteiriço, Airbnb B&B business, etc. A indústria precisa compreender e classificar as revisões dos utilizadores em várias línguas. Esta necessidade conduziu a um desenvolvimento aprofundado de estatísticas e estratégias de marketing, e a classificação de textos multilingues é particularmente importante neste cenário. Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo de classificação de texto multilingue da tradução automática no departamento de PNL Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentários e críticas, manchetes de notícias e outros dados para classificação manual, utilizamos os resultados dessa classificação para o aperfeiçoamento do modelo de classificação de texto multilingue e produzimos os dados de avaliação da qualidade do modelo de classificação de texto multilingue da Baidu. Discutiremos se o pré-treino e o aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o desempenho da classificação de texto multilingue. Finalmente, com base no modelo de classificação de texto multilingue de tradução automática, derivamos o método de aplicação do paradigma de pré-formação e afinação no atual modelo de IA de aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os resultados positivos do paradigma de pré-treino e afinação no campo de pesquisa de aprendizagem profunda

    Automatic extraction of mobility activities in microblogs

    Get PDF
    Tese de Mestrado Integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Unsupervised learning on social data

    Get PDF

    New Approaches to Frequent and Incremental Frequent Pattern Mining

    Full text link
    Data Mining (DM) is a process for extracting interesting patterns from large volumes of data. It is one of the crucial steps in Knowledge Discovery in Databases (KDD). It involves various data mining methods that mainly fall into predictive and descriptive models. Descriptive models look for patterns, rules, relationships and associations within data. One of the descriptive methods is association rule analysis, which represents co-occurrence of items or events. Association rules are commonly used in market basket analysis. An association rule is in the form of X → Y and it shows that X and Y co-occur with a given level of support and confidence. Association rule mining is a common technique used in discovering interesting frequent patterns in large datasets acquired in various application domains. Having petabytes of data finding its way into data storages in perhaps every day, made many researchers look for efficient methods for analyzing these large datasets. Many algorithms have been proposed for searching for frequent patterns. The search space combinatorically explodes as the size of the source data increases. Simply using more powerful computers, or even super-computers to handle ever-increasing size of large data sets is not sufficient. Hence, incremental algorithms have been developed and used to improve the efficiency of frequent pattern mining. One of the challenges of frequent itemset mining is long running times of the algorithms. Two major costs of long running times of frequent itemset mining are due to the number of database scans and the number of candidates generated (the latter one requires memory, and the more the number of candidates there are the more memory space is needed. When the candidates do not fit in memory then page swapping will occur which will increase the running time of the algorithms). In this dissertation we propose a new implementation of Apriori algorithm, NCLAT (Near Candidate-less Apriori with Tidlists), which scans the database only once and creates candidates only for level one (1-itemsets) which is equivalent to the total number of unique items in the database. In addition, we also show the results of choice of data structures used whether they are probabilistic or not, whether the datasets are horizontal or vertical, how counting is done, whether the algorithms are computed single or parallel way. We implement, explore and devise incremental algorithm UWEP with single as well as parallel computation. We have also cleaned a minor bug in UWEP and created a more efficient version UWEP2, which reduces the number of candidates created and the number of database scans. We have run all of our tests against three datasets with different features for different minimum support levels. We show both frequent and incremental frequent itemset mining implementation test results and comparison to each other. While there has been a lot of work done on frequent itemset mining on structured data, very little work has been done on the unstructured data. So, we have created a new hybrid pattern search algorithm, Double-Hash, which performed better for all of our test scenarios than the known pattern search algorithms. Double-Hash can potentially be used in frequent itemset mining on unstructured data in the future. We will be presenting our work and test results on this as well

    Big Data Drive: The Rhetoric of Biometric Big Data

    Get PDF
    In this essay, we seek to develop a concept of “big data drive.” Influenced in part by Lacan’s theory of drive, we study the drive toward biometric big data. Biometric big data (BBD) refers to the data collected around facial recognition, eye recognition, thumb prints, and other types of technology whose task is to identify a specific being through unique bio characteristics. “Big Data Drive” refers to the energies that pulsate around Big Data, as both a signifier and fetishized object, to promise “something more” that may never be fulfilled

    Detecting Abnormal Behavior in Web Applications

    Get PDF
    The rapid advance of web technologies has made the Web an essential part of our daily lives. However, network attacks have exploited vulnerabilities of web applications, and caused substantial damages to Internet users. Detecting network attacks is the first and important step in network security. A major branch in this area is anomaly detection. This dissertation concentrates on detecting abnormal behaviors in web applications by employing the following methodology. For a web application, we conduct a set of measurements to reveal the existence of abnormal behaviors in it. We observe the differences between normal and abnormal behaviors. By applying a variety of methods in information extraction, such as heuristics algorithms, machine learning, and information theory, we extract features useful for building a classification system to detect abnormal behaviors.;In particular, we have studied four detection problems in web security. The first is detecting unauthorized hotlinking behavior that plagues hosting servers on the Internet. We analyze a group of common hotlinking attacks and web resources targeted by them. Then we present an anti-hotlinking framework for protecting materials on hosting servers. The second problem is detecting aggressive behavior of automation on Twitter. Our work determines whether a Twitter user is human, bot or cyborg based on the degree of automation. We observe the differences among the three categories in terms of tweeting behavior, tweet content, and account properties. We propose a classification system that uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot or cyborg. Furthermore, we shift the detection perspective from automation to spam, and introduce the third problem, namely detecting social spam campaigns on Twitter. Evolved from individual spammers, spam campaigns manipulate and coordinate multiple accounts to spread spam on Twitter, and display some collective characteristics. We design an automatic classification system based on machine learning, and apply multiple features to classifying spam campaigns. Complementary to conventional spam detection methods, our work brings efficiency and robustness. Finally, we extend our detection research into the blogosphere to capture blog bots. In this problem, detecting the human presence is an effective defense against the automatic posting ability of blog bots. We introduce behavioral biometrics, mainly mouse and keyboard dynamics, to distinguish between human and bot. By passively monitoring user browsing activities, this detection method does not require any direct user participation, and improves the user experience

    Strategic integration between startups: the Yougenio case.

    Get PDF
    This work aims to analyze a new company, founded in June 2016, called Yougenio. As it is going to be clear along the text, it is not a normal startup, because it has not been founded by a young and inexpert team of “dreamers”, but it is an experiment pursued by an already established group called Manutencoop, based in Bologna (Italy), which is trying to exploit in a B2C market its previous knowledge and expertise developed in the B2B. The work has been divided in four chapters. The first one is an attempt to summarize what are the differences between EU and a better developed startup ecosystem, which is USA. Then, it follows a brief overview on the Italian situation on both the legal framework, created in order to facilitate the growth of this kind of new businesses, and the updated actual results. New management practices for new ventures are then discussed to give to the reader an idea of which are the best-known ways to deal with startups management. Next, a general look is taken on how firms can integrate their businesses together to achieve better synergies thanks to the use of M&A processes. The chapter is closed with a focus on how difficult it is to determine their values and which are the main issues to deal with the valuation processes. The second chapter tries to analyze in which background Yougenio is moving: in which segments the startup is actually interested, which competitors are developing in the same niche, etc. The third one, instead, goes further in the company business overview: all past moves of the company until today are explained (with the reasons behind specific decisions taken), and so are the results achieved. It focuses on the future strategy of Yougenio and how the top management is thinking to the next steps, knowing how this niche is a fast-pace moving environment. Taking into account what precedes, the chapter concludes with financial forecasts regarding the business and its future improvement. The text is concluded with the study of a possible integration strategy, Yougenio could go through, with two other competitors in order to boost their growth path, together
    corecore