48 research outputs found
Deep Learning for Learning Representation and Its Application to Natural Language Processing
As the web evolves even faster than expected, the exponential growth of data becomes overwhelming. Textual data is being generated at an ever-increasing pace via emails, documents on the web, tweets, online user reviews, blogs, and so on. As the amount of unstructured text data grows, so does the need for intelligently processing and understanding it. The focus of this dissertation is on developing learning models that automatically induce representations of human language to solve higher level language tasks.
In contrast to most conventional learning techniques, which employ certain shallow-structured learning architectures, deep learning is a newly developed machine learning technique which uses supervised and/or unsupervised strategies to automatically learn hierarchical representations in deep architectures and has been employed in varied tasks such as classification or regression. Deep learning was inspired by biological observations on human brain mechanisms for processing natural signals and has attracted the tremendous attention of both academia and industry in recent years due to its state-of-the-art performance in many research domains such as computer vision, speech recognition, and natural language processing.
This dissertation focuses on how to represent the unstructured text data and how to model it with deep learning models in different natural language processing
viii
applications such as sequence tagging, sentiment analysis, semantic similarity and etc. Specifically, my dissertation addresses the following research topics:
In Chapter 3, we examine one of the fundamental problems in NLP, text classification, by leveraging contextual information [MLX18a];
In Chapter 4, we propose a unified framework for generating an informative map from review corpus [MLX18b];
Chapter 5 discusses the tagging address queries in map search [Mok18]. This research was performed in collaboration with Microsoft; and
In Chapter 6, we discuss an ongoing research work in the neural language sentence matching problem. We are working on extending this work to a recommendation system
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
Phishing detection : methods based on natural language processing
Tese (doutorado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Elétrica, 2020.Nas tentativas de phishing, o criminoso finge ser uma pessoa ou entidade confiável e, por meio dessa falsa representação, tenta obter informações confidenciais de um alvo. Um exemplo típico é aquele em que golpistas tentam passar por uma instituição conhecida, alegando a necessidade de atualização de um cadastro ou de uma ação imediata do lado do cliente e, para isso, são solicitados dados pessoais e financeiros. Uma variedade de recursos, como páginas da web falsas, instalação de código malicioso ou preenchimento de formulários, são empregados junto com o próprio e-mail para executar esse tipo de ação. Geralmente uma campanha de phishing começa com um e-mail. Portanto, a detecção desse tipo de e-mail é crítica. Uma vez que o phishing pretende parecer uma mensagem legítima, as técnicas de detecção baseadas apenas em regras de filtragem, como regras de listas e heurística, têm eficácia limitada, além de potencialmente poderem ser forjadas. Desta forma, através de processamento de texto, atributos podem ser extraídos do corpo e do cabeçalho de e-mails, por meio de técnicas que expliquem as relações de semelhança e significância entre as palavras presentes em um determinado e-mail, bem como em todo o conjunto de amostras de mensagens. A abordagem mais comum para este tipo de engenharia de recursos é baseada em Modelos de Espaço Vetorial (VSM), mas como o VSM derivada da Matriz de Documentos por Termos (DTM) tem tantas dimensões quanto o número de termos utilizado em um corpus, e dado o fato de que nem todos os termos estão presentes em cada um dos emails, a etapa de engenharia de recursos do processo de detecção de e-mails de phishing tem que lidar e resolver questões relacionadas à "Maldição da Dimensionalidade", à esparsidade e às informações que podem ser obtidas do contexto textual. Esta tese propõe uma abordagem que consiste em quatro métodos para detectar phishing. Eles usam técnicas combinadas para obter recursos mais representativos dos textos de e-mails que são utilizados como atributos de entrada para os algoritmos de classificação para detectar e-mails de phishing corretamente. Eles são baseadas em processamento de linguagem natural (NLP) e aprendizado de máquina (ML), com estratégias de engenharia de features que aumentam a precisão, recall e acurácia das previsões dos algoritmos adotados, e abordam os problemas relacionados à representação VSM/DTM. O método 1 usa todos os recursos obtidos da DTM nos algoritmos de classificação, enquanto os outros métodos usam diferentes estratégias de redução de dimensionalidade para lidar com as questões apontadas. O método 2 usa a seleção de recursos por meio das vii medidas de qui-quadrado e informação mútua para tratar esses problemas. O Método 3 implementa a extração de recursos por meio das técnicas de Análise de Componentes Prin- cipais (PCA), Análise Semântica Latente (LSA) e Alocação Latente de Dirichlet (LDA). Enquanto o Método 4 é baseado na incorporação de palavras, e suas representações são obtidas a partir das técnicas Word2Vec, Fasttext e Doc2Vec. Foram empregados três conjuntos de dados (Dataset 1 - o conjunto de dados principal, Dataset 2 e Dataset 3). Usando o Dataset 1, em seus respectivos melhores resultados, uma pontuação F1 de 99,74% foi alcançada pelo Método 1, enquanto os outros três métodos alcançaram uma medida notável de 100% em todas as medidas de utilidade utilizadas, ou seja até onde sabemos, o mais alto resultado em pesquisas de detecção de phishing para um conjunto de dados credenciado com base apenas no corpo dos e-mails. Os métodos/perspectivas que obtiveram 100% no Dataset 1 (perspectiva Qui-quadrado do Método 2 - usando cem features, perspectiva LSA do Método 3 - usando vinte e cinco features, perspectiva Word2Vec e perspectiva FastText do Método 4) foram avaliados em dois contextos diferentes. Considerando tanto o corpo do e-mail quanto o cabeçalho, utilizando o primeiro dataset adicional proposto (Dataset 2), onde, em sua melhor nota, foi obtido 99,854% F1 Score na perspectiva Word2Vec, superando o melhor resultado atual para este dataset. Utilizando apenas os corpos de e-mail, como feito para o Dataset 1, a avaliação com o Dataset 3 também se mostrou com os melhores resultados para este dataset. Todas as quatro perspectivas superam os resultados do estado da arte, com uma pontuação F1 de 98,43%, através da perspectiva FastText, sendo sua melhor nota. Portanto, para os dois conjuntos de dados adicionais, esses resultados são os mais elevados na pesquisa de detecção de phishing para esses datasets. Os resultados demonstrados não são apenas devido ao excelente desempenho dos algoritmos de classificação, mas também devido à combinação de técnicas proposta, composta de processos de engenharia de features, de técnicas de aprendizagem apri- moradas para reamostragem e validação cruzada, e da estimativa de configuração de hiperparâmetros. Assim, os métodos propostos, suas perspectivas e toda a sua estraté- gia demonstraram um desempenho relevante na detecção de phishing. Eles também se mostraram uma contribuição substancial para outras pesquisas de NLP que precisam lidar com os problemas da representação VSM/DTM, pois geram uma representação densa e de baixa dimensão para os textos avaliados.In phishing attempts, the attacker pretends to be a trusted person or entity and, through
this false impersonation, tries to obtain sensitive information from a target. A typical
example is one in which a scammer tries to pass off as a known institution, claiming
the need to update a register or take immediate action from the client-side, and for this,
personal and financial data are requested. A variety of features, such as fake web pages,
the installation of malicious code, or form filling are employed along with the e-mail
itself to perform this type of action. A phishing campaign usually starts with an e-mail.
Therefore, the detection of this type of e-mail is critical. Since phishing aims to appear
being a legitimate message, detection techniques based only on filtering rules, such as
blacklisting and heuristics, have limited effectiveness, in addition to being potentially
forged.
Therefore, with the use of data-driven techniques, mainly those focused on text
processing, features can be extracted from the e-mail body and header that explain the
similarity and significance of the words in a specific e-mail, as well as for the entire set
of message samples. The most common approach for this type of feature engineering is
based on Vector Space Models (VSM). However, since VSMs derived from the Document-
Term Matrix (DTM) have as many dimensions as the number of terms in used in a corpus,
in addition to the fact that not all terms are present in each of the e-mails, the feature
engineering step of the phishing e-mail detection process has to deal with and address
issues related to the "Curse of Dimensionality"; the sparsity and the information that can
be obtained from the context (how to improve it, and reveal its latent features).
This thesis proposes an approach to detect phishing that consists of four methods.
They use combined techniques to obtain more representative features from the e-mail
texts that feed ML classification algorithms to correctly detect phishing e-mails. They are
based on natural language processing (NLP) and machine learning (ML), with feature
engineering strategies that increase the precision, recall, and accuracy of the predictions
of the adopted algorithms and that address the VSM/DTM problems.
Method 1 uses all the features obtained from the DTM in the classification algorithms,
while the other methods use different dimensionality reduction strategies to deal with
the posed issues. Method 2 uses feature selection through the Chi-Square and Mutual
Information measures to address these problems. Method 3 implements feature extraction
through the Principal Components Analysis (PCA), Latent Semantic Analysis (LSA), and
Latent Dirichlet Allocation (LDA) techniques. Method 4 is based on word embedding,
and its representations are obtained from the Word2Vec, Fasttext, and Doc2Vec techniques.
ix
Our approach was employed on three datasets (Dataset 1 - the main dataset, Dataset 2,
and Dataset 3).
All four proposed methods had excellent marks. Using the main proposed dataset
(Dataset 1), on the respective best results of the four methods, a F1 Score of 99.74% was
achieved by Method 1, whereas the other three methods attained a remarkable measure
of 100% in all main utility measures which is, to the best of our knowledge, the highest
result obtained in phishing detection research for an accredited dataset based only on the
body of the e-mails.
The methods/perspectives that obtained 100% in Dataset 1 (perspective Chi-Square of
Method 2 - using one-hundred features, perspective LSA of Method 3 - using twenty-five
features, perspectiveWord2Vec and perspective FastText of Method 4) were evaluated
in two different contexts. Considering both the e-mail bodies and headers, using the
first additional proposed dataset (Dataset 2), a 99.854% F1 Score was obtained using the
perspective Word2Vec, which was its best mark, surpassing the current best result. Using
just the e-mail bodies, as done for Dataset 1, the evaluation employing Dataset 3 also
proved to reach the best marks for this data collection. All four perspectives outperformed
the state-of-the-art results, with an F1 Score of 98.43%, through the FastText perspective,
being its best mark. Therefore, for both additional datasets, these results, to the best
of our knowledge, are the highest in phishing detection research for these accredited
datasets.
The results obtained by these measurements are not only due to the excellent perfor-
mance of the classification algorithms, but also to the combined techniques of feature
engineering proposed process such as text processing procedures (for instance, the lemma-
tization step), improved learning techniques for re-sampling and cross-validation, and
hyper-parameter configuration estimation. Thus, the proposed methods, their perspectives,
and the complete plan of action demonstrated relevant performance when distinguishing
between ham and phishing e-mails. The methods also proved to substantially contribute
to this area of research and other natural language processing research that need to address
or avoid problems related to VSM/DTM representation, since the methods generate a
dense and low-dimension representation of the evaluated texts
From feature engineering and topics models to enhanced prediction rates in phishing detection
Phishing is a type of fraud attempt in which the attacker, usually by e-mail, pretends to be a trusted person or entity in order to obtain sensitive information from a target. Most recent phishing detection researches have focused on obtaining highly distinctive features from the metadata and text of these e-mails. The obtained attributes are then used to feed classification algorithms in order to determine whether they are phishing or legitimate messages. In this paper, it is proposed an approach based on machine learning to detect phishing e-mail attacks. The methods that compose this approach are performed through a feature engineering process based on natural language processing, lemmatization, topics modeling, improved learning techniques for resampling and cross-validation, and hyperparameters configuration. The first proposed method uses all the features obtained from the Document-Term Matrix (DTM) in the classification algorithms. The second one uses Latent Dirichlet Allocation (LDA) as a operation to deal with the problems of the “curse of dimensionality”, the sparsity, and the text context portion included in the obtained representation. The proposed approach reached marks with an F1-measure of 99.95% success rate using the XGBoost algorithm. It outperforms state-of-the-art phishing detection researches for an accredited data set, in applications based only on the body of the e-mails, without using other e-mail features such as its header, IP information or number of links in the text
Can humain association norm evaluate latent semantic analysis?
This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
Domain adaptation in Natural Language Processing
Domain adaptation has received much attention in the past decade. It has been shown that domain knowledge is paramount for building successful Natural Language Processing (NLP) applications.
To investigate the domain adaptation problem, we conduct several experiments from different perspectives. First, we automatically adapt sentiment dictionaries for predicting the financial outcomes “excess return” and “volatility”. In these experiments, we compare manual adaptation of the domain-general dictionary with automatic adaptation, and manual adaptation with a combination consisting of first manual, then automatic adaptation. We demonstrate that automatic adaptation performs better than manual adaptation, namely the automatically adapted sentiment dictionary outperforms the previous state of the art in predicting excess return and volatility. Furthermore, we perform qualitative and quantitative analyses finding that annotation based on an expert’s a priori belief about a word’s meaning is error-prone – the meaning of a word can only be recognized in the context that it appears in.
Second, we develop the temporal transfer learning approach to account for the language change in social media. The language of social media is changing rapidly – new words appear in the vocabulary, and new trends are constantly emerging. Temporal transfer-learning allows us to model these temporal dynamics in the document collection. We show that this method significantly improves the prediction of movie sales from discussions on social media forums. In particular, we illustrate the success of parameter transfer, the importance of textual information for financial prediction, and show that temporal transfer learning can capture temporal trends in the data by focusing on those features that are relevant in a particular time step, i.e., we obtain more robust models preventing overfitting.
Third, we compare the performance of various domain adaptation models in low-resource settings, i.e., when there is a lack of large amounts of high-quality training data. This is an important issue in computational linguistics since the success of NLP applications primarily depends on the availability of training data. In real-world scenarios, the data is often too restricted and specialized. In our experiments, we evaluate different domain adaptation methods under these assumptions and find the most appropriate techniques for such a low-data problem. Furthermore, we discuss the conditions under which one approach substantially outperforms the other.
Finally, we summarize our work on domain adaptation in NLP and discuss possible future work topics.Die Domänenanpassung hat in den letzten zehn Jahren viel Aufmerksamkeit erhalten. Es hat sich gezeigt, dass das Domänenwissen für die Erstellung erfolgreicher NLP-Anwendungen (Natural Language Processing) von größter Bedeutung ist.
Um das Problem der Domänenanpassung zu untersuchen, führen wir mehrere Experimente aus verschiedenen Perspektiven durch. Erstens passen wir Sentimentlexika automatisch an, um die Überschussrendite und die Volatilität der Finanzergebnisse besser vorherzusagen. In diesen Experimenten vergleichen wir die manuelle Anpassung des allgemeinen Lexikons mit der automatischen Anpassung und die manuelle Anpassung mit einer Kombination aus erst manueller und dann automatischer Anpassung. Wir zeigen, dass die automatische Anpassung eine bessere Leistung erbringt als die manuelle Anpassung: das automatisch angepasste Sentimentlexikon übertrifft den bisherigen Stand der Technik bei der Vorhersage der Überschussrendite und der Volatilität. Darüber hinaus führen wir eine qualitative und quantitative Analyse durch und stellen fest, dass Annotationen, die auf der a priori Überzeugung eines Experten über die Bedeutung eines Wortes basieren, fehlerhaft sein können. Die Bedeutung eines Wortes kann nur in dem Kontext erkannt werden, in dem es erscheint.
Zweitens entwickeln wir den Ansatz, den wir Temporal Transfer Learning benennen, um den Sprachwechsel in sozialen Medien zu berücksichtigen. Die Sprache der sozialen Medien ändert sich rasant – neue Wörter erscheinen im Vokabular und es entstehen ständig neue Trends. Temporal Transfer Learning ermöglicht es, diese zeitliche Dynamik in der Dokumentensammlung zu modellieren. Wir zeigen, dass diese Methode die Vorhersage von Filmverkäufen aus Diskussionen in Social-Media-Foren erheblich verbessert. In unseren Experimenten zeigen wir (i) den Erfolg der Parameterübertragung, (ii) die Bedeutung von Textinformationen für die finanzielle Vorhersage und (iii) dass Temporal Transfer Learning zeitliche Trends in den Daten erfassen kann, indem es sich auf die Merkmale konzentriert, die in einem bestimmten Zeitschritt relevant sind, d. h. wir erhalten robustere Modelle, die eine Überanpassung verhindern.
Drittens vergleichen wir die Leistung verschiedener Domänenanpassungsmodelle in ressourcenarmen Umgebungen, d. h. wenn große Mengen an hochwertigen Trainingsdaten fehlen. Das ist ein wichtiges Thema in der Computerlinguistik, da der Erfolg der NLP-Anwendungen stark von der Verfügbarkeit von Trainingsdaten abhängt. In realen Szenarien sind die Daten oft zu eingeschränkt und spezialisiert. In unseren Experimenten evaluieren wir verschiedene Domänenanpassungsmethoden unter diesen Annahmen und finden die am besten geeigneten Techniken dafür. Darüber hinaus diskutieren wir die Bedingungen, unter denen ein Ansatz den anderen deutlich übertrifft.
Abschließend fassen wir unsere Arbeit zur Domänenanpassung in NLP zusammen und diskutieren mögliche zukünftige Arbeitsthemen
Rich and Scalable Models for Text
Topic models have become essential tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches such as the moment method with tensor decomposition or matrix factorization. The anchor word algorithm by Arora et al. [2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor word algorithm is high-speed, and it has a provable theoretical guarantee: it will converge to a global solution given enough number of documents. In this thesis, we present a series of spectral models based on the anchor word algorithm to serve a broader class of datasets and to provide more abundant and more flexible modeling capacity.
First, we improve the anchor word algorithm by incorporating various rich priors in the form of appropriate regularization terms. Our new regularized anchor word algorithms produce higher topic quality and provide flexibility to incorporate informed priors, creating the ability to discover topics more suited for external knowledge.
Second, we enrich the anchor word algorithm with metadata-based word representation for labeled datasets. Our new supervised anchor word algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA on three sentiment datasets. Also, sentiment anchor words, which play a vital role in generating sentiment topics, provide cues to understand sentiment datasets better than unsupervised topic models.
Lastly, we examine ALTO, an active learning framework with a static topic overview, and investigate the usability of supervised topic models for active learning. We develop a new, dynamic, active learning framework that combines the concept of informativeness and representativeness of documents using dynamically updating topics from our fast supervised anchor word algorithm. Experiments using three multi-class datasets show that our new framework consistently improves classification accuracy over ALTO
Learning Representations of Social Media Users
User representations are routinely used in recommendation systems by platform
developers, targeted advertisements by marketers, and by public policy
researchers to gauge public opinion across demographic groups. Computer
scientists consider the problem of inferring user representations more
abstractly; how does one extract a stable user representation - effective for
many downstream tasks - from a medium as noisy and complicated as social media?
The quality of a user representation is ultimately task-dependent (e.g. does
it improve classifier performance, make more accurate recommendations in a
recommendation system) but there are proxies that are less sensitive to the
specific task. Is the representation predictive of latent properties such as a
person's demographic features, socioeconomic class, or mental health state? Is
it predictive of the user's future behavior?
In this thesis, we begin by showing how user representations can be learned
from multiple types of user behavior on social media. We apply several
extensions of generalized canonical correlation analysis to learn these
representations and evaluate them at three tasks: predicting future hashtag
mentions, friending behavior, and demographic features. We then show how user
features can be employed as distant supervision to improve topic model fit.
Finally, we show how user features can be integrated into and improve existing
classifiers in the multitask learning framework. We treat user representations
- ground truth gender and mental health features - as auxiliary tasks to
improve mental health state prediction. We also use distributed user
representations learned in the first chapter to improve tweet-level stance
classifiers, showing that distant user information can inform classification
tasks at the granularity of a single message.Comment: PhD thesi