146 research outputs found
Data Mining Techniques to Understand Textual Data
More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier.
Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data.
My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions
Graph Learning and Its Applications: A Holistic Survey
Graph learning is a prevalent domain that endeavors to learn the intricate
relationships among nodes and the topological structure of graphs. These
relationships endow graphs with uniqueness compared to conventional tabular
data, as nodes rely on non-Euclidean space and encompass rich information to
exploit. Over the years, graph learning has transcended from graph theory to
graph data mining. With the advent of representation learning, it has attained
remarkable performance in diverse scenarios, including text, image, chemistry,
and biology. Owing to its extensive application prospects, graph learning
attracts copious attention from the academic community. Despite numerous works
proposed to tackle different problems in graph learning, there is a demand to
survey previous valuable works. While some researchers have perceived this
phenomenon and accomplished impressive surveys on graph learning, they failed
to connect related objectives, methods, and applications in a more coherent
way. As a result, they did not encompass current ample scenarios and
challenging problems due to the rapid expansion of graph learning. Different
from previous surveys on graph learning, we provide a holistic review that
analyzes current works from the perspective of graph structure, and discusses
the latest applications, trends, and challenges in graph learning.
Specifically, we commence by proposing a taxonomy from the perspective of the
composition of graph data and then summarize the methods employed in graph
learning. We then provide a detailed elucidation of mainstream applications.
Finally, based on the current trend of techniques, we propose future
directions.Comment: 20 pages, 7 figures, 3 table
Sentiment analysis with limited training data
Sentiments are positive and negative emotions, evaluations and stances. This dissertation focuses on learning based systems for automatic analysis of sentiments and comparisons in natural language text. The proposed approach consists of three contributions:
1. Bag-of-opinions model: For predicting document-level polarity and intensity, we proposed the bag-of-opinions model by modeling each document as a bag of sentiments, which can explore the syntactic structures of sentiment-bearing phrases for improved rating prediction of online reviews.
2. Multi-experts model: Due to the sparsity of manually-labeled training data, we designed the multi-experts model for sentence-level analysis of sentiment polarity and intensity by fully exploiting any available sentiment indicators, such as phrase-level predictors and sentence similarity measures.
3. LSSVMrae model: To understand the sentiments regarding entities, we proposed LSSVMrae model for extracting sentiments and comparisons of entities at both sentence and subsentential level.
Different granularity of analysis leads to different model complexity, the finer the more complex. All proposed models aim to minimize the use of hand-labeled data by maximizing the use of the freely available resources. These models explore also different feature representations to capture the compositional semantics inherent in sentiment-bearing expressions. Our experimental results on real-world data showed that all models significantly outperform the state-of-the-art methods on the respective tasks.Sentiments sind positive und negative Gefühle, Bewertungen und Einstellungen. Die Dissertation beschäftigt sich mit lernbasierten Systemen zur automatischen Analyse von Sentiments und Vergleichen in Texten in natürlicher Sprache. Die vorliegende Abeit leistet dazu drei Beiträge:
1. Bag-of-Opinions-Modell: Zur Vorhersage der Polarität und Intensität auf Dokumentenebene haben wir das Bag-of-Opinions-Modell vorgeschlagen, bei dem jedes Dokument als ein Beutel Sentiments dargestellt wird. Das Modell kann die syntaktischen Strukturen von subjektiven Ausdrücken untersuchen, um eine verbesserte Bewertungsvorhersage von Online-Rezensionen zu erzielen.
2. Multi-Experten-Modell: Wegen des Mangels an manuell annotierten Trainingsdaten haben wir das Multi-Experten-Modell entworfen, um die Sentimentpolarität und -intensität auf Satzebene zu analysieren. Das Modell kann alle möglichen Sentiment-Indikatoren verwenden, wie Prädiktoren auf Phrasenebene und Ähnlichkeitsmaße von Sätzen.
3. LSSVMrae-Modell: Um Sentiments von Entitäten zu verstehen, wir haben wir das LSSVMrae-Modell zur Extraktion von Sentiments und Vergleichen von Entitäten auf Satz- und Ausdrucksebene vorgeschlagen.
Die unterschiedliche Granularität der Analyse führt zu unterschiedlicher Modellkomplexität; je feiner, desto komplexer. Alle vorgeschlagenen Modelle zielen darauf ab, möglichst wenige manuell annotierte Daten und möglichst viele frei verfügbare Ressourcen zu verwenden.
Diese Modelle untersuchen auch verschiedene Merkmalsdarstellungen, um die Kompositionssemantik abzubilden, die subjektiven Ausdrücken inhärent ist. Die Ergebnisse unserer Experimente mit Realweltdaten haben gezeigt, dass alle Modelle für die jeweiligen Aufgaben deutlich bessere Leistungen erzielen als die modernsten Methoden
Support Vector Machines (SVM) in Test Extraction
Text categorization is the process of grouping documents or words into predefined
categories. Each category consists of documents or words having similar attributes.
There exist numerous algorithms to address the need of text categorization including
Naive Bayes, k-nearest-neighbor classifier, and decision trees. In this project, Support
Vector Machines (SVM) is studied and experimented by the implementation ofa textual
extractor. This algorithm is used to extract important points from a lengthy document,
by which it classifies each word in the document under its relevant category and
constructs the structure of the summary with reference to the categorized words. The
performance of the extractor is evaluated using a similar corpus against an existing
summarizer, which uses a different kind of approach. Summarization is part of text
categorization whereby it is considered an essential part of today's information-led
society, and it has been a growing area of research for over 40 years. This project's
objective is to create a summarizer, or extractor, based on machine learning algorithms,
which are namely SVM and K-Means. Each word in the particular document is
processed by both algorithms to determine its actual occurrence in the document by
which it will first be clustered or grouped into categories based on parts of speech (verb,
noun, adjective) which is done by K-Means, then later processed by SVM to determine
the actual occurrence of each word in each of the cluster, taking into account whether
the words have similar meanings with otherwords in the subsequent cluster. The corpus
chosen to evaluate the application is the Reuters-21578 dataset comprising of
newspaper articles. Evaluation of the applications are carried out against another
accompanying system-generated extract which is already in the market, as a means to
observe the amount of sentences overlap with the tested applications, in this case, the
Text Extractor and also Microsoft Word AutoSummarizer. Results show that the Text
Extractor has optimal results at compression rates of 10 - 20% and 35 - 45
Coarse-to-Fine Contrastive Learning on Graphs
Inspired by the impressive success of contrastive learning (CL), a variety of
graph augmentation strategies have been employed to learn node representations
in a self-supervised manner. Existing methods construct the contrastive samples
by adding perturbations to the graph structure or node attributes. Although
impressive results are achieved, it is rather blind to the wealth of prior
information assumed: with the increase of the perturbation degree applied on
the original graph, 1) the similarity between the original graph and the
generated augmented graph gradually decreases; 2) the discrimination between
all nodes within each augmented view gradually increases. In this paper, we
argue that both such prior information can be incorporated (differently) into
the contrastive learning paradigm following our general ranking framework. In
particular, we first interpret CL as a special case of learning to rank (L2R),
which inspires us to leverage the ranking order among positive augmented views.
Meanwhile, we introduce a self-ranking paradigm to ensure that the
discriminative information among different nodes can be maintained and also be
less altered to the perturbations of different degrees. Experiment results on
various benchmark datasets verify the effectiveness of our algorithm compared
with the supervised and unsupervised models
NLP-Based Techniques for Cyber Threat Intelligence
In the digital era, threat actors employ sophisticated techniques for which,
often, digital traces in the form of textual data are available. Cyber Threat
Intelligence~(CTI) is related to all the solutions inherent to data collection,
processing, and analysis useful to understand a threat actor's targets and
attack behavior. Currently, CTI is assuming an always more crucial role in
identifying and mitigating threats and enabling proactive defense strategies.
In this context, NLP, an artificial intelligence branch, has emerged as a
powerful tool for enhancing threat intelligence capabilities. This survey paper
provides a comprehensive overview of NLP-based techniques applied in the
context of threat intelligence. It begins by describing the foundational
definitions and principles of CTI as a major tool for safeguarding digital
assets. It then undertakes a thorough examination of NLP-based techniques for
CTI data crawling from Web sources, CTI data analysis, Relation Extraction from
cybersecurity data, CTI sharing and collaboration, and security threats of CTI.
Finally, the challenges and limitations of NLP in threat intelligence are
exhaustively examined, including data quality issues and ethical
considerations. This survey draws a complete framework and serves as a valuable
resource for security professionals and researchers seeking to understand the
state-of-the-art NLP-based threat intelligence techniques and their potential
impact on cybersecurity
- …