971 research outputs found

    From the User to the Medium: Neural Profiling Across Web Communities

    Full text link
    Online communities provide a unique way for individuals to access information from those in similar circumstances, which can be critical for health conditions that require daily and personalized management. As these groups and topics often arise organically, identifying the types of topics discussed is necessary to understand their needs. As well, these communities and people in them can be quite diverse, and existing community detection methods have not been extended towards evaluating these heterogeneities. This has been limited as community detection methodologies have not focused on community detection based on semantic relations between textual features of the user-generated content. Thus here we develop an approach, NeuroCom, that optimally finds dense groups of users as communities in a latent space inferred by neural representation of published contents of users. By embedding of words and messages, we show that NeuroCom demonstrates improved clustering and identifies more nuanced discussion topics in contrast to other common unsupervised learning approaches

    User Multi-Interest Modeling for Behavioral Cognition

    Full text link
    Representation modeling based on user behavior sequences is an important direction in user cognition. In this study, we propose a novel framework called Multi-Interest User Representation Model. Specifically, the model consists of two sub-models. The first sub-module is used to encode user behaviors in any period into a super-high dimensional sparse vector. Then, we design a self-supervised network to map vectors in the first module to low-dimensional dense user representations by contrastive learning. With the help of a novel attention module which can learn multi-interests of user, the second sub-module achieves almost lossless dimensionality reduction. Experiments on several benchmark datasets show that our approach works well and outperforms state-of-the-art unsupervised representation methods in different downstream tasks.Comment: during peer revie

    Enriching Word Embeddings with Food Knowledge for Ingredient Retrieval

    Get PDF
    Smart assistants and recommender systems must deal with lots of information coming from different sources and having different formats. This is more frequent in text data, which presents increased variability and complexity, and is rather common for conversational assistants or chatbots. Moreover, this issue is very evident in the food and nutrition lexicon, where the semantics present increased variability, namely due to hypernyms and hyponyms. This work describes the creation of a set of word embeddings based on the incorporation of information from a food thesaurus - LanguaL - through retrofitting. The ingredients were classified according to three different facet label groups. Retrofitted embeddings seem to properly encode food-specific knowledge, as shown by an increase on accuracy as compared to generic embeddings (+23%, +10% and +31% per group). Moreover, a weighing mechanism based on TF-IDF was applied to embedding creation before retrofitting, also bringing an increase on accuracy (+5%, +9% and +5% per group). Finally, the approach has been tested with human users in an ingredient retrieval exercise, showing very positive evaluation (77.3% of the volunteer testers preferred this method over a string-based matching algorithm)

    NLP-Based Techniques for Cyber Threat Intelligence

    Full text link
    In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

    Automatic stance detection on political discourse in Twitter

    Get PDF
    The majority of opinion mining tasks in natural language processing (NLP) have been focused on sentiment analysis of texts about products and services while there is comparatively less research on automatic detection of political opinion. Almost all previous research work has been done for English, while this thesis is focused on the automatic detection of stance (whether he or she is favorable or not towards important political topic) from Twitter posts in Catalan, Spanish and English. The main objective of this work is to build and compare automatic stance detection systems using supervised both classic machine and deep learning techniques. We also study the influence of text normalization and perform experiments with differentt methods for word representations such as TF-IDF measures for unigrams, word embeddings, tweet embeddings, and contextual character-based embeddings. We obtain state-of-the-art results in the stance detection task on the IberEval 2018 dataset. Our research shows that text normalization and feature selection is important for the systems with unigram features, and does not affect the performance when working with word vector representations. Classic methods such as unigrams and SVM classifier still outperform deep learning techniques, but seem to be prone to overfitting. The classifiers trained using word vector representations and the neural network models encoded with contextual character-based vectors show greater robustness

    Automatic stance detection on political discourse in Twitter

    Get PDF
    The majority of opinion mining tasks in natural language processing (NLP) have been focused on sentiment analysis of texts about products and services while there is comparatively less research on automatic detection of political opinion. Almost all previous research work has been done for English, while this thesis is focused on the automatic detection of stance (whether he or she is favorable or not towards important political topic) from Twitter posts in Catalan, Spanish and English. The main objective of this work is to build and compare automatic stance detection systems using supervised both classic machine and deep learning techniques. We also study the influence of text normalization and perform experiments with differentt methods for word representations such as TF-IDF measures for unigrams, word embeddings, tweet embeddings, and contextual character-based embeddings. We obtain state-of-the-art results in the stance detection task on the IberEval 2018 dataset. Our research shows that text normalization and feature selection is important for the systems with unigram features, and does not affect the performance when working with word vector representations. Classic methods such as unigrams and SVM classifier still outperform deep learning techniques, but seem to be prone to overfitting. The classifiers trained using word vector representations and the neural network models encoded with contextual character-based vectors show greater robustness

    Similarity and explanation for dynamic telecommunication engineer support.

    Get PDF
    Understanding similarity between different examples is a crucial aspect of Case-Based Reasoning (CBR) systems, but learning representations optimised for similarity comparisons can be difficult. CBR systems typically rely on separate algorithms to learn representations for cases and to compare those representations, as symbolised by the vocabulary and similarity knowledge containers respectively. Deep Metric Learners (DMLs) are a branch of deep learning architectures which learn a representation optimised for similarity comparison by leveraging direct case comparisons during training. In this thesis we explore the symbiotic relationship between these two fields of research. Firstly we examine what can be learned from traditional CBR research to improve the training of DMLs through training strategies. We then examine how DMLs can fill the traditionally separate roles of the vocabulary and similarity knowledge containers. We perform this exploration on the real-world problem of experience transfer between experts and non-experts on service provisioning for telecommunication organisations. This problem is also revealing about the requirements for practical applications to be explainable to their intended user group. With that in mind, we conclude this thesis with work towards the development of an explanation framework designed to explain the recommendations of similarity-based classifiers. We support this practical contribution with an exploration of similarity knowledge to support autonomous measurement of explanation quality

    Identifying Restaurants Proposing Novel Kinds of Cuisines: Using Yelp Reviews

    Get PDF
    These days with TV-shows and starred chefs, new kinds of cuisines appear in the market. The main cuisines like French, Italian, Japanese, Chinese and Indian are always appreciated but they are no longer the most popular. The new trend is the fusion cuisine, which is obtained by combining different main cuisines. The opening of a new restaurant proposing new kinds of cuisine produces a lot of excitement in people. They feel the need to try it and be part of this new culture. Yelp is a platform which publishes crowd sourced reviews about different businesses, in particular, restaurants. For some restaurants in Yelp if the kind of cuisine is available, usually, there is a tag only for the main cuisines, but there is no information for the fusion cuisine. There is a need to develop a system which is able to identify restaurants proposing fusion cuisine (novel or unknown cuisines). This proposal is to address the novelty detection task using Yelp reviews. The idea is that the semi-supervised Machine Learning models trained only on the reviews of restaurants proposing the main cuisine will be able to discriminate between restaurants providing the main cuisine and restaurants providing the novel ones. We propose effective novelty detection approaches for the unknown cuisine type identification problem using Long Short Term Memory (LSTM), autoencoder and Term-Frequency and Inverse Document Frequency(). Our main idea is to obtain features from LSTM, autoencoder and TF-IDF and use these features with standard semi-supervised novelty detection algorithms like Gaussian Mixture Model, Isolation Forest and One-class Support Vector Machines (SVM) to identify the unknown cuisines. We conducted extensive experiments that prove the effectiveness of our approaches. The score that we obtained has a very high discrimination power because the best value of AUROC for the novelty detection problem is 0.85 from LSTM. LSTM outperforms our baseline model of TF-IDF and the main motivation is due to its ability to retain only the useful parts of a sentence
    • …
    corecore