145 research outputs found

    Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

    Full text link
    Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium 201

    Research and analysis of hate and other emotions in social media

    Full text link
    Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2022, Director: Maria Salamó Llorente[en] In the course of just a few years, with the massive introduction of social media, people have changed the way they communicate and share experiences dramatically. The global scale that this topic has reached, combined with its rapid expansion, is a historic landmark. However, what do social networks represent in our day-to-day lifestyles? The answer is a double life. Since their launch, a digital pseudo-reality has been created in which thoughts, emotions and privacy can be expressed in detail. This leads us to dump each of society’s concerns into community applications, and if you add the factor of anonymity behind a screen, the result is incendiary. Through this work, it is intended to identify, study and analyze the high level of emotions, mostly negative, that has been flooding social media thanks to the aforementioned anonymity. This process will be carried out by entering the Natural Language Processing field. For this purpose, a study of Hate Speech, Toxicity, Offensiveness and other emotions will be carried out on four datasets, each one with one of these tasks respectively. Using these datasets, three language models, based on Transformers and Deep Learning, will be trained and validated for their future comparison. All of this is performed with the aim of finding the ideal framework for each of the featured tasks, which are based on true-to-life situations. Furthermore, it is intended to find the causes of the inconveniences that the models may present, in a concise and intuitive way for the reader

    Automatic detection of persuasion attempts on social networks

    Get PDF
    The rise of social networks and the increasing amount of time people spend on them have created a perfect place for the dissemination of false narratives, propaganda, and manipulated content. In order to prevent the spread of disinformation, content moderation is needed, however it is unfeasible to do it manually due to the large number of daily posts. This dissertation aims at solving this problem by creating a system for automatic detection of persuasion techniques, as proposed in a SemEval challenge. We start by reviewing classic machine learning and natural language processing approaches and go through more sophisticated deep learning approaches which are more suited for this type of complex problem. The classic machine learning approaches are used to create a baseline for the problem. The architecture proposed, using deep learning techniques, is built on top of a DistilBERT transformer followed by Convolutional Neural Networks. We study how our usage of different loss functions, pre-processing the text, freezing DistilBERT layers and performing hyperparameter search impact the performance of our system. We discovered that we could optimize our architecture by freezing the two initial DistilBERT’s layers and using asymmetric loss to tackle the class imbalance on the dataset presented. This study resulted in three final models with the same architecture but using different parameters where the first showed signs of overfitting, one did not show sings of overfitting but did not seem to converge and other seemed to converge but yielded the worst performance of all three. They presented a micro f1-score of 0.551, 0.526 and 0.509 and were placed in 3rd, 6th and 11th place respectively in the overall table. The models can only classify textual elements as the multimodal component is not implemented on this iteration but only discussed; Sumário: Deteção automática de tentativas de persuasão em redes sociais - O crescimento das redes sociais e o aumento do tempo que as pessoas passam nelas criaram um lugar perfeito para a disseminação de falsas narrativas, propaganda e conteúdo manipulado. Para evitar a disseminação da desinformação, é necessária a moderação do conteúdo, porém é inviável fazê-la manualmente devido ao grande número de conteúdo diário. Esta dissertação visa resolver este problema através da criação de um sistema de deteção automática de técnicas de persuasão, conforme proposto num desafio da SemEval. Começamos por rever as abordagens clássicas de aprendizagem automática e processamento de linguagem natural, passamos de seguida por abordagens mais sofisticadas de aprendizagem profunda que são mais adequadas para esse tipo de problema complexo. As abordagens clássicas de aprendizagem automática são usadas para criar um ponto de partida para o problema. A arquitetura proposta, utilizando técnicas de aprendizagem profunda, é construída sobre um transformer DistilBERT seguido de redes neuronais convolucionais. Estudamos de que forma o uso de diferentes funções ativação, pré-processamento do texto, congelamento de camadas do DistilBERT e realização de pesquisa de hiperparâmetros afetam o desempenho do nosso sistema. Descobrimos que poderíamos otimizar nossa arquitetura congelando as duas camadas iniciais do DistilBERT e usando asymmetric loss para lidar com o desequilíbrio de classes no conjunto de dados apresentado. Este estudo resultou em três modelos finais com a mesma arquitetura, mas usando parâmetros diferentes, onde o primeiro mostrou sinais de overfitting, um não mostrou sinais de overfitting mas não parece convergir e outro parece convergir, mas produziu o pior desempenho de todos os três. Apresentaram micro f1-score de 0.551, 0.526 e 0.509 e ficaram em 3º, 6º e 11º lugares, respectivamente, na tabela geral. Os modelos podem apenas classificar elementos textuais, pois o componente multimodal não é implementado nesta iteração, mas apenas discutido

    Visual Analytics for the Exploratory Analysis and Labeling of Cultural Data

    Get PDF
    Cultural data can come in various forms and modalities, such as text traditions, artworks, music, crafted objects, or even as intangible heritage such as biographies of people, performing arts, cultural customs and rites. The assignment of metadata to such cultural heritage objects is an important task that people working in galleries, libraries, archives, and museums (GLAM) do on a daily basis. These rich metadata collections are used to categorize, structure, and study collections, but can also be used to apply computational methods. Such computational methods are in the focus of Computational and Digital Humanities projects and research. For the longest time, the digital humanities community has focused on textual corpora, including text mining, and other natural language processing techniques. Although some disciplines of the humanities, such as art history and archaeology have a long history of using visualizations. In recent years, the digital humanities community has started to shift the focus to include other modalities, such as audio-visual data. In turn, methods in machine learning and computer vision have been proposed for the specificities of such corpora. Over the last decade, the visualization community has engaged in several collaborations with the digital humanities, often with a focus on exploratory or comparative analysis of the data at hand. This includes both methods and systems that support classical Close Reading of the material and Distant Reading methods that give an overview of larger collections, as well as methods in between, such as Meso Reading. Furthermore, a wider application of machine learning methods can be observed on cultural heritage collections. But they are rarely applied together with visualizations to allow for further perspectives on the collections in a visual analytics or human-in-the-loop setting. Visual analytics can help in the decision-making process by guiding domain experts through the collection of interest. However, state-of-the-art supervised machine learning methods are often not applicable to the collection of interest due to missing ground truth. One form of ground truth are class labels, e.g., of entities depicted in an image collection, assigned to the individual images. Labeling all objects in a collection is an arduous task when performed manually, because cultural heritage collections contain a wide variety of different objects with plenty of details. A problem that arises with these collections curated in different institutions is that not always a specific standard is followed, so the vocabulary used can drift apart from another, making it difficult to combine the data from these institutions for large-scale analysis. This thesis presents a series of projects that combine machine learning methods with interactive visualizations for the exploratory analysis and labeling of cultural data. First, we define cultural data with regard to heritage and contemporary data, then we look at the state-of-the-art of existing visualization, computer vision, and visual analytics methods and projects focusing on cultural data collections. After this, we present the problems addressed in this thesis and their solutions, starting with a series of visualizations to explore different facets of rap lyrics and rap artists with a focus on text reuse. Next, we engage in a more complex case of text reuse, the collation of medieval vernacular text editions. For this, a human-in-the-loop process is presented that applies word embeddings and interactive visualizations to perform textual alignments on under-resourced languages supported by labeling of the relations between lines and the relations between words. We then switch the focus from textual data to another modality of cultural data by presenting a Virtual Museum that combines interactive visualizations and computer vision in order to explore a collection of artworks. With the lessons learned from the previous projects, we engage in the labeling and analysis of medieval illuminated manuscripts and so combine some of the machine learning methods and visualizations that were used for textual data with computer vision methods. Finally, we give reflections on the interdisciplinary projects and the lessons learned, before we discuss existing challenges when working with cultural heritage data from the computer science perspective to outline potential research directions for machine learning and visual analytics of cultural heritage data

    Deep Learning Software Repositories

    Get PDF
    Bridging the abstraction gap between artifacts and concepts is the essence of software engineering (SE) research problems. SE researchers regularly use machine learning to bridge this gap, but there are three fundamental issues with traditional applications of machine learning in SE research. Traditional applications are too reliant on labeled data. They are too reliant on human intuition, and they are not capable of learning expressive yet efficient internal representations. Ultimately, SE research needs approaches that can automatically learn representations of massive, heterogeneous, datasets in situ, apply the learned features to a particular task and possibly transfer knowledge from task to task. Improvements in both computational power and the amount of memory in modern computer architectures have enabled new approaches to canonical machine learning tasks. Specifically, these architectural advances have enabled machines that are capable of learning deep, compositional representations of massive data depots. The rise of deep learning has ushered in tremendous advances in several fields. Given the complexity of software repositories, we presume deep learning has the potential to usher in new analytical frameworks and methodologies for SE research and the practical applications it reaches. This dissertation examines and enables deep learning algorithms in different SE contexts. We demonstrate that deep learners significantly outperform state-of-the-practice software language models at code suggestion on a Java corpus. Further, these deep learners for code suggestion automatically learn how to represent lexical elements. We use these representations to transmute source code into structures for detecting similar code fragments at different levels of granularity—without declaring features for how the source code is to be represented. Then we use our learning-based framework for encoding fragments to intelligently select and adapt statements in a codebase for automated program repair. In our work on code suggestion, code clone detection, and automated program repair, everything for representing lexical elements and code fragments is mined from the source code repository. Indeed, our work aims to move SE research from the art of feature engineering to the science of automated discovery

    Natural Language Processing for Under-resourced Languages: Developing a Welsh Natural Language Toolkit

    Get PDF
    Language technology is becoming increasingly important across a variety of application domains which have become common place in large, well-resourced languages. However, there is a danger that small, under-resourced languages are being increasingly pushed to the technological margins. Under-resourced languages face significant challenges in delivering the underlying language resources necessary to support such applications. This paper describes the development of a natural language processing toolkit for an under-resourced language, Cymraeg (Welsh). Rather than creating the Welsh Natural Language Toolkit (WNLT) from scratch, the approach involved adapting and enhancing the language processing functionality provided for other languages within an existing framework and making use of external language resources where available. This paper begins by introducing the GATE NLP framework, which was used as the development platform for the WNLT. It then describes each of the core modules of the WNLT in turn, detailing the extensions and adaptations required for Welsh language processing. An evaluation of the WNLT is then reported. Following this, two demonstration applications are presented. The first is a simple text mining application that analyses wedding announcements. The second describes the development of a Twitter NLP application, which extends the core WNLT pipeline. As a relatively small-scale project, the WNLT makes use of existing external language resources where possible, rather than creating new resources. This approach of adaptation and reuse can provide a practical and achievable route to developing language resources for under-resourced languages

    Exploiting Cross-Lingual Representations For Natural Language Processing

    Get PDF
    Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest

    Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection

    Get PDF
    Text reuse is the act of borrowing text (either verbatim or paraphrased) from an earlier written text. It could occur within the same language (mono-lingual) or across languages (cross-lingual) where the reused text is in a different language than the original text. Text reuse and its related problem, plagiarism (the unacknowledged reuse of text), are becoming serious issues in many fields and research shows that paraphrased and especially the cross-lingual cases of reuse are much harder to detect. Moreover, the recent rise in readily available multi-lingual content on the Web and social media has increased the problem to an unprecedented scale. To develop, compare, and evaluate automatic methods for mono- and crosslingual text reuse and extrinsic (finding portion(s) of text that is reused from the original text) plagiarism detection, standard evaluation resources are of utmost importance. However, previous efforts on developing such resources have mostly focused on English and some other languages. On the other hand, the Urdu language, which is widely spoken and has a large digital footprint, lacks resources in terms of core language processing tools and corpora. With this consideration in mind, this PhD research focuses on developing standard evaluation corpora, methods, and supporting resources to automatically detect mono-lingual (Urdu) and cross-lingual (English-Urdu) cases of text reuse and extrinsic plagiarism This thesis contributes a mono-lingual (Urdu) text reuse corpus (COUNTER Corpus) that contains real cases of Urdu text reuse at document-level. Another contribution is the development of a mono-lingual (Urdu) extrinsic plagiarism corpus (UPPC Corpus) that contains simulated cases of Urdu paraphrase plagiarism. Evaluation results, by applying a wide range of state-of-the-art mono-lingual methods on both corpora, shows that it is easier to detect verbatim cases than paraphrased ones. Moreover, the performance of these methods decreases considerably on real cases of reuse. A couple of supporting resources are also created to assist methods used in the cross-lingual (English-Urdu) text reuse detection. A large-scale multi-domain English-Urdu parallel corpus (EUPC-20) that contains parallel sentences is mined from the Web and several bi-lingual (English-Urdu) dictionaries are compiled using multiple approaches from different sources. Another major contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus (TREU Corpus). It contains English to Urdu real cases of text reuse at the document-level. A diversified range of methods are applied on the TREU Corpus to evaluate its usefulness and to show how it can be utilised in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse. A new cross-lingual method is also proposed that uses bilingual word embeddings to estimate the degree of overlap amongst text documents by computing the maximum weighted cosine similarity between word pairs. The overall low evaluation results indicate that it is a challenging task to detect crosslingual real cases of text reuse, especially when the language pairs have unrelated scripts, i.e., English-Urdu. However, an improvement in the result is observed using a combination of methods used in the experiments. The research work undertaken in this PhD thesis contributes corpora, methods, and supporting resources for the mono- and cross-lingual text reuse and extrinsic plagiarism for a significantly under-resourced Urdu and English-Urdu language pair. It highlights that paraphrased and cross-lingual cross-script real cases of text reuse are harder to detect and are still an open issue. Moreover, it emphasises the need to develop standard evaluation and supporting resources for under-resourced languages to facilitate research in these languages. The resources that have been developed and methods proposed could serve as a framework for future research in other languages and language pairs

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Deep Open Representative Learning for Image and Text Classification

    Get PDF
    Title from PDF of title page viewed November 5, 2020Dissertation advisor: Yugyung LeeVitaIncludes bibliographical references (pages 257-289)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2020An essential goal of artificial intelligence is to support the knowledge discovery process from data to the knowledge that is useful in decision making. The challenges in the knowledge discovery process are typically due to the following reasons: First, the real-world data are typically noise, sparse, or derived from heterogeneous sources. Second, it is neither easy to build robust predictive models nor to validate them with such real-world data. Third, the `black-box' approach to deep learning models makes it hard to interpret what they produce. It is essential to bridge the gap between the models and their support in decisions with something potentially understandable and interpretable. To address the gap, we focus on designing critical representatives of the discovery process from data to the knowledge that can be used to perform reasoning. In this dissertation, a novel model named Class Representative Learning (CRL) is proposed, a class-based classifier designed with the following unique contributions in machine learning, specifically for image and text classification, i) The unique design of a latent feature vector, i.e., class representative, represents the abstract embedding space projects with the features extracted from a deep neural network learned from either images or text, ii) Parallel ZSL algorithms with class representative learning; iii) A novel projection-based inferencing method uses the vector space model to reconcile the dominant difference between the seen classes and unseen classes; iv) The relationships between CRs (Class Representatives) are represented as a CR Graph where a node represents a CR, and an edge represents the similarity between two CRs.Furthermore, we designed the CR-Graph model that aims to make the models explainable that is crucial for decision-making. Although this CR-Graph does not have full reasoning capability, it is equipped with the class representatives and their inter-dependent network formed through similar neighboring classes. Additionally, semantic information and external information are added to CR-Graph to make the decision more capable of dealing with real-world data. The automated semantic information's ability to the graph is illustrated with a case study of biomedical research through the ontology generation from text and ontology-to-ontology mapping.Introduction -- CRL: Class Representative Learning for Image Classification -- Class Representatives for Zero-shot Learning using Purely Visual Data -- MCDD: Multi-class Distribution Model for Large Scale Classification -- Zero Shot Learning for Text Classification using Class Representative Learning -- Visual Context Learning with Big Data Analytics -- Transformation from Publications to Ontology using Topic-based Assertion Discovery -- Ontology Mapping Framework with Feature Extraction and Semantic Embeddings -- Conclusion -- Appendix A. A Comparative Evaluation with Different Similarity Measure
    corecore