12 research outputs found

    Filtrage d'Arnaques dans un Corpus de Spams : Une application de Filtrar-S à la sécurité du citoyen

    Get PDF
    National audienceThis paper presents the testing of the softwares designed during the Filtrar-S project that is supported by ANR and belongs to the CSOSG 2008 financial session(http://www.filtrar-s.fr). The semantic filtring module of Filtrar-S has been used to find in a spam corpora the spams that are a kinds of scams and are therefore called Scams. This application responds to a need for the Division de Lutte Contre la Cybercriminalité de la gendarmerie nationale and it was conducted in collaboration with the association Signal Spam (http://www.signal-spam.fr). Actual performance is good and demonstrate the relevance of Filtrar-S to solve problems related to security of the citizen.Cet article présente les résultats d'un essai des logiciels conçus et développés dans le cadre du projet Filtrar-S, financé par l'ANR, dans le cadre du programme CSOSG 2008 (http://www.filtrar-s.fr). Il s'agissait d'utiliser Filtrar-S et son module de filtrage sémantique pour filtrer des spams d'arnaques (ou Scams) dans un corpus de spams. Cette application répond à un besoin de la Division de Lutte Contre la Cybercriminalité de la gendarmerie nationale et elle a été menée en collaboration avec l'association Signal-Spam (http://www.signal-spam.fr). Les performances sont bonnes et démontrent la pertinence de Filtrar-S pour résoudre des problèmes liés à la sécurité du citoyen.

    Social aspects of collaboration in online software communities

    Get PDF

    Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

    Get PDF
    In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated

    Semantically Enriched Text-Based Retrieval in Chemical Digital Libraries

    Get PDF
    During the last decades, the information gathering process has considerably changed in science, research and development, and the private life. Whereas Web pages for private information seeking are usually accessed using well-known text-based search engines, complex documents for scientific research are often stored in digital libraries and will usually be accessed through domain specific Web portals. Considering the specific domain of chemistry, portals usually rely on graphical user-interfaces allowing for pictorial structure queries. The difficulty with purely text-based searches is that information seeking in chemical documents is generally focused on chemical entities, for which current standard search relies on complex and hard to extract structures. In this thesis, we introduce a retrieval workflow for chemical digital libraries enabling text-based searches. First, we explain how to automatically index chemical documents with high completeness by creating enriched index pages containing different entity representations and synonyms. Next, we analyze different similarity measures for chemical entities. We further describe how to model the chemists’ implicit knowledge to personalize the retrieval process. Furthermore, since users often search for chemical entities occurring in a specific context, we also show how to use contextual information to further enhance the retrieval quality. Since, the annotated context terms will not help for contextual search if the users use different vocabulary, we present an approach that semantically enriches documents with Wikipedia concepts to overcome the vocabulary problem. Since for most queries a huge amount of possibly relevant hits are returned to the user, we further present an approach summarizing the documents’ content using Wikipedia categories. Finally, we present an architecture for a chemical digital library provider combining the different steps enabling semantically enriched text-based retrieval for the chemical domain.Über die letzten Jahre hat sich der Prozess der Informationssuche stark verändert. Während im privaten Bereich meistens über eine text-basierte Websuche auf Informationen zugegriffen wird, erfolgt der Zugriff auf Dokumente für den wissenschaftlichen Gebrauch in der Regel über domänenspezifische Web Portale. Betrachtet man beispielsweise die Domäne der Chemie, basieren Web Portale auf speziellen grafischen Benutzeroberflächen, die gezeichnete, strukturbasierte Anfragen ermöglichen. Da die Informationssuche für chemische Dokumente generell auf chemischen Entitäten basiert, die wiederum aus komplexen Strukturen bestehen, birgt eine reine text-basierte Suche eine Vielzahl von Herausforderungen. In dieser Arbeit entwickeln wir einen Retrieval Workflow für eine chemische digitale Bibliothek, der text-basierte Suchen ermöglicht. Als erstes erzeugen wir für chemische Dokumente semantisch angereicherte Indexseiten. Im Folgenden analysieren wir wie man Ähnlichkeit zwischen chemischen Entitäten bestimmen kann. Im Anschluss zeigen wir wie man das subjektive Relevanzempfinden der Chemiker modellieren kann, um ein personalisiertes Retrieval zu ermöglichen. Dann beschäftigen wir uns mit der Tatsache, dass Benutzer häufig nach chemischen Entitäten suchen, die in einem bestimmten Kontext auftreten. Allerdings sind die annotierten Kontext-Terme nutzlos, falls die Benutzer ein völlig anderes Vokabular verwenden. Deshalb reichern wir die Dokumente semantisch mit Wikipedia Konzepten an um das Problem des unterschiedlichen Vokabulars zu beheben. Da für die meisten Anfragen eine Vielzahl von relevanten Treffern zurückgeliefert wird, präsentieren wir eine Methode um den Inhalt der Dokumente auf übersichtliche Weise mit Hilfe von Wikipedia Kategorien darzustellen. Schlussendlich kombinieren wir die gewonnenen Erkenntnisse und stellen eine Architektur für eine chemische digitale Bibliothek vor, die semantisch angereicherte, text-basierte Suchen in der Chemie ermöglicht

    Towards an Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches

    Get PDF
    Software Vulnerabilities (SVs) can expose software systems to cyber-attacks, potentially causing enormous financial and reputational damage for organizations. There have been significant research efforts to detect these SVs so that developers can promptly fix them. However, fixing SVs is complex and time-consuming in practice, and thus developers usually do not have sufficient time and resources to fix all SVs at once. As a result, developers often need SV information, such as exploitability, impact, and overall severity, to prioritize fixing more critical SVs. Such information required for fixing planning and prioritization is typically provided in the SV assessment step of the SV lifecycle. Recently, data-driven methods have been increasingly proposed to automate SV assessment tasks. However, there are still numerous shortcomings with the existing studies on data-driven SV assessment that would hinder their application in practice. This PhD thesis aims to contribute to the growing literature in data-driven SV assessment by investigating and addressing the constant changes in SV data as well as the lacking considerations of source code and developers’ needs for SV assessment that impede the practical applicability of the field. Particularly, we have made the following five contributions in this thesis. (1) We systematize the knowledge of data-driven SV assessment to reveal the best practices of the field and the main challenges affecting its application in practice. Subsequently, we propose various solutions to tackle these challenges to better support the real-world applications of data-driven SV assessment. (2) We first demonstrate the existence of the concept drift (changing data) issue in descriptions of SV reports that current studies have mostly used for predicting the Common Vulnerability Scoring System (CVSS) metrics. We augment report-level SV assessment models with subwords of terms extracted from SV descriptions to help the models more effectively capture the semantics of ever-increasing SVs. (3) We also identify that SV reports are usually released after SV fixing. Thus, we propose using vulnerable code to enable earlier SV assessment without waiting for SV reports. We are the first to use Machine Learning techniques to predict CVSS metrics on the function level leveraging vulnerable statements directly causing SVs and their context in code functions. The performance of our function-level SV assessment models is promising, opening up research opportunities in this new direction. (4) To facilitate continuous integration of software code nowadays, we present a novel deep multi-task learning model, DeepCVA, to simultaneously and efficiently predict multiple CVSS assessment metrics on the commit level, specifically using vulnerability-contributing commits. DeepCVA is the first work that enables practitioners to perform SV assessment as soon as vulnerable changes are added to a codebase, supporting just-in-time prioritization of SV fixing. (5) Besides code artifacts produced from a software project of interest, SV assessment tasks can also benefit from SV crowdsourcing information on developer Question and Answer (Q&A) websites. We automatically retrieve large-scale security/SVrelated posts from these Q&A websites. We then apply a topic modeling technique on these posts to distill developers’ real-world SV concerns that can be used for data-driven SV assessment. Overall, we believe that this thesis has provided evidence-based knowledge and useful guidelines for researchers and practitioners to automate SV assessment using data-driven approaches.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Supporting Source Code Search with Context-Aware and Semantics-Driven Query Reformulation

    Get PDF
    Software bugs and failures cost trillions of dollars every year, and could even lead to deadly accidents (e.g., Therac-25 accident). During maintenance, software developers fix numerous bugs and implement hundreds of new features by making necessary changes to the existing software code. Once an issue report (e.g., bug report, change request) is assigned to a developer, she chooses a few important keywords from the report as a search query, and then attempts to find out the exact locations in the software code that need to be either repaired or enhanced. As a part of this maintenance, developers also often select ad hoc queries on the fly, and attempt to locate the reusable code from the Internet that could assist them either in bug fixing or in feature implementation. Unfortunately, even the experienced developers often fail to construct the right search queries. Even if the developers come up with a few ad hoc queries, most of them require frequent modifications which cost significant development time and efforts. Thus, construction of an appropriate query for localizing the software bugs, programming concepts or even the reusable code is a major challenge. In this thesis, we overcome this query construction challenge with six studies, and develop a novel, effective code search solution (BugDoctor) that assists the developers in localizing the software code of interest (e.g., bugs, concepts and reusable code) during software maintenance. In particular, we reformulate a given search query (1) by designing novel keyword selection algorithms (e.g., CodeRank) that outperform the traditional alternatives (e.g., TF-IDF), (2) by leveraging the bug report quality paradigm and source document structures which were previously overlooked and (3) by exploiting the crowd knowledge and word semantics derived from Stack Overflow Q&A site, which were previously untapped. Our experiment using 5000+ search queries (bug reports, change requests, and ad hoc queries) suggests that our proposed approach can improve the given queries significantly through automated query reformulations. Comparison with 10+ existing studies on bug localization, concept location and Internet-scale code search suggests that our approach can outperform the state-of-the-art approaches with a significant margin

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore