70 research outputs found

    Analysis and classification of privacy-sensitive content in social media posts

    Get PDF
    User-generated contents often contain private information, even when they are shared publicly on social media and on the web in general. Although many filtering and natural language approaches for automatically detecting obscenities or hate speech have been proposed, determining whether a shared post contains sensitive information is still an open issue. The problem has been addressed by assuming, for instance, that sensitive contents are published anonymously, on anonymous social media platforms or with more restrictive privacy settings, but these assumptions are far from being realistic, since the authors of posts often underestimate or overlook their actual exposure to privacy risks. Hence, in this paper, we address the problem of content sensitivity analysis directly, by presenting and characterizing a new annotated corpus with around ten thousand posts, each one annotated as sensitive or non-sensitive by a pool of experts. We characterize our data with respect to the closely-related problem of self-disclosure, pointing out the main differences between the two tasks. We also present the results of several deep neural network models that outperform previous naive attempts of classifying social media posts according to their sensitivity, and show that state-of-the-art approaches based on anonymity and lexical analysis do not work in realistic application scenarios

    Positive and unlabeled learning in categorical data

    Get PDF
    International audienceIn common binary classification scenarios, the presence of both positive and negative examples in training datais needed to build an efficient classifier. Unfortunately, in many domains, this requirement is not satisfied andonly one class of examples is available. To cope with this setting, classification algorithms have been introducedthat learn from Positive and Unlabeled (PU) data. Originally, these approaches were exploited in the context ofdocument classification. Only few works address the PU problem for categorical datasets. Nevertheless, theavailable algorithms are mainly based on Naive Bayes classifiers. In this work we present a new distance basedPU learning approach for categorical data: Pulce. Our framework takes advantage of the intrinsic relationshipsbetween attribute values and exceeds the independence assumption made by Naive Bayes. Pulce, in fact,leverages on the statistical properties of the data to learn a distance metric employed during the classificationtask. We extensively validate our approach over real world datasets and demonstrate that our strategy obtainsstatistically significant improvements w.r.t. state-of-the-art competitors

    Identification of key films and personalities in the history of cinema from a Western perspective

    Get PDF
    Abstract The success of a film is usually measured through its box-office revenue or through the opinion of professional critics; such measures, however, may be influenced by external factors, such as advertisement or trends, and are not able to capture the impact of a film over time. Thanks to the recent availability of data on references among movies, some researchers have started to use citations patterns as an alternative method for ranking movies. In this paper, we propose a novel ranking method for films based on the network of references among movies, calculated by combining four well known centrality indexes: in-degree, closeness, harmonic and PageRank. Our objective is to measure the success of a movie by accounting how much it has influenced other movies produced after its release, from both the artistic and the economic point of view. We apply our method on a subset of the IMDb (Internet Movie Database) citation network consisting of around 47,000 international movies, and we derive a list of films that can be considered milestones in the history of cinema. For each movie we also collect data on its year of release, genres and countries of production, to analyze trends and patterns in the film industry according to such features. We also collect data on 20,000 directors and almost 400,000 performers (actors and actresses), and we use the network of references and our score of movies for evaluating their career, and for ranking them. Since the IMDb dataset we employ is highly biased toward European and North American movies and personalities, our findings can be considered relevant principally for Western culture

    A parameter-less algorithm for tensor co-clustering

    Get PDF

    Comparing Transport Quality Perception among Different Travellers in European Cities through Co-Cluster Analysis

    Get PDF
    The quality of the transport system offered at city level constitutes an important and challenging goal for society, for local authorities, and transport operators. Therefore, appropriate evaluation of travellers’ satisfaction is required to support service performance monitoring, benchmarking, and market analysis. This aspect implies the collection of satisfaction levels for different passengers’ groups, as it could provide interesting suggestions for identifying priority areas of action. To this end, an original study aimed at understanding the main aspects affecting the common view of satisfaction among different kinds of travellers at European level is presented in this paper. A specific survey investigating how travellers perceive the quality of their journey is proposed to people living in cities characterised by different sizes. Data are then analysed through a multi-view co-clustering algorithm, an innovative machine learning technique that highlights clusters of respondents grouped according to various categories of features. Such results could be used by local authorities and transport providers to understand the specific actions to be operated to improve the quality of transport service offered in a market segmentation dimension
    • …
    corecore