37 research outputs found

    Viewpoint Discovery and Understanding in Social Networks

    Full text link
    The Web has evolved to a dominant platform where everyone has the opportunity to express their opinions, to interact with other users, and to debate on emerging events happening around the world. On the one hand, this has enabled the presence of different viewpoints and opinions about a - usually controversial - topic (like Brexit), but at the same time, it has led to phenomena like media bias, echo chambers and filter bubbles, where users are exposed to only one point of view on the same topic. Therefore, there is the need for methods that are able to detect and explain the different viewpoints. In this paper, we propose a graph partitioning method that exploits social interactions to enable the discovery of different communities (representing different viewpoints) discussing about a controversial topic in a social network like Twitter. To explain the discovered viewpoints, we describe a method, called Iterative Rank Difference (IRD), which allows detecting descriptive terms that characterize the different viewpoints as well as understanding how a specific term is related to a viewpoint (by detecting other related descriptive terms). The results of an experimental evaluation showed that our approach outperforms state-of-the-art methods on viewpoint discovery, while a qualitative analysis of the proposed IRD method on three different controversial topics showed that IRD provides comprehensive and deep representations of the different viewpoints

    Null models and complexity science: disentangling signal from noise in complex interacting systems

    Get PDF
    The constantly increasing availability of fine-grained data has led to a very detailed description of many socio-economic systems (such as financial markets, interbank loans or supply chains), whose representation, however, quickly becomes too complex to allow for any meaningful intuition or insight about their functioning mechanisms. This, in turn, leads to the challenge of disentangling statistically meaningful information from noise without assuming any a priori knowledge on the particular system under study. The aim of this thesis is to develop and test on real world data unsupervised techniques to extract relevant information from large complex interacting systems. The question I try to answer is the following: is it possible to disentangle statistically relevant information from noise without assuming any prior knowledge about the system under study? In particular, I tackle this challenge from the viewpoint of hypothesis testing by developing techniques based on so-called null models, i.e., partially randomised representations of the system under study. Given that complex systems can be analysed both from the perspective of their time evolution and of their time-aggregated properties, I have tested and developed one technique for each of these two purposes. The first technique I have developed is aimed at extracting “backbones” of relevant relationships in complex interacting systems represented as static weighted networks of pairwise interactions and it is inspired by the well-known Pólya urn combinatorial process. The second technique I have developed is instead aimed at identifying statistically relevant events and temporal patterns in single or multiple time series by means of maximum entropy null models based on Ensemble Theory. Both of these methodologies try to exploit the heterogeneity of complex systems data in order to design null models that are tailored to the systems under study, and therefore capable of identifying signals that are genuinely distinctive of the systems themselves

    What-is and How-to for Fairness in Machine Learning: A Survey, Reflection, and Perspective

    Full text link
    Algorithmic fairness has attracted increasing attention in the machine learning community. Various definitions are proposed in the literature, but the differences and connections among them are not clearly addressed. In this paper, we review and reflect on various fairness notions previously proposed in machine learning literature, and make an attempt to draw connections to arguments in moral and political philosophy, especially theories of justice. We also consider fairness inquiries from a dynamic perspective, and further consider the long-term impact that is induced by current prediction and decision. In light of the differences in the characterized fairness, we present a flowchart that encompasses implicit assumptions and expected outcomes of different types of fairness inquiries on the data generating process, on the predicted outcome, and on the induced impact, respectively. This paper demonstrates the importance of matching the mission (which kind of fairness one would like to enforce) and the means (which spectrum of fairness analysis is of interest, what is the appropriate analyzing scheme) to fulfill the intended purpose

    TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

    Get PDF
    With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers

    The social components of innovation: from data analysis to mathematical modelling

    Get PDF
    Novelties are a key driver of societal progress, yet we lack a comprehensive understanding of the factors that generate them. Recent evidence suggests that innovation emerges from the balance between exploiting past discoveries and exploring new possibilities, the so-called ``adjacent possible". This thesis aims at developing new analysis tools and models to study how people navigate the seemingly infinite space of possibilities. Firstly, I extend the notion of the adjacent possible to account for novelties as combinations of existing elements. In particular, I model innovation as a random walk on an expanding complex network of content, in which novelties correspond not only to the first visit of nodes, but also of links. The model correctly reproduces how novelties emerge in empirical data, highlighting the importance of the exploration process in shaping the growth of the network. Secondly, since people continuously interact and exchange information with each other, I investigate the role of social interactions in enhancing discoveries. I hence propose a model where multiple agents extend their adjacent possible through the links of a complex social network, exploiting in this way opportunities coming from their contacts. By adding a social dimension to the adjacent possible, I prove that the discovery potential of an individual is influenced by its position on the social network. Finally, I combine the two concepts of the adjacent possible in the content and social dimension to develop a data-driven model of music exploration on online platforms. In such a model, multiple agents grow their individual space of possibilities by exploring a network of similarity between artists, while exploiting suggestions from their friends on the social network. The comparison with the empirical data indicates that the adjacent possible, in both the content and the social space, plays a crucial role in determining the individual propensity to innovate

    Modeling and Analyzing Collective Behavior Captured by Many-to-Many Networks

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction

    Full text link
    A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.Comment: 19 pages, 1 figure, 6 table
    corecore