1,300 research outputs found

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Mapping the Evolution of "Clusters": A Meta-analysis

    Get PDF
    This paper presents a meta-analysis of the “cluster literature” contained in scientific journals from 1969 to 2007. Thanks to an original database we study the evolution of a stream of literature which focuses on a research object which is both a theoretical puzzle and an empirical widespread evidence. We identify different growth stages, from take-off to development and maturity. We test the existence of a life-cycle within the authorships and we discover the existence of a substitutability relation between different collaborative behaviours. We study the relationships between a “spatial” and an “industrial” approach within the textual corpus of cluster literature and we show the existence of a “predatory” interaction. We detect the relevance of clustering behaviours in the location of authors working on clusters and in measuring the influence of geographical distance in co-authorship. We measure the extent of a convergence process of the vocabulary of scientists working on clusters.Cluster, Life-Cycle, Cluster Literature, Textual Analysis, Agglomeration, Co-Authorship

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Automatically extracting polarity-bearing topics for cross-domain sentiment classification

    Get PDF
    Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning

    All the ties that bind. A socio-semantic network analysis of Twitter political discussions

    Get PDF
    Social media play a crucial role in what contemporary sociological reflections define as a ‘hybrid media system’. Online spaces created by social media platforms resemble global public squares hosting large-scale social networks populated by citizens, political leaders, parties and organizations, journalists, activists and institutions that establish direct interactions and exchange contents in a disintermediated fashion. In the last decade, an increasing number of studies from researchers coming from different disciplines has approached the study of the manifold facets of citizen participation in online political spaces. In most cases, these studies have focused on the investigation of direct relationships amongst political actors. Conversely, relatively less attention has been paid to the study of contents that circulate during online discussions and how their diffusion contributes to building political identities. Even more rarely, the study of social media contents has been investigated in connection with those concerning social interactions amongst online users. To fill in this gap, my thesis work proposes a methodological procedure consisting in a network-based, data-driven approach to both infer communities of users with a similar communication behavior and to extract the most prominent contents discussed within those communities. More specifically, my work focuses on Twitter, a social media platform that is widely used during political debates. Groups of users with a similar retweeting behavior - hereby referred to as discursive communities - are identified starting with the bipartite network of Twitter verified users retweeted by nonverified users. Once the discursive communities are obtained, the corresponding semantic networks are identified by considering the co-occurrences of the hashtags that are present in the tweets sent by their members. The identification of discursive communities and the study of the related semantic networks represent the starting point for exploring more in detail two specific conversations that took place in the Italian Twittersphere: the former occured during the electoral campaign before the 2018 Italian general elections and in the two weeks after Election day; the latter centered on the issue of migration during the period May-November 2019. Regarding the social analysis, the main result of my work is the identification of a behavior-driven picture of discursive communities induced by the retweeting activity of Twitter users, rather than determined by prior information on their political affiliation. Although these communities do not necessarily match the political orientation of their users, they are closely related to the evolution of the Italian political arena. As for the semantic analysis, this work sheds light on the symbolic dimension of partisan dynamics. Different discursive communities are, in fact, characterized by a peculiar conversational dynamics at both the daily and the monthly time-scale. From a purely methodological aspect, semantic networks have been analyzed by employing three (increasingly restrictive) benchmarks. The k-shell decomposition of both filtered and non-filtered semantic networks reveals the presence of a core-periphery structure providing information on the most debated topics within each discursive community and characterizing the communication strategy of the corresponding political coalition

    Link communities reveal multiscale complexity in networks

    Full text link
    Networks have become a key approach to understanding systems of interacting objects, unifying the study of diverse phenomena including biological organisms and human society. One crucial step when studying the structure and dynamics of networks is to identify communities: groups of related nodes that correspond to functional subunits such as protein complexes or social spheres. Communities in networks often overlap such that nodes simultaneously belong to several groups. Meanwhile, many networks are known to possess hierarchical organization, where communities are recursively grouped into a hierarchical structure. However, the fact that many real networks have communities with pervasive overlap, where each and every node belongs to more than one group, has the consequence that a global hierarchy of nodes cannot capture the relationships between overlapping groups. Here we reinvent communities as groups of links rather than nodes and show that this unorthodox approach successfully reconciles the antagonistic organizing principles of overlapping communities and hierarchy. In contrast to the existing literature, which has entirely focused on grouping nodes, link communities naturally incorporate overlap while revealing hierarchical organization. We find relevant link communities in many networks, including major biological networks such as protein-protein interaction and metabolic networks, and show that a large social network contains hierarchically organized community structures spanning inner-city to regional scales while maintaining pervasive overlap. Our results imply that link communities are fundamental building blocks that reveal overlap and hierarchical organization in networks to be two aspects of the same phenomenon.Comment: Main text and supplementary informatio

    Community Detection in Hypergraphen

    Get PDF
    Viele DatensĂ€tze können als Graphen aufgefasst werden, d.h. als Elemente (Knoten) und binĂ€re Verbindungen zwischen ihnen (Kanten). Unter dem Begriff der "Complex Network Analysis" sammeln sich eine ganze Reihe von Verfahren, die die Untersuchung von DatensĂ€tzen allein aufgrund solcher struktureller Eigenschaften erlauben. "Community Detection" als Untergebiet beschĂ€ftigt sich mit der Identifikation besonders stark vernetzter Teilgraphen. Über den Nutzen hinaus, den eine Gruppierung verwandter Element direkt mit sich bringt, können derartige Gruppen zu einzelnen Knoten zusammengefasst werden, was einen neuen Graphen von reduzierter KomplexitĂ€t hervorbringt, der die Makrostruktur des ursprĂŒnglichen Graphen unter UmstĂ€nden besser hervortreten lĂ€sst. Fortschritte im Bereich der "Community Detection" verbessern daher auch das VerstĂ€ndnis komplexer Netzwerke im allgemeinen. Nicht jeder Datensatz lĂ€sst sich jedoch angemessen mit binĂ€ren Relationen darstellen - Relationen höherer Ordnung fĂŒhren zu sog. Hypergraphen. Gegenstand dieser Arbeit ist die Verallgemeinerung von AnsĂ€tzen zur "Community Detection" auf derartige Hypergraphen. Im Zentrum der Aufmerksamkeit stehen dabei "Social Bookmarking"-DatensĂ€tze, wie sie von Benutzern von "Bookmarking"-Diensten erzeugt werden. Dabei ordnen Benutzer Dokumenten frei gewĂ€hlte Stichworte, sog. "Tags" zu. Dieses "Tagging" erzeugt, fĂŒr jede Tag-Zuordnung, eine ternĂ€re Verbindung zwischen Benutzer, Dokument und Tag, was zu Strukturen fĂŒhrt, die 3-partite, 3-uniforme (im folgenden 3,3-, oder allgemeiner k,k-) Hypergraphen genannt werden. Die Frage, der diese Arbeit nachgeht, ist wie diese Strukturen formal angemessen in "Communities" unterteilt werden können, und wie dies das VerstĂ€ndnis dieser DatensĂ€tze erleichtert, die potenziell sehr reich an latenten Informationen sind. ZunĂ€chst wird eine Verallgemeinerung der verbundenen Komponenten fĂŒr k,k-Hypergraphen eingefĂŒhrt. Die normale Definition verbundener Komponenten weist auf den untersuchten DatensĂ€tzen, recht uninformativ, alle Elemente einer einzelnen Riesenkomponente zu. Die verallgemeinerten, so genannten hyper-inzidenten verbundenen Komponenten hingegen zeigen auf den "Social Bookmarking"-DatensĂ€tzen eine charakteristische GrĂ¶ĂŸenverteilung, die jedoch bspw. von Spam-Verhalten zerstört wird - was eine Verbindung zwischen Verhaltensmustern und strukturellen Eigenschaften zeigt, der im folgenden weiter nachgegangen wird. Als nĂ€chstes wird das allgemeine Thema der "Community Detection" auf k,k-Hypergraphen eingefĂŒhrt. Drei Herausforderungen werden definiert, die mit der naiven Anwendung bestehender Verfahren nicht gemeistert werden können. Außerdem werden drei Familien synthetischer Hypergraphen mit "Community"-Strukturen von steigender KomplexitĂ€t eingefĂŒhrt, die prototypisch fĂŒr Situationen stehen, die ein erfolgreicher Detektionsansatz rekonstruieren können sollte. Der zentrale methodische Beitrag dieser Arbeit besteht aus der im folgenden dargestellten Entwicklung eines multipartiten (d.h. fĂŒr k,k-Hypergraphen geeigneten) Verfahrens zur Erkennung von "Communities". Es basiert auf der Optimierung von ModularitĂ€t, einem etablierten Verfahrung zur Erkennung von "Communities" auf nicht-partiten, d.h. "normalen" Graphen. Ausgehend vom einfachst möglichen Ansatz wird das Verfahren iterativ verfeinert, um den zuvor definierten sowie neuen, in der Praxis aufgetretenen Herausforderungen zu begegnen. Am Ende steht die Definition der "ausgeglichenen multi-partiten ModularitĂ€t". Schließlich wird ein interaktives Werkzeug zur Untersuchung der so gewonnenen "Community"-Zuordnungen vorgestellt. Mithilfe dieses Werkzeugs können die Vorteile der zuvor eingefĂŒhrten ModularitĂ€t demonstriert werden: So können komplexe ZusammenhĂ€nge beobachtet werden, die den einfacheren Verfahren entgehen. Diese Ergebnisse werden von einer stĂ€rker quantitativ angelegten Untersuchung bestĂ€tigt: UnĂŒberwachte QualitĂ€tsmaße, die bspw. den Kompressionsgrad berĂŒcksichtigen, können ĂŒber eine grĂ¶ĂŸere Menge von Beispielen die Vorteile der ausgeglichenen multi-partiten ModularitĂ€t gegenĂŒber den anderen Verfahren belegen. Zusammenfassend lassen sich die Ergebnisse dieser Arbeit in zwei Bereiche einteilen: Auf der praktischen Seite werden Werkzeuge zur Erforschung von "Social Bookmarking"-Daten bereitgestellt. DemgegenĂŒber stehen theoretische BeitrĂ€ge, die fĂŒr Graphen etablierte Konzepte - verbundene Komponenten und "Community Detection" - auf k,k-Hypergraphen ĂŒbertragen.Many datasets can be interpreted as graphs, i.e. as elements (nodes) and binary relations between them (edges). Under the label of complex network analysis, a vast array of graph-based methods allows the exploration of datasets purely based on such structural properties. Community detection, as a subfield of network analysis, aims to identify well-connected subparts of graphs. While the grouping of related elements is useful in itself, these groups can furthermore be collapsed into single nodes, creating a new graph of reduced complexity which may better reveal the original graph's macrostructure. Therefore, advances in community detection improve the understanding of complex networks in general. However, not every dataset can be modelled properly with binary relations - higher-order relations give rise to so-called hypergraphs. This thesis explores the generalization of community detection approaches to hypergraphs. In the focus of attention are social bookmarking datasets, created by users of online bookmarking services who assign freely chosen keywords, so-called "tags", to documents. This "tagging" creates, for each tag assignment, a ternary connection between the user, the document, and the tag, inducing particular structures called 3-partite, 3-uniform hypergraphs (henceforth called 3,3- or more generally k,k-hypergraphs). The question pursued here is how to decompose these structures in a formally adequate manner, and how this improves the understanding of these rich datasets. First, a generalization of connected components to k,k-hypergraphs is proposed. The standard definition of connected components here rather uninformatively assigns almost all elements to a single giant component. The generalized so-called hyperincident connected components, however, show a characteristic size distribution on the social bookmarking datasets that is disrupted by, e.g., spamming activity - demonstrating a link between behavioural patterns and structural features that is further explored in the following. Next, the general topic of community detection in k,k-hypergraphs is introduced. Three challenges are posited that are not met by the naive application of standard techniques, and three families of synthetic hypergraphs are introduced containing increasingly complex community setups that a successful detection approach must be able to identify. The main methodical contribution of this thesis consists of the following development of a multi-partite (i.e. suitable for k,k-hypergraphs) community detection algorithm. It is based on modularity optimization, a well-established algorithm to detect communities in non-partite, i.e. "normal" graphs. Starting from the simplest approach possible, the method is successively refined to meet the previously defined as well as empirically encountered challenges, culminating in the definition of the "balanced multi-partite modularity". Finally, an interactive tool for exploring the obtained community assignments is introduced. Using this tool, the benefits of balanced multi-partite modularity can be shown: Intricate patters can be observed that are missed by the simpler approaches. These findings are confirmed by a more quantitative examination: Unsupervised quality measures considering, e.g., compression document the advantages of this approach on a larger number of samples. To conclude, the contributions of this thesis are twofold. It provides practical tools for the analysis of social bookmarking data, complemented with theoretical contributions, the generalization of connected components and modularity from graphs to k,k-hypergraphs
    • 

    corecore