58 research outputs found
Spectral Estimation of Conditional Random Graph Models for Large-Scale Network Data
Generative models for graphs have been typically committed to strong prior
assumptions concerning the form of the modeled distributions. Moreover, the
vast majority of currently available models are either only suitable for
characterizing some particular network properties (such as degree distribution
or clustering coefficient), or they are aimed at estimating joint probability
distributions, which is often intractable in large-scale networks. In this
paper, we first propose a novel network statistic, based on the Laplacian
spectrum of graphs, which allows to dispense with any parametric assumption
concerning the modeled network properties. Second, we use the defined statistic
to develop the Fiedler random graph model, switching the focus from the
estimation of joint probability distributions to a more tractable conditional
estimation setting. After analyzing the dependence structure characterizing
Fiedler random graphs, we evaluate them experimentally in edge prediction over
several real-world networks, showing that they allow to reach a much higher
prediction accuracy than various alternative statistical models.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI2012
Learning to Recommend Links using Graph Structure and Node Content
International audienceThe link prediction problem for graphs is a binary classification task that estimates the presence or absence of a link between two nodes in the graph. Links absent from the training set, however, cannot be directly considered as the negative examples since they might be present links at test time. Finding a hard decision boundary for link prediction is thus unnatural. This paper formalizes the link prediction problem from the flexible perspective of preference learning: the goal is to learn a preference score between any two nodes---either observed in the network at training time or to appear only later in the test---by using the feature vectors of the nodes and the structure of the graph as side information. Our assumption is that the observed edges, and in general, shortest paths between nodes in the graph, can reinforce an existing similarity between the nodes feature vectors. We propose a model implemented by a simple neural network architecture and an objective function that can be optimized by stochastic gradient descent over appropriate triplets of nodes in the graph. Our first preliminary experiments in small undirected graphs show that our learning algorithm outperforms baselines in real networks and is able to learn the correct distance function in synthetic networks
Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling
International audienceStatistical models for networks have been typically committed to strong prior assumptions concerning the form of the modeled distributions. Moreover, the vast majority of currently available models are explicitly designed for capturing some specific graph properties (such as power-law degree distributions), which makes them unsuitable for application to domains where the behavior of the target quantities is not known a priori. The key contribution of this paper is twofold. First, we introduce the Fiedler delta statistic, based on the Laplacian spectrum of graphs, which allows to dispense with any parametric assumption concerning the modeled network properties. Second, we use the defined statistic to develop the Fiedler random field model, which allows for efficient estimation of edge distributions over large-scale random networks. After analyzing the dependence structure involved in Fiedler random fields, we estimate them over several real-world networks, showing that they achieve a much higher modeling accuracy than other well-known statistical approaches
A Tale of Two Laws of Semantic Change: Predicting Synonym Changes with Distributional Semantic Models
Lexical Semantic Change is the study of how the meaning of words evolves
through time. Another related question is whether and how lexical relations
over pairs of words, such as synonymy, change over time. There are currently
two competing, apparently opposite hypotheses in the historical linguistic
literature regarding how synonymous words evolve: the Law of Differentiation
(LD) argues that synonyms tend to take on different meanings over time, whereas
the Law of Parallel Change (LPC) claims that synonyms tend to undergo the same
semantic change and therefore remain synonyms. So far, there has been little
research using distributional models to assess to what extent these laws apply
on historical corpora. In this work, we take a first step toward detecting
whether LD or LPC operates for given word pairs. After recasting the problem
into a more tractable task, we combine two linguistic resources to propose the
first complete evaluation framework on this problem and provide empirical
evidence in favor of a dominance of LD. We then propose various computational
approaches to the problem using Distributional Semantic Models and grounded in
recent literature on Lexical Semantic Change detection. Our best approaches
achieve a balanced accuracy above 0.6 on our dataset. We discuss challenges
still faced by these approaches, such as polysemy or the potential confusion
between synonymy and hypernymy.Comment: Accepted at The 12th Joint Conference on Lexical and Computational
Semantics (*SEM 2023
Recommended from our members
Automated Vocabulary Discovery for Geo-Parsing Online Epidemic Intelligence
Background Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human.Results Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon.Conclusion The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting
A Multitask Learning Approach to Document Representation using Unlabeled Data
Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents' bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases
Theme Topic Mixture Model: A Graphical Model for Document Representation
Documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a graphical model for representing documents: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. This paper defines the TTMM, compares it to the related Latent Dirichlet Allocation (LDA) model (Blei, 2003) and reports some interesting empirical results
Facing the facts of fake: a distributional semantics and corpus annotation approach
International audienceFake is often considered the textbook example of a so-called 'privative' adjective, one which, in other words, allows the proposition that '(a) fake x is not (an) x'. This study tests the hypothesis that the contexts of an adjective-noun combination are more different from the contexts of the noun when the adjective is such a 'privative' one than when it is an ordinary (subsective) one. We here use 'embeddings', that is, dense vector representations based on word co-occurrences in a large corpus, which in our study is the entire English Wikipedia as it was in 2013. Comparing the cosine distance between the adjective-noun bigram and single noun embeddings across two sets of adjectives, privative and ordinary ones, we fail to find a noticeable difference. However, we contest that fake is an across-the-board privative adjective, since a fake article, for instance, is most definitely still an article. We extend a recent proposal involving the noun's qualia roles (how an entity is made, what it consists of, what it is used for, etc.) and propose several interpretational types of fake-noun combinations, some but not all of which are privative. These interpretations, which we assign manually to the 100 most frequent fake-noun combinations in the Wikipedia corpus, depend to a large extent on the meaning of the noun, as combinations with similar interpretations tend to involve nouns that are linked in a distributions-based network. When we restrict our focus to the privative uses of fake only, we do detect a slightly enlarged difference between fake + noun bigram and noun distributions compared to the previously obtained average difference between adjective + noun bigram and noun distributions. This result contrasts with negative or even opposite findings reported in the literature
- …