24 research outputs found

    Novel measures on directed graphs and applications to large-scale within-network classification

    No full text
    Ces derniĂšres annĂ©es, les rĂ©seaux sont devenus une source importante d’informations dans diffĂ©rents domaines aussi variĂ©s que les sciences sociales, la physique ou les mathĂ©matiques. De plus, la taille de ces rĂ©seaux n’a cessĂ© de grandir de maniĂšre consĂ©quente. Ce constat a vu Ă©merger de nouveaux dĂ©fis, comme le besoin de mesures prĂ©cises et intuitives pour caractĂ©riser et analyser ces rĂ©seaux de grandes tailles en un temps raisonnable.La premiĂšre partie de cette thĂšse introduit une nouvelle mesure de similaritĂ© entre deux noeuds d’un rĂ©seau dirigĂ© et pondĂ©rĂ© :la covariance “sum-over-paths”. Celle-ci a une interprĂ©tation claire et prĂ©cise :en dĂ©nombrant tous les chemins possibles deux noeuds sont considĂ©rĂ©s comme fortement corrĂ©lĂ©s s’ils apparaissent souvent sur un mĂȘme chemin – de prĂ©fĂ©rence court. Cette mesure dĂ©pend d’une distribution de probabilitĂ©s, dĂ©finie sur l’ensemble infini dĂ©nombrable des chemins dans le graphe, obtenue en minimisant l'espĂ©rance du coĂ»t total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectĂ©e dans le rĂ©seau est fixĂ©e Ă  priori. Le paramĂštre d’entropie permet de biaiser la distribution de probabilitĂ© sur un large spectre :allant de marches alĂ©atoires naturelles oĂč tous les chemins sont Ă©quiprobables Ă  des marches biaisĂ©es en faveur des plus courts chemins. Cette mesure est alors appliquĂ©e Ă  des problĂšmes de classification semi-supervisĂ©e sur des rĂ©seaux de taille moyennes et comparĂ©e Ă  l’état de l’art.La seconde partie de la thĂšse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large rĂ©seau dont les noeuds sont partiellement Ă©tiquetĂ©s. Ces algorithmes ont un temps de calcul linĂ©aire en le nombre de noeuds, de classes et d’itĂ©rations, et peuvent dĂ©s lors ĂȘtre appliquĂ©s sur de larges rĂ©seaux. Ceux-ci ont obtenus des rĂ©sultats compĂ©titifs en comparaison Ă  l’état de l’art sur le large rĂ©seaux de citations de brevets amĂ©ricains et sur huit autres jeux de donnĂ©es. De plus, durant la thĂšse, nous avons collectĂ© un nouveau jeu de donnĂ©es, dĂ©jĂ  mentionnĂ© :le rĂ©seau de citations de brevets amĂ©ricains. Ce jeu de donnĂ©es est maintenant disponible pour la communautĂ© pour la rĂ©alisation de tests comparatifs.La partie finale de cette thĂšse concerne la combinaison d’un graphe de citations avec les informations prĂ©sentes sur ses noeuds. De maniĂšre empirique, nous avons montrĂ© que des donnĂ©es basĂ©es sur des citations fournissent de meilleurs rĂ©sultats de classification que des donnĂ©es basĂ©es sur des contenus textuels. Toujours de maniĂšre empirique, nous avons Ă©galement montrĂ© que combiner les diffĂ©rentes sources d’informations (contenu et citations) doit ĂȘtre considĂ©rĂ© lors d’une tĂąche de classification de textes. Par exemple, lorsqu’il s’agit de catĂ©goriser des articles de revues, s’aider d’un graphe de citations extrait au prĂ©alable peut amĂ©liorer considĂ©rablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du rĂ©seau de citations, s’aider des informations prĂ©sentes sur les noeuds n’amĂ©liora pas nĂ©cessairement les performances.La thĂ©orie, les algorithmes et les applications prĂ©sentĂ©s dans cette thĂšse fournissent des perspectives intĂ©ressantes dans diffĂ©rents domaines.In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time. The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques.The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e. classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes. The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results.The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields.Doctorat en Sciencesinfo:eu-repo/semantics/nonPublishe

    Novel measures on directed graphs and applications to large-scale within-network classification

    No full text
    Ces derniĂšres annĂ©es, les rĂ©seaux sont devenus une source importante d’informations dans diffĂ©rents domaines aussi variĂ©s que les sciences sociales, la physique ou les mathĂ©matiques. De plus, la taille de ces rĂ©seaux n’a cessĂ© de grandir de maniĂšre consĂ©quente. Ce constat a vu Ă©merger de nouveaux dĂ©fis, comme le besoin de mesures prĂ©cises et intuitives pour caractĂ©riser et analyser ces rĂ©seaux de grandes tailles en un temps raisonnable.La premiĂšre partie de cette thĂšse introduit une nouvelle mesure de similaritĂ© entre deux noeuds d’un rĂ©seau dirigĂ© et pondĂ©rĂ© :la covariance “sum-over-paths”. Celle-ci a une interprĂ©tation claire et prĂ©cise :en dĂ©nombrant tous les chemins possibles deux noeuds sont considĂ©rĂ©s comme fortement corrĂ©lĂ©s s’ils apparaissent souvent sur un mĂȘme chemin – de prĂ©fĂ©rence court. Cette mesure dĂ©pend d’une distribution de probabilitĂ©s, dĂ©finie sur l’ensemble infini dĂ©nombrable des chemins dans le graphe, obtenue en minimisant l'espĂ©rance du coĂ»t total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectĂ©e dans le rĂ©seau est fixĂ©e Ă  priori. Le paramĂštre d’entropie permet de biaiser la distribution de probabilitĂ© sur un large spectre :allant de marches alĂ©atoires naturelles oĂč tous les chemins sont Ă©quiprobables Ă  des marches biaisĂ©es en faveur des plus courts chemins. Cette mesure est alors appliquĂ©e Ă  des problĂšmes de classification semi-supervisĂ©e sur des rĂ©seaux de taille moyennes et comparĂ©e Ă  l’état de l’art.La seconde partie de la thĂšse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large rĂ©seau dont les noeuds sont partiellement Ă©tiquetĂ©s. Ces algorithmes ont un temps de calcul linĂ©aire en le nombre de noeuds, de classes et d’itĂ©rations, et peuvent dĂ©s lors ĂȘtre appliquĂ©s sur de larges rĂ©seaux. Ceux-ci ont obtenus des rĂ©sultats compĂ©titifs en comparaison Ă  l’état de l’art sur le large rĂ©seaux de citations de brevets amĂ©ricains et sur huit autres jeux de donnĂ©es. De plus, durant la thĂšse, nous avons collectĂ© un nouveau jeu de donnĂ©es, dĂ©jĂ  mentionnĂ© :le rĂ©seau de citations de brevets amĂ©ricains. Ce jeu de donnĂ©es est maintenant disponible pour la communautĂ© pour la rĂ©alisation de tests comparatifs.La partie finale de cette thĂšse concerne la combinaison d’un graphe de citations avec les informations prĂ©sentes sur ses noeuds. De maniĂšre empirique, nous avons montrĂ© que des donnĂ©es basĂ©es sur des citations fournissent de meilleurs rĂ©sultats de classification que des donnĂ©es basĂ©es sur des contenus textuels. Toujours de maniĂšre empirique, nous avons Ă©galement montrĂ© que combiner les diffĂ©rentes sources d’informations (contenu et citations) doit ĂȘtre considĂ©rĂ© lors d’une tĂąche de classification de textes. Par exemple, lorsqu’il s’agit de catĂ©goriser des articles de revues, s’aider d’un graphe de citations extrait au prĂ©alable peut amĂ©liorer considĂ©rablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du rĂ©seau de citations, s’aider des informations prĂ©sentes sur les noeuds n’amĂ©liora pas nĂ©cessairement les performances.La thĂ©orie, les algorithmes et les applications prĂ©sentĂ©s dans cette thĂšse fournissent des perspectives intĂ©ressantes dans diffĂ©rents domaines.In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time. The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques.The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e. classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes. The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results.The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields.Doctorat en Sciencesinfo:eu-repo/semantics/nonPublishe

    Item Cold-Start Recommendations: Learning Local Collective Embeddings

    No full text
    ABSTRACT Recommender systems suggest to users items that they might like (e.g., news articles, songs, movies) and, in doing so, they help users deal with information overload and enjoy a personalized experience. One of the main problems of these systems is the item cold-start, i.e., when a new item is introduced in the system and no past information is available, then no effective recommendations can be produced. The item cold-start is a very common problem in practice: modern online platforms have hundreds of new items published every day. To address this problem, we propose to learn Local Collective Embeddings: a matrix factorization that exploits items' properties and past user preferences while enforcing the manifold structure exhibited by the collective embeddings. We present a learning algorithm based on multiplicative update rules that are efficient and easy to implement. The experimental results on two item cold-start use cases: news recommendation and email recipient recommendation, demonstrate the effectiveness of this approach and show that it significantly outperforms six state-of-theart methods for item cold-start

    Unsupervised domain adaptation with non-stochastic missing data

    No full text
    International audienceWe consider unsupervised domain adaptation (UDA) for classification problems in the presence of missing data in the unlabelled target domain. More precisely, motivated by practical applications, we analyze situations where distribution shift exists between domains and where some components are systematically absent on the target domain without available supervision for imputing the missing target components. We propose a generative approach for imputation. Imputation is performed in a domain-invariant latent space and leverages indirect supervision from a complete source domain. We introduce a single model performing joint adaptation, imputation and classification which, under our assumptions, minimizes an upper bound of its target generalization error and performs well under various representative divergence families (H-divergence, Optimal Transport). Moreover, we compare the target error of our Adaptation-imputation framework and the "ideal" target error of a UDA classifier without missing target components. Our model is further improved with self-training, to bring the learned source and target class posterior distributions closer. We perform experiments on three families of datasets of different modalities: a classical digit classification benchmark, the Amazon product reviews dataset both commonly used in UDA and real-world digital advertising datasets. We show the benefits of jointly performing adaptation, classification and imputation on these datasets

    A family of dissimilarity measures between nodes generalizing both the shortest-path and the commute-time distances

    No full text
    This work introduces a new family of link-based dissimilarity measures between nodes of a weighted directed graph. This measure, called the randomized shortest-path (RSP) dissimilarity, depends on a parameter ξ and has the interesting property of reducing, on one end, to the standard shortest-path distance when ξ is large and, on the other end, to the commute-time (or resistance) distance when ξ is small (near zero). Intuitively, it corresponds to the expected cost incurred by a random walker in order to reach a destination node from a starting node while maintaining a constant entropy (related to ξ) spread in the graph. The parameter ξ is therefore biasing gradually the simple random walk on the graph towards the shortest-path policy. By adopting a statistical physics approach and computing a sum over all the possible paths (discrete path integral), it is shown that the RSP dissimilarity from every node to a particular node of interest can be computed efficiently by solving two linear systems of n equations, where n is the number of nodes. On the other hand, the dissimilarity between every couple of nodes is obtained by inverting an n × n matrix. The proposed measure can be used for various graph mining tasks such as computing betweenness centrality, finding dense communities, etc, as shown in the experimental section

    Position-Aware Deep Character-Level CTR Prediction for Sponsored Search

    No full text

    {HEADS}: Headline Generation as Sequence Prediction Using an Abstract Feature-Rich Space

    No full text
    Automatic headline generation is a sub-task of document summarization with many reported applications. In this study we present a sequence-prediction technique for learning how editors title their news stories. The introduced technique models the problem as a discrete optimization task in a feature-rich space. In this space the global optimum can be found in polynomial time by means of dynamic programming. We train and test our model on an extensive corpus of financial news, and compare it against a number of baselines by using standard metrics from the document summarization domain, as well as some new ones proposed in this work. We also assess the readability and informativeness of the generated titles through human evaluation. The obtained results are very appealing and substantiate the soundness of the approac
    corecore