Search CORE

14,313 research outputs found

Word Embeddings for Entity-annotated Texts

Author: A Das
A Spitz
CD Manning
D Nadeau
E Bruni
F Hill
F Hill
H Abdi
H Rubenstein
J Mitchell
J Strötgen
JG Moreno
L Maaten
P Bojanowski
P Goyal
S Deerwester
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/02/2020
Field of study

Learned vector representations of words are useful tools for many information retrieval and natural language processing tasks due to their ability to capture lexical semantics. However, while many such tasks involve or even rely on named entities as central components, popular word embedding models have so far failed to include entities as first-class citizens. While it seems intuitive that annotating named entities in the training corpus should result in more intelligent word features for downstream tasks, performance issues arise when popular embedding approaches are naively applied to entity annotated corpora. Not only are the resulting entity embeddings less useful than expected, but one also finds that the performance of the non-entity word embeddings degrades in comparison to those trained on the raw, unannotated corpus. In this paper, we investigate approaches to jointly train word and entity embeddings on a large corpus with automatically annotated and linked entities. We discuss two distinct approaches to the generation of such embeddings, namely the training of state-of-the-art embeddings on raw-text and annotated versions of the corpus, as well as node embeddings of a co-occurrence graph representation of the annotated corpus. We compare the performance of annotated embeddings and classical word embeddings on a variety of word similarity, analogy, and clustering evaluation tasks, and investigate their performance in entity-specific tasks. Our findings show that it takes more than training popular word embedding models on an annotated corpus to create entity embeddings with acceptable performance on common test cases. Based on these results, we discuss how and when node embeddings of the co-occurrence graph representation of the text can restore the performance.Comment: This paper is accepted in 41st European Conference on Information Retrieva

arXiv.org e-Print Archive

Crossref

A Network Topology Approach to Bot Classification

Author: Aiello Luca Maria
Bezdek J.C.
Danezis George
Douceur John R
Ferguson Niall
Varol Onur
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/09/2018
Field of study

Automated social agents, or bots, are increasingly becoming a problem on social media platforms. There is a growing body of literature and multiple tools to aid in the detection of such agents on online social networking platforms. We propose that the social network topology of a user would be sufficient to determine whether the user is a automated agent or a human. To test this, we use a publicly available dataset containing users on Twitter labelled as either automated social agent or human. Using an unsupervised machine learning approach, we obtain a detection accuracy rate of 70%

arXiv.org e-Print Archive

Crossref

Evolutionary constraints on the complexity of genetic regulatory networks allow predictions of the total number of genetic interactions

Author: Campos-González Adrian I.
Freyre-González Julio A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/01/2019
Field of study

Genetic regulatory networks (GRNs) have been widely studied, yet there is a lack of understanding with regards to the final size and properties of these networks, mainly due to no network currently being complete. In this study, we analyzed the distribution of GRN structural properties across a large set of distinct prokaryotic organisms and found a set of constrained characteristics such as network density and number of regulators. Our results allowed us to estimate the number of interactions that complete networks would have, a valuable insight that could aid in the daunting task of network curation, prediction, and validation. Using state-of-the-art statistical approaches, we also provided new evidence to settle a previously stated controversy that raised the possibility of complete biological networks being random and therefore attributing the observed scale-free properties to an artifact emerging from the sampling process during network discovery. Furthermore, we identified a set of properties that enabled us to assess the consistency of the connectivity distribution for various GRNs against different alternative statistical distributions. Our results favor the hypothesis that highly connected nodes (hubs) are not a consequence of network incompleteness. Finally, an interaction coverage computed for the GRNs as a proxy for completeness revealed that high-throughput based reconstructions of GRNs could yield biased networks with a low average clustering coefficient, showing that classical targeted discovery of interactions is still needed.Comment: 28 pages, 5 figures, 12 pages supplementary informatio

arXiv.org e-Print Archive

Directory of Open Access Journals

University of Queensland eSpace

WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking

Author: Cifariello Paolo
Ferragina Paolo
Ponza Marco
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

We present WISER, a new semantic search engine for expert finding in academia. Our system is unsupervised and it jointly combines classical language modeling techniques, based on text evidences, with the Wikipedia Knowledge Graph, via entity linking. WISER indexes each academic author through a novel profiling technique which models her expertise with a small, labeled and weighted graph drawn from Wikipedia. Nodes in this graph are the Wikipedia entities mentioned in the author's publications, whereas the weighted edges express the semantic relatedness among these entities computed via textual and graph-based relatedness functions. Every node is also labeled with a relevance score which models the pertinence of the corresponding entity to author's expertise, and is computed by means of a proper random-walk calculation over that graph; and with a latent vector representation which is learned via entity and other kinds of structural embeddings derived from Wikipedia. At query time, experts are retrieved by combining classic document-centric approaches, which exploit the occurrences of query terms in the author's documents, with a novel set of profile-centric scoring strategies, which compute the semantic relatedness between the author's expertise and the query topic via the above graph-based profiles. The effectiveness of our system is established over a large-scale experimental test on a standard dataset for this task. We show that WISER achieves better performance than all the other competitors, thus proving the effectiveness of modelling author's profile via our "semantic" graph of entities. Finally, we comment on the use of WISER for indexing and profiling the whole research community within the University of Pisa, and its application to technology transfer in our University

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Correlated fragile site expression allows the identification of candidate fragile genes involved in immunity and associated with carcinogenesis

Author: A Caputo
A Matsuyama
A Musio
Alda Maria Puliti
AM Casper
Angela Re
CD Hou
CT Miller
D Corà
D Corà
D Iliopoulos
Davide Cora
E Birney
H Ishii
I Sbrana
I Sbrana
Isabella Sbrana
ISCN
J Bartkova
J Hoshen
K Mimori
KA Cimprich
KA Nyberg
LV O'Keefe
M Ashburner
M Fabbri
M Schwartz
Michele Caselle
Newman MEJ
NS Chang
P Hoglund
RS Cha
S Corbin
S Gasser
SL Reiner
T Oyama
TW Glover
U Krummrei
VG Gorgoulis
Y Zhu
Publication venue
Publication date: 01/01/2006
Field of study

Common fragile sites (cfs) are specific regions in the human genome that are particularly prone to genomic instability under conditions of replicative stress. Several investigations support the view that common fragile sites play a role in carcinogenesis. We discuss a genome-wide approach based on graph theory and Gene Ontology vocabulary for the functional characterization of common fragile sites and for the identification of genes that contribute to tumour cell biology. CFS were assembled in a network based on a simple measure of correlation among common fragile site patterns of expression. By applying robust measurements to capture in quantitative terms the non triviality of the network, we identified several topological features clearly indicating departure from the Erdos-Renyi random graph model. The most important outcome was the presence of an unexpected large connected component far below the percolation threshold. Most of the best characterized common fragile sites belonged to this connected component. By filtering this connected component with Gene Ontology, statistically significant shared functional features were detected. Common fragile sites were found to be enriched for genes associated to the immune response and to mechanisms involved in tumour progression such as extracellular space remodeling and angiogenesis. Our results support the hypothesis that fragile sites serve a function; we propose that fragility is linked to a coordinated regulation of fragile genes expression.Comment: 18 pages, accepted for publication in BMC Bioinformatic

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio della Ricerca - Università di Pisa

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Archivio istituzionale della ricerca - Università di Genova

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale