2,234 research outputs found
OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Models
To enrich language models with domain knowledge is crucial but difficult.
Based on the world's largest public academic graph Open Academic Graph (OAG),
we pre-train an academic language model, namely OAG-BERT, which integrates
massive heterogeneous entities including paper, author, concept, venue, and
affiliation. To better endow OAG-BERT with the ability to capture entity
information, we develop novel pre-training strategies including heterogeneous
entity type embedding, entity-aware 2D positional encoding, and span-aware
entity masking. For zero-shot inference, we design a special decoding strategy
to allow OAG-BERT to generate entity names from scratch. We evaluate the
OAG-BERT on various downstream academic tasks, including NLP benchmarks,
zero-shot entity inference, heterogeneous graph link prediction, and author
name disambiguation. Results demonstrate the effectiveness of the proposed
pre-training approach to both comprehending academic texts and modeling
knowledge from heterogeneous entities. OAG-BERT has been deployed to multiple
real-world applications, such as reviewer recommendations and paper tagging in
the AMiner system. It is also available to the public through the CogDL
package
edge2vec: Representation learning using edge semantics for biomedical knowledge discovery
Representation learning provides new and powerful graph analytical approaches
and tools for the highly valued data science challenge of mining knowledge
graphs. Since previous graph analytical methods have mostly focused on
homogeneous graphs, an important current challenge is extending this
methodology for richly heterogeneous graphs and knowledge domains. The
biomedical sciences are such a domain, reflecting the complexity of biology,
with entities such as genes, proteins, drugs, diseases, and phenotypes, and
relationships such as gene co-expression, biochemical regulation, and
biomolecular inhibition or activation. Therefore, the semantics of edges and
nodes are critical for representation learning and knowledge discovery in real
world biomedical problems. In this paper, we propose the edge2vec model, which
represents graphs considering edge semantics. An edge-type transition matrix is
trained by an Expectation-Maximization approach, and a stochastic gradient
descent model is employed to learn node embedding on a heterogeneous graph via
the trained transition matrix. edge2vec is validated on three biomedical domain
tasks: biomedical entity classification, compound-gene bioactivity prediction,
and biomedical information retrieval. Results show that by considering
edge-types into node embedding learning in heterogeneous graphs,
\textbf{edge2vec}\ significantly outperforms state-of-the-art models on all
three tasks. We propose this method for its added value relative to existing
graph analytical methodology, and in the real world context of biomedical
knowledge discovery applicability.Comment: 10 page
Neural Graph Transfer Learning in Natural Language Processing Tasks
Natural language is essential in our daily lives as we rely on languages to communicate and exchange information. A fundamental goal for natural language processing (NLP) is to let the machine understand natural language to help or replace human experts to mine knowledge and complete tasks. Many NLP tasks deal with sequential data. For example, a sentence is considered as a sequence of works. Very recently, deep learning-based language models (i.e.,BERT \citep{devlin2018bert}) achieved significant improvement in many existing tasks, including text classification and natural language inference. However, not all tasks can be formulated using sequence models. Specifically, graph-structured data is also fundamental in NLP, including entity linking, entity classification, relation extraction, abstractive meaning representation, and knowledge graphs \citep{santoro2017simple,hamilton2017representation,kipf2016semi}. In this scenario, BERT-based pretrained models may not be suitable. Graph Convolutional Neural Network (GCN) \citep{kipf2016semi} is a deep neural network model designed for graphs. It has shown great potential in text classification, link prediction, question answering and so on. This dissertation presents novel graph models for NLP tasks, including text classification, prerequisite chain learning, and coreference resolution. We focus on different perspectives of graph convolutional network modeling: for text classification, a novel graph construction method is proposed which allows interpretability for the prediction; for prerequisite chain learning, we propose multiple aggregation functions that utilize neighbors for better information exchange; for coreference resolution, we study how graph pretraining can help when labeled data is limited. Moreover, an important branch is to apply pretrained language models for the mentioned tasks. So, this dissertation also focuses on the transfer learning method that generalizes pretrained models to other domains, including medical, cross-lingual, and web data. Finally, we propose a new task called unsupervised cross-domain prerequisite chain learning, and study novel graph-based methods to transfer knowledge over graphs
Vector representation of Internet domain names using Word embedding techniques
Word embeddings is a well-known set of techniques widely used in
natural language processing ( NLP ). This thesis explores the use of word
embeddings in a new scenario. A vector space model ( VSM) for Internet
domain names ( DNS) is created by taking core ideas from NLP techniques
and applying them to real anonymized DNS log queries from a large
Internet Service Provider ( ISP) . The main goal is to find semantically
similar domains only using information of DNS queries without any other
knowledge about the content of those domains.
A set of transformations through a detailed preprocessing pipeline
with eight specific steps is defined to move the original problem to a
problem in the NLP field. Once the preprocessing pipeline is applied and
the DNS log files are transformed to a standard text corpus, we show that
state-of-the-art techniques for word embeddings can be successfully
applied in order to build what we called a DNS-VSM (a vector space model
for Internet domain names).
Different word embeddings techniques are evaluated in this work:
Word2Vec (with Skip-Gram and CBOW architectures), App2Vec (with a
CBOW architecture and adding time gaps between DNS queries), and
FastText (which includes sub-word information).
The obtained results are compared using various metrics from Information
Retrieval theory and the quality of the learned vectors is validated with a
third party source, namely, similar sites service offered by Alexa Internet,
Inc2 .
Due to intrinsic characteristics of domain names, we found that FastText is
the best option for building a vector space model for DNS. Furthermore, its
performance (considering the top 3 most similar learned vectors to each
domain) is compared against two baseline methods: Random Guessing
(returning randomly any domain name from the dataset) and Zero Rule
(returning always the same most popular domains), outperforming both of
them considerably.
The results presented in this work can be useful in many
engineering activities, with practical application in many areas. Some
examples include websites recommendations based on similar sites,
competitive analysis, identification of fraudulent or risky sites,
parental-control systems, UX improvements (based on recommendations,
spell correction, etc.), click-stream analysis, representation and clustering
of users navigation profiles, optimization of cache systems in recursive
DNS resolvers (among others).
Finally, as a contribution to the research community a set of vectors
of the DNS-VSM trained on a similar dataset to the one used in this thesis
is released and made available for download through the github page in
[1]. With this we hope that further work and research can be done using
these vectors.La vectorización de palabras es un conjunto de técnicas bien
conocidas y ampliamente usadas en el procesamiento del lenguaje natural
( PLN ). Esta tesis explora el uso de vectorización de palabras en un nuevo
escenario. Un modelo de espacio vectorial ( VSM) para nombres de
dominios de Internet ( DNS ) es creado tomando ideas fundamentales de
PLN, l as cuales son aplicadas a consultas reales anonimizadas de logs de
DNS de un gran proveedor de servicios de Internet ( ISP) . El objetivo
principal es encontrar dominios relacionados semánticamente solamente
usando información de consultas DNS sin ningún otro conocimiento sobre
el contenido de esos dominios.
Un conjunto de transformaciones a través de un detallado pipeline
de preprocesamiento con ocho pasos específicos es definido para llevar el
problema original a un problema en el campo de PLN. Una vez aplicado el
pipeline de preprocesamiento y los logs de DNS son transformados a un
corpus de texto estándar, se muestra que es posible utilizar con éxito
técnicas del estado del arte respecto a vectorización de palabras para
construir lo que denominamos un DNS-VSM (un modelo de espacio
vectorial para nombres de dominio de Internet).
Diferentes técnicas de vectorización de palabras son evaluadas en
este trabajo: Word2Vec (con arquitectura Skip-Gram y CBOW) , App2Vec
(con arquitectura CBOW y agregando intervalos de tiempo entre consultas
DNS ), y FastText (incluyendo información a nivel de sub-palabra).
Los resultados obtenidos se comparan usando varias métricas de la teoría
de Recuperación de Información y la calidad de los vectores aprendidos
es validada por una fuente externa, un servicio para obtener sitios
similares ofrecido por Alexa Internet, Inc .
Debido a características intrínsecas de los nombres de dominio,
encontramos que FastText es la mejor opción para construir un modelo de
espacio vectorial para DNS . Además, su performance es comparada
contra dos métodos de línea base: Random Guessing (devolviendo
cualquier nombre de dominio del dataset de forma aleatoria) y Zero Rule
(devolviendo siempre los mismos dominios más populares), superando a
ambos de manera considerable.
Los resultados presentados en este trabajo pueden ser útiles en
muchas actividades de ingeniería, con aplicación práctica en muchas
áreas. Algunos ejemplos incluyen recomendaciones de sitios web, análisis
competitivo, identificación de sitios riesgosos o fraudulentos, sistemas de
control parental, mejoras de UX (basada en recomendaciones, corrección
ortográfica, etc.), análisis de flujo de clics, representación y clustering de
perfiles de navegación de usuarios, optimización de sistemas de cache en
resolutores de DNS recursivos (entre otros).
Por último, como contribución a la comunidad académica, un
conjunto de vectores del DNS-VSM entrenado sobre un juego de datos
similar al utilizado en esta tesis es liberado y hecho disponible para
descarga a través de la página github en [1]. Con esto esperamos a que
más trabajos e investigaciones puedan realizarse usando estos vectores
Graph enabled cross-domain knowledge transfer
The world has never been more connected, led by the information technology revolution in the past decades that has fundamentally changed the way people interact with each other using social networks. Consequently, enormous human activity data are collected from the business world and machine learning techniques are widely adopted to aid our decision processes. Despite of the success of machine learning in various application scenarios, there are still many questions that need to be well answered, such as optimizing machine learning outcomes when desired knowledge cannot be extracted from the available data. This naturally drives us to ponder if one can leverage some side information to populate the knowledge domain of their interest, such that the problems within that knowledge domain can be better tackled.
In this work, such problems are investigated and practical solutions are proposed. To leverage machine learning in any decision-making process, one must convert the given knowledge (for example, natural language, unstructured text) into representation vectors that can be understood and processed by machine learning model in their compatible language and data format. The frequently encountered difficulty is, however, the given knowledge is not rich or reliable enough in the first place. In such cases, one seeks to fuse side information from a separate domain to mitigate the gap between good representation learning and the scarce knowledge in the domain of interest. This approach is named Cross-Domain Knowledge Transfer. It is crucial to study the problem because of the commonality of scarce knowledge in many scenarios, from online healthcare platform analyses to financial market risk quantification, leaving an obstacle in front of us benefiting from automated decision making. From the machine learning perspective, the paradigm of semi-supervised learning takes advantage of large amount of data without ground truth and achieves impressive learning performance improvement. It is adopted in this dissertation for cross-domain knowledge transfer.
Furthermore, graph learning techniques are indispensable given that networks commonly exist in real word, such as taxonomy networks and scholarly article citation networks. These networks contain additional useful knowledge and are ought to be incorporated in the learning process, which serve as an important lever in solving the problem of cross-domain knowledge transfer. This dissertation proposes graph-based learning solutions and demonstrates their practical usage via empirical studies on real-world applications. Another line of effort in this work lies in leveraging the rich capacity of neural networks to improve the learning outcomes, as we are in the era of big data.
In contrast to many Graph Neural Networks that directly iterate on the graph adjacency to approximate graph convolution filters, this work also proposes an efficient Eigenvalue learning method that directly optimizes the graph convolution in the spectral space. This work articulates the importance of network spectrum and provides detailed analyses on the spectral properties in the proposed EigenLearn method, which well aligns with a series of CNN models that attempt to have meaningful spectral interpretation in designing graph neural networks. The disser-tation also addresses the efficiency, which can be categorized in two folds. First, by adopting approximate solutions it mitigates the complexity concerns for graph related algorithms, which are naturally quadratic in most cases and do not scale to large datasets. Second, it mitigates the storage and computation overhead in deep neural network, such that they can be deployed on many light-weight devices and significantly broaden the applicability. Finally, the dissertation is concluded by future endeavors
- …