569 research outputs found
Sparse distributed representations as word embeddings for language understanding
Word embeddings are vector representations of words that capture semantic and syntactic
similarities between them. Similar words tend to have closer vector representations in a N
dimensional space considering, for instance, Euclidean distance between the points associated
with the word vector representations in a continuous vector space. This property, makes word
embeddings valuable in several Natural Language Processing tasks, from word analogy and
similarity evaluation to the more complex text categorization, summarization or translation tasks.
Typically state of the art word embeddings are dense vector representations, with low
dimensionality varying from tens to hundreds of floating number dimensions, usually obtained
from unsupervised learning on considerable amounts of text data by training and optimizing an
objective function of a neural network.
This work presents a methodology to derive word embeddings as binary sparse vectors, or word
vector representations with high dimensionality, sparse representation and binary features (e.g.
composed only by ones and zeros). The proposed methodology tries to overcome some
disadvantages associated with state of the art approaches, namely the size of corpus needed for
training the model, while presenting comparable evaluations in several Natural Language
Processing tasks.
Results show that high dimensionality sparse binary vectors representations, obtained from a
very limited amount of training data, achieve comparable performances in similarity and
categorization intrinsic tasks, whereas in analogy tasks good results are obtained only for nouns
categories. Our embeddings outperformed eight state of the art word embeddings in word
similarity tasks, and two word embeddings in categorization tasks.A designação word embeddings refere-se a representações vetoriais das palavras que capturam
as similaridades semânticas e sintáticas entre estas. Palavras similares tendem a ser
representadas por vetores próximos num espaço N dimensional considerando, por exemplo, a
distância Euclidiana entre os pontos associados a estas representações vetoriais num espaço
vetorial contínuo. Esta propriedade, torna as word embeddings importantes em várias tarefas de
Processamento Natural da Língua, desde avaliações de analogia e similaridade entre palavras,
às mais complexas tarefas de categorização, sumarização e tradução automática de texto.
Tipicamente, as word embeddings são constituídas por vetores densos, de dimensionalidade
reduzida. São obtidas a partir de aprendizagem não supervisionada, recorrendo a consideráveis
quantidades de dados, através da otimização de uma função objetivo de uma rede neuronal.
Este trabalho propõe uma metodologia para obter word embeddings constituídas por vetores
binários esparsos, ou seja, representações vetoriais das palavras simultaneamente binárias (e.g.
compostas apenas por zeros e uns), esparsas e com elevada dimensionalidade. A metodologia
proposta tenta superar algumas desvantagens associadas às metodologias do estado da arte,
nomeadamente o elevado volume de dados necessário para treinar os modelos, e
simultaneamente apresentar resultados comparáveis em várias tarefas de Processamento
Natural da Língua.
Os resultados deste trabalho mostram que estas representações, obtidas a partir de uma
quantidade limitada de dados de treino, obtêm performances consideráveis em tarefas de
similaridade e categorização de palavras. Por outro lado, em tarefas de analogia de palavras
apenas se obtém resultados consideráveis para a categoria gramatical dos substantivos. As word
embeddings obtidas com a metodologia proposta, e comparando com o estado da arte,
superaram a performance de oito word embeddings em tarefas de similaridade, e de duas word
embeddings em tarefas de categorização de palavras
Random Projection in Deep Neural Networks
This work investigates the ways in which deep learning methods can benefit
from random projection (RP), a classic linear dimensionality reduction method.
We focus on two areas where, as we have found, employing RP techniques can
improve deep models: training neural networks on high-dimensional data and
initialization of network parameters. Training deep neural networks (DNNs) on
sparse, high-dimensional data with no exploitable structure implies a network
architecture with an input layer that has a huge number of weights, which often
makes training infeasible. We show that this problem can be solved by
prepending the network with an input layer whose weights are initialized with
an RP matrix. We propose several modifications to the network architecture and
training regime that makes it possible to efficiently train DNNs with learnable
RP layer on data with as many as tens of millions of input features and
training examples. In comparison to the state-of-the-art methods, neural
networks with RP layer achieve competitive performance or improve the results
on several extremely high-dimensional real-world datasets. The second area
where the application of RP techniques can be beneficial for training deep
models is weight initialization. Setting the initial weights in DNNs to
elements of various RP matrices enabled us to train residual deep networks to
higher levels of performance
edge2vec: Representation learning using edge semantics for biomedical knowledge discovery
Representation learning provides new and powerful graph analytical approaches
and tools for the highly valued data science challenge of mining knowledge
graphs. Since previous graph analytical methods have mostly focused on
homogeneous graphs, an important current challenge is extending this
methodology for richly heterogeneous graphs and knowledge domains. The
biomedical sciences are such a domain, reflecting the complexity of biology,
with entities such as genes, proteins, drugs, diseases, and phenotypes, and
relationships such as gene co-expression, biochemical regulation, and
biomolecular inhibition or activation. Therefore, the semantics of edges and
nodes are critical for representation learning and knowledge discovery in real
world biomedical problems. In this paper, we propose the edge2vec model, which
represents graphs considering edge semantics. An edge-type transition matrix is
trained by an Expectation-Maximization approach, and a stochastic gradient
descent model is employed to learn node embedding on a heterogeneous graph via
the trained transition matrix. edge2vec is validated on three biomedical domain
tasks: biomedical entity classification, compound-gene bioactivity prediction,
and biomedical information retrieval. Results show that by considering
edge-types into node embedding learning in heterogeneous graphs,
\textbf{edge2vec}\ significantly outperforms state-of-the-art models on all
three tasks. We propose this method for its added value relative to existing
graph analytical methodology, and in the real world context of biomedical
knowledge discovery applicability.Comment: 10 page
Toward Concept-Based Text Understanding and Mining
There is a huge amount of text information in the world, written in natural languages. Most of the text information is hard to access compared with other well-structured information sources such as relational databases. This is because reading and understanding text requires the ability to disambiguate text fragments at several levels, syntactically and semantically, abstracting away details and using background knowledge in a variety of ways. One possible solution to these problems is to implement a framework of concept-based text understanding and mining, that is, a mechanism of analyzing and integrating segregated information, and a framework of organizing, indexing, accessing textual information centered around real-world concepts.
A fundamental difficulty toward this goal is caused by the concept ambiguity of natural language. In text, the real-world entities are referred using their names. The variability in writing a given concept, along with the fact that different concepts/enities may have very similar writings, poses a significant challenge to progress in text understanding and mining. Supporting concept-based natural language understanding requires resolving conceptual ambiguity, and in particular, identifying whether different mentions of real world entities, within and across documents, actually represent the same concept.
This thesis systematically studies this fundamental problem. We study and propose different machine learning techniques to address different aspects of this problem and show that as more information can be exploited, the learning techniques developed accordingly, can continuously improve the identification accuracy. In addition, we extend our global probabilistic model to address a significant application -- semantic integration between text and databases
Proximal Methods for Hierarchical Sparse Coding
Sparse coding consists in representing signals as sparse linear combinations
of atoms selected from a dictionary. We consider an extension of this framework
where the atoms are further assumed to be embedded in a tree. This is achieved
using a recently introduced tree-structured sparse regularization norm, which
has proven useful in several applications. This norm leads to regularized
problems that are difficult to optimize, and we propose in this paper efficient
algorithms for solving them. More precisely, we show that the proximal operator
associated with this norm is computable exactly via a dual approach that can be
viewed as the composition of elementary proximal operators. Our procedure has a
complexity linear, or close to linear, in the number of atoms, and allows the
use of accelerated gradient techniques to solve the tree-structured sparse
approximation problem at the same computational cost as traditional ones using
the L1-norm. Our method is efficient and scales gracefully to millions of
variables, which we illustrate in two types of applications: first, we consider
fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we
apply our optimization tools in the context of dictionary learning, where
learned dictionary elements naturally organize in a prespecified arborescent
structure, leading to a better performance in reconstruction of natural image
patches. When applied to text documents, our method learns hierarchies of
topics, thus providing a competitive alternative to probabilistic topic models
- …