7,077 research outputs found
SubGram: Extending Skip-gram Word Representation with Substrings
Skip-gram (word2vec) is a recent method for creating vector representations
of words ("distributed word representations") using a neural network. The
representation gained popularity in various areas of natural language
processing, because it seems to capture syntactic and semantic information
about words without any explicit supervision in this respect. We propose
SubGram, a refinement of the Skip-gram model to consider also the word
structure during the training process, achieving large gains on the Skip-gram
original test set.Comment: Published at TSD 201
Acquiring Receptive Morphology: A Connectionist Model
This paper describes a modular connectionist model of the acquisition of
receptive inflectional morphology. The model takes inputs in the form of phones
one at a time and outputs the associated roots and inflections. Simulations
using artificial language stimuli demonstrate the capacity of the model to
learn suffixation, prefixation, infixation, circumfixation, mutation, template,
and deletion rules. Separate network modules responsible for syllables enable
to the network to learn simple reduplication rules as well. The model also
embodies constraints against association-line crossing.Comment: 8 pages, Postscript file; extract with Unix uudecode and uncompres
Transfer in a Connectionist Model of the Acquisition of Morphology
The morphological systems of natural languages are replete with examples of
the same devices used for multiple purposes: (1) the same type of morphological
process (for example, suffixation for both noun case and verb tense) and (2)
identical morphemes (for example, the same suffix for English noun plural and
possessive). These sorts of similarity would be expected to convey advantages
on language learners in the form of transfer from one morphological category to
another. Connectionist models of morphology acquisition have been faulted for
their supposed inability to represent phonological similarity across
morphological categories and hence to facilitate transfer. This paper describes
a connectionist model of the acquisition of morphology which is shown to
exhibit transfer of this type. The model treats the morphology acquisition
problem as one of learning to map forms onto meanings and vice versa. As the
network learns these mappings, it makes phonological generalizations which are
embedded in connection weights. Since these weights are shared by different
morphological categories, transfer is enabled. In a set of experiments with
artificial stimuli, networks were trained first on one morphological task
(e.g., tense) and then on a second (e.g., number). It is shown that in the
context of suffixation, prefixation, and template rules, the second task is
facilitated when the second category either makes use of the same forms or the
same general process type (e.g., prefixation) as the first.Comment: 21 pages, uuencoded compressed Postscrip
Phonology
Phonology is the systematic study of the sounds used in language, their
internal structure, and their composition into syllables, words and phrases.
Computational phonology is the application of formal and computational
techniques to the representation and processing of phonological information.
This chapter will present the fundamentals of descriptive phonology along with
a brief overview of computational phonology.Comment: 27 page
Modeling Order in Neural Word Embeddings at Scale
Natural Language Processing (NLP) systems commonly leverage bag-of-words
co-occurrence techniques to capture semantic and syntactic word relationships.
The resulting word-level distributed representations often ignore morphological
information, though character-level embeddings have proven valuable to NLP
tasks. We propose a new neural language model incorporating both word order and
character order in its embedding. The model produces several vector spaces with
meaningful substructure, as evidenced by its performance of 85.8% on a recent
word-analogy task, exceeding best published syntactic word-analogy scores by a
58% error margin. Furthermore, the model includes several parallel training
methods, most notably allowing a skip-gram network with 160 billion parameters
to be trained overnight on 3 multi-core CPUs, 14x larger than the previous
largest neural network
Deriving Ontologies from XML Schema
In this paper, we present a method and a tool for deriving a skeleton of an
ontology from XML schema files. We first recall what an is ontology and its
relationships with XML schemas. Next, we focus on ontology building methodology
and associated tool requirements. Then, we introduce Janus, a tool for building
an ontology from various XML schemas in a given domain. We summarize the main
features of Janus and illustrate its functionalities through a simple example.
Finally, we compare our approach to other existing ontology building tools
Mined Semantic Analysis: A New Concept Space Model for Semantic Representation of Textual Data
Mined Semantic Analysis (MSA) is a novel concept space model which employs
unsupervised learning to generate semantic representations of text. MSA
represents textual structures (terms, phrases, documents) as a Bag of Concepts
(BoC) where concepts are derived from concept rich encyclopedic corpora.
Traditional concept space models exploit only target corpus content to
construct the concept space. MSA, alternatively, uncovers implicit relations
between concepts by mining for their associations (e.g., mining Wikipedia's
"See also" link graph). We evaluate MSA's performance on benchmark datasets for
measuring semantic relatedness of words and sentences. Empirical results show
competitive performance of MSA compared to prior state-of-the-art methods.
Additionally, we introduce the first analytical study to examine statistical
significance of results reported by different semantic relatedness methods. Our
study shows that, the nuances of results across top performing methods could be
statistically insignificant. The study positions MSA as one of state-of-the-art
methods for measuring semantic relatedness, besides the inherent
interpretability and simplicity of the generated semantic representation.Comment: 10 pages, 2 figure
Digital Neural Networks in the Brain: From Mechanisms for Extracting Structure in the World To Self-Structuring the Brain Itself
In order to keep trace of information, the brain has to resolve the problem
where information is and how to index new ones. We propose that the neural
mechanism used by the prefrontal cortex (PFC) to detect structure in temporal
sequences, based on the temporal order of incoming information, has served as
second purpose to the spatial ordering and indexing of brain networks. We call
this process, apparent to the manipulation of neural 'addresses' to organize
the brain's own network, the 'digitalization' of information. Such tool is
important for information processing and preservation, but also for memory
formation and retrieval
Words are not Equal: Graded Weighting Model for building Composite Document Vectors
Despite the success of distributional semantics, composing phrases from word
vectors remains an important challenge. Several methods have been tried for
benchmark tasks such as sentiment classification, including word vector
averaging, matrix-vector approaches based on parsing, and on-the-fly learning
of paragraph vectors. Most models usually omit stop words from the composition.
Instead of such an yes-no decision, we consider several graded schemes where
words are weighted according to their discriminatory relevance with respect to
its use in the document (e.g., idf). Some of these methods (particularly
tf-idf) are seen to result in a significant improvement in performance over
prior state of the art. Further, combining such approaches into an ensemble
based on alternate classifiers such as the RNN model, results in an 1.6%
performance improvement on the standard IMDB movie review dataset, and a 7.01%
improvement on Amazon product reviews. Since these are language free models and
can be obtained in an unsupervised manner, they are of interest also for
under-resourced languages such as Hindi as well and many more languages. We
demonstrate the language free aspects by showing a gain of 12% for two review
datasets over earlier results, and also release a new larger dataset for future
testing (Singh,2015).Comment: 10 Pages, 2 Figures, 11 Table
Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications
We are proposing a simple, but efficient basic approach for a number of
multilingual and cross-lingual language technology applications that are not
limited to the usual two or three languages, but that can be applied with
relatively little effort to larger sets of languages. The approach consists of
using existing multilingual linguistic resources such as thesauri,
nomenclatures and gazetteers, as well as exploiting the existence of additional
more or less language-independent text items such as dates, currency
expressions, numbers, names and cognates. Mapping texts onto the multilingual
resources and identifying word token links between texts in different languages
are basic ingredients for applications such as cross-lingual document
similarity calculation, multilingual clustering and categorisation,
cross-lingual document retrieval, and tools to provide cross-lingual
information access.Comment: The approach described in this paper is used to link related
documents across languages in the multilingual news analysis system
NewsExplorer, which is freely accessible at http://press.jrc.it/NewsExplorer
. 11 page
- …