7,077 research outputs found

    SubGram: Extending Skip-gram Word Representation with Substrings

    Full text link
    Skip-gram (word2vec) is a recent method for creating vector representations of words ("distributed word representations") using a neural network. The representation gained popularity in various areas of natural language processing, because it seems to capture syntactic and semantic information about words without any explicit supervision in this respect. We propose SubGram, a refinement of the Skip-gram model to consider also the word structure during the training process, achieving large gains on the Skip-gram original test set.Comment: Published at TSD 201

    Acquiring Receptive Morphology: A Connectionist Model

    Full text link
    This paper describes a modular connectionist model of the acquisition of receptive inflectional morphology. The model takes inputs in the form of phones one at a time and outputs the associated roots and inflections. Simulations using artificial language stimuli demonstrate the capacity of the model to learn suffixation, prefixation, infixation, circumfixation, mutation, template, and deletion rules. Separate network modules responsible for syllables enable to the network to learn simple reduplication rules as well. The model also embodies constraints against association-line crossing.Comment: 8 pages, Postscript file; extract with Unix uudecode and uncompres

    Transfer in a Connectionist Model of the Acquisition of Morphology

    Full text link
    The morphological systems of natural languages are replete with examples of the same devices used for multiple purposes: (1) the same type of morphological process (for example, suffixation for both noun case and verb tense) and (2) identical morphemes (for example, the same suffix for English noun plural and possessive). These sorts of similarity would be expected to convey advantages on language learners in the form of transfer from one morphological category to another. Connectionist models of morphology acquisition have been faulted for their supposed inability to represent phonological similarity across morphological categories and hence to facilitate transfer. This paper describes a connectionist model of the acquisition of morphology which is shown to exhibit transfer of this type. The model treats the morphology acquisition problem as one of learning to map forms onto meanings and vice versa. As the network learns these mappings, it makes phonological generalizations which are embedded in connection weights. Since these weights are shared by different morphological categories, transfer is enabled. In a set of experiments with artificial stimuli, networks were trained first on one morphological task (e.g., tense) and then on a second (e.g., number). It is shown that in the context of suffixation, prefixation, and template rules, the second task is facilitated when the second category either makes use of the same forms or the same general process type (e.g., prefixation) as the first.Comment: 21 pages, uuencoded compressed Postscrip

    Phonology

    Full text link
    Phonology is the systematic study of the sounds used in language, their internal structure, and their composition into syllables, words and phrases. Computational phonology is the application of formal and computational techniques to the representation and processing of phonological information. This chapter will present the fundamentals of descriptive phonology along with a brief overview of computational phonology.Comment: 27 page

    Modeling Order in Neural Word Embeddings at Scale

    Full text link
    Natural Language Processing (NLP) systems commonly leverage bag-of-words co-occurrence techniques to capture semantic and syntactic word relationships. The resulting word-level distributed representations often ignore morphological information, though character-level embeddings have proven valuable to NLP tasks. We propose a new neural language model incorporating both word order and character order in its embedding. The model produces several vector spaces with meaningful substructure, as evidenced by its performance of 85.8% on a recent word-analogy task, exceeding best published syntactic word-analogy scores by a 58% error margin. Furthermore, the model includes several parallel training methods, most notably allowing a skip-gram network with 160 billion parameters to be trained overnight on 3 multi-core CPUs, 14x larger than the previous largest neural network

    Deriving Ontologies from XML Schema

    Full text link
    In this paper, we present a method and a tool for deriving a skeleton of an ontology from XML schema files. We first recall what an is ontology and its relationships with XML schemas. Next, we focus on ontology building methodology and associated tool requirements. Then, we introduce Janus, a tool for building an ontology from various XML schemas in a given domain. We summarize the main features of Janus and illustrate its functionalities through a simple example. Finally, we compare our approach to other existing ontology building tools

    Mined Semantic Analysis: A New Concept Space Model for Semantic Representation of Textual Data

    Full text link
    Mined Semantic Analysis (MSA) is a novel concept space model which employs unsupervised learning to generate semantic representations of text. MSA represents textual structures (terms, phrases, documents) as a Bag of Concepts (BoC) where concepts are derived from concept rich encyclopedic corpora. Traditional concept space models exploit only target corpus content to construct the concept space. MSA, alternatively, uncovers implicit relations between concepts by mining for their associations (e.g., mining Wikipedia's "See also" link graph). We evaluate MSA's performance on benchmark datasets for measuring semantic relatedness of words and sentences. Empirical results show competitive performance of MSA compared to prior state-of-the-art methods. Additionally, we introduce the first analytical study to examine statistical significance of results reported by different semantic relatedness methods. Our study shows that, the nuances of results across top performing methods could be statistically insignificant. The study positions MSA as one of state-of-the-art methods for measuring semantic relatedness, besides the inherent interpretability and simplicity of the generated semantic representation.Comment: 10 pages, 2 figure

    Digital Neural Networks in the Brain: From Mechanisms for Extracting Structure in the World To Self-Structuring the Brain Itself

    Full text link
    In order to keep trace of information, the brain has to resolve the problem where information is and how to index new ones. We propose that the neural mechanism used by the prefrontal cortex (PFC) to detect structure in temporal sequences, based on the temporal order of incoming information, has served as second purpose to the spatial ordering and indexing of brain networks. We call this process, apparent to the manipulation of neural 'addresses' to organize the brain's own network, the 'digitalization' of information. Such tool is important for information processing and preservation, but also for memory formation and retrieval

    Words are not Equal: Graded Weighting Model for building Composite Document Vectors

    Full text link
    Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of such an yes-no decision, we consider several graded schemes where words are weighted according to their discriminatory relevance with respect to its use in the document (e.g., idf). Some of these methods (particularly tf-idf) are seen to result in a significant improvement in performance over prior state of the art. Further, combining such approaches into an ensemble based on alternate classifiers such as the RNN model, results in an 1.6% performance improvement on the standard IMDB movie review dataset, and a 7.01% improvement on Amazon product reviews. Since these are language free models and can be obtained in an unsupervised manner, they are of interest also for under-resourced languages such as Hindi as well and many more languages. We demonstrate the language free aspects by showing a gain of 12% for two review datasets over earlier results, and also release a new larger dataset for future testing (Singh,2015).Comment: 10 Pages, 2 Figures, 11 Table

    Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications

    Full text link
    We are proposing a simple, but efficient basic approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as well as exploiting the existence of additional more or less language-independent text items such as dates, currency expressions, numbers, names and cognates. Mapping texts onto the multilingual resources and identifying word token links between texts in different languages are basic ingredients for applications such as cross-lingual document similarity calculation, multilingual clustering and categorisation, cross-lingual document retrieval, and tools to provide cross-lingual information access.Comment: The approach described in this paper is used to link related documents across languages in the multilingual news analysis system NewsExplorer, which is freely accessible at http://press.jrc.it/NewsExplorer . 11 page
    • …
    corecore