154,255 research outputs found
Language Modeling by Clustering with Word Embeddings for Text Readability Assessment
We present a clustering-based language model using word embeddings for text
readability prediction. Presumably, an Euclidean semantic space hypothesis
holds true for word embeddings whose training is done by observing word
co-occurrences. We argue that clustering with word embeddings in the metric
space should yield feature representations in a higher semantic space
appropriate for text regression. Also, by representing features in terms of
histograms, our approach can naturally address documents of varying lengths. An
empirical evaluation using the Common Core Standards corpus reveals that the
features formed on our clustering-based language model significantly improve
the previously known results for the same corpus in readability prediction. We
also evaluate the task of sentence matching based on semantic relatedness using
the Wiki-SimpleWiki corpus and find that our features lead to superior matching
performance
RNN Language Model with Word Clustering and Class-based Output Layer
The recurrent neural network language model (RNNLM) has shown significant promise for statistical language modeling. In this work, a new class-based output layer method is introduced to further improve the RNNLM. In this method, word class information is incorporated into the output layer by utilizing the Brown clustering algorithm to estimate a class-based language model. Experimental results show that the new output layer with word clustering not only improves the convergence obviously but also reduces the perplexity and word error rate in large vocabulary continuous speech recognition
Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics
Next Generation Sequencing (NGS) technologies generate large amounts of short
read data for many different organisms. The fact that NGS reads are generally
short makes it challenging to assemble the reads and reconstruct the original
genome sequence. For clustering genomes using such NGS data, word-count based
alignment-free sequence comparison is a promising approach, but for this
approach, the underlying expected word counts are essential.
A plausible model for this underlying distribution of word counts is given
through modelling the DNA sequence as a Markov chain (MC). For single long
sequences, efficient statistics are available to estimate the order of MCs and
the transition probability matrix for the sequences. As NGS data do not provide
a single long sequence, inference methods on Markovian properties of sequences
based on single long sequences cannot be directly used for NGS short read data.
Here we derive a normal approximation for such word counts. We also show that
the traditional Chi-square statistic has an approximate gamma distribution,
using the Lander-Waterman model for physical mapping. We propose several
methods to estimate the order of the MC based on NGS reads and evaluate them
using simulations. We illustrate the applications of our results by clustering
genomic sequences of several vertebrate and tree species based on NGS reads
using alignment-free sequence dissimilarity measures. We find that the
estimated order of the MC has a considerable effect on the clustering results,
and that the clustering results that use a MC of the estimated order give a
plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201
Learning to Rank Question-Answer Pairs using Hierarchical Recurrent Encoder with Latent Topic Clustering
In this paper, we propose a novel end-to-end neural architecture for ranking
candidate answers, that adapts a hierarchical recurrent neural network and a
latent topic clustering module. With our proposed model, a text is encoded to a
vector representation from an word-level to a chunk-level to effectively
capture the entire meaning. In particular, by adapting the hierarchical
structure, our model shows very small performance degradations in longer text
comprehension while other state-of-the-art recurrent neural network models
suffer from it. Additionally, the latent topic clustering module extracts
semantic information from target samples. This clustering module is useful for
any text related tasks by allowing each data sample to find its nearest topic
cluster, thus helping the neural network model analyze the entire data. We
evaluate our models on the Ubuntu Dialogue Corpus and consumer electronic
domain question answering dataset, which is related to Samsung products. The
proposed model shows state-of-the-art results for ranking question-answer
pairs.Comment: 10 pages, Accepted as a conference paper at NAACL 201
The Influence of the Phonological Neighborhood Clustering-Coefficient on Spoken Word Recognition
This article may not exactly replicate the final version published in the APA journal. It is not the copy of record.Clustering coefficient—a measure derived from the new science of networks—refers to the proportion of phonological neighbors of a target word that are also neighbors of each other. Consider the words bat, hat, and can, all of which are neighbors of the word cat; the words bat and hat are also neighbors of each other. In a perceptual identification task, words with a low clustering coefficient (i.e., few neighbors are neighbors of each other) were more accurately identified than words with a high clustering coefficient (i.e., many neighbors are neighbors of each other). In a lexical decision task, words with a low clustering coefficient were responded to more quickly than words with a high clustering coefficient. These findings suggest that the structure of the lexicon, that is the similarity relationships among neighbors of the target word measured by clustering coefficient, influences lexical access in spoken word recognition. Simulations of the TRACE and Shortlist models of spoken word recognition failed to account for the present findings. A framework for a new model of spoken word recognition is proposed
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
Word-of-mouth interaction and the organization of behaviour
We present a discrete choice model based on agent interaction. The framework combines the features of two well-known models of word-of-mouthcommunication (Ellison and Fudenberg, 1995 and Bala and Goyal, 2001).Interaction structure is a regular periodic lattice with decision-makers interacting only with immediate neighbours. We investigate the long-runequilibrium) behaviour of the resulting system and show that for a largerange of initial conditions clustering in economic behaviour emerges andpersists inde?nitely. The setup allows for the analysis of multi-option environments. For these environments we derive the distribution of optionpopularity in equilibrium.word-of-mouth, inertia, clustering, choice.
- …