107,649 research outputs found
Language Model Based on Word Clustering
PACLIC 20 / Wuhan, China / 1-3 November, 200
Language Modeling by Clustering with Word Embeddings for Text Readability Assessment
We present a clustering-based language model using word embeddings for text
readability prediction. Presumably, an Euclidean semantic space hypothesis
holds true for word embeddings whose training is done by observing word
co-occurrences. We argue that clustering with word embeddings in the metric
space should yield feature representations in a higher semantic space
appropriate for text regression. Also, by representing features in terms of
histograms, our approach can naturally address documents of varying lengths. An
empirical evaluation using the Common Core Standards corpus reveals that the
features formed on our clustering-based language model significantly improve
the previously known results for the same corpus in readability prediction. We
also evaluate the task of sentence matching based on semantic relatedness using
the Wiki-SimpleWiki corpus and find that our features lead to superior matching
performance
Sözcük Köklerinin Sözdizimsel Olarak Kümelenmesi
This is an accepted manuscript of an article published by IEEE in 2016 24th Signal Processing and Communication Application Conference (SIU) on 23/06/2016, available online: https://ieeexplore.ieee.org/document/7496026
The accepted version of the publication may differ from the final published version.Distributional representation of words is used for both syntactic and semantic tasks. In this paper two different methods are presented for clustering word roots. In the first method, the distributional model word2vec [1] is used for clustering word roots, whereas distributional approaches are generally used for words. For this purpose, the distributional similarities of roots are modeled and the roots are divided into syntactic categories (noun, verb etc.). In the other method, two different models are proposed: an information theoretical model and a probabilistic model. With a metric [8] based on mutual information and with another metric based on Jensen-Shannon divergence, similarities of word roots are calculated and clustering is performed using these metrics. Clustering word roots has a significant role in other natural language processing applications such as machine translation and question answering, and in other applications that include language generation. We obtained a purity of 0.92 from the obtained clusters.Published versio
The Impact of Arabic Diacritization on Word Embeddings
Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this paper, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. Interestingly, we obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks
Strong correlations between text quality and complex networks features
Concepts of complex networks have been used to obtain metrics that were
correlated to text quality established by scores assigned by human judges.
Texts produced by high-school students in Portuguese were represented as
scale-free networks (word adjacency model), from which typical network features
such as the in/outdegree, clustering coefficient and shortest path were
obtained. Another metric was derived from the dynamics of the network growth,
based on the variation of the number of connected components. The scores
assigned by the human judges according to three text quality criteria
(coherence and cohesion, adherence to standard writing conventions and theme
adequacy/development) were correlated with the network measurements. Text
quality for all three criteria was found to decrease with increasing average
values of outdegrees, clustering coefficient and deviation from the dynamics of
network growth. Among the criteria employed, cohesion and coherence showed the
strongest correlation, which probably indicates that the network measurements
are able to capture how the text is developed in terms of the concepts
represented by the nodes in the networks. Though based on a particular set of
texts and specific language, the results presented here point to potential
applications in other instances of text analysis.Comment: 8 pages, 8 figure
Network analysis of named entity co-occurrences in written texts
The use of methods borrowed from statistics and physics to analyze written
texts has allowed the discovery of unprecedent patterns of human behavior and
cognition by establishing links between models features and language structure.
While current models have been useful to unveil patterns via analysis of
syntactical and semantical networks, only a few works have probed the relevance
of investigating the structure arising from the relationship between relevant
entities such as characters, locations and organizations. In this study, we
represent entities appearing in the same context as a co-occurrence network,
where links are established according to a null model based on random, shuffled
texts. Computational simulations performed in novels revealed that the proposed
model displays interesting topological features, such as the small world
feature, characterized by high values of clustering coefficient. The
effectiveness of our model was verified in a practical pattern recognition task
in real networks. When compared with traditional word adjacency networks, our
model displayed optimized results in identifying unknown references in texts.
Because the proposed representation plays a complementary role in
characterizing unstructured documents via topological analysis of named
entities, we believe that it could be useful to improve the characterization of
written texts (and related systems), specially if combined with traditional
approaches based on statistical and deeper paradigms
Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation
We present an approach to reduce the performance disparity between geographic
regions without degrading performance on the overall user population for ASR. A
popular approach is to fine-tune the model with data from regions where the ASR
model has a higher word error rate (WER). However, when the ASR model is
adapted to get better performance on these high-WER regions, its parameters
wander from the previous optimal values, which can lead to worse performance in
other regions. In our proposed method, we utilize the elastic weight
consolidation (EWC) regularization loss to identify directions in parameters
space along which the ASR weights can vary to improve for high-error regions,
while still maintaining performance on the speaker population overall. Our
results demonstrate that EWC can reduce the word error rate (WER) in the region
with highest WER by 3.2% relative while reducing the overall WER by 1.3%
relative. We also evaluate the role of language and acoustic models in ASR
fairness and propose a clustering algorithm to identify WER disparities based
on geographic region.Comment: Accepted for publication at Interspeech 202
- …