278,844 research outputs found
Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death
We analyze the dynamic properties of 10^7 words recorded in English, Spanish
and Hebrew over the period 1800--2008 in order to gain insight into the
coevolution of language and culture. We report language independent patterns
useful as benchmarks for theoretical models of language evolution. A
significantly decreasing (increasing) trend in the birth (death) rate of words
indicates a recent shift in the selection laws governing word use. For new
words, we observe a peak in the growth-rate fluctuations around 40 years after
introduction, consistent with the typical entry time into standard dictionaries
and the human generational timescale. Pronounced changes in the dynamics of
language during periods of war shows that word correlations, occurring across
time and between words, are largely influenced by coevolutionary social,
technological, and political factors. We quantify cultural memory by analyzing
the long-term correlations in the use of individual words using detrended
fluctuation analysis.Comment: Version 1: 31 pages, 17 figures, 3 tables. Version 2 is streamlined,
eliminates substantial material and incorporates referee comments: 19 pages,
14 figures, 3 table
TextGAIL: Generative Adversarial Imitation Learning for Text Generation
Generative Adversarial Networks (GANs) for text generation have recently
received many criticisms, as they perform worse than their MLE counterparts. We
suspect previous text GANs' inferior performance is due to the lack of a
reliable guiding signal in their discriminators. To address this problem, we
propose a generative adversarial imitation learning framework for text
generation that uses large pre-trained language models to provide more reliable
reward guidance. Our approach uses contrastive discriminator, and proximal
policy optimization (PPO) to stabilize and improve text generation performance.
For evaluation, we conduct experiments on a diverse set of unconditional and
conditional text generation tasks. Experimental results show that TextGAIL
achieves better performance in terms of both quality and diversity than the MLE
baseline. We also validate our intuition that TextGAIL's discriminator
demonstrates the capability of providing reasonable rewards with an additional
task.Comment: AAAI 202
Anchoring vignettes can they make adolescent self-reports of social-emotional skills more reliable, discriminant, and criterion-valid?
Individuals differ in the way they use rating scales to describe themselves, and these differences are particularly pronounced in children and early adolescents. One promising remedy is to correct (or "anchor'') an individual's responses according to the way they use the scale when they rate an anchoring vignette (a set of hypothetical targets differing on the attribute of interest). Studying adolescents' self-reports of their socio-emotional attributes, we compared traditional self-report scores with vignette-corrected scores in terms of reliability (internal consistency), discriminant validity (scale intercorrelations), and criterion validity (predicting achievement test scores in language and math). A large and representative sample of 12th grade Brazilian students (N = 8,582, 62% female, mean age 18.2) were administered a Portuguese-language self-report inventory assessing social-emotional skills related to the Big Five personality dimensions. Correcting scores according to vignette ratings led to increases in the reliability of scales measuring Conscientiousness and Openness, but discriminant validity and criterion validity increased only when each scale was corrected using its own corresponding vignette set. Moreover, accuracy in rating the vignettes was correlated with language achievement test scores, suggesting that verbal factors play a role in providing both normative vignette ratings of others and self-reports that are reliable and valid
Diffusion of Lexical Change in Social Media
Computer-mediated communication is driving fundamental changes in the nature
of written language. We investigate these changes by statistical analysis of a
dataset comprising 107 million Twitter messages (authored by 2.7 million unique
user accounts). Using a latent vector autoregressive model to aggregate across
thousands of words, we identify high-level patterns in diffusion of linguistic
change over the United States. Our model is robust to unpredictable changes in
Twitter's sampling rate, and provides a probabilistic characterization of the
relationship of macro-scale linguistic influence to a set of demographic and
geographic predictors. The results of this analysis offer support for prior
arguments that focus on geographical proximity and population size. However,
demographic similarity -- especially with regard to race -- plays an even more
central role, as cities with similar racial demographics are far more likely to
share linguistic influence. Rather than moving towards a single unified
"netspeak" dialect, language evolution in computer-mediated communication
reproduces existing fault lines in spoken American English.Comment: preprint of PLOS-ONE paper from November 2014; PLoS ONE 9(11) e11311
A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem
In this paper, we consider the sparse eigenvalue problem wherein the goal is
to obtain a sparse solution to the generalized eigenvalue problem. We achieve
this by constraining the cardinality of the solution to the generalized
eigenvalue problem and obtain sparse principal component analysis (PCA), sparse
canonical correlation analysis (CCA) and sparse Fisher discriminant analysis
(FDA) as special cases. Unlike the -norm approximation to the
cardinality constraint, which previous methods have used in the context of
sparse PCA, we propose a tighter approximation that is related to the negative
log-likelihood of a Student's t-distribution. The problem is then framed as a
d.c. (difference of convex functions) program and is solved as a sequence of
convex programs by invoking the majorization-minimization method. The resulting
algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for
any random initialization, the sequence (subsequence) of iterates generated by
the algorithm converges to a stationary point of the d.c. program. The
performance of the algorithm is empirically demonstrated on both sparse PCA
(finding few relevant genes that explain as much variance as possible in a
high-dimensional gene dataset) and sparse CCA (cross-language document
retrieval and vocabulary selection for music retrieval) applications.Comment: 40 page
Text authorship identified using the dynamics of word co-occurrence networks
The identification of authorship in disputed documents still requires human
expertise, which is now unfeasible for many tasks owing to the large volumes of
text and authors in practical applications. In this study, we introduce a
methodology based on the dynamics of word co-occurrence networks representing
written texts to classify a corpus of 80 texts by 8 authors. The texts were
divided into sections with equal number of linguistic tokens, from which time
series were created for 12 topological metrics. The series were proven to be
stationary (p-value>0.05), which permits to use distribution moments as
learning attributes. With an optimized supervised learning procedure using a
Radial Basis Function Network, 68 out of 80 texts were correctly classified,
i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in
purely dynamic network metrics were found to characterize authorship, thus
opening the way for the description of texts in terms of small evolving
networks. Moreover, the approach introduced allows for comparison of texts with
diverse characteristics in a simple, fast fashion
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using
geotagged microblogging datasets. By collecting all Twitter messages written in
Spanish over more than two years, we build a corpus from which a carefully
selected list of concepts allows us to characterize Spanish varieties on a
global scale. A cluster analysis proves the existence of well defined
macroregions sharing common lexical properties. Remarkably enough, we find that
Spanish language is split into two superdialects, namely, an urban speech used
across major American and Spanish citites and a diverse form that encompasses
rural areas and small towns. The latter can be further clustered into smaller
varieties with a stronger regional character.Comment: 10 pages, 5 figure
Understanding Interest And Self-Efficacy In The Reading And Writing Of Students With Persisting Specific Learning Disabilities During Middle Childhood And Early Adolescence
Three methodological approaches were applied to understand the role of interest and self-efficacy in reading and/or writing in students without and with persisting specific learning disabilities (SLDs) in literacy. For each approach students in grades 4 to 9 completed a survey in which they rated 10 reading items and 10 writing items on a Scale 1 to 5; all items were the same but domain varied. The first approach applied Principal Component Analysis with Varimax Rotation to a sample that varied in specific kinds of literacy achievement. The second approach applied bidirectional multiple regressions in a sample of students with diagnosed SLDs-WL to (a) predict literacy achievement from ratings on interest and self-efficacy survey items; and (b) predict ratings on interest and self-efficacy survey items from literacy achievement. The third approach correlated ratings on the surveys with BOLD activation on an fMRI word reading/spelling task in a brain region associated with approach/avoidance and affect in a sample with diagnosed SLDs-WL. The first approach identified two components for the reading items (each correlated differently with reading skills) and two components for the writing items (each correlated differently with writing skills), but the components were not the same for both domains. Multiple regressions supported predicting interest and self-efficacy ratings from current reading achievement, rather than predicting reading achievement from interest and self-efficacy ratings, but also bidirectional relationships between interest or self-efficacy in writing and writing achievement. The third approach found negative correlations with amygdala connectivity for 2 reading items, but 5 positive and 2 negative correlations with amygdala connectivity for writing items; negative correlations may reflect avoidance and positive correlations approach. Collectively results show the relevance and domain-specificity of interest and self-efficacy in reading and writing for students with persisting SLDs in literacy
Better communication research project : language and literacy attainment of pupils during early years and through KS2 : does teacher assessment at five provide a valid measure of children's current and future educational attainments?
It is well-established that language skills are amongst the best predictors of educational success. Consistent with this, findings from a population-based longitudinal study of parents and children in the UK indicate that language development at the age of two years predicts childrenâs performance on entering primary school. Moreover, children who enter school with poorly developed speech and language are at risk of literacy difficulties and educational
underachievement is common in such children. Whatever the origin of childrenâs problems with language and communication, the poor educational attainment of children with language learning difficulties is an important concern for educational polic
Reviews Matter: How Distributed Mentoring Predicts Lexical Diversity on Fanfiction.net
Fanfiction.net provides an informal learning space for young writers through
distributed mentoring, networked giving and receiving of feedback. In this
paper, we quantify the cumulative effect of feedback on lexical diversity for
1.5 million authors.Comment: Connected Learning Summit 201
- âŠ