278,844 research outputs found

    Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death

    Get PDF
    We analyze the dynamic properties of 10^7 words recorded in English, Spanish and Hebrew over the period 1800--2008 in order to gain insight into the coevolution of language and culture. We report language independent patterns useful as benchmarks for theoretical models of language evolution. A significantly decreasing (increasing) trend in the birth (death) rate of words indicates a recent shift in the selection laws governing word use. For new words, we observe a peak in the growth-rate fluctuations around 40 years after introduction, consistent with the typical entry time into standard dictionaries and the human generational timescale. Pronounced changes in the dynamics of language during periods of war shows that word correlations, occurring across time and between words, are largely influenced by coevolutionary social, technological, and political factors. We quantify cultural memory by analyzing the long-term correlations in the use of individual words using detrended fluctuation analysis.Comment: Version 1: 31 pages, 17 figures, 3 tables. Version 2 is streamlined, eliminates substantial material and incorporates referee comments: 19 pages, 14 figures, 3 table

    TextGAIL: Generative Adversarial Imitation Learning for Text Generation

    Full text link
    Generative Adversarial Networks (GANs) for text generation have recently received many criticisms, as they perform worse than their MLE counterparts. We suspect previous text GANs' inferior performance is due to the lack of a reliable guiding signal in their discriminators. To address this problem, we propose a generative adversarial imitation learning framework for text generation that uses large pre-trained language models to provide more reliable reward guidance. Our approach uses contrastive discriminator, and proximal policy optimization (PPO) to stabilize and improve text generation performance. For evaluation, we conduct experiments on a diverse set of unconditional and conditional text generation tasks. Experimental results show that TextGAIL achieves better performance in terms of both quality and diversity than the MLE baseline. We also validate our intuition that TextGAIL's discriminator demonstrates the capability of providing reasonable rewards with an additional task.Comment: AAAI 202

    Anchoring vignettes can they make adolescent self-reports of social-emotional skills more reliable, discriminant, and criterion-valid?

    Get PDF
    Individuals differ in the way they use rating scales to describe themselves, and these differences are particularly pronounced in children and early adolescents. One promising remedy is to correct (or "anchor'') an individual's responses according to the way they use the scale when they rate an anchoring vignette (a set of hypothetical targets differing on the attribute of interest). Studying adolescents' self-reports of their socio-emotional attributes, we compared traditional self-report scores with vignette-corrected scores in terms of reliability (internal consistency), discriminant validity (scale intercorrelations), and criterion validity (predicting achievement test scores in language and math). A large and representative sample of 12th grade Brazilian students (N = 8,582, 62% female, mean age 18.2) were administered a Portuguese-language self-report inventory assessing social-emotional skills related to the Big Five personality dimensions. Correcting scores according to vignette ratings led to increases in the reliability of scales measuring Conscientiousness and Openness, but discriminant validity and criterion validity increased only when each scale was corrected using its own corresponding vignette set. Moreover, accuracy in rating the vignettes was correlated with language achievement test scores, suggesting that verbal factors play a role in providing both normative vignette ratings of others and self-reports that are reliable and valid

    Diffusion of Lexical Change in Social Media

    Full text link
    Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter's sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity -- especially with regard to race -- plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified "netspeak" dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.Comment: preprint of PLOS-ONE paper from November 2014; PLoS ONE 9(11) e11311

    A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem

    Full text link
    In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a sparse solution to the generalized eigenvalue problem. We achieve this by constraining the cardinality of the solution to the generalized eigenvalue problem and obtain sparse principal component analysis (PCA), sparse canonical correlation analysis (CCA) and sparse Fisher discriminant analysis (FDA) as special cases. Unlike the ℓ1\ell_1-norm approximation to the cardinality constraint, which previous methods have used in the context of sparse PCA, we propose a tighter approximation that is related to the negative log-likelihood of a Student's t-distribution. The problem is then framed as a d.c. (difference of convex functions) program and is solved as a sequence of convex programs by invoking the majorization-minimization method. The resulting algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for any random initialization, the sequence (subsequence) of iterates generated by the algorithm converges to a stationary point of the d.c. program. The performance of the algorithm is empirically demonstrated on both sparse PCA (finding few relevant genes that explain as much variance as possible in a high-dimensional gene dataset) and sparse CCA (cross-language document retrieval and vocabulary selection for music retrieval) applications.Comment: 40 page

    Text authorship identified using the dynamics of word co-occurrence networks

    Full text link
    The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion

    Crowdsourcing Dialect Characterization through Twitter

    Get PDF
    We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.Comment: 10 pages, 5 figure

    Understanding Interest And Self-Efficacy In The Reading And Writing Of Students With Persisting Specific Learning Disabilities During Middle Childhood And Early Adolescence

    Get PDF
    Three methodological approaches were applied to understand the role of interest and self-efficacy in reading and/or writing in students without and with persisting specific learning disabilities (SLDs) in literacy. For each approach students in grades 4 to 9 completed a survey in which they rated 10 reading items and 10 writing items on a Scale 1 to 5; all items were the same but domain varied. The first approach applied Principal Component Analysis with Varimax Rotation to a sample that varied in specific kinds of literacy achievement. The second approach applied bidirectional multiple regressions in a sample of students with diagnosed SLDs-WL to (a) predict literacy achievement from ratings on interest and self-efficacy survey items; and (b) predict ratings on interest and self-efficacy survey items from literacy achievement. The third approach correlated ratings on the surveys with BOLD activation on an fMRI word reading/spelling task in a brain region associated with approach/avoidance and affect in a sample with diagnosed SLDs-WL. The first approach identified two components for the reading items (each correlated differently with reading skills) and two components for the writing items (each correlated differently with writing skills), but the components were not the same for both domains. Multiple regressions supported predicting interest and self-efficacy ratings from current reading achievement, rather than predicting reading achievement from interest and self-efficacy ratings, but also bidirectional relationships between interest or self-efficacy in writing and writing achievement. The third approach found negative correlations with amygdala connectivity for 2 reading items, but 5 positive and 2 negative correlations with amygdala connectivity for writing items; negative correlations may reflect avoidance and positive correlations approach. Collectively results show the relevance and domain-specificity of interest and self-efficacy in reading and writing for students with persisting SLDs in literacy

    Better communication research project : language and literacy attainment of pupils during early years and through KS2 : does teacher assessment at five provide a valid measure of children's current and future educational attainments?

    Get PDF
    It is well-established that language skills are amongst the best predictors of educational success. Consistent with this, findings from a population-based longitudinal study of parents and children in the UK indicate that language development at the age of two years predicts children’s performance on entering primary school. Moreover, children who enter school with poorly developed speech and language are at risk of literacy difficulties and educational underachievement is common in such children. Whatever the origin of children’s problems with language and communication, the poor educational attainment of children with language learning difficulties is an important concern for educational polic

    Reviews Matter: How Distributed Mentoring Predicts Lexical Diversity on Fanfiction.net

    Full text link
    Fanfiction.net provides an informal learning space for young writers through distributed mentoring, networked giving and receiving of feedback. In this paper, we quantify the cumulative effect of feedback on lexical diversity for 1.5 million authors.Comment: Connected Learning Summit 201
    • 

    corecore