158 research outputs found
No evidence for an association between gender equality and pathogen prevalence – a comment on Varnum and Grossmann 2017
In a previous study published in Nature Human Behaviour, Varnum and Grossmann claim that reductions in gender inequality are linked to reductions in pathogen prevalence in the United States between 1951 and 2013. Since the statistical methods used by Varnum and Grossmann are known to induce (seemingly) significant correlations between unrelated time series, so-called spurious or non-sense correlations, we test here whether the statistical association between gender inequality and pathogens prevalence in its current form also is the result of mis-specified models that do not correctly account for the temporal structure of the data. Our analysis clearly suggests that this is the case. We then discuss and apply several standard approaches of modelling time-series processes in the data and show that there is, at least as of now, no support for a statistical association between gender inequality and pathogen prevalence
Both the validity of the cultural tightness index and the association with creativity and order are spurious -- a comment on Jackson et al
It was recently suggested in a study published in Nature Human Behaviour that the historical loosening of American culture was associated with a trade-off between higher creativity and lower order. To this end, Jackson et al. generate a linguistic index of cultural tightness based on the Google Books Ngram corpus and use this index to show that American norms loosened between 1800 and 2000. While we remain agnostic toward a potential loosening of American culture and a statistical association with creativity/order, we show here that the methods used by Jackson et al. are neither suitable for testing the validity of the index nor for establishing possible relationships with creativity/order
Leakage explains the apparent superiority of Bayesian random effect models – a preregistered comment on Claessens, Kyritsis and Atkinson (2023)
In a previous study, Claessens, Kyritsis, and Atkinson (CKA) demonstrated the importance of controlling for geographic proximity and cultural similarity in cross-national analyses. Based on a simulation study, CKA showed that methods commonly used to control for spatial and cultural non-independence are insufficient in reducing false positives while maintaining the ability to detect true effects. CKA strongly advocate the use of Bayesian random effect models in such situations, arguing that among the studied model types, they are the only ones that reduced false positives while maintaining high statistical power. However, in this comment, we argue that the apparent superiority of such models is overstated by CKA due to a form of methodological circularity called 'leakage' in statistics and machine learning, because the same proximity matrix is used both to generate the simulated data and as an input to only the Bayesian models for comparison. When this leakage is controlled for, we show that Bayesian models do not outperform most other methods
Tracking, exploring and analyzing recent developments in German-language online press in the face of the coronavirus crisis: cOWIDplus Analysis and cOWIDplus Viewer
The coronavirus pandemic may be the largest crisis the world has had to face
since World War II. It does not come as a surprise that it is also having an
impact on language as our primary communication tool. We present three
inter-connected resources that are designed to capture and illustrate these
effects on a subset of the German language: An RSS corpus of German-language
newsfeeds (with freely available untruncated unigram frequency lists), a static
but continuously updated HTML page tracking the diversity of the used
vocabulary and a web application that enables other researchers and the broader
public to explore these effects without any or with little knowledge of corpus
representation/exploration or statistical analyses.Comment: 13 pages, 6 figures, 1 table, 3852 word
How many people constitute a crowd and what do they do? Quantitative analyses of revisions in the English and German wiktionary editions
Wiktionary is increasingly gaining influence in a wide variety of linguistic fields such as NLP and lexicography, and has great potential to become a serious competitor for publisher-based and academic dictionaries. However, little is known about the "crowd" that is responsible for the content of Wiktionary. In this article, we want to shed some light on selected questions con-cerning large-scale cooperative work in online dictionaries. To this end, we use quantitative analy-ses of the complete edit history files of the English and German Wiktionary language editions. Concerning the distribution of revisions over users, we show that — compared to the overall user base — only very few authors are responsible for the vast majority of revisions in the two Wiktion-ary editions. In the next step, we compare this distribution to the distribution of revisions over all the articles. The articles are subsequently analysed in terms of rigour and diversity, typical revision patterns through time, and novelty (the time since the last revision). We close with an examination of the relationship between corpus frequencies of headwords in articles, the number of article vis-its, and the number of revisions made to articles.Keywords: User-Generated Content, Online Dictionary, Wiktionary, Revision, Edit, Frequency, Collaboration, Wisdom of The Crow
La compréhension et la compréhensibilité de textes de vulgarisation scientifique: le projet PopSci – Understanding Science
Die öffentliche Akzeptanz und Wirkung natur- und technikwissenschaftlicher Forschung hängt grundlegend davon ab, ob sich die Ziele und Forschungsergebnisse an die Öffentlichkeit vermitteln lassen. Doch die Inhalte aktueller Forschungsvorhaben sind für ein Laienpublikum oft nur schwer zugänglich und verständlich. Vor dem Hintergrund, die gesellschaftliche Diskussion natur- und technikwissenschaftlicher Forschung zu verbessern, untersuchen und bewerten wir im Projekt PopSci – Understanding Science einen wichtigen Sektor des populärwissenschaftlichen Diskurses in Deutschland empirisch. Hierfür identifizieren wir die linguistischen Merkmale deutscher populärwissenschaftlicher Texte durch korpusbasierte Methoden und untersuchen deren Effekt auf die kognitive Verarbeitung der Texte durch Laien. Dazu setzen wir Vor- und Nachwissenstests ein. Außerdem messen wir die Blickbewegungen der Leserinnen und Leser, während sie populärwissenschaftliche Texte lesen. Aus dieser Kombination von unterschiedlichen Methoden versuchen wir, erste Empfehlungen zur Verbesserung des linguistischen Stils und der Wissensrepräsentation populärwissenschaftlicher Texte abzuleiten.The public accessibility and comprehension of scientific aims and results fundamentally influences the social acceptability and receptiveness of research. The contents of up-to-date research in the (natural) sciences are, however, not easily accessible to a lay audience because of many interfering factors. Aiming at the optimization of scientific publications in German print and online media, we investigate and validate present-day popular science discourse within our project PopSci – Understanding Science. For this purpose, stylistic features of German popular-science writing are identified through corpus-based research and their effects on the lay reader’s processing of these texts are measured experimentally. The resulting recommendations will improve the linguistic style and knowledge representation of written and web-based publications.L’acceptation par le public et l’impact de la recherche dans les domaines des sciences naturelles et techniques dépendent fondamentalement du fait que les objectifs et les résultats de recherche peuvent être communiqués facilement au public ou non. Toutefois, les contenus de projets de recherche actuels sont souvent difficiles à accéder et à comprendre pour un public profane. Dans le projet PopSci – Understanding Science, nous examinons et évaluons de façon empirique une partie importante du discours scientifique populaire en Allemagne, afin d’améliorer le débat public sur la recherche dans les domaines des sciences naturelles et techniques. Pour ce faire, nous identifions les caractéristiques linguistiques des textes de vulgarisation scientifique allemand par des méthodes fondées sur les corpus et examinons leur effet sur le traitement cognitif des textes par des profanes. Pour cela, nous testons les connaissances des individus avant et après lecture des textes. Nous mesurons également les mouvements oculaires des lecteurs pendant qu’ils lisent des textes de vulgarisation scientifique. De cette combinaison de différentes méthodes, nous essayons de déduire de premières recommandations pour l’amélioration du style linguistique et la représentation de la connaissance dans les textes de vulgarisation scientifique
Human languages trade off complexity against efficiency
From a cross-linguistic perspective, language models are interesting because they can be used as idealised language learners that learn to produce and process language by being trained on a corpus of linguistic input. In this paper, we train different language models, from simple statistical models to advanced neural networks, on a database of 41 multilingual text collections comprising a wide variety of text types, which together include nearly 3 billion words across more than 6,500 documents in over 2,000 languages. We use the trained models to estimate entropy rates, a complexity measure derived from information theory. To compare entropy rates across both models and languages, we develop a quantitative approach that combines machine learning with semiparametric spatial filtering methods to account for both language- and document-specific characteristics, as well as phylogenetic and geographical language relationships. We first establish that entropy rate distributions are highly consistent across different language models, suggesting that the choice of model may have minimal impact on cross-linguistic investigations. On the basis of a much broader range of language models than in previous studies, we confirm results showing systematic differences in entropy rates, i.e. text complexity, across languages. These results challenge the long-held notion that all languages are equally complex. We then show that higher entropy rate tends to co-occur with shorter text length, and argue that this inverse relationship between complexity and length implies a compensatory mechanism whereby increased complexity is offset by increased efficiency. Finally, we introduce a multi-model multilevel inference approach to show that this complexity-efficiency trade-off is partly influenced by the social environment in which languages are used: languages spoken by larger communities tend to have higher entropy rates while using fewer symbols to encode messages
Less than one percent of words would be affected by gender-inclusive language in German press texts
Research on gender and language is tightly knitted to social debates on gender equality and non-discriminatory language use. Psycholinguistic scholars have made significant contributions in this field. However, corpus-based studies that investigate these matters within the context of language use are still rare. In our study, we address the question of how much textual material would actually have to be changed if non-gender-inclusive texts were rewritten to be gender-inclusive. This quantitative measure is an important empirical insight, as a recurring argument against the use of gender-inclusive German is that it supposedly makes written texts too long and complicated. It is also argued that gender-inclusive language has negative effects on language learners. However, such effects are only likely if gender-inclusive texts are very different from those that are not gender-inclusive. In our corpus-linguistic study, we manually annotated German press texts to identify the parts that would have to be changed. Our results show that, on average, less than 1% of all tokens would be affected by gender-inclusive language. This small proportion calls into question whether gender-inclusive German presents a substantial barrier to understanding and learning the language, particularly when we take into account the potential complexities of interpreting masculine generics
Web-based exploration of results from a large European survey on dictionary use and culture: ESDexplorer
We present ESDexplorer (https://owid.shinyapps.io/ESDexplorer), a browser appli-cation which allows the user to explore the data from a large European survey on dictionary use and culture. We built ESDexplorer with several target groups in mind: our cooperation partners, other researchers, and a more general public interested in the results. Also, we present in detail the architecture and technological realisation of the application and discuss some legal aspects of data protection that motivated some architectural choices.Keywords: survey, data collection, data processing, data presentation, data analysis, technology and architecture, target group, plot, browser application, ESDexplore
Akkurate hipoteses en noukeurige lees is noodsaaklik: Resul-tate van 'n waarnemingstudie uitgevoer op leerders wat aanlyn taalhulpbronne gebruik
In the past two decades, more and more dictionary usage studies have been published, but most of them deal with questions related to what users appreciate about dictionaries, which dictionaries they use and what type of information they need in specific situations — presupposing that users actually consult lexicographic resources. However, language teachers and lecturers in linguistics often have the impression that students do not use enough high-quality dictionaries in their everyday work. With this in mind, we launched an international cooperation project to collect empirical data to evaluate what it is that students actually do while attempting to solve language problems. To this end, we applied a new methodological setting: screen recording in conjunction with a thinking-aloud task. The collected empirical data offers a broad insight into what users really do while they attempt to solve language-related tasks online.In die afgelope twee dekades is al hoe meer woordeboekgebruikstudies gepubliseer, maar die meeste van hierdie studies handel oor vraagstukke wat verband hou met wat gebruikers van woordeboeke waardevol vind, watter woordeboeke hulle gebruik en watter tipe inligting hull in spesifieke situasies benodig — met die voorveronderstelling dat gebruikers inderdaad leksiko-grafiese hulpbronne raadpleeg. Taalonderwysers en dosente in die linguistiek kry dikwels die indruk dat studente nie genoeg hoëkwaliteitwoordeboeke in hul daaglikse werk gebruik nie. Met hierdie siening in gedagte het ons 'n internasionale samewerkingsprojek van stapel gestuur om empiriese data te versamel om sodoende te kan evalueer wat dit is wat studente in werklikheid doen wanneer hulle taalprobleme probeer oplos. Om hierdie doel te bereik het ons gebruik gemaak van 'n nuwe metodologiese omgewing: skermopnames saam met 'n opdrag wat uitgevoer moet word terwyl daar hardop gedink word. Die versamelde empiriese data verskaf 'n breë insig in wat gebruikers werklik doen terwyl hulle poog om taalverwante take aanlyn op te los
- …