Search CORE

12,873 research outputs found

Cross correlations of the American baby names

Author: Barucca Paolo
Marinari Enzo
Parisi Giorgio
Ricci-Tersenghi Federico
Rocchi Jacopo
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 13/05/2015
Field of study

The quantitative description of cultural evolution is a challenging task. The most difficult part of the problem is probably to find the appropriate measurable quantities that can make more quantitative such evasive concepts as, for example, dynamics of cultural movements, behavior patterns and traditions of the people. A strategy to tackle this issue is to observe particular features of human activities, i.e. cultural traits, such as names given to newborns. We study the names of babies born in the United States of America from 1910 to 2012. Our analysis shows that groups of different correlated states naturally emerge in different epochs, and we are able to follow and decrypt their evolution. While these groups of states are stable across many decades, a sudden reorganization occurs in the last part of the twentieth century. We think that this kind of quantitative analysis can be possibly extended to other cultural traits: although databases covering more than one century (as the one we used) are rare, the cultural evolution on shorter time scales can be studied thanks to the fact that many human activities are usually recorded in the present digital era.Comment: submitted for consideration to PNA

arXiv.org e-Print Archive

CiteSeerX

Net generation culture

Author: Rettie Ruth
Publication venue
Publication date: 01/11/2002
Field of study

Kingston University Research Repository

Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Author: Kessler Jason S.
Publication venue
Publication date: 01/01/2017
Field of study

Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank-frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.Comment: ACL 2017 Demos. 6 pages, 5 figures. See the Githup repo https://github.com/JasonKessler/scattertext for source code and documentatio

arXiv.org e-Print Archive

Crossref

Quantitative Analysis of Genealogy Using Digitised Family Trees

Author: Chesney Thomas
Elovici Yuval
Fire Michael
Publication venue
Publication date: 30/08/2014
Field of study

Driven by the popularity of television shows such as Who Do You Think You Are? many millions of users have uploaded their family tree to web projects such as WikiTree. Analysis of this corpus enables us to investigate genealogy computationally. The study of heritage in the social sciences has led to an increased understanding of ancestry and descent but such efforts are hampered by difficult to access data. Genealogical research is typically a tedious process involving trawling through sources such as birth and death certificates, wills, letters and land deeds. Decades of research have developed and examined hypotheses on population sex ratios, marriage trends, fertility, lifespan, and the frequency of twins and triplets. These can now be tested on vast datasets containing many billions of entries using machine learning tools. Here we survey the use of genealogy data mining using family trees dating back centuries and featuring profiles on nearly 7 million individuals based in over 160 countries. These data are not typically created by trained genealogists and so we verify them with reference to third party censuses. We present results on a range of aspects of population dynamics. Our approach extends the boundaries of genealogy inquiry to precise measurement of underlying human phenomena

arXiv.org e-Print Archive

CiteSeerX

Biased Embeddings from Wild Data: Measuring, Understanding and Removing

Author: Cristianini Nello
Lansdall-Welfare Thomas
Sutton Adam
Publication venue
Publication date: 16/06/2018
Field of study

Many modern Artificial Intelligence (AI) systems make use of data embeddings, particularly in the domain of Natural Language Processing (NLP). These embeddings are learnt from data that has been gathered "from the wild" and have been found to contain unwanted biases. In this paper we make three contributions towards measuring, understanding and removing this problem. We present a rigorous way to measure some of these biases, based on the use of word lists created for social psychology applications; we observe how gender bias in occupations reflects actual gender bias in the same occupations in the real world; and finally we demonstrate how a simple projection can significantly reduce the effects of embedding bias. All this is part of an ongoing effort to understand how trust can be built into AI systems.Comment: Author's original versio

arXiv.org e-Print Archive

Explore Bristol Research

Digital Entertainment to Support Toddlers' Language and Cognitive Development

Author: Linuwih E. R. (Endar)
Trihastutie N. (Nopita)
Publication venue: 'Universitas Teknokrat Indonesia'
Publication date: 01/01/2020
Field of study

This current research aimed at seeing how English nursery rhymes and kids' songs as learning media support toddlers who are not living in an English speaking country (Indonesia) but exposed to the English language media during their normal baby-sitting times to learning English. To observe how two Indonesian toddlers learned English language in their early critical period of language acquisition through co-watching activity, Early Development Instrument which focuses on language and cognitive development domain with reading awareness and reciting memory subdomain was applied to observe two subjects after 15 month treatments (from age 10-24 months). The results show that the media and the co-watching activity are able to support the toddlers' understanding of the English words spoken and their ability to produce the intelligent pronunciation of those words. The interesting fact reveals that English which is normatively learned merely as a foreign language to most Indonesian people is no longer something far-off to the toddlers who are exposed to it through English nursery rhymes and kids' songs online since they are at the very young age. They naturally tend to be bilingual since at the same time they learn their mother tongue

Neliti

Smartphone picture organization: a hierarchical approach

Author: Dimiccoli Mariella
Lonn Stefan
Radeva Petia
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

Author: Manolopoulou Ioanna
Musolesi Mirco
O'sullivan Jason
Prior Rosie
Vega-Carrasco Mariflor
Publication venue
Publication date: 04/05/2020
Field of study

Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure

arXiv.org e-Print Archive

UCL Discovery

Nationality Classification Using Name Embeddings

Author: Coskun Baris
Han Shuchu
Hu Yifan
Liu Meizhu
Qin Hong
Skiena Steven
Ye Junting
Publication venue
Publication date: 25/08/2017
Field of study

Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification. We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality classifier available. As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in.Comment: 10 pages, 9 figures, 4 table, accepted by CIKM 2017, Demo and free API: www.name-prism.co

arXiv.org e-Print Archive

Crossref