12,873 research outputs found
Cross correlations of the American baby names
The quantitative description of cultural evolution is a challenging task. The
most difficult part of the problem is probably to find the appropriate
measurable quantities that can make more quantitative such evasive concepts as,
for example, dynamics of cultural movements, behavior patterns and traditions
of the people. A strategy to tackle this issue is to observe particular
features of human activities, i.e. cultural traits, such as names given to
newborns. We study the names of babies born in the United States of America
from 1910 to 2012. Our analysis shows that groups of different correlated
states naturally emerge in different epochs, and we are able to follow and
decrypt their evolution. While these groups of states are stable across many
decades, a sudden reorganization occurs in the last part of the twentieth
century. We think that this kind of quantitative analysis can be possibly
extended to other cultural traits: although databases covering more than one
century (as the one we used) are rare, the cultural evolution on shorter time
scales can be studied thanks to the fact that many human activities are usually
recorded in the present digital era.Comment: submitted for consideration to PNA
Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ
Scattertext is an open source tool for visualizing linguistic variation
between document categories in a language-independent way. The tool presents a
scatterplot, where each axis corresponds to the rank-frequency a term occurs in
a category of documents. Through a tie-breaking strategy, the tool is able to
display thousands of visible term-representing points and find space to legibly
label hundreds of them. Scattertext also lends itself to a query-based
visualization of how the use of terms with similar embeddings differs between
document categories, as well as a visualization for comparing the importance
scores of bag-of-words features to univariate metrics.Comment: ACL 2017 Demos. 6 pages, 5 figures. See the Githup repo
https://github.com/JasonKessler/scattertext for source code and documentatio
Quantitative Analysis of Genealogy Using Digitised Family Trees
Driven by the popularity of television shows such as Who Do You Think You
Are? many millions of users have uploaded their family tree to web projects
such as WikiTree. Analysis of this corpus enables us to investigate genealogy
computationally. The study of heritage in the social sciences has led to an
increased understanding of ancestry and descent but such efforts are hampered
by difficult to access data. Genealogical research is typically a tedious
process involving trawling through sources such as birth and death
certificates, wills, letters and land deeds. Decades of research have developed
and examined hypotheses on population sex ratios, marriage trends, fertility,
lifespan, and the frequency of twins and triplets. These can now be tested on
vast datasets containing many billions of entries using machine learning tools.
Here we survey the use of genealogy data mining using family trees dating back
centuries and featuring profiles on nearly 7 million individuals based in over
160 countries. These data are not typically created by trained genealogists and
so we verify them with reference to third party censuses. We present results on
a range of aspects of population dynamics. Our approach extends the boundaries
of genealogy inquiry to precise measurement of underlying human phenomena
Biased Embeddings from Wild Data: Measuring, Understanding and Removing
Many modern Artificial Intelligence (AI) systems make use of data embeddings,
particularly in the domain of Natural Language Processing (NLP). These
embeddings are learnt from data that has been gathered "from the wild" and have
been found to contain unwanted biases. In this paper we make three
contributions towards measuring, understanding and removing this problem. We
present a rigorous way to measure some of these biases, based on the use of
word lists created for social psychology applications; we observe how gender
bias in occupations reflects actual gender bias in the same occupations in the
real world; and finally we demonstrate how a simple projection can
significantly reduce the effects of embedding bias. All this is part of an
ongoing effort to understand how trust can be built into AI systems.Comment: Author's original versio
Digital Entertainment to Support Toddlers' Language and Cognitive Development
This current research aimed at seeing how English nursery rhymes and kids' songs as learning media support toddlers who are not living in an English speaking country (Indonesia) but exposed to the English language media during their normal baby-sitting times to learning English. To observe how two Indonesian toddlers learned English language in their early critical period of language acquisition through co-watching activity, Early Development Instrument which focuses on language and cognitive development domain with reading awareness and reciting memory subdomain was applied to observe two subjects after 15 month treatments (from age 10-24 months). The results show that the media and the co-watching activity are able to support the toddlers' understanding of the English words spoken and their ability to produce the intelligent pronunciation of those words. The interesting fact reveals that English which is normatively learned merely as a foreign language to most Indonesian people is no longer something far-off to the toddlers who are exposed to it through English nursery rhymes and kids' songs online since they are at the very young age. They naturally tend to be bilingual since at the same time they learn their mother tongue
Smartphone picture organization: a hierarchical approach
We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin
Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability
Understanding the shopping motivations behind market baskets has high
commercial value in the grocery retail industry. Analyzing shopping
transactions demands techniques that can cope with the volume and
dimensionality of grocery transactional data while keeping interpretable
outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to
process grocery transactions and to discover a broad representation of
customers' shopping motivations. However, summarizing the posterior
distribution of an LDA model is challenging, while individual LDA draws may not
be coherent and cannot capture topic uncertainty. Moreover, the evaluation of
LDA models is dominated by model-fit measures which may not adequately capture
the qualitative aspects such as interpretability and stability of topics.
In this paper, we introduce clustering methodology that post-processes
posterior LDA draws to summarise the entire posterior distribution and identify
semantic modes represented as recurrent topics. Our approach is an alternative
to standard label-switching techniques and provides a single posterior summary
set of topics, as well as associated measures of uncertainty. Furthermore, we
establish a more holistic definition for model evaluation, which assesses topic
models based not only on their likelihood but also on their coherence,
distinctiveness and stability. By means of a survey, we set thresholds for the
interpretation of topic coherence and topic similarity in the domain of grocery
retail data. We demonstrate that the selection of recurrent topics through our
clustering methodology not only improves model likelihood but also outperforms
the qualitative aspects of LDA such as interpretability and stability. We
illustrate our methods on an example from a large UK supermarket chain.Comment: 20 pages, 9 figure
Nationality Classification Using Name Embeddings
Nationality identification unlocks important demographic information, with
many applications in biomedical and sociological research. Existing name-based
nationality classifiers use name substrings as features and are trained on
small, unrepresentative sets of labeled names, typically extracted from
Wikipedia. As a result, these methods achieve limited performance and cannot
support fine-grained classification.
We exploit the phenomena of homophily in communication patterns to learn name
embeddings, a new representation that encodes gender, ethnicity, and
nationality which is readily applicable to building classifiers and other
systems. Through our analysis of 57M contact lists from a major Internet
company, we are able to design a fine-grained nationality classifier covering
39 groups representing over 90% of the world population. In an evaluation
against other published systems over 13 common classes, our F1 score (0.795) is
substantial better than our closest competitor Ethnea (0.580). To the best of
our knowledge, this is the most accurate, fine-grained nationality classifier
available.
As a social media application, we apply our classifiers to the followers of
major Twitter celebrities over six different domains. We demonstrate stark
differences in the ethnicities of the followers of Trump and Obama, and in the
sports and entertainments favored by different groups. Finally, we identify an
anomalous political figure whose presumably inflated following appears largely
incapable of reading the language he posts in.Comment: 10 pages, 9 figures, 4 table, accepted by CIKM 2017, Demo and free
API: www.name-prism.co
- …