33 research outputs found
Discovering Periodic Patterns in Historical News
We address the problem of observing periodic changes in the behaviour of a large population, by analysing the daily contents of newspapers published in the United States and United Kingdom from 1836 to 1922. This is done by analysing the daily time series of the relative frequency of the 25K most frequent words for each country, resulting in the study of 50K time series for 31,755 days. Behaviours that are found to be strongly periodic include seasonal activities, such as hunting and harvesting. A strong connection with natural cycles is found, with a pronounced presence of fruits, vegetables, flowers and game. Periodicities dictated by religious or civil calendars are also detected and show a different wave-form than those provoked by weather. States that can be revealed include the presence of infectious disease, with clear annual peaks for fever, pneumonia and diarrhoea. Overall, 2% of the words are found to be strongly periodic, and the period most frequently found is 365 days. Comparisons between UK and US, and between modern and historical news, reveal how the fundamental cycles of life are shaped by the seasons, but also how this effect has been reduced in modern times
A Dataset for Learning Graph Representations to Predict Customer Returns in Fashion Retail
We present a novel dataset collected by ASOS (a major online fashion
retailer) to address the challenge of predicting customer returns in a fashion
retail ecosystem. With the release of this substantial dataset we hope to
motivate further collaboration between research communities and the fashion
industry. We first explore the structure of this dataset with a focus on the
application of Graph Representation Learning in order to exploit the natural
data structure and provide statistical insights into particular features within
the data. In addition to this, we show examples of a return prediction
classification task with a selection of baseline models (i.e. with no
intermediate representation learning step) and a graph representation based
model. We show that in a downstream return prediction classification task, an
F1-score of 0.792 can be found using a Graph Neural Network (GNN), improving
upon other models discussed in this work. Alongside this increased F1-score, we
also present a lower cross-entropy loss by recasting the data into a graph
structure, indicating more robust predictions from a GNN based solution. These
results provide evidence that GNNs could provide more impactful and usable
classifications than other baseline models on the presented dataset and with
this motivation, we hope to encourage further research into graph-based
approaches using the ASOS GraphReturns dataset.Comment: The ASOS GraphReturns dataset can be found at https://osf.io/c793h/.
Accepted at FashionXRecSys 2022 workshop. Published Versio
ASOS graph returns dataset
A graph dataset of anonymised customer returns in online fashion retai
Représentation et apprentissage à partir de textes pour des informations émotionnelles et pour des informations dynamiques
Automatic knowledge extraction from texts consists in mapping lowlevel information, as carried by the words and phrases extracted fromdocuments, to higher level information. The choice of datarepresentation for describing documents is, thus, essential and thedefinition of a learning algorithm is subject to theirspecifics. This thesis addresses these two issues in the context ofemotional information on the one hand and dynamic information on theother.In the first part, we consider the task of emotion extraction forwhich the semantic gap is wider than it is with more traditionalthematic information. Therefore, we propose to study representationsaimed at modeling the many nuances of natural language used fordescribing emotional, hence subjective, information. Furthermore, wepropose to study the integration of semantic knowledge which provides,from a characterization perspective, support for extracting theemotional content of documents and, from a prediction perspective,assistance to the learning algorithm.In the second part, we study information dynamics: any corpus ofdocuments published over the Internet can be associated to sources inperpetual activity which exchange information in a continuousmovement. We explore three main lines of work: automaticallyidentified sources; the communities they form in a dynamic and verysparse description space; and the noteworthy themes they develop. Foreach we propose original extraction methods which we apply to a corpusof real data we have collected from information streams over the Internet.L'extraction de connaissances automatique à partir de textes consiste àmettre en correspondance une information bas niveau, extraite desdocuments au travers des mots et des groupes de mots, avec uneinformation de plus haut niveau. Les choix de représentation pourdécrire les documents sont alors essentiels et leurs particularitéscontraignent la définition de l'algorithme d'apprentissage mis enoeuvre. Les travaux de cette thèse considèrent ces deux problématiquesd'une part pour des informations émotionnelles, d'autre part pour desinformations dynamiques.Dans une première partie, nous considérons une tâche d'extraction desémotions pour laquelle le fossé sémantique est plus important que pourdes informations traditionnellement thématiques. Aussi, nous étudionsdes représentations destinées à capturer les nuances du langage pourdécrire une information subjective puisque émotionnelle. Nous étudionsde plus l'intégration de connaissances sémantiques qui permettent, dans unetâche de caractérisation, d'extraire la charge émotionnelle desdocuments, dans une tâche de prédiction de guider l'apprentissageréalisé.Dans une seconde partie, nous étudions la dynamique de l'information :à tout corpus de documents publié sur Internet peut être associé dessources en perpétuelle activité qui échangent des informations dansun mouvement continu. Nous explorons trois axes d'étude : les sourcesidentifiées, les communautés qu'elles forment dans un espace dynamiquetrès parcimonieux, et les thématiques remarquables qu'ellesdéveloppent. Pour chacun nous proposons des méthodes d'extractionoriginales que nous mettons en oeuvre sur un corpus réel collecté encontinu sur Internet.PARIS-BIUSJ-Mathématiques rech (751052111) / SudocSudocFranceF
Fusion anticipée de descripteurs bas niveau pour la détection d'émotions dans les textes
National audienc
Data from: Circadian mood variations in Twitter content
Background: Circadian regulation of sleep, cognition, and metabolic state is driven by a central clock, which is in turn entrained by environmental signals. Understanding the circadian regulation of mood, which is vital for coping with day-to-day needs, requires large datasets and has classically utilised subjective reporting. Methods: In this study, we use a massive dataset of over 800 million Twitter messages collected over 4 years in the United Kingdom. We extract robust signals of the changes that happened during the course of the day in the collective expression of emotions and fatigue. We use methods of statistical analysis and Fourier analysis to identify periodic structures, extrema, change-points, and compare the stability of these events across seasons and weekends. Results: We reveal strong, but different, circadian patterns for positive and negative moods. The cycles of fatigue and anger appear remarkably stable across seasons and weekend/weekday boundaries. Positive mood and sadness interact more in response to these changing conditions. Anger and, to a lower extent, fatigue show a pattern that inversely mirrors the known circadian variation of plasma cortisol concentrations. Most quantities show a strong inflexion in the morning. Conclusion: Since circadian rhythm and sleep disorders have been reported across the whole spectrum of mood disorders, we suggest that analysis of social media could provide a valuable resource to the understanding of mental disorder
Apprentissage de concepts émotionnels à partir de descripteurs bas niveau
National audienceThis paper addresses the task of emotion recognition in unstructured textual documents. It first reviews existing representations of documents able to cope with the subjectivity of their emotional content. We then describe the proposed method: following an early fusion strategy, features defined as n-grams of several orders are combined. Moreover, dictionaries specific to each emotion label are automatically extracted. The proposed decision process is implemented as a two level "one vs. all" strategy relying on linear SVM. The resulting system has been applied to the I2B2 track2 challenge and obtained a good ranking among systems relying on a low level representation of the data. We detail the results obtained over the corpus made of real data describing 4 241 sentences labeled with 12 emotion labels and 3 non emotional labels.Cet article considère la tâche de classification de textes selon leur contenu émotionnel. Nous présentons d'abord un état de l'art des représentations textuelles pour extraire le contenu émotionnel des documents. Nous décrivons ensuite la méthode proposée : elle consiste à combiner, par fusion anticipée, des descripteurs définis comme des n-grammes de plusieurs ordres. De plus, elle extrait automatiquement des dictionnaires spécialisés pour chacune des émotions considérées. Le processus de décision proposé, de type « un contre tous » à deux niveaux, met en œuvre des classifieurs linéaires à vaste marge. Le système résultant a fait l'objet d'une participation à la compétition I2B2 track2 et s'est bien classé parmi ceux exploitant uniquement des descripteurs bas niveau. Nous analysons les résultats obtenus sur le corpus de données réelles fourni, qui représente 4 241 phrases étiquetées par 12 émotions et 3 classes non émotionnelles
FindMyPast Daily Words
Time Series of Daily Frequencies of 25k words over 87 years in Uk historical newspapers. Release of the daily frequency of the 25K most published words in News content in the United Kingdom between 1st January 1836 and 31st December 1922. The frequency was measured from a representative set of Newspaper across the United Kingdom at the time
Comparison between the variations of posemo, anger, sadness, and the two leading factors.
<p>The x-axis indicates the hour.</p
Bristol UK Modern Daily Words
Time Series of Daily Frequencies of 25k words over 6 years in Modern UK online News. Release of the daily frequency of the 25K most published words in News content in the United Kingdom between 1st January 2010 and 31st December 2015. The frequency was measured from a set of Online News outlets across the United Kingdom at the time