4,476 research outputs found

    Applying machine learning algorithms for deriving personality traits in social network

    Get PDF

    Typical Phone Use Habits: Intense Use Does Not Predict Negative Well-Being

    Full text link
    Not all smartphone owners use their device in the same way. In this work, we uncover broad, latent patterns of mobile phone use behavior. We conducted a study where, via a dedicated logging app, we collected daily mobile phone activity data from a sample of 340 participants for a period of four weeks. Through an unsupervised learning approach and a methodologically rigorous analysis, we reveal five generic phone use profiles which describe at least 10% of the participants each: limited use, business use, power use, and personality- & externally induced problematic use. We provide evidence that intense mobile phone use alone does not predict negative well-being. Instead, our approach automatically revealed two groups with tendencies for lower well-being, which are characterized by nightly phone use sessions.Comment: 10 pages, 6 figures, conference pape

    Applying data science and machine learning for psycho-demographic profiling of internet users

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaThere always have been a huge interest in working with public data from online social media users, with the exponential growth of social media usage, this interest and re searches on the area keep increasing. This thesis aims to address prediction and classification tasks on online social net work data. The goal is to predict psycho-demographic - personality and demographic - traits by doing text emotion analysis on social networks as Twitter and Facebook. Our main motivation was to raise awareness to what can be done with users’ social media or network information or usual behaviours on the web, such as from text analysis we can trace their personality, know their tastes, how they behave and so on, and to spread the emotion-text relation on social networks subject, because it only started to be studied recently and there’s so much data and information to do it. To perform these tasks mentioned above we carried an extensive review of literature of previous works to define the state-of-art of the project and to learn and identify work strategies. Almost all of the past researches, based their results on a vast sample of users and data, but because some frameworks and APIs were shutdown in recent years, such as MyPersonality from Facebook adding to some frameworks being paid for, resulted in a small sample of users’ data to analyze in our thesis which can prejudice the results. We start by gathering data from Twitter and Facebook with users consent. On Twit ter we focused on tweets and retweets, on Facebook we focused on all of what the user typed by using the DataSelfie plugin that stored all that data on a server that can be retrieved later. Our next step was to find emotions on their text data with the help of a lexicon that categorized words by eight different emotions, two of them were put away because we focused only on the six major emotions - this is explained later - and we had to remove stopwords and apply stemming to all of the text and do a word-matching of every word of our data with every word from the lexicon. After this, we asked our participants to fulfill a "Big-Five" personality questionnaire and to provide us their age, so we added the Big-Five traits and age to each users individual dataset. We got their final versions, ready to apply machine-learning algorithms to find correlations between emotions and personality or demographic attributes. We focused on practical and methodological aspects of the user attribute prediction task. We used many techniques and algorithms that we thought it were best fit for the data we had and for the goal that we had to achieve. We gathered data in two datasets that we tested, one of them we called "Mixed Lan guage Dataset", contains all text entries from each user, and the other "User Dataset", contains one entry per user after we analyze every text entry for all users in order to have a more general view on each one. For the first mentioned dataset we achieve best results with the decision trees algorithms, from 58% on the agreeableness trait, to 68% on the neuroticism trait. This dataset had a problem with the way data was spread, so it was impossible to predict age and gender with efficiency. As for the lat ter, regarding demographic characteristics all of the classifiers had a good classifying percentage, from K-nearest’s 73% to Naive Bayes’ 95%. The most solid classifier for personality traits was the one using the CART decision tree algorithm, it ranged from 50% on the openness trait to 76% on the agreeableness one. There were classifiers with terrible results, there were others that were a bit dull, and there were some that stood out as we stated above. We had a small sample, and that was a problem as it wasn’t consistent or solid in terms of data value and that can change our results, we believe that our results would be way better if we applied the same mechanisms to a much bigger sample. Concluding, we demonstrate how we can predict personality or demographic traits - BigFive traits, age or gender - from studying emotions in text. As stated above, we hope this thesis will alert people for what can be done with their online information, we only focus on psycho-demographic profiling, but there are many other things that can be done.Sempre houve um enorme interesse em trabalhar com dados públicos dos utilizadores das redes sociais online, com o crescimento exponencial do uso das redes sociais, esse interesse e pesquisas na área continuam a crescer imenso. Esta tese tem como objetivo abordar tarefas de previsão e classificação de dados de redes sociais online. O objetivo é prever traços psico-demográficos - de personalidade e demográficos - fazendo análises de emoções presentes no texto em redes sociais como Twitter e Facebook. A nossa principal motivação foi consciencializar os utilizadores sobre o que pode ser feito com as informações dos utilizadores ou com os seus comportamentos na web, por exemplo, com a análise de texto, podemos traçar a sua personalidade, conhecer os seus gostos, saber como eles se comportam e assim por diante, e para espalhar a relação texto-emoções nas redes sociais, porque só começou a ser estudado recentemente e há imensos dados e informações para isso. Para realizar essas tarefas mencionadas acima, realizamos uma extensa revisão da literatura de trabalhos anteriores para definir o estado da arte do projeto, aprender e identificar estratégias de trabalho. Quase todas as pesquisas anteriores basearam os seus resultados numa vasta amostra de utilizadores e dados, mas como algumas frameworks e APIs foram encerradas nos últimos anos, como a MyPersonality do Facebook, adicionando a algumas frameworks que são pagas, o resultado foi que na nossa tese tivemos uma pequena amostra de dados de utilizadores para analisar o que pode prejudicar os resultados. Começamos por recolher os dados do Twitter e do Facebook com o consentimento dos utilizadores. No Twitter, concentramo-nos nos tweets e retweets, no Facebook concentramo-nos em tudo o que o utilizador digitou usando o plugin DataSelfie que armazena todos os dados num servidor que podem ser recuperados mais tarde. O nosso passo seguinte foi encontrar emoções no texto digitado por cada utilizador com a ajuda de um léxico que categoriza palavras por oito emoções diferentes, duas dessas emoções foram descartadas, concentrando-nos apenas nas seis principais emoções - o processo é explicado mais tarde - e tivemos que remover as stopwords e aplicar stemming a todo o texto e fazer uma correspondência de cada palavra dos nossos dados com cada palavra do léxico. Depois disto, pedimos aos nossos participantes que preenchessem um questionário de personalidade "Big-Five" e nos dessem a conhecer a sua idade. Adicionamos as 5 características do "Big-Five" e a idade ao dataset individual de cada utilizador e obtivemos as suas versões finais, prontas para aplicar algoritmos de aprendizagem de máquina para encontrar correlações entre as emoções e personalidade ou atributos demográficos. Focamo-nos nos aspectos práticos e metodológicos da tarefa de predição e classificação de atributos do utilizador. Muitas técnicas e algoritmos foram utilizados, aqueles que consideramos mais adequados para os dados que tínhamos e o objetivo que tínhamos que alcançar. Obtemos dados para dois datasets diferentes que testamos no final, um deles chamado de "Mixed Language Dataset", contém todas as entradas de texto de cada utilizador e o outro "User Dataset" contém uma entrada por utilizador após analisarmos todas as entradas de texto de todos eles para ter informação mais concisa geral sobre cada um. Para o primeiro conjunto de dados mencionado, os melhores resultados obtidos foram com os algoritmos de árvores de decisão, de 58% na característica de agreabilidade, para 68% na característica de neuroticismo. Este conjunto de dados tinha um problema com a forma como os dados estavam compostos no dataset, por isso foi impossível prever idade e género com eficiência. Quanto ao último dataset, em relação às características demográficas, todos os classificadores tiveram uma boa percentagem de classificação, de 73% de K-nearest para 95% com Naive Bayes. O classificador mais sólido para os traços de personalidade foi o que usou o algoritmo de árvore de decisão, CART, que varia apenas entre 50% no traço de "abertura a experiências" e 76% no de agreabilidade. Tivemos classificadores com resultados terríveis, houve outros que foram um pouco "aborrecidos", e houve alguns que se destacaram como afirmamos acima. A nossa amostra era consideravelmente pequena e isso foi um problema para nós, pois não era consistente ou sólido em termos de valores de dados e isso provavelmente alterou alguns dos nossos resultados, com uma amostra bem maior, mais profunda, acreditamos que aplicando os mesmos processos e mecanismos, teríamos resultados mais sólidos e mais consistentes. Concluindo, demonstramos como é possível prever traços de personalidade ou demográficos - traços BigFive, idade ou género - a partir do estudo de emoções presentes em texto. Como foi dito acima, esperamos que esta tese permita que os utilizadores tenham mais consciência da importância dos seus dados e do que conseguimos atingir com eles

    The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism

    Full text link
    Computer vision and other biometrics data science applications have commenced a new project of profiling people. Rather than using 'transaction generated information', these systems measure the 'real world' and produce an assessment of the 'world state' - in this case an assessment of some individual trait. Instead of using proxies or scores to evaluate people, they increasingly deploy a logic of revealing the truth about reality and the people within it. While these profiling knowledge claims are sometimes tentative, they increasingly suggest that only through computation can these excesses of reality be captured and understood. This article explores the bases of those claims in the systems of measurement, representation, and classification deployed in computer vision. It asks if there is something new in this type of knowledge claim, sketches an account of a new form of computational empiricism being operationalised, and questions what kind of human subject is being constructed by these technological systems and practices. Finally, the article explores legal mechanisms for contesting the emergence of computational empiricism as the dominant knowledge platform for understanding the world and the people within it

    Supporting lay users in privacy decisions when sharing sensitive data

    Get PDF
    The first part of the thesis focuses on assisting users in choosing their privacy settings, by using machine learning to derive the optimal set of privacy settings for the user. In contrast to other work, our approach uses context factors as well as individual factors to provide a personalized set of privacy settings. The second part consists of a set of intelligent user interfaces to assist the users throughout the complete privacy journey, from defining friend groups that allow targeted information sharing; through user interfaces for selecting information recipients, to find possible errors or unusual settings, and to refine them; up to mechanisms to gather in-situ feedback on privacy incidents, and investigating how to use these to improve a user’s privacy in the future. Our studies have shown that including tailoring the privacy settings significantly increases the correctness of the predicted privacy settings; whereas the user interfaces have been shown to significantly decrease the amount of unwanted disclosures.Insbesondere nach den jüngsten Datenschutzskandalen in sozialen Netzwerken wird der Datenschutz für Benutzer immer wichtiger. Obwohl die meisten Benutzer behaupten Wert auf Datenschutz zu legen, verhalten sie sich online allerdings völlig anders: Sie lassen die meisten Datenschutzeinstellungen der online genutzten Dienste, wie z. B. von sozialen Netzwerken oder Diensten zur Standortfreigabe, unberührt und passen sie nicht an ihre Datenschutzanforderungen an. In dieser Arbeit werde ich einen Ansatz zur Lösung dieses Problems vorstellen, der auf zwei verschiedenen Säulen basiert. Der erste Teil konzentriert sich darauf, Benutzer bei der Auswahl ihrer Datenschutzeinstellungen zu unterstützen, indem maschinelles Lernen verwendet wird, um die optimalen Datenschutzeinstellungen für den Benutzer abzuleiten. Im Gegensatz zu anderen Arbeiten verwendet unser Ansatz Kontextfaktoren sowie individuelle Faktoren, um personalisierte Datenschutzeinstellungen zu generieren. Der zweite Teil besteht aus einer Reihe intelligenter Benutzeroberflächen, die die Benutzer in verschiedene Datenschutzszenarien unterstützen. Dies beginnt bei einer Oberfläche zur Definition von Freundesgruppen, die im Anschluss genutzt werden können um einen gezielten Informationsaustausch zu ermöglichen, bspw. in sozialen Netzwerken; über Benutzeroberflächen um die Empfänger von privaten Daten auszuwählen oder mögliche Fehler oder ungewöhnliche Datenschutzeinstellungen zu finden und zu verfeinern; bis hin zu Mechanismen, um In-Situ- Feedback zu Datenschutzverletzungen zum Zeitpunkt ihrer Entstehung zu sammeln und zu untersuchen, wie diese verwendet werden können, um die Privatsphäreeinstellungen eines Benutzers anzupassen. Unsere Studien haben gezeigt, dass die Verwendung von individuellen Faktoren die Korrektheit der vorhergesagten Datenschutzeinstellungen erheblich erhöht. Es hat sich gezeigt, dass die Benutzeroberflächen die Anzahl der Fehler, insbesondere versehentliches Teilen von Daten, erheblich verringern

    Concepts and experiments on psychoanalysis driven computing

    Get PDF
    This research investigates the effective incorporation of the human factor and user perception in text-based interactive media. In such contexts, the reliability of user texts is often compromised by behavioural and emotional dimensions. To this end, several attempts have been made in the state of the art, to introduce psychological approaches in such systems, including computational psycholinguistics, personality traits and cognitive psychology methods. In contrast, our method is fundamentally different since we employ a psychoanalysis-based approach; in particular, we use the notion of Lacanian discourse types, to capture and deeply understand real (possibly elusive) characteristics, qualities and contents of texts, and evaluate their reliability. As far as we know, this is the first time computational methods are systematically combined with psychoanalysis. We believe such psychoanalytic framework is fundamentally more effective than standard methods, since it addresses deeper, quite primitive elements of human personality, behaviour and expression which usually escape methods functioning at “higher”, conscious layers. In fact, this research is a first attempt to form a new paradigm of psychoanalysis-driven interactive technologies, with broader impact and diverse applications. To exemplify this generic approach, we apply it to the case-study of fake news detection; we first demonstrate certain limitations of the well-known Myers–Briggs Type Indicator (MBTI) personality type method, and then propose and evaluate our new method of analysing user texts and detecting fake news based on the Lacanian discourses psychoanalytic approach.This publication is part of the Spanish I+D+i project TRAINERA (ref. PID2020-118011GB-C21), funded by MCIN/AEI/10.13039/ 501100011033Peer ReviewedPostprint (published version

    EEG andmete analüüs ja andmepartitsioonide arendamine masinõppe algoritmidele

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneDoktoritöö käigus valmis uus meetod masinõppe andmete efektiivsemaks kasutamiseks. Klassikalises statistikas on mudelid piisavalt lihtsad, et koos eeldustega andmete kohta, saavad need öelda, kas saadud tulemused on statistiliselt olulised või mitte ehk kas andmetes üldse on signaali, mis oleks mürast erinev. Masinõppe algoritmid, nt sügavad närvivõrgud, sisaldavad sageli sadu miljoneid parameetreid, mis muudab kogu tööprotsessi loogikat. Need mudelid suudavad alati andmed 100% ära kirjeldada – sõltumata signaali olemasolust. Masinõppe keeles on see ületreenimine. Seepärast kasutatakse masinõppes statistilise olulisuse mõõtmiseks teistsugust meetodit. Nimelt pannakse osa algandmeid kõrvale, st neid ei kasutata mudeli treenimisel. Kui kasutatud andmete põhjal on parim mudel valmis tehtud, testitakse seda varem kõrvale jäänud andmete peal. Probleemiks on aga see, et masinõppe algoritmid vajavad väga palju andmeid ning kõik, mis n.ö kõrvale pannakse, läheb mudeli treenimise mõttes raisku. Teadlased on ammu otsinud viise, kuidas seda probleemi leevendada ning kasutusele on võetud mitmeid meetodeid, aga paraku on ka neil kõigil oma puudused. Näiteks ristvalideerimise korral saab kõiki andmeid väga efektiivselt kasutada, ent pole võimalik tõlgendada mudeli parameetreid. Samas kui paneme andmeid kõrvale, on meil see info küll olemas, aga mudel ise on vähemefektiivne. Doktoritöö raames leiutasime uue viisi, kuidas andmete jagamist teha. Antud meetodi puhul jäetakse samuti algul kõrvale andmete testrühm, seejärel fikseeritakse ristvalideerimist kasutades mudeli parameetrid, neid kõrvale pandud andmete peal testides tehakse seda aga mitmes jaos ning igas jaos üle jäänud andmeid kasutatakse uuesti mudeli treenimiseks. Kasutame uuesti küll kõiki andmeid, aga saavutame ka selle, et parameetrid jäävad interpreteeritavaks, nii et me teame lõpuks, kas võitis lineaarne või eksponentsiaalne mudel; kolmekihiline või neljakihiline närvivõrk. Keeruliste andmetega loodusteadustes tihti ongi just seda vaja, et teadusartikli lõpus saaks öelda, milline oli parim mudel. Samas mudeli kaalude kõiki väärtusi polegi tihtipeale vaja. Sellises olukorras on uus meetod meie teada praegu maailma kõige efektiivsem ja parem.A novel more efficient data handling method for machine learning. In classical statistics, models are rather simple and together with some assumptions about the data itself, it is possible to say if the given result is statistically significant or not. Machine learning algorithms on the other hand can have hundreds of millions of model weights. Such models can explain any data with 100% accuracy that changes the rules of the game. This issue is solved by evaluating the models on a separate test set. Some data points are not used in the model fitting phase. If the best model has been found, the quality of the model is evaluated on that test set. This method works well but it has a problem that some of the precious data is wasted for testing the model and not actually used in training. Researches have come up with many solutions to improve the efficiency of data usage. One of the main methods is called nested cross-validation that uses data very efficiently but it has a problem that it makes it very difficult to interpret model parameters. In this thesis, we invented a novel approach for data partitioning that we termed "Cross-validation and cross-testing". First, cross-validation is used on part of the data to determine and lock the model. Then testing of the model on a separate test set is performed in a novel way such that on each testing cycle, part of the data is also used in a model training phase. This gives us an improved system for using machine learning algorithms in the case where we need to interpret model parameters but not the model weights. For example, it gives us a nice possibility to be able to describe that the data has a linear relationship instead of quadratic one or that the best neural network has 5 hidden layers
    corecore