448 research outputs found
Psychographic Traits Identification based on political ideology: An author analysis study on spanish politicians tweets posted in 2020
In general, people are usually more reluctant to follow advice and directions from politicians who do not have their ideology. In extreme cases, people can be heavily biased in favour of a political party at the same time that they are in sharp disagreement with others, which may lead to irrational decision making and can put people’s lives at risk by ignoring certain recommendations from the authorities. Therefore, considering political ideology as a psychographic trait can improve political micro-targeting by helping public authorities and local governments to adopt better communication policies during crises. In this work, we explore the reliability of determining psychographic traits concerning political ideology. Our contribution is twofold. On the one hand, we release the PoliCorpus-2020, a dataset composed by Spanish politicians’ tweets posted in 2020. On the other hand, we conduct two authorship analysis tasks with the aforementioned dataset: an author profiling task to extract demographic and psychographic traits, and an authorship attribution task to determine the author of an anonymous text in the political domain. Both experiments are evaluated with several neural network architectures grounded on explainable linguistic features, statistical features, and state-of-the-art transformers. In addition, we test whether the neural network models can be transferred to detect the political ideology of citizens. Our results indicate that the linguistic features are good indicators for identifying finegrained political affiliation, they boost the performance of neural network models when combined with embedding-based features, and they preserve relevant information when the models are tested with ordinary citizens. Besides, we found that lexical and morphosyntactic features are more effective on author profiling, whereas stylometric features are more effective in authorship attribution.publishedVersio
Explainable deep learning models for biological sequence classification
Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms.
This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates.
Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice
A machine learning personalization flow
This thesis describes a machine learning-based personalization flow for streaming platforms: we match users and content like video or music, and monitor the results. We find that there are still many open questions in personalization and especially in recommendation. When recommending an item to a user, how do we use unobservable data, e.g., intent, user and content metadata as input? Can we optimize directly for non-differentiable metrics? What about diversity in recommendations? To answer these questions, this thesis proposes data, experimental design, loss functions, and metrics. In the future, we hope these concepts are brought closer together via end-to-end solutions, where personalization models are directly optimized for the desired metric
Aprendendo representações especÃficas para a face de cada pessoa
Orientadores: Alexandre Xavier Falcão, Anderson de Rezende RochaTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Os seres humanos são especialistas natos em reconhecimento de faces, com habilidades que excedem em muito as dos métodos automatizados vigentes, especialmente em cenários não controlados, onde não há a necessidade de colaboração por parte do indivÃduo sendo reconhecido. No entanto, uma caracterÃstica marcante do reconhecimento de face humano é que nós somos substancialmente melhores no reconhecimento de faces familiares, provavelmente porque somos capazes de consolidar uma grande quantidade de experiência prévia com a aparência de certo indivÃduo e de fazer uso efetivo dessa experiência para nos ajudar no reconhecimento futuro. De fato, pesquisadores em psicologia têm até mesmo sugeridos que a representação interna que fazemos das faces pode ser parcialmente adaptada ou otimizada para rostos familiares. Enquanto isso, a situação análoga no reconhecimento facial automatizado | onde um grande número de exemplos de treinamento de um indivÃduo está disponÃvel | tem sido muito pouco explorada, apesar da crescente relevância dessa abordagem na era das mÃdias sociais. Inspirados nessas observações, nesta tese propomos uma abordagem em que a representação da face de cada pessoa é explicitamente adaptada e realçada com o intuito de reconhecê-la melhor. Apresentamos uma coleção de métodos de aprendizado que endereça e progressivamente justifica tal abordagem. Ao aprender e operar com representações especÃficas para face de cada pessoa, nós somos capazes de consistentemente melhorar o poder de reconhecimento dos nossos algoritmos. Em particular, nós obtemos resultados no estado da arte na base de dados PubFig83, uma desafiadora coleção de imagens instituÃda e tornada pública com o objetivo de promover o estudo do reconhecimento de faces familiares. Nós sugerimos que o aprendizado de representações especÃficas para face de cada pessoa introduz uma forma intermediária de regularização ao problema de aprendizado, permitindo que os classificadores generalizem melhor através do uso de menos |, porém mais relevantes | caracterÃsticas faciaisAbstract: Humans are natural face recognition experts, far outperforming current automated face recognition algorithms, especially in naturalistic, \in-the-wild" settings. However, a striking feature of human face recognition is that we are dramatically better at recognizing highly familiar faces, presumably because we can leverage large amounts of past experience with the appearance of an individual to aid future recognition. Researchers in psychology have even suggested that face representations might be partially tailored or optimized for familiar faces. Meanwhile, the analogous situation in automated face recognition, where a large number of training examples of an individual are available, has been largely underexplored, in spite of the increasing relevance of this setting in the age of social media. Inspired by these observations, we propose to explicitly learn enhanced face representations on a per-individual basis, and we present a collection of methods enabling this approach and progressively justifying our claim. By learning and operating within person-specific representations of faces, we are able to consistently improve performance on both the constrained and the unconstrained face recognition scenarios. In particular, we achieve state-of-the-art performance on the challenging PubFig83 familiar face recognition benchmark. We suggest that such person-specific representations introduce an intermediate form of regularization to the problem, allowing the classifiers to generalize better through the use of fewer | but more relevant | face featuresDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã
Text Classification: A Review, Empirical, and Experimental Evaluation
The explosive and widespread growth of data necessitates the use of text
classification to extract crucial information from vast amounts of data.
Consequently, there has been a surge of research in both classical and deep
learning text classification methods. Despite the numerous methods proposed in
the literature, there is still a pressing need for a comprehensive and
up-to-date survey. Existing survey papers categorize algorithms for text
classification into broad classes, which can lead to the misclassification of
unrelated algorithms and incorrect assessments of their qualities and behaviors
using the same metrics. To address these limitations, our paper introduces a
novel methodological taxonomy that classifies algorithms hierarchically into
fine-grained classes and specific techniques. The taxonomy includes methodology
categories, methodology techniques, and methodology sub-techniques. Our study
is the first survey to utilize this methodological taxonomy for classifying
algorithms for text classification. Furthermore, our study also conducts
empirical evaluation and experimental comparisons and rankings of different
algorithms that employ the same specific sub-technique, different
sub-techniques within the same technique, different techniques within the same
category, and categorie
Recommendation Systems: An Insight Into Current Development and Future Research Challenges
Research on recommendation systems is swiftly producing an abundance of novel methods, constantly challenging the current state-of-the-art. Inspired by advancements in many related fields, like Natural Language Processing and Computer Vision, many hybrid approaches based on deep learning are being proposed, making solid improvements over traditional methods. On the downside, this flurry of research activity, often focused on improving over a small number of baselines, makes it hard to identify reference methods and standardized evaluation protocols. Furthermore, the traditional categorization of recommendation systems into content-based, collaborative filtering and hybrid systems lacks the informativeness it once had. With this work, we provide a gentle introduction to recommendation systems, describing the task they are designed to solve and the challenges faced in research. Building on previous work, an extension to the standard taxonomy is presented, to better reflect the latest research trends, including the diverse use of content and temporal information. To ease the approach toward the technical methodologies recently proposed in this field, we review several representative methods selected primarily from top conferences and systematically describe their goals and novelty. We formalize the main evaluation metrics adopted by researchers and identify the most commonly used benchmarks. Lastly, we discuss issues in current research practices by analyzing experimental results reported on three popular datasets
Development of Machine Learning Models for Generation and Activity Prediction of the Protein Tyrosine Kinase Inhibitors
The field of computational drug discovery and development continues to grow at a rapid pace, using generative machine learning approaches to present us with solutions to high dimensional and complex problems in drug discovery and design. In this work, we present a platform of Machine Learning based approaches for generation and scoring of novel kinase inhibitor molecules. We utilized a binary Random Forest classification model to develop a Machine Learning based scoring function to evaluate the generated molecules on Kinase Inhibition Likelihood. By training the model on several chemical features of each known kinase inhibitor, we were able to create a metric that captures the differences between a SRC Kinase Inhibitor and a non-SRC Kinase Inhibitor. We implemented the scoring function into a Biased and Unbiased Bayesian Optimization framework to generate molecules based on features of SRC Kinase Inhibitors. We then used similarity metrics such as Tanimoto Similarity to assess their closeness to that of known SRC Kinase Inhibitors. The molecules generated from this experiment demonstrated potential for belonging to the SRC Kinase Inhibitor family though chemical synthesis would be needed to confirm the results. The top molecules generated from the Unbiased and Biased Bayesian Optimization experiments were calculated to respectively have Tanimoto Similarity scores of 0.711 and 0.709 to known SRC Kinase Inhibitors. With calculated Kinase Inhibition Likelihood scores of 0.586 and 0.575, the top molecules generated from the Bayesian Optimization demonstrate a disconnect between the similarity scores to known SRC Kinase Inhibitors and the calculated Kinase Inhibition Likelihood score. It was found that implementing a bias into the Bayesian Optimization process had little effect on the quality of generated molecules. In addition, several molecules generated from the Bayesian Optimization process were sent to the School of Pharmacy for chemical synthesis which gives the experiment more concrete results. The results of this study demonstrated that generating molecules throughBayesian Optimization techniques could aid in the generation of molecules for a specific kinase family, but further expansions of the techniques would be needed for substantial results
New scalable machine learning methods: beyond classification and regression
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The recent surge in data available has spawned a new and promising age of machine
learning. Success cases of machine learning are arriving at an increasing rate as some
algorithms are able to leverage immense amounts of data to produce great complicated
predictions. Still, many algorithms in the toolbox of the machine learning practitioner
have been render useless in this new scenario due to the complications associated with
large-scale learning. Handling large datasets entails logistical problems, limits the computational
and spatial complexity of the used algorithms, favours methods with few or
no hyperparameters to be con gured and exhibits speci c characteristics that complicate
learning. This thesis is centered on the scalability of machine learning algorithms,
that is, their capacity to maintain their e ectivity as the scale of the data grows, and
how it can be improved. We focus on problems for which the existing solutions struggle
when the scale grows. Therefore, we skip classi cation and regression problems and
focus on feature selection, anomaly detection, graph construction and explainable machine
learning. We analyze four di erent strategies to obtain scalable algorithms. First,
we explore distributed computation, which is used in all of the presented algorithms.
Besides this technique, we also examine the use of approximate models to speed up
computations, the design of new models that take advantage of a characteristic of the
input data to simplify training and the enhancement of simple models to enable them
to manage large-scale learning. We have implemented four new algorithms and six
versions of existing ones that tackle the mentioned problems and for each one we report
experimental results that show both their validity in comparison with competing
methods and their capacity to scale to large datasets. All the presented algorithms
have been made available for download and are being published in journals to enable
practitioners and researchers to use them.[Resumen]
El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y
prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo
a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar
inmensas cantidades de datos para producir predicciones difÃciles y muy certeras. Sin
embargo, muchos de los algoritmos hasta ahora disponibles para los cientÃficos de datos
han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas
al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva
problemas logÃsticos, limita la complejidad computacional y espacial de los algoritmos
utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y
muestra complicaciones especÃficas que dificultan el aprendizaje. Esta tesis se centra en
la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de
mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos
el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la
escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección
de caracterÃsticas, detección de anomalÃas, construcción de grafos y en el aprendizaje
máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos
escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en
todos los algoritmos presentados. Además de esta técnica, también examinamos el uso
de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan
una particularidad de los datos de entrada para simplificar el entrenamiento y la
potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos
implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que
tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales
que muestran tanto su validez en comparación con los métodos previamente
disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y
se han difundido mediante publicaciones en revistas cientÃficas para facilitar que tanto
investigadores como cientÃficos de datos puedan conocerlos y utilizarlos.[Resumo]
O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora
era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un
ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas
cantidades de datos para producir prediccións difÃciles e moi acertadas. Non obstante,
moitos dos algoritmos ata agora dispo~nibles para os cientÃficos de datos perderon a súa
efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe
a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas
loxÃsticos, limita a complexidade computacional e espacial dos algoritmos empregados,
favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións especÃficas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos
algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade
a medida que a escala do conxunto de datos aumenta. Tratamos problemas para
os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando
no canto a clasificación e a regresión, centrámonos na selección de caracterÃsticas,
detección de anomalÃas, construcción de grafos e no aprendizaxe máquina explicable.
Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro
lugar, exploramos a computación distribuÃda, que empregamos en tódolos algoritmos
presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados
para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos
datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos
para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e
seis versións de algoritmos existentes que tratan os problemas mencionados e para cada
un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a
grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición
do lector para a súa descarga e difundÃronse mediante publicacións en revistas cientÃficas para facilitar que tanto investigadores como cientÃficos de datos poidan coñecelos e
empregalos
An interpretable machine learning approach to multimodal stress detection in a simulated office environment
Background and objective:
Work-related stress affects a large part of today’s workforce and is known to have detrimental effects on physical and mental health. Continuous and unobtrusive stress detection may help prevent and reduce stress by providing personalised feedback and allowing for the development of just-in-time adaptive health interventions for stress management. Previous studies on stress detection in work environments have often struggled to adequately reflect real-world conditions in controlled laboratory experiments. To close this gap, in this paper, we present a machine learning methodology for stress detection based on multimodal data collected from unobtrusive sources in an experiment simulating a realistic group office environment (N=90).
Methods:
We derive mouse, keyboard and heart rate variability features to detect three levels of perceived stress, valence and arousal with support vector machines, random forests and gradient boosting models using 10-fold cross-validation. We interpret the contributions of features to the model predictions with SHapley Additive exPlanations (SHAP) value plots.
Results:
The gradient boosting models based on mouse and keyboard features obtained the highest average F1 scores of 0.625, 0.631 and 0.775 for the multiclass prediction of perceived stress, arousal and valence, respectively. Our results indicate that the combination of mouse and keyboard features may be better suited to detect stress in office environments than heart rate variability, despite physiological signal-based stress detection being more established in theory and research. The analysis of SHAP value plots shows that specific mouse movement and typing behaviours may characterise different levels of stress.
Conclusions:
Our study fills different methodological gaps in the research on the automated detection of stress in office environments, such as approximating real-life conditions in a laboratory and combining physiological and behavioural data sources. Implications for field studies on personalised, interpretable ML-based systems for the real-time detection of stress in real office environments are also discussed
- …