2,258 research outputs found
Data-driven unsupervised clustering of online learner behaviour
The widespread adoption of online courses opens opportunities for analysing learner behaviour and optimising web-based learning adapted to observed usage. Here we introduce a mathematical framework for the analysis of time series of online learner engagement, which allows the identification of clusters of learners with similar online temporal behaviour directly from the raw data without prescribing a priori subjective reference behaviours. The method uses a dynamic time warping kernel to create a pairwise similarity between time series of learner actions, and combines it with an unsupervised multiscale graph clustering algorithm to identify groups of learners with similar temporal behaviour. To showcase our approach, we analyse task completion data from a cohort of learners taking an online post-graduate degree at Imperial Business School. Our analysis reveals clusters of learners with statistically distinct patterns of engagement, from distributed to massed learning, with different levels of regularity, adherence to pre-planned course structure and task completion. The approach also reveals outlier learners with highly sporadic behaviour. A posteriori comparison against student performance shows that, whereas high performing learners are spread across clusters with diverse temporal engagement, low performers are located significantly in the massed learning cluster, and our unsupervised clustering identifies low performers more accurately than common machine learning classification methods trained on temporal statistics of the data. Finally, we test the applicability of the method by analysing two additional datasets: a different cohort of the same course, and time series of different format from another university
A Survey on Explainable Anomaly Detection
In the past two decades, most research on anomaly detection has focused on
improving the accuracy of the detection, while largely ignoring the
explainability of the corresponding methods and thus leaving the explanation of
outcomes to practitioners. As anomaly detection algorithms are increasingly
used in safety-critical domains, providing explanations for the high-stakes
decisions made in those domains has become an ethical and regulatory
requirement. Therefore, this work provides a comprehensive and structured
survey on state-of-the-art explainable anomaly detection techniques. We propose
a taxonomy based on the main aspects that characterize each explainable anomaly
detection technique, aiming to help practitioners and researchers find the
explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from
Data (TKDD) for publication (preprint version
Human-in-the-Loop Learning From Crowdsourcing and Social Media
Computational social studies using public social media data have become more and more popular because of the large amount of user-generated data available. The richness of social media data, coupled with noise and subjectivity, raise significant challenges for computationally studying social issues in a feasible and scalable manner. Machine learning problems are, as a result, often subjective or ambiguous when humans are involved. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. When building supervised learning models, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This inevitably hides a rich source of diversity and subjectivity of opinions about the labels.
Label distribution learning associates for each data item a probability distribution over the labels for that item, thus it can preserve diversities of opinions, beliefs, etc. that conventional learning hides or ignores. We propose a humans-in-the-loop learning framework to model and study large volumes of unlabeled subjective social media data with less human effort. We study various annotation tasks given to crowdsourced annotators and methods for aggregating their contributions in a manner that preserves subjectivity and disagreement. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. We conduct experiments using our learning framework on data related to two subjective social issues (work and employment, and suicide prevention) that touch many people worldwide. Our methods can be applied to a broad variety of problems, particularly social problems. Our experimental results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level
Resolving anthropogenic aerosol pollution types - deconvolution and exploratory classification of pollution events
Mass spectrometric measurements commonly yield data on hundreds of variables over thousands of points in time. Refining and synthesizing this raw data into chemical information necessitates the use of advanced, statisticsbased data analytical techniques. In the field of analytical aerosol chemistry, statistical, dimensionality reductive methods have become widespread in the last decade, yet comparable advanced chemometric techniques for data classification and identification remain marginal. Here we present an example of combining data dimensionality reduction (factorization) with exploratory classification (clustering), and show that the results cannot only reproduce and corroborate earlier findings, but also complement and broaden our current perspectives on aerosol chemical classification. We find that applying positive matrix factorization to extract spectral characteristics of the organic component of air pollution plumes, together with an unsupervised clustering algorithm, k -means C C, for classification, reproduces classical organic aerosol speciation schemes. Applying appropriately chosen metrics for spectral dissimilarity along with optimized data weighting, the source-specific pollution characteristics can be statistically resolved even for spectrally very similar aerosol types, such as different combustion-related anthropogenic aerosol species and atmospheric aerosols with similar degree of oxidation. In addition to the typical oxidation level and source-driven aerosol classification, we were also able to classify and characterize outlier groups that would likely be disregarded in a more conventional analysis. Evaluating solution quality for the classification also provides means to assess the performance of mass spectral simi-larity metrics and optimize weighting for mass spectral variables. This facilitates algorithm-based evaluation of aerosol spectra, which may prove invaluable for future development of automatic methods for spectra identification and classification. Robust, statistics-based results and data visualizations also provide important clues to a human analyst on the existence and chemical interpretation of data structures. Applying these methods to a test set of data, aerosol mass spectrometric data of organic aerosol from a boreal forest site, yielded five to seven different recurring pollution types from various sources, including traffic, cooking, biomass burning and nearby sawmills. Additionally, three distinct, minor pollution types were discovered and identified as amine-dominated aerosols.Peer reviewe
- …