51 research outputs found
Efficient Failure Pattern Identification of Predictive Algorithms
Given a (machine learning) classifier and a collection of unlabeled data, how
can we efficiently identify misclassification patterns presented in this
dataset? To address this problem, we propose a human-machine collaborative
framework that consists of a team of human annotators and a sequential
recommendation algorithm. The recommendation algorithm is conceptualized as a
stochastic sampler that, in each round, queries the annotators a subset of
samples for their true labels and obtains the feedback information on whether
the samples are misclassified. The sampling mechanism needs to balance between
discovering new patterns of misclassification (exploration) and confirming the
potential patterns of classification (exploitation). We construct a
determinantal point process, whose intensity balances the
exploration-exploitation trade-off through the weighted update of the posterior
at each round to form the generator of the stochastic sampler. The numerical
results empirically demonstrate the competitive performance of our framework on
multiple datasets at various signal-to-noise ratios.Comment: 19 pages, Accepted for UAI202
Discovering structure without labels
The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction.
The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space.
The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success.
Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples
Learning Representations for Novelty and Anomaly Detection
The problem of novelty or anomaly detection refers to the ability to automatically
identify data samples that differ from a notion of normality. Techniques
that address this problem are necessary in many applications, like in medical
diagnosis, autonomous driving, fraud detection, or cyber-attack detection, just to
mention a few. The problem is inherently challenging because of the openness of
the space of distributions that characterize novelty or outlier data points. This is
often matched with the inability to adequately represent such distributions due
to the lack of representative data.
In this dissertation we address the challenge above by making several contributions.
(a)We introduce an unsupervised framework for novelty detection,
which is based on deep learning techniques, and which does not require labeled
data representing the distribution of outliers. (b) The framework is general and
based on first principles by detecting anomalies via computing their probabilities
according to the distribution representing normality. (c) The framework can
handle high-dimensional data such as images, by performing a non-linear dimensionality
reduction of the input space into an isometric lower-dimensional space,
leading to a computationally efficient method. (d) The framework is guarded
from the potential inclusion of distributions of outliers into the distribution of
normality by favoring that only inlier data can be well represented by the model.
(e) The methods are evaluated extensively on multiple computer vision benchmark
datasets, where it is shown that they compare favorably with the state of
the art
WiFi-Based Human Activity Recognition Using Attention-Based BiLSTM
Recently, significant efforts have been made to explore human activity recognition (HAR) techniques that use information gathered by existing indoor wireless infrastructures through WiFi signals without demanding the monitored subject to carry a dedicated device. The key intuition is that different activities introduce different multi-paths in WiFi signals and generate different patterns in the time series of channel state information (CSI). In this paper, we propose and evaluate a full pipeline for a CSI-based human activity recognition framework for 12 activities in three different spatial environments using two deep learning models: ABiLSTM and CNN-ABiLSTM. Evaluation experiments have demonstrated that the proposed models outperform state-of-the-art models. Also, the experiments show that the proposed models can be applied to other environments with different configurations, albeit with some caveats. The proposed ABiLSTM model achieves an overall accuracy of 94.03%, 91.96%, and 92.59% across the 3 target environments. While the proposed CNN-ABiLSTM model reaches an accuracy of 98.54%, 94.25% and 95.09% across those same environments
Correlating sparse sensing for large-scale traffic speed estimation: A Laplacian-enhanced low-rank tensor kriging approach
Traffic speed is central to characterizing the fluidity of the road network.
Many transportation applications rely on it, such as real-time navigation,
dynamic route planning, and congestion management. Rapid advances in sensing
and communication techniques make traffic speed detection easier than ever.
However, due to sparse deployment of static sensors or low penetration of
mobile sensors, speeds detected are incomplete and far from network-wide use.
In addition, sensors are prone to error or missing data due to various kinds of
reasons, speeds from these sensors can become highly noisy. These drawbacks
call for effective techniques to recover credible estimates from the incomplete
data. In this work, we first identify the issue as a spatiotemporal kriging
problem and propose a Laplacian enhanced low-rank tensor completion (LETC)
framework featuring both lowrankness and multi-dimensional correlations for
large-scale traffic speed kriging under limited observations. To be specific,
three types of speed correlation including temporal continuity, temporal
periodicity, and spatial proximity are carefully chosen and simultaneously
modeled by three different forms of graph Laplacian, named temporal graph
Fourier transform, generalized temporal consistency regularization, and
diffusion graph regularization. We then design an efficient solution algorithm
via several effective numeric techniques to scale up the proposed model to
network-wide kriging. By performing experiments on two public million-level
traffic speed datasets, we finally draw the conclusion and find our proposed
LETC achieves the state-of-the-art kriging performance even under low
observation rates, while at the same time saving more than half computing time
compared with baseline methods. Some insights into spatiotemporal traffic data
modeling and kriging at the network level are provided as well
Machine Learning Methods with Noisy, Incomplete or Small Datasets
In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios
Collected Papers (on Neutrosophics, Plithogenics, Hypersoft Set, Hypergraphs, and other topics), Volume X
This tenth volume of Collected Papers includes 86 papers in English and Spanish languages comprising 972 pages, written between 2014-2022 by the author alone or in collaboration with the following 105 co-authors (alphabetically ordered) from 26 countries: Abu Sufian, Ali Hassan, Ali Safaa Sadiq, Anirudha Ghosh, Assia Bakali, Atiqe Ur Rahman, Laura Bogdan, Willem K.M. Brauers, Erick González Caballero, Fausto Cavallaro, Gavrilă Calefariu, T. Chalapathi, Victor Christianto, Mihaela Colhon, Sergiu Boris Cononovici, Mamoni Dhar, Irfan Deli, Rebeca Escobar-Jara, Alexandru Gal, N. Gandotra, Sudipta Gayen, Vassilis C. Gerogiannis, Noel Batista Hernández, Hongnian Yu, Hongbo Wang, Mihaiela Iliescu, F. Nirmala Irudayam, Sripati Jha, Darjan Karabašević, T. Katican, Bakhtawar Ali Khan, Hina Khan, Volodymyr Krasnoholovets, R. Kiran Kumar, Manoranjan Kumar Singh, Ranjan Kumar, M. Lathamaheswari, Yasar Mahmood, Nivetha Martin, Adrian Mărgean, Octavian Melinte, Mingcong Deng, Marcel Migdalovici, Monika Moga, Sana Moin, Mohamed Abdel-Basset, Mohamed Elhoseny, Rehab Mohamed, Mohamed Talea, Kalyan Mondal, Muhammad Aslam, Muhammad Aslam Malik, Muhammad Ihsan, Muhammad Naveed Jafar, Muhammad Rayees Ahmad, Muhammad Saeed, Muhammad Saqlain, Muhammad Shabir, Mujahid Abbas, Mumtaz Ali, Radu I. Munteanu, Ghulam Murtaza, Munazza Naz, Tahsin Oner, Gabrijela Popović, Surapati Pramanik, R. Priya, S.P. Priyadharshini, Midha Qayyum, Quang-Thinh Bui, Shazia Rana, Akbara Rezaei, Jesús Estupiñán Ricardo, Rıdvan Sahin, Saeeda Mirvakili, Said Broumi, A. A. Salama, Flavius Aurelian Sârbu, Ganeshsree Selvachandran, Javid Shabbir, Shio Gai Quek, Son Hoang Le, Florentin Smarandache, Dragiša Stanujkić, S. Sudha, Taha Yasin Ozturk, Zaigham Tahir, The Houw Iong, Ayse Topal, Alptekin Ulutaș, Maikel Yelandi Leyva Vázquez, Rizha Vitania, Luige Vlădăreanu, Victor Vlădăreanu, Ștefan Vlăduțescu, J. Vimala, Dan Valeriu Voinea, Adem Yolcu, Yongfei Feng, Abd El-Nasser H. Zaied, Edmundas Kazimieras Zavadskas.
Statistical Data Modeling and Machine Learning with Applications
The modeling and processing of empirical data is one of the main subjects and goals of statistics. Nowadays, with the development of computer science, the extraction of useful and often hidden information and patterns from data sets of different volumes and complex data sets in warehouses has been added to these goals. New and powerful statistical techniques with machine learning (ML) and data mining paradigms have been developed. To one degree or another, all of these techniques and algorithms originate from a rigorous mathematical basis, including probability theory and mathematical statistics, operational research, mathematical analysis, numerical methods, etc. Popular ML methods, such as artificial neural networks (ANN), support vector machines (SVM), decision trees, random forest (RF), among others, have generated models that can be considered as straightforward applications of optimization theory and statistical estimation. The wide arsenal of classical statistical approaches combined with powerful ML techniques allows many challenging and practical problems to be solved. This Special Issue belongs to the section “Mathematics and Computer Science”. Its aim is to establish a brief collection of carefully selected papers presenting new and original methods, data analyses, case studies, comparative studies, and other research on the topic of statistical data modeling and ML as well as their applications. Particular attention is given, but is not limited, to theories and applications in diverse areas such as computer science, medicine, engineering, banking, education, sociology, economics, among others. The resulting palette of methods, algorithms, and applications for statistical modeling and ML presented in this Special Issue is expected to contribute to the further development of research in this area. We also believe that the new knowledge acquired here as well as the applied results are attractive and useful for young scientists, doctoral students, and researchers from various scientific specialties
Preference-based Representation Learning for Collections
In this thesis, I make some contributions to the development of representation learning in the setting of external constraints and noisy supervision. A setting of external constraints refers to the scenario in which the learner is forced to output a latent representation of the given data points while enforcing some particular conditions. These conditions can be geometrical constraints, for example forcing the vector embeddings to be close to each other based on a particular relations, or forcing the embedding vectors to lie in a particular manifold, such as the manifold of vectors whose elements sum to 1, or even more complex constraints. The objects of interest in this thesis are elements of a collection X in an abstract space that is endowed with a similarity function which quantifies how similar two objects are. A collection is defined as a set of items in which the order is ignored but the multiplicity is relevant. Various types of collections are used as inputs or outputs in the machine learning field. The most common are perhaps sequences and sets.
Besides studying representation learning approaches in presence of external constraints, in this thesis we tackle the case in which the evaluation of this similarity function is not directly possible. In recent years, the machine learning setting of having only binary answers to some comparisons for tuples of elements has gained interest. Learning good representations from a scenario in which a clear distance information cannot be obtained is of fundamental importance. This problem is opposite to the standard machine learning setting where the similarity function between elements can be directly evaluated. Moreover, we tackle the case in which the learner is given noisy supervision signals, with a certain probability for the label to be incorrect. Another research question that was studied in this thesis is how to assess the quality of the learned representations and how a learner can convey the uncertainty about this representation.
After the introductory Chapter 1, the thesis is structured in three main parts. In the first part, I present the results of representation learning based on data points that are sequences. The focus in this part is on sentences and permutations, particular types of sequences. The first contribution of this part consists in enforcing analogical relations between sentences and the second is learning appropriate representations for permutations, which are particular mathematical objects, while using neural networks. The second part of this thesis tackles the question of learning perceptual embeddings from binary and noisy comparisons. In machine learning, this problem is referred as ordinal embedding problem. This part contains two chapters which elaborate two different aspects of the problem: appropriately conveying the uncertainty of the representation and learning the embeddings from aggregated and noisy feedback. Finally the third part of the thesis, contains applications of the findings of the previous part, namely unsupervised alignment of clouds of embedding vectors and entity set extension
Learning Robust Low-Rank Approximation for Crowdsourcing on Riemannian Manifold
Recently, crowdsourcing has attracted substantial research interest due to its efficiency in collecting labels for machine learning and computer vision tasks. This paper proposes a Rieman-nian manifold optimization algorithm, ROLA (Robust Low-rank Approximation), to aggregate the labels from a novel perspective. Specifically, a novel low-rank approximation model is proposed to capture underlying correlation among annotators meanwhile identify annotator-specific noise. More significantly, ROLA defines the label noise in crowdsourcing as annotator-specific noise, which can be well regularized by l2,1-norm. The proposed ROLA can improve the aggregation performance when compared with state-of-the-art crowdsourcing methods
- …