154 research outputs found
Robust Machine Learning by Transforming and Augmenting Imperfect Training Data
Machine Learning (ML) is an expressive framework for turning data into
computer programs. Across many problem domains -- both in industry and policy
settings -- the types of computer programs needed for accurate prediction or
optimal control are difficult to write by hand. On the other hand, collecting
instances of desired system behavior may be relatively more feasible. This
makes ML broadly appealing, but also induces data sensitivities that often
manifest as unexpected failure modes during deployment. In this sense, the
training data available tend to be imperfect for the task at hand. This thesis
explores several data sensitivities of modern machine learning and how to
address them. We begin by discussing how to prevent ML from codifying prior
human discrimination measured in the training data, where we take a fair
representation learning approach. We then discuss the problem of learning from
data containing spurious features, which provide predictive fidelity during
training but are unreliable upon deployment. Here we observe that insofar as
standard training methods tend to learn such features, this propensity can be
leveraged to search for partitions of training data that expose this
inconsistency, ultimately promoting learning algorithms invariant to spurious
features. Finally, we turn our attention to reinforcement learning from data
with insufficient coverage over all possible states and actions. To address the
coverage issue, we discuss how causal priors can be used to model the
single-step dynamics of the setting where data are collected. This enables a
new type of data augmentation where observed trajectories are stitched together
to produce new but plausible counterfactual trajectories.Comment: A thesis submitted in conformity with the requirements for the degree
of Doctor of Philosophy, Department of Computer Science, University of
Toront
Principles of Physical Layer Security in Multiuser Wireless Networks: A Survey
This paper provides a comprehensive review of the domain of physical layer
security in multiuser wireless networks. The essential premise of
physical-layer security is to enable the exchange of confidential messages over
a wireless medium in the presence of unauthorized eavesdroppers without relying
on higher-layer encryption. This can be achieved primarily in two ways: without
the need for a secret key by intelligently designing transmit coding
strategies, or by exploiting the wireless communication medium to develop
secret keys over public channels. The survey begins with an overview of the
foundations dating back to the pioneering work of Shannon and Wyner on
information-theoretic security. We then describe the evolution of secure
transmission strategies from point-to-point channels to multiple-antenna
systems, followed by generalizations to multiuser broadcast, multiple-access,
interference, and relay networks. Secret-key generation and establishment
protocols based on physical layer mechanisms are subsequently covered.
Approaches for secrecy based on channel coding design are then examined, along
with a description of inter-disciplinary approaches based on game theory and
stochastic geometry. The associated problem of physical-layer message
authentication is also introduced briefly. The survey concludes with
observations on potential research directions in this area.Comment: 23 pages, 10 figures, 303 refs. arXiv admin note: text overlap with
arXiv:1303.1609 by other authors. IEEE Communications Surveys and Tutorials,
201
Information Theory and Machine Learning
The recent successes of machine learning, especially regarding systems based on deep neural networks, have encouraged further research activities and raised a new set of challenges in understanding and designing complex machine learning algorithms. New applications require learning algorithms to be distributed, have transferable learning results, use computation resources efficiently, convergence quickly on online settings, have performance guarantees, satisfy fairness or privacy constraints, incorporate domain knowledge on model structures, etc. A new wave of developments in statistical learning theory and information theory has set out to address these challenges. This Special Issue, "Machine Learning and Information Theory", aims to collect recent results in this direction reflecting a diverse spectrum of visions and efforts to extend conventional theories and develop analysis tools for these complex machine learning systems
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Estimating Dependency, Monitoring and Knowledge Discovery in High-Dimensional Data Streams
Data Mining – known as the process of extracting knowledge from massive data sets – leads to phenomenal impacts on our society, and now affects nearly every aspect of our lives: from the layout in our local grocery store, to the ads and product recommendations we receive, the availability of treatments for common diseases, the prevention of crime, or the efficiency of industrial production processes.
However, Data Mining remains difficult when (1) data is high-dimensional, i.e., has many attributes, and when (2) data comes as a stream. Extracting knowledge from high-dimensional data streams is impractical because one must cope with two orthogonal sets of challenges. On the one hand, the effects of the so-called "curse of dimensionality" bog down the performance of statistical methods and yield to increasingly complex Data Mining problems. On the other hand, the statistical properties of data streams may evolve in unexpected ways, a phenomenon known in the community as "concept drift". Thus, one needs to update their knowledge about data over time, i.e., to monitor the stream.
While previous work addresses high-dimensional data sets and data streams to some extent, the intersection of both has received much less attention. Nevertheless, extracting knowledge in this setting is advantageous for many industrial applications: identifying patterns from high-dimensional data streams in real-time may lead to larger production volumes, or reduce operational costs. The goal of this dissertation is to bridge this gap.
We first focus on dependency estimation, a fundamental task of Data Mining. Typically, one estimates dependency by quantifying the strength of statistical relationships. We identify the requirements for dependency estimation in high-dimensional data streams and propose a new estimation framework, Monte Carlo Dependency Estimation (MCDE), that fulfils them all. We show that MCDE leads to efficient dependency monitoring.
Then, we generalise the task of monitoring by introducing the Scaling Multi-Armed Bandit (S-MAB) algorithms, extending the Multi-Armed Bandit (MAB) model. We show that our algorithms can efficiently monitor statistics by leveraging user-specific criteria.
Finally, we describe applications of our contributions to Knowledge Discovery. We propose an algorithm, Streaming Greedy Maximum Random Deviation (SGMRD), which exploits our new methods to extract patterns, e.g., outliers, in high-dimensional data streams. Also, we present a new approach, that we name kj-Nearest Neighbours (kj-NN), to detect outlying documents within massive text corpora.
We support our algorithmic contributions with theoretical guarantees, as well as extensive experiments against both synthetic and real-world data. We demonstrate the benefits of our methods against real-world use cases. Overall, this dissertation establishes fundamental tools for Knowledge Discovery in high-dimensional data streams, which help with many applications in the industry, e.g., anomaly detection, or predictive maintenance.
To facilitate the application of our results and future research, we publicly release our implementations, experiments, and benchmark data via open-source platforms
Multi-facet graph mining with contextualized projections
The goal of my doctoral research is to develop a new generation of graph mining techniques, centered around my proposed idea of multi-facet contextualized projections, for more systematic, flexible, and scalable knowledge discovery around massive, complex, and noisy real-world context-rich networks across various domains. Traditional graph theories largely overlook network contexts, whereas state-of-the-art graph mining algorithms simply regard them as associative attributes and brutally employ machine learning models developed in individual domains (e.g., convolutional neural networks in computer vision, recurrent neural networks in natural language processing) to handle them jointly. As such, essentially different contexts (e.g., temporal, spatial, textual, visual) are mixed up in a messy, unstable, and uninterpretable way, while the correlations between graph topologies and contexts remain a mystery, which further renders the development of real-world mining systems less principled and ineffective. To overcome such barriers, my research harnesses the power of multi-facet context modeling and focuses on the principle of contextualized projections, which provides generic but subtle solutions to knowledge discovery over graphs with the mixtures of various semantic contexts
Domain Generalization in Computational Pathology: Survey and Guidelines
Deep learning models have exhibited exceptional effectiveness in
Computational Pathology (CPath) by tackling intricate tasks across an array of
histology image analysis applications. Nevertheless, the presence of
out-of-distribution data (stemming from a multitude of sources such as
disparate imaging devices and diverse tissue preparation methods) can cause
\emph{domain shift} (DS). DS decreases the generalization of trained models to
unseen datasets with slightly different data distributions, prompting the need
for innovative \emph{domain generalization} (DG) solutions. Recognizing the
potential of DG methods to significantly influence diagnostic and prognostic
models in cancer studies and clinical practice, we present this survey along
with guidelines on achieving DG in CPath. We rigorously define various DS
types, systematically review and categorize existing DG approaches and
resources in CPath, and provide insights into their advantages, limitations,
and applicability. We also conduct thorough benchmarking experiments with 28
cutting-edge DG algorithms to address a complex DG problem. Our findings
suggest that careful experiment design and CPath-specific Stain Augmentation
technique can be very effective. However, there is no one-size-fits-all
solution for DG in CPath. Therefore, we establish clear guidelines for
detecting and managing DS depending on different scenarios. While most of the
concepts, guidelines, and recommendations are given for applications in CPath,
we believe that they are applicable to most medical image analysis tasks as
well.Comment: Extended Versio
Machine learning in solar physics
The application of machine learning in solar physics has the potential to
greatly enhance our understanding of the complex processes that take place in
the atmosphere of the Sun. By using techniques such as deep learning, we are
now in the position to analyze large amounts of data from solar observations
and identify patterns and trends that may not have been apparent using
traditional methods. This can help us improve our understanding of explosive
events like solar flares, which can have a strong effect on the Earth
environment. Predicting hazardous events on Earth becomes crucial for our
technological society. Machine learning can also improve our understanding of
the inner workings of the sun itself by allowing us to go deeper into the data
and to propose more complex models to explain them. Additionally, the use of
machine learning can help to automate the analysis of solar data, reducing the
need for manual labor and increasing the efficiency of research in this field.Comment: 100 pages, 13 figures, 286 references, accepted for publication as a
Living Review in Solar Physics (LRSP
- …