4,557 research outputs found
A review of domain adaptation without target labels
Domain adaptation has become a prominent problem setting in machine learning
and related fields. This review asks the question: how can a classifier learn
from a source domain and generalize to a target domain? We present a
categorization of approaches, divided into, what we refer to as, sample-based,
feature-based and inference-based methods. Sample-based methods focus on
weighting individual observations during training based on their importance to
the target domain. Feature-based methods revolve around on mapping, projecting
and representing features such that a source classifier performs well on the
target domain and inference-based methods incorporate adaptation into the
parameter estimation procedure, for instance through constraints on the
optimization procedure. Additionally, we review a number of conditions that
allow for formulating bounds on the cross-domain generalization error. Our
categorization highlights recurring ideas and raises questions important to
further research.Comment: 20 pages, 5 figure
Robust Algorithms for Detecting Hidden Structure in Biological Data
Biological data, such as molecular abundance measurements and protein
sequences, harbor complex hidden structure that reflects its underlying
biological mechanisms. For example, high-throughput abundance measurements
provide a snapshot the global state of a living cell, while homologous
protein sequences encode the residue-level logic of the proteins\u27 function
and provide a snapshot of the evolutionary trajectory of the protein family.
In this work I describe algorithmic approaches and analysis software I
developed for uncovering hidden structure in both kinds of data.
Clustering is an unsurpervised machine learning technique commonly used
to map the structure of data collected in high-throughput experiments,
such as quantification of gene expression by DNA microarrays or
short-read sequencing. Clustering algorithms always yield a partitioning
of the data, but relying on a single partitioning solution can lead to
spurious conclusions. In particular, noise in the data can cause objects
to fall into the same cluster by chance rather than due to meaningful
association. In the first part of this thesis I demonstrate approaches to
clustering data robustly in the presence of noise and apply robust clustering
to analyze the transcriptional response to injury in a neuron cell.
In the second part of this thesis I describe identifying hidden specificity
determining residues (SDPs) from alignments of protein sequences descended
through gene duplication from a common ancestor (paralogs) and apply the
approach to identify numerous putative SDPs in bacterial transcription
factors in the LacI family. Finally, I describe and demonstrate a new
algorithm for reconstructing the history of duplications by which paralogs
descended from their common ancestor. This algorithm addresses the
complexity of such reconstruction due to indeterminate or erroneous
homology assignments made by sequence alignment algorithms and to the
vast prevalence of divergence through speciation over divergence through
gene duplication in protein evolution
UTB/TSC Legacy Degree Programs and Courses 2010 – 2011
https://scholarworks.utrgv.edu/brownsvillelegacycatalogs/1026/thumbnail.jp
Towards the true tree: Bioinformatic approaches in the phylogenetics and molecular evolution of the Endopterygota
In this thesis, I use bioinformatic approaches to address new and existing issues
surrounding large-scale phylogenetic analysis. A phylogenetic analysis pipeline is
developed to aid an investigation of the suitability of integrating Cytochrome Oxidase
Subunit 1 (cox1) into phylogenetic supermatrices. In the first two chapters I assess the
effect of varying cox1 sample size within a large variable phylogenetic context. As well
as intuitive results on increased quality with greater taxon sampling, there are clear
monophyly patters relating to local taxonomic sampling. Specifically, more monophyletic
resampled taxa in cases when fewer consubfamilials are represented, with a tendency for
these to remain unchanged in the degree of monophyly when rarefied. Sampling analyses
are extended in chapter two using a mined Scarabaeoidea multilocus dataset, where taxa
from given loci are used to improve existing matrices. Improvement in phylogenetic
signal is best achieved by targeting cox1 to existing taxa, which suggests minimum
parameters for cox1 adoption in large-scale phylogenetics.
In chapter 3 I address recently-arisen issues related to phyloinformatic analysis of
sequence-delineated matrices. There is ongoing work on setting species boundaries by
sequence variation alone, but incongruence results in methodological issues upon
integrating multiple loci delineated in this way.
In the final chapter I assess the impact of heterogeneous substitution rates on large scale
cox1 datasets. Although the number of heterogeneous sites in Coleoptera cox1 is
substantial, their presence is found to be beneficial, as their removal negatively impacts
the ability of the alignment to generate the 'known' topology. The homoplasy and
heterogeneous characteristics of cox1 have not substantially impacted its utility, thus the
cox1 datasets have potential to play a substantial role in the tree-of-life
Dynamic Datasets and Market Environments for Financial Reinforcement Learning
The financial market is a particularly challenging playground for deep
reinforcement learning due to its unique feature of dynamic datasets. Building
high-quality market environments for training financial reinforcement learning
(FinRL) agents is difficult due to major factors such as the low
signal-to-noise ratio of financial data, survivorship bias of historical data,
and model overfitting. In this paper, we present FinRL-Meta, a data-centric and
openly accessible library that processes dynamic datasets from real-world
markets into gym-style market environments and has been actively maintained by
the AI4Finance community. First, following a DataOps paradigm, we provide
hundreds of market environments through an automatic data curation pipeline.
Second, we provide homegrown examples and reproduce popular research papers as
stepping stones for users to design new trading strategies. We also deploy the
library on cloud platforms so that users can visualize their own results and
assess the relative performance via community-wise competitions. Third, we
provide dozens of Jupyter/Python demos organized into a curriculum and a
documentation website to serve the rapidly growing community. The open-source
codes for the data curation pipeline are available at
https://github.com/AI4Finance-Foundation/FinRL-MetaComment: 49 pages, 15 figures. arXiv admin note: substantial text overlap with
arXiv:2211.0310
- …