4,557 research outputs found

    Second April 2018 Faculty Senate Packet

    Get PDF

    A review of domain adaptation without target labels

    Full text link
    Domain adaptation has become a prominent problem setting in machine learning and related fields. This review asks the question: how can a classifier learn from a source domain and generalize to a target domain? We present a categorization of approaches, divided into, what we refer to as, sample-based, feature-based and inference-based methods. Sample-based methods focus on weighting individual observations during training based on their importance to the target domain. Feature-based methods revolve around on mapping, projecting and representing features such that a source classifier performs well on the target domain and inference-based methods incorporate adaptation into the parameter estimation procedure, for instance through constraints on the optimization procedure. Additionally, we review a number of conditions that allow for formulating bounds on the cross-domain generalization error. Our categorization highlights recurring ideas and raises questions important to further research.Comment: 20 pages, 5 figure

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

    UTB/TSC Legacy Degree Programs and Courses 2010 – 2011

    Get PDF
    https://scholarworks.utrgv.edu/brownsvillelegacycatalogs/1026/thumbnail.jp

    Towards the true tree: Bioinformatic approaches in the phylogenetics and molecular evolution of the Endopterygota

    No full text
    In this thesis, I use bioinformatic approaches to address new and existing issues surrounding large-scale phylogenetic analysis. A phylogenetic analysis pipeline is developed to aid an investigation of the suitability of integrating Cytochrome Oxidase Subunit 1 (cox1) into phylogenetic supermatrices. In the first two chapters I assess the effect of varying cox1 sample size within a large variable phylogenetic context. As well as intuitive results on increased quality with greater taxon sampling, there are clear monophyly patters relating to local taxonomic sampling. Specifically, more monophyletic resampled taxa in cases when fewer consubfamilials are represented, with a tendency for these to remain unchanged in the degree of monophyly when rarefied. Sampling analyses are extended in chapter two using a mined Scarabaeoidea multilocus dataset, where taxa from given loci are used to improve existing matrices. Improvement in phylogenetic signal is best achieved by targeting cox1 to existing taxa, which suggests minimum parameters for cox1 adoption in large-scale phylogenetics. In chapter 3 I address recently-arisen issues related to phyloinformatic analysis of sequence-delineated matrices. There is ongoing work on setting species boundaries by sequence variation alone, but incongruence results in methodological issues upon integrating multiple loci delineated in this way. In the final chapter I assess the impact of heterogeneous substitution rates on large scale cox1 datasets. Although the number of heterogeneous sites in Coleoptera cox1 is substantial, their presence is found to be beneficial, as their removal negatively impacts the ability of the alignment to generate the 'known' topology. The homoplasy and heterogeneous characteristics of cox1 have not substantially impacted its utility, thus the cox1 datasets have potential to play a substantial role in the tree-of-life

    King’s Cross: renaissance for whom?

    Get PDF

    Dynamic Datasets and Market Environments for Financial Reinforcement Learning

    Full text link
    The financial market is a particularly challenging playground for deep reinforcement learning due to its unique feature of dynamic datasets. Building high-quality market environments for training financial reinforcement learning (FinRL) agents is difficult due to major factors such as the low signal-to-noise ratio of financial data, survivorship bias of historical data, and model overfitting. In this paper, we present FinRL-Meta, a data-centric and openly accessible library that processes dynamic datasets from real-world markets into gym-style market environments and has been actively maintained by the AI4Finance community. First, following a DataOps paradigm, we provide hundreds of market environments through an automatic data curation pipeline. Second, we provide homegrown examples and reproduce popular research papers as stepping stones for users to design new trading strategies. We also deploy the library on cloud platforms so that users can visualize their own results and assess the relative performance via community-wise competitions. Third, we provide dozens of Jupyter/Python demos organized into a curriculum and a documentation website to serve the rapidly growing community. The open-source codes for the data curation pipeline are available at https://github.com/AI4Finance-Foundation/FinRL-MetaComment: 49 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2211.0310
    • …
    corecore