20 research outputs found

    Effectively Constructing Reliable Data for Cross-Domain Text Classification

    No full text
    Part 2: Machine LearningInternational audienceTraditional classification algorithms often fail when the independent and identical distributed (i.i.d.) assumption does not hold, and the cross-domain learning emerges recently is to deal with this problem. Actually, we observe that though the trained model from training data may not perform well over all test data, it can give much better prediction results on a subset of the test data with high prediction confidence. Also this subset of data from test data set may have more similar distribution with the test data. In this study, we propose to construct the reliable data set with high prediction confidence, and use this reliable data as training data. Furthermore, we develop an EM algorithm to refine the model trained from the reliable data. The extensive experiments on text classification verify the effectiveness and efficiency of our methods. It is worth to mention that the model trained from the reliable data achieves a significant performance improvement compared with the one trained from the original training data, and our methods outperform all the baseline algorithms

    BayCis: A Bayesian Hierarchical HMM for Cis-regulatory Module Decoding

    No full text
    Abstract. The transcriptional regulatory sequences in metazoan genomes often consist of multiple cis-regulatory modules (CRMs). Each CRM contains locally enriched occurrences of binding sites (motifs) for a certain array of regulatory proteins, capable of integrating, amplifying or attenuating multiple regulatory signals via combinatorial interaction with these proteins. The architecture of CRM organizations is reminiscent of the grammatical rules underlying a natural language, and presents a particular challenge to computational motif and CRM identification in metazoan genomes. In this paper, we present BayCis, a Bayesian hierarchical HMM that attempts to capture the stochastic syntactic rules of CRM organization. Under the BayCis model, all candidate sites are evaluated based on a posterior probability measure that takes into consideration their similarity to known BSs, their contrasts against local genomic context, their first-order dependencies on upstream sequence elements, as well as priors reflecting general knowledge of CRM structure. We compare our approach to five existing methods for the discovery of CRMs, and demonstrate competitive or superior prediction results evaluated against experimentally based annotations on a comprehensive selection of Drosophila regulatory regions. The software, database and Supplementary Materials will be availabl

    Feature Selection for Unsupervised Domain Adaptation using Optimal Transport

    No full text
    International audienceIn this paper, we propose a new feature selection method for unsuper-vised domain adaptation based on the emerging optimal transportation theory. We build upon a recent theoretical analysis of optimal transport in domain adaptation and show that it can directly suggest a feature selection procedure leveraging the shift between the domains. Based on this, we propose a novel algorithm that aims to sort features by their similarity across the source and target domains, where the order is obtained by analyzing the coupling matrix representing the solution of the proposed optimal transportation problem. We evaluate our method on a well-known benchmark data set and illustrate its capability of selecting correlated features leading to better classification performances. Furthermore, we show that the proposed algorithm can be used as a pre-processing step for existing domain adaptation techniques ensuring an important speed-up in terms of the computational time while maintaining comparable results. Finally, we validate our algorithm on clinical imaging databases for computer-aided diagnosis task with promising results

    Domain Adaptation Transfer Learning by Kernel Representation Adaptation

    No full text
    International audienceDomain adaptation, where no labeled target data is available, is a challenging task. To solve this problem, we first propose a new SVM based approach with a supplementary MaximumMean Discrepancy (MMD)-like constraint. With this heuristic, source and target data are projected onto a common subspace of a Reproducing Kernel Hilbert Space (RKHS) where both data distributions are expected to become similar. Therefore, a classifier trained on source data might perform well on target data, if the conditional probabilities of labels are similar for source and target data, which is the main assumption of this paper. We demonstrate that adding this constraint does not change the quadratic nature of the optimization problem, so we can use common quadratic optimization tools. Secondly, using the same idea that rendering source and target data similar might ensure efficient transfer learning, and with the same assumption, a Kernel Principal Component Analysis (KPCA) based transfer learning method is proposed. Different from the first heuristic, this second method ensures other higher order moments to be aligned in the RKHS, which leads to better performances. Here again, we select MMD as the similarity measure. Then, a linear transformation is also applied to further improve the alignment between source and target data. We finally compare both methods with other transfer learning methods from the literature to show their efficiency on synthetic and real datasets
    corecore