Search CORE

33 research outputs found

Revisiting Self-Training with Regularized Pseudo-Labeling for Tabular Data

Author: Kim Juseong
Kim Minwook
Song Giltae
Publication venue
Publication date: 12/03/2023
Field of study

Recent progress in semi- and self-supervised learning has caused a rift in the long-held belief about the need for an enormous amount of labeled data for machine learning and the irrelevancy of unlabeled data. Although it has been successful in various data, there is no dominant semi- and self-supervised learning method that can be generalized for tabular data (i.e. most of the existing methods require appropriate tabular datasets and architectures). In this paper, we revisit self-training which can be applied to any kind of algorithm including the most widely used architecture, gradient boosting decision tree, and introduce curriculum pseudo-labeling (a state-of-the-art pseudo-labeling technique in image) for a tabular domain. Furthermore, existing pseudo-labeling techniques do not assure the cluster assumption when computing confidence scores of pseudo-labels generated from unlabeled data. To overcome this issue, we propose a novel pseudo-labeling approach that regularizes the confidence scores based on the likelihoods of the pseudo-labels so that more reliable pseudo-labels which lie in high density regions can be obtained. We exhaustively validate the superiority of our approaches using various models and tabular datasets.Comment: 10 pages for the main part and 8 extra pages for the appendix. 2 figures and 3 tables for the main par

arXiv.org e-Print Archive

CAST: Cluster-Aware Self-Training for Tabular Data

Author: Kang Donggil
Kim Juseong
Kim Kibeom
Kim Minwook
Song Giltae
Publication venue
Publication date: 10/10/2023
Field of study

Self-training has gained attraction because of its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels. Several studies have proposed successful approaches to tackle this issue, but they have diminished the advantages of self-training because they require specific modifications in self-training algorithms or model architectures. Furthermore, most of them are incompatible with gradient boosting decision trees, which dominate the tabular domain. To address this, we revisit the cluster assumption, which states that data samples that are close to each other tend to belong to the same class. Inspired by the assumption, we propose Cluster-Aware Self-Training (CAST) for tabular data. CAST is a simple and universally adaptable approach for enhancing existing self-training algorithms without significant modifications. Concretely, our method regularizes the confidence of the classifier, which represents the value of the pseudo-label, forcing the pseudo-labels in low-density regions to have lower confidence by leveraging prior knowledge for each class within the training data. Extensive empirical evaluations on up to 20 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts.Comment: 17 pages with appendi

arXiv.org e-Print Archive

Reconstructing Histories of Complex Gene Clusters on a Phylogeny

Author: Adam Siepel
Benson G.
Broňa Brejová
Fitch W.M.
Giltae Song
Peng Q.
Tomáš Vinař
Zhang Y.
Zhang Y.
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

Revealing mammalian evolutionary relationships by comparative analysis of gene clusters

Author: Abi-Rached
Akahoshi
Bailey
Benjamin Dickins
Birney
Cadavid
Cathy Riemer
Chen
Chih-Hao Hsu
Chiu
Colobran
Datta
Degenhardt
Dewey
Dufayard
Edwards
Eric D. Green
Fitch
Fitch
Fitch
Giltae Song
Gish
Gonzalez
Goodstadt
Graef
Guethlein
Guethlein
Han
Hardies
Hardison
Hardison
Hardison
Harris
Hie Lim Kim
Hoffmann
Hou
Hou
Hsu
Hsu
Hu
Huerta-Cepas
Jensen
Johnson
Kim
Kristensen
Lee
Levy
Li
Li
Lopez-Vazquez
Louxin Zhang
Margulies
Martin
Matsuya
Mi
Miyata
Muller
Murphy
NISC Comparative Sequencing Program
Opazo
Opazo
Ostlund
Ouzounis
Parham
Pianezza
Rajalingam
Ross C. Hardison
Sambrook
Shilling
Siepel
Smit
Song
Song
Song
Sonnhammer
Su
Tatusov
The ENCODE Project Consortium
Uchiyama
van der Heijden
Vilella
Wang
Wapinski
Waterhouse
Webb Miller
Wilson
Wilson
Woelk
Yu Zhang
Zhang
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Many software tools for comparative analysis of genomic sequence data have been released in recent decades. Despite this, it remains challenging to determine evolutionary relationships in gene clusters due to their complex histories involving duplications, deletions, inversions, and conversions. One concept describing these relationships is orthology. Orthologs derive from a common ancestor by speciation, in contrast to paralogs, which derive from duplication. Discriminating orthologs from paralogs is a necessary step in most multispecies sequence analyses, but doing so accurately is impeded by the occurrence of gene conversion events. We propose a refined method of orthology assignment based on two paradigms for interpreting its definition: by genomic context or by sequence content. X-orthology (based on context) traces orthology resulting from speciation and duplication only, while N-orthology (based on content) includes the influence of conversion events

Crossref

Nottingham Trent Institutional Repository (IRep)

PubMed Central

ScholarBank@NUS

Conversion events in gene clusters

Abstract Background Gene clusters containing multiple similar genomic regions in close proximity are of great interest for biomedical studies because of their associations with inherited diseases. However, such regions are difficult to analyze due to their structural complexity and their complicated evolutionary histories, reflecting a variety of large-scale mutational events. In particular, conversion events can mislead inferences about the relationships among these regions, as traced by traditional methods such as construction of phylogenetic trees or multi-species alignments. Results To correct the distorted information generated by such methods, we have developed an automated pipeline called CHAP (Cluster History Analysis Package) for detecting conversion events. We used this pipeline to analyze the conversion events that affected two well-studied gene clusters (α-globin and β-globin) and three gene clusters for which comparative sequence data were generated from seven primate species: CCL (chemokine ligand), IFN (interferon), and CYP2abf (part of cytochrome P450 family 2). CHAP is freely available at <url>http://www.bx.psu.edu/miller_lab</url>. Conclusions These studies reveal the value of characterizing conversion events in the context of studying gene clusters in complex genomes.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

Evaluation of methods for detecting conversion events in gene clusters

Author: A Siepel
A Siepel
C Hsu
C Spencer
C Strope
Cathy Riemer
Chih-Hao Hsu
D Husmeier
D Martin
D Martin
D Posada
E Holmes
G Hellenthal
Giltae Song
J Archer
J Archibald
J Chen
J Hein
J Huelsenbeck
J Kim
J Smith
J Stoye
K Lole
L Excoffier
L Liang
M Arenas
M Arenas
M Boni
M Gibbs
M Hasegawa
M Rosenberg
M Suchard
N Grassly
O Westesson
P Marjoram
R Cartwright
R Harris
R Hudson
S Pond
S Sawyer
S Schaffner
T Mailund
V Minin
W Miller
Webb Miller
Y Zhang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. Results: We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. Conclusions: Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Correction: AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae

Author: A Bergstrom
A Borneman
A Dereeper
A Gibson
A Goffeau
A Nitta
B Cantarel
B Dunn
B Read
Barbara Dunn
Benjamin J. A. Dickins
C Brachmann
C Kumar
D Botstein
D Fisk
E Winzeler
F Lacroute
F Sherman
F Winston
G Liti
G Song
Giltae Song
H Lam
H Li
H Wu
J Argueso
J Cherry
J Fay
J Heck
J Heitman
J Perez-Ortin
J Schacherer
J Simpson
J Simpson
J van Dijken
J Warringer
J. Michael Cherry
Janos Demeter
Joseph Schacherer
JS Robinson
M Novo
O Bedoya-Reina
P Sniegowski
R Atkinson
R Borts
R Dubin
R Mortimer
R Mortimer
R Sikorski
S Altschul
S Bentley
S Doniger
S Engel
S Hunter
S Kane
S Salzberg
S Spingola
Stacia Engel
T Akao
T Asano
T Kobayashi
V Meyrial
W Kent
W Voth
Y Li
Y Zhao
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 17/03/2015
Field of study

The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community

Public Library of Science (PLOS)

Crossref

Nottingham Trent Institutional Repository (IRep)

Directory of Open Access Journals

PubMed Central

FigShare

DISPAQ: Distributed Profitable-Area Query from Big Taxi Trip Data

Author: Fadhilah Kurnia Putri
Giltae Song
Joonho Kwon
Praveen Rao
Publication venue: 'MDPI AG'
Publication date: 01/09/2017
Field of study

One of the crucial problems for taxi drivers is to efficiently locate passengers in order to increase profits. The rapid advancement and ubiquitous penetration of Internet of Things (IoT) technology into transportation industries enables us to provide taxi drivers with locations that have more potential passengers (more profitable areas) by analyzing and querying taxi trip data. In this paper, we propose a query processing system, called Distributed Profitable-Area Query (DISPAQ) which efficiently identifies profitable areas by exploiting the Apache Software Foundation’s Spark framework and a MongoDB database. DISPAQ first maintains a profitable-area query index (PQ-index) by extracting area summaries and route summaries from raw taxi trip data. It then identifies candidate profitable areas by searching the PQ-index during query processing. Then, it exploits a Z-Skyline algorithm, which is an extension of skyline processing with a Z-order space filling curve, to quickly refine the candidate profitable areas. To improve the performance of distributed query processing, we also propose local Z-Skyline optimization, which reduces the number of dominant tests by distributing killer profitable areas to each cluster node. Through extensive evaluation with real datasets, we demonstrate that our DISPAQ system provides a scalable and efficient solution for processing profitable-area queries from huge amounts of big taxi trip data

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals