Search CORE

6 research outputs found

A benchmark of categorical encoders for binary classification

Author: Arzamasov Vadim
Boehm Klemens
Matteucci Federico
Publication venue
Publication date: 20/11/2023
Field of study

Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark

arXiv.org e-Print Archive

Towards Foundation Models for Learning on Tabular Data

Author: Bian Jiang
Wen Xumeng
Xu Wei
Zhang Han
Zheng Shun
Publication venue
Publication date: 22/10/2023
Field of study

Learning on tabular data underpins numerous real-world applications. Despite considerable efforts in developing effective learning models for tabular data, current transferable tabular models remain in their infancy, limited by either the lack of support for direct instruction following in new tasks or the neglect of acquiring foundational knowledge and capabilities from diverse tabular datasets. In this paper, we propose Tabular Foundation Models (TabFMs) to overcome these limitations. TabFMs harness the potential of generative tabular learning, employing a pre-trained large language model (LLM) as the base model and fine-tuning it using purpose-designed objectives on an extensive range of tabular datasets. This approach endows TabFMs with a profound understanding and universal capabilities essential for learning on tabular data. Our evaluations underscore TabFM's effectiveness: not only does it significantly excel in instruction-following tasks like zero-shot and in-context inference, but it also showcases performance that approaches, and in instances, even transcends, the renowned yet mysterious closed-source LLMs like GPT-4. Furthermore, when fine-tuning with scarce data, our model achieves remarkable efficiency and maintains competitive performance with abundant training data. Finally, while our results are promising, we also delve into TabFM's limitations and potential opportunities, aiming to stimulate and expedite future research on developing more potent TabFMs

arXiv.org e-Print Archive

Predictability of Machine Learning Algorithms and Related Feature Extraction Techniques

Author: Dong Yunbo
Publication venue
Publication date: 15/11/2021
Field of study

Pure OAI Repository

Algorithms for the description of molecular sequences

Author: Vis J.K.
Publication venue
Publication date: 21/12/2016
Field of study

Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two DNA sequences. Our algorithm is able to compute the HGVS~descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers. Next, we adapted our method for the extraction of descriptions for protein sequences in particular for describing frame shifted variants. We propose an addition to the HGVS nomenclature for accommodating the (complex) frame shifted variants that can be described with our method. Finally, we applied our method to generate descriptions for Short Tandem Repeats (STRs), a form of self-similarity. We propose an alternative repeat variant that can be added to the existing HGVS nomenclature. The final chapter takes an explorative approach to classification in large cohort studies. We provide a ``cross-sectional'' investigation on this data to see the relative power of the different groups. Algorithms and the Foundations of Software technolog

Leiden University Scholary Publications