6 research outputs found
A benchmark of categorical encoders for binary classification
Categorical encoders transform categorical features into numerical
representations that are indispensable for a wide range of machine learning
models. Existing encoder benchmark studies lack generalizability because of
their limited choice of (1) encoders, (2) experimental factors, and (3)
datasets. Additionally, inconsistencies arise from the adoption of varying
aggregation strategies. This paper is the most comprehensive benchmark of
categorical encoders to date, including an extensive evaluation of 32
configurations of encoders from diverse families, with 36 combinations of
experimental factors, and on 50 datasets. The study shows the profound
influence of dataset selection, experimental factors, and aggregation
strategies on the benchmark's conclusions -- aspects disregarded in previous
encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information
Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark
Towards Foundation Models for Learning on Tabular Data
Learning on tabular data underpins numerous real-world applications. Despite
considerable efforts in developing effective learning models for tabular data,
current transferable tabular models remain in their infancy, limited by either
the lack of support for direct instruction following in new tasks or the
neglect of acquiring foundational knowledge and capabilities from diverse
tabular datasets. In this paper, we propose Tabular Foundation Models (TabFMs)
to overcome these limitations. TabFMs harness the potential of generative
tabular learning, employing a pre-trained large language model (LLM) as the
base model and fine-tuning it using purpose-designed objectives on an extensive
range of tabular datasets. This approach endows TabFMs with a profound
understanding and universal capabilities essential for learning on tabular
data. Our evaluations underscore TabFM's effectiveness: not only does it
significantly excel in instruction-following tasks like zero-shot and
in-context inference, but it also showcases performance that approaches, and in
instances, even transcends, the renowned yet mysterious closed-source LLMs like
GPT-4. Furthermore, when fine-tuning with scarce data, our model achieves
remarkable efficiency and maintains competitive performance with abundant
training data. Finally, while our results are promising, we also delve into
TabFM's limitations and potential opportunities, aiming to stimulate and
expedite future research on developing more potent TabFMs
Algorithms for the description of molecular sequences
Unambiguous sequence variant descriptions are important in reporting
the outcome of clinical diagnostic DNA tests. The standard
nomenclature of the Human Genome Variation Society (HGVS) describes
the observed variant sequence relative to a given reference sequence.
We propose an efficient algorithm for the extraction of
HGVS descriptions from two DNA sequences.
Our algorithm is able to compute the HGVS~descriptions of complete
chromosomes or other large DNA strings in a reasonable amount of
computation time and its resulting descriptions are relatively small.
Additional applications include updating of gene variant database
contents and reference sequence liftovers.
Next, we adapted our method for the extraction of descriptions for protein sequences in particular for describing frame shifted variants. We propose an addition to the HGVS nomenclature for accommodating the (complex)
frame shifted variants that can be described with our method.
Finally, we applied our method to generate descriptions for Short Tandem Repeats (STRs), a form of self-similarity. We propose an alternative repeat variant that can be added to the existing
HGVS nomenclature.
The final chapter takes an explorative approach to classification in large cohort studies. We provide a ``cross-sectional'' investigation on this data to see the relative power of the different groups.
 Algorithms and the Foundations of Software technolog