154 research outputs found
GECKO: Generative Language Model for English, Code and Korean
We introduce GECKO, a bilingual large language model (LLM) optimized for
Korean and English, along with programming languages. GECKO is pretrained on
the balanced, high-quality corpus of Korean and English employing LLaMA
architecture. In this report, we share the experiences of several efforts to
build a better data pipeline for the corpus and to train our model. GECKO shows
great efficiency in token generations for both Korean and English, despite its
small size of vocabulary. We measure the performance on the representative
benchmarks in terms of Korean, English and Code, and it exhibits great
performance on KMMLU (Korean MMLU) and modest performance in English and Code,
even with its smaller number of trained tokens compared to English-focused
LLMs. GECKO is available to the open-source community under a permissive
license. We hope our work offers a research baseline and practical insights for
Korean LLM research. The model can be found at:
https://huggingface.co/kifai/GECKO-7
Robust High-Dimensional Time-Varying Coefficient Estimation
In this paper, we develop a novel high-dimensional coefficient estimation
procedure based on high-frequency data. Unlike usual high-dimensional
regression procedure such as LASSO, we additionally handle the heavy-tailedness
of high-frequency observations as well as time variations of coefficient
processes. Specifically, we employ Huber loss and truncation scheme to handle
heavy-tailed observations, while -regularization is adopted to
overcome the curse of dimensionality. To account for the time-varying
coefficient, we estimate local coefficients which are biased due to the
-regularization. Thus, when estimating integrated coefficients, we
propose a debiasing scheme to enjoy the law of large number property and employ
a thresholding scheme to further accommodate the sparsity of the coefficients.
We call this Robust thrEsholding Debiased LASSO (RED-LASSO) estimator. We show
that the RED-LASSO estimator can achieve a near-optimal convergence rate. In
the empirical study, we apply the RED-LASSO procedure to the high-dimensional
integrated coefficient estimation using high-frequency trading data.Comment: 55 pages, 5 figure
Nonconvex High-Dimensional Time-Varying Coefficient Estimation for Noisy High-Frequency Observations
In this paper, we propose a novel high-dimensional time-varying coefficient
estimator for noisy high-frequency observations. In high-frequency finance, we
often observe that noises dominate a signal of an underlying true process.
Thus, we cannot apply usual regression procedures to analyze noisy
high-frequency observations. To handle this issue, we first employ a smoothing
method for the observed variables. However, the smoothed variables still
contain non-negligible noises. To manage these non-negligible noises and the
high dimensionality, we propose a nonconvex penalized regression method for
each local coefficient. This method produces consistent but biased local
coefficient estimators. To estimate the integrated coefficients, we propose a
debiasing scheme and obtain a debiased integrated coefficient estimator using
debiased local coefficient estimators. Then, to further account for the
sparsity structure of the coefficients, we apply a thresholding scheme to the
debiased integrated coefficient estimator. We call this scheme the Thresholded
dEbiased Nonconvex LASSO (TEN-LASSO) estimator. Furthermore, this paper
establishes the concentration properties of the TEN-LASSO estimator and
discusses a nonconvex optimization algorithm.Comment: 54 pages, 5 figure
Note on Hamiltonicity of basis graphs of even delta-matroids
We show that the basis graph of an even delta-matroid is Hamiltonian if it
has more than two vertices. More strongly, we prove that for two distinct edges
and sharing a common end, it has a Hamiltonian cycle using and
avoiding unless it has at most two vertices or it is a cycle of length at
most four. We also prove that if the basis graph is not a hypercube graph, then
each vertex belongs to cycles of every length , and each edge
belongs to cycles of every length . For the last theorem, we
provide two proofs, one of which uses the result of Naddef (1984) on polytopes
and the result of Chepoi (2007) on basis graphs of even delta-matroids, and the
other is a direct proof using various properties of even delta-matroids. Our
theorems generalize the analogous results for matroids by Holzmann and Harary
(1972) and Bondy and Ingleton (1976).Comment: 10 pages, 2 figures. Corrected a typ
Large Global Volatility Matrix Analysis Based on Structural Information
In this paper, we develop a novel large volatility matrix estimation
procedure for analyzing global financial markets. Practitioners often use
lower-frequency data, such as weekly or monthly returns, to address the issue
of different trading hours in the international financial market. However, this
approach can lead to inefficiency due to information loss. To mitigate this
problem, our proposed method, called Structured Principal Orthogonal complEment
Thresholding (Structured-POET), incorporates structural information for both
global and national factor models. We establish the asymptotic properties of
the Structured-POET estimator, and also demonstrate the drawbacks of
conventional covariance matrix estimation procedures when using lower-frequency
data. Finally, we apply the Structured-POET estimator to an out-of-sample
portfolio allocation study using international stock market data
- β¦