2 research outputs found
Duncode Characters Shorter
This paper investigates the employment of various encoders in text
transformation, converting characters into bytes. It discusses local encoders
such as ASCII and GB-2312, which encode specific characters into shorter bytes,
and universal encoders like UTF-8 and UTF-16, which can encode the complete
Unicode set with greater space requirements and are gaining widespread
acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders,
however, lack self-synchronizing capabilities. Duncode is introduced as an
innovative encoding method that aims to encode the entire Unicode character set
with high space efficiency, akin to local encoders. It has the potential to
compress multiple characters of a string into a Duncode unit using fewer bytes.
Despite offering less self-synchronizing identification information, Duncode
surpasses UTF8 in terms of space efficiency. The application is available at
\url{https://github.com/laohur/duncode}. Additionally, we have developed a
benchmark for evaluating character encoders across different languages. It
encompasses 179 languages and can be accessed at
\url{https://github.com/laohur/wiki2txt}
Generate to Understand for Representation
In recent years, a significant number of high-quality pretrained models have
emerged, greatly impacting Natural Language Understanding (NLU), Natural
Language Generation (NLG), and Text Representation tasks. Traditionally, these
models are pretrained on custom domain corpora and finetuned for specific
tasks, resulting in high costs related to GPU usage and labor. Unfortunately,
recent trends in language modeling have shifted towards enhancing performance
through scaling, further exacerbating the associated costs.
Introducing GUR: a pretraining framework that combines language modeling and
contrastive learning objectives in a single training step. We select similar
text pairs based on their Longest Common Substring (LCS) from raw unlabeled
documents and train the model using masked language modeling and unsupervised
contrastive learning. The resulting model, GUR, achieves impressive results
without any labeled training data, outperforming all other pretrained baselines
as a retriever at the recall benchmark in a zero-shot setting. Additionally,
GUR maintains its language modeling ability, as demonstrated in our ablation
experiment. Our code is available at \url{https://github.com/laohur/GUR}