17 research outputs found
Derived-Term Automata of Multitape Expressions with Composition
Rational expressions are powerful tools to define automata, but often restricted to single-tape automata. Our goal is to unleash their expressive power for transducers, and more generally, any multitape automaton; for instance
(a + |x+b + |y) ∗ . We generalize the construction of the derived-term automaton by using expansions. This approach generates small automata, and even allows us to support a composition operator
Formalizing BPE Tokenization
In this paper, we formalize practical byte pair encoding tokenization as it
is used in large language models and other NLP systems, in particular we
formally define and investigate the semantics of the SentencePiece and
HuggingFace tokenizers, in particular how they relate to each other, depending
on how the tokenization rules are constructed. Beyond this we consider how
tokenization can be performed in an incremental fashion, as well as doing it
left-to-right using an amount of memory constant in the length of the string,
enabling e.g. using a finite state string-to-string transducer.Comment: In Proceedings NCMA 2023, arXiv:2309.0733
Computer Aided Verification
This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book