2 research outputs found
Обучение распределенных представлений слов на основе символов
Распределённые представления слов - набор широко используемых техник обработки естественного языка. Важная проблема распределённых представлений слов - невозможность работы со словами, которые отсутствовали в обучающей выборке. В этой магистерской диссертации предложена character-level нейронная сеть, которая может обучаться на распределённых представлениях и обобщать из на произвольные слова. Модель была протестирована на русскоязычных и медицинских текстах, а так же применена к задаче распознавания именованных сущностей.Word embeddings are a set of widely used techniques in natural language processing. An important problem with word embeddings is inability to handle out-of-vocabulary words. In this master thesis a character-level neural network is proposed that able to learn on pretrained embeddings and expand it to arbitrary words. This model is tested on both russian language and medical dataset and applied to named entity recognition task
Choose a Transformer: Fourier or Galerkin
In this paper, we apply the self-attention from the state-of-the-art
Transformer in Attention Is All You Need for the first time to a data-driven
operator learning problem related to partial differential equations. An effort
is put together to explain the heuristics of, and to improve the efficacy of
the attention mechanism. By employing the operator approximation theory in
Hilbert spaces, it is demonstrated for the first time that the softmax
normalization in the scaled dot-product attention is sufficient but not
necessary. Without softmax, the approximation capacity of a linearized
Transformer variant can be proved to be comparable to a Petrov-Galerkin
projection layer-wise, and the estimate is independent with respect to the
sequence length. A new layer normalization scheme mimicking the Petrov-Galerkin
projection is proposed to allow a scaling to propagate through attention
layers, which helps the model achieve remarkable accuracy in operator learning
tasks with unnormalized data. Finally, we present three operator learning
experiments, including the viscid Burgers' equation, an interface Darcy flow,
and an inverse interface coefficient identification problem. The newly proposed
simple attention-based operator learner, Galerkin Transformer, shows
significant improvements in both training cost and evaluation accuracy over its
softmax-normalized counterparts.Comment: 35 pages, 13 figures. Published as a conference paper at NeurIPS 202