2 research outputs found

    Обучение распределенных представлений слов на основе символов

    Get PDF
    Распределённые представления слов - набор широко используемых техник обработки естественного языка. Важная проблема распределённых представлений слов - невозможность работы со словами, которые отсутствовали в обучающей выборке. В этой магистерской диссертации предложена character-level нейронная сеть, которая может обучаться на распределённых представлениях и обобщать из на произвольные слова. Модель была протестирована на русскоязычных и медицинских текстах, а так же применена к задаче распознавания именованных сущностей.Word embeddings are a set of widely used techniques in natural language processing. An important problem with word embeddings is inability to handle out-of-vocabulary words. In this master thesis a character-level neural network is proposed that able to learn on pretrained embeddings and expand it to arbitrary words. This model is tested on both russian language and medical dataset and applied to named entity recognition task

    Choose a Transformer: Fourier or Galerkin

    Full text link
    In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need for the first time to a data-driven operator learning problem related to partial differential equations. An effort is put together to explain the heuristics of, and to improve the efficacy of the attention mechanism. By employing the operator approximation theory in Hilbert spaces, it is demonstrated for the first time that the softmax normalization in the scaled dot-product attention is sufficient but not necessary. Without softmax, the approximation capacity of a linearized Transformer variant can be proved to be comparable to a Petrov-Galerkin projection layer-wise, and the estimate is independent with respect to the sequence length. A new layer normalization scheme mimicking the Petrov-Galerkin projection is proposed to allow a scaling to propagate through attention layers, which helps the model achieve remarkable accuracy in operator learning tasks with unnormalized data. Finally, we present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem. The newly proposed simple attention-based operator learner, Galerkin Transformer, shows significant improvements in both training cost and evaluation accuracy over its softmax-normalized counterparts.Comment: 35 pages, 13 figures. Published as a conference paper at NeurIPS 202
    corecore