Search CORE

64,771 research outputs found

Natural Language Processing (Almost) from Scratch

Author: Bottou Léon
Collobert Ronan
Karlen Michael
Kavukcuoglu Koray
Kuksa Pavel
Weston Jason
Publication venue
Publication date: 19/12/2013
Field of study

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements

Infoscience - École polytechnique fédérale de Lausanne

Multilingual Language Processing From Bytes

Author: Brunk Cliff
Gillick Dan
Subramanya Amarnag
Vinyals Oriol
Publication venue
Publication date: 01/01/2016
Field of study

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text

arXiv.org e-Print Archive

Crossref

Cross-language Text Classification with Convolutional Neural Networks From Scratch

Author: Enweiji M. Z. (Musbah)
Glybovets А. (Аndrey)
Lehinevych T. (Taras)
Publication venue: Scientific Route OÜ
Publication date: 01/01/2017
Field of study

Cross language classification is an important task in multilingual learning, where documents in different languages often share the same set of categories. The main goal is to reduce the labeling cost of training classification model for each individual language. The novel approach by using Convolutional Neural Networks for multilingual language classification is proposed in this article. It learns representation of knowledge gained from languages. Moreover, current method works for new individual language, which was not used in training. The results of empirical study on large dataset of 21 languages demonstrate robustness and competitiveness of the presented approach

Neliti

EUREKA: Physics and Engineering

Lexical descriptions for Vietnamese language processing

Author: Nguyen Thanh Bon
Nguyen Thi Minh Huyen
Romary Laurent
Vu Xuan Luong
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

Colloque avec actes et comité de lecture. internationale.International audienceOnly very recently have Vietnamese re-searchers begun to be involved in the do-main of Natural Language Processing. As there does not exist any published work in formal linguistics or any recognizable standard for Vietnamese word categories, the fundamental works in Vietnamese text analysis such as part-of-speech tagging, parsing, etc. are very difficult tasks for computer scientists. All necessary linguistic resources have to be built from scratch, and until now almost no re-sources are shared in public research. The aim of our project is to build a common linguistic database that is freely and easily exploitable for the automatic processing of Vietnamese. In this paper, we propose an extensible set of Vietnamese syntactic descriptions that can be used for tagset definition and corpus annotation. These descriptors are established in such a way to be a reference set proposal for Vietnamese in the context of ISO subcommit-tee TC37/SC4 (Language Resource Management)

INRIA a CCSD electronic archive server

HAL-Rennes 1