871 research outputs found

    Byte-based Language Identification with Deep Convolutional Networks

    Full text link
    We report on our system for the shared task on discriminating between similar languages (DSL 2016). The system uses only byte representations in a deep residual network (ResNet). The system, named ResIdent, is trained only on the data released with the task (closed training). We obtain 84.88% accuracy on subtask A, 68.80% accuracy on subtask B1, and 69.80% accuracy on subtask B2. A large difference in accuracy on development data can be observed with relatively minor changes in our network's architecture and hyperparameters. We therefore expect fine-tuning of these parameters to yield higher accuracies.Comment: 7 pages. Adapted reviewer comments. arXiv admin note: text overlap with arXiv:1609.0705

    Phonetic Temporal Neural Model for Language Identification

    Get PDF
    Deep neural models, particularly the LSTM-RNN model, have shown great potential for language identification (LID). However, the use of phonetic information has been largely overlooked by most existing neural LID methods, although this information has been used very successfully in conventional phonetic LID systems. We present a phonetic temporal neural model for LID, which is an LSTM-RNN LID system that accepts phonetic features produced by a phone-discriminative DNN as the input, rather than raw acoustic features. This new model is similar to traditional phonetic LID methods, but the phonetic knowledge here is much richer: it is at the frame level and involves compacted information of all phones. Our experiments conducted on the Babel database and the AP16-OLR database demonstrate that the temporal phonetic neural approach is very effective, and significantly outperforms existing acoustic neural models. It also outperforms the conventional i-vector approach on short utterances and in noisy conditions.Comment: Submitted to TASL

    Experiments in Language Variety Geolocation and Dialect Identification

    Get PDF
    Peer reviewe

    Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

    Get PDF
    We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe

    Authorship Authentication for Twitter Messages Using Support Vector Machine

    Get PDF
    With the rapid growth of internet usage, authorship authentication of online messages became challenging research topic in the last decades. In this paper, we used a team of support vector machines to authenticate 5 Twitter authors’ messages. SVM is one of the commonly used and strong classification algorithms in authorship attribution problems. SVM maps the linearly non separable input data to a higher dimensional space by a hyperplane via radial base functions. Firstly using the training data, 10 hyperplanes that separate pair wise five authors training data are built. Then the expertise of these SVMs combined to classify the testing data into five classes. 20 tweets with 16 features from each author were used for evaluation. In spite of the randomly choice of the features, one of the author accuracy around 75% is achieved

    Convolutional Neural Networks over Tree Structures for Programming Language Processing

    Full text link
    Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.Comment: Accepted at AAAI-1
    • …
    corecore