3,974 research outputs found

    Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities

    Full text link
    Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%

    Proceedings of the Morpho Challenge 2010 Workshop

    Get PDF
    In natural language processing many practical tasks, such as speech recognition, information retrieval and machine translation depend on a large vocabulary and statistical language models. For morphologically rich languages, such as Finnish and Turkish, the construction of a vocabulary and language models that have a sufficient coverage is particularly difficult, because of the huge amount of different word forms. In Morpho Challenge 2010 unsupervised and semi-supervised algorithms are suggested to provide morpheme analyses for words in different languages and evaluated in various practical applications. As a research theme, unsupervised morphological analysis has received wide attention in conferences and scientific journals focused on computational linguistic and its applications. This is the proceedings of the Morpho Challenge 2010 Workshop that contains one introduction article with a description of the tasks, evaluation and results and six articles describing the participating unsupervised and supervised learning algorithms. The Morpho Challenge 2010 Workshop was held at Espoo, Finland in 2-3 September, 2010.reviewe

    Megaphylogenetic Specimen-Level Approaches to the Carex (Cyperaceae) Phylogeny Using ITS, ETS, and matK Sequences: Implications for Classification

    Get PDF
    We present the first large-scale phylogenetic hypothesis for the genus Carex based on 996 of the 1983 accepted species (50.23%). We used a supermatrix approach using three DNA regions: ETS, ITS and matK. Every concatenated sequence was derived from a single specimen. The topology of our phylogenetic reconstruction largely agreed with previous studies. We also gained new insights into the early divergence structure of the two largest clades, core Carex and Vignea clades, challenging some previous evolutionary hypotheses about inflorescence structure. Most sections were recovered as non-monophyletic. Homoplasy of characters traditionally selected as relevant for classification, historical misunderstanding of how morphology varies across Carex, and regional rather than global views of Carex diversity seem to be the main reasons for the high levels of polyphyly and paraphyly in the current infrageneric classification

    Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing

    Get PDF
    Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F1 of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data

    Strings And Colorings Of Topological Coding Towards Asymmetric Topology Cryptography

    Full text link
    We, for anti-quantum computing, will discuss various number-based strings, such as number-based super-strings, parameterized strings, set-based strings, graph-based strings, integer-partitioned and integer-decomposed strings, Hanzi-based strings, as well as algebraic operations based on number-based strings. Moreover, we introduce number-based string-colorings, magic-constraint colorings, and vector-colorings and set-colorings related with strings. For the technique of encrypting the entire network at once, we propose graphic lattices related with number-based strings, Hanzi-graphic lattices, string groups, all-tree-graphic lattices. We study some topics of asymmetric topology cryptography, such as topological signatures, Key-pair graphs, Key-pair strings, one-encryption one-time and self-certification algorithms. Part of topological techniques and algorithms introduced here are closely related with NP-complete problems or NP-hard problems.Comment: Asymmetric topology encryption is a new topic of topological coding towards the certificateless public key cryptograph
    • …
    corecore