3,974 research outputs found
Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities
Breaking domain names such as openresearch into component words open and
research is important for applications like Text-to-Speech synthesis and web
search. We link this problem to the classic problem of Chinese word
segmentation and show the effectiveness of a tagging model based on Recurrent
Neural Networks (RNNs) using characters as input. To compensate for the lack of
training data, we propose a pre-training method on concatenated entity names in
a large knowledge database. Pre-training improves the model by 33% and brings
the sequence accuracy to 85%
Proceedings of the Morpho Challenge 2010 Workshop
In natural language processing many practical tasks, such as speech recognition, information retrieval and machine translation depend on a large vocabulary and statistical language models. For morphologically rich languages, such as Finnish and Turkish, the construction of a vocabulary and language models that have a sufficient coverage is particularly difficult, because of the huge amount of different word forms. In Morpho Challenge 2010 unsupervised and semi-supervised algorithms are suggested to provide morpheme analyses for words in different languages and evaluated in various practical applications. As a research theme, unsupervised morphological analysis has received wide attention in conferences and scientific journals focused on computational linguistic and its applications. This is the proceedings of the Morpho Challenge 2010 Workshop that contains one introduction article with a description of the tasks, evaluation and results and six articles describing the participating unsupervised and supervised learning algorithms. The Morpho Challenge 2010 Workshop was held at Espoo, Finland in 2-3 September, 2010.reviewe
Megaphylogenetic Specimen-Level Approaches to the Carex (Cyperaceae) Phylogeny Using ITS, ETS, and matK Sequences: Implications for Classification
We present the first large-scale phylogenetic hypothesis for the genus Carex based on 996 of the 1983 accepted species (50.23%).
We used a supermatrix approach using three DNA regions: ETS, ITS and matK. Every concatenated sequence was derived from a single
specimen. The topology of our phylogenetic reconstruction largely agreed with previous studies. We also gained new insights into the early
divergence structure of the two largest clades, core Carex and Vignea clades, challenging some previous evolutionary hypotheses about
inflorescence structure. Most sections were recovered as non-monophyletic. Homoplasy of characters traditionally selected as relevant for
classification, historical misunderstanding of how morphology varies across Carex, and regional rather than global views of Carex diversity
seem to be the main reasons for the high levels of polyphyly and paraphyly in the current infrageneric classification
Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing
Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F1 of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data
Strings And Colorings Of Topological Coding Towards Asymmetric Topology Cryptography
We, for anti-quantum computing, will discuss various number-based strings,
such as number-based super-strings, parameterized strings, set-based strings,
graph-based strings, integer-partitioned and integer-decomposed strings,
Hanzi-based strings, as well as algebraic operations based on number-based
strings. Moreover, we introduce number-based string-colorings, magic-constraint
colorings, and vector-colorings and set-colorings related with strings. For the
technique of encrypting the entire network at once, we propose graphic lattices
related with number-based strings, Hanzi-graphic lattices, string groups,
all-tree-graphic lattices. We study some topics of asymmetric topology
cryptography, such as topological signatures, Key-pair graphs, Key-pair
strings, one-encryption one-time and self-certification algorithms. Part of
topological techniques and algorithms introduced here are closely related with
NP-complete problems or NP-hard problems.Comment: Asymmetric topology encryption is a new topic of topological coding
towards the certificateless public key cryptograph
- …