Search CORE

3,974 research outputs found

Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities

Author: Ro Jae
Sproat Richard
Zhang Hao
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%

arXiv.org e-Print Archive

Crossref

The University of Helsinki submissions to the WMT18 news task

Author: Hurskainen Arvi
Nieminen Tommi
Raganato Alessandro
Scherrer Yves
Tiedemann Jörg
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Archivio della ricerca- Università di Roma La Sapienza

Proceedings of the Morpho Challenge 2010 Workshop

Author: Kurimo Mikko
Turunen Ville T.
Virpioja Sami
Publication venue: Aalto-yliopiston teknillinen korkeakoulu
Publication date: 01/01/2010
Field of study

In natural language processing many practical tasks, such as speech recognition, information retrieval and machine translation depend on a large vocabulary and statistical language models. For morphologically rich languages, such as Finnish and Turkish, the construction of a vocabulary and language models that have a sufficient coverage is particularly difficult, because of the huge amount of different word forms. In Morpho Challenge 2010 unsupervised and semi-supervised algorithms are suggested to provide morpheme analyses for words in different languages and evaluated in various practical applications. As a research theme, unsupervised morphological analysis has received wide attention in conferences and scientific journals focused on computational linguistic and its applications. This is the proceedings of the Morpho Challenge 2010 Workshop that contains one introduction article with a description of the tasks, evaluation and results and six articles describing the participating unsupervised and supervised learning algorithms. The Morpho Challenge 2010 Workshop was held at Espoo, Finland in 2-3 September, 2010.reviewe

Aaltodoc Publication Archive

Megaphylogenetic Specimen-Level Approaches to the Carex (Cyperaceae) Phylogeny Using ITS, ETS, and matK Sequences: Implications for Classification

Author: Escudero Lirio Marcial
Global Carex Group
Hahn Marlene
Jiménez Mejías Pedro
Maguilla Salado Enrique
Roalson Eric H.
Villlaverde Tamara
Waterway Marcia J.
Publication venue: 'American Society of Plant Taxonomists'
Publication date: 01/01/2016
Field of study

We present the first large-scale phylogenetic hypothesis for the genus Carex based on 996 of the 1983 accepted species (50.23%). We used a supermatrix approach using three DNA regions: ETS, ITS and matK. Every concatenated sequence was derived from a single specimen. The topology of our phylogenetic reconstruction largely agreed with previous studies. We also gained new insights into the early divergence structure of the two largest clades, core Carex and Vignea clades, challenging some previous evolutionary hypotheses about inflorescence structure. Most sections were recovered as non-monophyletic. Homoplasy of characters traditionally selected as relevant for classification, historical misunderstanding of how morphology varies across Carex, and regional rather than global views of Carex diversity seem to be the main reasons for the high levels of polyphyly and paraphyly in the current infrageneric classification

idUS. Depósito de Investigación Universidad de Sevilla

Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing

Author: A Dagar
AM Cohen
B Hjfirland
Bairong Shen
C Chung
C Cortes
CJ Kuo
CM Bishop
CW Hsu
D Rebholz-Schuhmann
ET Jaynes
ET Jaynes
Fei Zhu
G Zhou
HJ Dai
HM Wallach
I Tsochantaridis
J Lafferty
J Rose
JD Kim
JD Kim
Ji Kazama
Jérémie Bourdon
L Li
L Tanabe
L Tanabe
LJ Jensen
MS Habib
R Winnenburg
RT Tsai
S Ananiadou
TH Tsai
Y Ephraim
Y He
Y Li
YF Lin
Z Ju
Publication venue: Public Library of Science
Publication date: 26/06/2012
Field of study

Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F1 of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Strings And Colorings Of Topological Coding Towards Asymmetric Topology Cryptography

Author: Liu Xia
Ma Fei
Mu Yarong
Su Jing
Sun Hui
Yang Chao
Yao Bing
Zhang Xiaohui
Publication venue
Publication date: 30/09/2022
Field of study

We, for anti-quantum computing, will discuss various number-based strings, such as number-based super-strings, parameterized strings, set-based strings, graph-based strings, integer-partitioned and integer-decomposed strings, Hanzi-based strings, as well as algebraic operations based on number-based strings. Moreover, we introduce number-based string-colorings, magic-constraint colorings, and vector-colorings and set-colorings related with strings. For the technique of encrypting the entire network at once, we propose graphic lattices related with number-based strings, Hanzi-graphic lattices, string groups, all-tree-graphic lattices. We study some topics of asymmetric topology cryptography, such as topological signatures, Key-pair graphs, Key-pair strings, one-encryption one-time and self-certification algorithms. Part of topological techniques and algorithms introduced here are closely related with NP-complete problems or NP-hard problems.Comment: Asymmetric topology encryption is a new topic of topological coding towards the certificateless public key cryptograph

arXiv.org e-Print Archive