36,298 research outputs found
A cascaded approach to normalising gene mentions in biomedical literature
Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%
A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate
A Legal Perspective on Training Models for Natural Language Processing
A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning
model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights
Towards shared datasets for normalization research
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalization experiments on these datasets using statistical machine translation techniques. We focus on both the word and character level and find that we can improve the BLEU score with ca. 20% for both languages. In order for this user generated content data to be released publicly to the research community some issues first need to be resolved. These are discussed in closer detail by focussing on the current legislation and by investigating previous similar data collection projects. With this discussion we hope to shed some light on various difficulties researchers are facing when trying to share social media data
ARPA Whitepaper
We propose a secure computation solution for blockchain networks. The
correctness of computation is verifiable even under malicious majority
condition using information-theoretic Message Authentication Code (MAC), and
the privacy is preserved using Secret-Sharing. With state-of-the-art multiparty
computation protocol and a layer2 solution, our privacy-preserving computation
guarantees data security on blockchain, cryptographically, while reducing the
heavy-lifting computation job to a few nodes. This breakthrough has several
implications on the future of decentralized networks. First, secure computation
can be used to support Private Smart Contracts, where consensus is reached
without exposing the information in the public contract. Second, it enables
data to be shared and used in trustless network, without disclosing the raw
data during data-at-use, where data ownership and data usage is safely
separated. Last but not least, computation and verification processes are
separated, which can be perceived as computational sharding, this effectively
makes the transaction processing speed linear to the number of participating
nodes. Our objective is to deploy our secure computation network as an layer2
solution to any blockchain system. Smart Contracts\cite{smartcontract} will be
used as bridge to link the blockchain and computation networks. Additionally,
they will be used as verifier to ensure that outsourced computation is
completed correctly. In order to achieve this, we first develop a general MPC
network with advanced features, such as: 1) Secure Computation, 2) Off-chain
Computation, 3) Verifiable Computation, and 4)Support dApps' needs like
privacy-preserving data exchange
- …