139,754 research outputs found
Soft edit distance for differentiable comparison of symbolic sequences
Edit distance, also known as Levenshtein distance, is an essential way to
compare two strings that proved to be particularly useful in the analysis of
genetic sequences and natural language processing. However, edit distance is a
discrete function that is known to be hard to optimize. This fact hampers the
use of this metric in Machine Learning. Even as simple algorithm as K-means
fails to cluster a set of sequences using edit distance if they are of variable
length and abundance. In this paper we propose a novel metric - soft edit
distance (SED), which is a smooth approximation of edit distance. It is
differentiable and therefore it is possible to optimize it with gradient
methods. Similar to original edit distance, SED as well as its derivatives can
be calculated with recurrent formulas at polynomial time. We prove usefulness
of the proposed metric on synthetic datasets and clustering of biological
sequences
Representation learning of drug and disease terms for drug repositioning
Drug repositioning (DR) refers to identification of novel indications for the
approved drugs. The requirement of huge investment of time as well as money and
risk of failure in clinical trials have led to surge in interest in drug
repositioning. DR exploits two major aspects associated with drugs and
diseases: existence of similarity among drugs and among diseases due to their
shared involved genes or pathways or common biological effects. Existing
methods of identifying drug-disease association majorly rely on the information
available in the structured databases only. On the other hand, abundant
information available in form of free texts in biomedical research articles are
not being fully exploited. Word-embedding or obtaining vector representation of
words from a large corpora of free texts using neural network methods have been
shown to give significant performance for several natural language processing
tasks. In this work we propose a novel way of representation learning to obtain
features of drugs and diseases by combining complementary information available
in unstructured texts and structured datasets. Next we use matrix completion
approach on these feature vectors to learn projection matrix between drug and
disease vector spaces. The proposed method has shown competitive performance
with state-of-the-art methods. Further, the case studies on Alzheimer's and
Hypertension diseases have shown that the predicted associations are matching
with the existing knowledge.Comment: Accepted to appear in 3rd IEEE International Conference on
Cybernetics (Spl Session: Deep Learning for Prediction and Estimation
Artificial Intelligence in the Context of Human Consciousness
Artificial intelligence (AI) can be defined as the ability of a machine to learn and make decisions based on acquired information. AI’s development has incited rampant public speculation regarding the singularity theory: a futuristic phase in which intelligent machines are capable of creating increasingly intelligent systems. Its implications, combined with the close relationship between humanity and their machines, make achieving understanding both natural and artificial intelligence imperative. Researchers are continuing to discover natural processes responsible for essential human skills like decision-making, understanding language, and performing multiple processes simultaneously. Artificial intelligence attempts to simulate these functions through techniques like artificial neural networks, Markov Decision Processes, Human Language Technology, and Multi-Agent Systems, which rely upon a combination of mathematical models and hardware
Segmenting DNA sequence into words based on statistical language model
This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last
Parts-of-Speech Tagger Errors Do Not Necessarily Degrade Accuracy in Extracting Information from Biomedical Text
A recent study reported development of Muscorian, a generic text processing
tool for extracting protein-protein interactions from text that achieved
comparable performance to biomedical-specific text processing tools. This
result was unexpected since potential errors from a series of text analysis
processes is likely to adversely affect the outcome of the entire process. Most
biomedical entity relationship extraction tools have used biomedical-specific
parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect
subsequent semantic analysis of the text, such as shallow parsing. This study
aims to evaluate the parts-of-speech (POS) tagging accuracy and attempts to
explore whether a comparable performance is obtained when a generic POS tagger,
MontyTagger, was used in place of MedPost, a tagger trained in biomedical text.
Our results demonstrated that MontyTagger, Muscorian's POS tagger, has a POS
tagging accuracy of 83.1% when tested on biomedical text. Replacing MontyTagger
with MedPost did not result in a significant improvement in entity relationship
extraction from text; precision of 55.6% from MontyTagger versus 56.8% from
MedPost on directional relationships and 86.1% from MontyTagger compared to
81.8% from MedPost on nondirectional relationships. This is unexpected as the
potential for poor POS tagging by MontyTagger is likely to affect the outcome
of the information extraction. An analysis of POS tagging errors demonstrated
that 78.5% of tagging errors are being compensated by shallow parsing. Thus,
despite 83.1% tagging accuracy, MontyTagger has a functional tagging accuracy
of 94.6%
A Categorical Compositional Distributional Modelling for the Language of Life
The Categorical Compositional Distributional (DisCoCat) Model is a powerful
mathematical model for composing the meaning of sentences in natural languages.
Since we can think of biological sequences as the "language of life", it is
attempting to apply the DisCoCat model on the language of life to see if we can
obtain new insights and a better understanding of the latter. In this work, we
took an initial step towards that direction. In particular, we choose to focus
on proteins as the linguistic features of protein are the most prominent as
compared with other macromolecules such as DNA or RNA. Concretely, we treat
each protein as a sentence and its constituent domains as words. The meaning of
a word or the sentence is just its biological function, and the arrangement of
domains in a protein corresponds to the syntax. Putting all those into the
DisCoCat framework, we can "compute" the function of a protein based on the
functions of its domains with the grammar rules that combine them together.
Since the functions of both the protein and its domains are represented in
vector spaces, we provide a novel way to formalize the functional
representation of proteins.Comment: 20 pages, 15 figure
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge Discovery
The correlation and interactions among different biological entities comprise
the biological system. Although already revealed interactions contribute to the
understanding of different existing systems, researchers face many questions
everyday regarding inter-relationships among entities. Their queries have
potential role in exploring new relations which may open up a new area of
investigation. In this paper, we introduce a text mining based method for
answering the biological queries in terms of statistical computation such that
researchers can come up with new knowledge discovery. It facilitates user to
submit their query in natural linguistic form which can be treated as
hypothesis. Our proposed approach analyzes the hypothesis and measures the
p-value of the hypothesis with respect to the existing literature. Based on the
measured value, the system either accepts or rejects the hypothesis from
statistical point of view. Moreover, even it does not find any direct
relationship among the entities of the hypothesis, it presents a network to
give an integral overview of all the entities through which the entities might
be related. This is also congenial for the researchers to widen their view and
thus think of new hypothesis for further investigation. It assists researcher
to get a quantitative evaluation of their assumptions such that they can reach
a logical conclusion and thus aids in relevant re-searches of biological
knowledge discovery. The system also provides the researchers a graphical
interactive interface to submit their hypothesis for assessment in a more
convenient way.Comment: 9 pages, published on International Journal on Computational Sciences
& Applications (IJCSA) Vol.3, No.6, December 201
Ontologies and Information Extraction
This report argues that, even in the simplest cases, IE is an ontology-driven
process. It is not a mere text filtering method based on simple pattern
matching and keywords, because the extracted pieces of texts are interpreted
with respect to a predefined partial domain model. This report shows that
depending on the nature and the depth of the interpretation to be done for
extracting the information, more or less knowledge must be involved. This
report is mainly illustrated in biology, a domain in which there are critical
needs for content-based exploration of the scientific literature and which
becomes a major application domain for IE
Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models
Neural language models (NLMs) have recently gained a renewed interest by
achieving state-of-the-art performance across many natural language processing
(NLP) tasks. However, NLMs are very computationally demanding largely due to
the computational cost of the softmax layer over a large vocabulary. We observe
that, in decoding of many NLP tasks, only the probabilities of the top-K
hypotheses need to be calculated preciously and K is often much smaller than
the vocabulary size. This paper proposes a novel softmax layer approximation
algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a
given context, a set of K words that are most likely to occur according to a
NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude
while attaining close to the full softmax baseline accuracy on neural machine
translation and language modeling tasks. We also prove the theoretical
guarantee on the softmax approximation quality
Event-based hyperspace analogue to language for query expansion
Bag-of-words approaches to information retrieval (IR) are effective but assume independence between words. The Hyperspace Analogue to Language (HAL) is a cognitively motivated and validated semantic space model that captures statistical dependencies between words by considering their co-occurrences in a surrounding window of text. HAL has been successfully applied to query expansion in IR, but has several limitations, including high processing cost and use of distributional statistics that do not exploit syntax. In this paper, we pursue two methods for incorporating syntactic-semantic information from textual ‘events’ into HAL. We build the HAL space directly from events to investigate whether processing costs can be reduced through more careful definition of word co-occurrence, and improve the quality of the pseudo-relevance feedback by applying event information as a constraint during HAL construction. Both methods significantly improve performance results in comparison with original HAL, and interpolation of HAL and relevance model expansion outperforms either method alone
- …