32,003 research outputs found
The study of probability model for compound similarity searching
Information Retrieval or IR system main task is to retrieve relevant documents according to the users query. One of IR most popular retrieval model is the Vector Space Model. This model assumes relevance based on similarity, which is defined as the distance between query and document in the concept space. All currently existing chemical compound database systems have adapt the vector space model to calculate the similarity of a database entry to a query compound. However, it assumes that fragments represented by the bits are independent of one another, which is not necessarily true. Hence, the possibility of applying another IR model is explored, which is the Probabilistic Model, for chemical compound searching. This model estimates the probabilities of a chemical structure to have the same bioactivity as a target compound. It is envisioned that by ranking chemical structures in decreasing order of their probability of relevance to the query structure, the effectiveness of a molecular similarity searching system can be increased. Both fragment dependencies and independencies assumption are taken into consideration in achieving improvement towards compound similarity searching system. After conducting a series of simulated similarity searching, it is concluded that PM approaches really did perform better than the existing similarity searching. It gave better result in all evaluation criteria to confirm this statement. In terms of which probability model performs better, the BD model shown improvement over the BIR model
Extreme Value Statistics and Traveling Fronts: Various Applications
An intriguing connection between extreme value statistics and traveling
fronts has been found recently in a number of diverse problems. In this brief
review we outline a few such problems and consider their various applications.Comment: A brief review (6 pages, 2 figures) to appear in Physica A as part of
the proceedings of Statphys-Kolkata IV (2002
A survey on the use of relevance feedback for information access systems
Users of online search engines often find it difficult to express their need for information in the form of a query. However, if the user can identify examples of the kind of documents they require then they can employ a technique known as relevance feedback. Relevance feedback covers a range of techniques intended to improve a user's query and facilitate retrieval of information relevant to a user's information need. In this paper we survey relevance feedback techniques. We study both automatic techniques, in which the system modifies the user's query, and interactive techniques, in which the user has control over query modification. We also consider specific interfaces to relevance feedback systems and characteristics of searchers that can affect the use and success of relevance feedback systems
Fast Tree Search for Enumeration of a Lattice Model of Protein Folding
Using a fast tree-searching algorithm and a Pentium cluster, we enumerated
all the sequences and compact conformations (structures) for a protein folding
model on a cubic lattice of size . We used two types of amino
acids -- hydrophobic (H) and polar (P) -- to make up the sequences, so there
were different sequences. The total number
of distinct structures was 84,731,192. We made use of a simple solvation model
in which the energy of a sequence folded into a structure is minus the number
of hydrophobic amino acids in the ``core'' of the structure. For every
sequence, we found its ground state or ground states, i.e., the structure or
structures for which its energy is lowest. About 0.3% of the sequences have a
unique ground state. The number of structures that are unique ground states of
at least one sequence is 2,662,050, about 3% of the total number of structures.
However, these ``designable'' structures differ drastically in their
designability, defined as the number of sequences whose unique ground state is
that structure. To understand this variation in designability, we studied the
distribution of structures in a high dimensional space in which each structure
is represented by a string of 1's and 0's, denoting core and surface sites,
respectively.Comment: 18 pages, 10 figure
On the Bayes-optimality of F-measure maximizers
The F-measure, which has originally been introduced in information retrieval,
is nowadays routinely used as a performance metric for problems such as binary
classification, multi-label classification, and structured output prediction.
Optimizing this measure is a statistically and computationally challenging
problem, since no closed-form solution exists. Adopting a decision-theoretic
perspective, this article provides a formal and experimental analysis of
different approaches for maximizing the F-measure. We start with a Bayes-risk
analysis of related loss functions, such as Hamming loss and subset zero-one
loss, showing that optimizing such losses as a surrogate of the F-measure leads
to a high worst-case regret. Subsequently, we perform a similar type of
analysis for F-measure maximizing algorithms, showing that such algorithms are
approximate, while relying on additional assumptions regarding the statistical
distribution of the binary response variables. Furthermore, we present a new
algorithm which is not only computationally efficient but also Bayes-optimal,
regardless of the underlying distribution. To this end, the algorithm requires
only a quadratic (with respect to the number of binary responses) number of
parameters of the joint distribution. We illustrate the practical performance
of all analyzed methods by means of experiments with multi-label classification
problems
Parallel Implementation of Efficient Search Schemes for the Inference of Cancer Progression Models
The emergence and development of cancer is a consequence of the accumulation
over time of genomic mutations involving a specific set of genes, which
provides the cancer clones with a functional selective advantage. In this work,
we model the order of accumulation of such mutations during the progression,
which eventually leads to the disease, by means of probabilistic graphic
models, i.e., Bayesian Networks (BNs). We investigate how to perform the task
of learning the structure of such BNs, according to experimental evidence,
adopting a global optimization meta-heuristics. In particular, in this work we
rely on Genetic Algorithms, and to strongly reduce the execution time of the
inference -- which can also involve multiple repetitions to collect
statistically significant assessments of the data -- we distribute the
calculations using both multi-threading and a multi-node architecture. The
results show that our approach is characterized by good accuracy and
specificity; we also demonstrate its feasibility, thanks to a 84x reduction of
the overall execution time with respect to a traditional sequential
implementation
- âŠ