7,372 research outputs found
Efficient Regularized Least-Squares Algorithms for Conditional Ranking on Relational Data
In domains like bioinformatics, information retrieval and social network
analysis, one can find learning tasks where the goal consists of inferring a
ranking of objects, conditioned on a particular target object. We present a
general kernel framework for learning conditional rankings from various types
of relational data, where rankings can be conditioned on unseen data objects.
We propose efficient algorithms for conditional ranking by optimizing squared
regression and ranking loss functions. We show theoretically, that learning
with the ranking loss is likely to generalize better than with the regression
loss. Further, we prove that symmetry or reciprocity properties of relations
can be efficiently enforced in the learned models. Experiments on synthetic and
real-world data illustrate that the proposed methods deliver state-of-the-art
performance in terms of predictive power and computational efficiency.
Moreover, we also show empirically that incorporating symmetry or reciprocity
properties can improve the generalization performance
Improving average ranking precision in user searches for biomedical research datasets
Availability of research datasets is keystone for health and life science
study reproducibility and scientific progress. Due to the heterogeneity and
complexity of these data, a main challenge to be overcome by research data
management systems is to provide users with the best answers for their search
queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we
investigate a novel ranking pipeline to improve the search of datasets used in
biomedical experiments. Our system comprises a query expansion model based on
word embeddings, a similarity measure algorithm that takes into consideration
the relevance of the query terms, and a dataset categorisation method that
boosts the rank of datasets matching query constraints. The system was
evaluated using a corpus with 800k datasets and 21 annotated user queries. Our
system provides competitive results when compared to the other challenge
participants. In the official run, it achieved the highest infAP among the
participants, being +22.3% higher than the median infAP of the participant's
best submissions. Overall, it is ranked at top 2 if an aggregated metric using
the best official measures per participant is considered. The query expansion
method showed positive impact on the system's performance increasing our
baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively.
Our similarity measure algorithm seems to be robust, in particular compared to
Divergence From Randomness framework, having smaller performance variations
under different training conditions. Finally, the result categorization did not
have significant impact on the system's performance. We believe that our
solution could be used to enhance biomedical dataset management systems. In
particular, the use of data driven query expansion methods could be an
alternative to the complexity of biomedical terminologies
Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts
The recently proposed concept of molecular connectivity maps enables researchers to integrate experimental measurements of genes, proteins, metabolites, and drug compounds under similar biological conditions. The study of these maps provides opportunities for future toxicogenomics and drug discovery applications. We developed a computational framework to build disease-specific drug-protein connectivity maps. We integrated gene/protein and drug connectivity information based on protein interaction networks and literature mining, without requiring gene expression profile information derived from drug perturbation experiments on disease samples. We described the development and application of this computational framework using Alzheimer's Disease (AD) as a primary example in three steps. First, molecular interaction networks were incorporated to reduce bias and improve relevance of AD seed proteins. Second, PubMed abstracts were used to retrieve enriched drug terms that are indirectly associated with AD through molecular mechanistic studies. Third and lastly, a comprehensive AD connectivity map was created by relating enriched drugs and related proteins in literature. We showed that this molecular connectivity map development approach outperformed both curated drug target databases and conventional information retrieval systems. Our initial explorations of the AD connectivity map yielded a new hypothesis that diltiazem and quinidine may be investigated as candidate drugs for AD treatment. Molecular connectivity maps derived computationally can help study molecular signature differences between different classes of drugs in specific disease contexts. To achieve overall good data coverage and quality, a series of statistical methods have been developed to overcome high levels of data noise in biological networks and literature mining results. Further development of computational molecular connectivity maps to cover major disease areas will likely set up a new model for drug development, in which therapeutic/toxicological profiles of candidate drugs can be checked computationally before costly clinical trials begin
Knowledge Discovery Through Large-Scale Literature-Mining of Biological Text-Data
The aim of this study is to develop scalable and efficient literature-mining framework for knowledge discovery in the field of medical and biological sciences. Using this scalable framework, customized disease-disease interaction network can be constructed. Features of the proposed network that differentiate it from existing networks are its 1) flexibility in the level of abstraction, 2) broad coverage, and 3) domain specificity. Empirical results for two neurological diseases have shown the utility of the proposed framework. The second goal of this study is to design and implement a bottom-up information retrieval approach to facilitate literature-mining in the specialized field of medical genetics. Experimental results are being corroborated at the moment
Computing Network of Diseases and Pharmacological Entities through the Integration of Distributed Literature Mining and Ontology Mapping
The proliferation of -omics (such as, Genomics, Proteomics) and -ology (such as, System Biology, Cell Biology, Pharmacology) have spawned new frontiers of research in drug discovery and personalized medicine. A vast amount (21 million) of published research results are archived in the PubMed and are continually growing in size. To improve the accessibility and utility of such a large number of literatures, it is critical to develop a suit of semantic sensitive technology that is capable of discovering knowledge and can also infer possible new relationships based on statistical co-occurrences of meaningful terms or concepts. In this context, this thesis presents a unified framework to mine a large number of literatures through the integration of latent semantic analysis (LSA) and ontology mapping. In particular, a parameter optimized, robust, scalable, and distributed LSA (DiLSA) technique was designed and implemented on a carefully selected 7.4 million PubMed records related to pharmacology. The DiLSA model was integrated with MeSH to make the model effective and efficient for a specific domain. An optimized multi-gram dictionary was customized by mapping the MeSH to build the DiLSA model. A fully integrated web-based application, called PharmNet, was developed to bridge the gap between biological knowledge and clinical practices. Preliminary analysis using the PharmNet shows an improved performance over global LSA model. A limited expert evaluation was performed to validate the retrieved results and network with biological literatures. A thorough performance evaluation and validation of results is in progress
Mapping Subsets of Scholarly Information
We illustrate the use of machine learning techniques to analyze, structure,
maintain, and evolve a large online corpus of academic literature. An emerging
field of research can be identified as part of an existing corpus, permitting
the implementation of a more coherent community structure for its
practitioners.Comment: 10 pages, 4 figures, presented at Arthur M. Sackler Colloquium on
"Mapping Knowledge Domains", 9--11 May 2003, Beckman Center, Irvine, CA,
proceedings to appear in PNA
A kernel-based framework for learning graded relations from data
Driven by a large number of potential applications in areas like
bioinformatics, information retrieval and social network analysis, the problem
setting of inferring relations between pairs of data objects has recently been
investigated quite intensively in the machine learning community. To this end,
current approaches typically consider datasets containing crisp relations, so
that standard classification methods can be adopted. However, relations between
objects like similarities and preferences are often expressed in a graded
manner in real-world applications. A general kernel-based framework for
learning relations from data is introduced here. It extends existing approaches
because both crisp and graded relations are considered, and it unifies existing
approaches because different types of graded relations can be modeled,
including symmetric and reciprocal relations. This framework establishes
important links between recent developments in fuzzy set theory and machine
learning. Its usefulness is demonstrated through various experiments on
synthetic and real-world data.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Development of a framework for the classification of antibiotics adjuvants
Dissertação de mestrado em BioInformaticsThroughout the last decades, bacteria have become increasingly resistant to available
antibiotics, leading to a growing need for new antibiotics and new drug development
methodologies. In the last 40 years, there are no records of the development of new
antibiotics, which has begun to shorten possible alternatives. Therefore, finding new
antibiotics and bringing them to market is increasingly challenging. One approach is finding
compounds that restore or leverage the activity of existing antibiotics against biofilm bacteria.
As the information in this field is very limited and there is no database regarding this theme,
machine learning models were used to predict the relevance of the documents regarding
adjuvants.
In this project, the BIOFILMad - Catalog of antimicrobial adjuvants to tackle biofilms
application was developed to help researchers save time in their daily research. This
application was constructed using Django and Django REST Framework for the backend
and React for the frontend.
As for the backend, a database needed to be constructed since no database entirely
focuses on this topic. For that, a machine learning model was trained to help us classify
articles. Three different algorithms were used, Support-Vector Machine (SVM), Random
Forest (RF), and Logistic Regression (LR), combined with a different number of features
used, more precisely, 945 and 1890. When analyzing all metrics, model LR-1 performed
the best for classifying relevant documents with an accuracy score of 0.8461, a recall score
of 0.6170, an f1-score of 0.6904, and a precision score of 0.7837. This model is the best at
correctly predicting the relevant documents, as proven by the higher recall score compared
to the other models. With this model, our database was populated with relevant information.
Our backend has a unique feature, the aggregation feature constructed with Named
Entity Recognition (NER). The goal is to identify specific entity types, in our case, it identifies CHEMICAL and DISEASE. An association between these entities was made, thus delivering
the user the respective associations, saving researchers time. For example, a researcher can
see with which compounds "pseudomonas aeruginosa" has already been tested thanks to this
aggregation feature.
The frontend was implemented so the user could access this aggregation feature, see the
articles present in the database, use the machine learning models to classify new documents,
and insert them in the database if they are relevant.Ao longo das últimas décadas, as bactérias tornaram-se cada vez mais resistentes aos
antibiĂłticos disponĂveis, levando a uma crescente necessidade de novos antibiĂłticos e novas
metodologias de desenvolvimento de medicamentos. Nos últimos 40 anos, não há registos
do desenvolvimento de novos antibiĂłticos, o que começa a reduzir as alternativas possĂveis.
Portanto, criar novos antibiĂłticos e torna-los disponĂveis no mercado Ă© cada vez mais
desafiante. Uma abordagem Ă© a descoberta de compostos que restaurem ou potencializem a
atividade dos antibióticos existentes contra bactérias multirresistentes. Como as informações
neste campo são muito limitadas e não há uma base de dados sobre este tema, modelos
de Machine Learning foram utilizados para prever a relevância dos documentos acerca dos
adjuvantes.
Neste projeto, foi desenvolvida a aplicação BIOFILMad - Catalog of antimicrobial adjuvants
to tackle biofilms para ajudar os investigadores a economizar tempo nas suas pesquisas. Esta
aplicação foi construĂda usando o Django e Django REST Framework para o backend e React
para o frontend.
Quanto ao backend, foi necessário construir uma base de dados, pois não existe nenhuma
que se concentre inteiramente neste tĂłpico. Para isso, foi treinado um modelo machine
learning para nos ajudar a classificar os artigos. TrĂŞs algoritmos diferentes foram usados:
Support-Vector Machine (SVM), Random Forest (RF) e Logistic Regression (LR), combinados
com um nĂşmero diferente de features, mais precisamente, 945 e 1890. Ao analisar todas as
métricas, o modelo LR-1 teve o melhor desempenho para classificar artigos relevantes com
uma accuracy de 0,8461, um recall de 0,6170, um f1-score de 0,6904 e uma precision de 0,7837.
Este modelo foi o melhor a prever corretamente os artigos relevantes, comprovado pelo
alto recall em comparação com os outros modelos. Com este modelo, a base de dados foi
populda com informação relevante.
O backend apresenta uma caracteristica particular, a agregação construĂda com Named-Entity-Recognition (NER). O objetivo Ă© identificar tipos especĂficos de entidades, no nosso
caso, identifica QUÍMICOS e DOENÇAS. Esta classificação serviu para formar associações
entre entidades, demonstrando ao utilizador as respetivas associações feitas, permitindo
economizar o tempo dos investigadores. Por exemplo, um investigador pode ver com quais
compostos a "pseudomonas aeruginosa" já foi testada graças à funcionalidade de agregação.
O frontend foi implementado para que o utilizador possa ter acesso a esta
funcionalidade de agregação, ver os artigos presentes na base de dados, utilizar o modelo
de machine learning para classificar novos artigos e inseri-los na base de dados caso sejam
relevantes
- …