3,637 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Prediction of Alternative Splice Sites in Human Genes
This thesis addresses the problem of predicting alternative splice sites in human genes. The most common way to identify alternative splice sites are the use of expressed sequence tags and microarray data. Since genes only produce alternative proteins under certain conditions, these methods are limited to detecting only alternative splice sites in genes whose alternative protein forms are expressed under the tested conditions. I have introduced three multiclass support vector machines that predict upstream and downstream alternative 3â splice sites, upstream and downstream alternative 5â splice sites, and the 3â splice site of skipped and cryptic exons. On a test set extracted from the Alternative Splice Annotation Project database, I was able to correctly classify about 68% of the splice sites in the alternative 3â set, about 62% of the splice sites in the alternative 5â set, and about 66% in the exon skipping set
Uni-QSAR: an Auto-ML Tool for Molecular Property Prediction
Recently deep learning based quantitative structure-activity relationship
(QSAR) models has shown surpassing performance than traditional methods for
property prediction tasks in drug discovery. However, most DL based QSAR models
are restricted to limited labeled data to achieve better performance, and also
are sensitive to model scale and hyper-parameters. In this paper, we propose
Uni-QSAR, a powerful Auto-ML tool for molecule property prediction tasks.
Uni-QSAR combines molecular representation learning (MRL) of 1D sequential
tokens, 2D topology graphs, and 3D conformers with pretraining models to
leverage rich representation from large-scale unlabeled data. Without any
manual fine-tuning or model selection, Uni-QSAR outperforms SOTA in 21/22 tasks
of the Therapeutic Data Commons (TDC) benchmark under designed parallel
workflow, with an average performance improvement of 6.09\%. Furthermore, we
demonstrate the practical usefulness of Uni-QSAR in drug discovery domains
Knowledge-Informed Machine Learning for Cancer Diagnosis and Prognosis: A review
Cancer remains one of the most challenging diseases to treat in the medical
field. Machine learning has enabled in-depth analysis of rich multi-omics
profiles and medical imaging for cancer diagnosis and prognosis. Despite these
advancements, machine learning models face challenges stemming from limited
labeled sample sizes, the intricate interplay of high-dimensionality data
types, the inherent heterogeneity observed among patients and within tumors,
and concerns about interpretability and consistency with existing biomedical
knowledge. One approach to surmount these challenges is to integrate biomedical
knowledge into data-driven models, which has proven potential to improve the
accuracy, robustness, and interpretability of model results. Here, we review
the state-of-the-art machine learning studies that adopted the fusion of
biomedical knowledge and data, termed knowledge-informed machine learning, for
cancer diagnosis and prognosis. Emphasizing the properties inherent in four
primary data types including clinical, imaging, molecular, and treatment data,
we highlight modeling considerations relevant to these contexts. We provide an
overview of diverse forms of knowledge representation and current strategies of
knowledge integration into machine learning pipelines with concrete examples.
We conclude the review article by discussing future directions to advance
cancer research through knowledge-informed machine learning.Comment: 41 pages, 4 figures, 2 table
Metric Selection and Metric Learning for Matching Tasks
A quarter of a century after the world-wide web was born, we have grown accustomed to having easy access to a wealth of data sets and open-source software. The value of these resources is restricted if they are not properly integrated and maintained. A lot of this work boils down to matching; finding existing records about entities and enriching them with information from a new data source. In the realm of code this means integrating new code snippets into a code base while avoiding duplication.
In this thesis, we address two different such matching problems. First, we leverage the diverse and mature set of string similarity measures in an iterative semisupervised learning approach to string matching. It is designed to query a user to make a sequence of decisions on specific cases of string matching. We show that we can find almost optimal solutions after only a small amount of such input. The low labelling complexity of our algorithm is due to addressing the cold start problem that is inherent to Active Learning; by ranking queries by variance before the arrival of enough supervision information, and by a self-regulating mechanism that counteracts initial biases.
Second, we address the matching of code fragments for deduplication. Programming code is not only a tool, but also a resource that itself demands maintenance. Code duplication is a frequent problem arising especially from modern development practice. There are many reasons to detect and address code duplicates, for example to keep a clean and maintainable codebase. In such more complex data structures, string similarity measures are inadequate. In their stead, we study a modern supervised Metric Learning approach to model code similarity with Neural Networks. We find that in such a model representing the elementary tokens with a pretrained word embedding is the most important ingredient. Our results show both qualitatively (by visualization) that relatedness is modelled well by the embeddings and quantitatively (by ablation) that the encoded information is useful for the downstream matching task.
As a non-technical contribution, we unify the common challenges arising in supervised learning approaches to Record Matching, Code Clone Detection and generic Metric Learning tasks. We give a novel account to string similarity measures from a psychological standpoint and point out and document one longstanding naming conflict in string similarity measures. Finally, we point out the overlap of latest research in Code Clone Detection with the field of Natural Language Processing
- âŠ