3 research outputs found
Linking Data Across Universities: An Integrated Video Lectures Dataset
This paper presents our work and experience interlinking educational information across universities through the use of Linked Data principles and technologies. More specifically this paper is focused on selecting, extracting, structuring and interlinking information of video lectures produced by 27 different educational institutions. For this purpose, selected information from several websites and YouTube channels have been scraped and structured according to well-known vocabularies, like FOAF 1, or the W3C Ontology for Media Resources 2. To integrate this information, the extracted videos have been categorized under a common classification space, the taxonomy defined by the Open Directory Project 3. An evaluation of this categorization process has been conducted obtaining a 98% degree of coverage and 89% degree of correctness. As a result of this process a new Linked Data dataset has been released containing more than 14,000 video lectures from 27 different institutions and categorized under a common classification scheme
Automatic Assignment of Protein Function with Supervised Classifiers
High-throughput genome sequencing and sequence analysis technologies have
created the need for automated annotation and analysis of large sets of genes. The
Gene Ontology (GO) provides a common controlled vocabulary for describing gene
function. However, the process for annotating proteins with GO terms is usually
through a tedious manual curation process by trained professional annotators. With
the wealth of genomic data that are now available, there is a need for accurate auto-
mated annotation methods.
The overall objective of my research is to improve our ability to automatically an-
notate proteins with GO terms. The first method, Automatic Annotation of Protein
Functional Class (AAPFC), employs protein functional domains as features and learns
independent Support Vector Machine classifiers for each GO term. This approach relies only on protein functional domains as features, and demonstrates that statistical
pattern recognition can outperform expert curated mapping of protein functional
domain features to protein functions. The second method Predict of Gene Ontology
(PoGO) describes a meta-classification method that integrates multiple heterogeneous
data sources. This method leads to improved performance than the protein domain
method can achieve alone.
Apart from these two methods, several systems have been developed that employ pattern recognition to assign gene function using a variety of features, such as the sequence similarity, presence of protein functional domains and gene expression
patterns. Most of these approaches have not considered the hierarchical relationships
among the terms in the form of a directed acyclic graph (DAG). The DAG represents
the functional relationships between the GO terms, thus it should be an important
component of an automated annotation system. I describe a Bayesian network used as
a multi-layered classifier that incorporates the relationships among GO terms found in
the GO DAG. I also describe an inference algorithm for quickly assigning GO terms
to unlabeled proteins. A comparative analysis of the method to other previously
described annotation systems shows that the method provides improved annotation
accuracy when the performance of individual GO terms are compared. More importantly, this method enables the classification of significantly more GO terms to more
proteins than was previously possible