3 research outputs found
Rapid Feature Learning with Stacked Linear Denoisers
We investigate unsupervised pre-training of deep architectures as feature
generators for "shallow" classifiers. Stacked Denoising Autoencoders (SdA),
when used as feature pre-processing tools for SVM classification, can lead to
significant improvements in accuracy - however, at the price of a substantial
increase in computational cost. In this paper we create a simple algorithm
which mimics the layer by layer training of SdAs. However, in contrast to SdAs,
our algorithm requires no training through gradient descent as the parameters
can be computed in closed-form. It can be implemented in less than 20 lines of
MATLABTMand reduces the computation time from several hours to mere seconds. We
show that our feature transformation reliably improves the results of SVM
classification significantly on all our data sets - often outperforming SdAs
and even deep neural networks in three out of four deep learning benchmarks.Comment: 10 page
An alternative text representation to TF-IDF and Bag-of-Words
In text mining, information retrieval, and machine learning, text documents
are commonly represented through variants of sparse Bag of Words (sBoW) vectors
(e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer
from their inherent over-sparsity and fail to capture word-level synonymy and
polysemy. Especially when labeled data is limited (e.g. in document
classification), or the text documents are short (e.g. emails or abstracts),
many features are rarely observed within the training corpus. This leads to
overfitting and reduced generalization accuracy. In this paper we propose Dense
Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW
document features. dCoT explicitly models absent words by removing and
reconstructing random sub-sets of words in the unlabeled corpus. With this
approach, dCoT learns to reconstruct frequent words from co-occurring
infrequent words and maps the high dimensional sparse sBoW vectors into a
low-dimensional dense representation. We show that the feature removal can be
marginalized out and that the reconstruction can be solved for in closed-form.
We demonstrate empirically, on several benchmark datasets, that dCoT features
significantly improve the classification accuracy across several document
classification tasks
Machine Learning for Protein Function
Systematic identification of protein function is a key problem in current
biology. Most traditional methods fail to identify functionally equivalent
proteins if they lack similar sequences, structural data or extensive manual
annotations. In this thesis, I focused on feature engineering and machine
learning methods for identifying diverse classes of proteins that share
functional relatedness but little sequence or structural similarity, notably,
Neuropeptide Precursors (NPPs).
I aim to identify functional protein classes solely using unannotated protein
primary sequences from any organism. This thesis focuses on feature
representations of whole protein sequences, sequence derived engineered
features, their extraction, frameworks for their usage by machine learning (ML)
models, and the application of ML models to biological tasks, focusing on high
level protein functions. I implemented the ideas of feature engineering to
develop a platform (called NeuroPID) that extracts meaningful features for
classification of overlooked NPPs. The platform allows mass discovery of new
NPs and NPPs. It was expanded as a webserver.
I expanded our approach towards other challenging protein classes. This is
implemented as a novel bioinformatics toolkit called ProFET (Protein Feature
Engineering Toolkit). ProFET extracts hundreds of biophysical and sequence
derived attributes, allowing the application of machine learning methods to
proteins. ProFET was applied on many protein benchmark datasets with state of
the art performance. The success of ProFET applies to a wide range of
high-level functions such as metagenomic analysis, subcellular localization,
structure and unique functional properties (e.g. thermophiles, nucleic acid
binding).
These methods and frameworks represent a valuable resource for using ML and
data science methods on proteins.Comment: MsC Thesi