7 research outputs found
Bacteriophage-host determinants: identification of bacteriophage receptors through machine learning techniques
Dissertação de mestrado em BioinformaticsBacterial resistance to antibiotics is nowadays becoming a major concern. Several reports indicate
that bacteria are developing resistance mechanisms to various antibiotics. Moreover, the processes involved
in the development of new antibiotics are lengthy and expensive. Therefore, an alternative to antibiotics
is needed. One promising alternative are bacteriophages, viruses that specifically infect bacteria,
causing their lysis. Hence, it would be interesting to discover which bacteria a specific phage recognizes.
The bacterial receptors determine phage specificity, using tail spikes/fibres as receptor binding proteins
to detect carbohydrates or proteins, in bacterial surface. Studying interactions between phage tail spikes/-
fibres and bacterial receptors can allow the identification of interaction pairs. Machine learning algorithms
can be used to find patterns in these interactions and build models to make predictions.
In this work, PhageHost, a tool that predicts hosts at a strain level, for three species, E. coli, K.
pneumoniae and A. baumannii was developed. Several data was extracted from GenBank, retrieving
general, protein and coding information, for both phages and bacteria. The protein data was used to
build an important phage protein function database, that allowed the classification of protein functions,
namely, phage tail spikes/fibres. In the end, several machine learning models with relevant protein features
were created to predict phage-host strain interactions. Compared with previously performed works, these
models show better predictive power and the ability to perform strain-level predictions. For the best model,
a Matthews correlation coefficient (MCC) of 96.6% and an F-score of 98.3% were obtained. These best
predictive models were implemented online, in a server under the name PhageHost (https://galaxy.bio.di.
uminho.pt).Resistência bacteriana a antibióticos está a tornar-se uma preocupação hoje em dia. Várias bactérias
foram descritas desenvolvendo mecanismos de resistência a diversos antibióticos. Aliado a isto, estão os
longos e dispendiosos processos envolvidos no desenvolvimento de antibióticos. Por isso, há a necessidade
de procurar uma alternativa aos antibióticos. Uma alternativa promissora são os bacteriófagos, vírus
que infetam especificamente bactérias e levam à sua lise. Posto isto, seria interessante descobrir qual
a bactéria que um certo fago reconhece. A especificidade de fagos é dada pelos recetores da superfícies
das bactérias que conseguem reconhecer. Eles usam proteínas das spikes/fibras para reconhecer
recetires proteicos ou hidratos de carbono nas bactérias. Estudar as interações entre spikes/fibras das
caudas de fagos e recetores bacterianos pode permitir a identificação de pares de interação. Algoritmos
de aprendizagem máquina podem ser utilizados para descobrir padrões nestas interações e construir
modelos para realizar previsões.
Neste trabalho, a ferramenta PhageHost foi desenvolvida. Permite a previsão de hospedeiros ao nível
da estirpe, para três espécies, E. coli, K. pneumoniae e A. baumannii. Vários dados foram extraídos
do GenBank, nomeadamente informações gerais, de proteína e codificante, para fagos e bactérias. Com
todos os dados proteicos, uma base de dados importante foi construída, que permitiu a classificação
de funções proteicas, nomeadamente, spikes/fibras das caudas dos fagos. Finalmente, vários modelos
de aprendizagem máquina, com características proteicas relevantes, capazes de prever interações
fago-hospedeiro, a nível da estirpe. Em comparação com outros trabalhos semelhantes, estes modelos
demonstraram melhor poder preditivo, assim como capacidade de prever interações a nível da estirpe.
Para o melhor modelo foram obtidos um coeficiente de correlação de Matthews de 96.6% e um F-score
de 98.3%. Os melhores modelos foram implementados online, num servidor com o nome PhageHost
(https://galaxy.bio.di.uminho.pt)
Shape Representations Using Nested Descriptors
The problem of shape representation is a core problem in computer vision. It can be argued that shape representation is the most central representational problem for computer vision, since unlike texture or color, shape alone can be used for perceptual tasks such as image matching, object detection and object categorization.
This dissertation introduces a new shape representation called the nested descriptor. A nested descriptor represents shape both globally and locally by pooling salient scaled and oriented complex gradients in a large nested support set. We show that this nesting property introduces a nested correlation structure that enables a new local distance function called the nesting distance, which provides a provably robust similarity function for image matching. Furthermore, the nesting property suggests an elegant flower like normalization strategy called a log-spiral difference. We show that this normalization enables a compact binary representation and is equivalent to a form a bottom up saliency. This suggests that the nested descriptor representational power is due to representing salient edges, which makes a fundamental connection between the saliency and local feature descriptor literature. In this dissertation, we introduce three examples of shape representation using nested descriptors: nested shape descriptors for imagery, nested motion descriptors for video and nested pooling for activities. We show evaluation results for these representations that demonstrate state-of-the-art performance for image matching, wide baseline stereo and activity recognition tasks
Recommended from our members
An automated method mapping parametric features between computer aided design software
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonEnterprise efficiency is limited by data exchange. A product designer might specify the geometry of a product with a Computer Aided Design program, an engineer might re-use that geometry data to calculate physical properties of the product using a Finite Element Analysis program. These different domains place different requirements on the product representation. Representations of product data required for different tasks is dependent on the vendor software associated with those tasks, sharing data between different vendor programs is limited by incompatibility of the vendor formats used. In the case of Computer Aided Design where the virtual form of an object is modelled, no standard data format captures complete model data. Common data standards transfer model surface geometry without capturing the topological elements from which these geometries are constructed. There are prescriptive data representations to allow these features to be specified in a neutral format, but little incentive for vendors to adopt these schemes. Recent efforts instead focus on identifying similar feature elements between different vendor CAD programs, however this approach relies on onerous manual identification requiring frequent revision.
This research develops methods to automate the task of mapping relationships between different data format representations. Two independent matching techniques identify similar CAD feature functions between heterogeneous programs. Text similarity and object geometry matching techniques are combined to match the data formats associated with CAD programs. An efficient search for matching function parameters is performed using a genetic algorithm that incorporates semantic data matching and geometry data matching. A greedy semantic matching algorithm is developed that compares with the Doc2vec short text matching technique over the API dataset tested. A SVD geometric surface registration technique is developed that requires fewer calculations than an equivalent Iterative Closest Point method
Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes
About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them
Machine Learning
Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience