4,821 research outputs found
Entropy in Image Analysis III
Image analysis can be applied to rich and assorted scenarios; therefore, the aim of this recent research field is not only to mimic the human vision system. Image analysis is the main methods that computers are using today, and there is body of knowledge that they will be able to manage in a totally unsupervised manner in future, thanks to their artificial intelligence. The articles published in the book clearly show such a future
Entropy in Image Analysis II
Image analysis is a fundamental task for any application where extracting information from images is required. The analysis requires highly sophisticated numerical and analytical methods, particularly for those applications in medicine, security, and other fields where the results of the processing consist of data of vital importance. This fact is evident from all the articles composing the Special Issue "Entropy in Image Analysis II", in which the authors used widely tested methods to verify their results. In the process of reading the present volume, the reader will appreciate the richness of their methods and applications, in particular for medical imaging and image security, and a remarkable cross-fertilization among the proposed research areas
Human protein function prediction: application of machine learning for integration of heterogeneous data sources
Experimental characterisation of protein cellular function can be prohibitively expensive and
take years to complete. To address this problem, this thesis focuses on the development of computational
approaches to predict function from sequence. For sequences with well characterised
close relatives, annotation is trivial, orphans or distant homologues present a greater challenge.
The use of a feature based method employing ensemble support vector machines to predict individual
Gene Ontology classes is investigated. It is found that different combinations of feature
inputs are required to recognise different functions. Although the approach is applicable to any
human protein sequence, it is restricted to broadly descriptive functions. The method is well
suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate
class assignments.
Signatures of common function can be derived from different biological characteristics; interactions
and binding events as well as expression behaviour. To investigate the hypothesis that
common function can be derived from expression information, public domain human microarray
datasets are assembled. The questions of how best to integrate these datasets and derive
features that are useful in function prediction are addressed. Both co-expression and abundance
information is represented between and within experiments and investigated for correlation with
function. It is found that features derived from expression data serve as a weak but significant
signal for recognising functions. This signal is stronger for biological processes than molecular
function categories and independent of homology information.
The protein domain has historically been coined as a modular evolutionary unit of protein function.
The occurrence of domains that can be linked by ancestral fusion events serves as a signal
for domain-domain interactions. To exploit this information for function prediction, novel domain
architecture and fused architecture scores are developed. Architecture scores rather than
single domain scores correlate more strongly with function, and both architecture and fusion
scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach
designed to target the annotation of both homologous and non-homologous proteins. Support
vector regression is used to combine pair-wise sequence features with expression scores and
domain architecture scores to rank protein pairs in terms of their functional similarities. The
target of the regression models represents the continuum of protein function space empirically
derived from the Gene Ontology molecular function and biological process graphs. The merit
and performance of the approach is demonstrated using homologous and non-homologous test
datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence
methods. The final model represents a method that achieves a compromise between
high specificity and sensitivity for all human proteins regardless of their homology status. It is
expected that this strategy will allow for more comprehensive and accurate annotations of the
human proteome
MODULATION OF PROTEIN DYNAMICS BY LIGAND BINDING AND SOLVENT COMPOSITION
Many proteins undergo conformational switching in order to perform their cellular functions. A multitude of factors may shift the energy landscape and alter protein dynamics with varying effects on the conformations they explore. We apply atomistic molecular dynamics simulations to a variety of biomolecular systems in order to investigate how factors such as pressure, the chemical environment, and ligand binding at distant binding pockets affect the structure and dynamics of these protein systems. Further, we examine how such changes should be characterized. We first investigate how pressure and solvent modulate ligand access to the active site of a bacterial lipase by probing the dynamics in a variety of pressures and DMSO-water solvent mixtures. By measuring the gorge leading to the binding pocket we find small amounts of DMSO and high atmospheric pressure optimize the ability of lipids to reach the catalytic interior. Next, we examine the allosteric mechanism behind cooperative and anti-cooperative binding of nuclear hormone receptor RXR and two of its binding partners (TR and CAR). We detail why ligands of the RXR:TR (9c and t3) complex bind anti-cooperatively while ligands of RXR:CAR (9c and tcp) bind cooperatively. Finally, we describe how an intrinsically disordered protein, α-synuclein, alters its conformational dynamics in a pH-dependent manner increasing the likelihood of pathogenic aggregation and neurodegenerative disease at low pH. In each case, we apply contact analysis to uncover the collective motions underlying conformational change triggered by environmental factors or ligand binding
Compressão e análise de dados genómicos
Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure
of all known living organisms. Since the presentation of the rst genomic
sequence, a huge amount of genomics data have been generated,
with diversi ed characteristics, rendering the data deluge phenomenon a
serious problem in most genomics centers. As such, most of the data are
discarded (when possible), while other are compressed using general purpose
algorithms, often attaining modest data reduction results.
Several speci c algorithms have been proposed for the compression of genomic
data, but unfortunately only a few of them have been made available
as usable and reliable compression tools. From those, most have been developed
to some speci c purpose. In this thesis, we propose a compressor
for genomic sequences of multiple natures, able to function in a reference
or reference-free mode. Besides, it is very
exible and can cope with diverse
hardware speci cations. It uses a mixture of nite-context models (FCMs)
and eXtended FCMs. The results show improvements over state-of-the-art
compressors.
Since the compressor can be seen as a unsupervised alignment-free method
to estimate algorithmic complexity of genomic sequences, it is the ideal
candidate to perform analysis of and between sequences. Accordingly, we
de ne a way to approximate directly the Normalized Information Distance,
aiming to identify evolutionary similarities in intra- and inter-species. Moreover,
we introduce a new concept, the Normalized Relative Compression,
that is able to quantify and infer new characteristics of the data, previously
undetected by other methods. We also investigate local measures, being
able to locate speci c events, using complexity pro les. Furthermore, we
present and explore a method based on complexity pro les to detect and
visualize genomic rearrangements between sequences, identifying several insights
of the genomic evolution of humans.
Finally, we introduce the concept of relative uniqueness and apply it to the
Ebolavirus, identifying three regions that appear in all the virus sequences
outbreak but nowhere in the human genome. In fact, we show that these
sequences are su cient to classify di erent sub-species. Also, we identify
regions in human chromosomes that are absent from close primates DNA,
specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos
vivos. Desde a apresentação da primeira sequência, um enorme número de
dados genómicos tem sido gerado, com diversas características, originando
um sério problema de excesso de dados nos principais centros de genómica.
Por esta razão, a maioria dos dados é descartada (quando possível), enquanto
outros são comprimidos usando algoritmos genéricos, quase sempre
obtendo resultados de compressão modestos.
Têm também sido propostos alguns algoritmos de compressão para
sequências genómicas, mas infelizmente apenas alguns estão disponíveis
como ferramentas eficientes e prontas para utilização. Destes, a maioria
tem sido utilizada para propósitos específicos. Nesta tese, propomos
um compressor para sequências genómicas de natureza múltipla, capaz de
funcionar em modo referencial ou sem referência. Além disso, é bastante
flexível e pode lidar com diversas especificações de hardware. O compressor
usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos.
Os resultados mostram melhorias relativamente a compressores estado-dearte.
Uma vez que o compressor pode ser visto como um método não supervisionado,
que não utiliza alinhamentos para estimar a complexidade
algortímica das sequências genómicas, ele é o candidato ideal para realizar
análise de e entre sequências. Em conformidade, definimos uma maneira
de aproximar directamente a distância de informação normalizada (NID),
visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa
normalizada (NRC), que é capaz de quantificar e inferir novas características
nos dados, anteriormente indetectados por outros métodos. Investigamos
também medidas locais, localizando eventos específicos, usando perfis de
complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre
sequências, identificando algumas características da evolução genómica humana.
Por último, introduzimos um novo conceito de singularidade relativa e
aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas
as sequências do surto viral, mas ausentes do genoma humano. De facto,
mostramos que as três sequências são suficientes para classificar diferentes
sub-espécies. Também identificamos regiões nos cromossomas humanos que
estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana
- …