6,448 research outputs found
Multivariate NIR studies of seed-water interaction in Scots Pine Seeds (Pinus sylvestris L.)
This thesis describes seed-water interaction using near infrared (NIR) spectroscopy, multivariate regression models and Scots pine seeds. The presented research covers classification of seed viability, prediction of seed moisture content, selection of NIR wavelengths and interpretation of seed-water interaction modelled and analysed by principal component analysis, ordinary least squares (OLS), partial least squares (PLS), bi-orthogonal least squares (BPLS) and genetic algorithms. The potential of using multivariate NIR calibration models for seed classification was demonstrated using filled viable and non-viable seeds that could be separated with an accuracy of 98-99%. It was also shown that multivariate NIR calibration models gave low errors (0.7% and 1.9%) in prediction of seed moisture content for bulk seed and single seeds, respectively, using either NIR reflectance or transmittance spectroscopy. Genetic algorithms selected three to eight wavelength bands in the NIR region and these narrow bands gave about the same prediction of seed moisture content (0.6% and 1.7%) as using the whole NIR interval in the PLS regression models. The selected regions were simulated as NIR filters in OLS regression resulting in predictions of the same quality (0.7 % and 2.1%). This finding opens possibilities to apply NIR sensors in fast and simple spectrometers for the determination of seed moisture content. Near infrared (NIR) radiation interacts with overtones of vibrating bonds in polar molecules. The resulting spectra contain chemical and physical information. This offers good possibilities to measure seed-water interactions, but also to interpret processes within seeds. It is shown that seed-water interaction involves both transitions and changes mainly in covalent bonds of O-H, C-H, C=O and N-H emanating from ongoing physiological processes like seed respiration and protein metabolism. I propose that BPLS analysis that has orthonormal loadings and orthogonal scores giving the same predictions as using conventional PLS regression, should be used as a standard to harmonise the interpretation of NIR spectra
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
Sparse reduced-rank regression for imaging genetics studies: models and applications
We present a novel statistical technique; the sparse reduced rank regression (sRRR) model
which is a strategy for multivariate modelling of high-dimensional imaging responses and
genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity
in the regression coefficients, identifying subsets of genetic markers that best explain
the variability observed in subsets of the phenotypes. To properly exploit the rich structure
present in each of the imaging and genetics domains, we additionally propose the use of
several structured penalties within the sRRR model. Using simulation procedures that accurately
reflect realistic imaging genetics data, we present detailed evaluations of the sRRR
method in comparison with the more traditional univariate linear modelling approach. In
all settings considered, we show that sRRR possesses better power to detect the deleterious
genetic variants. Moreover, using a simple genetic model, we demonstrate the potential
benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to
extracting averages over regions of interest in the brain. Since this entails the use of phenotypic
vectors of enormous dimensionality, we suggest the use of a sparse classification
model as a de-noising step, prior to the imaging genetics study. Finally, we present the
application of a data re-sampling technique within the sRRR model for model selection.
Using this approach we are able to rank the genetic markers in order of importance of association
to the phenotypes, and similarly rank the phenotypes in order of importance to
the genetic markers. In the very end, we illustrate the application perspective of the proposed
statistical models in three real imaging genetics datasets and highlight some potential
associations
Doctor of Philosophy
dissertationRapidly evolving technologies such as chip arrays and next-generation sequencing are uncovering human genetic variants at an unprecedented pace. Unfortunately, this ever growing collection of gene sequence variation has limited clinical utility without clear association to disease outcomes. As electronic medical records begin to incorporate genetic information, gene variant classification and accurate interpretation of gene test results plays a critical role in customizing patient therapy. To verify the functional impact of a given gene variant, laboratories rely on confirming evidence such as previous literature reports, patient history and disease segregation in a family. By definition variants of uncertain significance (VUS) lack this supporting evidence and in such cases, computational tools are often used to evaluate the predicted functional impact of a gene mutation. This study evaluates leveraging high quality genotype-phenotype disease variant data from 20 genes and 3986 variants, to develop gene-specific predictors utilizing a combination of changes in primary amino acid sequence, amino acid properties as descriptors of mutation severity and Naïve Bayes classification. A Primary Sequence Amino Acid Properties (PSAAP) prediction algorithm was then combined with well established predictors in a weighted Consensus sum in context of gene-specific reference intervals for known phenotypes. PSAAP and Consensus were also used to evaluate known variants of uncertain significance in the RET proto-oncogene as a model gene. The PSAAP algorithm was successfully extended to many genes and diseases. Gene-specific algorithms typically outperform generalized prediction tools. Characteristic mutation properties of a given gene and disease may be lost when diluted into genomewide data sets. A reliable computational phenotype classification framework with quantitative metrics and disease specific reference ranges allows objective evaluation of novel or uncertain gene variants and augments decision making when confirming clinical information is limited
Chemical Similarity and Threshold of Toxicological Concern (TTC) Approaches: Report of an ECB Workshop held in Ispra, November 2005
There are many national, regional and international programmes – either regulatory or voluntary – to assess the hazards or risks of chemical substances to humans and the environment. The first step in making a hazard assessment of a chemical is to ensure that there is adequate information on each of the endpoints. If adequate information is not available then additional data is needed to complete the dataset for this substance. For reasons of resources and animal welfare, it is important to limit the number of tests that have to be conducted, where this is scientifically justifiable. One approach is to consider closely related chemicals as a group, or chemical category, rather than as individual chemicals. In a category approach, data for chemicals and endpoints that have been already tested are used to estimate the hazard for untested chemicals and endpoints. Categories of chemicals are selected on the basis of similarities in biological activity which is associated with a common underlying mechanism of action.
A homologous series of chemicals exhibiting a coherent trend in biological activity can be rationalised on the basis of a constant change in structure. This type of grouping is relatively straightforward. The challenge lies in identifying the relevant chemical structural and physicochemical characteristics that enable more sophisticated groupings to be made on the basis of similarity in biological activity and hence purported mechanism of action. Linking two chemicals together and rationalising their similarity with reference to one or more endpoints has been very much carried out on an ad hoc basis. Even with larger groups, the process and approach is ad hoc and based on expert judgement. There still appears to be very little guidance about the tools and approaches for grouping chemicals systematically.
In November 2005, the ECB Workshop on Chemical Similarity and Thresholds of Toxicological Concern (TTC) Approaches was convened to identify the available approaches that currently exist to encode similarity and how these can be used to facilitate the grouping of chemicals. This report aims to capture the main themes that were discussed.
In particular, it outlines a number of different approaches that can facilitate the formation of chemical groupings in terms of the context under consideration and the likely information that would be required. Grouping methods were divided into one of four classes – knowledge-based, analogue-based, unsupervised, and supervised. A flowchart was constructed to attempt to capture a possible work flow to highlight where and how these approaches might be best applied.JRC.I.3-Toxicology and chemical substance
Development of a deep learning-based computational framework for the classification of protein sequences
Dissertação de mestrado em BioinformaticsProteins are one of the more important biological structures in living organisms, since they
perform multiple biological functions. Each protein has different characteristics and properties,
which can be employed in many industries, such as industrial biotechnology, clinical applications,
among others, demonstrating a positive impact.
Modern high-throughput methods allow protein sequencing, which provides the protein
sequence data. Machine learning methodologies are applied to characterize proteins using
information of the protein sequence. However, a major problem associated with this method
is how to properly encode the protein sequences without losing the biological relationship
between the amino acid residues. The transformation of the protein sequence into a numeric
representation is done by encoder methods. In this sense, the main objective of this project is to
study different encoders and identify the methods which yield the best biological representation
of the protein sequences, when used in machine learning (ML) models to predict different labels
related to their function.
The methods were analyzed in two study cases. The first is related to enzymes, since
they are a well-established case in the literature. The second used transporter sequences, a
lesser studied case in the literature. In both cases, the data was collected from the curated
database Swiss-Prot. The encoders that were tested include: calculated protein descriptors;
matrix substitution methods; position-specific scoring matrices; and encoding by pre-trained
transformer methods. The use of state-of-the-art pretrained transformers to encode protein
sequences proved to be a good biological representation for subsequent application in state-of-the-art ML methods. Namely, the ESM-1b transformer achieved a Mathews correlation coefficient
above 0.9 for any multiclassification task of the transporter classification system.As proteínas são estruturas biológicas importantes dos organismos vivos, uma vez que estas desempenham múltiplas funções biológicas. Cada proteína tem características e propriedades diferentes, que podem ser aplicadas em diversas indústrias, tais como a biotecnologia industrial, aplicações clínicas, entre outras, demonstrando um impacto positivo. Os métodos modernos de alto rendimento permitem a sequenciação de proteínas, fornecendo dados da sequência proteica. Metodologias de aprendizagem de máquinas tem sido aplicada para caracterizar as proteínas utilizando informação da sua sequência. Um problema associado a este método e como representar adequadamente as sequências proteicas sem perder a relação biológica entre os resíduos de aminoácidos. A transformação da sequência de proteínas numa representação numérica é feita por codificadores. Neste sentido, o principal objetivo deste projeto é estudar diferentes codificadores e identificar os métodos que produzem a melhor representação biológica das sequências proteicas, quando utilizados em modelos de aprendizagem mecânica para prever a classificação associada à sua função a sua função. Os métodos foram analisados em dois casos de estudo. O primeiro caso foi baseado em enzimas, uma vez que são um caso bem estabelecido na literatura. O segundo, na utilização de proteínas de transportadores, um caso menos estudado na literatura. Em ambos os casos, os dados foram recolhidos a partir da base de dados curada Swiss-Prot. Os codificadores testados incluem: descritores de proteínas calculados; métodos de substituição por matrizes; matrizes de pontuação específicas da posição; e codificação por modelos de transformadores pré-treinados. A utilização de transformadores de última geração para codificar sequências de proteínas demonstrou ser uma boa representação biológica para aplicação subsequente em métodos ML de última geração. Nomeadamente, o transformador ESM-1b atingiu um coeficiente de correlação de Matthews acima de 0,9 para multiclassificação do sistema de classificação de proteínas transportadoras
- …