Search CORE

60 research outputs found

Min–Max Hyperellipsoidal Clustering for Anomaly Detection in Network Security

Author: Sarasamma Suseela T.
Zhu Qiuming
Publication venue: DigitalCommons@UNO
Publication date: 01/08/2006
Field of study

A novel hyperellipsoidal clustering technique is presented for an intrusion-detection system in network security. Hyperellipsoidal clusters toward maximum intracluster similarity and minimum intercluster similarity are generated from training data sets. The novelty of the technique lies in the fact that the parameters needed to construct higher order data models in general multivariate Gaussian functions are incrementally derived from the data sets using accretive processes. The technique is implemented in a feedforward neural network that uses a Gaussian radial basis function as the model generator. An evaluation based on the inclusiveness and exclusiveness of samples with respect to specific criteria is applied to accretively learn the output clusters of the neural network. One significant advantage of this is its ability to detect individual anomaly types that are hard to detect with other anomaly-detection schemes. Applying this technique, several feature subsets of the tcptrace network-connection records that give above 95% detection at false-positive rates below 5% were identified

The University of Nebraska, Omaha

Theoretical Interpretations and Applications of Radial Basis Function Networks

Author: Blanzieri Enrico
Publication venue
Publication date: 01/05/2003
Field of study

Medical applications usually used Radial Basis Function Networks just as Artificial Neural Networks. However, RBFNs are Knowledge-Based Networks that can be interpreted in several way: Artificial Neural Networks, Regularization Networks, Support Vector Machines, Wavelet Networks, Fuzzy Controllers, Kernel Estimators, Instanced-Based Learners. A survey of their interpretations and of their corresponding learning algorithms is provided as well as a brief survey on dynamic learning algorithms. RBFNs' interpretations can suggest applications that are particularly interesting in medical domains

Unitn-eprints Research

An Automated Pipeline for Variability Detection and Classification for the Small Telescopes Installed at the Liverpool Telescope

Author: McWhirter PR
Publication venue
Publication date
Field of study

The Small Telescopes at the Liverpool Telescope (STILT) is an almost decade old project to install a number of wide field optical instruments to the Liverpool Telescope, named Skycams, to monitor weather conditions and yield useful photometry on bright astronomical sources. The motivation behind this thesis is the development of algorithms and techniques which can automatically exploit the data generated during the first 1200 days of Skycam operation to catalogue variable sources in the La Palma sky. A previously developed pipeline reduces the Skycam images and produces photometric time-series data named light curves of millions of objects. 590,492 of these objects have 100 or more data points of sufficient quality to attempt a variability analysis. The large volume and relatively high noise of this data necessitated the use of Machine Learning and sophisticated optimisation techniques to successfully extract this information. The Skycam instruments have no control over the orientation and pointing of the Liverpool Telescope and therefore resample areas of the sky highly irregularly. The term used for this resampling in astronomy is ‘cadence’. The unusually irregular Skycam cadence places increased strain on the algorithms designed for the detection of periodicity in light curves. This thesis details the development of a period estimation method based on a novel implementation of a genetic algorithm combined with a generational clustering method. Named GRAPE (Genetic Routine for Astronomical Period Estimation), this algorithm deconstructs the space of possible periods for a light curve into regions in which the genetic population clusters. These regions are then fine-tuned using a k-means clustering algorithm to return a set of independent period candidates which are then analysed using a Vuong closeness test to discriminate between aliased and true periods. This thesis demonstrates the capability of GRAPE on a set of synthetic light curves built using traditional regular cadence sampling and Skycam style cadence for four different shapes of periodic light curve. The performance of GRAPE on these light curves is compared to a more traditional periodogram which returns a set of peaks and is then analysed using Vuong closeness tests. GRAPE obtains similar performance compared to the periodogram on all the light curve shapes but with less computational complexity allowing for more efficient light curve analysis. Automated classification of variable light curves has been explored over the last decade. Multiple features have been engineered to identify patterns in the light curves of different classes of variable star. Within the last few years deep learning has come to prominence as a method of automatically generating informative representations of the data for the solution of a desired problem, such as a classification task. A set of models using Random Forests, Support Vector Machines and Neural Networks were trained using a set of variable Skycam light curves of five classes. Using 16 features engineered from previous methods an Area under the Curve (AUC) of 0.8495 was obtained. Replacing these features with inputs from the pixel intensities from a 100 by 20 pixel image representation, produced an AUC of 0.6348, which improved to 0.7952 when provided with additional context to the dimensionality of the image. Despite the inferior performance, the importance of the different pixels produced relations in the trained models demonstrating that they had produced features based on well-understood patterns in the different classes of light curve. Using features produced by Richards et al. and Kim & Bailer-Jones et al., a set of features to train machine learning classification models was constructed. In addition to this set of features, a semi-supervised set of novel features was designed to describe the shape of light curves phased around the GRAPE candidate period. This thesis investigates the performance of the PolyFit algorithm of Prsa et al., a technique to fit four piecewise polynomials with discontinuous knots capable of connecting across the phase boundary at phases of zero and one. This method was designed to fit eclipsing binary phased light curves however were also described to be fully capable on other variable star types. The optimisation method used by PolyFit is replaced by a novel genetic algorithm optimisation routine to fit the model to Skycam data with substantial improvement in performance. The PolyFit model is applied to the candidate period and twice this period for every classified light curve. This interpolation produces novel features which describe similar statistics to the previously developed methods but which appear significantly more resilient to the Skycam noise and are often preferred by the trained models. In addition, Principal Component Analysis (PCA) is used to investigate a set of 6897 variable light curves and discover that the first ten principal components are sufficient to describe 95\% of the variance of the fitted models. This trained PCA model is retained and used to generate twenty novel shape features. Whilst these features are not dominant in their importance to the learned models, they have above average importance and help distinguish some objects in the light curve classification task. The second principal component in particular is an important feature in the discrimination of short period pulsating and eclipsing variables as it appears to be an automatically learned robust skewness measure. The method described in this thesis produces 112 features of the Skycam light curves, 38 variability indices which are quickly obtainable and 74 which require the computation of a candidate period using GRAPE. A number of machine learning classifiers are investigated to produce high-performance models for the detection and classification of variable light curves from the Skycam dataset. A Random Forest classifier uses a training set of 859 light curves of 12 object classes to produce a classifier with a multi-class F1 score of 0.533. It would be computationally infeasible to produce all the features for every Skycam light curve, therefore an automated pipeline has been developed which combines a Skycam trend removal pipeline, GRAPE and our machine learned classifiers. It initialises with a set of Skycam light curves from objects cross-matched from the American Association of Variable Star Observers (AAVSO) Variable Star Index (VSI), one of the most comprehensive catalogues of variable stars available. The learned models classify the full 112 features generated for these cross-matched light curves and confident matches are selected to produce a training set for a binary variability detection model. This model utilises only the 38 variability indices to identify variable light curves rapidly without the use of GRAPE. This variability model, trained using a random forest classifier, obtains an F1 score of 0.702. Applying this model to the 590,492 Skycam light curves yields 103,790 variable candidates of which 51,129 candidates have been classified and are available for further analysis

LJMU Research Online (Liverpool John Moores University)

An Integrated Fuzzy Inference Based Monitoring, Diagnostic, and Prognostic System

Author: Garvey Dustin R
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/05/2007
Field of study

To date the majority of the research related to the development and application of monitoring, diagnostic, and prognostic systems has been exclusive in the sense that only one of the three areas is the focus of the work. While previous research progresses each of the respective fields, the end result is a variable grab bag of techniques that address each problem independently. Also, the new field of prognostics is lacking in the sense that few methods have been proposed that produce estimates of the remaining useful life (RUL) of a device or can be realistically applied to real-world systems. This work addresses both problems by developing the nonparametric fuzzy inference system (NFIS) which is adapted for monitoring, diagnosis, and prognosis and then proposing the path classification and estimation (PACE) model that can be used to predict the RUL of a device that does or does not have a well defined failure threshold. To test and evaluate the proposed methods, they were applied to detect, diagnose, and prognose faults and failures in the hydraulic steering system of a deep oil exploration drill. The monitoring system implementing an NFIS predictor and sequential probability ratio test (SPRT) detector produced comparable detection rates to a monitoring system implementing an autoassociative kernel regression (AAKR) predictor and SPRT detector, specifically 80% vs. 85% for the NFIS and AAKR monitor respectively. It was also found that the NFIS monitor produced fewer false alarms. Next, the monitoring system outputs were used to generate symptom patterns for k-nearest neighbor (kNN) and NFIS classifiers that were trained to diagnose different fault classes. The NFIS diagnoser was shown to significantly outperform the kNN diagnoser, with overall accuracies of 96% vs. 89% respectively. Finally, the PACE implementing the NFIS was used to predict the RUL for different failure modes. The errors of the RUL estimates produced by the PACE-NFIS prognosers ranged from 1.2-11.4 hours with 95% confidence intervals (CI) from 0.67-32.02 hours, which are significantly better than the population based prognoser estimates with errors of ~45 hours and 95% CIs of ~162 hours

University of Tennessee, Knoxville: Trace

An algebraic technique for the automatic recognition of visual patterns

Author: Ullmann Julian Richard
Ullmann Julian Richard
Publication venue
Publication date: 01/01/1968
Field of study

Imperial Users onl

Spiral - Imperial College Digital Repository

Regularized model learning in EDAs for continuous and multi-objective optimization

Author: Karshenas Hossein
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2013
Field of study

Probabilistic modeling is the de�ning characteristic of estimation of distribution algorithms (EDAs) which determines their behavior and performance in optimization. Regularization is a well-known statistical technique used for obtaining an improved model by reducing the generalization error of estimation, especially in high-dimensional problems. `1-regularization is a type of this technique with the appealing variable selection property which results in sparse model estimations. In this thesis, we study the use of regularization techniques for model learning in EDAs. Several methods for regularized model estimation in continuous domains based on a Gaussian distribution assumption are presented, and analyzed from di�erent aspects when used for optimization in a high-dimensional setting, where the population size of EDA has a logarithmic scale with respect to the number of variables. The optimization results obtained for a number of continuous problems with an increasing number of variables show that the proposed EDA based on regularized model estimation performs a more robust optimization, and is able to achieve signi�cantly better results for larger dimensions than other Gaussian-based EDAs. We also propose a method for learning a marginally factorized Gaussian Markov random �eld model using regularization techniques and a clustering algorithm. The experimental results show notable optimization performance on continuous additively decomposable problems when using this model estimation method. Our study also covers multi-objective optimization and we propose joint probabilistic modeling of variables and objectives in EDAs based on Bayesian networks, speci�cally models inspired from multi-dimensional Bayesian network classi�ers. It is shown that with this approach to modeling, two new types of relationships are encoded in the estimated models in addition to the variable relationships captured in other EDAs: objectivevariable and objective-objective relationships. An extensive experimental study shows the e�ectiveness of this approach for multi- and many-objective optimization. With the proposed joint variable-objective modeling, in addition to the Pareto set approximation, the algorithm is also able to obtain an estimation of the multi-objective problem structure. Finally, the study of multi-objective optimization based on joint probabilistic modeling is extended to noisy domains, where the noise in objective values is represented by intervals. A new version of the Pareto dominance relation for ordering the solutions in these problems, namely �-degree Pareto dominance, is introduced and its properties are analyzed. We show that the ranking methods based on this dominance relation can result in competitive performance of EDAs with respect to the quality of the approximated Pareto sets. This dominance relation is then used together with a method for joint probabilistic modeling based on `1-regularization for multi-objective feature subset selection in classi�cation, where six di�erent measures of accuracy are considered as objectives with interval values. The individual assessment of the proposed joint probabilistic modeling and solution ranking methods on datasets with small-medium dimensionality, when using two di�erent Bayesian classi�ers, shows that comparable or better Pareto sets of feature subsets are approximated in comparison to standard methods

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

From Data to Software to Science with the Rubin Observatory LSST

Author: Acquaviva Viviana
Ahumada Tomas
AlSayyad Yusra
Alves Catarina S.
Andreoni Igor
Anguita Timo
Best Henry J.
Bianco Federica B.
Bonito Rosaria
Bradshaw Andrew
Breivik Katelyn
Burke Colin J.
Camargo Nicolás Garavito
Cantiello Matteo
Caplar Neven
Chan James
Chandler Colin Orion
Connolly Andrew J.
da Costa Luiz Nicolaci
Danieli Shany
Davenport James R. A.
de Campos Andresa Rodrigues
Fabbian Giulio
Fagin Joshua
Ford K. E. Saavik
Gagliano Alexander
Gall Christa
Gawiser Eric
Gezari Suvi
Gomboc Andreja
Gonzalez-Morales Alma X.
Graham Matthew J.
Gschwend Julia
Guy Leanne P.
Holman Matthew J.
Hsieh Henry H.
Hundertmark Markus
Ilić Dragana
Ishida Emille E. O.
Jurić Mario
Jurkić Tomislav
Kannawadi Arun
Kosakowski Alekzander
Kovačević Andjelka B.
Kubica Jeremy
Lanusse François
Lazar Ilin
Levine W. Garrett
Li Xiaolong
Lu Jing
Luna Gerardo Juan Manuel
Mahabal Ashish A.
Malz Alex I.
Mandelbaum Rachel
Mao Yao-Yuan
Medan Ilija
Miller Adam A.
Moeyens Joachim
Nikolić Mladen
Nikutta Robert
Norman Dara
O'Dowd Matt
O'Mullane William
Olsen Charlotte
Olsen Knut
Pearson Sarah
Pedraza Ilhuiyolitzin Villicana
Popinchalk Mark
Popović Luka C.
Price-Whelan Adrian
Pritchard Tyler A.
Quint Bruno C.
Radović Viktor
Ragosta Fabio
Riccio Gabriele
Riley Alexander H.
Rożek Agata
Sacco Timothy
Sarro Luis M.
Saunders Clare
Savić Đorđe V.
Schmidt Samuel
Scott Adam
Shirley Raphael
Smotherman Hayden R.
Sokoloski J. L.
Stetzler Steven
Storey-Fisher Kate
Street Rachel A.
Sánchez-Sáez Paula
Trilling David E.
Tsapras Yiannis
Ustamujic Sabina
van Velzen Sjoert
Venuti Laura
Villar Ashley
Vázquez-Mata José Antonio
Wyatt Samuel
Yu Weixiang
Zabludoff Ann
Publication venue
Publication date: 01/08/2022
Field of study

The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) dataset will dramatically alter our understanding of the Universe, from the origins of the Solar System to the nature of dark matter and dark energy. Much of this research will depend on the existence of robust, tested, and scalable algorithms, software, and services. Identifying and developing such tools ahead of time has the potential to significantly accelerate the delivery of early science from LSST. Developing these collaboratively, and making them broadly available, can enable more inclusive and equitable collaboration on LSST science. To facilitate such opportunities, a community workshop entitled "From Data to Software to Science with the Rubin Observatory LSST" was organized by the LSST Interdisciplinary Network for Collaboration and Computing (LINCC) and partners, and held at the Flatiron Institute in New York, March 28-30th 2022. The workshop included over 50 in-person attendees invited from over 300 applications. It identified seven key software areas of need: (i) scalable cross-matching and distributed joining of catalogs, (ii) robust photometric redshift determination, (iii) software for determination of selection functions, (iv) frameworks for scalable time-series analyses, (v) services for image access and reprocessing at scale, (vi) object image access (cutouts) and analysis at scale, and (vii) scalable job execution systems. This white paper summarizes the discussions of this workshop. It considers the motivating science use cases, identified cross-cutting algorithms, software, and services, their high-level technical specifications, and the principles of inclusive collaborations needed to develop them. We provide it as a useful roadmap of needs, as well as to spur action and collaboration between groups and individuals looking to develop reusable software for early LSST science.Comment: White paper from "From Data to Software to Science with the Rubin Observatory LSST" worksho

arXiv.org e-Print Archive

Open Repository and Bibliography - Liège

Multi-view learning and data integration for omics data

Author: Serra Angela
Publication venue: Universita degli studi di Salerno
Publication date: 23/06/2017
Field of study

2015 - 2016In recent years, the advancement of high-throughput technologies, combined with the constant decrease of the data-storage costs, has led to the production of large amounts of data from diﬀerent experiments that characterise the same entities of interest. This information may relate to speciﬁc aspects of a phenotypic entity (e.g. Gene expression), or can include the comprehensive and parallel measurement of multiple molecular events (e.g., DNA modiﬁcations, RNA transcription and protein translation) in the same samples. Exploiting such complex and rich data is needed in the frame of systems biology for building global models able to explain complex phenotypes. For example, theuseofgenome-widedataincancerresearch, fortheidentiﬁcationof groups of patients with similar molecular characteristics, has become a standard approach for applications in therapy-response, prognosis-prediction, and drugdevelopment.ÂăMoreover, the integration of gene expression data regarding cell treatment by drugs, and information regarding chemical structure of the drugs allowed scientist to perform more accurate drug repositioning tasks. Unfortunately, there is a big gap between the amount of information and the knowledge in which it is translated. Moreover, there is a huge need of computational methods able to integrate and analyse data to ﬁll this gap. Current researches in this area are following two diﬀerent integrative methods: one uses the complementary information of diﬀerent measurements for the 7 i i “Template” — 2017/6/9 — 16:42 — page 8 — #8 i i i i i i study of complex phenotypes on the same samples (multi-view learning); the other tends to infer knowledge about the phenotype of interest by integrating and comparing the experiments relating to it with respect to those of diﬀerent phenotypes already known through comparative methods (meta-analysis). Meta-analysis can be thought as an integrative study of previous results, usually performed aggregating the summary statistics from diﬀerent studies. Due to its nature, meta-analysis usually involves homogeneous data. On the other hand, multi-view learning is a more ﬂexible approach that considers the fusion of different data sources to get more stable and reliable estimates. Based on the type of data and the stage of integration, new methodologies have been developed spanning a landscape of techniques comprising graph theory, machine learning and statistics. Depending on the nature of the data and on the statistical problem to address, the integration of heterogeneous data can be performed at diﬀerent levels: early, intermediate and late. Early integration consists in concatenating data from diﬀerent views in a single feature space. Intermediate integration consists in transforming all the data sources in a common feature space before combining them. In the late integration methodologies, each view is analysed separately and the results are then combined. The purpose of this thesis is twofold: the former objective is the deﬁnition of a data integration methodology for patient sub-typing (MVDA) and the latter is the development of a tool for phenotypic characterisation of nanomaterials (INSIdEnano). In this PhD thesis, I present the methodologies and the results of my research. MVDA is a multi-view methodology that aims to discover new statistically relevant patient sub-classes. Identify patient subtypes of a speciﬁc diseases is a challenging task especially in the early diagnosis. This is a crucial point for the treatment, because not allthe patients aﬀected bythe same diseasewill have the same prognosis or need the same drug treatment. This problem is usually solved by using transcriptomic data to identify groups of patients that share the same gene patterns. The main idea underlying this research work is that to combine more omics data for the same patients to obtain a better characterisation of their disease proﬁle. The proposed methodology is a late integration approach i i “Template” — 2017/6/9 — 16:42 — page 9 — #9 i i i i i i based on clustering. It works by evaluating the patient clusters in each single view and then combining the clustering results of all the views by factorising the membership matrices in a late integration manner. The eﬀectiveness and the performance of our method was evaluated on six multi-view cancer datasets related to breast cancer, glioblastoma, prostate and ovarian cancer. The omics data used for the experiment are gene and miRNA expression, RNASeq and miRNASeq, Protein Expression and Copy Number Variation. In all the cases, patient sub-classes with statistical signiﬁcance were found, identifying novel sub-groups previously not emphasised in literature. The experiments were also conducted by using prior information, as a new view in the integration process, to obtain higher accuracy in patients’ classiﬁcation. The method outperformed the single view clustering on all the datasets; moreover, it performs better when compared with other multi-view clustering algorithms and, unlike other existing methods, it can quantify the contribution of single views in the results. The method has also shown to be stable when perturbation is applied to the datasets by removing one patient at a time and evaluating the normalized mutual information between all the resulting clusterings. These observations suggest that integration of prior information with genomic features in sub-typing analysis is an eﬀective strategy in identifying disease subgroups. INSIdE nano (Integrated Network of Systems bIology Eﬀects of nanomaterials) is a novel tool for the systematic contextualisation of the eﬀects of engineered nanomaterials (ENMs) in the biomedical context. In the recent years, omics technologies have been increasingly used to thoroughly characterise the ENMs molecular mode of action. It is possible to contextualise the molecular eﬀects of diﬀerent types of perturbations by comparing their patterns of alterations. While this approach has been successfully used for drug repositioning, it is still missing to date a comprehensive contextualisation of the ENM mode of action. The idea behind the tool is to use analytical strategies to contextualise or position the ENM with the respect to relevant phenotypes that have been studied in literature, (such as diseases, drug treatments, and other chemical exposures) by comparing their patterns of molecular alteration. This could greatly increase the knowledge on the ENM molecular eﬀects and in turn i i “Template” — 2017/6/9 — 16:42 — page 10 — #10 i i i i i i contribute to the deﬁnition of relevant pathways of toxicity as well as help in predicting the potential involvement of ENM in pathogenetic events or in novel therapeutic strategies. The main hypothesis is that suggestive patterns of similarity between sets of phenotypes could be an indication of a biological association to be further tested in toxicological or therapeutic frames. Based on the expression signature, associated to each phenotype, the strength of similarity between each pair of perturbations has been evaluated and used to build a large network of phenotypes. To ensure the usability of INSIdE nano, a robust and scalable computational infrastructure has been developed, to scan this large phenotypic network and a web-based eﬀective graphic user interface has been built. Particularly, INSIdE nano was scanned to search for clique sub-networks, quadruplet structures of heterogeneous nodes (a disease, a drug, a chemical and a nanomaterial) completely interconnected by strong patterns of similarity (or anti-similarity). The predictions have been evaluated for a set of known associations between diseases and drugs, based on drug indications in clinical practice, and between diseases and chemical, based on literature-based causal exposure evidence, and focused on the possible involvement of nanomaterials in the most robust cliques. The evaluation of INSIdE nano conﬁrmed that it highlights known disease-drug and disease-chemical connections. Moreover, disease similarities agree with the information based on their clinical features, as well as drugs and chemicals, mirroring their resemblance based on the chemical structure. Altogether, the results suggest that INSIdE nano can also be successfully used to contextualise the molecular eﬀects of ENMs and infer their connections to other better studied phenotypes, speeding up their safety assessment as well as opening new perspectives concerning their usefulness in biomedicine. [edited by author]L’avanzamento tecnologico delle tecnologie high-throughput, combinato con il costante decremento dei costi di memorizzazione, ha portato alla produzione di grandi quantit`a di dati provenienti da diversi esperimenti che caratterizzano le stesse entit`a di interesse. Queste informazioni possono essere relative a speciﬁci aspetti fenotipici (per esempio l’espressione genica), o possono includere misure globali e parallele di diversi aspetti molecolari (per esempio modiﬁche del DNA, trascrizione dell’RNA e traduzione delle proteine) negli stessi campioni. Analizzare tali dati complessi `e utile nel campo della systems biology per costruire modelli capaci di spiegare fenotipi complessi. Ad esempio, l’uso di dati genome-wide nella ricerca legata al cancro, per l’identiﬁcazione di gruppi di pazienti con caratteristiche molecolari simili, `e diventato un approccio standard per una prognosi precoce piu` accurata e per l’identiﬁcazione di terapie speciﬁche. Inoltre, l’integrazione di dati di espressione genica riguardanti il trattamento di cellule tramite farmaci ha permesso agli scienziati di ottenere accuratezze elevate per il drug repositioning. Purtroppo, esiste un grosso divario tra i dati prodotti, in seguito ai numerosi esperimenti, e l’informazione in cui essi sono tradotti. Quindi la comunit`a scientiﬁca ha una forte necessit`a di metodi computazionali per poter integrare e analizzate tali dati per riempire questo divario. La ricerca nel campo delle analisi multi-view, segue due diversi metodi di analisi integrative: uno usa le informazioni complementari di diverse misure per studiare fenotipi complessi su diversi campioni (multi-view learning); l’altro tende ad inferire conoscenza sul fenotipo di interesse di una entit`a confrontando gli esperimenti ad essi relativi con quelli di altre entit`a fenotipiche gi`a note in letteratura (meta-analisi). La meta-analisi pu`o essere pensata come uno studio comparativo dei risultati identiﬁcati in un particolare esperimento, rispetto a quelli di studi precedenti. A causa della sua natura, la meta-analisi solitamente coinvolge dati omogenei. D’altra parte, il multi-view learning `e un approccio piu` ﬂessibile che considera la fusione di diverse sorgenti di dati per ottenere stime piu` stabili e aﬃdabili. In base al tipo di dati e al livello di integrazione, nuove metodologie sono state sviluppate a partire da tecniche basate sulla teoria dei graﬁ, machine learning e statistica. In base alla natura dei dati e al problema statistico da risolvere, l’integrazione di dati eterogenei pu`o essere eﬀettuata a diversi livelli: early, intermediate e late integration. Le tecniche di early integration consistono nella concatenazione dei dati delle diverse viste in un unico spazio delle feature. Le tecniche di intermediate integration consistono nella trasformazione di tutte le sorgenti dati in un unico spazio comune prima di combinarle. Nelle tecniche di late integration, ogni vista `e analizzata separatamente e i risultati sono poi combinati. Lo scopo di questa tesi `e duplice: il primo obbiettivo `e la deﬁnizione di una metodologia di integrazione dati per la sotto-tipizzazione dei pazienti (MVDA) e il secondo `e lo sviluppo di un tool per la caratterizzazione fenotipica dei nanomateriali (INSIdEnano). In questa tesi di dottorato presento le metodologie e i risultati della mia ricerca. MVDA `e una tecnica multi-view con lo scopo di scoprire nuove sotto tipologie di pazienti statisticamente rilevanti. Identiﬁcare sottotipi di pazienti per una malattia speciﬁca `e un obbiettivo con alto rilievo nella pratica clinica, soprattutto per la diagnosi precoce delle malattie. Questo problema `e generalmente risolto usando dati di trascrittomica per identiﬁcare i gruppi di pazienti che condividono gli stessi pattern di alterazione genica. L’idea principale alla base di questo lavoro di ricerca `e quello di combinare piu` tipologie di dati omici per gli stessi pazienti per ottenere una migliore caratterizzazione del loro proﬁlo. La metodologia proposta `e un approccio di tipo late integration basato sul clustering. Per ogni vista viene eﬀettuato il clustering dei pazienti rappresentato sotto forma di matrici di membership. I risultati di tutte le viste vengono poi combinati tramite una tecnica di fattorizzazione di matrici per ottenere i metacluster ﬁnali multi-view. La fattibilit`a e le performance del nostro metodo sono stati valutati su sei dataset multi-view relativi al tumore al seno, glioblastoma, cancro alla prostata e alle ovarie. I dati omici usati per gli esperimenti sono relativi alla espressione dei geni, espressione dei mirna, RNASeq, miRNASeq, espressione delle proteine e della Copy Number Variation. In tutti i dataset sono state identiﬁcate sotto-tipologie di pazienti con rilevanza statistica, identiﬁcando nuovi sottogruppi precedentemente non noti in letteratura. Ulteriori esperimenti sono stati condotti utilizzando la conoscenza a priori relativa alle macro classi dei pazienti. Tale informazione `e stata considerata come una ulteriore vista nel processo di integrazione per ottenere una accuratezza piu` elevata nella classiﬁcazione dei pazienti. Il metodo proposto ha performance migliori degli algoritmi di clustering clussici su tutti i dataset. MVDA ha ottenuto risultati migliori in confronto a altri algoritmi di integrazione di tipo ealry e intermediate integration. Inoltre il metodo `e in grado di calcolare il contributo di ogni singola vista al risultato ﬁnale. I risultati mostrano, anche, che il metodo `e stabile in caso di perturbazioni del dataset eﬀettuate rimuovendo un paziente alla volta (leave-one-out). Queste osservazioni suggeriscono che l’integrazione di informazioni a priori e feature genomiche, da utilizzare congiuntamente durante l’analisi, `e una strategia vincente nell’identiﬁcazione di sotto-tipologie di malattie. INSIdE nano (Integrated Network of Systems bIology Eﬀects of nanomaterials) `e un tool innovativo per la contestualizzazione sistematica degli eﬀetti delle nanoparticelle (ENMs) in contesti biomedici. Negli ultimi anni, le tecnologie omiche sono state ampiamente applicate per caratterizzare i nanomateriali a livello molecolare. E’ possibile contestualizzare l’eﬀetto a livello molecolare di diversi tipi di perturbazioni confrontando i loro pattern di alterazione genica. Mentre tale approccio `e stato applicato con successo nel campo del drug repositioning, una contestualizzazione estensiva dell’eﬀetto dei nanomateriali sulle cellule `e attualmente mancante. L’idea alla base del tool `e quello di usare strategie comparative di analisi per contestualizzare o posizionare i nanomateriali in confronto a fenotipi rilevanti che sono stati studiati in letteratura (come ad esempio malattie dell’uomo, trattamenti farmacologici o esposizioni a sostanze chimiche) confrontando i loro pattern di alterazione molecolare. Questo potrebbe incrementare la conoscenza dell’eﬀetto molecolare dei nanomateriali e contribuire alla deﬁnizione di nuovi pathway tossicologici oppure identiﬁcare eventuali coinvolgimenti dei nanomateriali in eventi patologici o in nuove strategie terapeutiche. L’ipotesi alla base `e che l’identiﬁcazione di pattern di similarit`a tra insiemi di fenotipi potrebbe essere una indicazione di una associazione biologica che deve essere successivamente testata in ambito tossicologico o terapeutico. Basandosi sulla ﬁrma di espressione genica, associata ad ogni fenotipo, la similarit`a tra ogni coppia di perturbazioni `e stata valuta e usata per costruire una grande network di interazione tra fenotipi. Per assicurare l’utilizzo di INSIdE nano, `e stata sviluppata una infrastruttura computazionale robusta e scalabile, allo scopo di analizzare tale network. Inoltre `e stato realizzato un sito web che permettesse agli utenti di interrogare e visualizzare la network in modo semplice ed eﬃciente. In particolare, INSIdE nano `e stato analizzato cercando tutte le possibili clique di quattro elementi eterogenei (un nanomateriale, un farmaco, una malattia e una sostanza chimica). Una clique `e una sotto network completamente connessa, dove ogni elemento `e collegato con tutti gli altri. Di tutte le clique, sono state considerate come signiﬁcative solo quelle per le quali le associazioni tra farmaco e malattia e farmaco e sostanze chimiche sono note. Le connessioni note tra farmaci e malattie si basano sul fatto che il farmaco `e prescritto per curare tale malattia. Le connessioni note tra malattia e sostanze chimiche si basano su evidenze presenti in letteratura del fatto che tali sostanze causano la malattia. Il focus `e stato posto sul possibile coinvolgimento dei nanomateriali con le malattie presenti in tali clique. La valutazione di INSIdE nano ha confermato che esso mette in evidenza connessioni note tra malattie e farmaci e tra malattie e sostanze chimiche. Inoltre la similarit`a tra le malattie calcolata in base ai geni `e conforme alle informazioni basate sulle loro informazioni cliniche. Allo stesso modo le similarit`a tra farmaci e sostanze chimiche rispecchiano le loro similarit`a basate sulla struttura chimica. Nell’insieme, i risultati suggeriscono che INSIdE nano pu`o essere usato per contestualizzare l’eﬀetto molecolare dei nanomateriali e inferirne le connessioni rispetto a fenotipi precedentemente studiati in letteratura. Questo metodo permette di velocizzare il processo di valutazione della loro tossicit`a e apre nuove prospettive per il loro utilizzo nella biomedicina. [a cura dell'autore]XV n.s

EleA@UniSA - Università degli Studi di Salerno

Innovation and Society 5.0: Statistical and Economic Methodologies for Quality Assessment

Author: Camminatiello I.
Lombardo R.
R. Lombardo I. Camminatiello, V. Simonacci
R. Lombardo I. Camminatiello, V. Simonacci
Simonacci V.
Publication venue: country:ITA
Publication date: 01/01/2022
Field of study

Archivio Istituzionale della Ricerca - Università degli Studi della Campania "Luigi Vanvitelli"