7 research outputs found

    A direct proof of significant directed random walk

    Get PDF
    This paper is presented to disclose the relationship between weight and connectivity of nodes. An equation is formed to enhance the connectivity of nodes in directed graph via weigh. With implementation of references data, the adjacency matrix is further enhances to increases the accessibility of nodes via vector. The evolution of random walk is disclosed in this paper as well. Significant directed random walk will be used to prove the importance of weight in this paper

    Specific Tuning Parameter for Directed Random Walk Algorithm Cancer Classification

    Get PDF
    Accuracy of cancerous gene classification is a central challenge in clinical cancer research. Microarray-based gene biomarkers have proved the performance and its ability over traditional clinical parameters. However, gene biomarkers of an individual are less robustness due to litter reproducibility between different cohorts of patients. Several methods incorporating pathway information such as directed random walk have been proposed to infer the pathway activity. This paper discusses the implementation of group specific tuning parameter in directed random walk algorithm. In this experiment, gene expression data and pathway data are used as input data. Throughout this experiment, more significant pathway activities can be identified which increases the accuracy of cancer classification. The lung cancer gene is used as the experimental dataset, with which, the sDRW is used in determining significant pathways. More risk-active pathways are identified throughout this experiment

    An effective pre-processing phase for gene expression classification

    Get PDF
    A raw dataset prepared by researchers comes with a lot of information. Whether the information is usefull or not, completely depends on the requirement and purposes. In machine learning, data pre-processing is the very initial stage. It is a must to make sure the dataset is totally suitable for the requirement. In significant directed random walk (sDRW), there are three steps in data pre-processing stage. First, we remove unwanted attributes, missing value and proper arrangement, followed by normalization of the expression value and lastly, filtering method is applied. The first two steps are completed by Bioconductor package while the last step is works in sDRW

    Identification of pathway and gene markers using enhanced directed random walk for multiclass cancer expression data

    Get PDF
    Cancer markers play a significant role in the diagnosis of the origin of cancers and in the detection of cancers from initial treatments. This is a challenging task owing to the heterogeneity nature of cancers. Identification of these markers could help in improving the survival rate of cancer patients, in which dedicated treatment can be provided according to the diagnosis or even prevention. Previous investigations show that the use of pathway topology information could help in the detection of cancer markers from gene expression. Such analysis reduces its complexity from thousands of genes to a few hundreds of pathways. However, most of the existing methods group different cancer subtypes into just disease samples, and consider all pathways contribute equally in the analysis process. Meanwhile, the interaction between multiple genes and the genes with missing edges has been ignored in several other methods, and hence could lead to the poor performance of the identification of cancer markers from gene expression. Thus, this research proposes enhanced directed random walk to identify pathway and gene markers for multiclass cancer gene expression data. Firstly, an improved pathway selection with analysis of variances (ANOVA) that enables the consideration of multiple cancer subtypes is performed, and subsequently the integration of k-mean clustering and average silhouette method in the directed random walk that considers the interaction of multiple genes is also conducted. The proposed methods are tested on benchmark gene expression datasets (breast, lung, and skin cancers) and biological pathways. The performance of the proposed methods is then measured and compared in terms of classification accuracy and area under the receiver operating characteristics curve (AUC). The results indicate that the proposed methods are able to identify a list of pathway and gene markers from the datasets with better classification accuracy and AUC. The proposed methods have improved the classification performance in the range of between 1% and 35% compared with existing methods. Cell cycle and p53 signaling pathway were found significantly associated with breast, lung, and skin cancers, while the cell cycle was highly enriched with squamous cell carcinoma and adenocarcinoma

    Optimisation approaches for data mining in biological systems

    Get PDF
    The advances in data acquisition technologies have generated massive amounts of data that present considerable challenge for analysis. How to efficiently and automatically mine through the data and extract the maximum value by identifying the hidden patterns is an active research area, called data mining. This thesis tackles several problems in data mining, including data classification, regression analysis and community detection in complex networks, with considerable applications in various biological systems. First, the problem of data classification is investigated. An existing classifier has been adopted from literature and two novel solution procedures have been proposed, which are shown to improve the predictive accuracy of the original method and significantly reduce the computational time. Disease classification using high throughput genomic data is also addressed. To tackle the problem of analysing large number of genes against small number of samples, a new approach of incorporating extra biological knowledge and constructing higher level composite features for classification has been proposed. A novel model has been introduced to optimise the construction of composite features. Subsequently, regression analysis is considered where two piece-wise linear regression methods have been presented. The first method partitions one feature into multiple complementary intervals and ts each with a distinct linear function. The other method is a more generalised variant of the previous one and performs recursive binary partitioning that permits partitioning of multiple features. Lastly, community detection in complex networks is investigated where a new optimisation framework is introduced to identify the modular structure hidden in directed networks via optimisation of modularity. A non-linear model is firstly proposed before its linearised variant is presented. The optimisation framework consists of two major steps, including solving the non-linear model to identify a coarse initial partition and a second step of solving repeatedly the linearised models to re fine the network partition

    Analyse intégrée des données omiques dans l'impact de l'alimentation sur la santé cardiométabolique

    Get PDF
    Au Canada, les maladies cardiovasculaires (MCV) sont la deuxième cause de mortalité après le cancer, et l'une des principales causes d'hospitalisation. La prise en charge des individus souffrant de MCV repose sur l'évaluation et le traitement de plusieurs facteurs de risque cardiométabolique, lesquels comprennent le syndrome métabolique, l'activité physique et l'alimentation. L'adoption de saines habitudes de vie, incluant notamment une alimentation équilibrée, demeure la pierre angulaire de la prévention des MCV. En effet, une alimentation riche en fruits et légumes est inversement reliée à l'incidence de MCV. Les biomarqueurs d'exposition à la diète permettent par ailleurs d'étudier l'impact des facteurs alimentaires sur le développement des MCV. Les caroténoïdes plasmatiques, qui sont des biomarqueurs de la consommation de fruits et de légumes, sont associés à la santé cardiométabolique. L'alimentation influence en plus une multitude de facteurs omiques, modulant ainsi le risque de MCV. Les sciences omiques étudient l'ensemble complexe des molécules qui composent le corps. Parmi ces sciences, la génomique, l'épigénomique, la transcriptomique et la métabolomique s'intéressent respectivement à l'étude à grande échelle des gènes, de la méthylation de l'ADN, de l'expression génique et des métabolites. Étant donné qu'un seul type de données omiques ne permet généralement pas de saisir la complexité des processus biologiques, une approche intégrative combinant plusieurs données omiques s'avère idéale afin de déchiffrer la physiopathologie des traits complexes. La biologie des systèmes étudie les interactions complexes des différentes données omiques entre elles, et avec l'environnement ainsi que leur influence sur un trait d'intérêt, tel que la santé. Il existe plusieurs méthodes pour analyser et intégrer des données omiques. La génétique quantitative permet d'estimer les contributions des effets génétiques et environnementaux dans la variance de traits complexes. L'analyse de réseaux de corrélations pondérées permet de mettre en relation un grand nombre de données omiques interreliées avec un trait, comme par exemple un ensemble de facteurs de risque de maladies complexes. L'objectif général de cette thèse est d'étudier l'impact des déterminants omiques sur la relation entre l'alimentation et la santé cardiométabolique. Le premier objectif spécifique, utilisant une approche de la génétique quantitative, est de caractériser l'héritabilité des données omiques et des caroténoïdes plasmatiques ainsi que de vérifier si le lien avec des facteurs de risque cardiométabolique peut être expliqué par des facteurs génétiques et environnementaux. Le deuxième objectif spécifique, utilisant une approche de réseaux de corrélations pondérées, est d'évaluer le rôle des données omiques individuelles et combinées dans la relation entre les caroténoïdes plasmatiques et le profil lipidique. Ce projet de doctorat repose sur l'étude observationnelle GENERATION qui comprend 48 sujets en bonne santé répartis en 16 familles. Toutes les données omiques étudiées et les caroténoïdes plasmatiques ont démontré iii des ressemblances familiales dues, à des degrés divers, à l'effet de la génétique et de l'environnement partagé. La génétique et l'environnement sont également impliqués dans le lien entre la méthylation de l'ADN et l'expression génique ainsi qu'entre les métabolites, les caroténoïdes et les facteurs de risque cardiométabolique. L'utilisation de réseaux de corrélations pondérées a en outre permis de mieux comprendre le système moléculaire interactif qui relie les caroténoïdes, la méthylation de l'ADN, l'expression génique et le profil lipidique. En conclusion, ces travaux basés sur des données omiques individuelles et combinées analysées dans des approches de la génétique quantitative et de réseaux de corrélations pondérées ont mis en lumière la relation entre l'alimentation et la santé cardiométabolique.After cancer, cardiovascular disease (CVD) is the second leading cause of death and one of the leading causes of hospitalization in Canada. CVD management is based on the assessment and treatment of several cardiometabolic risk factors, which include metabolic syndrome, physical activity, and diet. A healthy lifestyle, including a balanced diet, remains the key to prevent CVD. A diet rich in fruits and vegetables is inversely associated with CVD incidence. Biomarkers of exposure to diet are used to study the impact of dietary factors on the development of CVD. Plasma carotenoids, a biomarker of fruit and vegetable consumption, are associated with cardiometabolic health. Diet also influences a myriad of omics factors, thus modulating CVD risk. Omics sciences study the complex set of molecules that make up the body. Among these sciences, genomics, epigenomics, transcriptomics, and metabolomics consider the large-scale study of genes, DNA methylation, gene expression, and metabolites, respectively. Given that a single type of omics data usually does not capture the complexity of biological processes, an integrative approach combining multiple omics data proves ideal to elucidate the pathophysiology of diseases. Systems biology studies the complex interactions of different omics data among themselves and with the environment on a trait such as health. There are several methods for analyzing and integrating omics data. Quantitative genetics estimates the contributions of genetic and environmental effects to the variance of complex traits such as omics data. Weighted correlation network analysis allows the association of a large number of omics data with a trait such as risk factors for diseases. The general objective of this thesis is to study the impact of omics determinants in the link between diet and cardiometabolic health. The first specific objective, using a quantitative genetics approach, is to characterize the heritability of omics data and plasma carotenoids as well as to check if their link with cardiometabolic risk factors can be explained by genetic and environmental factors. The second specific objective, using a weighted correlation network approach, is to assess the role of individual and combined omics data in the relationship between plasma carotenoids and lipid profile. This project is based on the GENERATION observational study, which includes 48 healthy subjects from 16 families. All omics data studied showed familial resemblances due, to varying degrees, to genetic and common environmental effects. Genetics and environment are also involved in the link between DNA methylation and gene expression, as well as between metabolites, carotenoids, and cardiometabolic risk factors. Moreover, weighted correlation network analysis has provided insight into the interactive molecular system that links carotenoids, DNA methylation, gene expression, and lipid profile. In conclusion, the present study, using approaches from quantitative genetics and weighted correlation network analysis, brought to light the impact of some individual and combined omics data in the link between diet and cardiometabolic healt
    corecore