Search CORE

286 research outputs found

Discovery of error-tolerant biclusters from noisy gene expression data

Author: A Ben-Dor
A Gyenesei
A Poernomo
A Poernomo
A Prelic
A Subramanian
A Tanay
C Becquet
C Creighton
C Yang
G Pandey
H Cheng
H Cheng
I Dhillon
J Besson
J Han
J Liu
J Liu
J Seppänen
M Ashburner
M Zhang
Navneet Rao
R Gupta
R Gupta
R Rastogi
R Srikant
Rohit Gupta
S Bergmann
S Hanhijärvi
SC Madeira
T Calders
T Fukuda
T Hughes
T Mcintosh
Vipin Kumar
Y Cheng
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, whic

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Big Data Analytics for Complex Systems

Author: Abou Tabl Ashraf Mohamed
Publication venue: 'University of Windsor Leddy Library'
Publication date: 10/09/2019
Field of study

The evolution of technology in all fields led to the generation of vast amounts of data by modern systems. Using data to extract information, make predictions, and make decisions is the current trend in artificial intelligence. The advancement of big data analytics tools made accessing and storing data easier and faster than ever, and machine learning algorithms help to identify patterns in and extract information from data. The current tools and machines in health, computer technologies, and manufacturing can generate massive raw data about their products or samples. The author of this work proposes a modern integrative system that can utilize big data analytics, machine learning, super-computer resources, and industrial health machines’ measurements to build a smart system that can mimic the human intelligence skills of observations, detection, prediction, and decision-making. The applications of the proposed smart systems are included as case studies to highlight the contributions of each system. The first contribution is the ability to utilize big data revolutionary and deep learning technologies on production lines to diagnose incidents and take proper action. In the current digital transformational industrial era, Industry 4.0 has been receiving researcher attention because it can be used to automate production-line decisions. Reconfigurable manufacturing systems (RMS) have been widely used to reduce the setup cost of restructuring production lines. However, the current RMS modules are not linked to the cloud for online decision-making to take the proper decision; these modules must connect to an online server (super-computer) that has big data analytics and machine learning capabilities. The online means that data is centralized on cloud (supercomputer) and accessible in real-time. In this study, deep neural networks are utilized to detect the decisive features of a product and build a prediction model in which the iFactory will make the necessary decision for the defective products. The Spark ecosystem is used to manage the access, processing, and storing of the big data streaming. This contribution is implemented as a closed cycle, which for the best of our knowledge, no one in the literature has introduced big data analysis using deep learning on real-time applications in the manufacturing system. The code shows a high accuracy of 97% for classifying the normal versus defective items. The second contribution, which is in Bioinformatics, is the ability to build supervised machine learning approaches based on the gene expression of patients to predict proper treatment for breast cancer. In the trial, to personalize treatment, the machine learns the genes that are active in the patient cohort with a five-year survival period. The initial condition here is that each group must only undergo one specific treatment. After learning about each group (or class), the machine can personalize the treatment of a new patient by diagnosing the patients’ gene expression. The proposed model will help in the diagnosis and treatment of the patient. The future work in this area involves building a protein-protein interaction network with the selected genes for each treatment to first analyze the motives of the genes and target them with the proper drug molecules. In the learning phase, a couple of feature-selection techniques and supervised standard classifiers are used to build the prediction model. Most of the nodes show a high-performance measurement where accuracy, sensitivity, specificity, and F-measure ranges around 100%. The third contribution is the ability to build semi-supervised learning for the breast cancer survival treatment that advances the second contribution. By understanding the relations between the classes, we can design the machine learning phase based on the similarities between classes. In the proposed research, the researcher used the Euclidean matrix distance among each survival treatment class to build the hierarchical learning model. The distance information that is learned through a non-supervised approach can help the prediction model to select the classes that are away from each other to maximize the distance between classes and gain wider class groups. The performance measurement of this approach shows a slight improvement from the second model. However, this model reduced the number of discriminative genes from 47 to 37. The model in the second contribution studies each class individually while this model focuses on the relationships between the classes and uses this information in the learning phase. Hierarchical clustering is completed to draw the borders between groups of classes before building the classification models. Several distance measurements are tested to identify the best linkages between classes. Most of the nodes show a high-performance measurement where accuracy, sensitivity, specificity, and F-measure ranges from 90% to 100%. All the case study models showed high-performance measurements in the prediction phase. These modern models can be replicated for different problems within different domains. The comprehensive models of the newer technologies are reconfigurable and modular; any newer learning phase can be plugged-in at both ends of the learning phase. Therefore, the output of the system can be an input for another learning system, and a newer feature can be added to the input to be considered for the learning phase

Scholarship at UWindsor

EgoNet: Identification of human disease ego-network modules

Author: Bai Yun
Qin Zhaohui
Yang Rendong
Yu Tianwei
Publication venue
Publication date: 01/01/2014
Field of study

Background: Mining novel biomarkers from gene expression profiles for accurate disease classification is challenging due to small sample size and high noise in gene expression measurements. Several studies have proposed integrated analyses of microarray data and protein-protein interaction (PPI) networks to find diagnostic subnetwork markers. However, the neighborhood relationship among network member genes has not been fully considered by those methods, leaving many potential gene markers unidentified. The main idea of this study is to take full advantage of the biological observation that genes associated with the same or similar diseases commonly reside in the same neighborhood of molecular networks.Results: We present EgoNet, a novel method based on egocentric network-analysis techniques, to exhaustively search and prioritize disease subnetworks and gene markers from a large-scale biological network. When applied to a triple-negative breast cancer (TNBC) microarray dataset, the top selected modules contain both known gene markers in TNBC and novel candidates, such as RAD51 and DOK1, which play a central role in their respective ego-networks by connecting many differentially expressed genes.Conclusions: Our results suggest that EgoNet, which is based on the ego network concept, allows the identification of novel biomarkers and provides a deeper understanding of their roles in complex diseases

Crossref

Springer - Publisher Connector

PubMed Central

Philadelphia College of Osteopathic Medicine: DigitalCommons@PCOM

Recommended from our members

Artificial neural network techniques to investigate potential interactions between biomarkers

Author: Lemetre C
Publication venue
Publication date: 01/01/2010
Field of study

High-throughput technologies in biomedical sciences, including gene microarrays, supposed to revolutionise the post-genomic era, have barely met the great expectations they inspired to the biomedical community at first. Current efforts are still focused toward improving the technology, its reproducibility and accuracy. In the meantime, computational techniques for the analysis of the data from these technologies have achieved great progresses and show encouraging results. New approaches have been developed to extract relevant information out from these results. However, important work needs to be further conducted in order to extract even more meaningful and relevant information. These techniques offer great possibilities to explore the overall dynamic held within a living organism. The potential information contained in their output can reveal important leads at deciphering the interconnection, interaction or regulation influences that can exist between several molecules. In front of an increasing interest of the scientific community toward the exploration of these dynamics, some groups have started to develop solutions based on different technologies to extract these information related to interactions. Here we present an Artificial Neural Network-based methodology for the study of interactions in gene transcriptomic data. This will be applied and validated in a breast cancer context

Nottingham Trent Institutional Repository (IRep)

Computational methods to analyze molecular determinants behind phenotypes

Author: Karinen Sirkku
Publication venue: 'University of Helsinki Libraries'
Publication date: 31/05/2013
Field of study

Phenotype is a collection of an organism's observable features that can be characterized both on individual level and on single cell level. Phenotypes are largely determined by their molecular processes which also explains their inheritance and plasticity. Some of the molecular background of phenotypes can be characterized by inherited genetic variations and alterations in gene expression. The high-throughput measurement technologies enable the measurement of molecular determinants in cells. However, measurement technologies produce remarkable large data sets and the research questions have become increasingly complex. Thus computational methods are needed to discover molecular mechanisms behind the phenotypes. In many cases, analysis of molecular determinants that contribute to the phenotype proceeds by first identifying putative candidates by using a priori information and high-throughput measurements. Then further analysis can focus on most promising molecules. In many cases, the aim is to identify relevant markers or targets from a set of candidate molecules. Often biomedical studies result in a long list of candidate genes, and to interpret these candidates, information on their context in cell functions is needed. This context information can give insight to synergistic effects of molecular machinery in cells when functions of individual molecules do not explain the observed phenotype. In addition, the context information can be used to generate candidates. One of the methods in this thesis provides a computational data integration method that provides a link in between candidate genes from molecular pathways and genetic variants. It uses publicly available biological knowledge bases to systematically create functional context of candidate genes. This approach is especially important when studying cancer, that is dependent of complex molecular signaling. Genotypes associated with inherited disease predispositions have been studied successfully in the past, however, traditional methods are not applicable in wide variety of analysis conditions. Thus, this thesis introduces a method that uses haplotype sharing to identify genetic loci inherited by multiple distantly related individuals. It is flexible and can be used in various settings, also with very limited number of samples. Increasing the number of biological replicates in gene expression analysis increases the reliability of the results. In many cases, however, the number of samples is limited. Therefore, pooling gene expression data from multiple published studies can increase the understanding of the molecular background behind cell types. This is shown in this thesis by an analysis that identifies gene expression differences in two cell types using publicly available gene expression samples from previous studies. Finally, when candidate molecules are available to characterize phenotypes, they can be compiled into biomarkers. In many cases, a combination of multiple molecules serves as a better biomarker than a single molecule. This thesis also includes a machine learning approach that is used to discover a classifier that predicts the phenotype.Fenotyyppi on joukko organismin piirteitä, jotka ovat havaittavissa joko yksilön tasolla tai yksittäisten solujen tasolla. Molekulaariset prosessit määräävät pitkälti fenotyyppien ilmentymistä, joten taustalla vaikuttavat molekulaariset prosessit myös selittävät fenotyyppien perinnöllisyyttä sekä niiden mukautumista. Fenotyyppien molekulaarista taustaa voidaan kartoittaa tunnistamalla geneettistä variaatiota sekä muutoksia geenien aktiivisuudessa. Määrääviä molekulaarisia tekijöitä voidaan havaita soluissa käyttämällä high-throughput -mittausteknologioita. Nämä mittausteknologiat tuottavat erittäin suuria data-aineistoja ja samalla tutkimuskysymykset ovat tulleet entistä monimutkaisemmiksi. Nämä seikat ovat johtaneet siihen, että laskennallisia menetelmiä tarvitaan fenotyyppien molekulaarisen mekanismien tunnistamisessa. Usein tutkimus etenee ensin tunnistamalla lupaavia kandidaatteja käyttämällä a priori tietoa sekä high-throughput -mittauksia. Jatkoanalyysit voivat keskittyä lupaavimpiin molekyyleihin. Tällöin tavoitteena saattaa olla käyttökelpoisimpien biomarkkereiden tunnistaminen tai kohdegeenien valitseminen kandidaattien joukosta. Usein biolääketieteen tutkimus tuottaa joukon kandidaattigeenejä, jolloin tulosten tulkinta vaatii tietoa kandidaattigeenien suhteesta solun muuhun molekulaariseen toimintaan. Kun tämä molekulaarinen toiminta kontekstina otetaan huomioon, on mahdollista ymmärtää geenien yhteisvaikutuksia solun toimintaan silloin kun yksittäiset geenit eivät selitä havaittua fenotyyppiä. Solun molekulaarista kontekstia voi käyttää myös kandidaattigeenien luomiseen. Yksi väitöskirjassa esitelty menetelmä tarjoaa laskennallisen menetelmän, jolla voidaan yhdistää kandidaatit tunnetuilta pathwaylta geneettisiin variantteihin. Tämä menetelmä käyttää julkisia tietokantoja, joista se systemaattisesti kerää molekulaarisen kontekstin kandidaattigeeneille. Tällainen lähestymistapa on erityisen hyödyllinen syöpätutkimuksessa, sillä syöpä on tyypillisesti riippuvainen monimutkaisista molekyylien signalointiverkoista. Perittyjen genotyyppien ja sairauksien välisiä yhteyksiä on tutkittu pitkään menestyksekkäästi, mutta perinteisesti käytetyt menetelmät soveltuvat vain tiettyihin tapauksiin. Tässä väitöskirjassa esitellään menetelmä, joka käyttää haplotyyppien jakamista tunnistaakseen genomiset alueet, jotka ovat periytyneet useille kaukaisesti sukua oleville henkilöille. Tätä menetelmää voi käyttää useissa erilaisissa tutkimuskysymyksissä, ja se tuottaa luotettavia tuloksia myös hyvin vähäisellä näytemäärällä. Geeniekspressioanalyysin tulosten luotettavuus kasvaa samalla kun biologisten kopioiden määrä aineistossa kasvaa. Huolimatta tästä, näytemäärät ovat usein rajallisia. Tämän vuoksi geeniekspressiomittausten yhdistäminen useista jo julkaistuista tutkimuksista voi lisätä ymmärrystä solutyypin määräävistä biologisista prosesseista. Tässä väitöskirjassa esitellään analyysi, jolla tunnistetaan geeniekspressioeroja käyttäen geeniekspressioainestoa, joka on yhdistetty julkaistuista tutkimuksista. Viimein, kun fenotyyppiä selittävät kandidaattimolekyylit on tunnistettu, niistä voidaan luoda biomarkkereita. Monesti useamman molekyylin mittaus on parempi biomarkkeri kuin yksikään molekyyli yksinään. Tässä väitöskirjassa esitellään myös koneoppimisanalyysi, jolla luodaan geeniekspressiomittauksista fenotyyppiä ennustava luokittelija

Helsingin yliopiston digitaalinen arkisto

Systems Analytics and Integration of Big Omics Data

Author: Hardiman Gary
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

Directory of Open Access Books (DOAB)

Recommended from our members

Evaluation of proteomic and transcriptomic biomarker discovery technologies in ovarian cancer

Author: Coveney CRE
Publication venue
Publication date: 01/10/2016
Field of study

Novel, specific and sensitive biomarkers are prerequisite to improve diagnosis and prognosis of patients with ovarian cancer. Firstly, a proteomic bottom-up MALDI-TOF mass spectrometric profiling analysis was conducted on a cohort of sixty serum samples specifically collected for this purpose. An in-house stepwise Artificial Neural Network (ANN) algorithm generated a biomarker panel of m/z peaks which differentiated cancer from aged matched controls with an accuracy of 91% and error of 9%, identities were inferred where possible and validation conducted using ELISA on the same cohort. Lack of complete verification, or the ability to verify the full panel lead to an in-depth evaluation of the strategy used with the aim to repeat with an improved methodology. Following this, a feasibility analysis and evaluation was performed on the next generation of equipment for sample fractionation prior to analysis on multiple replicates of stock human serum collected in the same way as the ovarian cohort. The results of which combined with the limited amount of available ovarian cancer sample cohort altered the trajectory of the project to the mining of transcriptomic data acquired from an online data repository. A meta-analysis approach was applied to two carefully selected gene expression microarray data sets ANNs, Cox Univariate Survival analyses and T-tests were used to filter genes whose expression were consistently significantly associated with patient survival times. A list of 56 genes were refined from a potential 37000 gene probes to be taken forward for verification for which more freely available online resources such as SRING, Kaplan Meier Plotter and KEGG were utilised. The list of 56 genes of interest were refined to seven using a larger cohort of transcriptomic data, of the seven one, EDNRA, was selected for translational verification using immunohistochemistry of a tissue microarray of ovarian cancer specimens. Significant association is seen with cancer stage, grade and histology. The merits and flaws of the verification are discussed and future work and direction for research is suggested

Nottingham Trent Institutional Repository (IRep)

Discovering cancer-associated transcripts by RNA sequencing

Author: Iyer Matthew Kalahasty
Publication venue
Publication date: 01/01/2013
Field of study

High-throughput sequencing of poly-adenylated RNA (RNA-Seq) in human cancers shows remarkable potential to identify uncharacterized aspects of tumor biology, including gene fusions with therapeutic significance and disease markers such as long non-coding RNA (lncRNA) species. However, the analysis of RNA-Seq data places unprecedented demands upon computational infrastructures and algorithms, requiring novel bioinformatics approaches. To meet these demands, we present two new open-source software packages - ChimeraScan and AssemblyLine - designed to detect gene fusion events and novel lncRNAs, respectively. RNA-Seq studies utilizing ChimeraScan led to discoveries of new families of recurrent gene fusions in breast cancers and solitary fibrous tumors. Further, ChimeraScan was one of the key components of the repertoire of computational tools utilized in data analysis for MI-ONCOSEQ, a clinical sequencing initiative to identify potentially informative and actionable mutations in cancer patients’ tumors. AssemblyLine, by contrast, reassembles RNA sequencing data into full-length transcripts ab initio. In head-to-head analyses AssemblyLine compared favorably to existing ab initio approaches and unveiled abundant novel lncRNAs, including antisense and intronic lncRNAs disregarded by previous studies. Moreover, we used AssemblyLine to define the prostate cancer transcriptome from a large patient cohort and discovered myriad lncRNAs, including 121 prostate cancer-associated transcripts (PCATs) that could potentially serve as novel disease markers. Functional studies of two PCATs - PCAT-1 and SChLAP1 - revealed cancer-promoting roles for these lncRNAs. PCAT1, a lncRNA expressed from chromosome 8q24, promotes cell proliferation and represses the tumor suppressor BRCA2. SChLAP1, located in a chromosome 2q31 ‘gene desert’, independently predicts poor patient outcomes, including metastasis and cancer-specific mortality. Mechanistically, SChLAP1 antagonizes the genome-wide localization and regulatory functions of the SWI/SNF chromatin-modifying complex. Collectively, this work demonstrates the utility of ChimeraScan and AssemblyLine as open-source bioinformatics tools. Our applications of ChimeraScan and AssemblyLine led to the discovery of new classes of recurrent and clinically informative gene fusions, and established a prominent role for lncRNAs in coordinating aggressive prostate cancer, respectively. We expect that the methods and findings described herein will establish a precedent for RNA-Seq-based studies in cancer biology and assist the research community at large in making similar discoveries.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120814/1/mkiyer_1.pd

Deep Blue Documents at the University of Michigan

Role of network topology based methods in discovering novel gene-phenotype associations

Author: Güney Emre, 1983-
Publication venue: 'Universitat Pompeu Fabra'
Publication date: 01/01/2012
Field of study

The cell is governed by the complex interactions among various types of biomolecules. Coupled with environmental factors, variations in DNA can cause alterations in normal gene function and lead to a disease condition. Often, such disease phenotypes involve coordinated dysregulation of multiple genes that implicate inter-connected pathways. Towards a better understanding and characterization of mechanisms underlying human diseases, here, I present GUILD, a network-based disease-gene prioritization framework. GUILD associates genes with diseases using the global topology of the protein-protein interaction network and an initial set of genes known to be implicated in the disease. Furthermore, I investigate the mechanistic relationships between disease-genes and explain the robustness emerging from these relationships. I also introduce GUILDify, an online and user-friendly tool which prioritizes genes for their association to any user-provided phenotype. Finally, I describe current state-of-the-art systems-biology approaches where network modeling has helped extending our view on diseases such as cancer.La cèl•lula es regeix per interaccions complexes entre diferents tipus de biomolècules. Juntament amb factors ambientals, variacions en el DNA poden causar alteracions en la funció normal dels gens i provocar malalties. Sovint, aquests fenotips de malaltia involucren una desregulació coordinada de múltiples gens implicats en vies interconnectades. Per tal de comprendre i caracteritzar millor els mecanismes subjacents en malalties humanes, en aquesta tesis presento el programa GUILD, una plataforma que prioritza gens relacionats amb una malaltia en concret fent us de la topologia de xarxe. A partir d’un conjunt conegut de gens implicats en una malaltia, GUILD associa altres gens amb la malaltia mitjancant la topologia global de la xarxa d’interaccions de proteïnes. A més a més, analitzo les relacions mecanístiques entre gens associats a malalties i explico la robustesa es desprèn d’aquesta anàlisi. També presento GUILDify, un servidor web de fácil ús per la priorització de gens i la seva associació a un determinat fenotip. Finalment, descric els mètodes més recents en què el model•latge de xarxes ha ajudat extendre el coneixement sobre malalties complexes, com per exemple a càncer

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Candidate gene prioritization by network analysis of differential expression using machine learning approaches

Author: A Subramanian
A Zanzoni
AJ Smola
AP Francisco
B Aranda
B Harr
Bart de Moor
C Saunders
C Stark
C von Mering
D Nitsch
D Zieker
Daniela Nitsch
F Chung
F Fouss
Fabian Ojeda
GC Cawley
GD Bader
H Yang
HY Chuang
J Chen
JA Hanley
Joana P Gonçalves
JW Park
K Lage
KR Brown
L Franke
L Gautier
L Salwinski
LC Tranchevent
M Liu
P Baldi
P Pagel
R Gupta
RA Irizarry
RI Kondor
RK Nibbe
S Aerts
S Köhler
S Mirkin
S Razick
S Vardhanabhuti
SE Choe
T Fawcett
WK Lim
Y Saad
Yves Moreau
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Discovering novel disease genes is still challenging for diseases for which no prior knowledge - such as known disease genes or disease-related pathways - is available. Performing genetic studies frequently results in large lists of candidate genes of which only few can be followed up for further investigation. We have recently developed a computational method for constitutional genetic disorders that identifies the most promising candidate genes by replacing prior knowledge by experimental data of differential gene expression between affected and healthy individuals. To improve the performance of our prioritization strategy, we have extended our previous work by applying different machine learning approaches that identify promising candidate genes by determining whether a gene is surrounded by highly differentially expressed genes in a functional association or protein-protein interaction network. Results We have proposed three strategies scoring disease candidate genes relying on network-based machine learning approaches, such as kernel ridge regression, heat kernel, and Arnoldi kernel approximation. For comparison purposes, a local measure based on the expression of the direct neighbors is also computed. We have benchmarked these strategies on 40 publicly available knockout experiments in mice, and performance was assessed against results obtained using a standard procedure in genetics that ranks candidate genes based solely on their differential expression levels (<it>Simple Expression Ranking</it>). Our results showed that our four strategies could outperform this standard procedure and that the best results were obtained using the <it>Heat Kernel Diffusion Ranking </it>leading to an average ranking position of 8 out of 100 genes, an AUC value of 92.3% and an error reduction of 52.8% relative to the standard procedure approach which ranked the knockout gene on average at position 17 with an AUC value of 83.7%. Conclusion In this study we could identify promising candidate genes using network based machine learning approaches even if no knowledge is available about the disease or phenotype.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central