Search CORE

1,280 research outputs found

Data mining in bioinformatics using Weka

Author: Frank Eibe
Hall Mark A.
Holmes Geoffrey
Trigg Leonard E.
Witten Ian H.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2004
Field of study

The Weka machine learning workbench provides a general purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it

CiteSeerX

Research Commons@Waikato

Weka: A machine learning workbench for data mining

Author: Frank Eibe
Hall Mark A.
Holmes Geoffrey
Kirkby Richard Brendon
Pfahringer Bernhard
Witten Ian H.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

The Weka workbench is an organized collection of state-of-the-art machine learning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user interfaces are provided for data exploration, for setting up large-scale experiments on distributed computing platforms, and for designing configurations for streamed data processing. These interfaces constitute an advanced environment for experimental data mining. The system is written in Java and distributed under the terms of the GNU General Public License

Research Commons@Waikato

Comparative Analysis of Data Mining Tools and Classification Techniques using WEKA in Medical Bioinformatics

Author: David Satish Kumar
Rubeaan Khalid Al
Saeb Amr T.M.
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 29/12/2013
Field of study

The availability of huge amounts of data resulted in great need of data mining technique in order to generate useful knowledge. In the present study we provide detailed information about data mining techniques with more focus on classification techniques as one important supervised learning technique. We also discuss WEKA software as a tool of choice to perform classification analysis for different kinds of available data. A detailed methodology is provided to facilitate utilizing the software by a wide range of users. The main features of WEKA are 49 data preprocessing tools, 76 classification/regression algorithms, 8 clustering algorithms, 3 algorithms for finding association rules, 15 attribute/subset evaluators plus 10 search algorithms for feature selection. WEKA extracts useful information from data and enables a suitable algorithm for generating an accurate predictive model from it to be identified. Moreover, medical bioinformatics analyses have been performed to illustrate the usage of WEKA in the diagnosis of Leukemia. Keywords: Data mining, WEKA, Bioinformatics, Knowledge discovery, Gene Expression

International Institute for Science, Technology and Education (IISTE): E-Journals

Interpretation of Mutations, Expression, Copy Number in Somatic Breast Cancer: Implications for Metastasis and Chemotherapy

Author: Dorman Stephanie
Publication venue: Scholarship@Western
Publication date: 15/09/2015
Field of study

Breast cancer (BC) patient management has been transformed over the last two decades due to the development and application of genome-wide technologies. The vast amounts of data generated by these assays, however, create new challenges for accurate and comprehensive analysis and interpretation. This thesis describes novel methods for fluorescence in-situ hybridization (FISH), array comparative genomic hybridization (aCGH), and next generation DNA- and RNA-sequencing, to improve upon current approaches used for these technologies. An ab initio algorithm was implemented to identify genomic intervals of single copy and highly divergent repetitive sequences that were applied to FISH and aCGH probe design. FISH probes with higher resolution than commercially available reagents were developed and validated on metaphase chromosomes. An aCGH microarray was developed that had improved reproducibility compared to the standard Agilent 44K array, which was achieved by placing oligonucleotide probes distant from conserved repetitive sequences. Splicing mutations are currently underrepresented in genome-wide sequencing analyses, and there are limited methods to validate genome-wide mutation predictions. This thesis describes Veridical, a program developed to statistically validate aberrant splicing caused by a predicted mutation. Splicing mutation analysis was performed on a large subset of BC patients previously analyzed by the Cancer Genome Atlas. This analysis revealed an elevated number of splicing mutations in genes involved in NCAM pathways in basal-like and HER2-enriched lymph node positive tumours. Genome-wide technologies were leveraged further to develop chemosensitivity models that predict BC response to paclitaxel and gemcitabine. A type of machine learning, called support vector machines (SVM), was used to create predictive models from small sets of biologically-relevant genes to drug disposition or resistance. SVM models generated were able to predict sensitivity in two groups of independent patient data. High variability between individuals requires more accurate and higher resolution genomic data. However the data themselves are insufficient; also needed are more insightful analytical methods to fully exploit these data. This dissertation presents both improvements in data quality and accuracy as well as analytical procedures, with the aim of detecting and interpreting critical genomic abnormalities that are hallmarks of BC subtypes, metastasis and therapy response

Scholarship@Western

Identification of genome wide host RNA biomarkers for infectious diseases

Author: Barral Arca Ruth
Publication venue
Publication date: 01/01/2020
Field of study

Existe una predisposición genética en humanos a la susceptibilidad y la gravedad de las enfermedades infecciosas. No todas las personas en contacto cercano con patógenos se infectan y desarrollan la enfermedad, en general, la mayoría de los pacientes muestran síntomas leves o moderados, y solo una minoría desarrolla una enfermedad grave. En la presente tesis nos centramos en el estudio de las firmas de expresión génica ya que el transcriptoma es un puente entre la información contenida dentro de nuestros genes y el fenotipo. Nuestros resultados suponen demuestran el potencial del uso de firmas trascriptómicas del huésped en la práctica clínica como pruebas clínicas para diagnóstico, pronóstico o evaluación de riesgos

Repositorio Institucional da Universidade de Santiago de Compostela

Non-Unique oligonucleotide probe selection heuristics

Author: Wang Lili
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2008
Field of study

The non-unique probe selection problem consists of selecting both unique and nonunique oligonucleotide probes for oligonucleotide microarrays, which are widely used tools to identify viruses or bacteria in biological samples. The non-unique probes, designed to hybridize to at least one target, are used as alternatives when the design of unique probes is particularly difficult for the closely related target genes. The goal of the non-unique probe selection problem is to determine a smallest set of probes able to identify all targets present in a biological sample. This problem is known to be NP-hard. In this thesis, several novel heuristics are presented based on greedy strategy, genetic algorithms and evolutionary strategy respectively for the minimization problem arisen from the non-unique probe selection using the best-known ILP formulation. Experiment results show that our methods are capable of reducing the number of probes required over the state-of-the-art methods

Scholarship at UWindsor

Sparse graphical models for cancer signalling

Author: Hill Steven M. (Mark)
Publication venue
Publication date
Field of study

Protein signalling networks play a key role in cellular function, and their dysregulation is central to many diseases, including cancer. Recent advances in biochemical technology have begun to allow high-throughput, data-driven studies of signalling. In this thesis, we investigate multivariate statistical methods, rooted in sparse graphical models, aimed at probing questions in cancer signalling. First, we propose a Bayesian variable selection method for identifying subsets of proteins that jointly in uence an output of interest, such as drug response. Ancillary biological information is incorporated into inference using informative prior distributions. Prior information is selected and weighted in an automated manner using an empirical Bayes formulation. We present examples of informative pathway and network-based priors, and illustrate the proposed method on both synthetic and drug response data. Second, we use dynamic Bayesian networks to perform structure learning of context-specific signalling network topology from proteomic time-course data. We exploit a connection between variable selection and network structure learning to efficiently carry out exact inference. Existing biology is incorporated using informative network priors, weighted automatically by an empirical Bayes approach. The overall approach is computationally efficient and essentially free of user-set parameters. We show results from an empirical investigation, comparing the approach to several existing methods, and from an application to breast cancer cell line data. Hypotheses are generated regarding novel signalling links, some of which are validated by independent experiments. Third, we describe a network-based clustering approach for the discovery of cancer subtypes that differ in terms of subtype-specific signalling network structure. Model-based clustering is combined with penalised likelihood estimation of undirected graphical models to allow simultaneous learning of cluster assignments and cluster-specific network structure. Results are shown from an empirical investigation comparing several penalisation regimes, and an application to breast cancer proteomic data

Warwick Research Archives Portal Repository

Data Mining of Biomedical Databases

Author: MELONI ANTONELLA
Publication venue: 'Pisa University Press'
Publication date: 10/04/2011
Field of study

Data mining can be defined as the nontrivial extraction of implicit, previously unknown and potentially useful information from data. This thesis is focused on Data Mining in Biomedicine, representing one of the most interesting fields of application. Different kinds of biomedical data sets would require different data mining approaches. Two approaches are treated in this thesis, divided in two separate and independent parts. The first part deals with Bayesian Networks, representing one of the most successful tools for medical diagnosis and therapies follow-up. Formally, a Bayesian Network (BN) is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph. An algorithm for Bayesian network structure learning that is a variation of the standard search-and-score approach has been developed. The proposed approach overcomes the creation of redundant network structures that may include non significant connections between variables. In particular, the algorithm finds which relationships between the variables must be prevented, by exploiting the binarization of a square matrix containing the mutual information (MI) among all pairs of variables. Four different binarization methods are implemented. The MI binary matrix is exploited as a pre-conditioning step for the subsequent greedy search procedure that optimizes the network score, reducing the number of possible search paths in the greedy search procedure. This approach has been tested on four different datasets and compared against the standard search-and-score algorithm as implemented in the DEAL package, with successful results. Moreover, a comparison among different network scores has been performed. The second part of this thesis is focused on data mining of microarray databases. An algorithm able to perform the analysis of Illumina microRNA microarray data in a systematic and easy way has been developed. The algorithm includes two parts. The first part is the pre-processing, characterized by two steps: variance stabilization and normalization. Variance stabilization has to be performed to abrogate or at least reduce the heteroskedasticity while normalization has to be performed to minimize systematic effects that are not constant among different samples of an experiment and that are not due to the factors under investigation. Three alternative variance stabilization strategies and three alternative normalization approaches are included. So, considering all the possible combinations between variance stabilization and normalization strategies, 9 different ways to pre-process the data are obtained. The second part of the algorithm deals with the statistical analysis for the differential expression detection. Linear models and empirical Bayes methods are used. The final result is the list of the microRNAs significantly differentially-expressed in two different conditions. The algorithm has been tested on three different real datasets and partially validated with an independent approach (quantitative real time PCR). Moreover, the influence of the use of different preprocessing methods on the discovery of differentially expressed microRNAs has been studied and a comparison among the different normalization methods has been performed. This is the first study comparing normalization techniques for Illumina microRNA microarray data

Electronic Thesis and Dissertation Archive - Università di Pisa

Adapted Boolean network models for extracellular matrix formation

Author: A Barchowsky
A Schwachula
A Trabandt
AE Postlethwaite
AM Abeles
B Ganter
BM Bolstad
C Buttner
C Ritchlin
Dirk Koczan
Dirk Pohlers
E Dimitrova
E Karouzakis
EM Gravallese
EP Newberry
ET Andreakos
F Gaultier
F Verrecchia
F Verrecchia
FC Arnett
G Karsenty
G Kervizic
G Rogler
G Stumme
GR Burmester
GS Firestein
GS Firestein
GS Firestein
H Asahara
H Hacker
HG Welgus
I Berger
J Gebert
J Wollbold
JA Hartigan
JH Ward
JJ Wu
JM Hernandez
Johannes Wollbold
JS Smolen
KM Stuhlmeier
LA White
LA White
LC Huber
LJ Steggles
M Hecker
M Kaytoue
M Mizui
M Xue
ML Handel
MP Bombara
P Angel
P Shannon
PF Lambert
PG Conaghan
R Altman
R Huber
R Laubenbacher
R Rossignol
Raimund W Kinne
Reinhard Guthke
René Huber
RW Kinne
S Klamt
S Martin
S Ross
SA Kauffman
SG Pereira
SI Hirai
SO Kuznetsov
SS McCachren
T Schlitt
T Zimmermann
Ulrike Gausmann
V Bours
VC Foletta
Y Sun
Y Yamanishi
Z Han
Z Werb
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Arabidopsis Coexpression Tool:a tool for gene coexpression analysis in Arabidopsis thaliana

Author: Angelopoulou Antonia
Daras Gerasimos
Duddy William
Georgia Saxami
Hatzopoulos Polydefkis
Jen Chih-Hung
Malatras Apostolos
Michalopoulos Ioannis
Westhead David
Zogopoulos Vasileios
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Gene coexpression analysis refers to the discovery of sets of genes which exhibit similar expression patterns across multiple transcriptomic data sets, such as microarray experiment data of public repositories. Arabidopsis Coexpression Tool (ACT), a gene coexpression analysis web tool for Arabidopsis thaliana, identifies genes which are correlated to a driver gene. Primary microarray data from ATH1 Affymetrix platform were processed with Single-Channel Array Normalization algorithm and combined to produce a coexpression tree which contains ∼21,000 A. thaliana genes. ACT was developed to present subclades of coexpressed genes, as well as to perform gene set enrichment analysis, being unique in revealing enriched transcription factors targeting coexpressed genes. ACT offers a simple and user-friendly interface producing working hypotheses which can be experimentally verified for the discovery of gene partnership, pathway membership, and transcriptional regulation. ACT analyses have been successful in identifying not only genes with coordinated ubiquitous expressions but also genes with tissue-specific expressions

Directory of Open Access Journals

HAL-Inserm

PubMed Central

Ulster University's Research Portal

HAL-CEA

White Rose Research Online