3,095 research outputs found
A graph-based representation of Gene Expression profiles in DNA microarrays
This paper proposes a new and very flexible data model, called gene expression graph (GEG), for genes expression analysis and classification. Three features differentiate GEGs from other available microarray data representation structures: (i) the memory occupation of a GEG is independent of the number of samples used to built it; (ii) a GEG more clearly expresses relationships among expressed and non expressed genes in both healthy and diseased tissues experiments; (iii) GEGs allow to easily implement very efficient classifiers. The paper also presents a simple classifier for sample-based classification to show the flexibility and user-friendliness of the proposed data structur
A data mining framework based on boundary-points for gene selection from DNA-microarrays: Pancreatic Ductal Adenocarcinoma as a case study
[EN] Gene selection (or feature selection) from DNA-microarray data can be focused on different techniques, which generally involve statistical tests, data mining and machine learning. In recent years there has been an increasing interest in using hybrid-technique sets to face the problem of meaningful gene selection; nevertheless, this issue remains a challenge. In an effort to address the situation, this paper proposes a novel hybrid framework based on data mining techniques and tuned to select gene subsets, which are meaningfully related to the target disease conducted in DNA-microarray experiments. For this purpose, the framework above deals with approaches such as statistical significance tests, cluster analysis, evolutionary computation, visual analytics and boundary points. The latter is the core technique of our proposal, allowing the framework to define two methods of gene selection. Another novelty of this work is the inclusion of the age of patients as an additional factor in our analysis, which can leading to gaining more insight into the disease. In fact, the results reached in this research have been very promising and have shown their biological validity. Hence, our proposal has resulted in a methodology that can be followed in the gene selection process from DNA-microarray data
Data analysis methods for copy number discovery and interpretation
Copy
number
variation
(CNV)
is
an
important
type
of
genetic
variation
that
can
give
rise
to
a
wide
variety
of
phenotypic
traits.
Differences
in
copy
number
are
thought
to
play
major
roles
in
processes
that
involve
dosage
sensitive
genes,
providing
beneficial,
deleterious
or
neutral
modifications
to
individual
phenotypes.
Copy
number
analysis
has
long
been
a
standard
in
clinical
cytogenetic
laboratories.
Gene
deletions
and
duplications
can
often
be
linked
with
genetic
Syndromes
such
as:
the
7q11.23
deletion
of
Williams-‐Bueren
Syndrome,
the
22q11
deletion
of
DiGeorge
syndrome
and
the
17q11.2
duplication
of
Potocki-‐Lupski
syndrome.
Interestingly,
copy
number
based
genomic
disorders
often
display
reciprocal
deletion
/
duplication
syndromes,
with
the
latter
frequently
exhibiting
milder
symptoms.
Moreover,
the
study
of
chromosomal
imbalances
plays
a
key
role
in
cancer
research.
The
datasets
used
for
the
development
of
analysis
methods
during
this
project
are
generated
as
part
of
the
cutting-‐edge
translational
project,
Deciphering
Developmental
Disorders
(DDD).
This
project,
the
DDD,
is
the
first
of
its
kind
and
will
directly
apply
state
of
the
art
technologies,
in
the
form
of
ultra-‐high
resolution
microarray
and
next
generation
sequencing
(NGS),
to
real-‐time
genetic
clinical
practice.
It
is
collaboration
between
the
Wellcome
Trust
Sanger
Institute
(WTSI)
and
the
National
Health
Service
(NHS)
involving
the
24
regional
genetic
services
across
the
UK
and
Ireland.
Although
the
application
of
DNA
microarrays
for
the
detection
of
CNVs
is
well
established,
individual
change
point
detection
algorithms
often
display
variable
performances.
The
definition
of
an
optimal
set
of
parameters
for
achieving
a
certain
level
of
performance
is
rarely
straightforward,
especially
where
data
qualities
vary ... [cont.]
Assessment of gene order computing methods for Alzheimer’s disease
This article was originally published by BMC Medical Genomics in 2013. doi:10.1186/1755-8794-6-S1-S8Background: Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation.
Methods: Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified.
Results: Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data.
Conclusion: Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods.The work was supported by the BWH Radiology and MGH Psychiatry research funds (to X. Huang) and the Technology Innovation fund (No.
09zz028) of Key Developing Program from Education Department of Sichuan Province, ChinaPearson distanc
OptCluster : an R package for determining the optimal clustering algorithm and optimal number of clusters.
Determining the best clustering algorithm and ideal number of clusters for a particular dataset is a fundamental difficulty in unsupervised clustering analysis. In biological research, data generated from Next Generation Sequencing technology and microarray gene expression data are becoming more and more common, so new tools and resources are needed to group such high dimensional data using clustering analysis. Different clustering algorithms can group data very differently. Therefore, there is a need to determine the best groupings in a given dataset using the most suitable clustering algorithm for that data. This paper presents the R package optCluster as an efficient way for users to evaluate up to ten clustering algorithms, ultimately determining the optimal algorithm and optimal number of clusters for a given set of data. The selected clustering algorithms are evaluated by as many as nine validation measures classified as “biological”, “internal”, or “stability”, and the final result is obtained through a weighted rank aggregation algorithm based on the calculated validation scores. Two examples using this package are presented, one with a microarray dataset and the other with an RNA-Seq dataset. These two examples highlight the capabilities the optCluster package and demonstrate its usefulness as a tool in cluster analysis
Gene regulatory network modelling with evolutionary algorithms -an integrative approach
Building models for gene regulation has been an important aim of Systems Biology over the past years, driven by the large amount of gene expression data that has become available. Models represent regulatory interactions between genes and transcription factors and can provide better understanding of biological processes, and means of simulating both natural and perturbed systems (e.g. those associated with disease). Gene regulatory network
(GRN) quantitative modelling is still limited, however, due to data issues such as noise and restricted length of time series, typically used for GRN reverse engineering. These issues create an under-determination problem, with many models possibly fitting the data. However,
large amounts of other types of biological data and knowledge are available, such as cross-platform measurements, knockout experiments, annotations, binding site affinities for transcription factors and so on. It has been postulated that integration of these can improve
model quality obtained, by facilitating further filtering of possible models. However, integration is not straightforward, as the different types of data can provide contradictory information, and are intrinsically noisy, hence large scale integration has not been fully
explored, to date. Here, we present an integrative parallel framework for GRN modelling, which employs
evolutionary computation and different types of data to enhance model inference. Integration is performed at different levels. (i) An analysis of cross-platform integration of time series microarray data, discussing the effects on the resulting models and exploring crossplatform
normalisation techniques, is presented. This shows that time-course data integration is possible, and results in models more robust to noise and parameter perturbation, as
well as reduced noise over-fitting. (ii) Other types of measurements and knowledge, such as knock-out experiments, annotated transcription factors, binding site affinities and promoter sequences are integrated within the evolutionary framework to obtain more plausible GRN models. This is performed by customising initialisation, mutation and evaluation of candidate model solutions. The different data types are investigated and both qualitative and
quantitative improvements are obtained. Results suggest that caution is needed in order to obtain improved models from combined data, and the case study presented here provides
an example of how this can be achieved. Furthermore, (iii), RNA-seq data is studied in comparison to microarray experiments, to identify overlapping features and possibilities of integration within the framework. The extension of the framework to this data type is
straightforward and qualitative improvements are obtained when combining predicted interactions
from single-channel and RNA-seq datasets
Optimizing parameters in fuzzy k-means for clustering microarray data.
Rapid advances of microarray technologies are making it possible to analyze and manipulate large amounts of gene expression data. Clustering algorithms, such as hierarchical clustering, self-organizing maps, k-means clustering and fuzzy k-means clustering, have become important tools for expression analysis of microarray data. However, the need of prior knowledge of the number of clusters, k, and the fuzziness parameter, b, limits the usage of fuzzy clustering. Few approaches have been proposed for assigning best possible values for such parameters. In this thesis, we use simulated annealing and fuzzy k-means clustering to determine the optimal parameters, namely the number of clusters, k, and the fuzziness parameter, b. To assess the performance of our method, we have used synthetic and real gene experiment data sets. To improve our approach, two methods, searching with Tabu List and Shrinking the scope of randomization, are applied. Our results show that a nearly-optimal pair of k and b can be obtained without exploring the entire search space.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .Y37. Source: Masters Abstracts International, Volume: 44-03, page: 1419. Thesis (M.Sc.)--University of Windsor (Canada), 2005
- …