12,094 research outputs found
ASTErIsM - Application of topometric clustering algorithms in automatic galaxy detection and classification
We present a study on galaxy detection and shape classification using
topometric clustering algorithms. We first use the DBSCAN algorithm to extract,
from CCD frames, groups of adjacent pixels with significant fluxes and we then
apply the DENCLUE algorithm to separate the contributions of overlapping
sources. The DENCLUE separation is based on the localization of pattern of
local maxima, through an iterative algorithm which associates each pixel to the
closest local maximum. Our main classification goal is to take apart elliptical
from spiral galaxies. We introduce new sets of features derived from the
computation of geometrical invariant moments of the pixel group shape and from
the statistics of the spatial distribution of the DENCLUE local maxima
patterns. Ellipticals are characterized by a single group of local maxima,
related to the galaxy core, while spiral galaxies have additional ones related
to segments of spiral arms. We use two different supervised ensemble
classification algorithms, Random Forest, and Gradient Boosting. Using a sample
of ~ 24000 galaxies taken from the Galaxy Zoo 2 main sample with spectroscopic
redshifts, and we test our classification against the Galaxy Zoo 2 catalog. We
find that features extracted from our pipeline give on average an accuracy of ~
93%, when testing on a test set with a size of 20% of our full data set, with
features deriving from the angular distribution of density attractor ranking at
the top of the discrimination power.Comment: 20 pages, 13 Figures, 8 Tables, Accepted for publication in the
Monthly Notices of the Royal Astronomical Societ
Transcription Factor-DNA Binding Via Machine Learning Ensembles
We present ensemble methods in a machine learning (ML) framework combining
predictions from five known motif/binding site exploration algorithms. For a
given TF the ensemble starts with position weight matrices (PWM's) for the
motif, collected from the component algorithms. Using dimension reduction, we
identify significant PWM-based subspaces for analysis. Within each subspace a
machine classifier is built for identifying the TF's gene (promoter) targets
(Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool.
Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string)
feature PWM-based subspaces that stand out in identifying gene targets. We
approach Problem 3 (binding sites) with a novel machine learning approach that
uses promoter string features and ML importance scores in a classification
algorithm locating binding sites across the genome. For target gene
identification this method improves performance (measured by the F1 score) by
about 10 percentage points over the (a) motif scanning method and (b) the
coexpression-based association method. Top motif outperformed 5 component
algorithms as well as two other common algorithms (BEST and DEME). For
identifying individual binding sites on a benchmark cross species database
(Tompa et al., 2005) we match the best performer without much human
intervention. It also improved the performance on mammalian TFs.
The ensemble can integrate orthogonal information from different weak
learners (potentially using entirely different types of features) into a
machine learner that can perform consistently better for more TFs. The TF gene
target identification component (problem 1 above) is useful in constructing a
transcriptional regulatory network from known TF-target associations. The
ensemble is easily extendable to include more tools as well as future PWM-based
information.Comment: 33 page
Phenotypic landscape inference reveals multiple evolutionary paths to C photosynthesis
C photosynthesis has independently evolved from the ancestral C
pathway in at least 60 plant lineages, but, as with other complex traits, how
it evolved is unclear. Here we show that the polyphyletic appearance of C
photosynthesis is associated with diverse and flexible evolutionary paths that
group into four major trajectories. We conducted a meta-analysis of 18 lineages
containing species that use C, C, or intermediate C-C forms of
photosynthesis to parameterise a 16-dimensional phenotypic landscape. We then
developed and experimentally verified a novel Bayesian approach based on a
hidden Markov model that predicts how the C phenotype evolved. The
alternative evolutionary histories underlying the appearance of C
photosynthesis were determined by ancestral lineage and initial phenotypic
alterations unrelated to photosynthesis. We conclude that the order of C
trait acquisition is flexible and driven by non-photosynthetic drivers. This
flexibility will have facilitated the convergent evolution of this complex
trait
- …