Search CORE

110,319 research outputs found

Clustering of gene expression data: performance and similarity analysis

Author: B Everitt
Chun-Hsi Huang
D Botstein
J Dopazo
J Hartigan
J Herrero
J Herrero
J Tamames
Joaquín Dopazo
Jun Ni
K Alsabti
KY Yeung
Longde Yin
MB Eisen
MF Ramoni
P Tamayo
PT Spellman
R Cho
RL Stears
Sneath
T Kanungo
T Kohonen
T Kohonen
T Kohonen
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research. RESULTS: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms. CONCLUSION: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Data Driven Discovery in Astrophysics

Author: Brescia M.
Cavuoti S.
Djorgovski S. G.
Donalek C.
Longo G.
Publication venue
Publication date: 01/01/2014
Field of study

We review some aspects of the current state of data-intensive astronomy, its methods, and some outstanding data analysis challenges. Astronomy is at the forefront of "big data" science, with exponentially growing data volumes and data rates, and an ever-increasing complexity, now entering the Petascale regime. Telescopes and observatories from both ground and space, covering a full range of wavelengths, feed the data via processing pipelines into dedicated archives, where they can be accessed for scientific analysis. Most of the large archives are connected through the Virtual Observatory framework, that provides interoperability standards and services, and effectively constitutes a global data grid of astronomy. Making discoveries in this overabundance of data requires applications of novel, machine learning tools. We describe some of the recent examples of such applications.Comment: Keynote talk in the proceedings of ESA-ESRIN Conference: Big Data from Space 2014, Frascati, Italy, November 12-14, 2014, 8 pages, 2 figure

arXiv.org e-Print Archive

Archivio della ricerca - Università degli studi di Napoli Federico II

From Social Simulation to Integrative System Design

Author: Balietti Stefano
Helbing Dirk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

As the recent financial crisis showed, today there is a strong need to gain "ecological perspective" of all relevant interactions in socio-economic-techno-environmental systems. For this, we suggested to set-up a network of Centers for integrative systems design, which shall be able to run all potentially relevant scenarios, identify causality chains, explore feedback and cascading effects for a number of model variants, and determine the reliability of their implications (given the validity of the underlying models). They will be able to detect possible negative side effect of policy decisions, before they occur. The Centers belonging to this network of Integrative Systems Design Centers would be focused on a particular field, but they would be part of an attempt to eventually cover all relevant areas of society and economy and integrate them within a "Living Earth Simulator". The results of all research activities of such Centers would be turned into informative input for political Decision Arenas. For example, Crisis Observatories (for financial instabilities, shortages of resources, environmental change, conflict, spreading of diseases, etc.) would be connected with such Decision Arenas for the purpose of visualization, in order to make complex interdependencies understandable to scientists, decision-makers, and the general public.Comment: 34 pages, Visioneer White Paper, see http://www.visioneer.ethz.c

arXiv.org e-Print Archive

Repository for Publications and Research Data

CiteSeerX

EDP Sciences OAI-PMH repository (1.2.0)

Flow-based Influence Graph Visual Summarization

Author: Lin Chuang
Shi Lei
Tang Jie
Tong Hanghang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/10/2014
Field of study

Visually mining a large influence graph is appealing yet challenging. People are amazed by pictures of newscasting graph on Twitter, engaged by hidden citation networks in academics, nevertheless often troubled by the unpleasant readability of the underlying visualization. Existing summarization methods enhance the graph visualization with blocked views, but have adverse effect on the latent influence structure. How can we visually summarize a large graph to maximize influence flows? In particular, how can we illustrate the impact of an individual node through the summarization? Can we maintain the appealing graph metaphor while preserving both the overall influence pattern and fine readability? To answer these questions, we first formally define the influence graph summarization problem. Second, we propose an end-to-end framework to solve the new problem. Our method can not only highlight the flow-based influence patterns in the visual summarization, but also inherently support rich graph attributes. Last, we present a theoretic analysis and report our experiment results. Both evidences demonstrate that our framework can effectively approximate the proposed influence graph summarization objective while outperforming previous methods in a typical scenario of visually mining academic citation networks.Comment: to appear in IEEE International Conference on Data Mining (ICDM), Shen Zhen, China, December 201

arXiv.org e-Print Archive

Crossref

WordSup: Exploiting Word Annotations for Character based Text Detection

Author: Ding Errui
Han Junyu
Hu Han
Luo Yuxuan
Wang Yuzhuo
Zhang Chengquan
Publication venue
Publication date: 22/08/2017
Field of study

Imagery texts are usually organized as a hierarchy of several visual elements, i.e. characters, words, text lines and text blocks. Among these elements, character is the most basic one for various languages such as Western, Chinese, Japanese, mathematical expression and etc. It is natural and convenient to construct a common text detection engine based on character detectors. However, training character detectors requires a vast of location annotated characters, which are expensive to obtain. Actually, the existing real text datasets are mostly annotated in word or line level. To remedy this dilemma, we propose a weakly supervised framework that can utilize word annotations, either in tight quadrangles or the more loose bounding boxes, for character detector training. When applied in scene text detection, we are thus able to train a robust character detector by exploiting word annotations in the rich large-scale real scene text datasets, e.g. ICDAR15 and COCO-text. The character detector acts as a key role in the pipeline of our text detection engine. It achieves the state-of-the-art performance on several challenging scene text detection benchmarks. We also demonstrate the flexibility of our pipeline by various scenarios, including deformed text detection and math expression recognition.Comment: 2017 International Conference on Computer Visio

arXiv.org e-Print Archive

Crossref