Search CORE

23,177 research outputs found

Challenges of Big Data Analysis

Author: Fan Jianqing
Han Fang
Liu Han
Publication venue: 'Oxford University Press (OUP)'
Publication date: 06/02/2014
Field of study

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Finding unprecedentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling

Author: Carrete Jesús
Curtarolo Stefano
Li Wu
Mingo Natalio
Wang Shidong
Publication venue: 'American Physical Society (APS)'
Publication date: 01/02/2014
Field of study

The lattice thermal conductivity ({\kappa}{\omega}) is a key property for many potential applications of compounds. Discovery of materials with very low or high {\kappa}{\omega} remains an experimental challenge due to high costs and time-consuming synthesis procedures. High-throughput computational pre-screening is a valuable approach for significantly reducing the set of candidate compounds. In this article, we introduce efficient methods for reliably estimating the bulk {\kappa}{\omega} for a large number of compounds. The algorithms are based on a combination of machine-learning algorithms, physical insights, and automatic ab-initio calculations. We scanned approximately 79,000 half-Heusler entries in the AFLOWLIB.org database. Among the 450 mechanically stable ordered semiconductors identified, we find that {\kappa}{\omega} spans more than two orders of magnitude- a much larger range than that previously thought. {\kappa}{\omega} is lowest for compounds whose elements in equivalent positions have large atomic radii. We then perform a thorough screening of thermodynamical stability that allows to reduce the list to 77 systems. We can then provide a quantitative estimate of {\kappa}{\omega} for this selected range of systems. Three semiconductors having {\kappa}{\omega} < 5 W /(m K) are proposed for further experimental study.Comment: 9 pages, 4 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

Recommended from our members

Statistical Workflow for Feature Selection in Human Metabolomics Data.

Author: Antonelli Joseph
Cheng Susan
Claggett Brian L
Demler Olga V
Deng Katherine
Henglin Mir
Hushcha Pavel V
Jain Mohit
Kim Andy
Kim Nicole
Lagerborg Kim A
Mora Samia
Niiranen Teemu J
Ovsak Gavin
Pereira Alexandre C
Rao Kevin
Tyagi Octavia
Watrous Jeramie D
Publication venue: eScholarship, University of California
Publication date: 01/07/2019
Field of study

High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations

eScholarship - University of California

Identification of a small optimal subset of CpG sites as bio-markers from high-throughput DNA methylation profiles

Author: Li Guoya
Meng Hailong
Murrelle Edward L
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Machine learning for automatic prediction of the quality of electrophysiological recordings

Author: AB Wiltschko
BT Priest
C Mathes
CG Galizia
Dominique Martinez
F Franke
H Lei
Jean-Pierre Rospars
Johannes Reisert
M Asmild
MS Lewicki
R Friedrich
R Kohavi
S Panzeri
S Takahashi
SB Wilson
Shereen Elbanna
Sylvia Anton
T Nowotny
Thomas Nowotny
Y Saeys
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

The quality of electrophysiological recordings varies a lot due to technical and biological variability and neuroscientists inevitably have to select “good” recordings for further analyses. This procedure is time-consuming and prone to selection biases. Here, we investigate replacing human decisions by a machine learning approach. We define 16 features, such as spike height and width, select the most informative ones using a wrapper method and train a classifier to reproduce the judgement of one of our expert electrophysiologists. Generalisation performance is then assessed on unseen data, classified by the same or by another expert. We observe that the learning machine can be equally, if not more, consistent in its judgements as individual experts amongst each other. Best performance is achieved for a limited number of informative features; the optimal feature set being different from one data set to another. With 80–90% of correct judgements, the performance of the system is very promising within the data sets of each expert but judgments are less reliable when it is used across sets of recordings from different experts. We conclude that the proposed approach is relevant to the selection of electrophysiological recordings, provided parameters are adjusted to different types of experiments and to individual experimenters

Public Library of Science (PLOS)

Crossref

INRIA a CCSD electronic archive server

Directory of Open Access Journals

Sussex Research Online

FigShare