8 research outputs found
Recommended from our members
Exploiting Intrinsic Clustering Structure in Discrete-Valued Data Sets for Efficient Knowledge Discovery in the Presence of Missing Data
Scalable algorithm design has become central in the era of large-scale data analysis. The vast amounts of data pouring in from a diverse set of application domains, such as bioinformatics, recommender systems, sensor systems, and social networks, cannot be analyzed efficiently using many data mining and statistical tools that were designed for a small scale setting. It is an ongoing challenge to the data mining, machine learning, and statistics communities to design new methods for efficient data analysis. Confounding this challenge is the noisy and incomplete nature of real-world data sets. Research scientists as well as practitioners in industry need to find meaningful patterns in data with missing value rates often as high as 99%, in addition to errors in the data that can obstruct accurate analyses. My contribution to this line of research is the design of new algorithms for scalable clustering, data reduction, and similarity evaluation by exploiting inherent clustering structure in the input data to overcome the challenges of significant amounts of missing entries. I demonstrate that, by focusing on underlying clustering properties of the data, we can improve the efficiency of several data analysis methods on sparse, discrete-valued data sets. I highlight new methods that I have developed with my collaborators for three diverse knowledge discovery tasks: (1) clustering genetic markers into linkage groups, (2) reducing large-scale genetic data to a much smaller, more accurate representative data set, and (3) computing similarity between users in recommender systems. In each case, I point out how the underlying clustering structure can be used to design more efficient algorithms, even when high missing value rates are present
A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome
Citation: Chapman, J. A., Mascher, M., Buluç, A., Barry, K., Georganas, E., Session, A., . . . Rokhsar, D. S. (2015). A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome. Genome Biology, 16(1). doi:10.1186/s13059-015-0582-8Polyploid species have long been thought to be recalcitrant to whole-genome assembly. By combining high-throughput sequencing, recent developments in parallel computing, and genetic mapping, we derive, de novo, a sequence assembly representing 9.1 Gbp of the highly repetitive 16 Gbp genome of hexaploid wheat, Triticum aestivum, and assign 7.1 Gb of this assembly to chromosomal locations. The genome representation and accuracy of our assembly is comparable or even exceeds that of a chromosome-by-chromosome shotgun assembly. Our assembly and mapping strategy uses only short read sequencing technology and is applicable to any species where it is possible to construct a mapping population. © 2015 Chapman et al. licensee BioMed Central.Additional Authors: Muehlbauer, G. J.;Stein, N.;Rokhsar, D. S
LiRa: A New Likelihood-Based Similarity Score for Collaborative Filtering
Recommender system data presents unique challenges to the data mining,
machine learning, and algorithms communities. The high missing data rate, in
combination with the large scale and high dimensionality that is typical of
recommender systems data, requires new tools and methods for efficient data
analysis. Here, we address the challenge of evaluating similarity between two
users in a recommender system, where for each user only a small set of ratings
is available. We present a new similarity score, that we call LiRa, based on a
statistical model of user similarity, for large-scale, discrete valued data
with many missing values. We show that this score, based on a ratio of
likelihoods, is more effective at identifying similar users than traditional
similarity scores in user-based collaborative filtering, such as the Pearson
correlation coefficient. We argue that our approach has significant potential
to improve both accuracy and scalability in collaborative filtering
Stakeholders and practices in the field of senior safety and mobility : Work package 4 Report
CONSOL - Concerns and solution
Stakeholders and practices in the field of senior safety and mobility : Work package 4 Report
CONSOL - Concerns and solution
14-3-3 Protein Masks the DNA Binding Interface of Forkhead Transcription Factor FOXO4*
The role of 14-3-3 proteins in the regulation of FOXO forkhead transcription factors is at least 2-fold. First, the 14-3-3 binding inhibits the interaction between the FOXO and the target DNA. Second, the 14-3-3 proteins prevent nuclear reimport of FOXO factors by masking their nuclear localization signal. The exact mechanisms of these processes are still unclear, mainly due to the lack of structural data. In this work, we used fluorescence spectroscopy to investigate the mechanism of the 14-3-3 protein-dependent inhibition of FOXO4 DNA-binding properties. Time-resolved fluorescence measurements revealed that the 14-3-3 binding affects fluorescence properties of 5-(((acetylamino)ethyl)amino) naphthalene-1-sulfonic acid moiety attached at four sites within the forkhead domain of FOXO4 that represent important parts of the DNA binding interface. Observed changes in 5-(((acetylamino)ethyl)amino) naphthalene-1-sulfonic acid fluorescence strongly suggest physical contacts between the 14-3-3 protein and labeled parts of the FOXO4 DNA binding interface. The 14-3-3 protein binding, however, does not cause any dramatic conformational change of FOXO4 as documented by the results of tryptophan fluorescence experiments. To build a realistic model of the FOXO4·14-3-3 complex, we measured six distances between 14-3-3 and FOXO4 using Förster resonance energy transfer time-resolved fluorescence experiments. The model of the complex suggests that the forkhead domain of FOXO4 is docked within the central channel of the 14-3-3 protein dimer, consistent with our hypothesis that 14-3-3 masks the DNA binding interface of FOXO4