Search CORE

4 research outputs found

SocialTrove: A Self-summarizing Storage Service for Social Sensing

Author: Abdelzaher Tarek F.
Ahmed Reaz
Amin Md Tanvir Al
Ganti Raghu K.
Gupta Indranil
Le Hieu
Li Shen
Rahman Muntasir Raihan
Seetharamu Panindra Tumkur
Srivatsa Mudhakar
Wang Shiguang
Publication venue
Publication date: 03/02/2015
Field of study

The increasing availability of smartphones, cameras, and wearables with instant data sharing capabilities, and the exploitation of social networks for information broadcast, heralds a future of real-time information overload. With the growing excess of worldwide streaming data, such as images, geotags, text annotations, and sensory measurements, an increasingly common service will become one of data summarization. The objective of such a service will be to obtain a representative sampling of large data streams at a configurable granularity, in real-time, for subsequent consumption by a range of data-centric applications. This paper describes a general-purpose self-summarizing storage service, called SocialTrove, for social sensing applications. The service summarizes data streams from human sources, or sensors in their possession, by hierarchically clustering received information in accordance with an application-specific distance metric. It then serves a sampling of produced clusters at a configurable granularity in response to application queries. While SocialTrove is a general service, we illustrate its functionality and evaluate it in the specific context of workloads collected from Twitter. Results show that SocialTrove supports a high query throughput, while maintaining a low access latency to the produced real-time application-specific data summaries. As a specific application case-study, we implement a fact-finding service on top of SocialTrove.Army Research Laboratory, Cooperative Agreement W911NF-09-2-0053DTRA grant HDTRA1-10-1-0120NSF grants CNS 13-29886, CNS 09-58314, CNS 10-35736Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

Author: Caporaso
Caporaso
Costello
Cutting
DeSantis
Edgar
Eren
Faith
Gilbert
Jensen
Lauber
Li
Lozupone
Mantel
McDonald
McDonald
Navas-Molina
Schloss
Yatsunenko
Publication venue: PeerJ Inc.
Publication date: 01/01/2014
Field of study

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME

Crossref

OpenKnowledge@NAU

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Similarity Search in Document Collections

Author: Jordanov Dimitar Dimitrov
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date: 01/01/2009
Field of study

Hlavním cílem této práce je odhadnout výkonnost volně šířeni balík Sémantický Vektory a třída MoreLikeThis z balíku Apache Lucene. Tato práce nabízí porovnání těchto dvou přístupů a zavádí metody, které mohou vést ke zlepšení kvality vyhledávání.The main objective of this work is to estimate the efficiency of the available software for similarity search in document collections and on two in particular, Semantic Vectors and Lecene's class MoreLikeThis. The paper provides a comparison of those two approaches and introduces methods that can lead to improving the quality of the results generated by a search.

Digital library of Brno University of Technology

National Repository of Grey Literature