363 research outputs found

    Subset Sampling and Its Extensions

    Full text link
    This paper studies the \emph{subset sampling} problem. The input is a set S\mathcal{S} of nn records together with a function p\textbf{p} that assigns each record vSv\in\mathcal{S} a probability p(v)\textbf{p}(v). A query returns a random subset XX of S\mathcal{S}, where each record vSv\in\mathcal{S} is sampled into XX independently with probability p(v)\textbf{p}(v). The goal is to store S\mathcal{S} in a data structure to answer queries efficiently. If S\mathcal{S} fits in memory, the problem is interesting when S\mathcal{S} is dynamic. We develop a dynamic data structure with O(1+μS)\mathcal{O}(1+\mu_{\mathcal{S}}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(1)\mathcal{O}(1) amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where μS=vSp(v)\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v). The query time and space are optimal. If S\mathcal{S} does not fit in memory, the problem is difficult even if S\mathcal{S} is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in O((logBn)/B+(μS/B)logM/B(n/B))\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right) amortized expected I/Os using O(n/B)\mathcal{O}(n/B) space, where MM is the memory size, BB is the block size and logBn\log^*_B n is the number of iterative log2(.)\log_2(.) operations we need to perform on nn before going below BB. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range [a,b][a,b]. For this extension, we provide a solution under the dynamic setting, with O(logn+μS[a,b])\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]}) expected \emph{query} time, O(n)\mathcal{O}(n) space and O(logn)\mathcal{O}(\log n) amortized expected \emph{update}, \emph{insert} and \emph{delete} time.Comment: 17 page

    A novel artificial bee colony based clustering algorithm for categorical data

    Get PDF
    Funding: This work was supported by the National Natural Science Foundation of China (NSFC) under Grant Nos. (21127010, 61202309, http://www.nsfc.gov.cn/), China Postdoctoral Science Foundation under Grant No. 2013M530956 (http://res.chinapostdoctor.org.cn), the UK Economic & Social Research Council (ESRC): award reference: ES/M001628/1 (http://www.esrc.ac.uk/), Science and Technology Development Plan of Jilin province under Grant No. 20140520068JH (http://www.jlkjt.gov.cn), Fundamental Research Funds for the Central Universities under No. 14QNJJ028 (http://www.nenu.edu.cn), the open project program of Key Laboratory of Symbolic Computation andKnowledge Engineering of Ministry of Education, Jilin University under Grant No. 93K172014K07 (http://www.jlu.edu.cn). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Peer reviewedPublisher PD

    High throughput photonic time stretch optical coherence tomography with data compression

    Get PDF
    Photonic time stretch enables real time high throughput optical coherence tomography (OCT), but with massive data volume being a real challenge. In this paper, data compression in high throughput optical time stretch OCT has been explored and experimentally demonstrated. This is made possible by exploiting spectral sparsity of encoded optical pulse spectrum using compressive sensing (CS) approach. Both randomization and integration have been implemented in the optical domain avoiding an electronic bottleneck. A data compression ratio of 66% has been achieved in high throughput OCT measurements with 1.51 MHz axial scan rate using greatly reduced data sampling rate of 50 MS/s. Potential to improve compression ratio has been exploited. In addition, using a dual pulse integration method, capability of improving frequency measurement resolution in the proposed system has been demonstrated. A number of optimization algorithms for the reconstruction of the frequency-domain OCT signals have been compared in terms of reconstruction accuracy and efficiency. Our results show that the L1 Magic implementation of the primal-dual interior point method offers the best compromise between accuracy and reconstruction time of the time-stretch OCT signal tested
    corecore