10 research outputs found

    Asymptotic Optimality of Antidictionary Codes

    Full text link
    An antidictionary code is a lossless compression algorithm using an antidictionary which is a set of minimal words that do not occur as substrings in an input string. The code was proposed by Crochemore et al. in 2000, and its asymptotic optimality has been proved with respect to only a specific information source, called balanced binary source that is a binary Markov source in which a state transition occurs with probability 1/2 or 1. In this paper, we prove the optimality of both static and dynamic antidictionary codes with respect to a stationary ergodic Markov source on finite alphabet such that a state transition occurs with probability p(0<p1)p (0 < p \leq 1).Comment: 5 pages, to appear in the proceedings of 2010 IEEE International Symposium on Information Theory (ISIT2010

    反辞書確率モデルを用いた多クラス不整脈の分類

    Get PDF
     現代社会では全世界的なレベルで高齢化が急速に進んでおり,それに伴い,死因が悪性新生物に次いで2番目に高い心疾患についても,心臓病理学上のさまざまな不規則性や異変の早期発見の必要性やその手段の開発がより一層求められてきている.本研究では,近年データ圧縮分野で提唱された反辞書符号化法を利用した新しい不整脈の検知ならびにその分類手法を提案し,その手法を組み込んだ無線ウェアラブル端末による常時心電図監視システムについて考察する. 代表的な不整脈パターンである心室性期外収縮(PVC)と心房性期外収縮(PAC)において,出現しない特徴的なパターンの統計的な性質を調べることによって,それらの分類を行なった.ここで,出現しない特徴的なパターンとは,それ自身を少しでも短くすると,もとのデータには出現してしまうという性質をもっており,極小禁止語とも呼ばれる.極小禁止語の抽出には,MIT-BIHより公開提供されている心電図データベースを利用している.このデータベースは専門医による診断結果が付されており,どの心電波形が不整脈であるかを正確に知ることができる.ところで,正常な心電図データから抽出された極小禁止語には,PVC やPAC を含む不整脈パターンが含まれていると考えられる.そこで,本研究では,抽出した極小禁止語から正常な心電図データの変動を表す有限状態遷移確率モデルを作成した.この確率モデルは,観測データに追従しながら内部状態が推移して行くだけでなく,状態が遷移する確率も合わせて計算するので,不整脈パターンの出現を,非常に小さな確率をとる状態遷移の生起として捉えることができる. 状態遷移確率モデルを作成するにあたっては,まず1サンプル11 ビットで表現される心電図データを量子化するために,データの差分化による頻度分布のシェイピングや量子化レベル数について実験検討を行い,モデルの簡略化,実装にあたってのメモリ量の削減を図った.次に,提案法の性能評価のために,MIT-BIHで提供されている不整脈を含む複数の被験者の心電図データについて,心室期外収縮に対する検知率の実験を行なったところ,平均して,既存の不整脈検出法とほぼ同程度の,感度 (sensitivity) 97.53%,特異度(specificity)93.89% の結果が得られた.さらに,IPhone 端末のエミュレータに提案方式プログラムを組み込んだシミュレーション実験を行ない,必要とされる計算リソースについて調査した.その結果は,CPU占有率が65%,メモリ使用量30.5Mバイトであり,提案方式はウェアラブル端末へ十分実装可能であることが示された.加えて,先に示したPVC の二項分類問題に加えて,PVC にPAC を加えた多値分類問題を考え,それぞれのパターンを特徴付ける極小禁止語について考察を行なった.電気通信大学201

    Internal shortest absent word queries

    Get PDF
    Given a string T of length n over an alphabet Σ ⊂ {1, 2, . . . , nO(1)} of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T[i] · · · T[j] of T. For any positive integer k ∈ [1, log logσ n], we present an O((n/k) · log logσ n)-size data structure, which can be constructed in O(n logσ n) time, and answers queries in time O(log logσ k)

    Internal Shortest Absent Word Queries in Constant Time and Linear Space

    Get PDF
    International audienceGiven a string T of length n over an alphabet Σ ⊂ {1, 2,. .. , n O(1) } of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T [i] • • • T [j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log σ n) time

    String Sanitization Under Edit Distance: Improved and Generalized

    Get PDF
    Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars

    Querying and Efficiently Searching Large, Temporal Text Corpora

    Get PDF

    Hide and mine in strings: Hardness, algorithms, and experiments

    Get PDF
    Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well

    Resource efficient on-node spike sorting

    Get PDF
    Current implantable brain-machine interfaces are recording multi-neuron activity by utilising multi-channel, multi-electrode micro-electrodes. With the rapid increase in recording capability has come more stringent constraints on implantable system power consumption and size. This is even more so with the increasing demand for wireless systems to increase the number of channels being monitored whilst overcoming the communication bottleneck (in transmitting raw data) via transcutaneous bio-telemetries. For systems observing unit activity, real-time spike sorting within an implantable device offers a unique solution to this problem. However, achieving such data compression prior to transmission via an on-node spike sorting system has several challenges. The inherent complexity of the spike sorting problem arising from various factors (such as signal variability, local field potentials, background and multi-unit activity) have required computationally intensive algorithms (e.g. PCA, wavelet transform, superparamagnetic clustering). Hence spike sorting systems have traditionally been implemented off-line, usually run on work-stations. Owing to their complexity and not-so-well scalability, these algorithms cannot be simply transformed into a resource efficient hardware. On the contrary, although there have been several attempts in implantable hardware, an implementation to match comparable accuracy to off-line within the required power and area requirements for future BMIs have yet to be proposed. Within this context, this research aims to fill in the gaps in the design towards a resource efficient implantable real-time spike sorter which achieves performance comparable to off-line methods. The research covered in this thesis target: 1) Identifying and quantifying the trade-offs on subsequent signal processing performance and hardware resource utilisation of the parameters associated with analogue-front-end. Following the development of a behavioural model of the analogue-front-end and an optimisation tool, the sensitivity of the spike sorting accuracy to different front-end parameters are quantified. 2) Identifying and quantifying the trade-offs associated with a two-stage hybrid solution to realising real-time on-node spike sorting. Initial part of the work focuses from the perspective of template matching only, while the second part of the work considers these parameters from the point of whole system including detection, sorting, and off-line training (template building). A set of minimum requirements are established which ensure robust, accurate and resource efficient operation. 3) Developing new feature extraction and spike sorting algorithms towards highly scalable systems. Based on waveform dynamics of the observed action potentials, a derivative based feature extraction and a spike sorting algorithm are proposed. These are compared with most commonly used methods of spike sorting under varying noise levels using realistic datasets to confirm their merits. The latter is implemented and demonstrated in real-time through an MCU based platform.Open Acces
    corecore