759 research outputs found
Accelerating Continuous Normalizing Flow with Trajectory Polynomial Regularization
In this paper, we propose an approach to effectively accelerating the
computation of continuous normalizing flow (CNF), which has been proven to be a
powerful tool for the tasks such as variational inference and density
estimation. The training time cost of CNF can be extremely high because the
required number of function evaluations (NFE) for solving corresponding
ordinary differential equations (ODE) is very large. We think that the high NFE
results from large truncation errors of solving ODEs. To address the problem,
we propose to add a regularization. The regularization penalizes the difference
between the trajectory of the ODE and its fitted polynomial regression. The
trajectory of ODE will approximate a polynomial function, and thus the
truncation error will be smaller. Furthermore, we provide two proofs and claim
that the additional regularization does not harm training quality. Experimental
results show that our proposed method can result in 42.3% to 71.3% reduction of
NFE on the task of density estimation, and 19.3% to 32.1% reduction of NFE on
variational auto-encoder, while the testing losses are not affected
Real-world Effectiveness and Tolerability of Interferon-free Direct-acting Antiviral for 15,849 Patients with Chronic Hepatitis C: A Multinational Cohort Study
BACKGROUND AND AIMS: As practice patterns and hepatitis C virus (HCV) genotypes (GT) vary geographically, a global real-world study from both East and West covering all GTs can help inform practice policy toward the 2030 HCV elimination goal. This study aimed to assess the effectiveness and tolerability of DAA treatment in routine clinical practice in a multinational cohort for patients infected with all HCV GTs, focusing on GT3 and GT6.
METHODS: We analyzed the sustained virological response (SVR12) of 15,849 chronic hepatitis C patients from 39 Real-World Evidence from the Asia Liver Consortium for HCV clinical sites in Asia Pacific, North America, and Europe between 07/01/2014-07/01/2021.
RESULTS: The mean age was 62±13 years, with 49.6% male. The demographic breakdown was 91.1% Asian (52.9% Japanese, 25.7% Chinese/Taiwanese, 5.4% Korean, 3.3% Malaysian, and 2.9% Vietnamese), 6.4% White, 1.3% Hispanic/Latino, and 1% Black/African-American. Additionally, 34.8% had cirrhosis, 8.6% had hepatocellular carcinoma (HCC), and 24.9% were treatment-experienced (20.7% with interferon, 4.3% with direct-acting antivirals). The largest group was GT1 (10,246 [64.6%]), followed by GT2 (3,686 [23.2%]), GT3 (1,151 [7.2%]), GT6 (457 [2.8%]), GT4 (47 [0.3%]), GT5 (1 [0.006%]), and untyped GTs (261 [1.6%]). The overall SVR12 was 96.9%, with rates over 95% for GT1/2/3/6 but 91.5% for GT4. SVR12 for GT3 was 95.1% overall, 98.2% for GT3a, and 94.0% for GT3b. SVR12 was 98.3% overall for GT6, lower for patients with cirrhosis and treatment-experienced (TE) (93.8%) but ≥97.5% for treatment-naive patients regardless of cirrhosis status. On multivariable analysis, advanced age, prior treatment failure, cirrhosis, active HCC, and GT3/4 were independent predictors of lower SVR12, while being Asian was a significant predictor of achieving SVR12.
CONCLUSIONS: In this diverse multinational real-world cohort of patients with various GTs, the overall cure rate was 96.9%, despite large numbers of patients with cirrhosis, HCC, TE, and GT3/6. SVR12 for GT3/6 with cirrhosis and TE was lower but still excellent (\u3e91%)
Clustering by Correlations and Similarity Search overultiple Data Streams
在現今許多應用之中,資料是以串流的方式快速的產生與累積;因此,如何有效率地處理資料串流便成為一項重要的議題。處理資料串流最大的挑戰在於其龐大的量以及快速的產生與變動速度;甚者,很多應用必須同時觀測多於一條的資料串流。若能找出資料串流與串流之間的相似性關係,必能成為挖掘多條資料串流知識的一項重要資訊。因此,本篇論文主旨在討論如何尋找資料串流之間的關係,其中包括利用相關係數將資料串流分群,以及資料串流間之相似性查詢。先我們考慮當不同的資料串流之間可能存在關聯性的情形。我們設計一個以相關係數為相似性,根據串流事件,即時地驅動且反應串流群集變化的系統。隨著時間前進各個資料串流不斷的變動,某些原本相似的資料串流,在經過一段時間可能又不再相似;因此,若能掌握資料串流的群集變化,對於線上決策分析會有相當大的幫助。相對於每固定時刻就做一次串流分群,我們設計的方法利用串流的變化事件來驅動對應的群集分割或合併。每一條串流都以線性片段來近似時,當有新的片段折點產生,代表串流有一定程度的變化,而這樣的變化可能就是串流群集改變的提示。許多實際應用中,資料串流的取得其實是獨立且分佈在不同的位置。因此,我們探討如何直接在分散式的環境底下,根據使用者給定的參考資料串流,有效率地節省頻寬使用,找出分佈在不同地方但與參考值最相似的前k名資料串流。不同於先前將參考資料串流全部送給所有其他資料串流所在的位置,我們設計一個能有效率節省頻寬的傳送方法來做相似性查詢。基於每條資料串流使用Haar小波轉換係數來做摘要的情形下,根據此種摘要方式多重解析度的特性,將解析度低到高,送出查詢的起始者每次只傳送同一解析度的係數給候選串流可能存在的位置。在傳送係數的過程中,查詢起始者會不斷的篩選並縮小候選人可能存在的位置範圍。為了保證篩選的過程不會有任何正確答案的遺漏,對於每一個候選資料串流都會維護一個與參考串流距離的最低與最高保證範圍。隨著此保證範圍的縮小,查詢起始者便能快速且正確的找出最後的答案。後,我們考慮當資料串流的值含有不確定性的情形。與先前所討論的資料不同的是,資料值不再是一個定值而是一個隨機變數。於是,不確定性資料串流可視為一連串的隨機變數。那麼具不確定性資料串流之間的距離也變成了一個隨機變數。我們採用一個比較通用隨機變數模型,不需要知道變數的機率分佈情形,而只有期望值與變異數是已知的。在此變數模型下,利用機率與統計的理論,我們提出以機率方法來解決相似性查詢,並設計能有效率篩選答案的方式。此外,我們還討論如何將先前設計的方法應用在只有部份摘要的情況底下。利用控制一個可調整的機率門檻值,我們設計的方法可以針對不同的應用需求,在誤報與遺漏答案之間提供交換平衡。Processing data streams has become increasingly important as more and more emerging applications are required to handle a large amount of data in the form of rapidly arriving streams. The huge amount and evolving properties make them more challenging to be processed. Moreover, in many cases, more than one data stream needs to be analyzed simultaneously. To discover knowledge from multiple data streams, it is useful if we know the cross relationship among them first. Therefore, in this dissertation, we focus on how to find the relationship between streams, including clustering by correlations and similarity searches over many streams.irst, we devise a framework for Clustering Over Multiple Evolving sTreams by CORrelations and Events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlations. In the multiple data stream environment, where streams are evolving as time advances, some might act similarly at this moment but dissimilarly at the next moment. The information of evolving clusters is valuable to support corresponding online decisions. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. n many real cases, streams are collected independently in a decentralized manner. Given a reference stream, searching its most similar streams, which might exist in more than one distributed databases, is helpful for many applications. Therefore, we present LEEWAVE − a bandwidth-efficient approach to searching range-specified k-nearest neighbors among distributed streams by LEvEl-wise distribution of WAVElet coefficients. This work focuses on that when all streams are summarized using wavelet-based synopses. To find the k most similar streams to a range-specified reference one, the relevant wavelet coefficients of the reference stream can be sent to the peer sites to compute the similarities. However, bandwidth can be unnecessarily wasted if the entire relevant coefficients are sent simultaneously. Instead, we present a level-wise approach by leveraging the multi-resolution property of the wavelet coefficients. Starting from the top and moving down one level at a time, the query initiator sends only the single-level coefficients to a progressively shrinking set of candidates. In addition, we derive and maintain a similarity range for each candidate and gradually tighten the bounds of this range as we move from one level to the next. The increasingly tightened similarity ranges enable the query initiator to effectively prune the candidates without causing any false dismissal. inally, the case when each stream is composed of uncertain values is discussed. We present PROUD - A PRObabilistic approach to processing similarity queries over Uncertain Data streams. In contrast to streams with certainty, an uncertain stream is an ordered sequence of random variables. The distance between two uncertain streams is also a random variable. We use a general uncertain data model, where only the means and the deviations of the random variables in an uncertain stream are available. Under this model, we first derive mathematical conditions for progressively prune them to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are vailable. PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 4 Clustering over Multiple Evolving Streams by Events and Correlations 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Problem Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Data Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Similarity Measurement on Summary Structure . . . . . . . . . . . . . . 14.5 The COMET-CORE Framework . . . . . . . . . . . . . . . . . . . . . . 20.5.1 Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5.2 Split Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5.3 Update Inter-Cluster Similarity . . . . . . . . . . . . . . . . . . . 24.5.4 Merge Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.5.5 Analysis of COMET-CORE . . . . . . . . . . . . . . . . . . . . 25.6 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.6.1 Evaluation of COMET-CORE on Real Data . . . . . . . . . . . . 29.6.2 Evaluation of COMET-CORE on Synthetic Data . . . . . . . . . 32.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 LEEWAVE: Level-Wise Distribution of Wavelet Coefficients for ProcessingNN Queries over Distributed Streams 37.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3.1 Wavelet Decomposition . . . . . . . . . . . . . . . . . . . . . . 44.3.2 Coefficient Maintenance . . . . . . . . . . . . . . . . . . . . . . 45.4 The LEEWAVE Approach to Processing Distributed kNN Queries . . . . 46.4.1 Computing Similarities Using Wavelet Coefficients . . . . . . . . 47.4.2 LEEWAVE for a kNN Query . . . . . . . . . . . . . . . . . . . . 49.5 Performance study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.5.1 Experiments with Real Data . . . . . . . . . . . . . . . . . . . . 57.5.2 Evaluation with Synthetic Data . . . . . . . . . . . . . . . . . . 63.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 PROUD: A Probabilistic Approach to Processing Similarity Queries over Uncertain Data Streams 69.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Similarity Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 75.3.1 Statistics Computation of Uncertain Distance . . . . . . . . . . . 75.3.2 Candidate Selection . . . . . . . . . . . . . . . . . . . . . . . . 79.3.3 Progressively Pruning . . . . . . . . . . . . . . . . . . . . . . . 81.3.4 The PROUD Algorithm . . . . . . . . . . . . . . . . . . . . . . 84.4 Applying PROUD to Wavelet Synopses . . . . . . . . . . . . . . . . . . 85.4.1 Wavelet Summarization for Uncertain Data Streams . . . . . . . 85.4.2 Statistic Computation Using Wavelet Coefficients . . . . . . . . . 86.4.3 Pruning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 87.5 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89.5.1 Experiments with Real Data . . . . . . . . . . . . . . . . . . . . 90.5.2 Experiments with Synthetic Data . . . . . . . . . . . . . . . . . . 94.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Conclusion and FutureWork 97.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99ibliography 100ppendix 106 Proof of Theorem 3.1 106 Proof of PROUD 109.1 Proof of Non-increasing rnorm in the Raw Uncertain Series Case . . . . . 109.2 Proof of Non-increasing rnorm in the Wavelet Synopses Case . . . . . . . 11
On Shortest Unique Substring Queries
Abstract—In this paper, we tackle a novel type of interesting queries — shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest unique substring queries have many potential applications, such as information retrieval, bioinformatics, and event context analysis. We develop efficient algorithms for online query answering. First, we present an algorithm to answer a shortest unique substring query in O(n) time using a suffix tree index, where n is the length of string S. Second, we show that, using O(n·h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smallerthannandcanbetreatedasaconstant.Oncetheshortest unique substrings are pre-computed, shortest unique substring queries can be answered online in constant time. In addition to the solid algorithmic results, we empirically demonstrate the effectiveness and efficiency of shortest unique substring queries on real data sets. I
PASSLEAF: A Pool-bAsed Semi-Supervised LEArning Framework for Uncertain Knowledge Graph Embedding
In this paper, we study the problem of embedding uncertain knowledge graphs, where each relation between entities is associated with a confidence score. Observing the existing embedding methods may discard the uncertainty information, only incorporate a specific type of score function, or cause many false-negative samples in the training, we propose the PASSLEAF framework to solve the above issues. PASSLEAF consists of two parts, one is a model that can incorporate different types of scoring functions to predict the relation confidence scores and the other is the semi-supervised learning model by exploiting both positive and negative samples associated with the estimated confidence scores. Furthermore, PASSLEAF leverages a sample pool as a relay of generated samples to further augment the semi-supervised learning. Experiment results show that our proposed framework can learn better embedding in terms of having higher accuracy in both the confidence score prediction and tail entity prediction
- …