Search CORE

346 research outputs found

A Cover-Merging-Based Algorithm for the Longest Increasing Subsequence in a Sliding Window Problem

Author: Deorowicz Sebastian
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 24/01/2013
Field of study

A longest increasing subsequence problem (LIS) is a well-known combinatorial problem with applications mainly in bioinformatics, where it is used in various projects on DNA sequences. Recently, a number of generalisations of this problem were proposed. One of them is to find an LIS among all fixed-size windows of the input sequence (LISW). We propose an algorithm for the LISW problem based on cover representation of the sequence that outperforms the existing methods for some class of the input sequences

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Lempel-Ziv Data Compression on Parallel and Distributed Systems

Author: Sergio De Agostino
Publication venue
Publication date: 01/01/2011
Field of study

We present a survey of results concerning Lempel-Ziv data compression on parallel and distributed systems, starting from the theoretical approach to parallel time complexity to conclude with the practical goal of designing distributed algorithms with low communication cost. An extension by Storer to image compression is also discussed

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Open Access Repository

Archivio della ricerca- Università di Roma La Sapienza

The user attribution problem and the challenge of persistent surveillance of user activity in complex networks

Author: Cannady James D., Jr.
Taglienti Claudio
Publication venue: NSUWorks
Publication date: 19/04/2016
Field of study

In telecommunication networks, the user attribution problem refers to the challenge faced in recognizing communication traffic as belonging to a given user when information needed to identify the user is missing. This problem becomes more difficult to tackle as users move across many mobile networks (complex networks) owned and operated by different providers. The traditional approach of using the source IP address as a tracking identifier does not work when used to identify mobile users. Recent efforts to address this problem by exclusively relying on web browsing behavior to identify users, brought to light the challenges of solutions which try to link up multiple user sessions together when these approaches rely exclusively on the frequency of web sites visited by the user. This study has tackled this problem by utilizing behavior based identification while accounting for time and the sequential order of web visits by a user. Hierarchical Temporal Memories (HTM) were used to classify historical navigational patterns for different users. This approach enables linking multiple user sessions together forgoing the need for a tracking identifier such as the source IP address. Results are promising. HTMs outperform traditional Markov chains based approaches and can provide high levels of identification accuracy

Crossref

NSU Works

Accelerating Event Stream Processing in On- and Offline Systems

Author: Körber Michael
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2021
Field of study

Due to a growing number of data producers and their ever-increasing data volume, the ability to ingest, analyze, and store potentially never-ending streams of data is a mission-critical task in today's data processing landscape. A widespread form of data streams are event streams, which consist of continuously arriving notifications about some real-world phenomena. For example, a temperature sensor naturally generates an event stream by periodically measuring the temperature and reporting it with measurement time in case of a substantial change to the previous measurement. In this thesis, we consider two kinds of event stream processing: online and offline. Online refers to processing events solely in main memory as soon as they arrive, while offline means processing event data previously persisted to non-volatile storage. Both modes are supported by widely used scale-out general-purpose stream processing engines (SPEs) like Apache Flink or Spark Streaming. However, such engines suffer from two significant deficiencies that severely limit their processing performance. First, for offline processing, they load the entire stream from non-volatile secondary storage and replay all data items into the associated online engine in order of their original arrival. While this naturally ensures unified query semantics for on- and offline processing, the costs for reading the entire stream from non-volatile storage quickly dominate the overall processing costs. Second, modern SPEs focus on scaling out computations across the nodes of a cluster, but use only a fraction of the available resources of individual nodes. This thesis tackles those problems with three different approaches. First, we present novel techniques for the offline processing of two important query types (windowed aggregation and sequential pattern matching). Our methods utilize well-understood indexing techniques to reduce the total amount of data to read from non-volatile storage. We show that this improves the overall query runtime significantly. In particular, this thesis develops the first index-based algorithms for pattern queries expressed with the Match_Recognize clause, a new and powerful language feature of SQL that has received little attention so far. Second, we show how to maximize resource utilization of single nodes by exploiting the capabilities of modern hardware. Therefore, we develop a prototypical shared-memory CPU-GPU-enabled event processing system. The system provides implementations of all major event processing operators (filtering, windowed aggregation, windowed join, and sequential pattern matching). Our experiments reveal that regarding resource utilization and processing throughput, such a hardware-enabled system is superior to hardware-agnostic general-purpose engines. Finally, we present TPStream, a new operator for pattern matching over temporal intervals. TPStream achieves low processing latency and, in contrast to sequential pattern matching, is easily parallelizable even for unpartitioned input streams. This results in maximized resource utilization, especially for modern CPUs with multiple cores

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg

Periodic Pattern Mining a Algorithms and Applications

Author: G.N.V.G. Sirisha
G.V. Padma Raju
Shashi Mogalla
Publication venue: Global Journals Inc. (US)
Publication date: 15/07/2013
Field of study

Owing to a large number of applications periodic pattern mining has been extensively studied for over a decade Periodic pattern is a pattern that repeats itself with a specific period in a give sequence Periodic patterns can be mined from datasets like biological sequences continuous and discrete time series data spatiotemporal data and social networks Periodic patterns are classified based on different criteria Periodic patterns are categorized as frequent periodic patterns and statistically significant patterns based on the frequency of occurrence Frequent periodic patterns are in turn classified as perfect and imperfect periodic patterns full and partial periodic patterns synchronous and asynchronous periodic patterns dense periodic patterns approximate periodic patterns This paper presents a survey of the state of art research on periodic pattern mining algorithms and their application areas A discussion of merits and demerits of these algorithms was given The paper also presents a brief overview of algorithms that can be applied for specific types of datasets like spatiotemporal data and social network

Global Journal of Computer Science and Technology (GJCST)

ENVIRONMENTAL MODEL ACCURACY IMPROVEMENT FRAMEWORK USING STATISTICAL TECHNIQUES AND A NOVEL TRAINING APPROACH

Author: Matta Rekesh
Publication venue: 'East Carolina University'
Publication date: 22/06/2020
Field of study

It is challenging to predict environmental behaviors because of extreme events, such as heatwaves, typhoons, droughts, tsunamis, torrential downpour, wind ramps, or hurricanes. In this thesis, we proposed a novel framework to improve environmental model accuracy with a novel training approach. Extreme event detection algorithms are surveyed, selected, and applied in our proposed framework. The application of statistics in extreme events detection is quite diverse and leads to diverse formulations, which need to be designed for a specific problem. Each formula needs to be tailored specially to work with the available data in the given situation. This diversity is one of the driving forces of this research towards identifying the most common mixture of components utilized in the analysis of extreme events detection. Besides the extreme event detection algorithm, we also integrated the sliding window approach to see how well our models predict future events. To test the proposed framework, we collected coastal data from various sources and obtained the results; we improved the predictive accuracy of various machine learning models by 20% to 25% increase in R2 value using our approach. Apart from that, we organized the discussion along with different extreme event detection types, presented a few outlier definitions, and briefly introduced their techniques. We also summarized the statistical methods involved in the detection of environmental extremes, such as wind ramps and climatic events

ScholarShip

Pattern mining under different conditions

Author: Lu Yifeng
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 12/05/2021
Field of study

New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

Digitale Hochschulschriften der LMU

28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

Author
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/07/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto