7,579 research outputs found
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Dealing with Complexity in Process Model Discovery Through Segmentation
Iga eduka organisatsiooni alus on äriprotsesside õige haldamine, sest see võimaldab parimal viisil säilitada organisatsiooni tootmisprotsesse ja kasutatavaid ressursse. Paljusid tootmise käigus esile kerkinud märkimisväärseid probleeme saab analüüsida protsessikaeve võtete ja organisatsiooni töötamise ajal tehtud ülesannete sündmustelogide abil. Üks levinud viise seda teha on luua protsessimudelid olemasolevate toimingute uurimiseks ja hinnata organisatsiooni protsesse eesmärgiga neid muuta. Protsessikaeve võtete arendamise ja laiendamise käigus kerkis esile palju meetodeid ja vahendeid, mis aitavad sellist ülesannet lahendada. Enamik olemasolevaid vahendeid, mida kasutatakse tegelike sündmustelogide puhul, tekitavad aga „spageti“-mudeleid, millest on selgitusteta keeruline aru saada. Töös käsitleme probleemi, filtreerides ja sorteerides logisid enne kaevet ning muutes mudeli komplekssust. Sellega saame protsessimudelid, mida mõõdame ja uuendame soovitud komplekssuse saavutamiseks. Tulemuseks on vahend, mis loob lihtsaid ja mõistetavaid protsessimudeleid, mida kasutaja saab soovi kohaselt valida.The fundament of every successful organization is a proper business process management, as it allows maintaining the organization’s production processes and the employed resources in the most sufficient way. Many noticeable problems occurred in production can be analyzed using process mining techniques and event logs obtained from tasks executed during the organization lifetime. One of the common ways to do this is to generate process models in order to study existing operations and explore processes in the organization with the aim to change them. With developing and expanding process mining technique, many methods and tools appeared which can help to solve this kind of task. However, most of the existing tools that are applied to real-life event logs produce spaghetti-like models that are difficult to understand without explanation. In this thesis we try to address this issue by filtering and sorting the logs before mining as well as adjusting model complexity, thus obtaining process models that we will measure and reform to satisfy desired complexity. A final result is a tool that produces a set of simple and understandable process models that the user can select according to his or her choices
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Improving Hypernymy Extraction with Distributional Semantic Classes
In this paper, we show how distributionally-induced semantic classes can be
helpful for extracting hypernyms. We present methods for inducing sense-aware
semantic classes using distributional semantics and using these induced
semantic classes for filtering noisy hypernymy relations. Denoising of
hypernyms is performed by labeling each semantic class with its hypernyms. On
the one hand, this allows us to filter out wrong extractions using the global
structure of distributionally similar senses. On the other hand, we infer
missing hypernyms via label propagation to cluster terms. We conduct a
large-scale crowdsourcing study showing that processing of automatically
extracted hypernyms using our approach improves the quality of the hypernymy
extraction in terms of both precision and recall. Furthermore, we show the
utility of our method in the domain taxonomy induction task, achieving the
state-of-the-art results on a SemEval'16 task on taxonomy induction.Comment: In Proceedings of the 11th Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japa
An Enhanced Web Data Learning Method for Integrating Item, Tag and Value for Mining Web Contents
The Proposed System Analyses the scopes introduced by Web 2.0 and collaborative tagging systems, several challenges have to be addressed too, notably, the problem of information overload. Recommender systems are among the most successful approaches for increasing the level of relevant content over the 201C;noise.201D; Traditional recommender systems fail to address the requirements presented in collaborative tagging systems. This paper considers the problem of item recommendation in collaborative tagging systems. It is proposed to model data from collaborative tagging systems with three-mode tensors, in order to capture the three-way correlations between users, tags, and items. By applying multiway analysis, latent correlations are revealed, which help to improve the quality of recommendations. Moreover, a hybrid scheme is proposed that additionally considers content-based information that is extracted from items. We propose an advanced data mining method using SVD that combines both tag and value similarity, item and user preference. SVD automatically extracts data from query result pages by first identifying and segmenting the query result records in the query result pages and then aligning the segmented query result records into a table, in which the data values from the same attribute are put into the same column. Specifically, we propose new techniques to handle the case when the query result records based on user preferences, which may be due to the presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested-structure that may exist in the query result records
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction
Topic distillation is the process of finding authoritative Web pages and comprehensive “hubs” which reciprocally endorse each other and are relevant to a given query. Hyperlink-based topic distillation has been traditionally applied to a macroscopic Web model where documents are nodes in a directed graph and hyperlinks are edges. Macroscopic models miss valuable clues such as banners, navigation panels, and template-based inclusions, which are embedded in HTML pages using markup tags. Consequently, results of macroscopic distillation algorithms have been deteriorating in quality as Web pages are becoming more complex. We propose a uniform fine-grained model for the Web in which pages are represented by their tag trees (also called their Document Object Models or DOMs) and these DOM trees are interconnected by ordinary hyperlinks. Surprisingly, macroscopic distillation algorithms do not work in the finegrained scenario. We present a new algorithm suitable for the fine-grained model. It can dis-aggregate hubs into coherent regions by segmenting their DOM trees. Mutual endorsement between hubs and authorities involve these regions, rather than single nodes representing complete hubs. Anecdotes and measurements using a 28-query, 366000-document benchmark suite, used in earlier topic distillation research, reveal two benefits from the new algorithm: distillation quality improves and a by-product of distillation is the ability to extract relevant snippets from hubs which are only partially relevant to the query
- …