Search CORE

3,027 research outputs found

Transductive Learning with String Kernels for Cross-Domain Text Classification

Author: AM Fernández
D Bollegala
G Ifrim
H Lodhi
J Shawe-Taylor
M Franco-Salvador
M Long
Marius Popescu
RT Ionescu
RT Ionescu
RT Ionescu
TG Dietterich
Publication venue
Publication date: 02/11/2018
Field of study

For many text classification tasks, there is a major problem posed by the lack of labeled data in a target domain. Although classifiers for a target domain can be trained on labeled text data from a related source domain, the accuracy of such classifiers is usually lower in the cross-domain setting. Recently, string kernels have obtained state-of-the-art results in various text classification tasks such as native language identification or automatic essay scoring. Moreover, classifiers based on string kernels have been found to be robust to the distribution gap between different domains. In this paper, we formally describe an algorithm composed of two simple yet effective transductive learning approaches to further improve the results of string kernels in cross-domain settings. By adapting string kernels to the test set without using the ground-truth test labels, we report significantly better accuracy rates in cross-domain English polarity classification.Comment: Accepted at ICONIP 2018. arXiv admin note: substantial text overlap with arXiv:1808.0840

arXiv.org e-Print Archive

Crossref

Parallel text index construction

Author: Kurpicz Florian
Publication venue
Publication date: 01/01/2020
Field of study

In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen über Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfür sind Volltext-Indizes, welche für eine effiziente Phrasensuche genutzt werden, also etwa für die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsächlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher. Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binären Alphabeten auf Alphabete beliebiger Größe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index überhaupt. Das Suffix-Array enthält die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthält die Länge der längsten gemeinsamen Präfixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.The focus of this dissertation is the parallel construction of text indices. Text indices provide additional information about a text that allow to answer queries faster. Full-text indices for example are used to efficiently answer phrase queries, i.e., if and where a phrase occurs in a text. The research in this dissertation is focused on but not limited to parallel construction algorithms for text indices in both shared and distributed memory. In the first part, we look at wavelet trees: a compact index that generalizes rank and select queries from binary alphabets to alphabets of arbitrary size. In the second part of this dissertation, we consider the suffix array---one of the most researched text indices.The suffix array of a text contains the starting positions of the text's lexicographically sorted suffixes, i.e., we want to sort all its suffixes. Finally, we use the distributed suffix arrays (and LCP arrays) to compute distributed Patricia tries. This allows us to answer different phrase queries more efficiently than using only the suffix array

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Recommended from our members

Efficient analysis and storage of large-scale genomic data

Author: Klarqvist Marcus
Publication venue: University of Cambridge
Publication date: 01/09/2019
Field of study

The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed algorithms, and heterogeneous computing. First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus

Apollo (Cambridge)

GPU accelerating distributed succinct de Bruijn graph construction

Author: Laanti Topi
Publication venue: Helsingfors universitet
Publication date: 01/01/2022
Field of study

The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance. Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph. As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing. This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction

Helsingin yliopiston digitaalinen arkisto

Compressing and Performing Algorithms on Massively Large Networks

Author: Nelson Michael
Publication venue
Publication date: 13/12/2019
Field of study

Networks are represented as a set of nodes (vertices) and the arcs (links) connecting them. Such networks can model various real-world structures such as social networks (e.g., Facebook), information networks (e.g., citation networks), technological networks (e.g., the Internet), and biological networks (e.g., gene-phenotype network). Analysis of such structures is a heavily studied area with many applications. However, in this era of big data, we find ourselves with networks so massive that the space requirements inhibit network analysis. Since many of these networks have nodes and arcs on the order of billions to trillions, even basic data structures such as adjacency lists could cost petabytes to zettabytes of storage. Storing these networks in secondary memory would require I/O access (i.e., disk access) during analysis, thus drastically slowing analysis time. To perform analysis efficiently on such extensive data, we either need enough main memory for the data structures and algorithms, or we need to develop compressions that require much less space while still being able to answer queries efficiently. In this dissertation, we develop several compression techniques that succinctly represent these real-world networks while still being able to efficiently query the network (e.g., check if an arc exists between two nodes). Furthermore, since many of these networks continue to grow over time, our compression techniques also support the ability to add and remove nodes and edges directly on the compressed structure. We also provide a way to compress the data quickly without any intermediate structure, thus giving minimal memory overhead. We provide detailed analysis and prove that our compression is indeed succinct (i.e., achieves the information-theoretic lower bound). Also, we empirically show that our compression rates outperform or are equal to existing compression algorithms on many benchmark datasets. We also extend our technique to time-evolving networks. That is, we store the entire state of the network at each time frame. Studying time-evolving networks allows us to find patterns throughout the time that would not be available in regular, static network analysis. A succinct representation for time-evolving networks is arguably more important than static graphs, due to the extra dimension inflating the space requirements of basic data structures even more. Again, we manage to achieve succinctness while also providing fast encoding, minimal memory overhead during encoding, fast queries, and fast, direct modification. We also compare against several benchmarks and empirically show that we achieve compression rates better than or equal to the best performing benchmark for each dataset. Finally, we also develop both static and time-evolving algorithms that run directly on our compressed structures. Using our static graph compression combined with our differential technique, we find that we can speed up matrix-vector multiplication by reusing previously computed products. We compare our results against a similar technique using the Webgraph Framework, and we see that not only are our base query speeds faster, but we also gain a more significant speed-up from reusing products. Then, we use our time-evolving compression to solve the earliest arrival paths problem and time-evolving transitive closure. We found that not only were we the first to run such algorithms directly on compressed data, but that our technique was particularly efficient at doing so

SHAREOK repository