Search CORE

4,292 research outputs found

De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space

Author: Robertson David L.
Tapinos Avraam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/08/2017
Field of study

Sequencing technologies allow for an in-depth analysis of biological species but the size of the generated datasets introduce a number of analytical challenges. Recently, we demonstrated the application of numerical sequence representations and data transformations for the alignment of short reads to a reference genome. Here, we expand out approach for de novo assembly of short reads. Our results demonstrate that highly compressed data can encapsulate the signal suffi- ciently to accurately assemble reads to big contigs or complete genomes

Crossref

Enlighten

A 2D based Partition Strategy for Solving Ranking under Team Context (RTP)

Author: Feng Ling
Li Dongxu
Li Xiang
Lu Xiaolu
Publication venue
Publication date: 14/04/2014
Field of study

In this paper, we propose a 2D based partition method for solving the problem of Ranking under Team Context(RTC) on datasets without a priori. We first map the data into 2D space using its minimum and maximum value among all dimensions. Then we construct window queries with consideration of current team context. Besides, during the query mapping procedure, we can pre-prune some tuples which are not top ranked ones. This pre-classified step will defer processing those tuples and can save cost while providing solutions for the problem. Experiments show that our algorithm performs well especially on large datasets with correctness

arXiv.org e-Print Archive

CiteSeerX

HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

Author: Arora Akhil
Bhattacharya Arnab
Kumar Piyush
Sinha Sakshi
Publication venue: 'VLDB Endowment'
Publication date: 23/04/2018
Field of study

Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Noise-tolerant approximate blocking for dynamic real-time entity resolution

Author: Christen Peter
Gayler Ross
Liang Huizhi
Wang Yanzhe
Publication venue
Publication date: 01/01/2014
Field of study

Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world datasets show the effectiveness of the proposed approach

Central Archive at the University of Reading

Crossref

Large Spatial Database Indexing with aX-tree

Author: Hadeel Hadeel Jazzaa
Joan Lu
Samson Grace
Showole Aminat A.
Usman Mistura M.
Publication venue
Publication date: 05/04/2018
Field of study

Spatial databases are optimized for the management of data stored based on their geometric space. Researchers through high degree scalability have proposed several spatial indexing structures towards this effect. Among these indexing structures is the X-tree. The existing X-trees and its variants are designed for dynamic environment, with the capability for handling insertions and deletions. Notwithstanding, the X-tree degrades on retrieval performance as dimensionality increases and brings about poor worst-case performance than sequential scan. We propose a new X-tree packing techniques for static spatial databases which performs better in space utilization through cautious packing. This new improved structure yields two basic advantage: It reduces the space overhead of the index and produces a better response time, because the aX-tree has a higher fan-out and so the tree always ends up shorter. New model for super-node construction and effective method for optimal packing using an improved str bulk-loading technique is proposed. The study reveals that proposed system performs better than many existing spatial indexing structure

University of Huddersfield Repository

Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy

Author: Gieseke Fabian
Igel Christian
Kremer Jan
Pedersen Kim Steenstrup
Stensbo-Smidt Kristoffer
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Astrophysics and cosmology are rich with data. The advent of wide-area digital cameras on large aperture telescopes has led to ever more ambitious surveys of the sky. Data volumes of entire surveys a decade ago can now be acquired in a single night and real-time analysis is often desired. Thus, modern astronomy requires big data know-how, in particular it demands highly efficient machine learning and image analysis algorithms. But scalability is not the only challenge: Astronomy applications touch several current machine learning research questions, such as learning from biased data and dealing with label and measurement noise. We argue that this makes astronomy a great domain for computer science research, as it pushes the boundaries of data analysis. In the following, we will present this exciting application area for data scientists. We will focus on exemplary results, discuss main challenges, and highlight some recent methodological advancements in machine learning and image analysis triggered by astronomical applications

arXiv.org e-Print Archive

Copenhagen University Research Information System