13,723 research outputs found
Extreme Scale De Novo Metagenome Assembly
Metagenome assembly is the process of transforming a set of short,
overlapping, and potentially erroneous DNA segments from environmental samples
into the accurate representation of the underlying microbiomes's genomes.
State-of-the-art tools require big shared memory machines and cannot handle
contemporary metagenome datasets that exceed Terabytes in size. In this paper,
we introduce the MetaHipMer pipeline, a high-quality and high-performance
metagenome assembler that employs an iterative de Bruijn graph approach.
MetaHipMer leverages a specialized scaffolding algorithm that produces long
scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is
end-to-end parallelized using the Unified Parallel C language and therefore can
run seamlessly on shared and distributed-memory systems. Experimental results
show that MetaHipMer matches or outperforms the state-of-the-art tools in terms
of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and
is able to assemble previously intractable grand challenge metagenomes. We
demonstrate the unprecedented capability of MetaHipMer by computing the first
full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion
reads - size 2.6 TBytes.Comment: Accepted to SC1
Correcting Knowledge Base Assertions
The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB
Improving Link Reliability through Network Coding in Cooperative Cellular Networks
The paper proposes a XOR-based network coded cooperation protocol for the uplink transmission of relay assisted cellular networks and an algorithm for selection and assignment of the relay nodes. The performances of the cooperation protocol are expressed in terms of network decoder outage probability and Block Error Rate of the cooperating users. These performance indicators are analyzed theoretically and by computer simulations. The relay nodes assignment is based on the optimization, according to several criteria, of the graph that describes the cooperation cluster formed after an initial selection of the relay nodes. The graph optimization is performed using Genetic Algorithms adapted to the topology of the cooperation cluster and the optimization criteria considered
Tiny Codes for Guaranteeable Delay
Future 5G systems will need to support ultra-reliable low-latency
communications scenarios. From a latency-reliability viewpoint, it is
inefficient to rely on average utility-based system design. Therefore, we
introduce the notion of guaranteeable delay which is the average delay plus
three standard deviations of the mean. We investigate the trade-off between
guaranteeable delay and throughput for point-to-point wireless erasure links
with unreliable and delayed feedback, by bringing together signal flow
techniques to the area of coding. We use tiny codes, i.e. sliding window by
coding with just 2 packets, and design three variations of selective-repeat ARQ
protocols, by building on the baseline scheme, i.e. uncoded ARQ, developed by
Ausavapattanakun and Nosratinia: (i) Hybrid ARQ with soft combining at the
receiver; (ii) cumulative feedback-based ARQ without rate adaptation; and (iii)
Coded ARQ with rate adaptation based on the cumulative feedback. Contrasting
the performance of these protocols with uncoded ARQ, we demonstrate that HARQ
performs only slightly better, cumulative feedback-based ARQ does not provide
significant throughput while it has better average delay, and Coded ARQ can
provide gains up to about 40% in terms of throughput. Coded ARQ also provides
delay guarantees, and is robust to various challenges such as imperfect and
delayed feedback, burst erasures, and round-trip time fluctuations. This
feature may be preferable for meeting the strict end-to-end latency and
reliability requirements of future use cases of ultra-reliable low-latency
communications in 5G, such as mission-critical communications and industrial
control for critical control messaging.Comment: to appear in IEEE JSAC Special Issue on URLLC in Wireless Network
Fault-tolerance techniques for hybrid CMOS/nanoarchitecture
The authors propose two fault-tolerance techniques for hybrid CMOS/nanoarchitecture implementing logic functions as look-up tables. The authors compare the efficiency of the proposed techniques with recently reported methods that use single coding schemes in tolerating high fault rates in nanoscale fabrics. Both proposed techniques are based on error correcting codes to tackle different fault rates. In the first technique, the authors implement a combined two-dimensional coding scheme using Hamming and Bose-Chaudhuri-Hocquenghem (BCH) codes to address fault rates greater than 5. In the second technique, Hamming coding is complemented with bad line exclusion technique to tolerate fault rates higher than the first proposed technique (up to 20). The authors have also estimated the improvement that can be achieved in the circuit reliability in the presence of Don-t Care Conditions. The area, latency and energy costs of the proposed techniques were also estimated in the CMOS domain
Automatic refinement of large-scale cross-domain knowledge graphs
Knowledge graphs are a way to represent complex structured and unstructured information
integrated into an ontology, with which one can reason about the existing
information to deduce new information or highlight inconsistencies. Knowledge
graphs are divided into the terminology box (TBox), also known as ontology, and
the assertions box (ABox). The former consists of a set of schema axioms defining
classes and properties which describe the data domain. Whereas the ABox consists
of a set of facts describing instances in terms of the TBox vocabulary.
In the recent years, there have been several initiatives for creating large-scale
cross-domain knowledge graphs, both free and commercial, with DBpedia, YAGO,
and Wikidata being amongst the most successful free datasets. Those graphs are
often constructed with the extraction of information from semi-structured knowledge,
such as Wikipedia, or unstructured text from the web using NLP methods. It
is unlikely, in particular when heuristic methods are applied and unreliable sources
are used, that the knowledge graph is fully correct or complete. There is a tradeoff
between completeness and correctness, which is addressed differently in each
knowledge graph’s construction approach.
There is a wide variety of applications for knowledge graphs, e.g. semantic
search and discovery, question answering, recommender systems, expert systems
and personal assistants. The quality of a knowledge graph is crucial for its applications.
In order to further increase the quality of such large-scale knowledge graphs,
various automatic refinement methods have been proposed. Those methods try to
infer and add missing knowledge to the graph, or detect erroneous pieces of information.
In this thesis, we investigate the problem of automatic knowledge graph
refinement and propose methods that address the problem from two directions, automatic
refinement of the TBox and of the ABox.
In Part I we address the ABox refinement problem. We propose a method for
predicting missing type assertions using hierarchical multilabel classifiers and ingoing/
outgoing links as features. We also present an approach to detection of relation
assertion errors which exploits type and path patterns in the graph. Moreover,
we propose an approach to correction of relation errors originating from confusions
between entities. Also in the ABox refinement direction, we propose a knowledge
graph model and process for synthesizing knowledge graphs for benchmarking
ABox completion methods.
In Part II we address the TBox refinement problem. We propose methods for inducing flexible relation constraints from the ABox, which are expressed using
SHACL.We introduce an ILP refinement step which exploits correlations between
numerical attributes and relations in order to the efficiently learn Horn rules with
numerical attributes. Finally, we investigate the introduction of lexical information
from textual corpora into the ILP algorithm in order to improve quality of induced
class expressions
- …