708 research outputs found
MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities
Entity Resolution (ER) aims to identify different descriptions in various
Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the
Variety, Volume and Veracity of entity descriptions published in the Web of
Data. To address them, we propose the MinoanER framework that simultaneously
fulfills full automation, support of highly heterogeneous entities, and massive
parallelization of the ER process. MinoanER leverages a token-based similarity
of entities to define a new metric that derives the similarity of neighboring
entities from the most important relations, as they are indicated only by
statistics. A composite blocking method is employed to capture different
sources of matching evidence from the content, neighbors, or names of entities.
The search space of candidate pairs for comparison is compactly abstracted by a
novel disjunctive blocking graph and processed by a non-iterative, massively
parallel matching algorithm that consists of four generic, schema-agnostic
matching rules that are quite robust with respect to their internal
configuration. We demonstrate that the effectiveness of MinoanER is comparable
to existing ER tools over real KBs exhibiting low Variety, but it outperforms
them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001
Optimal joint path computation and rate allocation for real-time traffic
Computing network paths under worst-case delay constraints has been the subject of abundant literature in the past two decades. Assuming Weighted Fair Queueing scheduling at the nodes, this translates to computing paths and reserving rates at each link. The problem is NP-hard in general, even for a single path; hence polynomial-time heuristics have been proposed in the past, that either assume equal rates at each node, or compute the path heuristically and then allocate the rates optimally on the given path. In this paper we show that the above heuristics, albeit finding optimal solutions quite often, can lead to failing of paths at very low loads, and that this could be avoided by solving the problem, i.e., path computation and rate allocation, jointly at optimality. This is possible by modeling the problem as a mixed-integer second-order cone program and solving it optimally in split-second times for relatively large networks on commodity hardware; this approach can also be easily turned into a heuristic one, trading a negligible increase in blocking probability for one order of magnitude of computation time. Extensive simulations show that these methods are feasible in today's ISPs networks and they significantly outperform the existing schemes in terms of blocking probability
Recommended from our members
Populating a Linked Data Entity Name System
Resource Description Framework (RDF) is a graph-based data model used to publish data as a Web of Linked Data. RDF is an emergent foundation for large-scale data integration, the problem of providing a unified view over multiple data sources. An Entity Name System (ENS) is a thesaurus for entities, and is a crucial component in a data integration architecture. Populating a Linked Data ENS is equivalent to solving an Artificial Intelligence problem called instance matching, which concerns identifying pairs of entities referring to the same underlying entity. This dissertation presents an instance matcher with four properties, namely automation, heterogeneity, scalability and domain independence. Automation is addressed by employing inexpensive but well-performing heuristics to automatically generate a training set, which is employed by other machine learning algorithms in the pipeline. Data-driven alignment algorithms are adapted to deal with structural heterogeneity in RDF graphs. Domain independence is established by actively avoiding prior assumptions about input domains, and through evaluations on ten RDF test cases. The full system is scaled by implementing it on cloud infrastructure using MapReduce algorithms.Computer Science
Spatial variability of soil structure and its impact on transport processes and some associated land qualities
This thesis treats the impact of soil spatial variability on spatial variability of simulated land qualities. A sequence of procedures that were done to determine this impact is described in chapters 2 and 3. The subchapters correspond to seven manuscripts that either have appeared in or have been submitted to peer-reviewed journals.In chapter 2 attention is paid to methods to inventory spatial variability of soil characteristics related to the structure of the soil. A method was developed to construct confidence intervals to point count results in case of spatial dependency of the point observations on a soil thin section. It was concluded, that confidence intervals obtained following the traditional method by assuming all observations independent, will be much narrower than those where spatial dependency structure is taken into account.Two other papers in chapter 2 describe a method to translate soil profile descriptions into soil physical input data for computer models that simulate solute flow. The concept of functional layers is introduced. A functional layer is a combination of soil layers showing comparable soil physical behaviour related to water flow. The functional layer approach was tested and accepted for examples of disturbed and thinly stratified soils by calculating functional properties of the layer under defined hydrological conditions. When functional layers are established, mapping the thickness, starting depth and type of functional layers provides spatial information about soil physical characteristics. In one paper in chapter 2 the number of necessary observations in this mapping procedure is optimized by application of geostatistical methods and a sequential sampling test.In chapter three the impact of variability of the structure of the soil on variability of crop yields and nitrate leaching is investigated. One paper describes a field scale empirical study where barley grain yield variability is correlated to variability of soil characteristics and simulated transpiration deficits. Simulation model inputs were obtained using the functional layer approach described in chapter 2. Regression functions based on simulated transpiration deficits only could explain 43% of the variance in yields, which suggested that variability of transpiration may be an important factor causing yield variability. This hypothesis was tested in a next paper in which remote sensing estimates of the leaf area index were used to obtain estimates of the potential transpiration with a high spatial accuracy. Incorporating space- and time series of the leaf area index into a crop growth model resulted in a prediction of yield variability that could explain 39% of measured variability. Variability of plant- available water, expressed by the actual transpiration, is an important factor causing yield variability.Two papers in chapter three describe how a combined solute flow and crop growth model was used to evaluate the spatial varying effect of fertilizing scenarios. 'Me spatial interpolation method Disjunctive Kriging was used to translate spatial variability of simulated nitrate leaching into maps of the probability that a threshold leaching concentration is exceeded. It was also investigated, whether the number of simulations could be minimized using Disjunctive CoKriging and available spatial information. It was concluded, that different soil units within one agricultural field showed a different leaching response and crop yield response to identical fertilizer treatments, and that yield variability will increase when fertilizer levels approach the level for maximal production
Named Entity Resolution in Personal Knowledge Graphs
Entity Resolution (ER) is the problem of determining when two entities refer
to the same underlying entity. The problem has been studied for over 50 years,
and most recently, has taken on new importance in an era of large,
heterogeneous 'knowledge graphs' published on the Web and used widely in
domains as wide ranging as social media, e-commerce and search. This chapter
will discuss the specific problem of named ER in the context of personal
knowledge graphs (PKGs). We begin with a formal definition of the problem, and
the components necessary for doing high-quality and efficient ER. We also
discuss some challenges that are expected to arise for Web-scale data. Next, we
provide a brief literature review, with a special focus on how existing
techniques can potentially apply to PKGs. We conclude the chapter by covering
some applications, as well as promising directions for future research.Comment: To appear as a book chapter by the same name in an upcoming (Oct.
2023) book `Personal Knowledge Graphs (PKGs): Methodology, tools and
applications' edited by Tiwari et a
Recommended from our members
Working notes of the 1991 spring symposium on constraint-based reasoning
POLIS: a probabilistic summarisation logic for structured documents
PhDAs the availability of structured documents, formatted in markup languages such as SGML, RDF,
or XML, increases, retrieval systems increasingly focus on the retrieval of document-elements,
rather than entire documents. Additionally, abstraction layers in the form of formalised retrieval
logics have allowed developers to include search facilities into numerous applications, without
the need of having detailed knowledge of retrieval models.
Although automatic document summarisation has been recognised as a useful tool for reducing
the workload of information system users, very few such abstraction layers have been developed
for the task of automatic document summarisation. This thesis describes the development
of an abstraction logic for summarisation, called POLIS, which provides users (such as developers
or knowledge engineers) with a high-level access to summarisation facilities. Furthermore,
POLIS allows users to exploit the hierarchical information provided by structured documents.
The development of POLIS is carried out in a step-by-step way. We start by defining a series
of probabilistic summarisation models, which provide weights to document-elements at a user
selected level. These summarisation models are those accessible through POLIS. The formal
definition of POLIS is performed in three steps. We start by providing a syntax for POLIS,
through which users/knowledge engineers interact with the logic. This is followed by a definition
of the logics semantics. Finally, we provide details of an implementation of POLIS.
The final chapters of this dissertation are concerned with the evaluation of POLIS, which is
conducted in two stages. Firstly, we evaluate the performance of the summarisation models by
applying POLIS to two test collections, the DUC AQUAINT corpus, and the INEX IEEE corpus.
This is followed by application scenarios for POLIS, in which we discuss how POLIS can be used in specific IR tasks
- …