5 research outputs found
Enabling entity-based aggregators for web 2.0 data
Selecting and presenting content culled from multiple heterogeneous and physically distributed sources is a challenging task. The exponential growth of the web data in modern times has brought new requirements to such integration systems. Data is not any more produced by content providers alone, but also from regular users through the highly popular Web 2.0 social and semantic web applications. The plethora of the available web content, increased its demand by regular users who could not any more wait the development of advanced integration tools. They wanted to be able to build in a short time their own specialized integration applications. Aggregators came to the risk of these users. They allowed them not only to combine distributed content, but also to process it in ways that generate new services available for further consumption. To cope with the heterogeneous data, the Linked Data initiative aims at the creation and exploitation of correspondences across data values. In this work, although we share the Linked Data community vision, we advocate that for the modern web, linking at the data value level is not enough. Aggregators should base their integration tasks on the concept of an entity, i.e., identifying whether different pieces of information correspond to the same real world entity, such as an event or a person. We describe our theory, system, and experimental results that illustrate the approach’s effectiveness
Sailing the Information Ocean with Awareness of Currents: Discovery and Application of Source Dependence
The Web has enabled the availability of a huge amount of useful information,
but has also eased the ability to spread false information and rumors across
multiple sources, making it hard to distinguish between what is true and what
is not. Recent examples include the premature Steve Jobs obituary, the second
bankruptcy of United airlines, the creation of Black Holes by the operation of
the Large Hadron Collider, etc. Since it is important to permit the expression
of dissenting and conflicting opinions, it would be a fallacy to try to ensure
that the Web provides only consistent information. However, to help in
separating the wheat from the chaff, it is essential to be able to determine
dependence between sources. Given the huge number of data sources and the vast
volume of conflicting data available on the Web, doing so in a scalable manner
is extremely challenging and has not been addressed by existing work yet.
In this paper, we present a set of research problems and propose some
preliminary solutions on the issues involved in discovering dependence between
sources. We also discuss how this knowledge can benefit a variety of
technologies, such as data integration and Web 2.0, that help users manage and
access the totality of the available information from various sources.Comment: CIDR 200
Mining Butterflies in Streaming Graphs
This thesis introduces two main-memory systems sGrapp and sGradd for performing the fundamental analytic tasks of biclique counting and concept drift detection over a streaming graph. A data-driven heuristic is used to architect the systems. To this end, initially, the growth patterns of bipartite streaming graphs are mined and the emergence principles of streaming motifs are discovered. Next, the discovered principles are (a) explained by a graph generator called sGrow; and (b) utilized to establish the requirements for efficient, effective, explainable, and interpretable management and processing of streams. sGrow is used to benchmark stream analytics, particularly in the case of concept drift detection.
sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. Extensive evaluations confirm the simultaneous effectiveness and efficiency of sGrapp and sGradd. sGrapp achieves mean absolute percentage error up to 0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data records per second. The throughput and estimation error of sGrapp are 160x higher and 0.02x lower than baselines. sGradd demonstrates an improving performance over time, achieves zero false detection rates when there is not any drift and when drift is already detected, and detects sequential drifts in zero to a few seconds after their occurrence regardless of drift intervals