2,992 research outputs found
Distributed enterprise search using software agents
In this paper we introduce a distributed information retrieval
system using agent-based technology. In this multiagent
system, each agent has its own specific task and can
be used to handle a specific document repository. The system
is designed to automatically comply with access restriction
rules that are normally enforced in companies. It is
used in the administration offices of the German capital city
Berlin where it serves as a testbed for further research on
aggregated search in an enterprise environment with roughly
50,000 employees
Hypermedia-based discovery for source selection using low-cost linked data interfaces
Evaluating federated Linked Data queries requires consulting multiple sources on the Web. Before a client can execute queries, it must discover data sources, and determine which ones are relevant. Federated query execution research focuses on the actual execution, while data source discovery is often marginally discussed-even though it has a strong impact on selecting sources that contribute to the query results. Therefore, the authors introduce a discovery approach for Linked Data interfaces based on hypermedia links and controls, and apply it to federated query execution with Triple Pattern Fragments. In addition, the authors identify quantitative metrics to evaluate this discovery approach. This article describes generic evaluation measures and results for their concrete approach. With low-cost data summaries as seed, interfaces to eight large real-world datasets can discover each other within 7 minutes. Hypermedia-based client-side querying shows a promising gain of up to 50% in execution time, but demands algorithms that visit a higher number of interfaces to improve result completeness
Fame for sale: efficient detection of fake Twitter followers
are those Twitter accounts specifically created to
inflate the number of followers of a target account. Fake followers are
dangerous for the social platform and beyond, since they may alter concepts
like popularity and influence in the Twittersphere - hence impacting on
economy, politics, and society. In this paper, we contribute along different
dimensions. First, we review some of the most relevant existing features and
rules (proposed by Academia and Media) for anomalous Twitter accounts
detection. Second, we create a baseline dataset of verified human and fake
follower accounts. Such baseline dataset is publicly available to the
scientific community. Then, we exploit the baseline dataset to train a set of
machine-learning classifiers built over the reviewed rules and features. Our
results show that most of the rules proposed by Media provide unsatisfactory
performance in revealing fake followers, while features proposed in the past by
Academia for spam detection provide good results. Building on the most
promising features, we revise the classifiers both in terms of reduction of
overfitting and cost for gathering the data needed to compute the features. The
final result is a novel classifier, general enough to thwart
overfitting, lightweight thanks to the usage of the less costly features, and
still able to correctly classify more than 95% of the accounts of the original
training set. We ultimately perform an information fusion-based sensitivity
analysis, to assess the global sensitivity of each of the features employed by
the classifier. The findings reported in this paper, other than being supported
by a thorough experimental methodology and interesting on their own, also pave
the way for further investigation on the novel issue of fake Twitter followers
From RESTful Services to RDF: Connecting the Web and the Semantic Web
RESTful services on the Web expose information through retrievable resource
representations that represent self-describing descriptions of resources, and
through the way how these resources are interlinked through the hyperlinks that
can be found in those representations. This basic design of RESTful services
means that for extracting the most useful information from a service, it is
necessary to understand a service's representations, which means both the
semantics in terms of describing a resource, and also its semantics in terms of
describing its linkage with other resources. Based on the Resource Linking
Language (ReLL), this paper describes a framework for how RESTful services can
be described, and how these descriptions can then be used to harvest
information from these services. Building on this framework, a layered model of
RESTful service semantics allows to represent a service's information in
RDF/OWL. Because REST is based on the linkage between resources, the same model
can be used for aggregating and interlinking multiple services for extracting
RDF data from sets of RESTful services
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
Web archives are a valuable resource for researchers of various disciplines.
However, to use them as a scholarly source, researchers require a tool that
provides efficient access to Web archive data for extraction and derivation of
smaller datasets. Besides efficient access we identify five other objectives
based on practical researcher needs such as ease of use, extensibility and
reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient,
distributed Web archive processing that builds a research corpus by working on
existing and standardized data formats commonly held by Web archiving
institutions. Performance optimizations in ArchiveSpark, facilitated by the use
of a widely available metadata index, result in significant speed-ups of data
processing. Our benchmarks show that ArchiveSpark is faster than alternative
approaches without depending on any additional data stores while improving
usability by seamlessly integrating queries and derivations with external
tools.Comment: JCDL 2016, Newark, NJ, US
- …