43,113 research outputs found
An Expressive Language and Efficient Execution System for Software Agents
Software agents can be used to automate many of the tedious, time-consuming
information processing tasks that humans currently have to complete manually.
However, to do so, agent plans must be capable of representing the myriad of
actions and control flows required to perform those tasks. In addition, since
these tasks can require integrating multiple sources of remote information ?
typically, a slow, I/O-bound process ? it is desirable to make execution as
efficient as possible. To address both of these needs, we present a flexible
software agent plan language and a highly parallel execution system that enable
the efficient execution of expressive agent plans. The plan language allows
complex tasks to be more easily expressed by providing a variety of operators
for flexibly processing the data as well as supporting subplans (for
modularity) and recursion (for indeterminate looping). The executor is based on
a streaming dataflow model of execution to maximize the amount of operator and
data parallelism possible at runtime. We have implemented both the language and
executor in a system called THESEUS. Our results from testing THESEUS show that
streaming dataflow execution can yield significant speedups over both
traditional serial (von Neumann) as well as non-streaming dataflow-style
execution that existing software and robot agent execution systems currently
support. In addition, we show how plans written in the language we present can
represent certain types of subtasks that cannot be accomplished using the
languages supported by network query engines. Finally, we demonstrate that the
increased expressivity of our plan language does not hamper performance;
specifically, we show how data can be integrated from multiple remote sources
just as efficiently using our architecture as is possible with a
state-of-the-art streaming-dataflow network query engine
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Apache Calcite is a foundational software framework that provides query
processing, optimization, and query language support to many popular
open-source data processing systems such as Apache Hive, Apache Storm, Apache
Flink, Druid, and MapD. Calcite's architecture consists of a modular and
extensible query optimizer with hundreds of built-in optimization rules, a
query processor capable of processing a variety of query languages, an adapter
architecture designed for extensibility, and support for heterogeneous data
models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an
attractive choice for adoption in big-data frameworks. It is an active project
that continues to introduce support for the new types of data sources, query
languages, and approaches to query processing and optimization.Comment: SIGMOD'1
Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network
Language in social media is extremely dynamic: new words emerge, trend and
disappear, while the meaning of existing words can fluctuate over time. Such
dynamics are especially notable during a period of crisis. This work addresses
several important tasks of measuring, visualizing and predicting short term
text representation shift, i.e. the change in a word's contextual semantics,
and contrasting such shift with surface level word dynamics, or concept drift,
observed in social media streams. Unlike previous approaches on learning word
representations from text, we study the relationship between short-term concept
drift and representation shift on a large social media corpus - VKontakte posts
in Russian collected during the Russia-Ukraine crisis in 2014-2015. Our novel
contributions include quantitative and qualitative approaches to (1) measure
short-term representation shift and contrast it with surface level concept
drift; (2) build predictive models to forecast short-term shifts in meaning
from previous meaning as well as from concept drift; and (3) visualize
short-term representation shift for example keywords to demonstrate the
practical use of our approach to discover and track meaning of newly emerging
terms in social media. We show that short-term representation shift can be
accurately predicted up to several weeks in advance. Our unique approach to
modeling and visualizing word representation shifts in social media can be used
to explore and characterize specific aspects of the streaming corpus during
crisis events and potentially improve other downstream classification tasks
including real-time event detection
- …