14,280 research outputs found
Automated schema matching techniques: an exploratory study
Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made
Feedback Generation for Performance Problems in Introductory Programming Assignments
Providing feedback on programming assignments manually is a tedious, error
prone, and time-consuming task. In this paper, we motivate and address the
problem of generating feedback on performance aspects in introductory
programming assignments. We studied a large number of functionally correct
student solutions to introductory programming assignments and observed: (1)
There are different algorithmic strategies, with varying levels of efficiency,
for solving a given problem. These different strategies merit different
feedback. (2) The same algorithmic strategy can be implemented in countless
different ways, which are not relevant for reporting feedback on the student
program.
We propose a light-weight programming language extension that allows a
teacher to define an algorithmic strategy by specifying certain key values that
should occur during the execution of an implementation. We describe a dynamic
analysis based approach to test whether a student's program matches a teacher's
specification. Our experimental results illustrate the effectiveness of both
our specification language and our dynamic analysis. On one of our benchmarks
consisting of 2316 functionally correct implementations to 3 programming
problems, we identified 16 strategies that we were able to describe using our
specification language (in 95 minutes after inspecting 66, i.e., around 3%,
implementations). Our dynamic analysis correctly matched each implementation
with its corresponding specification, thereby automatically producing the
intended feedback.Comment: Tech report/extended version of FSE 2014 pape
Polygraph: Automatically generating signatures for polymorphic worms
It is widely believed that content-signature-based intrusion detection systems (IDSes) are easily evaded by polymorphic worms, which vary their payload on every infection attempt. In this paper, we present Polygraph, a signature generation system that successfully produces signatures that match polymorphic worms. Polygraph generates signatures that consist of multiple disjoint content sub-strings. In doing so, Polygraph leverages our insight that for a real-world exploit to function properly, multiple invariant substrings must often be present in all variants of a payload; these substrings typically correspond to protocol framing, return addresses, and in some cases, poorly obfuscated code. We contribute a definition of the polymorphic signature generation problem; propose classes of signature suited for matching polymorphic worm payloads; and present algorithms for automatic generation of signatures in these classes. Our evaluation of these algorithms on a range of polymorphic worms demonstrates that Polygraph produces signatures for polymorphic worms that exhibit low false negatives and false positives. Ā© 2005 IEEE
Automatic Recognition of Public Transport Trips from Mobile Device Sensor Data and Transport Infrastructure Information
Automatic detection of public transport (PT) usage has important applications
for intelligent transport systems. It is crucial for understanding the
commuting habits of passengers at large and over longer periods of time. It
also enables compilation of door-to-door trip chains, which in turn can assist
public transport providers in improved optimisation of their transport
networks. In addition, predictions of future trips based on past activities can
be used to assist passengers with targeted information. This article documents
a dataset compiled from a day of active commuting by a small group of people
using different means of PT in the Helsinki region. Mobility data was collected
by two means: (a) manually written details of each PT trip during the day, and
(b) measurements using sensors of travellers' mobile devices. The manual log is
used to cross-check and verify the results derived from automatic measurements.
The mobile client application used for our data collection provides a fully
automated measurement service and implements a set of algorithms for decreasing
battery consumption. The live locations of some of the public transport
vehicles in the region were made available by the local transport provider and
sampled with a 30-second interval. The stopping times of local trains at
stations during the day were retrieved from the railway operator. The static
timetable information of all the PT vehicles operating in the area is made
available by the transport provider, and linked to our dataset. The challenge
is to correctly detect as many manually logged trips as possible by using the
automatically collected data. This paper includes an analysis of challenges due
to missing or partially sampled information in the data, and initial results
from automatic recognition using a set of algorithms. Improvement of correct
recognitions is left as an ongoing challenge.Comment: 22 pages, 7 figures, 10 table
Highly Scalable Algorithms for Robust String Barcoding
String barcoding is a recently introduced technique for genomic-based
identification of microorganisms. In this paper we describe the engineering of
highly scalable algorithms for robust string barcoding. Our methods enable
distinguisher selection based on whole genomic sequences of hundreds of
microorganisms of up to bacterial size on a well-equipped workstation, and can
be easily parallelized to further extend the applicability range to thousands
of bacterial size genomes. Experimental results on both randomly generated and
NCBI genomic data show that whole-genome based selection results in a number of
distinguishers nearly matching the information theoretic lower bounds for the
problem
Dictionary matching in a stream
We consider the problem of dictionary matching in a stream. Given a set of
strings, known as a dictionary, and a stream of characters arriving one at a
time, the task is to report each time some string in our dictionary occurs in
the stream. We present a randomised algorithm which takes O(log log(k + m))
time per arriving character and uses O(k log m) words of space, where k is the
number of strings in the dictionary and m is the length of the longest string
in the dictionary
- ā¦