23,032 research outputs found
Towards shared datasets for normalization research
In this paper we present a Dutch and English dataset that can serve as a gold standard for evaluating text normalization approaches. With the combination of text messages, message board posts and tweets, these datasets represent a variety of user generated content. All data was manually normalized to their standard form using newly-developed guidelines. We perform automatic lexical normalization experiments on these datasets using statistical machine translation techniques. We focus on both the word and character level and find that we can improve the BLEU score with ca. 20% for both languages. In order for this user generated content data to be released publicly to the research community some issues first need to be resolved. These are discussed in closer detail by focussing on the current legislation and by investigating previous similar data collection projects. With this discussion we hope to shed some light on various difficulties researchers are facing when trying to share social media data
UNH Law Alumni Magazine, Spring/Summer 2001
https://scholars.unh.edu/alumni_mag/1024/thumbnail.jp
Conversational Sensing
Recent developments in sensing technologies, mobile devices and context-aware
user interfaces have made it possible to represent information fusion and
situational awareness as a conversational process among actors - human and
machine agents - at or near the tactical edges of a network. Motivated by use
cases in the domain of security, policing and emergency response, this paper
presents an approach to information collection, fusion and sense-making based
on the use of natural language (NL) and controlled natural language (CNL) to
support richer forms of human-machine interaction. The approach uses a
conversational protocol to facilitate a flow of collaborative messages from NL
to CNL and back again in support of interactions such as: turning eyewitness
reports from human observers into actionable information (from both trained and
untrained sources); fusing information from humans and physical sensors (with
associated quality metadata); and assisting human analysts to make the best use
of available sensing assets in an area of interest (governed by management and
security policies). CNL is used as a common formal knowledge representation for
both machine and human agents to support reasoning, semantic information fusion
and generation of rationale for inferences, in ways that remain transparent to
human users. Examples are provided of various alternative styles for user
feedback, including NL, CNL and graphical feedback. A pilot experiment with
human subjects shows that a prototype conversational agent is able to gather
usable CNL information from untrained human subjects
An Empirical Study on Android for Saving Non-shared Data on Public Storage
With millions of apps that can be downloaded from official or third-party
market, Android has become one of the most popular mobile platforms today.
These apps help people in all kinds of ways and thus have access to lots of
user's data that in general fall into three categories: sensitive data, data to
be shared with other apps, and non-sensitive data not to be shared with others.
For the first and second type of data, Android has provided very good storage
models: an app's private sensitive data are saved to its private folder that
can only be access by the app itself, and the data to be shared are saved to
public storage (either the external SD card or the emulated SD card area on
internal FLASH memory). But for the last type, i.e., an app's non-sensitive and
non-shared data, there is a big problem in Android's current storage model
which essentially encourages an app to save its non-sensitive data to shared
public storage that can be accessed by other apps. At first glance, it seems no
problem to do so, as those data are non-sensitive after all, but it implicitly
assumes that app developers could correctly identify all sensitive data and
prevent all possible information leakage from private-but-non-sensitive data.
In this paper, we will demonstrate that this is an invalid assumption with a
thorough survey on information leaks of those apps that had followed Android's
recommended storage model for non-sensitive data. Our studies showed that
highly sensitive information from billions of users can be easily hacked by
exploiting the mentioned problematic storage model. Although our empirical
studies are based on a limited set of apps, the identified problems are never
isolated or accidental bugs of those apps being investigated. On the contrary,
the problem is rooted from the vulnerable storage model recommended by Android.
To mitigate the threat, we also propose a defense framework
Reinforcement learning for efficient network penetration testing
Penetration testing (also known as pentesting or PT) is a common practice for actively assessing the defenses of a computer network by planning and executing all possible attacks to discover and exploit existing vulnerabilities. Current penetration testing methods are increasingly becoming non-standard, composite and resource-consuming despite the use of evolving tools. In this paper, we propose and evaluate an AI-based pentesting system which makes use of machine learning techniques, namely reinforcement learning (RL) to learn and reproduce average and complex pentesting activities. The proposed system is named Intelligent Automated Penetration Testing System (IAPTS) consisting of a module that integrates with industrial PT frameworks to enable them to capture information, learn from experience, and reproduce tests in future similar testing cases. IAPTS aims to save human resources while producing much-enhanced results in terms of time consumption, reliability and frequency of testing. IAPTS takes the approach of modeling PT environments and tasks as a partially observed Markov decision process (POMDP) problem which is solved by POMDP-solver. Although the scope of this paper is limited to network infrastructures PT planning and not the entire practice, the obtained results support the hypothesis that RL can enhance PT beyond the capabilities of any human PT expert in terms of time consumed, covered attacking vectors, accuracy and reliability of the outputs. In addition, this work tackles the complex problem of expertise capturing and re-use by allowing the IAPTS learning module to store and re-use PT policies in the same way that a human PT expert would learn but in a more efficient way
- …