5,339 research outputs found
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
XML Schema-based Minification for Communication of Security Information and Event Management (SIEM) Systems in Cloud Environments
XML-based communication governs most of today's systems communication, due to
its capability of representing complex structural and hierarchical data.
However, XML document structure is considered a huge and bulky data that can be
reduced to minimize bandwidth usage, transmission time, and maximize
performance. This contributes to a more efficient and utilized resource usage.
In cloud environments, this affects the amount of money the consumer pays.
Several techniques are used to achieve this goal. This paper discusses these
techniques and proposes a new XML Schema-based Minification technique. The
proposed technique works on XML Structure reduction using minification. The
proposed technique provides a separation between the meaningful names and the
underlying minified names, which enhances software/code readability. This
technique is applied to Intrusion Detection Message Exchange Format (IDMEF)
messages, as part of Security Information and Event Management (SIEM) system
communication hosted on Microsoft Azure Cloud. Test results show message size
reduction ranging from 8.15% to 50.34% in the raw message, without using
time-consuming compression techniques. Adding GZip compression to the proposed
technique produces 66.1% shorter message size compared to original XML
messages.Comment: XML, JSON, Minification, XML Schema, Cloud, Log, Communication,
Compression, XMill, GZip, Code Generation, Code Readability, 9 pages, 12
figures, 5 tables, Journal Articl
XML Schema Clustering with Semantic and Hierarchical Similarity Measures
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis
HUDDL for description and archive of hydrographic binary data
Many of the attempts to introduce a universal hydrographic binary data format have failed or have been only partially successful. In essence, this is because such formats either have to simplify the data to such an extent that they only support the lowest common subset of all the formats covered, or they attempt to be a superset of all formats and quickly become cumbersome. Neither choice works well in practice. This paper presents a different approach: a standardized description of (past, present, and future) data formats using the Hydrographic Universal Data Description Language (HUDDL), a descriptive language implemented using the Extensible Markup Language (XML). That is, XML is used to provide a structural and physical description of a data format, rather than the content of a particular file. Done correctly, this opens the possibility of automatically generating both multi-language data parsers and documentation for format specification based on their HUDDL descriptions, as well as providing easy version control of them. This solution also provides a powerful approach for archiving a structural description of data along with the data, so that binary data will be easy to access in the future. Intending to provide a relatively low-effort solution to index the wide range of existing formats, we suggest the creation of a catalogue of format descriptions, each of them capturing the logical and physical specifications for a given data format (with its subsequent upgrades). A C/C++ parser code generator is used as an example prototype of one of the possible advantages of the adoption of such a hydrographic data format catalogue
BSML: A Binding Schema Markup Language for Data Interchange in Problem Solving Environments (PSEs)
We describe a binding schema markup language (BSML) for describing data
interchange between scientific codes. Such a facility is an important
constituent of scientific problem solving environments (PSEs). BSML is designed
to integrate with a PSE or application composition system that views model
specification and execution as a problem of managing semistructured data. The
data interchange problem is addressed by three techniques for processing
semistructured data: validation, binding, and conversion. We present BSML and
describe its application to a PSE for wireless communications system design
Huddl: the Hydrographic Universal Data Description Language
Since many of the attempts to introduce a universal hydrographic data format have failed or have been only partially successful, a different approach is proposed. Our solution is the Hydrographic Universal Data Description Language (HUDDL), a descriptive XML-based language that permits the creation of a standardized description of (past, present, and future) data formats, and allows for applications like HUDDLER, a compiler that automatically creates drivers for data access and manipulation. HUDDL also represents a powerful solution for archiving data along with their structural description, as well as for cataloguing existing format specifications and their version control. HUDDL is intended to be an open, community-led initiative to simplify the issues involved in hydrographic data access
Type-Based Detection of XML Query-Update Independence
This paper presents a novel static analysis technique to detect XML
query-update independence, in the presence of a schema. Rather than types, our
system infers chains of types. Each chain represents a path that can be
traversed on a valid document during query/update evaluation. The resulting
independence analysis is precise, although it raises a challenging issue:
recursive schemas may lead to infer infinitely many chains. A sound and
complete approximation technique ensuring a finite analysis in any case is
presented, together with an efficient implementation performing the chain-based
analysis in polynomial space and time.Comment: VLDB201
Information Integration - the process of integration, evolution and versioning
At present, many information sources are available wherever you are. Most of the time, the information needed is spread across several of those information sources. Gathering this information is a tedious and time consuming job. Automating this process would assist the user in its task. Integration of the information sources provides a global information source with all information needed present. All of these information sources also change over time. With each change of the information source, the schema of this source can be changed as well. The data contained in the information source, however, cannot be changed every time, due to the huge amount of data that would have to be converted in order to conform to the most recent schema.\ud
In this report we describe the current methods to information integration, evolution and versioning. We distinguish between integration of schemas and integration of the actual data. We also show some key issues when integrating XML data sources
Recommended from our members
XML-based genetic rules for scene boundary detection in a parallel processing environment
Genetic programming is based on Darwinian evolutionary theory that suggests that the best solution for a problem can be evolved by methods of natural selection of the fittest organisms in a population. These principles are translated into genetic programming by populating the solution space with an initial number of computer programs that can possibly solve the problem and then evolving the programs by means of mutation, reproduction and crossover until a candidate solution can be found that is close to or is the optimal solution for the problem. The computer programs are not fully formed source code but rather a derivative that is represented as a parse tree. The initial solutions are randomly generated and set to a certain population size that the system can compute efficiently. Research has shown that better solutions can be obtained if 1) the population size is increased and 2) if multiple runs are performed of each experiment. If multiple runs are initiated on many machines the probability of finding an optimal solution are increased exponentially and computed more efficiently. With the proliferation of the web and high speed bandwidth connections genetic programming can take advantage of grid computing to both increase population size and increasing the number of runs by utilising machines connected to the web. Using XML-Schema as a global referencing mechanism for defining the parameters and syntax of the evolvable computer programs all machines can synchronise ad-hoc to the ever changing environment of the solution space. Another advantage of using XML is that rules are constructed that can be transformed by XSLT or DOM tree viewers so they can be understood by the GP programmer. This allows the programmer to experiment by manipulating rules to increase the fitness of a rule and evaluate the selection of parameters used to define a solution
- …