109 research outputs found

    Tokenisation of class files for an embedded java processor

    Get PDF
    Los Alamitos, US

    Text Augmentation: Inserting markup into natural language text with PPM Models

    Get PDF
    This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computistsā€™ Communique corpus and the Reutersā€™ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

    Novel Approaches to the Delivery of XML and Schemas

    Get PDF
    Typically XML documents are delivered as whole documents, and the transmission does not consider if all of this data may actually be relevant to the user. This results in inefficiencies in terms of both bandwidth (transferring unnecessary data) and computing resources (extra memory and processing to handle the entire XML document). Through exploitation of XML\u27s tree-like structure, a simple and lightweight protocol is introduced (referred to as RXPP). Designed with mobile devices in mind, RXPP provides users with the ability to navigate and retrieve data from remote documents on a node-by-node or branch-by-branch basis, allowing users to retrieve only fragments of interest. By skipping unwanted XML nodes, this avoids the need to always maintain a full copy of the XML document locally as processing of the document is performed remotely. When only partial views of XML documents are maintained, the processing requirements of mobile devices are less demanding and requires less memory. Furthermore, time and money can be saved when using mobile devices in bandwidth limited environments where data is often charged per kilobyte as only the relevant data is retrieved when the user selects the next node or branch. Through extension of RXPP, a two-way exchange of XML documents is introduced called RXEP. RXEP allows users to receive XML fragments and also update remote XML documents. In addition to the navigation features of RXPP, RXEP further allows users to construct queries (e.g., using the XPath language), requesting many XML nodes from a remote XML document. In some cases, users can construct well crafted queries to retrieve all the relevant XML fragments using only a single request. RXEP locators are introduced which extend the path features of XPath to the provide precise location of received XML fragments within the clients own local version. RXEP locators provide extra information such as the nodes absolute location and total number of sibling nodes. RXEP locators thus allow clients to retrieve fragments of XML whilst replicating the exact structure of the original XML document. Through exploitation of RXEP locators and RXEP\u27s two-way exchange, office suites using XML as a document format (such as MS Office and Openoffice), becomes an ideal target for collaborative editing amongst many users. This allows users to download only relevant parts of a document and upload corrections or modifications without the need to upload the entire document. To further increase the efficiency of RXEP, a binarised (i.e., compressed) version of the protocol is explored. By utilising well established tree-based binarisation techniques significant savings can be achieved through compression of the RXEP structure and requested XML data. A new technique called SDOM is introduced which merges the structural information from XML Schemas with the requested XML document. SDOM allows users to request XML fragments using RXEP techniques where the requested XML data can be compressed on-the-fly using the information contained within SDOM. BinRXEP thus allows users to perform queries or navigation on remote XML documents and receive the results in a compact and compressed form. In many cases, the overhead added by RXEP, is reduced to less than a byte when using binRXEP. Techniques for the transmission of both XML and XML Schema fragments within a single RXEP packet are proposed. Utilising RXEP, a user can request fragments with a of XML data from a remote document with a further option to request the XML Schema fragment required for validation of that fragment. In this way, the user can avoid retrieving all XML Schemas associated with an XML document, and may only retrieve the relevant XML Schema fragments. Finally, the collaborative creation of XML Schemas is introduced. Utilising RXEP XML and Schema techniques, users can all contribute to the creation of a schema in realtime, while seeing the progress of other users. This collaborative creation of schemas can lead to quicker creation of XML Schemas. Users may then extend the current set of descriptors or generate new descriptors using ideas from the previous schema updates, thus resulting in a richer set of descriptors

    CORLEONE - Core Linguistic Entity Online Extraction

    Get PDF
    This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.JRC.G.2-Support to external securit

    Scalable Honeypot Monitoring and Analytics

    Get PDF
    Honeypot systems with a large number of instances pose new challenges in terms of monitoring and analytics. They produce a significant amount of data and require the analyst to monitor every new honeypot instance in the system. Specifically, current approaches require each honeypot instance to be monitored and analysed individually. Therefore, these cannot scale to support scenarios in which a large number of honeypots are used. Furthermore, amalgamating data from a large number of honeypots presents new opportunities to analyse trends. This thesis proposes a scalable monitoring and analytics system that is designed to address this challenge. It consists of three components: monitoring, analysis and visualisation. The system automatically monitors each new honeypot, reduces the amount of collected data and stores it centrally. All gathered data is analysed in order to identify patterns of attacker behaviour. Visualisation conveniently displays the analysed data to an analyst. A user study was performed to evaluate the system. It shows that the solution has met the requirements posed to a scalable monitoring and analytics system. In particular, the monitoring and analytics can be implemented using only open-source software and does not noticeably impact the performance of individual honeypots or the scalability of the overall honeypot system. The thesis also discusses several variations and extensions, including detection of new patterns, and the possibility of providing feedback when used in an educational setting, monitoring attacks by information-security students

    Ensemble Morphosyntactic Analyser for Classical Arabic

    Get PDF
    Classical Arabic (CA) is an influential language for Muslim lives around the world. It is the language of two sources of Islamic laws: the Quran and the Sunnah, the collection of traditions and sayings attributed to the prophet Mohammed. However, classical Arabic in general, and the Sunnah, in particular, is underexplored and under-resourced in the field of computational linguistics. This study examines the possible directions for adapting existing tools, specifically morphological analysers, designed for modern standard Arabic (MSA) to classical Arabic. Morphological analysers of CA are limited, as well as the data for evaluating them. In this study, we adapt existing analysers and create a validation data-set from the Sunnah books. Inspired by the advances in deep learning and the promising results of ensemble methods, we developed a systematic method for transferring morphological analysis that is capable of handling different labelling systems and various sequence lengths. In this study, we handpicked the best four open access MSA morphological analysers. Data generated from these analysers are evaluated before and after adaptation through the existing Quranic Corpus and the Sunnah Arabic Corpus. The findings are as follows: first, it is feasible to analyse under-resourced languages using existing comparable language resources given a small sufficient set of annotated text. Second, analysers typically generate different errors and this could be exploited. Third, an explicit alignment of sequences and the mapping of labels is not necessary to achieve comparable accuracies given a sufficient size of training dataset. Adapting existing tools is easier than creating tools from scratch. The resulting quality is dependent on training data size and number and quality of input taggers. Pipeline architecture performs less well than the End-to-End neural network architecture due to error propagation and limitation on the output format. A valuable tool and data for annotating classical Arabic is made freely available

    An information extraction model for recommending the most applied case

    Get PDF
    The amount of information produced by different domains is constantly increasing. One domain that particularly produces large amounts of information is the legal domain, where information is mainly used for research purposes. However, too much time is spent by legal researchers on searching for useful information. Information is found by using special search engines or by consulting hard copies of legal literature. The main research question that this study addressed is ā€œWhat techniques can be incorporated into a model that recommends the most applied case for a field of law?ā€. The Design Science Research (DSR) methodology was used to address the research objectives. The model developed is the theoretical contribution produced from following the DSR methodology. A case study organisation, called LexisNexis, was to help investigate the real-world problem. The initial investigation into the real-world problem revealed that too much time is spent on searching for the Most Applied Case (MAC) and no formal or automated processes were used. An analysis of an informal process followed by legal researchers enabled the identification of different concepts that could be combined to create a prescriptive model to recommend the MAC. A critical analysis of the literature was conducted to obtain a better understanding of the legal domain and the techniques that can be applied to assist with problems faced in this domain, related to information retrieval and extraction. This resulted in the creation of an IE Model based only on theory. Questionnaires were sent to experts to obtain a further understanding of the legal domain, highlight problems faced, and identify which attributes of a legal case can be used to help recommend the MAC. During the Design and Development activity of the DSR methodology, a prescriptive MAC Model for recommending the MAC was created based on findings from the literature review and questionnaires. The MAC Model consists of processes concerning: Information retrieval (IR); Information extraction (IE); Information storage; and Query-independent ranking. Analysis of IR and IE helped to identify problems experienced when processing text. Furthermore, appropriate techniques and algorithms were identified that can process legal documents and extract specific facts. The extracted facts were then further processed to allow for storage and processing by query-independent ranking algorithms. The processes incorporated into the model were then used to create a proof-of-concept prototype called the IE Prototype. The IE Prototype implements two processes called the IE process and the Database process. The IE process analyses different sections of a legal case to extract specific facts. The Database process then ensures that the extracted facts are stored in a document database for future querying purposes. The IE Prototype was evaluated using the technical risk and efficacy strategy from the Framework for Evaluation of Design Science. Both formative and summative evaluations were conducted. Formative evaluations were conducted to identify functional issues of the prototype whilst summative evaluations made use of real-world legal cases to test the prototype. Multiple experiments were conducted on legal cases, known as source cases, that resulted in facts from the source cases being extracted. For the purpose of the experiments, the term ā€œsource caseā€ was used to distinguish between a legal case in its entirety and a legal caseā€™s list of cases referred to. Two types of NoSQL databases were investigated for implementation namely, a graph database and a document database. Setting up the graph database required little time. However, development issues prevented the graph database from being successfully implemented in the proof-of-concept prototype. A document database was successfully implemented as an alternative for the proof-of-concept prototype. Analysis of the source cases used to evaluate the IE Prototype revealed that 96% of the source cases were categorised as being partially extracted. The results also revealed that the IE Prototype was capable of processing large amounts of source cases at a given time
    • ā€¦
    corecore