43 research outputs found

    Tag based models of English text

    Get PDF
    The problem of compressing English text is important both because of the ubiquity of English as a target for compression and because of the light that compression can shed on the structure of English. English text is examined in conjunction with additional information about the parts of speech of each word in the text (these are referred to as “tags”). It is shown that the tags plus the text can be compressed more than the text alone. Essentially the tags can be compressed for nothing or even a small net saving in size. A comparison is made of a number of different ways of integrating compression of tags and text using an escape mechanism similar to PPM. These are also compared with standard word based and character based compression programs. The result is that the tag character and word based schemes always outperform the character based schemes. Overall, the tag based schemes outperform the word based schemes. We conclude by conjecturing that tags chosen for compression rather than linguistic purposes would perform even better

    Correcting English text using PPM models

    Get PDF
    An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized; while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out. This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 95.9% to 96.6%, a decrease of about 10 errors per page

    Adaptive models of English text

    Get PDF
    High quality models of English text with performance approaching that of humans is important for many applications including spelling correction, speech recognition, OCR, and encryption. A number of different statistical models of English are compared with each other and with previous estimates from human subjects. It is concluded that the best current models are word based with part of speech tags. Given sufficient training text, they are able to attain performance comparable to humans

    Differentiating code from data in x86 binaries

    No full text
    Abstract. Robust, static disassembly is an important part of achieving high coverage for many binary code analyses, such as reverse engineering, malware analysis, reference monitor in-lining, and software fault isolation. However, one of the major difficulties current disassemblers face is differentiating code from data when they are interleaved. This paper presents a machine learning-based disassembly algorithm that segments an x86 binary into subsequences of bytes and then classifies each subsequence as code or data. The algorithm builds a language model from a set of pre-tagged binaries using a statistical data compression technique. It sequentially scans a new binary executable and sets a breaking point at each potential code-to-code and code-to-data/data-to-code transition. The classification of each segment as code or data is based on the minimum cross-entropy. Experimental results are presented to demonstrate the effectiveness of the algorithm

    Token identification using HMM and PPM models

    No full text
    Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language processing tasks including learning-based token identification. Most of the existing systems are domain- and language-dependent. The power of retargetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on token identification. We implement a system that bridges the two well known methods through words new to the identification model. The system is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only required input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with language-dependent components, our proposed system has power to deal with large scope of domain- and language-independent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people s name with correct tokens of 68% for TCC, and 76% for BIB

    The ecology of the European badger (Meles meles) in Ireland: a review

    Get PDF
    peer-reviewedThe badger is an ecologically and economically important species. Detailed knowledge of aspects of the ecology of this animal in Ireland has only emerged through research over recent decades. Here, we review what is known about the species' Irish populations and compare these findings with populations in Britain and Europe. Like populations elsewhere, setts are preferentially constructed on south or southeast facing sloping ground in well-drained soil types. Unlike in Britain, Irish badger main setts are less complex and most commonly found in hedgerows. Badgers utilise many habitat types, but greater badger densities have been associated with landscapes with high proportions of pasture and broadleaf woodlands. Badgers in Ireland tend to have seasonally varied diets, with less dependence on earthworms than some other populations in northwest Europe. Recent research suggests that females exhibit later onset and timing of reproductive events, smaller litter sizes and lower loss of blastocysts than populations studied in Britain. Adult social groups in Ireland tend to be smaller than in Britain, though significantly larger than social groups from continental Europe. Although progress has been made in estimating the distribution and density of badger populations, national population estimates have varied widely in the Republic of Ireland. Future research should concentrate on filling gaps in our knowledge, including population models and predictive spatial modelling that will contribute to vaccine delivery, management and conservation strategies.Department of Agriculture, Fisheries and FoodTeagasc Walsh Fellowship Programm
    corecore