781 research outputs found

    Identification of headers and footers in noisy documents

    Full text link
    Optical Recognition Technology is typically used to convert hard copy printed material into its electronic form. Many presentational artifacts such as end-of-line hyphenations, running headers and footers are literally converted. These artifacts can possibly hinder proximity and exact match searching; This thesis develops an algorithm to extract running headers and footers from electronic documents generated by OCR. This method associates each page of the document with its neighboring pages and detects the headers and footers by comparing the page with its neighboring pages. Experiments are also taken to test the effectiveness of these algorithms

    Forensic Data Properties of Digital Signature BDOC and ASiC-E Files on Classic Disk Drives

    Get PDF
    Käesolevas magistritöös vaadeldakse BDOC ja ASiC-E digitaalselt allkirjastatud dokumendikonteinerite sisu ning kirjeldatakse nende huvipakkuvaid omadusi. Teatava hulga näidiskonteinerite vaatlemise järel pakub autor välja faili päise ja faili jaluse kombinatsiooni (signatuuri), mis oluliselt parandab nimetatud failide kustutatud olekust sihitud taastamist külgnevatest klastritest NTFS vormindatud tihendamata kettal, võttes arvesse klassikalise kõvaketta geomeetriat. Ühtlasi kirjeldab autor kohtuekspertiisi koha pealt tähendust omavaid andmeid ZIP kohaliku faili päises ja keskkataloogi kirjes, XML signatuuris ja ASN.1 kodeeritud kihtides ning nende kättesaamise algoritmi. Nendele järeldustele tuginedes loob autor Phytoni skripte ja viib läbi mitmeid teste failide taastamiseks faili signatuuri järgi ning huvipakkuvate andmete väljavõtmiseks. Teste viiakse läbi teatava valiku failide üle ja tulemusi võrreldakse mitme kohtuekspertiisis laialt kasutatava peavoolu töökeskkonnaga, samuti mõningate andmetaaste tööriistadega. Lõpuks testitakse magistritöö käigus pakutud digitaalselt allkirjastatud dokumentide taastamiseks mõeldud signatuuri ja andmete väljavõtmise algoritmi suurel hulgal avalikust dokumendiregistrist pärit kehtivate dokumentidega, mis saadi kätte spetsiaalselt selleks kirjutatud veebirobotiga. Nimetatud teste viiakse läbi dokumentide üle, mille hulgas on nii digitaalselt allkirjastatud dokumente kui ka teisi, nendega struktuurilt sarnaseid dokumente.This thesis reviews the contents and observes certain properties of digitally signed documents of BDOC and ASiC-E container formats. After reviewing a set of sample containers, the author comes up with a header and footer combination (signature) significantly improving pinpointed carving-based recovery of those files from a deleted state on NTFS formatted uncompressed volumes in contiguous clusters, taking into account the geometry of classic disk drives. The author also describes forensically meaningful attributive data found in ZIP Headers and Central Directory, XML signatures as well as embedded ASN.1 encoded data of the sample files and suggests an algorithm for the extraction of such data. Based on these findings, the author creates scripts in Python and executes a series of tests for file carving and extraction of attributive data. These tests are run over the samples placed into unallocated clusters and the results are compared to several mainstream commercial forensic examination suites as well as some popular data recovery tools. Finally, the author web-scrapes a large number of real-life documents from a government agency’s public document registry. The carving signature and the data-extractive algorithm are thereafter applied on a larger scale and in an environment competitively supplemented with structurally similar containers

    AN ML BASED DIGITAL FORENSICS SOFTWARE FOR TRIAGE ANALYSIS THROUGH FACE RECOGNITION

    Get PDF
    Since the past few years, the complexity and heterogeneity of digital crimes has increased exponentially, which has made the digital evidence & digital forensics paramount for both criminal investigation and civil litigation cases. Some of the routine digital forensic analysis tasks are cumbersome and can increase the number of pending cases especially when there is a shortage of domain experts. While the work is not very complex, the sheer scale can be taxing. With the current scenarios and future predictions, crimes are only going to become more complex and the precedent of collecting and examining digital evidence is only going to increase. In this research, we propose an ML based Digital Forensics Software for Triage Analysis called Synthetic Forensic Omnituens (SynFO) that can automate evidence acquisition, extraction of relevant files, perform automated triage analysis and generate a basic report for the analyst. Results of this research show a promising future for automation with the help of Machine Learning

    Intelligent Web Crawler for Semantic Search Engine

    Get PDF
    A Semantic Search Engine (SSE) is a program that produces semantic-oriented concepts from the Internet. A web crawler is the front end of our SSE; its primary goal is to supply important and necessary information to the data analysis component of SSE. The main function of the analysis component is to produce the concepts (moderately frequent finite sequences of keywords) from the input; it uses some variants of TF-IDF as a primary tool to remove stop words. However, it is a very expensive way to filter out stop words using the idea of TF-IDF. The goal of this project is to improve the efficiency of the SSE by avoiding feeding junk data (stop words) to the SSE. In this project, we classify formally three classes of stop words: English-grammar-based stop words, Metadata stop words, and Topic-specific stop words. To remove English-grammar-based stop words, we simply use a list of stop words that can be found on the Internet. For Metadata stop words, we create a simple web crawler and add a modified HTML parser to it. The HTML parser is used to identify and remove Metadata stop words. So, our web crawler can remove most of the Metadata stop words and reduce the processing time of SSE. However, we do not know much about Topic-specific stop words. So, Topic-specific stop words are identified by a randomly selected sample of documents, instead of identifying all keywords (equal or above a threshold) and all stop words (below the threshold) on the whole set of documents. MapReduce is applied to reduce the complexity and find Topic- specific stop words such as “acm” (Association for Computing Machinery) that we find on IEEE data mining papers. Then, we create a Topic-specific stop word list and use it to reduce the processing time of SSE

    Document boundary determination using structural and lexical analysis

    Full text link
    A method of sequentially presented document determination using parallel analyses from various facets of structural document understanding and information retrieval is proposed in this thesis. Specifically, the method presented here intends to serve as a trainable system when determining where one document ends and another begins. Content analysis methods include use of the Vector Space Model, as well as targeted analysis of content on the margins of document fragments. Structural analysis for this implementation has been limited to simple and ubiquitous entities, such as software-generated zones, simple format-specific lines, and the appearance of page numbers. Analysis focuses on change in similarity between comparisons, with the emphasis placed on the fact that the extremities of documents tend to contain significant structural and lexical changes that can be observed and quantified. We combine the various features using nonlinear approximation (neural network) and experimentally test the usefulness of the combinations

    DutchParl: A corpus of parliamentary documents in Dutch

    Get PDF

    DutchParl: A corpus of parliamentary documents in Dutch

    Get PDF

    Towards a digitised process-wheel for historic building repair and maintenance projects in Scotland

    Get PDF
    Purpose – With the increasing demand for high quality economical and sustainable historic building Repair and Maintenance (R&M) allied with the perennial problem of skills shortages (PM-project management and on-site practice) investment in new technologies becomes paramount for modernising training and practice. Yet, the historic R&M industry, in-particular Small–Medium sized Enterprises (SMEs) have yet to benefit from digital technologies (such as laser scanning, virtual reality (VR) and cloud-computing) which have the potential to enhance performance and productivity. Design/methodology/approach – A qualitative participatory action research approach was adopted. One demonstration project (Project A) exhibiting critical disrepair, showcasing the piloting of a five phased digitised ‘process-wheel’ intended to provide a common framework for facilitating collaboration of project stakeholders thereby aiding successful project delivery is reported. Five semi-structured interviews were conducted with industry employers to facilitate the process-wheel concept development. Findings – Implementing only Phase 1 of the digitised ‘process-wheel’ (e-Condition surveying incorporating laser scanning) resulted in an estimated 25-30% cost and time savings) when compared to conventional methods. The accrued benefits are two-fold: (1) provide a structured standardised data capturing approach that is shared in a common project repository amongst relevant stakeholders; (2) inform the application of digital technologies to attain efficiencies across various phases of the process-wheel. Originality/value – This paper has provided original and valuable information on the benefits of modernising R&M practice, highlighting the importance of continued investment in innovative processes and new technologies for historic building R&M to enhance existing practice and in form current training provision. Future work will focus on further piloting and validation of the process-wheel in its entirety on selected demonstration projects with a view of supporting the industry to digitise its workflows and going-fully digital to realise optimum process efficiencies
    corecore