171 research outputs found

    Identification of headers and footers in noisy documents

    Full text link
    Optical Recognition Technology is typically used to convert hard copy printed material into its electronic form. Many presentational artifacts such as end-of-line hyphenations, running headers and footers are literally converted. These artifacts can possibly hinder proximity and exact match searching; This thesis develops an algorithm to extract running headers and footers from electronic documents generated by OCR. This method associates each page of the document with its neighboring pages and detects the headers and footers by comparing the page with its neighboring pages. Experiments are also taken to test the effectiveness of these algorithms

    AN EVOLUTIONARY APPROACH TO BIBLIOGRAPHIC CLASSIFICATION

    Get PDF
    This dissertation is research in the domain of information science and specifically, the organization and representation of information. The research has implications for classification of scientific books, especially as dissemination of information becomes more rapid and science becomes more diverse due to increases in multi-, inter-, trans-disciplinary research, which focus on phenomena, in contrast to traditional library classification schemes based on disciplines.The literature review indicates 1) human socio-cultural groups have many of the same properties as biological species, 2) output from human socio-cultural groups can be and has been the subject of evolutionary relationship analyses (i.e., phylogenetics), 3) library and information science theorists believe the most favorable and scientific classification for information packages is one based on common origin, but 4) library and information science classification researchers have not demonstrated a book classification based on evolutionary relationships of common origin.The research project supports the assertion that a sensible book classification method can be developed using a contemporary biological classification approach based on common origin, which has not been applied to a collection of books until now. Using a sample from a collection of earth-science digitized books, the method developed includes a text-mining step to extract important terms, which were converted into a dataset for input into the second step—the phylogenetic analysis. Three classification trees were produced and are discussed. Parsimony analysis, in contrast to distance and likelihood analyses, produced a sensible book classification tree. Also included is a comparison with a classification tree based on a well-known contemporary library classification scheme (the Library of Congress Classification).Final discussions connect this research with knowledge organization and information retrieval, information needs beyond science, and this type of research in context of a unified science of cultural evolution

    Retrieval effectiveness for OCR text using thesauri

    Full text link
    This thesis reports on the effects of an automatic query expansion with a subject specific thesaurus on retrieval effectiveness for document collection consisting of OCR text; The investigation encompasses several experiments with a modern retrieval engine based on the probabilistic model. Each experiment is performed on two document collections. The first version of the collection consists of raw OCR output. The second collection consists of the ground truth (retyped from hard copy) version of the same collection; It is shown that the usage of the thesaurus as a source for query expansion can significantly improve recall for Boolean queries, for both OCR and manually corrected document collections. In the case of weighted queries, the expansion has no effect on the average precision and recall. Nevertheless, some individual queries benefit from query expansion

    DARIAH and the Benelux

    Get PDF

    Our Digital Legacy: an Archival Perspective

    Get PDF
    Our digital memories are threatened by archival hubris, technical misdirection, and simplistic application of rules to protect privacy rights. The obsession with the technical challenge of digital preservation has blinded much of the archival community to the challenges, created by the digital transition, to the other core principles of archival science - namely, appraisal (what to keep), sensitivity review (identifying material that cannot yet be disclosed for ethical or legal reasons) and access. The essay will draw on the considerations of appraisal and sensitivity review to project a vision of some aspects of access to the Digital Archive. This essay will argue that only by careful scrutiny of these three challenges and the introduction of appropriate practices and procedures will it be possible to prevent the precautionary closure of digital memories for long periods or, worse still, their destruction. We must ensure that our digital memories can be captured, kept, recalled and remain faithful to the events and circumstances that created them

    Novel approaches to applied cybersecurity in privacy, encryption, security systems, web credentials, and education

    Get PDF
    Applied Cybersecurity is a domain that interconnects people, processes, technologies, usage environment and vulnerabilities in a complex manner. As a cybersecurity expert at CTI Renato Archer- a research institute from Brazilian Ministry of Science, Technology and Innovations, author developed novel approaches to help solve practical and practice-based problems in applied cybersecurity over the last ten years. The needs of the government, industry, customers, and real-life problems in five categories: Privacy, Encryption, Web Credentials, Security Systems and Education, were the research stimuli. Based on prior outputs, this thesis presents a cohesive narrative of the novel approaches in the mentioned categories consolidating fifteen research publications. The customers and society, in general, expect that companies, universities, and the government will protect them from any cyber threats. Fifteen research papers that compose this thesis elucidate a broader context of cyber threats, errors in security software and gaps in cybersecurity education. This thesis's research points out that a large number of organisations are vulnerable to cyber threats and procedures and practices around cybersecurity are questionable. Therefore, society expects a periodic reassessment of cybersecurity systems, practices and policies. Privacy has been extensively debated in many countries due to personal implications and civil liberties with citizenship at stake. Since 2018, GDPR has been in force in the EU and has been a milestone for people and institutions' privacy. The novel work in privacy, supported by four research papers, discusses the private mode navigation in several browsers and shows how privacy is a fragile feeling. The secrets of different companies, countries and armed forces are entrusted to encryption technologies. Three research papers support the encryption element discussed in this thesis. It explores vulnerabilities in the most used encryption software. It provides data exposure scenarios showing how companies, government and universities are vulnerable and proposes best practices. Credentials are data that give someone the right to access a location or a system. They usually involve a login, a username, email, access code and a password. It is customary to have a rigorous demand for security credentials a sensitive system of information. The work on web credentials in this thesis, supported by one research paper, examines a novel experiment that permits the intruder to extract user credentials in home banking and e-commerce websites, revealing common cyber flaws and vulnerabilities. Antimalware systems are complex software engineering systems purposely designed to be safe and reliable despite numerous operational idiosyncrasies. Antimalware systems have been deployed for protecting information systems for decades. The novel work on security systems presented in the thesis, supported by five research papers, explores antimalware attacks and software engineering structure problems. Cybersecurity's primary awareness is expected through school and University education, but the academic discourse is often dissociated from practice. The discussion-based on two research papers presents a new insight into cybersecurity education and proposes an IRCS Index of Relevance in Cybersecurity (IRCS) to classify the computer science courses offered in UK Universities relevance of cybersecurity in their curricula. In a nutshell, the thesis presents a coherent and novel narrative to applied cybersecurity in five categories spanning software, systems, and education

    A design for a practical soil bed disposal system for pesticide contaminated wastewater

    Get PDF
    Pesticide rinsate disposal is an ongoing problem for today\u27s farmers. The legal and environmental repercussions resulting from mismanagement of pesticide wastewater can be great. A simple, inexpensive, and legal means of dealing with this issue would be of benefit to many producers. The goal of the current study is to develop a facility design that incorporates a mixing and loading facility with a practical, effective pesticide wastewater disposal system. The first component in any comprehensive pesticide wastewater management plan is to minimize production of rinsate. Proper calibration, in-field rinse systems, and rinsate recycling can greatly reduce the amount of rinsate created during chemical application procedures. Unfortunately, some rinsate production is inevitable. Containment and collection of this waste is the next step in the safe handling of rinsate. Pertinent research with regard to chemical containment and storage was reviewed in detail. A rinsepad structure was chosen as the most effective and economical means for rinsate containment. Having decided on a containment technology, a number of disposal options were considered and criteria were developed for choosing an appropriate system. The most effective and practical disposal option was the Soil Bed Bioreactor System (SBBR) developed by researchers at The University of Tennessee. Having chosen the mechanism for containment and disposal, a facility design was developed that integrated both of these functions. The Plant Science Unit at the Knoxville Experiment Station was used as a case study for this investigation, but the basic design elements of this facility can be applied to any operation in which rinsate is produced including golf courses, nurseries, and lawn care companies. A rinsepad structure was designed for use during the loading and rinsing of spray equipment. A full-scale SBBR system was designed to dispose of all pesticide wastewater generated at the Plant Science Unit. All pertinent regulations were investigated and complied with. Environmental protection was a major concern in the development of a design. Finally, the practicality of the system and the possibility of use by producers as well as other agricultural end users such as nurseries and landscaping companies were considered

    Towards understanding and mitigating attacks leveraging zero-day exploits

    Get PDF
    Zero-day vulnerabilities are unknown and therefore not addressed with the result that they can be exploited by attackers to gain unauthorised system access. In order to understand and mitigate against attacks leveraging zero-days or unknown techniques, it is necessary to study the vulnerabilities, exploits and attacks that make use of them. In recent years there have been a number of leaks publishing such attacks using various methods to exploit vulnerabilities. This research seeks to understand what types of vulnerabilities exist, why and how these are exploited, and how to defend against such attacks by either mitigating the vulnerabilities or the method / process of exploiting them. By moving beyond merely remedying the vulnerabilities to defences that are able to prevent or detect the actions taken by attackers, the security of the information system will be better positioned to deal with future unknown threats. An interesting finding is how attackers exploit moving beyond the observable bounds to circumvent security defences, for example, compromising syslog servers, or going down to lower system rings to gain access. However, defenders can counter this by employing defences that are external to the system preventing attackers from disabling them or removing collected evidence after gaining system access. Attackers are able to defeat air-gaps via the leakage of electromagnetic radiation as well as misdirect attribution by planting false artefacts for forensic analysis and attacking from third party information systems. They analyse the methods of other attackers to learn new techniques. An example of this is the Umbrage project whereby malware is analysed to decide whether it should be implemented as a proof of concept. Another important finding is that attackers respect defence mechanisms such as: remote syslog (e.g. firewall), core dump files, database auditing, and Tripwire (e.g. SlyHeretic). These defences all have the potential to result in the attacker being discovered. Attackers must either negate the defence mechanism or find unprotected targets. Defenders can use technologies such as encryption to defend against interception and man-in-the-middle attacks. They can also employ honeytokens and honeypots to alarm misdirect, slow down and learn from attackers. By employing various tactics defenders are able to increase their chance of detecting and time to react to attacks, even those exploiting hitherto unknown vulnerabilities. To summarize the information presented in this thesis and to show the practical importance thereof, an examination is presented of the NSA's network intrusion of the SWIFT organisation. It shows that the firewalls were exploited with remote code execution zerodays. This attack has a striking parallel in the approach used in the recent VPNFilter malware. If nothing else, the leaks provide information to other actors on how to attack and what to avoid. However, by studying state actors, we can gain insight into what other actors with fewer resources can do in the future
    • …
    corecore