12 research outputs found

    Validasi Otomatis Dokumen Transkrip Nilai Mahasiswa Menggunakan Metoda Optical Character Recognition

    Get PDF
    At the Ambon State Polytechnic, students' semester grade reports are still manually typed. This causes frequent typo errors which can result in the invalidity of the document, let alone incorrect grades, student identification numbers and many other label values. Here a java application has been implemented to detect these errors. This application is primarily intended for officials of the Head of Study Program, Head of the Department before signing and validating the report. Officials who legalize it will be greatly assisted because tedious validation work can be replaced by computers. The validation process is carried out by utilizing the optical character recognition technique from the open source library Tesseract-OCR. From the experimental results the verification process can be improved by using OCR  specific on specific regions of interest (ROI) after using template matching method from OpenCV. The consideration of the Levehnstein distance in the comparison of label values against the reference database also improves the success rate of the algorithm. The database used has been tested for about 800 grade report documents, with successful verification result above 90%

    Canary in Twitter Mine: Collecting Phishing Reports from Experts and Non-experts

    Full text link
    The rise in phishing attacks via e-mail and short message service (SMS) has not slowed down at all. The first thing we need to do to combat the ever-increasing number of phishing attacks is to collect and characterize more phishing cases that reach end users. Without understanding these characteristics, anti-phishing countermeasures cannot evolve. In this study, we propose an approach using Twitter as a new observation point to immediately collect and characterize phishing cases via e-mail and SMS that evade countermeasures and reach users. Specifically, we propose CrowdCanary, a system capable of structurally and accurately extracting phishing information (e.g., URLs and domains) from tweets about phishing by users who have actually discovered or encountered it. In our three months of live operation, CrowdCanary identified 35,432 phishing URLs out of 38,935 phishing reports. We confirmed that 31,960 (90.2%) of these phishing URLs were later detected by the anti-virus engine, demonstrating that CrowdCanary is superior to existing systems in both accuracy and volume of threat extraction. We also analyzed users who shared phishing threats by utilizing the extracted phishing URLs and categorized them into two distinct groups - namely, experts and non-experts. As a result, we found that CrowdCanary could collect information that is specifically included in non-expert reports, such as information shared only by the company brand name in the tweet, information about phishing attacks that we find only in the image of the tweet, and information about the landing page before the redirect

    Processing Pre-Existing Connect-The-Dots Puzzles For Educational Repurposing Applications

    Get PDF
    Connect-the-Dots puzzles are puzzles which contain labeled dots in a sequence. These puzzles are mostly designed as a way for children to hone in on their counting skills, while having fun. These same puzzles, which are available in abundance online and with modification, can be used to aid students in other areas of education such as spelling. Research shows that the addition of visual imagery provides a significant impact in spelling performance. The objective of this research is to develop an algorithm for processing Connect-the-Dots puzzles to assist in the replacement of the original numbers in the puzzle with characters that will help to facilitate an alternative educational purpose. In particular, the use of Optical Character Recognition (OCR) and image processing algorithms to process pre-existing Connect-the-Dots puzzles is explored. An algorithm was developed to locate and identify the numbers in the puzzles. The system is comprised of five components, namely, an Image Preprocessing component, a Dot Locator component, a Number Locator component, a Number Recognition component, and a Post Processing component. To test the accuracy of the algorithm an experiment was conducted using 20 hand selected puzzles from an online source. The accuracy of the algorithm was evaluated, component by component, as well as overall, by visually capturing the make-up of the puzzles and comparing them to the results generated by the algorithm. Results show that the algorithm performed at an overall accuracy rate of 66%. However, the Dot Locator component performed at a rate of 100%, the Number Locator at a rate of 86%, and the Number Recognition at a rate of 76%. This research will aid in the development of an application that may provide educational benefits to children who are exposed to using technology for learning, at a young age

    Processing Pre-Existing Connect-The-Dots Puzzles For Educational Repurposing Applications

    Get PDF
    Connect-the-Dots puzzles are puzzles which contain labeled dots in a sequence. These puzzles are mostly designed as a way for children to hone in on their counting skills, while having fun. These same puzzles, which are available in abundance online and with modification, can be used to aid students in other areas of education such as spelling. Research shows that the addition of visual imagery provides a significant impact in spelling performance. The objective of this research is to develop an algorithm for processing Connect-the-Dots puzzles to assist in the replacement of the original numbers in the puzzle with characters that will help to facilitate an alternative educational purpose. In particular, the use of Optical Character Recognition (OCR) and image processing algorithms to process pre-existing Connect-the-Dots puzzles is explored. An algorithm was developed to locate and identify the numbers in the puzzles. The system is comprised of five components, namely, an Image Preprocessing component, a Dot Locator component, a Number Locator component, a Number Recognition component, and a Post-Processing component. To test the accuracy of the algorithm an experiment was conducted using 20 hand selected puzzles from an online source. The accuracy of the algorithm was evaluated, component by component, as well as overall, by visually capturing the make-up of the puzzles and comparing them to the results generated by the algorithm. Results show that the algorithm performed at an overall accuracy rate of 66%. However, the Dot Locator component performed at a rate of 100%, the Number Locator at a rate of 86%, and the Number Recognition at a rate of 76%. This research will aid in the development of an application that may provide educational benefits to children who are exposed to using technology for learning, at a young age

    Experimental Approach Based on Ensemble and Frequent Itemsets Mining for Image Spam Filtering

    Get PDF
    Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector

    Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents

    Get PDF
    The presence of huge volumes of unstructured data in the form of pdf documents poses a challenge to the organizations trying to extract valuable information from it. In this thesis, we try to solve this problem as per the requirement of DNB by building an automatic information extraction system to get only the key information in which the company is interested in from the pdf documents. This is achieved by comparing the performance of named entity recognition models for automatic text extraction, built using Bi-directional Long Short Term Memory (Bi-LSTM) with a Conditional Random Field (CRF) in combination with three variations of word vectorization techniques. The word vectorisation techniques compared in this thesis include randomly generated word embeddings by the Keras embedding layer, pre-trained static word embeddings focusing on 100-dimensional GloVe embeddings and, finally, deep-contextual ELMo word embeddings. Comparison of these models helps us identify the advantages and disadvantages of using different word embeddings by analysing their effect on NER performance. This study was performed on a DNB provided data set. The comparative study showed that the NER systems built using Bi-LSTM-CRF with GloVe embeddings gave the best results with a micro F1 score of 0.868 and a macro-F1 score of 0.872 on unseen data, in comparison to a Bi-LSTM-CRF based NER using Keras embedding layer and ELMo embeddings which gave micro F1 scores of 0.858 and 0.796 and macro F1 scores of 0.848 and 0.776 respectively. The result is in contrary to our assumption that NER using deep contextualised word embeddings show better performance when compared to NER using other word embeddings. We proposed that this contradicting performance is due to the high dimensionality, and we analysed it by using a lower-dimensional word embedding. It was found that using 50-dimensional GloVe embeddings instead of 100-dimensional GloVe embeddings resulted in an improvement of the overall micro and macro F1 score from 0.87 to 0.88. Additionally, optimising the best model, which was the Bi-LSTM-CRF using 100-dimensional GloVe embeddings, by tuning in a small hyperparameter search space did not result in any improvement from the present micro F1 score of 0.87 and macro F1 score of 0.87.M30-DV Master's ThesisM-D

    Holistic recommender systems for software engineering

    Get PDF
    The knowledge possessed by developers is often not sufficient to overcome a programming problem. Short of talking to teammates, when available, developers often gather additional knowledge from development artifacts (e.g., project documentation), as well as online resources. The web has become an essential component in the modern developer’s daily life, providing a plethora of information from sources like forums, tutorials, Q&A websites, API documentation, and even video tutorials. Recommender Systems for Software Engineering (RSSE) provide developers with assistance to navigate the information space, automatically suggest useful items, and reduce the time required to locate the needed information. Current RSSEs consider development artifacts as containers of homogeneous information in form of pure text. However, text is a means to represent heterogeneous information provided by, for example, natural language, source code, interchange formats (e.g., XML, JSON), and stack traces. Interpreting the information from a pure textual point of view misses the intrinsic heterogeneity of the artifacts, thus leading to a reductionist approach. We propose the concept of Holistic Recommender Systems for Software Engineering (H-RSSE), i.e., RSSEs that go beyond the textual interpretation of the information contained in development artifacts. Our thesis is that modeling and aggregating information in a holistic fashion enables novel and advanced analyses of development artifacts. To validate our thesis we developed a framework to extract, model and analyze information contained in development artifacts in a reusable meta- information model. We show how RSSEs benefit from a meta-information model, since it enables customized and novel analyses built on top of our framework. The information can be thus reinterpreted from an holistic point of view, preserving its multi-dimensionality, and opening the path towards the concept of holistic recommender systems for software engineering

    PROACTIVE BIOMETRIC-ENABLED FORENSIC IMPRINTING SYSTEM

    Get PDF
    Insider threats are a significant security issue. The last decade has witnessed countless instances of data loss and exposure in which leaked data have become publicly available and easily accessible. Losing or disclosing sensitive data or confidential information may cause substantial financial and reputational damage to a company. Therefore, preventing or responding to such incidents has become a challenging task. Whilst more recent research has focused explicitly on the problem of insider misuse, it has tended to concentrate on the information itself—either through its protection or approaches to detecting leakage. Although digital forensics has become a de facto standard in the investigation of criminal activities, a fundamental problem is not being able to associate a specific person with particular electronic evidence, especially when stolen credentials and the Trojan defence are two commonly cited arguments. Thus, it is apparent that there is an urgent requirement to develop a more innovative and robust technique that can more inextricably link the use of information (e.g., images and documents) to the users who access and use them. Therefore, this research project investigates the role that transparent and multimodal biometrics could play in providing this link by leveraging individuals’ biometric information for the attribution of insider misuse identification. This thesis examines the existing literature in the domain of data loss prevention, detection, and proactive digital forensics, which includes traceability techniques. The aim is to develop the current state of the art, having identified a gap in the literature, which this research has attempted to investigate and provide a possible solution. Although most of the existing methods and tools used by investigators to conduct examinations of digital crime help significantly in collecting, analysing and presenting digital evidence, essential to this process is that investigators establish a link between the notable/stolen digital object and the identity of the individual who used it; as opposed to merely using an electronic record or a log that indicates that the user interacted with the object in question (evidence). Therefore, the proposed approach in this study seeks to provide a novel technique that enables capturing individual’s biometric identifiers/signals (e.g. face or keystroke dynamics) and embedding them into the digital objects users are interacting with. This is achieved by developing two modes—a centralised or decentralised manner. The centralised approach stores the mapped information alongside digital object identifiers in a centralised storage repository; the decentralised approach seeks to overcome the need for centralised storage by embedding all the necessary information within the digital object itself. Moreover, no explicit biometric information is stored, as only the correlation that points to those locations within the imprinted object is preserved. Comprehensive experiments conducted to assess the proposed approach show that it is highly possible to establish this correlation even when the original version of the examined object has undergone significant modification. In many scenarios, such as changing or removing part of an image or document, including words and sentences, it was possible to extract and reconstruct the correlated biometric information from a modified object with a high success rate. A reconstruction of the feature vector from unmodified images was possible using the generated imprints with 100% accuracy. This was achieved easily by reversing the imprinting processes. Under a modification attack, in which the imprinted object is manipulated, at least one imprinted feature vector was successfully retrieved from an average of 97 out of 100 images, even when the modification percentage was as high as 80%. For the decentralised approach, the initial experimental results showed that it was possible to retrieve the embedded biometric signals successfully, even when the file (i.e., image) had had 75% of its original status modified. The research has proposed and validated a number of approaches to the embedding of biometric data within digital objects to enable successful user attribution of information leakage attacks.Embassy of Saudi Arabia in Londo

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore