8 research outputs found

    Role of images on World Wide Web readability

    Get PDF
    As the Internet and World Wide Web have grown, many good things have come. If you have access to a computer, you can find a lot of information quickly and easily. Electronic devices can store and retrieve vast amounts of data in seconds. You no longer have to leave your house to get products and services you could only get in person. Documents can be changed from English to Urdu or from text to speech almost instantly, making it easy for people from different cultures and with different abilities to talk to each other. As technology improves, web developers and website visitors want more animation, colour, and technology. As computers get faster at processing images and other graphics, web developers use them more and more. Users who can see colour, pictures, animation, and images can help understand and read the Web and improve the Web experience. People who have trouble reading or whose first language is not used on the website can also benefit from using pictures. But not all images help people understand and read the text they go with. For example, images just for decoration or picked by the people who made the website should not be used. Also, different factors could affect how easy it is to read graphical content, such as a low image resolution, a bad aspect ratio, a bad colour combination in the image itself, a small font size, etc., and the WCAG gave different rules for each of these problems. The rules suggest using alternative text, the right combination of colours, low contrast, and a higher resolution. But one of the biggest problems is that images that don't go with the text on a web page can make it hard to read the text. On the other hand, relevant pictures could make the page easier to read. A method has been suggested to figure out how relevant the images on websites are from the point of view of web readability. This method combines different ways to get information from images by using Cloud Vision API and Optical Character Recognition (OCR), and reading text from websites to find relevancy between them. Techniques for preprocessing data have been used on the information that has been extracted. Natural Language Processing (NLP) technique has been used to determine what images and text on a web page have to do with each other. This tool looks at fifty educational websites' pictures and assesses their relevance. Results show that images that have nothing to do with the page's content and images that aren't very good cause lower relevancy scores. A user study was done to evaluate the hypothesis that the relevant images could enhance web readability based on two evaluations: the evaluation of the 1024 end users of the page and the heuristic evaluation, which was done by 32 experts in accessibility. The user study was done with questions about what the user knows, how they feel, and what they can do. The results back up the idea that images that are relevant to the page make it easier to read. This method will help web designers make pages easier to read by looking at only the essential parts of a page and not relying on their judgment.Programa de Doctorado en Ciencia y Tecnología Informåtica por la Universidad Carlos III de MadridPresidente: José Luis Lépez Cuadrado.- Secretario: Divakar Yadav.- Vocal: Arti Jai

    Offline printed Arabic character recognition

    Get PDF
    Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approaches to the Arabic character recognition problem and innovates a new approach. The main work involves a Haar-Cascade Classifier (HCC) approach modified for the first time for Arabic character recognition. This technique eliminates the problematic steps in the pre-processing and recognition phases in additional to the character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These 61 classifiers were trained and tested on an average of about 2,000 images each. A Multi-Modal Arabic Corpus (MMAC) has also been developed to support this work. MMAC makes innovative use of the new concept of connected segments of Arabic words (PAWs) with and without diacritics marks. These new tokens have significance for linguistic as well as OCR research and applications and have been applied here in the post-processing phase. A complete Arabic OCR application has been developed to manipulate the scanned images and extract a list of detected words. It consists of the HCC to extract glyphs, systems for parsing and correcting these glyphs and the MMAC to apply linguistic constrains. The HCC produces a recognition rate for Arabic glyphs of 87%. MMAC is based on 6 million words, is published on the web and has been applied and validated both in research and commercial use

    Biometrics Writer Recognition for Arabic language: Analysis and Classification techniques using Subwords Features

    Get PDF
    Handwritten text in any language is believed to convey a great deal of information about writers’ personality and identity. Indeed, handwritten signature has long been accepted as an authentication of the writer’s physical stamp on financial and legal deals as well official/personal documents and works of art. Handwritten documents are frequently used as evidences in forensic tasks. Handwriting skills is learnt and developed from the early schooling stages. Research interest in behavioral biometrics was the main driving force behind the growth in research into Writer Identification (WI) from handwritten text, but recent rise in terrorism associated with extreme religious ideologies spreading primarily, but not exclusively, from the middle-east has led to a surge of interest in WI from handwritten text in Arabic and similar languages. This thesis is the main outcome of extensive research investigations conducted with the aim of developing an automatic identification of a person from handwritten Arabic text samples. My motivations and interests, as an Iraqi researcher, emanate from my multi-faceted desires to provide scientific support for my people in their fight against terrorism by providing forensic evidences, and as contribute to the ongoing digitization of the Iraqi National archive as well as the wealth of religious and historical archives in Iraq and the middle-east. Good knowledge of the underlying language is invaluable in this project. Despite the rising interest in this recognition modality worldwide, Arabic writer identification has not been addressed as extensively as Latin writer identification. However, in recent years some new Arabic writer identification approaches have been proposed some of which are reviewed in this thesis. Arabic is a cursive language when handwritten. This means that each and every writer in this language develops some unique features that could demonstrate writer’s habits and style. These habits and styles are considered as unique WI features and determining factors. Existing dominating approaches to WI are based on recognizing handwriting habits/styles are embedded in certain parts/components of the written texts. Although the appearance of these components within long text contain rich information and clues to writer identity, the most common approaches to WI in Arabic in the literature are based on features extracted from paragraph(s), line(s), word(s), character(s), and/or a part of a character. Generally, Arabic words are made up of one or more subwords at the end of each; there is a connected stroke with a certain style of which seem to be most representative of writers habits. Another feature of Arabic writing is to do with diacritics that are added to written words/subwords, to add meaning and pronunciation. Subwords are more frequent in written Arabic text and appear as part of several different words or as full individual words. Thus, we propose a new innovative approach based on a seemingly plausible hypothesis that subwords based WI yields significant increase in accuracy over existing approaches. The thesis most significant contributions can be summarized as follows: - Developed a high performing segmentation of scanned text images, that combines threshold based binarisation, morphological operation and active shape model. - Defined digital measures and formed a 15-dimensional feature vectors representations of subwords that implicitly cover its diacritics and strokes. A pilot study that incrementally added features according to writer discriminating power. This reduced subwords feature vector dimension to 8, two of which were modelled as time series. - For the dependent 8-dimensional WI scheme, we identify the best performing set of subwords (best 22 subwords out of 49 then followed by best 11 out of these 22 subwords). - We established the validity of our hypothesis for different versions of subwords based WI schemes by providing empirical evidence when testing on a number of existing text dependent and in text-dependent databases plus a simulated text-in text-dependent DB. The text-dependent scenario results exhibited possible present of the Doddington Zoo phenomena. - The final optimal subword based WI scheme, not only removes the need to include diacritics as part of the subword but also demonstrating that including diacritics within subwords impairs the WI discriminating power of subwords. This should not be taken to discredit research that are based on diacritics based WI. Also in this subword body (without diacritics) base WI scheme, resulted in eliminating the presence of Doddington Zoo effect. - Finally, a significant but un-intended consequence of using subwords for WI is that there is no difference between a text-independent scenario and text-dependent one. In fact, we shall demonstrate that the text-dependent database of the 27-words can be used to simulate the testing of the scheme for an in text-dependent database without the need to record such a DB. Finally, we discussed ways of optimising the performance of our last scheme by considering possible ways of complementing our scheme using the addition of various image texture analysis features to be extracted from subwords, lines, paragraphs or entire file of the scabbed image. These included LBP and Gabor Filter. We also suggested the possible addition of few more features

    BD 5 2022 Complete

    Get PDF

    Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu

    Get PDF
    University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages

    Recognizable units in Pashto language for OCR

    No full text
    Atomic segmentation of cursive scripts into con- stituent characters is one of the most challenging problems in pattern recognition. To avoid segmentation in cursive script, concrete shapes are considered as recognizable units. Therefore, the objective of this work is to find out the alternate recognizable units in Pashto cursive script. These alternatives are ligatures and primary ligatures. However, we need sound statistical analysis to find the appropriate numbers of ligatures and primary ligatures in Pashto script. In this work, a corpus of 2, 313, 736 Pashto words are extracted from a large scale diversified web sources, and total of 19, 268 unique ligatures have been identified in Pashto cursive script. Analysis shows that only 7000 ligatures represent 91% portion of overall corpus of the Pashto unique words. Similarly, about 7, 681 primary ligatures are also identified which represent the basic shapes of all the ligatures

    Translating Islamic Law: the postcolonial quest for minority representation

    Get PDF
    This research sets out to investigate how culture-specific or signature concepts are rendered in English-language discourse on Islamic, or ‘shariÊża’ law, which has Arabic roots. A large body of literature has investigated Islamic law from a technical perspective. However, from the perspective of linguistics and translation studies, little attention has been paid to the lexicon that makes up this specialised discourse. Much of the commentary has so far been prescriptive, with limited empirical evidence. This thesis aims to bridge this gap by exploring how ‘culturalese’ (i.e., ostensive cultural discourse) travels through language, as evidenced in the self-built Islamic Law Corpus (ILC), a 9-million-word monolingual English corpus, covering diverse genres on Islamic finance and family law. Using a mixed methods design, the study first quantifies the different linguistic strategies used to render shariÊża-based concepts in English, in order to explore ‘translation’ norms based on linguistic frequency in the corpus. This quantitative analysis employs two models: profile-based correspondence analysis, which considers the probability of lexical variation in expressing a conceptual category, and logistic regression (using MATLAB programming software), which measures the influence of the explanatory variables ‘genre’, ‘legal function’ and ‘subject field’ on the choice between an Arabic loanword and an endogenous English lexeme, i.e., a close English equivalent. The findings are then interpreted qualitatively in the light of postcolonial translation agendas, which aim to preserve intangible cultural heritage and promote the representation of minoritised groups. The research finds that the English-language discourse on Islamic law is characterised by linguistic borrowing and glossing, implying an ideologically driven variety of English that can be usefully labelled as a kind of ‘Islamgish’ (blending ‘Islamic’ and ‘English’) aimed at retaining symbols of linguistic hybridity. The regression analysis confirms the influence of the above-mentioned contextual factors on the use of an Arabic loanword versus English alternatives
    corecore