178 research outputs found
An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach
This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models.
Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier.
TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97.
To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes
Role of images on World Wide Web readability
As the Internet and World Wide Web have grown, many good things have come. If you have access to a computer, you can find a lot of information quickly and easily. Electronic devices can store and retrieve vast amounts of data in seconds. You no longer have to leave your house to get products and services you could only get in person. Documents can be changed from English to Urdu or from text to speech almost instantly, making it easy for people from different cultures and with different abilities to talk to each other. As technology improves, web developers and website visitors want more animation, colour, and technology. As computers get faster at processing images and other graphics, web developers use them more and more. Users who can see colour, pictures, animation, and images can help understand and read the Web and improve the Web experience. People who have trouble reading or whose first language is not used on the website can also benefit from using pictures.
But not all images help people understand and read the text they go with. For example, images just for decoration or picked by the people who made the website should not be used. Also, different factors could affect how easy it is to read graphical content, such as a low image resolution, a bad aspect ratio, a bad colour combination in the image itself, a small font size, etc., and the WCAG gave different rules for each of these problems. The rules suggest using alternative text, the right combination of colours, low contrast, and a higher resolution. But one of the biggest problems is that images that don't go with the text on a web page can make it hard to read the text. On the other hand, relevant pictures could make the page easier to read.
A method has been suggested to figure out how relevant the images on websites are from the point of view of web readability. This method combines different ways to get information from images by using Cloud Vision API and Optical Character Recognition (OCR), and reading text from websites to find relevancy between them. Techniques for preprocessing data have been used on the information that has been extracted. Natural Language Processing (NLP) technique has been used to determine what images and text on a web page have to do with each other. This tool looks at fifty educational websites' pictures and assesses their relevance. Results show that images that have nothing to do with the page's content and images that aren't very good cause lower relevancy scores. A user study was done to evaluate the hypothesis that the relevant images could enhance web readability based on two evaluations: the evaluation of the 1024 end users of the page and the heuristic evaluation, which was done by 32 experts in accessibility. The user study was done with questions about what the user knows, how they feel, and what they can do. The results back up the idea that images that are relevant to the page make it easier to read. This method will help web designers make pages easier to read by looking at only the essential parts of a page and not relying on their judgment.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Luis Lépez Cuadrado.- Secretario: Divakar Yadav.- Vocal: Arti Jai
THE ROLE OF SHORT VOWELS AND CONTEXT IN THE READING OF ARABIC, COMPREHENSION AND WORD RECOGNITION OF HIGHLY SKILLED READERS
The purpose of this study was to investigate the role of short vowels in reading Arabic for skilled Arab adult readers. Previous studies claimed that the presence of short vowels (and diacritics) has a facilitative role in the reading of Arabic. That is, adding short vowels to the consonants facilitates the reading comprehension and reading accuracy of both children and skilled adult Arab readers. Further, those studies claimed that the absence of short vowels (and diacritics) and context makes reading Arabic impossible. But these studies did not manipulate the short vowels and diacritics to the degree that would isolate the short vowels effect. Nor did they take into account the level of reading involved: text, sentence, and word. That is, on a text level, assessing the role of short vowels should take into account the text level in terms of word frequency; on a sentence level, the structure of the sentence- garden-path versus non-garden-path-; and finally, on a word level the type of word, homographic versus nonhomographic. Thus, the study described in the following pages was designed with three tasks to assess the role of short vowels in relation to each level: the text frequency, the garden-path structure, and the homography aspect of the word. In general, the results showed that the presence or absence of short vowels and diacritics in combination do not affect the reading process, comprehension, and accuracy of skilled adult Arab readers. However, only in a word-naming task, the absence of short vowels and context prevented the skilled adult Arab reader from choosing the right form of the heterophonic homographic word. Further, according to the findings, at the absence of short vowels and diacritics in combination, the role of context in Arabic is still limited to the heterophonic homographic words. In sum, the results demonstrated that the only variable that affects the reading process of Adult Arab skilled readers is the word frequency. Justification for such effects and recommendations for pedagogical purposes and future research are suggested
Karttatypografia: luettavuuden parantaminen kirjainmuotoilun keinoin topografisissa kartoissa
This thesis examines the legibility of type on maps and aims to find out ways to improve it through type design. As type often is an integral part of maps – something that helps the map user navigate, understand, and perceive a wide range of information in an effective way – type design and legibility must be regarded as important design elements. However, even though cartography and typography have extensive theoretical bases, the subject of legibility has not been comprehensively researched in cartographic context. Thus, by combining type design theory and scientific legibility studies with cartographic theory, the legibility of type on maps could be improved.
The topic is first studied by an extensive literature review to cover existing concepts and theories of cartography, cartographic typography, and typography. After a competent knowledge basis of these concepts and theories is acquired, the findings are utilised in the design component. The design component is a type family designed specifically to be used with topographic maps: it consists of two elements, a project description that follows the design process of the type family, relating design choices to the theoretical findings and perspectives presented in the literary review, and the finished type family. In conclusion of the design component, several visual studies are made both to compare the design component (type family) to other relevant typefaces, and to validate the possible functionality of the design component in the chosen cartographic application (topographic map).
A broad understanding of the topics of the literature review was formed. Cartographic theory observed the overall nature of maps and specified the various map elements and their intended uses. Cartographic typography deepened the understanding of type on maps – it highlighted the specific needs that must be taken into consideration, demonstrated the diversity of typographic situations that might occur, and presented a large set of guidelines to help the mapmaker to achieve better results. Typography and type design focused on the micro-level of type: how the minor design choices affect the whole, and furthermore, through legibility studies, validated certain views and brought new topics into consideration. By combining theoretical literature from these domains, this thesis helped to form a foundation for an improved framework for type de-sign for (topographic) maps. Furthermore, the domains of cartographic typography and typography and type design gave clear suggestions on how the legibility of type on topographic maps can be improved: legibility of type in this context constitutes from multiple components that must be both taken into consideration and be applied to processes of mapmaking and type design.Tässä opinnäytetyössä tutkitaan karttatypografiaa ja pyritään löytämään keinoja parantaa luettavuutta kirjainmuotoilun keinoin. Teksti on usein elimellinen osa karttoja: se helpottaa kartan käyttäjää navigoimaan ja sisäistämään suuren määrän informaatiota tehokkaasti. Siispä kirjainmuotoilua ja luettavuutta tulee pitää tärkeinä karttasuunnittelun työkaluina. Vaikka sekä kartografiassa että typografiassa on olemassa laajat teoreettiset perustat, luettavuutta ei ole kattavasti tutkittu kartografisessa kontekstissa. Yhdistämällä kirjainmuotoilun ja tieteelliset luettavuustutkimukset kartografiseen teoriaan, karttatekstien luettavuutta voidaan parantaa.
Aluksi tutustutaan olemassa oleviin konsepteihin ja kartografisiin teorioihin kattavan kirjallisuuskatsauksen avulla. Kun tarpeellinen tietopohja on rakennettu, saavutettua tietämystä hyödynnetään opinnäytetyön projektiosassa, joka tässä tapauksessa on topografisten karttojen yhteydessä käytettävä kirjainperhe. Projektiosio on kaksijakoinen ja pitää sisällään sekä valmiin kirjainperheen, että projektikuvauksen. Projektikuvaus seuraa suunnitteluprosessia ja peilaa tehtyjä valintoja kirjallisuuskatsauksessa esiteltyihin löydöksiin. Projektiosion päätelmässä tutkitaan visuaalisesti kirjainperheen toimintaa ja käyttökelpoisuutta topografisessa karttaympäristössä, sekä verrataan kirjainperheen toimivuutta suhteessa muihin kirjaintyyppeihin.
Tutkimuksen perusteella muodostuu laaja ymmärrys aiheesta. Kartografinen teoria valottaa yleisesti karttojen olemusta ja toimintaa, sekä esittelee erilaisia karttalementtejä ja niiden toimintatapoja. Karttatypografian teoria syventää ymmärrystä tekstin käyttäytymisestä karttaympäristössä, esittelee karttatypografian erityispiirteitä, ja tarjoaa laajan karttatypografisen ohjeiston. Typografian ja kirjainmuotoilun teoria keskittyy mikrotason aiheisiin: kuinka vähäpätöisiltä vaikuttavat suunnitteluvalinnat vaikuttavat kokonaisuuteen, ja kuinka luettavuustutkimukset auttavat näkemään asioita uudessa valossa. Tämä opinnäytetyö auttaa parantamaan kirjainmuotoilua (topografisessa) karttaympäristössä yhdistämällä edellä mainittujen alojen teorioita keskenään ja pohjustamalla paranneltuja suunniteluvalintoja. Yhdistetty teoria viittaa selkeästi siihen, että luettavuus karttaympäristössä koostuu lukuisista osatekijöistä – nämä osatekijät tulee ymmärtää, ottaa huomioon, ja soveltaa sekä karttojen että niille suunniteltujen kirjaintyyppien suunnitteluprosesseissa
Latinate Word Parts And Vocabulary:contrasts Among Three Groups Comprisingthe Community College Preparatory Reading Class
Students enrolled in a college preparatory reading class at one particular community college were categorized based on language origin. Native English speaking students comprised one group and foreign students formed two additional groups--students whose language origin was Latin-based (i.e. Romance languages) and students whose language origin was not Latin-based (i.e. Japanese). A pretest assessment measure was used to quantify the extent that pre-existing knowledge of Latinate word parts and morphologically complex vocabulary differed among groups based on language origin. The identical instrument served as a posttest to measure the extent that direct instruction in morphological analysis resulted in change among the same groups after one semester of instruction. Two sections on both the pretest and posttest yielded a total of four distinct mean scores that formed the primary basis for comparison. Categorizing students within the college preparatory reading class based on language origin revealed distinctive strengths and weaknesses relative to group identity when learning Latin-based word parts and vocabulary. Results of a one-way fixed-factor analysis of variance, in conjunction with multiple comparison procedures, indicated that the Latin-based group performed the strongest. This group had the greatest mean score on all four measurements; however, only for the word part section of the pretest was the difference statistically significant. The non Latin-based group performed the poorest as evidenced by scoring the lowest on three of the four measures, with a statistically significant difference for the vocabulary pretest. Additionally, a disproportionately large number of students within the native English-speaking group had difficulty mastering word parts. Though the lower group mean was statistically significant for the word part section of the posttest, practical significance was not observable from the descriptive data. A follow-up frequency tabulation revealed a dichotomization within the native English speaking group between those who proceeded to master word parts and those who did not. Furthermore, results from a pretest/posttest comparison for each respective group indicated that all three groups made significant gains on both sections of the test instrument as a result of direct instruction in Latinate word parts and vocabulary. However, there was an incongruity between word part and vocabulary mastery as all three group means were markedly better on the word part section of the instrument. The results of this study suggest that college preparatory students, regardless of their language origin, enter higher education with limited knowledge of Latinate word parts and vocabulary. The results further suggest that students comprising the heterogeneously populated college preparatory reading class can profit from direct instruction in morphological analysis--regardless of language origin. Prior research has demonstrated that college-level content words tend to be morphologically complex, singular in meaning, and likely to be Latinate in origin. Reading is the salient skill utilized across the curriculum and often the primary means of content dissemination. Reading, in turn, is principally linked to the extent of one\u27s vocabulary. Consequently, teaching morphologically complex vocabulary at the college preparatory level along with providing a working knowledge of morphemes can assist students toward college readiness
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
Effects of Two Prereading Activities on Comprehending Science Text: Reading Abridged Text and Learning Vocabulary Words
The present study examined the effects of two prereading activities designed to improve fifth-grade students’ vocabulary learning and comprehension of science textbook content containing those words. Ninety-three fifth grade students participated in this study. The prereading activities consisted of students reading an abridged version of the text or receiving instruction on vocabulary words drawn from the text before reading the full text once. Students receiving these treatments were compared to a control condition in which students reread the full text passage twice but did not receive any prereading treatment. Students were grouped by reading ability levels into above average, average, and below average readers. ANOVAs confirmed that the treatment/control groups did not differ on any of the pretests. ANOVAs were performed to examine the effects of the prereading treatments on measures of students’ vocabulary learning and reading comprehension of the science text. Results showed that students in the vocabulary training condition and the abridged text condition performed similarly in defining the vocabulary words and generating sentences containing the words, and both groups outperformed the control group on these measures. In addition, the vocabulary trained group outperformed the other two groups on a prompted recall measure of text comprehension. Treatment effects conditioned by reader ability were found on the sentence generation measure. The difference favoring the vocabulary group over the control group was evident for above-average and average readers but not for below average readers. The difference favoring the abridged group over the control group was evident for average and below average readers but not for above average readers. Students in the abridged text condition performed similarly across all reading levels, whereas students in the vocabulary and the control conditions differed across reading levels, with performance declining linearly as reading level declined. Better readers outperformed poorer readers on all the vocabulary measures and all but one of the reading comprehension measures. Results of this study suggest that having students read an abridged version of a difficult science text can help students learn vocabulary words in the text. Teaching students vocabulary words contained in a difficult science text prior to reading the text can help students learn the vocabulary words and improve their comprehension of the text
Information-theoretic causal inference of lexical flow
This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision
- …