232 research outputs found

    In-place Update of Suffix Array while Recoding Words

    Get PDF
    International audienceMotivated by grammatical inference and data compression applications, we propose an algorithm to update a suffix array after the substitution, in the indexed text, of some occurrences of a given word by a new character. Compared to other published index update methods, the problem addressed here may require the modification of a large number of distinct positions over the original text. The proposed algorithm uses the specific internal order of suffix arrays in order to update simultaneously groups of entries, and ensures that only entries to be modified are visited. Experiments confirm a significant execution time speed-up compared to the construction of suffix array from scratch at each step of the application

    Reverse-Safe Data Structures for Text Indexing

    Get PDF
    We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model

    Searching for Smallest Grammars on Large Sequences and Application to DNA

    Get PDF
    International audienceMotivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10\%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms

    A stemming algorithm for Latvian

    Get PDF
    The thesis covers construction, application and evaluation of a stemming algorithm for advanced information searching and retrieval in Latvian databases. Its aim is to examine the following two questions: Is it possible to apply for Latvian a suffix removal algorithm originally designed for English? Can stemming in Latvian produce the same or better information retrieval results than manual truncation? In order to achieve these aims, the role and importance of automatic word conflation both for document indexing and information retrieval are characterised. A review of literature, which analyzes and evaluates different types of stemming techniques and retrospective development of stemming algorithms, justifies the necessity to apply this advanced IR method also for Latvian. Comparative analysis of morphological structure both for English and Latvian language determined the selection of Porter's suffix removal algorithm as a basis for the Latvian sternmer. An extensive list of Latvian stopwords including conjunctions, particles and adverbs, was designed and added to the initial sternmer in order to eliminate insignificant words from further processing. A number of specific modifications and changes related to the Latvian language were carried out to the structure and rules of the original stemming algorithm. Analysis of word stemming based on Latvian electronic dictionary and Latvian text fragments confirmed that the suffix removal technique can be successfully applied also to Latvian language. An evaluation study of user search statements revealed that the stemming algorithm to a certain extent can improve effectiveness of information retrieval

    An Investigation of Reading Development Through Sensitivity to Sublexical Units

    Get PDF
    The present dissertation provides a novel perspective to the study of reading, focusing on sensitivity to sublexical units across reading development. Work towards this thesis has been conducted at SISSA and Macquarie University. The first study is an eye tracking experiment on natural reading, with 140 developing readers and 33 adult participants, who silently read multiline passages from story books in Italian. A developmental database of eye tracking during natural reading was created, filling a gap in the literature. We replicated well-documented developmental trends of reading behavior (e.g., reading rate and skipping rate increasing with age) and effects of word length and frequency on eye tracking measures. The second study, in collaboration with Dr Jon Carr, is a methodological paper presenting algorithms for accuracy enhancement of eye tracking recordings in multiline reading. Using the above-mentioned dataset and computational simulations, we assessed the performance of several algorithms (including two novel methods that we proposed) on the correction of vertical drift, the progressive displacement of fixation registrations on the vertical axis over time. We provided guidance for eye tracking researchers in the application of these methods, and one of the novel algorithms (based on Dynamic Time Warping) proved particularly promising in realigning fixations, especially in child recordings. This manuscript has recently been accepted for publication in Behavior Research Methods. In the third study, I examined sensitivity to statistical regularities in letter co-occurrence throughout reading development, by analysing the effects of n-gram frequency metrics on eye-tracking measures. To this end, the EyeReadIt eye-tracking corpus (presented in the first study) was used. Our results suggest that n-gram frequency effects (in particular related to maximum/average frequency metrics) are present even in developing readers, suggesting that sensitivity to sublexical orthographic regularities in reading is present as soon as the developing reading system can pick it up \u2013 in the case of this study, as early as in third grade. The results bear relevant implications for extant theories of learning to read, which largely overlook the contribution of statistical learning to reading acquisition. The fourth study is a magnetoencephalography experiment conducted at Macquarie University, in collaboration with Dr Lisi Beyersmann, Prof Paul Sowman, and Prof Anne Castles, on 28 adults and 17 children (5th and 6th grade). We investigated selective neural responses to morphemes at different stages of reading development, using Fast Periodic Visual Stimulation (FPVS) combined with an oddball design. Participants were presented with rapid sequences (6 Hz) of pseudoword combinations of stem/nonstem and suffix/nonsuffix components. Interleaved in this stream, oddball stimuli appeared periodically every 5 items (1.2 Hz) and were specifically designed to examine stem or suffix detection (e.g., stem+suffix oddballs, such as softity, were embedded in a sequence of nonstem+suffix base items, such as terpity). We predicted that neural responses at the oddball stimulation frequency (1.2 Hz) would reflect the detection of morphemes in the oddball stimuli. Sensor-level analysis revealed a selective response in a left occipito-temporal region of interest when the oddball stimuli were fully decomposable pseudowords. This response emerged for adults and children alike, showing that automatic morpheme identification occurs at relatively early stages of reading development, in line with major accounts of morphological decomposition. Critically, these findings also suggest that morpheme identification is modulated by the context in which the morphemes appear

    Development of a stemmer for the isiXhosa language

    Get PDF
    IsiXhosa language is one of the eleven official languages and the second most widely spoken language in South Africa. However, in terms of computational linguistics, the language did not get attention and natural language related work is almost non-existent. Document retrieval using unstructured queries requires some kind of language processing, and an efficient retrieval of documents can be achieved if we use a technique called stemming. The area that involves document storage and retrieval is called Information Retrieval (IR). Basically, IR systems make use of a Stemmer to index document representations and also terms in users’ queries to retrieve matching documents. In this dissertation, we present the developed Stemmer that can be used in both conditions. The Stemmer is used in IR systems, like Google to retrieve documents written in isiXhosa. In the Eastern Cape Province of South Africa many public schools take isiXhosa as a subject and also a number of Universities in South Africa teach isiXhosa. Therefore, for a language important such as this, it is important to make valuable information that is available online accessible to users through the use of IR systems. In our efforts to develop a Stemmer for the isiXhosa language, an investigation on how others have developed Stemmers for other languages was carried out. From the investigation we came to realize that the Porter stemming algorithm in particular was the main algorithm that many of other Stemmers make use of as a reference. We found that Porter’s algorithm could not be used in its totality in the development of the isiXhosa Stemmer because of the morphological complexity of the language. We developed an affix removal that is embedded with rules that determine which order should be followed in stripping the affixes. The rule is that, the word under consideration is checked against the exceptions, if it’s not in the exceptions list then the stripping continue in the following order; Prefix removal, Suffix removal and finally save the result as stem. The Stemmer was successfully developed and was tested and evaluated in a sample data that was randomly collected from the isiXhosa text books and isiXhosa dictionary. From the results obtained we concluded that the Stemmer can be used in IR systems as it showed 91 percent accuracy. The errors were 9 percent and therefore these results are within the accepted range and therefore the Stemmer can be used to help in retrieval of isiXhosa documents. This is only a noun Stemmer and in the future it can be extended to also stem verbs as well. The Stemmer can also be used in the development of spell-checkers of isiXhosa

    Interweaving letters and sounds : the impact of phonics instruction in English on the oral production and symbolic representation of sounds among university-level L2 English learners

    Get PDF
    The study described in this thesis was conducted with a number of L1 Spanish learners of L2 English who were students of English Pronunciation Practice (EPP), an undergraduate pronunciation course taught in English Teaching, Translation and Research programs at Facultad de Lenguas (FL), Universidad Nacional de CĂłrdoba (UNC). It was aimed at investigating whether explicit phonics instruction contributes positively to the oral production and phonemic transcription of unfamiliar words of a number of university-level L1 Spanish learners of L2 English. A quasi-experimental research designed was used and the data obtained were analyzed with a quantitative method. The participating students were divided into experimental and control groups. The total number of students whose performance was analyzed was 62 (experimental = 33 and control = 29). Both groups were pretested on oral production and phonemic transcription of unfamiliar words. Next, the experimental group received a six-lesson phonics instruction focusing on the pronunciation and transcription of six specific orthographic combinations. After that both groups were posttested in terms similar to the pretest. All the data collected were analyzed using the dependent t test (also known as paired t test) to assess the difference between the averages obtained in the pretest and posttest conditions by each group. This was complemented with a variability analysis conducted to determine the degree of difficulty caused by the different combinations to the participating students. The results obtained from this study confirm the hypothesis that students who received explicit phonics instruction performed better in terms of oral production and phonemic transcription of unfamiliar words containing the orthographic combinations chosen than did students who did not receive such instruction. Pedagogical implications, practical applications and directions for future research are given

    Implications of differences of echoic and iconic memory for the design of multimodal displays

    Get PDF
    It has been well documented that dual-task performance is more accurate when each task is based on a different sensory modality. It is also well documented that the memory for each sense has unequal durations, particularly visual (iconic) and auditory (echoic) sensory memory. In this dissertation I address whether differences in sensory memory (e.g. iconic vs. echoic) duration have implications for the design of a multimodal display. Since echoic memory persists for seconds in contrast to iconic memory which persists only for milliseconds, one of my hypotheses was that in a visual-auditory dual task condition, performance will be better if the visual task is completed before the auditory task than vice versa. In Experiment 1 I investigated whether the ability to recall multi-modal stimuli is affected by recall order, with each mode being responded to separately. In Experiment 2, I investigated the effects of stimulus order and recall order on the ability to recall information from a multi-modal presentation. In Experiment 3 I investigated the effect of presentation order using a more realistic task. In Experiment 4 I investigated whether manipulating the presentation order of stimuli of different modalities improves humans' ability to combine the information from the two modalities in order to make decision based on pre-learned rules. As hypothesized, accuracy was greater when visual stimuli were responded to first and auditory stimuli second. Also as hypothesized, performance was improved by not presenting both sequences at the same time, limiting the perceptual load. Contrary to my expectations, overall performance was better when a visual sequence was presented before the audio sequence. Though presenting a visual sequence prior to an auditory sequence lengthens the visual retention interval, it also provides time for visual information to be recoded to a more robust form without disruption. Experiment 4 demonstrated that decision making requiring the integration of visual and auditory information is enhanced by reducing workload and promoting a strategic use of echoic memory. A framework for predicting Experiment 1-4 results is proposed and evaluated
    • …
    corecore