4,265 research outputs found

    Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

    Full text link
    In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer from data sparseness problem as they cannot capture large vocabulary of words including proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, they exhibit low error detection rate and often fail to catch major errors in the text. This paper proposes a new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents. The approach hinges around data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web. Fundamentally, the proposed method comprises an error detector that detects misspellings, a candidate spellings generator based on a character 2-gram model that generates correction suggestions, and an error corrector that performs contextual error correction. Experiments conducted on a set of text documents from different domains and containing misspellings, showed an outstanding spelling error correction rate and a drastic reduction of both non-word and real-word errors. In a further study, the proposed algorithm is to be parallelized so as to lower the computational cost of the error detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences - http://www.lacsc.or

    Genome-Wide Transposon Screen of a Pseudomonas syringae mexB Mutant Reveals the Substrates of Efflux Transporters.

    Get PDF
    Bacteria express numerous efflux transporters that confer resistance to diverse toxicants present in their environment. Due to a high level of functional redundancy of these transporters, it is difficult to identify those that are of most importance in conferring resistance to specific compounds. The resistance-nodulation-division (RND) protein family is one such example of redundant transporters that are widespread among Gram-negative bacteria. Within this family, the MexAB-OprM protein complex is highly expressed and conserved among Pseudomonas species. We exposed barcoded transposon mutant libraries in isogenic wild-type and ΔmexB backgrounds in P. syringae B728a to diverse toxic compounds in vitro to identify mutants with increased susceptibility to these compounds. Mutants with mutations in genes encoding both known and novel redundant transporters but with partially overlapping substrate specificities were observed in a ΔmexB background. Psyr_0228, an uncharacterized member of the major facilitator superfamily of transporters, preferentially contributes to tolerance of acridine orange and acriflavine. Another transporter located in the inner membrane, Psyr_0541, contributes to tolerance of acriflavine and berberine. The presence of multiple redundant, genomically encoded efflux transporters appears to enable bacterial strains to tolerate a diversity of environmental toxins. This genome-wide screen performed in a hypersusceptible mutant strain revealed numerous transporters that would otherwise be dispensable under these conditions. Bacterial strains such as P. syringae that likely encounter diverse toxins in their environment, such as in association with many different plant species, probably benefit from possessing multiple redundant transporters that enable versatility with respect to toleration of novel toxicants.IMPORTANCE Bacteria use protein pumps to remove toxic compounds from the cell interior, enabling survival in diverse environments. These protein pumps can be highly redundant, making their targeted examination difficult. In this study, we exposed mutant populations of Pseudomonas syringae to diverse toxicants to identify pumps that contributed to survival in those conditions. In parallel, we examined pump redundancy by testing mutants of a population lacking the primary efflux transporter responsible for toxin tolerance. We identified partial substrate overlap for redundant transporters, as well as several pumps that appeared more substrate specific. For bacteria that are found in diverse environments, having multiple, partially redundant efflux pumps likely allows flexibility in habitat colonization

    Protein Delivery of an Artificial Transcription Factor Restores Widespread Ube3a Expression in an Angelman Syndrome Mouse Brain.

    Get PDF
    Angelman syndrome (AS) is a neurological genetic disorder caused by loss of expression of the maternal copy of UBE3A in the brain. Due to brain-specific genetic imprinting at this locus, the paternal UBE3A is silenced by a long antisense transcript. Inhibition of the antisense transcript could lead to unsilencing of paternal UBE3A, thus providing a therapeutic approach for AS. However, widespread delivery of gene regulators to the brain remains challenging. Here, we report an engineered zinc finger-based artificial transcription factor (ATF) that, when injected i.p. or s.c., crossed the blood-brain barrier and increased Ube3a expression in the brain of an adult mouse model of AS. The factor displayed widespread distribution throughout the brain. Immunohistochemistry of both the hippocampus and cerebellum revealed an increase in Ube3a upon treatment. An ATF containing an alternative DNA-binding domain did not activate Ube3a. We believe this to be the first report of an injectable engineered zinc finger protein that can cause widespread activation of an endogenous gene in the brain. These observations have important implications for the study and treatment of AS and other neurological disorders

    Pattern matching in compilers

    Get PDF
    In this thesis we develop tools for effective and flexible pattern matching. We introduce a new pattern matching system called amethyst. Amethyst is not only a generator of parsers of programming languages, but can also serve as an alternative to tools for matching regular expressions. Our framework also produces dynamic parsers. Its intended use is in the context of IDE (accurate syntax highlighting and error detection on the fly). Amethyst offers pattern matching of general data structures. This makes it a useful tool for implementing compiler optimizations such as constant folding, instruction scheduling, and dataflow analysis in general. The parsers produced are essentially top-down parsers. Linear time complexity is obtained by introducing the novel notion of structured grammars and regularized regular expressions. Amethyst uses techniques known from compiler optimizations to produce effective parsers.Comment: master thesi

    Picture languages generated by splicing and assembling til·les

    Get PDF
    Idiomes imatge generats per Empalme i Muntatge Rajoles. L'extensió de l'estudi de les llengües oficials sobre el cas string 2 idiomes dimensionals o idiomes imatge ha estat d'interès per molt temps per les seves vastes aplicacions. En la tesi i l'objecte de dues dimensions més comú estudiat és una imatge que és una matriu rectangular de símbols presos a partir d'un alfabet finit. L 'objectiu d'aquesta tesi se centra en l'estudi de la generació de les classes d'idiomes d'imatge per les operacions de bio-inspirats saber, `Enllaç' i` Auto-Assemblea 'd'ADN-Computing. És a dir, sistema d'entroncament i enrajolats Regla Sistemes H-Array. H-array Empalme Sistema és un formalisme bio-inspirat estès des H-entroncament de la caixa de cadena, un estudi àmpliament investigat introduït per T. Head. Està estructurat com un mecanisme mitjançant l'estudi de les gramàtiques lineals de dues dimensions correctes. Aquest formalisme és un mecanisme que s'aplica al número finit d'imatges trucades imatges inicials amb determinat conjunt de regles d'entroncament de dòmino de columna i fila. El lloc de context en el qual les dues imatges es tallen en columnes i files es va decidir per la seqüència de dòmino adjacents en el conjunt de regles. I llavors el `enganxar 'de la primera part de la imatge a la segona part de la imatge es realitza mitjançant la columna i fila concatenacions respectivament. L'H-Array Enllaç Sistemes s'aplica sobre llengües 2D-RLG generant idiomes Enllaç H-Array (HASL). A continuació, les classes de restricció definides i d'estudi són acte Creu sobre matrius i llenguatges Empalme matriu simples i resultats incomparables es va demostrar amb 2D-RLG, HASL. No són disjunts i es creuen llenguatges recognoscibles. El segon formalisme principal introduït i estudiat a la tesi és Revestiments Sistema de regles . Aquest formalisme genera la imatge pel conjunt de regles de mosaic, muntatge rajoles. Hem demostrat que la classe de L (TS) (Sistema d'enrajolat, llenguatge reconeixible) està continguda en TRuS. A més, vam demostrar existeixen una construcció del formalisme basat en la generació d'imatges en files o columnes que és equivalent a L (TS). L'equivalència es demostra l'ús de sistemes de Wang. Així que porta a una idea interessant d funcionament bio-inspirat (auto-acoblament) a la generació de la imatge.Idiomas de imagen generados por Empalme y Montaje Azulejos. La extensión del estudio de las lenguas oficiales sobre el caso string dos idiomas dimensionales o idiomas imagen ha sido de interés por mucho tiempo por sus vastas aplicaciones. En la tesis y el objeto de dos dimensiones más común estudiado es una imagen que es una matriz rectangular de símbolos tomados a partir de un alfabeto finito. E l objetivo de esta tesis se centra en el estudio de la generación de las clases de idiomas de imagen por las operaciones de bio-inspirados saber, `Empalme 'y` Auto-Asamblea' de ADN-Computing. A saber, sistema de empalme H-Array y sistemas de regla de mosaico. H-array Empalme Sistema es un formalismo bio-inspirado extendido desde H-empalme de la caja de cadena, un estudio ampliamente investigado introducido por T. Head. Está estructurado como un mecanismo mediante el estudio de las gramáticas lineales de dos dimensiones correctas. Este formalismo es un mecanismo que se aplica en el número finito de imágenes llamadas imágenes iniciales con determinado conjunto de reglas de empalme de dominó de columna y fila. El sitio de contexto en el que las dos imágenes se cortan en columnas y filas se decidió por la secuencia de dominós adyacentes en el conjunto de reglas. Y entonces el `pegar 'de la primera parte (o sub-array) de la imagen a la segunda parte de la imagen se realiza mediante la columna y fila concatenaciones respectivamente. El H-Array Empalme Sistemas se aplica sobre lenguas 2D-RLG generando idiomas Empalme H-Array (HASL). A continuación, las clases de restricciones definidas y de estudio son auto Cruz sobre matrices y lenguajes Empalme matriz simples y resultados incomparables se demostró con 2D-RLG, HASL. No son disjuntos y se cruzan lenguajes reconocibles. El segundo formalismo principal introducido y estudiado en la tesis son Revestimientos Sistema de reglas TRuS. Este formalismo genera la imagen por el conjunto de reglas de mosaico, montajes azulejos. Hemos demostrado que la clase de L (TS) (Sistema de alicatado, lenguaje reconocible) está contenida en la TRuS. Además, demostramos existen una construcción del formalismo basado en la generación de imágenes en filas o columnas que es equivalente a L (TS). La equivalencia se demuestra el uso de sistemas de Wang.Picture languages generated by Splicing and Assembling Tiles. The extension of the study of formal languages over string case to two dimensional languages or picture languages has been of interest for long for its vast applications. In the thesis and the most common two-dimensional object studied is a picture which is a rectangular array of symbols taken from a finite alphabet. T he objective of this thesis concentrates on the study of generation of Picture language classes by bio-inspired operations namely, `Splicing' and `Self-Assembly' of DNA-Computing. Namely, H-Array Splicing System and Tiling Rule Systems. H-array Splicing Systems is a bio-inspired formalism extended from H-Splicing from string case, a vastly investigated study introduced by T. Head. In particular it is structured as a mechanism by studying two-dimensional right linear grammars. In elaborate this formalism is a mechanism which is applied on finite number of pictures called initial pictures with given set of column and row domino splicing rules. The context site where the two pictures are cut in columns and rows are decided by the sequence of adjacent dominoes in the set of rules. And then the `pasting' of the first part (or sub-array) of the picture to the second part of the picture is done by column and row concatenations respectively. The H-Array Splicing Systems is applied over 2D-RLG languages generating H-Array Splicing languages (HASL). Then the restriction classes defined and study are Self Cross over Arrays and Simple Array Splicing languages and incomparable results are proved with 2D-RLG, HASL. They are not disjoint and they intersect Recognizable languages. Another and the second main formalism introduced and studied in the thesis is Tiling Rule System TRuS . This formalism generates picture by set of tiling rules, assembling tiles. We have proved that the class of L(TS) (Tiling System, recognizable language) is contained in TRuS . Also, we prove there exist a construct of the formalism based on generating pictures in rows or columns which is equivalent to L(TS). The equivalence is proved using Wang systems. Thus leading to an interesting notion of bio-inspired (self-assembling) operation to picture generation

    Stereoscopic Sketchpad: 3D Digital Ink

    Get PDF
    --Context-- This project looked at the development of a stereoscopic 3D environment in which a user is able to draw freely in all three dimensions. The main focus was on the storage and manipulation of the ‘digital ink’ with which the user draws. For a drawing and sketching package to be effective it must not only have an easy to use user interface, it must be able to handle all input data quickly and efficiently so that the user is able to focus fully on their drawing. --Background-- When it comes to sketching in three dimensions the majority of applications currently available rely on vector based drawing methods. This is primarily because the applications are designed to take a users two dimensional input and transform this into a three dimensional model. Having the sketch represented as vectors makes it simpler for the program to act upon its geometry and thus convert it to a model. There are a number of methods to achieve this aim including Gesture Based Modelling, Reconstruction and Blobby Inflation. Other vector based applications focus on the creation of curves allowing the user to draw within or on existing 3D models. They also allow the user to create wire frame type models. These stroke based applications bring the user closer to traditional sketching rather than the more structured modelling methods detailed. While at present the field is inundated with vector based applications mainly focused upon sketch-based modelling there are significantly less voxel based applications. The majority of these applications focus on the deformation and sculpting of voxmaps, almost the opposite of drawing and sketching, and the creation of three dimensional voxmaps from standard two dimensional pixmaps. How to actually sketch freely within a scene represented by a voxmap has rarely been explored. This comes as a surprise when so many of the standard 2D drawing programs in use today are pixel based. --Method-- As part of this project a simple three dimensional drawing program was designed and implemented using C and C++. This tool is known as Sketch3D and was created using a Model View Controller (MVC) architecture. Due to the modular nature of Sketch3Ds system architecture it is possible to plug a range of different data structures into the program to represent the ink in a variety of ways. A series of data structures have been implemented and were tested for efficiency. These structures were a simple list, a 3D array, and an octree. They have been tested for: the time it takes to insert or remove points from the structure; how easy it is to manipulate points once they are stored; and also how the number of points stored effects the draw and rendering times. One of the key issues brought up by this project was devising a means by which a user is able to draw in three dimensions while using only two dimensional input devices. The method settled upon and implemented involves using the mouse or a digital pen to sketch as one would in a standard 2D drawing package but also linking the up and down keyboard keys to the current depth. This allows the user to move in and out of the scene as they draw. A couple of user interface tools were also developed to assist the user. A 3D cursor was implemented and also a toggle, which when on, highlights all of the points intersecting the depth plane on which the cursor currently resides. These tools allow the user to see exactly where they are drawing in relation to previously drawn lines. --Results-- The tests conducted on the data structures clearly revealed that the octree was the most effective data structure. While not the most efficient in every area, it manages to avoid the major pitfalls of the other structures. The list was extremely quick to render and draw to the screen but suffered severely when it comes to finding and manipulating points already stored. In contrast the three dimensional array was able to erase or manipulate points effectively while the draw time rendered the structure effectively useless, taking huge amounts of time to draw each frame. The focus of this research was on how a 3D sketching package would go about storing and accessing the digital ink. This is just a basis for further research in this area and many issues touched upon in this paper will require a more in depth analysis. The primary area of this future research would be the creation of an effective user interface and the introduction of regular sketching package features such as the saving and loading of images

    Two Refinements of the Template-Guided DNA Recombination Model of Ciliate Computing

    Get PDF
    To solve the mystery of the intricate gene unscrambling mechanism in ciliates, various theoretical models for this process have been proposed from the point of view of computation. Two main models are the reversible guided recombination system by Kari and Landweber and the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg, based on two categories of DNA recombination: the pointer guided and the template directed recombination respectively. The latter model has been generalized by Daley and McQuillan. In this thesis, we propose a new approach to generate regular languages using the iterated TGR system with a finite initial language and a finite set of templates, that reduces the size of the template language and the alphabet compared to that of the Daley-McQuillan model. To achieve computational completeness using only finite components we also propose an extension of the contextual template-guided recombination system (CTGR system) by Daley and McQuillan, by adding an extra control called permitting contexts on the usage of templates. Then we prove that our proposed system, the CTGR system using permitting contexts, has the capability to characterize the family of recursively enumerable languages using a finite initial language and a finite set of templates. Lastly, we present a comparison and analysis of the computational power of the reversible guided recombination system and the TGR system. Keywords: ciliates, gene unscrambling, in vivo computing, DNA computing, cellular computing, reversible guided recombination, template-guided recombination

    Shelfaware: Accelerating Collaborative Awareness with Shelf CRDT

    Get PDF
    Collaboration has become a key feature of modern software, allowing teams to work together effectively in real-time while in different locations. In order for a user to communicate their intention to several distributed peers, computing devices must exchange high-frequency updates with transient metadata like mouse position, text range highlights, and temporary comments. Current peer-to-peer awareness solutions have high time and space complexity due to the ever-expanding logs that each client must maintain in order to ensure robust collaboration in eventually consistent environments. This paper proposes an awareness Conflict-Free Replicated Data Type (CRDT) library that provides the tooling to support an eventually consistent, decentralized, and robust multi-user collaborative environment. Our library is tuned for rapid iterative updates that communicate fine-grained user actions across a network of collaborators. Our approach holds memory constant for subsequent writes to an existing key on a shared resource and completely prunes stale data from shared documents. These features allow us to keep the CRDT\u27s memory footprint small, making it a feasible solution for memory constrained applications. Results show that our CRDT implementation is comparable to or exceeds the performance of similar data structures in high-frequency read/write scenarios

    Word hypothesis from undifferentiated, errorful phonetic strings

    Get PDF
    This thesis investigates a dynamic programming approach to word hypothesis in the context of a speaker independent, large vocabulary, continuous speech recognition system. Using a method known as Dynamic Time Warping, an undifferentiated phonetic string (one without word boundaries) is parsed to produce all possible words contained in a domain specific lexicon. Dynamic Time Warping is a common method of sequence comparison used in matching the acoustic feature vectors representing an unknown input utterance and some reference utterance. The cumulative least cost path, when compared with some threshold can be used as a decision criterion for recognition. This thesis attempts to extend the DTW technique using strings of phonetic symbols, instead. Three variables that were found to affect the parsing process include: (1) minimum distance threshold, (2) the number of word candidates accepted at any given phonetic index, and (3) the lexical search space used for reference pattern comparisons. The performance of this parser as a function of these variables is discussed. Also discussed is the performance of the parser at a variety of input error conditions
    corecore