12 research outputs found

    Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

    Full text link
    In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer from data sparseness problem as they cannot capture large vocabulary of words including proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, they exhibit low error detection rate and often fail to catch major errors in the text. This paper proposes a new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents. The approach hinges around data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web. Fundamentally, the proposed method comprises an error detector that detects misspellings, a candidate spellings generator based on a character 2-gram model that generates correction suggestions, and an error corrector that performs contextual error correction. Experiments conducted on a set of text documents from different domains and containing misspellings, showed an outstanding spelling error correction rate and a drastic reduction of both non-word and real-word errors. In a further study, the proposed algorithm is to be parallelized so as to lower the computational cost of the error detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences - http://www.lacsc.or

    Predicting the confusion level of text excerpts with syntactic, lexical and n-gram features

    Get PDF
    Distance learning, offline presentations (presentations that are not being carried in a live fashion but were instead pre-recorded) and such activities whose main goal is to convey information are getting increasingly relevant with digital media such as Virtual Reality (VR) and Massive Online Open Courses (MOOCs). While MOOCs are a well-established reality in the learning environment, VR is also being used to promote learning in virtual rooms, be it in the academia or in the industry. Oftentimes these methods are based on written scripts that take the learner through the content, making them critical components to these tools. With such an important role, it is important to ensure the efficiency of these scripts. Confusion is a non-basic emotion associated with learning. This process often leads to a cognitive disequilibrium either caused by the content itself or due to the way it is conveyed when it comes to its syntactic and lexical features. We hereby propose a supervised model that can predict the likelihood of confusion an input text excerpt can cause on the learner. To achieve this, we performed syntactic and lexical analyses over 300 text excerpts and collected 5 confusion level classifications (0 – 6) per excerpt from 51 annotators to use their respective means as labels. These examples that compose the dataset were collected from random presentations transcripts across various fields of knowledge. The learning model was trained with this data with the results being included in the body of the paper. This model allows the design of clearer scripts of offline presentations and similar approaches and we expect that it improves the efficiency of these speeches. While this model is applied to this specific case, we hope to pave the way to generalize this approach to other contexts where clearness of text is critical, such as the scripts of MOOCs or academic abstracts.info:eu-repo/semantics/acceptedVersio

    The effect of word similarity on N-gram language models in Northern and Southern Dutch

    Get PDF
    In this paper we examine several combinations of classical N-gram language models with more advanced and well known techniques based on word similarity such as cache models and Latent Semantic Analysis. We compare the efficiency of these combined models to a model that combines N-grams with the recently proposed, state-of-the-art neural network-based continuous skip-gram. We discuss the strengths and weaknesses of each of these models, based on their predictive power of the Dutch language and find that a linear interpolation of a 3-gram, a cache model and a continuous skip-gram is capable of reducing perplexity by up to 18.63%, compared to a 3-gram baseline. This is three times the reduction achieved with a 5-gram. In addition, we investigate whether and in what way the effect of Southern Dutch training material on these combined models differs when evaluated on Northern and Southern Dutch material. Experiments on Dutch newspaper and magazine material suggest that N-grams are mostly influenced by the register and not so much by the language (variety) of the training material. Word similarity models on the other hand seem to perform best when they are trained on material in the same language (variety)

    Molecule Generation by Principal Subgraph Mining and Assembling

    Full text link
    Molecule generation is central to a variety of applications. Current attention has been paid to approaching the generation task as subgraph prediction and assembling. Nevertheless, these methods usually rely on hand-crafted or external subgraph construction, and the subgraph assembling depends solely on local arrangement. In this paper, we define a novel notion, principal subgraph, that is closely related to the informative pattern within molecules. Interestingly, our proposed merge-and-update subgraph extraction method can automatically discover frequent principal subgraphs from the dataset, while previous methods are incapable of. Moreover, we develop a two-step subgraph assembling strategy, which first predicts a set of subgraphs in a sequence-wise manner and then assembles all generated subgraphs globally as the final output molecule. Built upon graph variational auto-encoder, our model is demonstrated to be effective in terms of several evaluation metrics and efficiency, compared with state-of-the-art methods on distribution learning and (constrained) property optimization tasks.Comment: Accepted by NeurIPS 202

    Improving Robustness and Scalability of Available Ner Systems

    Get PDF
    The focus of this research is to study and develop techniques to adapt existing NER resources to serve the needs of a broad range of organizations without expert NLP manpower. My methods emphasize usability, robustness and scalability of existing NER systems to ensure maximum functionality to a broad range of organizations. Usability is facilitated by ensuring that the methodologies are compatible with any available open-source NER tagger or data set, thus allowing organizations to choose resources that are easy to deploy and maintain and fit their requirements. One way of making use of available tagged data would be to aggregate a number of different tagged sets in an effort to increase the coverage of the NER system. Though, generally, more tagged data can mean a more robust NER model, extra data also introduces a significant amount of noise and complexity into the model as well. Because adding in additional training data to scale up an NER system presents a number of challenges in terms of scalability, this research aims to address these difficulties and provide a means for multiple available training sets to be aggregated while reducing noise, model complexity and training times. In an effort to maintain usability, increase robustness and improve scalability, I designed an approach to merge document clustering of the training data with open-source or available NER software packages and tagged data that can be easily acquired and implemented. Here, a tagged training set is clustered into smaller data sets, and models are then trained on these smaller clusters. This is designed not only to reduce noise by creating more focused models, but also to increase scalability and robustness. Document clustering is used extensively in information retrieval, but has never been used in conjunction with NER

    Virtual environments promoting interaction

    Get PDF
    Virtual reality (VR) has been widely researched in the academic environment and is now breaking into the industry. Regular companies do not have access to this technology as a collaboration tool because these solutions usually require specific devices that are not at hand of the common user in offices. There are other collaboration platforms based on video, speech and text, but VR allows users to share the same 3D space. In this 3D space there can be added functionalities or information that in a real-world environment would not be possible, something intrinsic to VR. This dissertation has produced a 3D framework that promotes nonverbal communication. It plays a fundamental role on human interaction and is mostly based on emotion. In the academia, confusion is known to influence learning gains if it is properly managed. We designed a study to evaluate how lexical, syntactic and n-gram features influence perceived confusion and found results (not statistically significant) that point that it is possible to build a machine learning model that can predict the level of confusion based on these features. This model was used to manipulate the script of a given presentation, and user feedback shows a trend that by manipulating these features and theoretically lowering the level of confusion on text not only drops the reported confusion, as it also increases reported sense of presence. Another contribution of this dissertation comes from the intrinsic features of a 3D environment where one can carry actions that in a real world are not possible. We designed an automatic adaption lighting system that reacts to the perceived user’s engagement. This hypothesis was partially refused as the results go against what we hypothesized but do not have statistical significance. Three lines of research may stem from this dissertation. First, there can be more complex features to train the machine learning model such as syntax trees. Also, on an Intelligent Tutoring System this could adjust the avatar’s speech in real-time if fed by a real-time confusion detector. When going for a social scenario, the set of basic emotions is well-adjusted and can enrich them. Facial emotion recognition can extend this effect to the avatar’s body to fuel this synchronization and increase the sense of presence. Finally, we based this dissertation on the premise of using ubiquitous devices, but with the rapid evolution of technology we should consider that new devices will be present on offices. This opens new possibilities for other modalities.A Realidade Virtual (RV) tem sido alvo de investigação extensa na academia e tem vindo a entrar na indústria. Empresas comuns não têm acesso a esta tecnologia como uma ferramenta de colaboração porque estas soluções necessitam de dispositivos específicos que não estão disponíveis para o utilizador comum em escritório. Existem outras plataformas de colaboração baseadas em vídeo, voz e texto, mas a RV permite partilhar o mesmo espaço 3D. Neste espaço podem existir funcionalidades ou informação adicionais que no mundo real não seria possível, algo intrínseco à RV. Esta dissertação produziu uma framework 3D que promove a comunicação não-verbal que tem um papel fundamental na interação humana e é principalmente baseada em emoção. Na academia é sabido que a confusão influencia os ganhos na aprendizagem quando gerida adequadamente. Desenhámos um estudo para avaliar como as características lexicais, sintáticas e n-gramas influenciam a confusão percecionada. Construímos e testámos um modelo de aprendizagem automática que prevê o nível de confusão baseado nestas características, produzindo resultados não estatisticamente significativos que suportam esta hipótese. Este modelo foi usado para manipular o texto de uma apresentação e o feedback dos utilizadores demonstra uma tendência na diminuição do nível de confusão reportada no texto e aumento da sensação de presença. Outra contribuição vem das características intrínsecas de um ambiente 3D onde se podem executar ações que no mundo real não seriam possíveis. Desenhámos um sistema automático de iluminação adaptativa que reage ao engagement percecionado do utilizador. Os resultados não suportam o que hipotetizámos mas não têm significância estatística, pelo que esta hipótese foi parcialmente rejeitada. Três linhas de investigação podem provir desta dissertação. Primeiro, criar características mais complexas para treinar o modelo de aprendizagem, tais como árvores de sintaxe. Além disso, num Intelligent Tutoring System este modelo poderá ajustar o discurso do avatar em tempo real, alimentado por um detetor de confusão. As emoções básicas ajustam-se a um cenário social e podem enriquecê-lo. A emoção expressada facialmente pode estender este efeito ao corpo do avatar para alimentar o sincronismo social e aumentar a sensação de presença. Finalmente, baseámo-nos em dispositivos ubíquos, mas com a rápida evolução da tecnologia, podemos considerar que novos dispositivos irão estar presentes em escritórios. Isto abre possibilidades para novas modalidades

    A Neural Network Approach to Aircraft Performance Model Forecasting

    Get PDF
    Performance models used in the aircraft development process are dependent on the assumptions and approximations associated with the engineering equations used to produce them. The design and implementation of these highly complex engineering models are typically associated with a longer development process. This study proposes a non-deterministic approach where machine learning techniques using Artificial Neural Networks are used to predict specific aircraft parameters using available data. The approach yields results that are independent of the equations used in conventional aircraft performance modeling methods and rely on stochastic data and its distribution to extract useful patterns. To test the viability of the approach, a case study is performed comparing a conventional performance model describing the takeoff ground roll distance with the values generated from a neural network using readily-available flight data. The neural network receives as input, and is trained using, aircraft performance parameters including atmospheric conditions (air temperature, air pressure, air density), performance characteristics (flap configuration, thrust setting, MTOW, etc.) and runway conditions (wet, dry, slope angle, etc.). The proposed predictive modeling approach can be tailored for use with a wider range of flight mission profiles such as climb, cruise, descent and landing

    Comparative Evaluation of Translation Memory (TM) and Machine Translation (MT) Systems in Translation between Arabic and English

    Get PDF
    In general, advances in translation technology tools have enhanced translation quality significantly. Unfortunately, however, it seems that this is not the case for all language pairs. A concern arises when the users of translation tools want to work between different language families such as Arabic and English. The main problems facing ArabicEnglish translation tools lie in Arabic’s characteristic free word order, richness of word inflection – including orthographic ambiguity – and optionality of diacritics, in addition to a lack of data resources. The aim of this study is to compare the performance of translation memory (TM) and machine translation (MT) systems in translating between Arabic and English.The research evaluates the two systems based on specific criteria relating to needs and expected results. The first part of the thesis evaluates the performance of a set of well-known TM systems when retrieving a segment of text that includes an Arabic linguistic feature. As it is widely known that TM matching metrics are based solely on the use of edit distance string measurements, it was expected that the aforementioned issues would lead to a low match percentage. The second part of the thesis evaluates multiple MT systems that use the mainstream neural machine translation (NMT) approach to translation quality. Due to a lack of training data resources and its rich morphology, it was anticipated that Arabic features would reduce the translation quality of this corpus-based approach. The systems’ output was evaluated using both automatic evaluation metrics including BLEU and hLEPOR, and TAUS human quality ranking criteria for adequacy and fluency.The study employed a black-box testing methodology to experimentally examine the TM systems through a test suite instrument and also to translate Arabic English sentences to collect the MT systems’ output. A translation threshold was used to evaluate the fuzzy matches of TM systems, while an online survey was used to collect participants’ responses to the quality of MT system’s output. The experiments’ input of both systems was extracted from ArabicEnglish corpora, which was examined by means of quantitative data analysis. The results show that, when retrieving translations, the current TM matching metrics are unable to recognise Arabic features and score them appropriately. In terms of automatic translation, MT produced good results for adequacy, especially when translating from Arabic to English, but the systems’ output appeared to need post-editing for fluency. Moreover, when retrievingfrom Arabic, it was found that short sentences were handled much better by MT than by TM. The findings may be given as recommendations to software developers
    corecore