12 research outputs found
Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information
In computing, spell checking is the process of detecting and sometimes
providing spelling suggestions for incorrectly spelled words in a text.
Basically, a spell checker is a computer program that uses a dictionary of
words to perform spell checking. The bigger the dictionary is, the higher is
the error detection rate. The fact that spell checkers are based on regular
dictionaries, they suffer from data sparseness problem as they cannot capture
large vocabulary of words including proper names, domain-specific terms,
technical jargons, special acronyms, and terminologies. As a result, they
exhibit low error detection rate and often fail to catch major errors in the
text. This paper proposes a new context-sensitive spelling correction method
for detecting and correcting non-word and real-word errors in digital text
documents. The approach hinges around data statistics from Google Web 1T 5-gram
data set which consists of a big volume of n-gram word sequences, extracted
from the World Wide Web. Fundamentally, the proposed method comprises an error
detector that detects misspellings, a candidate spellings generator based on a
character 2-gram model that generates correction suggestions, and an error
corrector that performs contextual error correction. Experiments conducted on a
set of text documents from different domains and containing misspellings,
showed an outstanding spelling error correction rate and a drastic reduction of
both non-word and real-word errors. In a further study, the proposed algorithm
is to be parallelized so as to lower the computational cost of the error
detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences -
http://www.lacsc.or
Predicting the confusion level of text excerpts with syntactic, lexical and n-gram features
Distance learning, offline presentations (presentations that are not being carried in a live fashion but were instead pre-recorded) and such activities whose main goal is to convey information are getting increasingly relevant with digital media such as Virtual Reality (VR) and Massive Online Open Courses (MOOCs). While MOOCs are a well-established reality in the learning environment, VR is also being used to promote learning in virtual rooms, be it in the academia or in the industry. Oftentimes these methods are based on written scripts that take the learner through the content, making them critical components to these tools. With such an important role, it is important to ensure the efficiency of these scripts.
Confusion is a non-basic emotion associated with learning. This process often leads to a cognitive disequilibrium either caused by the content itself or due to the way it is conveyed when it comes to its syntactic and lexical features. We hereby propose a supervised model that can predict the likelihood of confusion an input text excerpt can cause on the learner. To achieve this, we performed syntactic and lexical analyses over 300 text excerpts and collected 5 confusion level classifications (0 – 6) per excerpt from 51 annotators to use their respective means as labels. These examples that compose the dataset were collected from random presentations transcripts across various fields of knowledge. The learning model was trained with this data with the results being included in the body of the paper.
This model allows the design of clearer scripts of offline presentations and similar approaches and we expect that it improves the efficiency of these speeches. While this model is applied to this specific case, we hope to pave the way to generalize this approach to other contexts where clearness of text is critical, such as the scripts of MOOCs or academic abstracts.info:eu-repo/semantics/acceptedVersio
The effect of word similarity on N-gram language models in Northern and Southern Dutch
In this paper we examine several combinations of classical N-gram language models with more advanced and well known techniques based on word similarity such as cache models and Latent Semantic Analysis. We compare the efficiency of these combined models to a model that combines N-grams with the recently proposed, state-of-the-art neural network-based continuous skip-gram. We discuss the strengths and weaknesses of each of these models, based on their predictive power of the Dutch language and find that a linear interpolation of a 3-gram, a cache model and a continuous skip-gram is capable of reducing perplexity by up to 18.63%, compared to a 3-gram baseline. This is three times the reduction achieved with a 5-gram.
In addition, we investigate whether and in what way the effect of Southern Dutch training material on these combined models differs when evaluated on Northern and Southern Dutch material. Experiments on Dutch newspaper and magazine material suggest that N-grams are mostly influenced by the register and not so much by the language (variety) of the training material. Word similarity models on the other hand seem to perform best when they are trained on material in the same language (variety)
Molecule Generation by Principal Subgraph Mining and Assembling
Molecule generation is central to a variety of applications. Current
attention has been paid to approaching the generation task as subgraph
prediction and assembling. Nevertheless, these methods usually rely on
hand-crafted or external subgraph construction, and the subgraph assembling
depends solely on local arrangement. In this paper, we define a novel notion,
principal subgraph, that is closely related to the informative pattern within
molecules. Interestingly, our proposed merge-and-update subgraph extraction
method can automatically discover frequent principal subgraphs from the
dataset, while previous methods are incapable of. Moreover, we develop a
two-step subgraph assembling strategy, which first predicts a set of subgraphs
in a sequence-wise manner and then assembles all generated subgraphs globally
as the final output molecule. Built upon graph variational auto-encoder, our
model is demonstrated to be effective in terms of several evaluation metrics
and efficiency, compared with state-of-the-art methods on distribution learning
and (constrained) property optimization tasks.Comment: Accepted by NeurIPS 202
Improving Robustness and Scalability of Available Ner Systems
The focus of this research is to study and develop techniques to adapt existing NER resources to serve the needs of a broad range of organizations without expert NLP manpower. My methods emphasize usability, robustness and scalability of existing NER systems to ensure maximum functionality to a broad range of organizations. Usability is facilitated by ensuring that the methodologies are compatible with any available open-source NER tagger or data set, thus allowing organizations to choose resources that are easy to deploy and maintain and fit their requirements. One way of making use of available tagged data would be to aggregate a number of different tagged sets in an effort to increase the coverage of the NER system. Though, generally, more tagged data can mean a more robust NER model, extra data also introduces a significant amount of noise and complexity into the model as well. Because adding in additional training data to scale up an NER system presents a number of challenges in terms of scalability, this research aims to address these difficulties and provide a means for multiple available training sets to be aggregated while reducing noise, model complexity and training times.
In an effort to maintain usability, increase robustness and improve scalability, I designed an approach to merge document clustering of the training data with open-source or available NER software packages and tagged data that can be easily acquired and implemented. Here, a tagged training set is clustered into smaller data sets, and models are then trained on these smaller clusters. This is designed not only to reduce noise by creating more focused models, but also to increase scalability and robustness. Document clustering is used extensively in information retrieval, but has never been used in conjunction with NER
Virtual environments promoting interaction
Virtual reality (VR) has been widely researched in the academic environment and is now breaking
into the industry. Regular companies do not have access to this technology as a collaboration tool
because these solutions usually require specific devices that are not at hand of the common user in
offices. There are other collaboration platforms based on video, speech and text, but VR allows
users to share the same 3D space. In this 3D space there can be added functionalities or information
that in a real-world environment would not be possible, something intrinsic to VR.
This dissertation has produced a 3D framework that promotes nonverbal communication. It
plays a fundamental role on human interaction and is mostly based on emotion. In the academia,
confusion is known to influence learning gains if it is properly managed. We designed a study to
evaluate how lexical, syntactic and n-gram features influence perceived confusion and found results (not statistically significant) that point that it is possible to build a machine learning model
that can predict the level of confusion based on these features. This model was used to manipulate
the script of a given presentation, and user feedback shows a trend that by manipulating these
features and theoretically lowering the level of confusion on text not only drops the reported confusion, as it also increases reported sense of presence. Another contribution of this dissertation
comes from the intrinsic features of a 3D environment where one can carry actions that in a real
world are not possible. We designed an automatic adaption lighting system that reacts to the perceived user’s engagement. This hypothesis was partially refused as the results go against what we
hypothesized but do not have statistical significance.
Three lines of research may stem from this dissertation. First, there can be more complex features to train the machine learning model such as syntax trees. Also, on an Intelligent Tutoring
System this could adjust the avatar’s speech in real-time if fed by a real-time confusion detector.
When going for a social scenario, the set of basic emotions is well-adjusted and can enrich them.
Facial emotion recognition can extend this effect to the avatar’s body to fuel this synchronization
and increase the sense of presence. Finally, we based this dissertation on the premise of using
ubiquitous devices, but with the rapid evolution of technology we should consider that new devices
will be present on offices. This opens new possibilities for other modalities.A Realidade Virtual (RV) tem sido alvo de investigação extensa na academia e tem vindo a entrar
na indústria. Empresas comuns não têm acesso a esta tecnologia como uma ferramenta de colaboração porque estas soluções necessitam de dispositivos especÃficos que não estão disponÃveis para
o utilizador comum em escritório. Existem outras plataformas de colaboração baseadas em vÃdeo,
voz e texto, mas a RV permite partilhar o mesmo espaço 3D. Neste espaço podem existir funcionalidades ou informação adicionais que no mundo real não seria possÃvel, algo intrÃnseco à RV.
Esta dissertação produziu uma framework 3D que promove a comunicação não-verbal que tem
um papel fundamental na interação humana e é principalmente baseada em emoção. Na academia
é sabido que a confusão influencia os ganhos na aprendizagem quando gerida adequadamente.
Desenhámos um estudo para avaliar como as caracterÃsticas lexicais, sintáticas e n-gramas influenciam a confusão percecionada. ConstruÃmos e testámos um modelo de aprendizagem automática
que prevê o nÃvel de confusão baseado nestas caracterÃsticas, produzindo resultados não estatisticamente significativos que suportam esta hipótese. Este modelo foi usado para manipular o texto
de uma apresentação e o feedback dos utilizadores demonstra uma tendência na diminuição do
nÃvel de confusão reportada no texto e aumento da sensação de presença. Outra contribuição vem
das caracterÃsticas intrÃnsecas de um ambiente 3D onde se podem executar ações que no mundo
real não seriam possÃveis. Desenhámos um sistema automático de iluminação adaptativa que reage
ao engagement percecionado do utilizador. Os resultados não suportam o que hipotetizámos mas
não têm significância estatÃstica, pelo que esta hipótese foi parcialmente rejeitada.
Três linhas de investigação podem provir desta dissertação. Primeiro, criar caracterÃsticas mais
complexas para treinar o modelo de aprendizagem, tais como árvores de sintaxe. Além disso, num
Intelligent Tutoring System este modelo poderá ajustar o discurso do avatar em tempo real, alimentado por um detetor de confusão. As emoções básicas ajustam-se a um cenário social e podem
enriquecê-lo. A emoção expressada facialmente pode estender este efeito ao corpo do avatar para
alimentar o sincronismo social e aumentar a sensação de presença. Finalmente, baseámo-nos em
dispositivos ubÃquos, mas com a rápida evolução da tecnologia, podemos considerar que novos
dispositivos irão estar presentes em escritórios. Isto abre possibilidades para novas modalidades
A Neural Network Approach to Aircraft Performance Model Forecasting
Performance models used in the aircraft development process are dependent on the assumptions and approximations associated with the engineering equations used to produce them. The design and implementation of these highly complex engineering models are typically associated with a longer development process. This study proposes a non-deterministic approach where machine learning techniques using Artificial Neural Networks are used to predict specific aircraft parameters using available data. The approach yields results that are independent of the equations used in conventional aircraft performance modeling methods and rely on stochastic data and its distribution to extract useful patterns. To test the viability of the approach, a case study is performed comparing a conventional performance model describing the takeoff ground roll distance with the values generated from a neural network using readily-available flight data. The neural network receives as input, and is trained using, aircraft performance parameters including atmospheric conditions (air temperature, air pressure, air density), performance characteristics (flap configuration, thrust setting, MTOW, etc.) and runway conditions (wet, dry, slope angle, etc.). The proposed predictive modeling approach can be tailored for use with a wider range of flight mission profiles such as climb, cruise, descent and landing
Comparative Evaluation of Translation Memory (TM) and Machine Translation (MT) Systems in Translation between Arabic and English
In general, advances in translation technology tools have enhanced translation quality significantly. Unfortunately, however, it seems that this is not the case for all language pairs. A concern arises when the users of translation tools want to work between different language families such as Arabic and English. The main problems facing ArabicEnglish translation tools lie in Arabic’s characteristic free word order, richness of word inflection – including orthographic ambiguity – and optionality of diacritics, in addition to a lack of data resources. The aim of this study is to compare the performance of translation memory (TM) and machine translation (MT) systems in translating between Arabic and English.The research evaluates the two systems based on specific criteria relating to needs and expected results. The first part of the thesis evaluates the performance of a set of well-known TM systems when retrieving a segment of text that includes an Arabic linguistic feature. As it is widely known that TM matching metrics are based solely on the use of edit distance string measurements, it was expected that the aforementioned issues would lead to a low match percentage. The second part of the thesis evaluates multiple MT systems that use the mainstream neural machine translation (NMT) approach to translation quality. Due to a lack of training data resources and its rich morphology, it was anticipated that Arabic features would reduce the translation quality of this corpus-based approach. The systems’ output was evaluated using both automatic evaluation metrics including BLEU and hLEPOR, and TAUS human quality ranking criteria for adequacy and fluency.The study employed a black-box testing methodology to experimentally examine the TM systems through a test suite instrument and also to translate Arabic English sentences to collect the MT systems’ output. A translation threshold was used to evaluate the fuzzy matches of TM systems, while an online survey was used to collect participants’ responses to the quality of MT system’s output. The experiments’ input of both systems was extracted from ArabicEnglish corpora, which was examined by means of quantitative data analysis. The results show that, when retrieving translations, the current TM matching metrics are unable to recognise Arabic features and score them appropriately. In terms of automatic translation, MT produced good results for adequacy, especially when translating from Arabic to English, but the systems’ output appeared to need post-editing for fluency. Moreover, when retrievingfrom Arabic, it was found that short sentences were handled much better by MT than by TM. The findings may be given as recommendations to software developers