Search CORE

100 research outputs found

A semi-automated FAQ retrieval system for HIV/AIDS

Author: Thuma Edwin
Publication venue
Publication date: 01/01/2015
Field of study

This thesis describes a semi-automated FAQ retrieval system that can be queried by users through short text messages on low-end mobile phones to provide answers on HIV/AIDS related queries. First we address the issue of result presentation on low-end mobile phones by proposing an iterative interaction retrieval strategy where the user engages with the FAQ retrieval system in the question answering process. At each iteration, the system returns only one question-answer pair to the user and the iterative process terminates after the user's information need has been satisfied. Since the proposed system is iterative, this thesis attempts to reduce the number of iterations (search length) between the users and the system so that users do not abandon the search process before their information need has been satisfied. Moreover, we conducted a user study to determine the number of iterations that users are willing to tolerate before abandoning the iterative search process. We subsequently used the bad abandonment statistics from this study to develop an evaluation measure for estimating the probability that any random user will be satisfied when using our FAQ retrieval system. In addition, we used a query log and its click-through data to address three main FAQ document collection deficiency problems in order to improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Conclusions are derived concerning whether we can reduce the rate at which users abandon their search before their information need has been satisfied by using information from previous searches to: Address the term mismatch problem between the users' SMS queries and the relevant FAQ documents in the collection; to selectively rank the FAQ document according to how often they have been previously identified as relevant by users for a particular query term; and to identify those queries that do not have a relevant FAQ document in the collection. In particular, we proposed a novel template-based approach that uses queries from a query log for which the true relevant FAQ documents are known to enrich the FAQ documents with additional terms in order to alleviate the term mismatch problem. These terms are added as a separate field in a field-based model using two different proposed enrichment strategies, namely the Term Frequency and the Term Occurrence strategies. This thesis thoroughly investigates the effectiveness of the aforementioned FAQ document enrichment strategies using three different field-based models. Our findings suggest that we can improve the overall recall and the probability that any random user will be satisfied by enriching the FAQ documents with additional terms from queries in our query log. Moreover, our investigation suggests that it is important to use an FAQ document enrichment strategy that takes into consideration the number of times a term occurs in the query when enriching the FAQ documents. We subsequently show that our proposed enrichment approach for alleviating the term mismatch problem generalise well on other datasets. Through the evaluation of our proposed approach for selectively ranking the FAQ documents, we show that we can improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by incorporating the click popularity score of a query term t on an FAQ document d into the scoring and ranking process. Our results generalised well on a new dataset. However, when we deploy the click popularity score of a query term t on an FAQ document d on an enriched FAQ document collection, we saw a decrease in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Furthermore, we used our query log to build a binary classifier for detecting those queries that do not have a relevant FAQ document in the collection (Missing Content Queries (MCQs))). Before building such a classifier, we empirically evaluated several feature sets in order to determine the best combination of features for building a model that yields the best classification accuracy in identifying the MCQs and the non-MCQs. Using a different dataset, we show that we can improve the overall retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by deploying a MCQs detection subsystem in our FAQ retrieval system to filter out the MCQs. Finally, this thesis demonstrates that correcting spelling errors can help improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. We tested our FAQ retrieval system with two different testing sets, one containing the original SMS queries and the other containing the SMS queries which were manually corrected for spelling errors. Our results show a significant improvement in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system

Glasgow Theses Service

Temu Kembali Informasi Berbasis Pemodelan Topik Menggunakan Kombinasi LSI dan VSM Pada Sistem Tanya-Jawab

Author: Bahri Syamsul
Publication venue
Publication date: 01/07/2018
Field of study

Dalam Penerapan e-government untuk menuju tata pemerintahan yang baik (good governance), pemerintah pusat maupun daerah menyediakan layanan tanya-jawab pada sistem online. Layanan tanya-jawab ini sangat penting karena dapat memfasilitasi permintaan informasi secara lebih mudah serta dapat diakses kapan saja, tanpa harus menunggu jam layanan kantor buka. Dalam pelaksanaan layanan tersebut masih dilakukan secara manual, sehingga perlu dikembangkan suatu sistem tanya-jawab yang dikerjakan oleh komputer. Suatu sistem tanya-jawab dibentuk oleh beberapa elemen/modul. Salah satu elemen penting dalam sistem tanya-jawab tersebut elemen temu kembali informasi yang bertanggung jawab dalam pengambilan dokumen-dokumen yang relevan dengan pertanyaan (query) pengguna. Metode yang banyak digunakan dalam membangun temu kembali informasi adalah menggunakan adalah metode Vector Space Model (VSM) dan Latent Semantic Indexing (LSI), dimana keduanya merepresentasikan dokumen ke dalam vektor ruang. Namun kedua metode tersebut memiliki keterbatasan masing-masing. Untuk itu dalam penelitian ini diusulkan model kombinasi antara metode VSM dan LSI untuk memperbaiki beberapa batasan pada keduanya. Dalam mencari dokumen yang relevan dengan query, model kombinasi ini bekerja dengan cara mengambil terlebih dahulu dokumen yang memiliki kesamaan topik dengan query menggunakan pemodelan topik dalam hal ini metode LSI. Kemudian setelah itu mengurutkannya berdasarkan kesamaan term menggunakan metode VSM untuk diambil beberapa dokumen dengan nilai kemiripan tertinggi. Untuk menguji kinerja dari model kombinasi tersebut dalam mencari dokumen relevan pada sistem tanya-jawab, maka pada penelitian ini akan menggunakan data layanan tanya-jawab pada sistem Pengadaan Secara Elektronik (SPSE) sebagai data eksperimen. Dari hasil eksperimen yang dilakukan ditemukan bahwa model yang diusulkan mampu meningkatkan presisi metode dasarnya yakni LSI dan VSM yang berdiri sendiri. Model kombinasi (LSI+VSM) memperoleh precision at 1 (P@1)=0,7 dengan Mean Average Precision (MAP)=0,579 sedangkan pada model dasarnya diperoleh P@1=0,5 dengan MAP=0,237 untuk LSI, P@1=0,38 dengan MAP=0,247 untuk VSM biasa serta P@1=0,44 dengan MAP=0,258 untuk VSM dengan pembobotan profesional (VSM+PP). =========================================================================================================== In order to achieve good governance through implementation of e-government, the central and local governments provide a question-answering services for online system. This question-answering services are essential to facilitate information requests to make it easier and accessible at any time. In the implementation of the services are still done manually, so it is necessary to develop a computerized question-answering system (QAS). A QAS is formed by several elements/modules. One of important element in QAS is the information retrieval (IR) that is responsible for retrieving relevant documents to the user requests. A widely used methods for developing the information retrieval system are using Vector Space Model (VSM) and Latent Semantic Indexing (LSI), where they represent documents into space vectors. However, both models have their respective limitation. For this reason, in this research proposed a combination model between VSM and LSI to fix some limitations on both. In searching for documents relevant to the query, this combination model works by retrieving documents that have the same topic as the query first using the topic modeling in this case the LSI method and then sort it based on the term similarity using the VSM method to retrieve some documents with the highest similarity value. To evaluate the performance of that combination model in searching relevant documents on the question-answering system, hence in this research will be use question-answer data on the Electronic Procurement System (SPSE) as experimental data. From the experimental results, it was found that the proposed model was able to improve the precision of its basic method i.e. the stand-alone LSI and VSM. The combination model (LSI + VSM) obtained precision at 1 (P@1)=0.7 with Mean Average Precision (MAP)=0.579 whereas in the basic methods obtained P@1=0.5 with MAP=0.237 for the LSI, P@1=0.38 with MAP=0.247 for the traditional VSM and P@1=0.44 with MAP=0.258 for the VSM with professional weight concept

ITS Repository

Cross-language Information Retrieval

Author: Galuščáková Petra
Nair Suraj
Oard Douglas W.
Publication venue
Publication date: 08/06/2022
Field of study

Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

arXiv.org e-Print Archive

An enhanced sequential exception technique for semantic-based text anomaly detection

Author: Taiye Mohammed Ahmed
Publication venue
Publication date: 01/01/2019
Field of study

The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems related to ambiguity, high dimensionality, sparsity and text representation. If these challenges are not properly resolved, identifying semantic-based text anomaly will be less accurate. This study proposes an Enhanced Sequential Exception Technique (ESET) to detect semantic-based text anomaly by achieving five objectives: (1) to modify Sequential Exception Technique (SET) in processing unstructured text; (2) to optimize Cosine Similarity for identifying similar and dissimilar text data; (3) to hybridize modified SET with Latent Semantic Analysis (LSA); (4) to integrate Lesk and Selectional Preference algorithms for disambiguating senses and identifying text canonical form; and (5) to represent semantic-based text anomaly using First Order Logic (FOL) and Concept Network Graph (CNG). ESET performs text anomaly detection by employing optimized Cosine Similarity, hybridizing LSA with modified SET, and integrating it with Word Sense Disambiguation algorithms specifically Lesk and Selectional Preference. Then, FOL and CNG are proposed to represent the detected semantic-based text anomaly. To demonstrate the feasibility of the technique, four selected datasets namely NIPS data, ENRON, Daily Koss blog, and 20Newsgroups were experimented on. The experimental evaluation revealed that ESET has significantly improved the accuracy of detecting semantic-based text anomaly from documents. When compared with existing measures, the experimental results outperformed benchmarked methods with an improved F1-score from all datasets respectively; NIPS data 0.75, ENRON 0.82, Daily Koss blog 0.93 and 20Newsgroups 0.97. The results generated from ESET has proven to be significant and supported a growing notion of semantic-based text anomaly which is increasingly evident in existing literatures. Practically, this study contributes to topic modelling and concept coherence for the purpose of visualizing information, knowledge sharing and optimized decision making

Universiti Utara Malaysia: UUM eTheses

Recommended from our members

Applying latent semantic analysis to computer assisted assessment in the Computer Science domain: a framework, a tool, and an evaluation

Author: Haley Debra Trusso
Publication venue
Publication date: 01/01/2009
Field of study

This dissertation argues that automated assessment systems can be useful for both students and educators provided that the results correspond well with human markers. Thus, evaluating such a system is crucial. I present an evaluation framework and show how and why it can be useful for both producers and consumers of automated assessment systems. The framework is a refinement of a research taxonomy that came out of the effort to analyse the literature review of systems based on Latent Semantic Analysis (LSA), a statistical natural language processing technique that has been used for automated assessment of essays. The evaluation framework can help developers publish their results in a format that is comprehensive, relatively compact, and useful to other researchers. The thesis claims that, in order to see a complete picture of an automated assessment system, certain pieces must be emphasised. It presents the framework as a jigsaw puzzle whose pieces join together to form the whole picture. The dissertation uses the framework to compare the accuracy of human markers and EMMA, the LSA-based assessment system I wrote as part of this dissertation. EMMA marks short, free text answers in the domain of computer science. I conducted a study of five human markers and then used the results as a benchmark against which to evaluate EMMA. An integral part of the evaluation was the success metric. The standard inter-rater reliability statistic was not useful; I located a new statistic and applied it to the domain of computer assisted assessment for the first time, as far as I know. Although EMMA exceeds human markers on a few questions, overall it does not achieve the same level of agreement with humans as humans do with each other. The last chapter maps out a plan for further research to improve EMMA

Open Research Online (The Open University)

OpenGrey Repository

EVALITA Evaluation of NLP and Speech Tools for Italian Proceedings of the Final Workshop

Author: Basile Pierpaolo
Cutugno Franco
Nissim Malvina
Patti Viviana
Pierpaolo Basile Franco Cutugno, Malvina Nissim, Viviana Patti, Rachele Sprugnoli
Sprugnoli Rachele
Publication venue: place:Torino
Publication date: 01/01/2016
Field of study

Editor of the proceedings of EVALITA 2016

Archivio istituzionale della Ricerca - Università degli Studi di Parma

PubliCatt

Question Answering using Syntactic Patterns in a Contextual Search Engine

Author: Sand Kim Andre
Publication venue
Publication date: 01/01/2006
Field of study

Question Answering (QA) systems promise to enhance both usability and accuracy when searching for knowledge. This thesis presents a prototype QA system built to leverage the extraction capabilities of a modern, context-aware search platform; Fast ESP. Questions in plain English are transformed to queries which target specific entities in the text that correspond with the identified answer types. A small set of unified patterns is demonstrated as adequate to classify a wide variety of syntactic constructs. For the purpose of verifying the answers, a semantic lexicon is compiled using an automated procedure. The whole solution is based on pattern matching and presents this as a viable alternative to deeper linguistic methods

NORA - Norwegian Open Research Archives

A software based mentor system

Author: Marriott Andrew
Publication venue: Curtin University
Publication date: 01/01/2008
Field of study

This thesis describes the architecture, implementation issues and evaluation of Mentor - an educational support system designed to mentor students in their university studies. Students can ask (by typing) natural language questions and Mentor will use several educational paradigms to present information from its Knowledge Base or from data-mined online Web sites to respond. Typically the questions focus on the student’s assignments or in their preparation for their examinations. Mentor is also pro-active in that it prompts the student with questions such as "Have you started your assignment yet?". If the student responds and enters into a dialogue with Mentor, then, based upon the student’s questions and answers, it guides them through a Directed Learning Path planned by the lecturer, specific to that assessment. The objectives of the research were to determine if such a system could be designed, developed and applied in a large-scale, real-world environment and to determine if the resulting system was beneficial to students using it. The study was significant in that it provided an analysis of the design and implementation of the system as well as a detailed evaluation of its use. This research integrated the Computer Science disciplines of network communication, natural language parsing, user interface design and software agents, together with pedagogies from the Computer Aided Instruction and Intelligent Tutoring System fields of Education. Collectively, these disciplines provide the foundation for the two main thesis research areas of Dialogue Management and Tutorial Dialogue Systems. The development and analysis of the Mentor System required the design and implementation of an easy to use text based interface as well as a hyper- and multi-media graphical user interface, a client-server system, and a dialogue management system based on an extensible kernel. The multi-user Java-based client-server system used Perl-5 Regular Expression pattern matching for Natural Language Parsing along with a state-based Dialogue Manager and a Knowledge Base marked up using the XML-based Virtual Human Markup Language. The kernel was also used in other Dialogue Management applications such as with computer generated Talking Heads. The system also enabled a user to easily program their own knowledge into the Knowledge Base as well as to program new information retrieval or management tasks so that the system could grow with the user. The overall framework to integrate and manage the above components into a usable system employed suitable educational pedagogies that helped in the student’s learning process. The thesis outlines the learning paradigms used in, and summarises the evaluation of, three course-based Case Studies of university students’ perception of the system to see how effective and useful it was, and whether students benefited from using it. This thesis will demonstrate that Mentor met its objectives and was very successful in helping students with their university studies. As one participant indicated: ‘I couldn’t have done without it.

espace@Curtin

On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

Author: Barrón Cedeño Luis Alberto
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 08/06/2012
Field of study

Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

RiuNet