100 research outputs found

    A semi-automated FAQ retrieval system for HIV/AIDS

    Get PDF
    This thesis describes a semi-automated FAQ retrieval system that can be queried by users through short text messages on low-end mobile phones to provide answers on HIV/AIDS related queries. First we address the issue of result presentation on low-end mobile phones by proposing an iterative interaction retrieval strategy where the user engages with the FAQ retrieval system in the question answering process. At each iteration, the system returns only one question-answer pair to the user and the iterative process terminates after the user's information need has been satisfied. Since the proposed system is iterative, this thesis attempts to reduce the number of iterations (search length) between the users and the system so that users do not abandon the search process before their information need has been satisfied. Moreover, we conducted a user study to determine the number of iterations that users are willing to tolerate before abandoning the iterative search process. We subsequently used the bad abandonment statistics from this study to develop an evaluation measure for estimating the probability that any random user will be satisfied when using our FAQ retrieval system. In addition, we used a query log and its click-through data to address three main FAQ document collection deficiency problems in order to improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Conclusions are derived concerning whether we can reduce the rate at which users abandon their search before their information need has been satisfied by using information from previous searches to: Address the term mismatch problem between the users' SMS queries and the relevant FAQ documents in the collection; to selectively rank the FAQ document according to how often they have been previously identified as relevant by users for a particular query term; and to identify those queries that do not have a relevant FAQ document in the collection. In particular, we proposed a novel template-based approach that uses queries from a query log for which the true relevant FAQ documents are known to enrich the FAQ documents with additional terms in order to alleviate the term mismatch problem. These terms are added as a separate field in a field-based model using two different proposed enrichment strategies, namely the Term Frequency and the Term Occurrence strategies. This thesis thoroughly investigates the effectiveness of the aforementioned FAQ document enrichment strategies using three different field-based models. Our findings suggest that we can improve the overall recall and the probability that any random user will be satisfied by enriching the FAQ documents with additional terms from queries in our query log. Moreover, our investigation suggests that it is important to use an FAQ document enrichment strategy that takes into consideration the number of times a term occurs in the query when enriching the FAQ documents. We subsequently show that our proposed enrichment approach for alleviating the term mismatch problem generalise well on other datasets. Through the evaluation of our proposed approach for selectively ranking the FAQ documents, we show that we can improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by incorporating the click popularity score of a query term t on an FAQ document d into the scoring and ranking process. Our results generalised well on a new dataset. However, when we deploy the click popularity score of a query term t on an FAQ document d on an enriched FAQ document collection, we saw a decrease in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Furthermore, we used our query log to build a binary classifier for detecting those queries that do not have a relevant FAQ document in the collection (Missing Content Queries (MCQs))). Before building such a classifier, we empirically evaluated several feature sets in order to determine the best combination of features for building a model that yields the best classification accuracy in identifying the MCQs and the non-MCQs. Using a different dataset, we show that we can improve the overall retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by deploying a MCQs detection subsystem in our FAQ retrieval system to filter out the MCQs. Finally, this thesis demonstrates that correcting spelling errors can help improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. We tested our FAQ retrieval system with two different testing sets, one containing the original SMS queries and the other containing the SMS queries which were manually corrected for spelling errors. Our results show a significant improvement in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system

    Temu Kembali Informasi Berbasis Pemodelan Topik Menggunakan Kombinasi LSI dan VSM Pada Sistem Tanya-Jawab

    Get PDF
    Dalam Penerapan e-government untuk menuju tata pemerintahan yang baik (good governance), pemerintah pusat maupun daerah menyediakan layanan tanya-jawab pada sistem online. Layanan tanya-jawab ini sangat penting karena dapat memfasilitasi permintaan informasi secara lebih mudah serta dapat diakses kapan saja, tanpa harus menunggu jam layanan kantor buka. Dalam pelaksanaan layanan tersebut masih dilakukan secara manual, sehingga perlu dikembangkan suatu sistem tanya-jawab yang dikerjakan oleh komputer. Suatu sistem tanya-jawab dibentuk oleh beberapa elemen/modul. Salah satu elemen penting dalam sistem tanya-jawab tersebut elemen temu kembali informasi yang bertanggung jawab dalam pengambilan dokumen-dokumen yang relevan dengan pertanyaan (query) pengguna. Metode yang banyak digunakan dalam membangun temu kembali informasi adalah menggunakan adalah metode Vector Space Model (VSM) dan Latent Semantic Indexing (LSI), dimana keduanya merepresentasikan dokumen ke dalam vektor ruang. Namun kedua metode tersebut memiliki keterbatasan masing-masing. Untuk itu dalam penelitian ini diusulkan model kombinasi antara metode VSM dan LSI untuk memperbaiki beberapa batasan pada keduanya. Dalam mencari dokumen yang relevan dengan query, model kombinasi ini bekerja dengan cara mengambil terlebih dahulu dokumen yang memiliki kesamaan topik dengan query menggunakan pemodelan topik dalam hal ini metode LSI. Kemudian setelah itu mengurutkannya berdasarkan kesamaan term menggunakan metode VSM untuk diambil beberapa dokumen dengan nilai kemiripan tertinggi. Untuk menguji kinerja dari model kombinasi tersebut dalam mencari dokumen relevan pada sistem tanya-jawab, maka pada penelitian ini akan menggunakan data layanan tanya-jawab pada sistem Pengadaan Secara Elektronik (SPSE) sebagai data eksperimen. Dari hasil eksperimen yang dilakukan ditemukan bahwa model yang diusulkan mampu meningkatkan presisi metode dasarnya yakni LSI dan VSM yang berdiri sendiri. Model kombinasi (LSI+VSM) memperoleh precision at 1 (P@1)=0,7 dengan Mean Average Precision (MAP)=0,579 sedangkan pada model dasarnya diperoleh P@1=0,5 dengan MAP=0,237 untuk LSI, P@1=0,38 dengan MAP=0,247 untuk VSM biasa serta P@1=0,44 dengan MAP=0,258 untuk VSM dengan pembobotan profesional (VSM+PP). =========================================================================================================== In order to achieve good governance through implementation of e-government, the central and local governments provide a question-answering services for online system. This question-answering services are essential to facilitate information requests to make it easier and accessible at any time. In the implementation of the services are still done manually, so it is necessary to develop a computerized question-answering system (QAS). A QAS is formed by several elements/modules. One of important element in QAS is the information retrieval (IR) that is responsible for retrieving relevant documents to the user requests. A widely used methods for developing the information retrieval system are using Vector Space Model (VSM) and Latent Semantic Indexing (LSI), where they represent documents into space vectors. However, both models have their respective limitation. For this reason, in this research proposed a combination model between VSM and LSI to fix some limitations on both. In searching for documents relevant to the query, this combination model works by retrieving documents that have the same topic as the query first using the topic modeling in this case the LSI method and then sort it based on the term similarity using the VSM method to retrieve some documents with the highest similarity value. To evaluate the performance of that combination model in searching relevant documents on the question-answering system, hence in this research will be use question-answer data on the Electronic Procurement System (SPSE) as experimental data. From the experimental results, it was found that the proposed model was able to improve the precision of its basic method i.e. the stand-alone LSI and VSM. The combination model (LSI + VSM) obtained precision at 1 (P@1)=0.7 with Mean Average Precision (MAP)=0.579 whereas in the basic methods obtained P@1=0.5 with MAP=0.237 for the LSI, P@1=0.38 with MAP=0.247 for the traditional VSM and P@1=0.44 with MAP=0.258 for the VSM with professional weight concept

    Cross-language Information Retrieval

    Full text link
    Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

    An enhanced sequential exception technique for semantic-based text anomaly detection

    Get PDF
    The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems related to ambiguity, high dimensionality, sparsity and text representation. If these challenges are not properly resolved, identifying semantic-based text anomaly will be less accurate. This study proposes an Enhanced Sequential Exception Technique (ESET) to detect semantic-based text anomaly by achieving five objectives: (1) to modify Sequential Exception Technique (SET) in processing unstructured text; (2) to optimize Cosine Similarity for identifying similar and dissimilar text data; (3) to hybridize modified SET with Latent Semantic Analysis (LSA); (4) to integrate Lesk and Selectional Preference algorithms for disambiguating senses and identifying text canonical form; and (5) to represent semantic-based text anomaly using First Order Logic (FOL) and Concept Network Graph (CNG). ESET performs text anomaly detection by employing optimized Cosine Similarity, hybridizing LSA with modified SET, and integrating it with Word Sense Disambiguation algorithms specifically Lesk and Selectional Preference. Then, FOL and CNG are proposed to represent the detected semantic-based text anomaly. To demonstrate the feasibility of the technique, four selected datasets namely NIPS data, ENRON, Daily Koss blog, and 20Newsgroups were experimented on. The experimental evaluation revealed that ESET has significantly improved the accuracy of detecting semantic-based text anomaly from documents. When compared with existing measures, the experimental results outperformed benchmarked methods with an improved F1-score from all datasets respectively; NIPS data 0.75, ENRON 0.82, Daily Koss blog 0.93 and 20Newsgroups 0.97. The results generated from ESET has proven to be significant and supported a growing notion of semantic-based text anomaly which is increasingly evident in existing literatures. Practically, this study contributes to topic modelling and concept coherence for the purpose of visualizing information, knowledge sharing and optimized decision making

    Question Answering using Syntactic Patterns in a Contextual Search Engine

    Get PDF
    Question Answering (QA) systems promise to enhance both usability and accuracy when searching for knowledge. This thesis presents a prototype QA system built to leverage the extraction capabilities of a modern, context-aware search platform; Fast ESP. Questions in plain English are transformed to queries which target specific entities in the text that correspond with the identified answer types. A small set of unified patterns is demonstrated as adequate to classify a wide variety of syntactic constructs. For the purpose of verifying the answers, a semantic lexicon is compiled using an automated procedure. The whole solution is based on pattern matching and presents this as a viable alternative to deeper linguistic methods

    A software based mentor system

    Get PDF
    This thesis describes the architecture, implementation issues and evaluation of Mentor - an educational support system designed to mentor students in their university studies. Students can ask (by typing) natural language questions and Mentor will use several educational paradigms to present information from its Knowledge Base or from data-mined online Web sites to respond. Typically the questions focus on the student’s assignments or in their preparation for their examinations. Mentor is also pro-active in that it prompts the student with questions such as "Have you started your assignment yet?". If the student responds and enters into a dialogue with Mentor, then, based upon the student’s questions and answers, it guides them through a Directed Learning Path planned by the lecturer, specific to that assessment. The objectives of the research were to determine if such a system could be designed, developed and applied in a large-scale, real-world environment and to determine if the resulting system was beneficial to students using it. The study was significant in that it provided an analysis of the design and implementation of the system as well as a detailed evaluation of its use. This research integrated the Computer Science disciplines of network communication, natural language parsing, user interface design and software agents, together with pedagogies from the Computer Aided Instruction and Intelligent Tutoring System fields of Education. Collectively, these disciplines provide the foundation for the two main thesis research areas of Dialogue Management and Tutorial Dialogue Systems. The development and analysis of the Mentor System required the design and implementation of an easy to use text based interface as well as a hyper- and multi-media graphical user interface, a client-server system, and a dialogue management system based on an extensible kernel. The multi-user Java-based client-server system used Perl-5 Regular Expression pattern matching for Natural Language Parsing along with a state-based Dialogue Manager and a Knowledge Base marked up using the XML-based Virtual Human Markup Language. The kernel was also used in other Dialogue Management applications such as with computer generated Talking Heads. The system also enabled a user to easily program their own knowledge into the Knowledge Base as well as to program new information retrieval or management tasks so that the system could grow with the user. The overall framework to integrate and manage the above components into a usable system employed suitable educational pedagogies that helped in the student’s learning process. The thesis outlines the learning paradigms used in, and summarises the evaluation of, three course-based Case Studies of university students’ perception of the system to see how effective and useful it was, and whether students benefited from using it. This thesis will demonstrate that Mentor met its objectives and was very successful in helping students with their university studies. As one participant indicated: ‘I couldn’t have done without it.

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci
    corecore