5 research outputs found

    Extracting Semantics of Documents Using Semantic Header Generator

    Get PDF
    Accurate representation of electronic information on the Internet underlies a solid foundation for precise information retrieval. However, the existing search systems tend to generate misses and false hits due to the fact that they attempt to match the specified search terms without context in the target information resource. It is clear that using traditional keywords-based methods for representing semantics of information items has become a major obstacle to high precision. In this paper, we propose the notion of Semantic Header to replace keyword indexing in extracting the meanings of information resources that marks explicitly the logical structure of a document. The information from the Semantic Header could be used by the search system to help locate appropriate documents with minimum effort. We also introduce an automatic tool, called Automatic Semantic Header Generator (ASHG), used for generating the meta-information for some significant fields of Semantic Header

    Identifikasi Kesamaan Pola Dokumen Teks Berdasarkan Kemunculan Term Dalam Kalimat

    Get PDF
    Disertasi ini bertujuan untuk membuat alat deteksi kesamaan pola dokumen teks berdasarkan munculnya term di setiap kalimat dalam dokumen teks. Pola munculnya term yang diteliti meliputi 3 skenario yaitu: pola munculnya term pertama, pola munculnya dua term pertama, dan pola munculnya tiga term pertama di setiap kalimat dalam dokumen teks. Hasil yang diperoleh berupa cara identifikasi dan kesamaan pola dokumen teks dari munculnya term pertama dengan pendekatan uji pembeda pola Kolmogorov-Smirnov (uji K-S), dari munculnya dua term pertama dengan menghitung jarak Euclidean antara pasangan term kedua dokumen teks sebagai alat pembeda polanya, dan dari munculnya tiga term pertama yang pembedaannya dengan menggunakan pendekatan Bayesian Network (BN) dan likelihood ratio test dalam dokumen teks. Pola dokumen teks munculnya term pertama dengan pendekatan uji Kolmogorov-Smirnov (uji K-S) diperoleh kesamaan pola sebesar 66,67% sesuai skenario dokumen uji. Pola dokumen teks munculnya pasangan term pertama dengan menghitung jarak Euclidean antara pasangan term kedua dokumen teks, diperoleh kesamaan pola sebesar 93,33% sesuai skenario dokumen uji. Sedangkan pola dokumen teks munculnya tiga term pertama dengan pendekatan Bayesian Network (BN) dan likelihood ratio test dalam dokumen teks diperoleh 100% sama dengan skenario. Ketiga cara pendeteksian pola tersebut terbukti telah mampu membedakan beberapa dokumen standar yang diuji cobakan. ================================================================= This dissertation aims to develop a similarity pattern text detection based on the term order appearance in each sentence in the text document. Term emergence patterns examined include three categories, i.e the pattern of the first term emergence, the pattern of the first two terms emergence, and the pattern of the first three terms emergence in each sentence in the text document. The result obtained is the identification and similarity of the text document pattern from the emergence of the first term with the Kolmogorov-Smirnov pattern differentiator approach (KS test), from the appearance of the first two terms by calculating the Euclidean distance between the second term pairs of the text document as a distinguishing tool of the pattern, and from The emergence of the first three terms of distinction by using the Bayesian Network (BN) approach and the likelihood ratio test in text documents. Pattern of text document the emergence of the first term with Kolmogorov-Smirnov test approach (K-S test) obtained similar pattern of 66.67% according to the test document scenario. The text document pattern of the emergence of the first term pair by calculating the Euclidean distance between the second term pair of text documents, obtained similar pattern of 93.33% according to the test document scenario. While the text document pattern the emergence of the first three terms with the Bayesian Network (BN) approach and the likelihood ratio test in the text document is 100% similar to the scenario. This dissertation has been succeeded to propose and demonstrate the work of three main algorithms for three scenarios couple with Kolmogorov-Smirnov, Euclidean distance, Bayesian Network and likelihood ratio test respectively to identify and to detect the difference between some standard tested text documents

    InfoCrystal, a visual tool for information retrieval

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 1995.Includes bibliographical references (p. 227-232).by Anselm Spoerri.Ph.D

    A framework for enhancing the query and medical record representations for patient search

    Get PDF
    Electronic medical records (EMRs) are digital documents stored by medical institutions that detail the observed symptoms, the conducted diagnostic tests, the identified diagnoses and the prescribed treatments. These EMRs are being increasingly used worldwide to improve healthcare services. For example, when a doctor compiles the possible treatments for a patient showing some particular symptoms, it is advantageous to consult the information about patients who were previously treated for those same symptoms. However, finding patients with particular medical conditions is challenging, due to the implicit knowledge inherent within the patients' medical records and queries - such knowledge may be known by medical practitioners, but may be hidden from an information retrieval (IR) system. For instance, the mention of a treatment such as a drug may indicate to a practitioner that a particular diagnosis has been made for the patient, but this diagnosis may not be explicitly mentioned in the patient's medical records. Moreover, the use of negated language (e.g.\ `without', `no') to describe a medical condition of a patient (e.g.\ the patient has no fever) may cause a search system to erroneously retrieve that patient for a query when searching for patients with that medical condition (e.g.\ find patients with fever). This thesis focuses on enhancing the search of EMRs, with the aim of identifying patients with medical histories relevant to the medical conditions stated in a text query. During retrieval, a healthcare practitioner indicates a number of inclusion criteria describing the medical conditions of the patients of interest. To attain effective retrieval performance, we hypothesise that, in a patient search system, both the information needs and patients' histories should be represented based upon \emph{the medical decision process}. In particular, this thesis argues that since the medical decision process typically encompasses four aspects (symptom, diagnostic test, diagnosis and treatment), a patient search system should take into account these aspects and apply inferences to recover the possible implicit knowledge. We postulate that considering these aspects and their derived implicit knowledge at three different levels of the retrieval process (namely, sentence, medical record and inter-record levels) enhances the retrieval performance. Indeed, we propose a novel framework that can gain insights from EMRs and queries, by modelling and reasoning upon information during retrieval in terms of the four aforementioned aspects at the three levels of the retrieval process, and can use these insights to enhance patient search. Firstly, at the sentence level, we extract the medical conditions in the medical records and queries. In particular, we propose to represent only the medical conditions related to the four medical aspects in order to improve the accuracy of our search system. In addition, we identify the context (negative/positive) of terms, which leads to an accurate representation of the medical conditions both in the EMRs and queries. In particular, we aim to prevent patients whose EMRs state the medical conditions in the contexts different from the query from being ranked highly. For example, preventing patients whose EMRs state ``no history of dementia'' from being retrieved for a query searching for patients with dementia. Secondly, at the medical record level, using external knowledge-based resources (e.g.\ ontologies and health-related websites), we leverage the relationships between medical terms to infer the wider medical history of the patient in terms of the four medical aspects. In particular, we estimate the relevance of a patient to the query by exploiting association rules that we extract from the semantic relationships between medical terms using the four aspects of the medical process. For example, patients with a medical history involving a \emph{CABG surgery} (treatment) can be inferred as relevant to a query searching for a patient suffering from \emph{heart disease} (diagnosis), since a CABG surgery is a treatment of heart disease. Thirdly, at the inter-record level, we enhance the retrieval of patients in two different manners. First, we exploit knowledge about how the four medical aspects are handled by different hospital departments to gain a better understanding about the appropriateness of EMRs created by different departments for a given query. We propose to aggregate EMRs at the department level (i.e.\ inter-record level) to extract implicit knowledge (i.e.\ the expertise of each department) and model this department's expertise, while ranking patients. For instance, patients having EMRs from the cardiology department are likely to be relevant to a query searching for patients who suffered from a heart attack. Second, as a medical query typically contains several medical conditions that the relevant patients should satisfy, we propose to explicitly model the relevance towards multiple query medical conditions in the EMRs related to a particular patient during retrieval. In particular, we rank highly those patients that match all the stated medical conditions in the query by adapting coverage-based diversification approaches originally proposed for the web search domain. Finally, we examine the combination of our aforementioned approaches that exploit the implicit knowledge at the three levels of the retrieval process to further improve the retrieval performance by adapting techniques from the fields of data fusion and machine learning. In particular, data fusion techniques, such as CombSUM and CombMNZ, are used to combine the relevance scores computed by the different approaches of the proposed framework. On the other hand, we deploy state-of-the-art learning to rank approaches (e.g.\ LambdaMART and AdaRank) to learn from a set of training data an effective combination of the relevance scores computed by the approaches of the framework. In addition, we introduce a novel selective ranking approach that uses a classifier to effectively apply one of the approaches of the framework on a per-query basis. This thesis draws insights from a thorough evaluation and analysis of the proposed framework using a standard test collection provided by the TREC Medical Records track. The experimental results show the effectiveness of the framework. In particular, the results demonstrate the importance of dealing with the implicit knowledge in patient search by focusing on the medical decision criteria aspects at the three levels of the retrieval process

    A probabilistic approach for automatic text filtering.

    Get PDF
    Low Kon Fan.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 165-168).Abstract also in Chinese.Abstract --- p.iAcknowledgment --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview of Information Filtering --- p.1Chapter 1.2 --- Contributions --- p.4Chapter 1.3 --- Organization of this thesis --- p.6Chapter 2 --- Existing Approaches --- p.7Chapter 2.1 --- Representational issues --- p.7Chapter 2.1.1 --- Document Representation --- p.7Chapter 2.1.2 --- Feature Selection --- p.11Chapter 2.2 --- Traditional Approaches --- p.15Chapter 2.2.1 --- NewsWeeder --- p.15Chapter 2.2.2 --- NewT --- p.17Chapter 2.2.3 --- SIFT --- p.19Chapter 2.2.4 --- InRoute --- p.20Chapter 2.2.5 --- Motivation of Our Approach --- p.21Chapter 2.3 --- Probabilistic Approaches --- p.23Chapter 2.3.1 --- The Naive Bayesian Approach --- p.25Chapter 2.3.2 --- The Bayesian Independence Classifier Approach --- p.28Chapter 2.4 --- Comparison --- p.31Chapter 3 --- Our Bayesian Network Approach --- p.33Chapter 3.1 --- Backgrounds of Bayesian Networks --- p.34Chapter 3.2 --- Bayesian Network Induction Approach --- p.36Chapter 3.3 --- Automatic Construction of Bayesian Networks --- p.38Chapter 4 --- Automatic Feature Discretization --- p.50Chapter 4.1 --- Predefined Level Discretization --- p.52Chapter 4.2 --- Lloyd's algorithm . . > --- p.53Chapter 4.3 --- Class Dependence Discretization --- p.55Chapter 5 --- Experiments and Results --- p.59Chapter 5.1 --- Document Collections --- p.60Chapter 5.2 --- Batch Filtering Experiments --- p.63Chapter 5.3 --- Batch Filtering Results --- p.65Chapter 5.4 --- Incremental Session Filtering Experiments --- p.87Chapter 5.5 --- Incremental Session Filtering Results --- p.88Chapter 6 --- Conclusions and Future Work --- p.105Appendix A --- p.107Appendix B --- p.116Appendix C --- p.126Appendix D --- p.131Appendix E --- p.14
    corecore