656 research outputs found

    Fisher's exact test explains a popular metric in information retrieval

    Full text link
    Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the one-tailed version of Fisher's exact test, also known as the hypergeometric test, corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. We then set forth a mathematical argument that suggests the tf-idf variant approximates the negative logarithm of the one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution tail probability). The Fisher's exact test interpretation of this common tf-idf variant furnishes the working statistician with a ready explanation of tf-idf's long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision

    Pengembangan Konsep Desain Kemasan Produk Handsanitizer dengan Pendekatan Kansei Engineering

    Get PDF
    Kemasan memiliki peran dalam meningkatkan minat konsumen terhadap pembelian produk. Desain kemasan yang sesuai dengan keinginan konsumen dapat diperoleh dari rancangan kemasan berdasarkan emosi dan perasaan konsumen. Hal ini dapat pula meningkatkan daya saing produk di pasaran. Penelitian ini dilakukan untuk mengidentifikasi kebutuhan konsumen dan menentukan konsep desain visual produk handsanitizer dengan pendekatan Kansei engineering. Metode yang digunakan pada penelitian ini yaitu Term Frequency Inverse Document Frequency (TF-IDF) dan Principal Component Analysis (PCA). Hasil pengidentifikasian kebutuhan konsumen terhadap produk handsanitizer diperoleh 25 kata kansei yang mewakili produk dan ekstraksi kata Kansei pada kemasan handsanitizer menggunakan metode PCA menunjukkan dua konsep desain yaitu Eye catching dan Praktis

    Detecting of a Patient's Condition From Clinical Narratives Using Natural Language Representation

    Full text link
    The rapid progress in clinical data management systems and artificial intelligence approaches enable the era of personalized medicine. Intensive care units (ICUs) are the ideal clinical research environment for such development because they collect many clinical data and are highly computerized environments. We designed a retrospective clinical study on a prospective ICU database using clinical natural language to help in the early diagnosis of heart failure in critically ill children. The methodology consisted of empirical experiments of a learning algorithm to learn the hidden interpretation and presentation of the French clinical note data. This study included 1386 patients' clinical notes with 5444 single lines of notes. There were 1941 positive cases (36 % of total) and 3503 negative cases classified by two independent physicians using a standardized approach. The multilayer perceptron neural network outperforms other discriminative and generative classifiers. Consequently, the proposed framework yields an overall classification performance with 89 % accuracy, 88 % recall, and 89 % precision. Furthermore, a generative autoencoder learning algorithm was proposed to leverage the sparsity reduction that achieved 91% accuracy, 91% recall, and 91% precision. This study successfully applied learning representation and machine learning algorithms to detect heart failure from clinical natural language in a single French institution. Further work is needed to use the same methodology in other institutions and other languages.Comment: Submitting to IEEE Transactions on Biomedical Engineering. arXiv admin note: text overlap with arXiv:2104.0393

    Probabilistic retrieval models - relationships, context-specific application, selection and implementation

    Get PDF
    PhDRetrieval models are the core components of information retrieval systems, which guide the document and query representations, as well as the document ranking schemes. TF-IDF, binary independence retrieval (BIR) model and language modelling (LM) are three of the most influential contemporary models due to their stability and performance. The BIR model and LM have probabilistic theory as their basis, whereas TF-IDF is viewed as a heuristic model, whose theoretical justification always fascinates researchers. This thesis firstly investigates the parallel derivation of BIR model, LM and Poisson model, wrt event spaces, relevance assumptions and ranking rationales. It establishes a bridge between the BIR model and LM, and derives TF-IDF from the probabilistic framework. Then, the thesis presents the probabilistic logical modelling of the retrieval models. Various ways of how to estimate and aggregate probability, and alternative implementation to nonprobabilistic operator are demonstrated. Typical models have been implemented. The next contribution concerns the usage of of context-specific frequencies, i.e., the frequencies counted based on assorted element types or within different text scopes. The hypothesis is that they can help to rank the elements in structured document retrieval. The thesis applies context-specific frequencies on term weighting schemes in these models, and the outcome is a generalised retrieval model with regard to both element and document ranking. The retrieval models behave differently on the same query set: for some queries, one model performs better, for other queries, another model is superior. Therefore, one idea to improve the overall performance of a retrieval system is to choose for each query the model that is likely to perform the best. This thesis proposes and empirically explores the model selection method according to the correlation of query feature and query performance, which contributes to the methodology of dynamically choosing a model. In summary, this thesis contributes a study of probabilistic models and their relationships, the probabilistic logical modelling of retrieval models, the usage and effect of context-specific frequencies in models, and the selection of retrieval models

    Weighting Passages Enhances Accuracy

    Get PDF
    We observe that in curated documents the distribution of the occurrences of salient terms, e.g., terms with a high Inverse Document Frequency, is not uniform, and such terms are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of the document. We study a multiplicity of partitioning schemes of document content into passages and compute the collection-dependent weights associated with them on the basis of the distribution of occurrences of salient terms in documents. Moreover, we tune BM25P hyperparameters and investigate their impact on ad hoc document retrieval through fully reproducible experiments conducted using four publicly available datasets. Our findings demonstrate that our BM25P weighting model markedly and consistently outperforms BM25 in terms of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1, and up to 21% in MRR

    Automated image tagging through tag propagation

    Get PDF
    Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial Para obtenção do grau de Mestre em Engenharia InformáticaToday, more and more data is becoming available on the Web. In particular, we have recently witnessed an exponential increase of multimedia content within various content sharing websites. While this content is widely available, great challenges have arisen to effectively search and browse such vast amount of content. A solution to this problem is to annotate information, a task that without computer aid requires a large-scale human effort. The goal of this thesis is to automate the task of annotating multimedia information with machine learning algorithms. We propose the development of a machine learning framework capable of doing automated image annotation in large-scale consumer photos. To this extent a study on state of art algorithms was conducted, which concluded with a baseline implementation of a k-nearest neighbor algorithm. This baseline was used to implement a more advanced algorithm capable of annotating images in the situations with limited training images and a large set of test images – thus, a semi-supervised approach. Further studies were conducted on the feature spaces used to describe images towards a successful integration in the developed framework. We first analyzed the semantic gap between the visual feature spaces and concepts present in an image, and how to avoid or mitigate this gap. Moreover, we examined how users perceive images by performing a statistical analysis of the image tags inserted by users. A linguistic and statistical expansion of image tags was also implemented. The developed framework withstands uneven data distributions that occur in consumer datasets, and scales accordingly, requiring few previously annotated data. The principal mechanism that allows easier scaling is the propagation of information between the annotated data and un-annotated data

    Implications of Computational Cognitive Models for Information Retrieval

    Get PDF
    This dissertation explores the implications of computational cognitive modeling for information retrieval. The parallel between information retrieval and human memory is that the goal of an information retrieval system is to find the set of documents most relevant to the query whereas the goal for the human memory system is to access the relevance of items stored in memory given a memory probe (Steyvers & Griffiths, 2010). The two major topics of this dissertation are desirability and information scent. Desirability is the context independent probability of an item receiving attention (Recker & Pitkow, 1996). Desirability has been widely utilized in numerous experiments to model the probability that a given memory item would be retrieved (Anderson, 2007). Information scent is a context dependent measure defined as the utility of an information item (Pirolli & Card, 1996b). Information scent has been widely utilized to predict the memory item that would be retrieved given a probe (Anderson, 2007) and to predict the browsing behavior of humans (Pirolli & Card, 1996b). In this dissertation, I proposed the theory that desirability observed in human memory is caused by preferential attachment in networks. Additionally, I showed that documents accessed in large repositories mirror the observed statistical properties in human memory and that these properties can be used to improve document ranking. Finally, I showed that the combination of information scent and desirability improves document ranking over existing well-established approaches
    corecore