10 research outputs found

    Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study

    Get PDF
    Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection

    Get PDF
    Text reuse is the act of borrowing text (either verbatim or paraphrased) from an earlier written text. It could occur within the same language (mono-lingual) or across languages (cross-lingual) where the reused text is in a different language than the original text. Text reuse and its related problem, plagiarism (the unacknowledged reuse of text), are becoming serious issues in many fields and research shows that paraphrased and especially the cross-lingual cases of reuse are much harder to detect. Moreover, the recent rise in readily available multi-lingual content on the Web and social media has increased the problem to an unprecedented scale. To develop, compare, and evaluate automatic methods for mono- and crosslingual text reuse and extrinsic (finding portion(s) of text that is reused from the original text) plagiarism detection, standard evaluation resources are of utmost importance. However, previous efforts on developing such resources have mostly focused on English and some other languages. On the other hand, the Urdu language, which is widely spoken and has a large digital footprint, lacks resources in terms of core language processing tools and corpora. With this consideration in mind, this PhD research focuses on developing standard evaluation corpora, methods, and supporting resources to automatically detect mono-lingual (Urdu) and cross-lingual (English-Urdu) cases of text reuse and extrinsic plagiarism This thesis contributes a mono-lingual (Urdu) text reuse corpus (COUNTER Corpus) that contains real cases of Urdu text reuse at document-level. Another contribution is the development of a mono-lingual (Urdu) extrinsic plagiarism corpus (UPPC Corpus) that contains simulated cases of Urdu paraphrase plagiarism. Evaluation results, by applying a wide range of state-of-the-art mono-lingual methods on both corpora, shows that it is easier to detect verbatim cases than paraphrased ones. Moreover, the performance of these methods decreases considerably on real cases of reuse. A couple of supporting resources are also created to assist methods used in the cross-lingual (English-Urdu) text reuse detection. A large-scale multi-domain English-Urdu parallel corpus (EUPC-20) that contains parallel sentences is mined from the Web and several bi-lingual (English-Urdu) dictionaries are compiled using multiple approaches from different sources. Another major contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus (TREU Corpus). It contains English to Urdu real cases of text reuse at the document-level. A diversified range of methods are applied on the TREU Corpus to evaluate its usefulness and to show how it can be utilised in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse. A new cross-lingual method is also proposed that uses bilingual word embeddings to estimate the degree of overlap amongst text documents by computing the maximum weighted cosine similarity between word pairs. The overall low evaluation results indicate that it is a challenging task to detect crosslingual real cases of text reuse, especially when the language pairs have unrelated scripts, i.e., English-Urdu. However, an improvement in the result is observed using a combination of methods used in the experiments. The research work undertaken in this PhD thesis contributes corpora, methods, and supporting resources for the mono- and cross-lingual text reuse and extrinsic plagiarism for a significantly under-resourced Urdu and English-Urdu language pair. It highlights that paraphrased and cross-lingual cross-script real cases of text reuse are harder to detect and are still an open issue. Moreover, it emphasises the need to develop standard evaluation and supporting resources for under-resourced languages to facilitate research in these languages. The resources that have been developed and methods proposed could serve as a framework for future research in other languages and language pairs

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Arabic named entity recognition

    Full text link
    En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar las mejores tecnicas para construir un Reconocedor de Entidades Nombradas en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades nombradas que se encuentran en un texto arabe de dominio abierto. La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos que investigan la tarea de REN para un idioma especifico o desde una perspectiva independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy pocos trabajos que estudien dicha tarea para el arabe. El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una discusion sobre los resultados que benefician a la comunidad de investigadores del REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos: 1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha tarea; 2. Analizado el estado del arte del REN; 3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes tecnicas de aprendizaje automatico; 4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores, donde cada clasificador trata con una sola clase de entidades nombradas y emplea el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas adecuados para la clase de entidades nombradas en cuestion. Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318Palanci
    corecore