3 research outputs found

    Accessing Information Based on a Combination of Document Structure and Content: Exploiting XML tags in indexing and searching to enhance content retrieval of online document-centric XML encoded texts

    Get PDF
    This study explores the challenges of using traditional information retrieval methods to retrieve document-centric XML encoded text. It demonstrates how coupling structure and content in query and index formulation improves retrieval performance. Native XML database (NXD) and search engine technologies were evaluated in a baseline experiment, and in a second test after alterations were made to their respective indexes. Documents were retrieved for simple and complex forms of 30 XPath and keyword queries from a corpus of 95 XML/TEI encoded texts. Overall results indicated that query augmentation using document structure improves retrieval performance. Complex queries submitted to the NXD produced the most satisfying results, with an average precision of 93.3% and an average recall of 86.3%. Performance improvements were also achieved using complex, structured queries and indexes in the search engine. Study findings suggest that effective XML retrieval models might result from a combination of unstructures and structured retrieval techniques

    Traducción automática de conjuntos de datos para la construcción de sistemas de pregunta/respuesta mediante aprendizaje automático

    Full text link
    [ES] Los sistemas de pregunta/respuesta contruidos mediante técnicas de aprendizaje automático requieren de conjuntos de datos. A día de hoy, se cuenta con dichos conjuntos en idiomas muy hablados como el inglés o el chino, pero no para idiomas más locales como pude ser el euskera. El objetivo de este trabajo es, partiendo de conjuntos de datos en inglés, crear sus respectivas versiones en castellano, euskera y catalán. Una vez creados estos conjuntos empleando técnicas de traducción automática, se procede a entrenar el modelo predictivo de pregunta/respuesta para cada idioma. Hay que tener en cuenta no solo aquellos modelos entrenados con un solo idioma, sino aquellos que emplean métodos multilingües, ya que actualmente estos últimos tienen bastante peso en el estado del arte del procesamiento del lenguaje natural y, específicamente, en las tareas de pregunta/respuesta.[EN] Building Question / answer systems by using machine learning techniques require data sets. Nowadays, these sets are available in highly spoken languages such as English or Chinese, but not for regional languages such as Basque. The aim of this work is to create datasets in Spanish, Basque and Catalan from the English version. After creating these sets by Machine Translation techniques, we will proceed to train the predictive QA model for each language. We must take into account that we can train models using just one or multiple languages. We are taking into consideration multilingual models due to their importance in the NLP state of  the art and, specifically, in QA tasks.[CA] Les tasques d’aprenentatge automàtic basades en sistemes de pregunta/resposta requereixen de conjunts de dades per entrenar models predictius. Actualment, es compta amb aquests conjunts en idiomes molt parlats com l’anglès o el xinès, però no per a idiomes més locals com pot ser el basc. L’objectiu d’aquest treball és, partint de conjunts de dades en anglès, crear les seves respectives versions en castellà, basc i català. Una vegada que estiguin creats aquests conjunts emprant tècniques de traducció automàtica, es procedeix a entrenar el model predictiu de pregunta/resposta per a cada idioma. Cal tenir en compte no només aquells models entrenats amb un sol idioma, sinó aquells que fan servir mètodes multilingües (Per exemple, entrenament en un idioma i validació en un altre), ja que actualment aquests últims tenen prou pes en l’estat de l’art del processament del llenguatge natural i, específicament, en les tasques de pregunta/resposta.López Ramírez, JA. (2020). Traducción automática de conjuntos de datos para la construcción de sistemas de pregunta/respuesta mediante aprendizaje automático. http://hdl.handle.net/10251/151719TFG

    Question Answering Using XML-Tagged Documents

    No full text
    The official submission for CL Research's question-answering system (DIMAP-QA) for TREC-11 only slightly extends its semantic relation triple (logical form) technology in which documents are fully parsed and databases built around discourse entities. We were unable to complete the planned revision of our system based on a fuller discourse analysis of the texts. We have since implemented many of these changes and can now report preliminary and encouraging results of basing our system on XML markup of texts with syntactic and semantic attributes and use of XML stylesheet functionality (specifically, XPath expressions) to answer questions. The official confidence-weighted score for the main TREC-11 QA task was 0.049, based on processing 20 of the top 50 documents provided by NIST. Our estimated mean reciprocal rank was 0.128 for the exact answers and 0.227 for sentence answers, comparable to our results from previous years. With our revised XML-based system, using a 20 percent sample of the TREC questions, we have an estimated confidence-weighted score of 0.869 and mean reciprocal rank of 0.828. We describe our system and examine the results from XML tagging in terms of question-answering and other applications such as information extraction, text summarization, novelty studies, and investigation of linguistic phenomena.