Search CORE

32 research outputs found

A Hybrid Approach Towards Two Stage Bengali Question Classification Utilizing Smart Data Balancing Technique

Author: A Aouichat
A Mohasseb
AM Hasan
F Mohammed
S Sekine
X Li
Y Li
Y Liu
Y Zheng-tao
Z Yu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/03/2020
Field of study

Question classification (QC) is the primary step of the Question Answering (QA) system. Question Classification (QC) system classifies the questions in particular classes so that Question Answering (QA) System can provide correct answers for the questions. Our system categorizes the factoid type questions asked in natural language after extracting features of the questions. We present a two stage QC system for Bengali. It utilizes one dimensional convolutional neural network for classifying questions into coarse classes in the first stage. Word2vec representation of existing words of the question corpus have been constructed and used for assisting 1D CNN. A smart data balancing technique has been employed for giving data hungry convolutional neural network the advantage of a greater number of effective samples to learn from. For each coarse class, a separate Stochastic Gradient Descent (SGD) based classifier has been used in order to differentiate among the finer classes within that coarse class. TF-IDF representation of each word has been used as feature for the SGD classifiers implemented as part of second stage classification. Experiments show the effectiveness of our proposed method for Bengali question classification

arXiv.org e-Print Archive

Crossref

Tratamiento lingüístico de las preguntas en español en los sistemas de búsqueda de respuestas / Linguistic treatment of questions in Spanish for question classification in question answering systems

Author: Olvera-Lobo María-Dolores
Robinson-García Nicolás
Publication venue: 'Ediciones Profesionales de la Informacion SL'
Publication date: 01/07/2009
Field of study

We propose a procedure for the linguistic treatment of Spanish questions as a step prior to their classification in question answering systems. The main types of question answering systems and their basic architecture are described. We review the principal question classification taxonomies used to date and the different fields from which they have been derived. Finally, we present the stages of linguistic analysis that the text of questions in question answering systems should be subject to in order to facilitate the location of appropriate answers

E-LIS

Bootstrapping named entity resources for adaptive question answering systems

Author: Pablo Sánchez César de
Publication venue
Publication date: 22/10/2010
Field of study

Los Sistemas de Búsqueda de Respuestas (SBR) amplían las capacidades de un buscador de información tradicional con la capacidad de encontrar respuestas precisas a las preguntas del usuario. El objetivo principal es facilitar el acceso a la información y disminuir el tiempo y el esfuerzo que el usuario debe emplear para encontrar una información concreta en una lista de documentos relevantes. En esta investigación se han abordado dos trabajos relacionados con los SBR. La primera parte presenta una arquitectura para SBR en castellano basada en la combinación y adaptación de diferentes técnicas de Recuperación y de Extracción de Información. Esta arquitectura está integrada por tres módulos principales que incluyen el análisis de la pregunta, la recuperación de pasajes relevantes y la extracción y selección de respuestas. En ella se ha prestado especial atención al tratamiento de las Entidades Nombradas puesto que, con frecuencia, son el tema de las preguntas o son buenas candidatas como respuestas. La propuesta se ha encarnado en el SBR del grupo MIRACLE que ha sido evaluado de forma independiente durante varias ediciones en la tarea compartida CLEF@QA, parte del foro de evaluación competitiva Cross-Language Evaluation Forum (CLEF). Se describen aquí las participaciones y los resultados obtenidos entre 2004 y 2007. El SBR de MIRACLE ha obtenido resultados moderados en el desempeño de la tarea con tasas de respuestas correctas entre el 20% y el 30%. Entre los resultados obtenidos destacan los de la tarea principal de 2005 y la tarea piloto de Búsqueda de Respuestas en tiempo real de 2006, RealTimeQA. Esta última tarea, además de requerir respuestas correctas incluía el tiempo de respuesta como un factor adicional en la evaluación. Estos resultados respaldan la validez de la arquitectura propuesta como una alternativa viable para los SBR sobre colecciones textuales y también corrobora resultados similares para el inglés y otras lenguas. Por otro lado, el análisis de los resultados a lo largo de las diferentes ediciones de CLEF así como la comparación con otros SBR apunta nuevos problemas y retos. Según nuestra experiencia, los sistemas de QA son más complicados de adaptar a otros dominios y lenguas que los sistemas de Recuperación de Información. Este problema viene heredado del uso de herramientas complejas de análisis de lenguaje como analizadores morfológicos, sintácticos y semánticos. Entre estos últimos se cuentan las herramientas para el Reconocimiento y Clasificación de Entidades Nombradas (NERC en inglés) así como para la Detección y Clasificación de Relaciones (RDC en inglés). Debido a la di cultad de adaptación del SBR a distintos dominios y colecciones, en la segunda parte de esta tesis se investiga una propuesta diferente basada en la adquisición de conocimiento mediante métodos de aprendizaje ligeramente supervisado. El objetivo de esta investigación es adquirir recursos semánticos útiles para las tareas de NERC y RDC usando colecciones de textos no anotados. Además, se trata de eliminar la dependencia de herramientas de análisis lingüístico con el fin de facilitar que las técnicas sean portables a diferentes dominios e idiomas. En primer lugar, se ha realizado un estudio de diferentes algoritmos para NERC y RDC de forma semisupervisada a partir de unos pocos ejemplos (bootstrapping). Este trabajo propone primero una arquitectura común y compara diferentes funciones que se han usado en la evaluación y selección de resultados intermedios, tanto instancias como patrones. La principal propuesta es un nuevo algoritmo que permite la adquisición simultánea e iterativa de instancias y patrones asociados a una relación. Incluye también la posibilidad de adquirir varias relaciones de forma simultánea y mediante el uso de la hipótesis de exclusividad obtener mejores resultados. Como característica distintiva el algoritmo explora la colección de textos con una estrategia basada en indización, que permite adquirir conocimiento de grandes colecciones. La estrategia de selección de candidatos y la evaluación se basan en la construcción de un grafo de instancias y patrones, que justifica nuestro método para la selección de candidatos. Este procedimiento es semejante al frente de exploración de una araña web y permite encontrar las instancias más parecidas a las semillas con las evidencias disponibles. Este algoritmo se ha implementado en el sistema SPINDEL y para su evaluación se ha comenzado con el caso concreto de la adquisición de recursos para las clases de Entidades Nombradas más comunes, Persona, Lugar y Organización. El objetivo es adquirir nombres asociados a cada una de las categorías así como patrones contextuales que permitan detectar menciones asociadas a una clase. Se presentan resultados para la adquisición de dos idiomas distintos, castellano e inglés, y para el castellano, en dos dominios diferentes, noticias y textos de una enciclopedia colaborativa, Wikipedia. En ambos casos el uso de herramientas de análisis lingüístico se ha limitado de acuerdo con el objetivo de avanzar hacia la independencia de idioma. Las listas adquiridas mediante bootstrapping parten de menos de 40 semillas por clase y obtienen del orden de 30.000 instancias de calidad variable. Además se obtienen listas de patrones indicativos asociados a cada clase de entidad. La evaluación indirecta confirma la utilidad de ambos recursos en la clasificación de Entidades Nombradas usando un enfoque simple basado únicamente en diccionarios. La mejor configuración obtiene para la clasificación en castellano una medida F de 67,17 y para inglés de 55,99. Además se confirma la utilidad de los patrones adquiridos que en ambos casos ayudan a mejorar la cobertura. El módulo requiere menor esfuerzo de desarrollo que los enfoques supervisados, si incluimos la necesidad de anotación, aunque su rendimiento es inferior por el momento. En definitiva, esta investigación constituye un primer paso hacia el desarrollo de aplicaciones semánticas como los SBR que requieran menos esfuerzo de adaptación a un dominio o lenguaje nuevo.-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Question Answering (QA) systems add new capabilities to traditional search engines with the ability to find precise answers to user questions. Their objective is to enable easier information access by reducing the time and effort that the user requires to find a concrete information among a list of relevant documents. In this thesis we have carried out two works related with QA systems. The first part introduces an architecture for QA systems for Spanish which is based on the combination and adaptation of different techniques from Information Retrieval (IR) and Information Extraction (IE). This architecture is composed by three modules that include question analysis, relevant passage retrieval and answer extraction and selection. The appropriate processing of Named Entities (NE) has received special attention because of their importance as question themes and candidate answers. The proposed architecture has been implemented as part of the MIRACLE QA system. This system has taken part in independent evaluations like the CLEF@QA track in the Cross-Language Evaluation Forum (CLEF). Results from 2004 to 2007 campaigns as well as the details and the evolution of the system have been described in deep. The MIRACLE QA system has obtained moderate performance with a first answer accuracy ranging between 20% and 30%. Nevertheless, it is important to highlight the results obtained in the 2005 main QA task and the RealTimeQA pilot task in 2006. The last one included response time as an important additional variable of the evaluation. These results back the proposed architecture as an option for QA from textual collection and confirm similar findings obtained for English and other languages. On the other hand, the analysis of the results along evaluation campaigns and the comparison with other QA systems point problems with current systems and new challenges. According to our experience, it is more dificult to tailor QA systems to different domains and languages than IR systems. The problem is inherited by the use of complex language analysis tools like POS taggers, parsers and other semantic analyzers, like NE Recognition and Classification (NERC) and Relation Detection and Characterization (RDC) tools. The second part of this thesis tackles this problem and proposes a different approach to adapting QA systems for di erent languages and collections. The proposal focuses on acquiring knowledge for the semantic analyzers based on lightly supervised approaches. The goal is to obtain useful resources that help to perform NERC or RDC using as few annotated resources as possible. Besides, we try to avoid dependencies from other language analysis tools with the purpose that these methods apply to different languages and domains. First of all, we have study previous work on building NERC and RDC modules with few supervision, particularly bootstrapping methods. We propose a common framework for different bootstrapping systems that help to unify different evaluation functions for intermediate results. The main proposal is a new algorithm that is able to simultaneously acquire instances and patterns associated to a relation of interest. It also uses mutual exclusion among relations to reduce concept drift and achieve better results. A distinctive characteristic is that it uses a query based exploration strategy of the text collection which enables their use for larger collections. Candidate selection and evaluation are based on incrementally building a graph of instances and patterns which also justifies our evaluation function. The discovery approach is analogous to the front of exploration in a web crawler and it is able to find the most similar instances to the available seeds. This algorithm has been implemented in the SPINDEL system. We have selected for evaluation the task of acquiring resources for the most common NE classes, Person, Location and Organization. The objective is to acquire name instances that belong to any of the classes as well as contextual patterns that help to detect mentions of NE that belong to that class. We present results for the acquisition of resources from raw text from two different languages, Spanish and English. We also performed experiments for Spanish in two different collections, news and texts from a collaborative encyclopedia, Wikipedia. Both cases are tackled with limited language analysis tools and resources. With an initial list of 40 instance seeds, the bootstrapping process is able to acquire large name lists containing up to 30.000 instances with a variable quality. Besides, large lists of indicative patterns are obtained too. Our indirect evaluation confirms the utility of both resources to classify NE using a simple dictionary recognition approach. Best results for Spanish obtained a F-score of 67,17 and for English this value is 55,99. The module requires much less development effort than annotation for supervised algorithms although the performance is not in pair yet. This research is a first step towards the development of semantic applications like QA for a new language or domain with no annotated corpora that requires less adaptation effort

Universidad Carlos III de Madrid e-Archivo

Linguistic treatment of questions in Spanish for question classification in question answering systems

Author: Olvera Lobo María Dolores
Robinson García Nicolás
Publication venue: 'Ediciones Profesionales de la Informacion SL'
Publication date: 01/01/2009
Field of study

Se propone un procedimiento para el tratamiento lingüístico de las preguntas en español como paso previo a su clasificación en los sistemas de búsqueda de respuestas. Se mencionan los principales tipos de sistemas de búsqueda de respuestas y su arquitectura básica. Se revisan las principales taxonomías utilizadas hasta el momento para la clasificación de preguntas y las distintas perspectivas desde las que se enfocan. Finalmente, se presentan las etapas de análisis lingüístico a las que ha de someterse el texto de las preguntas en estos sistemas para facilitar la localización de las respuestas adecuadas.We propose a procedure for the linguistic treatment of Spanish questions as a step prior to their classification in question answering systems. The main types of question answering systems and their basic architecture are described. We review the principal question classification taxonomies used to date and the different fields from which they have been derived. Finally, we present the stages of linguistic analysis that the text of questions in question answering systems should be subject to in order to facilitate the location of appropriate answers

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada

Scipedia

Distributional semantics and machine learning for statistical machine translation

Author: Artetxe Zurutuza Mikel
Publication venue
Publication date: 17/05/2016
Field of study

[EU]Lan honetan semantika distribuzionalaren eta ikasketa automatikoaren erabilera aztertzen dugu itzulpen automatiko estatistikoa hobetzeko. Bide horretan, erregresio logistikoan oinarritutako ikasketa automatikoko eredu bat proposatzen dugu hitz-segiden itzulpen- probabilitatea modu dinamikoan modelatzeko. Proposatutako eredua itzulpen automatiko estatistikoko ohiko itzulpen-probabilitateen orokortze bat dela frogatzen dugu, eta testuinguruko nahiz semantika distribuzionaleko informazioa barneratzeko baliatu ezaugarri lexiko, hitz-cluster eta hitzen errepresentazio bektorialen bidez. Horretaz gain, semantika distribuzionaleko ezagutza itzulpen automatiko estatistikoan txertatzeko beste hurbilpen bat lantzen dugu: hitzen errepresentazio bektorial elebidunak erabiltzea hitz-segiden itzulpenen antzekotasuna modelatzeko. Gure esperimentuek proposatutako ereduen baliagarritasuna erakusten dute, emaitza itxaropentsuak eskuratuz oinarrizko sistema sendo baten gainean. Era berean, gure lanak ekarpen garrantzitsuak egiten ditu errepresentazio bektorialen mapaketa elebidunei eta hitzen errepresentazio bektorialetan oinarritutako hitz-segiden antzekotasun neurriei dagokienean, itzulpen automatikoaz haratago balio propio bat dutenak semantika distribuzionalaren arloan.[EN]In this work, we explore the use of distributional semantics and machine learning to improve statistical machine translation. For that purpose, we propose the use of a logistic regression based machine learning model for dynamic phrase translation probability mod- eling. We prove that the proposed model can be seen as a generalization of the standard translation probabilities used in statistical machine translation, and use it to incorporate context and distributional semantic information through lexical, word cluster and word embedding features. Apart from that, we explore the use of word embeddings for phrase translation probability scoring as an alternative approach to incorporate distributional semantic knowledge into statistical machine translation. Our experiments show the effectiveness of the proposed models, achieving promising results over a strong baseline. At the same time, our work makes important contributions in relation to bilingual word embedding mappings and word embedding based phrase similarity measures, which go be- yond machine translation and have an intrinsic value in the field of distributional semantics

Archivo Digital para la Docencia y la Investigación

Recommended from our members

Cross-Lingual and Low-Resource Sentiment Analysis

Author: Farra Noura
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment

Columbia University Academic Commons

La família humana: Perspectives multidisciplinàries de la investigació en Ciències Humanes i Socials

Author: Bellés-Calvera Lucía
Pallarés Renau María
Publication venue: 'Universitat Jaume I'
Publication date: 01/01/2023
Field of study

Recull part de les ponències presentades en les XXVII Jornades de Foment de la Investigació en Ciències Humanes i Socials, celebrades en Castelló el 6 de maig de 2022 organitzades per la Universitat Jaume

Repositori Institucional de la Universitat Jaume I

The Proceedings of the 23rd Annual International Conference on Digital Government Research (DGO2022) Intelligent Technologies, Governments and Citizens June 15-17, 2022

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/09/2022
Field of study

The 23rd Annual International Conference on Digital Government Research theme is “Intelligent Technologies, Governments and Citizens”. Data and computational algorithms make systems smarter, but should result in smarter government and citizens. Intelligence and smartness affect all kinds of public values - such as fairness, inclusion, equity, transparency, privacy, security, trust, etc., and is not well-understood. These technologies provide immense opportunities and should be used in the light of public values. Society and technology co-evolve and we are looking for new ways to balance between them. Specifically, the conference aims to advance research and practice in this field. The keynotes, presentations, posters and workshops show that the conference theme is very well-chosen and more actual than ever. The challenges posed by new technology have underscored the need to grasp the potential. Digital government brings into focus the realization of public values to improve our society at all levels of government. The conference again shows the importance of the digital government society, which brings together scholars in this field. Dg.o 2022 is fully online and enables to connect to scholars and practitioners around the globe and facilitate global conversations and exchanges via the use of digital technologies. This conference is primarily a live conference for full engagement, keynotes, presentations of research papers, workshops, panels and posters and provides engaging exchange throughout the entire duration of the conference

DSpace at Tartu University Library

Diversity in the East-Central European Borderlands : Memories, Cityscapes, People

Author: Fedor Julie
Narvselius Eleonora
Publication venue: 'ibidem-Verlag'
Publication date: 01/11/2021
Field of study

Built on up-to-date field material, this edited volume suggests an anthropological approach to the palimpsest-like milieus of Wrocław, Lviv, Chernivtsi, and Chişinău. In these East-Central European borderline cities, the legacies of Nazism, Marxism-Leninism, and violent ethno-nationalism have been revisited in recent decades in search of profound moral reckoning and in response to the challenges posed by the (post-)transitional period. Present shapes and contents of these urban settings derive from combinations of fragmented material environments, cultural continuities and political ruptures, present-day heritage industries and collective memories about the contentious past, expressive architectural forms and less conspicuous meaning-making activities of human actors. In other words, they evolve from perpetual tensions between choices of the past and the burden of the past. A novel feature of this book is its multi-level approach to the analysis of engagements with the lost diversity in historical urban milieus full of post-war voids and ruptures. In particular, the collected studies test the possibility of combining the theoretical propositions of Memory Studies with broader conceptualizations of borderlands, cosmopolitan sociality, urban mythologies, and hybridity

Lund University Publications

Research on Phraseology Across Continents

Author
Publication venue: University of Bialystok Publishing House
Publication date: 01/01/2013
Field of study

The second volume of the IDP series contains papers by phraseologists from five continents: Europe, Australia, North America, South America and Asia, which were written within the framework of the project Intercontinental Dialogue on Phraseology, prepared and coordinated by Joanna Szerszunowicz, conducted by the University of Bialystok in cooperation with Kwansei Gakuin University in Japan. The book consists of the following parts: Dialogue on Phraseology, General and Corpus Linguistics & Phraseology, Lexicography & Phraseology, Contrastive Linguistics, Translation & Phraseology, Literature, Cultural Studies, Education & Phraseology. Dialogue contains two papers written by widely recognised phraseologists: professor Anita Naciscione from Latvia and professor Irine Goshkheteliani.The volume has been financed by the Philological Department of the University of Bialysto

Repozytorium Uniwersytetu w Białymstoku (University of Bialystok Repository)