120 research outputs found

    New Perspectives in Sinographic Language Processing Through the Use of Character Structure

    Full text link
    Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate "most semantic subcharacter paths" for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.Comment: 17 pages, 5 figures, presented at CICLing 201

    Automatic recognition of schwa variants in spontaneous Hungarian speech

    Get PDF
    This paper analyzes the nature of the process involved in optional vowel reduction in Hungarian, and the acoustic structure of schwa variants in spontaneous speech. The study focuses on the acoustic patterns of both the basic realizations of Hungarian vowels and their realizations as neutral vowels (schwas), as well as on the design, implementation, and evaluation of a set of algorithms for the recognition of both types of realizations from the speech waveform. The authors address the question whether schwas form a unified group of vowels or they show some dependence on the originally intended articulation of the vowel they stand for. The acoustic study uses a database consisting of over 4,000 utterances extracted from continuous speech, and recorded from 19 speakers. The authors propose methods for the recognition of neutral vowels depending on the various vowels they replace in spontaneous speech. Mel-Frequency Cepstral Coefficients are calculated and used for the training of Hidden Markov Models. The recognition system was trained on 2,500 utterances and then tested on 1,500 utterances. The results show that a neutral vowel can be detected in 72% of all occurrences. Stressed and unstressed syllables can be distinguished in 92% of all cases. Neutralized vowels do not form a unified group of phoneme realizations. The pronunciation of schwa heavily depends on the original articulation configuration of the intended vowel

    Enhanced Web Search Engines with Query-Concept Bipartite Graphs

    Get PDF
    With rapid growth of information on the Web, Web search engines have gained great momentum for exploiting valuable Web resources. Although keywords-based Web search engines provide relevant search results in response to users’ queries, future enhancement is still needed. Three important issues include (1) search results can be diverse because ambiguous keywords in queries can be interpreted to different meanings; (2) indentifying keywords in long queries is difficult for search engines; and (3) generating query-specific Web page summaries is desirable for Web search results’ previews. Based on clickthrough data, this thesis proposes a query-concept bipartite graph for representing queries’ relations, and applies the queries’ relations to applications such as (1) personalized query suggestions, (2) long queries Web searches and (3) query-specific Web page summarization. Experimental results show that query-concept bipartite graphs are useful for performance improvement for the three applications

    Standardization of the formal representation of lexical information for NLP

    Get PDF
    International audienceA survey of dictionary models and formats is presented as well as a presentation of corresponding recent standardisation activities

    Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

    Get PDF
    Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented

    Unsupervised Extractive Summarization of Emotion Triggers

    Full text link
    Understanding what leads to emotions during large-scale crises is important as it can provide groundings for expressed emotions and subsequently improve the understanding of ongoing disasters. Recent approaches trained supervised models to both detect emotions and explain emotion triggers (events and appraisals) via abstractive summarization. However, obtaining timely and qualitative abstractive summaries is expensive and extremely time-consuming, requiring highly-trained expert annotators. In time-sensitive, high-stake contexts, this can block necessary responses. We instead pursue unsupervised systems that extract triggers from text. First, we introduce CovidET-EXT, augmenting (Zhan et al. 2022)'s abstractive dataset (in the context of the COVID-19 crisis) with extractive triggers. Second, we develop new unsupervised learning models that can jointly detect emotions and summarize their triggers. Our best approach, entitled Emotion-Aware Pagerank, incorporates emotion information from external sources combined with a language understanding module, and outperforms strong baselines. We release our data and code at https://github.com/tsosea2/CovidET-EXT.Comment: ACL 2023 Camera-Read

    An ontology-based approach toward the configuration of heterogeneous network devices

    Get PDF
    Despite the numerous efforts of standardization, semantic issues remain in effect in many subfields of networking. The inability to exchange data unambiguously between information systems and human resources is an issue that hinders technology implementation, semantic interoperability, service deployment, network management, technology migration, among many others. In this thesis, we will approach the semantic issues in two critical subfields of networking, namely, network configuration management and network addressing architectures. The fact that makes the study in these areas rather appealing is that in both scenarios semantic issues have been around from the very early days of networking. However, as networks continue to grow in size and complexity current practices are becoming neither scalable nor practical. One of the most complex and essential tasks in network management is the configuration of network devices. The lack of comprehensive and standard means for modifying and controlling the configuration of network elements has led to the continuous and extended use of proprietary Command Line Interfaces (CLIs). Unfortunately, CLIs are generally both, device and vendor-specific. In the context of heterogeneous network infrastructures---i.e., networks typically composed of multiple devices from different vendors---the use of several CLIs raises serious Operation, Administration and Management (OAM) issues. Accordingly, network administrators are forced to gain specialized expertise and to continuously keep knowledge and skills up to date as new features, system upgrades or technologies appear. Overall, the utilization of proprietary mechanisms allows neither sharing knowledge consistently between vendors' domains nor reusing configurations to achieve full automation of network configuration tasks---which are typically required in autonomic management. Due to this heterogeneity, CLIs typically provide a help feature which is in turn an useful source of knowledge to enable semantic interpretation of a vendor's configuration space. The large amount of information a network administrator must learn and manage makes Information Extraction (IE) and other forms of natural language analysis of the Artificial Intelligence (AI) field key enablers for the network device configuration space. This thesis presents the design and implementation specification of the first Ontology-Based Information Extraction (OBIE) System from the CLI of network devices for the automation and abstraction of device configurations. Moreover, the so-called semantic overload of IP addresses---wherein addresses are both identifiers and locators of a node at the same time---is one of the main constraints over mobility of network hosts, multi-homing and scalability of the routing system. In light of this, numerous approaches have emerged in an effort to decouple the semantics of the network addressing scheme. In this thesis, we approach this issue from two perspectives, namely, a non-disruptive (i.e., evolutionary) solution to the current Internet and a clean-slate approach for Future Internet. In the first scenario, we analyze the Locator/Identifier Separation Protocol (LISP) as it is currently one of the strongest solutions to the semantic overload issue. However, its adoption is hindered by existing problems in the proposed mapping systems. Herein, we propose the LISP Redundancy Protocol (LRP) aimed to complement the LISP framework and strengthen feasibility of deployment, while at the same time, minimize mapping table size, latency time and maximize reachability in the network. In the second scenario, we explore TARIFA a Next Generation Internet architecture and introduce a novel service-centric addressing scheme which aims to overcome the issues related to routing and semantic overload of IP addresses.A pesar de los numerosos esfuerzos de estandarización, los problemas de semántica continúan en efecto en muchas subáreas de networking. La inabilidad de intercambiar data sin ambiguedad entre sistemas es un problema que limita la interoperabilidad semántica. En esta tesis, abordamos los problemas de semántica en dos áreas: (i) la gestión de configuración y (ii) arquitecturas de direccionamiento. El hecho que hace el estudio en estas áreas de interés, es que los problemas de semántica datan desde los inicios del Internet. Sin embargo, mientras las redes continúan creciendo en tamaño y complejidad, los mecanismos desplegados dejan de ser escalabales y prácticos. Una de las tareas más complejas y esenciales en la gestión de redes es la configuración de equipos. La falta de mecanismos estándar para la modificación y control de la configuración de equipos ha llevado al uso continuado y extendido de interfaces por líneas de comando (CLI). Desafortunadamente, las CLIs son generalmente, específicos por fabricante y dispositivo. En el contexto de redes heterogéneas--es decir, redes típicamente compuestas por múltiples dispositivos de distintos fabricantes--el uso de varias CLIs trae consigo serios problemas de operación, administración y gestión. En consecuencia, los administradores de red se ven forzados a adquirir experiencia en el manejo específico de múltiples tecnologías y además, a mantenerse continuamente actualizados en la medida en que nuevas funcionalidades o tecnologías emergen, o bien con actualizaciones de sistemas operativos. En general, la utilización de mecanismos propietarios no permite compartir conocimientos de forma consistente a lo largo de plataformas heterogéneas, ni reutilizar configuraciones con el objetivo de alcanzar la completa automatización de tareas de configuración--que son típicamente requeridas en el área de gestión autonómica. Debido a esta heterogeneidad, las CLIs suelen proporcionar una función de ayuda que fundamentalmente aporta información para la interpretación semántica del entorno de configuración de un fabricante. La gran cantidad de información que un administrador debe aprender y manejar, hace de la extracción de información y otras formas de análisis de lenguaje natural del campo de Inteligencia Artificial, potenciales herramientas para la configuración de equipos en entornos heterogéneos. Esta tesis presenta el diseño y especificaciones de implementación del primer sistema de extracción de información basada en ontologías desde el CLI de dispositivos de red, para la automatización y abstracción de configuraciones. Por otra parte, la denominada sobrecarga semántica de direcciones IP--en donde, las direcciones son identificadores y localizadores al mismo tiempo--es una de las principales limitaciones sobre mobilidad, multi-homing y escalabilidad del sistema de enrutamiento. Por esta razón, numerosas propuestas han emergido en un esfuerzo por desacoplar la semántica del esquema de direccionamiento de las redes actuales. En esta tesis, abordamos este problema desde dos perspectivas, la primera de ellas una aproximación no-disruptiva (es decir, evolucionaria) al problema del Internet actual y la segunda, una nueva propuesta en torno a futuras arquitecturas del Internet. En el primer escenario, analizamos el protocolo LISP (del inglés, Locator/Identifier Separation Protocol) ya que es en efecto, una de las soluciones con mayor potencial para la resolucion del problema de semántica. Sin embargo, su adopción está limitada por problemas en los sistemas de mapeo propuestos. En esta tesis, proponemos LRP (del inglés, LISP Redundancy Protocol) un protocolo destinado a complementar LISP e incrementar la factibilidad de despliegue, a la vez que, reduce el tamaño de las tablas de mapeo, tiempo de latencia y maximiza accesibilidad. En el segundo escenario, exploramos TARIFA una arquitectura de red de nueva generación e introducimos un novedoso esquema de direccionamiento orientado a servicios
    corecore