98 research outputs found

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Topic Distiller:distilling semantic topics from documents

    Get PDF
    Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics. The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system. The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. TiivistelmĂ€. Tässä diplomityössä tarkastellaan järjestelmää, joka pystyy löytämään tekstistä relevantteja ja piileviä semanttisia aihealueita, sekä kyseisen järjestelmän suunnittelua ja implementaatiota. Tämän Topic Distiller -järjestelmän suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeämisen tutkimuksesta sekä hyödyntää automaattista semanttista annotointia ja tietämyskantoja tekstin aihealueiden löytämisessä. Topic Distiller -järjestelmän suorituskykyä mitataan hyödyntämällä kirjallisuudessa paljon käytettyjä automaattisen termintunnistamisen evaluontimenetelmiä ja aineistoja. Näiden yleisten aineistojen lisäksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -järjestelmän arviointia varten. Evaluointi tuo ilmi, että Topic Distiller kykenee löytämään relevantteja ja piileviä aiheita tekstistä. Se päihittää kirjallisuuden viimeisimmät automaattisen termintunnistamisen menetelmät suorituskyvyssä, kun sitä käytetään uutisartikkelien sekä sosiaalisen median julkaisujen analysointiin

    Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages

    Get PDF
    Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe

    A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4

    Full text link
    Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family large language models.Comment: Preprint under review, 58 page

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Ontology Learning from the Arabic Text of the Qur’an: Concepts Identification and Hierarchical Relationships Extraction

    Get PDF
    Recent developments in ontology learning have highlighted the growing role ontologies play in linguistic and computational research areas such as language teaching and natural language processing. The ever-growing availability of annotations for the Qur’an text has made the acquisition of the ontological knowledge promising. However, the availability of resources and tools for Arabic ontology is not comparable with other languages. Manual ontology development is labour-intensive, time-consuming and it requires knowledge and skills of domain experts. This thesis aims to develop new methods for Ontology learning from the Arabic text of the Qur’an, including concepts identification and hierarchical relationships extraction. The thesis presents a methodology for reducing human intervention in building ontology from Classical Arabic Language of the Qur’an text. The set of concepts, which is a crucial step in ontology learning, was generated based on a set of patterns made of lexical and inflectional information. The concepts were identified based on adapted weighting schema that exploit a combination of knowledge to learn the relevance degree of a term. Statistical, domain-specific knowledge and internal information of Multi-Word Terms (MWTs) were combined to learn the relevance of generated terms. This methodology which represents the major contribution of the thesis was experimentally investigated using different terms generation methods. As a result, we provided the Arabic Qur’anic Terms (AQT) as a training resource for machine learning based term extraction. This thesis also introduces a new approach for hierarchical relations extraction from Arabic text of the Qur’an. A set of hierarchical relations occurring between identified concepts are extracted based on hybrid methods including head-modifier, set of markers for copula construct in Arabic text, referents. We also compared a number of ontology alignment methods for matching ontological bilingual Qur’anic resources. In addition, a multi-dimensional resource named Arabic Qur’anic Database (AQD) about the Qur’an is made for Arabic computational researchers, allowing regular expression query search over the included annotations. The search tool was successfully applied to find instances for a given complex rule made of different combined resources

    Methods for Building Semantic Portals

    Get PDF
    Semantic portals are information systems which collect information from several sources and combine them using semantic web technologies into a user interface that solves information needs of users. Creating such portals requires methods and tools from multiple disciplines, including knowledge representation, information retrieval, information extraction, and user interface design. This thesis explores methods for building and improving semantic portals and other semantic web applications with contributions in three areas. The studies included in the thesis draw from the design science methodology in information systems research. First, a method for creating of faceted search user interfaces for semantic portals utilizing controlled vocabularies with a complex hierarchical structure is presented. The results show that the method allows the creation of user-centric search facets that hide the complex hierarchies from the user, resulting in a user-friendly faceted search interface. Second, the creation of structured metadata from text documents is enhanced by adapting a state of the art automatic subject indexing system to Finnish language texts. The results show that using a suitable combination of existing tools, automatic subject indexing quality comparable to that of human indexers can be attained in a highly inflected language such as Finnish. Finally, the quality of controlled vocabularies such as thesauri and lightweight ontologies is examined by developing a set of quality criteria for vocabularies expressed using the SKOS standard, and methods for correcting structural problems in SKOS vocabularies are presented. The results show that most published SKOS vocabularies suffer from quality issues and violate the SKOS integrity conditions. However, the great majority of such problems were corrected by the methods presented in this dissertation. The methods have been implemented in several real world applications, including the HealthFinland health information portal, the ARPA information extraction toolkit, and the ONKI ontology library system.Semanttiset portaalit ovat tietojÀrjestelmiÀ, jotka kerÀÀvÀt tietoa useista lÀhteistÀ ja yhdistÀvÀt ne semanttisen webin teknologioiden avulla kÀyttÀjien tiedontarpeita tukevaksi kÀyttöliittymÀksi. TÀllaisten portaalien rakentaminen vaatii menetelmiÀ ja työkaluja useilta tieteenaloilta, mukaan lukien tietÀmyksen esittÀminen, tiedonhaku, tiedon eristÀminen ja kÀyttöliittymÀsuunnittelu. TÀssÀ vÀitöskirjassa tarkastellaan menetelmiÀ semanttisten portaalien ja muiden semanttisen webin sovellusten rakentamiseksi. VÀitöskirjan tulokset jakaantuvat kolmeen osa-alueeseen. Tutkimuksessa kÀytetyt menetelmÀt perustuvat informaatiojÀrjestelmien tutkimuksessa kÀytettyihin suunnittelutieteen menetelmiin. Ensiksi vÀitöskirjassa esitetÀÀn menetelmÀ semanttisten portaalien fasettipohjaisten kÀyttöliittymien luomiseksi monimutkaisten kontrolloitujen sanastojen pohjalta. Tulokset osoittavat, ettÀ menetelmÀ mahdollistaa sellaisten kÀyttÀjÀkeskeisten hakunÀkymien luomisen, jotka piilottavat monimutkaiset hierarkiat kÀyttÀjÀltÀ ja auttavat siten luomaan kÀyttÀjÀystÀvÀllisen fasettipohjaisen hakukÀyttöliittymÀn. Toiseksi rakenteisen metatiedon tuottamista tekstidokumenteista parannetaan sovittamalla nykyaikainen automaattisen sisÀllönkuvailun jÀrjestelmÀ suomenkieliselle tekstiaineistolle. Tulokset osoittavat, ettÀ kÀyttÀmÀllÀ sopivaa yhdistelmÀÀ olemassaolevista työkaluista saavutetaan ihmistyönÀ tehtyyn sisÀllönkuvailuun verrattavissa oleva automaattisen sisÀllönkuvailun laatu myös agglutinatiivisella kielellÀ kuten suomen kielellÀ esitetyille aineistoille. Kolmanneksi tarkastellaan kontrolloitujen sanastojen kuten asiasanastojen ja kevytontologioiden laatua kehittÀmÀllÀ laatukriteeristö SKOS-standardin avulla esitetyille sanastoille sekÀ esittÀmÀllÀ menetelmiÀ SKOS-sanastojen rakenteisten ongelmien korjaamiseksi. Tulokset osoittavat, ettÀ useimmat julkaistut SKOS-sanastot kÀrsivÀt laatuongelmista eivÀtkÀ noudata SKOS-standardin eheyssÀÀntöjÀ. Suuri osa nÀistÀ ongelmista pystyttiin korjaamaan tÀssÀ vÀitöskirjassa esitetyin menetelmin. MenetelmÀt on toteutettu useissa kÀytössÀ olevissa jÀrjestelmissÀ, kuten TerveSuomi-terveystietoportaalissa, ARPA-tiedoneristÀmistyökalussa ja ONKI-ontologiakirjastossa
    • 

    corecore