98 research outputs found
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
Topic Distiller:distilling semantic topics from documents
Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics.
The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system.
The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. TiivistelmĂ€. TaÌssaÌ diplomityoÌssaÌ tarkastellaan jaÌrjestelmaÌaÌ, joka pystyy loÌytaÌmaÌaÌn tekstistaÌ relevantteja ja piileviaÌ semanttisia aihealueita, sekaÌ kyseisen jaÌrjestelmaÌn suunnittelua ja implementaatiota. TaÌmaÌn Topic Distiller -jaÌrjestelmaÌn suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeaÌmisen tutkimuksesta sekaÌ hyoÌdyntaÌaÌ automaattista semanttista annotointia ja tietaÌmyskantoja tekstin aihealueiden loÌytaÌmisessaÌ.
Topic Distiller -jaÌrjestelmaÌn suorituskykyaÌ mitataan hyoÌdyntaÌmaÌllaÌ kirjallisuudessa paljon kaÌytettyjaÌ automaattisen termintunnistamisen evaluontimenetelmiaÌ ja aineistoja. NaÌiden yleisten aineistojen lisaÌksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -jaÌrjestelmaÌn arviointia varten.
Evaluointi tuo ilmi, ettaÌ Topic Distiller kykenee loÌytaÌmaÌaÌn relevantteja ja piileviaÌ aiheita tekstistaÌ. Se paÌihittaÌaÌ kirjallisuuden viimeisimmaÌt automaattisen termintunnistamisen menetelmaÌt suorituskyvyssaÌ, kun sitaÌ kaÌytetaÌaÌn uutisartikkelien sekaÌ sosiaalisen median julkaisujen analysointiin
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
Peer reviewe
A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4
Large language models (LLMs) are a special class of pretrained language
models obtained by scaling model size, pretraining corpus and computation.
LLMs, because of their large size and pretraining on large volumes of text
data, exhibit special abilities which allow them to achieve remarkable
performances without any task-specific training in many of the natural language
processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the
popularity of LLMs is increasing exponentially after the introduction of models
like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models,
including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With
the ever-rising popularity of GLLMs, especially in the research community,
there is a strong need for a comprehensive survey which summarizes the recent
research progress in multiple dimensions and can guide the research community
with insightful future research directions. We start the survey paper with
foundation concepts like transformers, transfer learning, self-supervised
learning, pretrained language models and large language models. We then present
a brief overview of GLLMs and discuss the performances of GLLMs in various
downstream tasks, specific domains and multiple languages. We also discuss the
data labelling and data augmentation abilities of GLLMs, the robustness of
GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with
multiple insightful future research directions. To summarize, this
comprehensive survey paper will serve as a good resource for both academic and
industry people to stay updated with the latest research related to GPT-3
family large language models.Comment: Preprint under review, 58 page
Ontology Learning from the Arabic Text of the Qurâan: Concepts Identification and Hierarchical Relationships Extraction
Recent developments in ontology learning have highlighted the growing role ontologies play in linguistic and computational research areas such as language teaching and natural language processing. The ever-growing availability of annotations for the Qurâan text has made the acquisition of the ontological knowledge promising. However, the availability of resources and tools for Arabic ontology is not comparable with other languages. Manual ontology development is labour-intensive, time-consuming and it requires knowledge and skills of domain experts.
This thesis aims to develop new methods for Ontology learning from the Arabic text of the Qurâan, including concepts identification and hierarchical relationships extraction. The thesis presents a methodology for reducing human intervention in building ontology from Classical Arabic Language of the Qurâan text. The set of concepts, which is a crucial step in ontology learning, was generated based on a set of patterns made of lexical and inflectional information. The concepts were identified based on adapted weighting schema that exploit a combination of knowledge to learn the relevance degree of a term. Statistical, domain-specific knowledge and internal information of Multi-Word Terms (MWTs) were combined to learn the relevance of generated terms. This methodology which represents the major contribution of the thesis was experimentally investigated using different terms generation methods. As a result, we provided the Arabic Qurâanic Terms (AQT) as a training resource for machine learning based term extraction.
This thesis also introduces a new approach for hierarchical relations extraction from Arabic text of the Qurâan. A set of hierarchical relations occurring between identified concepts are extracted based on hybrid methods including head-modifier, set of markers for copula construct in Arabic text, referents. We also compared a number of ontology alignment methods for matching ontological bilingual Qurâanic resources.
In addition, a multi-dimensional resource named Arabic Qurâanic Database (AQD) about the Qurâan is made for Arabic computational researchers, allowing regular expression query search over the included annotations. The search tool was successfully applied to find instances for a given complex rule made of different combined resources
Methods for Building Semantic Portals
Semantic portals are information systems which collect information from several sources and combine them using semantic web technologies into a user interface that solves information needs of users. Creating such portals requires methods and tools from multiple disciplines, including knowledge representation, information retrieval, information extraction, and user interface design.
This thesis explores methods for building and improving semantic portals and other semantic web applications with contributions in three areas. The studies included in the thesis draw from the design science methodology in information systems research.
First, a method for creating of faceted search user interfaces for semantic portals utilizing controlled vocabularies with a complex hierarchical structure is presented. The results show that the method allows the creation of user-centric search facets that hide the complex hierarchies from the user, resulting in a user-friendly faceted search interface.
Second, the creation of structured metadata from text documents is enhanced by adapting a state of the art automatic subject indexing system to Finnish language texts. The results show that using a suitable combination of existing tools, automatic subject indexing quality comparable to that of human indexers can be attained in a highly inflected language such as Finnish.
Finally, the quality of controlled vocabularies such as thesauri and lightweight ontologies is examined by developing a set of quality criteria for vocabularies expressed using the SKOS standard, and methods for correcting structural problems in SKOS vocabularies are presented. The results show that most published SKOS vocabularies suffer from quality issues and violate the SKOS integrity conditions. However, the great majority of such problems were corrected by the methods presented in this dissertation.
The methods have been implemented in several real world applications, including the HealthFinland health information portal, the ARPA information extraction toolkit, and the ONKI ontology library system.Semanttiset portaalit ovat tietojÀrjestelmiÀ, jotka kerÀÀvÀt tietoa useista lÀhteistÀ ja yhdistÀvÀt ne semanttisen webin teknologioiden avulla kÀyttÀjien tiedontarpeita tukevaksi kÀyttöliittymÀksi. TÀllaisten portaalien rakentaminen vaatii menetelmiÀ ja työkaluja useilta tieteenaloilta, mukaan lukien tietÀmyksen esittÀminen, tiedonhaku, tiedon eristÀminen ja kÀyttöliittymÀsuunnittelu.
TÀssÀ vÀitöskirjassa tarkastellaan menetelmiÀ semanttisten portaalien ja muiden semanttisen webin sovellusten rakentamiseksi. VÀitöskirjan tulokset jakaantuvat kolmeen osa-alueeseen. Tutkimuksessa kÀytetyt menetelmÀt perustuvat informaatiojÀrjestelmien tutkimuksessa kÀytettyihin suunnittelutieteen menetelmiin.
Ensiksi vÀitöskirjassa esitetÀÀn menetelmÀ semanttisten portaalien fasettipohjaisten kÀyttöliittymien luomiseksi monimutkaisten kontrolloitujen sanastojen pohjalta. Tulokset osoittavat, ettÀ menetelmÀ mahdollistaa sellaisten kÀyttÀjÀkeskeisten hakunÀkymien luomisen, jotka piilottavat monimutkaiset hierarkiat kÀyttÀjÀltÀ ja auttavat siten luomaan kÀyttÀjÀystÀvÀllisen fasettipohjaisen hakukÀyttöliittymÀn.
Toiseksi rakenteisen metatiedon tuottamista tekstidokumenteista parannetaan sovittamalla nykyaikainen automaattisen sisÀllönkuvailun jÀrjestelmÀ suomenkieliselle tekstiaineistolle. Tulokset osoittavat, ettÀ kÀyttÀmÀllÀ sopivaa yhdistelmÀÀ olemassaolevista työkaluista saavutetaan ihmistyönÀ tehtyyn sisÀllönkuvailuun verrattavissa oleva automaattisen sisÀllönkuvailun laatu myös agglutinatiivisella kielellÀ kuten suomen kielellÀ esitetyille aineistoille.
Kolmanneksi tarkastellaan kontrolloitujen sanastojen kuten asiasanastojen ja kevytontologioiden laatua kehittÀmÀllÀ laatukriteeristö SKOS-standardin avulla esitetyille sanastoille sekÀ esittÀmÀllÀ menetelmiÀ SKOS-sanastojen rakenteisten ongelmien korjaamiseksi. Tulokset osoittavat, ettÀ useimmat julkaistut SKOS-sanastot kÀrsivÀt laatuongelmista eivÀtkÀ noudata SKOS-standardin eheyssÀÀntöjÀ. Suuri osa nÀistÀ ongelmista pystyttiin korjaamaan tÀssÀ vÀitöskirjassa esitetyin menetelmin.
MenetelmÀt on toteutettu useissa kÀytössÀ olevissa jÀrjestelmissÀ, kuten TerveSuomi-terveystietoportaalissa, ARPA-tiedoneristÀmistyökalussa ja ONKI-ontologiakirjastossa
- âŠ