155 research outputs found

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government

    The acquisition and use of Mandarin relative clauses by monolingual and bilingual children and adults

    Get PDF
    Children have been found to understand and use relative clauses (RCs) at an early age. However, not all types of RCs are acquired at the same time, and are used with the same frequency (e.g., Diessel & Tomasello, 2000, 2005). Using corpus-based and experimental methodologies, the three studies presented in this thesis investigate the acquisition and processing of different types of RCs in Mandarin, aiming to understand the mechanisms involved in the acquisition and processing of RC involving varying degrees of complexity. The first study (Chapter 3) presents a corpus analysis examining the naturalistic production of Mandarin RCs by Mandarin-speaking monolingual and heritage MandarinEnglish bilingual children (1;00-5;00). The results show that both monolingual and bilingual children produce more object RCs than subject RCs in Mandarin. This is because Mandarin object RCs resemble simple Subject-Verb-Object (SVO) sentences the children had previously acquired, and occur more frequently than subject RCs in their input. Compared to monolingual children, bilingual children produce more object RCs, suggesting that the acquisition of Mandarin RCs is not only facilitated by SVO transitives in Mandarin, but also SVO transitives in English. In contrast to the first study, the second study (Chapter 4) reports a subject RC advantage by looking at the comprehension of Mandarin subject and object RCs in heritage Mandarin-English bilingual children (4;00-10;11) and their vocabulary-matched monolingual peers (4;00-5;09). Using a character-sentence matching task, the results reveal that simple SVO transitives hinder children’s comprehension of Mandarin object RCs by misleading them to interpret the noun phrase occurring first as the head noun. Compared to monolingual children, bilingual children who are more English dominant make this type of error more frequently for Mandarin object RCs, suggesting that both English SVO transitives and language dominance contribute to cross-linguistic influence. However, unlike either the subject or object RC advantage shown in children, mixed results are found in the writing of adult Mandarin native speakers (L1) and advanced second language learners (L2) in the third study (Chapter 5). Using conditional inference trees and random forests, the results show that both adult Mandarin L1 and L2 speakers’ selection of subject and object RCs heavily depends on the discourse context that RCs are situated in. The first and second studies (Chapters 3 and 4) are novel in taking Mandarin RCs with omitted head nouns into account. In spontaneous speech (Chapter 3), the results indicate that monolingual and bilingual children as young as two can produce Mandarin RCs with omitted head nouns, and the omission of a head noun does not influence the subject-object asymmetry. Similarly, the absence of a head noun does not influence monolingual and bilingual children’s comprehension of Mandarin RCs (Chapter 4), suggesting that they are able to recover omitted head nouns from the context provided. In addition, the first and third studies (Chapters 3 and 5) also examine the matrixclause positions in which Mandarin RCs tend to occur. RCs that occur in the non-centreembedded matrix-clause position (e.g., The goat saw the horse [that hugged the pig]) are expected to be easier to process than RCs in the centre-embedded matrix-clause position (e.g., The horse [that hugged the pig] saw the goat), as they require lower working memory load (e.g., Gibson, 1998, 2000). Supporting this assumption, in adult Mandarin L1 and L2 speakers’ writing (Chapter 5), non-centre-embedded RCs occur more often than centreembedded RCs. Moreover, the longer the RCs, the higher the possibility they are placed in the non-centre-embedded matrix-clause position. However, in children’s spontaneous speech (Chapter 3), both monolingual and bilingual children do not show a tendency to prefer noncentre-embedded over centre-embedded RCs, which may relate to the short length of the RCs they produce. The shorter the RCs, the less memory load is needed to process centre-embedded RCs, and therefore the disadvantage of centre-embedded RCs may diminish. The three studies of this thesis present mixed findings regarding Mandarin RC processing, but consistently provide evidence to support the usage-based account. That is, the processing of RCs is shaped by an individual’s age and language experience, including input frequency, the related structures that have been acquired, language dominance and the discourse contexts that RCs tend to appear in

    The evolution of language: Proceedings of the Joint Conference on Language Evolution (JCoLE)

    Get PDF

    The future of dialects: Selected papers from Methods in Dialectology XV

    Get PDF
    Traditional dialects have been encroached upon by the increasing mobility of their speakers and by the onslaught of national languages in education and mass media. Typically, older dialects are “leveling” to become more like national languages. This is regrettable when the last articulate traces of a culture are lost, but it also promotes a complex dynamics of interaction as speakers shift from dialect to standard and to intermediate compromises between the two in their forms of speech. Varieties of speech thus live on in modern communities, where they still function to mark provenance, but increasingly cultural and social provenance as opposed to pure geography. They arise at times from the need to function throughout the different groups in society, but they also may have roots in immigrants’ speech, and just as certainly from the ineluctable dynamics of groups wishing to express their identity to themselves and to the world. The future of dialects is a selection of the papers presented at Methods in Dialectology XV, held in Groningen, the Netherlands, 11-15 August 2014. While the focus is on methodology, the volume also includes specialized studies on varieties of Catalan, Breton, Croatian, (Belgian) Dutch, English (in the US, the UK and in Japan), German (including Swiss German), Italian (including Tyrolean Italian), Japanese, and Spanish as well as on heritage languages in Canada
    • 

    corecore