4 research outputs found

    Steps for Creating two Persian Specialized Corpora

    Get PDF
    Currently, most linguistic studies benefit from valid linguistic data available at corpora. Compiling corpora is a common practice in linguistic research. The present study introduces two specialized corpora in Persian; a specialized corpus is used to study a particular type of language or language variety. For building such corpora, first, a set of texts were compiled based on pre-established criteria used in the sampling process (including the mode of the texts, type of the texts, domain of the texts, language/ language varieties of the texts and the date of the texts). The corpora are specialized because they include technical terms in information processing and management, librarianship, linguistics, computational linguistics, thesaurus building, managing, policy-making, natural language processing, information technology, information retrieval, ontology and other related interdisciplinary domains. After compiling data and Metadata, the texts were preprocessed (normalized and tokenized) and annotated (automated POS tagging); finally, the tags were manually checked. Each corpus includes more than four million words. Since not many specialized corpora are built in Persian, such corpora could be considered valuable resources for researchers interested in studying linguistic variations in Persian interdisciplinary texts.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.14.

    Word Sense Disambiguation Focusing on POS Tag Disambiguation in Persian:

    Get PDF
    The present study deals with ambiguity at word level focusing on homographs. In different languages, homographs may cause ambiguity in text processing. In Persian, the number of homographs is high due to its orthographic structure as well as its complex derivational and inflectional morphology. In this study, a broad list of homographs was extracted from some Persian corpora first. The list indicates that the number of homographs in Persian corpora is high and homographs with high frequency are those that occur as a result of the identical orthographic representation of some inflectional and derivational morphemes. Based on the list, the most frequent homographs are nouns and adjectives ending in <ی> /i/. POS tag disambiguation of such homographs would make word sense disambiguation easier and lead to better text processing. In this study, a list of noun and adjective homographs ending in <ی> is extracted in order to decide their correct POS tag. The result was studied to extract context-sensitive rules for allocating the right POS tag to the homograph in syntactic structures. The accuracy of rules was checked, and the result showed that the accuracy of most rules is high which proves most rules are true

    Introducing a machine-based approach for Word Sense Disambiguation: using Lesk algorithm and Part Of Speech tagging

    No full text
    Present study introduces a machine-based approach for Word Sense Disambiguation (WSD). In Persian, a morphologically complex language, lots of homographs are made; one way for doing WSD is allocating the right Part Of Speech (POS) tags to words, prior to WSD. Since the frequency of noun and adjective homographs in different Persian text corpuses is high, POS disambiguation of such homographs seems to be necessary for WSD. This paper introduces an approach in which first POS tagging is done, then the output, which is tagged sentences, enters the next step which is POS disambiguation of Persian nouns and adjective homographs; then the output of this step enters the final step which is applying the Lesk algorithm(a kind of unsupervised learning) for WSD. The proposed approach speeds up the WSD procedure by filtering the only relevant glosses (exist in dictionary) and increases the accuracy of the WSD procedure as well

    Introducing a probabilistic–structural method for grapheme-to-phoneme conversion in Persian

    No full text
    Persian writing system deviates from the ideal one due to the lack of one-to-one correspondence between graphemes and phonemes. The present study deals with this question that in spite of the absence of short vowels in Persian writing system and one-to-many and many-to-one relationships between the graphemes and phonemes, how can Persian speakers read out of vocabulary words? This study introduces a probabilistic- structural method that Persian speakers use to read out of vocabulary words in which structural information (including Persian morphology and morphophonemic rules) as well as Arabic morphological templates are considered. In order to test how the introduced method works, Persian speakers were asked to read a list of out of vocabulary words. The mentioned list was used by ID3 and MLP (two methods which are used in machine learning) as input, then the outputs of the method and those of ID3 and MLP were compared with Persian speakers` pronunciations the results proved that the introduced method functions similar to Persian speakers in reading out of vocabulary words
    corecore