7 research outputs found
SCyDia ā OCR For Serbian Cyrillic with Diacritics
In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the
biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment ā OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now ā OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are
not efficient enough. This paper presents the OCR software called āSCyDiaā, developed to overcome this issue. We demonstrate the organizational structure of the OCR software āSCyDiaā and the first results. The āSCyDiaā is a web-based software solution that relies on the open-source software āTesseractā in the background. āSCyDiaā also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the āSCyDiaā by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%
SCyDia ā OCR For Serbian Cyrillic with Diacritics
In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the
biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment ā OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now ā OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are
not efficient enough. This paper presents the OCR software called āSCyDiaā, developed to overcome this issue. We demonstrate the organizational structure of the OCR software āSCyDiaā and the first results. The āSCyDiaā is a web-based software solution that relies on the open-source software āTesseractā in the background. āSCyDiaā also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the āSCyDiaā by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%
The numberof senses effect in polysemous adjective recognition
Previous research revealed a significant polysemy effect: namely, it found that words with multiple related senses (polysemous words) are recognised faster compared to the words with multiple unrelated meanings (homonymous words) and words with only one meaning/sense (unambiguous words; Rodd et al., 2002). The measure of ambiguity in polysemous words was the number of senses (NoS), derived from the meanings/senses provided by native speakers. NoS was a significant predictor of reaction time in visual lexical decision task (VLDT) experiments (FilipoviÄ ÄurÄeviÄ, 2007). This is in accordance with various models of lexical ambiguity processing. Although some attribute the effect to an increased semantic activation due to the facilitation among the related senses (Armstrong & Plaut, 2016; Rodd et al., 2004), whereas others attribute it to the differences at the level of responding (Hino & Lupker, 1996), they agree in predicting the processing advantage in polysemous word recognition. Research in Serbian revealed this effect in noun and verb processing (FilipoviÄ ÄurÄeviÄ & KostiÄ, 2008; MiÅ”iÄ & FilipoviÄ ÄurÄeviÄ, 2019; 2020). The aim of this research was to further generalize the findings and to test whether the NoS effect is present in polysemous adjective recognition. The prediction was that the increase in the NoS and word frequency would be followed by faster adjective recognition. In this research, the participants were presented with a VLDT consisting of 107 polysemous Serbian adjectives. They were presented in all three grammatical genders using the Latin square design between participants, which allowed each participant to see only one form of the same adjective. Multiple regression revealed that the NoS and frequency were significant predictors of the reaction time: polysemous adjectives with higher NoS and higher frequency were processed faster (NoS: Ī² = -.199, S.E. = .093, df = 106, t = -2.143, p < .05; frequency: Ī² = -.281, S.E. = .093, df = 106, t = -3.036, p < .05). These findings are in accordance with our hypothesis and concur with the previous findings from the experiments with nouns and verbs (FilipoviÄ ÄurÄeviÄ & KostiÄ, 2008; MiÅ”iÄ & FilipoviÄ ÄurÄeviÄ, 2019; 2020), as well as various models regarding word ambiguity processing (Armstrong & Plaut, 2016; Hino & Lupker, 1996; Rodd et al., 2004). Together they converge to the conclusion that the NoS facilitates recognition of polysemous words in the VLDT across different parts of speech
The number of senses effect in polysemous noun recognition: expanding the database
Words with multiple related senses (polysemous words) are recognised faster compared to the words with multiple unrelated meanings (homonymous words) and unambiguous words (Rodd et al., 2002). The measure of ambiguity in polysemous words was the number of senses (NoS), derived from the meanings/senses provided by native speakers, as well as the information theory measures, entropy (sense uncertainty) and redundancy (the balance of meaning probabilities). These measures were significant predictors of reaction time in visual lexical decision task (VLDT) experiments (FilipoviÄ ÄurÄeviÄ & KostiÄ, 2021). In spite of differences, multiple models agree in predicting the observed facilitation. Research in Serbian revealed these effects in noun, adjective, and verb processing (AnÄeliÄ, IliÄ, MiÅ”iÄ, & FilipoviÄ ÄurÄeviÄ, 2021; FilipoviÄ ÄurÄeviÄ & KostiÄ, 2008; 2021; MiÅ”iÄ & FilipoviÄ ÄurÄeviÄ, 2021). The aim of this research was to conceptually replicate and further generalise the NoS effect on processing of nouns. Also, the goal was to collect native speakers' intuitions of the senses for the novel set of Serbian nouns and thus expand the existing database (FilipoviÄ ÄurÄeviÄ & KostiÄ, 2016). A novel set of 100 polysemous nouns was selected from the dictionary and then included in the normative study, in which 36 participants were instructed to write all of the senses that they could recall. The senses obtained from the participants were categorised according to the dictionary and the NoS along with the entropy and redundancy of senses was calculated. The same nouns were presented in a visual lexical decision task to a novel group of 87 native speakers. The results indicated that polysemous nouns with higher number of senses were processed faster (Ī² =-.02 , CI = -.03 ā -.00, t =-2.78, p = .005), which is in accordance with our hypothesis. The results regarding the information theory measures revealed that the effects of entropy (H) and redundancy (T) indicated a non-significant trend in the predicted direction (H: Ī² =-.00 , CI [-.02 ā .01], t =-.597 p = .557, T: Ī² =.01 , CI [-.00 ā .03], t =1.66, p = .097). These findings concur with the previous findings from the noun, adjective and verb experiments and the SSD model (Armstrong & Plaut, 2016) and together they converge to the conclusion that the effect of number of senses in the processing of polysemous words facilitates recognition in the visual lexical decision
Open database of polysemous senses of 308 Serbian polysemous nouns, verbs, and adjectives
The majority of words can denote multiple related objects/phenomena, i.e. can have multiple
related senses ā so called polysemes. Understanding this linguistic phenomenon is therefore of
high importance both in terms of linguistic inquiries and in terms of psychological studies of
cognitive mechanisms. Previous research demonstrated that, in addition to the number of
senses, processing is also influenced by the balance of sense probabilities (FilipoviÄ ÄurÄeviÄ &
KostiÄ, 2021). However, the resources for the study of lexical ambiguity are very sparce (e.g. a
database of 150 polysemous Serbian nouns; FilipoviÄ ÄurÄeviÄ & KostiÄ, 2017). Additionally,
most of these effects were demonstrated either within a single part of speech category
(typically nouns) or for ambiguous words with senses that span across various part of speech
(e.g. a record / to record; as pointed out by Eddington & Tokowicz, 2015). Therefore, the goal of
this paper is to present a new open database containing raw and categorized native speakersā
semantic intuitions for 308 Serbian polysemous nouns (100), verbs (100), adjectives (108) and
multiple quantifications representing an array of the level of ambiguity indices.
For each of the polysemous words, we collected semantic intuitions of native speakers by using
the total meaning metric (Azuma, 1997). We then categorized the collected descriptions by
using three strategies: a) relying solely on semantic intuition, b) relying solely on dictionary
descriptions, and c) combining semantic intuitions and dictionary descriptions. Within each
strategy, we also monitored and investigated the effect of the coder (the researcher performing
the categorization) in order to explore the robustness of each approach. We then generated
the sense probability distributions for each word by counting the response frequencies across
created categories. In order to quantify the level of ambiguity, we calculated the number of
senses, redundancy, and entropy of the obtained sense probability distributions (Shannon,
1948; FilipoviÄ ÄurÄeviÄ & KostiÄ, 2017). Each measure, within each approach was also
corrected for the effects of idiosyncratic senses, reflexive verbs etc. This database will be
openly available and will provide a useful resource in ambiguity research. In future, this
database should be expanded with measures from word embeddings (i.e. BERT; Wiedemann et
al., 2019) that separate different word senses. This will allow for quantifying the level of
ambiguity on large-scale samples of text that may reveal a more precise estimation of sense
numbers and sense probabilities, and would allow for abandoning the counting-of-senses
approach (as suggested by FilipoviÄ ÄurÄeviÄ et al., 2009). Adding this to the database in the
future, and therefore allowing comparison to existing measures may allow another validation
point for measures derived from human participants