596,021 research outputs found

    A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics

    Full text link
    © {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation, http://dx.doi.org/10.1145/2824864.2824876[EN] In this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for Information Retrieval Evaluation (FIRE) in 2014. A CRF based machine learning model and post-processing heuristics are employed for the WLL identification task. In addition to language identification, two transliteration systems were built to transliterate detected Bangla words written in Roman script into native Bangla script. The system demonstrated an overall token level language identification accuracy of 0.905. The token level Bangla and English language identification F-scores are 0.899, 0.920 respectively. The two transliteration systems achieved accuracies of 0.062 and 0.037. The word-level language identification system presented in this paper resulted in the best scores across almost all metrics among all the participating systems for the Bangla-English language pair.We acknowledge the support of the Department of Electronics and Information Technology (DeitY), Government of India, through the project “CLIA System Phase II”. The research work of the last author was carried out in the framework of WIQ-EI IRSES (Grant No. 269180) within the FP 7 Marie Curie, DIANA-APPLICATIONS (TIN2012-38603-C02-01) projects and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Banerjee, S.; Kuila, A.; Roy, A.; Naskar, SK.; Rosso, P.; Bandyopadhyay, S. (2014). A hybrid approach for transliterated word-level language identification: CRF with post processing heuristics. En FIRE '14 Proceedings of the Forum for Information Retrieval Evaluation. ACM. 170-173. https://doi.org/10.1145/2824864.2824876S170173Y. Al-Onaizan and K. Knight. Named entity translation: Extended abstract. In HLT, pages 122--124. Singapore, 2002.P. J. Antony, V. P. Ajith, and K. P. Suman. Feature extraction based english to kannada transliteration. In In hird International conference on Semantic E-business and Enterprise Computing. SEEC 2010, 2010.P. J. Antony, V. P. Ajith, and K. P. Suman. Kernel method for english to kannada transliteration. In International conference on-Recent trends in Information, Telecommunication and computing. ITC2010, 2010.M. Arbabi, S. M. Fischthal, V. C. Cheng, and E. Bart. Algorithms for arabic name transliteration. In IBM Journal of Research and Development, page 183. TeX Users Group, 1994.S. Banerjee, S. Naskar, and S. Bandyopadhyay. Bengali named entity recognition using margin infused relaxed algorithm. In TSD, pages 125--132. Springer International Publishing, 2014.U. Barman, J. Wagner, G. Chrupala, and J. Foster. Identification of languages and encodings in a multilingual document. page 127. EMNLP, 2014.K. R. Beesley. Language identifier: A computer program for automatic natural-language identification of on-line text. pages 47--54. ATA, 1988.P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. Mercer: The mathematics of statistical machine translation: parameter estimation. pages 263--311. Computational Linguistics, 1993.M. Carpuat. Mixed-language and code-switching in the canadian hansard. page 107. EMNLP, 2014.G. Chittaranjan, Y. Vyas, K. Bali, and M. Choudhury. Word-level language identification using crf: Code-switching shared task report of msr india system. pages 73--79. EMNLP, 2014.A. Das, A. Ekbal, T. Mandal, and S. Bandyopadhyay. English to hindi machine transliteration system at news. pages 80--83. Proceeding of the Named Entities Workshop ACL-IJCNLP, Singapore, 2009.A. Ekbal, S. Naskar, and S. Bandyopadhyay. A modified joint source channel model for transliteration. pages 191--198. COLING-ACL Australia, 2006.I. Goto, N. Kato, N. Uratani, and T. Ehara. Transliteration considering context information based on the maximum entropy method. pages 125--132. MT-Summit IX, New Orleans, USA, 2003.R. Haque, S. Dandapat, A. K. Srivastava, S. K. Naskar, and A. Way. English to hindi transliteration using context-informed pb-smt:the dcu system for news 2009. NEWS 2009, 2009.S. Y. Jung, S. Hong, and E. Paek. An english to korean transliteration model of extended markov window.S. Y. Jung, S. L. Hong, and E. Paek. An english to korean transliteration model of extended markov window. pages 383--389. COLING, 2000.B. J. Kang and K. S. Choi. Automatic transliteration and back-transliteration by decision tree learning. LERC, May 2000.B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. pages 1110--1119. NAACL-HLT, 2013.R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In ICASSP, pages 181--184. Detroit, MI, 1995.R. Kneser and H. Ney. SRILM-an extensible language modeling toolkit. In Intl. Conf. on Spoken Language Processing, pages 901--904, 2002.K. Knight and J. Graehl. Machine transliteration. in computational linguistics. pages 599--612, 1998.P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: open source toolkit for statistical machine translation. In ACL, pages 177--180, 2007.P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In HLT-NAACL, 2003.A. Kumaran and T. Kellner. A generic framework for machine transliteration. In 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 721--722. ACM, 2007.H. Li, Z. Min, and J. Su. A joint source-channel model for machine transliteration. In ACL, page 159, 2004.C. Lignos and M. Marcus. Toward web-scale analysis of codeswitching. In Annual Meeting of the Linguistic Society of America, 2013.J. H. Oh and K. S. Choi. An english-korean transliteration model using pronunciation and contextual rules. In 19th international conference on Computational linguistics. ACL, 2002.T. Rama and K. Gali. Modeling machine transliteration as a phrase based statistical machine translation problem. In Language Technologies Research Centre. IIIT, Hyderabad, India, 2009.A. K. Singh and J. Gorla. Identification of languages and encodings in a multilingual document. In ACL-SIGWAC's Web As Corpus3, page 95. Presses univ. de Louvain, 2007.V. Sowmya, M. Choudhury, K. Bali, T. Dasgupta, and A. Basu. Resource creation for training and testing of transliteration systems for indian languages. In LREC, pages 2902--2907, 2010.V. Sowmya and V. Varma. Transliteration based text input methods for telugu. In ICCPOL-2009, 2009.B. G. Stalls and J. Graehl. Translating names and technical terms in arabic text. In Workshop on Computational Approaches to Semitic Languages, pages 34--41. ACL, 1998.S. Sumaja, R. Loganathan, and K. P. Suman. English to malayalam transliteration using sequence labeling approach. International Journal of Recent Trends in Engineering, 1(2), 2009.M. S. Vijaya, V. P. Ajith, G. Shivapratap, and K. P. Soman. English to tamil transliteration using weka. International Journal of Recent Trends in Engineering, 2009

    Developing Speech-Language Pathology Students’ Grammatical Identification Skills Through Gamification

    Get PDF
    Background: Speech-Language Pathologists (SLPs) are communication experts required to analyze and interpret a variety of language components (Schuele, 2010). Language sampling is a form of communication analysis and is used with adult and pediatric populations. SLPs collect and analyze language samples in an effort to make evidence-based diagnostic and intervention decisions. When analyzing a language sample, sentences must be deconstructed along a variety of parameters. At Old Dominion University (ODU), the undergraduate Communication Sciences and Disorders program requires students to identify broad and specific grammatical categories during language sample analysis in preparation for clinical experiences. This research involves the design and implementaiton of a gaming application using spaced retrieval practice and principles of gaming theory to facilitate grammatical identification skills in undergraduate and graduate SLP students. Purpose: The primary aim of this project is to generate pilot data determining the utility of a gaming application (designed by the course instructor) for teaching grammatical category identification. The gaming application has been developed with an ODU undergraduate student and Information Technology specialists, and it is in the prototyping phase. There are three planned phases of application design in the pursuit of creating a generalizable and individualized tool for instruction at the elementary level and for other SLP college programs. Research Questions: 1) Do students who use the gaming application more accurately identify auxiliary verbs, main verbs, secondary verbs, subjective pronouns, objective pronouns, personal pronouns, and conjunctions more accurately than students who did not use the gaming application? 2) Over time, do students who use the gaming application perform better on accurately identifying auxiliary verbs, main verbs, secondary verbs, subjective pronouns, objective pronouns, personal pronouns, and conjunctions than students who do not use the gaming application?https://digitalcommons.odu.edu/gradposters2020_education/1002/thumbnail.jp

    Talker identification is not improved by lexical access in the absence of familiar phonology

    Full text link
    Listeners identify talkers more accurately when they are familiar with both the sounds and words of the language being spoken. It is unknown whether lexical information alone can facilitate talker identification in the absence of familiar phonology. To dissociate the roles of familiar words and phonology, we developed English-Mandarin “hybrid” sentences, spoken in Mandarin, which can be convincingly coerced to sound like English when presented with corresponding subtitles (e.g., “wei4 gou3 chi1 kao3 li2 zhi1” becomes “we go to college”). Across two experiments, listeners learned to identify talkers in three conditions: listeners' native language (English), an unfamiliar, foreign language (Mandarin), and a foreign language paired with subtitles that primed native language lexical access (subtitled Mandarin). In Experiment 1 listeners underwent a single session of talker identity training; in Experiment 2 listeners completed three days of training. Talkers in a foreign language were identified no better when native language lexical representations were primed (subtitled Mandarin) than from foreign-language speech alone, regardless of whether they had received one or three days of talker identity training. These results suggest that the facilitatory effect of lexical access on talker identification depends on the availability of familiar phonological forms

    Evaluation of Automatic Vehicle Specific Identification (AVSI) in a traffic signal control system

    Get PDF
    Automatic Vehicle Specific Identification (AVSI) is a generic name for advanced vehicle detection systems. By automating the identification of vehicles by sensing the presence of vehicles with roadside detection sites or readers, AVSI is assumed to provide vehicle specific information in traffic signal control systems;In the application of AVSI to traffic signal control systems, as a vehicle passes a reader site, the reader records the arrival time and type of the detected vehicle. The reader would then send the information received to a local microprocessor-based traffic signal controller. The controller\u27s built-in signal control logic would then use the information to adjust traffic signal timing to reflect the present traffic stream\u27s characteristics;The purpose of this research is to evaluate the potential benefits of AVSI at an isolated intersection. The evaluation of the applicability of AVSI at an intersection is accomplished by using a new developed microscopic simulation model. This simulation model is coded in SIMAN simulation language. For the purpose of validating the simulation model, a delay study is conducted at an actual intersection. The validation of the model has established a level of confidence in the obtained simulation results;An important element of this simulation model is the development of a new Vehicle Specific Adaptive (VSA) traffic signal control strategy. VSA control strategy adjusts the signal timing based on AVSI traffic information, that is, it examines individual vehicle performance characteristics before extending a phase green time or implementing a new cycle split;Using the simulation model, the incorporated VSA control strategy is tested against a pretimed control system. The simulation results indicates that through the use of AVSI traffic information, the VSA control logic can improve intersection performance by reducing vehicles stopped delay at an intersection

    ASE@DPIL-FIRE2016: Hindi Paraphrase Detection using Natural Language Processing Techniques & Semantic Similarity Computations

    Get PDF
    ABSTRACT The paper reports the approaches utilized and results achieved for our system in the shared task (in FIRE-2016) for paraphrase identification in Indian languages (DPIL). Since Indian languages have a complex inherent nature, paraphrase identification in these languages becomes a challenging task. In the DPIL task, the challenge is to detect and identify whether a given sentence pairs paraphrased or not. In the proposed work, natural language processing with semantic concept extractions is explored for paraphrase detection in Hindi. Stop word removal, stemming and part of speech tagging are employed. Further similarity computations between the sentence pairs are done by extracting semantic concepts using WordNet lexical database. Initially, the proposed approach is evaluated over the given training sets using different machine learning classifiers. Then testing phase is used to predict the classes using the proposed features. The results are found to be promising, which shows the potency of natural language processing techniques and semantic concept extractions in detecting paraphrases. CCS Concepts Computing methodologies-Natural language processing Information systems -Document analysis and feature selection; Near-duplicate and paraphrase detectio
    • …
    corecore