102 research outputs found

    Text Extraction and Web Searching in a Non-Latin Language

    Get PDF
    Recent studies of queries submitted to Internet Search Engines have shown that non-English queries and unclassifiable queries have nearly tripled during the last decade. Most search engines were originally engineered for English. They do not take full account of inflectional semantics nor, for example, diacritics or the use of capitals which is a common feature in languages other than English. The literature concludes that searching using non-English and non-Latin based queries results in lower success and requires additional user effort to achieve acceptable precision. The primary aim of this research study is to develop an evaluation methodology for identifying the shortcomings and measuring the effectiveness of search engines with non-English queries. It also proposes a number of solutions for the existing situation. A Greek query log is analyzed considering the morphological features of the Greek language. Also a text extraction experiment revealed some problems related to the encoding and the morphological and grammatical differences among semantically equivalent Greek terms. A first stopword list for Greek based on a domain independent collection has been produced and its application in Web searching has been studied. The effect of lemmatization of query terms and the factors influencing text based image retrieval in Greek are also studied. Finally, an instructional strategy is presented for teaching non-English students how to effectively utilize search engines. The evaluation of the capabilities of the search engines showed that international and nationwide search engines ignore most of the linguistic idiosyncrasies of Greek and other complex European languages. There is a lack of freely available non-English resources to work with (test corpus, linguistic resources, etc). The research showed that the application of standard IR techniques, such as stopword removal, stemming, lemmatization and query expansion, in Greek Web searching increases precision. i

    Mining Twitter for crisis management: realtime floods detection in the Arabian Peninsula

    Get PDF
    A thesis submitted to the University of Bedfordshire, in partial fulfilment of the requirements for the degree of doctor of Philosophy.In recent years, large amounts of data have been made available on microblog platforms such as Twitter, however, it is difficult to filter and extract information and knowledge from such data because of the high volume, including noisy data. On Twitter, the general public are able to report real-world events such as floods in real time, and act as social sensors. Consequently, it is beneficial to have a method that can detect flood events automatically in real time to help governmental authorities, such as crisis management authorities, to detect the event and make decisions during the early stages of the event. This thesis proposes a real time flood detection system by mining Arabic Tweets using machine learning and data mining techniques. The proposed system comprises five main components: data collection, pre-processing, flooding event extract, location inferring, location named entity link, and flooding event visualisation. An effective method of flood detection from Arabic tweets is presented and evaluated by using supervised learning techniques. Furthermore, this work presents a location named entity inferring method based on the Learning to Search method, the results show that the proposed method outperformed the existing systems with significantly higher accuracy in tasks of inferring flood locations from tweets which are written in colloquial Arabic. For the location named entity link, a method has been designed by utilising Google API services as a knowledge base to extract accurate geocode coordinates that are associated with location named entities mentioned in tweets. The results show that the proposed location link method locate 56.8% of tweets with a distance range of 0 – 10 km from the actual location. Further analysis has shown that the accuracy in locating tweets in an actual city and region are 78.9% and 84.2% respectively

    Automatic Vocalization of Arabic Text

    Get PDF

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Building the Arabic Learner Corpus and a System for Arabic Error Annotation

    Get PDF
    Recent developments in learner corpora have highlighted the growing role they play in some linguistic and computational research areas such as language teaching and natural language processing. However, there is a lack of a well-designed Arabic learner corpus that can be used for studies in the aforementioned research areas. This thesis aims to introduce a detailed and original methodology for developing a new learner corpus. This methodology which represents the major contribution of the thesis includes a combination of resources, proposed standards and tools developed for the Arabic Learner Corpus project. The resources include the Arabic Learner Corpus, which is the largest learner corpus for Arabic based on systematic design criteria. The resources also include the Error Tagset of Arabic that was designed for annotating errors in Arabic covering 29 types of errors under five broad categories. The Guide on Design Criteria for Learner Corpus is an example of the proposed standards which was created based on a review of previous work. It focuses on 11 aspects of corpus design criteria. The tools include the Computer-aided Error Annotation Tool for Arabic that provides some functions facilitating error annotation such as the smart-selection function and the auto-tagging function. Additionally, the tools include the ALC Search Tool that is developed to enable searching the ALC and downloading the source files based on a number of determinants. The project was successfully able to recruit 992 people including language learners, data collectors, evaluators, annotators and collaborators from more than 30 educational institutions in Saudi Arabia and the UK. The data of the Arabic Learner Corpus was used in a number of projects for different purposes including error detection and correction, native language identification, Arabic analysers evaluation, applied linguistics studies and data-driven Arabic learning. The use of the ALC highlights the extent to which it is important to develop this project

    Union Catalogs at the Crossroad

    Get PDF
    The Andrew W. Mellon Foundation and the National Library of Estonia organized a Conference on Union Catalogs which took place in Tallinn, in the National Library of Estonia on October 17-19, 2002. The Conference presented and discussed analytical papers dealing with various aspects of designing and implementing union catalogs and shared cataloging systems as revealed through the experiences of Eastern European, Baltic and South African research libraries. Here you can find the texts of the conference papers and the list of contributors and participants

    Rapid Generation of Pronunciation Dictionaries for new Domains and Languages

    Get PDF
    This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists
    • …
    corecore