164 research outputs found

    Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

    Get PDF

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Are You Finding the Right Person? A Name Translation System Towards Web 2.0

    Get PDF
    In a multilingual world, information available in global information systems is increasing rapidly. Searching for proper names in foreign language becomes an important task in multilingual search and knowledge discovery. However, these names are the most difficult to handle because they are often unknown words that cannot be found in a translation dictionary and even human experts cannot handle the variation generated during translation. Furthermore, existing research on name translation have focused on translation algorithms. However, user experience during name translation and name search are often ignored. With the Web technology moving towards Web 2.0, creating a platform that allow easier distributed collaboration and information sharing, we seek methods to incorporate Web 2.0 technologies into a name translation system. In this research, we review challenges in name translation and propose an interactive name translation and search system: NameTran. This system takes English names and translates them into Chinese using a combined hybrid Hidden Markov Model-based (HMM-based) transliteration approach and a web mining approach. Evaluation results showed that web mining consistently boosted the performance of a pure HMM approach. Our system achieved top-1 accuracy of 0.64 and top-8 accuracy of 0.96. To cope with changing popularity and variation in name translations, we demonstrated the feasibility of allowing users to rank translations and the new ranking serves as feedback to the original trained HMM model. We believe that such user input will significantly improve system usability

    Domain adaptation for statistical machine translation of corporate and user-generated content

    Get PDF
    The growing popularity of Statistical Machine Translation (SMT) techniques in recent years has led to the development of multiple domain-specic resources and adaptation scenarios. In this thesis we address two important and industrially relevant adaptation scenarios, each suited to different kinds of content. Initially focussing on professionally edited `enterprise-quality' corporate content, we address a specic scenario of data translation from a mixture of different domains where, for each of them domain-specific data is available. We utilise an automatic classifier to combine multiple domain-specific models and empirically show that such a configuration results in better translation quality compared to both traditional and state-of-the-art techniques for handling mixed domain translation. In the second phase of our research we shift our focus to the translation of possibly `noisy' user-generated content in web-forums created around products and services of a multinational company. Using professionally edited translation memory (TM) data for training, we use different normalisation and data selection techniques to adapt SMT models to noisy forum content. In this scenario, we also study the effect of mixture adaptation using a combination of in-domain and out-of-domain data at different component levels of an SMT system. Finally we focus on the task of optimal supplementary training data selection from out-of-domain corpora using a novel incremental model merging mechanism to adapt TM-based models to improve forum-content translation quality

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
    corecore