31 research outputs found

    Null Element Restoration

    Get PDF
    Understanding the syntactic structure of a sentence is a necessary preliminary to understanding its semantics and therefore for many practical applications. The field of natural language processing has achieved a high degree of accuracy in parsing, at least in English. However, the syntactic structures produced by the most commonly used parsers are less detailed than those structures found in the treebanks the parsers were trained on. In particular, these parsers typically lack the null elements used to indicate wh-movement, control, and other phenomena. This thesis presents a system for inserting these null elements into parse trees in English. It then examines the problem in Arabic, which motivates a second, joint- inference system which has improved performance on English as well. Finally, it examines the application of information derived from the Google Web 1T corpus as a way of reducing certain data sparsity issues related to wh-movement

    Arabic and English Relative Clauses and Machine Translation Challenges

    Get PDF
    The study aims at performing an error analysis as well as providing an evaluation of the quality of neural machine translation (NMT) represented by Google Translate when translating relative clauses. The study uses two test suites are composed of sentences that contain relative clauses. The first test suite composes of 108 pair sentences that are translated from English to Arabic whereas the second composes of 72 Arabic sentences that are translated into English. Errors annotation is performed by 6 professional annotators. The study presents a list of the annotated errors divided into accuracy and fluency errors that occur based on MQM. Manual evaluation is also performed by the six professionals along with a BLEU automatic evaluation using the Tilde Me platform. The results show that fluency errors are more frequent than accuracy errors. They also show that the frequency of errors and MT quality when translating from English into Arabic is lower than the frequency of errors and MT quality when translating from Arabic into English is also presented. Based on the performed error analysis and both manual and automatic evaluation, it is pointed out that the gap between MT and professional human translation is still large

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government

    Lexical selection for machine translation

    Get PDF
    Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89.EThOS - Electronic Theses Online ServiceEgyptian GovernmentGBUnited Kingdo

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Source and revision in the narratives of David's transfer of the Ark: text, language and story in 2 Samuel 6 and 1 Chronicles 13, 15-16

    Get PDF
    The aim of this thesis is to evaluate the relationship between Samuel and Chronicles in a single synoptic story: David's transfer of Israel's sacred ark to Jerusalem in 2 Samuel 6 and I Chronicles 13, 15 -16. Chapter one surveys areas of research related to Samuel and Chronicles. First, the writer summarises research and perspectives on these books and their stories of David's ark transfer. The review highlights competing approaches to Samuel which centre on either sources or composition and on either a diachronic or synchronic methodology. The literary history of Samuel is inadequate in conventional perspective, and must be freshly unfolded, and consequently the relationship of Samuel and Chronicles must also be re- evaluated. Second, the writer reviews the textual evidence for both books, focusing on the received versions, the Greek translations, and in the case of Samuel, on the Dead Sea Scrolls. The witnesses to Chronicles are relatively uniform, and it is suggested that the pluriformity among witnesses to Samuel, and the character of the MT of this book, are related to Samuel's editorial history. In particular, revisers reshaped the story of David's ark transfer in Chronicles and Samuel. Chapter two surveys issues related to synchronic and diachronic approaches to Samuel and Chronicles. The writer suggests that the impasse between these competing approaches may be resolved by the textual- exegetical approach, that is, by using text -critical controls on redactional arguments. The versional evidence substantiates the validity of the diachronic approach -there are earlier and later forms of biblical texts and editions of biblical stories -and scholars can use this evidence to discern literary origins and developments- developments in the versions whose special features, and the reasons for them, may be perceived and appreciated through holistic or final -form readings. Related to this, the writer points out that the issues of text, language (grammar, vocabulary, style) and story are interconnected. Textual variation and grammatical and stylistic incongruities and lexical discrepancies frequently signal editorial developments in biblical compositions. Three helpful models for understanding this developmental process are considered: McKane's rolling corpus, Tov's and Ulrich's literary layers, and Fishbane's inner -biblical exegesis. Finally, it is stated that the principal text -critical aim in this thesis is the detection of earlier and later forms of biblical texts or stories, or to state it differently, the discovery of earlier and later stages in their editorial histories. Using the aforementioned insights and methodologies, chapters three through six closely examine 2 Samuel 6 and the synoptic portions of 1 Chronicles 13, 15 -16. The latter has one short and two lengthy pluses (13.1 -4; 15.1 -24; 16.4- 42) but the text and story in its synoptic material are more primitive than in synoptic MT Samuel. 2 Samuel 6 has one short plus (vv. 20b -23) but the text and story in its synoptic material have developed in MT Samuel beyond LXX Samuel and beyond synoptic Chronicles. In other words, 2 Samuel 6 is a shorter version on the whole, yet in many particulars the MT is a later version of the story of David's ark transfer. The text's 'poor condition' is evidence of its editorial history. Overall, 2 Samuel 6 shows greater textual variation and fluidity, more doublets, and more interpretative difficulties than does 1 Chronicles 13, 15 -16. Specifically, the MT reflects much literary creativity and ideological bias. The readings special to this text relate to an apology of Davidic kingship, an apology of Davidic and Yahwistic character, and cultic practice. In addition, many textual manipulations in MT 2 Samuel 6 connect to the language of stories in 1 Samuel, especially chapters 2, 10- 15, 17 and 25. All these interconnected adjustments point to successive editorial interventions over a substantial period of time and their cumulative appearance and objective may be labelled a literary layer. The thesis concludes with observations regarding the implications of the present investigation for the theories of A. G. Auld

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)
    corecore