41 research outputs found

    Object-oriented engineering of visual languages

    Get PDF
    Visual languages are notations that employ graphics (icons, diagrams) to present information in a two or more dimensional space. This work focuses on diagrammatic visual languages, as found in software engineering, and their computer implementations. Implementation means the development of processors to automatically analyze diagrams and the development of graphical editors for constructing the diagrams. We propose a rigorous implementation technique that uses a formal grammar to specify the syntax of a visual language and that uses parsing to automatically analyze the visual sentences generated by the grammar. The theoretical contributions of our work are an original treatment of error handling (error detection, reporting, and recovery) in off-line visual language parsing, and the source-to-source translation of visual languages. We have also substantially extended an existing grammatical model for multidimensional languages, called atomic relational grammars. We have added support for meta-language expressions that denote optional and repetitive right-hand-side elements. We hav

    Finding structure in language

    Get PDF
    Since the Chomskian revolution, it has become apparent that natural language is richly structured, being naturally represented hierarchically, and requiring complex context sensitive rules to define regularities over these representations. It is widely assumed that the richness of the posited structure has strong nativist implications for mechanisms which might learn natural language, since it seemed unlikely that such structures could be derived directly from the observation of linguistic data (Chomsky 1965).This thesis investigates the hypothesis that simple statistics of a large, noisy, unlabelled corpus of natural language can be exploited to discover some of the structure which exists in natural language automatically. The strategy is to initially assume no knowledge of the structures present in natural language, save that they might be found by analysing statistical regularities which pertain between a word and the words which typically surround it in the corpus.To achieve this, various statistical methods are applied to define similarity between statistical distributions, and to infer a structure for a domain given knowledge of the similarities which pertain within it. Using these tools, it is shown that it is possible to form a hierarchical classification of many domains, including words in natural language. When this is done, it is shown that all the major syntactic categories can be obtained, and the classification is both relatively complete, and very much in accord with a standard linguistic conception of how words are classified in natural language.Once this has been done, the categorisation derived is used as the basis of a similar classification of short sequences of words. If these are analysed in a similar way, then several syntactic categories can be derived. These include simple noun phrases, various tensed forms of verbs, and simple prepositional phrases. Once this has been done, the same technique can be applied one level higher, and at this level simple sentences and verb phrases, as well as more complicated noun phrases and prepositional phrases, are shown to be derivable

    Example-based machine translation using the marker hypothesis

    Get PDF
    The development of large-scale rules and grammars for a Rule-Based Machine Translation (RBMT) system is labour-intensive, error-prone and expensive. Current research in Machine Translation (MT) tends to focus on the development of corpus-based systems which can overcome the problem of knowledge acquisition. Corpus-Based Machine Translation (CBMT) can take the form of Statistical Machine Translation (SMT) or Example-Based Machine Translation (EBMT). Despite the benefits of EBMT, SMT is currently the dominant paradigm and many systems classified as example-based integrate additional rule-based and statistical techniques. The benefits of an EBMT system which does not require extensive linguistic resources and can produce reasonably intelligible and accurate translations cannot be overlooked. We show that our linguistics-lite EBMT system can outperform an SMT system trained on the same data. The work reported in this thesis describes the development of a linguistics-lite EBMT system which does not have recourse to extensive linguistic resources. We apply the Marker Hypothesis (Green, 1979) — a psycholinguistic theory which states that all natural languages are ‘marked’ for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. We use this technique in different environments to segment aligned (English, French) phrases and sentences. We then apply an alignment algorithm which can deduce smaller aligned chunks and words. Following a process similar to (Block, 2000), we generalise these alignments by replacing certain function words with an associated tag. In so doing, we cluster on marker words and add flexibility to our matching process. In a post hoc stage we treat the World Wide Web as a large corpus and validate and correct instances of determiner-noun and noun-verb boundary friction. We have applied our marker-based EBMT system to different bitexts and have explored its applicability in various environments. We have developed a phrase-based EBMT system (Gough et al., 2002; Way and Gough, 2003). We show that despite the perceived low quality of on-line MT systems, our EBMT system can produce good quality translations when such systems are used to seed its memories. (Carl, 2003a; Schaler et al., 2003) suggest that EBMT is more suited to controlled translation than RBMT as it has been known to overcome the ‘knowledge acquisition bottleneck’. To this end, we developed the first controlled EBMT system (Gough and Way, 2003; Way and Gough, 2004). Given the lack of controlled bitexts, we used an on-line MT system Logomedia to translate a set of controlled English sentences, We performed experiments using controlled analysis and generation and assessed the performance of our system at each stage. We made a number of improvements to our sub-sentential alignment algorithm and following some minimal adjustments to our system, we show that our controlled EBMT system can outperform an RBMT system. We applied the Marker Hypothesis to a more scalable data set. We trained our system on 203,529 sentences extracted from a Sun Microsystems Translation Memory. We thus reduced problems of data-sparseness and limited our dependence on Logomedia. We show that scaling up data in a marker-based EBMT system improves the quality of our translations. We also report on the benefits of extracting lexical equivalences from the corpus using Mutual Information

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Learning logic rules from text using statistical methods for natural language processing

    Get PDF
    The field of Natural Language Processing (NLP) examines how computers can be made to do beneficial tasks by understanding the natural language. The foundations of NLP are diverse and include scientific fields such as electrical and electronic engineering, linguistics, and artificial intelligence. Some popular NLP applications are information extraction, machine translation, text summarization, and question answering. This dissertation proposes a new methodology using Answer Set programming (ASP) as our main formalism to predict Interpretable Semantic Textual Similarity (iSTS) with a rule-based approach focusing on hard-coded rules for our system, Inspire. We next propose an intelligent rule learning methodology using Inductive Logic Programming (ILP) and modify the ILP-tool eXtended Hyrbid Abductive Inductive Learning (XHAIL) in order to test if we are able to learn the ASP-based rules that were hard-coded earlier on the chunking subtask of the Inspire system. Chunking is the identification of short phrases such as noun phrases which mainly rely on Part-of-Speech (POS) tags. We next evaluate our results using real data sets obtained from the SemEval2016 Task-2 iSTS competition to work with a real application which could be evaluated objectively using the test-sets provided by experts. The Inspire system participated at the SemEval2016 Task-2 iSTS competition in the subtasks of predicting chunk similarity alignments for gold chunks and system generated chunks for three different Datasets. The Inspire system extended the basic ideas from SemEval2015 iSTS Task participant NeRoSim, by realising the rules in logic programming and obtaining the result with an Answer Set Solver. To prepare the input for the logic program, the PunktTokenizer, Word2Vec, and WordNet APIs of NLTK, and the Part-of-Speech (POS) and Named-Entity-Recognition (NER) taggers from Stanford CoreNLP were used. For the chunking subtask, a joint POS-tagger and dependency parser were used based on which an Answer Set program determined chunks. The Inspire system ranked third place overall and first place in one of the competition datasets in the gold chunk subtask. For the above mentioned system, we decided to automate the sentence chunking process by learning the ASP rules using a statistical logical method which combines rule-based and statistical artificial intelligence methods, namely ILP. ILP has been applied to a variety of NLP problems some of which include parsing, information extraction, and question answering. XHAIL, is the ILP-tool we used that aims at generating a hypothesis, which is a logic program, from given background knowledge and examples of structured knowledge based on information provided by the POS-tags One of the main challenges was to extend the XHAIL algorithm for ILP which is based on ASP. With respect to processing natural language, ILP can cater for the constant change in how language is used on a daily basis. At the same time, ILP does not require huge amounts of training examples such as other statistical methods and produces interpretable results, that means a set of rules, which can be analysed and tweaked if necessary. As contributions XHAIL was extended with (i) a pruning mechanism within the hypothesis generalisation algorithm which enables learning from larger datasets, (ii) a better usage of modern solver technology using recently developed optimisation methods, and (iii) a time budget that permits the usage of suboptimal results. These improvements were evaluated on the subtask of sentence chunking using the same three datasets obtained from the SemEval2016 Task-2 competition. Results show that these improvements allow for learning on bigger datasets with results that are of similar quality to state-of-the-art systems on the same task. Moreover, the hypotheses obtained from individual datasets were compared to each other to gain insights on the structure of each dataset. Using ILP to extend our Inspire system not only automates the process of chunking the sentences but also provides us with interpretable models that are useful for providing a deeper understanding of the data being used and how it can be manipulated, which is a feature that is absent in popular Machine Learning methods

    Computer Aided Verification

    Get PDF
    This open access two-volume set LNCS 10980 and 10981 constitutes the refereed proceedings of the 30th International Conference on Computer Aided Verification, CAV 2018, held in Oxford, UK, in July 2018. The 52 full and 13 tool papers presented together with 3 invited papers and 2 tutorials were carefully reviewed and selected from 215 submissions. The papers cover a wide range of topics and techniques, from algorithmic and logical foundations of verification to practical applications in distributed, networked, cyber-physical, and autonomous systems. They are organized in topical sections on model checking, program analysis using polyhedra, synthesis, learning, runtime verification, hybrid and timed systems, tools, probabilistic systems, static analysis, theory and security, SAT, SMT and decisions procedures, concurrency, and CPS, hardware, industrial applications
    corecore