Search CORE

22 research outputs found

Hindi CCGbank: CCG Treebank from the Hindi Dependency Treebank

Author: A Bharati
A Joshi
A Mahajan
B Kumari
Bharat Ram Ambati
C Shastri
D Hays
J Hockenmaier
J Nivre
J Robinson
M Kuhlmann
M Lewis
M Palmer
M Steedman
Mark Steedman
MP Marcus
N Xue
S Clark
S Reddy
S Uematsu
T Mohanan
Tejaswini Deoskar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Crossref

Springer - Publisher Connector

Edinburgh Research Explorer

An Annotation Scheme for English Language using Paninian Framework

Author: Ajay Jangra
Amita
Publication venue
Publication date: 06/03/2020
Field of study

Abstract This paper presents a comprehensive study about the Panini's karaka relations for English. Paninian framework is suitable to all Indian language but some issues occur when applied to English languages. This paper discuss what are these issues and different approaches that were used in past

CiteSeerX

Automatic summarization of Malayalam documents using clause identification method

Author: C Sunitha
Ganesh Amal
Jaya A
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/12/2019
Field of study

Text summarization is an active research area in the field of natural language processing. Huge amount of information in the internet necessitates the development of automatic summarization systems. There are two types of summarization techniques: Extractive and Abstractive. Extractive summarization selects important sentences from the text and produces summary as it is present in the original document. Abstractive summarization systems will provide a summary of the input text as is generated by human beings. Abstractive summary requires semantic analysis of text. Limited works have been carried out in the area of abstractive summarization in Indian languages especially in Malayalam. Only extractive summarization methods are proposed in Malayalam. In this paper, an abstractive summarization system for Malayalam documents using clause identification method is proposed. As part of this research work, a POS tagger and a morphological analyzer for Malayalam words in cricket domain are also developed. The clauses from input sentences are identified using a modified clause identification algorithm. The clauses are then semantically analyzed using an algorithm to identify semantic triples - subject, object and predicate. The score of each clause is then calculated by using feature extraction and the important clauses which are to be included in the summary are selected based on this score. Finally an algorithm is used to generate the sentences from the semantic triples of the selected clauses which is the abstractive summary of input documents

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Automatic grammar induction from free text using insights from cognitive grammar

Author: Muralidaran Vigneshwaran
Publication venue
Publication date
Field of study

Automatic identification of the grammatical structure of a sentence is useful in many Natural Language Processing (NLP) applications such as Document Summarisation, Question Answering systems and Machine Translation. With the availability of syntactic treebanks, supervised parsers have been developed successfully for many major languages. However, for low-resourced minority languages with fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic annotation schemes motivated by different linguistic theories and formalisms which are sometimes language specific and they cannot always be adapted for developing syntactic parsers across different language families. This project aims to develop a linguistically motivated approach to the automatic induction of grammatical structures from raw sentences. Such an approach can be readily adapted to different languages including low-resourced minority languages. We draw the basic approach to linguistic analysis from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing by humans - such as incrementality, connectedness and expectation. Our implementation has three components: Schema Definition, Schema Assembly and Schema Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed through all the three components incrementally until all the words are exhausted and the entire sentence is analysed as an instance of one final construction schema. The order in which all intermediate schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers for English and Welsh (a low-resource minority language) were developed using the same approach with some changes to the Schema Definition component. We evaluated the parser performance by (a) Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys

Online Research @ Cardiff

Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

Author: EHRMANN MAUD
TURCHI MARCO
Publication venue: Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional, Mexico
Publication date: 09/08/2011
Field of study

Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

JRC Publications Repository