Search CORE

3,268 research outputs found

Discovery of Linguistic Relations Using Lexical Attraction

Author: Yuret Deniz
Publication venue
Publication date: 01/01/1998
Field of study

This work has been motivated by two long term goals: to understand how humans learn language and to build programs that can understand language. Using a representation that makes the relevant features explicit is a prerequisite for successful learning and understanding. Therefore, I chose to represent relations between individual words explicitly in my model. Lexical attraction is defined as the likelihood of such relations. I introduce a new class of probabilistic language models named lexical attraction models which can represent long distance relations between words and I formalize this new class of models using information theory. Within the framework of lexical attraction, I developed an unsupervised language acquisition program that learns to identify linguistic relations in a given sentence. The only explicitly represented linguistic knowledge in the program is lexical attraction. There is no initial grammar or lexicon built in and the only input is raw text. Learning and processing are interdigitated. The processor uses the regularities detected by the learner to impose structure on the input. This structure enables the learner to detect higher level regularities. Using this bootstrapping procedure, the program was trained on 100 million words of Associated Press material and was able to achieve 60% precision and 50% recall in finding relations between content-words. Using knowledge of lexical attraction, the program can identify the correct relations in syntactically ambiguous sentences such as ``I saw the Statue of Liberty flying over New York.''Comment: dissertation, 56 page

arXiv.org e-Print Archive

CiteSeerX

Property Rights and Gender Bias: Evidence from Land Reform in West Bengal

Author: Bhalotra Sonia R
Chakravarty Abhishek
Mookherjee Dilip
Pino Francisco J
Publication venue
Publication date
Field of study

While land reforms are typically pursued in order to raise productivity and reduce inequality across households, an unintended consequence may be increased within-household gender inequality. We analyse a tenancy registration programme in West Bengal, and find that it increased child survival and reduced fertility. However, we also find that it intensified son preference in families without a first-born son to inherit the land title. These families exhibit no reduction in fertility, an increase in the probability that a subsequent birth is male, and a substantial increase in the survival advantage of subsequent sons over daughters

Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition

Author: Kilgour Kevin
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2015
Field of study

This Thesis tackles the problems of modularity in Large-Vocabulary Continuous Speech Recognition with use of Neural Network

Beyond Extractive: Advancing Abstractive Automatic Text Summarization in Norwegian with Transformers

Author: Korsvik Jon-Mikkel Ryen
Navjord Jørgen Johnsen
Publication venue: Norwegian University of Life Sciences
Publication date: 01/01/2023
Field of study

Automatic summarization is a key area in natural language processing (NLP) and machine learning which attempts to generate informative summaries of articles and documents. Despite its evolution since the 1950s, research on automatically summarising Norwegian text has remained relatively underdeveloped. Though there have been some strides made in extractive systems, which generate summaries by selecting and condensing key phrases directly from the source material, the field of abstractive summarization remains unexplored for the Norwegian language. Abstractive summarization is distinct as it generates summaries incorporating new words and phrases not present in the original text. This Master's thesis revolves around one key question: Is it possible to create a machine learning system capable of performing abstractive summarization in Norwegian? To answer this question, we generate and release the first two Norwegian datasets for creating and evaluating Norwegian summarization models. One of these datasets is a web scrape of Store Norske Leksikon (SNL), and the other is a machine-translated version of CNN/Daily Mail. Using these datasets, we fine-tune two Norwegian T5 language models with 580M and 1.2B parameters to create summaries. To assess the quality of the models, we employed both automatic ROUGE scores and human evaluations on the generated summaries. In an effort to better understand the model's behaviour, we measure how a model generates summaries with various metrics, including our own novel contribution which we name "Match Ratio" which measures sentence similarities between summaries and articles based on Levenshtein distances. The top-performing models achieved ROUGE-1 scores of 35.07 and 34.02 on SNL and CNN/DM, respectively. In terms of human evaluation, the best model yielded an average score of 3.96/5.00 for SNL and 4.64/5.00 for CNN/Daily Mail across various criteria. Based on these results, we conclude that it is possible to perform abstractive summarization of Norwegian with high-quality summaries. With this research, we have laid a foundation that hopefully will facilitate future research, empowering others to build upon our findings and contribute further to the development of Norwegian summarization models