Search CORE

2,138 research outputs found

Statistical Function Tagging and Grammatical Relations of Myanmar Sentences

Author: Htwe Tin Myat
Thant Win Win
Thein Ni Lar
Publication venue
Publication date: 25/09/2011
Field of study

This paper describes a context free grammar (CFG) based grammatical relations for Myanmar sentences which combine corpus-based function tagging system. Part of the challenge of statistical function tagging for Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex morphological system. Function tagging is a pre-processing step to show grammatical relations of Myanmar sentences. In the task of function tagging, which tags the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information, we use Naive Bayesian theory to disambiguate the possible function tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.Comment: 16 pages, 7 figures, 8 tables, AIAA-2011 (India). arXiv admin note: text overlap with arXiv:0912.1820 by other author

arXiv.org e-Print Archive

CiteSeerX

MERAL Portal

Boosting Applied to Word Sense Disambiguation

Author: Escudero Gerard
Marquez Lluis
Rigau German
Publication venue
Publication date: 01/01/2000
Field of study

In this paper Schapire and Singer's AdaBoost.MH boosting algorithm is applied to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of 15 selected polysemous words show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy on supervised WSD. In order to make boosting practical for a real learning domain of thousands of words, several ways of accelerating the algorithm by reducing the feature space are studied. The best variant, which we call LazyBoosting, is tested on the largest sense-tagged corpus available containing 192,800 examples of the 191 most frequent and ambiguous English words. Again, boosting compares favourably to the other benchmark algorithms.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

TONGS: TLDR; Opinion Network Guide System

Author: Wang Andrew Jen
Publication venue: DigitalCommons@CalPoly
Publication date: 01/12/2017
Field of study

In the modern world, huge amounts of text are being generated every minute. For example, Twitter users post their current emotions in tweets, while Facebook users vent about their experience in posts. In just one minute, Twitter users upload 350,000 tweets, and Facebook users post anywhere from 2.5 million to 3 million posts. To keep up with this growth in data, almost all of this information goes through automated text processing. To extract features such as the opinion and subjectivity in text, sentiment analysis is applied to the corpus. In this thesis, we present the TONGS library for conducting sentiment analysis. TONGS uses Word2Vec within the TensorFlow library to convert words into vector space representations. The TONGS library contains four different methods built upon previous research in sentiment analysis and Word2Vec. We further experiment and analyze these methods using the IMDB dataset. Finally, we introduce and test a new sentiment dataset from government hearings obtained through Digital Democracy, challenging the accuracy of the TONGS library in an unknown topic

Supervised Learning for Multi-Domain Text Classification

Author: Gangireddy Siva Charan Reddy
Publication venue: SJSU ScholarWorks
Publication date: 01/06/2016
Field of study

Digital information available on the Internet is increasing day by day. As a result of this, the demand for tools that help people in finding and analyzing all these resources are also growing in number. Text Classification, in particular, has been very useful in managing the information. Text Classification is the process of assigning natural language text to one or more categories based on the content. It has many important applications in the real world. For example, finding the sentiment of the reviews, posted by people on restaurants, movies and other such things are all applications of Text classification. In this project, focus has been laid on Sentiment Analysis, which identifies the opinions expressed in a piece of text. It involves categorizing opinions in text into categories like \u27positive\u27 or \u27negative\u27. Existing works in Sentiment Analysis focused on determining the polarity (Positive or negative) of a sentence. This comes under binary classification, which means classifying the given set of elements into two groups. The purpose of this research is to address a different approach for Sentiment Analysis called Multi Class Sentiment Classification. In this approach the sentences are classified under multiple sentiment classes like positive, negative, neutral and so on. Classifiers are built on the Predictive Model, that consists of multiple phases. Analysis of different sets of features on the data set, like stemmers, n-grams, tf-idf and so on, will be considered for classification of the data. Different classification models like Bayesian Classifier, Random Forest and SGD classifier are taken into consideration for classifying the data and their results are compared. Frameworks like Weka, Apache Mahout and Scikit are used for building the classifiers

SJSU ScholarWorks

SecREP : A Framework for Automating the Extraction and Prioritization of Security Requirements Using Machine Learning and NLP Techniques

Author: Khanneh Shada
Publication venue: Montclair State University Digital Commons
Publication date: 01/08/2022
Field of study

Gathering and extracting security requirements adequately requires extensive effort, experience, and time, as large amounts of data need to be analyzed. While many manual and academic approaches have been developed to tackle the discipline of Security Requirements Engineering (SRE), a need still exists for automating the SRE process. This need stems mainly from the difficult, error-prone, and time-consuming nature of traditional and manual frameworks. Machine learning techniques have been widely used to facilitate and automate the extraction of useful information from software requirements documents and artifacts. Such approaches can be utilized to yield beneficial results in automating the process of extracting and eliciting security requirements. However, the extraction of security requirements alone leaves software engineers with yet another tedious task of prioritizing the most critical security requirements. The competitive and fast-paced nature of software development, in addition to resource constraints make the process of security requirements prioritization crucial for software engineers to make educated decisions in risk-analysis and trade-off analysis. To that end, this thesis presents an automated framework/pipeline for extracting and prioritizing security requirements. The proposed framework, called the Security Requirements Extraction and Prioritization Framework (SecREP) consists of two parts: SecREP Part 1: Proposes a machine learning approach for identifying/extracting security requirements from natural language software requirements artifacts (e.g., the Software Requirement Specification document, known as the SRS documents) SecREP Part 2: Proposes a scheme for prioritizing the security requirements identified in the previous step. For the first part of the SecREP framework, three machine learning models (SVM, Naive Bayes, and Random Forest) were trained using an enhanced dataset the “SecREP Dataset” that was created as a result of this work. Each model was validated using resampling (80% of for training and 20% for validation) and 5-folds cross validation techniques. For the second part of the SecREP framework, a prioritization scheme was established with the aid of NLP techniques. The proposed prioritization scheme analyzes each security requirement using Part-of-speech (POS) and Named Entity Recognition methods to extract assets, security attributes, and threats from the security requirement. Additionally, using a text similarity method, each security requirement is compared to a super-sentence that was defined based on the STRIDE threat model. This prioritization scheme was applied to the extracted list of security requirements obtained from the case study in part one, and the priority score for each requirement was calculated and showcase

Montclair State University Digital Commons

Classification of protein interaction sentences via gaussian processes

Author: A. Aizerman
A.M. Cohen
C.D. Manning
C.D. Manning
C.E. Rasmussen
C.H. Ding
D.D. Lewis
E.M. Marcotte
H. Chen
J. Huang
J.C. Platt
J.D. Kim
J.H. Albert
K. Crammer
K. Sugiyama
K.M.A. Chai
M. Girolami
M. Girolami
N. Lama
N. Lawrence
R. Bunescu
S. Rogers
S.S. Keerthi
Silva
T. Joachims
V. Vapnik
W. Chu
W. Chu
Y. Hao
Y. Lee
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na\"ive Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption

CiteSeerX

Enlighten

A New Open Information Extraction System Using Sentence Difficulty Estimation

Author: Faili Heshaam
Reshadat Vahideh
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 30/12/2019
Field of study

The World Wide Web has a considerable amount of information expressed using natural language. While unstructured text is often difficult for machines to understand, Open Information Extraction (OIE) is a relation-independent extraction paradigm designed to extract assertions directly from massive and heterogeneous corpora. Allocation of low-cost computational resources is a main demand for Open Relation Extraction (ORE) systems. A large number of ORE methods have been proposed recently, covering a wide range of NLP tools, from ``shallow'' (e.g., part-of-speech tagging) to ``deep'' (e.g., semantic role labeling). There is a trade-off between NLP tools depth versus efficiency (computational cost) of ORE systems. This paper describes a novel approach called Sentence Difficulty Estimator for Open Information Extraction (SDE-OIE) for automatic estimation of relation extraction difficulty by developing some difficulty classifiers. These classifiers dedicate the input sentence to an appropriate OIE extractor in order to decrease the overall computational cost. Our evaluations show that an intelligent selection of a proper depth of ORE systems has a significant improvement on the effectiveness and scalability of SDE-OIE. It avoids wasting resources and achieves almost the same performance as its constituent deep extractor in a more reasonable time