1,363 research outputs found

    Learning to Resolve Natural Language Ambiguities: A Unified Approach

    Full text link
    We analyze a few of the commonly used statistics based and machine learning algorithms for natural language disambiguation tasks and observe that they can be re-cast as learning linear separators in the feature space. Each of the methods makes a priori assumptions, which it employs, given the data, when searching for its hypothesis. Nevertheless, as we show, it searches a space that is as rich as the space of all linear separators. We use this to build an argument for a data driven approach which merely searches for a good linear separator in the feature space, without further assumptions on the domain or a specific problem. We present such an approach - a sparse network of linear separators, utilizing the Winnow learning algorithm - and show how to use it in a variety of ambiguity resolution problems. The learning approach presented is attribute-efficient and, therefore, appropriate for domains having very large number of attributes. In particular, we present an extensive experimental comparison of our approach with other methods on several well studied lexical disambiguation tasks such as context-sensitive spelling correction, prepositional phrase attachment and part of speech tagging. In all cases we show that our approach either outperforms other methods tried for these tasks or performs comparably to the best

    Distinguishing Word Senses in Untagged Text

    Full text link
    This paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text. The methods described in this paper, McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm, assign each instance of an ambiguous word to a known sense definition based solely on the values of automatically identifiable features in text. These methods and feature sets are found to be more successful in disambiguating nouns rather than adjectives or verbs. Overall, the most accurate of these procedures is McQuitty's similarity analysis in combination with a high dimensional feature set.Comment: 11 pages, latex, uses aclap.st

    Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation

    Get PDF
    BACKGROUND: The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS: We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSION: We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM

    New kernel functions and learning methods for text and data mining

    Get PDF
    Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.Siirretty Doriast

    Effective Unsupervised Author Disambiguation with Relative Frequencies

    Full text link
    This work addresses the problem of author name homonymy in the Web of Science. Aiming for an efficient, simple and straightforward solution, we introduce a novel probabilistic similarity measure for author name disambiguation based on feature overlap. Using the researcher-ID available for a subset of the Web of Science, we evaluate the application of this measure in the context of agglomeratively clustering author mentions. We focus on a concise evaluation that shows clearly for which problem setups and at which time during the clustering process our approach works best. In contrast to most other works in this field, we are sceptical towards the performance of author name disambiguation methods in general and compare our approach to the trivial single-cluster baseline. Our results are presented separately for each correct clustering size as we can explain that, when treating all cases together, the trivial baseline and more sophisticated approaches are hardly distinguishable in terms of evaluation results. Our model shows state-of-the-art performance for all correct clustering sizes without any discriminative training and with tuning only one convergence parameter.Comment: Proceedings of JCDL 201

    Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

    Get PDF
    This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap: Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed. Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar

    Predicting Scientific Success Based on Coauthorship Networks

    Full text link
    We address the question to what extent the success of scientific articles is due to social influence. Analyzing a data set of over 100000 publications from the field of Computer Science, we study how centrality in the coauthorship network differs between authors who have highly cited papers and those who do not. We further show that a machine learning classifier, based only on coauthorship network centrality measures at time of publication, is able to predict with high precision whether an article will be highly cited five years after publication. By this we provide quantitative insight into the social dimension of scientific publishing - challenging the perception of citations as an objective, socially unbiased measure of scientific success.Comment: 21 pages, 2 figures, incl. Supplementary Materia

    Cyber bullying identification and tackling using natural language processing techniques

    Get PDF
    Abstract. As offensive content has a detrimental influence on the internet and especially in social media, there has been much research identifying cyberbullying posts from social media datasets. Previous works on this topic have overlooked the problems for cyberbullying categories detection, impact of feature choice, negation handling, and dataset construction. Indeed, many natural language processing (NLP) tasks, including cyberbullying detection in texts, lack comprehensive manually labeled datasets limiting the application of powerful supervised machine learning algorithms, including neural networks. Equally, it is challenging to collect large scale data for a particular NLP project due to the inherent subjectivity of labeling task and man-made effort. For this purpose, this thesis attempts to contribute to these challenges by the following. We first collected and annotated a multi-category cyberbullying (10K) dataset from the social network platform (ask.fm). Besides, we have used another publicly available cyberbullying labeled dataset, ’Formspring’, for comparison purpose and ground truth establishment. We have devised a machine learning-based methodology that uses five distinct feature engineering and six different classifiers. The results showed that CNN classifier with Word-embedding features yielded a maximum performance amidst all state-of-art classifiers, with a detection accuracy of 93\% for AskFm and 92\% for FormSpring dataset. We have performed cyberbullying category detection, and CNN architecture still provide the best performance with 81\% accuracy and 78\% F1-score on average. Our second purpose was to handle the problem of lack of relevant cyberbullying instances in the training dataset through data augmentation. For this end, we developed an approach that makes use of wordsense disambiguation with WordNet-aided semantic expansion. The disambiguation and semantic expansion were intended to overcome several limitations of the social media (SM) posts/comments, such as unstructured content, limited semantic content, among others, while capturing equivalent instances induced by the wordsense disambiguation-based approach. We run several experiments and disambiguation/semantic expansion to estimate the impact of the classification performance using both original and the augmented datasets. Finally, we have compared the accuracy score for cyberbullying detection with some widely used classifiers before and after the development of datasets. The outcome supports the advantage of the data-augmentation strategy, which yielded 99\% of classifier accuracy, a 5\% improvement from the base score of 93\%. Our third goal related to negation handling was motivated by the intuitive impact of negation on cyberbullying statements and detection. Our proposed approach advocates a classification like technique by using NegEx and POS tagging that makes the use of a particular data design procedure for negation detection. Performances using the negation-handling approach and without negation handling are compared and discussed. The result showed a 95\% of accuracy for the negated handed dataset, which corresponds to an overall accuracy improvement of 2\% from the base score of 93\%. Our final goal was to develop a software tool using our machine learning models that will help to test our experiments and provide a real-life example of use case for both end-users and research communities. To achieve this objective, a python based web-application was developed and successfully tested
    • 

    corecore