20 research outputs found

    Reporting Statistical Validity and Model Complexity in Machine Learning based Computational Studies

    Get PDF
    Background:: Statistical validity and model complexity are both important concepts to enhanced understanding and correctness assessment of computational models. However, information about these are often missing from publications applying machine learning. Aim: The aim of this study is to show the importance of providing details that can indicate statistical validity and complexity of models in publications. This is explored in the context of citation screening automation using machine learning techniques. Method: We built 15 Support Vector Machine (SVM) models, each developed using word2vec (average word) features --- and data for 15 review topics from the Drug Evaluation Review Program (DERP) of the Agency for Healthcare Research and Quality (AHRQ). Results: The word2vec features were found to be sufficiently linearly separable by the SVM and consequently we used the linear kernels. In 11 of the 15 models, the negative (majority) class used over 80% of its training data as support vectors (SVs) and approximately 45% of the positive training data. Conclusions: In this context, exploring the SVs revealed that the models are overly complex against ideal expectations of not more than 2%-5% (and preferably much less) of the training vectors

    Exploring multinomial naïve Bayes for Yorùbá text document classification

    Get PDF
    The recent increase in the emergence of Nigerian language text online motivates this paper in which the problem of classifying text documents written in Yorùbá language into one of a few pre-designated classes is considered. Text document classification/categorization research is well established for English language and many other languages; this is not so for Nigerian languages. This paper evaluated the performance of a multinomial Naive Bayes model learned on a research dataset consisting of 100 samples of text each from business, sporting, entertainment, technology and political domains, separately on unigram, bigram and trigram features obtained using the bag of words representation approach. Results show that the performance of the model over unigram and bigram features is comparable but significantly better than a model learned on trigram features. The results generally indicate a possibility for the practical application of NB algorithm to the classification of text documents written in Yorùbá language. Keywords: Supervised learning, text classification, Yorùbá language, text mining, BoW Representatio

    Delineating Knowledge Domains in Scientific Domains in Scientific Literature using Machine Learning (ML)

    Get PDF
    The recent years have witnessed an upsurge in the number of published documents. Organizations are showing an increased interest in text classification for effective use of the information. Manual procedures for text classification can be fruitful for a handful of documents, but the same lack in credibility when the number of documents increases besides being laborious and time-consuming. Text mining techniques facilitate assigning text strings to categories rendering the process of classification fast, accurate, and hence reliable. This paper classifies chemistry documents using machine learning and statistical methods. The procedure of text classification has been described in chronological order like data preparation followed by processing, transformation, and application of classification techniques culminating in the validation of the results

    Using Semi-supervised Learning for the Creation of Medical Systematic Review: An exploratory Analysis

    Get PDF
    In this research, we explore semi-supervised learning based classifiers to identify articles that can be included when creating medical systematic reviews (SRs). Specifically, we perform comparative study of various semi-supervised learning algorithm, and identify the best technique that is suited for SRs creation. We also aim to identify whether semisupervised learning technique with few labeled samples produce meaningful work saving for SRs creation. Through an empirical study, we demonstrate that semi-supervised classifiers are viable for selecting articles for systematic reviews and situations when only a few numbers of training samples are available

    Active Learning for the Automation of Medical Systematic Review Creation

    Get PDF
    While systematic reviews (SRs) are positioned as an essential element of modern evidence-based medical practice, the creation of these reviews is resource intensive. To mitigate this problem there has been some attempts to leverage supervised machine learning to automate the article triage procedure. This approach has been proved to be helpful for updating existing SRs. However, this technique holds very little promise for creating new SRs because training data is rarely available when it comes to SR creation. In this research we propose an active machine learning approach to overcome this labeling bottleneck and develop a classifier for supporting the creation of systematic reviews. The results indicate that active learning based sample selection could significantly reduce the human effort and is viable technique for automating medical systematic review creation with very few training dataset

    Using semi-supervised learning for the creation of medical systematic review: An exploratory analysis

    Get PDF
    In this research, we explore semi-supervised learning based classifiers to identify articles that can be included when creating medical systematic reviews (SRs). Specifically, we perform comparative study of various semi-supervised learning algorithm, and identify the best technique that is suited for SRs creation. We also aim to identify whether semisupervised learning technique with few labeled samples produce meaningful work saving for SRs creation. Through an empirical study, we demonstrate that semi-supervised classifiers are viable for selecting articles for systematic reviews and situations when only a few numbers of training samples are available

    Managing Civility in News and Information Organizations

    Get PDF
    Purpose: Explore media managers’ perspectives on covering issues of conflict, as well as analyze news coverage of issues related to civility in the Black Lives Matter conflict. Methodology: Summary translations were conducted on panel discussions of civility in media to identify issues of importance to media managers. A computational content analysis of newspaper articles on conflict was conducted to identify key words and phrases used in actual coverage.  Findings/Contribution: Media managers are concerned with bias in news reporting an­­d covering differing viewpoints especially in stories of conflict. Four themes emerged: Values, Practices, Sectors, Story Topics and Legal Considerations. In Black Lives Matter coverage, reporting themes related to violence prevailed leaving open more in-depth coverage of issues related to values, sectors and legal concerns

    A Taxonomy of Text Mining

    Get PDF
    With a rapid increase in the volume of textual data on the Internet, extracting useful information through innovative text mining techniques has become crucial. In this context, terminology jargon in the literature related to text-mining creates ambiguity and has made it very difficult for researchers to focus in a specific direction and bring innovation. For example, review mining and opinion mining may have different applications, however, from a technical perspective, they are very similar. In this paper, we propose a classification of the text mining terminologies from the perspectives of technical and text-mining processes. The classification is based on a comprehensive literature survey and analysis. This research study presents a clear classification of text mining terminologies based on technical and text mining processes to resolve the issue of terminology jargon. By utilizing the proposed classification, researchers will be able to easily choose a specific direction instead of diverging amongst similar research problems, thereby, driving innovation. Further, the proposed classification will help advance and improve the overall research progress in all text-mining related fields

    Automatic Complaint Classification System Using Classifier Ensembles

    Get PDF
    Sambat Online is an online complaint system run by the city government of Malang, Indonesia. Because most citizens do not know to which work units (Satuan Kerja Pemerintah Daerah [SKPDs]) their complaints should be sent, the system administrator must manually sort and classify all of the incoming complaints with respect to the appropriate SKPDs. This study empirically evaluated the application of an automated system to replace the manual classification process. The experiments, which used Sambat Online data, involved five individual classification algorithms— Naïve Bayes, Maximum Entropy, K-Nearest Neighbors, Random Forest, and Support Vector Machines—and two ensemble strategies—hard voting and soft voting. The results show that the Multinomial Naïve Bayes classifier achieved the best performance, an 80.7% accuracy value, of the five individual classifiers. The results also indicate that generally all of the ensemble methods performed better than the individual classifiers. Almost all of them had the same accuracy level of 81.2%. In addition, the soft voting strategy had slightly higher accuracy than the hard one when all five classifiers were used. However, when the three best classifier combinations were used, both had the same level of accuracy

    Culture and Conflict: The Framing of News in Three National U.S. Newspapers

    Get PDF
    Overview: This research addressed how corporate political leanings of media organizations impacted journalistic coverage of issues of conflict and culture. Purpose: The purpose of this study was to identify how national newspapers with different editorial stances framed protest news coverage of the cultural issue of Black Lives Matter in order to attract audiences and differentiate their products. Journalists are influenced not only by what they see and hear at the scene of a news story but by the work practices and management decisions of their news organizations and parent companies. Methodology: Three national newspapers were chosen for analysis. Computational and manual content analyses of news stories were conducted to identify differences in word usage, story bias, and source usage. Newspaper stories on Black Lives Matter were collected at the height of coverage in Spring 2020 following the death of George Floyd and again in Spring 2021 surrounding the trial of Derek Chauvin, the police officer held responsible for the death. This timeframe provided an opportunity to measure differences in institutional and journalistic content decisionmaking in news stories during the heat of cultural exchanges. Findings: Analysis of newspaper coverage of the cultural movement indicated differences in coverage existed among newspapers where the liberal-leaning newspaper was more likely to engage in more sensational coverage, while the conservative newspaper engaged in more contextual coverage
    corecore