Search CORE

914 research outputs found

A New Pairwise Ensemble Approach for Text Classification

Author: L. Breiman
M.I. Jordan
R. Schapire
T. Hastie
T. Joachims
T.G. Dietterich
Y. Yang
Y. Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

Convolutional Neural Networks for Survey Response Classification

Author: Oberdorf Felix
Pirner Jonas
Stein Nikolai
Publication venue: AIS Electronic Library (AISeL)
Publication date: 10/08/2020
Field of study

Artificial Intelligence reveals great potential for enterprises e.g., intelligent services. However, small and medium enterprises struggle with Artificial Intelligence due to limited resources. Especially tasks such as survey response classification are yet not investigated. We address this research gap by means of a data science study. In particular, we analyze several baseline classification pipelines leveraging logistic regression, random forests, and linear support vector machines against wide headed CNN architectures with one-hot encoding or character embedding inputs. We find that the SVM model outperforms all other evaluated models in the setting at hand. In addition, we analyze the different predictions of the models and show typical prediction errors by means of a chord diagram of commonly misclassified brands

AIS Electronic Library (AISeL)

Automated classification of web contents in B2B marketing

Author: Guragain Nischal
Publication venue
Publication date: 06/08/2021
Field of study

Recent growth in digitization has affected how customers seek the information they need to make a purchase decision. This trend of customers making their purchase decision based on the information they collect online is increasing. To accommodate this change in purchase behavior, companies tend to share as much information about themselves and their products online, which in turn drives the amount of unstructured data produced. To get value for this huge amount of data being produced, the unstructured data needs to be processed before being used in digital marketing applications. When it comes to the companies serving business to customers (B2C), plenty of research exists on how the digital content could be used for marketing, but for the companies serving business to business (B2B) a huge research gap presides. B2C marketing and B2B marketing might share some analytical concepts but they are different domains. Not much research has been done in the field of using machine learning in B2B digital marketing. The lack of availability of labeled text data from the B2B domain makes it challenging for researchers to experiment on text classification models, while several methods have been proposed and used to classify unstructured text data in marketing and other domains. This thesis studies previous works done in the field of text classification in general, in the marketing domain, and compares those methods across the dataset available for this research. Text classification methods such as Random Forest, Linear SVM, KNN, Multinomial Naïve Bayes, and Multinomial Logistic Regression dominates the research field, hence these methods are tested in this research. In the used dataset surprisingly, Random Forest Classifier performed best with an average accuracy of 0.85 in the designed five-class classification task

UTUPub

Empirical Analysis and Automated Classification of Security Bug Reports

Author: Tyo Jacob P.
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2016
Field of study

With the ever expanding amount of sensitive data being placed into computer systems, the need for effective cybersecurity is of utmost importance. However, there is a shortage of detailed empirical studies of security vulnerabilities from which cybersecurity metrics and best practices could be determined. This thesis has two main research goals: (1) to explore the distribution and characteristics of security vulnerabilities based on the information provided in bug tracking systems and (2) to develop data analytics approaches for automatic classification of bug reports as security or non-security related. This work is based on using three NASA datasets as case studies. The empirical analysis showed that the majority of software vulnerabilities belong only to a small number of types. Addressing these types of vulnerabilities will consequently lead to cost efficient improvement of software security. Since this analysis requires labeling of each bug report in the bug tracking system, we explored using machine learning to automate the classification of each bug report as a security or non-security related (two-class classification), as well as each security related bug report as specific security type (multiclass classification). In addition to using supervised machine learning algorithms, a novel unsupervised machine learning approach is proposed. Of the machine learning algorithms tested, Naive Bayes was the most consistent, well performing classifier across all datasets. The novel unsupervised approach did not perform as well as the supervised methods, but still performed well resulting in a G-Score of 0.715 in the case of best performance whereas the supervised approach achieved a G-Score of 0.903 in the case of best performance

The Research Repository @ WVU (West Virginia University)

Reusing clinical data to improve health care:Challenge accepted!

Author: Wiegersma Sytske
Publication venue: University of Twente
Publication date: 17/11/2022
Field of study

University of Twente Research Information

Natural language content evaluation system for multiclass detection of hate speech in tweets using transformers

Author: Marrugo-Tobón Duván Andres
Martinez-Santos Juan Carlos
Puertas Edwin
Publication venue: Maestría en Ingeniería
Publication date: 05/12/2023
Field of study

In natural language processing, accurate categorization of tweets, including detecting hate speech, plays a pivotal role in efficient information organization and analysis. This paper presents a Natural Language Contents Evaluation System specifically tailored for multi-class tweet categorization, focusing on hate speech detection. Our system enhances classification accuracy and efficiency by harnessing the power of Transformers, namely BERT and DistilBERT. By leveraging feature extraction techniques, we capture pertinent information from tweets, enabling practical analysis, categorization, and identification of hate speech instances. During training, we also tackle imbalanced corpora by employing techniques to ensure fair representation of different tweet categories, including hate speech. Our system achieves impressive accuracy through extensive training of 95%, showcasing Transformers' effectiveness in comprehending and categorizing tweets, including identifying hate speech. Furthermore, our system maintains a good accuracy during testing of 83%, highlighting the robustness and generalizability of the trained models for hate speech detection. This system contributes to advancing automated tweet categorization, specifically in hate speech detection, providing a reliable and efficient solution for organizing and analyzing diverse tweet datasets.Universidad Tecnología de Bolíva

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Enhanced ontology-based text classification algorithm for structurally organized documents

Author: Oleiwi Suha Sahib
Publication venue
Publication date: 01/01/2015
Field of study

Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

Universiti Utara Malaysia: UUM eTheses

Journalistic image access : description, categorization and searching

Author: Westman Stina
Publication venue: Aalto University, School of Arts, Design and Architecture, Department of Arts
Publication date: 01/01/2011
Field of study

The quantity of digital imagery continues to grow, creating a pressing need to develop efficient methods for organizing and retrieving images. Knowledge on user behavior in image description and search is required for creating effective and satisfying searching experiences. The nature of visual information and journalistic images creates challenges in representing and matching images with user needs. The goal of this dissertation was to understand the processes in journalistic image access (description, categorization, and searching), and the effects of contextual factors on preferred access points. These were studied using multiple data collection and analysis methods across several studies. Image attributes used to describe journalistic imagery were analyzed based on description tasks and compared to a typology developed through a meta-analysis of literature on image attributes. Journalistic image search processes and query types were analyzed through a field study and multimodal image retrieval experiment. Image categorization was studied via sorting experiments leading to a categorization model. Advances to research methods concerning search tasks and categorization procedures were implemented. Contextual effects on image access were found related to organizational contexts, work, and search tasks, as well as publication context. Image retrieval in a journalistic work context was contextual at the level of image needs and search process. While text queries, together with browsing, remained the key access mode to journalistic imagery, participants also used visual access modes in the experiment, constructing multimodal queries. Assigned search task type and searcher expertise had an effect on query modes utilized. Journalistic images were mostly described and queried for on the semantic level but also syntactic attributes were used. Constraining the description led to more abstract descriptions. Image similarity was evaluated mainly based on generic semantics. However, functionally oriented categories were also constructed, especially by domain experts. Availability of page context promoted thematic rather than object-based categorization. The findings increase our understanding of user behavior in image description, categorization, and searching, as well as have implications for future solutions in journalistic image access. The contexts of image production, use, and search merit more interest in research as these could be leveraged for supporting annotation and retrieval. Multiple access points should be created for journalistic images based on image content and function. Support for multimodal query formulation should also be offered. The contributions of this dissertation may be used to create evaluation criteria for journalistic image access systems

Aaltodoc Publication Archive