1,542 research outputs found
A Comparison of Algorithms for Text Classification of Albanian News Articles
Text classification is an essential work in text mining and information retrieval. There are a lot of algorithms developed aiming to classify computational data and most of them are extended to classify textual data. We have used some of these algorithms to train the classifiers with part of our crawled Albanian news articles and classify the other part with the already learned classifiers. The used categories are: latest news, economy, sport, showbiz, technology, culture, and world. First, we remove all stop words from the gained articles and the output of this step is a separate text file for each category. All these files are then split in sentences, and for each sentence the appropriate category is assigned. All these sentences are then projected to a single list of tuples sentence/category. This list is used to train (80% of the overall number) and to test (the remained 20%) different classifiers. This list is at the end shuffled aiming to randomize the sequence of different categories. We have trained and then test our articles measuring the accuracy for each classifier separately. We have also analysed the training and testing time.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.</p
CABLE NEWS NETWORK (CNN) ARTICLES CLASSIFICATION USING RANDOM FOREST ALGORITHM WITH HYPERPARAMETER OPTIMIZATION
The growth of news articles on the internet occurs in a short period with large amounts so necessary to be grouped into several categories for easy access. There is a method for grouping news articles, namely classification. One of the classification methods is random forest which is built on decision tree. This research discusses the application of random forest as a method of classifying news articles into six categories, these are business, entertainment, health, politics, sport, and news. The data used is Cable News Network (CNN) articles from 2011 to 2022. The data is in form of text and has large amounts so good handling is needed to avoid overfitting and underfitting. Random forest is proper to apply to the data because the algorithm works very well on large amounts of data. However, random forest has a difficult interpretation if the combination of parameters is not appropriate in the data processing. Therefore, hyperparameter optimization is needed to discover the best combination of parameters in the random forest. This research uses search cross-validation (SearchCV) method to optimize hyperparameters in the random forest by testing the combinations one by one and validating those. Then we obtain the classification of news articles into six categories with an accuracy value of 0.81 on training and 0.76 on testing
Machine Learning-based Automatic Annotation and Detection of COVID-19 Fake News
COVID-19 impacted every part of the world, although the misinformation about
the outbreak traveled faster than the virus. Misinformation spread through
online social networks (OSN) often misled people from following correct medical
practices. In particular, OSN bots have been a primary source of disseminating
false information and initiating cyber propaganda. Existing work neglects the
presence of bots that act as a catalyst in the spread and focuses on fake news
detection in 'articles shared in posts' rather than the post (textual) content.
Most work on misinformation detection uses manually labeled datasets that are
hard to scale for building their predictive models. In this research, we
overcome this challenge of data scarcity by proposing an automated approach for
labeling data using verified fact-checked statements on a Twitter dataset. In
addition, we combine textual features with user-level features (such as
followers count and friends count) and tweet-level features (such as number of
mentions, hashtags and urls in a tweet) to act as additional indicators to
detect misinformation. Moreover, we analyzed the presence of bots in tweets and
show that bots change their behavior over time and are most active during the
misinformation campaign. We collected 10.22 Million COVID-19 related tweets and
used our annotation model to build an extensive and original ground truth
dataset for classification purposes. We utilize various machine learning models
to accurately detect misinformation and our best classification model achieves
precision (82%), recall (96%), and false positive rate (3.58%). Also, our bot
analysis indicates that bots generated approximately 10% of misinformation
tweets. Our methodology results in substantial exposure of false information,
thus improving the trustworthiness of information disseminated through social
media platforms
Exploring Text Mining and Analytics for Applications in Public Security: An in-depth dive into a systematic literature review
Text mining and related analytics emerge as a technological approach to support human activities in extracting useful knowledge through texts in several formats. From a managerial point of view, it can help organizations in planning and decision-making processes, providing information that was not previously evident through textual materials produced internally or even externally. In this context, within the public/governmental scope, public security agencies are great beneficiaries of the tools associated with text mining, in several aspects, from applications in the criminal area to the collection of people's opinions and sentiments about the actions taken to promote their welfare. This article reports details of a systematic literature review focused on identifying the main areas of text mining application in public security, the most recurrent technological tools, and future research directions. The searches covered four major article bases (Scopus, Web of Science, IEEE Xplore, and ACM Digital Library), selecting 194 materials published between 2014 and the first half of 2021, among journals, conferences, and book chapters. There were several findings concerning the targets of the literature review, as presented in the results of this article
Multi-Modal Deep Learning for Credit Rating Prediction Using Text and Numerical Data Streams
Knowing which factors are significant in credit rating assignment leads to
better decision-making. However, the focus of the literature thus far has been
mostly on structured data, and fewer studies have addressed unstructured or
multi-modal datasets. In this paper, we present an analysis of the most
effective architectures for the fusion of deep learning models for the
prediction of company credit rating classes, by using structured and
unstructured datasets of different types. In these models, we tested different
combinations of fusion strategies with different deep learning models,
including CNN, LSTM, GRU, and BERT. We studied data fusion strategies in terms
of level (including early and intermediate fusion) and techniques (including
concatenation and cross-attention). Our results show that a CNN-based
multi-modal model with two fusion strategies outperformed other multi-modal
techniques. In addition, by comparing simple architectures with more complex
ones, we found that more sophisticated deep learning models do not necessarily
produce the highest performance; however, if attention-based models are
producing the best results, cross-attention is necessary as a fusion strategy.
Finally, our comparison of rating agencies on short-, medium-, and long-term
performance shows that Moody's credit ratings outperform those of other
agencies like Standard & Poor's and Fitch Ratings
A Review on Computer Aided Diagnosis of Acute Brain Stroke.
Amongst the most common causes of death globally, stroke is one of top three affecting over 100 million people worldwide annually. There are two classes of stroke, namely ischemic stroke (due to impairment of blood supply, accounting for ~70% of all strokes) and hemorrhagic stroke (due to bleeding), both of which can result, if untreated, in permanently damaged brain tissue. The discovery that the affected brain tissue (i.e., 'ischemic penumbra') can be salvaged from permanent damage and the bourgeoning growth in computer aided diagnosis has led to major advances in stroke management. Abiding to the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) guidelines, we have surveyed a total of 177 research papers published between 2010 and 2021 to highlight the current status and challenges faced by computer aided diagnosis (CAD), machine learning (ML) and deep learning (DL) based techniques for CT and MRI as prime modalities for stroke detection and lesion region segmentation. This work concludes by showcasing the current requirement of this domain, the preferred modality, and prospective research areas
A multi-channel cross-residual deep learning framework for news-oriented stock movement prediction
Stock market movement prediction remains challenging due to
random walk characteristics. Yet through a potent blend of input
parameters, a prediction model can learn sequential features more
intelligently. In this paper, a multi-channel news-oriented prediction
system is developed to capture intricate moving patterns of
the stock market index. Specifically, the system adopts the temporal
causal convolution to process historical index values due to
its capability in learning long-term dependencies. Concurrently, it
employs the Transformer Encoder for qualitative information
extraction from financial news headlines and corresponding preview
texts. A notable configuration to our multi-channel system is
an integration of cross-residual learning between different channels,
thereby allowing an earlier and closer information fusion. The
proposed architecture is validated to be more efficient in trend
forecasting compared to independent learning, by which channels
are trained separately. Furthermore, we also demonstrate the
effectiveness of involving news content previews, improving the
prediction accuracy by as much as 3.39%
Review Paper on Enhancing COVID-19 Fake News Detection With Transformer Model
The growing propagation of disinformation about the COVID-19 epidemic needs powerful fake news detection technologies. This review provides an in-depth examination of existing techniques, including traditional machine learning methods such as Random Forest and Naive Bayes, as well as sophisticated models for deep learning such as Bi- GRU, CNN, and LSTM, RNN, & transformer-based architecture such as BERT and XLM- Roberta, are also available. One noticeable development is the merging of traditional algorithmswith sophisticated transformers, which emphasize the quest of improved accuracy and flexibility.However, important research gaps have been identified. There has been little research on cross- lingual detection algorithms, revealing a substantial gap in multilingual false news detection, which is critical in the global context of COVID-19 information spread. Furthermore, the researchemphasizes the need of flexible methodologies by emphasizing the need for appropriate preprocessing strategies for various content types. Furthermore, the lack of common assessment measures is a barrier, underlining the need of unified frameworks for successfully benchmarking and comparing models. This analysis provides light on the changing COVID-19 false news detection environment, emphasizing the need for novel, adaptive, and internationally relevant approaches to successfully address the ubiquitous dissemination of disinformation during the current pandemic
Network problems detection and classification by analyzing syslog data
Network troubleshooting is an important process which has a wide research field. The first step in troubleshooting procedures is to collect information in order to diagnose the problems. Syslog messages which are sent by almost all network devices contain a massive amount of data related to the network problems. It is found that in many studies conducted previously, analyzing syslog data which can be a guideline for network problems and their causes was used. Detecting network problems could be more efficient if the detected problems have been classified in
terms of network layers. Classifying syslog data needs to identify the syslog messages that describe the network problems for each layer, taking into account the different formats of various syslog for vendors’ devices. This study provides a method to classify syslog messages that indicates the network problem in terms of network layers. The method used data mining tool to classify the syslog messages
while the description part of the syslog message was used for classification process. Related syslog messages were identified; features were then selected to train the classifiers. Six classification algorithms were learned; LibSVM, SMO, KNN, Naïve Bayes, J48, and Random Forest. A real data set which was obtained from the
Universiti Utara Malaysia’s (UUM) network devices is used for the prediction stage. Results indicate that SVM shows the best performance during the training and prediction stages. This study contributes to the field of network troubleshooting, and the field of text data classification
- …