992 research outputs found

    Machine learning NLP-based recommendation system on production issues

    Get PDF
    The techniques related to Natural Language Processing (NLP) as information extraction are increasingly popular in media, E-commerce, and online games. However, the application with such techniques is yet to be established for production quality control in the manufacturing industry. The goal of this research is to build a recommendation system based on production issue descriptions in a textual format. The data was extracted from a manufacturing control system where it has been collected in Finnish on a relatively good scale for years. Five different NLP methods (TF-IDF, Word2Vec, spaCy, Sentence Transformers and SBERT) are used for modelling, converting hu-man digital written texts into numerical feature vectors. The most relevant issue cases could be retrieved by calculating the cosine distance between the query sentence vector and corpus embed matrix which represents the whole dataset. Turku NLP-based Sentence Transformer achieves the best result with Mean Average Precision @10 equal to 0.67, inferring that the initial dataset is large enough using deep learning algorithms competing with machine learning methods. Even though a categorical variable were chosen as a target variable to compute evaluation metrics, this research is not a classification problem with single variable for model training. Additionally, the metric selected for performance evaluation measures for every issue case. Therefore, it is not necessary to balance and split the dataset. This research work achieves a relatively good result with less data available compared to the size of data used for other businesses. The recommendation system can be optimized by feeding more data and implementing online testing. It also has the possibility to transform into collaborative filtering to find patterns of users instead of simply focusing on items, in the condition of comprehensive user information included

    USING DEEP LEARNING AND LINGUISTIC ANALYSIS TO PREDICT FAKE NEWS WITHIN TEXT

    Get PDF
    The spread of information about current events is a way for everybody in the world to learn and understand what is happening in the world. In essence, the news is an important and powerful tool that could be used by various groups of people to spread awareness and facts for the good of mankind. However, as information becomes easily and readily available for public access, the rise of deceptive news becomes an increasing concern. The reason is due to the fact that it will cause people to be misled and thus could affect the livelihood of themselves or others. The term that is coined for spreading false information is known as fake news. It is of the utmost importance to mitigate this issue, thus the proposition is to perform a study on technological techniques that are being used to prevent the spread of dishonest and propagandized information. Since there are an abundance of websites and articles that internet users could read, the use of automated technology was the only logical option when dealing with fake news. The techniques that were used in this study were based around linguistic analysis and deep learning. The end objective was to create a classifier that was able to judge an article based on the amount of fake news within it. Experiments were performed on these classifiers, which tried to prove that applied linguistic analysis was important in improving the accuracy of the classifiers. The results from this study displayed evidence that applied linguistic analysis did not have a sufficient impact, whereas deep learning and dataset improvements did have an impact

    Harnessing the Power of Hugging Face Transformers for Predicting Mental Health Disorders in Social Networks

    Full text link
    Early diagnosis of mental disorders and intervention can facilitate the prevention of severe injuries and the improvement of treatment results. Using social media and pre-trained language models, this study explores how user-generated data can be used to predict mental disorder symptoms. Our study compares four different BERT models of Hugging Face with standard machine learning techniques used in automatic depression diagnosis in recent literature. The results show that new models outperform the previous approach with an accuracy rate of up to 97%. Analyzing the results while complementing past findings, we find that even tiny amounts of data (like users' bio descriptions) have the potential to predict mental disorders. We conclude that social media data is an excellent source of mental health screening, and pre-trained models can effectively automate this critical task.Comment: 19 pages, 5 figure

    A Practical and Empirical Comparison of Three Topic Modeling Methods Using a COVID-19 Corpus: LSA, LDA, and Top2Vec

    Get PDF
    This study was prepared as a practical guide for researchers interested in using topic modeling methodologies. This study is specially designed for those with difficulty determining which methodology to use. Many topic modeling methods have been developed since the 1980s namely, latent semantic indexing or analysis (LSI/LSA), probabilistic LSI/LSA (pLSI/pLSA), naïve Bayes, the Author-Recipient-Topic (ART), Latent Dirichlet Allocation (LDA), Topic Over Time (TOT), Dynamic Topic Models (DTM), Word2Vec, Top2Vec, and \variation and combination of these techniques. Researchers from disciplines other than computer science may find it challenging to select a topic modeling methodology. We compared a recently developed topic modeling algorithm Top2Vec with two of the most conventional and frequently-used methodologiesLSA and LDA. As a study sample, we used a corpus of 65,292 COVID-19-focused abstracts. Among the 11 topics we identified in each methodology, we found high levels of correlation between LDA and Top2Vec results, followed by LSA and LDA and Top2Vec and LSA. We also provided information on computational resources we used to perform the analyses and provided practical guidelines and recommendations for researchers

    Toxic Comment Classification using Deep Learning

    Get PDF
    Online Conversation media serves as a means for individuals to engage, cooperate, and exchange ideas; however, it is also considered a platform that facilitates the spread of hateful and offensive comments, which could significantly impact one's emotional and mental health. The rapid growth of online communication makes it impractical to manually identify and filter out hateful tweets. Consequently, there is a pressing need for a method or strategy to eliminate toxic and abusive comments and ensure the safety and cleanliness of social media platforms. Utilizing LSTM, Character-level CNN, Word-level CNN, and Hybrid model (LSTM + CNN) in this toxicity analysis is to classify comments and identify the different types of toxic classes by means of a comparative analysis of various models. The neural network models utilized for this analysis take in comments extracted from online platforms, including both toxic and non-toxic comments. The results of this study can contribute towards the development of a web interface that enables the identification of toxic and hateful comments within a given sentence or phrase, and categorizes them into their respective toxicity classes

    Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

    Full text link
    This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.Comment: Submitted to TSD 2023 Conferenc

    Knowledge Discovery from CVs: A Topic Modeling Procedure

    Get PDF
    With a huge number of CVs available online, recruiting via the web has become an integral part of human resource management for companies. Automated text mining methods can be used to analyze large databases containing CVs. We present a topic modeling procedure consisting of five steps with the aim of identifying competences in CVs in an automated manner. Both the procedure and its exemplary application to CVs from IT experts are described in detail. The specific characteristics of CVs are considered in each step for optimal results. The exemplary application suggests that clearly interpretable topics describing fine-grained competences (e.g., Java programming, web design) can be discovered. This information can be used to rapidly assess the contents of a CV, categorize CVs and identify candidates for job offers. Furthermore, a topic-based search technique is evaluated to provide helpful decision support

    Understanding the structure and meaning of Finnish texts: From corpus creation to deep language modelling

    Get PDF
    Natural Language Processing (NLP) is a cross-disciplinary field combining elements of computer science, artificial intelligence, and linguistics, with the objective of developing means for computational analysis, understanding or generation of human language. The primary aim of this thesis is to advance natural language processing in Finnish by providing more resources and investigating the most effective machine learning based practices for their use. The thesis focuses on NLP topics related to understanding the structure and meaning of written language, mainly concentrating on structural analysis (syntactic parsing) as well as exploring the semantic equivalence of statements that vary in their surface realization (paraphrase modelling). While the new resources presented in the thesis are developed for Finnish, most of the methodological contributions are language-agnostic, and the accompanying papers demonstrate the application and evaluation of these methods across multiple languages. The first set of contributions of this thesis revolve around the development of a state-of-the-art Finnish dependency parsing pipeline. Firstly, the necessary Finnish training data was converted to the Universal Dependencies scheme, integrating Finnish into this important treebank collection and establishing the foundations for Finnish UD parsing. Secondly, a novel word lemmatization method based on deep neural networks is introduced and assessed across a diverse set of over 50 languages. And finally, the overall dependency parsing pipeline is evaluated on a large number of languages, securing top ranks in two competitive shared tasks focused on multilingual dependency parsing. The overall outcome of this line of research is a parsing pipeline reaching state-of-the-art accuracy in Finnish dependency parsing, the parsing numbers obtained with the latest pre-trained language models approaching (at least near) human-level performance. The achievement of large language models in the area of dependency parsing— as well as in many other structured prediction tasks— brings up the hope of the large pre-trained language models genuinely comprehending language, rather than merely relying on simple surface cues. However, datasets designed to measure semantic comprehension in Finnish have been non-existent, or very scarce at the best. To address this limitation, and to reflect the general change of emphasis in the field towards task more semantic in nature, the second part of the thesis shifts its focus to language understanding through an exploration of paraphrase modelling. The second contribution of the thesis is the creation of a novel, large-scale, manually annotated corpus of Finnish paraphrases. A unique aspect of this corpus is that its examples have been manually extracted from two related text documents, with the objective of obtaining non-trivial paraphrase pairs valuable for training and evaluating various language understanding models on paraphrasing. We show that manual paraphrase extraction can yield a corpus featuring pairs that are both notably longer and less lexically overlapping than those produced through automated candidate selection, the current prevailing practice in paraphrase corpus construction. Another distinctive feature in the corpus is that the paraphrases are identified and distributed within their document context, allowing for richer modelling and novel tasks to be defined
    corecore