29 research outputs found

    Named entity recognition based on Verbs Associated with Human Activities (VAHA)

    No full text
    Motivated by a hypothesis such that there always exists verb specifically describes human being conduct, this thesis propose a NER system which aims to identify NEs that perform human activity based on verb analysis in an autonomous manner. In other words, this research attempt to identify NEs by observing the existence of Verbs Associated with Human Activities (VAHA). VAHA architecture exhibits five significant characteristics where (i) it requires no training data for NER, thus eliminating human intervention (manually labeling data set or creating gazetteer), (ii) it is applicable to small text size and even on a single text, (iii) it is adaptable across different domains (fiction and non-fiction), (iv) it requires no gazetteer (person name list) during NER and (v) no anaphora resolution is required prior to NER

    Hybrid Deep Neural Networks for Industrial Text Scoring

    No full text
    Academic scoring is mainly explored through the pedagogical fields of Automated Essay Scoring (AES) and Short Answer Scoring (SAS), but text scoring in other domains has received limited attention. This paper focuses on industrial text scoring, namely the processing and adherence checking of long annual reports based on regulatory requirements. To lay the foundations for non-academic scoring, a pioneering corpus of annual reports from companies is scraped, segmented into sections, and domain experts score relevant sections based on adherence. Subsequently, deep neural non-hierarchical attention-based LSTMs, hierarchical attention networks and longformer-based models are refined and evaluated. Since the longformer outperformed LSTM-based models, we embed it into a hybrid scoring framework that employs lexicon and named entity features, with rubric injection via word-level attention, culminating in a Kappa score of 0.9670 and 0.820 in both our corpora, respectively. Though scoring is fundamentally subjective, our proposed models show significant results when navigating thin rubric boundaries and handling adversarial responses. As our work proposes a novel industrial text scoring engine, we hope to validate our framework using more official documentation based on a broader range of regulatory practices

    Towards automated financial market knowledge graph construction

    No full text
    The prevalence of financial news on the internet has made it easier for investors to access information. However, the fast-changing nature of the financial market and the time-consuming task of sifting through articles can be overwhelming. To address this issue, a framework has been developed and is proposed to automatically construct and update a knowledge graph (KG) for financial market information. The KG stores relational information between entities in a directed graph format, providing a graphical visualization that allows investors to examine complex relationships between entities that play a role in the stock market. The framework involves five main phases: scrapping online articles, triples extraction, coreference resolution, predicate linking, and entity linking. The precision rate achieved by the framework is 27.69%, with a recall rate of 7.14% and an F-1 score of 0.1136 in extracting correct information from articles and integrating it properly into the KG.</p

    Construction of Part of Speech Tagger for Malay Language: A Review

    No full text
    Part-of-Speech (POS) Tagging is one of the fundamental tasks in Natural Language Processing (NLP) in analyzing human languages. It is a process of identifying how words are used in a sentence by assigning the proper POS for each word. Thus far, most well-researched POS tagging is on European languages which are considered rich-resource languages due to the unlimited linguistic resources such as research studies and large standard corpus. However, POS tagging is arduous for lowresource languages due to the limitation of linguistic resources. The Malay language is considered as a low-resource language. Most POS tagging studies for the Malay language are using rulebased and stochastic methods. However, exploration in Deep Learning (DL) for Malay language is limited. Thus, studies with POS tagging methods that implement DL for other low-resource languages within South East Asia are included in this study. Hence, the aim of this study is to identify the state of the art, challenges, and future works of Malay POS tagger. This study provides a review of different methods, datasets, and performance measures used in POS tagging studies

    Forum Text Processing and Summarization

    Get PDF
    Frequently Asked Questions (FAQs) are extensively studied in general domains like the medical field, but such frameworks are lacking in domains such as software engineering and open-source communities. This research aims to bridge this gap by establishing the foundations of an automated FAQ Generation and Retrieval framework specifically tailored to the software engineering domain. The framework involves analyzing, ranking, performing sentiment analysis, and summarization techniques on open forums like StackOverflow and GitHub issues. A corpus of Stack Overflow post data is collected to evaluate the proposed framework and the selected models. Integrating state-of-the-art models of string-matching models, sentiment analysis models, summarization models, and the proprietary ranking formula proposed in this paper forms a robust Automatic FAQ Generation and Retrieval framework to facilitate developers' work. String matching, sentiment analysis, and summarization models are evaluated, and F1 scores of 71.31%, 74.90%, and 53.4% were achieved. Given the subjective nature of evaluations in this context, a human review is used to further validate the effectiveness of the overall framework, with assessments made on relevancy, preferred ranking, and preferred summarization. Future work includes improving summarization models by incorporating text classification and summarizing them individually (Kou et al, 2023), as well as proposing feedback loop systems based on human reinforcement learning. Furthermore, efforts will be made to optimize the framework by utilizing knowledge graphs for dimension reduction, enabling it to handle larger corpora effectivel

    What Modality Matters? Exploiting Highly Relevant Features for Video Advertisement Insertion

    No full text
    Video advertising is a thriving industry that has recently turned its attention to the use of intelligent algorithms for automating tasks. In advertisement insertion, the integration of contextual relevance is essential in influencing the viewer’s experience. Despite the wide spectrum of audio-visual semantic modalities available, there is a lack of research that analyzes their individual and complementary strengths in a systematic manner. In this paper, we propose an ad-insertion framework that maximizes the contextual relevance between advertisement and content video by employing high-level multi-modal semantic features. Prediction vectors are derived via clip-level and image-level extractors, which are then matched accordingly to yield relevance scores. We also established a new user study methodology that produces gold standard annotations based on multiple expert selections. By comprehensive human-centered approaches and analysis, we demonstrate that automatic ad-insertion can be improved by exploiting effective combinations of semantic modalities

    Remodeling Numerical Representation for Text Generation on Small Corpus

    No full text
    Data-to-text generation aims to generate natural language descriptions from non-linguistic data. Recent research on data-totext generation often uses a neural encoder-decoder architecture due to its simplicity to work across multiple domain. In this study, we aim to investigate two input encoding strategies: (1) numeral encoding as baseline, and (2) numeral as sequence-of-character tokens as proposed solution in financial data-to-text systems. An empirical study on the financial dataset validates our initial hypothesis that the character-based representation performs comparable results in content selection and diversity towards the generated text descriptions

    Detecting At-Risk and Withdrawal Students in STEM and Social Science Courses using Predictive and Association Rules Mining

    No full text
    This research aims to identify potential at-risk and withdrawal students to help these students in their studies. Interactions consisting of surfing behaviour in the Virtual Learning Environment (VLE) among two different groups of students namely disabled and non-disabled students for Social Science and STEM courses are analysed. Predictive analytics and association rule mining (ARM) analysis are performed. Predictive analytics is performed to predict students’ likelihood of withdrawing from their registered courses. Among the students who choose to pursue their registered courses, predictive analytics is also used to predict at-risk students. Six predictive algorithms namely Decision Tree (DT), Logistic Regression (LR), Naive Bayes (NB), K Nearest Neighbour (KNN), Random Forest (RF), and Support Vector Machine (SVM) are compared. FPGrowth algorithm is applied in ARM analysis. Predictive results show that DT is superior with the accuracy scores reaching 0.91. Most association rules are positively correlated, and they represent the set of commonly surfed pages by the potential at-risk and withdrawal students. The predictive results can help VLE developer to determine the possible algorithms to be used in the intelligent VLE to make accurate predictions based on students’ interactions in the VLE. The results from ARM analysis prove that FP-Growth can also be included in the intelligent VLE. The intelligent VLE can assist the relevant staff in an education institution to provide timely and personalized support to students who are struggling in their studies. This research contributes to precision education through learning analytics

    Click Analysis: How E-commerce Companies Benefit from Exploratory and Association Rule Mining

    No full text
    Electronic commerce (henceforth referred to as e-commerce) has attracted many people to buy things online because of its convenience. With Covid19 pandemic, the popularity of e-commerce increases as many people are working from home. Ability to understand customers' surfing and buying behavior on the ecommerce platform provides competitive advantage to e-commerce companies by being able to devise specific marketing plans to increase their market coverage and subsequently revenues from online sales of products. This paper discusses how the results derived from both, the exploratory data analysis (EDA) and association rule mining (ARM) can assist e-commerce companies to design specific marketing plans. The methodology consists of data understanding, data pre-processing, EDA, ARM, and analysis of results. A public dataset that is made available in the year 2020 consisting of clickstream data that are collected in 2018 from a popular fashion e-commerce website is used as a case study to prove the viability of the methodology in deriving results that can be used to design specific marketing plans. This study proves that it is possible to use clickstream data consisting of customers’ surfing and buying behavior and apply the methodology to derive analysis and devise better marketing plans

    Acquiring Input Features from Stock Market Summaries: A NLG Perspective

    No full text
    Generating text from structured data is challenging because it requires bridging the gap between the data and natural language. In the generation of financial data-to-text, stock market summaries written by experts require long-term analysis of market prices, thus it is often not suitable to formulate the problem as an end-to-end generation task. In this work, we focus on generating input features that can be aligned for stock market summaries. In particular, we introduce a new corpus for the task and define a rule-based approach to automatically identify salient market features from market prices. We obtained baseline results using state-of-the-art pre-trained models. Experimental results show that these models can produce fluent text and fairly accurate descriptions. We end with a discussion of the limitations and challenges of the proposed task
    corecore