6 research outputs found
An improved Bank Credit Scoring Model A Naïve Bayesian Approach
Credit scoring is a decision tool used by organizations to grant or reject credit requests from their customers. Series of artificial intelligent and traditional approaches have been used to building credit scoring model and credit risk evaluation. Despite being ranked amongst the top 10 algorithm in Data mining, Naïve Bayesian algorithm has not been extensively used in building credit score cards. Using demographic and material indicators as input variables, this paper investigate the ability of Bayesian classifier towards building credit scoring model in banking sector
Design and Implementation of a Student Attendance System Using Iris Biometric Recognition
Attendance taking is a standard practice in every educational system. The methods used to take class attendance are quite numerous but emphasis keeps shifting towards automating the process. The use of biometrics in taking class attendance is fast gaining ground and the traditional way of taking attendance is fast losing ground especially when the class is very large and time is of great essence. The iris was used as the biometric in this paper. After enrolling all attendees by storing their particulars along with their unique iris template, the designed system automatically took class attendance by capturing the eye image of each attendee, recognizing their iris, and searching for a match in the created database. The designed prototype is also web based. This paper proposes an alternative and accurate method of taking attendance that is both spoofproof and relatively cheap to implement
The Investigation of Multiple Product Rating Based on Data Mining Approaches
Ratings and product reviews could be considered as one of the main features determining the quality of a product in online store systems, especially in deciding whether to place a product as part of an online store's inventory. Online vendors are attracted by product reviews and ratings in order to study on potential products and related predictions. In this way, different machine learning algorithms such as Support Vector Machine, Bayesian Networks, Random Forests and Logistic Regression are investigated. The performance of each model is evaluated using accuracy, sensitivity and F1 score on the data from amazon online store website, 1996 to 2014. It is noteworthy to mention that the results of this paper can be used as an initial input to long-term product rating predictions. Keywords: Rating, Machine Learning Algorithm, Text mining, Classification, Resampling DOI: 10.7176/CEIS/10-5-03 Publication date:June 30th 201
Feature Selection for Text and Image Data Using Differential Evolution with SVM and Naïve Bayes Classifiers
Classification problems are increasing in various important applications such as text categorization, images, medical imaging diagnosis and bimolecular analysis etc. due to large amount of attribute set. Feature extraction methods in case of large dataset play an important role to reduce the irrelevant feature and thereby increases the performance of classifier algorithm. There exist various methods based on machine learning for text and image classification. These approaches are utilized for dimensionality reduction which aims to filter less informative and outlier data. Therefore, these approaches provide compact representation and computationally better tractable accuracy. At the same time, these methods can be challenging if the search space is doubled multiple time. To optimize such challenges, a hybrid approach is suggested in this paper. The proposed approach uses differential evolution (DE) for feature selection with naïve bayes (NB) and support vector machine (SVM) classifiers to enhance the performance of selected classifier. The results are verified using text and image data which reflects improved accuracy compared with other conventional techniques. A 25 benchmark datasets (UCI) from different domains are considered to test the proposed algorithms. A comparative study between proposed hybrid classification algorithms are presented in this work. Finally, the experimental result shows that the differential evolution with NB classifier outperforms and produces better estimation of probability terms. The proposed technique in terms of computational time is also feasible
Appearance of Corporate Innovation in Financial Reports : A Text-Based Analysis
Innovations are important drivers of economic growth and firm profitability. Firms need funding to generate profitable innovations, which is why it is important to reliably distinguish innovative firms. Innovation indicators are used to measure this innovativeness, and consequently, it is important that the used indicator is reliable and measures innovation as desired.
Patents, research and development expenditure and innovation surveys are examples of popular innovation indicators in research literature. However, these indicators have weaknesses, which is why new innovation indicators have been developed. This thesis studies the text-based innovation indicator developed by Bellstam et al. (2019) with a new type of data. Bellstam et al. (2019) created a new text-based innovation indicator that compares corporations’ analyst reports with an innovation textbook as the basis for the indicator. The similarity between these texts created the measurement for innovativeness. Analyst reports are usu-ally subject to charge. However, the 10-K reports used as data for this study are publicly available, and their functionality as the basis of the innovation indicator would mean good availability for the indicator.
The study begins by training a Latent Dirichlet allocation (LDA) model with a sample of 10-K documents from 2008-2018. LDA-model is an unsupervised machine learning method, it finds topics in the text documents based on the probabilities of different words. The LDA-model was trained to find 15 topic allocations in the data and the output of the model is the distribution of these topics for each document. The same topic distributions were also allocated for eight samples from innovation textbooks. When the topic distributions were allocated, a Kullback-Leibler-divergence (KL-divergence) was calculated between each text sample and 10-K document. Thus, the KL-divergence calculated is the lowest for those reports that are the most similar to the innovation text and works as the text-based innovation indicator.
Finally, the text-based innovation indicator was validated with regression analysis, in other words, it was confirmed that the indicator measures innovation. The text-based indicator was compared with research and development costs and the balance sheet value of brands and patents in different linear regressions. Out of the eight innovation measurements, most had a statistically significant correlation with one or both of the other innovation indicators. The ability of the text-based indicator to predict the development of sales in the next year was studied with regression analysis as well and all of the measurements had a significant effect on this. The most significant findings of this thesis are the relationship of the text-based innovation indicator and other indicators and its ability to predict firms’ sales.Innovaatiot ovat tärkeitä talouskasvun ja yritysten kannattavuuden ajureita. Tuottavien innovaatioiden syntymiseksi yritykset tarvitsevat rahoitusta, minkä takia onkin tärkeää, että innovatiiviset yritykset pystytään tunnistamaan luotettavasti. Innovaatioindikaattoreita käytetään tähän innovatiivisuuden mittaamiseen ja on siksi tärkeää, että käytetty indikaattori on luotettava ja mittaa innovatiivisuutta oikealla tavalla.
Kirjallisuudessa paljon käytettyjä innovaatioindikaattoreita ovat esimerkiksi patentit, tutkimus- ja kehitysmenot sekä innovaatiokyselyt. Näissä indikaattoreissa on kuitenkin myös heikkouksia, joiden takia uusia indikaattoreita on alettu kehittää. Tässä tutkielmassa tutkitaan Bellstamin ja muiden (2019) luomaa tekstipohjaista innovaatioindikaattoria erilaisella datalla. Bellstam ja muut (2019) loivat uuden innovaatioindikaattorin, jonka pohjana oli yritysten ana-lyytikkoraporttien vertailu innovaatio-oppikirjan tekstin kanssa, näiden samankaltaisuusver-tailusta saatiin innovaatiomittari. Analyytikkoraportit ovat usein maksullisia. Tässä tutkimuk-sessa aineistona on käytetty lakisääteisiä tilinpäätösraportteja, jotka ovat julkisia tiedostoja, joten niiden toimivuus innovaatioindikaattorin pohjana tarkoittaisi hyvää saatavuutta indi-kaattorille.
Tutkimus alkaa Latent Dirichlet allocation (LDA) –mallin harjoittamisella Yhdysvaltalaisten yritysten 10-K, eli tilinpäätösraporteilla vuosilta 2008-2018. LDA-malli on valvomaton koneoppimismenetelmä, eli se etsii datasta itse aihepiirejä sanojen todennäköisyyksien perusteella. LDA-malli asetettiin etsimään datasta 15 eri aihepiiriä raporteissa käytettyjen aiheiden perusteella ja mallin tuloksena on näiden aihepiirien jakautuminen jokaisessa dokumentissa. Samat aihepiirijakaumat haettiin myös kahdeksalle tekstiotokselle innovaatio-oppikirjoista. Aihepiirijakaumien ollessa valmiit, laskettiin Kullback-Leibler-divergenssi (KL-divergenssi) tilinpäätösraporttien ja innovaatio-oppikirjojen tekstiotosten aihepiirijakaumien välille. Laskettu KL-divergenssi on siten matalin niille tilinpäätösraporteille, joiden teksti on lähimpänä kunkin innovaatio-oppikirjan tekstiä ja toimii tekstipohjaisena innovaatioindikaattorina.
Lopuksi indikaattorin toimivuus vahvistetaan regressioanalyysillä, eli tutkitaan, että se mittaa innovatiivisuuta. Regressioanalyysillä tutkitaan innovaatiomittarien yhteyttä yritysten tutkimus- ja kehitystoiminnan kuluihin sekä patenttien ja brändien tasearvoon. Kahdeksasta innovaatiomittarista suurimmalla osalla oli tilastollisesti merkitsevä yhteys muuttujista toiseen tai molempiin. Myös uuden innovaatiomittarin kykyä ennustaa yritysten seuraavan vuoden myyntiä tutkittiin regressioanalyysillä ja jokaisella mittarilla oli tilastollisesti merkitsevä yhteys yritysten liikevaihdon muutokseen. Tutkimuksen merkittävin löydös oli tekstipohjaisen innovaatiomittarin yhteys muihin innovaatiomittareihin ja yritysten liikevaihdon kehitykseen
Recommended from our members
Simultaneous modelling and clustering of visual field data
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonIn the health-informatics and bio-medical domains, clinicians produce an enormous amount of data which can be complex and high in dimensionality. This scenario includes visual field data, which are used for managing the second leading cause of blindness in the world: glaucoma. Visual field data are the most common type of data collected to diagnose glaucoma in patients, and usually the data consist of 54 or 76 variables (which are referred to as visual field locations). Due to the large number of variables, the six nerve fiber bundles (6NFB), which is a collection of visual field locations in groups, are the standard clusters used in visual field data to represent the physiological traits of the retina. However, with regard to classification accuracy of the data, this research proposes a technique to find other significant spatial clusters of visual field with higher classification accuracy than the 6NFB.
This thesis presents a novel clustering technique, namely, Simultaneous Modelling and Clustering (SMC). SMC performs clustering on data based on classification accuracy using heuristic search techniques. The method searches a collection of significant clusters of visual field locations that indicate visual field loss progression. The aim of this research is two-fold. Firstly, SMC algorithms are developed and tested on data to investigate the effectiveness and efficiency of the method using optimisation and classification methods. Secondly, a significant clustering arrangement of visual field, which highly interrelated visual field locations to represent progression of visual field loss with high classification accuracy, is searched to complement the 6NFB in diagnosis of glaucoma. A new clustering arrangement of visual field locations can be used by medical practitioners together with the 6NFB to complement each other in diagnosis of glaucoma in patients.
This research conducts extensive experiment work on both visual field and simulated data to evaluate the proposed method. The results obtained suggest the proposed method appears to be an effective and efficient method in clustering visual field data and
3
improving classification accuracy. The key contributions of this work are the novel model-based clustering of visual field data, effective and efficient algorithms for SMC, practical knowledge of visual field data in the diagnosis of glaucoma and the presentation a generic framework for modelling and clustering which is highly applicable to many other dataset/model combinations