7 research outputs found

    Information Retrieval for Early Detection of Disease Using Semantic Similarity

    Get PDF
     The growth of medical records continues to increase and needs to be used to improve doctors' performance in diagnosing a disease. A retrieval method returns proposed information to provide diagnostic recommendations based on symptoms from medical record datasets by applying the TF-IDF and cosine similarity methods. The challenge in this study was that the symptoms in the medical record dataset were dirty data obtained from patients who were not familiar with biological terms. Therefore, the symptoms were matched in the medical record data with the symptom terms used in the system and from the results, data augmentation was carried out to increase the amount of data up to about 3 times more. In the TF-IDF the highest accuracy with  is only , while after augmentation of the test data, the accuracy becomes . The highest accuracy results with the same  value using the cosine similarity method is  and with the augmented test data accuracy increasing to . From this study it was concluded that a system with sufficient and relevant input of symptoms would provide a more accurate disease prediction. Prediction results using the TF-IDF method with  are more accurate than predictions using the cosine similarity method

    Application of Numerical Measure Variations in K-Means Clustering for Grouping Data

    Get PDF
    The K-Means Clustering algorithm is commonly used by researchers in grouping data. The main problem in this study was that it has yet to be discovered how optimal the grouping with variations in distance calculations is in K-Means Clustering. The purpose of this research was to compare distance calculation methods with K-Means such as Euclidean Distance, Canberra Distance, Chebychev Distance, Cosine Similarity, Dynamic TimeWarping Distance, Jaccard Similarity, and Manhattan Distance to find out how optimal the distance calculation is in the K-Means method. The best distancecalculation was determined from the smallest Davies Bouldin Index value. This research aimed to find optimal clusters using the K-Means Clustering algorithm with seven distance calculations based on types of numerical measures. This research method compared distance calculation methods in the K-Means algorithm, such as Euclidean Distance,  Canberra Distance, Chebychev Distance, Cosine Smilirity, Dynamic Time Warping Distance, Jaccard Smilirity and Manhattan Distance to find out how optimal the distance calculation is in the K-Means method. Determining the best distance calculation can be seen from the smallest Davies Bouldin Index value. The data used in this study was on cosmetic sales at Devi Cosmetics, consisting of cosmetics sales from January to April 2022 with 56 product items. The result of this study was a comparison of numerical measures in the K-Means Clustering algorithm. The optimal cluster was calculating the Euclidean distance with a total of 9 clusters with a DBI value of 0.224. In comparison, the best average DBI value was the calculation of the Euclidean Distance with an average DBI value of 0.265

    Um método voltado à recomendação de examinadores para a análise de patentes

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Campus Araranguá. Engenharia da Computação.O processo de análise de patentes é uma tarefa que exige muito trabalho para os examinadores de patentes. Analisar uma patente corretamente diminui, por exemplo, o risco de litígios de patentes. Embora a análise de patentes seja indispensável, o contínuo crescimento no volume de patentes torna o processo de avaliação cada vez mais dispendioso. Entre as possíveis ferramentas com potencial para auxiliar neste desafio estão os sistemas de recomendação, que são usados em diferentes áreas podendo ser empregados para sugerir avaliadores no contexto de análise de patentes. Sendo assim, neste trabalho, propõe-se um método que, no contexto da análise de patentes, possibilite, a partir de uma patente de interesse, a recomendação de um ranking de avaliadores com base em patentes previamente avaliadas. No desenvolvimento da etapa de recomendação foi adotada a abordagem de recomendação baseada em conteúdo utilizando o conceito de embeddings de documentos por meio do framework Doc2vec ® . Levando-se em conta a avaliação, cinco cenários foram estabelecidos de modo que para cada patente de entrada em cada um dos cenários fosse recomendada uma lista ordenada de possíveis examinadores. Os resultados nestes cenários se mostraram, em sua maioria, consistentes, ou seja, foram capazes de sugerir os examinadores corretos. Por fim, conclui-se que o método proposto viabiliza a recomendação de examinadores com potencial para auxiliar tomadores de decisão em ambientes em que estes precisem vincular determinada patente a um examinador específico

    Appearance of Corporate Innovation in Financial Reports : A Text-Based Analysis

    Get PDF
    Innovations are important drivers of economic growth and firm profitability. Firms need funding to generate profitable innovations, which is why it is important to reliably distinguish innovative firms. Innovation indicators are used to measure this innovativeness, and consequently, it is important that the used indicator is reliable and measures innovation as desired. Patents, research and development expenditure and innovation surveys are examples of popular innovation indicators in research literature. However, these indicators have weaknesses, which is why new innovation indicators have been developed. This thesis studies the text-based innovation indicator developed by Bellstam et al. (2019) with a new type of data. Bellstam et al. (2019) created a new text-based innovation indicator that compares corporations’ analyst reports with an innovation textbook as the basis for the indicator. The similarity between these texts created the measurement for innovativeness. Analyst reports are usu-ally subject to charge. However, the 10-K reports used as data for this study are publicly available, and their functionality as the basis of the innovation indicator would mean good availability for the indicator. The study begins by training a Latent Dirichlet allocation (LDA) model with a sample of 10-K documents from 2008-2018. LDA-model is an unsupervised machine learning method, it finds topics in the text documents based on the probabilities of different words. The LDA-model was trained to find 15 topic allocations in the data and the output of the model is the distribution of these topics for each document. The same topic distributions were also allocated for eight samples from innovation textbooks. When the topic distributions were allocated, a Kullback-Leibler-divergence (KL-divergence) was calculated between each text sample and 10-K document. Thus, the KL-divergence calculated is the lowest for those reports that are the most similar to the innovation text and works as the text-based innovation indicator. Finally, the text-based innovation indicator was validated with regression analysis, in other words, it was confirmed that the indicator measures innovation. The text-based indicator was compared with research and development costs and the balance sheet value of brands and patents in different linear regressions. Out of the eight innovation measurements, most had a statistically significant correlation with one or both of the other innovation indicators. The ability of the text-based indicator to predict the development of sales in the next year was studied with regression analysis as well and all of the measurements had a significant effect on this. The most significant findings of this thesis are the relationship of the text-based innovation indicator and other indicators and its ability to predict firms’ sales.Innovaatiot ovat tärkeitä talouskasvun ja yritysten kannattavuuden ajureita. Tuottavien innovaatioiden syntymiseksi yritykset tarvitsevat rahoitusta, minkä takia onkin tärkeää, että innovatiiviset yritykset pystytään tunnistamaan luotettavasti. Innovaatioindikaattoreita käytetään tähän innovatiivisuuden mittaamiseen ja on siksi tärkeää, että käytetty indikaattori on luotettava ja mittaa innovatiivisuutta oikealla tavalla. Kirjallisuudessa paljon käytettyjä innovaatioindikaattoreita ovat esimerkiksi patentit, tutkimus- ja kehitysmenot sekä innovaatiokyselyt. Näissä indikaattoreissa on kuitenkin myös heikkouksia, joiden takia uusia indikaattoreita on alettu kehittää. Tässä tutkielmassa tutkitaan Bellstamin ja muiden (2019) luomaa tekstipohjaista innovaatioindikaattoria erilaisella datalla. Bellstam ja muut (2019) loivat uuden innovaatioindikaattorin, jonka pohjana oli yritysten ana-lyytikkoraporttien vertailu innovaatio-oppikirjan tekstin kanssa, näiden samankaltaisuusver-tailusta saatiin innovaatiomittari. Analyytikkoraportit ovat usein maksullisia. Tässä tutkimuk-sessa aineistona on käytetty lakisääteisiä tilinpäätösraportteja, jotka ovat julkisia tiedostoja, joten niiden toimivuus innovaatioindikaattorin pohjana tarkoittaisi hyvää saatavuutta indi-kaattorille. Tutkimus alkaa Latent Dirichlet allocation (LDA) –mallin harjoittamisella Yhdysvaltalaisten yritysten 10-K, eli tilinpäätösraporteilla vuosilta 2008-2018. LDA-malli on valvomaton koneoppimismenetelmä, eli se etsii datasta itse aihepiirejä sanojen todennäköisyyksien perusteella. LDA-malli asetettiin etsimään datasta 15 eri aihepiiriä raporteissa käytettyjen aiheiden perusteella ja mallin tuloksena on näiden aihepiirien jakautuminen jokaisessa dokumentissa. Samat aihepiirijakaumat haettiin myös kahdeksalle tekstiotokselle innovaatio-oppikirjoista. Aihepiirijakaumien ollessa valmiit, laskettiin Kullback-Leibler-divergenssi (KL-divergenssi) tilinpäätösraporttien ja innovaatio-oppikirjojen tekstiotosten aihepiirijakaumien välille. Laskettu KL-divergenssi on siten matalin niille tilinpäätösraporteille, joiden teksti on lähimpänä kunkin innovaatio-oppikirjan tekstiä ja toimii tekstipohjaisena innovaatioindikaattorina. Lopuksi indikaattorin toimivuus vahvistetaan regressioanalyysillä, eli tutkitaan, että se mittaa innovatiivisuuta. Regressioanalyysillä tutkitaan innovaatiomittarien yhteyttä yritysten tutkimus- ja kehitystoiminnan kuluihin sekä patenttien ja brändien tasearvoon. Kahdeksasta innovaatiomittarista suurimmalla osalla oli tilastollisesti merkitsevä yhteys muuttujista toiseen tai molempiin. Myös uuden innovaatiomittarin kykyä ennustaa yritysten seuraavan vuoden myyntiä tutkittiin regressioanalyysillä ja jokaisella mittarilla oli tilastollisesti merkitsevä yhteys yritysten liikevaihdon muutokseen. Tutkimuksen merkittävin löydös oli tekstipohjaisen innovaatiomittarin yhteys muihin innovaatiomittareihin ja yritysten liikevaihdon kehitykseen

    A Methodology Combining Cosine Similarity with Classifier for Text Classification

    No full text
    Text Classification has received significant attention in recent years because of the proliferation of digital documents and is widely used in various applications such as filtering and recommendation. Consequently, many approaches, including those based on statistical theory, machine learning, and classifier performance improvement, have been proposed for improving text classification performance. Among these approaches, centroid-based classifier, multinomial naïve bayesian (MNB), support vector machines (SVM), convolutional neural network (CNN) are commonly used. In this paper, we introduce a cosine similarity-based methodology for improving performance. The methodology combines cosine similarity (between a test document and fixed categories) with conventional classifiers such as MNB, SVM, and CNN to improve the accuracy of the classifiers, and then we call the conventional classifiers with cosine similarity as enhanced classifiers. We applied the enhanced classifiers to famous datasets – 20NG, R8, R52, Cade12, and WebKB – and evaluated the performance of the enhanced classifiers in terms of the confusion matrix’s accuracy; we obtained outstanding results in that the enhanced classifiers show significant increases in accuracy. Moreover, through experiments, we identified which of two considered knowledge representation techniques (word count and term frequency-inverse document frequency (TFIDF)) is more suitable in terms of classifier performance

    Code authorship attribution using content-based and non-content-based features

    Get PDF
    Machine learning approaches are widely used in natural language analysis. Previous research has shown that similar techniques can be applied in the analysis of computer programming (artificial) languages. In this thesis, we focus on identifying the authors of computer programs by using machine learning techniques. We extend these techniques to determine which features capture the writing style of authors in the classification of a computer program according to the author's identity. We then propose a novel approach for computer program author identification. In this method, program features from the text documents are combined with authors' sociological features (gender and region) to develop the classification model. Several experiments have been conducted on two datasets composed of computer programs written in C++, and the results are encouraging. According to the experimental results, the author's identity can be predicted with a 75%75\% accuracy rate

    Communities in new media. Inclusive digital: Forming community in an open way. Self-determined participation in the digital transformation. Proceedings of 26th conference GeNeMe

    Get PDF
    Die jährliche Konferenz GeNeMe „Gemeinschaften in Neuen Medien“ diskutiert insbesondere Online Communities aus integraler Sicht auf mehrere Fachdisziplinen wie Informatik, Medientechnologie, Wirtschaftswissenschaft, Bildungs- und Informationswissenschaft, sowie Sozial- und Kommunikationswissenschaft. Als Forum für einen transdisziplinären Dialog ermöglicht die GeNeMe den Erfahrungs- und Wissensaustausch zwischen Teilnehmenden verschiedenster Fachrichtungen, Organisationen und Institutionen mit dem Fokus sowohl auf Forschung als auch Praxis. Die GeNeMe 2023 öffnete sich insbesondere der Diskussion von Fragen rund um Inklusion und Teilhabe im Rahmen digitaler Formate und Innovationen. Dabei wurden unter anderem folgende Fragen reflektiert: Wie kann Inklusion durch Digitalisierung umgesetzt werden und welche Möglichkeiten zeichnen sich dafür ab? Wie kann Teilhabe an und durch Digitalisierung gelingen? Wie steht es um Architekturen und professionelle Skills im Kontext spezifischer Zielgruppen? (DIPF/Orig.
    corecore