194 research outputs found

    Approximate TF–IDF based on topic extraction from massive message stream using the GPU

    Get PDF
    The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure

    Approximate TF-IDF based on topic extraction from massive message stream using the GPU

    Get PDF
    The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure

    METODE SUPPORT VECTOR MACHINE UNTUK MELAKUKAN ANALISIS SENTIMEN PADA MARKETPLACE DENGAN PERBANDINGAN CIRI PADA LEVEL ASPEK

    Get PDF
    Analisis sentimen merupakan bidang interdisipliner antara pengolahan bahasa alami, kecerdasan buatan dan text mining. Kunci utama dari analisis sentimen adalah klasifikasi polaritas yang menentukan apakah sentimen tersebut bersifat positif atau negatif. Pada penelitian ini menggunakan metode klasifikasi support vector machine dengan jumlah data ulasan konsumen berjumlah 648 data. Data tersebut didapatkan dari ulasan konsumen dari marketplace dengan produk yang dijual adalah handpone. Hasil penelitian ini mendapatkan 3 aspek yang mengindikasikan sentimen analisis pada marketplace yaitu aspek pelayanan , pengiriman dan produk. Kamus slang yang digunakan untuk proses normalisasi berjumlah 552 kata slang. Penelitian ini membandingkan analisis ciri untuk mendapatkan hasil klasifikasi terbaik, karena akurasi klasifikasi dipengaruhi oleh proses analisis ciri. Hasil nilai perbandingan dari analisis ciri antara n-gram dan TF-IDF dengan menggunakan metode Support Vector Machine didapatkan bahwa unigram mempunyai nilai akurasi tertinggi, dengan nilai akurasi sebesar 80,87 %. Hasil penelitian ini menjelaskan bahwa pada kasus sentimen analisis pada level aspek dengan perbandingan ciri dengan model klasifikasi Support Vector Machine didapatkan bahwa model analisis ciri unigram dan klasifikasi Support Vector Machine adalah model terbaik. Kata-kunci : analisis sentimen, e-commerce, marketplace, ekstraksi ciri, TF-IDF, n-gram, support vector machine Sentiment analysis is an interdisciplinary field between natural language processing, artificial intelligence and text mining. The main key of the sentiment analysis is the polarity that is meant by the sentiment is positive or negative. In this study using the method of classification support vector machine with the amount of data consumer reviews amounted to 648 data. The data obtained from consumer reviews from the marketplace with products sold is handpone. The results of this study get 3 aspects that indicate sentiment analysis on the marketplace aspects of service, delivery and products. The slang dictionary used for the normalization process is 552 words slang. This study compares the characteristic analysis to obtain the best classification result, because classification accuration is influenced by characteristic analysis process. The result of comparison value from characteristic analysis between n-gram and TF-IDF by using Support Vector Machine method found that unigram has the highest akurasi value, with akurasi value 80,87%. The results of this study explain that in the case of analysis sentiment at the aspect level with the comparison of characteristics with the classification model of support vector machine found that the analysis model of unigram character and classification of support vector machine is the best model. Keywords : sentiment analysis, e-commerce, marketplace, features selection, TF-IDF, n-gram, support vector machin

    Analysis of Life Context of On-Line Group-Buying Population by Dynamic Decision

    Get PDF
    While it is difficult to avoid uncertainties when shopping on the Internet, trust can reduce customers’ perceived uncertainties, and enhance their willingness and frequency to buy products and services. The difference in time and space information transparency between customers and on-line sellers, as well as the complex unpredictability of network structure, result in frequent uncertainty for on-line transactions. Therefore, through text mining and integrating the Genetic Algorithm (GA) with the Support Vector Machine (SVM), this project classifies the data of on-line group buying community complaints according to the posts left on Facebook and the three major group-buying websites of Taiwan. The terms are selected based on term frequency, document frequency, uniformity, and conformity, while document classification effectiveness is calculated using precision, recall rate, and F-measure. Community complaints are classified into the uncertain performance indicators that influence on-line group buying for integrated statistics, in order that specific performance indicators of community group-buying websites can be generated. Afterwards, based on the on-line group buying community performance indicator sequence, as integrated according to the dynamic Multicriteria Optimization and Compromise Solution (VIKOR) method and prosperity countermeasure signals, grey correlation sorting is applied to analyze the dynamic performance indicator sequence of different communities, in order to determine the life context of different populations for the reference of on-line group buying providers

    Analysis of Life Context of On-Line Group-Buying Population by Dynamic Decision

    Get PDF
    While it is difficult to avoid uncertainties when shopping on the Internet, trust can reduce customers’ perceived uncertainties, and enhance their willingness and frequency to buy products and services. The difference in time and space information transparency between customers and on-line sellers, as well as the complex unpredictability of network structure, result in frequent uncertainty for on-line transactions. Therefore, through text mining and integrating the Genetic Algorithm (GA) with the Support Vector Machine (SVM), this project classifies the data of on-line group buying community complaints according to the posts left on Facebook and the three major group-buying websites of Taiwan. The terms are selected based on term frequency, document frequency, uniformity, and conformity, while document classification effectiveness is calculated using precision, recall rate, and F-measure. Community complaints are classified into the uncertain performance indicators that influence on-line group buying for integrated statistics, in order that specific performance indicators of community group-buying websites can be generated. Afterwards, based on the on-line group buying community performance indicator sequence, as integrated according to the dynamic Multicriteria Optimization and Compromise Solution (VIKOR) method and prosperity countermeasure signals, grey correlation sorting is applied to analyze the dynamic performance indicator sequence of different communities, in order to determine the life context of different populations for the reference of on-line group buying providers

    Exposing knowledge: providing a real-time view of the domain under study for students

    Get PDF
    With the amount of information that exists online, it is impossible for a student to find relevant information or stay focused on the domain under study. Research showed that search engines have deficiencies that might prevent students from finding relevant information. To this end, this research proposes a technical solution that takes the personal search history of a student into consideration and provides a holistic view of the domain under study. Based on algorithmic approaches to assert semantic similarity, the proposed framework makes use of a user interface to dynamically assist students through aggregated results and wordcloud visualizations. The effectiveness of our approach is finally evaluated through the use of commonly used datasets and compared in line with existing research

    Mining a Small Medical Data Set by Integrating the Decision Tree and t-test

    Get PDF
    [[abstract]]Although several researchers have used statistical methods to prove that aspiration followed by the injection of 95% ethanol left in situ (retention) is an effective treatment for ovarian endometriomas, very few discuss the different conditions that could generate different recovery rates for the patients. Therefore, this study adopts the statistical method and decision tree techniques together to analyze the postoperative status of ovarian endometriosis patients under different conditions. Since our collected data set is small, containing only 212 records, we use all of these data as the training data. Therefore, instead of using a resultant tree to generate rules directly, we use the value of each node as a cut point to generate all possible rules from the tree first. Then, using t-test, we verify the rules to discover some useful description rules after all possible rules from the tree have been generated. Experimental results show that our approach can find some new interesting knowledge about recurrent ovarian endometriomas under different conditions.[[journaltype]]國外[[incitationindex]]EI[[booktype]]紙本[[countrycodes]]FI

    IMPROVE - Innovative Modelling Approaches for Production Systems to Raise Validatable Efficiency

    Get PDF
    This open access work presents selected results from the European research and innovation project IMPROVE which yielded novel data-based solutions to enhance machine reliability and efficiency in the fields of simulation and optimization, condition monitoring, alarm management, and quality prediction
    • …
    corecore