5 research outputs found

    Detección de fraude fiscal en alquiler de pisos turísticos mediante técnicas de clasificación positive-unlabeled

    Get PDF
    El objetivo principal de este trabajo final de master consiste en la identificación de alojamientos turísticos fraudulentos a partir de datos extraídos de webs de alojamiento turístico. Se trata de un problema de clasificación semisupervisada o, más concretamente, aprendizaje a partir de datos positivos y no etiquetados. Además de un modelo capaz de detectar el fraude fiscal, también es necesario un método de evaluación del modelo fiable para este tipo de clasificación particular

    Techniques for text classification: Literature review and current trends

    Get PDF
    Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. This kind of web information, popularly known as the digital/electronic information is in the form of documents, conference material, publications, journals, editorials, web pages, e-mail etc. People largely access information from these online sources rather than being limited to archaic paper sources like books, magazines, newspapers etc. But the main problem is that this enormous information lacks organization which makes it difficult to manage. Text classification is recognized as one of the key techniques used for organizing such kind of digital data. In this paper we have studied the existing work in the area of text classification which will allow us to have a fair evaluation of the progress made in this field till date. We have investigated the papers to the best of our knowledge and have tried to summarize all existing information in a comprehensive and succinct manner. The studies have been summarized in a tabular form according to the publication year considering numerous key perspectives. The main emphasis is laid on various steps involved in text classification process viz. document representation methods, feature selection methods, data mining methods and the evaluation technique used by each study to carry out the results on a particular dataset

    Dynamic classifier ensemble for positive unlabeled text stream classification

    Full text link
    Most of studies on streaming data classification are based on the assumption that data can be fully labeled. However, in real-life applications, it is impractical and time-consuming to manually label the entire stream for training. It is very common that only a small part of positive data and a large amount of unlabeled data are available in data stream environments. In this case, applying the traditional streaming algorithms with straightforward adaptation to positive unlabeled stream may not work well or lead to poor performance. In this paper, we propose a Dynamic Classifier Ensemble method for Positive and Unlabeled text stream (DCEPU) classification scenarios. We address the problem of classifying positive and unlabeled text stream with various concept drift by constructing an appropriate validation set and designing a novel dynamic weighting scheme in the classification phase. Experimental results on benchmark dataset RCV1-v2 demonstrate that the proposed method DCEPU outperforms the existing LELC (Li et al. 2009b), DVS (with necessary adaption) (Tsymbal et al. in Inf Fusion 9(1):56-68, 2008), and Stacking style ensemble-based algorithm (Zhang et al. 2008b). © 2011 Springer-Verlag London Limited

    Process-Oriented Stream Classification Pipeline:A Literature Review

    Get PDF
    Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p
    corecore