73,107 research outputs found

    A Scalable and Effective Rough Set Theory based Approach for Big Data Pre-processing

    Get PDF
    International audienceA big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data

    Reduct-based ranking of attributes

    Get PDF
    The paper is dedicated to the area of feature selection, in particular a notion of attribute rankings that allow to estimate importance of variables. In the research presented for ranking construction a new weighting factor was defined, based on relative reducts. A reduct constitutes an embedded mechanism of feature selection, specific to rough set theory. The proposed factor takes into account the number of reducts in which a given attribute exists, as well as lengths of reducts. Two approaches for reduct generation were employed and compared, with search executed by a genetic algorithm. To validate the usefulness of the reduct-based rankings in the process of feature reduction, for gradually decreasing subsets of attributes, selected through rankings, sets of decision rules were induced in classical rough set approach. The performance of all rule classifiers was evaluated, and experimental results showed that the proposed rankings led to at least the same, or even increased classification accuracy for reduced sets of features than in the case of operating on the entire set of condition attributes. The experiments were performed on datasets from stylometry domain, with treating authorship attribution as a classification task, and stylometric descriptors as characteristic features defining writing styles

    Analysis of the potentials of multi criteria decision analysis methods to conduct sustainability assessment

    Get PDF
    Sustainability assessments require the management of a wide variety of information types, parameters and uncertainties. Multi criteria decision analysis (MCDA) has been regarded as a suitable set of methods to perform sustainability evaluations as a result of its flexibility and the possibility of facilitating the dialogue between stakeholders, analysts and scientists. However, it has been reported that researchers do not usually properly define the reasons for choosing a certain MCDA method instead of another. Familiarity and affinity with a certain approach seem to be the drivers for the choice of a certain procedure. This review paper presents the performance of five MCDA methods (i.e. MAUT, AHP, PROMETHEE, ELECTRE and DRSA) in respect to ten crucial criteria that sustainability assessments tools should satisfy, among which are a life cycle perspective, thresholds and uncertainty management, software support and ease of use. The review shows that MAUT and AHP are fairly simple to understand and have good software support, but they are cognitively demanding for the decision makers, and can only embrace a weak sustainability perspective as trade-offs are the norm. Mixed information and uncertainty can be managed by all the methods, while robust results can only be obtained with MAUT. ELECTRE, PROMETHEE and DRSA are non-compensatory approaches which consent to use a strong sustainability concept, accept a variety of thresholds, but suffer from rank reversal. DRSA is less demanding in terms of preference elicitation, is very easy to understand and provides a straightforward set of decision rules expressed in the form of elementary “if … then …” conditions. Dedicated software is available for all the approaches with a medium to wide range of results capability representation. DRSA emerges as the easiest method, followed by AHP, PROMETHEE and MAUT, while ELECTRE is regarded as fairly difficult. Overall, the analysis has shown that most of the requirements are satisfied by the MCDA methods (although to different extents) with the exclusion of management of mixed data types and adoption of life cycle perspective which are covered by all the considered approaches

    Partitioning Clustering Based on Support Vector Ranking

    Get PDF
    Postprin
    corecore