37 research outputs found

    A Scalable and Effective Rough Set Theory based Approach for Big Data Pre-processing

    Get PDF
    International audienceA big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data

    A Detailed Study of the Distributed Rough Set Based Locality Sensitive Hashing Feature Selection Technique

    Get PDF
    International audienceIn the context of big data, granular computing has recently been implemented by some mathematical tools, especially Rough Set Theory (RST). As a key topic of rough set theory, feature selection has been investigated to adapt the related granular concepts of RST to deal with large amounts of data, leading to the development of the distributed RST version. However, despite of its scalability, the distributed RST version faces a key challenge tied to the partitioning of the feature search space in the distributed environment while guaranteeing data dependency. Therefore, in this manuscript, we propose a new distributed RST version based on Locality Sensitive Hashing (LSH), named LSH-dRST, for big data feature selection. LSH-dRST uses LSH to match similar features into the same bucket and maps the generated buckets into partitions to enable the splitting of the universe in a more efficient way. More precisely, in this paper, we perform a detailed analysis of the performance of LSH-dRST by comparing it to the standard distributed RST version, which is based on a random partitioning of the universe. We demonstrate that our LSH-dRST is scalable when dealing with large amounts of data. We also demonstrate * This work is part of a project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk艂odowska-Curie grant agreement No 702527. 2 Z. Chelly Dagdia, C. Zarges / LSH-RST for an Efficient Big Data Pre-processing that LSH-dRST ensures the partitioning of the high dimensional feature search space in a more reliable way; hence better preserving data dependency in the distributed environment and ensuring a lower computational cost

    Algorytmy uczenia si臋 relacji podobie艅stwa z wielowymiarowych zbior贸w danych

    Get PDF
    The notion of similarity plays an important role in machine learning and artificial intelligence. It is widely used in tasks related to a supervised classification, clustering, an outlier detection and planning. Moreover, in domains such as information retrieval or case-based reasoning, the concept of similarity is essential as it is used at every phase of the reasoning cycle. The similarity itself, however, is a very complex concept that slips out from formal definitions. A similarity of two objects can be different depending on a considered context. In many practical situations it is difficult even to evaluate the quality of similarity assessments without considering the task for which they were performed. Due to this fact the similarity should be learnt from data, specifically for the task at hand. In this dissertation a similarity model, called Rule-Based Similarity, is described and an algorithm for constructing this model from available data is proposed. The model utilizes notions from the rough set theory to derive a similarity function that allows to approximate the similarity relation in a given context. The construction of the model starts from the extraction of sets of higher-level features. Those features can be interpreted as important aspects of the similarity. Having defined such features it is possible to utilize the idea of Tversky鈥檚 feature contrast model in order to design an accurate and psychologically plausible similarity function for a given problem. Additionally, the dissertation shows two extensions of Rule-Based Similarity which are designed to efficiently deal with high dimensional data. They incorporate a broader array of similarity aspects into the model. In the first one it is done by constructing many heterogeneous sets of features from multiple decision reducts. To ensure their diversity, a randomized reduct computation heuristic is proposed. This approach is particularly well-suited for dealing with the few-objects-many-attributes problem, e.g. the analysis of DNA microarray data. A similar idea can be utilized in the text mining domain. The second of the proposed extensions serves this particular purpose. It uses a combination of a semantic indexing method and an information bireducts computation technique to represent texts by sets of meaningful concepts. The similarity function of the proposed model can be used to perform an accurate classification of previously unseen objects in a case-based fashion or to facilitate clustering of textual documents into semantically homogeneous groups. Experiments, whose results are also presented in the dissertation, show that the proposed models can successfully compete with the state-of-the-art algorithms.Poj臋cie podobie艅stwa pe艂ni istotn膮 rol臋 w dziedzinach uczenia maszynowego i sztucznej inteligencji. Jest ono powszechnie wykorzystywane w zadaniach dotycz膮cych nadzorowanej klasyfikacji, grupowania, wykrywania nietypowych obiekt贸w oraz planowania. Ponadto w dziedzinach takich jak wyszukiwanie informacji (ang. information retrieval) lub wnioskowanie na podstawie przyk艂ad贸w (ang. case-based reasoning) poj臋cie podobie艅stwa jest kluczowe ze wzgl臋du na jego obecno艣膰 na wszystkich etapach wyci膮gania wniosk贸w. Jednak偶e samo podobie艅stwo jest poj臋ciem niezwykle z艂o偶onym i wymyka si臋 pr贸bom 艣cis艂ego zdefiniowania. Stopie艅 podobie艅stwa mi臋dzy dwoma obiektami mo偶e by膰 r贸偶ny w zale偶no艣ci od kontekstu w jakim si臋 go rozpatruje. W praktyce trudno jest nawet oceni膰 jako艣膰 otrzymanych stopni podobie艅stwa bez odwo艂ania si臋 do zadania, kt贸remu maj膮 s艂u偶y膰. Z tego w艂a艣nie powodu modele oceniaj膮ce podobie艅stwo powinny by膰 wyuczane na podstawie danych, specjalnie na potrzeby realizacji konkretnego zadania. W niniejszej rozprawie opisano model podobie艅stwa zwany Regu艂owym Modelem Podobie艅stwa (ang. Rule-Based Similarity) oraz zaproponowano algorytm tworzenia tego modelu na podstawie danych. Wykorzystuje on elementy teorii zbior贸w przybli偶onych do konstruowania funkcji podobie艅stwa pozwalaj膮cej aproksymowa膰 podobie艅stwo w zadanym kontek艣cie. Konstrukcja ta rozpoczyna si臋 od wykrywania zbior贸w wysokopoziomowych cech obiekt贸w. Mog膮 by膰 one interpretowane jako istotne aspekty podobie艅stwa. Maj膮c zdefiniowane tego typu cechy mo偶liwe jest wykorzystanie idei modelu kontrastu cech Tversky鈥檈go (ang. feature contrast model) do budowy precyzyjnej oraz zgodnej z obserwacjami psycholog贸w funkcji podobie艅stwa dla rozwa偶anego problemu. Dodatkowo, niniejsza rozprawa zawiera opis dw贸ch rozszerze艅 Regu艂owego Modelu Podobie艅stwa przystosowanych do dzia艂ania na danych o bardzo wielu atrybutach. Staraj膮 si臋 one w艂膮czy膰 do modelu szerszy zakres aspekt贸w podobie艅stwa. W pierwszym z nich odbywa si臋 to poprzez konstruowanie wielu zbior贸w cech z redukt贸w decyzyjnych. Aby zapewni膰 ich zr贸偶nicowanie, zaproponowano algorytm 艂膮cz膮cy heurystyk臋 zach艂anna z elementami losowymi. Podej艣cie to jest szczeg贸lnie wskazane dla zada艅 zwi膮zanych z problemem ma艂ej liczby obiekt贸w i du偶ej liczby cech (ang. the few-objects-many-attributes problem), np. analizy danych mikromacierzowych. Podobny pomys艂 mo偶e by膰 r贸wnie偶 wykorzystany w dziedzinie analizy tekst贸w. Realizowany jest on przez drugie z proponowanych rozszerze艅 modelu. 艁膮czy ono metod臋 semantycznego indeksowania z algorytmem obliczania biredukt贸w informacyjnych, aby reprezentowa膰 teksty dobrze zdefiniowanymi poj臋ciami. Funkcja podobie艅stwa zaproponowanego modelu mo偶e by膰 wykorzystana do klasyfikacji nowych obiekt贸w oraz do 艂膮czenia dokument贸w tekstowych w semantycznie sp贸jne grupy. Eksperymenty, kt贸rych wyniki opisano w rozprawie, dowodz膮, ze zaproponowane modele mog膮 skutecznie konkurowa膰 nawet z powszechnie uznanymi rozwi膮zaniami

    Knowledge Discovery and Monotonicity

    Get PDF
    The monotonicity property is ubiquitous in our lives and it appears in different roles: as domain knowledge, as a requirement, as a property that reduces the complexity of the problem, and so on. It is present in various domains: economics, mathematics, languages, operations research and many others. This thesis is focused on the monotonicity property in knowledge discovery and more specifically in classification, attribute reduction, function decomposition, frequent patterns generation and missing values handling. Four specific problems are addressed within four different methodologies, namely, rough sets theory, monotone decision trees, function decomposition and frequent patterns generation. In the first three parts, the monotonicity is domain knowledge and a requirement for the outcome of the classification process. The three methodologies are extended for dealing with monotone data in order to be able to guarantee that the outcome will also satisfy the monotonicity requirement. In the last part, monotonicity is a property that helps reduce the computation of the process of frequent patterns generation. Here the focus is on two of the best algorithms and their comparison both theoretically and experimentally. About the Author: Viara Popova was born in Bourgas, Bulgaria in 1972. She followed her secondary education at Mathematics High School "Nikola Obreshkov" in Bourgas. In 1996 she finished her higher education at Sofia University, Faculty of Mathematics and Informatics where she graduated with major in Informatics and specialization in Information Technologies in Education. She then joined the Department of Information Technologies, First as an associated member and from 1997 as an assistant professor. In 1999 she became a PhD student at Erasmus University Rotterdam, Faculty of Economics, Department of Computer Science. In 2004 she joined the Artificial Intelligence Group within the Department of Computer Science, Faculty of Sciences at Vrije Universiteit Amsterdam as a PostDoc researcher.This thesis is positioned in the area of knowledge discovery with special attention to problems where the property of monotonicity plays an important role. Monotonicity is a ubiquitous property in all areas of life and has therefore been widely studied in mathematics. Monotonicity in knowledge discovery can be treated as available background information that can facilitate and guide the knowledge extraction process. While in some sub-areas methods have already been developed for taking this additional information into account, in most methodologies it has not been extensively studied or even has not been addressed at all. This thesis is a contribution to a change in that direction. In the thesis, four specific problems have been examined from different sub-areas of knowledge discovery: the rough sets methodology, monotone decision trees, function decomposition and frequent patterns discovery. In the first three parts, the monotonicity is domain knowledge and a requirement for the outcome of the classification process. The three methodologies are extended for dealing with monotone data in order to be able to guarantee that the outcome will also satisfy the monotonicity requirement. In the last part, monotonicity is a property that helps reduce the computation of the process of frequent patterns generation. Here the focus is on two of the best algorithms and their comparison both theoretically and experimentally
    corecore