Search CORE

3 research outputs found

Algorytmy uczenia się relacji podobieństwa z wielowymiarowych zbiorów danych

Author: Janusz Andrzej
Publication venue
Publication date
Field of study

The notion of similarity plays an important role in machine learning and artificial intelligence. It is widely used in tasks related to a supervised classification, clustering, an outlier detection and planning. Moreover, in domains such as information retrieval or case-based reasoning, the concept of similarity is essential as it is used at every phase of the reasoning cycle. The similarity itself, however, is a very complex concept that slips out from formal definitions. A similarity of two objects can be different depending on a considered context. In many practical situations it is difficult even to evaluate the quality of similarity assessments without considering the task for which they were performed. Due to this fact the similarity should be learnt from data, specifically for the task at hand. In this dissertation a similarity model, called Rule-Based Similarity, is described and an algorithm for constructing this model from available data is proposed. The model utilizes notions from the rough set theory to derive a similarity function that allows to approximate the similarity relation in a given context. The construction of the model starts from the extraction of sets of higher-level features. Those features can be interpreted as important aspects of the similarity. Having defined such features it is possible to utilize the idea of Tversky’s feature contrast model in order to design an accurate and psychologically plausible similarity function for a given problem. Additionally, the dissertation shows two extensions of Rule-Based Similarity which are designed to efficiently deal with high dimensional data. They incorporate a broader array of similarity aspects into the model. In the first one it is done by constructing many heterogeneous sets of features from multiple decision reducts. To ensure their diversity, a randomized reduct computation heuristic is proposed. This approach is particularly well-suited for dealing with the few-objects-many-attributes problem, e.g. the analysis of DNA microarray data. A similar idea can be utilized in the text mining domain. The second of the proposed extensions serves this particular purpose. It uses a combination of a semantic indexing method and an information bireducts computation technique to represent texts by sets of meaningful concepts. The similarity function of the proposed model can be used to perform an accurate classification of previously unseen objects in a case-based fashion or to facilitate clustering of textual documents into semantically homogeneous groups. Experiments, whose results are also presented in the dissertation, show that the proposed models can successfully compete with the state-of-the-art algorithms.Pojęcie podobieństwa pełni istotną rolę w dziedzinach uczenia maszynowego i sztucznej inteligencji. Jest ono powszechnie wykorzystywane w zadaniach dotyczących nadzorowanej klasyfikacji, grupowania, wykrywania nietypowych obiektów oraz planowania. Ponadto w dziedzinach takich jak wyszukiwanie informacji (ang. information retrieval) lub wnioskowanie na podstawie przykładów (ang. case-based reasoning) pojęcie podobieństwa jest kluczowe ze względu na jego obecność na wszystkich etapach wyciągania wniosków. Jednakże samo podobieństwo jest pojęciem niezwykle złożonym i wymyka się próbom ścisłego zdefiniowania. Stopień podobieństwa między dwoma obiektami może być różny w zależności od kontekstu w jakim się go rozpatruje. W praktyce trudno jest nawet ocenić jakość otrzymanych stopni podobieństwa bez odwołania się do zadania, któremu mają służyć. Z tego właśnie powodu modele oceniające podobieństwo powinny być wyuczane na podstawie danych, specjalnie na potrzeby realizacji konkretnego zadania. W niniejszej rozprawie opisano model podobieństwa zwany Regułowym Modelem Podobieństwa (ang. Rule-Based Similarity) oraz zaproponowano algorytm tworzenia tego modelu na podstawie danych. Wykorzystuje on elementy teorii zbiorów przybliżonych do konstruowania funkcji podobieństwa pozwalającej aproksymować podobieństwo w zadanym kontekście. Konstrukcja ta rozpoczyna się od wykrywania zbiorów wysokopoziomowych cech obiektów. Mogą być one interpretowane jako istotne aspekty podobieństwa. Mając zdefiniowane tego typu cechy możliwe jest wykorzystanie idei modelu kontrastu cech Tversky’ego (ang. feature contrast model) do budowy precyzyjnej oraz zgodnej z obserwacjami psychologów funkcji podobieństwa dla rozważanego problemu. Dodatkowo, niniejsza rozprawa zawiera opis dwóch rozszerzeń Regułowego Modelu Podobieństwa przystosowanych do działania na danych o bardzo wielu atrybutach. Starają się one włączyć do modelu szerszy zakres aspektów podobieństwa. W pierwszym z nich odbywa się to poprzez konstruowanie wielu zbiorów cech z reduktów decyzyjnych. Aby zapewnić ich zróżnicowanie, zaproponowano algorytm łączący heurystykę zachłanna z elementami losowymi. Podejście to jest szczególnie wskazane dla zadań związanych z problemem małej liczby obiektów i dużej liczby cech (ang. the few-objects-many-attributes problem), np. analizy danych mikromacierzowych. Podobny pomysł może być również wykorzystany w dziedzinie analizy tekstów. Realizowany jest on przez drugie z proponowanych rozszerzeń modelu. Łączy ono metodę semantycznego indeksowania z algorytmem obliczania bireduktów informacyjnych, aby reprezentować teksty dobrze zdefiniowanymi pojęciami. Funkcja podobieństwa zaproponowanego modelu może być wykorzystana do klasyfikacji nowych obiektów oraz do łączenia dokumentów tekstowych w semantycznie spójne grupy. Eksperymenty, których wyniki opisano w rozprawie, dowodzą, ze zaproponowane modele mogą skutecznie konkurować nawet z powszechnie uznanymi rozwiązaniami

Repozytorium UW

Complexity Analysis of Electroencephalogram Dynamics in Patients with Parkinson’s Disease

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2017
Field of study

Crossref

Efficient Optimization of F

Author: Fan Cheng
Jian Gao
Shuangqiu Zheng
Yuan Zhou
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

F-measure is one of the most commonly used performance metrics in classification, particularly when the classes are highly imbalanced. Direct optimization of this measure is often challenging, since no closed form solution exists. Current algorithms design the classifiers by using the approximations to the F-measure. These algorithms are not efficient and do not scale well to the large datasets. To fill the gap, in this paper, we propose a novel algorithm, which can efficiently optimize F-measure with cost-sensitive SVM. First of all, we present an explicit transformation from the optimization of F-measure to cost-sensitive SVM. Then we adopt bundle method to solve the inner optimization. For the problem where the existing bundle method may have the fluctuations in the primal objective during iterations, an additional line search procedure is involved, which can alleviate the fluctuations problem and make our algorithm more efficient. Empirical studies on the large-scale datasets demonstrate that our algorithm can provide significant speedups over current state-of-the-art F-measure based learners, while obtaining better (or comparable) precise solutions

Crossref

Directory of Open Access Journals