6,821 research outputs found

    From patterned response dependency to structured covariate dependency: categorical-pattern-matching

    Get PDF
    Data generated from a system of interest typically consists of measurements from an ensemble of subjects across multiple response and covariate features, and is naturally represented by one response-matrix against one covariate-matrix. Likely each of these two matrices simultaneously embraces heterogeneous data types: continuous, discrete and categorical. Here a matrix is used as a practical platform to ideally keep hidden dependency among/between subjects and features intact on its lattice. Response and covariate dependency is individually computed and expressed through mutliscale blocks via a newly developed computing paradigm named Data Mechanics. We propose a categorical pattern matching approach to establish causal linkages in a form of information flows from patterned response dependency to structured covariate dependency. The strength of an information flow is evaluated by applying the combinatorial information theory. This unified platform for system knowledge discovery is illustrated through five data sets. In each illustrative case, an information flow is demonstrated as an organization of discovered knowledge loci via emergent visible and readable heterogeneity. This unified approach fundamentally resolves many long standing issues, including statistical modeling, multiple response, renormalization and feature selections, in data analysis, but without involving man-made structures and distribution assumptions. The results reported here enhance the idea that linking patterns of response dependency to structures of covariate dependency is the true philosophical foundation underlying data-driven computing and learning in sciences.Comment: 32 pages, 10 figures, 3 box picture

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

    Data mining using concepts of independence, unimodality and homophily

    Get PDF
    With the widespread use of information technologies, more and more complex data is generated and collected every day. Such complex data is various in structure, size, type and format, e.g. time series, texts, images, videos and graphs. Complex data is often high-dimensional and heterogeneous, which makes the separation of the wheat (knowledge) from the chaff (noise) more difficult. Clustering is a main mode of knowledge discovery from complex data, which groups objects in such a way that intra-group objects are more similar than inter-group objects. Traditional clustering methods such as k-means, Expectation-Maximization clustering (EM), DBSCAN and spectral clustering are either deceived by "the curse of dimensionality" or spoiled by heterogenous information. So, how to effectively explore complex data? In some cases, people may only have some partial information about the complex data. For example, in social networks, not every user provides his/her profile information such as the personal interests. Can we leverage the limited user information and friendship network wisely to infer the likely labels of the unlabeled users so that the advertisers can do accurate advertising? This is the problem of learning from labeled and unlabeled data, which is literarily attributed to semi-supervised classification. To gain insights into these problems, this thesis focuses on developing clustering and semi-supervised classification methods that are driven by the concepts of independence, unimodality and homophily. The proposed methods leverage techniques from diverse areas, such as statistics, information theory, graph theory, signal processing, optimization and machine learning. Specifically, this thesis develops four methods, i.e. FUSE, ISAAC, UNCut, and wvGN. FUSE and ISAAC are clustering techniques to discover statistically independent patterns from high-dimensional numerical data. UNCut is a clustering technique to discover unimodal clusters in attributed graphs in which not all the attributes are relevant to the graph structure. wvGN is a semi-supervised classification technique using the theory of homophily to infer the labels of the unlabeled vertices in graphs. We have verified our clustering and semi-supervised classification methods on various synthetic and real-world data sets. The results are superior to those of the state-of-the-art.Täglich werden durch den weit verbreiteten Einsatz von Informationstechnologien mehr und mehr komplexe Daten generiert und gesammelt. Diese komplexen Daten unterscheiden sich in der Struktur, Größe, Art und Format. Häufig anzutreffen sind beispielsweise Zeitreihen, Texte, Bilder, Videos und Graphen. Dabei sind diese Daten meist hochdimensional und heterogen, was die Trennung des Weizens ( Wissen ) von der Spreu ( Rauschen ) erschwert. Die Cluster Analyse ist dabei eine der wichtigsten Methoden um aus komplexen Daten wssen zu extrahieren. Dabei werden die Objekte eines Datensatzes in einer solchen Weise gruppiert, dass intra-gruppierte Objekte ähnlicher sind als Objekte anderer Gruppen. Der Einsatz von traditionellen Clustering-Methoden wie k-Means, Expectation-Maximization (EM), DBSCAN und Spektralclustering wird dabei entweder "durch der Fluch der Dimensionalität" erschwert oder ist angesichts der heterogenen Information nicht möglich. Wie erforscht man also solch komplexe Daten effektiv? Darüber hinaus ist es oft der Fall, dass für Objekte solcher Datensätze nur partiell Informationen vorliegen. So gibt in sozialen Netzwerken nicht jeder Benutzer seine Profil-Informationen wie die persönlichen Interessen frei. Können wir diese eingeschränkten Benutzerinformation trotzdem in Kombination mit dem Freundschaftsnetzwerk nutzen, um von von wenigen, einer Klasse zugeordneten Nutzern auf die anderen zu schließen. Beispielsweise um zielgerichtete Werbung zu schalten? Dieses Problem des Lernens aus klassifizierten und nicht klassifizierten Daten wird dem semi-supversised Learning zugeordnet. Um Einblicke in diese Probleme zu gewinnen, konzentriert sich diese Arbeit auf die Entwicklung von Clustering- und semi-überwachten Klassifikationsmethoden, die von den Konzepten der Unabhängigkeit, Unimodalität und Homophilie angetrieben werden. Die vorgeschlagenen Methoden nutzen Techniken aus verschiedenen Bereichen der Statistik, Informationstheorie, Graphentheorie, Signalverarbeitung, Optimierung und des maschinelles Lernen. Dabei stellt diese Arbeit vier Techniken vor: FUSE, ISAAC, UNCut, sowie wvGN. FUSE und ISAAC sind Clustering-Techniken, um statistisch unabhängige Muster aus hochdimensionalen numerischen Daten zu entdecken. UNCut ist eine Clustering-Technik, um unimodale Cluster in attributierten Graphen zu entdecken, in denen die Kanten und Attribute heterogene Informationen liefern. wvGN ist eine halbüberwachte Klassifikationstechnik, die Homophilie verwendet, um von gelabelten Kanten auf ungelabelte Kanten im Graphen zu schließen. Wir haben diese Clustering und semi-überwachten Klassifizierungsmethoden auf verschiedenen synthetischen und realen Datensätze überprüft. Die Ergebnisse sind denen von bisherigen State-of-the-Art-Methoden überlegen

    Sentiment Analysis Using Machine Learning Techniques

    Get PDF
    Before buying a product, people usually go to various shops in the market, query about the product, cost, and warranty, and then finally buy the product based on the opinions they received on cost and quality of service. This process is time consuming and the chances of being cheated by the seller are more as there is nobody to guide as to where the buyer can get authentic product and with proper cost. But now-a-days a good number of persons depend upon the on-line market for buying their required products. This is because the information about the products is available from multiple sources; thus it is comparatively cheap and also has the facility of home delivery. Again, before going through the process of placing order for any product, customers very often refer to the comments or reviews of the present users of the product, which help them take decision about the quality of the product as well as the service provided by the seller. Similar to placing order for products, it is observed that there are quite a few specialists in the field of movies, who go though the movie and then finally give a comment about the quality of the movie, i.e., to watch the movie or not or in five-star rating. These reviews are mainly in the text format and sometimes tough to understand. Thus, these reports need to be processed appropriately to obtain some meaningful information. Classification of these reviews is one of the approaches to extract knowledge about the reviews. In this thesis, different machine learning techniques are used to classify the reviews. Simulation and experiments are carried out to evaluate the performance of the proposed classification methods. It is observed that a good number of researchers have often considered two different review datasets for sentiment classification namely aclIMDb and Polarity dataset. The IMDb dataset is divided into training and testing data. Thus, training data are used for training the machine learning algorithms and testing data are used to test the data based on the training information. On the other hand, polarity dataset does not have separate data for training and testing. Thus, k-fold cross validation technique is used to classify the reviews. Four different machine learning techniques (MLTs) viz., Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), and Linear Discriminant Analysis (LDA) are used for the classification of these movie reviews. Different performance evaluation parameters are used to evaluate the performance of the machine learning techniques. It is observed that among the above four machine learning algorithms, RF technique yields the classification result, with more accuracy. Secondly, n-gram based classification of reviews are carried out on the aclIMDb dataset..

    Hybrid Multi Attribute Relation Method for Document Clustering for Information Mining

    Get PDF
    Text clustering has been widely utilized with the aim of partitioning speci?c documents’ collection into different subsets using homogeneity/heterogeneity criteria. It has also become a very complicated area of research, including pattern recognition, information retrieval, and text mining. In the applications of enterprises, information mining faces challenges due to the complex distribution of data by an enormous number of different sources. Most of these information sources are from different domains which create difficulties in identifying the relationships among the information. In this case, a single method for clustering limits related information, while enhancing computational overheadsand processing times. Hence, identifying suitable clustering models for unsupervised learning is a challenge, specifically in the case of MultipleAttributesin data distributions. In recent works attribute relation based solutions are given significant importance to suggest the document clustering. To enhance further, in this paper, Hybrid Multi Attribute Relation Methods (HMARs) are presented for attribute selections and relation analyses of co-clustering of datasets. The proposed HMARs allowanalysis of distributed attributes in documents in the form of probabilistic attribute relations using modified Bayesian mechanisms. It also provides solutionsfor identifying most related attribute model for the multiple attribute documents clustering accurately. An experimental evaluation is performed to evaluate the clustering purity and normalization of the information utilizing UCI Data repository which shows 25% better when compared with the previous techniques

    Karakterizacija predkliničnega tumorskega ksenograftnega modela z uporabo multiparametrične MR

    Full text link
    Introduction: In small animal studies multiple imaging modalities can be combined to complement each other in providing information on anatomical structure and function. Non-invasive imaging studies on animal models are used to monitor progressive tumor development. This helps to better understand the efficacy of new medicines and prediction of the clinical outcome. The aim was to construct a framework based on longitudinal multi-modal parametric in vivo imaging approach to perform tumor tissue characterization in mice. Materials and Methods: Multi-parametric in vivo MRI dataset consisted of T1-, T2-, diffusion and perfusion weighted images. Image set of mice (n=3) imaged weekly for 6 weeks was used. Multimodal image registration was performed based on maximizing mutual information. Tumor region of interested was delineated in weeks 2 to 6. These regions were stacked together, and all modalities combined were used in unsupervised segmentation. Clustering methods, such as K-means and Fuzzy C-means together with blind source separation technique of non-negative matrix factorization were tested. Results were visually compared with histopathological findings. Results: Clusters obtained with K-means and Fuzzy C-means algorithm coincided with T2 and ADC maps per levels of intensity observed. Fuzzy C-means clusters and NMF abundance maps reported most promising results compared to histological findings and seem as a complementary way to asses tumor microenvironment. Conclusions: A workflow for multimodal MR parametric map generation, image registration and unsupervised tumor segmentation was constructed. Good segmentation results were achieved, but need further extensive histological validation.Uvod Eden izmed pomembnih stebrov znanstvenih raziskav v medicinski diagnostiki predstavljajo eksperimenti na živalih v sklopu predkliničnih študij. V teh študijah so eksperimenti izvedeni za namene odkrivanja in preskušanja novih terapevtskih metod za zdravljenje človeških bolezni. Rak jajčnikov je eden izmed glavnih vzrokov smrti kot posledica rakavih obolenj. Potreben je razvoj novih, učinkovitejših metod, da bi lahko uspešneje kljubovali tej bolezni. Časovno okno aplikacije novih terapevtikov je ključni dejavnik uspeha raziskovane terapije. Tumorska fiziologija se namreč razvija med napredovanjem bolezni. Eden izmed ciljev predkliničnih študij je spremljanje razvoja tumorskega mikro-okolja in tako določiti optimalno časovno okno za apliciranje razvitega terapevtika z namenom doseganja maksimalne učinkovitosti. Slikovne modalitete so kot raziskovalno orodje postale izjemno popularne v biomedicinskih in farmakoloških raziskavah zaradi svoje neinvazivne narave. Predklinične slikovne modalitete imajo nemalo prednosti pred tradicionalnim pristopom. Skladno z raziskovalno regulativo, tako za spremljanje razvoja tumorja skozi daljši čas ni potrebno žrtvovati živali v vmesnih časovnih točkah. Sočasno lahko namreč s svojim nedestruktivnim in neinvazivnim pristopom poleg anatomskih informacij podajo tudi molekularni in funkcionalni opis preučevanega subjekta. Za dosego slednjega so običajno uporabljene različne slikovne modalitete. Pogosto se uporablja kombinacija več slikovnih modalitet, saj so medsebojno komplementarne v podajanju željenih informacij. V sklopu te naloge je predstavljeno ogrodje za procesiranje različnih modalitet magnetno resonančnih predkliničnih modelov z namenom karakterizacije tumorskega tkiva. Metodologija V študiji Belderbos, Govaerts, Croitor Sava in sod. [1] so z uporabo magnetne resonance preučevali določitev optimalnega časovnega okna za uspešno aplikacijo novo razvitega terapevtika. Poleg konvencionalnih magnetno resonančnih slikovnih metod (T1 in T2 uteženo slikanje) sta bili uporabljeni tudi perfuzijsko in difuzijsko uteženi tehniki. Zajem slik je potekal tedensko v obdobju šest tednov. Podatkovni seti, uporabljeni v predstavljenem delu, so bili pridobljeni v sklopu omenjene raziskave. Ogrodje za procesiranje je narejeno v okolju Matlab (MathWorks, verzija R2019b) in omogoča tako samodejno kot ročno procesiranje slikovnih podatkov. V prvem koraku je pred generiranjem parametričnih map uporabljenih modalitet, potrebno izluščiti parametre uporabljenih protokolov iz priloženih tekstovnih datotek in zajete slike pravilno razvrstiti glede na podano anatomijo. Na tem mestu so slike tudi filtrirane in maskirane. Filtriranje je koristno za izboljšanje razmerja med koristnim signalom (slikanim živalskim modelom) in ozadjem, saj je skener za zajem slik navadno podvržen različnim izvorom slikovnega šuma. Uporabljen je bil filter ne-lokalnih povprečij Matlab knjižnice za procesiranje slik. Prednost maskiranja se potrdi v naslednjem koraku pri generiranju parametričnih map, saj se ob primerno maskiranem subjektu postopek bistveno pospeši z mapiranjem le na želenem področju. Za izdelavo parametričnih map je uporabljena metoda nelinearnih najmanjših kvadratov. Z modeliranjem fizikalnih pojavov uporabljenih modalitet tako predstavimo preiskovan živalski model z biološkimi parametri. Le-ti se komplementarno dopolnjujejo v opisu fizioloških lastnosti preučevanega modela na ravni posameznih slikovnih elementov. Ključen gradnik v uspešnem dopolnjevanju informacij posameznih modalitet je ustrezna poravnava parametričnih map. Posamezne modalitete so zajete zaporedno, ob različnih časih. Skeniranje vseh modalitet posamezne živali skupno traja več kot eno uro. Med zajemom slik tako navkljub uporabi anestetikov prihaja do majhnih premikov živali. V kolikor ti premiki niso pravilno upoštevani, prihaja do napačnih interpretacij skupnih informacij večih modalitet. Premiki živali znotraj modalitet so bili modelirani kot toge, med različnimi modalitetami pa kot afine preslikave. Poravnava slik je izvedena z lastnimi Matlab funkcijami ali z uporabo funkcij iz odprtokodnega ogrodja za procesiranje slik Elastix. Z namenom karakterizacije tumorskega tkiva so bile uporabljene metode nenadzorovanega razčlenjevanja. Bistvo razčlenjevanja je v združevanju posameznih slikovnih elementov v segmente. Elementi si morajo biti po izbranem kriteriju dovolj medsebojno podobni in se hkrati razlikovati od elementov drugih segmentov. Za razgradnjo so bile izbrane tri metode: metoda K-tih povprečij, kot ena izmed enostavnejšihmetoda mehkih C-tih povprečij, s prednostjo mehke razčlenitvein kot zadnja, nenegativna matrična faktorizacija. Slednja ponuja pogled na razčlenitev tkiva kot produkt tipičnih več-modalnih značilk in njihove obilice za vsak posamezni slikovni element. Za potrditev izvedenega razčlenjevanja z omenjenimi metodami je bila izvedena vizualna primerjava z rezultati histopatološke analize. Rezultati Na ustvarjene parametrične mape je imela poravnava slik znotraj posameznih modalitet velik vpliv. Zaradi dolgotrajnega zajema T1 uteženih slik nemalokrat prihaja do premikov živali, kar brez pravilne poravnave slik negativno vpliva na mapiranje modalitet in kasnejšo segmentacijo slik. Generirane mape imajo majhno odstopanje od tistih, narejenih s standardno uporabljenimi odprtokodnimi programi. Klastri pridobljeni z metodama K-tih in mehkih C-tih povprečij dobro sovpadajo z razčlenbami glede na njihovo inteziteto pri T2 in ADC mapah. Najobetavnejše rezultate po primerjavi s histološkimi izsledki podajata metoda mehkih C-povprečij in nenegativna matrična faktorizacija. Njuni segmentaciji se dopolnjujeta v razlagi tumorskega mikro-okolja. Zaključek Z izgradnjo ogrodja za procesiranje slik magnetne resonance in segmentacijo tumorskega tkiva je bil cilj magistrske naloge dosežen. Zasnova ogrodja omogoča poljubno dodajanje drugih modalitet in uporabo drugih živalskih modelov. Rezultati razčlenitve tumorskega tkiva so obetavni, vendar je potrebna nadaljna primerjava z rezultati histopatološke analize. Možna nadgradnja je izboljšanje robustnosti poravnave slik z uporabo modela netoge (elastične) preslikave. Prav tako je smiselno preizkusiti dodatne metode nenadzorovane segmentacije in dobljene rezultate primerjati s tukaj predstavljenimi
    corecore