35 research outputs found

    Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering

    Get PDF
    In this paper we compare usefulness of statistical techniques of dimensionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine

    Design of Module for Plagiarism Detection

    Get PDF
    Import 30/10/2012Diplomová práce pojednává o plagiátorství jako pojmu, je zde rozebrána teoretická otázka o plagiátech a jejich typech, dále se pak zabývá metodami pro vyhledávání plagiovaných prací, dokumentů, článků a podobně. Jsou zde podrobně rozebrány současné možnosti zpracování datových kolekcí pro možnosti porovnávání, tedy takové metody, které vhodně rozdělují jednotlivé texty na menší části spolu s odstraněním zavádějících informací. Jsou probrány jednotlivé modely a metody pro vyhledávání a v neposlední řadě práce popisuje také současná hotová řešení pro detekci plagiátů. Cílem této práce pak bylo navrhnout a zrenovovat rozhraní vyhledávacího systému Amphora, dále pak vytvořit webovou aplikaci pro vizualizaci vyhledávání, která by vhodně reprezentovala nalezené podobnosti v jednotlivých textech. Nakonec byly provedeny experimenty, které názorně ukazují a zhodnocují dosažené výsledky.This thesis is concerned with the plagiarism as a term, thesis also deals with the theoretical questions about the plagiarism and its types, then it is describing the methods used for searching plagiarized work, documents, articles etc. Current possibilities of processing huge data collections for ability of comparing texts are described in detail; these are methods which are correctly separating text to small parts together with the unimportant information disposal. Each of the models and methods for searching is discussed and this work is also describing present ready-made solutions for plagiarism detection. The aim of this work was to design and renew the interface of the searching engine – Amphora, and then to implement a web application for visualization of the search results which could properly represent found similarities in each text files. In the end there have been done few experiments which can illustrate and evaluate reached results.460 - Katedra informatikyvýborn

    Investigating the link between southern African droughts and global atmospheric teleconnections using regional climate models

    Get PDF
    Includes bibliographical referencesDrought is one of the natural hazards that threaten the economy of many nations, especially in Southern Africa, where many socio-economic activities depend on rain-fed agriculture. This study evaluates the capability of Regional Climate Models (RCMs) in simulating the Southern African droughts. It uses the Standardized Precipitation-Evapotranspiration Index (SPEI, computed using rainfall and temperature data) to identify 3-month droughts over Southern Africa, and compares the observed and simulated drought patterns. The observation data are from the Climate Research Unit (CRU), while the simulation data are from 10 RCMs (ARPEGE, CCLM, HIRHAM, RACMO, REMO, PRECIS, RegCM3, RCA, WRF, and CRCM) that participated in the Regional Climate Downscaling Experiment (CORDEX) project. The study also categorizes drought patterns over Southern Africa, examines the persistence and transition of these patterns, and investigates the roles of atmospheric teleconnections on the drought patterns. The results show that the drought patterns can occur in any season, but they have preference for seasons. Some droughts patterns may persist up to three seasons, while others are transient. Only about 20% of the droughts patterns are induced solely by El Niño Southern Oscillation (ENSO), other drought patterns are caused by complex interactions among the atmospheric teleconnections. The study also reveals that the Southern Africa drought pattern is generally shifting from a wet condition to a dry condition, and that the shifting can only be captured with a drought monitoring index that accounts for temperature influence on drought. Only few CORDEX RCMs simulate the Southern African droughts as observed. In this regard, the ARPEGE model shows the best simulation. The best performance may be because the stretching capability of ARPEGE helps the model to eliminate boundary condition problems, which are present in other RCMs. In ARPEGE simulations, the stretching capability would allow a better interaction between large and small scale features, and may lead to a better representation of the rain producing systems in Southern Africa. The results of the study may be applied to improve monitoring and prediction of regionally-extensive drought over Southern Africa, and to reduce the socio-economic impacts of drought in the region

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Using random projections for dimensionality reduction in identifying rogue applications

    Get PDF
    In general, the consumer must depend on others to provide their software solutions. However, this outsourcing of software development has caused it to become more and more abstract as to where the software is actually being developed and by whom, and it poses a potentially large security problem for the consumer as it opens up the possibility for rogue functionality to be injected into an application without the consumer’s knowledge or consent. This begs the question of ‘How do we know that the software we use can be trusted?’ or ‘How can we have assurance that the software we use is doing only the tasks that we ask it to do?’ Traditional methods for thwarting such activities, such as virus detection engines, are far too antiquated for today’s adversary. More sophisticated research needs to be conducted in this area to combat these more technically advanced enemies. To combat the ever increasing problem of rogue applications, this dissertation has successfully applied and extended the information retrieval techniques of n-gram analysis and document similarity and the data mining techniques of dimensionality reduction and attribute extraction. This combination of techniques has generated a more effective Trojan horse, rogue application detection capability tool suite that can detect not only standalone rogue applications but also those that are embedded within other applications. This research provides several major contributions to the field including a unique combination of techniques that have provided a new tool for the administrator’s multi-pronged defense to combat the infestation of rogue applications. Another contribution involves a unique method of slicing the potential rogue applications that has proven to provide a more robust rogue application classifier. Through experimental research this effort has shown that a viable and worthy rogue application detection tool suite can be developed. Experimental results have shown that in some cases as much as a 28% increase in overall accuracy can be achieved when comparing the accepted feature selection practice of mutual information with the feature extraction method presented in this effort called randomized projection

    Novelty, distillation, and federation in machine learning for medical imaging

    Get PDF
    The practical application of deep learning methods in the medical domain has many challenges. Pathologies are diverse and very few examples may be available for rare cases. Where data is collected it may lie in multiple institutions and cannot be pooled for practical and ethical reasons. Deep learning is powerful for image segmentation problems but ultimately its output must be interpretable at the patient level. Although clearly not an exhaustive list, these are the three problems tackled in this thesis. To address the rarity of pathology I investigate novelty detection algorithms to find outliers from normal anatomy. The problem is structured as first finding a low-dimension embedding and then detecting outliers in that embedding space. I evaluate for speed and accuracy several unsupervised embedding and outlier detection methods. Data consist of Magnetic Resonance Imaging (MRI) for interstitial lung disease for which healthy and pathological patches are available; only the healthy patches are used in model training. I then explore the clinical interpretability of a model output. I take related work by the Canon team — a model providing voxel-level detection of acute ischemic stroke signs — and deliver the Alberta Stroke Programme Early CT Score (ASPECTS, a measure of stroke severity). The data are acute head computed tomography volumes of suspected stroke patients. I convert from the voxel level to the brain region level and then to the patient level through a series of rules. Due to the real world clinical complexity of the problem, there are at each level — voxel, region and patient — multiple sources of “truth”; I evaluate my results appropriately against these truths. Finally, federated learning is used to train a model on data that are divided between multiple institutions. I introduce a novel evolution of this algorithm — dubbed “soft federated learning” — that avoids the central coordinating authority, and takes into account domain shift (covariate shift) and dataset size. I first demonstrate the key properties of these two algorithms on a series of MNIST (handwritten digits) toy problems. Then I apply the methods to the BraTS medical dataset, which contains MRI brain glioma scans from multiple institutions, to compare these algorithms in a realistic setting

    Biological investigation and predictive modelling of foaming in anaerobic digester

    Get PDF
    Anaerobic digestion (AD) of waste has been identified as a leading technology for greener renewable energy generation as an alternative to fossil fuel. AD will reduce waste through biochemical processes, converting it to biogas which could be used as a source of renewable energy and the residue bio-solids utilised in enriching the soil. A problem with AD though is with its foaming and the associated biogas loss. Tackling this problem effectively requires identifying and effectively controlling factors that trigger and promote foaming. In this research, laboratory experiments were initially carried out to differentiate foaming causal and exacerbating factors. Then the impact of the identified causal factors (organic loading rate-OLR and volatile fatty acid-VFA) on foaming occurrence were monitored and recorded. Further analysis of foaming and nonfoaming sludge samples by metabolomics techniques confirmed that the OLR and VFA are the prime causes of foaming occurrence in AD. In addition, the metagenomics analysis showed that the phylum bacteroidetes and proteobacteria were found to be predominant with a higher relative abundance of 30% and 29% respectively while the phylum actinobacteria representing the most prominent filamentous foam causing bacteria such as Norcadia amarae and Microthrix Parvicella had a very low and consistent relative abundance of 0.9% indicating that the foaming occurrence in the AD studied was not triggered by the presence of filamentous bacteria. Consequently, data driven models to predict foam formation were developed based on experimental data with inputs (OLR and VFA in the feed) and output (foaming occurrence). The models were extensively validated and assessed based on the mean squared error (MSE), root mean squared error (RMSE), R2 and mean absolute error (MAE). Levenberg Marquadt neural network model proved to be the best model for foaming prediction in AD, with RMSE = 5.49, MSE = 30.19 and R2 = 0.9435. The significance of this study is the development of a parsimonious and effective modelling tool that enable AD operators to proactively avert foaming occurrence, as the two model input variables (OLR and VFA) can be easily adjustable through simple programmable logic controller

    The anonymous 1821 translation of Goethe's Faust :a cluster analytic approach

    Get PDF
    PhD ThesisThis study tests the hypothesis proposed by Frederick Burwick and James McKusick in 2007 that Samuel Taylor Coleridge was the author of the anonymous translation of Goethe's Faust published by Thomas Boosey in 1821. The approach to hypothesis testing is stylometric. Specifically, function word usage is selected as the stylometric criterion, and 80 function words are used to define a 73-dimensional function word frequency profile vector for each text in the corpus of Coleridge's literary works and for a selection of works by a range of contemporary English authors. Each profile vector is a point in 80- dimensional vector space, and cluster analytic methods are used to determine the distribution of profile vectors in the space. If the hypothesis being tested is valid, then the profile for the 1821 translation should be closer in the space to works known to be by Coleridge than to works by the other authors. The cluster analytic results show, however, that this is not the case, and the conclusion is that the Burwick and McKusick hypothesis is falsified relative to the stylometric criterion and analytic methodology used

    Statistical Inference through Data Compression

    Get PDF
    corecore