12 research outputs found

    TSVD as a Statistical Estimator in the Latent Semantic Analysis Paradigm

    Get PDF
    The aim of this paper is to present a new point of view that makes it possible to give a statistical interpretation of the traditional latent semantic analysis (LSA) paradigm based on the truncated singular value decomposition (TSVD) technique. We show how the TSVD can be interpreted as a statistical estimator derived from the LSA co-occurrence relationship matrix by mapping probability distributions on Riemanian manifolds. Besides, the quality of the estimator model can be expressed by introducing a figure of merit arising from the Solomonoff approach. This figure of merit takes into account both the adherence to the sample data and the simplicity of the model. In our model, the simplicity parameter of the proposed figure of merit depends on the number of the singular values retained after the truncation process, while the TSVD estimator, according to the Hellinger distance, guarantees the minimal distance between the sample probability distribution and the inferred probabilistic model

    Time Efficiency on Computational Performance of PCA, FA and TSVD on Ransomware Detection

    Get PDF
    Ransomware is able to attack and take over access of the targeted user'scomputer. Then the hackers demand a ransom to restore the user's accessrights. Ransomware detection process especially in big data has problems interm of computational processing time or detection speed. Thus, it requires adimensionality reduction method for computational process efficiency. Thisresearch work investigates the efficiency of three dimensionality reductionmethods, i.e.: Principal Component Analysis (PCA), Factor Analysis (FA) andTruncated Singular Value Decomposition (TSVD). Experimental results onCICAndMal2017 dataset show that PCA is the fastest and most significantmethod in the computational process with average detection time of 34.33s.Furthermore, result of accuracy, precision and recall also show that the PCAis superior compared to FA and TSVD

    A Framework Based on Semantic Spaces and Glyphs for Social Sensing on Twitter

    Get PDF
    Abstract In this paper we present a framework aimed at detecting emotions and sentiments in a Twitter stream. The approach uses the well-founded Latent Semantic Analysis technique, which can be seen as a bio-insipred cognitive architecture, to induce a semantic space where tweets are mapped and analysed by soft sensors. The measurements of the soft sensors are then used by a visualisation module which exploits glyphs to graphically present them. The result is an interactive map which makes easy the exploration of reactions and opinions in the whole globe regarding tweets retrieved from specific queries

    Time efficiency on computational performance of PCA, FA and TSVD on ransomware detection

    Get PDF
    Ransomware is able to attack and take over access of the targeted user's computer. Then the hackers demand a ransom to restore the user's access rights. Ransomware detection process especially in big data has problems in term of computational processing time or detection speed. Thus, it requires a dimensionality reduction method for computational process efficiency. This research work investigates the efficiency of three dimensionality reduction methods, i.e.: Principal Component Analysis (PCA), Factor Analysis (FA) and Truncated Singular Value Decomposition (TSVD). Experimental results on CICAndMal2017 dataset show that PCA is the fastest and most significant method in the computational process with average detection time of 34.33s. Furthermore, result of accuracy, precision and recall also show that the PCA is superior compared to FA and TSVD

    Neuroinformatics in Functional Neuroimaging

    Get PDF
    This Ph.D. thesis proposes methods for information retrieval in functional neuroimaging through automatic computerized authority identification, and searching and cleaning in a neuroscience database. Authorities are found through cocitation analysis of the citation pattern among scientific articles. Based on data from a single scientific journal it is shown that multivariate analyses are able to determine group structure that is interpretable as particular “known ” subgroups in functional neuroimaging. Methods for text analysis are suggested that use a combination of content and links, in the form of the terms in scientific documents and scientific citations, respectively. These included context sensitive author ranking and automatic labeling of axes and groups in connection with multivariate analyses of link data. Talairach foci from the BrainMap ™ database are modeled with conditional probability density models useful for exploratory functional volumes modeling. A further application is shown with conditional outlier detection where abnormal entries in the BrainMap ™ database are spotted using kernel density modeling and the redundancy between anatomical labels and spatial Talairach coordinates. This represents a combination of simple term and spatial modeling. The specific outliers that were found in the BrainMap ™ database constituted among others: Entry errors, errors in the article and unusual terminology

    Elaboración de un Sistema de Recomendación de Publicaciones Científicas Nacionales de Acceso Abierto para los investigadores calificados del SINACYT

    Get PDF
    Actualmente existe un crecimiento sostenido sobre la producción científica mundial. Esta producción científica es preservada a través de repositorios de acceso abierto digitales, los cuales se crean como herramientas de apoyo para el desarrollo de producción científica. Sin embargo, existen deficiencias en la funcionalidad de los mismos como herramientas de apoyo para el aumento de la visibilidad, uso e impacto de la producción científica que albergan. El Perú, no es ajeno al crecimiento de la producción científica mundial. Con el avance del mismo, se implementaron nuevas plataformas (ALICIA y DINA) de difusión y promoción del intercambio de información entre las distintas instituciones y universidades locales. No obstante, estas plataformas se muestran como plataformas aisladas dentro del sistema científico-investigador, ya que no se encuentran integradas con las herramientas y procesos de los investigadores. El objetivo de este Proyecto es el de presentar una alternativa de solución para la resolución del problema de carencia de mecanismos adecuados para la visualización de la producción científica peruana a través de la implementación de un Sistema de Recomendación de Publicaciones Científicas Nacionales de Acceso Abierto para los investigadores calificados del SINACYT. Esta alternativa se basa en la generación de recomendaciones personalizadas de publicaciones en ALICIA, a través del uso del filtrado basado en contenido tomando en cuenta un perfil de investigador. Este perfil se construyó a partir de la información relevante sobre su producción científica publicada en Scopus y Orcid. La generación de recomendaciones se basó en la técnica de LSA (Latent Semantic Analysis), para descubrir estructuras semánticas escondidas sobre un conjunto de publicaciones científicas, y la técnica de Similitud Coseno, para encontrar aquellas publicaciones científicas con el mayor nivel de similitud. Para el Proyecto, se implementaron los módulos de extracción, en donde se recoge la data de las publicaciones en ALICIA y las publicaciones en Scopus y Orcid para cada uno de los investigadores registrados en DINA a través de la técnica de extracción de datos de sitios web (web scrapping); de pre procesamiento, en donde se busca la mejora de la calidad de la data previamente extraída para su posterior uso en el modelo analítico dentro del marco de la minería de texto; de recomendación, en donde se capacita un modelo LSA y se generan recomendaciones sobre qué publicaciones científicas pueden interesar a los usuarios basado en sus publicaciones científicas en Scopus y Orcid; y de servicio, en donde se permite a otras aplicaciones consumir las recomendaciones generadas por el sistema.Tesi

    Graphical Model approaches for Biclustering

    Get PDF
    In many scientific areas, it is crucial to group (cluster) a set of objects, based on a set of observed features. Such operation is widely known as Clustering and it has been exploited in the most different scenarios ranging from Economics to Biology passing through Psychology. Making a step forward, there exist contexts where it is crucial to group objects and simultaneously identify the features that allow to recognize such objects from the others. In gene expression analysis, for instance, the identification of subsets of genes showing a coherent pattern of expression in subsets of objects/samples can provide crucial information about active biological processes. Such information, which cannot be retrieved by classical clustering approaches, can be extracted with the so called Biclustering, a class of approaches which aim at simultaneously clustering both rows and columns of a given data matrix (where each row corresponds to a different object/sample and each column to a different feature). The problem of biclustering, also known as co-clustering, has been recently exploited in a wide range of scenarios such as Bioinformatics, market segmentation, data mining, text analysis and recommender systems. Many approaches have been proposed to address the biclustering problem, each one characterized by different properties such as interpretability, effectiveness or computational complexity. A recent trend involves the exploitation of sophisticated computational models (Graphical Models) to face the intrinsic complexity of biclustering, and to retrieve very accurate solutions. Graphical Models represent the decomposition of a global objective function to analyse in a set of smaller/local functions defined over a subset of variables. The advantages in using Graphical Models relies in the fact that the graphical representation can highlight useful hidden properties of the considered objective function, plus, the analysis of smaller local problems can be dealt with less computational effort. Due to the difficulties in obtaining a representative and solvable model, and since biclustering is a complex and challenging problem, there exist few promising approaches in literature based on Graphical models facing biclustering. 3 This thesis is inserted in the above mentioned scenario and it investigates the exploitation of Graphical Models to face the biclustering problem. We explored different type of Graphical Models, in particular: Factor Graphs and Bayesian Networks. We present three novel algorithms (with extensions) and evaluate such techniques using available benchmark datasets. All the models have been compared with the state-of-the-art competitors and the results show that Factor Graph approaches lead to solid and efficient solutions for dataset of contained dimensions, whereas Bayesian Networks can manage huge datasets, with the overcome that setting the parameters can be not trivial. As another contribution of the thesis, we widen the range of biclustering applications by studying the suitability of these approaches in some Computer Vision problems where biclustering has been never adopted before. Summarizing, with this thesis we provide evidence that Graphical Model techniques can have a significant impact in the biclustering scenario. Moreover, we demonstrate that biclustering techniques are ductile and can produce effective solutions in the most different fields of applications

    Sound Object Recognition

    Get PDF
    Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition. We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks. The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings. Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources

    From Points to Probability Measures: Statistical Learning on Distributions with Kernel Mean Embedding

    Get PDF
    The dissertation presents a novel learning framework on probability measures which has abundant real-world applications. In classical setup, it is assumed that the data are points that have been drawn independent and identically (i.i.d.) from some unknown distribution. In many scenarios, however, representing data as distributions may be more preferable. For instance, when the measurement is noisy, we may tackle the uncertainty by treating the data themselves as distributions, which is often the case for microarray data and astronomical data where the measurement process is imprecise and replication is often required. Distributions not only embody individual data points, but also constitute information about their interactions which can be beneficial for structural learning in high-energy physics, cosmology, causality, and so on. Moreover, classical problems in statistics such as statistical estimation, hypothesis testing, and causal inference, may be interpreted in a decision-theoretic sense as machine learning problems on empirical distributions. Rephrasing these problems as such leads to novel approach for statistical inference and estimation. Hence, allowing learning algorithms to operate directly on distributions prompts a wide range of future applications. To work with distributions, the key methodology adopted in this thesis is the kernel mean embedding of distributions which represents each distribution as a mean function in a reproducing kernel Hilbert space (RKHS). In particular, the kernel mean embedding has been applied successfully in two-sample testing, graphical model, and probabilistic inference. On the other hand, this thesis will focus mainly on the predictive learning on distributions, i.e., when the observations are distributions and the goal is to make prediction about the previously unseen distributions. More importantly, the thesis investigates kernel mean estimation which is one of the most fundamental problems of kernel methods. Probability distributions, as opposed to data points, constitute information at a higher level such as aggregate behavior of data points, how the underlying process evolves over time and domains, and a complex concept that cannot be described merely by individual points. Intelligent organisms have the ability to recognize and exploit such information naturally. Thus, this work may shed light on future development of intelligent machines, and most importantly, may provide clues on the true meaning of intelligence
    corecore