10 research outputs found

    High-dimensional tests for spherical location and spiked covariance

    Get PDF
    Rotationally symmetric distributions on the p-dimensional unit hypersphere, extremely popular in directional statistics, involve a location parameter theta that indicates the direction of the symmetry axis. The most classical way of addressing the spherical location problem H_0:theta=theta_0, with theta_0 a fixed location, is the so-called Watson test, which is based on the sample mean of the observations. This test enjoys many desirable properties, but its implementation requires the sample size n to be large compared to the dimension p. This is a severe limitation, since more and more problems nowadays involve high-dimensional directional data (e.g., in genetics or text mining). In this work, we therefore introduce a modified Watson statistic that can cope with high-dimensionality. We derive its asymptotic null distribution as both n and p go to infinity. This is achieved in a universal asymptotic framework that allows p to go to infinity arbitrarily fast (or slowly) as a function of n. We further show that our results also provide high-dimensional tests for a problem that has recently attracted much attention, namely that of testing that the covariance matrix of a multinormal distribution has a "theta_0-spiked" structure. Finally, a Monte Carlo simulation study corroborates our asymptotic results

    Data Stream Clustering: Challenges and Issues

    Full text link
    Very large databases are required to store massive amounts of data that are continuously inserted and queried. Analyzing huge data sets and extracting valuable pattern in many applications are interesting for researchers. We can identify two main groups of techniques for huge data bases mining. One group refers to streaming data and applies mining techniques whereas second group attempts to solve this problem directly with efficient algorithms. Recently many researchers have focused on data stream as an efficient strategy against huge data base mining instead of mining on entire data base. The main problem in data stream mining means evolving data is more difficult to detect in this techniques therefore unsupervised methods should be applied. However, clustering techniques can lead us to discover hidden information. In this survey, we try to clarify: first, the different problem definitions related to data stream clustering in general; second, the specific difficulties encountered in this field of research; third, the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and how several prominent solutions tackle different problems. Index Terms- Data Stream, Clustering, K-Means, Concept driftComment: IMECS201

    Internal and collective interpretation for improving human interpretability of multi-layered neural networks

    Get PDF
    The present paper aims to propose a new type of information-theoretic method to interpret the inference mechanism of neural networks. We interpret the internal inference mechanism for itself without any external methods such as symbolic or fuzzy rules. In addition, we make interpretation processes as stable as possible. This means that we interpret the inference mechanism, considering all internal representations, created by those different conditions and patterns. To make the internal interpretation possible, we try to compress multi-layered neural networks into the simplest ones without hidden layers. Then, the natural information loss in the process of compression is complemented by the introduction of a mutual information augmentation component. The method was applied to two data sets, namely, the glass data set and the pregnancy data set. In both data sets, information augmentation and compression methods could improve generalization performance. In addition, compressed or collective weights from the multi-layered networks tended to produce weights, ironically, similar to the linear correlation coefficients between inputs and targets, while the conventional methods such as the logistic regression analysis failed to do so

    Testing uniformity on high-dimensional spheres against monotone rotationally symmetric alternatives

    Full text link
    We consider the problem of testing uniformity on high-dimensional unit spheres. We are primarily interested in non-null issues. We show that rotationally symmetric alternatives lead to two Local Asymptotic Normality (LAN) structures. The first one is for fixed modal location θ\theta and allows to derive locally asymptotically most powerful tests under specified θ\theta. The second one, that addresses the Fisher-von Mises-Langevin (FvML) case, relates to the unspecified-θ\theta problem and shows that the high-dimensional Rayleigh test is locally asymptotically most powerful invariant. Under mild assumptions, we derive the asymptotic non-null distribution of this test, which allows to extend away from the FvML case the asymptotic powers obtained there from Le Cam's third lemma. Throughout, we allow the dimension pp to go to infinity in an arbitrary way as a function of the sample size nn. Some of our results also strengthen the local optimality properties of the Rayleigh test in low dimensions. We perform a Monte Carlo study to illustrate our asymptotic results. Finally, we treat an application related to testing for sphericity in high dimensions

    Dimensionality reduction for smart IoT sensors

    Get PDF
    Smart IoT sensors are characterized by their ability to sense and process signals, producing high-level information that is usually sent wirelessly while minimising energy consumption and maximising communication efficiency. Systems are getting smarter, meaning that they are providing ever richer information from the same raw data. This increasing intelligence can occur at various levels, including in the sensor itself, at the edge, and in the cloud. As sending one byte of data is several orders of magnitude more energy-expensive than processing it, data must be handled as near as possible to its generation. Thus, the intelligence should be located in the sensor; nevertheless, it is not always possible to do so because real data is not always available for designing the algorithms or the hardware capacity is limited. Smart devices detecting data coming from inertial sensors are a good example of this. They generate hundreds of bytes per second (100 Hz, 12-bit sampling of a triaxial accelerometer) but useful information comes out in just a few bytes per minute (number of steps, type of activity, and so forth). We propose a lossy compression method to reduce the dimensionality of raw data from accelerometers, gyroscopes, and magnetometers, while maintaining a high quality of information in the reconstructed signal coming from an embedded device. The implemented method uses an adaptive vector-quantisation algorithm that represents the input data with a limited set of codewords. The adaptive process generates a codebook that evolves to become highly specific for the input data, while providing high compression rates. The codebook’s reconstruction quality is measured with a peak signal-to-noise ratio (PSNR) above 40 dB for a 12-bit representation

    Real-time extensive livestock monitoring using lpwan smart wearable and infrastructure

    Get PDF
    Extensive unsupervised livestock farming is a habitual technique in many places around the globe. Animal release can be done for months, in large areas and with different species packing and behaving very differently. Nevertheless, the farmer’s needs are similar: where livestock is (and where has been) and how healthy they are. The geographical areas involved usually have difficult access with harsh orography and lack of communications infrastructure. This paper presents the design of a solution for extensive livestock monitoring in these areas. Our proposal is based in a wearable equipped with inertial sensors, global positioning system and wireless communications; and a Low-Power Wide Area Network infrastructure that can run with and without internet connection. Using adaptive analysis and data compression, we provide real-time monitoring and logging of cattle’s position and activities. Hardware and firmware design achieve very low energy consumption allowing months of battery life. We have thoroughly tested the devices in different laboratory setups and evaluated the system performance in real scenarios in the mountains and in the forest

    Reducción de dimensionalidad y técnicas de inferencia de estado para sensores inteligentes

    Get PDF
    Este proyecto nace del intento de reducir la cantidad de datos enviados por sensores inteligentes para prolongar así su vida útil y dotarles de cierta seguridad al realizar una abstracción de los datos en crudo haciendo imposible su interpretación.Para ello se hace uso de una técnica ya desarrollada hace años, el vector quantization pero añadiéndole una serie de mejoras que permitan recalcular la disposición de los centroides para conseguir minimizar el error de reconstrucción. Se ha desarrollado esta técnica en lenguaje Python y se va a evaluar con una base de datos existente variando, los parámetros de muestreo, el uso de codificación por media y desviación típica, la composición con una o las tres componentes del sensor inercial y el tamaño de la red de entrenamiento.Con este sistema definido, se propone una clasificación de los datos reducidos para intentar sacar aun mayor índice de compresión frente a los datos enviados por el sensor en crudo.Una vez verificado el comportamiento del modelo se implementa en lenguaje micro Python en un sensor inteligente para evaluar el sistema compresor frente a su uso convencional.<br /

    Vacuum ultraviolet laser induced breakdown spectroscopy (VUV-LIBS) for pharmaceutical analysis

    Get PDF
    Laser induced breakdown spectroscopy (LIBS) allows quick analysis to determine the elemental composition of the target material. Samples need little\no preparation, removing the risk of contamination or loss of analyte. It is minimally ablative so negligible amounts of the sample is destroyed, while allowing quantitative and qualitative results. Vacuum ultraviolet (VUV)-LIBS, due to the abundance of transitions at shorter wavelengths, offers improvements over LIBS in the visible region, such as achieving lower limits of detection for trace elements and extends LIBS to elements\samples not suitable to visible LIBS. These qualities also make VUV-LIBS attractive for pharmaceutical analysis. Due to success in the pharmaceutical sector molecules representing the active pharmaceutical ingredients (APIs) have become increasingly complex. These organic compounds reveal spectra densely populated with carbon and oxygen lines in the visible and infrared regions, making it increasingly difficult to identify an inorganic analyte. The VUV region poses a solution as there is much better spacing between spectral lines. VUV-LIBS experiments were carried out on pharmaceutical samples. This work is a proof of principle that VUV-LIBS in conjunction with machine learning can tell pharmaceuticals apart via classification. This work will attempt to test this principle in two ways. Firstly, by classifying pharmaceuticals that are very different from one another i.e., having different APIs. This first test will gauge the efficacy of separating into different classes analytes that are essentially carbohydrates with distinctly different APIs apart from one another using their VUV emission spectra. Secondly, by classifying two different brands of the same pharmaceutical, i.e., paracetamol. The second test will investigate of the ability of machine learning to abstract and identify the differences in the spectra of two pharmaceuticals with the same API and separate them. This second test presents the application of VUV-LIBS combined with machine learning as a solution for at-line analysis of similar analytes e.g., quality control. The machine learning techniques explored in this thesis were convolutional neural networks (CNNs), support vector machines, self-organizing maps and competitive learning. The motivation for the application of principal component analysis (PCA) and machine learning is for the classification of analytes, allowing us to distinguish pharmaceuticals from one another based on their spectra. PCA and the machine learning techniques are compared against one another in this thesis. Several innovations were made; this work is the first in LIBS to implement the use of a short-time Fourier transform (STFT) method to generate input images for a CNN for VUV-LIBS spectra. This is also believed to be the first work in LIBS to carry out the development and application of an ellipsoidal classifier based on PCA. The results of this work show that by lowering the pulse energy it is possible to gather more useful spectra over the surface of a sample. Although this yields spectra with poorer signal-to-noise, the samples can still be classified using the machine learning analytics. The results in this thesis indicate that, of all the machine learning techniques evaluated, CNNs have the best classification accuracy combined with the fastest run time. Prudent data augmentation can significantly reduce experimental workloads, without reducing classification rates

    Statistical Analysis of Spherical Data: Clustering, Feature Selection and Applications

    Get PDF
    In the light of interdisciplinary applications, data to be studied and analyzed have witnessed a growth in volume and change in their intrinsic structure and type. In other words, in practice the diversity of resources generating objects have imposed several challenges for decision maker to determine informative data in terms of time, model capability, scalability and knowledge discovery. Thus, it is highly desirable to be able to extract patterns of interest that support the decision of data management. Clustering, among other machine learning approaches, is an important data engineering technique that empowers the automatic discovery of similar object’s clusters and the consequent assignment of new unseen objects to appropriate clusters. In this context, the majority of current research does not completely address the true structure and nature of data for particular application at hand. In contrast to most previous research, our proposed work focuses on the modeling and classification of spherical data that are naturally generated in many data mining and knowledge discovery applications. Thus, in this thesis we propose several estimation and feature selection frameworks based on Langevin distribution which are devoted to spherical patterns in offline and online settings. In this thesis, we first formulate a unified probabilistic framework, where we build probabilistic kernels based on Fisher score and information divergences from finite Langevin mixture for Support Vector Machine. We are motivated by the fact that the blending of generative and discriminative approaches has prevailed by exploring and adopting distinct characteristic of each approach toward constructing a complementary system combining the best of both. Due to the high demand to construct compact and accurate statistical models that are automatically adjustable to dynamic changes, next in this thesis, we propose probabilistic frameworks for high-dimensional spherical data modeling based on finite Langevin mixtures that allow simultaneous clustering and feature selection in offline and online settings. To this end, we adopted finite mixture models which have long been heavily relied on deterministic learning approaches such as maximum likelihood estimation. Despite their successful utilization in wide spectrum of areas, these approaches have several drawbacks as we will discuss in this thesis. An alternative approach is the adoption of Bayesian inference that naturally addresses data uncertainty while ensuring good generalization. To address this issue, we also propose a Bayesian approach for finite Langevin mixture model estimation and selection. When data change dynamically and grow drastically, finite mixture is not always a feasible solution. In contrast with previous approaches, which suppose an unknown finite number of mixture components, we finally propose a nonparametric Bayesian approach which assumes an infinite number of components. We further enhance our model by simultaneously detecting informative features in the process of clustering. Through extensive empirical experiments, we demonstrate the merits of the proposed learning frameworks on diverse high dimensional datasets and challenging real-world applications

    Deep Machine Learning with Spatio-Temporal Inference

    Get PDF
    Deep Machine Learning (DML) refers to methods which utilize hierarchies of more than one or two layers of computational elements to achieve learning. DML may draw upon biomemetic models, or may be simply biologically-inspired. Regardless, these architectures seek to employ hierarchical processing as means of mimicking the ability of the human brain to process a myriad of sensory data and make meaningful decisions based on this data. In this dissertation we present a novel DML architecture which is biologically-inspired in that (1) all processing is performed hierarchically; (2) all processing units are identical; and (3) processing captures both spatial and temporal dependencies in the observations to organize and extract features suitable for supervised learning. We call this architecture Deep Spatio-Temporal Inference Network (DeSTIN). In this framework, patterns observed in pixel data at the lowest layer of the hierarchy are organized and fit to generalizations using decomposition algorithms. Subsequent spatial layers draw upon previous layers, their own temporal observations and beliefs, and the observations and beliefs of parent nodes to extract features suitable for supervised learning using standard classifiers such as feedforward neural networks. Hence, DeSTIN is viewed as an unsupervised feature extraction scheme in the sense that rather than relying on human engineering to determine features for a particular problem, DeSTIN naturally constructs features of interest by representing salient regularities in the patterns observed. Detailed discussion and analysis of the DeSTIN framework is provided, including focus on its key components of generalization through online clustering and temporal inference. We present a variety of implementation details, including static and dynamic learning formulations, and function approximation methods. Results on standardized datasets of handwritten digits as well as face and optic nerve detection are presented, illustrating the efficacy of the proposed approach
    corecore