2,135 research outputs found
Graphing of E-Science Data with varying user requirements
Based on our experience in the Swiss Experiment, exploring experimental, scientific data is often done in a visual way. Starting from a global overview the users are zooming in on interesting events. In case of huge data volumes special data structures have to be introduced to provide fast and easy access to the data. Since it is hard to predict on how users will work with the data a generic approach requires self-adaptation of the required special data structures. In this paper we describe the underlying NP-hard problem and present several approaches to address the problem with varying properties. The approaches are illustrated with a small example and are evaluated with a synthetic data set and user queries
Plant image retrieval using color, shape and texture features
We present a content-based image retrieval system for plant image retrieval, intended especially for the house plant identification problem. A plant image consists of a collection of overlapping leaves and possibly flowers, which makes the problem challenging.We studied the suitability of various well-known color, shape and texture features for this problem, as well as introducing some new texture matching techniques and shape features. Feature extraction is applied after segmenting the plant region from the background using the max-flow min-cut technique. Results on a database of 380 plant images belonging to 78 different types of plants show promise of the proposed new techniques
and the overall system: in 55% of the queries, the correct plant image is retrieved among the top-15 results. Furthermore, the accuracy goes up to 73% when a 132-image subset of well-segmented plant images are considered
Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain
Real-world data typically contain repeated and periodic patterns. This
suggests that they can be effectively represented and compressed using only a
few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.).
However, distance estimation when the data are represented using different sets
of coefficients is still a largely unexplored area. This work studies the
optimization problems related to obtaining the \emph{tightest} lower/upper
bound on Euclidean distances when each data object is potentially compressed
using a different set of orthonormal coefficients. Our technique leads to
tighter distance estimates, which translates into more accurate search,
learning and mining operations \textit{directly} in the compressed domain.
We formulate the problem of estimating lower/upper distance bounds as an
optimization problem. We establish the properties of optimal solutions, and
leverage the theoretical analysis to develop a fast algorithm to obtain an
\emph{exact} solution to the problem. The suggested solution provides the
tightest estimation of the -norm or the correlation. We show that typical
data-analysis operations, such as k-NN search or k-Means clustering, can
operate more accurately using the proposed compression and distance
reconstruction technique. We compare it with many other prevalent compression
and reconstruction techniques, including random projections and PCA-based
techniques. We highlight a surprising result, namely that when the data are
highly sparse in some basis, our technique may even outperform PCA-based
compression.
The contributions of this work are generic as our methodology is applicable
to any sequential or high-dimensional data as well as to any orthogonal data
transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD
Differentially Private Publication of Sparse Data
The problem of privately releasing data is to provide a version of a dataset
without revealing sensitive information about the individuals who contribute to
the data. The model of differential privacy allows such private release while
providing strong guarantees on the output. A basic mechanism achieves
differential privacy by adding noise to the frequency counts in the contingency
tables (or, a subset of the count data cube) derived from the dataset. However,
when the dataset is sparse in its underlying space, as is the case for most
multi-attribute relations, then the effect of adding noise is to vastly
increase the size of the published data: it implicitly creates a huge number of
dummy data points to mask the true data, making it almost impossible to work
with.
We present techniques to overcome this roadblock and allow efficient private
release of sparse data, while maintaining the guarantees of differential
privacy. Our approach is to release a compact summary of the noisy data.
Generating the noisy data and then summarizing it would still be very costly,
so we show how to shortcut this step, and instead directly generate the summary
from the input data, without materializing the vast intermediate noisy data. We
instantiate this outline for a variety of sampling and filtering methods, and
show how to use the resulting summary for approximate, private, query
answering. Our experimental study shows that this is an effective, practical
solution, with comparable and occasionally improved utility over the costly
materialization approach
Wavelet based similarity measurement algorithm for seafloor morphology
Thesis (S.M. in Naval Architecture and Marine Engineering and S.M. in Mechanical Engineering)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 2006.Includes bibliographical references (leaves 71-73).The recent expansion of systematic seafloor exploration programs such as geophysical research, seafloor mapping, search and survey, resource assessment and other scientific, commercial and military applications has created a need for rapid and robust methods of processing seafloor imagery. Given the existence of a large library of seafloor images, a fast automated image classifier algorithm is needed to determine changes in seabed morphology over time. The focus of this work is the development of a robust Similarity Measurement (SM) algorithm to address the above problem. Our work uses a side-scan sonar image library for experimentation and testing. Variations of an underwater vehicle's height above the sea floor and of its pitch and roll angles cause distortion in the data obtained, such that transformations to align the data should include rotation, translation, anisotropic scaling and skew. In order to deal with these problems, we propose to use the Wavelet transform for similarity detection. Wavelets have been widely used during the last three decades in image processing. Since the Wavelet transform allows a multi-resolution decomposition, it is easier to identify the similarities between two images by examining the energy distribution at each decomposition level.(cont.) The energy distribution in the frequency domain at the output of the high pass and low pass filter banks identifies the texture discrimination. Our approach uses a statistical framework, involving fitting the Wavelet coefficients into a generalized Gaussian density distribution. The next step involves use of the Kullback-Leibner entropy metric to measure the distance between Wavelet coefficient distributions. To select the top N most likely matching images, the database images are ranked based on the minimum Kullback-Leibner distance. The statistical approach is effective in eliminating rotation, mis-registration and skew problems by working in the Wavelet domain. It's recommended that further work focuses on choosing the best Wavelet packet to increase the robustness of the algorithm developed in this thesis.by Ilkay Darilmaz.S.M.in Naval Architecture and Marine Engineering and S.M.in Mechanical Engineerin
Application of Wavelets and Principal Component Analysis in Image Query and Mammography
Breast cancer is currently one of the major causes of death for women in the U.S. Mammography is currently the most effective method for detection of breast cancer and early detection has proven to be an efficient tool to reduce the number of deaths. Mammography is the most demanding of all clinical imaging applications as it requires high contrast, high signal to noise ratio and resolution with minimal x-radiation. According to studies [36], 10% to 30% of women having breast cancer and undergoing mammography, have negative mammograms, i.e. are misdiagnosed. Furthermore, only 20%-40% of the women who undergo biopsy, have cancer. Biopsies are expensive, invasive and traumatic to the patient. The high rate of false positives is partly because of the difficulties in the diagnosis process and partly due to the fear of missing a cancer. These facts motivate research aimed to enhance the mammogram images (e.g. by enhancement of features such as clustered calcification regions which were found to be associated with breast cancer) , to provide CAD (Computer Aided Diagnostics) tools that can alert the radiologist to potentially malignant regions in the mammograms and to develope tools for automated classification of mammograms into benign and malignant classes. In this paper we apply wavelet and Principal Component analysis, including the approximate Karhunen Loeve aransform to mammographic images, to derive feature vectors used for classification of mammographic images from an early stage of malignancy. Another area where wavelet analysis was found useful, is the area of image query. Image query of large data bases must provide a fast and efficient search of the query image. Lately, a group of researchers developed an algorithm based on wavelet analysis that was found to provide fast and efficient search in large data bases. Their method overcomes some of the difficulties associated with previous approaches, but the search algorithm is sensitive to displacement and rotation of the query image due to the fact that wavelet analysis is not invariant under displacement and rotation. In this study we propose the integration of the Hotelling transform to improve on this sensitivity and provide some experimental results in the context of the standard alphabetic characters
Recommended from our members
Fast retrieval of weather analogues in a multi-petabyte meteorological archive
The European Centre for Medium-Range Weather Forecasts (ECMWF) manages
the largest archive of meteorological data in the world. At the time of writing,
it holds around 300 petabytes and grows at a rate of 1 petabyte per week. This
archive is now mature, and contains valuable datasets such as several reanalyses,
providing a consistent view of the weather over several decades.
Weather analogue is the term used by meteorologists to refer to similar weather situations.
Looking for analogues in an archive using a brute force approach requires
data to be retrieved from tape and then compared to a user-provided weather
pattern, using a chosen similarity measure. Such an operation would be very long
and costly.
In this work, a wavelet-based fingerprinting scheme is proposed to index all weather
patterns from the archive, over a selected geographical domain. The system answers
search queries by computing the fingerprint of the query pattern and looking
for close matched in the index. Searches are fast enough that they are perceived
as being instantaneous.
A web-based application is provided, allowing users to express their queries interactively
in a friendly and straightforward manner by sketching weather patterns
directly in their web browser. Matching results are then presented as a series of
weather maps, labelled with the date and time at which they occur.
The system has been deployed as part of the Copernicus Climate Data Store and
allows the retrieval of weather analogues from ERA5, a 40-years hourly reanalysis
dataset.
Some preliminary results of this work have been presented at the International
Conference on Computational Science 2018 (Raoult et al. (2018))
- …