488,385 research outputs found
Hashing for large-scale structured data classification
University of Technology Sydney. Faculty of Engineering and Information Technology.With the rapid development of the information society and the wide applications of networks, almost incredibly large numbers bytes of data are generated every day from the social networks, business transactions and so on. In such cases, hashing technology, if done successfully, would greatly improve the performance of data management. The goal of this thesis is to develop hashing methods for large-scale structured data classification.
First of all, this work focuses on categorizing and reviewing the current progress on hashing from a data classification perspective.
Secondly, new hashing schemes are proposed by considering different data characteristics and challenges, respectively. Due to the popularity and importance of graph and text data, this research mainly focuses on these two kinds of structured data:
1) The first method is a fast graph stream classification method using Discriminative Clique Hashing (DICH). The main idea is to employ a fast algorithm to decompose a compressed graph into a number of cliques to sequentially extract clique-patterns over the graph stream as features. Two random hashing schemes are employed to compress the original edge set of the graph stream and map the unlimitedly increasing clique-patterns onto a fixed-size feature space, respectively. DICH essentially speeds up the discriminative clique-pattern mining process and solves the unlimited clique-pattern expanding problem in graph stream mining;
2) The second method is an adaptive hashing for real-time graph stream classification (ARC-GS) based on DICH. In order to adapt to the concept drifts of the graph stream, we partition the whole graph stream into consecutive graph chunks. A differential hashing scheme is used to map unlimited increasing features (cliques) onto a fixed-size feature space. At the final stage, a chunk level weighting mechanism is used to form an ensemble classifier for graph stream classification. Experiments demonstrate that our method significantly outperforms existing methods;
3) The last method is a Recursive Min-wise Hashing (RMH) for text structure. In this method, this study aims to quickly compute similarities between texts while also preserving context information. To take into account semantic hierarchy, this study considers a notion of “multi-level exchangeability”, and employs a nested-set to represent a multi-level exchangeable object. To fingerprint nested-sets for fast comparison, Recursive Min-wise Hashing (RMH) algorithm is proposed at the same computational cost of the standard min-wise hashing algorithm. Theoretical study and bound analysis confirm that RMH is a highly-concentrated estimator
Real-time data analysis at the LHC: present and future
The Large Hadron Collider (LHC), which collides protons at an energy of 14
TeV, produces hundreds of exabytes of data per year, making it one of the
largest sources of data in the world today. At present it is not possible to
even transfer most of this data from the four main particle detectors at the
LHC to "offline" data facilities, much less to permanently store it for future
processing. For this reason the LHC detectors are equipped with real-time
analysis systems, called triggers, which process this volume of data and select
the most interesting proton-proton collisions. The LHC experiment triggers
reduce the data produced by the LHC by between 1/1000 and 1/100000, to tens of
petabytes per year, allowing its economical storage and further analysis. The
bulk of the data-reduction is performed by custom electronics which ignores
most of the data in its decision making, and is therefore unable to exploit the
most powerful known data analysis strategies. I cover the present status of
real-time data analysis at the LHC, before explaining why the future upgrades
of the LHC experiments will increase the volume of data which can be sent off
the detector and into off-the-shelf data processing facilities (such as CPU or
GPU farms) to tens of exabytes per year. This development will simultaneously
enable a vast expansion of the physics programme of the LHC's detectors, and
make it mandatory to develop and implement a new generation of real-time
multivariate analysis tools in order to fully exploit this new potential of the
LHC. I explain what work is ongoing in this direction and motivate why more
effort is needed in the coming years.Comment: Contribution to the proceedings of the HEPML workshop NIPS 2014. 20
pages, 5 figure
Why It Takes So Long to Connect to a WiFi Access Point
Today's WiFi networks deliver a large fraction of traffic. However, the
performance and quality of WiFi networks are still far from satisfactory. Among
many popular quality metrics (throughput, latency), the probability of
successfully connecting to WiFi APs and the time cost of the WiFi connection
set-up process are the two of the most critical metrics that affect WiFi users'
experience. To understand the WiFi connection set-up process in real-world
settings, we carry out measurement studies on million mobile users from
representative cities associating with million APs in billion WiFi
sessions, collected from a mobile "WiFi Manager" App that tops the Android/iOS
App market. To the best of our knowledge, we are the first to do such large
scale study on: how large the WiFi connection set-up time cost is, what factors
affect the WiFi connection set-up process, and what can be done to reduce the
WiFi connection set-up time cost. Based on the measurement analysis, we develop
a machine learning based AP selection strategy that can significantly improve
WiFi connection set-up performance, against the conventional strategy purely
based on signal strength, by reducing the connection set-up failures from
to and reducing time costs of the connection set-up
processes by more than times.Comment: 11pages, conferenc
A new approach to understanding the frequency response of mineral oil
Dielectric spectroscopy is non-invasive diagnostic method and can give information about dipole relaxation, electrical conduction and structure of molecules. Since the creation of charge carriers in mineral oil is not only from dissociation but also injection from electrodes, the injection current cannot be simply ignored. The polarization caused by the charge injection has been studied in this paper. Based on our research, if the mobility of the injected charge carriers is fast enough so that they can reach the opposite electrode, the current caused by the injection will contribute only to the imaginary part of the complex permittivity and this part of the complex permittivity will decrease with the frequency with a slope of -1 which is in a good agreement with the experimental result. The classic ionic drift and diffusion model and this injection model will be combined to make an improved model. In this paper, the frequency responses of three different kinds of mineral oils have been measured, and this modified model has been used to simulate the experiment result. Since there is only one unknown parameter in this improved model, a better understanding of the frequency response in mineral oil can be achieve
The discrete dynamics of small-scale spatial events: agent-based models of mobility in carnivals and street parades
Small-scale spatial events are situations in which elements or objects vary in such away that temporal dynamics is intrinsic to their representation and explanation. Someof the clearest examples involve local movement from conventional traffic modelingto disaster evacuation where congestion, crowding, panic, and related safety issue arekey features of such events. We propose that such events can be simulated using newvariants of pedestrian model, which embody ideas about how behavior emerges fromthe accumulated interactions between small-scale objects. We present a model inwhich the event space is first explored by agents using ?swarm intelligence?. Armedwith information about the space, agents then move in an unobstructed fashion to theevent. Congestion and problems over safety are then resolved through introducingcontrols in an iterative fashion and rerunning the model until a ?safe solution? isreached. The model has been developed to simulate the effect of changing the route ofthe Notting Hill Carnival, an annual event held in west central London over 2 days inAugust each year. One of the key issues in using such simulation is how the processof modeling interacts with those who manage and control the event. As such, thischanges the nature of the modeling problem from one where control and optimizationis external to the model to one where this is intrinsic to the simulation
Regularization strategy for the layered inversion of airborne TEM data: application to VTEM data acquired over the basin of Franceville (Gabon)
Airborne transient electromagnetic (TEM) is a cost-effective method to image
the distribution of electrical conductivity in the ground. We consider layered
earth inversion to interpret large data sets of hundreds of kilometre.
Different strategies can be used to solve this inverse problem. This consists
in managing the a priori information to avoid the mathematical instability and
provide the most plausible model of conductivity in depth. In order to obtain
fast and realistic inversion program, we tested three kinds of regularization:
two are based on standard Tikhonov procedure which consist in minimizing not
only the data misfit function but a balanced optimization function with
additional terms constraining the lateral and the vertical smoothness of the
conductivity; another kind of regularization is based on reducing the condition
number of the kernel by changing the layout of layers before minimizing the
data misfit function. Finally, in order to get a more realistic distribution of
conductivity, notably by removing negative conductivity values, we suggest an
additional recursive filter based upon the inversion of the logarithm of the
conductivity. All these methods are tested on synthetic and real data sets.
Synthetic data have been calculated by 2.5D modelling; they are used to
demonstrate that these methods provide equivalent quality in terms of data
misfit and accuracy of the resulting image; the limit essentially comes on
special targets with sharp 2D geometries. The real data case is from
Helicopter-borne TEM data acquired in the basin of Franceville (Gabon) where
borehole conductivity loggings are used to show the good accuracy of the
inverted models in most areas, and some biased depths in areas where strong
lateral changes may occur
- …