488,385 research outputs found

    Hashing for large-scale structured data classification

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.With the rapid development of the information society and the wide applications of networks, almost incredibly large numbers bytes of data are generated every day from the social networks, business transactions and so on. In such cases, hashing technology, if done successfully, would greatly improve the performance of data management. The goal of this thesis is to develop hashing methods for large-scale structured data classification. First of all, this work focuses on categorizing and reviewing the current progress on hashing from a data classification perspective. Secondly, new hashing schemes are proposed by considering different data characteristics and challenges, respectively. Due to the popularity and importance of graph and text data, this research mainly focuses on these two kinds of structured data: 1) The first method is a fast graph stream classification method using Discriminative Clique Hashing (DICH). The main idea is to employ a fast algorithm to decompose a compressed graph into a number of cliques to sequentially extract clique-patterns over the graph stream as features. Two random hashing schemes are employed to compress the original edge set of the graph stream and map the unlimitedly increasing clique-patterns onto a fixed-size feature space, respectively. DICH essentially speeds up the discriminative clique-pattern mining process and solves the unlimited clique-pattern expanding problem in graph stream mining; 2) The second method is an adaptive hashing for real-time graph stream classification (ARC-GS) based on DICH. In order to adapt to the concept drifts of the graph stream, we partition the whole graph stream into consecutive graph chunks. A differential hashing scheme is used to map unlimited increasing features (cliques) onto a fixed-size feature space. At the final stage, a chunk level weighting mechanism is used to form an ensemble classifier for graph stream classification. Experiments demonstrate that our method significantly outperforms existing methods; 3) The last method is a Recursive Min-wise Hashing (RMH) for text structure. In this method, this study aims to quickly compute similarities between texts while also preserving context information. To take into account semantic hierarchy, this study considers a notion of “multi-level exchangeability”, and employs a nested-set to represent a multi-level exchangeable object. To fingerprint nested-sets for fast comparison, Recursive Min-wise Hashing (RMH) algorithm is proposed at the same computational cost of the standard min-wise hashing algorithm. Theoretical study and bound analysis confirm that RMH is a highly-concentrated estimator

    Real-time data analysis at the LHC: present and future

    Full text link
    The Large Hadron Collider (LHC), which collides protons at an energy of 14 TeV, produces hundreds of exabytes of data per year, making it one of the largest sources of data in the world today. At present it is not possible to even transfer most of this data from the four main particle detectors at the LHC to "offline" data facilities, much less to permanently store it for future processing. For this reason the LHC detectors are equipped with real-time analysis systems, called triggers, which process this volume of data and select the most interesting proton-proton collisions. The LHC experiment triggers reduce the data produced by the LHC by between 1/1000 and 1/100000, to tens of petabytes per year, allowing its economical storage and further analysis. The bulk of the data-reduction is performed by custom electronics which ignores most of the data in its decision making, and is therefore unable to exploit the most powerful known data analysis strategies. I cover the present status of real-time data analysis at the LHC, before explaining why the future upgrades of the LHC experiments will increase the volume of data which can be sent off the detector and into off-the-shelf data processing facilities (such as CPU or GPU farms) to tens of exabytes per year. This development will simultaneously enable a vast expansion of the physics programme of the LHC's detectors, and make it mandatory to develop and implement a new generation of real-time multivariate analysis tools in order to fully exploit this new potential of the LHC. I explain what work is ongoing in this direction and motivate why more effort is needed in the coming years.Comment: Contribution to the proceedings of the HEPML workshop NIPS 2014. 20 pages, 5 figure

    Why It Takes So Long to Connect to a WiFi Access Point

    Full text link
    Today's WiFi networks deliver a large fraction of traffic. However, the performance and quality of WiFi networks are still far from satisfactory. Among many popular quality metrics (throughput, latency), the probability of successfully connecting to WiFi APs and the time cost of the WiFi connection set-up process are the two of the most critical metrics that affect WiFi users' experience. To understand the WiFi connection set-up process in real-world settings, we carry out measurement studies on 55 million mobile users from 44 representative cities associating with 77 million APs in 0.40.4 billion WiFi sessions, collected from a mobile "WiFi Manager" App that tops the Android/iOS App market. To the best of our knowledge, we are the first to do such large scale study on: how large the WiFi connection set-up time cost is, what factors affect the WiFi connection set-up process, and what can be done to reduce the WiFi connection set-up time cost. Based on the measurement analysis, we develop a machine learning based AP selection strategy that can significantly improve WiFi connection set-up performance, against the conventional strategy purely based on signal strength, by reducing the connection set-up failures from 33%33\% to 3.6%3.6\% and reducing 80%80\% time costs of the connection set-up processes by more than 1010 times.Comment: 11pages, conferenc

    A new approach to understanding the frequency response of mineral oil

    No full text
    Dielectric spectroscopy is non-invasive diagnostic method and can give information about dipole relaxation, electrical conduction and structure of molecules. Since the creation of charge carriers in mineral oil is not only from dissociation but also injection from electrodes, the injection current cannot be simply ignored. The polarization caused by the charge injection has been studied in this paper. Based on our research, if the mobility of the injected charge carriers is fast enough so that they can reach the opposite electrode, the current caused by the injection will contribute only to the imaginary part of the complex permittivity and this part of the complex permittivity will decrease with the frequency with a slope of -1 which is in a good agreement with the experimental result. The classic ionic drift and diffusion model and this injection model will be combined to make an improved model. In this paper, the frequency responses of three different kinds of mineral oils have been measured, and this modified model has been used to simulate the experiment result. Since there is only one unknown parameter in this improved model, a better understanding of the frequency response in mineral oil can be achieve

    The discrete dynamics of small-scale spatial events: agent-based models of mobility in carnivals and street parades

    Get PDF
    Small-scale spatial events are situations in which elements or objects vary in such away that temporal dynamics is intrinsic to their representation and explanation. Someof the clearest examples involve local movement from conventional traffic modelingto disaster evacuation where congestion, crowding, panic, and related safety issue arekey features of such events. We propose that such events can be simulated using newvariants of pedestrian model, which embody ideas about how behavior emerges fromthe accumulated interactions between small-scale objects. We present a model inwhich the event space is first explored by agents using ?swarm intelligence?. Armedwith information about the space, agents then move in an unobstructed fashion to theevent. Congestion and problems over safety are then resolved through introducingcontrols in an iterative fashion and rerunning the model until a ?safe solution? isreached. The model has been developed to simulate the effect of changing the route ofthe Notting Hill Carnival, an annual event held in west central London over 2 days inAugust each year. One of the key issues in using such simulation is how the processof modeling interacts with those who manage and control the event. As such, thischanges the nature of the modeling problem from one where control and optimizationis external to the model to one where this is intrinsic to the simulation

    Regularization strategy for the layered inversion of airborne TEM data: application to VTEM data acquired over the basin of Franceville (Gabon)

    Full text link
    Airborne transient electromagnetic (TEM) is a cost-effective method to image the distribution of electrical conductivity in the ground. We consider layered earth inversion to interpret large data sets of hundreds of kilometre. Different strategies can be used to solve this inverse problem. This consists in managing the a priori information to avoid the mathematical instability and provide the most plausible model of conductivity in depth. In order to obtain fast and realistic inversion program, we tested three kinds of regularization: two are based on standard Tikhonov procedure which consist in minimizing not only the data misfit function but a balanced optimization function with additional terms constraining the lateral and the vertical smoothness of the conductivity; another kind of regularization is based on reducing the condition number of the kernel by changing the layout of layers before minimizing the data misfit function. Finally, in order to get a more realistic distribution of conductivity, notably by removing negative conductivity values, we suggest an additional recursive filter based upon the inversion of the logarithm of the conductivity. All these methods are tested on synthetic and real data sets. Synthetic data have been calculated by 2.5D modelling; they are used to demonstrate that these methods provide equivalent quality in terms of data misfit and accuracy of the resulting image; the limit essentially comes on special targets with sharp 2D geometries. The real data case is from Helicopter-borne TEM data acquired in the basin of Franceville (Gabon) where borehole conductivity loggings are used to show the good accuracy of the inverted models in most areas, and some biased depths in areas where strong lateral changes may occur
    • …
    corecore