3,465 research outputs found

    Signature File Hashing Using Term Occurrence and Query Frequencies

    Get PDF
    Signature files act as a filter on retrieval to discard a large number of non-qualifying data items. Linear hashing with superimposed signatures (LHSS) provides an effective retrieval filter to process queries in dynamic databases. This study is an analysis of the effects of reflecting the term occurrence and query frequencies to signatures in LHSS. This approach relaxes the unrealistic uniform frequency assumption and lets the terms with high discriminatory power set more bits in signatures. The simulation experiments based on the derived formulas explore the amount of page savings with different occurrence and query frequency combinations at different hashing levels. The results show that the performance of LHSS improves with the hashing level and the larger is the difference between the term discriminatory power values of the terms, the higher is the retrieval efficiency. The paper also discusses the benefits of this approach to alleviate the imbalance between the levels of efficiency and relevancy in unrealistic uniform frequency assumption case

    Dynamic Signature File Partitioning Based on Term Characteristics

    Get PDF
    Signature files act as a filter on retrieval to discard a large number of non-qualifying data items. Linear hashing with superimposed signatures (LHSS) provides an effective retrieval filter to process queries in dynamic databases. This study is an analysis of the effects of reflecting the term query and occurrence characteristics to signatures in LHSS. This approach relaxes the unrealistic uniform frequency assumption and lets the terms with high discriminatory power set more bits in signatures. The simulation experiments based on the derived formulas show that incorporating the term characteristics in LHSS improves retrieval efficiency. The paper also discusses the further benefits of this approach to alleviate the potential imbalance between the levels of efficiency and relevancy

    Analysis of Multiterm Queries in Partitioned Signature File Environments

    Get PDF
    The concern of this study is the signature files which are used for information storage and retrieval in both formatted and unformatted databases. The analysis combines the concerns of signature extraction and signature file organization which have usually been treated as separate issues. Both the uniform frequency and single term query assumptions are relaxed and a comprehensive analysis is presented for multiterm query environments where terms can be classified based on their query and database occurrence frequencies. The performance of three superimposed signature generation schemes is explored as they are applied to a dynamic signature file organization based on linear hashing: Linear Hashing with Superimposed Signatures (LHSS). First scheme (SM) allows all terms set the same number of bits regardless of their discriminatory power whereas the second and third methods (MMS and MMM) emphasize the terms with high query and low database ooccurrence frequencies. Of these three schemes, only MMM takes the probability distribution of the number of query terms into account in finding the optimal mapping strategy. The main contribution of the study is the derivation of the performance evaluation formulas which is provided together with the analysis of various experimental settings. Results indicate that MMM outperforms the other methods as the gap between the discriminatory power of the terms gets larger. The absolute value of the savings provided by MMM reaches a maximum for the high query weight case. However, the extra savings decline sharply for high weight and moderately for the low weight queries with the increase in database size. The applicability of the derivations to other partitioned signature organizations is discussed and a detailed analysis of Fixed Prefix Partitioning (FPP) is provided as an example. An approximate formula that is shown to estimate the performance of both FPP and LHSS within an acceptable margin of error is also modified to account for the multiterm case

    Analysis of Signature Generation Schemes for Multiterm Queries In Linear Hashing with Superimposed Signatures

    Get PDF
    Signature files provide efficient retrieval of data by reflecting the essence of the data objects into bit patterns. Our analysis explores the performance of three superimposed signature generation schemes as they are applied to a dynamic signature file organization based on linear hashing: Linear Hashing with Superimposed Signatures (LHSS). The first scheme (SM) allows all terms set the same number of bits whereas the second and third schemes (MMS aid MMM) emphasize the terms with high discriminatory power. In addition, MMM considers the probability distribution of the number of query terms. The main contribution of the study is a detailed analysis of LHSS in multiterm query environments by incorporating the term discrimination values based on document and query frequencies. The approach of the study can also be extended to other signature file access methods based on partitioning. The derivation of the performance evaluation formulas, the simulation results based on these formulas for various experimental settings, and the implementation results based on INSPEC and NPL text databases are provided. Results indicate that MMM and MMS outperform SM in all cases in terms of access savings, especially when terms become more distinctive. MMM slightly outperforms MMS in high weight and low weight query cases. The performance gap among all three schemes decreases as the database size increases, and as the signature size increases the performances of MMM and MMS decrease and converge to that of the SM scheme when the hashing level is fixed

    Analysis of Signature Generation Schemes for Multiterm Queries In Partitioned Signature File Environments

    Get PDF
    Our analysis explores the performance of three superimposed signature generation schemes as they are applied to a dynamic sigrtature file organization based on linear hashing: Linear Hashing with Superinzposed Signatures (LHSS). First scheme (SM) allows all terms set the same number of bits whereas the second and third methods (MMS and MMM) emphasize the terms with hlgh discriminatory power. In addition, M Mco nsiders the probaOiZity distribution of the number of query terms. The main contribution of the study is the combination of signature generation and signature file organization concepts together with the relaxation of the single term query and uniform frequency assumptions. The derivation of the performance evaluation formulas are provided as well as the analysis of various experimental settings. Results indicate that MMM outperforms the others as terms become more distinctive in their discriminatory power. MMM accomplishes the highest savings in retrieval eficiency for the high query weight case. We also discuss the applicability of the derivations to other partitioned signature organizations providing a detailed analysis of Fixed Prefix Partitioning (FPP) as an example. Finally, an appro.ximate perfortnance evaluation formula that works for both FPP and LHSS is modijied to account for the multiterm case

    Signature Files: An Integrated Access Method for Formatted and Unformatted Databases

    Get PDF
    The signature file approach is one of the most powerful information storage and retrieval techniques which is used for finding the data objects that are relevant to the user queries. The main idea of all signature based schemes is to reflect the essence of the data items into bit pattern (descriptors or signatures) and store them in a separate file which acts as a filter to eliminate the non aualifvine data items for an information reauest. It provides an integrated access method for both formattid and formatted databases. A complative overview and discussion of the proposed signatnre generation methods and the major signature file organization schemes are presented. Applications of the signature techniques to formatted and unformatted databases, single and multiterm query cases, serial and paratlei architecture. static and dynamic environments are provided with a special emphasis on the multimedia databases where the pioneering prototype systems using signatnres yield highly encouraging results

    Service Discovery from Uniform Resource Locators of Monitored Web Applications

    Get PDF
    Käesolev magistritöö käsitleb veebiserveri sissetulevate HTTP päringute URL-ide analüüsimist veebiteenuste tuvastamise eesmärgiga.\n\rProbleem on aktuaalne veebirakenduste monitooringutööriistade seisukohalt, kuna sissetulevad HTTP päringud on vaja omavahel loogiliselt grupeerida selleks, et edasihinnata ning jälgida teenuse vasteaega.\n\rUurimistöö keskendub Java veebirakendustele ning analüüsib URLide andmehulka, mis on saadudud monitoorimistarkvarast Plumbr.\n\rKui monitooritav veebirakendus on realiseeritud mõne Plumbri jaoks tuntud veebiraamistiku abil (näiteks Spring), siis on võimalik seda rakendust instrumenteerida selliselt, et teenuse nimi on üheselt määratletav. Kui aga tegemist on Plumbri jaoks tundmatu veebiraamistikuga, siis ainuke sissetuleva päringu kirjeldus on selle päringu URL.\n\rKui URLis sisalduvad dünaamilised parameetrid, siis sama teenust kasutavad päringud on erinevad ja neid ei ole võimalik ainult URLi põhiselt grupeerida.\n\rKäesolev magistritöö pakub välja URLi analüüsil baseeruva grupeerimise lahenduse. Lahendus tükeldab URLi, eraldab sealt sõnede ahelad, mida seejärel analüüsib kasutades loomuliku keele töötlemise ning graafitransformeerimise tehnikaid.\n\rPakutav teenuste tuvastuse lahendus on teostatud kasutades Java ja Groovy programmeerimiskeeli, hinnatud andmehulgal mis koosneb üle 400 000 URList ning on integreeritud Plumbr monitooringutarkvarasse.This thesis addresses the problem of analyzing Uniform Resource Locators (URLs) of incoming Hypertext Transfer Protocol (HTTP) requests in a Web application server in order to discover the services provided by the applications hosted by the application server, and to group these applications according to the services they provide. The thesis investigates this problem in the context of the Plumbr Java performance monitoring tool. When the hosted applications are implemented using a known web framework (e.g. Spring), the service name and associated data, such as URL parameters, can be extracted directly from the controller. However, this controller-based service discovery approach, which is currently implemented in Plumbr, is not applicable when the hosted applications use unknown framework. This research addresses the problem in this latter more general setting.\n\rThe thesis proposes a pure URL-based approach, where the observed URLs are parsed, leading to sequences of tokens, which are then analyzed using natural language processing techniques and graph transformations. The proposed service discovery technique has been implemented in Groovy and Java, integrated into the Plumbr tool and evaluated on data extracted from production server covering over 400K URLs

    Image Annotation and Topic Extraction Using Super-Word Latent Dirichlet

    Get PDF
    This research presents a multi-domain solution that uses text and images to iteratively improve automated information extraction. Stage I uses local text surrounding an embedded image to provide clues that help rank-order possible image annotations. These annotations are forwarded to Stage II, where the image annotations from Stage I are used as highly-relevant super-words to improve extraction of topics. The model probabilities from the super-words in Stage II are forwarded to Stage III where they are used to refine the automated image annotation developed in Stage I. All stages demonstrate improvement over existing equivalent algorithms in the literature

    A Multi-view Context-aware Approach to Android Malware Detection and Malicious Code Localization

    Full text link
    Existing Android malware detection approaches use a variety of features such as security sensitive APIs, system calls, control-flow structures and information flows in conjunction with Machine Learning classifiers to achieve accurate detection. Each of these feature sets provides a unique semantic perspective (or view) of apps' behaviours with inherent strengths and limitations. Meaning, some views are more amenable to detect certain attacks but may not be suitable to characterise several other attacks. Most of the existing malware detection approaches use only one (or a selected few) of the aforementioned feature sets which prevent them from detecting a vast majority of attacks. Addressing this limitation, we propose MKLDroid, a unified framework that systematically integrates multiple views of apps for performing comprehensive malware detection and malicious code localisation. The rationale is that, while a malware app can disguise itself in some views, disguising in every view while maintaining malicious intent will be much harder. MKLDroid uses a graph kernel to capture structural and contextual information from apps' dependency graphs and identify malice code patterns in each view. Subsequently, it employs Multiple Kernel Learning (MKL) to find a weighted combination of the views which yields the best detection accuracy. Besides multi-view learning, MKLDroid's unique and salient trait is its ability to locate fine-grained malice code portions in dependency graphs (e.g., methods/classes). Through our large-scale experiments on several datasets (incl. wild apps), we demonstrate that MKLDroid outperforms three state-of-the-art techniques consistently, in terms of accuracy while maintaining comparable efficiency. In our malicious code localisation experiments on a dataset of repackaged malware, MKLDroid was able to identify all the malice classes with 94% average recall

    Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

    Get PDF
    Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time
    corecore