881 research outputs found

    ROLE OF DATA MINING IN E-GOVERNMENT FRAMEWORK

    Get PDF
    In e-government, the mining techniques are considered as a procedure for extracting data from the related webapplication to be converted into useful knowledge. In addition, there are different methods of mining that can be applied to differentgovernment data. The significant ideas behind this paper are to produce a comprehensive study amongst the previous research workin improving the speed of queries to access the database and obtaining specific predictions. The provided study compares datamining methods, database management, and types of data. Moreover, a proposed model is introduced to put these different methodstogether for improving the online applications. These applications produce the ability to retrieve the information, matching keywords,indexing database, and performing the prediction from a vast amount of data

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

    Efficient query processing for scalable web search

    Get PDF
    Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures

    Frequent itemset mining in big data with effective single scan algorithms

    Get PDF
    © 2013 IEEE. This paper considers frequent itemsets mining in transactional databases. It introduces a new accurate single scan approach for frequent itemset mining (SSFIM), a heuristic as an alternative approach (EA-SSFIM), as well as a parallel implementation on Hadoop clusters (MR-SSFIM). EA-SSFIM and MR-SSFIM target sparse and big databases, respectively. The proposed approach (in all its variants) requires only one scan to extract the candidate itemsets, and it has the advantage to generate a fixed number of candidate itemsets independently from the value of the minimum support. This accelerates the scan process compared with existing approaches while dealing with sparse and big databases. Numerical results show that SSFIM outperforms the state-of-the-art FIM approaches while dealing with medium and large databases. Moreover, EA-SSFIM provides similar performance as SSFIM while considerably reducing the runtime for large databases. The results also reveal the superiority of MR-SSFIM compared with the existing HPC-based solutions for FIM using sparse and big databases

    Intersection caching techniques in search engines

    Get PDF
    Los motores de búsqueda procesan enormes cantidades de datos (páginas web) paraconstruir estructuras sofisticadas que soportan la búsqueda. La cantidad cada vez mayorde usuarios e información disponible en la web impone retos de rendimiento y el procesamientode las consultas (queries) consume una cantidad significativa de recursos computacionales. En este contexto, se requieren muchas técnicas de optimización para manejaruna alta carga de consultas. Uno de los mecanismos más importantes para abordareste tipo de problemas es el uso de cachés (caching), que básicamente consiste en manteneren memoria items utilizados previamente, en base a sus patrones de frecuencia, tiempode aparición o costo. La arquitectura típica de un motor de búsqueda está compuesta por un nodo front-end (broker) que proporciona la interfaz de usuario y un conjunto (grande) de nodosde búsqueda que almacenan los datos y soportan las búsquedas. En este contexto, elalmacenamiento en caché se puede implementar en varios niveles. A nivel del broker semantiene generalmente un caché de resultados (results caché). Éste almacena la lista finalde los resultados correspondientes a las consultas más frecuentes o recientes. Además, uncaché de listas de posteo (posting list caché) se implementa en cada nodo de búsqueda. Este caché almacena en memoria las listas de posteo de términos populares o valiosos,para evitar el acceso al disco. Complementariamente, se puede implementar un caché deinterscciones (intersection caché) para obtener mejoras de rendimiento adicionales. Eneste caso, se intenta explotar aquellos pares de términos que ocurren con mayor frecuenciaguardando en la memoria del nodo de búsqueda el resultado de la intersección de lascorrespondientes listas invertidas de los términos (ahorrando no sólo tiempo de acceso adisco, sino también tiempo de CPU). Todos los tipos de caché se gestionan mediante una "política de reemplazo", la cualdecide si va a desalojar algunas entradas del caché en el caso de que un elemento requieraser insertado o no. Esta política desaloja idealmente entradas que son poco probablesde que vayan a ser utilizadas nuevamente o aquellas que se espera que proporcionen unbeneficio menor. Esta tesis se enfoca en el nivel del Intersection caché utilizando políticas de reemplazoque consideran el costo de los items para desalojar un elemento (cost-aware policies). Se diseñan, analizan y evalúan políticas de reemplazo estáticas, dinámicas e híbridas,adaptando algunos algoritmos utilizados en otros dominios a éste. Estas políticas secombinan con diferentes estrategias de resolución de las consultas y se diseñan algunasque aprovechan la existencia del caché de intersecciones, reordenando los términos de laconsulta de una manera particular. También se explora la posibilidad de reservar todo el espacio de memoria asignadaal almacenamiento en caché en los nodos de búsqueda para un nuevo caché integrado (Integrated caché) y se propone una versión estática que sustituye tanto al caché de listasde posting como al de intersecciones. Se diseña una estrategia de gestión específica paraeste caché que evita la duplicación de términos almacenados y, además, se tienen en cuentadiferentes estrategias para inicializar este nuevo Integrated caché. Por último, se propone, diseña y evalúa una política de admisión para el caché deintersecciones basada en principios del Aprendizaje Automático (Machine Learning) quereduce al mínimo el almacenamiento de pares de términos que no proporcionan suficientebeneficio. Se lo modela como un problema de clasificación con el objetivo de identificarlos pares de términos que aparecen con poca frecuencia en el flujo de consultas. Los resultados experimentales de la aplicación de todos los métodos propuestos utilizandoconjuntos de datos reales y un entorno de simulación muestran que se obtienenmejoras interesantes, incluso en este hiper-optimizado campo.Search Engines process huge amounts of data (i.e. web pages) to build sophisticated structuresthat support search. The ever growing amount of users and information availableon the web impose performance challenges and query processing consumes a significantamount of computational resources. In this context, many optimization techniques arerequired to cope with heavy query trafic. One of the most important mechanisms toaddress this kind of problems is caching, that basically consists of keeping in memorypreviously used items, based on its frequency, recency or cost patterns. A typical architecture of a search engine is composed of a front-end node (broker)that provides the user interface and a cluster of a large number of search nodes that storethe data and support search. In this context, caching can be implemented at severallevels. At broker level, a results cache is generally maintained. This stores the finallist of results corresponding to the most frequent or recent queries. A posting list cacheis implemented at each node that simply stores in a memory buffer the posting lists ofpopular or valuable terms, to avoid accessing disk. Complementarily, an intersectioncache may be implemented to obtain additional performance gains. This cache attemptsto exploit frequently occurring pairs of terms by keeping the results of intersecting thecorresponding inverted lists in the memory of the search node (saving not only disk accesstime but CPU time as well). All cache types are managed using a so-called "replacement policy" that decideswhether to evict some entry from the cache in the case of a cache miss or not. This policyideally evicts entries that are unlikely to be a hit or those that are expected to provideless benefit. In this thesis we focus on the Intersection Cache level using cost-aware caching policies,that is, those policies that take into account the cost of the queries to evict an itemfrom cache. We carefully design, analyze and evaluate static, dynamic and hybrid costawareeviction policies adapting some cost-aware policies from other domains to ours. These policies are combined with different query-evaluation strategies, some of them originallydesigned to take benefit of the existence of an intersection cache by reordering thequery terms in a particular way. We also explore the possibility of reserving the whole memory space allocated tocaching at search nodes to a new integrated cache and propose a static cache (namely Integrated Cache) that replaces both list and intersection caches. We design a specificcache management strategy that avoids the duplication of cached terms and we alsoconsider different strategies to populate the integrated cache. Finally, we design and evaluate an admission policy for the Intersection Cache basedin Machine Learning principles that minimize caching pair of terms that do not provideenough advantage. We model this as a classification problem with the aim of identifyingthose pairs of terms that appear infrequently in the query stream. The experimental results from applying all the proposed approaches using real datasetson a simulation framework show that we obtain useful improvements over the baselineeven in this already hyper-optimized field.Fil: Tolosa, Gabriel Hernán. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales; Argentina

    ROLE OF DATA MINING IN E-GOVERNMENT FRAMEWORK

    Get PDF
    In e-government, the mining techniques are considered as a procedure for extracting data from the related web application to be converted into useful knowledge. In addition, there are different methods of mining that can be applied to different government data. The significant ideas behind this paper are to produce a comprehensive study amongst the previous research work in improving the speed of queries to access the database and obtaining specific predictions. The provided study compares data mining methods, database management, and types of data. Moreover, a proposed model is introduced to put these different methods together for improving the online applications. These applications produce the ability to retrieve the information, matching keywords, indexing database

    Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence

    Get PDF
    Energy efficiency is critical for running computer vision on battery-powered systems, such as mobile phones or UAVs (unmanned aerial vehicles, or drones). This book collects the methods that have won the annual IEEE Low-Power Computer Vision Challenges since 2015. The winners share their solutions and provide insight on how to improve the efficiency of machine learning systems

    Introspective knowledge acquisition for case retrieval networks in textual case base reasoning.

    Get PDF
    Textual Case Based Reasoning (TCBR) aims at effective reuse of information contained in unstructured documents. The key advantage of TCBR over traditional Information Retrieval systems is its ability to incorporate domain-specific knowledge to facilitate case comparison beyond simple keyword matching. However, substantial human intervention is needed to acquire and transform this knowledge into a form suitable for a TCBR system. In this research, we present automated approaches that exploit statistical properties of document collections to alleviate this knowledge acquisition bottleneck. We focus on two important knowledge containers: relevance knowledge, which shows relatedness of features to cases, and similarity knowledge, which captures the relatedness of features to each other. The terminology is derived from the Case Retrieval Network (CRN) retrieval architecture in TCBR, which is used as the underlying formalism in this thesis applied to text classification. Latent Semantic Indexing (LSI) generated concepts are a useful resource for relevance knowledge acquisition for CRNs. This thesis introduces a supervised LSI technique called sprinkling that exploits class knowledge to bias LSI's concept generation. An extension of this idea, called Adaptive Sprinkling has been proposed to handle inter-class relationships in complex domains like hierarchical (e.g. Yahoo directory) and ordinal (e.g. product ranking) classification tasks. Experimental evaluation results show the superiority of CRNs created with sprinkling and AS, not only over LSI on its own, but also over state-of-the-art classifiers like Support Vector Machines (SVM). Current statistical approaches based on feature co-occurrences can be utilized to mine similarity knowledge for CRNs. However, related words often do not co-occur in the same document, though they co-occur with similar words. We introduce an algorithm to efficiently mine such indirect associations, called higher order associations. Empirical results show that CRNs created with the acquired similarity knowledge outperform both LSI and SVM. Incorporating acquired knowledge into the CRN transforms it into a densely connected network. While improving retrieval effectiveness, this has the unintended effect of slowing down retrieval. We propose a novel retrieval formalism called the Fast Case Retrieval Network (FCRN) which eliminates redundant run-time computations to improve retrieval speed. Experimental results show FCRN's ability to scale up over high dimensional textual casebases. Finally, we investigate novel ways of visualizing and estimating complexity of textual casebases that can help explain performance differences across casebases. Visualization provides a qualitative insight into the casebase, while complexity is a quantitative measure that characterizes classification or retrieval hardness intrinsic to a dataset. We study correlations of experimental results from the proposed approaches against complexity measures over diverse casebases
    • …
    corecore