144 research outputs found

    CACHE DATA REPLACEMENT POLICY BASED ON RECENTLY USED ACCESS DATA AND EUCLIDEAN DISTANCE

    Get PDF
    Data access management in web-based applications that use relational databases must be well thought out because the data continues to grow every day. The Relational Database Management System (RDBMS) has a relatively slow access speed because the data is stored on disk. This causes problems with decreased database server performance and slow response times. One strategy to overcome this is to implement caching at the application level. This paper proposed SIMGD framework that models Application Level Caching (ALC) to speed up relational data access in web applications. The ALC strategy maps each controller and model that has access to the database into a node-data in the in-Memory Database (IMDB). Not all node-data can be included in IMDB due to limited capacity. Therefore, the SIMGD framework uses the Euclidean distance calculation method for each node-data with its top access data as a cache replacement policy. Node-data with Euclidean distance closer to their top access data have a high priority to be maintained in the caching server. Simulation results show at the 25KB cache configuration, the SIMGD framework excels in achieving hit ratios compared to the LRU algorithm of 6.46% and 6.01%, respectively

    Handling imperfect information in criterion evaluation, aggregation and indexing

    Get PDF

    Keyword Search in Relational Databases: Architecture, Approaches and Considerations

    Get PDF
    Questo lavoro di tesi presenta le diverse soluzioni proposte in letteratura per applicare il paradigma keyword search alle basi di dati relazionali, e vuole delineare una architettura generale per definire e sviluppare questi sistemi. A tal proposito, le soluzioni presentate dalla comunità scientifica sono state analizzate focalizzandosi sui singoli componenti della pipeline di ricerca. Infine, si sono analizzati i processi di valutazione sperimentale di questi sistem

    A Data-driven Methodology Towards Mobility- and Traffic-related Big Spatiotemporal Data Frameworks

    Get PDF
    Human population is increasing at unprecedented rates, particularly in urban areas. This increase, along with the rise of a more economically empowered middle class, brings new and complex challenges to the mobility of people within urban areas. To tackle such challenges, transportation and mobility authorities and operators are trying to adopt innovative Big Data-driven Mobility- and Traffic-related solutions. Such solutions will help decision-making processes that aim to ease the load on an already overloaded transport infrastructure. The information collected from day-to-day mobility and traffic can help to mitigate some of such mobility challenges in urban areas. Road infrastructure and traffic management operators (RITMOs) face several limitations to effectively extract value from the exponentially growing volumes of mobility- and traffic-related Big Spatiotemporal Data (MobiTrafficBD) that are being acquired and gathered. Research about the topics of Big Data, Spatiotemporal Data and specially MobiTrafficBD is scattered, and existing literature does not offer a concrete, common methodological approach to setup, configure, deploy and use a complete Big Data-based framework to manage the lifecycle of mobility-related spatiotemporal data, mainly focused on geo-referenced time series (GRTS) and spatiotemporal events (ST Events), extract value from it and support decision-making processes of RITMOs. This doctoral thesis proposes a data-driven, prescriptive methodological approach towards the design, development and deployment of MobiTrafficBD Frameworks focused on GRTS and ST Events. Besides a thorough literature review on Spatiotemporal Data, Big Data and the merging of these two fields through MobiTraffiBD, the methodological approach comprises a set of general characteristics, technical requirements, logical components, data flows and technological infrastructure models, as well as guidelines and best practices that aim to guide researchers, practitioners and stakeholders, such as RITMOs, throughout the design, development and deployment phases of any MobiTrafficBD Framework. This work is intended to be a supporting methodological guide, based on widely used Reference Architectures and guidelines for Big Data, but enriched with inherent characteristics and concerns brought about by Big Spatiotemporal Data, such as in the case of GRTS and ST Events. The proposed methodology was evaluated and demonstrated in various real-world use cases that deployed MobiTrafficBD-based Data Management, Processing, Analytics and Visualisation methods, tools and technologies, under the umbrella of several research projects funded by the European Commission and the Portuguese Government.A população humana cresce a um ritmo sem precedentes, particularmente nas áreas urbanas. Este aumento, aliado ao robustecimento de uma classe média com maior poder económico, introduzem novos e complexos desafios na mobilidade de pessoas em áreas urbanas. Para abordar estes desafios, autoridades e operadores de transportes e mobilidade estão a adotar soluções inovadoras no domínio dos sistemas de Dados em Larga Escala nos domínios da Mobilidade e Tráfego. Estas soluções irão apoiar os processos de decisão com o intuito de libertar uma infraestrutura de estradas e transportes já sobrecarregada. A informação colecionada da mobilidade diária e da utilização da infraestrutura de estradas pode ajudar na mitigação de alguns dos desafios da mobilidade urbana. Os operadores de gestão de trânsito e de infraestruturas de estradas (em inglês, road infrastructure and traffic management operators — RITMOs) estão limitados no que toca a extrair valor de um sempre crescente volume de Dados Espaciotemporais em Larga Escala no domínio da Mobilidade e Tráfego (em inglês, Mobility- and Traffic-related Big Spatiotemporal Data —MobiTrafficBD) que estão a ser colecionados e recolhidos. Os trabalhos de investigação sobre os tópicos de Big Data, Dados Espaciotemporais e, especialmente, de MobiTrafficBD, estão dispersos, e a literatura existente não oferece uma metodologia comum e concreta para preparar, configurar, implementar e usar uma plataforma (framework) baseada em tecnologias Big Data para gerir o ciclo de vida de dados espaciotemporais em larga escala, com ênfase nas série temporais georreferenciadas (em inglês, geo-referenced time series — GRTS) e eventos espacio- temporais (em inglês, spatiotemporal events — ST Events), extrair valor destes dados e apoiar os RITMOs nos seus processos de decisão. Esta dissertação doutoral propõe uma metodologia prescritiva orientada a dados, para o design, desenvolvimento e implementação de plataformas de MobiTrafficBD, focadas em GRTS e ST Events. Além de uma revisão de literatura completa nas áreas de Dados Espaciotemporais, Big Data e na junção destas áreas através do conceito de MobiTrafficBD, a metodologia proposta contem um conjunto de características gerais, requisitos técnicos, componentes lógicos, fluxos de dados e modelos de infraestrutura tecnológica, bem como diretrizes e boas práticas para investigadores, profissionais e outras partes interessadas, como RITMOs, com o objetivo de guiá-los pelas fases de design, desenvolvimento e implementação de qualquer pla- taforma MobiTrafficBD. Este trabalho deve ser visto como um guia metodológico de suporte, baseado em Arqui- teturas de Referência e diretrizes amplamente utilizadas, mas enriquecido com as característi- cas e assuntos implícitos relacionados com Dados Espaciotemporais em Larga Escala, como no caso de GRTS e ST Events. A metodologia proposta foi avaliada e demonstrada em vários cenários reais no âmbito de projetos de investigação financiados pela Comissão Europeia e pelo Governo português, nos quais foram implementados métodos, ferramentas e tecnologias nas áreas de Gestão de Dados, Processamento de Dados e Ciência e Visualização de Dados em plataformas MobiTrafficB

    Text mining for metabolic reaction extraction from scientific literature

    Get PDF
    Science relies on data in all its different forms. In molecular biology and bioinformatics in particular large scale data generation has taken centre stage in the form of high-throughput experiments. In line with this exponential increase of experimental data has been the near exponential growth of scientific publications. Yet where classical data mining techniques are still capable of coping with this deluge in structured data (Chapter 2), access of information found in scientific literature is still limited to search engines allowing searches on the level keywords, titles and abstracts. However, large amounts of knowledge about biological entities and their relations are held within the body of articles. When extracted, this data can be used as evidence for existing knowledge or hypothesis generation making scientific literature a valuable scientific resource. To unlock the information inside the articles requires a dedicated set of techniques and approaches tailored to the unstructured nature of free text. Analogous to the field of data mining for the analysis of structured data, the field of text mining has emerged for unstructured text and a number of applications has been developed in that field. This thesis is about text mining in the field of metabolomics. The work focusses on strategies for accessing large collections of scientific text and on the text mining steps required to extract metabolic reactions and their constituents, enzymes and metabolites, from scientific text. Metabolic reactions are important for our understanding of metabolic processes within cells and that information provides an important link between genotype phenotype. Furthermore information about metabolic reactions stored in databases is far from complete making it an excellent target for our text mining application. In order to access the scientific publications for further analysis they can be used as flat text or loaded into database systems. In Chapter 2we assessed and discussed the capabilities and performance of XML-type database systems to store and access very large collections of XML-type documents in the form of the Medline corpus, a collection of more than 20 million of scientific abstracts. XML data formats are common in the field of bioinformatics and are also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. The database systems were evaluated on a number of aspects broadly ranging from technical requirements to ease-of-use and performance. The performance of the different XML-type database systems was measured Medline abstract collections of increasing size and with a number of different queries. One of the queries assessed the capabilities of each database system to search the full-text of each abstract, which would allow access to the information within the text without further text analysis. The results show that all database systems cope well with the small and medium dataset, but that the full dataset remains a challenge. Also the query possibilities varied greatly across all studied databases. This led us to conclude that the performances and possibilities of the different database types vary greatly, also depending on the type of research question. There is no single system that outperforms the others; instead different circumstances can lead to a different optimal solution. Some of these scenarios are presented in the chapter. Among the conclusions of Chapter 2is that conventional data mining techniques do not work for the natural language part of a publication beyond simple retrieval queries based on pattern matching. The natural language used in written text is too unstructured for that purpose and requires dedicated text mining approaches, the main research topic of this thesis. Two major tasks of text mining are named entity recognition, the identification of relevant entities in the text, and relation extraction, the identification of relations between those named entities. For both text mining tasks many different techniques and approaches have been developed. For the named entity recognition of enzymes and metabolites we used a dictionary-based approach (Chapter 3) and for metabolic reaction extraction a full grammar approach (Chapter 4). In Chapter 3we describe the creation of two thesauri, one for enzymes and one for metabolites with the specific goal of allowing named entity identification, the mapping of identified synonyms to a common identifier, for metabolic reaction extraction. In the case of the enzyme thesaurus these identifiers are Enzyme Nomenclature numbers (EC number), in the case of the metabolite thesaurus KEGG metabolite identifiers. These thesauri are applied to the identification of enzymes and metabolites in the text mining approach of Chapter 4. Both were created from existing data sources by a series of automated steps followed by manual curation. Compared to a previously published chemical thesaurus, created entirely with automated steps, our much smaller metabolite thesaurus performed on the same level for F-measure with a slightly higher precision. The enzyme thesaurus produced results equal to our metabolite thesaurus. The compactness of our thesauri permits the manual curation step important in guaranteeing accuracy of the thesaurus contents, whereas creation from existing resources by automated means limits the effort required for creation. We concluded that our thesauri are compact and of high quality, and that this compactness does not greatly impact recall. In Chapter 4we studied the applicability and performance of a full parsing approach using the two thesauri described in Chapter 3 for the extraction of metabolic reactions from scientific full-text articles. For this we developed a text mining pipeline built around a modified dependency parser from the AGFL grammar lab using a pattern-based approach to extract metabolic reactions from the parsing output. Results of a comparison to a modified rule-based approach by Czarnecki et al.using three previously described metabolic pathways from the EcoCyc database show a slightly lower recall compared to the rule-based approach, but higher precision. We concluded that despite its current recall our full parsing approach to metabolic reaction extraction has high precision and potential to be used to (re-)construct metabolic pathways in an automated setting. Future improvements to the grammar and relation extraction rules should allow reactions to be extracted with even higher specificity. To identify potential improvements to the recall, the effect of a number of text pre-processing steps on the performance was tested in a number of experiments. The one experiment that had the most effect on performance was the conversion of schematic chemical formulas to syntactic complete sentences allowing them to be analysed by the parser. In addition to the improvements to the text mining approach described in Chapter 4I make suggestions in Chapter 5 for potential improvements and extensions to our full parsing approach for metabolic reaction extraction. Core focus here is the increase of recall by optimising each of the steps required for the final goal of extracting metabolic reactions from the text. Some of the discussed improvements are to increase the coverage of the used thesauri, possibly with specialist thesauri depending on the analysed literature. Another potential target is the grammar, where there is still room to increase parsing success by taking into account the characteristics of biomedical language. On a different level are suggestions to include some form of anaphora resolution and across sentence boundary search to increase the amount of information extracted from literature. In the second part of Chapter 5I make suggestions as to how to maximise the information gained from the text mining results. One of the first steps should be integration with other biomedical databases to allow integration with existing knowledge about metabolic reactions and other biological entities. Another aspect is some form of ranking or weighting of the results to be able to distinguish between high quality results useful for automated analyses and lower quality results still useful for manual approaches. Furthermore I provide a perspective on the necessity of computational literature analysis in the form of text mining. The main reasoning here is that human annotators cannot keep up with the amount of publications so that some form of automated analysis is unavoidable. Lastly I discuss the role of text mining in bioinformatics and with that also the accessibility of both text mining results and the literature resources necessary to create them. An important requirement for the future of text mining is that the barriers around high-throughput access to literature for text mining applications have to be removed. With regards to accessing text mining results, there is a long way to go for many applications, including ours, before they can be used directly by biologists. A major factor is that these applications rarely feature a suitable user interface and easy to use setup. To conclude, I see the main role of a text mining system like ours mainly in gathering evidence for existing knowledge and giving insights into the nuances of the research landscape of a given topic. When using the results of our reaction extraction system for the identification of ‘new’ reactions it is important to go back to the actual evidence presented for extra validations and to cross-validate the predictions with other resources or experiments. Ideally text mining will be used for generation of hypotheses, in which the researcher uses text mining findings to get ideas on, in our case, new connections between metabolites and enzymes; subsequently the researcher needs to go back to the original texts for further study. In this role text mining is an essential tool on the workbench of the molecular biologist.</p

    Content And Multimedia Database Management Systems

    Get PDF
    A database management system is a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications. The main characteristic of the ‘database approach’ is that it increases the value of data by its emphasis on data independence. DBMSs, and in particular those based on the relational data model, have been very successful at the management of administrative data in the business domain. This thesis has investigated data management in multimedia digital libraries, and its implications on the design of database management systems. The main problem of multimedia data management is providing access to the stored objects. The content structure of administrative data is easily represented in alphanumeric values. Thus, database technology has primarily focused on handling the objects’ logical structure. In the case of multimedia data, representation of content is far from trivial though, and not supported by current database management systems

    Database support for large-scale multimedia retrieval

    Get PDF
    With the increasing proliferation of recording devices and the resulting abundance of multimedia data available nowadays, searching and managing these ever-growing collections becomes more and more difficult. In order to support retrieval tasks within large multimedia collections, not only the sheer size, but also the complexity of data and their associated metadata pose great challenges, in particular from a data management perspective. Conventional approaches to address this task have been shown to have only limited success, particularly due to the lack of support for the given data and the required query paradigms. In the area of multimedia research, the missing support for efficiently and effectively managing multimedia data and metadata has recently been recognised as a stumbling block that constraints further developments in the field. In this thesis, we bridge the gap between the database and the multimedia retrieval research areas. We approach the problem of providing a data management system geared towards large collections of multimedia data and the corresponding query paradigms. To this end, we identify the necessary building-blocks for a multimedia data management system which adopts the relational data model and the vector-space model. In essence, we make the following main contributions towards a holistic model of a database system for multimedia data: We introduce an architectural model describing a data management system for multimedia data from a system architecture perspective. We further present a data model which supports the storage of multimedia data and the corresponding metadata, and provides similarity-based search operations. This thesis describes an extensive query model for a very broad range of different query paradigms specifying both logical and executional aspects of a query. Moreover, we consider the efficiency and scalability of the system in a distribution and a storage model, and provide a large and diverse set of index structures for high-dimensional data coming from the vector-space model. Thee developed models crystallise into the scalable multimedia data management system ADAMpro which has been implemented within the iMotion/vitrivr retrieval stack. We quantitatively evaluate our concepts on collections that exceed the current state of the art. The results underline the benefits of our approach and assist in understanding the role of the introduced concepts. Moreover, the findings provide important implications for future research in the field of multimedia data management

    10381 Summary and Abstracts Collection -- Robust Query Processing

    Get PDF
    Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants
    corecore