680 research outputs found

    Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

    Get PDF
    Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches. Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance. Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems. This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios

    Vision systems with the human in the loop

    Get PDF
    The emerging cognitive vision paradigm deals with vision systems that apply machine learning and automatic reasoning in order to learn from what they perceive. Cognitive vision systems can rate the relevance and consistency of newly acquired knowledge, they can adapt to their environment and thus will exhibit high robustness. This contribution presents vision systems that aim at flexibility and robustness. One is tailored for content-based image retrieval, the others are cognitive vision systems that constitute prototypes of visual active memories which evaluate, gather, and integrate contextual knowledge for visual analysis. All three systems are designed to interact with human users. After we will have discussed adaptive content-based image retrieval and object and action recognition in an office environment, the issue of assessing cognitive systems will be raised. Experiences from psychologically evaluated human-machine interactions will be reported and the promising potential of psychologically-based usability experiments will be stressed

    TREE-D-SEEK: A Framework for Retrieving Three-Dimensional Scenes

    Get PDF
    In this dissertation, a strategy and framework for retrieving 3D scenes is proposed. The strategy is to retrieve 3D scenes based on a unified approach for indexing content from disparate information sources and information levels. The TREE-D-SEEK framework implements the proposed strategy for retrieving 3D scenes and is capable of indexing content from a variety of corpora at distinct information levels. A semantic annotation model for indexing 3D scenes in the TREE-D-SEEK framework is also proposed. The semantic annotation model is based on an ontology for rapid prototyping of 3D virtual worlds. With ongoing improvements in computer hardware and 3D technology, the cost associated with the acquisition, production and deployment of 3D scenes is decreasing. As a consequence, there is a need for efficient 3D retrieval systems for the increasing number of 3D scenes in corpora. An efficient 3D retrieval system provides several benefits such as enhanced sharing and reuse of 3D scenes and 3D content. Existing 3D retrieval systems are closed systems and provide search solutions based on a predefined set of indexing and matching algorithms Existing 3D search systems and search solutions cannot be customized for specific requirements, type of information source and information level. In this research, TREE-D-SEEK—an open, extensible framework for retrieving 3D scenes—is proposed. The TREE-D-SEEK framework is capable of retrieving 3D scenes based on indexing low level content to high-level semantic metadata. The TREE-D-SEEK framework is discussed from a software architecture perspective. The architecture is based on a common process flow derived from indexing disparate information sources. Several indexing and matching algorithms are implemented. Experiments are conducted to evaluate the usability and performance of the framework. Retrieval performance of the framework is evaluated using benchmarks and manually collected corpora. A generic, semantic annotation model is proposed for indexing a 3D scene. The primary objective of using the semantic annotation model in the TREE-D-SEEK framework is to improve retrieval relevance and to support richer queries within a 3D scene. The semantic annotation model is driven by an ontology. The ontology is derived from a 3D rapid prototyping framework. The TREE-D-SEEK framework supports querying by example, keyword based and semantic annotation based query types for retrieving 3D scenes

    Context-aware OLAP for textual data warehouses

    Get PDF
    Decision Support Systems (DSS) that leverage business intelligence are based on numerical data and On-line Analytical Processing (OLAP) is often used to implement it. However, business decisions are increasingly dependent on textual data as well. Existing research work on textual data warehouses has the limitation of capturing contextual relationships when comparing only strongly related documents. This paper proposes an Information System (IS) based context-aware model that uses word embedding in conjunction with agglomerative hierarchical clustering algorithms to dynamically categorize documents in order to form the concept hierarchy. The results of the experimental evaluation provide evidence of the effectiveness of integrating textual data into a data warehouse and improving decision making through various OLAP operations

    Effective Instance Matching for Heterogeneous Structured Data

    Get PDF
    One main problem towards the effective usage of structured data is instance matching, where the goal is to find instance representations referring to the same real-world thing. In this book we investigate how to effectively match Heterogeneous structured data. We evaluate our approaches against the latest baselines. The results show advances beyond the state-of-the-art

    Multi modal multi-semantic image retrieval

    Get PDF
    PhDThe rapid growth in the volume of visual information, e.g. image, and video can overwhelm users’ ability to find and access the specific visual information of interest to them. In recent years, ontology knowledge-based (KB) image information retrieval techniques have been adopted into in order to attempt to extract knowledge from these images, enhancing the retrieval performance. A KB framework is presented to promote semi-automatic annotation and semantic image retrieval using multimodal cues (visual features and text captions). In addition, a hierarchical structure for the KB allows metadata to be shared that supports multi-semantics (polysemy) for concepts. The framework builds up an effective knowledge base pertaining to a domain specific image collection, e.g. sports, and is able to disambiguate and assign high level semantics to ‘unannotated’ images. Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the ‘Bag of Visual Words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. Local features are more useful than global features, e.g. colour, shape or texture, as they are invariant to image scale, orientation and camera angle. An innovative approach is proposed for the representation, annotation and retrieval of visual content using a hybrid technique based upon the use of an unstructured visual word and upon a (structured) hierarchical ontology KB model. The structural model facilitates the disambiguation of unstructured visual words and a more effective classification of visual content, compared to a vector space model, through exploiting local conceptual structures and their relationships. The key contributions of this framework in using local features for image representation include: first, a method to generate visual words using the semantic local adaptive clustering (SLAC) algorithm which takes term weight and spatial locations of keypoints into account. Consequently, the semantic information is preserved. Second a technique is used to detect the domain specific ‘non-informative visual words’ which are ineffective at representing the content of visual data and degrade its categorisation ability. Third, a method to combine an ontology model with xi a visual word model to resolve synonym (visual heterogeneity) and polysemy problems, is proposed. The experimental results show that this approach can discover semantically meaningful visual content descriptions and recognise specific events, e.g., sports events, depicted in images efficiently. Since discovering the semantics of an image is an extremely challenging problem, one promising approach to enhance visual content interpretation is to use any associated textual information that accompanies an image, as a cue to predict the meaning of an image, by transforming this textual information into a structured annotation for an image e.g. using XML, RDF, OWL or MPEG-7. Although, text and image are distinct types of information representation and modality, there are some strong, invariant, implicit, connections between images and any accompanying text information. Semantic analysis of image captions can be used by image retrieval systems to retrieve selected images more precisely. To do this, a Natural Language Processing (NLP) is exploited firstly in order to extract concepts from image captions. Next, an ontology-based knowledge model is deployed in order to resolve natural language ambiguities. To deal with the accompanying text information, two methods to extract knowledge from textual information have been proposed. First, metadata can be extracted automatically from text captions and restructured with respect to a semantic model. Second, the use of LSI in relation to a domain-specific ontology-based knowledge model enables the combined framework to tolerate ambiguities and variations (incompleteness) of metadata. The use of the ontology-based knowledge model allows the system to find indirectly relevant concepts in image captions and thus leverage these to represent the semantics of images at a higher level. Experimental results show that the proposed framework significantly enhances image retrieval and leads to narrowing of the semantic gap between lower level machinederived and higher level human-understandable conceptualisation

    Multi-paradigm frameworks for scalable intrusion detection

    Get PDF
    Research in network security and intrusion detection systems (IDSs) has typically focused on small or artificial data sets. Tools are developed that work well on these data sets but have trouble meeting the demands of real-world, large-scale network environments. In addressing this problem, improvements must be made to the foundations of intrusion detection systems, including data management, IDS accuracy and alert volume;We address data management of network security and intrusion detection information by presenting a database mediator system that provides single query access via a domain specific query language. Results are returned in the form of XML using web services, allowing analysts to access information from remote networks in a uniform manner. The system also provides scalable data capture of log data for multi-terabyte datasets;Next, we address IDS alert accuracy by building an agent-based framework that utilizes web services to make the system easy to deploy and capable of spanning network boundaries. Agents in the framework process IDS alerts managed by a central alert broker. The broker can define processing hierarchies by assigning dependencies on agents to achieve scalability. The framework can also be used for the task of event correlation, or gathering information relevant to an IDS alert;Lastly, we address alert volume by presenting an approach to alert correlation that is IDS independent. Using correlated events gathered in our agent framework, we build a feature vector for each IDS alert representing the network traffic profile of the internal host at the time of the alert. This feature vector is used as a statistical fingerprint in a clustering algorithm that groups related alerts. We analyze our results with a combination of domain expert evaluation and feature selection

    WAQS : a web-based approximate query system

    Get PDF
    The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation

    Doctor of Philosophy

    Get PDF
    dissertationBiomedical data are a rich source of information and knowledge. Not only are they useful for direct patient care, but they may also offer answers to important population-based questions. Creating an environment where advanced analytics can be performed against biomedical data is nontrivial, however. Biomedical data are currently scattered across multiple systems with heterogeneous data, and integrating these data is a bigger task than humans can realistically do by hand; therefore, automatic biomedical data integration is highly desirable but has never been fully achieved. This dissertation introduces new algorithms that were devised to support automatic and semiautomatic integration of heterogeneous biomedical data. The new algorithms incorporate both data mining and biomedical informatics techniques to create "concept bags" that are used to compute similarity between data elements in the same way that "word bags" are compared in data mining. Concept bags are composed of controlled medical vocabulary concept codes that are extracted from text using named-entity recognition software. To test the new algorithm, three biomedical text similarity use cases were examined: automatically aligning data elements between heterogeneous data sets, determining degrees of similarity between medical terms using a published benchmark, and determining similarity between ICU discharge summaries. The method is highly configurable and 5 different versions were tested. The concept bag method performed particularly well aligning data elements and outperformed the compared algorithms by iv more than 5%. Another configuration that included hierarchical semantics performed particularly well at matching medical terms, meeting or exceeding 30 of 31 other published results using the same benchmark. Results for the third scenario of computing ICU discharge summary similarity were less successful. Correlations between multiple methods were low, including between terminologists. The concept bag algorithms performed consistently and comparatively well and appear to be viable options for multiple scenarios. New applications of the method and ideas for improving the algorithm are being discussed for future work, including several performance enhancements, configuration-based enhancements, and concept vector weighting using the TF-IDF formulas
    corecore