11,042 research outputs found

    Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

    Get PDF
    Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches. Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance. Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems. This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Clustering Time Series from Mixture Polynomial Models with Discretised Data

    Get PDF
    Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly effect the quality of clusters formed. This paper evaluates a method of over-coming the detrimental effects of outliers. We describe some of the alternative approaches to clustering time series, then specify a particular class of model for experimentation with k-means clustering and a correlation based distance metric. For data derived from this class of model we demonstrate that discretising the data into a binary series of above and below the median improves the clustering when the data has outliers. More specifically, we show that firstly discretisation does not significantly effect the accuracy of the clusters when there are no outliers and secondly it significantly increases the accuracy in the presence of outliers, even when the probability of outlier is very low

    A Similarity Measure for Material Appearance

    Get PDF
    We present a model to measure the similarity in appearance between different materials, which correlates with human similarity judgments. We first create a database of 9,000 rendered images depicting objects with varying materials, shape and illumination. We then gather data on perceived similarity from crowdsourced experiments; our analysis of over 114,840 answers suggests that indeed a shared perception of appearance similarity exists. We feed this data to a deep learning architecture with a novel loss function, which learns a feature space for materials that correlates with such perceived appearance similarity. Our evaluation shows that our model outperforms existing metrics. Last, we demonstrate several applications enabled by our metric, including appearance-based search for material suggestions, database visualization, clustering and summarization, and gamut mapping.Comment: 12 pages, 17 figure

    Effective retrieval and new indexing method for case based reasoning: Application in chemical process design

    Get PDF
    In this paper we try to improve the retrieval step for case based reasoning for preliminary design. This improvement deals with three major parts of our CBR system. First, in the preliminary design step, some uncertainties like imprecise or unknown values remain in the description of the problem, because they need a deeper analysis to be withdrawn. To deal with this issue, the faced problem description is soften with the fuzzy sets theory. Features are described with a central value, a percentage of imprecision and a relation with respect to the central value. These additional data allow us to build a domain of possible values for each attributes. With this representation, the calculation of the similarity function is impacted, thus the characteristic function is used to calculate the local similarity between two features. Second, we focus our attention on the main goal of the retrieve step in CBR to find relevant cases for adaptation. In this second part, we discuss the assumption of similarity to find the more appropriated case. We put in highlight that in some situations this classical similarity must be improved with further knowledge to facilitate case adaptation. To avoid failure during the adaptation step, we implement a method that couples similarity measurement with adaptability one, in order to approximate the cases utility more accurately. The latter gives deeper information for the reusing of cases. In a last part, we present a generic indexing technique for the base, and a new algorithm for the research of relevant cases in the memory. The sphere indexing algorithm is a domain independent index that has performances equivalent to the decision tree ones. But its main strength is that it puts the current problem in the center of the research area avoiding boundaries issues. All these points are discussed and exemplified through the preliminary design of a chemical engineering unit operation

    Context-Specific Preference Learning of One Dimensional Quantitative Geospatial Attributes Using a Neuro-Fuzzy Approach

    Get PDF
    Change detection is a topic of great importance for modern geospatial information systems. Digital aerial imagery provides an excellent medium to capture geospatial information. Rapidly evolving environments, and the availability of increasing amounts of diverse, multiresolutional imagery bring forward the need for frequent updates of these datasets. Analysis and query of spatial data using potentially outdated data may yield results that are sometimes invalid. Due to measurement errors (systematic, random) and incomplete knowledge of information (uncertainty) it is ambiguous if a change in a spatial dataset has really occurred. Therefore we need to develop reliable, fast, and automated procedures that will effectively report, based on information from a new image, if a change has actually occurred or this change is simply the result of uncertainty. This thesis introduces a novel methodology for change detection in spatial objects using aerial digital imagery. The uncertainty of the extraction is used as a quality estimate in order to determine whether change has occurred. For this goal, we develop a fuzzy-logic system to estimate uncertainty values fiom the results of automated object extraction using active contour models (a.k.a. snakes). The differential snakes change detection algorithm is an extension of traditional snakes that incorporates previous information (i.e., shape of object and uncertainty of extraction) as energy functionals. This process is followed by a procedure in which we examine the improvement of the uncertainty at the absence of change (versioning). Also, we introduce a post-extraction method for improving the object extraction accuracy. In addition to linear objects, in this thesis we extend differential snakes to track deformations of areal objects (e.g., lake flooding, oil spills). From the polygonal description of a spatial object we can track its trajectory and areal changes. Differential snakes can also be used as the basis for similarity indices for areal objects. These indices are based on areal moments that are invariant under general affine transformation. Experimental results of the differential snakes change detection algorithm demonstrate their performance. More specifically, we show that the differential snakes minimize the false positives in change detection and track reliably object deformations

    On Supporting Wide Range of Attribute Types for Top-K Search

    Get PDF
    Searching top-k objects for many users face the problem of different user preferences. The family of Threshold algorithms computes top-k objects using sorted access to ordered lists. Each list is ordered w.r.t. user preference to one of objects' attributes. In this paper the index based methods to simulate the sorted access for different user preferences in parallel are presented. The simulation for different domain types -- ordinal, nominal, metric and hierarchical -- is presented
    corecore