40 research outputs found

    Detection of Communities within the Multibody System Dynamics Network and Analysis of Their Relations

    Get PDF
    Multibody system dynamics is already a well developed branch of theoretical, computational and applied mechanics. Thousands of documents can be found in any of the well-known scientific databases. In this work it is demonstrated that multibody system dynamics is built of many thematic communities. Using the Elsevier’s abstract and citation database SCOPUS, a massive amount of data is collected and analyzed with the use of the open source visualization tool Gephi. The information is represented as a large set of nodes with connections to study their graphical distribution and explore geometry and symmetries. A randomized radial symmetry is found in the graphical representation of the collected information. Furthermore, the concept of modularity is used to demonstrate that community structures are present in the field of multibody system dynamics. In particular, twenty-four different thematic communities have been identified. The scientific production of each community is analyzed, which allows to predict its growing rate in the next years. The journals and conference proceedings mainly used by the authors belonging to the community as well as the cooperation between them by country are also analyzed

    WEB-BASED DUPLICATE RECORDS DETECTION WITH ARABIC LANGUAGE ENHANCEMENT

    Get PDF
    Sharing data between organizations has growing importance in many data mining projects. Data from various heterogeneous sources often has to be linked and aggregated in order to improve data quality. The importance of data accuracy and quality has increased with the explosion of data size. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset which called Duplicate Record Detection (DRD). These data inaccuracy problems exist due to due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics especially with non-Latin languages like Arabic. In this paper, an English/Arabic enabled web-based framework is designed and implemented which considers the user interaction to add new rules, enrich the dictionary and evaluate results is an important step to improve system’s behavior. The proposed framework allows the processing on both single language dataset and bi-lingual dataset. The proposed framework is implemented and verified empirically in several case studies. The comparison results showed that the proposed system has substantial improvements compared to known tools

    INVESTIGATION OF TECHNIQUES FOR EFFICIENT & ACCURATE INDEXING FOR SCALABLE RECORD LINKAGE & DEDUPLICATION

    Get PDF
    Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys

    Cloud-Scale Entity Resolution: Current State and Open Challenges

    Get PDF
    Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

    Information integration for graph databases

    Get PDF
    Philip S. Yu, Jiawei Han, Christos Faloutsos</p
    corecore