1,202 research outputs found

    Distributed Holistic Clustering on Linked Data

    Full text link
    Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based on a clustering of entities that represent the same real-world object. Our clustering approach provides a compact and fused representation of entities, and can identify errors in existing links as well as many new links. We support a distributed execution of the clustering approach to achieve faster execution times and scalability for large real-world data sets. We provide a novel gold standard for multi-source clustering, and evaluate our methods with respect to effectiveness and efficiency for large data sets from the geographic and music domains

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Multi-Source Data Fusion for Cyberattack Detection in Power Systems

    Full text link
    Cyberattacks can cause a severe impact on power systems unless detected early. However, accurate and timely detection in critical infrastructure systems presents challenges, e.g., due to zero-day vulnerability exploitations and the cyber-physical nature of the system coupled with the need for high reliability and resilience of the physical system. Conventional rule-based and anomaly-based intrusion detection system (IDS) tools are insufficient for detecting zero-day cyber intrusions in the industrial control system (ICS) networks. Hence, in this work, we show that fusing information from multiple data sources can help identify cyber-induced incidents and reduce false positives. Specifically, we present how to recognize and address the barriers that can prevent the accurate use of multiple data sources for fusion-based detection. We perform multi-source data fusion for training IDS in a cyber-physical power system testbed where we collect cyber and physical side data from multiple sensors emulating real-world data sources that would be found in a utility and synthesizes these into features for algorithms to detect intrusions. Results are presented using the proposed data fusion application to infer False Data and Command injection-based Man-in- The-Middle (MiTM) attacks. Post collection, the data fusion application uses time-synchronized merge and extracts features followed by pre-processing such as imputation and encoding before training supervised, semi-supervised, and unsupervised learning models to evaluate the performance of the IDS. A major finding is the improvement of detection accuracy by fusion of features from cyber, security, and physical domains. Additionally, we observed the co-training technique performs at par with supervised learning methods when fed with our features

    Scalable Data Integration for Linked Data

    Get PDF
    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster

    Algorithms, applications and systems towards interpretable pattern mining from multi-aspect data

    Get PDF
    How do humans move around in the urban space and how do they differ when the city undergoes terrorist attacks? How do users behave in Massive Open Online courses~(MOOCs) and how do they differ if some of them achieve certificates while some of them not? What areas in the court elite players, such as Stephen Curry, LeBron James, like to make their shots in the course of the game? How can we uncover the hidden habits that govern our online purchases? Are there unspoken agendas in how different states pass legislation of certain kinds? At the heart of these seemingly unconnected puzzles is this same mystery of multi-aspect mining, i.g., how can we mine and interpret the hidden pattern from a dataset that simultaneously reveals the associations, or changes of the associations, among various aspects of the data (e.g., a shot could be described with three aspects, player, time of the game, and area in the court)? Solving this problem could open gates to a deep understanding of underlying mechanisms for many real-world phenomena. While much of the research in multi-aspect mining contribute broad scope of innovations in the mining part, interpretation of patterns from the perspective of users (or domain experts) is often overlooked. Questions like what do they require for patterns, how good are the patterns, or how to read them, have barely been addressed. Without efficient and effective ways of involving users in the process of multi-aspect mining, the results are likely to lead to something difficult for them to comprehend. This dissertation proposes the M^3 framework, which consists of multiplex pattern discovery, multifaceted pattern evaluation, and multipurpose pattern presentation, to tackle the challenges of multi-aspect pattern discovery. Based on this framework, we develop algorithms, applications, and analytic systems to enable interpretable pattern discovery from multi-aspect data. Following the concept of meaningful multiplex pattern discovery, we propose PairFac to close the gap between human information needs and naive mining optimization. We demonstrate its effectiveness in the context of impact discovery in the aftermath of urban disasters. We develop iDisc to target the crossing of multiplex pattern discovery with multifaceted pattern evaluation. iDisc meets the specific information need in understanding multi-level, contrastive behavior patterns. As an example, we use iDisc to predict student performance outcomes in Massive Open Online Courses given users' latent behaviors. FacIt is an interactive visual analytic system that sits at the intersection of all three components and enables for interpretable, fine-tunable, and scrutinizable pattern discovery from multi-aspect data. We demonstrate each work's significance and implications in its respective problem context. As a whole, this series of studies is an effort to instantiate the M^3 framework and push the field of multi-aspect mining towards a more human-centric process in real-world applications

    A hybrid semantic approach to building dynamic maps of research communities

    Get PDF
    In the last ten years, ontology-based recommender systems have been shown to be effective tools for predicting user preferences and suggesting items. There are however some issues associated with the ontologies adopted by these approaches, such as: 1) their crafting is not a cheap process, being time consuming and calling for specialist expertise; 2) they may not represent accurately the viewpoint of the targeted user community; 3) they tend to provide rather static models, which fail to keep track of evolving user perspectives. To address these issues, we propose Klink UM, an approach for extracting emergent semantics from user feedbacks, with the aim of tailoring the ontology to the users and improving the recommendations accuracy. Klink UM uses statistical and machine learning techniques for finding hierarchical and similarity relationships between keywords associated with rated items and can be used for: 1) building a conceptual taxonomy from scratch, 2) enriching and correcting an existing ontology, 3) providing a numerical estimate of the intensity of semantic relationships according to the users. The evaluation shows that Klink UM performs well with respect to handcrafted ontologies and can significantly increase the accuracy of suggestions in content-based recommender systems

    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Get PDF
    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data

    Feature extraction and duplicate detection for text mining: A survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user
    • …
    corecore