5,359,226 research outputs found

    From Data Fusion to Knowledge Fusion

    Get PDF
    The task of {\em data fusion} is to identify the true values of data items (eg, the true date of birth for {\em Tom Cruise}) among multiple observed values drawn from different sources (eg, Web sites) of varying (and unknown) reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitations of different fusion techniques on a more challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources. These extractors perform the tasks of entity linkage and schema alignment, thus introducing an additional source of noise that is quite different from that traditionally considered in the data fusion literature, which only focuses on factual errors in the original sources. We adapt state-of-the-art data fusion techniques and apply them to a knowledge base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B Web pages, which is three orders of magnitude larger than the data sets used in previous data fusion papers. We show great promise of the data fusion approaches in solving the knowledge fusion problem, and suggest interesting research directions through a detailed error analysis of the methods.Comment: VLDB'201

    From Big Data To Knowledge – Good Practices From Industry

    Get PDF
    Recent advancements in data gathering technologies have led to the rise of a large amount of data through which useful insights and ideas can be derived. These data sets are typically too large to process using traditional data processing tools and applications and thus known in the popular press as ‘big data’. It is essential to extract the hidden meanings in the available data sets by aggregating big data into knowledge, which may then positively contribute to decision making. One way to engage in data-driven strategy is to gather contextual relevant data on specific customers, products, and situations, and determine optimised offerings that are most appealing to the target customers based on sound analytics. Corporations around the world have been increasingly applying analytics, tools and technologies to capture, manage and process such data, and derive value out of the huge volumes of data generated by individuals. The detailed intelligence on consumer behaviour, user patterns and other hidden knowledge that was not possible to derive via traditional means could now be used to facilitate important business processes such as real-time control, and demand forecasting. The aim of our research is to understand and analyse the significance and impact of big data in today’s industrial environment and identify the good practices that can help us derive useful knowledge out of this wealth of information based on content analysis of 34 firms that have initiated big data analytical projects. Our descriptive and network analysis shows that the goals of a big data initiative are extensible and highlighted the importance of data representation. We also find the data analytical techniques adopted are heavily dependent on the project goals

    The Policy Infrastructure for Big Data: From Data to Knowledge to Action

    Get PDF

    Knowledge extraction from raw data in water networks: application to the Barcelona supramunicipal water transport network

    Get PDF
    Critical Infrastructure Systems (CIS) such as the case of potable water transport network are complex large-scale systems, geographically distributed and decentralized with a hierarchical structure, requiring highly sophisticated supervisory and real-time control (RTC) schemes to ensure high performance achievement and maintenance when conditions are non-favorable due to e.g. sensor malfunctions (drifts, offsets, problems of batteries, communications problems,...). Once the data are reliable, a process to transform these validated data into useful information and knowledge is key for the operating plan in real time (RTC). And moreover, but no less important, it allows extracting useful knowledge about the assets and instrumentation (sectors of pipes and reservoirs, flowmeters, level sensors, ...) of the network for short, medium and large term management plans. In this work, an overall analysis of the results of the application of a methodology for sensor data validation/reconstruction to the ATLL water network in the city of Barcelona and the surrounding metropolitan area since 2008 until 2013 is described. This methodology is very important for assessing the economic and hydraulic efficiency of the network.Peer ReviewedPostprint (published version

    Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data.

    Get PDF
    A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that capture the articleÂ’s main topics. Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are conceptually related to keyphrase-frequency and I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive. The new features are generated by issuing queries to a Web search engine, based on the candidate phrases in the input document. The feature values are calculated from the number of hits for the queries (the number of matching Web pages). In essence, these new features are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases

    Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

    Get PDF
    Distributional models provide a convenient way to model semantics using dense embedding spaces derived from unsupervised learning algorithms. However, the dimensions of dense embedding spaces are not designed to resemble human semantic knowledge. Moreover, embeddings are often built from a single source of information (typically text data), even though neurocognitive research suggests that semantics is deeply linked to both language and perception. In this paper, we combine multimodal information from both text and image-based representations derived from state-of-the-art distributional models to produce sparse, interpretable vectors using Joint Non-Negative Sparse Embedding. Through in-depth analyses comparing these sparse models to human-derived behavioural and neuroimaging data, we demonstrate their ability to predict interpretable linguistic descriptions of human ground-truth semantic knowledge.Comment: Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 260-270. Brussels, Belgium, October 31 - November 1, 2018. Association for Computational Linguistic

    Classification of Data to Extract Knowledge from Neural Networks

    Get PDF
    A major drawback of artificial neural networks is their black-box character. Therefore, the rule extraction algorithm is becoming more and more important in explaining the extracted rules from the neural networks. In this paper, we use a method that can be used for symbolic knowledge extraction from neural networks, once they have been trained with desired function. The basis of this method is the weights of the neural network trained. This method allows knowledge extraction from neural networks with continuous inputs and output as well as rule extraction. An example of the application is showed. This example is based on the extraction of average load demand of a power plant
    • …
    corecore