13 research outputs found

    Event-Driven Duplicate Detection: A Probability-based Approach

    Get PDF
    The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate detection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

    A conceptual framework for the adoption of big data analytics by e-commerce startups: a case-based approach

    Get PDF
    E-commerce start-ups have ventured into emerging economies and are growing at a significantly faster pace. Big data has acted like a catalyst in their growth story. Big data analytics (BDA) has attracted e-commerce firms to invest in the tools and gain cutting edge over their competitors. The process of adoption of these BDA tools by e-commerce start-ups has been an area of interest as successful adoption would lead to better results. The present study aims to develop an interpretive structural model (ISM) which would act as a framework for efficient implementation of BDA. The study uses hybrid multi criteria decision making processes to develop the framework and test the same using a real-life case study. Systematic review of literature and discussion with experts resulted in exploring 11 enablers of adoption of BDA tools. Primary data collection was done from industry experts to develop an ISM framework and fuzzy MICMAC analysis is used to categorize the enablers of the adoption process. The framework is then tested by using a case study. Thematic clustering is performed to develop a simple ISM framework followed by fuzzy analytical network process (ANP) to discuss the association and ranking of enablers. The results indicate that access to relevant data forms the base of the framework and would act as the strongest enabler in the adoption process while the company rates technical skillset of employees as the most important enabler. It was also found that there is a positive correlation between the ranking of enablers emerging out of ISM and ANP. The framework helps in simplifying the strategies any e-commerce company would follow to adopt BDA in future. © 2019, Springer-Verlag GmbH Germany, part of Springer Nature

    Big Data Fusion Model for Heterogeneous Financial Market Data (FinDF)

    Get PDF
    The dawn of big data has seen the volume, variety, and velocity of data sources increase dramatically. Enormous amounts of structured, semi-structured and unstructured heterogeneous data can be garnered at a rapid rate, making analysis of such big data a herculean task. This has never been truer for data relating to financial stock markets, the biggest challenge being the 7 Vs of big data which relate to the collection, pre-processing, storage and real-time processing of such huge quantities of disparate data sources. Data fusion techniques have been adopted in a wide number of fields to cope with such vast amounts of heterogeneous data from multiple sources and fuse them together in order to produce a more comprehensive view of the data and its underlying relationships. Research into the fusing of heterogeneous financial data is scant within the literature, with existing work only taking into consideration the fusing of text-based financial documents. The lack of integration between financial stock market data, social media comments, financial discussion board posts and broker agencies means that the benefits of data fusion are not being realised to their full potential. This paper proposes a novel data fusion model, inspired by the data fusion model introduced by the Joint Directors of Laboratories, for the fusing of disparate data sources relating to financial stocks. Data with a diverse set of features from different data sources will supplement each other in order to obtain a Smart Data Layer, which will assist in scenarios such as irregularity detection and prediction of stock prices

    Event-Driven Duplicate Detection: A probability based Approach

    Get PDF
    The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate de-tection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

    Assessing Data Quality - A Probability-based Metric for Semantic Consistency

    Get PDF
    We present a probability-based metric for semantic consistency using a set of uncertain rules. As opposed to existing metrics for semantic consistency, our metric allows to consider rules that are expected to be fulfilled with specific probabilities. The resulting metric values represent the probability that the assessed dataset is free of internal contradictions with regard to the uncertain rules and thus have a clear interpretation. The theoretical basis for determining the metric values are statistical tests and the concept of the p-value, allowing the interpretation of the metric value as a probability. We demonstrate the practical applicability and effectiveness of the metric in a real-world setting by analyzing a customer dataset of an insurance company. Here, the metric was applied to identify semantic consistency problems in the data and to support decision-making, for instance, when offering individual products to customers

    Design principles for digital value co-creation networks—a service-dominant logic perspective

    Get PDF
    Information systems (IS) increasingly expand actor-to-actor networks beyond their temporal, organizational, and spatial boundaries. In such networks and through digital technology, IS enable distributed economic and social actors to not only exchange but also integrate their resources in materializing value co-creation processes. To account for such IS-enabled value co-creation processes in multi-actor settings, this research gives rise to the phenomenon of digital value co-creation networks (DVNs). In designing DVNs, it is not only necessary to consider underpinning value co-creation processes, but also the characteristics of the business environments in which DVNs evolve. To this end, our study guides the design of DVNs through employing service-dominant logic, a theoretical lens that conceptualizes value co-creation as well as business environments. Through an iterative research process, this study derives design requirements and design principles for DVNs, and eventually discusses how these design principles can be illustrated by expository design features for DVNs

    A Smart Data Ecosystem for the Monitoring of Financial Market Irregularities

    Get PDF
    Investments made on the stock market depend on timely and credible information being made available to investors. Such information can be sourced from online news articles, broker agencies, and discussion platforms such as financial discussion boards and Twitter. The monitoring of such discussion is a challenging yet necessary task to support the transparency of the financial market. Although financial discussion boards are typically monitored by administrators who respond to other users reporting posts for misconduct, actively monitoring social media such as Twitter remains a difficult task. Users sharing news about stock-listed companies on Twitter can embed cashtags in their tweets that mimic a company’s stock ticker symbol (e.g. TSCO on the London Stock Exchange refers to Tesco PLC). A cashtag is simply the ticker characters prefixed with a ’$’ symbol, which then becomes a clickable hyperlink – similar to a hashtag. Twitter, however, does not distinguish between companies with identical ticker symbols that belong to different exchanges. TSCO, for example, refers to Tesco PLC on the London Stock Exchange but also refers to the Tractor Supply Company listed on the NASDAQ. This research has referred to such scenarios as a ’cashtag collision’. Investors who wish to capitalise on the fast dissemination that Twitter provides may become susceptible to tweets containing colliding cashtags. Further exacerbating this issue is the presence of tweets referring to cryptocurrencies, which also feature cashtags that could be identical to the cashtags used for stock-listed companies. A system that is capable of identifying stock-specific tweets by resolving such collisions, and assessing the credibility of such messages, would be of great benefit to a financial market monitoring system by filtering out non-significant messages. This project has involved the design and development of a novel, multi-layered, smart data ecosystem to monitor potential irregularities within the financial market. This ecosystem is primarily concerned with the behaviour of participants’ communicative practices on discussion platforms and the activity surrounding company events (e.g. a broker rating being issued for a company). A wide array of data sources – such as tweets, discussion board posts, broker ratings, and share prices – is collected to support this process. A novel data fusion model fuses together these data sources to provide synchronicity to the data and allow easier analysis of the data to be undertaken by combining data sources for a given time window (based on the company the data refers to and the date and time). This data fusion model, located within the data layer of the ecosystem, utilises supervised machine learning classifiers - due to the domain expertise needed to accurately describe the origin of a tweet in a binary way - that are trained on a novel set of features to classify tweets as being related to a London Stock Exchange-listed company or not. Experiments involving the training of such classifiers have achieved accuracy scores of up to 94.9%. The ecosystem also adopts supervised learning to classify tweets concerning their credibility. Credibility classifiers are trained on both general features found in all tweets, and a novel set of features only found within financial stock tweets. The experiments in which these credibility classifiers were trained have yielded AUC scores of up to 94.3. Once the data has been fused, and irrelevant tweets have been identified, unsupervised clustering algorithms are then used within the detection layer of the ecosystem to cluster tweets and posts for a specific time window or event as potentially irregular. The results are then presented to the user within the presentation and decision layer, where the user may wish to perform further analysis or additional clustering

    A data-driven decision-making model for the third-party logistics industry in Africa

    Get PDF
    Third-party logistics (3PL) providers have continued to be key players in the supply chain network and have witnessed a growth in the usage of information technology. This growth has enhanced the volume of structured and unstructured data that is collected at a high velocity, and is of rich variety, sometimes described as “Big Data”. Leaders in the 3PL industry are constantly seeking to effectively and efficiently mature their abilities to exploit this data to gain business value through data-driven decision-making (DDDM). DDDM helps the leaders to reduce the reliance they place on observations and intuition to make crucial business decisions in a volatile business environment. The aim of this research was to develop a prescriptive model for DDDM in 3PLs. The model consists of iterative elements that prescribe guidelines to decision-makers in the 3PL industry on how to adopt DDDM. A literature review of existing theoretical frameworks and models for DDDM was conducted to determine the extent to which they contribute towards DDDM for 3PLs. The Design-Science Research Methodology (DSRM) was followed to address the aim of the research and applied to pragmatically and iteratively develop and evaluate the artefact (the model for DDDM) in the real-world context of a 3PL. The literature findings revealed that the challenges with DDDM in organisations include three main categories of challenges related to data quality, data management, vision and capabilities. Once the challenges with DDDM were established, a prescriptive model was designed and developed for DDDM in 3PLs. Qualitative data was collected from semi-structured interviews to gain an understanding of the problems and possible solutions in the real-world context of 3PLs. An As-Is Analysis in the real-world case 3PL company confirmed the challenges identified in literature, and that data is still used in the 3PL company for descriptive and diagnostic analytics to aid with the decision-making processes. This highlights that there is still room for maturity into using data for predictive and prescriptive analytics that will, in turn, improve the decision-making process. An improved second version of the model was demonstrated to the participants (the targeted users), who had the opportunity to evaluate the model. The findings revealed that the model provided clear guidelines on how to make data-driven decisions and that the feedback loop and the data culture aspects highlighted in the design were some of the important features of the model. Some improvements were suggested by participants. A field study of three data analytics tools was conducted to identify the advantages and disadvantages of each as well as to highlight the status of DDDM at the real-world case 3PL. The limitations of the second version of the model, together with the recommendations from the participants were used to inform the improved and revised third version of the model.Thesis (MSc) -- Faculty of Science, School of Computer Science, Mathematics, Physics and Statistics, 202
    corecore