35 research outputs found

    BIG DATA AND ANALYTICS AS A NEW FRONTIER OF ENTERPRISE DATA MANAGEMENT

    Get PDF
    Big Data and Analytics (BDA) promises significant value generation opportunities across industries. Even though companies increase their investments, their BDA initiatives fall short of expectations and they struggle to guarantee a return on investments. In order to create business value from BDA, companies must build and extend their data-related capabilities. While BDA literature has emphasized the capabilities needed to analyze the increasing volumes of data from heterogeneous sources, EDM researchers have suggested organizational capabilities to improve data quality. However, to date, little is known how companies actually orchestrate the allocated resources, especially regarding the quality and use of data to create value from BDA. Considering these gaps, this thesis – through five interrelated essays – investigates how companies adapt their EDM capabilities to create additional business value from BDA. The first essay lays the foundation of the thesis by investigating how companies extend their Business Intelligence and Analytics (BI&A) capabilities to build more comprehensive enterprise analytics platforms. The second and third essays contribute to fundamental reflections on how organizations are changing and designing data governance in the context of BDA. The fourth and fifth essays look at how companies provide high quality data to an increasing number of users with innovative EDM tools, that are, machine learning (ML) and enterprise data catalogs (EDC). The thesis outcomes show that BDA has profound implications on EDM practices. In the past, operational data processing and analytical data processing were two “worlds” that were managed separately from each other. With BDA, these "worlds" are becoming increasingly interdependent and organizations must manage the lifecycles of data and analytics products in close coordination. Also, with BDA, data have become the long-expected, strategically relevant resource. As such data must now be viewed as a distinct value driver separate from IT as it requires specific mechanisms to foster value creation from BDA. BDA thus extends data governance goals: in addition to data quality and regulatory compliance, governance should facilitate data use by broadening data availability and enabling data monetization. Accordingly, companies establish comprehensive data governance designs including structural, procedural, and relational mechanisms to enable a broad network of employees to work with data. Existing EDM practices therefore need to be rethought to meet the emerging BDA requirements. While ML is a promising solution to improve data quality in a scalable and adaptable way, EDCs help companies democratize data to a broader range of employees

    Responsible AI and Analytics for an Ethical and Inclusive Digitized Society

    Get PDF
    publishedVersio

    Innovative Financing for Urban Rail in Indian Cities: Land-based Strategic Value Capture Mechanisms

    Get PDF
    Emerging cities are seeking urban rail but have difficulty with funding. This research uses the Bangalore Metro rail to develop an innovative land-based ‘strategic value capture' (VC) financing system suitable for Indian cities and other emerging cities. It shows significant land value uplift that could be used for VC funding. The four frameworks and strategic interventions developed in this research are novel contributions in India and apply to other emerging cities as well

    New Multidisciplinary Approaches for Reducing Food Waste in Agribusiness Supply Chains

    Get PDF
    This reprint is a collection of research articles that highlight the achievements of the team of the European project called REAMIT. REAMIT was funded by Interreg North-West Europe and ERDF. The term REAMIT stands for “Improving Resource Efficiency of Agribusiness supply chains by Minimising waste using Big Data and Internet of Things sensors.” The main aim of the REAMIT project was to reduce food waste in agrifood supply chains by using the power of modern, digital technologies (e.g., the Internet of Things (IoT), sensors, big data, cloud computing and analytics). The chapters in this reprint provide detailed information of the activities of the project team.The chapters of this reprint were published as articles in the Special Issue titled ”New Multidisciplinary Approaches for Reducing Food Waste in Agribusiness Supply Chains” published in the journal Sustainability. For ease of readability and flow, the book is divided into four distinct parts.In Part 1, the project members provided a comprehensive review of the existing literature. Part 2 is devoted to the in-depth discussions of the development, adaptation, and applications of these technologies for specific food companies. While the project team worked with a number of food companies including human milk, fresh vegetables and fruits, meat production, this part discusses four different applications.Part 3 presents a detailed analysis of our case studies. A general life-cycle analysis tool for implementing technology for reducing food waste (REAMIT-type activities) is presented in Chapter 7. A specific application of this tool for the case study on a human milk bank is presented in Chapter 8. In Chapter 9, we developed a novel mathematical programming model to identify the conditions when food businesses will prefer the use of modern technologies for helping to reduce food waste.The final part, Part 4, is devoted to summarising learnings from the project and developing some policy-oriented guidelines. Chapter 10 reviews the current state of corporate reporting guidelines for reporting on food waste. Chapter 11 presents the important leanings from the REAMIT project on the motivations for food companies in reducing waste and the associated challenges. Business models are discussed, and some policy guidelines were developed.We gratefully acknowledge the generous funding received from the Interreg North-West Europe for carrying out our activities. The content of Chapter 10 was funded additional funding received from the University of Essex. We believe that the reprint and individual chapters will be of interest to a wide and various audience and will kindle interest in food companies, technology companies, business support organisations, policy-makers and members of the academic community in finding ways to reduce food waste with and without the use of technology

    Cross-view Embeddings for Information Retrieval

    Full text link
    In this dissertation, we deal with the cross-view tasks related to information retrieval using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script IR, which deals with the challenges faced by an IR system when a language is written in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an abstract space and CAE provides the state-of-the-art performance. We study a wide variety of models for cross-language information retrieval (CLIR) and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language plagiarism detection. We empirically test the proposed models for these tasks on publicly available datasets and present the results with analyses. In this dissertation, we also explore an effective method to incorporate contextual similarity for lexical selection in machine translation. Concretely, we investigate a feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the strong baselines for English-to-Spanish and English-to-Hindi translation tasks. Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this, we propose two metrics based on reconstruction capabilities of the autoencoders: structure preservation index (SPI) and similarity accumulation index (SAI). We also introduce a concept of critical bottleneck dimensionality (CBD) below which the structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia. En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes. En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú. Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda. En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents. En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú. Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI

    Big data-driven multimodal traffic management : trends and challenges

    Get PDF

    Cultural Heritage Storytelling, Engagement and Management in the Era of Big Data and the Semantic Web

    Get PDF
    The current Special Issue launched with the aim of further enlightening important CH areas, inviting researchers to submit original/featured multidisciplinary research works related to heritage crowdsourcing, documentation, management, authoring, storytelling, and dissemination. Audience engagement is considered very important at both sites of the CH production–consumption chain (i.e., push and pull ends). At the same time, sustainability factors are placed at the center of the envisioned analysis. A total of eleven (11) contributions were finally published within this Special Issue, enlightening various aspects of contemporary heritage strategies placed in today’s ubiquitous society. The finally published papers are related but not limited to the following multidisciplinary topics:Digital storytelling for cultural heritage;Audience engagement in cultural heritage;Sustainability impact indicators of cultural heritage;Cultural heritage digitization, organization, and management;Collaborative cultural heritage archiving, dissemination, and management;Cultural heritage communication and education for sustainable development;Semantic services of cultural heritage;Big data of cultural heritage;Smart systems for Historical cities – smart cities;Smart systems for cultural heritage sustainability

    Benchmarking operation readiness of the high-speed rail (HSR) network

    Get PDF
    At present, HSR networks have been significantly extended to accommodate increased passenger demand because the service is believed to unleash social benefits. Nevertheless, the investment in the HSR project is substantially higher than in other transportation projects. Also, most of the HSR network has faced unavoidable issues during operation, such as lack of passenger demand, low operating profit, and non-safety issues. Despite issues addressed, HSR organisations could not maintain their performance to reach the standard and the globe’s directions, especially the sustainability pillar. Those issues become ineffective for HSR organisations, impacting the passenger’s quality of life and socio-economic. This thesis aims to develop a systems-based benchmarking framework for all HSR networks to enhance operating costs, punctuality, productivity, risk and uncertainty, sustainability, and urbanisation efficiency. Those six KPIs are necessary for the sustainable development of the upcoming HSR network. The thesis has made several significant contributions to developing a benchmarking framework for long-term improvement. First, this thesis is the world’s first to integrate a Bayesian distribution and Python programming to improve safety across the railway network. As a result, the created model shows higher accuracy than previous models due to the combination of long-term data sets. Moreover, this thesis reveals the decision tree and the Petri-net models to identify the risk level. Thus, it is an advantage for the rail authorities to evaluate and enhance safety performance. Next, the thesis focuses on life cycle assessment (LCA) and life cycle cost (LCC) frameworks. The LCA model reflects the environmental perspectives of each rail network. This thesis provides an in-depth analysis of each life cycle stage that shows the energy consumption rate and CO2 emission rate. The outcome can point out energy consumption and CO2 emission performance. In addition, this thesis is the world's first study concerning uncertainty costs during HSR operations regarding the LCC analysis. The net present value calculation with a discount rate has been added with the Monte Carlo Simulation. In this section, the developed model allows HSR authorities to firmly manage the budget under uncertain conditions, especially during an operating stage. Lastly, this thesis concentrates on the social impacts of HSR service, particularly on a living quality, educational benefits, and economic opportunities. The long-term datasets have been analysed by using K-nearest neighbour and Pearson correlation techniques. The result can point out the company’s performance toward social advantages. By adopting the models in practice, people can obtain more benefits from the HSR service. By promoting the novelty framework into practice, benchmarking through diversification of current HSR networks is addressed. The selected routes and networks are chosen using a range of factors. For illustrate, the collected networks must be stable and trustworthy, as determined by their long-term operation for at least ten years. Furthermore, the selected HSR lines are mixed in geography, technology, and relevant conditions to avoid bias. The five noteworthy networks and routes consist of Beijing-Shanghai (China), Paris-Lyon (France), Tokyo-Osaka (Japan), Madrid-Barcelona (Spain), and Seoul-Busan (South Korea). The analysis results indicate that none of the HSR networks illustrates high performance in all pillars. An overview result demonstrates that the CR’s networks perform the best performance following the Renfe, SNCF, JR Central, and Korail networks. In addition, the thesis has provided policy implications for long-term development, in particular, safety services, social impacts, environmental impacts, and technology and innovation. Those suggestions can be applied practically to both existing and upcoming HSR networks

    Automobile Insurance Fraud Detection Using Data Mining: A Systematic Literature Review

    Get PDF
    Insurance is a pivotal element in modern society, but insurers face a persistent challenge from fraudulent behaviour performed by policyholders. This behaviour could be detrimental to both insurance companies and their honest customers, but the intricate nature of insurance fraud severely complicates its efficient, automated detection. This study surveys fifty recent publications on automobile insurance fraud detection, published between January 2019 and March 2023, and presents both the most commonly used data sets and methods for resampling and detection, as well as interesting, novel approaches. The study adopts the highly-cited Systematic Literature Review (SLR) methodology for software engineering research proposed by Kitchenham and Charters and collected studies from four online databases. The findings indicate limited public availability of automobile insurance fraud data sets. In terms of detection methods, the prevailing approach involves supervised machine learning methods that utilise structured, intrinsic features of claims or policies and that lack consideration of an example-dependent cost of misclassification. However, alternative techniques are also explored, including the use of graph-based methods, unstructured textual data, and cost-sensitive classifiers. The most common resampling approach was found to be oversampling. This SLR has identified commonly used methods in recent automobile insurance fraud detection research, and interesting directions for future research. It adds value over a related review by also including studies published from 2021 onward, and by detailing the used methodology. Limitations of this SLR include its restriction to a small number of considered publication years and limited validation of choices made during the process

    Extracting and Cleaning RDF Data

    Get PDF
    The RDF data model has become a prevalent format to represent heterogeneous data because of its versatility. The capability of dismantling information from its native formats and representing it in triple format offers a simple yet powerful way of modelling data that is obtained from multiple sources. In addition, the triple format and schema constraints of the RDF model make the RDF data easy to process as labeled, directed graphs. This graph representation of RDF data supports higher-level analytics by enabling querying using different techniques and querying languages, e.g., SPARQL. Anlaytics that require structured data are supported by transforming the graph data on-the-fly to populate the target schema that is needed for downstream analysis. These target schemas are defined by downstream applications according to their information need. The flexibility of RDF data brings two main challenges. First, the extraction of RDF data is a complex task that may involve domain expertise about the information required to be extracted for different applications. Another significant aspect of analyzing RDF data is its quality, which depends on multiple factors including the reliability of data sources and the accuracy of the extraction systems. The quality of the analysis depends mainly on the quality of the underlying data. Therefore, evaluating and improving the quality of RDF data has a direct effect on the correctness of downstream analytics. This work presents multiple approaches related to the extraction and quality evaluation of RDF data. To cope with the large amounts of data that needs to be extracted, we present DSTLR, a scalable framework to extract RDF triples from semi-structured and unstructured data sources. For rare entities that fall on the long tail of information, there may not be enough signals to support high-confidence extraction. Towards this problem, we present an approach to estimate property values for long tail entities. We also present multiple algorithms and approaches that focus on the quality of RDF data. These include discovering quality constraints from RDF data, and utilizing machine learning techniques to repair errors in RDF data
    corecore