797 research outputs found

    Big Data and Regional Science: Opportunities, Challenges, and Directions for Future Research

    Get PDF
    Recent technological, social, and economic trends and transformations are contributing to the production of what is usually referred to as Big Data. Big Data, which is typically defined by four dimensions -- Volume, Velocity, Veracity, and Variety -- changes the methods and tactics for using, analyzing, and interpreting data, requiring new approaches for data provenance, data processing, data analysis and modeling, and knowledge representation. The use and analysis of Big Data involves several distinct stages from "data acquisition and recording" over "information extraction" and "data integration" to "data modeling and analysis" and "interpretation", each of which introduces challenges that need to be addressed. There also are cross-cutting challenges, which are common challenges that underlie many, sometimes all, of the stages of the data analysis pipeline. These relate to "heterogeneity", "uncertainty", "scale", "timeliness", "privacy" and "human interaction". Using the Big Data analysis pipeline as a guiding framework, this paper examines the challenges arising in the use of Big Data in regional science. The paper concludes with some suggestions for future activities to realize the possibilities and potential for Big Data in regional science.Series: Working Papers in Regional Scienc

    Exploring Unconventional Sources in Big Data: A Data Lifecycle Approach for Social and Economic Analysis with Machine Learning

    Get PDF
    This study delves into the realm of leveraging unconventional sources within the domain of Big Data for conducting insightful social and economic analyses. Employing a Data Lifecycle Approach, the research focuses on harnessing the potential of linear regression, random forest, and XGBoost techniques to extract meaningful insights from unconventional data sources. The study encompasses a structured methodology involving data collection, preprocessing, feature engineering, model selection, and iterative refinement. By applying these techniques to diverse datasets, encompassing sources like social media content, sensor data, and satellite imagery, the study aims to provide a comprehensive understanding of social and economic trends. The results obtained through these methods contribute to an enhanced comprehension of the intricate relationships within societal and economic systems, further highlighting the importance of unconventional data sources in driving valuable insights for decision-makers and researchers alike

    A MapReduce Algorithm for Finding Hotspots of Topics from Time Stamped Documents

    Get PDF
    Hotspots of a word/topic are time periods with a burst of activities in a time stamped document set. Identifying and analyzing hot spots of topics has been an important area of research. Finding hot spots of topics requires processing of contents of documents which is often time consuming. In this thesis, we explore MapReduce style algorithms for computing hot spots of topics. MapReduce is a distributed parallel programming model and an associated implementation for processing and analyzing large datasets. User specifies a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model and this thesis explores the feasibility of implementing the hotspot algorithm using MapReduce. We design map and reduce functions appropriate for preprocessing of documents, and the hot spot computation. We implement the functions in Hadoop (a MapReduce framework for Apache Foundation) and conduct several experiments to assess the benefits of MapReduce style implementation versus simple sequential implementation

    Improving Academic Natural Language Processing Infrastructures Utilizing Cluster Computation

    Get PDF
    In light of widespread digitization endeavors and ever-growing textual data generation, developing efficient academic Natural Language Processing (NLP) infrastructures, which can deal with large amounts of data, is of particular importance. Novel computation technologies allow tools that support big data and heavy computation while performing timely and cost-effective data processing. This development has led researchers to demand that knowledge be extracted from ever-increasing textual data before it is outdated. Cluster computation is a modern technology for handling big data efficiently. It provides distribution of computing and data over a number of machines in a cluster, as well as efficient use of resources, which are key requirements to process big data in a timely manner. It also assures applications’ high availability and fault tolerance, which are fundamental concerns when dealing with vast amounts of data. In addition, it provides load balancing of data during the execution of tasks, which results in optimal use of resources and enhances efficiency. Data-oriented parallelization is an effective solution to enable the currently available academic NLP infrastructures to process big data. This approach offers a solution to parallelize the NLP tools which comprise identical non-complicated tasks without the expense of changing NLP algorithms. This thesis presents the adaption of cluster computation technology to academic NLP infrastructures to address the notable features that are essential to process vast quantities of text materials efficiently, in terms of both resources and time. Apache Spark on top of Apache Hadoop and its ecosystem have been utilized to develop a set of NLP tools that provide a distributed environment to execute the NLP tasks. Many experiments were conducted to assess the functionality of the designated strategy. This thesis shows that using cluster computation technology and data-oriented parallelization enables academic NLP infrastructures to execute large amounts of textual data in a timely manner while improving the performance of the NLP tools. Moreover, these experiments provide information that brings a more realistic and transparent estimation of workflows’ costs (required hardware resources) and execution time, along with the fastest, optimum, or feasible resource configuration for each individual workflow. This knowledge can be employed by users to trade-off between run-time, size of data, and hardware, and it enables them to design a strategy for data storage, duration of data retention, and delivery time. This has the potential to enhance researchers’ satisfaction when using academic NLP infrastructures. The thesis also shows that a cluster computation approach provides the capacity to adapt NLP services with JIT delivery systems. The proposed strategy assures the reliability and predictability of the services, which are the main characteristics of the services in JIT delivery systems. Defining the relevant parameters, recording the behavior of the services, and analyzing the generated data resulted in the provision of knowledge that can be utilized to create a service catalog—a fundamental requirement for the services in JIT delivery systems—for each service offered. This knowledge also helps to generate the performance profiles for each item mentioned in the service catalog and to update them continuously to cover new experiments and improve service quality

    Some Contribution of Statistical Techniques in Big Data: A Review

    Get PDF
    Big Data is a popular topic in research work. Everyone is talking about big data, and it is believed that science, business, industry, government, society etc. will undergo a through change with the impact of big data.Big data is used to refer to very huge data set having large, more complex, hidden pattern, structured and unstructured nature of data with the difficulties to collect, storage, analysing for process or result. So proper advanced techniques to use to gain knowledge about big data. In big data research big challenge is created in storage, process, search, sharing, transfer, analysis and visualizing. To deeply discuss on introduction of big data, issue, management and all used big data techniques. Also in this paper present a review of various advanced statistical techniques to handling the key application of big data have large data set. These advanced techniques handle the structure as well as unstructured big data in different area

    Monitoring E-commerce Adoption from Online Data

    Full text link
    [EN] The purpose of this paper is to propose an intelligent system to automatically monitor the firms¿ engagement in e-commerce by analyzing online data retrieved from their corporate websites. The design of the proposed system combines web content mining and scraping techniques with learning methods for Big Data. Corporate websites are scraped to extract more than 150 features related to the e-commerce adoption, such as the presence of some keywords or a private area. Then, these features are taken as input by a classification model that includes dimensionality reduction techniques. The system is evaluated with a data set consisting of 426 corporate websites of firms based in France and Spain. The system successfully classified most of the firms into those that adopted e-commerce and those that did not, reaching a classification accuracy of 90.6%. This demonstrates the feasibility of monitoring e-commerce adoption from online data. Moreover, the proposed system represents a cost-effective alternative to surveys as method for collecting e-commerce information from companies, and is capable of providing more frequent information than surveys and avoids the non-response errors. This is the first research work to design and evaluate an intelligent system to automatically detect e-commerce engagement from online data. This proposal opens up the opportunity to monitor e-commerce adoption at a large scale, with highly granular information that otherwise would require every firm to complete a survey. In addition, it makes it possible to track the evolution of this activity in real time, so that governments and institutions could make informed decisions earlier.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness with Grant TIN2013-43913-R, and by the Spanish Ministry of Education with Grant FPU14/02386.Blazquez, D.; Domenech, J.; Gil, JA.; Pont Sanjuan, A. (2018). Monitoring E-commerce Adoption from Online Data. Knowledge and Information Systems. 1-19. https://doi.org/10.1007/s10115-018-1233-7S119Arias M, Arratia A, Xuriguera R (2013) Forecasting with Twitter data. ACM Trans Intell Syst Technol 5:1–24. https://doi.org/10.1145/2542182.2542190Arora SK, Youtie J, Shapira P, Gao L, Ma T (2013) Entry strategies in an emerging technology: a pilot web-based study of graphene firms. Scientometrics 95:1189–1207. https://doi.org/10.1007/s11192-013-0950-7Barcaroli G, Nurra A, Scarnò M, Summa D (2014) Use of web scraping and text mining techniques in the istat survey on information and communication technology in enterprises. In: Proceedings of quality conference, pp 33–38Barcaroli G, Nurra A, Salamone S, Scannapieco M, Scarnò M, Summa D (2015) Internet as data source in the istat survey on ict in enterprises. Austrian J Stat 44:31. https://doi.org/10.17713/ajs.v44i2.53Blazquez D, Domenech J (2014) Inferring export orientation from corporate websites. Appl Econ Lett 21:509–512. https://doi.org/10.1080/13504851.2013.872752Blazquez D, Domenech J (2017) Big data sources and methods for social and economic analyses. Technol Forecast Soc Change. https://doi.org/10.1016/j.techfore.2017.07.027Blazquez D, Domenech J (2017) Web data mining for monitoring business export orientation. Technol Econ Dev Econ. https://doi.org/10.3846/20294913.2016.1213193Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2:1–8. https://doi.org/10.1016/j.jocs.2010.12.007Bughin J (2015) Google searches and twitter mood: nowcasting telecom sales performance. NETNOMICS: Econ Res Electron Netw 16:87–105. https://doi.org/10.1007/s11066-015-9096-5Bulligan G, Marcellino M, Venditti F (2015) Forecasting economic activity with targeted predictors. Int J Forecast 31:188–206. https://doi.org/10.1016/j.ijforecast.2014.03.004Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357Choi H, Varian H (2009) Predicting the present with Google Trends. http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf . Accessed 9 Dec 2016Choi H, Varian H (2012) Predicting the present with Google Trends. Econ Record 88:2–9. https://doi.org/10.1111/j.1475-4932.2012.00809.xCooley R, Mobasher B, Srivastava J (1997) Web mining: information and pattern discovery on the world wide web. In: Proceedings of the ninth ieee international conference on tools with artificial intelligence. IEEE Computer Society, Newport Beach, CA, USA, pp 558–567. https://doi.org/10.1109/TAI.1997.632303Domenech J, de la Ossa B, Pont A, Gil JA, Martinez M, Rubio A (2012) An intelligent system for retrieving economic information from corporate websites. In: IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT), Macau, China, pp 573–578. https://doi.org/10.1109/WI-IAT.2012.92Ecommerce Foundation (2016) Global B2C E-commerce Report 2016Edelman B (2012) Using internet data for economic research. J Econ Perspect 26:189–206. https://doi.org/10.1257/jep.26.2.189Einav L, Levin J (2014) The data revolution and economic analysis. Innov Policy Econ 14:1–24. https://doi.org/10.1086/674019Eurostat (2008) NACE Rev. 2 Statistical classification of economic activities in the European Communities. EUROSTAT Methodologies and Working papers, Office for Official Publications of the European Communities, LuxembourgEurostat (2016) ICT usage and e-commerce in enterprises. http://ec.europa.eu/eurostat/statistics-explained/index.php/E-commerce_statistics . Accessed 12 Dec 2016Fan J, Han F, Liu H (2014) Challenges of Big Data analysis. Natl Sci Rev 1:293–314. https://doi.org/10.1093/nsr/nwt032Fondeur Y, Karamé F (2013) Can Google data help predict French youth unemployment? Econ Model 30:117–125. https://doi.org/10.1016/j.econmod.2012.07.017Griffis SE, Goldsby TJ, Cooper M (2003) Web-based and mail surveys: A comparison of response, data, and cost. J Bus Logist 24:237–258. https://doi.org/10.1002/j.2158-1592.2003.tb00053.xHand C, Judge G (2012) Searching for the picture: forecasting UK cinema admissions using google trends data. Appl Econ Lett 19:1051–1055. https://doi.org/10.1080/13504851.2011.613744Hao W, Walden J, Trenkamp C (2013) Accelerating e-commerce sites in the cloud. 10th Anual Consumer Communications and Networking Conference (CCNC). IEEE, IEEE, pp 605–608Hasan B (2016) Perceived irritation in online shopping: the impact of website design characteristics. Comput Hum Behav 54:224–230. https://doi.org/10.1016/j.chb.2015.07.056Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, BerlinHastie T, Tibshirani R, Friedman J (2013) The elements of statistical learning: data mining, inference and prediction, 3rd edn. Springer, BerlinHe LJ (2012) The application of web mining ontology system in e-commerce based on FCA, vol 149. Springer, Berlin, pp 429–432. https://doi.org/10.1007/978-3-642-28658-2_65Hernández B, Jiménez J, Martín MJ (2009) Key website factors in e-business strategy. Int J Inf Manag 29:362–371. https://doi.org/10.1016/j.ijinfomgt.2008.12.006INE (2016) Encuesta de uso de TIC y Comercio Electrónico en las empresas 2015-2016. http://ine.es/dynt3/inebase/?path=/t09/e02/a2015-2016 , http://ine.es/dynt3/inebase/?path=/t09/e02/a2015-2016 . Accessed 9 Oct 2016James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, vol 112. Springer Texts in Statistics. Springer, New YorkJungherr A, Jürgens P (2013) Forecasting the pulse. Internet Res 23:589–607. https://doi.org/10.1108/IntR-06-2012-0115Kim T, Hong J, Kang P (2015) Box office forecasting using machine learning algorithms based on SNS data. Int J Forecast 31:364–390. https://doi.org/10.1016/j.ijforecast.2014.05.006Kosala R, Blockeel H (2000) Web mining research. ACM SIGKDD Explor Newsl 2:1–15. https://doi.org/10.1145/360402.360406Kuhn M, Johnson K (2013) Applied predictive modeling, vol 810. Springer, BerlinKulkarni G, Kannan P, Moe W (2012) Using online search data to forecast new product sales. Decision Support Syst 52:604–611. https://doi.org/10.1016/j.dss.2011.10.017Lee Y, Kozar KA (2006) Investigating the effect of website quality on e-business success: an analytic hierarchy process (ahp) approach. Decision Support Syst 42:1383–1401. https://doi.org/10.1016/j.dss.2005.11.005Li Y, Arora S, Youtie J, Shapira P (2016) Using web mining to explore Triple Helix influences on growth in small and mid-size firms. Technovation. https://doi.org/10.1016/j.technovation.2016.01.002Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28:92–122. https://doi.org/10.1007/s10618-012-0295-5Munzert S, Rubba C, Meißner P, Nyhuis D (2015) Automated data collection with R: a practical guide to web scraping and text mining. Wiley, ChichesterOliveira T, Martins MF (2010) Understanding e-business adoption across industries in European countries. Ind Manag Data Syst 110:1337–1354. https://doi.org/10.1108/02635571011087428ONS (2016) E-commerce and ICT Activity: 2015. https://www.ons.gov.uk/businessindustryandtrade/itandinternetindustry/bulletins/ecommerceandictactivity/2015 . Accessed 5 Dec 2016Ordanini A, Rubera G (2010) How does the application of an it service innovation affect firm performance? A theoretical framework and empirical analysis on e-commerce. Inf Manag 47:60–67. https://doi.org/10.1016/j.im.2009.10.003Peytchev A (2013) Consequences of survey nonresponse. Ann Am Acad Political Soc Sci 645:88–111. https://doi.org/10.1177/0002716212461748Poggi N, Carrera D, Gavaldà R, Ayguadé E, Torres J (2014) A methodology for the evaluation of high response time on e-commerce users and sales. Inf Syst Front 16:867–885. https://doi.org/10.1007/s10796-012-9387-4Pokorný J, Škoda P, Zelinka I, Bednárek D, Zavoral F, Kruliš M, Šaloun P (2015) Big Data movement: a challenge in data processing, Studies in Big Data, vol 9. Springer, Cham. https://doi.org/10.1007/978-3-319-11056-1_2R Core Team (2015) R: a language and environment for statistical computing, Vienna, Austria. https://www.R-project.org/ . Accessed 25 Mar 2015Roche X (2014) HTTrack. http://www.httrack.com . Accessed 10 Nov 2014Rodríguez-Ardura I, Meseguer-Artola A (2010) Toward a longitudinal model of e-commerce: environmental, technological, and organizational drivers of B2C adoption. Inf Soc 26:209–227. https://doi.org/10.1080/01972241003712264Rosaci D, Sarnè G (2014) Multi-agent technology and ontologies to support personalization in B2C e-commerce. Electron Commer Res Appl 13:13–23. https://doi.org/10.1016/j.elerap.2013.07.003Shih HY (2012) The dynamics of local and interactive effects on innovation adoption: the case of electronic commerce. J Eng Technol Manag 29:434–452. https://doi.org/10.1016/j.jengtecman.2012.06.001Sohrabi B, Mahmoudian P, Raeesi I (2012) A framework for improving e-commerce websites usability using a hybrid genetic algorithm and neural network system. Neural Comput Appl 21:1017–1029. https://doi.org/10.1007/s00521-011-0674-7Stoll KU, Hepp M (2013) Detection of e-commerce systems with sparse features and supervised classification. In: 10th international conference on e-business engineering (ICEBE), IEEE, Coventry, United Kingdom, pp 199–206. https://doi.org/10.1109/ICEBE.2013.30Suchacka G, Borzemski L (2013) Simulation-based performance study of e-commerce Web server system-results for FIFO scheduling. Springer, Berlin, pp 249–259Swets J (1988) Measuring the accuracy of diagnostic systems. Science 240:1285–1293. https://doi.org/10.1126/science.3287615Thorleuchter D, Van den Poel D (2012) Predicting e-commerce company success by mining the text of its publicly-accessible website. Expert Syst Appl 39:13,026–13,034. https://doi.org/10.1016/j.eswa.2012.05.096Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodol) 58:267–288Varian HR (2014) Big Data: new tricks for econometrics. J Econ Perspect 28:3–28. https://doi.org/10.1257/jep.28.2.3Vicente MR, López-Menéndez AJ, Pérez R (2015) Forecasting unemployment with internet search data: does it help to improve predictions when job destruction is skyrocketing? Technol Forecast Soc Change 92:132–139. https://doi.org/10.1016/j.techfore.2014.12.005Youtie J, Hicks D, Shapira P, Horsley T (2012) Pathways from discovery to commercialisation: using web sources to track small and medium-sized enterprise strategies in emerging nanotechnologies. Technol Anal Strateg Manag 24:981–995. https://doi.org/10.1080/09537325.2012.724163Zhang Y, Fang Y, Wei KK, Ramsey E, McCole P, Chen H (2011) Repurchase intention in B2C e-commerce—a relationship quality perspective. Inf Manag 48:192–200. https://doi.org/10.1016/j.im.2011.05.003Zhao WX, Li S, He Y, Wang L, Wen JR, Li X (2016) Exploring demographic information in social media for product recommendation. Knowl Inf Syst 49:61–8

    Concepts and Methods from Artificial Intelligence in Modern Information Systems – Contributions to Data-driven Decision-making and Business Processes

    Get PDF
    Today, organizations are facing a variety of challenging, technology-driven developments, three of the most notable ones being the surge in uncertain data, the emergence of unstructured data and a complex, dynamically changing environment. These developments require organizations to transform in order to stay competitive. Artificial Intelligence with its fields decision-making under uncertainty, natural language processing and planning offers valuable concepts and methods to address the developments. The dissertation at hand utilizes and furthers these contributions in three focal points to address research gaps in existing literature and to provide concrete concepts and methods for the support of organizations in the transformation and improvement of data-driven decision-making, business processes and business process management. In particular, the focal points are the assessment of data quality, the analysis of textual data and the automated planning of process models. In regard to data quality assessment, probability-based approaches for measuring consistency and identifying duplicates as well as requirements for data quality metrics are suggested. With respect to analysis of textual data, the dissertation proposes a topic modeling procedure to gain knowledge from CVs as well as a model based on sentiment analysis to explain ratings from customer reviews. Regarding automated planning of process models, concepts and algorithms for an automated construction of parallelizations in process models, an automated adaptation of process models and an automated construction of multi-actor process models are provided
    • …
    corecore