594 research outputs found

    Improved CHAID Algorithm for Document Structure Modelling

    Get PDF
    International audienceThis paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the \Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the rst uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The nal error rate for determining the logical labels (among 9 dierent ones) is less than 6%

    Can Passive Mobile Application Traffic be Identified using Machine Learning Techniques

    Get PDF
    Mobile phone applications (apps) can generate background traffic when the end-user is not actively using the app. If this background traffic could be accurately identified, network operators could de-prioritise this traffic and free up network bandwidth for priority network traffic. The background app traffic should have IP packet features that could be utilised by a machine learning algorithm to identify app-generated (passive) traffic as opposed to user-generated (active) traffic. Previous research in the area of IP traffic classification focused on classifying high level network traffic types originating on a PC device. This research was concerned with classifying low level app traffic originating on mobile phone device. An innovative experiment setup was designed in order to answer the research question. A mobile phone running Android OS was configured to capture app network data. Three specific data trace procedures where then designed to comprehensively capture sample active and passive app traffic data. Feature generation in previous research recommend computing new features based on IP packet data. This research proposes a different approach. Feature generation was enabled by exposing inherent IP packet attributes as opposed to computing new features. Specific evaluation metrics were also designed in order to quantify the accuracy of the machine learning models at classifying active and passive app traffic. Three decision tree models were implemented; C5.0, C&R tree and CHAID tree. Each model was built using a standard implementation and with boosting. The findings indicate that passive app network traffic can be classified with an accuracy up to 84.8% using a CHAID decision tree algorithm with model boosting enabled. The finding also suggested that features derived from the inherent IP packet attributes, such as time frame delta and bytes in flight, had significant predictive value

    European tourist perspective on destination satisfaction: a business analytics approach

    Get PDF
    For many years that tourism information has been collected and stored, allowing increased interest in the data mining (DM) areas. This leads to a need of research and discovery of new patterns to develop automated procedures to improve the tourism knowledge management. The relationship between the tourist characteristics and preferences and the tourist satisfaction was never studied in order to provide useful knowledge to the tourism industry. Therefore, there was the need to investigate the explanatory factors of the tourist satisfaction with the destination to allow the tourism companies to define the correct assumptions about a certain travel. This dissertation used the data from Flash Eurobarometer 414 “Preferences of Europeans towards tourism 2015” with data from the 28 countries of the European Union (EU). A predictive model was obtained for the tourist satisfaction, through the discovery of existing patterns in the process of the tourist travel, using DM techniques on the data referred above. The definition of an explanatory model allowed to obtain useful knowledge for tourism agencies, enabling the development of marketing strategies according to the tourist profile and ensuring development of promotional messages for products and experiences, ensuring that correct assumptions are made about their customers.Desde há muito tempo que é recolhida e armazenada informação sobre turismo, permitindo captar o interesse das áreas de data mining (DM). Consequentemente, surgiu a necessidade de pesquisa e descoberta de novos padrões para desenvolver procedimentos automatizados, de forma a melhorar a gestão deste tipo de informação. A relação entre as características do turista, as suas preferências e a satisfação nunca foram estudadas extensivamente de forma a criar conhecimento útil para a indústria do turismo. Desta forma, havia a necessidade de investigar e estudar os fatores explicativos da satisfação do turista com o destino, para que seja possível às empresas de turismo traçar o perfil de turista adequado e transmitir as campanhas de marketing de forma assertiva e eficiente. Nesta dissertação foram utilizados os dados do Flash Eurobarometer 414 “Preferences of Europeans towards tourism 2015”, que contém dados dos 28 países da União Europeia. Através da descoberta de padrões existentes no processo de viagem do turista, utilizando técnicas de DM sobre os dados acima referidos, foi possível obter um modelo preditivo para a satisfação do turista. A definição de um modelo explicativo permitiu obter conhecimento útil para as empresas de turismo, facilitando o desenvolvimento de estratégias de marketing de acordo com o perfil do turista e de mensagens promocionais para produtos e experiências, garantindo que são definidos pressupostos adequados para os seus clientes

    The distributed assembly permutation flowshop scheduling problem

    Full text link
    Nowadays, improving the management of complex supply chains is a key to become competitive in the twenty-first century global market. Supply chains are composed of multi-plant facilities that must be coordinated and synchronised to cut waste and lead times. This paper proposes a Distributed Assembly Permutation Flowshop Scheduling Problem (DAPFSP) with two stages to model and study complex supply chains. This problem is a generalisation of the Distributed Permutation Flowshop Scheduling Problem (DPFSP). The first stage of the DAPFSP is composed of f identical production factories. Each one is a flowshop that produces jobs to be assembled into final products in a second assembly stage. The objective is to minimise the makespan. We present first a Mixed Integer Linear Programming model (MILP). Three constructive algorithms are proposed. Finally, a Variable Neighbourhood Descent (VND) algorithm has been designed and tested by a comprehensive ANOVA statistical analysis. The results show that the VND algorithm offers good performance to solve this scheduling problem.Ruben Ruiz is partially supported by the Spanish Ministry of Science and Innovation, under the project 'RESULT - Realistic Extended Scheduling Using Light Techniques' with reference DPI2012-36243-C02-01. Carlos Andres-Romano is partially supported by the Spanish Ministry of Science and Innovation, under the project 'INSAMBLE' - Scheduling at assembly/disassembly synchronised supply chains with reference DPI2011-27633.Hatami, S.; Ruiz García, R.; Andrés Romano, C. (2013). The distributed assembly permutation flowshop scheduling problem. International Journal of Production Research. 51(17):5292-5308. https://doi.org/10.1080/00207543.2013.807955S529253085117Basso, D., Chiarandini, M., & Salmaso, L. (2007). Synchronized permutation tests in replicated designs. Journal of Statistical Planning and Inference, 137(8), 2564-2578. doi:10.1016/j.jspi.2006.04.016Biggs, D., De Ville, B., & Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, 18(1), 49-62. doi:10.1080/02664769100000005Chan, F. T. S., Chung, S. H., Chan, L. Y., Finke, G., & Tiwari, M. K. (2006). Solving distributed FMS scheduling problems subject to maintenance: Genetic algorithms approach. Robotics and Computer-Integrated Manufacturing, 22(5-6), 493-504. doi:10.1016/j.rcim.2005.11.005Chan, F. T. S., Chung, S. H., & Chan, P. L. Y. (2006). Application of genetic algorithms with dominant genes in a distributed scheduling problem in flexible manufacturing systems. International Journal of Production Research, 44(3), 523-543. doi:10.1080/00207540500319229Liao, C.-J., & Liao, L.-M. (2008). Improved MILP models for two-machine flowshop with batch processing machines. Mathematical and Computer Modelling, 48(7-8), 1254-1264. doi:10.1016/j.mcm.2008.01.001Framinan, J. M., & Leisten, R. (2003). An efficient constructive heuristic for flowtime minimisation in permutation flow shops. Omega, 31(4), 311-317. doi:10.1016/s0305-0483(03)00047-1Gao, J., & Chen, R. (2011). A hybrid genetic algorithm for the distributed permutation flowshop scheduling problem. International Journal of Computational Intelligence Systems, 4(4), 497-508. doi:10.1080/18756891.2011.9727808Hansen, P., & Mladenović, N. (2001). Variable neighborhood search: Principles and applications. European Journal of Operational Research, 130(3), 449-467. doi:10.1016/s0377-2217(00)00100-4Hariri, A. M. A., & Potts, C. N. (1997). A branch and bound algorithm for the two-stage assembly scheduling problem. European Journal of Operational Research, 103(3), 547-556. doi:10.1016/s0377-2217(96)00312-8Jia, H. Z., Fuh, J. Y. H., Nee, A. Y. C., & Zhang, Y. F. (2002). Web-based Multi-functional Scheduling System for a Distributed Manufacturing Environment. Concurrent Engineering, 10(1), 27-39. doi:10.1177/1063293x02010001054Jia, H. Z., Nee, A. Y. C., Fuh, J. Y. H., & Zhang, Y. F. (2003). Journal of Intelligent Manufacturing, 14(3/4), 351-362. doi:10.1023/a:1024653810491Jia, H. Z., Fuh, J. Y. H., Nee, A. Y. C., & Zhang, Y. F. (2007). Integration of genetic algorithm and Gantt chart for job shop scheduling in distributed manufacturing systems. Computers & Industrial Engineering, 53(2), 313-320. doi:10.1016/j.cie.2007.06.024Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 29(2), 119. doi:10.2307/2986296Lee, C.-Y., Cheng, T. C. E., & Lin, B. M. T. (1993). Minimizing the Makespan in the 3-Machine Assembly-Type Flowshop Scheduling Problem. Management Science, 39(5), 616-625. doi:10.1287/mnsc.39.5.616Morgan, J. N., & Sonquist, J. A. (1963). Problems in the Analysis of Survey Data, and a Proposal. Journal of the American Statistical Association, 58(302), 415-434. doi:10.1080/01621459.1963.10500855Pan, Q.-K., & Ruiz, R. (2012). Local search methods for the flowshop scheduling problem with flowtime minimization. European Journal of Operational Research, 222(1), 31-43. doi:10.1016/j.ejor.2012.04.034Potts, C. N., Sevast’janov, S. V., Strusevich, V. A., Van Wassenhove, L. N., & Zwaneveld, C. M. (1995). The Two-Stage Assembly Scheduling Problem: Complexity and Approximation. Operations Research, 43(2), 346-355. doi:10.1287/opre.43.2.346Ruiz, R., & Stützle, T. (2007). A simple and effective iterated greedy algorithm for the permutation flowshop scheduling problem. European Journal of Operational Research, 177(3), 2033-2049. doi:10.1016/j.ejor.2005.12.009Ruiz, R., Şerifoğlu, F. S., & Urlings, T. (2008). Modeling realistic hybrid flexible flowshop scheduling problems. Computers & Operations Research, 35(4), 1151-1175. doi:10.1016/j.cor.2006.07.014Ruiz, R., & Andrés-Romano, C. (2011). Scheduling unrelated parallel machines with resource-assignable sequence-dependent setup times. The International Journal of Advanced Manufacturing Technology, 57(5-8), 777-794. doi:10.1007/s00170-011-3318-2Stafford, E. F., Tseng, F. T., & Gupta, J. N. D. (2005). Comparative evaluation of MILP flowshop models. Journal of the Operational Research Society, 56(1), 88-101. doi:10.1057/palgrave.jors.2601805Tozkapan, A., Kırca, Ö., & Chung, C.-S. (2003). A branch and bound algorithm to minimize the total weighted flowtime for the two-stage assembly scheduling problem. Computers & Operations Research, 30(2), 309-320. doi:10.1016/s0305-0548(01)00098-3Tseng, F. T., & Stafford, E. F. (2008). New MILP models for the permutation flowshop problem. Journal of the Operational Research Society, 59(10), 1373-1386. doi:10.1057/palgrave.jors.260245

    Futures Studies in the Interactive Society

    Get PDF
    This book consists of papers which were prepared within the framework of the research project (No. T 048539) entitled Futures Studies in the Interactive Society (project leader: Éva Hideg) and funded by the Hungarian Scientific Research Fund (OTKA) between 2005 and 2009. Some discuss the theoretical and methodological questions of futures studies and foresight; others present new approaches to or procedures of certain questions which are very important and topical from the perspective of forecast and foresight practice. Each study was conducted in pursuit of improvement in futures fields

    What factors influence whether politicians’ tweets are retweeted? Using CHAID to build an explanatory model of the retweeting of politicians’ tweets during the 2015 UK General Election campaign

    Get PDF
    Twitter is ever-present in British political life and many politicians use it as part of their campaign strategies. However, little is known about whether their tweets engage people, for example by being retweeted. This research addresses that gap, examining tweets sent by MPs during the 2015 UK General Election campaign to identify which were retweeted and why. A conceptual model proposes three factors which are most likely to influence retweets: the characteristics of (1) the tweet’s sender, (2) the tweet and (3) its recipients. This research focuses on the first two of these. Content and sentiment analysis are used to develop a typology of the politicians’ tweets, followed by CHAID analysis to identify the factors that best predict which tweets are retweeted. The research shows that the characteristics of tweet and its sender do influence whether the tweet is retweeted. Of the sender’s characteristics, number of followers is the most important – more followers leads to more retweets. Of the tweet characteristics, the tweet’s sentiment is the most influential. Negative tweets are retweeted more than positive or neutral tweets. Tweets attacking opponents or using fear appeals are also highly likely to be retweeted. The research makes a methodological contribution by demonstrating how CHAID models can be used to accurately predict retweets. This method has not been used to predict retweets before and has broad application to other contexts. The research also contributes to our understanding of how politicians and the public interact on Twitter, an area little studied to date, and proposes some practical recommendations regarding how MPs can improve the effectiveness of their Twitter campaigning. The finding that negative tweets are more likely to be retweeted also contributes to the ongoing debate regarding whether people are more likely to pass on positive or negative information online

    Tonometry:a study in biomechanical modelling. Appraisal and utility of measurable biomechanical markers.

    Get PDF
    Goldmann Applanation Tonometry (GAT) is the recognised ‘Gold Standard’ tonometer.However this status is refuted by eminent authors. These contradictory views have driventhe initial goal to assess, from first principles, the evolution of GAT and to experimentallyevaluate its utility and corrections. Subsequently, an important caveat became theevaluation of Corneal Hysteresis and Corneal Resistance Factor.Chapter 1. Biomechanical building blocks are defined and constitutive principlesincorporated into continuum modelling. The Imbert-Fick construct is re-interpreted asimple biomechanical model. GAT corrections are also appraised within a continuumframework; CCT, geometry and stiffness. These principles enable evaluation ofalternative tonometer theory and the evolving biomechanical markers, CornealHysteresis (ORA-CH) and Corneal Resistance Factor (ORA-CRF).Chapter 2 appraises corneal biomechanical markers, CCT, curvature, ORA-CH andORA-CRF in 91 normal eyes and the impact these have on three tonometers: GAT,Tonopen and Ocular Response Analyser (ORA). Tonopen was the sole tonometer notaffected by biomechanics. CCT was confirmed the sole measurable parameter affectingGAT. ORA did not demonstrate improved utility. ORA-CH and ORA-CRF do not appearrobust biomechanical measures.Chapter 3 assessed agreement between GAT, the ORA measures and Tonopen.Tonopen is found to measure highest and raises the question should a development goalemphasise GAT agreement or improvement?Chapter 4 assessed repeatability of the three tonometers and biomechanical measureskeratometry, pachymetry, ORA-CH and ORA-CRF on 35 eyes. Coefficients ofRepeatability (CoR) of all tonometers are wide. Effects assessed in Chapter 5 may bemasked by general noise. ORA does not appear to enhance utility over GAT.Isolation of corneal shape change via Orthokeratology (Chapter 5) demonstrate ORACHand ORA-CRF reflect, predominantly, a response to corneal flattening. It is proposedthey do not significantly reflect corneal biomechanics.After reviewing models for tear forces (Chapter 6), a refined mathematical model ispresented. Tear bridge attraction is minimal and cannot explain under-estimation of IOPby GAT in thin corneas. CCT corrections and the Imbert-Fick rules are incompatible.Chapter 7 summarises findings. The supremacy of GAT is likely to remain for some time,reflecting the sheer magnitude of overturning 60 years of convention, historicalprecedent, expert opinion as well as the logistical and educational difficulties ofredefining standards and statistical norms

    Spatial prediction of flood susceptible areas using machine learning approach: a focus on west african region

    Get PDF
    Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial TechnologiesThe constant change in the environment due to increasing urbanization and climate change has led to recurrent flood occurrences with a devastating impact on lives and properties. Therefore, it is essential to identify the factors that drive flood occurrences, and flood locations prone to flooding which can be achieved through the performance of Flood Susceptibility Modelling (FSM) utilizing stand-alone and hybrid machine learning models to attain accurate and sustainable results which can instigate mitigation measures and flood risk control. In this research, novel hybridizations of Index of Entropy (IOE) with Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF) was performed and equally as stand-alone models in Flood Susceptibility Modelling (FSM) and results of each model compared. First, feature selection and multi-collinearity analysis were performed to identify the predictive ability and the inter-relationship among the factors. Subsequently, IOE was performed as bivariate and multivariate statistical analysis to assess the correlation among the flood influencing factor’s classes with flooding and the overall influence (weight) of each factor on flooding. Subsequently, the weight generated was used in training the machine learning models. The performance of the proposed models was assessed using the popular Area Under Curve (AUC) and statistical metrics. Percentagewise, results attained reveals that DT-IOE hybrid model had the highest prediction accuracy of 87.1% while the DT had the lowest prediction performance of 77.0%. Among the other models, the result attained highlight that the proposed hybrid of machine learning and statistical models had a higher performance than the stand-alone models which reflect the detailed assessment performed by the hybrid models. The final susceptibility maps derived revealed that about 21% of the study area are highly prone to flooding and it is revealed that human-induced factors do have a huge influence on flooding in the region
    corecore