343 research outputs found

    Dimensionality Reduction via Matrix Factorization for Predictive Modeling from Large, Sparse Behavioral Data

    Get PDF
    Matrix factorization is a popular technique for engineering features for use in predictive models; it is viewed as a key part of the predictive analytics process and is used in many different domain areas. The purpose of this paper is to investigate matrix-factorization-based dimensionality reduction as a design artifact in predictive analytics. With the rise in availability of large amounts of sparse behavioral data, this investigation comes at a time when traditional techniques must be reevaluated. Our contribution is based on two lines of inquiry: we survey the literature on dimensionality reduction in predictive analytics, and we undertake an experimental evaluation comparing using dimensionality reduction versus not using dimensionality reduction for predictive modeling from large, sparse behavioral data. Our survey of the dimensionality reduction literature reveals that, despite mixed empirical evidence as to the benefit of computing dimensionality reduction, it is frequently applied in predictive modeling research and application without either comparing to a model built using the full feature set or utilizing state-of-the-art predictive modeling techniques for complexity control. This presents a concern, as the survey reveals complexity control as one of the main reasons for employing dimensionality reduction. This lack of comparison is troubling in light of our empirical results. We experimentally evaluate the e cacy of dimensionality reduction in the context of a collection of predictive modeling problems from a large-scale published study. We find that utilizing dimensionality reduction improves predictive performance only under certain, rather narrow, conditions. Specifically, under default regularization (complexity control)settings dimensionality reduction helps for the more di cult predictive problems (where the predictive performance of a model built using the original feature set is relatively lower), but it actually decreases the performance on the easier problems. More surprisingly, employing state-of-the-art methods for selecting regularization parameters actually eliminates any advantage that dimensionality reduction has! Since the value of building accurate predictive models for business analytics applications has been well-established, the resulting guidelines for the application of dimensionality reduction should lead to better research and managerial decisions.NYU Stern School of Busines

    Detecting user demographics in twitter to inform health trends in social media

    Get PDF
    The widespread and popular use of social media and social networking applications offer a promising opportunity for gaining knowledge and insights regarding population health conditions thanks to the diversity and abundance of online user-generated information (UGHI) relating to healthcare and well-being. However, users on social media and social networking sites often do not supply their complete demographic information, which greatly undermines the value of the aforementioned information for health 2.0 research, e.g., for discerning disparities across population groups in certain health conditions. To recover the missing user demographic information, existing methods observe a limited scope of user behaviors, such as word frequencies exhibited in a user’s messages, leading to sub-optimal results. To address the above limitation and improve the performance of inferring missing user demographic information for health 2.0 research, this work proposes a new algorithmic method for extracting a social media user’s gender by exploring and exploiting a comprehensive set of a user’s behaviors on Twitter, including the user’s conversational topic choices, account profile information, and personal information. In addition, this work explores the usage of synonym expansion for detecting social media users’ ethnicities. To better capture a user’s conversational topic choices using standardized hashtags for consistent comparison, this work additionally introduces a new method that automatically generates standardized hashtags for tweets. Even though Twitter is selected as the experimental platform in this study due to its leading position among today’s social networking sites, the proposed method is in principle generically applicable to other social media sites and applications as long as there is a way to access user-generated content on those platforms. When comparing the multi-perspective learning method with the state-of-the-art approaches for gender classification, a gender classification accuracy is observed of 88.6% for the proposed approach compared with 63.4% performance for bag-of-words and 61.4% for the peer method. Additionally, the topical approach introduced in this work outperforms vocabulary-based approach with a smaller dimensionality at 69.4% accuracy. Furthermore, observable usage patterns of the cancer terms are analyzed across the ethnic groups inferred by the proposed algorithmic approaches. Variations among demographic groups are seen in the frequency of term usage during months known to be labeled as cancer awareness months. This work introduces methods that have the potential to serve as a very powerful and important tool in disseminating critical prevention, screening, and treatment messages to the community in real time. Study findings highlight the potential benefits of social media as a tool for detecting demographic differences in cancer-related discussions on social media

    Inferential Modeling and Independent Component Analysis for Redundant Sensor Validation

    Get PDF
    The calibration of redundant safety critical sensors in nuclear power plants is a manual task that consumes valuable time and resources. Automated, data-driven techniques, to monitor the calibration of redundant sensors have been developed over the last two decades, but have not been fully implemented. Parity space methods such as the Instrumentation and Calibration Monitoring Program (ICMP) method developed by Electric Power Research Institute and other empirical based inferential modeling techniques have been developed but have not become viable options. Existing solutions to the redundant sensor validation problem have several major flaws that restrict their applications. Parity space method, such as ICMP, are not robust for low redundancy conditions and their operation becomes invalid when there are only two redundant sensors. Empirical based inferential modeling is only valid when intrinsic correlations between predictor variables and response variables remain static during the model training and testing phase. They also commonly produce high variance results and are not the optimal solution to the problem. This dissertation develops and implements independent component analysis (ICA) for redundant sensor validation. Performance of the ICA algorithm produces sufficiently low residual variance parameter estimates when compared to simple averaging, ICMP, and principal component regression (PCR) techniques. For stationary signals, it can detect and isolate sensor drifts for as few as two redundant sensors. It is fast and can be embedded into a real-time system. This is demonstrated on a water level control system. Additionally, ICA has been merged with inferential modeling technique such as PCR to reduce the prediction error and spillover effects from data anomalies. ICA is easy to use with, only the window size needing specification. The effectiveness and robustness of the ICA technique is shown through the use of actual nuclear power plant data. A bootstrap technique is used to estimate the prediction uncertainties and validate its usefulness. Bootstrap uncertainty estimates incorporate uncertainties from both data and the model. Thus, the uncertainty estimation is robust and varies from data set to data set. The ICA based system is proven to be accurate and robust; however, classical ICA algorithms commonly fail when distributions are multi-modal. This most likely occurs during highly non-stationary transients. This research also developed a unity check technique which indicates such failures and applies other, more robust techniques during transients. For linear trending signals, a rotation transform is found useful while standard averaging techniques are used during general transients

    New Approach for Market Intelligence Using Artificial and Computational Intelligence

    Get PDF
    Small and medium sized retailers are central to the private sector and a vital contributor to economic growth, but often they face enormous challenges in unleashing their full potential. Financial pitfalls, lack of adequate access to markets, and difficulties in exploiting technology have prevented them from achieving optimal productivity. Market Intelligence (MI) is the knowledge extracted from numerous internal and external data sources, aimed at providing a holistic view of the state of the market and influence marketing related decision-making processes in real-time. A related, burgeoning phenomenon and crucial topic in the field of marketing is Artificial Intelligence (AI) that entails fundamental changes to the skillssets marketers require. A vast amount of knowledge is stored in retailers’ point-of-sales databases. The format of this data often makes the knowledge they store hard to access and identify. As a powerful AI technique, Association Rules Mining helps to identify frequently associated patterns stored in large databases to predict customers’ shopping journeys. Consequently, the method has emerged as the key driver of cross-selling and upselling in the retail industry. At the core of this approach is the Market Basket Analysis that captures knowledge from heterogeneous customer shopping patterns and examines the effects of marketing initiatives. Apriori, that enumerates frequent itemsets purchased together (as market baskets), is the central algorithm in the analysis process. Problems occur, as Apriori lacks computational speed and has weaknesses in providing intelligent decision support. With the growth of simultaneous database scans, the computation cost increases and results in dramatically decreasing performance. Moreover, there are shortages in decision support, especially in the methods of finding rarely occurring events and identifying the brand trending popularity before it peaks. As the objective of this research is to find intelligent ways to assist small and medium sized retailers grow with MI strategy, we demonstrate the effects of AI, with algorithms in data preprocessing, market segmentation, and finding market trends. We show with a sales database of a small, local retailer how our Åbo algorithm increases mining performance and intelligence, as well as how it helps to extract valuable marketing insights to assess demand dynamics and product popularity trends. We also show how this results in commercial advantage and tangible return on investment. Additionally, an enhanced normal distribution method assists data pre-processing and helps to explore different types of potential anomalies.Små och medelstora detaljhandlare är centrala aktörer i den privata sektorn och bidrar starkt till den ekonomiska tillväxten, men de möter ofta enorma utmaningar i att uppnå sin fulla potential. Finansiella svårigheter, brist på marknadstillträde och svårigheter att utnyttja teknologi har ofta hindrat dem från att nå optimal produktivitet. Marknadsintelligens (MI) består av kunskap som samlats in från olika interna externa källor av data och som syftar till att erbjuda en helhetssyn av marknadsläget samt möjliggöra beslutsfattande i realtid. Ett relaterat och växande fenomen, samt ett viktigt tema inom marknadsföring är artificiell intelligens (AI) som ställer nya krav på marknadsförarnas färdigheter. Enorma mängder kunskap finns sparade i databaser av transaktioner samlade från detaljhandlarnas försäljningsplatser. Ändå är formatet på dessa data ofta sådant att det inte är lätt att tillgå och utnyttja kunskapen. Som AI-verktyg erbjuder affinitetsanalys en effektiv teknik för att identifiera upprepade mönster som statistiska associationer i data lagrade i stora försäljningsdatabaser. De hittade mönstren kan sedan utnyttjas som regler som förutser kundernas köpbeteende. I detaljhandel har affinitetsanalys blivit en nyckelfaktor bakom kors- och uppförsäljning. Som den centrala metoden i denna process fungerar marknadskorgsanalys som fångar upp kunskap från de heterogena köpbeteendena i data och hjälper till att utreda hur effektiva marknadsföringsplaner är. Apriori, som räknar upp de vanligt förekommande produktkombinationerna som köps tillsammans (marknadskorgen), är den centrala algoritmen i analysprocessen. Trots detta har Apriori brister som algoritm gällande låg beräkningshastighet och svag intelligens. När antalet parallella databassökningar stiger, ökar också beräkningskostnaden, vilket har negativa effekter på prestanda. Dessutom finns det brister i beslutstödet, speciellt gällande metoder att hitta sällan förekommande produktkombinationer, och i att identifiera ökande popularitet av varumärken från trenddata och utnyttja det innan det når sin höjdpunkt. Eftersom målet för denna forskning är att hjälpa små och medelstora detaljhandlare att växa med hjälp av MI-strategier, demonstreras effekter av AI med hjälp av algoritmer i förberedelsen av data, marknadssegmentering och trendanalys. Med hjälp av försäljningsdata från en liten, lokal detaljhandlare visar vi hur Åbo-algoritmen ökar prestanda och intelligens i datautvinningsprocessen och hjälper till att avslöja värdefulla insikter för marknadsföring, framför allt gällande dynamiken i efterfrågan och trender i populariteten av produkterna. Ytterligare visas hur detta resulterar i kommersiella fördelar och konkret avkastning på investering. Dessutom hjälper den utvidgade normalfördelningsmetoden i förberedelsen av data och med att hitta olika slags anomalier

    Crowdsourcing for Engineering Design: Objective Evaluations and Subjective Preferences

    Full text link
    Crowdsourcing enables designers to reach out to large numbers of people who may not have been previously considered when designing a new product, listen to their input by aggregating their preferences and evaluations over potential designs, aiming to improve ``good'' and catch ``bad'' design decisions during the early-stage design process. This approach puts human designers--be they industrial designers, engineers, marketers, or executives--at the forefront, with computational crowdsourcing systems on the backend to aggregate subjective preferences (e.g., which next-generation Brand A design best competes stylistically with next-generation Brand B designs?) or objective evaluations (e.g., which military vehicle design has the best situational awareness?). These crowdsourcing aggregation systems are built using probabilistic approaches that account for the irrationality of human behavior (i.e., violations of reflexivity, symmetry, and transitivity), approximated by modern machine learning algorithms and optimization techniques as necessitated by the scale of data (millions of data points, hundreds of thousands of dimensions). This dissertation presents research findings suggesting the unsuitability of current off-the-shelf crowdsourcing aggregation algorithms for real engineering design tasks due to the sparsity of expertise in the crowd, and methods that mitigate this limitation by incorporating appropriate information for expertise prediction. Next, we introduce and interpret a number of new probabilistic models for crowdsourced design to provide large-scale preference prediction and full design space generation, building on statistical and machine learning techniques such as sampling methods, variational inference, and deep representation learning. Finally, we show how these models and algorithms can advance crowdsourcing systems by abstracting away the underlying appropriate yet unwieldy mathematics, to easier-to-use visual interfaces practical for engineering design companies and governmental agencies engaged in complex engineering systems design.PhDDesign ScienceUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133438/1/aburnap_1.pd

    Building a Strong Undergraduate Research Culture in African Universities

    Get PDF
    Africa had a late start in the race to setting up and obtaining universities with research quality fundamentals. According to Mamdani [5], the first colonial universities were few and far between: Makerere in East Africa, Ibadan and Legon in West Africa. This last place in the race, compared to other continents, has had tremendous implications in the development plans for the continent. For Africa, the race has been difficult from a late start to an insurmountable litany of problems that include difficulty in equipment acquisition, lack of capacity, limited research and development resources and lack of investments in local universities. In fact most of these universities are very recent with many less than 50 years in business except a few. To help reduce the labor costs incurred by the colonial masters of shipping Europeans to Africa to do mere clerical jobs, they started training ―workshops‖ calling them technical or business colleges. According to Mamdani, meeting colonial needs was to be achieved while avoiding the ―Indian disease‖ in Africa -- that is, the development of an educated middle class, a group most likely to carry the virus of nationalism. Upon independence, most of these ―workshops‖ were turned into national ―universities‖, but with no clear role in national development. These national ―universities‖ were catering for children of the new African political elites. Through the seventies and eighties, most African universities were still without development agendas and were still doing business as usual. Meanwhile, governments strapped with lack of money saw no need of putting more scarce resources into big white elephants. By mid-eighties, even the UN and IMF were calling for a limit on funding African universities. In today‘s African university, the traditional curiosity driven research model has been replaced by a market-driven model dominated by a consultancy culture according to Mamdani (Mamdani, Mail and Guardian Online). The prevailing research culture as intellectual life in universities has been reduced to bare-bones classroom activity, seminars and workshops have migrated to hotels and workshop attendance going with transport allowances and per diems (Mamdani, Mail and Guardian Online). There is need to remedy this situation and that is the focus of this paper

    Workshops of the Sixth International Brain–Computer Interface Meeting: brain–computer interfaces past, present, and future

    Get PDF
    Brain–computer interfaces (BCI) (also referred to as brain–machine interfaces; BMI) are, by definition, an interface between the human brain and a technological application. Brain activity for interpretation by the BCI can be acquired with either invasive or non-invasive methods. The key point is that the signals that are interpreted come directly from the brain, bypassing sensorimotor output channels that may or may not have impaired function. This paper provides a concise glimpse of the breadth of BCI research and development topics covered by the workshops of the 6th International Brain–Computer Interface Meeting

    Research and Development of a General Purpose Instrument DAQ-Monitoring Platform applied to the CLOUD/CERN experiment

    Get PDF
    The current scientific environment has experimentalists and system administrators allocating large amounts of time for data access, parsing and gathering as well as instrument management. This is a growing challenge since there is an increasing number of large collaborations with significant amount of instrument resources, remote instrumentation sites and continuously improved and upgraded scientific instruments. DAQBroker is a new software designed to monitor networks of scientific instruments while also providing simple data access methods for any user. Data can be stored in one or several local or remote databases running on any of the most popular relational databases (MySQL, PostgreSQL, Oracle). It also provides the necessary tools for creating and editing the metadata associated with different instruments, perform data manipulation and generate events based on instrument measurements, regardless of the user’s know-how of individual instruments. Time series stored in a DAQBroker database also benefit from several statistical methods for time series classification, comparison and event detection as well as multivariate time series analysis methods to determine the most statistically relevant time series, rank the most influential time series and also determine the periods of most activity during specific experimental periods. This thesis presents the architecture behind the framework, assesses the performance under controlled conditions and presents a use-case under the CLOUD experiment at CERN, Switzerland. The univariate and multivariate time series statistical methods applied to this framework are also studied.O processo de investigação científica moderno requer que tanto experimentalistas como administradores de sistemas dediquem uma parte significativa do seu tempo a criar estratégias para aceder, armazenar e manipular instrumentos científicos e os dados que estes produzem. Este é um desafio crescente considerando o aumento de colaborações que necessitam de vários instrumentos, investigação em áreas remotas e instrumentos científicos com constantes alterações. O DAQBroker é uma nova plataforma desenhada para a monitorização de instrumentos científicos e ao mesmo tempo fornece métodos simples para qualquer utilizador aceder aos seus dados. Os dados podem ser guardados em uma ou várias bases de dados locais ou remotas utilizando os gestores de bases de dados mais comuns (MySQL, PostgreSQL, Oracle). Esta plataforma também fornece as ferramentas necessárias para criar e editar versões virtuais de instrumentos científicos e manipular os dados recolhidos dos instrumentos, independentemente do grau de conhecimento que o utilizador tenha com o(s) instrumento(s) utilizado(s). Séries temporais guardadas numa base de dados DAQBroker beneficiam de um conjunto de métodos estatísticos para a classificação, comparação e detecção de eventos, determinação das séries com maior influência e os sub-períodos experimentais com maior actividade. Esta tese apresenta a arquitectura da plataforma, os resultados de diversos testes de esforço efectuados em ambientes controlados e um caso real da sua utilização na experiência CLOUD, no CERN, Suíça. São estudados também os métodos de análise de séries temporais, tanto singulares como multivariadas aplicados na plataforma
    corecore