5 research outputs found

    Three Methods for Occupation Coding Based on Statistical Learning

    Get PDF
    Occupation coding, an important task in ofïŹcial statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modiïŹed nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we ïŹnd deïŹning duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches

    CaractĂ©risation des mesures d’exposition recueillies par l’agence fĂ©dĂ©rale amĂ©ricaine OSHA pour l’estimation des expositions professionnelles en AmĂ©rique du Nord

    Full text link
    La banque de donnĂ©es IMIS (Integrated Management Information System) de l’agence amĂ©ricaine OSHA (Occupational Safety and Health Administration) contient l’ensemble des mesures de l’exposition effectuĂ©es par les inspecteurs d’OSHA chargĂ©s de vĂ©rifier la conformitĂ© aux valeurs limites d’exposition. Les rĂ©sultats analytiques correspondant aux prĂ©lĂšvements effectuĂ©s par les inspecteurs sont Ă©galement disponibles dans la banque CEHD (Chemical Exposure Health Data). Ces deux banques reprĂ©sentent une source d’information potentielle majeure sur les conditions d’exposition aux substances chimiques en AmĂ©rique du Nord. Cependant, leur reprĂ©sentativitĂ© par rapport Ă  la distribution rĂ©elle des niveaux d’exposition retrouvĂ©s dans les milieux de travail est largement inconnue. L’objectif de cette thĂšse est d’établir dans quelle mesure les donnĂ©es de contamination de l'air recueillies par l’agence fĂ©dĂ©rale amĂ©ricaine OSHA peuvent ĂȘtre utilisĂ©es pour l’estimation des expositions professionnelles en AmĂ©rique du Nord. Les analyses ont portĂ© sur 511 047 et 588 818 mesures d’exposition contenues dans les banques IMIS et CEHD respectivement, pour la pĂ©riode 1979-2011. PremiĂšrement, des modĂšles additifs gĂ©nĂ©ralisĂ©s ont Ă©tĂ© utilisĂ©s pour Ă©tudier l’association entre les variables reflĂ©tant les caractĂ©ristiques des Ă©tablissements visitĂ©s et des inspections et les niveaux d’exposition pour 77 agents chimiques (90% du contenu d’IMIS). Dans un second temps, une approche de rĂ©gression de Poisson modifiĂ©e a Ă©tĂ© utilisĂ©e pour Ă©tudier les facteurs dĂ©terminants l’enregistrement ou non des Ă©chantillons de CEHD dans la banque IMIS en jumelant les deux banques pour 78 agents chimiques. Finalement, des modĂšles CART (Classification And Regression Tree) ont Ă©tĂ© dĂ©veloppĂ©s permettant de prĂ©dire, parmi les rĂ©sultats non dĂ©tectĂ©s de la banque IMIS, lesquels correspondent Ă  des mesures courte durĂ©e ou des moyennes pondĂ©rĂ©es sur 8 heures (VEMP-8h) en se basant sur les variables communes aux banques IMIS et CEHD. Dans la premiĂšre analyse, les modĂšles statistiques ont montrĂ© que les niveaux d’exposition Ă©taient plus susceptibles de dĂ©passer la TLV (threshold limit value) pour les mesures effectuĂ©es sous un rĂ©gime OSHA fĂ©dĂ©ral par rapport au rĂ©gime OSHA d’État (rapport de cote (RC) de 1,22 Ă  travers les agents). La probabilitĂ© de dĂ©passer la TLV augmentait avec le nombre total des amendes reçues par un Ă©tablissement, indĂ©pendamment de la nature des infractions (RC de 1,54 Ă  travers les agents entre les catĂ©gories « Ă©levĂ©e » et « aucune »). Elle Ă©tait Ă©galement plus Ă©levĂ©e pour les visites de suivi que pour les visites planifiĂ©es (RC de 1,61). Dans la deuxiĂšme analyse, la comparaison des banques IMIS et CEHD a montrĂ© un taux d’enregistrement global de 38% des donnĂ©es CEHD dans IMIS. Les rĂ©sultats non dĂ©tectĂ©s (particuliĂšrement ceux mesurĂ©s sur un panel d’agents – p. ex. panel de mĂ©taux) Ă©taient moins susceptibles d’ĂȘtre enregistrĂ©s dans IMIS (risque relatif ~0,6). Finalement, les modĂšles CART ont prĂ©dit plus prĂ©cisĂ©ment le type de prĂ©lĂšvement (courte durĂ©e, VEMP-8h) pour les rĂ©sultats non dĂ©tectĂ©s dans IMIS que des mĂ©thodes simples d’attribution (p. ex. attribution du type le plus frĂ©quent parmi les rĂ©sultats dĂ©tectĂ©s) pour les agents les plus pertinents (c.-Ă -d. ceux ayant une proportion substantielle de mesures ND, courte durĂ©e et VEMP-8h). Nos rĂ©sultats ont montrĂ© la prĂ©sence de plusieurs mĂ©canismes de sĂ©lection dans le processus conduisant Ă  l’enregistrement d’une mesure d’exposition dans IMIS, ce qui suggĂšre l’existence de diffĂ©rences systĂ©matiques entre les niveaux rapportĂ©s dans les banques OSHA et les niveaux moyens d’exposition dans la population de travailleurs. La prise en compte des informations contextuelles aux mesures et l’emploi de mĂ©thodes prĂ©dictives peuvent aider Ă  pallier partiellement ces biais et ainsi raffiner les portraits d’exposition Ă©tablis Ă  partir des donnĂ©es d’OSHA.The Integrated Management Information System (IMIS) contains exposure measurements taken by the U.S. Occupational Safety and Health Administration (OSHA) inspectors to verify compliance with permissible exposure limits. Supplementary data containing analytical results of the field samples are available in the Chemical Exposure Health Database (CEHD). These databanks represent a major potential source of information on exposure conditions in North American workplaces. However, the degree to which they represent the actual distribution of the exposure levels found in the workplace is largely unknown.The objective of this thesis is to examine the extent to which exposure data collected by OSHA can be used for estimating occupational exposure in North America. Analyses focused on 511 047 and 588 818 exposure measurements in IMIS and CEHD respectively, for the period 1979-2011. First, generalized additive models were used to explore associations between exposure levels in IMIS and ancillary variables reflecting characteristics of establishments and inspections for 77 chemical agents (90% of IMIS content). Second, modified Poisson regression was used to identify determinants of recording or not of CEHD samples in IMIS by linking both databanks for 78 agents. Finally, Classification And Regression Tree (CART) models were applied to predict which non-detected (ND) results stored in IMIS are 8-hour time-weighted average (TWA) or short-term samples, based on common variables available in IMIS and CEHD databanks. In the first analysis, statistical modelling showed that measurements collected under federal OSHA plans were more likely to have a sample result exceed the TLV compared to measurements collected under state OSHA plans (odds ratio (OR) of 1,22 across agents). An increase in the total amount of penalty assessed to a company was associated with higher odds of having a sample result exceed the TLV (OR of 1,54 across agents for « high » vs. « none »). Follow-up inspections were more likely to have a sample result exceed the TLV compared to planned inspections (OR of 1,61 across agents). In the second analysis, linkage between CEHD and IMIS showed a 38% overall proportion of CEHD samples recorded into IMIS. Non-detects (especially ND records corresponding to analytical panels – e.g. panel of metals) were less likely to be recorded in IMIS (relative risk ~0,6). Finally, CART models predicted more accurately which IMIS ND results were TWA or short-term samples compared to simple methods of assignment (e.g. assignment of the most frequent category from detected values) for the most relevant agents (i.e. with high proportions of ND, short-term, and TWA results). Our findings showed the presence of several selection mechanisms in the process leading up to the recording of a sample in IMIS, which suggest systematic differences exist between OSHA measurements and actual occupational exposures in the general U.S. working population. These biases can be partially controlled by using ancillary information on exposure measurements together with predictive methods, thus helping to draw more accurate portraits of exposure levels from OSHA data

    Opportunities and challenges in new survey data collection methods using apps and images.

    Get PDF
    Surveys are well established as an effective way of collecting social science data. However, they may lack the detail, or not measure the concepts, necessary to answer a wide array of social science questions. Supplementing survey data with data from other sources offer opportunities to overcome this. The use of mobile technologies offers many such new opportunities for data collection. New types of data might be able to be collected, or it may be possible to collect existing data types in new and innovative ways .As well as these new opportunities, there are new challenges. Again, these can both be unique to mobile data collection, or existing data collection challenges that are altered by using mobile devices to collect the data.The data used is from a study that makes use of an app for mobile devices to collect data about household spending, the Understanding Society Spending Study One. Participants were asked to report their spending by submitting a photo of a receipt, entering information about a purchase manually, or reporting that they had not spent anything that day. Each substantive chapter offers a piece of research exploring a different challenge posed by this particular research context. Chapter one explores the challenge presented by respondent burden in the context of mobile data collection. Chapter two considers the challenge of device effects. Chapter three examines the challenge of coding large volumes of organic data. The thesis concludes by reflecting on how the lessons learnt throughout might inform survey practice moving forward. Whilst this research focuses on one particular application it is hoped that this serves as a microcosm for contributing to the discussion of the wider opportunities and challenges faced by survey research as a field moving forward

    Computer-Based Coding of Occupation Codes for Epidemiological Analyses

    No full text

    New methods for job and occupation classification

    Get PDF
    This dissertation addresses the measurement of occupation in surveys. Many surveys ask respondents about their occupation with open-ended questions. The verbal answers are typically coded after the interview into official classifications (e.g., the 2008 International Standard Classification of Occupations or the 2010 German Classification of Occupations). This process is known to be time-consuming and prone to errors. To counter both issues, the first paper of the dissertation develops and tests a software prototype, which searches for candidate job titles at the time of the interview. A small set of relevant jobs are suggested based on the respondents’ initial verbal input, allowing respondents to select the most appropriate job on their own. A second paper compares various statistical learning algorithms to optimize the suggestions. A novel algorithm was developed employing Bayesian principles, improving the suggestions further. In a third paper, 1226 work activity descriptions were created based on close inspection of the official occupational classifications. These work activity descriptions can be used as answer options in an improved version of the prototype
    corecore