70 research outputs found

    A regression surprise resolved

    Get PDF
    In this note we explore the following surprising fact: In regression with trend and seasonality, the prediction risk is constant for all seasons of a new cycle, despite the fact that it increases with time when the seasons are left out. Awareness of this may be useful to both the practicing statistician and to teachers of statistics. The challenge of resolving the issue may also be given to students of statistics as a research project.Trend and seasonality; Prediction risk; Paradox

    Matched Dispatching in Randomized Settings

    Get PDF
    This paper examines some problems of matched dispatching in some different random settings. The context of presentation is that of a reality show with a lineup of the participants, and according to some probabilistic selection rule, some participants are pairwise matched to teams while some are excluded. We consider mainly two cases of induced randomness, one based on random ordering of participants, and one based on coin-flipping, and consider both linear and circular lineups. Questions of fairness are discussed, and some alternative schemes are examined

    Statistisk prosess styring : ĂĽ forstĂĽ og kunne reagere pĂĽ variasjon

    Get PDF

    Patient allocations in general practice in case of patients' preferences for gender of doctor and their unavailability

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In some countries every citizen has the right to obtain a designated general practitioner. However, each individual may have preferences that cannot be fulfilled due to shortages of some kind. The questions raised in this paper are: To what extent can we expect that preferences are fulfilled when the patients "compete" for entry on the lists of practitioners? What changes can we expect under changing conditions? A particular issue explored in the paper is when the majority of women prefer a female doctor and there is a shortage of female doctors.</p> <p>Findings</p> <p>The analysis is done on the macro level by the so called gravity model and on the micro level by recent theories of benefit efficient population behaviour, partly developed by two of the authors. A major finding is that the number of patients wanting a doctor of the underrepresented gender is less important than the strength of their preferences as determining factor for the benefit efficient allocation.</p> <p>Conclusions</p> <p>We were able to generate valuable insights to the questions asked and to the dynamics of benefit efficient allocations. The approach is quite general and can be applied in a variety of contexts.</p

    9 - Breakfast Cereal

    Get PDF
    Topic: Cluster analysis and alternatives Context: The following data (units omitted) on the content of 23 different brands of breakfast cereal were read from the packages found in two food stores in Bergen in the fall of 2001: Product Energy Protein Car.hyd. Sugar Starch Fat Fiber Sodium Brand Corn Flakes 1500 8 82 10 72 1 3 1,1 Kellogs Special 1500 15 75 17 58 1 2,5 0,9 Kellogs All-Bran 1350 10 66 22 44 2 15 0,8 Kellogs Frosties 1600 6 84 36 48 0,5 2 0,7 Kellogs Choko Korn Smacks 1650 8 81 45 36 6 5 0,05 Kellogs Chocos Frokost 1600 8 81 36 45 2 4 0,4 Kellogs Honey Crunch 1600 7 83 36 47 2,5 2,5 0,7 Kellogs Honey Korn Smacks 1600 7 84 48 36 2 3 0 Kellogs Loops 1550 8 77 36 41 3 7 0,6 Kellogs Cheerios 1580 8 76 21 55 4 6,5 0,8 Nestle Fitness 1530 7,5 80 17 63 1,3 6,7 0,5 Nestle Apple Minis 1580 4,5 84 43 41 2,4 4,5 0,7 Nestle Nesquick 1680 5 84 38 46 4,5 2,4 0,3 Nestle Havre Fras 1650 9,5 72 12 60 7 5,5 1,1 Quaker Crusli Sol frokost 1810 7 67 31 36 15 5,5 0 Quaker Crusli Fiber 1840 7,5 68 28 40 16 10 0 Quaker Crusli Choko 1920 7,5 76 32 44 18 5,5 0 Quaker Energi Mix 1750 8 73 23 50 10 4,5 0,4 Quaker 4 korn 1370 11 61 1 60 3 11 0,004 Regal Go' Dag w/raisins 1440 12 59 14 45 6,5 11 0,03 Regal Weetos 1629 6,2 78,4 36,3 42,1 5 5,6 0,3 Weeabix Weetabix 1440 11,2 67,6 4,7 62,9 2,7 10,5 0,3 Weetabix Frutibix 1498 8 71,2 27 44,2 3,8 8,1 0,2 Weetabix File: Breakfast_Cereal.XLS (Note that Carbohydrates is the sum of Sugar and Starch. In marketing it is often of interest to group competing products with respect to similarities, in order to reveal close competitors and possible niches. This may be achieved by cluster analysis, which is a technique for stepwise joining of items, and the result is often presented graphically by a so-called dendrogram. Various types of cluster analysis are available in standard statistical software, the type relevant here is observation cluster analysis. Task: Perform a cluster analysis using the variables: Energy, Protein, Carbohydrates, Fat, Fibres. What are your conclusions, and how can it be used? What are the limitations? Are there alternatives? A variety of options may be offered concerning the way of measuring distance between items and clusters and the approach for joining items and clusters. Typical distance choices are Euclidean or Correlation distance and typical linkage methods are Average linkage or Single linkage. The different choices will sometimes lead to fairly different solutions, and results interpreted with care. Try first the underlined ones and then some of the others

    17 - Tax Audit

    Get PDF
    Topic: Comparisons, regression and outliers Context: The establishment ”Nels” has live music every Friday and Saturday. The guests have to pay a cover charge of 50 NOK kr, and if the cloakroom is used, a wardrobe fee of 10 kr. Value added tax shall be paid on the wardrobe fees, but cover charge is payment for entertainment and is exempt VAT. Numbered tickets are used, and accounting data with vouchers exists for every evening, covering the number of cover charge tickets (C), the number of wardrobe tickets (W) and the sales turnover in the bar (S). After a book review at “Nels”, the tax revenue service claimed their own appraisal of the income from entrance for the year 1997. From the conclusion of the report of the tax auditor we read: “The registration sheets show that for many evenings (22 out of 104) more wardrobe tickets than entrance tickets are sold. Furthermore, the average bar sales compared with the number of entrance tickets sold, are often very high (more than 300 kr, which amounts to about 7 glasses of beer), and very varying (from 161 kr to 386 kr). Evenings with high registered beer consumption are frequently evenings when it is sold more wardrobe tickets than entrance tickets (among others on all 5 evenings where the bar sales per entrance ticket were above 300 kr). This indicates that it must have been far more guests than reported in the books.” N. Nelson claims that the tax auditor interprets the data completely wrong, and does not account for a number of obvious circumstances in running a restaurant: The number of guests and how much money they leave behind in the bar and the wardrobe depends on the time of the year, weekday, weather and clientele; if the attendance is low early in the evening, the doorman will desist from cover charge. Moreover, every evening will have a number of guests free of charge (X), i.e. guests with so-called VIP card or left over audience from special arrangements (concerts, shows, cabarets etc.) prior to the opening for regular guests (31 evenings out of 104 in 1997). N. Nelson claims that on regular evenings it is not uncommon with more than 100 free guests, and on special arrangements more than 300. On one occasion in February 1998 which can be documented (by video surveillance of a newly installed ticket system) the numbers were: paying guests C = 792, free guests X = 535, wardrobe tickets W = 881. This was an exceptionally high number of free guests. N. Nelson says that he reckons that 60-70% of the guests use the wardrobe midwinter, but only 20-30% midsummer, depending on the weather that evening. N. Nelson also refers to independent market research, which tells that some clientele on average consume about 6-7 glasses of beer per evening. Typically, more is consumed on Fridays than Saturdays. Task: Suppose that you, as the external auditor, have accepted the income statement, and are about to assist the owner Nels Nelson to disprove the claim of the tax revenue service. A-version: You have no guidance on how to proceed to analyze the data. B-version: You have obtained the following suggestions on how to proceed: By reading the report of the tax auditor you will see that the argument is mainly connected to a comparison of C with W and judging S/C by what is regarded an uncommon consumption, and when this occur. You realise that the role of the number of free guests (X) is largely neglected/underestimated, and that a more proper measure for the average consumption ought to be (a) Let S = 200 000 and compute R for the four cases Compare this with what the tax auditor would get if he used a fixed X = 0 or X = 300. Discuss the following claim of the tax auditor: “We regard here the number of free guests as constant, as this will not affect the judgement of the differences between high and low bar sales per guest”. (b) Computer descriptive statistics for the variables S, C, W, S/C, W/C. Use these results, your calculation in (a) and the information given, and try to argue that the registered number of guests C is reasonable compared to S and W. (c) A simple regression analysis may be performed, where C is explained by S. One could consider the outliers and try to find argument for that they are not more frequent or larger than expected. Ask yourself also how extraordinary bar sales may affect the regression. (d) To examine more closely the use of the wardrobe during the year, a regression is may be performed where W/C is explained by month (0 – 1 variable for each month, with January as basis). One could provide a table with estimated percentages in each month and try to explain why they are misleading. These numbers may be corrected using the facts in the case description, and then decide whether the result supports the claim of N. Nelson concerning the use of the wardrobe

    Statistisk forsøksplanlegging og analyse : for forbedring av produkter og prosesser

    Get PDF

    Preface

    Get PDF
    This is a case collection in data analysis containing cases mostly relevant for business studies. The first part of ten cases are mini-cases, that is, with a limited issue and a small dataset. The second part contains wider issues and most often larger datasets. The cases within each part are numbered in the order topics typically will occur in an introductory courses in statistics. For some topics that are not commonly treated in textbooks, we have supplied some theory. All cases are based on problems and data collected over many years. A few data sets are slightly modified in order to bring forward an issue more clearly and/or to bring anonymity to the data. In some cases the years are changed to “bring the problem into the current century”. The name of the case may indicate the problem area. Each case starts with a topic statement, indicating the data type and modes of analysis. Then the context of the case is given and the problem to be solved, finalized as a task statement. For some cases this is detailed in two versions, giving the teacher and student a choice between two levels of challenge. The A-version is open with respect to approach to the problem, while the B-version is structured, with specific and itemized questioning, often with indication of recommended method of analysis. One possibility is that the students read the A-version first, and make up their minds on how the problem may be attacked and then take a look at the B-version to see whether this contains additional elements not thought of. The solution may then be done according to the B-version. This gives the opportunity for the students to reflect on the choice of method, an important ability in practice. The students may alternatively be given the challenge of attacking the problem, without help of the B-version (as in business practice), or go directly to the B-version (as often final exams are structured). The fourth possibility, of going direct to our solution, is not recommended. Many problems have no unique best solution, and there are often several roads leading to the same solution or an equally good one. Our solution is maybe just one of the possible ways to attack the problem, and student creativity should not be restrained by this. The ability to attack new problems is vital in business practice, and an approach which deprives the student of such training is inferior. Some cases can be attacked on different level of theoretical knowledge and sophistication, and the one chosen in practice have to match the knowledge at hand. In such cases it is worthwhile to judge whether spending time on learning a more sophisticated method is worth the effort. In some cases we present a solution which may not seem completely satisfactory from a strict statistician’s viewpoint. This may give a statistics teacher an opportunity to comment and indicate a better way. This collection of cases can be used in conjunction with any elementary or intermediate textbook in statistics, typically one with some emphasis of probability. It can also be used alone, if the reader is willing to consult available resources when needed, for instance on the internet. Suitable software is needed for actually doing the cases. A lot can be accomplished in spreadsheet programs like Excel, but they have limited analytical capabilities. Add-ons like XLSTAT may remedy this, but a sound statistical package like Minitab is recommended. Such packages often have good help functions and tutorials, which may be used to learn some theory behind the methods as they are applied. The output shown in the solutions are taken from Minitab exclusively. If students are given access to solutions, a possibility is that they are required to repeat the analysis in Excel (if possible) or an add-on (if available). The data are supplied for each case as an Excel worksheet file. I owe thanks to many individuals for sharing their problems and data. I order to secure anonymity no one is mentioned here. I also acknowledge financial support from the “Faglitterære fond”

    4 - Accident Risks

    Get PDF
    Topic: Frequency counts, risk analysis Context: Many companies and organizations observe and keep records of work accidents or some other unwanted event over time, and report the numbers monthly, quarterly or yearly. Here are two examples, one from a large corporation and one from the police records of a medium sized city. Month 1 2 3 4 5 6 #Accidents 2 0 1 3 2 4 Year 2000 2001 2002 2003 2004 2005 2006 #Assaults 959 989 1052 1001 1120 1087 1105 In both examples the numbers show a tendency to increase over time. However, the question is whether this is just due to chance. If we add one to each accident count, the numbers do not look much different from those coming from throwing a fair dice. Taking action on pure randomness, believing that something special is going on is, at best, a waste of time. Randomness, or not, may be revealed by observing a longer period of time, but there may be no room for that. Pressure will rapidly build up to do something. In the case of accidents, find someone responsible. However, this may lead to blaming someone for something that is inherent variation “within the system”. In the case of assaults, media attention, where often just this year is compared with the last or some favourable year in the past, leads typically to demands for more resources or new priorities. Can we get some help from a probabilist or a statistician? The probabilist may say: ” If accidents occur randomly at the same rate, the probability distribution of the number of accidents in a given period of time is Poisson. If the expected number of accidents per month is 2, then the probabilities of a given number of accidents in a month are as follows: #Accidents 0 1 2 3 4 >4 Probability 0.1353 0.2707 0.2727 0.1804 0.0902 0.0526 Thus the pattern of observations is not at all unlikely. There is even a 5% chance that you get more than four accidents, and this will surely show up sometimes, even if the system as such is the same”. The statistician may perhaps add: “The average number of accidents over the six months is 2, but this is an estimate of the true expected number of accidents in a month. Typical error margins are plus/minus square root of 2 (knowing that the standard deviation of the Poisson distribution is the square root of the expectation), which is about 1.4. Thus the expectation may be anywhere in a wider region around 2, so that any claim that the 4 accidents last month are due to increased hazards is even more unjustified”. You have taken a course in statistics in school, and have been taught about hypothesis testing, and you wonder whether this can be used. There are some problems: (i) It assumes a “true” expected rate and independent random variation, (ii) it typically compares expectations in one group of data with a hypothesized value, or with other groups of data. Here we have just instants, (iii) it may neglect the order of the data and may not pick up trends, and (iv) it does not handle the monitoring over time in an attractive manner. These objections may be remedied by more sophisticated modelling, but becomes unattractive for common use. You may also have heard about statistical process control, and the use of control charts to point out deviant observations. This keeps track of the order of observations and may also react to certain patterns in the data, like level shifts and increasing trends. The drawback is: (i) it assumes at the outset a stable system (a process in “statistical control”) and the monitoring of deviant observations and patterns (ii) the data requirements are in many cases demanding. In practice we often do not have a stable system, things change over time, from year to year, and we do not have enough data to establish the control limits, even if the system was stable. Avoiding the reaction to deviant observations which are just random is a big issue in quality management, since it typically leads to more variation and turbulence in the system. This is justified in a context where you ideally have a system, stable over time, and running according to fixed procedures (until they are changed), e.g. in a production context and for some administrative processes. You may also have this in some contexts closely related to human hazards, say monitoring birth defects. We probably have to realize that in many situations related to hazards there is no stable system and cannot be, and that some kind of reaction is required before we know the truth. What then to do? If we ask a risk analyst, he or she may be stuck in the traditional statistical theory, while others have dismissed traditional statistical theory as basis for active risk management (see Aven: Risk Analysis, 2003). Data of the kind above occur frequently, and there is room for a more pragmatic attitude. Otherwise the data will either be neglected, because no one can tell how to deal with them in a systematic manner (and particularly so the statisticians) or they will be misused for opportunistic purposes. One possibility for monitoring with the aim to pick up trends in short series is described in Kvaløy & Aven (2004). It is slightly modified here (see next page) and is easily programmed in Excel: Theory A sequence of hazards for r consecutive periods is given, and the objective is to point out a worsening trend or individual hazards that are aberrant. Do the following: 1. Calculate the average mj of the observed hazards up to and including period j 2. Take ej = (r-j)mj as expected number of hazards for the remaining r-j periods 3. Use the Poisson distribution with expectation ej to calculate the probability that the number of hazards for the r-j remaining periods is at least as observed 4. If this probability is small, say less than 5% then initiate warning or alarm. Repeat 1-4 for all or some of j = r-1 , r-2 ,…,1 Note: If the counts are very high, we may use the normal distribution with expectation ej and standard deviation the square root of ej instead. The first example above gives this calculation scheme for judgment at month six: Month 1 2 3 4 5 6 #Accidents 2 0 1 3 2 4 Average till now 2.0 1.0 1.0 1.5 1.6 2.0 Expected ahead 10.0 4.0 3.0 3.0 1.6 - Observed ahead 10 10 9 6 4 - Probability (tail) 0.5421 0.0081 0.0038 0.0840 0.0788 - We see that alarm is given at this month, due to some small tail probabilities looking ahead from month two and three. The 4 in the sixth month (observed ahead from the fifth) is not that surprising, judged from the average of the preceding. Neither is the combined 6 of two last months judged from the average of the preceding. However, the combined 9 of the last three months is dubious compared with the average of the preceding, and so is the case moving one additional month backward. Going all the way back to month 1 we have just one single observation 2 and then “expect” 10 altogether for the 5 months ahead, and that is exactly what we got, thus leading to a high tail probability. Task 1: Replace the number 4 of the sixth month with 3 and repeat the analysis. Task 2: Analyse the data available after five months, i.e. omit month 6. Task 3: Use the described method to analyse the second exampl

    Weibull Wind Worth: Wait and Watch?

    Get PDF
    This paper considers a decision problem in the context of the worth of a wind mill project with profitability dependent on the average wind speed. This is partly known and the issue is whether to go on with the project now or, with an additional cost, put up a test mill and observe, say for a year, and then decide. The problem is studied within a Bayesian framework and given a general analytic solution for a specific loss function of linear type, with the normal case as illustration. Explicit formulas are then derived in the case when the wind speed distribution is Weibull with known shape parameter, and the sensitivity with respect to the specification of this parameter is explored. Based on Norwegian wind speed data we then give a justification of the Weibull model. This also provides some insight to parameter stability. Finally, a complete numerical scheme for the Bayesian two-parameter Weibull model is given, illustrated with an implementation of pre-posterior Weibull analysis in R
    • …
    corecore