Survey data comes often as a plain table containing cryptic variable names, numbers, and letters. To make sense of the data, the researcher is given a questionnaire or a code book that contains a list of variable names, their description, and an interpretation of the values (either a number or a string) that each variable can take. Code books are commonly provided as plain text or in PDF format. Hence, the researcher is left “free” to type labels and value labels one by one. This often leads to bad research habits, such as “cutting” and “processing” the piece of survey the researcher needs in the short-run and leaving the rest for future processing. Obviously, this is boring, time consuming, and eventually leads to the creation of various versions of the same survey, an inability to track important changes, and an incapacity to reproduce research results—because the researcher cannot recreate the analyzed dataset step by step from the original source. In this talk, I will discuss how to recover the information that is contained in questionnaires or code books and how to process this information in a clean, fast, and efficient way with Mata.

Alfonso Miranda

English

Research Papers in Economics

Motivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Dealing with the cryptic survey: Processing
labels and value labels with Mata
Alfonso Miranda
Institute of Education, University of London
(A.Miranda@ioe.ac.uk)
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Data Management
  Research is done on the basis of complex survey data
  Putting together data in a format that is ready for analysis
is often a non trivial exercise
  Researchers put lots of e ort to solve their Data
Administration problems and often take the wrong
decisions and end up analysing badly build data
  This may lead to extrange results and signiﬁcant bias
  However, most people would say that cleaning and
preparing data is a boring, mostly mechanical, and
undeserving activity
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The problem
  Survey data comes often as a plain table containing
cryptic variable names, numbers, and letters
  To make sense of the data, the researcher is given a
questionnaire or a code book that contains a list of
variable names, their description, and an interpretation of
the values (either a number or a string) that each variable
can take
  Code books are commonly provided as plain text or in
PDF format. Hence, the researcher is left “free” to type
labels and value labels one by one
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Bad research habits...
There are two things you are better o  not watching in the
making: sausages and econometric estimates
Edward Leamer
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Bad research habits...
  Cutting and processing the piece of the survey that is
needed in the short-run and leave the rest for future
processing
  Never fully understand how the survey is structured
  Reduce sample size more than strictly needed
  Create false missing values and/or item non-response
  Do not take into account sample design
  Introduce potential selection bias
  This leads to the creation of various versions of the
same data
  Inability to track changes
  Cannot reproduce research results
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
This talk...
  Here I discuss only one relatively small aspect that arise
when preparing data for analysis
  Namely, I will show how to recover the information that is
contained in questionnaires or code books that are in PDF
format (not copy protected) and how to process this
information in a clean, fast, and e cient way with Mata
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The Agenda
We have two pieces of information:
  Data in Stata format with variable names but no
description (i.e., no variable labels)
-----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------
k3_ac str9 %9s
k3_pmr str18 %18s
k3_dob str19 %19s
k3_age byte %8.0g
k3_mth byte %8.0g
k3_schid long %12.0g
k3_land str1 %9s
k3_lang str1 %9s
k3_ma str1 %9s
k3_sc str1 %9s
k3_engta str1 %9s
  A list of variable names and their description in a PDF ﬁle
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The Agenda
k3_ac Academic year 
k3_bcref Matching candidate reference number 
k3_pmr Pupil matching reference - Anonymous 
k3_pmr Pupil matching reference - Non Anonymous 
k3_pup Pupil matching reference 
k3_cand Pupil serial number 
k3_ncand NDCA reference number 
k3_upn Unique Pupil Number 
k3_sname Full legal surname 
k3_fname forenames in full 
k3_dob Date of birth 
k3_age Age at start of the academic year 
k3_mth Month part of age at start of the academic year 
k3_yob year the pupil was born. 
k3_mob month pupil was born. 
k3_yrgrp Year group - derived from date of birth 
k3_gend Gender 
k3_refug Refugee Indicator 
k3_la Local Authority (LA) 
k3_estab Establishment number of the school 
k3_laest LA and ESTAB together. 
k3_urn School's Unique Reference Number 
k3_stype Type of establishment 
k3_nftyp Institution type 
k3_land Source Country 
k3_lang Language of School 
k3_langm Language of Maths Teacher Assessment 
k3_langs Language of Science Teacher Assessment 
k3_en English examination year 
k3_ma Maths examination year 
k3_sc Science examination year 
k3_schrs Pupil in school level averages 
k3_lars Pupil in LA averages 
k3_natrs Pupil in national averages 
k3_elige Pupil in eligible pupil number English 
k3_eligm Pupil in eligible pupil number Maths 
k3_eligs Pupil in eligible pupil number Science 
k3_vale Pupil in eligible pupil number English + no missing/unmatched/ lost results 
k3_valm Pupil in eligible pupil number Maths + no missing/unmatched/ lost results 
k3_vals Pupil in eligible pupil number Science + no missing/unmatched/ lost results 
k3_cflag FFT Correction Flag for 2003/2004 
k3_welta Overall level for Welsh Teacher Assessment Level 
k3_levwe Overall Welsh Test Level 
k3_tiere English paper sat by pupil. 
k3_pap1e English Paper 1 Test Mark 
k3_pap2e English Paper 2 Test Mark 
k3_erm Marks achieved in English reading test 
k3_ersm Marks achieved in Shakespeare reading test 
k3_ewm Marks awarded in English longer writing test 
k3_ewsm Marks awarded in English shorter writing test 
Variable Description NPD
AIM: To create variable labels using the information
contained in the PDF
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Current Stata capabilities to deal with variable labels
  Can use Stata’s o cial label command
label variable varname ["label"]
For instance, we could type:
. label k3_ac ‘‘Academic year’’
. label k3_bcref ‘‘Matching candidate reference number’’
  But that will require to type one label at a time...Not
very e cient
  It would be nice if one could write a program that takes
two large strings, one containing variable names and the
other containing all variable descriptors, and process all
variable labels at the strike of a single return
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The general idea
I seek to write a program that will be invoked as follows:
#delimit ;
local varnames "k3_ac # k3_bcref # k3_pmr # k3_pmr # k3_pup ";
local vardes "Academic year # Matching candidate reference number
# Pupil matching reference - Anonymous
# Pupil matching reference - Non Anonymous # Pupil matching reference";
#delimit cr
mata: Labelvar("varnames","vardes")
And will to exploit the ability, which I assume I have, of
copying the data from the PDF document as plain text into a
text editor (your favourite) and from the text editor into a
spreadsheet (your favourite)
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Live demostration
Time for a live demonstration. Hope everything goes well...
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Live demostration
  Now, in the rest of the talk I will give details on the
programming of Labelvar in Mata.
  So, those who are not that interested in the technical
details please bear with me...
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Mata: An overview
Mata is a full-ﬂedged matrix programming language. Mata can
be used interactively or called from Stata and a large number
of functions (matrix, scalar, mathematical, statistical, equation
solvers, optimiser) are provided. Mata can access Stata’s
variables and can work with virtual matrices (views) of the data
in memory. Mata code is automatically compiled into
byte-code and runs signiﬁcantly faster than Stata
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Mata can do strings...
Mata handles matrices that contain either numeric or string
elements, though a single matrix may not mix strings and
numbers. Here are some examples:
. mata
:
: A = (1,2 \ 3,4)
:A
12
+---------+
1 | 1 2 |
2 | 3 4 |
+---------+
: B = ("This","That" \ "These","Those")
:B
12
+-----------------+
1 | This That |
2 | These Those |
+-----------------+
: end
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Mata can do strings...
The sum of two string matrices is deﬁned as:
: B = ("This","That" \ "These","Those")
: C = ("Hola","Si" \ "NO","QUE")
12
+---------------+
1 | Hola Si |
2 | NO QUE |
+---------------+
: D = B + C
:D
12
+-----------------------+
1 | ThisHola ThatSi |
2 | TheseNO ThoseQUE |
+-----------------------+
Here I used an assignment operator (the equals sign = in the
code) to deﬁne a new matrix D. Notice the sum operator was
performed using the conformability rule that the usual numeric
sum operator will require
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Mata can do strings...
To summarize,
  In Mata “This” + “Hola” returns “ThisHola”
  This deﬁnition of the sum operator for strings may not
sound that intuitive...But the operator does make sense
given that product operator is not deﬁned for strings
  So, “This” * “Hola” produces an error message
  Usual conformability of the sum operator applies
Hence, the idea is to exploit these capabilities of Mata and its
ability to communicate with Stata to solve our labels problem
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The code I
The code is written in a text editor into a do ﬁle
Labelvar.mata, which will be compiled once it is ready
The ﬁrst thing we need to do is call Mata and deﬁne the function we are program-
ming
mata:
mata clear
void function Labelvar(string scalar listvar, string scalar listdes)
{
The void says Mata that the function returns nothing. There are two arguments,
one named listvar and the other named listdes. Both arguments are scalars
(i.e., a matrix with a single cell) that contain a string value
/* Parsing relevant strings */
t = tokeninit("", "#", (‘""""’, ‘"‘""’"’), 0, 0)
Tokeninit() deﬁnes advanced parsing. First argument deﬁnes the character that
will be treated as white space. Second argument deﬁnes the character that will
deﬁne where a word begins and where it ends, here # (this is what we are after
for parsing our label names and descriptors.) Remaining options control the way
qoute characters behave and how large numeric values are displayed. Here we do
not allow numbers and so the zeroes
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The code II
Next tokenset() will be used to specify that our newly deﬁned advanced parsing
t will be used for processing the contents of the Stata locals listvar and listdes
tokenset(t, st_local(listvar))
listvarT = tokengetall(t)
tokenset(t, st_local(listdes))
descriptorT = tokengetall(t)
Function tokengetall() will put all the elements of local listvar in the cells of
a row vector, including the parsing character #
/* get variables */
for (i=1;i<=cols(listvarT);i++) {
if (i==1) variables = strtrim(listvarT[i])
if (i>1 & listvarT[i]!="#") variables = (variables,strtrim(listvarT[i]))
}
The lines above loop over the columns of listvar to deﬁne a new matrix
variables that contains only the name of our variables, getting rid of the parsing
character that were still present in matrix listvar. We do the same with the
variable descriptors
/* get descriptors */
for (i=1;i<=cols(descriptorT);i++) {
if (i==1) descriptor = strtrim(descriptorT[i])
if (i>1 & descriptorT[i]!="#") descriptor = (descriptor,strtrim(descriptorT[i]))
}
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The code III
And this is a trick to make the quotation symbols be part of the strings that are
deposited in descriptorT:
comma = ‘"""’
for (i=1;i<=cols(descriptor);i++) {
descriptor[i] = comma+descriptor[i]+comma
}
So, for instance, if we were to apply the same thick to matrix C we will get
something like this:
12
+-------------------+
1 | "Hola" "Si" |
2 | "NO" "QUE" |
+-------------------+
Now, matrix variables contains the variable names and matrix descriptors con-
tains the variable descriptors, with the quotation marks “ ” being part of the de-
scriptions. We are almost done...Now we only need to manipulate these matrices
to create our labels
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
The code IV
Next, we use the function Stata() to interact with Stata. Loop over the elements
of matrix variables and summarise variable by variable, keeping record in scalar
rc if the variable we are working with was found in data — in that case rc
will equal zero. Then I bring the result of this operation into Mata using the
st numscalar() function
/* Create labels definitions in Stata */
for (i=1;i<=cols(variables);i++) {
stata("capture su" + " " + variables[i])
stata("scalar inlist=_rc")
inlist=st_numscalar("inlist")
if (inlist==0) {
stata("label var" +" "+ variables[i]+" "+ descriptor[i])
}
}
Finally, if the variable is found on current data, we use Stata() to interact with
Stata and create the needed variable labels. Notice how the deﬁnition of the sum
operator in Mata is used to build up, in each iteration, a string that contains the
information in the relevant cell of variables and descriptor, and adds a set of
“ﬁxed” strings — one of which is an empty space. The resulting string will make
sense as a command once it is issued to the Stata prompt
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Last one on programming, I promise...
Now, just need to close the initial curly bracket and save the compiled ﬁle into a
mo-file:
}
mata mosave Labelvar(), dir(PERSONAL) replace
mata clear
end
Ok, the do-ﬁle with the source code is ready. The only thing we still must do is
to runLabelvar.doto compile the code. Now the new mata function Labelvar()
will be available for use.
  Very similar code will deal with the problem of deﬁning
label values. The code is written in the appendix
  This code is also available at the ssc:
. ssc install labelutil
  Many thanks!
  The End
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Labels v2() Function
mata:
mata clear
void function Labels_v2(string scalar labelsS, string scalar valuesS,
string scalar lname, string scalar vtype)
{
/* declarations */
string matrix labels, values
string scalar comma
/* Parsing relevant strings */
t = tokeninit("", "#", (‘""""’, ‘"‘""’"’), 0, 0)
tokenset(t, st_local(labelsS))
labelsT = tokengetall(t)
tokenset(t, st_local(valuesS))
valuesT = tokengetall(t)
/* get labels */
labels = J(1,1,"")
for (i=1;i<=cols(labelsT);i++) {
if (i==2) labels = strtrim(labelsT[i])
if (i>2 & labelsT[i]!="#") labels = (labels,strtrim(labelsT[i]))
}
comma = ‘"""’
for (i=1;i<=cols(labels);i++) {
labels[i] = comma+labels[i]+comma
}
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Labels Function II
/* get values */
valuesR = J(1,1,"")
for (i=1;i<=cols(valuesT);i++) {
if (i==2) valuesR = strtrim(valuesT[i])
if (i>2 & valuesT[i]!="#") valuesR = (valuesR,strtrim(valuesT[i]))
}
values = strtoreal(valuesR)
for (i=1;i<=cols(valuesR);i++) {
if (values[i]==.) values[i] = J(1,1,8800)+J(1,1,i)
}
for (i=1;i<=cols(valuesR);i++) {
valuesR[i] = comma+valuesR[i]+comma
}
/* Create a verctor with new values as strings */
valuesNS = strofreal(values)
for (i=1;i<=cols(valuesNS);i++) {
valuesNS[i] = comma+valuesNS[i]+comma
}
/* Replace values in data */
if (vtype=="s") {
/* trim string values in data */
stata("qui replace "+lname+" = "+"rtrim("+lname+")")
/* deal with blank records */
stata("qui replace "+lname+" = "+comma+"9985"+comma+" if "+lname+"=="+comma+comma)
stata("label def "+" "+lname+" "+"9985"+" "+comma+"Blank in data"+comma+", add")
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Labels Function III
/* replace new values in data */
for (i=1;i<=cols(valuesR);i++) {
stata("qui replace"+" "+lname+"="+valuesNS[i]+" if "+" "+lname+"=="+valuesR[i])
}
stata("qui destring "+lname+ ", replace")
}
/* reverse substitution --- variable is writen in data as label string description */
if (vtype=="rev") {
/* trim string values in data */
stata("qui replace "+lname+" = "+"rtrim("+lname+")")
/* deal with blank records */
stata("qui replace "+lname+" = "+comma+"9985"+comma+" if "+lname+"=="+comma+comma)
stata("label def "+" "+lname+" "+"9985"+" "+comma+"Blank in data"+comma+", add")
/* replace new values in data */
for (i=1;i<=cols(valuesR);i++) {
stata("qui replace"+" "+lname+"="+valuesNS[i]+" if "+" "+lname+"=="+labels[i])
}
stata("qui destring "+lname+ ", replace")
}
/* Create labels definitions in Stata */
for (i=1;i<=cols(labels);i++) {
stata("label def" +" "+lname+" "+strofreal(values[i])+" "+ labels[i]+", add")
}
ADMIN node · Institute of Education · University of LondonMotivation
Agenda
Live
demostration
Strings and
Mata
The code
Appendix
Labels v2() Function IV
/* label values */
stata("label val "+lname+" "+lname)
}
mata mosave Labels(), dir(PERSONAL) replace
mata clear
end
  NB. Labels v2() will code all blank records as 9985. This can changed as
needed/preferred
ADMIN node · Institute of Education · University of London

Dealing with the cryptic survey: Processing labels and value labels with Mata

http://repec.org/msug2009/mex09sug_am.pdf

Dealing with the cryptic survey: Processing labels and value labels with Mata

Abstract

Similar works

Full text

Available Versions

Research Papers in Economics