-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathupdate_microdata.Rmd
330 lines (255 loc) · 14.3 KB
/
update_microdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
# Update microdata in the P drive {#update-microdata}
Before explaining how to update our P drive, you need to make sure you
understand what the P drive is as it is explained in section
\@ref(network-drive).
The welfare data, most of it microdata but also bin and group data, reside in
the folder `P:/01.PovcalNet/01.Vintage_control`. Inside, there are two kinds of
folders, [1] country folders and [2] auxiliary folder. The former are identified
by the tree-letter ISO3 code of each country, whereas the latter start with
underscore "*".* You must be careful of never creating a folder inside
"01.vintage\_control" unless it starts with an underscore. Otherwise, you might
be breaking the functionality of many scripts that relies in this structure.
The internal structure of the country folders is the same and it follows the
[International Household Survey
Network](%5Bhttp://ihsn.org/)](<http://ihsn.org/>)) standards. It looks
something like this,
```{r}
fs::dir_tree("p:/01.PovcalNet/01.Vintage_control/COL/",
regexp = "COL/COL_2015_GEIH")
```
Within the Colombian surveys (COL), there are as many folders as surveys
available in this country. Each folder has three components, `CCC_YYYY_SSSS`,
where `CCC` stands for the country code, `yyyy` for the year, and `SSSS` for the
survey acronym. In this case, it is `COL_2015_GEIH`. Inside this folder we have
the different version available. This is represented by the convention
`CCC_YYYY_SSSS_Vmm_M_Vaa_A_TTT`, where `mm` stands for the version of the master
data (i.e., the one released by the Government of the country), and `aa` stands
for the version of the adaptation of the survey. Finally, `TTT` refers to the
collection of the adaptation. In PovcalNet we only have the `GMD` collection.
Finally, inside each of the vintage control folder, you will find the `data`
folder and within you will find the different modules of the survey. This module
is represented by the last acronym of the name of the file, say `GPWG` or `PCN`.
This two modules are explained further below.
## Get PovcalNet inventory up to date
Once everything has been approved in PRIMUS (see section \@ref(approve-primus)),
we can download the most recent data into our system using the directive
`pcn download gpwg`. The process of downloading the data can be done in several
ways, but we suggest two.
### Update all the years of selected countries
This method is easy but highly inefficient because it could be the case that
only one year of a particular country was updated or added, but the code below
will check if all the years of that country have changed as well. You may think
that it does not make sense to download the data in this way, and you might be
right. However, downloading only the data that you need is not as
straightforward as you may think, and thus the code is a little more complex
than usual. We recommend this inefficient way because it is easy to implement,
and `pcn` is smart enough to verify whether or not the microdata has changed in
a particular year, so in case it has not changed, `pcn` won't update the
microdata file in the P drive. It is just slow because it needs to load the
microdata from `datalibweb`.
```{stata, eval = FALSE}
primus query, overallstatus(approved)
local filtdate "2020-04-01" // change this date YYYY-MM-DD
keep if date_modified >= clock("`filtdate'", "YMD")
levelsof country, local(countries) clean // code of countries that changed
pcn download gpwg, countr(`countries')
```
The code above is simple. First, you query all the approved transactions in
PRIMUS using the `primus` command and keep only those transactions that were
created after the date in the local `filtdate`. Then, save in the local
`countries` the code of the countries that changed and parse it into the `pcn`
call. From there, `pcn` will take care of the update and will provide you with a
file summarizing what data was updated, added, skipped, or failed.
### Update file by file
The code below may seem complex, but it is actually very simple. The MATA
function only reads the observation `i` in the `while` loop of the
already-filtered data obtained with `primus query`. You can see that it is very
similar to the code above but the main difference is that it goes file by file
instead of checking all the files of one country.
```{stata, eval = FALSE}
//========================================================
// Mata function
//========================================================
cap mata: mata drop get_ind()
mata:
void get_ind(string matrix R) {
i = strtoreal(st_local("i"))
vars = tokens(st_local("varlist"))
for (j =1; j<=cols(vars); j++) {
st_local(vars[j], R[i,j] )
}
} // end of IDs variables
end
//========================================================
// PRIMUS query
//========================================================
primus query, overallstatus(approved)
local filtdate "2020-04-01" // change this date
keep if date_modified >= clock("`filtdate'", "YMD")
levelsof country, local(countries) clean // code of countries that changed
tostring _all, replace
local varlist = "country year"
mata: R = st_sdata(.,tokens(st_local("varlist")))
local n = _N
//========================================================
// Loop over surveys
//========================================================
local i = 0
while (`i' < `n') {
local ++i
mata: get_ind(R)
disp "`country' - `year'"
pcn download gpwg, countr(`country') year(`year')
}
```
Keep in mind the following,
1. This code works fine if you copy it and paste it in your do-file editor and
then run it directly in Stata. If you save it and then execute it from a
different file using either `do` or `run` it will fail because the `end`
command at the end of the MATA code will finish the process.
2. This code could be implemented in another subcommand of `pcn` without
running into the problem above. However, we don't have time to do this
change. The MATA function already exists in the `pcn` repository and it is
being used by other `pcn` subroutines like `pcn_download_gpwg`. If you want
to contribute to the `pcn` by adding this feature, you're more than welcome!
`r emo::ji("wink")`.
3. The only problem with this approach is that you won't get at the end of the
process a nice file summarizing what happened with each file. This could
also be fixed if this routine is included as part of `pcn`.
## Create the \_PCN files
The final step before having the microdata ready to be converted to .pcb files
and ingested by the PovcalNet system is to create the "PCN" files. These files
are just an adaptation of the GPWG microdata in the P drive. They are called
"PCN" because that is their suffix in the naming convention. For instance, the
"\_PCN" file of *COL\_2017\_GEIH\_V01\_M\_V01\_A\_GMD**\_GPWG**.dta* is
*COL\_2017\_GEIH\_v01\_M\_v01\_A\_GMD[\_PCN]{.red}.dta.* This conversion is done
through the `pcn create` directive. In the case above, for instance, you only
need to execute the directive,
```{stata, eval = FALSE}
pcn create, countries(COL) year(2017) clear
```
Also, you can create the "PCN" files of all surveys, by just typing,
```{stata, eval = FALSE}
pcn create, countries(ALL)
```
You may need to use the option `replace` in the directive above, in case you
want to replace an existing "PCN" file. In similar fashion as
`pcn download gpwg`, `pcn create` compares current data in the P drive and does
not replace anything unless you make this option explicit.
Also, if you want to understand what the `pcn create` directive does, you can
look at the file `pcn_create.ado`, but it general it makes sure to load the data
in the P drive and standardize the output to make it ready for PovcalNet. For
instance, it replace zeros and missing values of the welfare and weight
variables like this,
```{stata, eval = FALSE}
* drop missing values
drop if welfare < 0 | welfare == .
drop if weight <= 0 | weight == .
```
Or make sure the `weight` variable is exactly the same for all the files,
```{stata, eval = FALSE}
cap confirm var weight, exact
if (_rc) {
cap confirm var weight_p, exact
if (_rc == 0) rename weight_p weight
else {
cap confirm var weight_h, exact
if (_rc == 0) rename weight_h weight
else {
noi disp in red "no weight variable found for country(`country') year(`year') veralt(`veralt') "
local status "error. cleaning"
local dlwnote "no weight variable found for country(`country') year(`year') "
mata: P = pcn_info(P)
noi _dots `i' 1
continue
}
}
}
```
This command also makes sure to do the proper adjustments to the India and
Indonesia datasets.
[Daniel, could you please provide a short explanation of the IND and IDN
adjustment?]{.red .bg-light-blue}
::: {.rmdbox .rmdcaution}
Make sure that for all the new GPWG that you download from `datalibweb` based on
the PRIMUS catalog, you create all the corresponding PCN files.
:::
## China synthetic files
For the early years of Indonesia and India, and a large part of China's, we
calculate the national inequality measures using separate urban and rural
grouped data. This data is synthetic data created following a simple process:
1. Query the fitted Lorenz curve parameters for these country-years.
For this kind of data (grouped data), Povcalnet fits a Lorenz curve using
two functional forms: a linear approximation and the other using a quadratic one.
By Querying Povcalnet, one can recover the critical parameters used to fit the curve
over the data. These estimated parameters and the poverty estimates of each country-year
are provided for both urban and rural areas following a simple query:
"http://iresearch.worldbank.org/PovcalNet/Detail.aspx?Format=Detail&C0=CHN_2&PPP0=3.0392219&PL0=1.90&Y0=1981&NumOfCountries=1"
Checking the query in some detail shows that it is composed of a few key components:
- The server root: This is the first part of the query. In the example, it corresponds
to "http://iresearch.worldbank.org/PovcalNet". This root indicates the Povacalnet server
to be used to query the data. Keep in mind that there are a few servers: the production
and the dev or internal. The example uses the production server; in case you intend to
use the dev server, change the root to "http://wbgmsrech001/povcalnet".
- The country: Following the server definition, the API query deepens on details,
just after "Format=Detail". The first one to appear is "CO", which refers to the country.
To query a given country, following the "=" symbols, add the country's three-letter code.
In our case, add CHN to query data for China.
- The coverage level: Just after the country code and the "_" follows a number. This number
indicates the data coverage level, urban or rural. If the desired level is rural, this
number must be one (1); otherwise, if the sought level is urban, the number should set
to two (2). The example queries data at the urban level.
- The PPP: The next component is the PPP. The PPP is set after "&PPP0=".
In the example, the PPP to be used is 3.0392219.
- The poverty line: The poverty line is set following the characters "&PL0".
In the example, the line is set to the extreme poverty line of 1.90.
- The year: Is set using "&Y0=". In the example query, the requested year is 1981.
Once a given observation is queried from Povcalnet, the estimated parameters are recovered and stored. An auxiliary ado file carries out all this, "lorenz_query.ado". The extraction is close to a web scrapping process. Using regular expressions, each of the values of interest is stored and returned as a scalar. For example, the piece of code that recovers the Lorenz estimates is as follows:
```{stata, eval = FALSE}
foreach sec in GQ beta final {
if regexm(`sec'," A[ ]+([0-9|\.|-]+)") scalar `sec'c_A = regexs(1)
if regexm(`sec'," B[ ]+([0-9|\.|-]+)") scalar `sec'c_B = regexs(1)
if regexm(`sec'," C[ ]+([0-9|\.|-]+)") scalar `sec'c_C = regexs(1)
if ("`sec'" == "beta"){
if regexm(`sec'," Theta:[ ]+([0-9|\.|-]+)") scalar `sec'_theta = regexs(1)
if regexm(`sec'," Gamma:[ ]+([0-9|\.|-]+)") scalar `sec'_gamma = regexs(1)
if regexm(`sec'," Delta:[ ]+([0-9|\.|-]+)") scalar `sec'_delta = regexs(1)
}
```
2. Estimate the slope parameters
The following code estimates the slope parameters:
```{stata, eval = FALSE}
foreach cv of local cover{
loc cvl = substr("`cv'", 1,1) // level_sufix
// - Create parameters to calculate the slope of the GQ lorenz curve
scalar e_`cvl' = -(`=scalar(GQcoeffA_`cvl')' + `=scalar(GQcoeffB_`cvl')' + `=scalar(GQcoeffC_`cvl')' + 1)
scalar m_`cvl' = (`=scalar(GQcoeffB_`cvl')')^2 - 4*`=scalar(GQcoeffA_`cvl')'
scalar n_`cvl' =2*`=scalar(GQcoeffB_`cvl')'*`=scalar(e_`cvl')'-4*`=scalar(GQcoeffC_`cvl')'
scalar r_`cvl' =((`=scalar(n_`cvl')')^2-4*`=scalar(m_`cvl')'*((`=scalar(e_`cvl')')^2))^(0.5)
// - Convert GQ mean from monthly to daily
scalar GQmean_`cvl' = `=scalar(GQmean_`cvl')' * 12 / 365
scalar betamean_`cvl' = `=scalar(betamean_`cvl')' * 12 / 365
}
```
Notice that these parameters are created using the parameters delivered by the API. Namely, all the scalars, which name starts by GQcoeff, in the previous chunck of code.
3. Simulate the data
For each coverage level, urban and rural, 100 000 observations are simulated; and the observations' weights are simply the population by coverage level.
```{stata, eval = FALSE}
clear
set type double
set seed 12345
scalar nobs=100000
scalar first=1/(2*`=nobs')
scalar last=1-(1/(2*`=nobs'))
set obs `=nobs'
range _F `=first' `=last'
```
Now using the slope parameters previously calculated the Lorenz curve, the simulated incomes are calculated in the following lines:
```{stata, eval = FALSE}
** Calculate the slope of GQ lorenz curve and income (= mu * slope of LC)
gen double x_F_GQ = `=scalar(GQmean_`cvl')'*(-(`=scalar(GQcoeffB_`cvl')')/2 -(2*`=scalar(m_`cvl')'*(_F)+`=scalar(n_`cvl')')*(`=scalar(m_`cvl')'*(_F)^2+`=scalar(n_`cvl')'*(_F)+(`=scalar(e_`cvl')')^2)^(-0.5)/4)
** Calculate the slope of beta lorenz curve and income (= mu * slope of LC)
gen double x_F_beta = `=scalar(betamean_`cvl')'*(1-`=scalar(betatheta_`cvl')'*((_F)^`=scalar(betagamma_`cvl')')*((1-(_F))^`=scalar(betadelta_`cvl')')*((`=scalar(betagamma_`cvl')'/(_F)) - (`=scalar(betadelta_`cvl')'/(1-(_F)))))
```
The simlated incomes are just the mean times the Lorenz curve slope, for both the linear and cuadratic approach.
Once this is calculated, only the weights (population) and simulated income values are kept in memory. This is it, that's the simulated data used we use.