-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05_chapter05_microdata.Rmd
3508 lines (2829 loc) · 188 KB
/
05_chapter05_microdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
---
# Microdata {#chapter05}
<center>
![](./images/DDI.JPG){width=60%}
</center>
<br>
## Definition of microdata
When surveys or censuses are conducted, or when administrative data are recorded, information is collected on each unit of observation. The unit of observation can be a person, a household, a firm, an agricultural holding, a facility, or other. Microdata are the data files resulting from these data collection activities, which contain the <u>unit-level</u> information (as opposed to aggregated data in the form of counts, means, or other). Information on each unit is stored in *variables*, which can be of different types (e.g. numeric or alphanumeric, discrete or continuous). These variables may contain data reported by the respondent (e.g., the marital status of a person), obtained by observation or measurement (e.g., the GPS location of a dwelling), or generated by calculation, recoding or derivation (e.g., the sample weight in a survey).
For efficiency reasons, variables are often stored in numeric format (i.e. coded values), even when they contain qualitative information (coded values). For example, the sex of a respondent may be stored in a variable named ‘Q_01’, and include values 1, 2 and 9 where 1 represents "male", 2 represents "female", and 9 represents "unreported". Microdata must therefore be provided at a minimum with a data dictionary containing the variables and value labels and, for derived variables, information of the derivation process. But many other features of a micro-dataset should also be described such as the objectives and the methodology of data collection (including a description of the sampling design for sample surveys), the period of data collection, the identification of the primary investigator and other contributors, the scope and geographic coverage of the data, and much more. This information will make the data usable and discoverable.
## The Data Documentation Initiative (DDI) metadata standard
The DDI metadata standard provides a structured and comprehensive list of hundreds of elements and attributes which may be used to document microdata. It is unlikely that any one study would ever require using them all, but this list provides a convenient solution to foster completeness of the information, and to generate documentation that will meet the needs of users.
The Data Documentation Initiative (DDI) metadata standard originated in the [Inter-university Consortium for Political and Social Research (ICPSR)](https://www.icpsr.umich.edu/web/pages/), a membership-based organization with more than 500 member colleges and universities worldwide. The DDI is now the project of an alliance of North American and European institutions. Member institutions comprise many of the largest data producers and data archives in the world. The DDI standard is used by a large community of data archivists, including data librarians from academia, data managers in national statistical agencies and other official data producing agencies, and international organizations. The standard has two branches: the [DDI-Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) (version 2.x) and the [DDI LifeCycle](https://ddialliance.org/Specification/DDI-Lifecycle/) (version 3.x). These two branches serve different purposes and audiences. For the purpose of data archiving and cataloguing, the schema we recommend in this Guide is the DDI-Codebook. We use a slightly simplified version of version 2.5 of the standard, to which we add a few elements (including the `tags` element common to all schemas described in the Guide. A mapping between the elements included in our schema and the DDI Codebook metadata tags is provided in annex 2.
The DDI standard is published under the terms of the [GNU General Public License]((http://www.gnu.org/licenses) (version 3 or later).
### DDI-Codebook
The DDI Alliance developed the [DDI-Codebook](https://ddialliance.org/Specification/DDI-Codebook/2.5/) for organizing the content, presentation, transfer, and preservation of metadata in the social and behavioral sciences. It enables documenting microdata files in a simultaneously flexible and rigorous way. The DDI-Codebook aims to provide a straightforward means of recording and communicating all the salient characteristics of a micro-dataset.
The DDI-Codebook is designed to encompass the kinds of data resulting from surveys, censuses, administrative records, experiments, direct observation and other systematic methodology for generating empirical measurements. The unit of observation can be individual persons, households, families, business establishments, transactions, countries or other subjects of scientific interest.
The DDI Alliance publishes the DDI-Codebook as an XML schema. We present in this Guide a JSON implementation of the schema, which is used in our R package *NADAR* and Python Library *PyNADA*. The [NADA cataloguing](https://nada.ihsn.org/) application works with both the XML and the JSON version. A DDI-compliant metadata file can be converted from the JSON schema to the XML or from XML to JSON.
### DDI-Lifecycle
As indicated by the [DDI Alliance website](https://ddialliance.org/Specification/DDI-Lifecycle/3.3/), **DDI-Lifecycle** is "designed to document and manage data across the entire life cycle, from conceptualization to data publication, analysis and beyond. It encompasses all of the DDI-Codebook specification and extends it. Based on XML Schemas, DDI-Lifecycle is modular and extensible." DDI-lifecycle can be used to "populate variable and question banks to explore available data and question structures for reuse in new surveys". As this is not our objective, and because using the DDI-Lifecycle adds significant complexity, we do not make use of it and this chapter only covers the DDI-Codebook.
## Some practical considerations
The DDI is a comprehensive schema that provides metadata elements to document a **study** (e.g., a survey, or an administrative datasets), the related **data files**, and the **variables** they contain. A separate schema is used to document the **related resources** (questionnaires, reports, and others); see Chapter 13.
Some datasets may contain hundreds or even thousands of variables. For each variable, the DDI can include not only the variable name, label and description, but also summary statistics like the count of valid and missing observations, weighted and unweighted frequencies, means, and others. Generating a DDI file manually, in particular the variable-level metadata, can be a tedious and time consuming task. But variable names, summary statistics, and (when avaiulable) variable and value labels can be extracted directly from the data files. User-friendly solutions (specialized metadata editors) are available to automate a large part of this work. DDI can also be generated programmatically using R or Python. Section 5.5 provides examples of the use of specialized DDI metadata editors and programming languages to generate DDI-compliant metadata.
Documenting microdata is more complex than documenting publications or other types of data like tables or indicators. The production of microdata often involves experts in survey design, sampling, data processing, and analysis. Generating the metadata should thus be a collective responsibility and will ideally be done in real time ("document as you survey"). Data documentation should be implemented during the whole lifecycle of data production, not as an *ex post* task. This is in line with what the [Generic Statistical Business process Model (GSBPM)](https://statswiki.unece.org/display/GSBPM/VI.+Overarching+Processes) recommends: "Good metadata management is essential for the efficient operation of statistical business processes. Metadata are present in every phase, either created, updated or carried forward from a previous phase or reused from another business process. In the context of this model, the emphasis of the overarching process of metadata management is on the creation/revision, updating, use and archiving of statistical metadata, though metadata on the different sub-processes themselves are also of interest, including as an input for quality management. The key challenge is to ensure that these metadata are captured as early as possible, and stored and transferred from phase to phase alongside the data they refer to." Too often, microdata are documented after completion of the data collection, sometimes by a team who was not directly involved in the production of the data. In such cases, some information may not have been captured and will be difficult to find or reconstruct.
:::idea
**Suggestions and recommendations to data curators**
- Generating detailed metadata at the variable level (including elements like the formulation of the questions, variable and value labels, interviewer instructions, universe, derivation procedures, etc.) may seem to be a tedious exercise, but it adds considerable value to the metadata. Indeed, it will (i) provide a detailed data dictionary, required to make the data usable, (ii) provide the necessary information for making the data more discoverable and to enable variable comparison tools, and (iii) guarantee the preservation of institutional memory. The cost of generating such metadata will be very small relative to the cost of generating the data.<br>
- To make the data more discoverable, attention should be paid to provide a detailed description of the scope and objectives of the data collection. When a survey (or other microdataset) is used to generate statistical indicators, a list of these indicators should be provided in the metadata.<br>
- The `keywords` metadata element provides a flexible solution to improve the discoverability of data. For example, a survey that collects data on children age, weight and height, will be relevant for measuring malnutrition and generating indicators like prevalence of stunting or wasting, overweight and underweight. The variable description alone would not make the data discoverable in keyword-based search engines, hence the importance of adding relevant terms and phrases in the `keyword` section.<br>
- The DDI metadata will be saved as an XML or JSON file, i.e. as plain text. This means that the DDI metadata cannot include complex formulas. The description of some variables, as well as the description of a survey sample design, may require the use of formulas. In such case, the recommendation is to provide as much of the information as possible in the DDI, and to provide links to documents where the formulas can be found (these documents would be published with the metadata as *external resources*).
- Typically, the variables in the DDI are organized by data file. The DDI provides an option --the `variable groups`-- to organize variables differently, for example thematically. These variable groupings are virtual, in the sense that they do not impact the way variables are stored. Not all variables have to be mapped to such groups, and a same variable can belong to more than one group. This option provides the possibility to organize the variables based on a thematic or topical classification. Machine learning (AI) tools make it possible to automate the process of mapping variables to a pre-defined list of groups (each one of them described by a label and a short description). By doing this, and by generating embeddings at the group level, it becomes possible to add semantic search and to implement a recommender system that applies to microdata.
:::
## Schema description: DDI-Codebook 2.5
The DDI-Codebook is a comprehensive, structured list of elements to be used to document microdata of any source. The standard contains five main sections:
- **Document description** (`doc_desc`), with elements used to <u>describe the metadata</u> (not the data); the term "document" refers here to the XML (or JSON) file that contains the metadata.
- **Study description** (`study_desc`), which contains the elements used to describe the study itself (the survey, the administrative process, or the other activity that resulted in the production of the microdata). This section will contain information on the primary investigator, scope and coverage of the data, sampling, etc.
- **File description** (`data_files`), which provides elements to document each data file that compose the dataset (this is thus a repeatable block of elements).
- **Variable description** (`variables`), with elements used to describe each variable contained in the data files, including the variable names, the variable and value labels, summary statistics for each variable, interviewers' instructions, description of recoding or derivation procedure, and more.
- **Variable groups** (`variable_groups`), an optional section that allows organizing variables by thematic or other groups, independently from the data file they belong to. Variable groups are "virtual"; the grouping of variables does not affect the data files.
The other sections in the schema are not part of the DDI Codebook itself. Some are used for catalog administration purposes when the NADA cataloguing application is used (`repositoryid`, `access_policy`, `published`, `overwrite`, and `provenance`).
- **`repositoryid`** identifies the data catalog/collection in which the metadata will be published.
- **`access_policy`** indicates the access policy to be applied to the microdata (open access, public use files, licensed access, no access, etc.)
- **`published`**: Indicates whether the metadata will be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.
- **`overwrite`**: Indicates whether metadata that may have been previously uploaded for the same dataset can be overwritten. By default, the value is "no". It must be set to "yes" to overwrite existing information. Note that a dataset will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element `study_desc > title_statement > idno` is the same.
- **`provenance`** is used to store information on the source and time of harvesting, for metadata that were extracted automatically from external data catalogs.
Other sections are provided to allow additional metadata to be collected and stored, including metadata generated by machine learning models (`tags`, `lda_topics`, `embeddings`, and `additional`). The `tags` is a section common to all schemas (with the exception of the *external resources* schema), which provides a flexible solution to generate customized facets in data catalogs. The `additional` section allows data curators to supplement the DDI standard with their own metadata elements, without breaking compliance with the DDI.
```json
{
"repositoryid": "string",
"access_policy": "data_na",
"published": 0,
"overwrite": "no",
"doc_desc": {},
"study_desc": {},
"data_files": [],
"variables": [],
"variable_groups": [],
"provenance": [],
"tags": [],
"lda_topics": [],
"embeddings": [],
"additional": { }
}
```
<br>
The DDI-Codebook also provides a solution to describe OLAP cubes, which we do not make use of as our purpose is to use the standard to document and catalog datasets, not to manage data.
:::note
Each metadata element in the DDI standard has a name. In our JSON version of the standard, we do not make use of the exact same names. We adapted some of them for clarity. For example, we renamed the DDI element `titlStmt` as `title_statement`. The mapping between the DDI Codebook 2.5 standard and the elements in our schema is provided in appendix. JSON files created using our adapted version of the DDI can be exported as a DDI 2.5 compliant and validated XML file using R or Python scripts provided in the NADAR package and PyNADA library.
:::
### Document description
**`doc_desc`** *[Optional ; Not repeatable]* <br>
Documenting a study using the DDI-Codebook standard consists of generating a metadata file in XML or JSON format. This file is what is referred to as the metadata *document*. The `doc_desc` or **document description** is thus a description of the metadata file, and consists of bibliographic information describing the DDI-compliant document as a whole. As a same dataset can possibly be documented by more than one organization, and because metadata can be automatically harvested by on-line catalogs, traceability of the metadata is important. This section, which only contains five main elements, should be as complete as possible, and at least contain information on the `producer` and `prod_date`; information.
```json
"doc_desc": {
"title": "string",
"idno": "string",
"producers": [
{
"name": "string",
"abbr": "string",
"affiliation": "string",
"role": "string"
}
],
"prod_date": "string",
"version_statement": {
"version": "string",
"version_date": "string",
"version_resp": "string",
"version_notes": "string"
}
}
```
<br>
- **`title`** *[Optional ; Not repeatable ; String]* <br>
The title of the metadata document (which may be the title of the study itself). The metadata document is the DDI metadata file (XML or JSON file) that is being generated. The "Document title" should mention the geographic scope of the data collection as well as the time period covered. For example: “DDI 2.5: Albania Living Standards Study 2012”.
- **`idno`** *[Optional ; Not repeatable ; String]* <br>
A unique identifier for the metadata document. This identifier must be unique in the catalog where the metadata are intended to be published. Ideally, the identifier should also be unique globally. This is different from the unique identifier `idno` found in section `study_description / title_statement`, although it is good practice to generate identifiers that establish a clear connection between the two identifiers. The Document ID could also include the metadata document version identifier. For example, if the "Primary identifier" of the study is “ALB_LSMS_2012”, the "Document ID" in the Metadata information could be “IHSN_DDI_v01_ALB_LSMS_2012” if the DDI metadata are produced by the IHSN. Each organization should establish systematic rules to generate such IDs. A validation rule can be set (using a regular expression) in user templates to enforce a specific ID format. The identifier should not contain blank spaces.
- **`producers`** *[Optional ; Repeatable]* <br>
The metadata producer is the person or organization with the financial and/or administrative responsibility for the processes whereby the metadata document was created. This is a "Recommended" element. For catalog administration purposes, information on the producer and on the date of metadata production is useful.
- **`name`** *[Optional ; Not repeatable ; String]* <br>
The name of the person or organization in charge of the production of the DDI metadata. If the name of individuals cannot be provided due to an organization's data protection rules, the title of the person, or an anonymized identifier, can be provided (or this field can be left blank if no other option is available).
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The initials of the person, or the abbreviation of the organization's name mentioned in `name`.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the person or organization mentioned in `name`.
- **`role`** *[Optional ; Not repeatable ; String]* <br>
The specific role of the person or organization mentioned in `name` in the production of the DDI metadata. <br><br>
- **`prod_date`** *[Optional ; Not repeatable ; String]* <br>
The date the DDI metadata document was produced (not the date it was distributed or archived), preferably entered in ISO 8601 format (YYYY-MM-DD or YYY-MM). This is a "Recommended" element, as information on the producer and on the date of metadata production is useful for catalog administration purposes.
- **`version_statement`** *[Optional ; Not repeatable]* <br>
A version statement for the metadata (DDI) document. Documenting a dataset is not a trivial exercise. It may happen that, having identified errors or gaps in a DDI document, or after receiving suggestions for improvement or additional input, the DDI metadata are modified. The `version_statement` describes the version of the metadata document. It is good practice to provide a version number and date, and information on what distinguishes the current version from the previous one(s).
- **`version`** *[Optional ; Not repeatable ; String]* <br>
The label of the version, also known as release or edition. For example, *Version 1.2*
- **`version_date`** *[Optional ; Not repeatable ; String]* <br>
The date when this version of the metadata document (DDI file) was produced, preferably identifying an exact date. This will usually correspond to the `prod_date` element. It is recommended to enter the date in the ISO 8601 date format (YYYY-MM-DD or YYYY-MM or YYYY).
- **`version_resp`** *[Optional ; Not repeatable ; String]* <br>
The organization or person responsible for this version of the metadata document.
- **`version_notes`** *[Optional ; Not repeatable ; String]* <br>
This element can be used to clarify information/annotation regarding this version of the metadata document, for example to indicate what is new or specific in this version comparing with a previous version.
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
title = "Albania Living Standards Study 2012",
idno = "DDI_WB_ALB_2012_LSMS_v02",
producers = list(
list(name = "Development Data Group",
abbr = "DECDG",
affiliation = "World Bank",
role = "Production of the DDI-compliant metadata"
)
),
prod_date = "2021-02-16",
version_statement = list(
version = "Version 2.0",
version_date = "2021-02-16",
version_resp = "OD",
version_notes = "Version identical to Version 1.0 except for the Data Appraisal section which was added."
)
),
# ... (other sections of the DDI)
)
```
<br>
### Study description
**`study_desc`** *[Required ; Not repeatable]* <br>
The `study_desc` or **study description** consists of information about the data collection or study that the DDI-compliant documentation file describes. This section includes study-level information such as scope and coverage, objectives, producers, sampling, data collection dates and methods, etc.
```json
"study_desc": {
"title_statement": {},
"authoring_entity": [],
"oth_id": [],
"production_statement": {},
"distribution_statement": {},
"series_statement": {},
"version_statement": {},
"bib_citation": "string",
"bib_citation_format": "string",
"holdings": [],
"study_notes": "string",
"study_authorization": {},
"study_info": {},
"study_development": {},
"method": {},
"data_access": {}
}
```
<br>
#### Title statement
**`title_statement`** *[Required ; Not repeatable]* <br>
The title statement for the study.
```json
"title_statement": {
"idno": "string",
"identifiers": [
{
"type": "string",
"identifier": "string"
}
],
"title": "string",
"sub_title": "string",
"alternate_title": "string",
"translated_title": "string"
}
```
<br>
- **`idno`** *[Required ; Not repeatable ; String]* <br>
`idno` is the primary identifier of the dataset. It is a unique identification number used to identify the study (survey, census or other). A unique identifier is required for cataloguing purpose, so this element is declared as "Required". The identifier will allow users to cite the dataset properly. The identifier must be unique within the catalog. Ideally, it should also be globally unique; the recommended option is to obtain a Digital Object Identifier (DOI) for the study. Alternatively, the `idno` can be constructed by an organization using a consistent scheme. The scheme could for example be “catalog-country-study-year-version”, where country is the 3-letter ISO country code, producer is the abbreviation of the producing agency, study is the study acronym, year is the reference year (or the year the study started), version is a version number. Using that scheme, the Uganda 2005 Demographic and Health Survey for example would have the following `idno` (where “MDA” stand for “My Data Archive”): MDA_UGA_DHS_2005_v01. Note that the schema allows you to provide more than one identifier for a same study (in element `identifiers`); a catalog-specific identifier is thus not incompatible with a globally unique identifier like a DOI. The identifier should not contain blank spaces.
- **`identifiers`** *[Optional ; Repeatable]* <br>
This repeatable element is used to enter identifiers (IDs) other than the `idno` entered in the Title statement. It can for example be a Digital Object Identifier (DOI). The `idno` can be repeated here (the `idno` element does not provide a `type` parameter; if a DOI or other standard reference ID is used as `idno`, it is recommended to repeat it here with the identification of its `type`).
- **`type`** *[Optional ; Not repeatable ; String]* <br>
The type of unique ID, e.g. "DOI".
- **`identifier`** *[Required ; Not repeatable ; String]* <br>
The identifier itself. <br><br>
- **`title`** *[Required ; Not repeatable ; String]* <br>
This element is "Required". Provide here the full authoritative title for the study. Make sure to use a unique name for each distinct study. The title should indicate the time period covered. For example, in a country conducting monthly labor force surveys, the title of a study would be like “Labor Force Survey, December 2020”. When a survey spans two years (for example, a household income and expenditure survey conducted over a period of 12 months from June 2020 to June 2021), the range of years can be provided in the title, for example “Household Income and Expenditure Survey 2020-2021”. The title of a survey should be its official name as stated on the survey questionnaire or in other study documents (report, etc.). Including the country name in the title is optional (another metadata element is used to identify the reference countries). Pay attention to the consistent use of capitalization in the title.
- **`sub_title`** *[Optional ; Not repeatable ; String]* <br>
The `sub-title` is a secondary title used to amplify or state certain limitations on the main title, for example to add information usually associated with a sequential qualifier for a survey. For example, we may have “[country] Universal Primary Education Project, Impact Evaluation Survey 2007” as `title`, and “Baseline dataset” as `sub-title`. Note that this information could also be entered as a Title with no Subtitle: “[country] Universal Primary Education Project, Impact Evaluation Survey 2007 - Baseline dataset”.
- **`alternate_title`** *[Optional ; Not repeatable ; String]* <br>
The `alternate_title` will typically be used to capture the abbreviation of the survey title. Many surveys are known and referred to by their acronym. The survey reference year(s) may be included. For example, the "Demographic and Health Survey 2012" would be abbreviated as "DHS 2012", or the "Living Standards Measurement Study 2020-2012" as "LSMS 2020-2021".
- **`translated_title`** *[Optional ; Not repeatable ; String]* <br>
In countries with more than one official language, a translation of the title may be provided here. Likewise, the translated title may simply be a translation into English from a country’s own language. Special characters should be properly displayed, such as accents and other stress marks or different alphabets.
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
# ... ,
study_desc = list(
title_statement = list(
idno = "ML_ALB_2012_LSMS_v02",
identifiers = list(
list(type = "DOI", identifier = "XXX-XXXX-XXX")
),
title = "Living Standards Study 2012",
alternate_title = "LSMS 2012",
translated_title = "Anketa e Matjes së Nivelit të Jetesës (AMNJ) 2012"
)
),
# ...
)
```
<br>
#### Authoring entity
**`authoring_entity`** *[Optional ; Repeatable]* <br>
The name and affiliation of the person, corporate body, or agency responsible for the study’s substantive and intellectual content (the "authoring entity" or “primary investigator”). Generally, in a survey, the authoring entity will be the institution implementing the survey. Repeat the element for each authoring entity, and enter the `affiliation` when relevant. If various institutions have been equally involved as main investigators, then should all be listed. This only includes the agencies responsible for the implementation of the study, not sponsoring agencies or entities providing technical assistance (for which other metadata elements are available). The order in which authoring entities are listed is discretionary. It can be alphabetic or by significance of contribution. Individual persons can also be mentioned, if not prohibited by privacy protection rules.
```json
"authoring_entity": [
{
"name": "string",
"affiliation": "string"
}
]
```
<br>
- **`name`** *[Optional ; Not repeatable ; String]* <br>
The name of the person, corporate body, or agency responsible for the work's substantive and intellectual content. The primary investigator will in most cases be an institution, but could also be an individual in the case of small-scale academic surveys. If persons are mentioned, use the appropriate format of *Surname, First name*.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the person, corporate body, or agency mentioned in `name`.
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
# ... ,
study_desc = list(
# ... ,
authoring_entity = list(
list(name = "National Statistics Office of Popstan (NSOP)",
affiliation = "Ministry of Planning"),
list(name = "Department of Public Health of Popstan (DPH)",
affiliation = "Ministry of Health")
),
# ...
)
)
```
<br>
#### Other entity
**`oth_id`** *[Optional ; Repeatable]* <br>
This element is used to acknowledge any other people and organizations that have in some form contributed to the study. This does not include other producers which should be listed in `producers`, and financial sponsors which should be listed in the element `funding_agencies`.
```json
"oth_id": [
{
"name": "string",
"role": "string",
"affiliation": "string"
}
]
```
<br>
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the person or organization.
- **`role`** *[Optional ; Not repeatable ; String]* <br>
A brief description of the specific role of the person or organization mentioned in `name`.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the person or organization mentioned in `name`.
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
# ... ,
study_desc = list(
# ... ,
oth_id = list(
list(name = "John Doe",
role = "Technical advisor in sample design",
affiliation = "World Bank Group"
)
),
# ...
)
)
```
<br>
#### Production statement
**`production_statement`** *[Optional ; Not repeatable]* <br>
A production statement for the work at the appropriate level.
```json
"production_statement": {
"producers": [
{
"name": "string",
"abbr": "string",
"affiliation": "string",
"role": "string"
}
],
"copyright": "string",
"prod_date": "string",
"prod_place": "string",
"funding_agencies": [
{
"name": "string",
"abbr": "string",
"grant": "string",
"role": "string"
}
]
}
```
<br>
- **`producers`** *[Optional ; Repeatable]* <br>
This field is provided to list other interested parties and persons that have played a significant but not the leading technical role in implementing and producing the data (which will be listed in `authoring_entity`), and not the financial sponsors (which will be listed in `funding_agencies`).
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the person or organization.
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The official abbreviation of the organization mentioned in `name`.
- **`affiliation`** *[Optional ; Not repeatable ; String]*<br>
The affiliation of the person or organization mentioned in `name`.
- **`role`** *[Optional ; Not repeatable ; String]* <br>
A succinct description of the specific contribution by the person or organization in the production of the data. <br>
- **`copyright`** *[Optional ; Not repeatable ; String]* <br>
A copyright statement for the study at the appropriate level.
- **`prod_date`** *[Optional ; Not repeatable ; String]* <br>
This is the date (preferably entered in ISO 8601 format: YYYY-MM-DD or YYYY-MM or YYYY) of the actual and final production of the version of the dataset being documented. At least the month and year should be provided. A regular expression can be entered in user templates to validate the information captured in this field.
- **`prod_place`** *[Optional ; Not repeatable ; String]* <br>
The address of the organization that produced the study.
- **`funding_agencies`** *[Optional ; repeatable]* <br>
The source(s) of funds for the production of the study. If different funding agencies sponsored different stages of the production process, use the `role` attribute to distinguish them.
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the funding agency.
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The abbreviation (acronym) of the funding agency mentioned in `name`.
- **`grant`** *[Optional ; Not repeatable ; String]* <br>
The grant number. If an agency has provided more than one grant, list them all separated with a ";".
- **`role`** *[Optional ; Not repeatable ; String]* <br>
The specific contribution of the funding agency mentioned in `name`. This element is used when multiple funding agencies are listed to distinguish their specific contributions.<br><br>
This example shows the Bangladesh 2018-2019 Demographic and Health Survey (DHS)
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
# ... ,
study_desc = list(
# ... ,
production_statement = list(
producers = list(
list(name = "National Institute of Population Research and Training",
abbr = "NIPORT",
role = "Primary investigator"),
list(name = "Medical Education and Family Welfare Division",
role = "Advisory"),
list(name = "Ministry of Health and Family Welfare",
abbr = "MOHFW",
role = "Advisory"),
list(name = "Mitra and Associates",
role = "Data collection - fieldwork"),
list(name = "ICF (consulting firm)",
role = "Technical assistance / DHS Program")
),
prod_date = "2019",
prod_place = "Dhaka, Bangladesh",
funding_agencies = list(
list(name = "United States Agency for International Development",
abbr = "USAID")
)
),
# ...,
)
# ...
)
```
<br>
#### Distribution statement
**`distribution_statement`** *[Optional ; Not repeatable]* <br>
A distribution statement for the study.
```json
"distribution_statement": {
"distributors": [
{
"name": "string",
"abbr": "string",
"affiliation": "string",
"uri": "string"
}
],
"contact": [
{
"name": "string",
"affiliation": "string",
"email": "string",
"uri": "string"
}
],
"depositor": [
{
"name": "string",
"abbr": "string",
"affiliation": "string",
"uri": "string"
}
],
"deposit_date": "string",
"distribution_date": "string"
}
```
<br>
- **`distributors`** *[Optional ; Repeatable]* <br>
The organization(s) designated by the author or producer to generate copies of the study output including any necessary editions or revisions.
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the distributor. It can be an individual or an organization.
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The official abbreviation of the organization mentioned in `name`.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the person or organization mentioned in `name`.<br>
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
A URL to the ordering service or download facility on a Web site.<br><br>
- **`contact`** *[Optional ; Repeatable]* <br>
Names and addresses of individuals responsible for the study. Individuals listed as contact persons will be used as resource persons regarding problems or questions raised by users.<br>
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the person or organization that can be contacted.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the person or organization mentioned in `name`.
- **`email`** *[Optional ; Not repeatable ; String]* <br>
An email address for the contact mentioned in `name`. <br>
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
A URL to the contact mentioned in `name`.<br><br>
- **`depositor`** *[Optional ; Repeatable]* <br>
The name of the person (or institution) who provided this study to the archive storing it. <br>
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the depositor. It can be an individual or an organization.
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The official abbreviation of the organization mentioned in `name`.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the person or organization mentioned in `name`.
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
A URL to the depositor<br><br>
- **`deposit_date`** *[Optional ; Not repeatable ; String]* <br>
The date that the study was deposited with the archive that originally received it. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible. <br>
- **`distribution_date`** *[Optional ; Not repeatable ; String]* <br>
The date that the study was made available for distribution/presentation. The date should be entered in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The exact date should be provided when possible. <br><br>
This example is @@@@@@@@@@@@
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...
),
study_desc = list(
# ... ,
distribution_statement = list(
distributors = list(
list(name = "World Bank Microdata Library",
abbr = "WBML",
affiliation = "World Bank Group",
uri = "http:/microdata.worldbank.org")
),
contact = list(
list(name = "",
affiliation = "",
email = "",
uri = "")
),
depositor = list(
list(name = "",
abbr = "",
affiliation = "",
uri = "")
),
deposit_date = "",
distribution_date = ""
),
# ...
)
# ...
)
```
<br>
#### Series statement
**`series_statement`** *[Optional; Not repeatable]* <br>
A study may be repeated at regular intervals (such as an annual labor force survey), or be part of an international survey program (such as the MICS, DHS, LSMS and others). The series statement provides information on the series.
```json
"series_statement": {
"series_name": "string",
"series_info": "string"
}
```
<br>
- **`series_name`** *[Optional ; Not repeatable ; String]* <br>
The name of the series to which the study belongs. For example, "Living Standards Measurement Study (LSMS)" or "Demographic and Health Survey (DHS)" or "Multiple Indicator Cluster Survey VII (MICS7)". A description of the series can be provided in the element "series_info".<br>
- **`series_info`** *[Optional ; Not repeatable ; String]* <br>
A brief description of the characteristics of the series, including when it started, how many rounds were already implemented, and who is in charge would be provided here. <br>
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...
),
study_desc = list(
# ... ,
series_statement = list(
list(series_name = "Multiple Indicator Cluster Survey (MICS) by UNICEF",
series_info = "The Multiple Indicator Cluster Survey, Round 3 (MICS3) is the third round of MICS surveys, previously conducted around 1995 (MICS1) and 2000 (MICS2). MICS surveys are designed by UNICEF, and implemented by national agencies in participating countries. MICS was designed to monitor various indicators identified at the World Summit for Children and the Millennium Development Goals. Many questions and indicators in MICS3 are consistent and compatible with the prior round of MICS (MICS2) but less so with MICS1, although there have been a number of changes in definition of indicators between rounds. Round 1 covered X countries, round 2 covered Y countries, and Round 3 covered Z countries.")
),
# ...
),
# ...
)
```
<br>
#### Version statement
**`version_statement`** *[Optional; Not repeatable]* <br>
Version statement for the study.
```json
"version_statement": {
"version": "string",
"version_date": "string",
"version_resp": "string",
"version_notes": "string"
}
```
<br>
The version statement should contain a version number followed by a version label. The version number should follow a standard convention to be adopted by the data repository. We recommend that larger series be defined by a number to the left of a decimal and iterations of the same series by a sequential number that identifies the release. The left number could for example be (0) for the raw, unedited dataset; (1) for the edited dataset, non anonymized, available for internal use at the data producing agency; and (2) the edited dataset, prepared for dissemination to secondary users (possibly anonymized). Example:
v0: Basic raw data, resulting from the data capture process, before any data editing is implemented.<br>
v1.0: Edited data, first iteration, for internal use only. <br>
v1.1: Edited data, second iteration, for internal use only.<br>
v2.1: Edited data, anonymized and packaged for public distribution. <br>
- **`version`** *[Optional ; Not repeatable ; String]* <br>
The version number, also known as release or edition.
- **`version_date`** *[Optional ; Not repeatable ; String]* <br>
The ISO 8601 standard for dates (YYYY-MM-DD) is recommended for use with the "date" attribute.
- **`version_resp`** *[Optional ; Not repeatable ; String]* <br>
The person(s) or organization(s) responsible for this version of the study.
- **`version_notes`** *[Optional ; Not repeatable ; String]* <br>
Version notes should provide a brief report on the changes made through the versioning process. The note should indicate how this version differs from other versions of the same dataset.<br>
<br>
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
# ...
study_desc = list(
# ... ,
version_statement = list(
version = "Version 1.1",
version_date = "2021-02-09",
version_resp = "National Statistics Office, Data Processing unit",
version_notes = "This dataset contains the edited version of the data that were used to produce the Final Survey Report. It is equivalent to version 1.0 of the dataset, except for the addition of an additional variable (variable weight2) containing a calibrated version of the original sample weights (variable weight)"
),
# ...
),
# ...
)
```
<br>
#### Bibliographic citation
**`bib_citation`** *[Optional ; Not repeatable ; String]* <br>
Complete bibliographic reference containing all of the standard elements of a citation that can be used to cite the study. The `bib_citation_format` (see below) is provided to enable specification of the particular citation style used, e.g., APA, MLA, or Chicago.
#### Bibliographic citation format
**`bib_citation_format`** *[Optional ; Not repeatable ; String]* <br>
This element is used to specify the particular citation style used in the field `bib_citation` described above, e.g., APA, MLA, or Chicago. <br>
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...
),
study_desc = list(
# ... ,
bib_citation = "",
bib_citation_format = ""
# ...
),
# ...
)
```
<br>
#### Holdings
**`holdings`** *[Optional ; Repeatable]* <br>
Information concerning either the physical or electronic holdings of the study being described.
```json
"holdings": [
{
"name": "string",
"location": "string",
"callno": "string",
"uri": "string"
}
]
```
<br>
- **`name`** *[Optional ; Not repeatable ; String]* <br>
Name of the physical or electronic holdings of the cited study.<br>
- **`location`** *[Optional ; Not repeatable ; String]* <br>
The physical location where a copy of the study is held.<br>
- **`callno`** *[Optional ; Not repeatable ; String]* <br>
The call number at the location specified in `location`.<br>
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
A URL for accessing the electronic copy of the cited study from the location mentioned in `name`.<br>
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...
),
study_desc = list(
# ... ,
holdings = list(
name = "World Bank Microdata Library",
location = "World Bank, Development Data Group",
uri = "http://microdata.worldbank.org"
),
# ...
),
# ...
)
```
<br>
#### Study notes
**`study_notes`** *[Optional ; Not repeatable]* <br>
This element can be used to provide additional information on the study which cannot be accommodated in the specific metadata elements of the schema, in the form of a free text field.
#### Study autorization
**`study_authorization`** *[Optional ; Not repeatable]* <br>
```json
"study_authorization": {
"date": "string",
"agency": [
{
"name": "string",
"affiliation": "string",
"abbr": "string"
}
],
"authorization_statement": "string"
}
```
<br>
Provides structured information on the agency that authorized the study, the date of authorization, and an authorization statement. This element will be used when a special legislation is required to conduct the data collection (for example a Census Act) or when the approval of an Ethics Board or other body is required to collect the data.
- **`date`** *[Optional ; Not repeatable ; String]*
The date, preferably entered in ISO 8601 format (YYYY-MM-DD), when the authorization to conduct the study was granted.<br>
- **`agency`** *[Optional ; Repeatable]* <br>
Identification of the agency that authorized the study.
- **`name`** *[Optional ; Not repeatable ; String]* <br>
Name of the agent or agency that authorized the study.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The institutional affiliation of the authorizing agent or agency mentioned in `name`.
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The abbreviation of the authorizing agent's or agency's name.<br><br>
- **`authorization_statement`** *[Optional ; Not repeatable ; String]* <br>
The text of the authorization (or a description and link to a document or other resource containing the authorization statement). <br>
```{r, indent="", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...
),
study_desc = list(
# ... ,
study_authorization = list(
date = "2018-02-23",
agency = list(
name = "Institutional Review Board of the University of Popstan",
abbr = "IRB-UP")
),
authorization_statement = "The required documentation covering the study purpose, disclosure information, questionnaire content, and consent statements was delivered to the IRB-UP on 2017-12-27 and was reviewed by the compliance officer. Statement of authorization for the described study was issued on 2018-02-23."
# ...
),
# ...
)
```
<br>
#### Study information
**`study_info`** *[Required ; Not repeatable]* <br>
This section contains the metadata elements needed to describe the core elements of a study including the dates of data collection and reference period, the country and other geographic coverage information, and more. These elements are not required in the DDI standard, but documenting a study without provinding at least some of this information would make the metadata mostly irrelevant.
```json
"study_info": {
"study_budget": "string",
"keywords": [],
"topics": [],
"abstract": "string",
"time_periods": [],
"coll_dates": [],
"nation": [],
"bbox": [],
"bound_poly": [],
"geog_coverage": "string",
"geog_coverage_notes": "string",
"geog_unit": "string",
"analysis_unit": "string",
"universe": "string",
"data_kind": "string",
"notes": "string",
"quality_statement": {},
"ex_post_evaluation": {}
}
```
<br>
- **`study_budget`** *[Optional ; Not repeatable ; String]* <br>
This is a free-text field, not a structured element. The budget of a study will ideally be described by budget line. The currency used to describe the budget should be specified. This element can also be used to document issues related to the budget (e.g., documenting possible under-run and over-run).<br>
```{r, indent=" ", eval=F, echo=T}
my_ddi <- list(
# ... ,
study_desc = list(
# ... ,
study_info = list(
study_budget = "The study had a total budget of 500,000 USD allocated as follows:
By type of expense:
- Staff: 150,000 USD
- Consultants (incl. interviewers): 180,000 USD
- Travel: 50,000 USD
- Equipment: 90,000 USD
- Other: 30,000 USD
By activity
- Study design (questionnaire design and testing, sampling, piloting): 100,000 USD
- Data collection: 250,000 USD
- Data processing and tabulation: 80,000 USD
- Analysis and dissemination: 50,000 USD
- Evaluation: 20,000 USD
By source of funding:
- Government budget: 300,000 USD
- External sponsors
- Grant ABC001 - 150,000 USD
- Grant XYZ987 - 50,000 USD",
# ...
),
# ...
)
```
<br>
- **`keywords`** *[Optional ; Repeatable]* <br>
```json
"keywords": [
{
"keyword": "string",
"vocab": "string",
"uri": "string"
}
]
```
<br>
Keywords are words or phrases that describe salient aspects of a data collection's content. The addition of keywords can significantly improve the discoverability of data. Keywords can summarize and improve the description of the content or subject matter of a study. For example, keywords "poverty", "inequality", "welfare", and "prosperity" could be attached to a household income survey used to generate poverty and inequality indicators (for which these keywords may not appear anywhere else in the metadata). A controlled vocabulary can be employed. Keywords can be selected from a standard thesaurus, preferably an international, multilingual thesaurus. <br>
- **`keyword`** *[ Required ; String ; Non repeatable]* <br>
A keyword (or phrase).
- **`vocab`** *[Optional ; Not repeatable ; String]* <br>
The controlled vocabulary from which the keyword is extracted, if any.
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
The URI of the controlled vocabulary used, if any.<br>
```{r, indent=" ", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...
),
study_desc = list(
# ... ,
study_info = list(
# ... ,
keywords = list(
list(keyword = "poverty",
vocab = "UNESCO Thesaurus",
uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"),
list(keyword = "income distribution",
vocab = "UNESCO Thesaurus",
uri = "http://vocabularies.unesco.org/browser/thesaurus/en/"),
list(keyword = "inequality",
vocab = "UNESCO Thesaurus",
uri = "http://vocabularies.unesco.org/browser/thesaurus/en/")
),
# ...
),
# ...
)
```
<br>
- **`topics`** *[Optional ; Repeatable]* <br>
The `topics` field indicates the broad substantive topic(s) that the study covers. A topic classification facilitates referencing and searches in on-line data catalogs.
```json
"topics": [
{
"topic": "string",
"vocab": "string",
"uri": "string"
}
]
```
<br>
- **`topic`** *[Required ; Not repeatable]* <br>
The label of the topic. Topics should be selected from a standard controlled vocabulary such as the [Council of European Social Science Data Archives (CESSDA) Topic Classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification).<br>
- **`vocab`** *[Required ; Not repeatable]* <br>
The specification (name including the version) of the controlled vocabulary in use.<br>
- **`uri`** *[Required ; Not repeatable]* <br>
A link (URL) to the controlled vocabulary website. <br>
```{r, indent=" ", eval=F, echo=T}
my_ddi <- list(
doc_desc = list(
# ...