IMPACT OF WEIGHTING SYSTEMS

CHINTEX - The Change from Input Harmonisation to Ex-post Harmonisation in National. Samples of the ..... BHPS. D-ECHP. SOEP and PSELL are the national panel surveys using the national sample designs and ..... holders” as stated in the data of the Department of Social Security, that is to say 154534 main holders.
400KB Größe 4 Downloads 370 Ansichten
CHINTEX Working Paper #18 Work-package 3 Date: September 2003

Sébastien Badina, Uwe Warner (CEPS/INSTEAD)

Impact of weighting systems on panel surveys (ECHP and SOEP, PSELL2)

CHINTEX - The Change from Input Harmonisation to Ex-post Harmonisation in National Samples of the European Community Household Panel – Implications on Data Quality Financed by the European Commission under contract number IST-1999-11101

CHINTEX WP 3

IMPACT OF WEIGHTING SYSTEMS ON PANEL SURVEYS (ECHP and SOEP, PSELL2) by Sébastien Badina Uwe Warner September 2002

The Chintex Project .................................................................................................................. 4 Presentation of the Chintex Project .............................................................................. 4 Presentation of the Work-package ................................................................................ 5 Weighting of the German ECHP Data Bases ............................................................... 7 Weighting of the various waves of the PSELL2 survey ......................................................... 23 The PSELL2 ................................................................................................................ 23 Introduction to the survey ........................................................................................... 23 Sampling weights of the first wave ............................................................................. 25 The first stage .................................................................................................. 25 The second stage ............................................................................................. 26 Treatment of non-response ......................................................................................... 28 Weighting of the estimator .......................................................................................... 30 Adjustment of weights according to an information (CALMAR) .............................. 34 The aim of CALMAR ..................................................................................... 34 The CALMAR principle ................................................................................. 34 Theoretical aspect of CALMAR ..................................................................... 34 Subsequent waves of the survey ................................................................................. 37 Longitudinal weighting of subsequent waves ............................................................. 38 Cross-sectional weighting of subsequent waves ......................................................... 39 Adopted method for the waves 2t (t³1) ........................................................... 40 Method adopted for the waves 2t+1 (t³ 1) ...................................................... 44 The CHINTEX project: wave 1 .............................................................................................. 46 Difference between the weighting procedures of the PSELL2 and the ECHP at wave 1 ......................................................................................................................... 46 Application of the calmar version ECHP on the PSELL2 data .................................. 47 Impact of weighting systems ...................................................................................... 47 At the level of individual variables ................................................................. 47 At household level .......................................................................................... 48 Conclusion .................................................................................................................. 48 Distribution of some demographic variables .............................................................. 49 Looking on the central tendencies of household variables ......................................... 52 Choice of variables studied at household level ............................................... 52 Processing of the studied variables ................................................................. 52 Calculating the ratio of means ........................................................................ 53 Processing for the Mratio ................................................................................ 53 Aim of this ratio .............................................................................................. 53 Variables whose mean exhibits a difference of less than 5% ......................... 53 Variables whose mean exhibits a difference of less than 10% ....................... 53 Variables whose mean exhibits a difference of more than 10% ..................... 54 Looking on the central tendencies of individual variables ......................................... 54 Choice of variables studied at individual level ............................................... 54 Processing of the studied variables ................................................................. 54 Calculating the ratio of means ........................................................................ 55 Processing for the Mratio ................................................................................ 55 Aim of this ratio .............................................................................................. 55 Variables whose mean exhibits a difference of less than 5% ......................... 55 Variables whose mean exhibits a difference of less than 10% ....................... 56

Variables whose mean exhibits a difference of more than 10% ..................... 56 Conclusion from looking at the central tendencies ......................................... 57 Comparing the means of some income variables after the ECHP weight and the PSELL2 weight .......................................................................................................................... 58 Means .......................................................................................................................... 58 Comparing the variances of some income variables after the ECHP weight and the PSELL2 weight .......................................................................................................................... 59 Variances .................................................................................................................... 59 Comparing the variation coefficients of some income variables after the ECHP weight and the PSELL2 weight ..................................................................................................... 60 Coefficient of variation ............................................................................................... 60 References ............................................................................................................................... 61

The Chintex Project Presentation of the Chintex Project In 1994, EUROSTAT initiated a panel for European community households (ECHP) in order to watch the effects of the completion of the internal market on living and working conditions. This is a longitudinal study on about 65000 households with residence in 14 member states of the European Union. These households are being interviewed each year on their present and previous employment and on the income from various sources of all household members. In addition to this main subject, other information concerning living conditions, health, education etc. has been collected. One of the main interests of the ECHP is located in the comparability of the results obtained in each of the 14 member states. Actually, the questionnaire and the statistical procedures applied on the different national sub samples have been defined within a common methodological framework. Because of financial restrictions, the project has been stopped in three countries: Germany, Luxembourg and the UK. In these countries a similar household panel already existed . These panels are run by academic institutions and started long before the ECHP. Consequently three sub samples from the ECHP have not been carried on and have been substituted by already existing national panels: the German socio-economic panel (SOEP), the Luxembourg household panel (PSELL2) which we will be looking at and the English household panel (BHPS). These three panels all differ from the original study in various: They use different questionnaires, have different rules for the following up of households from year to year. Consequently each national file has to be converted into a substitute of the original ECHP file. This conversion, called ex-post harmonization procedure raises the problem of harmonization of the data. Indeed, data harmonization does not only mean conversion of data but also harmonizing the imputation and weighting rules that are being applied to the data. Here we analyze the impact of weighting systems. The change from input harmonization to ex-post harmonization in national samples of the European Community Household Panel has implications on the data quality. Beside the “Labour Force Survey”, the “Household Budget Survey”, the “Time Use Survey” and other surveys not based on households as the observation unit, the ECHP is part of the European Statistical System (ESS). Seven dimensions of data quality are applied to data of the European Statistical System: 1. Relevance 2. Accuracy 3. Timeliness and punctuality 4. Accessibility and clarity 5. Comparability 6. Coherence 7. Completeness. The relevance shows the answers from the data to the users demand and needs. The Accuracy discuss the closeness between the observed and estimated values of the survey with the unknown true values of the population. Timeliness describes the punctuality of the statistics 4

produced by the ESS. The accessibility and clarity tells about the availability of the statistical data in the form users desire. The comparability over time, between geographical areas and between domains is an important issue for ECHP. The coherence of panel data statistics with estimates from other data sources is important to the description of the social and economic situation of the EU and its member states. The completeness refers to the availability of all information to use and to analyze the data of ECHP. (cf. Eurostat 2000) Presentation of the Work-package The concept of the input harmonized ECHP includes not only variable and value conversion but also the harmonization of the sample design and the weighting rules. The WP empirically describes the quantitative effects of the harmonization of these statistical tools. For this the work package will apply the common recommendations to the different data sets and compares the results. The variability of population estimates according to different data sources for one country is investigated. The national German Panel (SOEP) is chosen here, there are the largest methodological differences to ECHP with respect of weighting. Having comparability as the most important priority, we are keeping the weighting procedures constant and calculate the population estimates by using the SOEP data. The work-package contributes to the methodological work on weighting household panels. Different weighting schemes for longitudinal tables are tested and the results improve not only the weighting practice of household panels but also the work on enterprise panels with similar problems of comparing national harmonized data. During the work on our work-package we expect to quantify the reduction of differences between the D-ECHP and the SOEP estimates by using the same imputation strategy and/or the same weighting rules. A general suggestions with respect to the debate on the harmonisation of imputation rules and weighting schemes will be presented at the end of our study. For starting our work we compiled a set of documentation on the differences for weighting rules between the D-ECHP and the SOEP and the PSELL2 at the different stages of harmonisation. A comparison of population estimates from the D-ECHP with the SOEP and the PSELL2 which is considerably influenced by the weights (e.g. income and labour force participation) is carried out: 1. SOEP with design weights versus D-ECHP with design weights, 2. SOEP with SOEP weights versus D-ECHP with ECHP with ECHP design, non-response and final weights, 3. SOEP with ECHP weights (= converted SOEP) versus D-ECHP with ECHP weights. The following graph presents the use of the surveys and clarifies the names and terms used during this paper. D-ECHP and LUX-ECHP stand for the input-harmonized national survey based on the common sample design and survey instruments of ECHP. C-SOEP and C-PSELL refer to the conferted data into the common ECHP file-, variablesand values layout. 5

SOEP and PSELL are the national panel surveys using the national sample designs and survey instruments. ECHP-PDB is the common data format of the producer data base. ECHP-UDB is the “final product”; this user data base is distributed by EUROSTAT for scientific and official use. BHPS, C-BHPS and UK-ECHP are not used for this work-package, but are shown in the diagram.

Graphical Presentation of the Surveys Contributing to the ECHP

UK-ECHP

D-ECHP

LUX-ECHP

UDB-ECHP

PDB-ECHP

C-BHPS

C-SOEP

C-PSELL

BHPS

SOEP

PSELL

6

Weighting of the German ECHP Data Bases The calculation of the weights for individual members of a household takes into count the sample probability, the non-response correction and the adjustment of the sample distribution on specific household and person characteristics to the population distribution on the same items. The design weight is the initial step and is inverse of the inclusion probability of units, generated by the sample design. This design weight is normalized to the mean equal to 1, we get:

household design weight =

1/ πn    ∑ 1 / πi  / n  

π i is the probability of selection of household i n is the number of household in the sample.

The non-response adjustment is defined by a correction factor, assuming that all units in the same stratum k respond with an equal probability.

non − response correctionk =

number of sampled households in stratum k number of responding households in stratum k

The final weight includes the design weight, the non-response correction and the calibration of the sample to the population. This post-stratification of the sample reflects the distribution of the population for classification variables. The impact of the various weights is illustrated on - Household size (total number of household members at the time of the interview) - Household type (sociolological typology) - Household type (economical typology) - Household type (economical typology – focused on persons aged 65 or more) - sex - ILO main activity status at the time of interview - Main source of personal income Comparing the final weighted D-ECHP and the final weighted C-SOEP the household characteristics do not differ.

7

Comparing the different weighting steps applied to the D-ECHP we see minor changes from the not weighted distribution to the design weighted sample; and also a small impact of the non-response correction. The remarkable adjustment is done by the calibration. The distributions on income items are demonstrated using the household income items: - TOTAL NET HOUSEHOLD INCOME (DETAILED, NC, TOTAL YEAR PRIOR TO THE SURVEY) - TOTAL NET INCOME FROM WORK (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY) - WAGE AND SALARY EARNINGS - WAGE AND SALARY EARNINGS ( REGULAR) - WAGE AND SALARY EARNINGS ( LUMP SUM) - SELF-EMPLOYMENT EARNINGS (NET) - NON-WORK PRIVATE INCOME (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY) - CAPITAL INCOME - PROPERTY/RENTAL INCOME - PROPERTY/RENTAL INCOME,GROSS, YEAR PRIOR TO THE SURVEY - PRIVATE TRANSFERS RECEIVED - TOTAL SOCIAL TRANSFER RECEIPTS (NET,NC,TOTAL YEAR PRIOR TO THE SURVEY) - UNEMPLOYMENT RELATED BENEFITS - OLD-AGE/SURVIVORS' BENEFITS - FAMILY-RELATED ALLOWANCES - SICKNESS/INVALIDITY BENEFITS - EDUCATION-RELATED ALLOWANCES - HOUSING ALLOWANCE - CURRENT TOTAL MONTHLY NET HOUSEHOLD INCOME (SUMMARY QUESTION, NC, YEAR OF THE SURVEY) - CURRENT WAGE AND SALARY EARNINGS, NET (MONTHLY,NC,YEAR OF THE SURVEY) - CURRENT WAGE AND SALARY EARNINGS, GROSS (MONTHLY,NC,YEAR OF THE SURVEY) On individual level following person's income items are used: - TOTAL NET PERSONAL INCOME (DETAILED, NC, TOTAL YEAR PRIOR TO THE SURVEY) - TOTAL NET INCOME FROM WORK (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY) - WAGE AND SALARY EARNINGS (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY) - WAGE AND SALARY EARNINGS (REGULAR) - WAGE AND SALARY EARNINGS (LUMP SUM) - SELF-EMPLOYMENT INCOME (NET)

8

- NON-WORK PRIVATE INCOME (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY) -CAPITAL INCOME - ASSIGNED PROPERTY/RENTAL INCOME - PRIVATE TRANSFERS RECEIVED - TOTAL SOCIAL/SOCIAL INSURANCE RECEIPTS (NET, NC, YEAR PRIOR TO THE SURVEY) - UNEMPLOYMENT RELATED BENEFITS - OLD-AGE / SURVIVORS' BENEFITS - OLD-AGE RELATED BENEFITS - SURVIVORS' BENEFITS - FAMILY-RELATED ALLOWANCES - SICKNESS/INVALIDITY BENEFITS - EDUCATION-RELATED ALLOWANCES - ANY OTHER (PERSONAL) BENEFITS - ASSIGNED SOCIAL ASSISTANCE - ASSIGNED HOUSING ALLOWANCE - CURRENT WAGE AND SALARY EARNINGS, NET (MONTHLY) - CURRENT WAGE AND SALARY EARNINGS, GROSS (MONTHLY)"

The final weighted D-ECHP and the final weighted C-SOEP have bigger differences for income items where the number of valid observation differ also. The different weighting steps applied to the D-ECHP data do not produce significant changes on the income items.

9

HD001

D-ECHP 1. Welle UNWEIGHTED N

UNWEIGHTED %

1 One Person Household

1162

23,4

22,7

23,0

34,0

2

1756

35,3

35,4

35,5

31,7

3

906

18,2

18,4

18,2

16,6

4

810

16,3

16,5

16,3

12,9

5 5 and more persons

334

6,7

6,9

7,0

4,7

4968

100,0

100,0

100,0

100,0

Household size (total number of household members at present)

Total

DESIGN %

C-SOEP 1. Welle UNWEIGHTED N

UNWEIGHTED %

1 One Person Household

1333

21,5

34,0

2

1963

31,6

31,7

3

1330

21,4

16,9

4

1092

17,6

12,5

489

7,9

4,9

6207

100,0

100,0

5 5 and more persons Total

FINAL %

Mikrozensus April 1994 N

%

1 One Person Household

12747

34,74

2

11624

31,68

3

5902

16,08

4

4669

12,72

5 5 and more persons

1753

4,78

36695

100,00

Total

Source: Leben und Arbeiten in Deutschland. Ergebnisse des Mikrozensus 2001, p. 55

10

NONRES %

FINAL %

HD006

D-ECHP 1. Welle UNWEIGHTED N

Household type (sociol. typology) Single adults

Couples

DESIGN %

NONRES %

FINAL %

1 One person aged 65 or more

398

8,1

8,0

7,9

14,1

2 One person aged 30-64

555

11,4

11,0

11,3

15,1

3 One person aged less than 30

209

4,3

4,1

4,2

5,5

4 Single parent with one or more children (all children aged less than 16)

119

2,4

2,4

2,4

2,3

5 Single parent with one or more children (at least one child aged 16 or more)

151

3,1

3,0

3,0

3,0

6 Couple without children (at least one person aged 65 or more)

524

10,7

10,7

10,6

10,2

1002

20,5

20,8

20,9

17,3

8 Couple with one child (child aged less than 16)

420

8,6

8,6

8,5

7,6

9 Couple with two children (all children aged less than 16)

465

9,5

9,7

9,7

7,3

10 Couple with three children or more (all children aged less than 16)

152

3,1

3,2

3,3

2,0

11 Couple with one or more children (at least one child aged 16 or more)

809

16,6

16,8

16,6

13,9

12 Other households

81

1,7

1,7

1,6

1,7

valid observations

4885

100,0

100,0

100,0

100,0

7 Couple without children (both persons aged less than 65)

Others

UNWEIGHTED %

-9 Missing Total

83 4968

11

HD006

C-SOEP 1. Welle UNWEIGHTED N

Single adults

Couples

FINAL %

1 One person aged 65 or more

522

8,8

14,6

2 One person aged 30-64

554

9,3

14,9

3 One person aged less than 30

257

4,3

5,6

4 Single parent with one or more children (all children aged less than 16)

135

2,3

2,3

5 Single parent with one or more children (at least one child aged 16 or more)

212

3,6

3,1

6 Couple without children (at least one person aged 65 or more)

466

7,9

10,2

1196

20,2

17,2

8 Couple with one child (child aged less than 16)

602

10,1

8,1

9 Couple with two children (all children aged less than 16)

576

9,7

6,8

10 Couple with three children or more (all children aged less than 16)

183

3,1

1,9

11 Couple with one or more children (at least one child aged 16 or more)

1168

19,7

14,3

12 Other households

62

1,0

1,0

valid observations

5933

100,0

100,0

7 Couple without children (both persons aged less than 65)

Others

UNWEIGHTED %

-9 Missing

274

Total

6207

12

HD006A

D-ECHP 1. Welle UNWEIGHTED N

UNWEIGHTED %

1 1-person household : Male under 30

105

2,1

2,0

2,0

2,8

2 1-person household : Male aged 30-64

286

5,8

5,7

5,9

8,1

3 1-person household : Male aged 65 or more

66

1,3

1,3

1,3

2,1

4 1-person household : Female under 30

104

2,1

2,1

2,1

2,7

5 1-person household : Female aged 30-64

269

5,5

5,3

5,4

6,9

6 1-person household : Female aged 65 or more

332

6,8

6,7

6,6

11,9

7 2 adults without dependent child with at least one person aged 65 or more

580

11,9

11,8

11,7

11,4

1086

22,2

22,4

22,6

18,9

387

7,9

8,1

8,0

7,4

10 Single parents with 1+ dependent child

141

2,9

2,8

2,9

2,8

11 2 adults with 1 dependent child

543

11,1

11,1

11,0

9,9

12 2 adults with 2 dependent children

558

11,4

11,6

11,5

8,9

13 2 adults with 3 or more dependent children

193

3,9

4,0

4,1

2,6

14 Other household with dependent children

240

4,9

5,0

4,8

3,7

4890

100,0

100,0

100,0

100,0

Household type (econom. typology) Households without dependent children

8 2 adults without dependent child with both under 65 9 Other household without dependent children Households with dependent children

valid observations -9 Missing

78

Total

4968

13

DESIGN %

NONRES %

FINAL %

HD006A

Households without dependent children

C-SOEP 1. Welle UNWEIGHTED N

UNWEIGHTED %

1 1-person household : Male under 30

126

2,1

2,8

2 1-person household : Male aged 30-64

284

4,7

8,2

3 1-person household : Male aged 65 or more

70

1,1

2,2

4 1-person household : Female under 30

131

2,2

2,8

5 1-person household : Female aged 30-64

270

4,4

6,5

6 1-person household : Female aged 65 or more

452

7,4

12,3

7 2 adults without dependent child with at least one person aged 65 or more

525

8,6

11,3

1303

21,4

18,4

609

10,0

7,8

10 Single parents with 1+ dependent child

166

2,7

2,8

11 2 adults with 1 dependent child

790

13,0

10,1

12 2 adults with 2 dependent children

715

11,7

8,3

13 2 adults with 3 or more dependent children

258

4,2

2,8

14 Other household with dependent children

390

6,4

3,7

6089

100,0

100,0

8 2 adults without dependent child with both under 65 9 Other household without dependent children Households with dependent children

valid observations -9 Missing

118

Total

6207

14

FINAL %

HD006B

D-ECHP 1. Welle UNWEIGHTED N

UNWEIGHTED %

391

8,0

7,8

7,9

10,9

66

1,3

1,3

1,3

2,1

3 1-person household : Female under 65

373

7,6

7,4

7,6

9,6

4 1-person household : Female aged 65 or more

332

6,8

6,7

6,6

11,9

1086

22,2

22,4

22,6

18,9

6 2 adults without dependent child with one person aged 65 or more

208

4,3

4,1

4,2

3,8

7 2 adults without dependent child with both aged 65 or more

372

7,6

7,7

7,5

7,6

9 Other household without dependent children

387

7,9

8,1

8,0

7,4

10 Single parents with 1 or more dependent child

141

2,9

2,8

2,9

2,8

11 2 adults with 1 dependent child

543

11,1

11,1

11,0

9,9

12 2 adults with 2 dependent children

558

11,4

11,6

11,5

8,9

13 2 adults with 3 or more dependent children

193

3,9

4,0

4,1

2,6

14 Other household with dependent children

240

4,9

5,0

4,8

3,7

4890

100,0

100,0

100,0

100,0

Household type (economical typology – focused on persons aged 65 or more) Households without dependent children

1 1-person household : Male under 65 2 1-person household : Male aged 65 or more

5 2 adults without dependent child with both under 65

Households with dependent children

valid observations -9 Missing

78

Total

4968

15

DESIGN %

NONRES %

FINAL %

HD006B

C-SOEP 1. Welle UNWEIGHTED N

Households without dependent children

1 1-person household : Male under 65

FINAL %

410

6,7

11,0

70

1,1

2,2

3 1-person household : Female under 65

401

6,6

9,3

4 1-person household : Female aged 65 or more

452

7,4

12,3

1303

21,4

18,4

6 2 adults without dependent child with one person aged 65 or more

181

3,0

3,8

7 2 adults without dependent child with both aged 65 or more

344

5,6

7,5

9 Other household without dependent children

609

10,0

7,8

10 Single parents with 1 or more dependent child

166

2,7

2,8

11 2 adults with 1 dependent child

790

13,0

10,1

12 2 adults with 2 dependent children

715

11,7

8,3

13 2 adults with 3 or more dependent children

258

4,2

2,8

14 Other household with dependent children

390

6,4

3,7

6089

100,0

100,0

2 1-person household : Male aged 65 or more

5 2 adults without dependent child with both under 65

Households with dependent children

UNWEIGHTED %

valid observations -9 Missing

118

Total

6207

16

PD004

D-ECHP 1. Welle UNWEIGHTED N

Sex

UNWEIGHTED %

FINAL %

1

Male

4632

48,81

47,69

2

Female

4858

51,19

52,31

9490

100,00

100,00

Total

C-SOEP 1. Welle PD004

UNWEIGHTED N

Sex

UNWEIGHTED %

FINAL %

1

Male

5886

48,12

47,02

2

Female

6347

51,88

52,98

12233

100,00

100,00

Total

Mikrozensus April 1991 N

%

Male

38052

48,15

Female

40978

51,85

Total

79030

100,00

Source: Leben und Arbeiten in Deutschland. Ergebnisse des Mikrozensus 2001, p. 64

17

PE003

D-ECHP 1. Welle

ILO main activity status at the time of interview 1 normally working (working 15+ hours / week)

UNWEIGHTED N

UNWEIGHTED %

FINAL %

5185

54,67

50,93

2 currently working (working less than 15 hours / week)

465

4,90

3,64

3 unemployed

447

4,71

4,88

26

0,27

0,26

5 economically inactive

3362

35,45

40,28

valid observations

9485

100,00

100,00

4 discouraged worker

-9 missing Total

5 9490

PE003

C-SOEP 1. Welle

ILO main activity status at the time of interview 1 normally working (working 15+ hours / week)

UNWEIGHTED N

UNWEIGHTED %

FINAL %

7077

58,57

52,58

2 currently working (working less than 15 hours / week)

581

4,81

3,43

3 unemployed

635

5,25

4,62

5 economically inactive

3791

31,37

39,37

valid observations

12084

100,00

100,00

4 discouraged worker

-9 missing Total

149 12233

18

PI001

D-ECHP 1. Welle

Main source of personal income 0 Person has no income from any source

UNWEIGHTED N

UNWEIGHTED %

FINAL %

611

6,44

6,94

5132

54,08

50,68

324

3,41

3,06

1629

17,17

20,59

4 Unemployment / redundancy benefits

392

4,13

4,00

5 Any other social benefits or grants

852

8,98

8,49

6 Private income

550

5,80

6,24

9490

100,00

100,00

1 Wages and salaries 2 Income from self-employment or farming 3 Pensions

Total

PI001

C-SOEP 1. Welle

Main source of personal income

UNWEIGHTED N

UNWEIGHTED %

FINAL %

0 Person has no income from any source

1048

8,57

8,35

1 Wages and salaries

7084

57,91

50,54

486

3,97

4,30

1894

15,48

23,10

4 Unemployment / redundancy benefits

619

5,06

4,53

5 Any other social benefits or grants

647

5,29

4,45

6 Private income

455

3,72

4,73

12233

100,00

100,00

2 Income from self-employment or farming 3 Pensions

Total

19

coef. of variation

D-ECHP 1. Welle UNWEIGHTED UNWEIGHTED UNWEIGHTED Variation N N incl imputed excl imputed values values

DESIGN Variation

NONRES Variation

FINAL Variation

TOTAL NET HOUSEHOLD INCOME (DETAILED, NC, TOTAL YEAR PRIOR TO THE SURVEY)

HI100

3832

4897

0,818

0,812

0,812

0,850

TOTAL NET INCOME FROM WORK (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY)

HI110

3722

3722

0,892

0,883

0,881

0,907

WAGE AND SALARY EARNINGS

HI111

3256

3604

0,773

0,776

0,770

0,792

SELF-EMPLOYMENT EARNINGS (NET)

HI112

221

442

2,036

1,997

2,010

2,037

NON-WORK PRIVATE INCOME (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY)

HI120

2293

2293

1,849

1,943

1,948

1,887

PROPERTY/RENTAL INCOME

HI122

482

482

1,580

1,594

1,556

1,551

PROPERTY/RENTAL INCOME, GROSS, YEAR PRIOR TO THE SURVEY

HI122G

482

482

1,555

1,567

1,540

1,504

3422

3422

1,135

1,134

1,141

1,086

TOTAL SOCIAL TRANSFER RECEIPTS HI130 (NET,NC,TOTAL YEAR PRIOR TO THE SURVEY) UNEMPLOYMENT RELATED BENEFITS

HI131

568

644

0,878

0,877

0,865

0,871

OLD-AGE/SURVIVORS' BENEFITS

HI132

1177

1379

0,770

0,768

0,774

0,791

SOCIAL ASSISTANCE

HI137

167

167

1,176

1,159

1,165

1,199

HOUSING ALLOWANCE

HI138

284

284

1,041

1,010

1,016

1,085

CURRENT TOTAL MONTHLY NET HOUSEHOLD INCOME (SUMMARY QUESTION, NC, YEAR OF THE SURVEY)

HI200

4968

4968

0,542

0,538

0,539

0,568

CURRENT WAGE AND SALARY EARNINGS - NET (MONTHLY,NC,YEAR OF THE SURVEY)

HI211M

2408

3379

0,569

0,567

0,566

0,571

CURRENT WAGE AND SALARY EARNINGS - GROSS (MONTHLY,NC,YEAR OF THE SURVEY)

HI211MG

1974

3379

0,587

0,585

0,585

0,587

20

coef. of variation

C-SOEP 1. Welle UNWEIGHTED N excl imputed values

UNWEIGHTED UNWEIGHTED FINAL Variation Variation N incl imputed values

TOTAL NET HOUSEHOLD INCOME (DETAILED, NC, TOTAL YEAR PRIOR TO THE SURVEY)

HI100

3420

6188

0,639

0,709

TOTAL NET INCOME FROM WORK (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY)

HI110

4803

4803

0,636

0,678

WAGE AND SALARY EARNINGS

HI111

4223

4648

0,615

0,662

WAGE AND SALARY EARNINGS ( REGULAR)

HI1111

4648

4648

0,610

0,658

WAGE AND SALARY EARNINGS ( LUMP SUM)

HI1112

2879

2879

1,657

1,726

SELF-EMPLOYMENT EARNINGS (NET)

HI112

460

549

1,136

1,042

NON-WORK PRIVATE INCOME (NET, NC, TOTAL YEAR PRIOR TO THE SURVEY)

HI120

5331

5331

3,815

3,662

CAPITAL INCOME

HI121

1420

5254

4,107

4,103

PROPERTY/RENTAL INCOME

HI122

472

472

2,017

1,706

PROPERTY/RENTAL INCOME, GROSS, YEAR PRIOR TO THE SURVEY

HI122G

472

472

1,847

1,794

PRIVATE TRANSFERS RECEIVED

HI123

166

194

1,124

1,021

TOTAL SOCIAL TRANSFER RECEIPTS HI130 (NET,NC,TOTAL YEAR PRIOR TO THE SURVEY)

4603

4603

1,184

1,079

UNEMPLOYMENT RELATED BENEFITS

HI131

900

1060

0,889

0,860

OLD-AGE/SURVIVORS' BENEFITS

HI132

979

1614

0,653

0,644

FAMILY-RELATED ALLOWANCES

HI133

2439

2559

1,221

1,236

SICKNESS/INVALIDITY BENEFITS

HI134

225

231

0,648

0,658

EDUCATION-RELATED ALLOWANCES

HI135

129

141

0,749

0,876

HOUSING ALLOWANCE

HI138

434

434

0,913

0,816

21

CURRENT TOTAL MONTHLY NET HOUSEHOLD INCOME (SUMMARY QUESTION, NC, YEAR OF THE SURVEY)

HI200

6207

6207

0,543

0,605

CURRENT WAGE AND SALARY EARNINGS - NET (MONTHLY,NC,YEAR OF THE SURVEY)

HI211M

3692

4380

0,578

0,638

CURRENT WAGE AND SALARY EARNINGS - GROSS (MONTHLY,NC,YEAR OF THE SURVEY)

HI211MG

3528

4380

0,600

0,653

22

Weighting of the various waves of the PSELL2 survey The PSELL2 We were asked: to discern the differences between the weighting systems adopted by PSELL2 and the ECHP (see paragraph 2); and to apply the modes of weighting of the ECHP to the results of the first wave of the PSELL 2 survey. The scope of this work is to test if the application of both weighting procedures on the PSELL 2 data will lead to equivalent results or not. The PSELL program (Socio-economic panel “Liewen zu Letzebuerg”), carried out by the division “Population and Households”, is an exceptional instrument allowing to get to know the evolution and dynamics of living conditions of people and households resident in the The Grand Duchy of Luxembourg. In the framework of this program, a lot of information is collected each year on the main aspects of live of the population of the country: mainly on income and on employment but also on housing conditions, household equipment and household composition, main expenses, precariousness, debt, school position of children, and socio- professional position of adults. The PSELL 1 program started in 1985 with interviews on a sample of 6110 people divided into 2012 households. In 1994 this sample was composed by 4966 people living in 1809 households. The PSELL2 followed his predecessor in 1995. Each year the PSELL survey is continued and the same sample is followed year by year. Obviously this sample changes as does the population of the country. Introduction to the survey The part of the population aimed at by the PSELL is limited to people connected directly or indirectly to the Luxembourg Social Security system. For instance, the conclusions of studies relating to income or poverty do not under any circumstances include “the whole of persons living in Luxembourg”, Foreign civil servants, foreign corporation agents temporarily living in Luxembourg and linked to the social security system of their own country as well as Luxembourg people crossing the border to work are not included in this survey. The PSELL2 sample has been selected among an exhaustive population of “main income holders” as stated in the data of the Department of Social Security, that is to say 154534 main holders. The sample counts 5713 main income holders chosen “at random”.

23

The main income holders are only sample units leading to observation units : households and individuals. The sample ( s ) extracted from a population ( U ) is formed by the combination of all selected units (main income holders) before and observation and not regarding the status of these units after the observation.

s is formed by m persons (or m main income holders) belonging to U formed by M persons. After the survey it generally looks as follows: For m addresses selected, s includes, after the survey:

m1 “non-sample” among which death, immigration, collective households, various errors linked to the nature of the population. m2 “completed surveys”

m3 “non-response” among which: refusal, wrong or unknown address, impossible to carry out (illness, handicap, …), impossible to contact (absence, ..) other impossibilities concercing people of the areas.The sample is composed by m selected units. The respondents are m1 + m2 : The non sample units have to be taken into account because they have influenced the sampling procedure. Officially, the sampling procedure is a simple at random sampling in the frame of social security data1. In addition, all redundant addresses have been previously cancelled from the data in order not to consider the same household twice at the moment of sampling. It would therefore be more exact to talk about a two-stage-sampling with sample probabilities and a posteriori adjustment by weighting. A main income holder leads to an address and the address leads to a household.

1

One does not actually know the sampling procedure because it is carried out by Luxembourg social security. 24

The 5713 households form a sample distributed as follows: Status of household

Frequencies

realized interviews

2978

Percent 52,1

non-sample

269

died

16

moved out

61

collective household

192

non-response

total

2466

4,7

43,2

refused

1922

absence

410

wrong address

134 5713

100,0

Within each household, all members are covered. These 10967 individuals belonging to 5713 households form a sample distributed as follows: Status of individual

Frequencies

realized interviews non-sample non-response total

Percent

8232

75,1

269

2,4

2466

22,5

10967

100,0

An main income holder leads to an address and an address relates to a household, on the other hand a household can lead to several reference persons, thus disclosing a biased sample. Sampling weights of the first wave The first stage Taking the main income holder as sampling unit in a simple random sample, we then have as total estimate of the interest variable yi :

 M T$ =   ∑ yi  m  i ∈s where M represents total population of reference persons and m represents the size of the sample of reference persons. There is a problem to move from the unit “main income holder” to the unit “household”. As all redundant data has been formerly cancelled from the data a reference person will lead to a household but the opposite will not be possible. 25

Consequently, the sample probability for households is not equal because the more reference persons there are in a household the higher the probability for the household to be selected, which means the sample will be biased. The total for household interest variables will therefore have to be estimated from the interest variables for reference persons. The second stage Moving from the main income holder unit to the household unit and applying an ex-post stratification to the sample bias. For PSELL1 as for PSELL2 the same rule has been applied: Given Ci there are ni main income holders in the household i With 1 ≤ ni ≤ 6 and

P(Ci ) =

1 . ni

Lets estimate the total of households:

$ = M

m

M

∑ P(C ) m i

i =1

Remain to be estimated: the estimators of the total and of the variance for the interest variables of the household unit. As estimator of the total we may take:

M T$ = m

m

1

i =1

i

∑ny

i

In order to estimate T

T=

M

∑y i =1

i

This estimator is unbiased. Proof:

 M  1   E( M ) M $ E ( T ) = E    ∑ yi  =   ∑ yi ni E (δ i )   m  i ∈s ni   m  i =1 where 26

 1 i ∈ households δi =   0 i ∉ households M M $ E (T ) = ∑ n y P(δ i ) m i =1 i i m n P(δ i ) = M i and

M E (T$ ) = m

M

∑y i =1

i

1 m ni M ni

therefore

E (T$ ) = T Obviously the bias will always exist because sample probabilities against the number of main income holders are not a representation of reality but an estimation. This choice was mandatory given the lack of information on the data structure and therefore the complete absence of information on address redundancies. Therefore

T$ =

M 1 y m ni i



i ∈s

Conclusion: We pose

M 1 m ni

wi = We get

T$ =

∑w

i

i ∈s

* Yi

And for the estimator of the mean we have:

1 Y$ = M

∑wy i ∈s

now, the sample units remain to be weighted.

27

i

i

Definition: Weighting is used to improve the quality of estimations by compensating disruptions that might have been caused by bias factors, like non response. Treatment of non-response We do have an estimator for the total and for the mean but we still have to deal with nonresponse. The scope After the sample we notice certain failures( m3 ), the total of all these failures is called “initial non-response” Two cases will appear. Either it is a random non-response, in this case a treatment is not necessary, Respondents can be considered as selected. Or the non-response is not a random non-response, that means it is highly connected to a few interest variables. In this case the estimator has to be weighted in order to share out the non-response among the respondents, trying to post-stratify he sample on those criteria most independent from the non-response (fixed by logistical regression) aiming at finding a random character for the distribution of non-response in these strata. The respondent’s inference on the sample group then becomes possible. The non-responses for a given variable Y , give way for estimators to: an introduction of a bias a diminished precision The ideal case would be to know for each individual i his probability of response Ri (they would have to be strictly positive). In this case we would get the following estimator:

T$R =



i ∈r

1 y Pi Ri i

Estimator without the bias T Where Pi is the probability is the probability of the inclusion of i in the sample s and r is list of identifiers of the selected individuals who respond. Proof: 1 0

δi = 

i ∈ s i ∉ s

and

 1 i ∈ answered εi =   0 i ∉ answered 28

T$R =

m



i =1

⇒ E (T$R ) =

yi δε Pi Ri i i m



i =1

yi E (δ i ε i ) Pi Ri

As by hypothesis, the sample of individuals and their response attitudes are completely independent from each other, we have:

E (δ i ε i ) = E (δ i ) E (ε i ) = P[ i ∈ s] P[ ianswered ] = Pi Ri therefore:

E (T$ ) = T end of proof. Unfortunately it is practically impossible to know the Ri . The principle We try to get as close as possible to the maximum of the preceding utopian and ideal estimator

T$r by taking only the responding individuals and by modifying their initial weight wi ( s) to take the non-response into account. The aim is not to have a direct inference of the sample on the total population (gained by the Pi only, without the non-responses) but an inference in two stages. First stage: Inference from the sample of respondents to the total sample, reached thanks to Ri or more exactly to the estimated Ri . Second stage: Inference from the sample s to the entire population reached thanks to inclusion probabilities.

29

Weighting of the estimator In full scale surveys one does not estimate a global response rate but rather response rates by population categories whenever auxiliary information allows it. One has to define those categories among the population for which one knows that the response or non-response for all individuals composing a particular category is independent from their own characteristic yi value.

30

The categories chosen for PSELL 2 are the districts, because it is the only exhaustive information source available both at sample level and at population level. Districts

Population

Sample

19,6

20,2

7,7

7,4

32,2

31,7

Lux. Campagne

9,3

9,3

Mersch

5,1

4,9

Clerveau

2,6

2,5

Diekirch

5,9

5,7

Redange

2,8

3,1

Vianden

0,6

0,6

Witz

2,7

2,6

Echternach

3,4

3,6

Grevenmacher

4,6

5,0

Remich

3,5

3,3

Missing data

0,0

0,0

Luxembourg Capellen Esch/Alzette

Total in N

154534

5713

Let’s formalize the estimator without bias reached by the homogeneous response mechanism for the sample s. We note:

pi :

 M 1   m  ni

the probability of inclusion from i , equal to 

h:

1 .. H=13, representing the 13 “cantons” (regions) in Luxembourg h=1=City of Luxembourg, h=2=Capellen, etc.

ni : nh , R :

sample size in the region h

nh , R nh

number of answers in the region h

= response rate in the region h = corrective cross sectional response rate, used here to

estimate all response probabilities Ri for all individuals of the region h.

31

Response rates of main income holders by region Districts

response rates

Luxembourg

.599

Capellen

.574

Esch/Alzette

.583

Lux. Campagne

.617

Mersch

.523

Clerveau

.549

Diekirch

.421

Redange

.588

Vianden

.500

Witz

.583

Echternach

.504

Grevenmacher

.528

Remich

.534 Total

.568

Conclusion We get for the estimation of the total:

T$R =

T$R =

H

∑∑

h = 1 i ∈rh

H

∑∑

h = 1 i ∈rh

pi ∗

1 ∗y nh , R i nh

M ∗ nh ∗ ni ∗ yi m∗ nh , R

and for the estimation of the mean:

1 Y$R = M

H

∑∑

h = 1 i ∈rh

pi ∗

1 ∗y nh , R i nh

H M ∗ nh ∗ ni $Y = 1 ∗y ∑ ∑ R M h=1 i∈rh m∗ nh, R i

Y$R =

H

nh ni 1 ∗ Yi ∑ h =1 m i ∈rh nh , R



32

Y$R =

Y$R =

H

nh ni ∗ yh,R m



h =1

H

nh ni ∗ yh, R h =1 m



These estimators are estimators of the total and of the mean for the household unit. To move from the households to the individuals one allocates the household weight to all its members. Thus we come round to the same estimators.

T$R =

M ∗ nh ∗ ni ∗ yi m∗ nh , R

H

∑∑

h = 1 i ∈rh

1 Y$ = N

H

∑∑

h = 1 i ∈rh

M ∗ nh ∗ ni ∗ yi m∗ nh , R

where N is the number of individuals affiliated to the IGSS. Of course yi then is the interest variable at individual level. In order to simplify the writing let’s consider the weight for each respondent individual to be known and given the value Winit and R =

T$R =

H

∑ ∑ 1 = the total number of respondents, so: h = 1 i ∈rh

R

∑W

init

r =1

∗ yi

1 R $ Y = ∑ Winit ∗ yi R r =1 The structure of the household sample before and after weighting is exposed in the following table. before weighting

after weighting

8232

10497

269

470

realized interviews non-sample non-response

2466

total

10967

33

10967

Adjustment of weights according to an information (CALMAR) The aim of CALMAR The fundamental principle for weighting is whenever you have an auxiliary information in correlation with the variables to be observed, try to use it in order to get more precise estimators. Auxiliary information is information available on all of the population thanks to a census or a big survey which provide information sufficiently precise to be used. There can be different information types as sex or age distribution in the population etc. CALMAR is a method that helps to achieve this. CALMAR allows to redress the sample which means to link our sample ( s ) as closely as possible to the population ( U ) by using auxiliary information on ( U ). The CALMAR principle A new (final ) weight is determined by multiplying the sample weight of a statistical unit by a correction factor. CALMAR allows us to determine new estimators (ie new weights) described as “wedging estimators”). Their construction relies on the following principle: The gap between the initial sample weights and the final weights is measured by a distance function. The final weights are then reached by minimizing this function under certain wedging restraints. Theoretical aspect of CALMAR We work at the individual unit level because the auxiliary information is considered at the individual unit level. The estimator of the total calculated before is:

T$R =

R

∑W k =1

init

yk

we note

πk =

1 Winit

y k : value taken by an interest variable Y for k -th individual of the population U . We try to estimate the total for the population:

Ty =

N

∑y k =1

34

k

The estimator of the total calculated before is:

T$R =

R

∑W k =1

init

yk

also written as R

∑d

T$Y /π = with

k =1

k

yk

1 ∀ k = 1... R πk

dk =

Be J auxiliary variables respectively noted: x 1 , x 2 ,... x J We admit: The values taken respectively by these auxiliary variables from the sample s sized m are available. Each element i of the sample s is associated to a vector of values taken by theses J auxiliary variables.

(

∀ k = 1... R x k = x k1 , x k2 ,..., x kJ

)

The totals, out of a population U , sized N of these J variables are known.

TX =

N



k =1

x k ⇔ ∀ j = 1,... J , X j =

with TX =

( X ,... X 1

j

,... X J

)

N

∑x k =1

j k

Deville and Sarndal (1992) propose to define the estimator T$Y /W =

(

)

R

∑W

k

y k by minimizing

under a certain number of imposed restrictions x 1 , x 2 ,... x J , a distance separating the new

{

}

{

}

weights Wk , k = 1... R from the initial sample weights d k , k = 1... R the distance being measured by form functions:

 Wk   G * (Wk , d k ) = d k , q k G   dk  where G is a function as G * be a distance and verifying the following properties: h1) h2)

G is a function G is strictly convex 35

h3) h4)

G (1) = G ′ (1) = 0 G ′′ (1) = 1

h1 and h2 entail the existence of a reciprocal function F of the derived function g .

F (u) = g −1 (u) and g ( x ) =

dG ( x ) dx

h3 et h4 are defined so that if Wk = d k this corresponds to a local minimum. h5) h6)

F(0) = 1 F ′ (0) = 1

From a series of observations (d k , x k ) made on the sample s and from a function G previously chosen, the weights Wk are determined by minimizing weighted mean distance) between the initial weights, noted Wk , from the population U , and the weights d k , determined from the sample s . Let:

 R W  min wk  ∑ d k q k G k    dk    k =1 under the constraint R

∑W x k =1

k

k

= Tx ⇔ ∀ j = 1... J

R

∑W x k =1

k

j k

= X

j

we find the problem of optimization under constraint. By introducing a multiplying vector from Lagrange λ ′ = λ1 , λ 2 ,... λ J , we get:

L(Wk , λ ) =



    Wk     − λ ′  ∑ Wk x k − TX  d k q k G  dk     

If a solution exists it has to verify:

 dL  Wk   − λ ′ = 0 ⇒ Wk = d k F ( x k λ ) ∀ k ∈ s = g   dWk  dk   n dL  = TX − ∑ Wk x k = 0  dλ k =1 36

Wherefrom λ is the only unknown of the problem. This system is resolved by Newton’s algorythm. The solution of this non linear system is conditional to the data realized for the sample s as well as to the chosen distance function G ( x ) : The hypotheses h1 and h2 guarantee the uniqueness of the solution if it exists. Examples of the distance G * available in the macro CALMAR

 Wk  (Wk − d k ) Wk  = ∈ ]− ∞ ,+ ∞ [ (the linear −∀ ∈ ]− ∞ ,+ ∞ [ we can take G  2d k dk  dk  2

method). −∀

Wk ∈ ]0,+ ∞ [ we can take dk

W  W  G k  = Wk Log k  − Wk + d k ∈ ]0,+∞ [  dk   dk 

(the so-

called multiplicative method). By applying this method you get the following estimator:

T$y /W =

R

∑W y k =1

k

k

Subsequent waves of the survey The sample drawn in 1994 is meant for carrying out the panel study in Luxembourg Definition: A panel is a survey allowing on the one hand to follow the paths of a certain number of individuals during a certain number of years, and on the other hand, it has to allow the yearly production of non biased estimations of relative quantities of the population in which it evolves. The survey of a panel therefore implicates two types of weighting at every wave of the survey. Following the path of a certain number of individuals during several years is a longitudinal operation carried out on a basis of a sample of longitudinal unities. Each year the same individuals are surveyed and the quantity of information available for each of them increases progressively. The longitudinal sample is composed and composed only by individuals selected during the first year. This sample is representative but only of the year in which it has been drawn. The evolution of the population during the years has strictly no influence either on the longitudinal sample or on its evolution. This first aspect asks for a so called longitudinal weighting for each wave. 37

Indeed PSELL, like all other panels, is subject to the erosion phenomenon of the sample as a result of repetitive observations carried out in the same households. This erosion called “attrition” of the sample is the result of weariness of individuals who are being polled several times and who refuse to continue their collaboration in a long-term survey. This erosion linked to the annual non-response of longitudinal members reduces the number of respondents and may be a source of bias if the non-response is important. The annual reweighting of the units of the longitudinal sample is meant to reduce the bias and to improve the quality of estimations. Producing each year non-biased estimations of quantities related to a population, in which the sample evolves, is a strictly cross sectional or annual operation ( if operations take place every year). This operation is carried out on the basis of cross sectional samples. A cross sectional sample is a limited sample whose function it is to represent the state of a population at the moment of survey. This cross sectional representativeness presupposes that each year new individuals will enter the longitudinal sample. Cross sectional samples are thus composed by longitudinal individuals, by individuals present in the country at the moment of the first sample, that have not been selected and will join the longitudinal individuals in their households, and finally by individuals absent from the country at the moment of the first sample either in households formed by longitudinal individuals or by the bias of a voluntary annual sub-sample representative of these “new-births” (new-born and immigrants). This second aspect asks for a cross sectional weighting at each wave. Each new wave in the survey does not only correspond to a new stage in the elaboration of the longitudinal sample but also to the creation of a new cross sectional sample. This new cross sectional sample presents two problems. On the one hand each cross sectional sample has to remain representative of the population at the moment of the new survey. On the other hand it has o take into account the adjustment of the weights of the longitudinal individuals according to the longitudinal response rate. Watch out! The longitudinal units are individuals and the cross sectional units are households, the individuals present in the country at the moment of the first sample and joining the longitudinal individuals in their households and the individuals that have been born in the country since the drawing of the first sample. Longitudinal weighting of subsequent waves The response rates of the longitudinal individuals are subject to an important analysis. At each wave one controls if the variables of most interest affect the non-response in a substantial way. The survey carried out by CEPS shows that the combination of these variables only explains for 6% of the response rate for all subsequent years. The hypothesis is made that the non-explained part of the non-response is insignificant in the way that it is not linked to the variables that are important for the survey.

38

In order to compensate the bias due to the non random distribution of the non response rates on the weight of longitudinal individuals, we will proceed as follows: As for the initial sample we will try to come near to the ideal and utopian estimator:

T$R =



i ∈r

1 y Pi Ri i

As none of the variables in the survey was supposed to have influenced the response of the polled individuals we may suppose that the response probability is the same for all individuals and is calculated:

φ =

(number of respondant units in wave t ) (number of units in the sample in wave t )

therefrom Ri is supposed equal to φ and in order to weight the wave longitudinally we can take as an estimator of the total of individuals:

T$vl =

R

Wi yi φ i =1



which is a non biased estimator of the total. The weighted result look like following table. The stage “CALMAR” has not yet been applied to PSELL 2. 1994 realized interviews non-sample Total

1995

1996

1997

10497

10177

9992

9738

470

790

975

1229

10967

10967

10967

10967

Cross-sectional weighting of subsequent waves The question arises if demographic phenomena that have happened in the population are mirrored in the new cross-sectional sample. The new cross-sectional sample generates: births: new households are formed (marriages), new born come into households, co-habitants join households formed by longitudinal individuals.

deaths: households slit up, deceases, others leave the country or enter a collective household (retirement home for instance).

39

But it hardly mirrors births by immigration. In order to take into account births you would have to introduce a new sample of births but it is not necessary because here the panel follows households that generates births themselves. On the other hand immigrants are not generated by the sample. Immigrants do indeed form new households and only rarely enter households composed by longitudinal individuals. Therefore in order to follow the demographic evolution of the population it is necessary to insert new households into the sample. For practical reasons this operation is only made every two years. It is carried out for the third wave (i.e. in 96 but not in 95) that means for every wave 2t+1 for t ≠ 0 . The composition of the cross-sectional sample without weightings 1994 longitudinal “panel” individuals

8232

new cross-sectional individuals Total

8232

1995

1996*

1997*

6597

6119

5627

210

658

1027

6807

6777

6654

* including new immigrants entering the country in 1995 and 1996

From now on “initial weight” will refer to the annual weight of longitudinal individuals (present in the cross sectional data). We consider complete households and not only individuals. This procedure is effective if you want to make estimations at household level. On the other hand, this approach faces us with the problem of the entry of new members into households. These new members could have been present in the population at the moment of drawing of the longitudinal sample but they have not been selected.? Under these circumstances, how is the weighting to be done? Adopted method for the waves 2t (t≥1) We use weight share approach. Considering U 1 to be the population of individuals already present in the first wave, composed by N 1 individuals. We have drawn a sample of m1 different addresses from the IGSS that means m1 households of a population of M 1 households and from these m1 households a sample of s1 of n 1 individuals ensue. We note M I1 the size of the household I so as N 1 =

40

M1

∑M i =1

1 I

Every individual j has a weight equal to the opposite of the quantity

πj =

φ . Wj

We consider U 2 is the population of individuals present in the first wave U 1 and add the population of the N * individuals absent at the beginning of wave U * . We get

U 2 = U 1 ∪ U * which contains N 2 = N 1 + N * individuals. The individuals of the year 2T are part of the M households, the size of which is M i ,i = 1.. M .

Be s 2 the sample of n 2 individuals drawn in the population U 2 being part of m households, then s 2 = s1 ∪ s * . Where s * is the sample of new entries (immigrants and new born) into households out of whose members one is a longitudinal individual. In order to estimate the total T of individuals we can use the estimator T$ =

m

M i2

∑∑w

yik .

ik

i =1 k =1

Where m represents the number of households sampled in a population U 2 and wik represents the weight attached to the unit k of a household i . We may then establish a link between the populations U 1 and U 2 by considering the individuals that are part of the two populations. Be l jk the link between the individuals k of U 2 and the individuals of U 1 . We then take:  1 if individual j of the populationU 1 equal to an individual k of the populationU 2 l jk =  if not 0

We get for each individual that is neither a new-born nor an immigrant Lik = Where l j ,ik is the link between the individual individual

j

k

N1

∑l j =1

j ,ik

= 1.

which is in the household i of U 2 and the

of U 1 .

And for each individual that is a new-born or an immigrant Lik =

N1

∑l j =1

j ,ik

= 0.

That means that each individual of the wave 2T is linked to the initial wave if he was present in the population at the first wave. 41

We get:

Li =

M i2

∑L

ik

k =1

= M i*

Where M i* = M i And M i* represents the size of the household i except for the new-born and the immigrants. The method of sharing the weights implies that an initial weight wik′ is attributed to each unit

k of the household i of the sample in the following way: n1

wik′ = ∑ l j ,ik j =1

tj

πj

 1 if j ∈ s1 where t j =   0 if not Therefore each individual present at the first wave will keep its weight and all new individuals that means co-habitants (new-born, immigrants co-habiting) will receive a weight zero. M i2

We got the base weight wi of the household from wi =

∑ w′

ik

i =1 M i2

∑L

1 = M i*

Mi

∑ w′ k =1

ik

.

ik

k =1

Which means that the weight of a household is the average weights of its individuals. And finally, wik = wi ∀ k ∈ i . In order to estimate the total of the households T =

M

Mi

∑∑y i =1 k =1

estimator:

T$ =

m

∑ wY i =1

42

i i

ik

=

M

∑Y k =1

i

we use the following

Let’s prove this estimator is not biased: m

∑ Yw

T$ =

i

i =1

i

 Mi   ∑ wik′  m  k =1  T$ = ∑ Yi  M 2  k =1 i =1   L ∑ ik  k =1  m Yi $ T= ∑ ∑ M i wik′ i = 1 Li k = 1 2

Given zik =

Yi for all k ∈ i , this means for all individuals of a household Li Mi

m

∑ ∑ w′ z

T$ =

i =1 k =1

We have n = 2

ik ik

m

∑M i =1

i

and we note k as index for all individuals of a sampled household m .

T$ =

n2

∑ w′ z k =1

T$ =

k

k

 N1 t   ∑ l j ,ik j  z k ∑ πj k =1  j =1 n2

For the sampled individuals k , we have t j ≠ 0 . Therefore

 N1 tj  T$ = ∑  ∑ l j ,ik  z k πj k =1  j =1 N2

T$ =

N1

∑t j =1

T$ =

N

N2

j

w j ∑ l jk z k

j

wjZ j

k =1

1

∑t j =1

⇒ E (T$ ) =

N1

∑ E (t j =1

E (T$ ) =

N1

∑Z j =1

j

E (T$ ) = Z 43

j

)w j Z j

because

E (t j ) =

1 wj

Let’s show that Z=T

Z=

N1

∑Z j =1

j

=

N1 N 2

∑∑l j =1 k =1

jk

zk =

N2

n1

∑z ∑l k =1

k

j =1

jk

For M households we get:

Z=

M

N1

Mi

∑∑z ∑l i =1 k =1

Z=

M

i =1 k =1

Z=

j ,ik

ik

Lik

Mi

Yi Lik k = 1 Li

∑∑ i =1

Z=

j =1

Mi

∑∑z M

ik

M

∑Y i =1

i

=T

Method adopted for the waves 2t+1 (t≥ 1) We also use the method of distribution of weights but this time including a new sample of immigrants. We consider U 1 the population of individuals already present at the first wave composed by

N 1 individuals. These individuals j have a weight equal to the inverse of π j =

φ Wj

We consider U * the population of the N * individuals that are immigrants in the year 2T+1.

= U 1 ∪ U * the population containing M A = M 1 + M * households that lead to N A = N 1 + N * individuals . The sample s A of m A = m1 + m* households that lead to n A = n 1 + n * individuals corresponds to the unification of the two separate samples s1 and s * drawn respectively in U 1 and U * . We note U

A

We introduced in U A a new sample of immigrants drawn from social security register with the scope of following the demographic movements of the population in a more effective way by taking into account the apparition of immigrant households; contrary to wave 2T where we do not take into account these new households but only the cohabiting immigrants. In order to

44

obtain the weight for new immigrants we use the same procedure as for the initial sample, thus we get for each household a weight equal to

1 inherited by each member of the household. π *j

We note U B the population present at wave 2T+1: this population includes individuals present at the first wave but also the new-born and the immigrants. The population U B is formed by M households that lead to N B individuals and each household i is composed of M i members. The sample is drawn from this population comprises m households that lead to

n B individuals. A

We may then establish a link between the populations U individuals that are part of both populations.

and U B by considering the

Be l jk the link between the individuals k of U B and the individuals of U A . We take,

l jk

A B  1 if the individual j of the populationU coresponds to individual k of populationU =  0 if not 

We get for each individual that is not a new-born Lik =

AA

∑ lj , ik = 1 . Where l j =1

i , jk

is the link

between the individual k that is in the household i of U B and the individual j of U A .

And for each individual that is a new-born Lik =

We then get Li =

Mi

∑L k =1

ik

NA

∑l j =1

j ,ik

.

= M i* .

Where M i* represents the size of the household i without the new-born. The method of weight distribution implicates that each unit k of the household i of the sample is being attributed an initial weight wik′ in the following way:

45

t ik t ik* + wik′ = π ik π ik*  1 if k ∈ s1 t ik =   0 if not  1 if k ∈ s * * t ik =   0 if not This means that each individual present in the initial wave (longitudinal individuals) keeps his original weight, each immigrant, present in the sample of immigrants drawn, keeps his weight and finally the new-born, the new cohabitants (if the household is a longitudinal one and is not part of the supplementary sample) where the individuals present at the start (if the household is part of the supplementary sample), receive a weight zero. Mi

The base weight wi of the household is gained from wi =

∑ w′ k =1 Mi

ik

∑L k =1

1 = M i*

Mi

∑ w′ k =1

ik

.

ik

This means that the weight of a household is the average weight of its individuals . And finally wik = wi ∀ k ∈ i . By using the basic weights, gained by the method of distribution of weights, we may now estimate the total for the unit household “66” of the interest variable y by:

T$ =

m

∑ wY i =1

i i

m1

z k* $ and so T = ∑ + k =1 π k

m*

z k* ∑ * k =1 π k

where z = Yi for all k ∈ i with Yi = * k

*

*

Mi



k =1

yik . M i*

This estimator is the addition of the two non-biased estimators after the demonstration made in the cross sectional treatment of the waves 2T.

The CHINTEX project: wave 1 Difference between the weighting procedures of the PSELL2 and the ECHP at wave 1

After studying the weighting procedures applicated on the samples of the PSELL2, we managed to recover the various documentation relating to the methodological 46

recommendations imposed by EUROSTAT in the framework of the ECHP project as well as the programs (Note: rather difficult stage), it was fairly easy to discern the differences in weighting procedures. The main differences lie in the using conditions of the CALMAR procedure. Actually, the procedure is applied at household level in the framework of the ECHP whereas it is applied at individual level in the PSELL. Moreover, the auxiliary data serving for the “calage sur marge” do not come from the same sources. On the one hand, the Ceps uses information coming from central data system of Luxembourg social security (IGSS). This data file does not include Luxembourg residents that are not affiliated to the Luxembourg social security system. But this is not important for the studies of Ceps because they have been excluded from the framework of the survey On the other hand the ECHP is referring to data coming from the 1991 census which takes into account the non-affiliated. Application of the calmar version ECHP on the PSELL2 data After having analyzed the programs of wave 1 we could reconstruct, with the SPSS, from the household data PSELL the variables used by EUROSTAT for the calibration. We could then apply the CALMAR macro (using the method or distance G “ranking ratio” function), under SAS, to end up with the creation of a new variable weight, called w94echpr (weight resulting from the method ECHP applied to PSELL). Once this weight created we just had to connect the household data of the first wave of PSELL (mwund9408) with the new variable. Finally, we got the data m942w, a data file containing all household variables but also the weight calculated by Ceps and the weight calculated with the ECHP methodology, which respectively are mwtcal94 and w94echpr. In the future the m942w data thus elaborated will allow us to compare estimations of interest variables weighted respectively by the w94echp weight and the mwtcal94 one. A household panel is also a longitudinal sample of individuals. This is why a similar work has been done at the level of individual data. The construction of this data file has been done as follows: The household weight w94echpr had been attributed to each member of the household (the individual weight is also w94echpr) but the individual weight in CEPS version already exists for each individual because the weighting had been done at individual unit level. Finally I get the data file i942w, containing all individual variable but also the weight calculated by Ceps from the ECHP methodology. These weights respectively are wcal94 and w94echpr. The data file i942w thus constructed will allow to compare the estimations of weighted interest variables (at individual level) by respectively the weights w94echpr and wal94. Both data files have been set up with the scope of updating the impact of weighting systems on survey variables. Impact of weighting systems At the level of individual variables In order to measure the impact of weighting systems, the adopted methodology is to compare the estimations of interest variables weighted by the weights PSELL version (wcal94) and the 47

same variables weighted by the weights version ECHP (w94echpr). The list of interest variables is the following: The distribution of the population by “cantons”, the nationality of residents in Luxembourg, the orientation table for the different income sources (The main information obtained is the labor force status of the respondent for each month of the previous calendar year), sex , age groups and five variables on monthly income of individuals. After reading these tables, results obtained at individual level show slight differences except for the variables “canton” and nationality. The variable “canton” is a calibration variable for the two weighting systems. The ECHP only considers 12 cantons instead of the 13 officìal ones present in PSELL. The unification of two cantons in the ECHP method (Luxembourg-city and Clervaux) remains a mystery. A difference can also be found for the nationality of residents variable. The ECHP method finds a resident rate with EC nationality of 27.1 % which is 2.9 % more than the PSELL method. This difference is explained like this: The calibration margins for the PSELL methodology have been drawn from a file coming from the IGSS and therefore non-affiliated individuals as for instance EC officials are not taken into account whereas for the ECHP method the margins have been drawn from the 91 labor force survey and therefore include EC officials. We can conclude, that the weighting systems used by PSELL and by the ECHP do only slightly influence individual variables of interest. At household level The choice of interest variable for the comparison of both systems is currently in progress, the survey will be realized later.

Conclusion The study realized until now at individual level shows us that the choice of the weighting system for wave 1 influences the results in the way that results are modified in a marked way. Nevertheless these differences we got by using the Luxembourg Household Panel- are not important enough to be taken into account and give way to discussion.

We have to see if this is true as well at household level. A few tests on various variables already realized make us believe that this is not the case. The next step will be to test ( by a variance calculation) if this hypothesis can be verified and if so to determine the reasons why the various weights give almost similar results at individual level and not at household level. We are limited to the survey of weighting for wave 1. We should underline that the differences in weighting systems for future waves will not be limited to CALMAR. It is therefore highly possible that these differences will be more important. Switching the data we also expect a higher influence of the weighting scheme to the results. Here we used the small sample of the PSELL2. but the discussion during the first CHINTEX 48

meeting with the external evaluator gave good reasons for changing results, when we apply the ECHP weighing to a big national sample like the German Socio-economic Panel (SOEP)2. Distribution of some demographic variables The following set of tables shows the distribution of main demographic indicators. The column titled “ECHP weighted” gives the percentage after the ECHP method is applied. The “PSELL2 weighted” column shows the frequencies after the PSELL2 weighting scheme is used.. For some tables we inform also about the distribution reported by the 1991 census in Luxembourg. ECHP weighted vs PSELL2 weighted distribution of the population by regions in 1994 Districts

ECHP weighted

Luxembourg

PSELL2 weighted

18,2

19,9

8,4

8,4

32,1

30,8

Lux. Campagne

9,9

10,7

Mersch

0,5

4,9

Clerveau

2,4

2,4

Diekirch

5,9

5,8

Redange

3,6

3,0

Vianden

0,4

0,5

Witz

2,6

2,7

Echternach

3,7

2,9

Grevenmacher

4,7

4,3

Remich

3,1

3,7

100,0

100,0

Capellen Esch/Alzette

Total

2

Because of the German data protection regulation the SOEP data are not available outside the Federal Republic of Germany at the moment. 49

Nationality of residents in Luxembourg 1994 Nationality

ECHP weighted

PSELL2 weighted

Luxembourg

71,6

68,8

EU member states

24,2

27,1

other Europeans

2,7

2,6

Non-Europeans

1,4

1,5

without

0,1

0,0

Nationality of residents in Luxembourg Nationality

ECHP weighted

PSELL2 weighted

Census 1991

Luxembourg

71,6

68,8

70,3

EU member states

24,2

27,1

26,8

other Europeans

2,7

2,6

1,3

Non-Europeans

1,4

1,5

1,4

without

0,1

0,0

0,1

Persons main income source in 1994 Type of main income invalidity pension

ECHP weighted

PSELL2 weighted

2,9

2,9

13,5

12,6

6,5

5,7

53,0

51,0

sickness benefits

0,5

0,5

apprenticeship and training

0,8

1,3

registered as looking for work

2,1

2,3

not registered as looking for work

0,2

0,2

19,6

22,3

working less then 10 hours/week

0,3

0,3

others

0,6

0,6

old age pension, incl early retirement survival pension working at least 10 hours/week

doing house work

50

Distribution by gender Gender

ECHP weighted

PSELL2 weighted

Census 1991

male

49,1

49,3

49,0

female

50,1

50,7

51,0

Distribution by age class age class

ECHP weighted

up to15 years old

PSELL2 weighted

18,8

19,7

16-19 years old

4,2

4,1

20-24

6,0

6,7

25-34

17,4

17,4

35-44

15,9

15,7

45-54

12,4

12,5

55-59

5,4

5,4

60-64

5,1

5,4

65-74

8,7

8,4

75 years and older

6,1

4,7

Distribution by age class age class up to18 years old

ECHP weighted

PSELL2 weighted

Census 1991

23,0

23,8

23,1

6,0

6,7

7,4

25-34

17,4

17,4

17,5

35-44

15,9

15,7

15,4

45-54

12,4

12,5

12,1

55-59

5,4

5,4

5,6

60-64

5,1

5,4

5,6

65-74

8,7

8,4

7,3

75 years and older

6,1

4,7

5,9

20-24 years old

51

The main interest of ECHP is the economic situation of people in Europe. For individuals the overview presents the central tendencies and variations of selected income types.

wage

farm incomes old age pensions

weight method

valid N

mean

median

mode

std. dev.

range

min

max

ECHP

135905

82759.51

75000

75000

46301.7

397083.3

2916.67

400000

PSELL2

136899

80885.41

70000

60000

47772.72

397083.3

2916.67

400000

ECHP

3028

78818.32

65000

100000

65985.09

374000

1000

375000

PSELL2

2894

79673.69

66667

100000

64917.71

374000

1000

375000

ECHP

39357

65435.63

65000

65000

32650.9

209500

500

210000

PSELL2

38049

66096.29

65000

65000

32720.57

209500

500

210000

social minimum incomes

ECHP

2641

29445.1

30000

15000

13175.28

52600

3400

56000

PSELL2

2322

27900.18

29800

10000

13099.67

52600

3400

56000

unemploy ment benefits

ECHP

2628

41210.43

38000

30000

17298.84

78500

11500

90000

PSELL2

2701

40742

38000

28000

17357.28

78500

11500

90000

Looking on the central tendencies of household variables The aim of this document is to try to bring an element of response to the impact of weighting systems in the frame of the Chintex project. The weighting systems that have been studied are those adopted by the ECHP and by PSELL. Both systems are applied to Psell2 data. The problem which then arises is how to compare hese weighting systems. Calculating the variance would be the most judicious solution, but a more basic way would allow us to see in simple way which are the variables influenced most and which percentage of variables is substantially influenced: calculating the ratio of the estimations of the mean. Choice of variables studied at household level The variables studied are all categorical variables and all income and expenditure variables. It is obvious that all constructed variables of income and expenditure variables (for instance number of months the employment allowance has been perceived) have been excluded as well as household numbering variables and all other variables whose average would not make sense, for instance the mean date. Processing of the studied variables For all of these variables the problem of non-response arises. It is divided in to sub-problems: The first arises for income and expenditure variables, the non-responses for these variables are coded by 0. But as we want to calculate an mean for weighted income it will be necessary to code these 0 as missing value. The second arises for sectional variables. The missing values are coded by a negative number (-5, -9, -7) and these numbers are not to be included in the mean. Therefore all these negative values will be coded as missing values.

52

Calculating the ratio of means Each variable, noted var, will be weighted by ECHP weights and will give birth to a new variable noted varwechp. It will also generate a new variable weighted by PSELL weights, noted varwpsell. For all varwechp and varwpsell an mean will be calculated which will be noted Mvareschp and Mvarpsell. Finally for each variable the quotient of the two means (i. E. Mvarechp and Mvarpsell) will be calculated and noted Mratio. Processing for the Mratio The ratio, thus calculated will allow us to calculate the variation in percentage between both estimations of the mean, this variation is (1-Mratio). Thus for each variable we will get the percentage of variation ([1-Mratio]*100) between means of the variables weighted by each method. Aim of this ratio At present it is possible to calculate the percentage of variables from the PSELL household file whose mean varies less than x % whether they are weighted by the PSELL or the ECHP method. We will take for x the following values 5, 10, 15, 20. Variables whose mean exhibits a difference of less than 5% The number of variables studied is 914. Among these 914 variables approximately 80 % are categorial variables of all kind, the remaining 20 % are income and expenditure variables. The following table represents the frequency and the percentage of variables having a variation of less than 5 % and those with a variation of more than 5 %. Percentage and frequency of variables having variation of less than 5% between their PSELL weighted mean and their ECHP weighted mean differences < 5%

number of variables

percent of variables

yes

841

92.0

no

73

8.0

We may thus conclude that 92 % of household level variables have hardly been influenced by the adopted weighting system. Among those that have undergone a variation of more than 5 %, 90 % are income and expenditure variables. Consequently weighting systems have a more important influence on numerical variables like income and expenditure than on sectional variables which remain relatively stable. Variables whose mean exhibits a difference of less than 10% The number of variables is still 914.The following table represents the frequency and the percentage of variables that have a variation of less than 10% and those that have a variation of more than 10%.

53

Percentage and frequency of variables having a variation of less than 10 % between their PSELL weighted mean and their ECHP weighted mean differences < 10%

number of variables

percent of variables

yes

888

97.2

no

26

2.8

We may conclude that 97.2 % of variables at household level have hardly been influenced by the adopted weighting system. Indeed the 26 variables that have a variation superior to 10% are all numerical variables of income and expenditure. Note: It is obvious that if you increase the variation margin this percentage will continuously increase, but a variation superior to 10% is not acceptable. Given these relevant results it would be interesting to see more closely which are the variables which react strongest to a change of weighting system. Variables whose mean exhibits a difference of more than 10% Percentage and frequency of variables having a difference of more than 10 % between their PSELL weighted mean and their ECHP weighted mean differences > 10%

number of variables

yes no

percent of variables 8

0.9

906

99.1

There are 8 variables having very high differences. They are variables on, the number of persons who are unemployed, scholarships, public policy grants, public social allowances. All of these variables concern small minorities which implies that their number in the sample is irrelevant. Consequently, the application of weighting systems to this variable logically leads to bad results because the sensitivity to the weighting system increases as the size of the survey population decreases. Looking on the central tendencies of individual variables Choice of variables studied at individual level Variables studied are all categorial variables and all person income variables. It is obvious that all construction variables of income and expenditure variables (for instance number of months the employment allowance has been perceived) have been excluded as well as individual numbering variables, name variables and all other variables which have dates as a value. Processing of the studied variables For all of these variables the problem of non-response arises. It is divided in to sub-problems:

54

The first arises for income and expenditure variables, the non-responses for these variables are coded by 0. But as we want to calculate an mean for weighted income it will be necessary to code these 0 as missing value. The second arises for sectional variables. The missing values are coded by a negative number (-5, -9, -7) and these numbers are not to be included in the mean. Therefore all these negative values will be coded as missing values. Calculating the ratio of means Each variable, noted var, will be weighted by ECHP weights and will give birth to a new variable noted varwechp. It will also generate a new variable weighted by PSELL weights, noted varwpsell. For all varwechp and varwpsell an mean will be calculated which will be noted Mvareschp and Mvarpsell. Finally for each variable the quotient of the two means (i. E. Mvarechp and Mvarpsell) will be calculated and noted Mratio. Processing for the Mratio The ratio, thus calculated will allow us to calculate the variation in percentage between both estimations of the mean, this variation is (1-Mratio). Thus for each variable we will get the percentage of variation ([1-Mratio]*100) between means of the variables weighted by each method. Aim of this ratio At present it is possible to calculate the percentage of variables from the PSELL household file whose mean varies less than x % whether they are weighted by the PSELL or the ECHP method. We will take for x the following values 5, 10, 15, 20. Variables whose mean exhibits a difference of less than 5% The number of variables studied is 150. Among these 150 variables about 25 % are categorical variables of all types, the remaining 75% are variables that concern income and expenditure. The following table represents the frequency and the percentage of variables that have a variation of less than 5% and those that have a variation of more than 5%: Percentage and frequency of person variables having a variation of less than 5 % between their PSELL weighted mean and their ECHP weighted mean differences < 5%

number of variables

percent of variables

yes

112

74.7

no

38

25.3

We may conclude that 74.7 % of variables at household level have hardly been influenced by the adopted weighting system. Among the 25.3% variables having undergone a variation superior to 5%, all are income and expenditure variables. Thus weighting systems have a

55

stronger influence on numerical variables such as income and expenditure than on sectional variables that remain fairly stable. We might think that the results on the individual data file are a lot worse than those on the household file, but only by forgetting that the variables that are influenced most, the numerical variables, represent only 20 % of the variables studied for the household file whereas they represent 75 % of the studied variables for the individual data file. Therefore it seems natural to say that the results obtained for the individual and the household file are rather similar. Variables whose mean exhibits a difference of less than 10% Using the same variables as above, the number of variables is still 150. The following table represents the frequency and the percentage of variables having a variation of less than 10% and those that have a variation of more than 10 %. Percentage and frequency of variables having a variation of less than 10 % between their PSELL weighted mean and their ECHP weighted mean differences < 10%

number of variables

percent of variables

yes

127

84.7

no

23

15.3

We may thus conclude that 84.7 % of variables at household level have hardly been influenced by the adopted weighting system. This result confirms those of the preceding paragraph 4.1, indeed the 23 variables having a variation superior to 10 % are all numerical variables of income and expenditure. Note: Note: It is obvious that if you increase the variation margin this percentage will continuously increase, but a difference superior to 10% is not acceptable. Given these relevant results it would be interesting to see more closely which are the variables which react strongest to a change of weighting system. Variables whose mean exhibits a difference of more than 10% Let's study the variables whose mean undergoes a change superior to 10 % according to the chosen weighting system. The following table gives us the frequency and the percentage of variables concerned. Percentage and frequency of variables having a variation of more than 10 % between their PSELL weighted mean and their ECHP weighted mean differences > 10%

number of variables

percent of variables

yes

23

15.3

no

127

84.7

There are 23 of those variables. They only concern income and expenditure variables, as for instance: apprenticeship salary, agricultural benefits, industrial and commercial benefits, alimonies, pensions, permanent accident allowance, scholarships. As at household level all 56

these variables concern small minorities of persons implicating that their number in the sample is irrelevant. Consequently. The application of weights to this variable logically gives bad results because the sensitivity to the weighting systems increases as the size of the survey population decreases. Conclusion from looking at the central tendencies The weighting systems adopted by PSELL and ECHP hardly influence the results at the household level and at the individual level for sectional variables, nevertheless the income and expenditure variables undergo a slight variation in general. The variables of income and expenditure that are most affected are those concerning minorities. We may say that it is not the chosen weighting system which is at the origin of this variation but the size of the considered sub-sample. We may adopt the hypothesis that the weighting system does only slightly influence the results.

57

Comparing the means of some income variables after the ECHP weight and the PSELL2 weight Means Table: not weighted means and PSELL2 weighted means NONE1

TOTAL i94138m i94148m

wage farm income i94154m old age pension i94168m social minimum income i94184m unemploy ment benefits 1 not weighted

valid N

mean

2761 76

77424.435 73217.583

135905 3028

82759.510 78818.320

6.89 7.65

833

64673.791

39357

65435.631

1.18

50

27978.660

2641

29445.105

5.24

55

37641.800

2628

41210.434

9.48

2

NONE1

wage farm income i94154m old age pension i94168m social minimum income i94184m unemploy ment benefits 1 not weighted

mean Dif in % of NONE

weighted by the PSELL2 method

Table: not weighted means and ECHP weighted means

TOTAL i94138m i94148m

WCAL942 valid N

W94ECHPR2 valid N

valid N

mean

2761 76

77424.435 73217.583

135905 3028

80885.410 79673.687

4.47 8.82

833

64673.791

39357

66096.288

2.20

50

27978.660

2641

27900.185

-0.28

55

37641.800

2628

40742.388

8.24

2

mean Dif in % of NONE

weighted by the ECHP method

Table: PSELL2 weighted means and ECHP weighted means WCAL941 valid N

TOTAL i94138m i94148m

wage 135905 farm 3028 income i94154m old age 39357 pension i94168m social 2641 minimum income i94184m unemploy 2628 ment benefits 1 weighted by the PSELL2 method

mean

W94ECHPR2 valid N

mean Dif in % of WCAL94

82759.510 78818.320

136899 2894

80885.410 79673.687

-2.26 1.09

65435.631

38049

66096.288

1.01

29445.105

2322

27900.185

-5.25

41210.434

2701

40742.388

-1.14

2

weighted by the ECHP method

58

Comparing the variances of some income variables after the ECHP weight and the PSELL2 weight Variances n

σ = 2

∑ (x i =1

i

− x)

2

n

Table: not weighted and PSELL2 weighted variances NONE1

valid N

variances

WCAL942 valid N variances

TOTAL i94138m i94148m i94154m

wage 2761 1900918909.058 135905 farm income 76 3840740937.052 3028 old age 833 1035292640.687 39357 pension i94168msocial minimum 50 165628919.739 2641 income i94184m unemployment 55 285032290.422 2628 benefits 1 2 not weighted weighted by the PSELL2 method

Dif in % of NONE

2143847226.877 4354031864.783 1066081494.001

12.78 13.36 2.97

173587921.050

4.81

299249899.237

4.99

Table: not weighted and ECHP weighted variances NONE valid N

variances

TOTAL i94138m i94148m i94154m

wage 2761 1900918909.058 farm income 76 3840740937.052 old age 833 1035292640.687 pension i94168msocial minimum 50 165628919.739 income i94184m unemployment 55 285032290.422 benefits 1 2 not weighted weighted by the ECHP method

W94ECHPR valid N variances Dif in % of NONE 136899 2894 38049

2282232309.243 4214308839.477 1070635803.109

20.06 9.73 3.41

2322

169772413.329

2.50

2701

301275032.949

5.70

Table: PSELL2 weighted and ECHP weighted variances WCAL94 valid N TOTAL i94138m wage 135905 i94148m farm income 3028 i94154m old age pension 39357 i94168m social minimum 2641 income i94184m unemployment 2628 benefits 1 weighted by the PSELL2 method

variances

W94ECHPR valid N variances Dif in % of WCAL94

2143847226.877 4354031864.783 1066081494.001 173587921.050

136899 2894 38049 2322

2282232309.243 4214308839.477 1070635803.109 169772413.329

6.45 -3.21 0.43 -2.20

299249899.237

2701

301275032.949

0.68

2

weighted by the ECHP method

59

Comparing the variation coefficients of some income variables after the ECHP weight and the PSELL2 weight Coefficient of variation

Variationskoefizient =

σ x

NONE1

W94CAL2 W94ECHPR3

TOTAL i94138m i94148m i94154m i94168m

wage 0.56312 0.55947 0.59062 farm income 0.84643 0.83718 0.81479 old age pension 0.49751 0.49898 0.49504 social minimum 0.45998 0.44745 0.46701 income i94184m unemployment 0.44851 0.41977 0.42603 benefits 1 = not weighted, 2 =weighted by the PSELL2 method, 3 =weighted by the ECHP method

60

References Arminger, G.; Cloog, C. C..; Sobel, Michael E. (eds.) 1995: Handbook of Statistical Modeling for the Social and Behavioral Sciences, New York, Plenum Press Adrilly, P. 1994: Les techniques de sondage, Paris, Editions Technip Berger, Y. G.1998: Variance Estimation Using List Sequential Scheme for Unequal Probability Sampling. In: Journal of Official Statistics, Vol. 14, No. 3, pp. 315-323 Cases, C. 1996: Méthodologie du Panel européen de ménage: Exploitation des données de la vague 1 du fichier français, INSEE, Série des documents de travail de la Direction des Statistiques Démographiques et Sociales, F9705 Chambaz, C.; Saunier, J.M.; Valdelievre, H. 1997: Méthodologie du Panel européen de ménage: Exploitation des données de la vague 2 du fichier français, INSEE, Série des documents de travail de la Direction des Statistiques Démographiques et Sociales, F9715 Cochran, W. G. 1977: Sampling techniques, New York, John Wiley Colloque francophone sur les sondages 1997, Université Rennes2, 19-20 juin 1997 Deming, W. E. 1943: Statistical Adjustment of Data, New York, John Wiley Deming, W. E., and Stephan, F. F. 1940: On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known. In: Annals of Mathematical Statistics, 11, pp. 427-444 Deville, J.-C. 1997: Estimation de variance pour des statistiques complexes: techniques de résidus et de linéarisation. In: Colloque francophone sur les sondages 1997, pp. 134133 Deville, J. C. and C. -E Särndal 1992: Calibration estimators in survey sampling. In: Journal of the American Statistical Association, 87 (418), pp. 376-382 Deville, J. C., C. -E Särndal and O. Sautory 1993: Generalized raking procedures in survey sampling. In: Journal of the American Statistical Association, 88 (423), pp. 1013-1020 Ernst, L.R. 1989: Weighting Issues foir Longitudinal Household and Family Estimates, In: Kasprzyk, D.; Duncan, G.; Kalton, G.; Singh, M.P. (1989) pp. 139 EUROSTAT 2000: Assessment of the Quality in Statistics. Definition of quality in statistics. Doc. Eurostat/A4/Quality/00/General/Definition. Luxembourg EUROSTAT 2001: Construction of Weights in the ECHP. Doc. Pan. 165/2001-12. Luxembourg

61

EUROSTAT 2002: ECHP Weighting - Review after Wave 5" Doc. Pan. 183/02. Luxembourg Gabler, S; Hoffmeyer-Zlotnik, J.H-P.; Krebs, D. (eds.) 1994: Gewichtung in der Umfragepraxis, Opladen, Westdeutscher Verlag Gabler, S; Hoffmeyer-Zlotnik, J.H-P. (eds.) 1997: Stichproben in der Umfragepraxis, Opladen, Westdeutscher Verlag Haisken-DeNew, J. P.; Frick, J. R. 2001: DTC Desktop Companion to the German SocioEconomic Panel Study (GSOEP). Berlin, DIW http://www.diw.de/deutsch/sop/service/dtc/ access Aug. 2, 2002 Kasprzyk, D.; Duncan, G.; Kalton, G.; Singh, M.P. 1989: Panel Surveys, New York, John Wiley Kellerer, H. 1963: Theorie und technik des Astichprobenverfahrens. Eine Einführung unter besonderer berücksichtigung der Anwendung auf soziale und wirtschaftliche Massenerscheinungen, Einzelschriften der Deutschen Statistischen Gesellschaft Nr. 5, München Kish, L. 1990: Weighting: Why, when, and how? In: ASA Proceedings of the Section on Survey Research Methods, pp 121-130 Kish, L. 1965: Survey Sampling, New York, John Wiley Koch, A.; Post, R. (eds.) 1998: Nonresponse in Survey Research. Proceedings of the eighth International Workshop on Household Survey Nonresponse 24-26 September 1997, Zuma Nachrichten Spezial Nr. 4, Zuma, Mannheim Kovacevic, M. S.; Binder, D. A. 1997: Variance Estimation for Measures of Income Inequality and Polarization - Estimating Equations Approach. In: Journal of Officila Statistics, Vol. 13, No. 1, pp. 41-58 Lavallee, P. 1995: Ponderation transversale des enquêtes longitudinales menées auprès des individus et des ménages à l’aide de la méthode du partage des poids, In: Statistiques Canada 1995: Techniques d’enquêtes Vol. 21, no. 1, pp 27-35 Lazzeroni, L. C.; Little, R. J. A. 1998: Random-effects Models for Smoothing Poststratification Weights. In: Journal of Official Statistics, Vol. 14, No. 1, pp. 61-78 Legendre, N. 2000: Méthodologie du Panel européen de ménage: Exploitation des données de la vague 3 du fichier français, INSEE, Série des documents de travail de la Direction des Statistiques Démographiques et Sociales, F2001 Levy, P. S.; Lemeshow, S. 1991: Sampling of Populations: Methods and Applications, New York, John Wiley Little, R. J. A. 1991: Inference with Survey Weights. In: Journal of Official Statistics, Vol. 7, pp. 405-424 62

Lundström, S.; Särndal, C.-E. 1999: Calibration as a Standard Method for Treatment of Nonresponse. In: Journal of Official Statistics, Vol. 15, No. 2, pp. 305-327 Merz, J., 1994, ADJUST - A Program Package to Adjust Microdata by the Minimum Information Loss Principle, Program-Manual, FFB-Documentation No. 1e, Department of Economics and Social Sciences, University of Lüneburg, Lüneburg. FFB-Dok Nr. 1e Rendtel, U. 1995: Lebenslagen im Wandel: Panelausfälle und Panelrepräsentativität, Frankfurt a.M., Campus Verlag Rothe G.; Wiedenbeck, M. 1987: Stichprobengewichtung: Ist Repräsentativität machbar? ZUMA-Nachrichten 21 Rothe, G. 1990: Wie (un)wichtig sind Gewichtungen? ZUMA-Nachrichten 26 Särndal, C -E., B. Swensson and J. Wretman 1992: Model Assisted Survey Sampling, New York, Springler-Verlag Sautory, 0. 1992: Redressement d'échantillons d'enquêtes auprès des ménages par calage sur marges, Insee Méthodes, 29-30-31, pp. 299-326 Sautory, 0.1993: La macro CALMAR. Redressement d'échantillons par calage sur marges, Document no F9310. INSEE Direction Générale, Série des documents de travail de la Direction des Statistiques Démographiques et Sociales Skinner, C.J.; Holt, D.; Smith, T. M. F. (eds.) 1989: Analysis of Complex Surveys, New York, John Wiley Spiess, M.: Derivation of Design Weights: The case of the German Socio-Economic Panel (GSOEP), revised version. Berlin, DIW http://www.diw.de/deutsch/sop/service/doku/paper197.pdf Aug. 2, 2002 Wolter, K. M. 1985: Introduction to Variance Estimation, New York, Springer-Verlag

63