Along with interval and ordinal variables we can use nominal level variables that are dichotomous, such as gender, in multiple regression analysis. In previous labs we have used a dichotomous variable for age to define subsets of cases. We can also use dichotomous variables as independent variables in regression. When scored as either a 0 or 1, dichotomies are often referred to as "dummy" variables. They indicate either the absence or presence of a characteristic or trait. Hence they function as a "dummy" for the variable in question. The most obvious use is when a variable either already has or has been recoded into two categories. However, the logic of dummy variables can also be extended to enable us to include nominal level variables with more than two categories in our multiple regressions. Examples of such variables include region, province, country, Canadian party identification, occupation and marital status.
Consider the hypothesis that income depends on gender, education, and region of residence. Working once again with our example of predicting income level using the CRIC2003 data, we could develop an equation of the form Y = A + B(X_{1}) + B(X_{2}) + B(X_{3}) + B(X_{4}) +B(X_{5}) + B(X_{6}) + B(X_{7}), where:
B(X_{1}) = Female (by coding female respondents as 1 and males as 0);
B(X_{2}) = Education (by maintaining the original coding) ;
B(X_{3}) = BC residence (by coding BC respondents as 1 and everyone else as 0);
B(X_{4}) = Alberta residence (by coding Alberta respondents as 1 and everyone else as 0);
B(X_{5}) = Manitoba/Saskatchewan residence (by coding Man/Sask respondents as 1 and everyone else as 0);
B(X_{6 }) = Quebec residence (by coding Quebec respondents as 1 and everyone else as 0);
B(X_{7}) = Atlantic residence (by coding all Atlantic respondents as 1 and everyone else as 0).
You will notice in both the equation and the syntax that the most populous province, Ontario, has been left out of the list.
At least one category must always be omitted which leaves something with the value of zero with which to compare each of the other categories. In this case, we need something to be not BC, not Alberta, not Man/Sask, not Quebec and not Atlantic. If, along with all the other regions, we were to enter Ontario as a predictor in the equation, its values would be perfectly correlated (negatively) with the combination of the other regional dummy variables. This would create a situation of multi-collinearity. So we intentionally leave out at least one of the categories. The omitted category becomes the reference category against which the effects of the other categories are assessed. We can interpret the results as the difference between each category and this omitted category.
You can arbitrarily choose any category to be omitted. However, carefully consider your options and exclude whichever category most sensibly should be the one best suited to be the reference value for all the other variables. Typically, this is the most common or largest category.
It is important that, with the exception of the reference category, you include each of the other categories of your variable in the regression.
Example Syntax
Syntax Legend
The usual missing values are declared
Income, gender, and education are recoded
A set of regional dummy variables are created through another set of recodes
The regression command lists the dependent and all seven independent variables.
The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance.
The descriptives subcommand asks for the number of cases used in the regression.
The dependent variable is declared with the dependent subcommand.
The method subcommand indicates that education, gender, and the five regional dummies should be entered together in the regression.
Example Output
N of Cases = 1830
* * * * M U L T I P L E R E
G R E S S I O N * * * *
Equation Number 1 Dependent Variable.. INCOME
Descriptive Statistics are printed on Page 4
Block Number 1. Method:
Enter EDUCATE FEMALE BC ALBERTA PRAIRIE QUEBEC ATLANTIC
1.. FEMALE
2.. EDUCATE
3.. PRAIRIE
4.. ATLANTIC
5.. ALBERTA
6.. BC
7.. QUEBEC
Multiple R .39553
R Square .15644
Adjusted R Square .15320
Standard Error 2.52800
F = 48.26685 Signif F = .0000
------------------------- Variables in the Equation
--------------------------
Variable B SE B Beta Tolerance VIF
T
EDUCATE .517381
.036775 .305824
.979913 1.020 14.069
FEMALE -.707809
.118295
-.128854 .998401 1.002 -5.983
BC -.635267
.190322 -.077849
.851211 1.175 -3.338
ALBERTA .303136
.213140 .032671
.877450 1.140 1.422
PRAIRIE -.934037
.245900 -.085847
.906485 1.103 -3.798
QUEBEC -.914617
.152956 -.144970
.787751 1.269 -5.980
ATLANTIC -1.023321
.232552 -.100351
.890308 1.123 -4.400
(Constant) 4.185041
.200155 20.909
------ in -------
Variable Sig T
EDUCATE .0000
FEMALE .0000
BC .0009
ALBERTA .1551
PRAIRIE .0002
QUEBEC .0000
ATLANTIC .0000
(Constant) .0000
Interpretation of Output
The R-square and adjusted R-square figures indicate that approximately 15% of variance in incomes is explained by gender education and region of residence. Compared to the regression model without the dummies, the R-square is higher by 3 percentage points. NOTE: this does NOT mean that region of residence on its own explains only a small amount of the variance in incomes, because the effects of education and gender are simultaneously controlled in the regression equation. If we wanted to examine the impact of only region of residence on the R-square, we could run a regression with only the regional dummies in an equation.
The B values for education and income indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable. In this example, the indicators for education and income are ordinal measures. The results show that earning more income depends in part on obtaining more education. Controlling for the effects of gender and region, a one category increase in education produces about a half category increase (+.52) in income. A university graduate, for example, could earn $10-20,000 more annual income than someone not yet finished university. The multiple regression results also indicate that women have lower incomes than men. Controlling for the effects of education and region, females earn three-quarters of an income category less (-.71) than males.
The B values for the regional dummies indicate that residents of the Prairies, Quebec, and Atlantic Canada have incomes that are about one income category lower than residents of Ontario. And British Columbians earn about .6 of an income category less than Ontarians residents, whereas Albertans earn about .3 of an income category more than Ontarians. However, the T-test results, which tell us whether any of the five regions are significantly different from Ontario, show that the difference between Ontarians and Albertans is insignificant.
We should also note that because Ontario is the same reference group for each of the other regional dummies, we can directly compare each of the regions to one another: British Columbians, for example, earn higher incomes on average than Atlantic Canadians (-.64 versus -1.02), but lower incomes than Albertans (.30 versus -.64).
We can also determine whether or not each of these regional differences are statistically significant by examining the standard errors of the region coefficients. If any pair of regional dummies do not overlap within +/- 1.96 standard errors of one another, we can be 95% confident the differences are not due to sampling error. In this example, Albertans have significantly higher incomes than from residents of the Prairies, Quebec, and Atlantic Canada, since the income differences between Alberta and these three regions do not overlap within +/- 1.96 standard errors of one another. This task can be made easier by asking Webstats to produce confidence intervals with your output. You can do this by including the "ci" command in the "/statistics" line of your regression syntax. However, it is useful to know how to calculate those confidence intervals yourself by examining the standard errors of the B values, since some of the research you read may not report confidence intervals.
The constant (or y-intercept) indicates the value of 'a' in the regression equation.
From the information in the regression output we can write the following regression equation:
income = 4.19 + .52educate - .71 female - .64 BC + .30 Alberta - .93 Prairies - .92 Quebec - 1.02 Atlantic.
Instructions
Part 2: Interaction Terms
<30 (0) | >30 (1) | |
Male (0) | 0 | 0 |
Female (1) | 0 | 1 |
Output
* * * * M U L T I P L E R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data
N of Cases = 1835
* * * *
M U L T I P L E R E G R E S S I O N * * * *
Equation Number 1 Dependent Variable.. INCOME
Descriptive Statistics are printed on Page 2
Block Number 1. Method:
Enter AGEDUMMY FEMALE
Variable(s) Entered on Step Number
1.. FEMALE
2.. AGEDUMMY
Multiple R
.18121
R Square
.03284
Adjusted R Square .03178
Standard Error 2.70153
F = 31.10496
Signif F = .0000
------------------------- Variables in the Equation
--------------------------
Variable
B SE B
Beta Tolerance VIF
T
AGEDUMMY .679421
.135058 .115613 .999351
1.001 5.031
FEMALE -.782345
.126165 -.142512 .999351
1.001 -6.201
(Constant) 5.545508 .126958
43.680
------ in -------
----------- not in ------------
Variable Sig T
Variable Tolerance VIF
AGEDUMMY .0000
FEM30 .240225
4.163
FEMALE .0000
(Constant) .0000
End Block Number 1 All requested variables entered.
* * * * * * * * * *
* * * * * * * * * * * * * * * * * * *
Block Number 2. Method:
Enter AGEDUMMY FEMALE FEM30
Variable(s) Entered on Step Number
3.. FEM30
Multiple R
.19095
R Square
.03646
Adjusted R Square .03488
Standard Error 2.69720
F = 23.09946
Signif F = .0000
------------------------- Variables in the Equation
--------------------------
Variable
B SE B
Beta Tolerance VIF
T
AGEDUMMY 1.029219
.189592 .175137 .505511
1.978 5.429
FEMALE -.302044
.222161 -.055020
.321268
3.113 -1.360
FEM30 -.707862
.269702 -.122831
.240225
4.163 -2.625
(Constant) 5.312372 .154780
34.322
------ in -------
Variable Sig T
AGEDUMMY .0000
FEMALE
.1741
FEM30
.0087
(Constant) .0000
Instructions