POL242 LAB MANUAL:  EXERCISE 9B

Regressions with Dummy Variables and Interaction Terms

Part 1: Dummy Variables

PURPOSE

MAIN POINTS

Along with interval and ordinal variables we can use nominal level variables that are dichotomous, such as gender, in multiple regression analysis. In previous labs we have used a dichotomous variable for age to define subsets of cases. We can also use dichotomous variables as independent variables in regression. When scored as either a 0 or 1, dichotomies are often referred to as "dummy" variables. They indicate either the absence or presence of a characteristic or trait. Hence they function as a "dummy" for the variable in question. The most obvious use is when a variable either already has or has been recoded into two categories. However, the logic of dummy variables can also be extended to enable us to include nominal level variables with more than two categories in our multiple regressions. Examples of such variables include region, province, country, Canadian party identification, occupation and marital status. 

 

An Example of Dummy Variables in Multiple Regression

Consider the hypothesis that income depends on gender, education, and region of residence.  Working once again with our example of predicting income level using the CRIC2003 data, we could develop an equation of the form Y = A + B(X1) + B(X2) + B(X3) + B(X4) +B(X5) + B(X6) + B(X7), where:

B(X1) = Female (by coding female respondents as 1 and males as 0);

B(X2) = Education (by maintaining the original coding) ;

B(X3) = BC residence (by coding BC respondents as 1 and everyone else as 0);

B(X4) = Alberta residence (by coding Alberta respondents as 1 and everyone else as 0);

B(X5) = Manitoba/Saskatchewan residence (by coding Man/Sask  respondents as 1 and everyone else as 0);

B(X6 ) = Quebec residence (by coding Quebec respondents as 1 and everyone else as 0);

B(X7) = Atlantic residence (by coding all Atlantic respondents as 1 and everyone else as 0).

You will notice in both the equation and the syntax that the most populous province, Ontario, has been left out of the list.

At least one category must always be omitted which leaves something with the value of zero with which to compare each of the other categories. In this case, we need something to be not BC, not Alberta, not Man/Sask, not Quebec and not Atlantic. If, along with all the other regions, we were to enter Ontario as a predictor in the equation, its values would be perfectly correlated (negatively) with the combination of the other regional dummy variables. This would create a situation of multi-collinearity. So we intentionally leave out at least one of the categories. The omitted category becomes the reference category against which the effects of the other categories are assessed. We can interpret the results as the difference between each category and this omitted category.

You can arbitrarily choose any category to be omitted. However, carefully consider your options and exclude whichever category most sensibly should be the one best suited to be the reference value for all the other variables. Typically, this is the most common or largest category.

It is important that, with the exception of the reference category, you include each of the other categories of your variable in the regression.

Example Syntax

Syntax Legend

The usual missing values are declared

Income, gender, and education are recoded

A set of regional dummy variables are created through another set of recodes

The regression command lists the dependent and all seven independent variables.

The statistics subcommand asks for regression coefficients, explained variance (r), and tolerance.

The descriptives subcommand asks for the number of cases used in the regression.

The dependent variable is declared with the dependent subcommand.

The method subcommand indicates that education, gender, and the five regional dummies should be entered together in the regression.

Example Output

N of Cases =  1830



           * * * *   M U L T I P L E   R E G R E S S I O N   * * * *

Equation Number 1    Dependent Variable..   INCOME

  Descriptive Statistics are printed on Page    4

Block Number  1.  Method:  Enter    EDUCATE  FEMALE BC ALBERTA  PRAIRIE  QUEBEC  ATLANTIC

 

   1..    FEMALE
   2..    EDUCATE
   3..    PRAIRIE
   4..    ATLANTIC
   5..    ALBERTA
   6..    BC
   7..    QUEBEC


Multiple R           .39553
R Square             .15644
Adjusted R Square    .15320
Standard Error      2.52800

F =      48.26685       Signif F =  .0000


------------------------- Variables in the Equation --------------------------

Variable              B        SE B       Beta  Tolerance        VIF         T

EDUCATE         .517381     .036775    .305824    .979913      1.020    14.069
FEMALE         -.707809     .118295   -.128854    .998401      1.002    -5.983
BC             -.635267     .190322   -.077849    .851211      1.175    -3.338
ALBERTA         .303136     .213140    .032671    .877450      1.140     1.422
PRAIRIE        -.934037     .245900   -.085847    .906485      1.103    -3.798
QUEBEC         -.914617     .152956   -.144970    .787751      1.269    -5.980
ATLANTIC      -1.023321     .232552   -.100351    .890308      1.123    -4.400
(Constant)     4.185041     .200155                                     20.909


------ in -------

Variable    Sig T

EDUCATE     .0000
FEMALE      .0000
BC          .0009
ALBERTA     .1551
PRAIRIE     .0002
QUEBEC      .0000
ATLANTIC    .0000
(Constant)  .0000

 

Interpretation of Output

The R-square and adjusted R-square figures indicate that approximately 15% of variance in incomes is explained by gender education and region of residence.  Compared to the regression model without the dummies, the R-square is higher by 3 percentage points.  NOTE: this does NOT mean that region of residence on its own explains only a small amount of the variance in incomes, because the effects of education and gender are simultaneously controlled in the regression equation.  If we wanted to examine the impact of only region of residence on the R-square, we could run a regression with only the regional dummies in an equation.

The B values for education and income indicate the direction and number of units (as coded) of change in the dependent variable due to a one unit change in each independent variable.  In this example, the indicators for education and income are ordinal measures.  The results show that earning more income depends in part on obtaining more education. Controlling for the effects of gender and region, a one category increase in education produces about a half category increase (+.52) in income. A university graduate, for example, could earn $10-20,000 more annual income than someone not yet finished university.  The multiple regression results also indicate that women have lower incomes than men.  Controlling for the effects of education and region, females earn three-quarters of an income category less (-.71) than males. 

The B values for the regional dummies indicate that residents of the Prairies, Quebec, and Atlantic Canada have incomes that are about one income category lower than residents of Ontario. And British Columbians earn about .6 of an income category less than Ontarians residents, whereas Albertans earn about .3 of an income category more than Ontarians.  However, the T-test results, which tell us whether any of the five regions are significantly different from Ontario, show that the difference between Ontarians and Albertans is insignificant

We should also note that because Ontario is the same reference group for each of the other regional dummies, we can directly compare each of the regions to one another: British Columbians, for example, earn higher incomes on average than Atlantic Canadians (-.64 versus -1.02), but lower incomes than Albertans (.30 versus -.64). 

We can also determine whether or not each of these regional differences are statistically significant by examining the standard errors of the region coefficients.  If any pair of regional dummies do not overlap within +/- 1.96 standard errors of one another, we can be 95% confident the differences are not due to sampling error.  In this example, Albertans have significantly higher incomes than from residents of the Prairies, Quebec, and Atlantic Canada, since the income differences between Alberta and these three regions do not overlap within +/- 1.96 standard errors of one another.  This task can be made easier by asking Webstats to produce confidence intervals with your output.  You can do this by including the "ci" command in the "/statistics" line of your regression syntax.  However, it is useful to know how to calculate those confidence intervals yourself by examining the standard errors of the B values, since some of the research you read may not report confidence intervals.

The constant (or y-intercept) indicates the value of  'a' in the regression equation.

From the information in the regression output we can write the following regression equation:

income = 4.19 + .52educate - .71 female - .64 BC + .30 Alberta - .93 Prairies - .92 Quebec - 1.02 Atlantic.

 

Instructions

  1. Perform a multiple regression analysis including a set of dummy variables built from a multi-category nominal variable.
  2. You can begin by using the syntax from the above example working with the CRIC 2003 data set.
  3. Identify a multi-category nominal variable in the data set to "dummy up" and use it as a control in your analysis. Potential variables include: Q27 (marital status) q28 (work status) q37 (community size).
  4. Include your dummies in the multiple regression equation.
  5. Examine your output to determine the influence of your dummy variables.

QUESTIONS FOR REFLECTION

DISCUSSION

Part 2: Interaction Terms

PURPOSE

MAIN POINTS

EXAMPLE

 

Output

           * * * *   M U L T I P L E   R E G R E S S I O N   * * * *


Listwise Deletion of Missing Data

N of Cases =  1835



           * * * *   M U L T I P L E   R E G R E S S I O N   * * * *

Equation Number 1    Dependent Variable..   INCOME

  Descriptive Statistics are printed on Page    2

Block Number  1.  Method:  Enter      AGEDUMMY FEMALE


Variable(s) Entered on Step Number
   1..    FEMALE
   2..    AGEDUMMY


Multiple R           .18121
R Square             .03284
Adjusted R Square    .03178
Standard Error      2.70153

F =      31.10496       Signif F =  .0000


------------------------- Variables in the Equation --------------------------

Variable              B        SE B       Beta  Tolerance        VIF         T

AGEDUMMY        .679421     .135058    .115613    .999351      1.001     5.031
FEMALE         -.782345     .126165   -.142512    .999351      1.001    -6.201
(Constant)     5.545508     .126958                                     43.680


------ in -------          ----------- not in ------------

Variable    Sig T          Variable   Tolerance        VIF

AGEDUMMY    .0000          FEM30        .240225      4.163
FEMALE      .0000
(Constant)  .0000


End Block Number   1   All requested variables entered.



           * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Block Number  2.  Method:  Enter      AGEDUMMY FEMALE   FEM30


Variable(s) Entered on Step Number
   3..    FEM30


Multiple R           .19095
R Square             .03646
Adjusted R Square    .03488
Standard Error      2.69720

F =      23.09946       Signif F =  .0000


------------------------- Variables in the Equation --------------------------

Variable              B        SE B       Beta  Tolerance        VIF         T

AGEDUMMY       1.029219     .189592    .175137    .505511      1.978     5.429
FEMALE         -.302044     .222161   -.055020    .321268      3.113    -1.360
FEM30          -.707862     .269702   -.122831    .240225      4.163    -2.625
(Constant)     5.312372     .154780                                     34.322


------ in -------

Variable    Sig T

AGEDUMMY    .0000
FEMALE      .1741
FEM30       .0087
(Constant)  .0000


 

Interpretation

Instructions

  1. For this exercise you will perform a multiple regression analysis with an interaction term
  2. Create an interaction term: As always, it is best to work with an explicit hypothesis in mind.
  3. Multiply two independent variables together to create the interaction term.
  4. Create a new variable (with a new variable name) to measure the interaction. You can name it "inter" or something more descriptive.
    compute inter=IndependentVar*ZVar.
  5. Once you have made all necessary recodes, declared missing values and created an interaction term, enter all variables (including the variables used to create the interaction) into a multiple regression equation with two steps, or blocks.
    • regression variables= DVar IVar1 IVar2 Inter
          /statistics coeff outs r tol
          /dependent=DVar/method=enter IVar1 IVar2 /method=enter Inter.
  6. Based on the output, determine whether the dummy variables and the interaction term are significant.  If neither are, repeat the process until you find an appropriate set.

DISCUSSION