Multiple linear regression, Part 1

Dwaine studios, Inc. operates portrait studios in \(21\) cities of medium size. These studios specialize in portraits of children. The company is considering an expansion into other cities of medium sizes and wishes to investigate whether sales (\(Y\)) in a community can be predicted from the number of persons aged \(16\) or younger in the community (\(X_1\)) and the per capita disposable personal income in the community (\(X_2\)). Data on these variables for the most recent year for the \(21\) cities in which Dwaine Studios is now operating are shown below:

city \(X_1\) \(X_2\) \(Y\)
168.516.7174.4
245.216.8164.4
391.318.2244.2
447.816.3154.6
546.917.3181.6
666.118.2207.5
749.515.9152.8
85217.2163.2
948.916.6145.4
1038.416137.2
1187.918.3241.9
1272.817.1191.1
1388.417.4232
1442.915.8145.3
1552.517.8161.1
1685.718.4209.7
1741.316.5146.4
1851.716.3144
1989.618.1232.6
2082.719.1224.1
2152.316166.5

คำถาม

  1. Find the multiple linear regression model and interpret \(b_1\) and \(b_2\)
  2. Test for significance of regression and test on individual regression coefficients
  3. Find \(95\%\) confidence interval on \(\beta_j\)
  4. Find \(95\%\) confidence interval on mean response where \(X_1 = 8\) cases and \(X_2 = 275\) feet
  5. Find \(95\%\) prediction interval where \(X_1 = 8\) cases and \(X_2 = 275\) feet
  6. Find \(R^2\) and \(R^2_{adj}\)

วิธีทำ

  1. Find the multiple linear regression model and interpret \(b_1\) and \(b_2\)

    เราสามารถหา estimated regression function ด้วยวิธี least squares estimates b ได้จาก

    \(b = [X^TX]^{-1}[X^TY]\)



    จากตัวอย่าง เราได้เมทริกซ์ \(X\) และ \(Y\) ดังนี้

    \( \begin{equation*} X = \begin{bmatrix} 1 & 68.5 & 16.7 \\ 1 & 45.2 & 16.8 \\ \vdots & \vdots & \vdots \\ 1 & 52.3 & 16.0 \end{bmatrix} \end{equation*} \;\;\;\;\; Y = \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \)
    หา \(X^TX\)

    \( \begin{align*} X^TX &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 68.5 & 45.2 & \cdots &52.3 \\ 16.7 & 16.8 & \cdots & 16.0 \end{bmatrix} \;\; \begin{bmatrix} 1 & 68.5 & 16.7 \\ 1 & 45.2 & 16.8 \\ \vdots & \vdots & \vdots \\ 1 & 52.3 & 16.0 \end{bmatrix} \\ \\ &= \begin{bmatrix} 21.0 & 1,302.4 & 360.0 \\ 1,302.4 & 87,707.9 & 22,609.2 \\ 360.0 & 22,609.2 & 6,190.3 \end{bmatrix} \end{align*} \)

    เราสามารถหา \(X^TX\) ได้โดยใช้ algebraic equivalent

    \( \begin{align*} X^TX &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_{11} & X_{21} & \cdots & X_{n1} \\ X_{12} & X_{22} & \cdots & X_{n2} \end{bmatrix} \;\; \begin{bmatrix} 1 & X_{11} & X_{12} \\ 1 & X_{21} & X_{22} \\ \vdots & \vdots & \vdots \\ 1 & X_{n1} & X_{n2} \end{bmatrix} \\ \\ &= \begin{bmatrix} n & \sum X_{i1} & \sum X_{i2} \\ \sum X_{i1} & \sum X^2_{i1} & \sum X_{i1} X_{i2} \\ \sum X_{i2} & \sum X_{i2} X_{i1} & \sum X^2_{i2} \end{bmatrix} \end{align*} \)

    โดยที่ในตัวอย่างนี้

    \( n=21 \\ \sum X_{i1} = 68.5 + 45.2 + \cdots + 52.3 = 1,302.4 \\ \sum X_{i1} X_{i2} = 68.5(16.7) + 45.2(16.8) + \cdots + 52.3(16.0) = 22,609.2 \)

    เช่นเดียวกับ \([X^TY]\)

    \( \begin{align*} X^TY &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_{11} & X_{21} & \cdots & X_{n1} \\ X_{12} & X_{22} & \cdots & X_{n2} \end{bmatrix} \;\; \begin{bmatrix} Y_{1} \\ X_{2} \\ \vdots \\ X_{n} \end{bmatrix} \\ \\ &= \begin{bmatrix} \sum Y_{i} \\ \sum X_{i1} Y_{i} \\ \sum X_{i2} Y_{i} \end{bmatrix} \end{align*} \)

    โดยที่

    \( \sum Y_{i} = 174.4 + 164.4 + \cdots + 166.5 = 3,820 \\ \sum X_{i1} Y_{i} = 68.5(174.4) + 45.2(164.4) + \cdots + 52.3(166.5) = 249,643 \\ \sum X_{i2} Y_{i} = 16.7(174.4) + 16.8(164.4) + \cdots + 16.0(166.5) = 66,073 \)


    จากนั้น หา \([X^TX]^{-1}\)

    \( [X^TX]^{-1} = \begin{bmatrix} 29.7289 & 0.0722 & -1.9926 \\ 0.0722 & 0.00037 & -0.0056 \\ -1.9926 & -0.0056 & 0.1363 \end{bmatrix} \)

    และหา \(X^TY\)

    \(\begin{align*} X^TY &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 68.5 & 45.2 & \cdots &52.3 \\ 16.7 & 16.8 & \cdots & 16.0 \end{bmatrix} \; \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \\ &= \begin{bmatrix} 3,820 \\ 249,643 \\ 66,073 \end{bmatrix} \end{align*} \)
    เราจะได้ b

    \(\begin{align*} b = [X^TX]^{-1}[X^TY] &= \begin{bmatrix} 29.7289 & 0.0722 & -1.9926 \\ 0.0722 & 0.00037 & -0.0056 \\ -1.9926 & -0.0056 & 0.1363 \end{bmatrix} \; \begin{bmatrix} 3,820 \\ 249,643 \\ 66,073 \end{bmatrix} \\ \\ &= \begin{bmatrix} -68,857 \\ 1.455 \\ 9.366 \end{bmatrix} = \begin{bmatrix} b_0 \\ b_1 \\ b_2 \end{bmatrix} \end{align*}\)

    \(\therefore\) เราจะได้ estimated regression function:
    \(\hat{Y} = -68.857 + 1.455X_1 + 9.366X_2\)

    Interpret \(b_1\) and \(b_2\)

    This estimated regression function indicates that mean sales are expected to increase by 1.455 thousand dollars when the target population increases by 1 thousand persons aged 16 years or younger, holding per capita disposable personal income constant, and that mean sales are expected to increase by 9.366 thousand dollars when per capita income increases by 1 thousand dollars, holding the target population constant.

  2. Test for significance of regression and test on individual regression coefficients

    เราสามารถทดสอบได้โดยการสร้างตาราง Anova โดยเริ่มจากหาค่าเหล่านี้ก่อน
    หา \(Y^TY\)

    \( \begin{align*} Y^TY &= \begin{bmatrix} 174.4 & 164.4 & \cdots & 166.5 \end{bmatrix} \;\; \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \\ \\ &= 721,072.40 \end{align*} \)

    และหา \((\frac{1}{n})Y^TJY\)

    \( \begin{align*} (\frac{1}{n})Y^TJY &= \frac{1}{21} \begin{bmatrix} 174.4 & 164.4 & \cdots & 166.5 \end{bmatrix} \; \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 1 & 1 & \cdots & 1 \\ \vdots & \vdots & & \vdots \\ 1 & 1 & \cdots & 1 \end{bmatrix} \; \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \\ \\ &= \frac{(3,820.0)^2}{21} = 694,876.19 \end{align*} \)

    เราจะหา \(SSTO, SSR, SSE\) ได้ดังนี้

    \( \begin{align*} SSTO &= Y^TY \; – (\frac{1}{n})Y^TJY = 721,072.40 \; – 694,876.19 = 26,196.21 \\ \\ SSE &= Y^TY \; – b^TX^TY \\ &= 721,072.40 \; – \begin{bmatrix} -68.857 & 1.455 & 9.366 \end{bmatrix} \; \begin{bmatrix} 3,820 \\ 249,643 \\ 66,073 \end{bmatrix} \\ &= 721,072.40 \; – 718,891.47 = 2,180.93 \\ \\ SSR &= SSTO \; – SSE = 26,196.21 \; – 2,180.93 = 24,015.28 \end{align*} \)

    เราจะได้ตาราง Anova ดังนี้

    Source of variationdfSum SquaresMean Squares
    Regression\(p-1 = 2\)\(24,015.28\)\(12,007.64\)
    Error\(n-p = 18\)\(2,180.93\)\(121.1626\)
    Total\(n-1 = 20\)\(26,196.21\)

    1. Test of Regression Relation

      เพื่อทดสอบว่า sales มีความสัมพันธ์กับ target population และ per capita disposable income หรือไม่

      สมมติฐาน
      \( H_0: \beta_1 = 0 \; and \; \beta_2 = 0 \\ H_1: \text{at least one of} \; \beta_j \neq 0 \; ; \; i,j = 1,2 \)

      สถิติทดสอบ
      \(F_{cal} = \frac{MSR}{MSE} = \frac{12,007.64}{121.1626} = 99.1\)

      เราจะปฏิเสธ \(H_0\) เมื่อ \(F_{cal} > F_{1-\alpha,p-1,n-p}\)
      ให้ \(\alpha = 0.05\),
      \( \begin{align*} F_{1-\alpha,p-1,n-p} &= F_{0.95,2,18} \\ &= 3.55 \end{align*} \)

      \(F_{cal} = 99.1 > 3.55\)
      \(\therefore\) Reject \(H_0\)
      There is sufficient evidence to conclude that the sales are related to at least one of these covariates which are the target population and per capita dispoable income at \(5\%\) significance level.
    2. Test partial regression coefficient

      เพื่อทดสอบว่า \(\beta_1\) และ \(\beta_2\) มีค่าเป็น 0 หรือไม่

      โดยขั้นตอนแรก เราต้องหา the estimated variance-covariance matrix \(s^2\{b\}\) ได้จาก

      \( \begin{align*} s^2\{b\} &= MSE(X^TX)^{-1} \\ \\ &= 121.1626 \; \begin{bmatrix} 29.7289 & 0.0722 & -1.9926 \\ 0.0722 & 0.00037 & -0.0056 \\ -1.9926 & -0.0056 & 0.1363 \end{bmatrix} \\ \\ &= \begin{bmatrix} 3,602.0 & 8.748 & -241.43 \\ 8.748 & 0.0448 & -0.679 \\ -241.43 & -0.679 & 16.514 \end{bmatrix} \end{align*} \)

      เลขในแนวทแยงมุม คือ ค่า variance, เราจะได้
      \( \begin{align*} s^2(b_0) &= 3,602.0 \\ s^2(b_1) &= 0.0448 \Rightarrow s(b_1) = 0.212 \\ s^2(b_2) &= 16.514 \Rightarrow s(b_2) = 4.06 \\ \end{align*} \)
      • ทดสอบว่า \(\beta_1\) มีค่าเป็น \(0\) หรือไม่ สมมติฐาน
        \( H_0: \beta_1 = 0 \\ H_1: \beta_1 \neq 0 \)

        สถิติทดสอบ
        \(\begin{align*} t_{cal} &= \frac{b_1}{s(b_1)} \\ &= \frac{1.455}{0.212} \\ &= 6.868 \end{align*}\)

        เราจะปฏิเสธ \(H_0\) เมื่อ \(|t_{cal}| > t_{1-\frac{\alpha}{2},n-p}\)
        ให้ \(\alpha = 0.05\),
        \( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)

        \(t_{cal} = 6.868 > 2.101\)
        \(\therefore\) Reject \(H_0\)
        There is sufficient evidence to conclude that the relationship is an appropriate one for explaining/predicting sales as a function of number of persons aged 16 or younger given that the per capita disposable income is also in the model, at \(5\%\) significance level.
      • ทดสอบว่า \(\beta_2\) มีค่าเป็น \(0\) หรือไม่ สมมติฐาน
        \( H_0: \beta_2 = 0 \\ H_1: \beta_2 \neq 0 \)

        สถิติทดสอบ
        \(\begin{align*} t_{cal} &= \frac{b_2}{s(b_2)} \\ &= \frac{9.366}{4.06} \\ &= 2.305 \end{align*}\)

        เราจะปฏิเสธ \(H_0\) เมื่อ \(|t_{cal}| > t_{1-\frac{\alpha}{2},n-p}\)
        ให้ \(\alpha = 0.05\),
        \( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)

        \(t_{cal} = 2.305 > 2.101\)
        \(\therefore\) Reject \(H_0\)
        There is sufficient evidence to conclude that the per capita disposable income contribute significantly to the model given that the number of person aged 16 or younger is also in the model, at \(5\%\) significance level.
  3. Find \(95\%\) confidence interval on \(\beta_j\)

    เราจะหาช่วงความเชื่อมั่นของ \(\beta_j\) ได้จากสูตร

    \(b_j \pm t_{1-\frac{\alpha}{2},n-p} s(b_j)\)

    ดังนั้น \(95\%\) confidence interval ของ \(\beta_1\) และ \(\beta_2\) คือ

    \(95\%\) CI on \(\beta_1\)
    \( \begin{align*} &= b_1 \pm t_{1-\frac{\alpha}{2},n-p} s(b_1) \\ &= 1.455 \pm t_{1-\frac{0.05}{2},18} (0.212) \\ &= 1.455 \pm (2.101) (0.212) \end{align*}\)
    \(\therefore\) \(1.01 \lt \beta_1 \lt 1.90\)
    With \(95\%\) confidence, we conclude that \(\beta_1\) falls between \(1.01\) and \(1.90\)

    \(95\%\) CI on \(\beta_2\)
    \( \begin{align*} &= b_2 \pm t_{1-\frac{\alpha}{2},n-p} s(b_2) \\ &= 9.366 \pm t_{1-\frac{0.05}{2},18} (4.06) \\ &= 9.366 \pm (2.101) (4.06) \end{align*}\)
    \(\therefore\) \(0.84 \lt \beta_2 \lt 17.9\)
    With \(95\%\) confidence, we conclude that \(\beta_2\) falls between \(0.84\) and \(17.9\)

  4. Find \(95\%\) confidence interval on mean response where \(X_1 = 8\) cases and \(X_2 = 275\) feet

    เราจะหาช่วงความเชื่อมั่นของ mean response ที่จุด \(X_h\) ได้จากสูตร

    \(\hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(\hat{Y}_h)\)

    ดังนั้น เราต้องหา \(s(\hat{Y}_h)\) ก่อน

    กำหนด

    \( X_h = \begin{bmatrix} 1 \\ 65.4 \\ 17.6 \end{bmatrix} \)

    เราจะได้ point estimate of mean sales

    \( \hat{Y}_h = X_h^T b = \begin{bmatrix} 1 & 65.4 & 17.6 \end{bmatrix} \begin{bmatrix} -68.857 \\ 1.455 \\ 9.366 \end{bmatrix} = 191.10 \)

    และ estimated variance คือ

    \( \begin{align*} s^2\{\hat{Y}_h\} &= X_h^T s^2\{b\} X_h \\ \\ &= \begin{bmatrix} 1 & 65.4 & 17.6 \end{bmatrix} \; \begin{bmatrix} 3,602.0 & 8.748 & -241.43 \\ 8.748 & 0.0448 & -0.679 \\ -241.43 & -0.679 & 16.514 \end{bmatrix} \; \begin{bmatrix} 1 \\ 65.4 \\ 17.6 \end{bmatrix} \\ \\ &= 7.656 \\ \\ s\{\hat{Y}_h\} &= 2.77 \end{align*} \)

    ให้ \(\alpha = 0.05\),
    \( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)

    \(95\%\) CI on mean response คือ
    \( \begin{align*} =& \; \hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(\hat{Y}_h) \\ =& \; 191.10 \pm (2.101) (2.77) \end{align*} \)

    \(\therefore\) \(185.3 \lt E(Y_h) \lt 196.9\)
    With confidence coefficient \(0.95\), we estimate that mean sales in cities with the target population of \(65.4\) thousand persons aged 16 years or younger and per capita disposable income of \(17.6\) thousand dollars are somewhere between \(185.3\) and \(196.9\) thousand dollars.



  5. Find \(95\%\) prediction interval where \(X_1 = 8\) cases and \(X_2 = 275\) feet

    จากสูตร

    \(\hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(pred)\)

    เราต้องหา \(s(pred)\) ก่อน

    \( \begin{align*} s^2(pred) &= MSE + s^2(\hat{Y}_h) \\ &= 121.1626 + 7.656 \\ &= 128.82 \\ \\ s(pred) &= 11.35 \end{align*} \)

    ให้ \(\alpha = 0.05\),
    \( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)

    \(95\%\) CI on prediction interval คือ
    \( \begin{align*} =& \; \hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(pred) \\ =& \; 191.10 \pm (2.101) (11.35) \end{align*} \)

    \(\therefore\) \(167.3 \lt Y_{h(new)} \lt 214.9\)
    With \(95\%\) confidence, we predict that sales in the new city will be somewhere between \(167.3\) and \(214.9\) thousand dollars.



  6. Find \(R^2\) and \(R^2_{adj}\)

    จากสูตร

    \(R^2 = \frac{SSR}{SST} = 1 – \frac{SSE}{SST}\)


    และ
    \( \begin{align*} R^2_{adj} &= 1 – \frac{\frac{SSE}{n – p}}{\frac{SST}{n – 1}} \\ \\ &= 1 – \frac{n – 1}{n – p} \frac{SSE}{SST} \end{align*}\)

  7. เราจะได้

    \(R^2 = \frac{SSR}{SST} = \frac{24,015.28}{26,196.21} = 0.917\)

    \(R^2_{adj} = 1 – \frac{\frac{2,180.93}{18}}{\frac{26,196.21}{20}} = 0.9075\)

    \(\therefore 91.7\%\) of the variation in sales can be explained by the target population and per capita disposable income.

Leave a Reply

Thumbnails managed by ThumbPress