Dwaine studios, Inc. operates portrait studios in \(21\) cities of medium size. These studios specialize in portraits of children. The company is considering an expansion into other cities of medium sizes and wishes to investigate whether sales (\(Y\)) in a community can be predicted from the number of persons aged \(16\) or younger in the community (\(X_1\)) and the per capita disposable personal income in the community (\(X_2\)). Data on these variables for the most recent year for the \(21\) cities in which Dwaine Studios is now operating are shown below:
city | \(X_1\) | \(X_2\) | \(Y\) |
1 | 68.5 | 16.7 | 174.4 |
2 | 45.2 | 16.8 | 164.4 |
3 | 91.3 | 18.2 | 244.2 |
4 | 47.8 | 16.3 | 154.6 |
5 | 46.9 | 17.3 | 181.6 |
6 | 66.1 | 18.2 | 207.5 |
7 | 49.5 | 15.9 | 152.8 |
8 | 52 | 17.2 | 163.2 |
9 | 48.9 | 16.6 | 145.4 |
10 | 38.4 | 16 | 137.2 |
11 | 87.9 | 18.3 | 241.9 |
12 | 72.8 | 17.1 | 191.1 |
13 | 88.4 | 17.4 | 232 |
14 | 42.9 | 15.8 | 145.3 |
15 | 52.5 | 17.8 | 161.1 |
16 | 85.7 | 18.4 | 209.7 |
17 | 41.3 | 16.5 | 146.4 |
18 | 51.7 | 16.3 | 144 |
19 | 89.6 | 18.1 | 232.6 |
20 | 82.7 | 19.1 | 224.1 |
21 | 52.3 | 16 | 166.5 |
คำถาม
- Find the multiple linear regression model and interpret \(b_1\) and \(b_2\)
- Test for significance of regression and test on individual regression coefficients
- Find \(95\%\) confidence interval on \(\beta_j\)
- Find \(95\%\) confidence interval on mean response where \(X_1 = 8\) cases and \(X_2 = 275\) feet
- Find \(95\%\) prediction interval where \(X_1 = 8\) cases and \(X_2 = 275\) feet
- Find \(R^2\) and \(R^2_{adj}\)
วิธีทำ
- Find the multiple linear regression model and interpret \(b_1\) and \(b_2\)
เราสามารถหา estimated regression function ด้วยวิธี least squares estimates b ได้จาก
\(b = [X^TX]^{-1}[X^TY]\)
จากตัวอย่าง เราได้เมทริกซ์ \(X\) และ \(Y\) ดังนี้
\( \begin{equation*} X = \begin{bmatrix} 1 & 68.5 & 16.7 \\ 1 & 45.2 & 16.8 \\ \vdots & \vdots & \vdots \\ 1 & 52.3 & 16.0 \end{bmatrix} \end{equation*} \;\;\;\;\; Y = \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \) หา \(X^TX\)\( \begin{align*} X^TX &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 68.5 & 45.2 & \cdots &52.3 \\ 16.7 & 16.8 & \cdots & 16.0 \end{bmatrix} \;\; \begin{bmatrix} 1 & 68.5 & 16.7 \\ 1 & 45.2 & 16.8 \\ \vdots & \vdots & \vdots \\ 1 & 52.3 & 16.0 \end{bmatrix} \\ \\ &= \begin{bmatrix} 21.0 & 1,302.4 & 360.0 \\ 1,302.4 & 87,707.9 & 22,609.2 \\ 360.0 & 22,609.2 & 6,190.3 \end{bmatrix} \end{align*} \) เราสามารถหา \(X^TX\) ได้โดยใช้ algebraic equivalent\( \begin{align*} X^TX &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_{11} & X_{21} & \cdots & X_{n1} \\ X_{12} & X_{22} & \cdots & X_{n2} \end{bmatrix} \;\; \begin{bmatrix} 1 & X_{11} & X_{12} \\ 1 & X_{21} & X_{22} \\ \vdots & \vdots & \vdots \\ 1 & X_{n1} & X_{n2} \end{bmatrix} \\ \\ &= \begin{bmatrix} n & \sum X_{i1} & \sum X_{i2} \\ \sum X_{i1} & \sum X^2_{i1} & \sum X_{i1} X_{i2} \\ \sum X_{i2} & \sum X_{i2} X_{i1} & \sum X^2_{i2} \end{bmatrix} \end{align*} \) โดยที่ในตัวอย่างนี้\( n=21 \\ \sum X_{i1} = 68.5 + 45.2 + \cdots + 52.3 = 1,302.4 \\ \sum X_{i1} X_{i2} = 68.5(16.7) + 45.2(16.8) + \cdots + 52.3(16.0) = 22,609.2 \) เช่นเดียวกับ \([X^TY]\)\( \begin{align*} X^TY &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_{11} & X_{21} & \cdots & X_{n1} \\ X_{12} & X_{22} & \cdots & X_{n2} \end{bmatrix} \;\; \begin{bmatrix} Y_{1} \\ X_{2} \\ \vdots \\ X_{n} \end{bmatrix} \\ \\ &= \begin{bmatrix} \sum Y_{i} \\ \sum X_{i1} Y_{i} \\ \sum X_{i2} Y_{i} \end{bmatrix} \end{align*} \) โดยที่\( \sum Y_{i} = 174.4 + 164.4 + \cdots + 166.5 = 3,820 \\ \sum X_{i1} Y_{i} = 68.5(174.4) + 45.2(164.4) + \cdots + 52.3(166.5) = 249,643 \\ \sum X_{i2} Y_{i} = 16.7(174.4) + 16.8(164.4) + \cdots + 16.0(166.5) = 66,073 \)
จากนั้น หา \([X^TX]^{-1}\)\( [X^TX]^{-1} = \begin{bmatrix} 29.7289 & 0.0722 & -1.9926 \\ 0.0722 & 0.00037 & -0.0056 \\ -1.9926 & -0.0056 & 0.1363 \end{bmatrix} \) และหา \(X^TY\)
\(\begin{align*} X^TY &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 68.5 & 45.2 & \cdots &52.3 \\ 16.7 & 16.8 & \cdots & 16.0 \end{bmatrix} \; \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \\ &= \begin{bmatrix} 3,820 \\ 249,643 \\ 66,073 \end{bmatrix} \end{align*} \) เราจะได้ b
\(\begin{align*} b = [X^TX]^{-1}[X^TY] &= \begin{bmatrix} 29.7289 & 0.0722 & -1.9926 \\ 0.0722 & 0.00037 & -0.0056 \\ -1.9926 & -0.0056 & 0.1363 \end{bmatrix} \; \begin{bmatrix} 3,820 \\ 249,643 \\ 66,073 \end{bmatrix} \\ \\ &= \begin{bmatrix} -68,857 \\ 1.455 \\ 9.366 \end{bmatrix} = \begin{bmatrix} b_0 \\ b_1 \\ b_2 \end{bmatrix} \end{align*}\) \(\therefore\) เราจะได้ estimated regression function:
\(\hat{Y} = -68.857 + 1.455X_1 + 9.366X_2\)
Interpret \(b_1\) and \(b_2\)
This estimated regression function indicates that mean sales are expected to increase by 1.455 thousand dollars when the target population increases by 1 thousand persons aged 16 years or younger, holding per capita disposable personal income constant, and that mean sales are expected to increase by 9.366 thousand dollars when per capita income increases by 1 thousand dollars, holding the target population constant.
- Test for significance of regression and test on individual
regression coefficients
เราสามารถทดสอบได้โดยการสร้างตาราง Anova โดยเริ่มจากหาค่าเหล่านี้ก่อน
หา \(Y^TY\)\( \begin{align*} Y^TY &= \begin{bmatrix} 174.4 & 164.4 & \cdots & 166.5 \end{bmatrix} \;\; \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \\ \\ &= 721,072.40 \end{align*} \) และหา \((\frac{1}{n})Y^TJY\)\( \begin{align*} (\frac{1}{n})Y^TJY &= \frac{1}{21} \begin{bmatrix} 174.4 & 164.4 & \cdots & 166.5 \end{bmatrix} \; \begin{bmatrix} 1 & 1 & \cdots & 1 \\ 1 & 1 & \cdots & 1 \\ \vdots & \vdots & & \vdots \\ 1 & 1 & \cdots & 1 \end{bmatrix} \; \begin{bmatrix} 174.4 \\ 164.4 \\ \vdots \\ 166.5 \end{bmatrix} \\ \\ &= \frac{(3,820.0)^2}{21} = 694,876.19 \end{align*} \) เราจะหา \(SSTO, SSR, SSE\) ได้ดังนี้
\( \begin{align*} SSTO &= Y^TY \; – (\frac{1}{n})Y^TJY = 721,072.40 \; – 694,876.19 = 26,196.21 \\ \\ SSE &= Y^TY \; – b^TX^TY \\ &= 721,072.40 \; – \begin{bmatrix} -68.857 & 1.455 & 9.366 \end{bmatrix} \; \begin{bmatrix} 3,820 \\ 249,643 \\ 66,073 \end{bmatrix} \\ &= 721,072.40 \; – 718,891.47 = 2,180.93 \\ \\ SSR &= SSTO \; – SSE = 26,196.21 \; – 2,180.93 = 24,015.28 \end{align*} \)
เราจะได้ตาราง Anova ดังนี้
Source of variation df Sum Squares Mean Squares Regression \(p-1 = 2\) \(24,015.28\) \(12,007.64\) Error \(n-p = 18\) \(2,180.93\) \(121.1626\) Total \(n-1 = 20\) \(26,196.21\)
-
สมมติฐานTest of Regression Relation
เพื่อทดสอบว่า sales มีความสัมพันธ์กับ target population และ per capita disposable income หรือไม่\( H_0: \beta_1 = 0 \; and \; \beta_2 = 0 \\ H_1: \text{at least one of} \; \beta_j \neq 0 \; ; \; i,j = 1,2 \)
สถิติทดสอบ\(F_{cal} = \frac{MSR}{MSE} = \frac{12,007.64}{121.1626} = 99.1\)
เราจะปฏิเสธ \(H_0\) เมื่อ \(F_{cal} > F_{1-\alpha,p-1,n-p}\)
ให้ \(\alpha = 0.05\),\( \begin{align*} F_{1-\alpha,p-1,n-p} &= F_{0.95,2,18} \\ &= 3.55 \end{align*} \)
\(F_{cal} = 99.1 > 3.55\) \(\therefore\) Reject \(H_0\)
There is sufficient evidence to conclude that the sales are related to at least one of these covariates which are the target population and per capita dispoable income at \(5\%\) significance level. -
โดยขั้นตอนแรก เราต้องหา the estimated variance-covariance matrix \(s^2\{b\}\) ได้จากTest partial regression coefficient
เพื่อทดสอบว่า \(\beta_1\) และ \(\beta_2\) มีค่าเป็น 0 หรือไม่
\( \begin{align*} s^2\{b\} &= MSE(X^TX)^{-1} \\ \\ &= 121.1626 \; \begin{bmatrix} 29.7289 & 0.0722 & -1.9926 \\ 0.0722 & 0.00037 & -0.0056 \\ -1.9926 & -0.0056 & 0.1363 \end{bmatrix} \\ \\ &= \begin{bmatrix} 3,602.0 & 8.748 & -241.43 \\ 8.748 & 0.0448 & -0.679 \\ -241.43 & -0.679 & 16.514 \end{bmatrix} \end{align*} \)
เลขในแนวทแยงมุม คือ ค่า variance, เราจะได้\( \begin{align*} s^2(b_0) &= 3,602.0 \\ s^2(b_1) &= 0.0448 \Rightarrow s(b_1) = 0.212 \\ s^2(b_2) &= 16.514 \Rightarrow s(b_2) = 4.06 \\ \end{align*} \) - ทดสอบว่า \(\beta_1\) มีค่าเป็น \(0\) หรือไม่
สมมติฐาน
\( H_0: \beta_1 = 0 \\ H_1: \beta_1 \neq 0 \)
สถิติทดสอบ\(\begin{align*} t_{cal} &= \frac{b_1}{s(b_1)} \\ &= \frac{1.455}{0.212} \\ &= 6.868 \end{align*}\)
เราจะปฏิเสธ \(H_0\) เมื่อ \(|t_{cal}| > t_{1-\frac{\alpha}{2},n-p}\)
ให้ \(\alpha = 0.05\),\( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)
\(t_{cal} = 6.868 > 2.101\) \(\therefore\) Reject \(H_0\)
There is sufficient evidence to conclude that the relationship is an appropriate one for explaining/predicting sales as a function of number of persons aged 16 or younger given that the per capita disposable income is also in the model, at \(5\%\) significance level. - ทดสอบว่า \(\beta_2\) มีค่าเป็น \(0\) หรือไม่
สมมติฐาน
\( H_0: \beta_2 = 0 \\ H_1: \beta_2 \neq 0 \)
สถิติทดสอบ\(\begin{align*} t_{cal} &= \frac{b_2}{s(b_2)} \\ &= \frac{9.366}{4.06} \\ &= 2.305 \end{align*}\)
เราจะปฏิเสธ \(H_0\) เมื่อ \(|t_{cal}| > t_{1-\frac{\alpha}{2},n-p}\)
ให้ \(\alpha = 0.05\),\( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)
\(t_{cal} = 2.305 > 2.101\) \(\therefore\) Reject \(H_0\)
There is sufficient evidence to conclude that the per capita disposable income contribute significantly to the model given that the number of person aged 16 or younger is also in the model, at \(5\%\) significance level.
- ทดสอบว่า \(\beta_1\) มีค่าเป็น \(0\) หรือไม่
สมมติฐาน
-
- Find \(95\%\) confidence interval on \(\beta_j\)
เราจะหาช่วงความเชื่อมั่นของ \(\beta_j\) ได้จากสูตร
\(b_j \pm t_{1-\frac{\alpha}{2},n-p} s(b_j)\) ดังนั้น \(95\%\) confidence interval ของ \(\beta_1\) และ \(\beta_2\) คือ
\(95\%\) CI on \(\beta_1\)
\( \begin{align*} &= b_1 \pm t_{1-\frac{\alpha}{2},n-p} s(b_1) \\ &= 1.455 \pm t_{1-\frac{0.05}{2},18} (0.212) \\ &= 1.455 \pm (2.101) (0.212) \end{align*}\) \(\therefore\) \(1.01 \lt \beta_1 \lt 1.90\)
With \(95\%\) confidence, we conclude that \(\beta_1\) falls between \(1.01\) and \(1.90\)
\(95\%\) CI on \(\beta_2\)
\( \begin{align*} &= b_2 \pm t_{1-\frac{\alpha}{2},n-p} s(b_2) \\ &= 9.366 \pm t_{1-\frac{0.05}{2},18} (4.06) \\ &= 9.366 \pm (2.101) (4.06) \end{align*}\) \(\therefore\) \(0.84 \lt \beta_2 \lt 17.9\)
With \(95\%\) confidence, we conclude that \(\beta_2\) falls between \(0.84\) and \(17.9\)
- Find \(95\%\) confidence interval on mean response where \(X_1 = 8\)
cases and \(X_2 = 275\) feet
เราจะหาช่วงความเชื่อมั่นของ mean response ที่จุด \(X_h\) ได้จากสูตร
\(\hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(\hat{Y}_h)\) ดังนั้น เราต้องหา \(s(\hat{Y}_h)\) ก่อน
กำหนด
\( X_h = \begin{bmatrix} 1 \\ 65.4 \\ 17.6 \end{bmatrix} \)
เราจะได้ point estimate of mean sales
\( \hat{Y}_h = X_h^T b = \begin{bmatrix} 1 & 65.4 & 17.6 \end{bmatrix} \begin{bmatrix} -68.857 \\ 1.455 \\ 9.366 \end{bmatrix} = 191.10 \)
และ estimated variance คือ
\( \begin{align*} s^2\{\hat{Y}_h\} &= X_h^T s^2\{b\} X_h \\ \\ &= \begin{bmatrix} 1 & 65.4 & 17.6 \end{bmatrix} \; \begin{bmatrix} 3,602.0 & 8.748 & -241.43 \\ 8.748 & 0.0448 & -0.679 \\ -241.43 & -0.679 & 16.514 \end{bmatrix} \; \begin{bmatrix} 1 \\ 65.4 \\ 17.6 \end{bmatrix} \\ \\ &= 7.656 \\ \\ s\{\hat{Y}_h\} &= 2.77 \end{align*} \)
ให้ \(\alpha = 0.05\),\( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)
\(95\%\) CI on mean response คือ
\( \begin{align*} =& \; \hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(\hat{Y}_h) \\ =& \; 191.10 \pm (2.101) (2.77) \end{align*} \)
\(\therefore\) \(185.3 \lt E(Y_h) \lt 196.9\)
With confidence coefficient \(0.95\), we estimate that mean sales in cities with the target population of \(65.4\) thousand persons aged 16 years or younger and per capita disposable income of \(17.6\) thousand dollars are somewhere between \(185.3\) and \(196.9\) thousand dollars.
- Find \(95\%\) prediction interval where \(X_1 = 8\) cases and \(X_2 = 275\) feet
จากสูตร
\(\hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(pred)\)
เราต้องหา \(s(pred)\) ก่อน
\( \begin{align*} s^2(pred) &= MSE + s^2(\hat{Y}_h) \\ &= 121.1626 + 7.656 \\ &= 128.82 \\ \\ s(pred) &= 11.35 \end{align*} \)
ให้ \(\alpha = 0.05\),\( \begin{align*} t_{1-\frac{\alpha}{2},n-p} &= t_{1-\frac{0.05}{2},18} \\ &= 2.101 \end{align*} \)
\(95\%\) CI on prediction interval คือ
\( \begin{align*} =& \; \hat{Y}_h \pm t_{1-\frac{\alpha}{2},n-p} s(pred) \\ =& \; 191.10 \pm (2.101) (11.35) \end{align*} \)
\(\therefore\) \(167.3 \lt Y_{h(new)} \lt 214.9\)
With \(95\%\) confidence, we predict that sales in the new city will be somewhere between \(167.3\) and \(214.9\) thousand dollars.
- Find \(R^2\) and \(R^2_{adj}\)
จากสูตร
\(R^2 = \frac{SSR}{SST} = 1 – \frac{SSE}{SST}\)
และ\( \begin{align*} R^2_{adj} &= 1 – \frac{\frac{SSE}{n – p}}{\frac{SST}{n – 1}} \\ \\ &= 1 – \frac{n – 1}{n – p} \frac{SSE}{SST} \end{align*}\)
เราจะได้
\(\therefore 91.7\%\) of the variation in sales can be explained by the target population and per capita disposable income.