Introduction

The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games.At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014 Chapter 2).

include_graphics(c("los_angeles-dodgers-stadium.jpg",
                 "Los-Angeles-Dodgers-Promo.jpg",
                 "adrian_bobble.jpg"))
56,000-seat Dodgers (left), stadium  (middle), shirts and caps  (right) *bobblehead*56,000-seat Dodgers (left), stadium  (middle), shirts and caps  (right) *bobblehead*56,000-seat Dodgers (left), stadium  (middle), shirts and caps  (right) *bobblehead*

56,000-seat Dodgers (left), stadium (middle), shirts and caps (right) bobblehead

Visualizing the whole dataset Table.

library(RSQLite)
library(dplyr)
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(pander)
library(knitr)
con <- dbConnect(SQLite(), "/sqlite/dodgers.sqlite")
## this will let sqlite do all jobs for us
events <-  tbl(con, "events")
d <- dbReadTable(con, "events")
events

Q1 . Tabular and Box Plot analysis with Fireworks

Is there an association between attendance and if there were fireworks or not ?

d2 <- d %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT")))
d2 %>%
select(day_of_week, month) %>%
summary() %>%
pander(caption = "Month and week day names now follow time order.")
Month and week day names now follow time order.
day_of_week month
Monday :12 APR:12
Tuesday :13 MAY:18
Wednesday:12 JUN: 9
Thursday : 5 JUL:12
Friday :13 AUG:15
Saturday :13 SEP:12
Sunday :13 OCT: 3
d2 %>%
count(day_of_week, fireworks, name = "cnt") %>%
pivot_wider(names_from = fireworks, values_from = cnt) %>%
rename(`Weekday` = day_of_week) %>%
kable(format = "html", caption = "Number of times Firework took place in games played different weekdays", booktabs=TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "=1, "Firework"=2))
Number of times Firework took place in games played different weekdays
Firework
Weekday NO YES
Monday 12 .
Tuesday 13 .
Wednesday 11 1
Thursday 5 .
Friday . 13
Saturday 13 .
Sunday 13 .

Fireworks were done only 14 times in 81 games. All were done on friday with only 1 done on wednesday. 13 games were played on friday which is same number of games played every day of the week. This is important as now number of games is same we can focus on the attendance variation between the days. Did attendance vary between friday and other days? How advantageuos were the fireworks ?

Q1. Box and scatter plots analysis between attendance and fireworks, cap, shirt , bobblehead

d2 %>%
ggplot(aes(fireworks, attend)) +
geom_boxplot(aes(fill=fireworks)) +
theme(legend.position = "none")

d2 %>%
ggplot(aes(cap, attend)) +
geom_boxplot(aes(fill=cap)) +
theme(legend.position = "none")

d2 %>%
ggplot(aes(shirt, attend)) +
geom_boxplot(aes(fill=shirt)) +
theme(legend.position = "none")

d2 %>%
ggplot(aes(bobblehead, attend)) +
geom_boxplot(aes(fill=bobblehead)) +
theme(legend.position = "none")

Fridays attendance average is same as any other day. Hence there is no relation seen between the fireworks and the attendance thru visualization ! Cap promotion also doesnt effect th attendance as the medians are the same when no caps were given. Shirt promotion seems to increase attendance. There seems a positive relation between the shirt promotion and attendance. Attendance median increases by 10,000.

Q1. Box plot and scatter plot of relation between attendance and Opponents.

Is there a relation between attendance and any particular opponent?

dk <- d %>%
mutate( opponent = factor( opponent, levels = c("Pirates", "Braves", "Nationals", "Giants", "Rockies", "Snakes", "Cardinals","Astros","Brewers", "Angels" , "White Sox", "Mets" , "Reds", "Padres", "Phillies" , "Cubs" , "Marlins")))

 dk %>% count( opponent , name = "Number_of_games") %>%
rename(`Opponent`= opponent) %>% 
pander(caption = "Number of games with each opponent");
Number of games with each opponent
Opponent Number_of_games
Pirates 3
Braves 3
Nationals 3
Giants 9
Rockies 9
Snakes 9
Cardinals 7
Astros 3
Brewers 4
Angels 3
White Sox 3
Mets 4
Reds 3
Padres 9
Phillies 3
Cubs 3
Marlins 3
d2 %>%
ggplot(aes(opponent, attend, group=1)) +
geom_point() +
scale_y_continuous(labels=scales::comma) +
geom_smooth(se=FALSE, method="loess")
## `geom_smooth()` using formula 'y ~ x'

d2 %>%
ggplot(aes(opponent, attend)) +
geom_boxplot(aes(fill=opponent)) +
theme(legend.position = "none")

There are different number og games played against each opponent hence a median value will better give us an understanding to compare the attendances. Angels have the highest average attendance while Brave have the lowest average attendance. Maximum attendance was however recorded in a Giants Game and a lowest attendance was recorded in a Sankes games. We can say that Angels bought in more crowd on average as if we compare it with other opponents against whom 3 games were played, angels bought in much higher crowds.

Q1. Chi quare test between fireworks , opponents, shirt, cap.

we try to see if the relation is due to chance or not.

fireworks_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + fireworks, .)
fireworks_tbl %>%
as_tibble() %>% pivot_wider(names_from= fireworks, values_from = n) %>%
kable(caption = "", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Fireworks" = 2))
Fireworks
attend_cut NO YES
(0,3.66e+04] 24 3
(3.66e+04,4.39e+04] 19 8
(4.39e+04,Inf] 24 3
chisq.test(fireworks_tbl) %>% pander()
## Warning in chisq.test(fireworks_tbl): Chi-squared approximation may be incorrect
Pearson’s Chi-squared test: fireworks_tbl
Test statistic df P value
4.318 2 0.1155
opponents_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + opponent, .)
opponents_tbl %>%
as_tibble() %>% pivot_wider(names_from= opponent, values_from = n) %>%
kable(caption = " ", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Opponents" = 17))
Opponents
attend_cut Angels Astros Braves Brewers Cardinals Cubs Giants Marlins Mets Nationals Padres Phillies Pirates Reds Rockies Snakes White Sox
(0,3.66e+04] 0 3 2 2 1 0 4 0 0 0 2 1 2 2 4 4 0
(3.66e+04,4.39e+04] 1 0 0 1 5 2 4 3 0 0 4 1 0 0 3 2 1
(4.39e+04,Inf] 2 0 1 1 1 1 1 0 4 3 3 1 1 1 2 3 2
chisq.test(opponents_tbl) %>% pander()
## Warning in chisq.test(opponents_tbl): Chi-squared approximation may be incorrect
Pearson’s Chi-squared test: opponents_tbl
Test statistic df P value
47.07 32 0.04178 *
shirt_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + shirt, .)
shirt_tbl %>%
as_tibble() %>% pivot_wider(names_from= shirt, values_from = n) %>%
kable(caption = " ", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Shirt" = 2))
Shirt
attend_cut NO YES
(0,3.66e+04] 27 0
(3.66e+04,4.39e+04] 26 1
(4.39e+04,Inf] 25 2
chisq.test(shirt_tbl) %>% pander()
## Warning in chisq.test(shirt_tbl): Chi-squared approximation may be incorrect
Pearson’s Chi-squared test: shirt_tbl
Test statistic df P value
2.077 2 0.354
cap_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + cap, .)
cap_tbl %>%
as_tibble() %>% pivot_wider(names_from= cap, values_from = n) %>%
kable(caption = " ", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Cap" = 2))
Cap
attend_cut NO YES
(0,3.66e+04] 26 1
(3.66e+04,4.39e+04] 26 1
(4.39e+04,Inf] 27 0
chisq.test(cap_tbl) %>% pander()
## Warning in chisq.test(cap_tbl): Chi-squared approximation may be incorrect
Pearson’s Chi-squared test: cap_tbl Null Hypothesis: Attendance and the other explanatory variable are independant.
Test statistic df P value
1.025 2 0.5989

Attendance and Firework : Chi square P value is 11.5%. It tells that fireworks and attendance may be independant, but we need to perform more accurate analysis with other tests, as no firm assurance is given by Chi test.

Attendance and Opponent: Chi square value is 4% which is less then 5%, it means that both variables are dependant. It rejects the null hypothesis.

Attendance and Cap: Null hypothesis accepted as p value is greater then 5%.

Attendance and shirt: Null hypothesis accepted as p value is greater then 5%.

Q1. Linear Regression Analysis of all variables.

a. How good does your model fit to data?

lmod <- lm(attend ~. , d2)
lmod
## 
## Call:
## lm(formula = attend ~ ., data = d2)
## 
## Coefficients:
##          (Intercept)              monthMAY              monthJUN  
##            48182.034              4292.168             -3401.847  
##             monthJUL              monthAUG              monthSEP  
##             4195.667              8003.271              3505.864  
##             monthOCT                   day    day_of_weekTuesday  
##             5434.108               133.637              8674.107  
## day_of_weekWednesday   day_of_weekThursday     day_of_weekFriday  
##               13.236               525.394            -18253.565  
##  day_of_weekSaturday     day_of_weekSunday        opponentAstros  
##             3688.891               830.087            -20871.309  
##       opponentBraves       opponentBrewers     opponentCardinals  
##           -18977.726            -22827.232            -12986.020  
##         opponentCubs        opponentGiants       opponentMarlins  
##           -10795.005            -17278.999            -19335.738  
##         opponentMets     opponentNationals        opponentPadres  
##            -4492.426             -6543.709            -11480.107  
##     opponentPhillies       opponentPirates          opponentReds  
##           -13922.323            -12835.956            -16654.400  
##      opponentRockies        opponentSnakes     opponentWhite Sox  
##           -17348.587            -20268.767              -971.859  
##                 temp           skiesCloudy        day_nightNight  
##                9.083              -108.115             -3520.446  
##               capYES              shirtYES          fireworksYES  
##            -6534.522               987.693             20183.543  
##        bobbleheadYES  
##             9349.214
summary(lmod)
## 
## Call:
## lm(formula = attend ~ ., data = d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9300.2 -3166.2   -60.2  1387.1 12599.9 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           48182.034  18476.340   2.608  0.01240 * 
## monthMAY               4292.168   6171.238   0.696  0.49039   
## monthJUN              -3401.847  12098.485  -0.281  0.77989   
## monthJUL               4195.667   6228.541   0.674  0.50408   
## monthAUG               8003.271   7716.681   1.037  0.30534   
## monthSEP               3505.864   7740.306   0.453  0.65282   
## monthOCT               5434.108   9589.486   0.567  0.57382   
## day                     133.637    141.282   0.946  0.34937   
## day_of_weekTuesday     8674.107   2943.997   2.946  0.00513 **
## day_of_weekWednesday     13.236   2765.422   0.005  0.99620   
## day_of_weekThursday     525.394   3906.853   0.134  0.89364   
## day_of_weekFriday    -18253.565   9216.711  -1.980  0.05392 . 
## day_of_weekSaturday    3688.891   3341.257   1.104  0.27558   
## day_of_weekSunday       830.087   4819.110   0.172  0.86403   
## opponentAstros       -20871.309  14215.141  -1.468  0.14915   
## opponentBraves       -18977.726  13742.734  -1.381  0.17428   
## opponentBrewers      -22827.232  15163.652  -1.505  0.13937   
## opponentCardinals    -12986.020  13157.200  -0.987  0.32904   
## opponentCubs         -10795.005  12633.898  -0.854  0.39749   
## opponentGiants       -17278.999  13263.050  -1.303  0.19942   
## opponentMarlins      -19335.738  13901.010  -1.391  0.17123   
## opponentMets          -4492.426   6422.733  -0.699  0.48795   
## opponentNationals     -6543.709  13941.956  -0.469  0.64113   
## opponentPadres       -11480.107  11321.487  -1.014  0.31612   
## opponentPhillies     -13922.323  12598.965  -1.105  0.27515   
## opponentPirates      -12835.956  13339.190  -0.962  0.34117   
## opponentReds         -16654.400  11914.662  -1.398  0.16918   
## opponentRockies      -17348.587  12957.532  -1.339  0.18749   
## opponentSnakes       -20268.767  12731.376  -1.592  0.11854   
## opponentWhite Sox      -971.859   5734.643  -0.169  0.86620   
## temp                      9.083    245.933   0.037  0.97071   
## skiesCloudy            -108.115   2387.211  -0.045  0.96408   
## day_nightNight        -3520.446   3582.791  -0.983  0.33118   
## capYES                -6534.522   5873.332  -1.113  0.27193   
## shirtYES                987.693   4600.000   0.215  0.83098   
## fireworksYES          20183.543   8355.164   2.416  0.01992 * 
## bobbleheadYES          9349.214   3198.320   2.923  0.00545 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5979 on 44 degrees of freedom
## Multiple R-squared:  0.7144, Adjusted R-squared:  0.4808 
## F-statistic: 3.057 on 36 and 44 DF,  p-value: 0.0002493
  1. The P-value of the F static is 2.4*10^(-4). It is closer to zero hence it means that some predictor variables are significantly related to the outcome variables.

Q1. Checking the Beta Coefficients of all variables.

summary(lmod)$coefficient
##                           Estimate Std. Error      t value    Pr(>|t|)
## (Intercept)           48182.033935 18476.3401  2.607769374 0.012398944
## monthMAY               4292.167885  6171.2382  0.695511624 0.490392013
## monthJUN              -3401.847066 12098.4853 -0.281179583 0.779890752
## monthJUL               4195.667129  6228.5412  0.673619550 0.504076578
## monthAUG               8003.271072  7716.6813  1.037138993 0.305337437
## monthSEP               3505.864112  7740.3055  0.452936139 0.652820395
## monthOCT               5434.107798  9589.4860  0.566673523 0.573815500
## day                     133.637178   141.2815  0.945892674 0.349369137
## day_of_weekTuesday     8674.106877  2943.9967  2.946371108 0.005125012
## day_of_weekWednesday     13.236429  2765.4217  0.004786405 0.996202651
## day_of_weekThursday     525.394146  3906.8531  0.134480137 0.893636194
## day_of_weekFriday    -18253.565455  9216.7107 -1.980485878 0.053922611
## day_of_weekSaturday    3688.890647  3341.2573  1.104042691 0.275575978
## day_of_weekSunday       830.086517  4819.1101  0.172248920 0.864031688
## opponentAstros       -20871.308931 14215.1407 -1.468244980 0.149150782
## opponentBraves       -18977.726425 13742.7342 -1.380928003 0.174275238
## opponentBrewers      -22827.232386 15163.6520 -1.505391467 0.139369094
## opponentCardinals    -12986.019767 13157.1999 -0.986989622 0.329044369
## opponentCubs         -10795.005358 12633.8985 -0.854447690 0.397485404
## opponentGiants       -17278.999097 13263.0501 -1.302792265 0.199423082
## opponentMarlins      -19335.738034 13901.0102 -1.390959196 0.171232313
## opponentMets          -4492.425611  6422.7329 -0.699457022 0.487947926
## opponentNationals     -6543.709408 13941.9555 -0.469353771 0.641134385
## opponentPadres       -11480.107122 11321.4870 -1.014010540 0.316121473
## opponentPhillies     -13922.322838 12598.9648 -1.105037044 0.275149635
## opponentPirates      -12835.956194 13339.1897 -0.962274054 0.341170728
## opponentReds         -16654.399987 11914.6622 -1.397807145 0.169178758
## opponentRockies      -17348.587246 12957.5322 -1.338880509 0.187486118
## opponentSnakes       -20268.767405 12731.3763 -1.592032696 0.118537132
## opponentWhite Sox      -971.858538  5734.6435 -0.169471483 0.866202566
## temp                      9.083097   245.9329  0.036933233 0.970705300
## skiesCloudy            -108.115090  2387.2108 -0.045289294 0.964081646
## day_nightNight        -3520.445505  3582.7913 -0.982598553 0.331177425
## capYES                -6534.522388  5873.3315 -1.112575094 0.271932734
## shirtYES                987.692712  4599.9995  0.214715828 0.830980954
## fireworksYES          20183.542503  8355.1641  2.415696727 0.019924658
## bobbleheadYES          9349.214089  3198.3204  2.923163724 0.005454829
  1. If a bobblehead is given the attendance increases by + 9349.
  2. The t-values of fireworks and Bobblehead are the highest which means that we can believe in the beta cofficients with more confidence.
  3. We should trust the beta coefficients of bobblehead and day_of_weekTuesday because they have the least pr>(|t|) value, which means coefficients are significant.
  4. It seems from the analysis that Bobbleheadyes, day_of_weekTuesday and fireworks are important predictors of attendance. It is however contradictory that first in initial data visualizaion fireworks seemed insignificant.

This doesnt seem to be a good model because when we look at the adjusted R square value of the model it says that only 48% of the variance in the attendance can be predicted by the given explanatory variables.

Q1. Linear Regression Model versus the Mean Model.

–Checking the Null Hypothesis

Now, we are comparing the linear regression model and the intercept only model : which is also called the mean model. This comparision is used to test the null hypothesis for regression analysis.

small <- update(lmod, . ~ 1 )
anova(small, lmod)

Null hypothesis : The linear regression model and the mean model of the variable are the same. if these models are same then the variable doesnt affect attendance.

P-value is small <= 0.05. We conclude that at least one of variables on the right has some relation to attendance and reject null hypothesis.

Q1. F-Tests for Statistical Significance.

Now we check that if bobblehead,day_of_weekTuesday and fireworks are statiscally affecting the attendance.

small <- update(lmod, . ~ . - bobblehead)
anova(small, lmod)
drop1(lmod, test="F")

The above results show that: 1. Bobblehead is most significant with the least p value and gives almost 100% confidence. Null hyp rejected. 2. Day of week is next most important as it gives 99% confidence . Null hyp rejected. 3. Fireworks is third most important as it gives 99% confidence. Null hyp rejected. 4. All other variables have high p vales. Null hyp accepted.

AIC analysis: 1. day_of_week is most significant as it will reduce future overestimation 2. bobbehead is second most important 3. Fireworks is the third most important.

All explanatory variables are not required because they are not significant for predicting attendance.

AIC Tests

opt_drop1 <- lmod %>% step(direction='backward')
## Start:  AIC=1433.33
## attend ~ month + day + day_of_week + opponent + temp + skies + 
##     day_night + cap + shirt + fireworks + bobblehead
## 
##               Df Sum of Sq        RSS    AIC
## - opponent    16 482404341 2055399107 1423.0
## - month        6 107253092 1680247858 1426.7
## - temp         1     48765 1573043531 1431.3
## - skies        1     73327 1573068093 1431.3
## - shirt        1   1648173 1574642938 1431.4
## - day          1  31985882 1604980647 1433.0
## - day_night    1  34516507 1607511273 1433.1
## <none>                     1572994766 1433.3
## - cap          1  44252037 1617246803 1433.6
## - fireworks    1 208621672 1781616438 1441.4
## - bobblehead   1 305478664 1878473429 1445.7
## - day_of_week  6 622226954 2195221720 1448.3
## 
## Step:  AIC=1422.99
## attend ~ month + day + day_of_week + temp + skies + day_night + 
##     cap + shirt + fireworks + bobblehead
## 
##               Df Sum of Sq        RSS    AIC
## - day_night    1   1939503 2057338610 1421.1
## - skies        1   3025454 2058424560 1421.1
## - temp         1  18990413 2074389520 1421.7
## - day          1  28387321 2083786427 1422.1
## <none>                     2055399107 1423.0
## - cap          1  60563235 2115962342 1423.3
## - shirt        1  83891144 2139290251 1424.2
## - month        6 484099010 2539498117 1428.1
## - fireworks    1 197806771 2253205878 1428.4
## - day_of_week  6 663720865 2719119972 1433.7
## - bobblehead   1 574485229 2629884336 1441.0
## 
## Step:  AIC=1421.07
## attend ~ month + day + day_of_week + temp + skies + cap + shirt + 
##     fireworks + bobblehead
## 
##               Df Sum of Sq        RSS    AIC
## - skies        1   3091785 2060430395 1419.2
## - temp         1  24263844 2081602454 1420.0
## - day          1  28120244 2085458854 1420.2
## <none>                     2057338610 1421.1
## - cap          1  61589936 2118928546 1421.5
## - shirt        1  82681702 2140020311 1422.3
## - fireworks    1 196113155 2253451764 1426.4
## - month        6 540707943 2598046553 1428.0
## - day_of_week  6 733147115 2790485724 1433.8
## - bobblehead   1 627860542 2685199152 1440.6
## 
## Step:  AIC=1419.19
## attend ~ month + day + day_of_week + temp + cap + shirt + fireworks + 
##     bobblehead
## 
##               Df Sum of Sq        RSS    AIC
## - day          1  30394696 2090825091 1418.4
## - temp         1  35300686 2095731080 1418.6
## <none>                     2060430395 1419.2
## - cap          1  69611450 2130041845 1419.9
## - shirt        1  89066301 2149496696 1420.6
## - fireworks    1 212995729 2273426124 1425.2
## - month        6 557853596 2618283991 1426.6
## - day_of_week  6 742897542 2803327936 1432.1
## - bobblehead   1 625482393 2685912788 1438.7
## 
## Step:  AIC=1418.38
## attend ~ month + day_of_week + temp + cap + shirt + fireworks + 
##     bobblehead
## 
##               Df Sum of Sq        RSS    AIC
## - temp         1  21642127 2112467217 1417.2
## <none>                     2090825091 1418.4
## - cap          1  54908713 2145733804 1418.5
## - shirt        1  90079538 2180904629 1419.8
## - fireworks    1 231720432 2322545523 1424.9
## - month        6 541557658 2632382748 1425.0
## - day_of_week  6 765145220 2855970311 1431.6
## - bobblehead   1 615401407 2706226498 1437.3
## 
## Step:  AIC=1417.21
## attend ~ month + day_of_week + cap + shirt + fireworks + bobblehead
## 
##               Df Sum of Sq        RSS    AIC
## <none>                     2112467217 1417.2
## - cap          1  55325426 2167792643 1417.3
## - shirt        1  88943911 2201411129 1418.5
## - fireworks    1 223427284 2335894502 1423.3
## - month        6 549143913 2661611130 1423.9
## - day_of_week  6 800793351 2913260569 1431.2
## - bobblehead   1 686366179 2798833397 1438.0

AIC analysis: 1. day_of_week is most significant as it will reduce future overestimation 2. bobbehead is second most important 3. Fireworks is the third most important.

Cross Validation tests : For bobblehead

This tests are done by taking each variable out from the model and then comparing the prediction results with the all variables model

set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)

lmodnew <-  update(lmod, . ~ . )
lmod_interaction <- update(lmod, . ~ . - bobblehead)

for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]

# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()

# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
## Warning in predict.lm(lmod_train, newdata = test): prediction from a rank-
## deficient fit may be misleading

## Warning in predict.lm(lmod_train, newdata = test): prediction from a rank-
## deficient fit may be misleading

## Warning in predict.lm(lmod_train, newdata = test): prediction from a rank-
## deficient fit may be misleading
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
##             lmod lmod_interaction         dif_rmse 
##         9995.594        11044.493        -1048.899
options(warn=-1)

Rule for Analysis: If the Diff RMSE value is negative, it means that Variable increases prediction accuracy. If positive , it means that variable decreaes prediction accuracy. If the diff rmse value is really less either positive or negative, it means that including or excluding the variable doesnt effect the model very much.

RMSE value when bobblehead variable is taken out is more as compared to the model where we included Bobblehead. It shows that bobblehead is significant.

Cross Validation tests : For day_of_week

This tests are done by taking each variable out from the model and then comparing the prediction results with the all variables model

set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)

lmodnew <-  update(lmod, . ~ .)
lmod_interaction <- update(lmod, . ~ . - day_of_week)

for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]

# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()

# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
##             lmod lmod_interaction         dif_rmse 
##        9995.5940        9674.9562         320.6378
options(warn=-1)

RMSE value when day_of_week variable is taken out is more as compared to the model where we included Bobblehead. It shows that day_of_week is significant.

Cross Validation tests : For fireworks

This tests are done by taking each variable out from the model and then comparing the prediction results with the all variables mode

set.seed(461)
options(warn=-1)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)

lmodnew <-  update(lmod, . ~ . - month)
lmod_interaction <- update(lmod, . ~ . - month - fireworks)

for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]

# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()

# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
##             lmod lmod_interaction         dif_rmse 
##        9046.7892        9355.4778        -308.6886

RMSE value when fireowrks variable is taken out is more as compared to the model where we included Bobblehead. It shows that fireworks is significant. The highest RMSE is when we removed the bobblehead variable. A low RMSE means the model is a better fit. Hence the model fits better when we include bobblehead in prediction.

c. Is bobblehead still significant? What is expected additional attendance due bobblehead? What is 80% confidence interval?

temp <- subset(d2, bobblehead=="YES")
newdata <- data.frame(temp)
result <- predict(lmod, newdata=newdata, level=0.80, 
        interval = "prediction")
colMeans(result)
##      fit      lwr      upr 
## 53144.64 43602.29 62686.98
confint(lmod, level = 0.80)["bobbleheadYES",]
##      10 %      90 % 
##  5187.911 13510.517

The additional number of fans only due to bobblehead promotion is 9345. The 80% confidence interval is computed above.

Yes, bobblehead is the most significant predictor of attendance.

d. Check model diagnostics.

i. Does any quantitative explanatory variable need a nonlinear transformation?

Ans.

Now we look at which variables would need nonlinear transformations. We can guess this from looking at the data visualizations of the variables.

 d2 %>%
ggplot(aes( temp, attend, group=1)) +
geom_point() +
scale_y_continuous(labels=scales::comma) +
geom_smooth(se=FALSE, method="loess")
## `geom_smooth()` using formula 'y ~ x'

The temp variable seems like can be modeled by a non linear transformation because of the non linear curve it represents.

ii. Cross Validation on Two interaction terms.

We will now see how the interaction of two variable terms can modify our model accuray :

Cross Validation on Temp:Day_night:

set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)

lmodnew <-  update(lmod, . ~ . )
lmod_interaction <- update(lmod, . ~ . + temp: day_night )

for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]

# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()

# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
##             lmod lmod_interaction         dif_rmse 
##        9995.5940        9604.4459         391.1481
options(warn=-1)

Analysis Rule: As we are adding a new variable term hence now if the diff rmse value is more positive it means that the term is significant.

I chose temperature and day_night as naturally people would like to watch a match in a cosy temperature or in day time or night. As temperature and day night are closely linked, generally higher temperature in days and lower in nights. Here we try to see if we can predict attendance by looking at their relation

Temp and day_night variable interaction reduces the RMSE value and increases diff RMSE. Hence this interaction improves the model.

Cross Validation on Temp:Day_night:

set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)

lmodnew <-  update(lmod, . ~ . )
lmod_interaction <- update(lmod, . ~ . + temp:skies)

for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]

# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()

# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
##             lmod lmod_interaction         dif_rmse 
##       9995.59396      10029.84337        -34.24942
options(warn=-1)

I chose temperature and skies as naturally people would like to watch a match in a cosy temperature plus if there are clouds or not. As temperature and skies are not so much closely linked, generally higher temperature in clear skies and lower in clouded ones. Here we try to see if we can predict attendance by looking at their relation

Temp and day_night variable interaction increases the RMSE value by very small amount. Hence this term is not significant.

Miller, T. W. 2014. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science. FT Press Analytics. Pearson Education. https://books.google.com.tr/books?id=PU6nBAAAQBAJ.