The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games.At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014 Chapter 2).
include_graphics(c("los_angeles-dodgers-stadium.jpg",
"Los-Angeles-Dodgers-Promo.jpg",
"adrian_bobble.jpg"))
56,000-seat Dodgers (left), stadium (middle), shirts and caps (right) bobblehead
library(RSQLite)
library(dplyr)
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(pander)
library(knitr)
con <- dbConnect(SQLite(), "/sqlite/dodgers.sqlite")
## this will let sqlite do all jobs for us
events <- tbl(con, "events")
d <- dbReadTable(con, "events")
events
d2 <- d %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT")))
d2 %>%
select(day_of_week, month) %>%
summary() %>%
pander(caption = "Month and week day names now follow time order.")
day_of_week | month |
---|---|
Monday :12 | APR:12 |
Tuesday :13 | MAY:18 |
Wednesday:12 | JUN: 9 |
Thursday : 5 | JUL:12 |
Friday :13 | AUG:15 |
Saturday :13 | SEP:12 |
Sunday :13 | OCT: 3 |
d2 %>%
count(day_of_week, fireworks, name = "cnt") %>%
pivot_wider(names_from = fireworks, values_from = cnt) %>%
rename(`Weekday` = day_of_week) %>%
kable(format = "html", caption = "Number of times Firework took place in games played different weekdays", booktabs=TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "=1, "Firework"=2))
Weekday | NO | YES |
---|---|---|
Monday | 12 | . |
Tuesday | 13 | . |
Wednesday | 11 | 1 |
Thursday | 5 | . |
Friday | . | 13 |
Saturday | 13 | . |
Sunday | 13 | . |
Fireworks were done only 14 times in 81 games. All were done on friday with only 1 done on wednesday. 13 games were played on friday which is same number of games played every day of the week. This is important as now number of games is same we can focus on the attendance variation between the days. Did attendance vary between friday and other days? How advantageuos were the fireworks ?
d2 %>%
ggplot(aes(fireworks, attend)) +
geom_boxplot(aes(fill=fireworks)) +
theme(legend.position = "none")
d2 %>%
ggplot(aes(cap, attend)) +
geom_boxplot(aes(fill=cap)) +
theme(legend.position = "none")
d2 %>%
ggplot(aes(shirt, attend)) +
geom_boxplot(aes(fill=shirt)) +
theme(legend.position = "none")
d2 %>%
ggplot(aes(bobblehead, attend)) +
geom_boxplot(aes(fill=bobblehead)) +
theme(legend.position = "none")
Fridays attendance average is same as any other day. Hence there is no relation seen between the fireworks and the attendance thru visualization ! Cap promotion also doesnt effect th attendance as the medians are the same when no caps were given. Shirt promotion seems to increase attendance. There seems a positive relation between the shirt promotion and attendance. Attendance median increases by 10,000.
dk <- d %>%
mutate( opponent = factor( opponent, levels = c("Pirates", "Braves", "Nationals", "Giants", "Rockies", "Snakes", "Cardinals","Astros","Brewers", "Angels" , "White Sox", "Mets" , "Reds", "Padres", "Phillies" , "Cubs" , "Marlins")))
dk %>% count( opponent , name = "Number_of_games") %>%
rename(`Opponent`= opponent) %>%
pander(caption = "Number of games with each opponent");
Opponent | Number_of_games |
---|---|
Pirates | 3 |
Braves | 3 |
Nationals | 3 |
Giants | 9 |
Rockies | 9 |
Snakes | 9 |
Cardinals | 7 |
Astros | 3 |
Brewers | 4 |
Angels | 3 |
White Sox | 3 |
Mets | 4 |
Reds | 3 |
Padres | 9 |
Phillies | 3 |
Cubs | 3 |
Marlins | 3 |
d2 %>%
ggplot(aes(opponent, attend, group=1)) +
geom_point() +
scale_y_continuous(labels=scales::comma) +
geom_smooth(se=FALSE, method="loess")
## `geom_smooth()` using formula 'y ~ x'
d2 %>%
ggplot(aes(opponent, attend)) +
geom_boxplot(aes(fill=opponent)) +
theme(legend.position = "none")
There are different number og games played against each opponent hence a median value will better give us an understanding to compare the attendances. Angels have the highest average attendance while Brave have the lowest average attendance. Maximum attendance was however recorded in a Giants Game and a lowest attendance was recorded in a Sankes games. We can say that Angels bought in more crowd on average as if we compare it with other opponents against whom 3 games were played, angels bought in much higher crowds.
fireworks_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + fireworks, .)
fireworks_tbl %>%
as_tibble() %>% pivot_wider(names_from= fireworks, values_from = n) %>%
kable(caption = "", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Fireworks" = 2))
attend_cut | NO | YES |
---|---|---|
(0,3.66e+04] | 24 | 3 |
(3.66e+04,4.39e+04] | 19 | 8 |
(4.39e+04,Inf] | 24 | 3 |
chisq.test(fireworks_tbl) %>% pander()
## Warning in chisq.test(fireworks_tbl): Chi-squared approximation may be incorrect
Test statistic | df | P value |
---|---|---|
4.318 | 2 | 0.1155 |
opponents_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + opponent, .)
opponents_tbl %>%
as_tibble() %>% pivot_wider(names_from= opponent, values_from = n) %>%
kable(caption = " ", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Opponents" = 17))
attend_cut | Angels | Astros | Braves | Brewers | Cardinals | Cubs | Giants | Marlins | Mets | Nationals | Padres | Phillies | Pirates | Reds | Rockies | Snakes | White Sox |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(0,3.66e+04] | 0 | 3 | 2 | 2 | 1 | 0 | 4 | 0 | 0 | 0 | 2 | 1 | 2 | 2 | 4 | 4 | 0 |
(3.66e+04,4.39e+04] | 1 | 0 | 0 | 1 | 5 | 2 | 4 | 3 | 0 | 0 | 4 | 1 | 0 | 0 | 3 | 2 | 1 |
(4.39e+04,Inf] | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 4 | 3 | 3 | 1 | 1 | 1 | 2 | 3 | 2 |
chisq.test(opponents_tbl) %>% pander()
## Warning in chisq.test(opponents_tbl): Chi-squared approximation may be incorrect
Test statistic | df | P value |
---|---|---|
47.07 | 32 | 0.04178 * |
shirt_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + shirt, .)
shirt_tbl %>%
as_tibble() %>% pivot_wider(names_from= shirt, values_from = n) %>%
kable(caption = " ", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Shirt" = 2))
attend_cut | NO | YES |
---|---|---|
(0,3.66e+04] | 27 | 0 |
(3.66e+04,4.39e+04] | 26 | 1 |
(4.39e+04,Inf] | 25 | 2 |
chisq.test(shirt_tbl) %>% pander()
## Warning in chisq.test(shirt_tbl): Chi-squared approximation may be incorrect
Test statistic | df | P value |
---|---|---|
2.077 | 2 | 0.354 |
cap_tbl <- d2 %>%
mutate(attend_cut = cut(attend, breaks = c(0, quantile(attend, prob=(1:2)/3), Inf))) %>%
xtabs(~ attend_cut + cap, .)
cap_tbl %>%
as_tibble() %>% pivot_wider(names_from= cap, values_from = n) %>%
kable(caption = " ", booktabs = TRUE) %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" "= 1, "Cap" = 2))
attend_cut | NO | YES |
---|---|---|
(0,3.66e+04] | 26 | 1 |
(3.66e+04,4.39e+04] | 26 | 1 |
(4.39e+04,Inf] | 27 | 0 |
chisq.test(cap_tbl) %>% pander()
## Warning in chisq.test(cap_tbl): Chi-squared approximation may be incorrect
Test statistic | df | P value |
---|---|---|
1.025 | 2 | 0.5989 |
Attendance and Firework : Chi square P value is 11.5%. It tells that fireworks and attendance may be independant, but we need to perform more accurate analysis with other tests, as no firm assurance is given by Chi test.
Attendance and Opponent: Chi square value is 4% which is less then 5%, it means that both variables are dependant. It rejects the null hypothesis.
Attendance and Cap: Null hypothesis accepted as p value is greater then 5%.
Attendance and shirt: Null hypothesis accepted as p value is greater then 5%.
lmod <- lm(attend ~. , d2)
lmod
##
## Call:
## lm(formula = attend ~ ., data = d2)
##
## Coefficients:
## (Intercept) monthMAY monthJUN
## 48182.034 4292.168 -3401.847
## monthJUL monthAUG monthSEP
## 4195.667 8003.271 3505.864
## monthOCT day day_of_weekTuesday
## 5434.108 133.637 8674.107
## day_of_weekWednesday day_of_weekThursday day_of_weekFriday
## 13.236 525.394 -18253.565
## day_of_weekSaturday day_of_weekSunday opponentAstros
## 3688.891 830.087 -20871.309
## opponentBraves opponentBrewers opponentCardinals
## -18977.726 -22827.232 -12986.020
## opponentCubs opponentGiants opponentMarlins
## -10795.005 -17278.999 -19335.738
## opponentMets opponentNationals opponentPadres
## -4492.426 -6543.709 -11480.107
## opponentPhillies opponentPirates opponentReds
## -13922.323 -12835.956 -16654.400
## opponentRockies opponentSnakes opponentWhite Sox
## -17348.587 -20268.767 -971.859
## temp skiesCloudy day_nightNight
## 9.083 -108.115 -3520.446
## capYES shirtYES fireworksYES
## -6534.522 987.693 20183.543
## bobbleheadYES
## 9349.214
summary(lmod)
##
## Call:
## lm(formula = attend ~ ., data = d2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9300.2 -3166.2 -60.2 1387.1 12599.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48182.034 18476.340 2.608 0.01240 *
## monthMAY 4292.168 6171.238 0.696 0.49039
## monthJUN -3401.847 12098.485 -0.281 0.77989
## monthJUL 4195.667 6228.541 0.674 0.50408
## monthAUG 8003.271 7716.681 1.037 0.30534
## monthSEP 3505.864 7740.306 0.453 0.65282
## monthOCT 5434.108 9589.486 0.567 0.57382
## day 133.637 141.282 0.946 0.34937
## day_of_weekTuesday 8674.107 2943.997 2.946 0.00513 **
## day_of_weekWednesday 13.236 2765.422 0.005 0.99620
## day_of_weekThursday 525.394 3906.853 0.134 0.89364
## day_of_weekFriday -18253.565 9216.711 -1.980 0.05392 .
## day_of_weekSaturday 3688.891 3341.257 1.104 0.27558
## day_of_weekSunday 830.087 4819.110 0.172 0.86403
## opponentAstros -20871.309 14215.141 -1.468 0.14915
## opponentBraves -18977.726 13742.734 -1.381 0.17428
## opponentBrewers -22827.232 15163.652 -1.505 0.13937
## opponentCardinals -12986.020 13157.200 -0.987 0.32904
## opponentCubs -10795.005 12633.898 -0.854 0.39749
## opponentGiants -17278.999 13263.050 -1.303 0.19942
## opponentMarlins -19335.738 13901.010 -1.391 0.17123
## opponentMets -4492.426 6422.733 -0.699 0.48795
## opponentNationals -6543.709 13941.956 -0.469 0.64113
## opponentPadres -11480.107 11321.487 -1.014 0.31612
## opponentPhillies -13922.323 12598.965 -1.105 0.27515
## opponentPirates -12835.956 13339.190 -0.962 0.34117
## opponentReds -16654.400 11914.662 -1.398 0.16918
## opponentRockies -17348.587 12957.532 -1.339 0.18749
## opponentSnakes -20268.767 12731.376 -1.592 0.11854
## opponentWhite Sox -971.859 5734.643 -0.169 0.86620
## temp 9.083 245.933 0.037 0.97071
## skiesCloudy -108.115 2387.211 -0.045 0.96408
## day_nightNight -3520.446 3582.791 -0.983 0.33118
## capYES -6534.522 5873.332 -1.113 0.27193
## shirtYES 987.693 4600.000 0.215 0.83098
## fireworksYES 20183.543 8355.164 2.416 0.01992 *
## bobbleheadYES 9349.214 3198.320 2.923 0.00545 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5979 on 44 degrees of freedom
## Multiple R-squared: 0.7144, Adjusted R-squared: 0.4808
## F-statistic: 3.057 on 36 and 44 DF, p-value: 0.0002493
summary(lmod)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48182.033935 18476.3401 2.607769374 0.012398944
## monthMAY 4292.167885 6171.2382 0.695511624 0.490392013
## monthJUN -3401.847066 12098.4853 -0.281179583 0.779890752
## monthJUL 4195.667129 6228.5412 0.673619550 0.504076578
## monthAUG 8003.271072 7716.6813 1.037138993 0.305337437
## monthSEP 3505.864112 7740.3055 0.452936139 0.652820395
## monthOCT 5434.107798 9589.4860 0.566673523 0.573815500
## day 133.637178 141.2815 0.945892674 0.349369137
## day_of_weekTuesday 8674.106877 2943.9967 2.946371108 0.005125012
## day_of_weekWednesday 13.236429 2765.4217 0.004786405 0.996202651
## day_of_weekThursday 525.394146 3906.8531 0.134480137 0.893636194
## day_of_weekFriday -18253.565455 9216.7107 -1.980485878 0.053922611
## day_of_weekSaturday 3688.890647 3341.2573 1.104042691 0.275575978
## day_of_weekSunday 830.086517 4819.1101 0.172248920 0.864031688
## opponentAstros -20871.308931 14215.1407 -1.468244980 0.149150782
## opponentBraves -18977.726425 13742.7342 -1.380928003 0.174275238
## opponentBrewers -22827.232386 15163.6520 -1.505391467 0.139369094
## opponentCardinals -12986.019767 13157.1999 -0.986989622 0.329044369
## opponentCubs -10795.005358 12633.8985 -0.854447690 0.397485404
## opponentGiants -17278.999097 13263.0501 -1.302792265 0.199423082
## opponentMarlins -19335.738034 13901.0102 -1.390959196 0.171232313
## opponentMets -4492.425611 6422.7329 -0.699457022 0.487947926
## opponentNationals -6543.709408 13941.9555 -0.469353771 0.641134385
## opponentPadres -11480.107122 11321.4870 -1.014010540 0.316121473
## opponentPhillies -13922.322838 12598.9648 -1.105037044 0.275149635
## opponentPirates -12835.956194 13339.1897 -0.962274054 0.341170728
## opponentReds -16654.399987 11914.6622 -1.397807145 0.169178758
## opponentRockies -17348.587246 12957.5322 -1.338880509 0.187486118
## opponentSnakes -20268.767405 12731.3763 -1.592032696 0.118537132
## opponentWhite Sox -971.858538 5734.6435 -0.169471483 0.866202566
## temp 9.083097 245.9329 0.036933233 0.970705300
## skiesCloudy -108.115090 2387.2108 -0.045289294 0.964081646
## day_nightNight -3520.445505 3582.7913 -0.982598553 0.331177425
## capYES -6534.522388 5873.3315 -1.112575094 0.271932734
## shirtYES 987.692712 4599.9995 0.214715828 0.830980954
## fireworksYES 20183.542503 8355.1641 2.415696727 0.019924658
## bobbleheadYES 9349.214089 3198.3204 2.923163724 0.005454829
This doesnt seem to be a good model because when we look at the adjusted R square value of the model it says that only 48% of the variance in the attendance can be predicted by the given explanatory variables.
Now, we are comparing the linear regression model and the intercept only model : which is also called the mean model. This comparision is used to test the null hypothesis for regression analysis.
small <- update(lmod, . ~ 1 )
anova(small, lmod)
Null hypothesis : The linear regression model and the mean model of the variable are the same. if these models are same then the variable doesnt affect attendance.
P-value is small <= 0.05. We conclude that at least one of variables on the right has some relation to attendance and reject null hypothesis.
Now we check that if bobblehead,day_of_weekTuesday and fireworks are statiscally affecting the attendance.
small <- update(lmod, . ~ . - bobblehead)
anova(small, lmod)
drop1(lmod, test="F")
The above results show that: 1. Bobblehead is most significant with the least p value and gives almost 100% confidence. Null hyp rejected. 2. Day of week is next most important as it gives 99% confidence . Null hyp rejected. 3. Fireworks is third most important as it gives 99% confidence. Null hyp rejected. 4. All other variables have high p vales. Null hyp accepted.
AIC analysis: 1. day_of_week is most significant as it will reduce future overestimation 2. bobbehead is second most important 3. Fireworks is the third most important.
All explanatory variables are not required because they are not significant for predicting attendance.
opt_drop1 <- lmod %>% step(direction='backward')
## Start: AIC=1433.33
## attend ~ month + day + day_of_week + opponent + temp + skies +
## day_night + cap + shirt + fireworks + bobblehead
##
## Df Sum of Sq RSS AIC
## - opponent 16 482404341 2055399107 1423.0
## - month 6 107253092 1680247858 1426.7
## - temp 1 48765 1573043531 1431.3
## - skies 1 73327 1573068093 1431.3
## - shirt 1 1648173 1574642938 1431.4
## - day 1 31985882 1604980647 1433.0
## - day_night 1 34516507 1607511273 1433.1
## <none> 1572994766 1433.3
## - cap 1 44252037 1617246803 1433.6
## - fireworks 1 208621672 1781616438 1441.4
## - bobblehead 1 305478664 1878473429 1445.7
## - day_of_week 6 622226954 2195221720 1448.3
##
## Step: AIC=1422.99
## attend ~ month + day + day_of_week + temp + skies + day_night +
## cap + shirt + fireworks + bobblehead
##
## Df Sum of Sq RSS AIC
## - day_night 1 1939503 2057338610 1421.1
## - skies 1 3025454 2058424560 1421.1
## - temp 1 18990413 2074389520 1421.7
## - day 1 28387321 2083786427 1422.1
## <none> 2055399107 1423.0
## - cap 1 60563235 2115962342 1423.3
## - shirt 1 83891144 2139290251 1424.2
## - month 6 484099010 2539498117 1428.1
## - fireworks 1 197806771 2253205878 1428.4
## - day_of_week 6 663720865 2719119972 1433.7
## - bobblehead 1 574485229 2629884336 1441.0
##
## Step: AIC=1421.07
## attend ~ month + day + day_of_week + temp + skies + cap + shirt +
## fireworks + bobblehead
##
## Df Sum of Sq RSS AIC
## - skies 1 3091785 2060430395 1419.2
## - temp 1 24263844 2081602454 1420.0
## - day 1 28120244 2085458854 1420.2
## <none> 2057338610 1421.1
## - cap 1 61589936 2118928546 1421.5
## - shirt 1 82681702 2140020311 1422.3
## - fireworks 1 196113155 2253451764 1426.4
## - month 6 540707943 2598046553 1428.0
## - day_of_week 6 733147115 2790485724 1433.8
## - bobblehead 1 627860542 2685199152 1440.6
##
## Step: AIC=1419.19
## attend ~ month + day + day_of_week + temp + cap + shirt + fireworks +
## bobblehead
##
## Df Sum of Sq RSS AIC
## - day 1 30394696 2090825091 1418.4
## - temp 1 35300686 2095731080 1418.6
## <none> 2060430395 1419.2
## - cap 1 69611450 2130041845 1419.9
## - shirt 1 89066301 2149496696 1420.6
## - fireworks 1 212995729 2273426124 1425.2
## - month 6 557853596 2618283991 1426.6
## - day_of_week 6 742897542 2803327936 1432.1
## - bobblehead 1 625482393 2685912788 1438.7
##
## Step: AIC=1418.38
## attend ~ month + day_of_week + temp + cap + shirt + fireworks +
## bobblehead
##
## Df Sum of Sq RSS AIC
## - temp 1 21642127 2112467217 1417.2
## <none> 2090825091 1418.4
## - cap 1 54908713 2145733804 1418.5
## - shirt 1 90079538 2180904629 1419.8
## - fireworks 1 231720432 2322545523 1424.9
## - month 6 541557658 2632382748 1425.0
## - day_of_week 6 765145220 2855970311 1431.6
## - bobblehead 1 615401407 2706226498 1437.3
##
## Step: AIC=1417.21
## attend ~ month + day_of_week + cap + shirt + fireworks + bobblehead
##
## Df Sum of Sq RSS AIC
## <none> 2112467217 1417.2
## - cap 1 55325426 2167792643 1417.3
## - shirt 1 88943911 2201411129 1418.5
## - fireworks 1 223427284 2335894502 1423.3
## - month 6 549143913 2661611130 1423.9
## - day_of_week 6 800793351 2913260569 1431.2
## - bobblehead 1 686366179 2798833397 1438.0
AIC analysis: 1. day_of_week is most significant as it will reduce future overestimation 2. bobbehead is second most important 3. Fireworks is the third most important.
This tests are done by taking each variable out from the model and then comparing the prediction results with the all variables model
set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)
lmodnew <- update(lmod, . ~ . )
lmod_interaction <- update(lmod, . ~ . - bobblehead)
for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]
# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()
# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
## Warning in predict.lm(lmod_train, newdata = test): prediction from a rank-
## deficient fit may be misleading
## Warning in predict.lm(lmod_train, newdata = test): prediction from a rank-
## deficient fit may be misleading
## Warning in predict.lm(lmod_train, newdata = test): prediction from a rank-
## deficient fit may be misleading
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
## lmod lmod_interaction dif_rmse
## 9995.594 11044.493 -1048.899
options(warn=-1)
Rule for Analysis: If the Diff RMSE value is negative, it means that Variable increases prediction accuracy. If positive , it means that variable decreaes prediction accuracy. If the diff rmse value is really less either positive or negative, it means that including or excluding the variable doesnt effect the model very much.
RMSE value when bobblehead variable is taken out is more as compared to the model where we included Bobblehead. It shows that bobblehead is significant.
This tests are done by taking each variable out from the model and then comparing the prediction results with the all variables model
set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)
lmodnew <- update(lmod, . ~ .)
lmod_interaction <- update(lmod, . ~ . - day_of_week)
for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]
# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()
# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
## lmod lmod_interaction dif_rmse
## 9995.5940 9674.9562 320.6378
options(warn=-1)
RMSE value when day_of_week variable is taken out is more as compared to the model where we included Bobblehead. It shows that day_of_week is significant.
This tests are done by taking each variable out from the model and then comparing the prediction results with the all variables mode
set.seed(461)
options(warn=-1)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)
lmodnew <- update(lmod, . ~ . - month)
lmod_interaction <- update(lmod, . ~ . - month - fireworks)
for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]
# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()
# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
## lmod lmod_interaction dif_rmse
## 9046.7892 9355.4778 -308.6886
RMSE value when fireowrks variable is taken out is more as compared to the model where we included Bobblehead. It shows that fireworks is significant. The highest RMSE is when we removed the bobblehead variable. A low RMSE means the model is a better fit. Hence the model fits better when we include bobblehead in prediction.
temp <- subset(d2, bobblehead=="YES")
newdata <- data.frame(temp)
result <- predict(lmod, newdata=newdata, level=0.80,
interval = "prediction")
colMeans(result)
## fit lwr upr
## 53144.64 43602.29 62686.98
confint(lmod, level = 0.80)["bobbleheadYES",]
## 10 % 90 %
## 5187.911 13510.517
The additional number of fans only due to bobblehead promotion is 9345. The 80% confidence interval is computed above.
Yes, bobblehead is the most significant predictor of attendance.
Now we look at which variables would need nonlinear transformations. We can guess this from looking at the data visualizations of the variables.
d2 %>%
ggplot(aes( temp, attend, group=1)) +
geom_point() +
scale_y_continuous(labels=scales::comma) +
geom_smooth(se=FALSE, method="loess")
## `geom_smooth()` using formula 'y ~ x'
The temp variable seems like can be modeled by a non linear transformation because of the non linear curve it represents.
We will now see how the interaction of two variable terms can modify our model accuray :
set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)
lmodnew <- update(lmod, . ~ . )
lmod_interaction <- update(lmod, . ~ . + temp: day_night )
for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]
# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()
# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
## lmod lmod_interaction dif_rmse
## 9995.5940 9604.4459 391.1481
options(warn=-1)
Analysis Rule: As we are adding a new variable term hence now if the diff rmse value is more positive it means that the term is significant.
I chose temperature and day_night as naturally people would like to watch a match in a cosy temperature or in day time or night. As temperature and day night are closely linked, generally higher temperature in days and lower in nights. Here we try to see if we can predict attendance by looking at their relation
Temp and day_night variable interaction reduces the RMSE value and increases diff RMSE. Hence this interaction improves the model.
set.seed(500)
nfolds <- 5
folds <- rep(seq(5), nrow(d2), len=nrow(d2)) %>% sample()
rmse_lmod <- rep(NA, nfolds)
rmse_lmod_interaction <- rep(NA, nfolds)
lmodnew <- update(lmod, . ~ . )
lmod_interaction <- update(lmod, . ~ . + temp:skies)
for (i in seq(nfolds)){
train <- d2[folds!=i,]
test <- d2[folds==i,]
# train lm without interaction model
lmod_train <- update(lmodnew, data = train)
lmod_test <- predict(lmod_train, newdata = test)
rmse_lmod[i] <- (test$attend - lmod_test)^2 %>% mean() %>% sqrt()
# train lm with interaction model
lmod_interaction_train <- update(lmod_interaction, data = train)
lmod_interaction_test <- suppressWarnings(predict(lmod_interaction_train, newdata = test))
rmse_lmod_interaction[i] <- (test$attend - lmod_interaction_test)^2 %>% mean() %>% sqrt()
}
cv <- tibble(lmod = rmse_lmod, lmod_interaction = rmse_lmod_interaction) %>%
mutate(dif_rmse = rmse_lmod - rmse_lmod_interaction)
cv %>%
apply(2,mean)
## lmod lmod_interaction dif_rmse
## 9995.59396 10029.84337 -34.24942
options(warn=-1)
I chose temperature and skies as naturally people would like to watch a match in a cosy temperature plus if there are clouds or not. As temperature and skies are not so much closely linked, generally higher temperature in clear skies and lower in clouded ones. Here we try to see if we can predict attendance by looking at their relation
Temp and day_night variable interaction increases the RMSE value by very small amount. Hence this term is not significant.
Miller, T. W. 2014. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science. FT Press Analytics. Pearson Education. https://books.google.com.tr/books?id=PU6nBAAAQBAJ.