Feature Scaling

Feature Scaling


Feature Scaling

This is an explanation for the question,
  • Should we scale the predicted variable (usually denoted y) also when the dependent variables (usually deboted Xs) are scaled?
To speed up gradient descent, sometimes we can do feature scaling, or in statistics term, using z score: ( frac{x- ar{x}}{sd} ). Specifically, the question is do we need to scale y too when we have scaled Xs to estimate thetas (or called coefficients in linear regression)?
Load an example data and separate predicted variable and the featuers (X matrix)
rm(list = ls())
dat <- read.table("ex1data2.txt", header = FALSE, sep = ",")
X <- dat[, 1:2]
y <- dat[, 3]


# show the first 6 lines of the data
head(dat)
## V1 V2 V3
## 1 2104 3 399900
## 2 1600 3 329900
## 3 2400 3 369000
## 4 1416 2 232000
## 5 3000 4 539900
## 6 1985 4 299900
First, lets us try estimate thetas without scaling any variables. Here I use Rs lm to perform linear regression, without scaling the variables.
mod1 <- lm(V3 ~ V1 + V2, data = dat)
coef(mod1) # thetas
## (Intercept) V1 V2 
## 89597.9 139.2 -8738.0
Also print out the predicted price when the square feet of the house is 1650 (V1) and the number of floor is 3 (V2).
predict(mod1, newdata = data.frame(V1 = 1650, V2 = 3))
## 1 
## 293081
# 293081
So the price should be $293081. Now lets try scaling Xs only.
# ------------------------------------------ Feature scaling and mean
# normalisation ------------------------------------------
dat2 <- as.data.frame(cbind(scale(dat[, 1:2]), dat[, 3]))

# y is in original form and Xs are scaled
head(dat2)
## V1 V2 V3
## 1 0.13001 -0.2237 399900
## 2 -0.50419 -0.2237 329900
## 3 0.50248 -0.2237 369000
## 4 -0.73572 -1.5378 232000
## 5 1.25748 1.0904 539900
## 6 -0.01973 1.0904 299900

# Build the linear model
mod2 <- lm(V3 ~ V1 + V2, data = dat2)
coef(mod2) # theta
## (Intercept) V1 V2 
## 340413 110631 -6649
Do the prediction, using mod2 (Xs are scaled, but not y). The predicted Xs are applied in original form.
predict(mod2, newdata = data.frame(V1 = 1650, V2 = 3))
## 1 
## 182861697
# 182861697
Hmm, the number is not the same as previously. Clearly, it is incorrect procedure. Lets try scaling the predicted Xs (1650 and 3) and using them as the inputs.
# Scaling and normalising the predicted scores
V1 <- (1650 - colMeans(X)[1])/apply(X, 2, sd)[1]
V2 <- (3 - colMeans(X)[2])/apply(X, 2, sd)[2]
predict(mod2, newdata = data.frame(V1 = V1, V2 = V2))
## V1 
## 293081
# 293081.5
The answer is that it is OK not to scale y variable when using data from a training set to estimate thetas. In the case of using scaled training set Xs to estimate thetas, we also need to use scaled Xs in the test set to predict y. Otherwise, the predicted y wont be right.
Below lists note for using normal equation to estimate thetas

Normal equation

X <- dat[, 1:2]
y <- dat[, 3]
X <- as.matrix(cbind(rep(1, nrow(X)), X))
colnames(X) <- paste("theta", 0:2, sep = "")
solve(t(X) %*% X) %*% t(X) %*% y
## [,1]
## theta0 89597.9
## theta1 139.2
## theta2 -8738.0

download file now

Unknown

About Unknown

Author Description here.. Nulla sagittis convallis. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.

Subscribe to this Blog via Email :