Feature Scaling

This is an explanation for the question,

Should we scale the predicted variable (usually denoted y) also when the dependent variables (usually deboted Xs) are scaled?

To speed up gradient descent, sometimes we can do feature scaling, or in statistics term, using z score: ( frac{x- ar{x}}{sd} ). Specifically, the question is do we need to scale y too when we have scaled Xs to estimate thetas (or called coefficients in linear regression)?
Load an example data and separate predicted variable and the featuers (X matrix)

rm(list = ls())
dat <- read.table("ex1data2.txt", header = FALSE, sep = ",")
X <- dat[, 1:2]
y <- dat[, 3]


# show the first 6 lines of the data
head(dat)

## V1 V2 V3
## 1 2104 3 399900
## 2 1600 3 329900
## 3 2400 3 369000
## 4 1416 2 232000
## 5 3000 4 539900
## 6 1985 4 299900

First, lets us try estimate thetas without scaling any variables. Here I use Rs lm to perform linear regression, without scaling the variables.

mod1 <- lm(V3 ~ V1 + V2, data = dat)
coef(mod1) # thetas

## (Intercept) V1 V2 
## 89597.9 139.2 -8738.0

Also print out the predicted price when the square feet of the house is 1650 (V1) and the number of floor is 3 (V2).

predict(mod1, newdata = data.frame(V1 = 1650, V2 = 3))

## 1 
## 293081

# 293081

So the price should be $293081. Now lets try scaling Xs only.

# ------------------------------------------ Feature scaling and mean
# normalisation ------------------------------------------
dat2 <- as.data.frame(cbind(scale(dat[, 1:2]), dat[, 3]))

# y is in original form and Xs are scaled
head(dat2)

## V1 V2 V3
## 1 0.13001 -0.2237 399900
## 2 -0.50419 -0.2237 329900
## 3 0.50248 -0.2237 369000
## 4 -0.73572 -1.5378 232000
## 5 1.25748 1.0904 539900
## 6 -0.01973 1.0904 299900


# Build the linear model
mod2 <- lm(V3 ~ V1 + V2, data = dat2)
coef(mod2) # theta

## (Intercept) V1 V2 
## 340413 110631 -6649

Do the prediction, using mod2 (Xs are scaled, but not y). The predicted Xs are applied in original form.

predict(mod2, newdata = data.frame(V1 = 1650, V2 = 3))

## 1 
## 182861697

# 182861697

Hmm, the number is not the same as previously. Clearly, it is incorrect procedure. Lets try scaling the predicted Xs (1650 and 3) and using them as the inputs.

# Scaling and normalising the predicted scores
V1 <- (1650 - colMeans(X)[1])/apply(X, 2, sd)[1]
V2 <- (3 - colMeans(X)[2])/apply(X, 2, sd)[2]
predict(mod2, newdata = data.frame(V1 = V1, V2 = V2))

## V1 
## 293081

# 293081.5

The answer is that it is OK not to scale y variable when using data from a training set to estimate thetas. In the case of using scaled training set Xs to estimate thetas, we also need to use scaled Xs in the test set to predict y. Otherwise, the predicted y wont be right.
Below lists note for using normal equation to estimate thetas

Normal equation

X <- dat[, 1:2]
y <- dat[, 3]
X <- as.matrix(cbind(rep(1, nrow(X)), X))
colnames(X) <- paste("theta", 0:2, sep = "")
solve(t(X) %*% X) %*% t(X) %*% y

## [,1]
## theta0 89597.9
## theta1 139.2
## theta2 -8738.0

download file now

pic pic

Feature Scaling

Feature Scaling

Feature Scaling

Normal equation

About Unknown