As my study goes into somewhere with more statistical flavor, I would like to share a powerful tool that I learned by chance.

Motivation

To study a complicated system, the data-driven approach focuses on the data. From analyzing the data, scientists try to make observations, discover patterns and regularities, draw conclusions, and so on. Then, scientists want to use these findings to make prediction. This approach is different from the physics-based methods, where people try to deduct/simulate the system behavior from some clearly defined principles.

A common way of looking at a problem is to reformulate the complex system as a black box with input and output.
\{x_1,x_2,\dots,x_p \} \longrightarrow y
Here, {x_1,x_2,\dots,x_p } are p parameters (all scalar) as input, and y is the scalar output (single-variate).

It is better to have some examples, such as:

  • Cancer detection among population. Input: measured physiological indicators, such as blood sugar, specific protein level, body weight, output: volume of cancer.
  • Fuel efficiency study of different cars. Input: (Acceleration, Displacement, Housepower, Car weight), output: MPG (miles per gallon) value.

Generally speaking, people need to think and construct some possible input parameters (sometimes called “predictors”) that are likely relevant to the output of interest (e.g. smoking history might correlate with cancer level). It is a major effort to design such system when we study very complicated phenomena. For example, in studying sociology, when someone starts to think about the possible factors on human well-being, he/she may have to design something like “Human Development Index” as an input parameter.

Now the situation is: we know there are some factors possibly influencing the outcome of interest, and we have the observed data in hand. We hope there are tools that can help us:

  • To build a model between the input parameters and output parameters.
  • To understand the relative importance of different parameters.
  • To filter out parameters that are less relevant or redundant.

The Lasso method

Surprisingly, the Lasso (least absolute shrinkage and selection
operator) method meets all these needs in a single run, simple and elegant. The definition of Lasso is (partly taken from the MATLAB documentation page about Lasso):

\text{min}_{\beta_0,\beta}\left(\frac{1}{2N}\sum_{i=1}^N\left(y_i-\beta_0-x_i^T{\beta}\right)^2+\lambda\sum_{j=1}^p|\beta_j|\right)

where,

  • N is the number of observations.
  • yi is the response at observation i.
  • xi is data, a vector of p values at observation i.
  • λ is a positive regularization parameter.
  • The parameters β0 and β are scalar and p-vector respectively.

Obviously, the first term in Lasso definition corresponds to the usual linear (least-square) regression. What makes it different is the regularization or penalty term. The reason of this term is to address the issue of “overfitting“, where the model is fitted to the noise rather than meaningful information, leading to a decreased prediction ability when applied to new data.

Here, the penalty is the 1-norm of the regression coefficient β. The idea is (equivalent to) that, you want to minimize the square error, with the constraint that you only have a limited amount of budget to allocate values to the vector β. Therefore, the minimization will choose the most important input parameters to fit automatically. What is special about 1-norm compared with 2-norm (ridge regression in this case) is that 1-norm penalty will drive β coefficients to exactly zero at some λ value. This feature makes Lasso regression very powerful in parameter selection.

In practice, the Lasso algorithm gives a solution path by varying λ value, namely, \beta_0(\lambda),\beta(\lambda) as a function of λ. As λ increases, the number of nonzero components of decreases. As a result, the Lasso algorithm gives a plot similar to the following figure (click to see the original size). Here, we are using the fuel efficiency study as an example.

Lasso trace plot example.

Lasso trace plot example.

The x-axis shows the log scale of λ, the y-axis shows the β coefficients for each λ values. When λ is zero (right end of the plot), the coefficients are the coefficients from ordinary linear regression. As we can see, as λ decrease (from left to right), an increased number of parameters are selected (more nonzero coefficients). Here, the order of the parameters being selected naturally reflects the rank of importance (high to low) with respect to MPG value:

  1. car weight
  2. horsepower
  3. engine displacement
  4. acceleration

We may also want to know what λ value should be used when we apply the model to new data. To address this, the cross validation technique comes into play. The green vertical line in the plot marks the λ value corresponding to the smallest validation error. The blue vertical line marks the λ value according to the one standard error rule (allow some error). These two values give us an idea of which model (or which λ) should we use in prediction tasks. This page gives a nice introduction of cross validation used in Lasso regression.

Important Remarks

1. How Lasso result varies when we re-scale the data?

It is necessary to standardize the input/output data (zero mean, one standard deviation) before applying the Lasso regression. In this way, the result of Lasso will not change if one re-scale the data, e.g. switch the unit system. It is defaulted in the lasso function in MATLAB.

2. Why/when we use linear regression? How to deal with nonlinearity?

It is turned out that linear regression is far more useful and practical than nonlinear regression. Linear system is easy to calculate, and the resulting coefficients are relatively easy to interpret.

When the system is involved with some nonlinear effect, one can add nonlinear terms as input parameters, such as the horsepower squared or cubic. Or, if one thinks about the caster angle of a motorcycle, then it might be helpful to try both this angle and the sine/cosine of this angle. Or, one can design whatever combination that might be important to the problem. Hence, we can do regression study with nonlinear consideration in a controlled manner.

3. How about Neural Network technique?

Neural network techniques assume a full level of nonlinearity. From my observation, the limitations are:

  • We can hardly draw conclusion from the study after the neural network is trained. We can only use it as a black box. If we have a different situation and the difference is hard to be represented as a few parameters beforehand, we need to retrain the neural network.
  • We need lots of data samples for neural network method to work. The rules of thumb are extremely obscure. In one example I have tried, there are 8 input parameters, and I have to expect about 100 observations for it to work properly. However, after selecting the parameters from Lasso study, 42 observations give great result (R>0.9), as opposed to neural network (R~0.5).

4. Other variants of Lasso regression

When we want to “group” different parameters together, we can use the “grouped Lasso” method or “elastic net regression“. In both cases, some 2-norm terms are blended into the penalty. In this way, grouped/correlated parameters tend to become nonzero (selected) at a same  value rather than being viewed separately, as in pure 1-norm penalty case. The plot below shows the elastic net regression result. We can see that coefficients of housepower, displacement and weight reach zero at about same λ value, as they are correlated to each other. This can be understood if we think of a model where x_1 + x_2 = y = 2 (so x_1 and x_2 are linear dependent), then minimizing 2-norm of x will give x_1=1,x_2=1, instead of x_1=2,x_2=0, as the former has a lower 2-norm, but same 1-norm compared to the latter. This method is useful when the input parameters have logically grouped structure.

Solution trace given by Elastic Net Regression

Solution trace given by Elastic Net Regression

Another related method, the stepwise regression, involves the idea of “automatically choosing parameters, step by step, to gradually fit the data”. Surprisingly, the solution path given by Lasso automatically satisfies the procedure of stepwise regression — as λ decreases, we are actually adding coefficients bit by bit to gradually fit the data.

Conclusion

The Lasso algorithm is a very simple and powerful method to do regression studies. By combining cross validation, it solves the regression problem, parameter selection problem, and importance ranking problem simultaneously. As a result, this method really deserves attention.

Reference

Advertisements