Lasso算法

佇統計學佮機器學習內底，Lasso 算法（英語：least absolute shrinkage and selection operator，閣譯上細絕對值的收斂佮選擇算子、套索算法）是一種同時進行特徵選擇佮正則化（數學）的迴歸分析方法，旨咧增強統計模型的預測準確性佮可解說性，上蓋起初由史丹福大學統計學教授 Robert Tibshirani 佇咧一九九六年因為 Leo Breiman 的非負參數推斷 ( Nonnegative Garrote , NNG ) 提出。Lasso 算法上原仔用佇計算上細平方法模型，這个簡單的算法公示矣足濟估計量的重要性質，如估計量佮嶺迴歸（Ridge regression，嘛叫吉洪諾夫正則化）佮最佳子集選擇的關係，Lasso 係數估計值 ( estimate ) 佮軟更值（soft thresholding）之間的聯絡。伊嘛搤出做協變數共線的時，Lasso 緊數估計值無一定唯一（類似標準線性迴歸）。

雖然上早是為著應用上小平方法來定義的算法，lasso 正則化會當簡單直接共湠開應用佇真濟統計學模型上，包括廣義線性模型，廣義估計方程式，做比例災難模型佮 M-估計。Lasso 選擇子集的形式依賴佇咧限制條件的形式並且有足濟種表現形式，包括幾何學，貝氏統計，佮噗分析。

Lasso 算法和基追蹤降噪音纏牢牢。

歷史的來源

Robert Tibshirani 頭仔就使用 Lasso 來提懸預測的準確性佮迴歸模型的通解說性，伊修改模型擬合的過程，佇協變數當中干焦選一个子集應用到最終模型中，非用上全部協變數。這是有的著相仝目的，但是方法有無仝款的 Breiman 的非負參數推斷。

佇咧 Lasso 進前，選擇模型中協變數上捷用的方法是移步選擇，這種方法佇咧某一寡情況下是有影，譬如講一寡協變數佮模型輸出值有強相關性的情況。毋過佇咧另外一寡狀況之下，這種方法會予預測結果閣較䆀。佇咧彼陣，嶺迴歸是提懸模型預測準確性上捷用的方法。嶺迴歸會當通過縮細大的迴歸係數來減少過擬合對而且改善模型預測偏差。但是伊並無選擇協變數，所以對模型的準確構建和解說無幫助。

Lasso 敆做伙講的兩種方法，伊通過強制予迴歸係數絕對值之佮小於某固定值，強制一寡回歸的係數變做零，有效地選擇矣無包括遮的回歸係數對應的協變數的閣較簡單的模型。這種方法佮嶺迴歸類似，佇嶺迴歸中間，迴歸係數平方佮被強制小於某定值，無仝款點佇嶺迴歸干焦改變係數的值，毋過共任何值設做零。

基本的形式

Lasso 上蓋起初為著上蓋細方法被設計出來，Lasso 的上細平方法應用會當簡單明矣地展示 Lasso 的真濟特性。

上細平方

準講一个樣本包括 N 種事件，逐个事件包括 _ p _ 個協變數佮一个輸出值。予 $ y _ { i } $ 為輸出值，並且 $ x _ { i } :=( x _ { 一 } , x _ { 二 } , \ ldots , x _ { p } ) ^ { T } $ 為第 i 種情況的協變數向量，遐爾 Lasso 愛計算的目標程式就是：

嘿所有 $ \ sum _ { j=一 } ^ { p } | \ beta _ { j } | \ leq t $，計算 $ \ min _ { \ beta _ { 零 } , \ beta } \ left \ { { \ frac { 一 } { N } } \ sum _ { i=一 } ^ { N } ( y _ { i }-\ beta _ { 零 }-x _ { i } ^ { T } \ beta ) ^ { 二 } \ right \ } $

遮 $ t $ 決定規則化程度的預定的自由參數。設 $ X $ 為協變數矩陣，遐爾 $ X _ { ij }=( x _ { i } ) _ { j } $，其中 $ x _ { i } ^ { T } $ 是 $ X $ 的第 _ i _ 走，遐爾仔頂式會當寫閣較絚的形式：

嘿所有 $ \ | \ beta \ | _ { 一 } \ leq t $，計算 $ \ min _ { \ beta _ { 零 } , \ beta } \ left \ { { \ frac { 一 } { N } } \ left \ | y-\ beta _ { 零 }-X \ beta \ right \ | _ { 二 } ^ { 二 } \ right \ } $

遮 $ \ | \ beta \ | _ { p }=\ left ( \ sum _ { i=一 } ^ { N } | \ beta _ { i } | ^ { p } \ right ) ^ { 一 / p } $ 是標準 $ \ ell ^ { p } $ 範數，$ 一 _ { N } $ 是 $ N \ times 一 $ 維的一个向量。

因為乎 $ { \ hat { \ beta } } _ { 零 }={ \ bar { y } }-{ \ bar { x } } ^ { T } \ beta $，所以有

$ y _ { i }-{ \ hat { \ beta } } _ { 零 }-x _ { i } ^ { T } \ beta=y _ { i }-( { \ bar { y } }-{ \ bar { x } } ^ { T } \ beta )-x _ { i } ^ { T } \ beta=( y _ { i }-{ \ bar { y } } )-( x _ { i }-{ \ bar { x } } ) ^ { T } \ beta , $

嘿變數進行中心化是捷用的數據處理方法。並且共變異數一般規範化做 $ \ textstyle \ left ( \ sum _ { i=一 } ^ { N } x _ { ij } ^ { 二 }=一 \ right ) $，按呢得著的解就袂依賴測量的規模。

伊的目標程式閣會當寫為：

$ \ min _ { \ beta \ in \ mathbb { R } ^ { p } } \ left \ { { \ frac { 一 } { N } } \ left \ | y-X \ beta \ right \ | _ { 二 } ^ { 二 } \ right \ } { \ text { subject to } } \ | \ beta \ | _ { 一 } \ leq t . $

其拉格朗日形式做：

$ \ min _ { \ beta \ in \ mathbb { R } ^ { p } } \ left \ { { \ frac { 一 } { N } } \ left \ | y-X \ beta \ right \ | _ { 二 } ^ { 二 } + \ lambda \ | \ beta \ | _ { 一 } \ right \ } $

其中 $ t $ 和 $ \ lambda $ 的關係就決定欲特徵。

Orthonormal covariates

Some basic properties of the lasso estimator can now be considered .

Assuming first that the covariates are orthonormal so that $ ( x _ { i } \ mid x _ { j } )=\ delta _ { ij } $ , where $ ( \ cdot \ mid \ cdot ) $ is the inner product and $ \ delta _ { ij } $ is the Kronecker delta , or , equivalently , $ X ^ { T } X=I $ , then using subgradient methods it can be shown that

$ { \ begin { aligned } { \ hat { \ beta } } _ { j }={ } & S _ { N \ lambda } ( { \ hat { \ beta } } _ { j } ^ { \ text { OLS } } )={ \ hat { \ beta } } _ { j } ^ { \ text { OLS } } \ max \ left ( 零 , 一-{ \ frac { N \ lambda } { | { \ hat { \ beta } } _ { j } ^ { \ text { OLS } } | } } \ right ) \ \ & { \ text { where } } { \ hat { \ beta } } ^ { \ text { OLS } }=( X ^ { T } X ) ^ { 影一 } X ^ { T } y \ end { aligned } } $

$ S _ { \ alpha } $ is referred to as the soft thresholding operator , since it translates values towards zero ( making them exactly zero if they are small enough ) instead of setting smaller values to zero and leaving larger ones untouched as the hard thresholding operator , often denoted $ H _ { \ alpha } $ , would .

This can be compared to ridge regression , where the objective is to minimize

$ \ min _ { \ beta \ in \ mathbb { R } ^ { p } } \ left \ { { \ frac { 一 } { N } } \ | y-X \ beta \ | _ { 二 } ^ { 二 } + \ lambda \ | \ beta \ | _ { 二 } ^ { 二 } \ right \ } $

yielding

$ { \ hat { \ beta } } _ { j }=( 一 + N \ lambda ) ^ { 影一 } { \ hat { \ beta } } _ { j } ^ { \ text { OLS } } . $

So ridge regression shrinks all coefficients by a uniform factor of $ ( 一 + N \ lambda ) ^ { 影一 } $ and does not set any coefficients to zero .

It can also be compared to regression with best subset selection , in which the goal is to minimize

$ \ min _ { \ beta \ in \ mathbb { R } ^ { p } } \ left \ { { \ frac { 一 } { N } } \ left \ | y-X \ beta \ right \ | _ { 二 } ^ { 二 } + \ lambda \ | \ beta \ | _ { 零 } \ right \ } $

where $ \ | \ cdot \ | _ { 零 } $ is the " $ \ ell ^ { 零 } $ norm " , which is defined as $ \ | z \ |=m $ if exactly m components of z are nonzero . In this case , it can be shown that

$ { \ hat { \ beta } } _ { j }=H _ { \ sqrt { N \ lambda } } \ left ( { \ hat { \ beta } } _ { j } ^ { \ text { OLS } } \ right )={ \ hat { \ beta } } _ { j } ^ { \ text { OLS } } \ mathrm { I } \ left ( \ left | { \ hat { \ beta } } _ { j } ^ { \ text { OLS } } \ right | \ geq { \ sqrt { N \ lambda } } \ right ) $

where $ H _ { \ alpha } $ is the so-called hard thresholding function and $ \ mathrm { I } $ is an indicator function ( it is 一 if its argument is true and 零 otherwise ) .

Therefore , the lasso estimates share features of the estimates from both ridge and best subset selection regression since they both shrink the magnitude of all the coefficients , like ridge regression , but also set some of them to zero , as in the best subset selection case . Additionally , while ridge regression scales all of the coefficients by a constant factor , lasso instead translates the coefficients towards zero by a constant value and sets them to zero if they reach it .

Correlated covariates

Returning to the general case , in which the different covariates may not be independent , a special case may be considered in which two of the covariates , say _ j _ and _ k _ , are identical for each case , so that $ x _ { ( j ) }=x _ { ( k ) } $ , where $ x _ { ( j ) , i }=x _ { ij } $ . Then the values of $ \ beta _ { j } $ and $ \ beta _ { k } $ that minimize the lasso objective function are not uniquely determined . In fact , if there is some solution $ { \ hat { \ beta } } $ in which $ { \ hat { \ beta } } _ { j } { \ hat { \ beta } } _ { k } \ geq 零 $ , then if $ s \ in [零 , 一] $ replacing $ { \ hat { \ beta } } _ { j } $ by $ s ( { \ hat { \ beta } } _ { j } + { \ hat { \ beta } } _ { k } ) $ and $ { \ hat { \ beta } } _ { k } $ by $ ( 一-s ) ( { \ hat { \ beta } } _ { j } + { \ hat { \ beta } } _ { k } ) $ , while keeping all the other $ { \ hat { \ beta } } _ { i } $ fixed , gives a new solution , so the lasso objective function then has a continuum of valid minimizers . Several variants of the lasso , including the Elastic Net , have been designed to address this shortcoming , which are discussed below .

一般的形式

算法共你解說

應用

LASSO 已經應用佇經濟佮金融領域，會當改善預測結果並且選擇有時予人忽視的變數。比如講：公司破產預測佮懸增長公司預測。

參見

降維
特徵選擇

參考文獻