feature importance¶

The methods, like PFI or SHAP importance, are not designed for feature selection. They are for feature importance - explain features

Purpose of feature selection:

Improve predictive performance
Speed up model training and prediction
Reduce feature space for comprehensibility
Cost reduction

feature importance example¶

import xgboost as xgb
import lightgbm as lgb

# XGBoost feature importance
import xgboost as xgb
# model = model = xgb.XGBClassifier(
#     objective='binary:logistic',  # Default for binary classification
#     eval_metric='logloss',        # Metric for evaluation
# )
model = xgb.XGBRegressor(
    objective='reg:squarederror',  # Default for regression, minimizes squared error
    eval_metric='rmse',            # Metric for evaluation
)
model.fit(X, y)
importance = model.feature_importances_
print(importance)

# LightGBM feature importance
train_data = lgb.Dataset(data=X.to_arrow(), label=y.to_arrow())
# for small number of features, reduce `num_leaves` to avoid warnings
# for small number of data points, reduce `min_data_in_leave` to avoid warnings
# warnings #1: [Warning] No further splits with positive gain, best gain: -inf
# warnings #2: There are no meaningful features which satisfy the provided configuration.
# Decreasing Dataset parameters min_data_in_bin or min_data_in_leaf and re-constructing Dataset might resolve this warning.
params = {'objective': 'regression', 'metric': 'rmse', 'num_leaves': 2, 'min_data_in_leaf': 2}
model = lgb.train(params, train_data, num_boost_round=100)
importances = model.feature_importance(importance_type='gain')
print(importance)

# Catboost feature importance, lgb: 35x faster, xgb: 15x faster
# model = ctb.CatBoostClassifier(verbose=0)
model = ctb.CatBoostRegressor(verbose=0)
model.fit(X.to_pandas(), y.to_pandas())
importance = model.get_feature_importance()
print(importance)

shap feature explainer¶

```py import shap import numpy as np import matplotlib.pyplot as plt from statsmodels.graphics.tsaplots import ( plot_acf, plot_pacf, )