feature selection¶
Summary of all feature selection methods (including python code):
https://github.com/Yimeng-Zhang/feature-engineering-and-feature-selection
A Short Guide for Feature Engineering and Feature Selection.pdf
Feature selection implementations:
- https://github.com/jundongl/scikit-feature
pros and cons of different methods:
https://www.paypalobjects.com/ecm_assets/Feature%20Selection%20WP-PP-v1.pdf
voting: time consuming
union: leading to to many features with many methods
union of xgb, lgb and ctb: best result with low compution time
links:
https://neptune.ai/blog/feature-selection-methods
https://www.kaggle.com/code/prashant111/comprehensive-guide-on-feature-selection/notebook
https://www.kaggle.com/code/kanncaa1/feature-selection-and-data-visualization
methods: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-024-00905-w
good disscusion:
https://www.reddit.com/r/datascience/comments/1gsa6aj/lightgbm_feature_selection_methods_that_operate
suggest
Permutation Importanceothers: NFE and then RFE
Correlation-based feature selection with Mutual Information is a good starting point, LightGBM can handle it efficiently
three widely used feature selection methods:
ANOVA
Mutual Information
Recursive Feature Elimination
tree based method¶
Limitation:
correlated features show similar importance
correlated features importance is lower than real importance, when tree is build without its correlated counterparts
high carinal variable tend to show higher importance
Effect of Correlated Features:
Lasso tends to pick one feature out of a correlated group and zero out the rest.
Boosted trees often split on multiple correlated features, sharing importance among them.
use model build-in importance feature:
xgboost
lightgbm
random forrest
Simple methods¶
Variance Threshold¶
Remove constant/near-constant features:
from sklearn.feature_selection import VarianceThreshold
X_reduced = VarianceThreshold(threshold=0.01).fit_transform(X)
Correlation Filter¶
Remove features with low target correlation:
only apply to linear relationship
simple but less effective compared to univariate selection
reducing features helps explainability. But dropping them via correlations will just lead to loss of potentially useful features
Remove highly correlated features (>0.95):
import pandas as pd
corr_matrix = X.corr().abs()
upper = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
X_filtered = X.drop(to_drop, axis=1)
vif for numeric features¶
- Measure multicollinearity: how much a feature is linearly correlated with other features
- Useful for linear models (e.g., linear regression), where multicollinearity can harm interpretability or inflate variance
- For tree-based models (like LightGBM, XGBoost), multicollinearity is not usually a problem, since trees can handle correlated features
- In real world, the very best variable could be extremely highly correlated with all the other variables -- you might drop it based on vif scores
pearson corr vs vif¶
corr().abs(), using pearson correlation, is faster and simpler, great for quick pairwise filtering
vif is more robust for detecting multicollinearity, especially in linear models.
both are not good for tree based models like xgboost
Filter methods¶
f_classiffor numerical features and a categorical targetchi-squaredfor categorical features and a categorical target
Univariate Selection¶
for regression problems:
f_regressionfor classification problems:
f_classiffor catagrical features: ``
doesn't consider feature interactions
select the best features:
from sklearn.feature_selection import SelectKBest, f_classif
# select top 500 features for classification using ANOVA F-value between each feature and target
selector = SelectKBest(score_func=f_classif, k=500)
selected = selector.fit_transform(X_train, y_train)
calculate scores for all features:
selector = SelectKBest(score_func=chi2, k=10)
fit = selector.fit(X, y)
f_scores = pd.DataFrame({'feature': X.columns, 'score': fit.scores_})
print(f_scores.nlargest(10, 'score'))
Nonlinear Univariate¶
It's slow: https://github.com/scikit-learn/scikit-learn/issues/6904
It has an inherent n.log(n) cost per feature as long as it's using exact nearest neighbors.
Wrapper methods¶
- use a machine learning model to evaluate the performance of different subsets of features
Permutation Importance (Feature Shuffling)¶
Randomly shuffle the values of a specific variable and determine how that permutation affects the performance metric (Measures drop in model performance when each feature is shuffled)
Permute the values of each feature, one at the time
If a variable is important, a random permutation of its values will decrease dramatically any of these metrics
Non-important variables should have little to no effect on the model performance metric
Can be slow
# Permutation Importance example
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
# Train a model, Random Forest is a great choice as it's a powerful tree-based model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X.to_pandas(), y.to_pandas())
# Calculate Permutation Importance on the test set to get a more reliable estimate
perm = permutation_importance(model, X.to_pandas(), y.to_pandas(), n_repeats=10, random_state=13)
# Get and sort the `perm.importances_mean` attribute that gives the average importance score
df = pl.DataFrame({
'feature': X.columns,
'perm_score': perm.importances_mean,
'perm_score_std': perm.importances_std,
}).sort('perm_score', descending=True)
print(df)
Recursive Feature Elimination (RFE)¶
https://machinelearningmastery.com/rfe-feature-selection-in-python/
RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number of features reached.
- Too slow for a huge amount of features
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create a sample dataset
X, y = make_classification(
n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42
)
# Initialize the model and RFE selector
model = LogisticRegression()
rfe_selector = RFE(estimator=model, n_features_to_select=5, step=1)
# Fit RFE
rfe_selector = rfe_selector.fit(X, y)
# Get the selected features
selected_features = rfe_selector.support_
print(f'Selected features: {selected_features}')
Embedded methods¶
force the coefficients of less important features to become exactly zero
larger
alphaleads to heavier penalty. 0-no regularization, 0.01-light, 1-strong, 100-very strong
Lasso (L1) regularization:
A linear model with linear relationships between features and target.
Shrinks some coefficients to zero, performing explicit feature selection based on linear correlations.
Sensitive to multicollinearity — often picks one feature among correlated groups.
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
# Create a sample dataset with some irrelevant features
X, y = make_regression(n_samples=100, n_features=20, n_informative=5, random_state=42)
# Initialize and train the Lasso model
lasso = Lasso(alpha=0.01)
lasso.fit(X, y)
# Features with a coefficient of 0 are not important
print(lasso.coef_)
Auto feature selection: BorutaPy: A Robust Feature Selection Algorithm
- can be computationally expensive
Time series feature generation:
TSFreshcan automatically extract and filter useful time series features based on statistical testsFeaturetoolscan help with automated feature generation + selection (more for relational data, but flexible)
create SHAP explainer (for tree base models)¶
explainer = shap.TreeExplainer(model.regressor)
sample 50% of the data to speed up the calculation¶
rng = np.random.default_rng(seed=785412) sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False) X_train_sample = X_train.loc[sample, :] shap_values = explainer.shap_values(X_train_sample)
shap summary report (top 10 features only)¶
shap.initjs() shap.summary_plot(shap_values, X_train_sample, max_display=10, show=False) fig, ax = plt.gcf(), plt.gca() ax.set_title('SHAP Summary plot') ax.tick_params(labelsize=8) fig.set_size(10, 4.5) ```