Skip to content

feature selection

Summary of all feature selection methods (including python code):

  • https://github.com/Yimeng-Zhang/feature-engineering-and-feature-selection

  • A Short Guide for Feature Engineering and Feature Selection.pdf

Feature selection implementations:

  • https://github.com/jundongl/scikit-feature

pros and cons of different methods:

  • https://www.paypalobjects.com/ecm_assets/Feature%20Selection%20WP-PP-v1.pdf

  • voting: time consuming

  • union: leading to to many features with many methods

  • union of xgb, lgb and ctb: best result with low compution time

links:

  • https://neptune.ai/blog/feature-selection-methods

  • https://www.kaggle.com/code/prashant111/comprehensive-guide-on-feature-selection/notebook

  • https://www.kaggle.com/code/kanncaa1/feature-selection-and-data-visualization

  • methods: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-024-00905-w

good disscusion:

  • https://www.reddit.com/r/datascience/comments/1gsa6aj/lightgbm_feature_selection_methods_that_operate

  • suggest Permutation Importance

  • others: NFE and then RFE

  • Correlation-based feature selection with Mutual Information is a good starting point, LightGBM can handle it efficiently

three widely used feature selection methods:

  • ANOVA

  • Mutual Information

  • Recursive Feature Elimination

tree based method

Limitation:

  • correlated features show similar importance

  • correlated features importance is lower than real importance, when tree is build without its correlated counterparts

  • high carinal variable tend to show higher importance

Effect of Correlated Features:

  • Lasso tends to pick one feature out of a correlated group and zero out the rest.

  • Boosted trees often split on multiple correlated features, sharing importance among them.

use model build-in importance feature:

  • xgboost

  • lightgbm

  • random forrest

Simple methods

Variance Threshold

Remove constant/near-constant features:

from sklearn.feature_selection import VarianceThreshold
X_reduced = VarianceThreshold(threshold=0.01).fit_transform(X)

Correlation Filter

Remove features with low target correlation:

  • only apply to linear relationship

  • simple but less effective compared to univariate selection

  • reducing features helps explainability. But dropping them via correlations will just lead to loss of potentially useful features

    # Calculate correlations
    corr = df.corr()['target'].sort_values(ascending=False)
    print(corr)
    

Remove highly correlated features (>0.95):

import pandas as pd
corr_matrix = X.corr().abs()
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
X_filtered = X.drop(to_drop, axis=1)

vif for numeric features

  • Measure multicollinearity: how much a feature is linearly correlated with other features
  • Useful for linear models (e.g., linear regression), where multicollinearity can harm interpretability or inflate variance
  • For tree-based models (like LightGBM, XGBoost), multicollinearity is not usually a problem, since trees can handle correlated features
  • In real world, the very best variable could be extremely highly correlated with all the other variables -- you might drop it based on vif scores

pearson corr vs vif

  • corr().abs(), using pearson correlation, is faster and simpler, great for quick pairwise filtering

  • vif is more robust for detecting multicollinearity, especially in linear models.

  • both are not good for tree based models like xgboost

Filter methods

  • f_classif for numerical features and a categorical target

  • chi-squared for categorical features and a categorical target

Univariate Selection

  • for regression problems: f_regression

  • for classification problems: f_classif

  • for catagrical features: ``

  • doesn't consider feature interactions

select the best features:

from sklearn.feature_selection import SelectKBest, f_classif
# select top 500 features for classification using ANOVA F-value between each feature and target
selector = SelectKBest(score_func=f_classif, k=500)
selected = selector.fit_transform(X_train, y_train)

calculate scores for all features:

selector = SelectKBest(score_func=chi2, k=10)
fit = selector.fit(X, y)
f_scores = pd.DataFrame({'feature': X.columns, 'score': fit.scores_})
print(f_scores.nlargest(10, 'score'))

Nonlinear Univariate

  • It's slow: https://github.com/scikit-learn/scikit-learn/issues/6904

  • It has an inherent n.log(n) cost per feature as long as it's using exact nearest neighbors.

    from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
    mi_scores = mutual_info_regression(X, y, njob=-1, discrete_features=False, random_state=42)
    mi_scores = mutual_info_classif(X, y, njob=-1, discrete_features=False, random_state=42)
    

Wrapper methods

  • use a machine learning model to evaluate the performance of different subsets of features

Permutation Importance (Feature Shuffling)

Randomly shuffle the values of a specific variable and determine how that permutation affects the performance metric (Measures drop in model performance when each feature is shuffled)

  • Permute the values of each feature, one at the time

  • If a variable is important, a random permutation of its values will decrease dramatically any of these metrics

  • Non-important variables should have little to no effect on the model performance metric

  • Can be slow

# Permutation Importance example
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance

# Train a model, Random Forest is a great choice as it's a powerful tree-based model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X.to_pandas(), y.to_pandas())

# Calculate Permutation Importance on the test set to get a more reliable estimate
perm = permutation_importance(model, X.to_pandas(), y.to_pandas(), n_repeats=10, random_state=13)

# Get and sort the `perm.importances_mean` attribute that gives the average importance score
df = pl.DataFrame({
    'feature': X.columns,
    'perm_score': perm.importances_mean,
    'perm_score_std': perm.importances_std,
}).sort('perm_score', descending=True)
print(df)

Recursive Feature Elimination (RFE)

https://machinelearningmastery.com/rfe-feature-selection-in-python/

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number of features reached.

  • Too slow for a huge amount of features
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create a sample dataset
X, y = make_classification(
  n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42
)
# Initialize the model and RFE selector
model = LogisticRegression()
rfe_selector = RFE(estimator=model, n_features_to_select=5, step=1)
# Fit RFE
rfe_selector = rfe_selector.fit(X, y)
# Get the selected features
selected_features = rfe_selector.support_
print(f'Selected features: {selected_features}')

Embedded methods

  • force the coefficients of less important features to become exactly zero

  • larger alpha leads to heavier penalty. 0-no regularization, 0.01-light, 1-strong, 100-very strong

Lasso (L1) regularization:

  • A linear model with linear relationships between features and target.

  • Shrinks some coefficients to zero, performing explicit feature selection based on linear correlations.

  • Sensitive to multicollinearity — often picks one feature among correlated groups.

from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
# Create a sample dataset with some irrelevant features
X, y = make_regression(n_samples=100, n_features=20, n_informative=5, random_state=42)
# Initialize and train the Lasso model
lasso = Lasso(alpha=0.01)
lasso.fit(X, y)
# Features with a coefficient of 0 are not important
print(lasso.coef_)

Auto feature selection: BorutaPy: A Robust Feature Selection Algorithm

  • can be computationally expensive

Time series feature generation:

  • TSFresh can automatically extract and filter useful time series features based on statistical tests

  • Featuretools can help with automated feature generation + selection (more for relational data, but flexible)

create SHAP explainer (for tree base models)

explainer = shap.TreeExplainer(model.regressor)

sample 50% of the data to speed up the calculation

rng = np.random.default_rng(seed=785412) sample = rng.choice(X_train.index, size=int(len(X_train)*0.5), replace=False) X_train_sample = X_train.loc[sample, :] shap_values = explainer.shap_values(X_train_sample)

shap summary report (top 10 features only)

shap.initjs() shap.summary_plot(shap_values, X_train_sample, max_display=10, show=False) fig, ax = plt.gcf(), plt.gca() ax.set_title('SHAP Summary plot') ax.tick_params(labelsize=8) fig.set_size(10, 4.5) ```