Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) #6310

Merged

Conversation

nicklamiller
Copy link
Contributor

@nicklamiller nicklamiller commented Feb 12, 2024

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

But please add some unit tests in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_sklearn.py covering the following:

  • what happens when you try to access that attribute on an unfitted estimator
  • that that attribute returns the exact expected values in the following situations:
    • trained with feature names (in each of the ways feature names can be provided, e.g. do you get them automatically using pandas as input?)
    • trained without feature names


@property
def feature_names_in_(self) -> List[str]:
""":obj:`list` of shape = [n_features]: Sklearn-style property for feature names."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this with the following:

  • remove "sklearn-style property for" and instead just say what it is, something like "names for features"
  • this should only be available in a fitted model, right? If so, please guard it like this:

if not self.__sklearn_is_fitted__():
raise LGBMNotFittedError('No best_score found. Need to call fit beforehand.')

  • explain in the docs what will happen when accessing this attribute if you never provided feature names (e.g. just passed raw numpy arrays as training data)

@jameslamb jameslamb changed the title Expose feature_name_ via sklearn consistent attribute feature_names_in_ [python] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) Feb 12, 2024
@jameslamb
Copy link
Collaborator

In scikit-learn/scikit-learn#28337 (comment), I noticed someone said

this feature comes for free if you inherit from BaseEstimator

lightgbm's scikit-learn estimators do inherit from BaseEstimator

class LGBMRegressor(_LGBMRegressorBase, LGBMModel):

class LGBMModel(_LGBMModelBase):

from .compat import (SKLEARN_INSTALLED, LGBMNotFittedError, _LGBMAssertAllFinite, _LGBMCheckArray,
_LGBMCheckClassificationTargets, _LGBMCheckSampleWeight, _LGBMCheckXY, _LGBMClassifierBase,
_LGBMComputeSampleWeight, _LGBMCpuCount, _LGBMLabelEncoder, _LGBMModelBase, _LGBMRegressorBase,

_LGBMModelBase = BaseEstimator

from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR! Those tests I mentioned would still be very valuable to catch changes to that support in the future and to be sure that lightgbm's integration with it has the expected behavior.

@nicklamiller
Copy link
Contributor Author

nicklamiller commented Feb 20, 2024

@jameslamb Thank you for the great feedback! I'm working on adding these suggestions in.

Is there a way you recommend recreating the development environment? I couldn't find info on this in the CONTRIBUTING.md so started to mimic the logic specified in .ci/test.sh but having to specify different global variables as they appear in the script prevents this from being a quick way to setup the environment. Just want to make sure I'm not missing a quicker way.

Thanks in advance!

@jameslamb
Copy link
Collaborator

jameslamb commented Feb 20, 2024

Thanks! There isn't a well-documented way to set up a local development environment for the Python package today, it's something I'd like to add soon.

Here's how I develop on LightGBM:

  1. Create a conda environment (I use miniforge, to prefer conda-forge)
conda create \
    --name lgb-dev \
    cloudpickle \
    dask \
    distributed \ 
    joblib \
    matplotlib \
    numpy \
    python-graphviz \
    pytest \
    pytest-cov \
    python=3.11 \
    scikit-learn \
    scipy
  1. build the C++ library one time (assuming you're making Python-only changes)
rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j4 _lightgbm
  1. make changes to the Python code
  2. install the Python package in the conda environment
source activate lgb-dev
sh build-python.sh install --precompile
  1. run the tests
pytest tests/python_package_test
  1. repeat steps 3-5 until you're confident in your changes
  2. run the auto-formatting and some of the linting stuff (this is a work in progress, see [RFC] [python-package] use black for formatting Python code? #6304)
pre-commit run --all-files

@nicklamiller
Copy link
Contributor Author

nicklamiller commented Mar 28, 2024

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR!

It turns out sklearn only adds the feature_names_in_ attribute if the input data has feature names, while LightGBM will add column names of the format "Column_{i}" if the input data doesn't have column names. I've added a comment to a test to highlight this difference with sklearn.

@nicklamiller
Copy link
Contributor Author

@microsoft-github-policy-service agree

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

But this does not look like it's meeting the expectations described in https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html.

I re-read that tonight, and saw the following

Input Feature Names

*The input feature names are stored in a fitted estimator in a feature_names_in_ attribute, and are taken from the given input data, for instance a pandas data frame.
This attribute will be None if the input provides no feature names. The feature_names_in_ attribute is a 1d NumPy array with object dtype and all elements in the array are strings.

Output Feature Names
A fitted estimator exposes the output feature names through the get_feature_names_out method. The output of get_feature_names_out is a 1d NumPy array with object dtype and all elements in the array are strings. Here we discuss more in detail how these feature names are generated. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names are generated for all of them. It is instead a guideline on how they could generally be generated.

So I think the following needs to be done:

  • feature_names_in_ should return a 1D numpy array, not a list
  • get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

There is also still something that's really bothering me about this in general, that I think we need to get a clear answer on before going further.

This comment claims that you get these things for free if you inherit from BaseEstimator: scikit-learn/scikit-learn#28337 (comment)

But lightgbm.sklearn.LGBMModel and everything inheriting from it do inherit from BaseEstimator. I've asked about this here: scikit-learn/scikit-learn#28337 (comment).

Up to you if you'd like to wait for scikit-learn maintainers to respond there before working on the other things I've requested here.

def test_getting_feature_names_in_pd_input():
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
est = lgb.LGBMModel(n_estimators=5, objective="binary")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please extend these tests to cover all 4 estimators (LGBMModel, LGBMClassifier, LGBMRegressor, LGBMRanker)? I know that those last 3 inherit from LGBMModel, but if someone were to make a change in how this attributes for, say, LGBMClassifier only that breaks this behavior, we'd want a failing test to alert us to that.

Follow the same pattern used in the existing test right above these, test_check_is_fitted(), using the same data for all of the estimators.

python-package/lightgbm/sklearn.py Show resolved Hide resolved
@jameslamb jameslamb changed the title [python] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) [python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) Mar 29, 2024
@nicklamiller
Copy link
Contributor Author

nicklamiller commented Apr 11, 2024

@jameslamb given that _validate_data needs to be called in order to get these attributes for free from BaseEstimator, would it make sense to call this method within the LGBM estimators' fit methods (like many other sklearn estimators, one example: scikit-learn/scikit-learn#27907 (comment))?

One different behavior between LGBM and sklearn is that LGBM assigns artificial names to features if the features are unnamed, whereas sklearn doesn't create artificial names, and also doesn't create the feature_names_in_ attribute. So for numpy arrays, even calling _validate_data within fit wouldn't make this attribute accessible.

I wanted to confirm that we want to add _validate_data, but to also keep the behavior of setting names when they're not present.

@nicklamiller
Copy link
Contributor Author

nicklamiller commented Apr 11, 2024

feature_names_in_ should return a 1D numpy array, not a list

Sounds good, will fix.

get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

I have less of an opinion on this one, but based on the SLEP, it does look like it should be specifically for estimators with the transform method.:

Scope

The API for input and output feature names includes a feature_names_in_ attribute for all estimators, and a get_feature_names_out method for any estimator with a transform method, i.e. they expose the generated feature names via the get_feature_names_out method.

@jameslamb
Copy link
Collaborator

I wanted to confirm that we want to add _validate_data

That method being prefixed with a _ suggests to me that it's an internal implementation detail of scikit-learn that could be changed in a future release of that library.

Can you find me some authoritative source saying that projects implementing their own estimators are encouraged to call that method? The comment you linked above is a specific recommendation from a scikit-learn maintainer about what to do for 2 estimators within scikit-learn... I don't interpret that as encouragement that other projects should call it.

xgboost does not: https://github.com/search?q=repo%3Admlc%2Fxgboost%20%22_validate_data%22&type=code

but catboost does: https://github.com/catboost/catboost/blob/19b60a20b2b1733c528b40c6c9ebe2f3d1f5dbde/contrib/python/scikit-learn/py3/sklearn/base.py#L537

Let's please pause on this work until some scikit-learn maintainer gives an authoritative answer on scikit-learn/scikit-learn#28337.

@jameslamb
Copy link
Collaborator

Alright @nicklamiller I've read through the response at scikit-learn/scikit-learn#28337 (comment) carefully... I think we have enough information to proceed.

I think we should do the following here in LightGBM:

  • add feature_names_in_
  • add get_feature_names_out()
  • do not directly call _validate_data() or any other scikit-learn methods prefixed with _

Let's try to as closely as possible follow the way that xgboost has handled this in their Python package.

Are you interested in continuing this?

@nicklamiller nicklamiller force-pushed the add-sklearn-feature-attributes branch from 10d5301 to 9e168f2 Compare June 1, 2024 03:02
@nicklamiller
Copy link
Contributor Author

nicklamiller commented Jun 1, 2024

@jameslamb I tried to closely follow xgboost w.r.t feature_names_in_, which is similar in its implementation in that feature names are obtained from another attribute from a _Booster object which is only returned if the model is fitted. If feature names can’t be accessed because the model is not fitted, a NotFittedError is raised (so similar to the LGBMNotFittedError that we raise here).

xgboost doesn’t implement the get_feature_names_out method, so I followed scikit-learn’s approach for this. Most objects that have this method are subclasses of TransformerMixin some examples:

The few classes that inherit from an estimator-like mixin (ClassifierMixin or RegressorMixin) appear to be meta-estimators that specify features for each individual estimator that the meta estimator is comprised of:

So it looks like get_feature_names_out is in fact defined almost exclusively for transformers. I’ve still added it here but since LGBMModel and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply return feature_names_in_ within the method.

Please let me know how this looks!

@nicklamiller nicklamiller requested a review from jameslamb June 1, 2024 03:21
@nicklamiller nicklamiller force-pushed the add-sklearn-feature-attributes branch from 8698523 to 574d9ce Compare June 3, 2024 17:54
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for much for your persistence with this!

The implementations look good to me. I just have some recommendations on the tests.

So it looks like get_feature_names_out is in fact defined almost exclusively for transformers. I’ve still added it here but since LGBMModel and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply return feature_names_in_ within the method.

I asked for it to be added based on this statement form @adrinjalali in scikit-learn/scikit-learn#28337 (comment): "scikit-learn estimators all add ... get_feature_names_out()".

I think the choice to just return feature_names_in_ makes sense for lightgbm, thank you.


Please note... I am going to try to push this week to get LightGBM v4.4.0 out, so we can have a release up before NumPy 2.0 is released June 16th (#6439 (comment)). So if you don't have time to address this feedback in the next couple days, then this will be in the next release of LightGBM after that.

Comment on lines 1315 to 1316
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
col_names = X.columns
assert isinstance(col_names, list) and all(isinstance(c, str) for c in col_names), "input data must have feature names for this test to cover the expected functionality"

Instead of using a code comment, could you please test for this directly? That'd ensure that if load_digits() behavior around feature names ever changes, this test will fail and alert us instead of silently passing or maybe failing in some other hard-to-understand way.

model.fit(X, y, group=group)
else:
model.fit(X, y)
np.testing.assert_array_equal(est.feature_names_in_, X.columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.testing.assert_array_equal(est.feature_names_in_, X.columns)
np.testing.assert_array_equal(model.feature_names_in_, X.columns)

Instead of doing this for loop approach, could you please change these tests to parameterize over classes, like this?

@pytest.mark.parametrize("estimator_class", [lgb.LGBMModel, lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker])

That'd reduce the risk of mistakes like this one (where only the LGBMModel instance, est is being tested).

est = lgb.LGBMModel(n_estimators=5, objective="binary")
clf = lgb.LGBMClassifier(n_estimators=5)
reg = lgb.LGBMRegressor(n_estimators=5)
rnk = lgb.LGBMRanker(n_estimators=5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the thing being tested in this PR isn't really dependent on the content of the learned model, could you please use n_estimators=2 and num_leaves=7 in all the tests? That'd make the tests slightly faster and cheaper without reducing their effectiveness in detecting issues.

reg = lgb.LGBMRegressor(n_estimators=5)
rnk = lgb.LGBMRanker(n_estimators=5)
models = (est, clf, reg, rnk)
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group
group = [X.shape[0]]

For simplicity, please just treat all samples in X as part of a single query group. LightGBM supports that, and it won't materially change the effectiveness of these tests.

@@ -1290,6 +1290,90 @@ def test_max_depth_warning_is_never_raised(capsys, estimator_class, max_depth):
assert "Provided parameters constrain tree depth" not in capsys.readouterr().out


def test_getting_feature_names_in_np_input():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite a lot of repetition in these tests. Generally I'm supportive of repetition in tests in favor of making it easier to diagnose issues or selectively include / exclude tests... but for the cases of .feature_names_in_ and .get_feature_names_out(), I think the tests should be combined.

So can you please reduce these 4 tests down to 2? One for numpy input without feature names, one for pandas input with feature names?

Ending with assertions like this:

expected_col_names = np.array([f"Column_{i}" for i in range(X.shape[1])]
np.testing.assert_array_equal(model.feature_names_in_, expected_col_names)
np.testing.assert_array_equal(model.get_feature_names_out(), expected_col_names)

@adrinjalali
Copy link

So it looks like get_feature_names_out is in fact defined almost exclusively for transformers. I’ve still added it here but since LGBMModel and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply return feature_names_in_ within the method.

I asked for it to be added based on this statement form @adrinjalali in scikit-learn/scikit-learn#28337 (comment): "scikit-learn estimators all add ... get_feature_names_out()".

I think the choice to just return feature_names_in_ makes sense for lightgbm, thank you.

I think for lightgbm's classifiers and regressors, it makes sense to not implement get_feature_names_out() at all. They're indeed only available on transformers in scikit-learn.

@jameslamb
Copy link
Collaborator

Alright, thanks for that clarification.

@nicklamiller please remove that method then.

@nicklamiller
Copy link
Contributor Author

@jameslamb Thanks very much for the suggestions! I removed get_feature_names_out() and fixed up the tests as specified above, please let me know how this looks.

@nicklamiller nicklamiller requested a review from jameslamb June 11, 2024 22:01
@jameslamb
Copy link
Collaborator

Thanks very much for keeping this up to date with master.

I'll review again soon, but it's too late for this to make it into the v4.4.0 release (#6439). Sorry about that, but we are rushing to get that release up before numpy 2.0 comes out 2 days from now.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Thank you VERY much for your patience and persistence on this. We really appreciate it and would love to have you come back and contribute more in the future!

@jameslamb
Copy link
Collaborator

@borchero could you re-run the failing Azure check and merge this if it passes? I'll be away from my laptop for the next day or 2

@jameslamb
Copy link
Collaborator

I just restarted that failing CI job, will check back in a bit.

@borchero
Copy link
Collaborator

Sorry @jameslamb, I saw your comment only now! Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄

@jameslamb
Copy link
Collaborator

Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄

No problem! I know, I also sometimes have to log into Azure DevOps multiple times and in different ways to finally get in. I'll send you a private message with some tips to try.

@jameslamb jameslamb merged commit f811c82 into microsoft:master Jul 3, 2024
41 checks passed
@jameslamb
Copy link
Collaborator

Thanks so much @nicklamiller ! I'm sorry this was such a long process for you. We hope you'll consider coming back and contributing more in the future.

If you're interested in helping with the Python package specifically, this would be a good next place to go: #6361

@nicklamiller
Copy link
Contributor Author

@jameslamb thank you for all the help and guidance, I'm looking forward to contributing more!

@nicklamiller nicklamiller deleted the add-sklearn-feature-attributes branch July 5, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] Support feature_names_in_ attribute via sklearn API
4 participants