-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Add feature_names_in_
attribute for scikit-learn estimators (fixes #6279)
#6310
[python-package] Add feature_names_in_
attribute for scikit-learn estimators (fixes #6279)
#6310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
But please add some unit tests in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_sklearn.py covering the following:
- what happens when you try to access that attribute on an unfitted estimator
- that that attribute returns the exact expected values in the following situations:
- trained with feature names (in each of the ways feature names can be provided, e.g. do you get them automatically using
pandas
as input?) - trained without feature names
- trained with feature names (in each of the ways feature names can be provided, e.g. do you get them automatically using
python-package/lightgbm/sklearn.py
Outdated
|
||
@property | ||
def feature_names_in_(self) -> List[str]: | ||
""":obj:`list` of shape = [n_features]: Sklearn-style property for feature names.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this with the following:
- remove "sklearn-style property for" and instead just say what it is, something like "names for features"
- this should only be available in a fitted model, right? If so, please guard it like this:
LightGBM/python-package/lightgbm/sklearn.py
Lines 993 to 994 in cc733f8
if not self.__sklearn_is_fitted__(): | |
raise LGBMNotFittedError('No best_score found. Need to call fit beforehand.') |
- explain in the docs what will happen when accessing this attribute if you never provided feature names (e.g. just passed raw
numpy
arrays as training data)
feature_name_
via sklearn consistent attribute feature_names_in_
feature_names_in_
attribute for scikit-learn estimators (fixes #6279)
In scikit-learn/scikit-learn#28337 (comment), I noticed someone said
LightGBM/python-package/lightgbm/sklearn.py Line 1072 in cc733f8
LightGBM/python-package/lightgbm/sklearn.py Line 430 in cc733f8
LightGBM/python-package/lightgbm/sklearn.py Lines 15 to 17 in cc733f8
LightGBM/python-package/lightgbm/compat.py Line 106 in cc733f8
If you get into this and find that |
@jameslamb Thank you for the great feedback! I'm working on adding these suggestions in. Is there a way you recommend recreating the development environment? I couldn't find info on this in the Thanks in advance! |
Thanks! There isn't a well-documented way to set up a local development environment for the Python package today, it's something I'd like to add soon. Here's how I develop on LightGBM:
conda create \
--name lgb-dev \
cloudpickle \
dask \
distributed \
joblib \
matplotlib \
numpy \
python-graphviz \
pytest \
pytest-cov \
python=3.11 \
scikit-learn \
scipy
rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j4 _lightgbm
source activate lgb-dev
sh build-python.sh install --precompile
pytest tests/python_package_test
pre-commit run --all-files |
c481290
to
10d5301
Compare
It turns out sklearn only adds the |
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
But this does not look like it's meeting the expectations described in https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html.
I re-read that tonight, and saw the following
Input Feature Names
*The input feature names are stored in a fitted estimator in a
feature_names_in_
attribute, and are taken from the given input data, for instance apandas
data frame.
This attribute will beNone
if the input provides no feature names. Thefeature_names_in_
attribute is a 1d NumPy array with object dtype and all elements in the array are strings.
Output Feature Names
A fitted estimator exposes the output feature names through theget_feature_names_out
method. The output ofget_feature_names_out
is a 1d NumPy array with object dtype and all elements in the array are strings. Here we discuss more in detail how these feature names are generated. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names are generated for all of them. It is instead a guideline on how they could generally be generated.
So I think the following needs to be done:
feature_names_in_
should return a 1Dnumpy
array, not a listget_feature_names_out()
function should be implemented (right? or is that only for estimators that define.transform()
?)
There is also still something that's really bothering me about this in general, that I think we need to get a clear answer on before going further.
This comment claims that you get these things for free if you inherit from BaseEstimator
: scikit-learn/scikit-learn#28337 (comment)
But lightgbm.sklearn.LGBMModel
and everything inheriting from it do inherit from BaseEstimator
. I've asked about this here: scikit-learn/scikit-learn#28337 (comment).
Up to you if you'd like to wait for scikit-learn
maintainers to respond there before working on the other things I've requested here.
def test_getting_feature_names_in_pd_input(): | ||
# as_frame=True means input has column names and these should propagate to fitted model | ||
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True) | ||
est = lgb.LGBMModel(n_estimators=5, objective="binary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please extend these tests to cover all 4 estimators (LGBMModel
, LGBMClassifier
, LGBMRegressor
, LGBMRanker
)? I know that those last 3 inherit from LGBMModel
, but if someone were to make a change in how this attributes for, say, LGBMClassifier
only that breaks this behavior, we'd want a failing test to alert us to that.
Follow the same pattern used in the existing test right above these, test_check_is_fitted()
, using the same data for all of the estimators.
feature_names_in_
attribute for scikit-learn estimators (fixes #6279)feature_names_in_
attribute for scikit-learn estimators (fixes #6279)
@jameslamb given that One different behavior between LGBM and sklearn is that LGBM assigns artificial names to features if the features are unnamed, whereas sklearn doesn't create artificial names, and also doesn't create the I wanted to confirm that we want to add |
Sounds good, will fix.
I have less of an opinion on this one, but based on the SLEP, it does look like it should be specifically for estimators with the
|
That method being prefixed with a Can you find me some authoritative source saying that projects implementing their own estimators are encouraged to call that method? The comment you linked above is a specific recommendation from a
but Let's please pause on this work until some |
Alright @nicklamiller I've read through the response at scikit-learn/scikit-learn#28337 (comment) carefully... I think we have enough information to proceed. I think we should do the following here in LightGBM:
Let's try to as closely as possible follow the way that Are you interested in continuing this? |
10d5301
to
9e168f2
Compare
@jameslamb I tried to closely follow
The few classes that inherit from an estimator-like mixin ( So it looks like Please let me know how this looks! |
8698523
to
574d9ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for much for your persistence with this!
The implementations look good to me. I just have some recommendations on the tests.
So it looks like
get_feature_names_out
is in fact defined almost exclusively for transformers. I’ve still added it here but sinceLGBMModel
and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply returnfeature_names_in_
within the method.
I asked for it to be added based on this statement form @adrinjalali in scikit-learn/scikit-learn#28337 (comment): "scikit-learn estimators all add ... get_feature_names_out()
".
I think the choice to just return feature_names_in_
makes sense for lightgbm
, thank you.
Please note... I am going to try to push this week to get LightGBM v4.4.0 out, so we can have a release up before NumPy 2.0 is released June 16th (#6439 (comment)). So if you don't have time to address this feedback in the next couple days, then this will be in the next release of LightGBM after that.
# as_frame=True means input has column names and these should propagate to fitted model | ||
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# as_frame=True means input has column names and these should propagate to fitted model | |
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True) | |
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True) | |
col_names = X.columns | |
assert isinstance(col_names, list) and all(isinstance(c, str) for c in col_names), "input data must have feature names for this test to cover the expected functionality" |
Instead of using a code comment, could you please test for this directly? That'd ensure that if load_digits()
behavior around feature names ever changes, this test will fail and alert us instead of silently passing or maybe failing in some other hard-to-understand way.
model.fit(X, y, group=group) | ||
else: | ||
model.fit(X, y) | ||
np.testing.assert_array_equal(est.feature_names_in_, X.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.testing.assert_array_equal(est.feature_names_in_, X.columns) | |
np.testing.assert_array_equal(model.feature_names_in_, X.columns) |
Instead of doing this for loop approach, could you please change these tests to parameterize over classes, like this?
@pytest.mark.parametrize("estimator_class", [lgb.LGBMModel, lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker]) |
That'd reduce the risk of mistakes like this one (where only the LGBMModel
instance, est
is being tested).
est = lgb.LGBMModel(n_estimators=5, objective="binary") | ||
clf = lgb.LGBMClassifier(n_estimators=5) | ||
reg = lgb.LGBMRegressor(n_estimators=5) | ||
rnk = lgb.LGBMRanker(n_estimators=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the thing being tested in this PR isn't really dependent on the content of the learned model, could you please use n_estimators=2
and num_leaves=7
in all the tests? That'd make the tests slightly faster and cheaper without reducing their effectiveness in detecting issues.
reg = lgb.LGBMRegressor(n_estimators=5) | ||
rnk = lgb.LGBMRanker(n_estimators=5) | ||
models = (est, clf, reg, rnk) | ||
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group = np.full(shape=(X.shape[0] // 2,), fill_value=2) # Just an example group | |
group = [X.shape[0]] |
For simplicity, please just treat all samples in X
as part of a single query group. LightGBM supports that, and it won't materially change the effectiveness of these tests.
@@ -1290,6 +1290,90 @@ def test_max_depth_warning_is_never_raised(capsys, estimator_class, max_depth): | |||
assert "Provided parameters constrain tree depth" not in capsys.readouterr().out | |||
|
|||
|
|||
def test_getting_feature_names_in_np_input(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite a lot of repetition in these tests. Generally I'm supportive of repetition in tests in favor of making it easier to diagnose issues or selectively include / exclude tests... but for the cases of .feature_names_in_
and .get_feature_names_out()
, I think the tests should be combined.
So can you please reduce these 4 tests down to 2? One for numpy
input without feature names, one for pandas
input with feature names?
Ending with assertions like this:
expected_col_names = np.array([f"Column_{i}" for i in range(X.shape[1])]
np.testing.assert_array_equal(model.feature_names_in_, expected_col_names)
np.testing.assert_array_equal(model.get_feature_names_out(), expected_col_names)
I think for lightgbm's classifiers and regressors, it makes sense to not implement |
Alright, thanks for that clarification. @nicklamiller please remove that method then. |
@jameslamb Thanks very much for the suggestions! I removed |
Thanks very much for keeping this up to date with I'll review again soon, but it's too late for this to make it into the v4.4.0 release (#6439). Sorry about that, but we are rushing to get that release up before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Thank you VERY much for your patience and persistence on this. We really appreciate it and would love to have you come back and contribute more in the future!
@borchero could you re-run the failing Azure check and merge this if it passes? I'll be away from my laptop for the next day or 2 |
I just restarted that failing CI job, will check back in a bit. |
Sorry @jameslamb, I saw your comment only now! Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄 |
No problem! I know, I also sometimes have to log into Azure DevOps multiple times and in different ways to finally get in. I'll send you a private message with some tips to try. |
Thanks so much @nicklamiller ! I'm sorry this was such a long process for you. We hope you'll consider coming back and contributing more in the future. If you're interested in helping with the Python package specifically, this would be a good next place to go: #6361 |
@jameslamb thank you for all the help and guidance, I'm looking forward to contributing more! |
Fixes: #6279
Related: scikit-learn/scikit-learn#28337