Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] remove contiguous check from _check_array #2185

Merged
merged 24 commits into from
Nov 27, 2024

Conversation

icfaust
Copy link
Contributor

@icfaust icfaust commented Nov 23, 2024

Description

This PR makes to_table handle non-contiguous arrays natively. The checks in _check_array are hardcoded for numpy inputs, and are applied regardless of the circumstance. This makes the check seamless, will remove a blocker for other non-numpy data types, and will help in the rollout of the new finite checker, where a future refactor will remove _check_array entirely and enforce the use of validate_data and _check_sample_weight in sklearnex. This also removes three TODOs listed in the codebase.


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@icfaust
Copy link
Contributor Author

icfaust commented Nov 24, 2024

/intelci: run

@@ -153,15 +153,6 @@ def _check_array(

if sp.issparse(array):
return array

# TODO: Convert this kind of arrays to a table like in daal4py
if not array.flags.aligned and not array.flags.writeable:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From numpy documentation: https://numpy.org/devdocs/dev/alignment.html only structured arrays will cause this, meaning that this is a non issue based on the dtypes of use for oneDAL (float32, float64, int32, and int64), and would failt to convert in table.cpp anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think numpy also allows creating non-aligned arrays out of custom non-owned pointers, for example through PyArray_SimpleNewFromData, so in theory there could be a non-aligned float32 array or similar. But it'd be a very unlikely input, since default allocators in most platforms are aligned.

Also an easier conversion could be through np.require.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll probably because of this too I will make a stricter check on C++ side https://github.com/intel/scikit-learn-intelex/blob/main/onedal/decomposition/pca.py#L150 probably using https://numpy.org/devdocs/reference/c-api/array.html#c.PyArray_FromArray where I can specify the requirements to be aligned and owned (rather than use PyArray_GETCONTIGUOUS)

Copy link
Contributor Author

@icfaust icfaust Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-cortes-intel I looked into it, the checks PyArray_ISFARRAY_RO and PyArray_ISCARRAY_RO check for alignment natively (https://numpy.org/devdocs/reference/c-api/array.html#c.PyArray_ISFARRAY_RO), and then PyArray_GETCONTIGUOUS also returns an aligned (well-behaved) copy regardless of the ownership. The change in the checks was to make sure that numpy objects wouldn't cause an infinite recursion. So the issue was taken care of, even if I didn't realize it. Just as a note, I am trying to minimize the number of numpy calls because of upcoming array_api support which we cannot rely on direct numpy calls on non-numpy arrays (say dpctl tensors).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to non-ownership, I don't want to step on anyone's toes, as changes to PCA are occuring here which may inpact the ownership check there: #2106

@icfaust icfaust changed the title [experiment] remove contiguous check [enhancement] remove contiguous check Nov 24, 2024
@icfaust icfaust changed the title [enhancement] remove contiguous check [enhancement] remove contiguous check from _check_array Nov 24, 2024
@icfaust
Copy link
Contributor Author

icfaust commented Nov 25, 2024

/intelci: run

@icfaust icfaust marked this pull request as ready for review November 25, 2024 09:09
@icfaust icfaust added the enhancement New feature or request label Nov 25, 2024
if (!PyArray_ISCARRAY_RO(ary) && !PyArray_ISFARRAY_RO(ary)) {
// NOTE: this will make a C-contiguous deep copy of the data
// this is expected to be a special case
ary = PyArray_GETCONTIGUOUS(ary);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, could you explain what would happen if an error occurs here? Additionally, is there a risk of a memory leak in this scenario?
For example, is there some error flag that should be checked?
Wouldn't it be safer to use the Python NumPy API for such checks and conversions?

Copy link
Contributor Author

@icfaust icfaust Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Decref is necessary to prevent a memory leak (otherwise leaks in BasicStatistics occur in all CIs), the memory is effectively owned by the table object, who will call its destructor with the pycapsule. This work makes to_table support non-contiguous arrays, which will simplify the python coding occurring before to_table. This work will enable _check_array to be moved from onedal to sklearnex simplifying the rollout of the new finite checker to SVM and neighbors algorithms. There is no error handling available (that I can find) in the numpy C-api as also numpy generally doesn't raise errors like that. The checks that it is 1) a numpy array and 2) of certain aligned and type characteristics will prevent errors, and the use of the numpy c-api is very standardized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh thank you for doing that research @david-cortes-intel , that would mean in the case of a failure in allocation, it will fail to decref. I will add a check there.

Comment on lines 70 to 85
if (layout == dal::data_layout::unknown){
py::object copy;
if (py::hasattr(obj, "copy")){
copy = obj.attr("copy")();
}
else if (py::hasattr(obj, "__array_namespace__")){
const auto space = obj.attr("__array_namespace__")();
copy = space.attr("asarray")(obj, "copy"_a = true);
}
else {
throw std::runtime_error("Wrong strides");
}
res = convert_to_homogen_impl<Type>(copy);
copy.dec_ref();
return res;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments here for other reviewers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@icfaust
Copy link
Contributor Author

icfaust commented Nov 25, 2024

/intelci: run

Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is touching shared code we should see at least one performance measurement that shows no impact on runtime or accuracy.

@samir-nasibli
Copy link
Contributor

@icfaust please add proper description to the PR and share benchmark validation results

Copy link
Contributor

@Alexsandruss Alexsandruss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, provide detailed explanation of changes in next time.

@icfaust
Copy link
Contributor Author

icfaust commented Nov 26, 2024

Sorry about that, updated the description.

@icfaust
Copy link
Contributor Author

icfaust commented Nov 26, 2024

/intelci: run

// NOTE: this will make a C-contiguous deep copy of the data
// if possible, this is expected to be a special case
py::object copy;
if (py::hasattr(obj, "copy")){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that the copy will be C-contiguous? It will always be the case with numpy, but what about other packages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of now only dpctl and dpnp use this standard:
https://github.com/IntelPython/dpnp/blob/master/dpnp/dpnp_container.py#L137 will specify "K" to dpctl.tensor's copy: https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/_copy_utils.py#L574 which if its not F aligned (which is the case) it will default to C alignment. asarray in dpctl https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/_ctors.py#L483 also has "K" as default. We test this circumstance in the test that is modified in this PR for dpnp and dpctl. If a new sycl_usm_namespace array type comes out, we can come back to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other libraries that implement the array protocol? XArray also has "copy" for example:
https://docs.xarray.dev/en/latest/generated/xarray.DataArray.copy.html

Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please wait for green CI

@icfaust icfaust merged commit 7dd395b into uxlfoundation:main Nov 27, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants