-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement] remove contiguous check from _check_array
#2185
Conversation
/intelci: run |
@@ -153,15 +153,6 @@ def _check_array( | |||
|
|||
if sp.issparse(array): | |||
return array | |||
|
|||
# TODO: Convert this kind of arrays to a table like in daal4py | |||
if not array.flags.aligned and not array.flags.writeable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From numpy documentation: https://numpy.org/devdocs/dev/alignment.html only structured arrays will cause this, meaning that this is a non issue based on the dtypes of use for oneDAL (float32, float64, int32, and int64), and would failt to convert in table.cpp anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think numpy also allows creating non-aligned arrays out of custom non-owned pointers, for example through PyArray_SimpleNewFromData, so in theory there could be a non-aligned float32 array or similar. But it'd be a very unlikely input, since default allocators in most platforms are aligned.
Also an easier conversion could be through np.require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll probably because of this too I will make a stricter check on C++ side https://github.com/intel/scikit-learn-intelex/blob/main/onedal/decomposition/pca.py#L150 probably using https://numpy.org/devdocs/reference/c-api/array.html#c.PyArray_FromArray where I can specify the requirements to be aligned and owned (rather than use PyArray_GETCONTIGUOUS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-cortes-intel I looked into it, the checks PyArray_ISFARRAY_RO and PyArray_ISCARRAY_RO check for alignment natively (https://numpy.org/devdocs/reference/c-api/array.html#c.PyArray_ISFARRAY_RO), and then PyArray_GETCONTIGUOUS also returns an aligned (well-behaved) copy regardless of the ownership. The change in the checks was to make sure that numpy objects wouldn't cause an infinite recursion. So the issue was taken care of, even if I didn't realize it. Just as a note, I am trying to minimize the number of numpy calls because of upcoming array_api support which we cannot rely on direct numpy calls on non-numpy arrays (say dpctl tensors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to non-ownership, I don't want to step on anyone's toes, as changes to PCA are occuring here which may inpact the ownership check there: #2106
_check_array
/intelci: run |
if (!PyArray_ISCARRAY_RO(ary) && !PyArray_ISFARRAY_RO(ary)) { | ||
// NOTE: this will make a C-contiguous deep copy of the data | ||
// this is expected to be a special case | ||
ary = PyArray_GETCONTIGUOUS(ary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, could you explain what would happen if an error occurs here? Additionally, is there a risk of a memory leak in this scenario?
For example, is there some error flag that should be checked?
Wouldn't it be safer to use the Python NumPy API for such checks and conversions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Decref is necessary to prevent a memory leak (otherwise leaks in BasicStatistics occur in all CIs), the memory is effectively owned by the table object, who will call its destructor with the pycapsule. This work makes to_table support non-contiguous arrays, which will simplify the python coding occurring before to_table. This work will enable _check_array
to be moved from onedal
to sklearnex
simplifying the rollout of the new finite checker to SVM and neighbors algorithms. There is no error handling available (that I can find) in the numpy C-api as also numpy generally doesn't raise errors like that. The checks that it is 1) a numpy array and 2) of certain aligned and type characteristics will prevent errors, and the use of the numpy c-api is very standardized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it can fail to allocate, in which case it will return NULL
:
https://github.com/numpy/numpy/blob/cf9598572528318a54489b3c9ed5f65ef042e8c8/numpy/_core/src/multiarray/convert.c#L495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh thank you for doing that research @david-cortes-intel , that would mean in the case of a failure in allocation, it will fail to decref. I will add a check there.
if (layout == dal::data_layout::unknown){ | ||
py::object copy; | ||
if (py::hasattr(obj, "copy")){ | ||
copy = obj.attr("copy")(); | ||
} | ||
else if (py::hasattr(obj, "__array_namespace__")){ | ||
const auto space = obj.attr("__array_namespace__")(); | ||
copy = space.attr("asarray")(obj, "copy"_a = true); | ||
} | ||
else { | ||
throw std::runtime_error("Wrong strides"); | ||
} | ||
res = convert_to_homogen_impl<Type>(copy); | ||
copy.dec_ref(); | ||
return res; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments here for other reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is touching shared code we should see at least one performance measurement that shows no impact on runtime or accuracy.
@icfaust please add proper description to the PR and share benchmark validation results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, provide detailed explanation of changes in next time.
Sorry about that, updated the description. |
/intelci: run |
// NOTE: this will make a C-contiguous deep copy of the data | ||
// if possible, this is expected to be a special case | ||
py::object copy; | ||
if (py::hasattr(obj, "copy")){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that the copy will be C-contiguous? It will always be the case with numpy, but what about other packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of now only dpctl and dpnp use this standard:
https://github.com/IntelPython/dpnp/blob/master/dpnp/dpnp_container.py#L137 will specify "K" to dpctl.tensor's copy: https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/_copy_utils.py#L574 which if its not F aligned (which is the case) it will default to C alignment. asarray in dpctl https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/_ctors.py#L483 also has "K" as default. We test this circumstance in the test that is modified in this PR for dpnp and dpctl. If a new sycl_usm_namespace array type comes out, we can come back to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other libraries that implement the array protocol? XArray also has "copy" for example:
https://docs.xarray.dev/en/latest/generated/xarray.DataArray.copy.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please wait for green CI
Description
This PR makes
to_table
handle non-contiguous arrays natively. The checks in_check_array
are hardcoded for numpy inputs, and are applied regardless of the circumstance. This makes the check seamless, will remove a blocker for other non-numpy data types, and will help in the rollout of the new finite checker, where a future refactor will remove_check_array
entirely and enforce the use ofvalidate_data
and_check_sample_weight
in sklearnex. This also removes three TODOs listed in the codebase.PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance