-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
WIP Add function to convert array namespace and device to reference array #31829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
try: | ||
# Note will copy if required | ||
array_converted = xp_ref.from_dlpack(array, device=device_ref) | ||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to only except AttributeError
, which I think occurs if input or output namespace does not support dlpack.
from_dlpack
can give 2 (or more) other errors
- BufferError - The dlpack and dlpack_device methods on the input array may raise BufferError when the data cannot be exported as DLPack (e.g., incompatible dtype, strides, or device). It may also raise other errors when export fails for other reasons (e.g., not enough memory available to materialize the data). from_dlpack must propagate such exceptions.
- I thought that if dlpack fails to convert due to one of the above errors, it would not make sense to try ourselves manually.
- ValueError - If data exchange is possible via an explicit copy but copy is set to False.
- I've left
copy=None
, allowing it to copy if need be, so this error is not relevant. I am not sure about thecopy=None
setting though, it is a lot of memory usage.
- I've left
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking to Evgeni about what could possibly cause a BufferError
- here are some ideas, but but I don't know if any of these will cause an error:
- numpy supports negative strides but torch does not (https://discuss.pytorch.org/t/negative-strides-in-tensor-error/134287) (I don't think torch supports DLPack atm, so hard to figure out)
- numpy structured array?
- numpy memmapped array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we could simplify _convert_to_reference
a bit when xp_ref
is NumPy:
if _is_numpy_namespace(xp_ref):
return tuple([_convert_to_numpy(array, get_namespace(array)) for array in arrays])
However, I'm not sure this offers much benefit beyond readability, since in most cases there are only two arrays to convert: y_true
and sample_weight
.
I'll have to give it some more thought.
At the moment that would be simpler, but I think
For metrics I think this would mostly be it, but for estimators there could be other arrays to convert. |
@@ -466,6 +466,34 @@ def get_namespace_and_device( | |||
return xp, False, arrays_device | |||
|
|||
|
|||
def _convert_to_reference(*, reference, arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit/pet peeve: sklearn.utils._array_api
is a private module, we don't need to make functions private here. It feels like repeating ourselves. I've made this argument before and you can tell I've not convinced everyone because there are many functions here that start with a _
but I will keep trying :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried at some point. It's a lost battle 😄
def _convert_to_reference(*, reference, arrays): | ||
"""Convert `arrays` to `reference` array's namespace and device.""" | ||
xp_ref, _, device_ref = get_namespace_and_device(reference) | ||
arrays_converted_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling it converted_arrays
? arrays_converted_list
feels backwards and we can use the plural "s" to let people know it is a sequence/collection of more than one thing
xp, _, device = get_namespace_and_device(y_pred) | ||
y_true, sample_weight = _convert_to_reference( | ||
reference=y_pred, arrays=(y_true, sample_weight) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that returning xp
and device
from convert_to_reference
feels like too much. But it also feels weird to have this "duplication" of get_namespace_and_device
and convert_to_reference
. The only alternative I can think of, though I'm not sure I like it, is to have convert_to_reference(arrays, namespace=xp, device=device)
(but then maybe rename it to move_to
or some such.
@@ -466,6 +466,34 @@ def get_namespace_and_device( | |||
return xp, False, arrays_device | |||
|
|||
|
|||
def _convert_to_reference(*, reference, arrays): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a great fan of the *
. Let people use or not use keyword arguments. You can foo(bar=42)
with def foo(bar):
People are grownups :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, not using named arguments can make the code very hard to understand _convert_to_reference(some_array, other_arrays)
.
Forcing _convert_to_reference(reference=some_array, arrays=other_arrays)
would ensure that we always call this particular function in an explicit way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that wouldn't be needed if we instead implement the move_to
API suggested in #31829 (comment).
# Note will copy if required | ||
array_converted = xp_ref.from_dlpack(array, device=device_ref) | ||
except AttributeError: | ||
# Convert to numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert to numpy | |
# Converting to numpy is tricky, handle this via a dedicated function |
# Convert to numpy | ||
if _is_numpy_namespace(xp_ref): | ||
array_converted = _convert_to_numpy(array, xp_array) | ||
# Convert from numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert from numpy | |
# Convert from numpy, all array libraries know how to use a Numpy array |
elif _is_numpy_namespace(xp_array): | ||
array_converted = xp_ref.asarray(array, device=device_ref) | ||
else: | ||
# Convert to numpy then to reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert to numpy then to reference | |
# There is no generic way to convert from namespace A to B | |
# So we first convert from A to numpy and then from numpy to B | |
# The way to avoid this round trip is to lobby for DLpack support | |
# in libraries A and B |
What's the difference with |
I think the goal of both is the same. I also think that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would be in favor of renaming this method to move_to(arrays, namespace, device)
since the device and namespace will already be inspected in the caller most of the time for other purposes.
arrays_converted_list.append(array) | ||
else: | ||
try: | ||
# Note will copy if required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Note will copy if required | |
# The dlpack protocol is the future proof and library agnostic method | |
# to transfer arrays across namespace and device boundaries hence | |
# this method is attempted first and going through NumPy is only | |
# used as fallback in case of failure. | |
# Note: copy=None is the default since 2023.12. Namespace libraries | |
# should trigger a copy automatically only if needed. |
Reference Issues/PRs
Towards #28668 and #31274
What does this implement/fix? Explain your changes.
Adds a function that converts arrays to the namespace and device of the reference array.
Tries DLPack first, and if either array does not support it, tries to convert manually.
Any other comments?
This is an initial attempt, and what it would look like in a simple metric. Feedback welcome. (Tests to come)
I thought about also outputting the namespace and device of the reference array, to avoid the second call to
get_namespace_and_device
, but I thought it would make the outputs too messy.cc @ogrisel @betatim @StefanieSenger @virchan @lesteve