-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Fix uncomparable values in SimpleImputer tie-breaking strategy #31820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When encountering uncomparable types, SimpleImputer's 'min' strategy previously raised an error. This commit resolves the issue by selecting the value with the smallest hash of its pickled representation as a fallback. Fix scikit-learn#31717
sklearn/impute/_base.py
Outdated
return min(items) | ||
except TypeError as e: | ||
if "'<' not supported between" in str(e): | ||
return min(items, key=lambda x: murmurhash3_32(pickle.dumps(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's not a bit overkill. For the use case of SimpleImputer, I believe that using the string representation of objects would be fine. It's deterministic across versions, os, envs... for usual types (int, float, str, None). It's not on more complex objects but I don't think that we expect complex objects to be passed to SimpleImputer.
return min(items, key=lambda x: murmurhash3_32(pickle.dumps(x))) | |
return min(items, key=str) |
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of that too but I am bugged by the 1 out of a million case where there is a tie between None
and 'None'
that is shuffle-sensitive:
In [1]: min([None, 'None'], key=str)
In [2]: min(['None', None], key=str)
Out[2]: 'None'
repr
does not fail in that case but it may in others.
Actually, we do not really need the hash, the pickle dumped string should be unique.
return min(items, key=lambda x: pickle.dumps(x))
Is that better? Still needs pickle though...
Actually, that's where the other PR comes in handy, using the type breaks the tie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually the same for 1
and "1"
and so on which is more likely to happen.
I was undecided between pickle.dumps(x)
and (str(type(x), str(x))
, but a quick benchmark showed that the latter is much faster, so I'm in favor of choosing this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another reason to not rely on pickle
is that IIRC there are very little promises about the contents of a pickle with respect to the Python version (I think you'd have to select an explicit version of the protocol?). Generally it feels like using pickle
is somehow "too complicated"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments. Please also add a changelog entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly pushed a commit to use (str(type(x)), str(x))
as key. LGTM.
@betatim is this solution fine for you ?
Sorry, I was off, I could not implement the requests. let me know if there is anything more I can do. |
Fixes #31717
When encountering uncomparable types, SimpleImputer's 'min' strategy previously raised an error. This commit resolves the issue by selecting the value with the smallest hash of its pickled representation as a fallback.
The pickled value is used because some special values such as
None
are not handled natively by murmurhash.