Skip to content

Fix uncomparable values in SimpleImputer tie-breaking strategy #31820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AlexandreAbraham
Copy link
Contributor

Fixes #31717

When encountering uncomparable types, SimpleImputer's 'min' strategy previously raised an error. This commit resolves the issue by selecting the value with the smallest hash of its pickled representation as a fallback.

The pickled value is used because some special values such as None are not handled natively by murmurhash.

When encountering uncomparable types, SimpleImputer's 'min'
strategy previously raised an error. This commit resolves the
issue by selecting the value with the smallest hash of its pickled
representation as a fallback.

Fix scikit-learn#31717
Copy link

github-actions bot commented Jul 22, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 51b98c1. Link to the linter CI: here

return min(items)
except TypeError as e:
if "'<' not supported between" in str(e):
return min(items, key=lambda x: murmurhash3_32(pickle.dumps(x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's not a bit overkill. For the use case of SimpleImputer, I believe that using the string representation of objects would be fine. It's deterministic across versions, os, envs... for usual types (int, float, str, None). It's not on more complex objects but I don't think that we expect complex objects to be passed to SimpleImputer.

Suggested change
return min(items, key=lambda x: murmurhash3_32(pickle.dumps(x)))
return min(items, key=str)

What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of that too but I am bugged by the 1 out of a million case where there is a tie between None and 'None' that is shuffle-sensitive:

In [1]: min([None, 'None'], key=str)

In [2]: min(['None', None], key=str)
Out[2]: 'None'

repr does not fail in that case but it may in others.

Actually, we do not really need the hash, the pickle dumped string should be unique.

            return min(items, key=lambda x: pickle.dumps(x))

Is that better? Still needs pickle though...
Actually, that's where the other PR comes in handy, using the type breaks the tie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually the same for 1 and "1" and so on which is more likely to happen.

I was undecided between pickle.dumps(x) and (str(type(x), str(x)), but a quick benchmark showed that the latter is much faster, so I'm in favor of choosing this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another reason to not rely on pickle is that IIRC there are very little promises about the contents of a pickle with respect to the Python version (I think you'd have to select an explicit version of the protocol?). Generally it feels like using pickle is somehow "too complicated"

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments. Please also add a changelog entry

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pushed a commit to use (str(type(x)), str(x)) as key. LGTM.
@betatim is this solution fine for you ?

@jeremiedbb jeremiedbb added the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 1, 2025
@jeremiedbb jeremiedbb added this to the 1.7.2 milestone Aug 1, 2025
@AlexandreAbraham
Copy link
Contributor Author

Sorry, I was off, I could not implement the requests. let me know if there is anything more I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer fails in "most_frequent" if incomparable types only if ties
3 participants