MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

lucyleeow · 2025-07-17T11:21:00Z

Reference Issues/PRs

Supercedes #30945

What does this implement/fix? Explain your changes.

Refactor _average_weighted_percentile so we are not just performing _weighted_percentile twice, thus avoids sorting and computing cumulative sum twice.

#30945 essentially uses the sorted indicies and calculates _weighted_percentile(-array, 100-percentile_rank) - this was verbose and required computing cumulative sum again on the negative (you could have used symmetry to avoid computing cumulative sum in cases when fraction above is greater than 0 - i.e., g>0 from Hyndman and Fan)

I've followed the Hyndman and Fan computation more closely and calculate g and just use j+1 (since we already know j). This did make handling the case where j+1 had a sample weight of 0 (or when you have sample weight of 0 at the end of the array) more complex.

Any other comments?

github-actions · 2025-07-17T11:21:58Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bba43c4. Link to the linter CI: here}

lucyleeow · 2025-07-17T11:25:28Z

sklearn/utils/stats.py

+
+        result = xp.where(
+            is_fraction_above,
+            array[percentile_in_sorted, col_indices],


I initially thought this should be percentile_plus_one_in_sorted as from the paper, when g>0, $\gamma=1$, ~~but searchsorted defaults to left (equals is on the right), whereas the paper defined j <= pn < j+1~~ but searchsorted effectively gives i-1 < pn <= i whereas the paper had j <= pn < j+1. This means that when pn is greater than the LHS, searchsorted's i equals j+1, from the paper.

When the quantile exactly matches an index, searchsorted's i equals j, from the paper (as the equals is on opposite sides in paper vs searchsorted).

lucyleeow · 2025-07-21T01:50:31Z

I think this is ready for review, maybe @ogrisel @betatim ?

lucyleeow · 2025-07-21T01:50:35Z

I think this is ready for review, maybe @ogrisel @betatim ?

ogrisel

Thanks very much @lucyleeow for diving into this. I pushed a commit to make the randomized NumPy equivalence tests stronger and checked locally that they pass with all random seeds:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -vl sklearn/utils/tests/test_stats.py

It seems that this PR fixes a rare edge case bug found in another PR (in addition to the CPU speed-up and memory improvements): #29641. We could write a changelog entry to document this. However, doing so would require crafting a minimal reproducer to precisely characterize the conditions under which this edge case can be triggered. Maybe we could have a generic changelog entry such as "improve CPU and memory usage in estimators and metric functions that rely on weighted percentiles and better handle edge cases" or something similar.

Please fix the conflicts and feel free to ping me again for the final review.

ogrisel · 2025-07-31T15:51:40Z

sklearn/utils/tests/test_stats.py

+    When `g=0` and `percentile_indices` is at max index, quantile is perfectly at 100
+    and take the average of 2x the max index.


Suggested change

When `g=0` and `percentile_indices` is at max index, quantile is perfectly at 100

and take the average of 2x the max index.

When `g=0` and `percentile_indices` is at max index, percentile rank is perfectly

at 100 and take the average of 2x the max index.

ogrisel · 2025-07-31T15:51:56Z

sklearn/utils/tests/test_stats.py

+    When `g=0` and `percentile_indices` is at max index, quantile is perfectly at 100
+    and take the average of 2x the max index.
+    """
+    # Note for both spercentile_rank`s`,`percentile_indices` is already at max index


Suggested change

# Note for both spercentile_rank`s`,`percentile_indices` is already at max index

# Note for both percentile_rank`s`,`percentile_indices` is already at max index

ogrisel · 2025-07-31T15:55:07Z

sklearn/utils/stats.py

+    Uses 'inverted_cdf' method when `average=False` (default) and
+    'averaged_inverted_cdf' when `average=True`.


Suggested change

Uses 'inverted_cdf' method when `average=False` (default) and

'averaged_inverted_cdf' when `average=True`.

Implement an array API compatible (weighted version) of NumPy's 'inverted_cdf'

method when `average=False` (default) and 'averaged_inverted_cdf' when

`average=True`.

lucyleeow added 3 commits July 14, 2025 14:58

try reverse cum sum

8fe6ae2

initial implementation, wip tests

b9c0c7b

fix and add tests, update use

b56fab0

github-actions bot added module:metrics module:preprocessing module:utils labels Jul 17, 2025

lucyleeow mentioned this pull request Jul 17, 2025

Refactor weighted percentile functions to avoid redundant sorting #30945

Closed

lucyleeow commented Jul 17, 2025

View reviewed changes

lucyleeow added 2 commits July 18, 2025 23:51

fixes and add tests

f99366c

simplify zero sample code

ba57727

lucyleeow added the No Changelog Needed label Jul 19, 2025

typos

bba43c4

lucyleeow added the Array API label Jul 24, 2025

ogrisel mentioned this pull request Jul 31, 2025

Added sample weight handling to BinMapper under HGBT #29641

Open

7 tasks

Stronger tests

bb9c800

ogrisel reviewed Jul 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

Uh oh!

lucyleeow commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025 •

edited

Loading

Uh oh!

lucyleeow Jul 17, 2025 •

edited

Loading

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

ogrisel left a comment

Uh oh!

ogrisel Jul 31, 2025

Uh oh!

ogrisel Jul 31, 2025

Uh oh!

ogrisel Jul 31, 2025

Uh oh!

Uh oh!

		When `g=0` and `percentile_indices` is at max index, quantile is perfectly at 100
		and take the average of 2x the max index.

	# Note for both spercentile_rank`s`,`percentile_indices` is already at max index
	# Note for both percentile_rank`s`,`percentile_indices` is already at max index

		Uses 'inverted_cdf' method when `average=False` (default) and
		'averaged_inverted_cdf' when `average=True`.

Uh oh!

MNT Refactor _average_weighted_percentile to avoid double sort #31775

Are you sure you want to change the base?

MNT Refactor _average_weighted_percentile to avoid double sort #31775

Uh oh!

Conversation

lucyleeow commented Jul 17, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lucyleeow Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

MNT Refactor `_average_weighted_percentile` to avoid double sort #31775

github-actions bot commented Jul 17, 2025 •

edited

Loading

lucyleeow Jul 17, 2025 •

edited

Loading