gh-137627: Make `csv.Sniffer.sniff()` delimiter detection 1.6x faster #137628

maurycy · 2025-08-11T02:57:13Z

The basic idea is not to iterate over all ASCII characters and count their frequency on each line in _guess_delimiter but only over present characters, and just backfill zeros.

Benchmark

There is no csv.Sniffer benchmark in pyperformance, so I created a simple benchmark instead:

import csv, pathlib, pyperf
def sniff(s): csv.Sniffer()._guess_delimiter(s, None)
r = pyperf.Runner()
sizes = [1024, 2048, 4096]
for file in pathlib.Path("/home/maurycy/CSVsniffer/CSV/").glob("*.csv"):
    for s in sizes:
        with file.open() as f:
            try:
                r.bench_func(f"csv_sniff({file.name}, {s})", sniff, f.read(s))
            except UnicodeDecodeError:
                pass

using all 149 files from CSVSniffer (MIT License), reading only the sample, as recommended in docs.python.org example. That's what real users do, too. Unfortunately, it takes a few hours to run.

Results

Geometric mean: 1.60x faster

The full results (compare_to --table --table-format=md, and JSON files):

https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set

Environment

% ./python -c "import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))"
'--with-lto' '--enable-optimizations'

sudo ./python -m pyperf system tune ensured.

Notes

The optimization is in csv.Sniffer()._read_delimiter() which runs only if regular expressions in csv.Sniffer()._guess_quote_and_delimiter() failed, so there's no guarantee that csv.Sniffer().sniff() will always be faster
My original patch relied on confusing set operations.

Issue: csv.Sniffer._guess_delimiter() iterates over all ASCII on each line #137627

Lib/csv.py

picnixz

Are the benchmarks done with a POG+LTO build?

Lib/csv.py

Misc/NEWS.d/next/Library/2025-08-11-04-52-18.gh-issue-137627.Ku5Yi2.rst

AA-Turner

I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance.

Lib/csv.py

maurycy · 2025-08-11T15:50:36Z

@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with pyperformance, rephrase NEWS, and add in a whatsnew.

picnixz · 2025-08-12T07:43:48Z

Benchmarks without optimizations are not relevant so just run those with.

maurycy · 2025-08-13T10:43:09Z

@picnixz @AA-Turner @ZeroIntensity

Thank you for all the comments:

I created a much more rigorous benchmark than before, see the results: https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set The benchmark is now on par with how people use the class.
I've changed the approach, ditching confusing set operations. The speed up now comes mostly from not iterating over ascii over and over, and more efficient zero backfilling.
There is nothing about csv.Sniffer() in pyperformance, so I was a bit stuck here.
I updated the docs.

Doc/whatsnew/3.15.rst

picnixz · 2025-08-18T16:52:43Z

Lib/csv.py

+                seen += 1
+                charCounts = Counter(line)
+                for char, count in charCounts.items():
+                    if ord(char) < 127:


Is it faster to do ord(char) or char.isascii()? (just run python -m pyperf timeit -s "x='x'" "x.isascii()" and compare it to python -m pyperf timeit -s "char='x'" "ord(char) < 127" (ensure that we use a name lookup ord(char) or "x.*" to avoid inlining the checks)

@picnixz

20:52:34.106551089PM CEST maurycy@gunnbjorn ~/cpython (main) % ./python -m pyperf timeit -s "x='x'" "x.isascii()" ..................... Mean +- std dev: 13.8 ns +- 0.0 ns 20:52:46.818268693PM CEST maurycy@gunnbjorn ~/cpython (main) % ./python -m pyperf timeit -s "char='x'" "ord(char) < 127" ..................... Mean +- std dev: 17.4 ns +- 0.0 ns

The reason for having ord in the first place is preserving the original off-by-one error and not changing the fragile behavior.

Oh. Wait. The check is < 127 not <= 127. Was it intentional?

Well there's an easy way to use isascii() and still be efficient: after processing the lines, just do charFrequency.pop('\x7f', None). But I don't think we should do it because counting Control characters (e.g., 0x1) is still as nonsensical as counting \x7f.

@picnixz

Yes, intentional:

cpython/Lib/csv.py

Line 370 in b07a267

ascii = [chr(c) for c in range(127)] # 7-bit ASCII

I agree it doesn't make sense to keep this off-by-one.

str.isascii() is much faster than ord(), confirmed with the new benchmark:

Geometric mean: 1.60x faster

https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set

Lib/csv.py

picnixz · 2025-08-18T17:27:49Z

Lib/csv.py

+                presentCount = sum(counts.values())
+                zeroCount = seen - presentCount
+                if zeroCount > 0:
+                    items = list(counts.items()) + [(0, zeroCount)]
+                else:
+                    items = list(counts.items())


Suggested change

presentCount = sum(counts.values())

zeroCount = seen - presentCount

if zeroCount > 0:

items = list(counts.items()) + [(0, zeroCount)]

else:

items = list(counts.items())

items = list(counts.items())

missed_lines = seen - sum(counts.values())

if missed_lines:

# charFrequency[char][0] can only be deduced now

# as it cannot be obtained when parsing the lines.

assert 0 not in counts.keys()

# Store the number of lines 'char' was missing from.

items.append((0, missed_lines))

I think this is what you want right?

I'm not entirely sure of my suggestion so someone else needs to check this (I'm sick hence tired..)

@picnixz It’s OK. I’ve spent way too much thinking about this code and I believe they’re the same, yours a bit more readable. :-)

I’m running the benchmark as we speak.

There are only three gotchas:

Not sure if I agree with the comment (the zeros can be obtained but it’s inefficient),

Are we OK with leaving the assert?

There is a very subtle difference between the original code and my code, that is the tie break. I don’t think it matters, though.

Get better soon!

Not sure if I agree with the comment (the zeros can be obtained but it’s inefficient),

Yes if we change the construction. But with the usage of Counter(line) it's not possible to obtain them as we ... don't count chars that are missing.

There is a very subtle difference between the original code and my code, that is the tie break. It should not matter, though.

Oh. Can you give us an example if possible? and then we can add a test as well.

Are we OK with leaving the assert?

This specific assert should be cheap but we can remove it (along with the comment that can be misread). It's good that benchmarks are performed with that assert though.

There is a very subtle difference between the original code and my code, that is the tie break. It should not matter, though.

Oh. Can you give us an example if possible? and then we can add a test as well.

Definitely!

https://github.com/python/cpython/pull/137628/files#diff-dd5e5efc1566b1c1dc988769a04cc403d2416c8af5bf4785bec235d02a780813R1447

FYI, the first test:

https://github.com/python/cpython/pull/137628/files#diff-dd5e5efc1566b1c1dc988769a04cc403d2416c8af5bf4785bec235d02a780813R1440

caught my attention because of:

cpython/Lib/csv.py

Line 368 in b07a267

data = list(filter(None, data.split('\n')))

which should be just str.splitlines() but I'm still afraid of touching too much here. :-)

Are we OK with leaving the assert?

This specific assert should be cheap but we can remove it (along with the comment that can be misread). It's good that benchmarks are performed with that assert though.

I've removed it:

https://github.com/python/cpython/pull/137628/files#diff-e4e9ef49f9188f63d0c22e6e24cf6a4880fca56c11077df84f64937c9081e2eaR392-R395

The new benchmark:

https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set

ran over 4b62c84:

main...4b62c84

and included the assert.

Lib/csv.py

Co-authored-by: Bénédikt Tran <[email protected]>

maurycy · 2025-08-19T10:49:14Z

@picnixz

Thank you for your yesterday's review:

I renamed the variables and simplified the zero-bucket, as per your suggestion,
I added the tests for the tie break (insert v. append of the zero bucket) to confirm that the behavior has not changed,
I updated the docs and the benchmark (it's now 1.6x faster!)

Thank you!

ZeroIntensity · 2025-08-19T11:00:13Z

Love the enthusiasm, but please try to avoid continuously rebasing (pressing "update branch"). It wastes CI time and also puts a ding in all of our inboxes. Updating the branch should generally only be done to resolve merge conflicts.

maurycy added 3 commits August 11, 2025 04:40

do not iterate over all ascii

80be530

NEWS entry

2d636cf

bring back the comment

1f0b25e

bedevere-app bot added the awaiting review label Aug 11, 2025

bedevere-app bot mentioned this pull request Aug 11, 2025

csv.Sniffer._guess_delimiter() iterates over all ASCII on each line #137627

Open

bang

601b2f1

maurycy changed the title ~~gh-137627: Make csv.Sniffer._guess_delimiter() 2x faster~~ gh-137627: Make csv.Sniffer.sniff() 2x faster Aug 11, 2025

document the public method

2dc0d41

AA-Turner reviewed Aug 11, 2025

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

import within Sniffer

f106da2

maurycy requested a review from AA-Turner August 11, 2025 05:37

Merge branch 'main' into csv-sniffer-counter-set

7f7dca1

picnixz reviewed Aug 11, 2025

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2025-08-11-04-52-18.gh-issue-137627.Ku5Yi2.rst Outdated Show resolved Hide resolved

AA-Turner reviewed Aug 11, 2025

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

Lib/csv.py Show resolved Hide resolved

Lib/csv.py Outdated Show resolved Hide resolved

Lib/csv.py Outdated Show resolved Hide resolved

Lib/csv.py Outdated Show resolved Hide resolved

_ASCII_CHARS, set operators

2f1ea73

ZeroIntensity reviewed Aug 11, 2025

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

Merge branch 'main' into csv-sniffer-counter-set

4b50610

maurycy added a commit to maurycy/cpython-test that referenced this pull request Aug 13, 2025

python/cpython#137628 benchmark

30cf1bd

update docs, no set operations

07a336b

maurycy requested review from picnixz, ZeroIntensity and AA-Turner August 13, 2025 10:38

maurycy changed the title ~~gh-137627: Make csv.Sniffer.sniff() 2x faster~~ gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.5x faster Aug 13, 2025

maurycy added 2 commits August 15, 2025 19:45

Merge branch 'main' into csv-sniffer-counter-set

080dbfc

Merge branch 'main' into csv-sniffer-counter-set

0b5bcdd

picnixz reviewed Aug 18, 2025

View reviewed changes

Update Lib/csv.py

2ccaac0

Co-authored-by: Bénédikt Tran <[email protected]>

maurycy added 7 commits August 18, 2025 20:41

move whatsnew to Optimizations

36fc9d9

s/seen/num_lines/

7189b51

picnixz suggestion

6b64ba4

use isascii

4b62c84

CRLF tie

dcbd9f5

simple tie test

14ea139

rm assert

c332fb4

maurycy changed the title ~~gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.5x faster~~ gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster Aug 19, 2025

update the benchmark numbers

e5b5611

maurycy requested a review from picnixz August 19, 2025 10:49

Merge branch 'main' into csv-sniffer-counter-set

8614756

-                presentCount = sum(counts.values())
-                zeroCount = seen - presentCount
-                if zeroCount > 0:
-                    items = list(counts.items()) + [(0, zeroCount)]
-                else:
-                    items = list(counts.items())
+                items = list(counts.items())
+                missed_lines = seen - sum(counts.values())
+                if missed_lines:
+                    # charFrequency[char][0] can only be deduced now
+                    # as it cannot be obtained when parsing the lines.
+                    assert 0 not in counts.keys()
+                    # Store the number of lines 'char' was missing from.
+                    items.append((0, missed_lines))

Uh oh!

gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster #137628

Are you sure you want to change the base?

gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster #137628

Uh oh!

Conversation

maurycy commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Results

Environment

Notes

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AA-Turner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maurycy commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Aug 12, 2025

Uh oh!

maurycy commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

picnixz Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

maurycy Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

picnixz Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

picnixz Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

picnixz Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

picnixz Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

maurycy Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

maurycy Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maurycy commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeroIntensity commented Aug 19, 2025

Uh oh!

Uh oh!

gh-137627: Make `csv.Sniffer.sniff()` delimiter detection 1.6x faster #137628

gh-137627: Make `csv.Sniffer.sniff()` delimiter detection 1.6x faster #137628

maurycy commented Aug 11, 2025 •

edited

Loading

AA-Turner left a comment •

edited

Loading

maurycy commented Aug 11, 2025 •

edited

Loading

maurycy commented Aug 13, 2025 •

edited

Loading

picnixz Aug 18, 2025 •

edited

Loading

maurycy Aug 19, 2025 •

edited

Loading

maurycy Aug 18, 2025 •

edited

Loading

maurycy Aug 19, 2025 •

edited

Loading

maurycy commented Aug 19, 2025 •

edited

Loading