-
-
Notifications
You must be signed in to change notification settings - Fork 32.6k
gh-137627: Make csv.Sniffer.sniff()
delimiter detection 1.6x faster
#137628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
csv.Sniffer._guess_delimiter()
2x fastercsv.Sniffer.sniff()
2x faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the benchmarks done with a POG+LTO build?
Misc/NEWS.d/next/Library/2025-08-11-04-52-18.gh-issue-137627.Ku5Yi2.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance
.
@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with |
Benchmarks without optimizations are not relevant so just run those with. |
csv.Sniffer.sniff()
2x fastercsv.Sniffer.sniff()
delimiter detection 1.5x faster
@picnixz @AA-Turner @ZeroIntensity Thank you for all the comments:
|
Lib/csv.py
Outdated
seen += 1 | ||
charCounts = Counter(line) | ||
for char, count in charCounts.items(): | ||
if ord(char) < 127: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it faster to do ord(char)
or char.isascii()
? (just run python -m pyperf timeit -s "x='x'" "x.isascii()"
and compare it to python -m pyperf timeit -s "char='x'" "ord(char) < 127"
(ensure that we use a name lookup ord(char) or "x.*" to avoid inlining the checks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20:52:34.106551089PM CEST maurycy@gunnbjorn ~/cpython (main) % ./python -m pyperf timeit -s "x='x'" "x.isascii()"
.....................
Mean +- std dev: 13.8 ns +- 0.0 ns
20:52:46.818268693PM CEST maurycy@gunnbjorn ~/cpython (main) % ./python -m pyperf timeit -s "char='x'" "ord(char) < 127"
.....................
Mean +- std dev: 17.4 ns +- 0.0 ns
The reason for having ord
in the first place is preserving the original off-by-one error and not changing the fragile behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Wait. The check is < 127
not <= 127
. Was it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there's an easy way to use isascii() and still be efficient: after processing the lines, just do charFrequency.pop('\x7f', None)
. But I don't think we should do it because counting Control characters (e.g., 0x1) is still as nonsensical as counting \x7f
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, intentional:
Line 370 in b07a267
ascii = [chr(c) for c in range(127)] # 7-bit ASCII
I agree it doesn't make sense to keep this off-by-one.
str.isascii()
is much faster than ord()
, confirmed with the new benchmark:
Geometric mean: 1.60x faster
Lib/csv.py
Outdated
presentCount = sum(counts.values()) | ||
zeroCount = seen - presentCount | ||
if zeroCount > 0: | ||
items = list(counts.items()) + [(0, zeroCount)] | ||
else: | ||
items = list(counts.items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presentCount = sum(counts.values()) | |
zeroCount = seen - presentCount | |
if zeroCount > 0: | |
items = list(counts.items()) + [(0, zeroCount)] | |
else: | |
items = list(counts.items()) | |
items = list(counts.items()) | |
missed_lines = seen - sum(counts.values()) | |
if missed_lines: | |
# charFrequency[char][0] can only be deduced now | |
# as it cannot be obtained when parsing the lines. | |
assert 0 not in counts.keys() | |
# Store the number of lines 'char' was missing from. | |
items.append((0, missed_lines)) |
I think this is what you want right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure of my suggestion so someone else needs to check this (I'm sick hence tired..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@picnixz It’s OK. I’ve spent way too much thinking about this code and I believe they’re the same, yours a bit more readable. :-)
I’m running the benchmark as we speak.
There are only three gotchas:
- Not sure if I agree with the comment (the zeros can be obtained but it’s inefficient),
- Are we OK with leaving the
assert
? - There is a very subtle difference between the original code and my code, that is the tie break. I don’t think it matters, though.
Get better soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I agree with the comment (the zeros can be obtained but it’s inefficient),
Yes if we change the construction. But with the usage of Counter(line)
it's not possible to obtain them as we ... don't count chars that are missing.
There is a very subtle difference between the original code and my code, that is the tie break. It should not matter, though.
Oh. Can you give us an example if possible? and then we can add a test as well.
Are we OK with leaving the assert?
This specific assert should be cheap but we can remove it (along with the comment that can be misread). It's good that benchmarks are performed with that assert though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a very subtle difference between the original code and my code, that is the tie break. It should not matter, though.
Oh. Can you give us an example if possible? and then we can add a test as well.
Definitely!
FYI, the first test:
caught my attention because of:
Line 368 in b07a267
data = list(filter(None, data.split('\n')))
which should be just str.splitlines()
but I'm still afraid of touching too much here. :-)
Are we OK with leaving the assert?
This specific assert should be cheap but we can remove it (along with the comment that can be misread). It's good that benchmarks are performed with that assert though.
I've removed it:
The new benchmark:
ran over 4b62c84:
and included the assert
.
Co-authored-by: Bénédikt Tran <[email protected]>
csv.Sniffer.sniff()
delimiter detection 1.5x fastercsv.Sniffer.sniff()
delimiter detection 1.6x faster
Thank you for your yesterday's review:
Thank you! |
Love the enthusiasm, but please try to avoid continuously rebasing (pressing "update branch"). It wastes CI time and also puts a ding in all of our inboxes. Updating the branch should generally only be done to resolve merge conflicts. |
The basic idea is not to iterate over all ASCII characters and count their frequency on each line in
_guess_delimiter
but only over present characters, and just backfill zeros.Benchmark
There is no
csv.Sniffer
benchmark inpyperformance
, so I created a simple benchmark instead:using all 149 files from CSVSniffer (MIT License), reading only the sample, as recommended in docs.python.org example. That's what real users do, too. Unfortunately, it takes a few hours to run.
Results
The full results (
compare_to --table --table-format=md
, and JSON files):Environment
sudo ./python -m pyperf system tune
ensured.Notes
csv.Sniffer()._read_delimiter()
which runs only if regular expressions incsv.Sniffer()._guess_quote_and_delimiter()
failed, so there's no guarantee thatcsv.Sniffer().sniff()
will always be fastercsv.Sniffer._guess_delimiter()
iterates over all ASCII on each line #137627