-
-
Notifications
You must be signed in to change notification settings - Fork 32.6k
gh-116738: Make _json module safe in the free-threading build #119438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
You need to include the file that defines that macro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert newlines
Co-authored-by: Nice Zombies <[email protected]>
The python implementation first checks if the list is empty and then iterates over it. Instead of making a shallow copy of the list, checking the length of the copy and iterating over it. A different thread could probably make the list empty between these two statements. (Like the subclass is simulating) if not lst:
yield "[]"
return
time.sleep(10) # allow thread to modify the list
for value in lst:
... My question is: do we fix just the broken subclass or also this? |
In my opinion there is nothing to fix: when different threads are mutating the underlying data, we give no guarantees on the output. But we do guarantee we will not crash the python interpreter. The python implementation will not crash (since all individual python statements are safe). In this PR we modify the C implementation so that no crashes can occur. On the C side we want to make sure that if the underlying list is emptied we do not index into deallocated memory (this would crash the interpreter). (note: for the json encoder the C method that is unsafe for the list access is There are some other PRs addressing safety under the free-threading builds and the feedback there was similar: address the crashes, but don't make guarantees on correct output (at the cost of performance). See |
There's a precedent for guarding against a broken |
Misc/NEWS.d/next/Core_and_Builtins/2024-06-04-20-26-21.gh-issue-116738.q_hPYq.rst
Outdated
Show resolved
Hide resolved
@colesbury @mpage Would one you be able to review the PR? Thanks |
Py_ssize_t indent_level, PyObject *indent_cache, PyObject *separator) | ||
{ | ||
for (Py_ssize_t i = 0; i < PySequence_Fast_GET_SIZE(s_fast); i++) { | ||
PyObject *obj = PySequence_Fast_GET_ITEM(s_fast, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using borrowed reference is not safe here because if the critical section on sequence get's suspended then other thread can decref or free the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eendebakpt Can you update the PR to use strong references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kumaraditya303 Yes I will (probably with a macro to continue using borrowed ones in the normal build).
If I understand the mechanism correctly, then also the _encoder_iterate_dict_lock_held
needs to be changed.
There PyDict_Next is used, which also returns borrowed references. (note: the docs for PyDict_Next
mention that Py_BEGIN_CRITICAL_SECTION
should be used in the free-theaded build, but make no mention of borrowed references, perhaps an additional note to the documentation is in order).
I tried creating a minimal example to add to the tests, but have not succeeded so far. Test script included below. Suggestions to get a minimal example are welcome.
Test script
import gc
import random
import time
import sys
from threading import Barrier, Thread
from itertools import count
import json
def single_reference_int():
# Return int that has a single reference (with very high probability)
return random.randint(2**80, 2**82)
class EvilMapping(dict):
cnt = count()
def __init__(self, data):
mapping = {next(self.cnt): single_reference_int()} # generate a mapping with no outside references
super().__init__(mapping)
self.data = data
def keys(self):
return list(self.mapping)
def items(self):
# this is called in encoder_listencode_dict which is called from encoder_listencode_obj
# try to get this thread to suspect on the outer lock
self.data.clear() # this will remove self from the list, leaving no references to self (except cyclic references)
gc.collect() # the EvilMapping without refcounts (e.g. this one) should be cleared
repr(self) # do something with self
return super().items()
def __repr__(self):
return f'{self.__class__.__name__} {super().__repr__()}'
run= True
def worker(barrier, data, index):
global run
barrier.wait()
while run:
# worker clears the list to generate borrowed references with refcount 0
#print(f'worker {index} {len(lst)=}')
data.append(single_reference_int())
data.append(EvilMapping(data)) # inject more evil mappings
if len(data) > 10:
data.clear()
#print(f'worker {index} done')
print(f'worked {index=} {run=}')
# we want a list where encoding one of the elements clears elements from the list that have refcount 1
data= []
data.append(EvilMapping(data))
data.append(EvilMapping(data))
print(f'{data=}')
j=json.dumps(data)
print(j)
print(f'{data=}')
#%%
print('----- go! -------')
data= []
data.append(EvilMapping(data))
number_of_threads = 2
number_of_json_encodings=3
worker_threads = []
barrier = Barrier(number_of_threads)
run = True
for index in range(number_of_threads):
worker_threads.append(
Thread(target=worker, args=[barrier, data, index])
)
for t in worker_threads:
t.start()
for ii in range(number_of_json_encodings):
print(f'dump {ii}')
data.extend( [EvilMapping(data)]*10)
json.dumps(data)
run = False
for t in worker_threads:
t.join()
(updated description)
Writing JSON files (or encoding to a string) is not thread-safe in the sense that when encoding data to json while another thread is mutating the data, the result is not well-defined (this is true for both the normal and free-threading build). But the free-threading build can crash the interpreter while writing JSON because of the usage of methods like
PySequence_Fast_GET_ITEM
. In this PR we make the free-threading build safe by adding locks in three places in the JSON encoder.Reading from a JSON file is safe: objects constructed are only known to the executing thread. Encoding data to JSON needs a bit more care: mutable Python objects such as a list or a dict could be modified by another thread during encoding.
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST
to project against mutation the listPyDict_Next
is used there). The non-exact dicts usePyMapping_Items
to create a list of tuples.PyMapping_Items
itself is assumed to be thread safe, but the resulting list is not a copy and can be mutated.Update 2025-02-10: refactored to avoid using Py_EXIT_CRITICAL_SECTION_SEQUENCE_FAST
Test script
t=JsonThreadingTest(number_of_json_dumps=102, number_of_threads=8)
is a factor 25 faster using free-threading. Nice!