Skip to content

Conversation

KJTsanaktsidis
Copy link
Contributor

(this is #8903 again - I just accidently merged it too soon before)


This fixes some of the issues with ASAN I outlined in https://bugs.ruby-lang.org/issues/20001

With this merged, make now works on my machine with the following:

CC=clang ./configure cppflags="-fsanitize=address -fno-omit-frame-pointer" optflags=-O2 LDFLAGS="-fsanitize=address -fno-omit-frame-pointer" --prefix="$(realpath installed)"
ASAN_OPTIONS=use_sigaltstack=0:detect_leaks=0:abort_on_error=1 make

Without these fixes, I get crashes in various steps which use the built miniruby (e.g. parsing rdoc or runing extconf files) because VALUE's on the machine stack are not getting properly marked (because they are instead living in ASAN fake stacks).

This seems to build fine with ASAN both enabled and disabled on linux/mac, and builds without asan on windows (i wasn't brave enough to try turning asan on with msvc heh).

[Bug #20001]

@KJTsanaktsidis
Copy link
Contributor Author

@eightbitraptor to pick you more or less at random amongst people who know about the GC ... :) Would you mind giving a quick once-over of this PR? I would like a second opinion on the changes to each_location & stack marking in it. Thanks!

Copy link
Contributor

@eightbitraptor eightbitraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick once over as requested and some feedback inline 👍

I'm not at all familiar with how asan works, so it might also be useful to get @tenderlove or @peterzhu2118 to take a look at this.

eval.c Outdated
return 0;

ruby_init_stack((void *)&state);
ruby_init_stack(&state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the ruby_tag_type enum here right? and then treating it like an address, and writing to it inside ruby_init_stack. Why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale here for the main thread is the same as for non-main threads:

https://github.com/ruby/ruby/pull/9505/files/09c83f44d2c66e9e432742d76df95d60afa5ebef#diff-28c449b5ef066b9e7b6e6f149211ba67bf320a633d32acac4117184c02f429e8R2190-R2195

I agree the type and variable name is a little strange. I renamed all of this punching through as local_in_parent_frame for the non-main thread case, perhaps I should do the same thing here as well (and duplicate the contents of that comment).

eval.c Outdated
prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0);
#endif
Init_BareVM();
Init_BareVM(&state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here. Do we have to wind the address all the way down to native_thread_init_stack?

gc.c Outdated
{
gc_mark_maybe(objspace, obj);
#ifdef RUBY_ASAN_ENABLED
rb_execution_context_t *ec = ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to each_location and the callback signature make sense to me. However, It seems a shame to introduce quite a lot of extra machinery in order to support drilling down a pointer that will only ever be used when RUBY_ASAN_ENABLED.

Have you looked at whether this has any impact on marking performance? Is there a way we could make this zero overhead if asan is not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safe to just stash a pointer to the execution context we're currently marking on objspace. That would be zero-cost if ASAN is not enabled, and avoid a lot of the messy diff here. I'll do that instead.

gc.c Outdated
void (*cb)(rb_objspace_t *, void *, VALUE));

static void
gc_mark_machine_stack_location_maybe(rb_objspace_t *objspace, void *ctx, VALUE obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be helpful for me if the intentional uses for ctx were documented in the sites where they were used. It's not immediately obvious what the intention of it is given that in the majority of callbacks it's unused.

ruby_init_stack(volatile void *addr)
{
native_main_thread.id = pthread_self();
addr = asan_get_real_stack_addr((void *)addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me why we're using functions directly referencing asan in places where we don't know whether asan is available or enabled or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern for most of the existing ASAN macros is for them to be no-ops if ASAN is disabled, and to elide the checks where they're used. I guess the motivation for this is to avoid spraying a lot of conditional compilation checks all throughout the codebase, like:

#if RUBY_ASAN_ENABLED
    void *poisoned = asan_unpoison_object_temporary(val)
#endif
   /* use val */
#if RUBY_ASAN_ENABLED
  if (poisoned) { asan_poison_object(val); }
#endif

For these more specialised macros only used in a couple of places though, perhaps it's better to be explicit and put the check here.

stack beginning from the pthread API in native_thread_init_stack, because
glibc stores some of its own data on the stack before calling into user code
on a new thread, and replacing that data on fiber-switch would break it (see
bug #13887) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation! Thanks 👍

@KJTsanaktsidis
Copy link
Contributor Author

Thanks for the review 🙏 I'll make some adjustments.

@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_asan_pt_1_again branch from 09c83f4 to 99bbb74 Compare January 17, 2024 23:08
if (GET_VM())
return 0;

ruby_init_stack((void *)&state);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these calls to ruby_init_stack were doing absolutely nothing, because the initial call to INIT_STACK in main.c already called ruby_init_stack with a higher address.

This commit changes how stack extents are calculated for both the main
thread and other threads. Ruby uses the address of a local variable as
part of the calculation for machine stack extents:

* pthreads uses it as a lower-bound on the start of the stack, because
  glibc (and maybe other libcs) can store its own data on the stack
  before calling into user code on thread creation.
* win32 uses it as an argument to VirtualQuery, which gets the extent of
  the memory mapping which contains the variable

However, the local being used for this is actually too low (too close to
the leaf function call) in both the main thread case and the new thread
case.

In the main thread case, we have the `INIT_STACK` macro, which is used
for pthreads to set the `native_main_thread->stack_start` value. This
value is correctly captured at the very top level of the program (in
main.c). However, this is _not_ what's used to set the execution context
machine stack (`th->ec->machine_stack.stack_start`); that gets set as
part of a call to `ruby_thread_init_stack` in `Init_BareVM`, using the
address of a local variable allocated _inside_ `Init_BareVM`. This is
too low; we need to use a local allocated closer to the top of the
program.

In the new thread case, the lolcal is allocated inside
`native_thread_init_stack`, which is, again, too low.

In both cases, this means that we might have VALUEs lying outside the
bounds of `th->ec->machine.stack_{start,end}`, which won't be marked
correctly by the GC machinery.

To fix this,

* In the main thread case: We already have `INIT_STACK` at the right
  level, so just pass that local var to `ruby_thread_init_stack`.
* In the new thread case: Allocate the local one level above the call to
  `native_thread_init_stack` in `call_thread_start_func2`.

[Bug #20001]

fix
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_asan_pt_1_again branch 2 times, most recently from 8d46003 to c21a0c1 Compare January 17, 2024 23:21
@KJTsanaktsidis
Copy link
Contributor Author

@tenderlove or @peterzhu2118 - would either of you mind having a look at this as well as matt suggested? Thank you!

Copy link
Member

@peterzhu2118 peterzhu2118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense 👍

thread_pthread.c Outdated
native_thread_init_main_thread_stack(void *addr)
{
native_main_thread.id = pthread_self();
#ifdef RUBYASAN_ENABLED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
#ifdef RUBYASAN_ENABLED
#ifdef RUBY_ASAN_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, thank you.

@KJTsanaktsidis
Copy link
Contributor Author

Awesome, thanks so much for your reviews.

KJTsanaktsidis and others added 3 commits January 19, 2024 08:46
Where a local variable is used as part of the stack bounds detection, it
has to actually be on the stack. ASAN can put local variable on "fake
stacks", however, with addresses in different memory mappings. This
completely destroys the stack bounds calculation, and can lead to e.g.
things not getting GC marked on the machine stack or stackoverflow
checks that always fail.

The __asan_addr_is_in_fake_stack helper can be used to get the _real_
stack address of such variables, and thus perform the stack size
calculation properly

[Bug #20001]
__has_feature is a clang-ism, and GCC has a different way to tell if
sanitizers are enabled. For this reason, I don't want to spray
__has_feature all over the codebase for other places where conditional
compilation based on sanitizers is required.

[Bug #20001]
ASAN leaves a pointer to the fake frame on the stack; we can use the
__asan_addr_is_in_fake_stack API to work out the extent of the fake
stack and thus mark any VALUEs contained therein.

[Bug #20001]
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_asan_pt_1_again branch from c21a0c1 to 8cae62a Compare January 18, 2024 21:46
@KJTsanaktsidis KJTsanaktsidis merged commit 61da90c into ruby:master Jan 18, 2024
@KJTsanaktsidis KJTsanaktsidis deleted the ktsanaktsidis/fix_asan_pt_1_again branch January 18, 2024 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants