Skip to content

Let Meson use a system call to detect the page size #3174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 4, 2023

Conversation

kcgen
Copy link
Member

@kcgen kcgen commented Dec 4, 2023

Description

Previously we ran local binaries to get the page size, however some systems don't have them (such as RetroPie, which is missing the pagesize binary).

The good news is both Microsoft and POSIX both have long-lived extremely standardized calls to get the page size:

By using a C syscall, Meson will build and run it using the cross-compiler's runnable environment (if one is configured).

This is particularly important when cross compiling, as we want the destination host's page size and not the local build machine's page size.

(If a cross-compiler isn't configured Meson just runs it on the local build machine)

Finally, this PR also lets users (optionally) configure the page size via Meson's setup options, such as: meson setup -Dpagesize=16384 ...

Some machines support multiple page sizes - so there's no harm in letting them override it or experiment.

Why page size is important

The dynamic core converts DOS instructions on the fly into a sequence of host machine instructions, packed into a "page", and passes it to the host as if it's native compiled code. The host then executes that page of machine byte code, one page at a time.

Some processors have boot time switches that can control the page size, like the pi5.

So there's no wiggle room; we either get it correct or DOSBox just crashes straight away as soon as the dynamic recompiler passes its first generated page of instructions in, like we saw here.

Also, hardware and operating systems that support write-or-execute flagging (W^X) require flagging of whole pages. That's probably what crashed the Pi5: we asked to flag only a 4k portion of a 16K page (and were denied). So dynrec didn't even get off the ground to fill its first page.

This is going to become a bigger problem with arm64 and aarch64 chips now becoming very common.

Related issues

Fixes #3160.

Manual testing

Tested on x86-64 Linux, the Pi4 w/ Linux, and Apple's M1.

Checklist

I have:

  • followed the project's contributing guidelines and code of conduct.
  • performed a self-review of my code.
  • commented on the particularly hard-to-understand areas of my code.
  • split my work into well-defined, bisectable commits, and I named my commits well.
  • applied the appropriate labels (bug, enhancement, refactoring, documentation, etc.)
  • checked that all my commits can be built.
  • confirmed that my code does not cause performance regressions (e.g., by running the Quake benchmark).
  • added unit tests where applicable to prove the correctness of my code and to avoid future regressions.
  • made corresponding changes to the documentation or the website according to the documentation guidelines.
  • locally verified my website or documentation changes.

kcgen added 2 commits December 4, 2023 09:12
This is a white-space only change made with "muon fmt".
Previously we ran local binaries to get the page size, however
some systems don't have them (such as RetroPie, which is missing
the 'pagesize' binary).

The good news is both Microsoft and POSIX both have long lived
extremely standardized calls to get the page size:

- https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsysteminfo
  Supported on Windows 2000 and greater.

- https://pubs.opengroup.org/onlinepubs/9699919799/
  POSIX: supported since 2001

By using a C syscall, Meson will build and run it using the
(potential) cross-compiler's runnable environment (if one is
configured).

This is particularly important when cross compiling, as we want
the destination host's page size and not the local build
machine's page size.
@kcgen kcgen added the build system Build system related issues label Dec 4, 2023
@kcgen kcgen self-assigned this Dec 4, 2023
Copy link
Collaborator

@weirddan455 weirddan455 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code all looks good and this certainly looks like a win for native compilation. I don't understand how this would work for cross compilation though. If I'm compiling for ARM on my x86 machine, I compile that small program but then I can't execute the resulting binary unless I have qemu or something installed (which is not typically part of a cross compile environment).

The old method probably isn't any better though (running some command may just return host-side value instead of target value). Plus you've introduced an option to manually set page size so someone can always use that if they're compiling for some esoteric architecture.

@shermp
Copy link
Collaborator

shermp commented Dec 4, 2023

As an addition to what @weirddan455 mentioned, if it is a syscall, isn't it also getting the info from the running kernel? Which might be the host kernel even with qemu-user?

@kklobe
Copy link
Collaborator

kklobe commented Dec 4, 2023

I like this idea. How about we just go all the way and modify the code to compute and use this at runtime instead of compile-time? I think the need for this is largely (totally?) confined to dyn_cache.h, and we could just run your nifty utility functions as part of cache_init(), and optionally add a config parameter to let it be overridden.

@weirddan455
Copy link
Collaborator

I like this idea. How about we just go all the way and modify the code to compute and use this at runtime instead of compile-time? I think the need for this is largely (totally?) confined to dyn_cache.h, and we could just run your nifty utility functions as part of cache_init(), and optionally add a config parameter to let it be overridden.

Yeah, maybe this makes sense to be a runtime value. It's probably out of scope for what @kcgen was planning here though (unless he feels like doing it). IIRC there's some compile time #ifdef's relating to page size that would need to be refactored. And then we would need to test for bugs and performance regressions since this is in the core of the emulator.

@kcgen
Copy link
Member Author

kcgen commented Dec 4, 2023

If I'm compiling for ARM on my x86 machine, I compile that small program but then I can't execute the resulting binary unless I have qemu or something

This is where Meson leaves the other build systems in dust; it has first-class handling for cross compiling, including three-way cross compiling (where you cross-compile your cross-compiler as an intermediary).

When setting it up, you populate a cross build definition file, and one of the fields is an exe_wrapper that's able to run cross-compiled output binaries.

So that's where you'd plug in qemu / wine / etc.. or even a script that does something like ssh actual-machine $1, and you get stdout, stderr, and even the rcode back from the SSH client.

Great walk through here: https://mesonbuild.com/Cross-compilation.html

If it is a syscall, isn't it also getting the info from the running kernel? Which might be the host kernel even with qemu-user?

The exe_wrapper's environment will have emulation-side system libraries that should insulate the client from the host. For example, x86-32 binaries running on Wine should get a 4k page size, just like DOS executables running on DOSBox; they only see and use a 4k page sizes, even when running on the 16K-page Apple M1.

@kklobe
Copy link
Collaborator

kklobe commented Dec 4, 2023

I like this idea. How about we just go all the way and modify the code to compute and use this at runtime instead of compile-time? I think the need for this is largely (totally?) confined to dyn_cache.h, and we could just run your nifty utility functions as part of cache_init(), and optionally add a config parameter to let it be overridden.

Yeah, maybe this makes sense to be a runtime value. It's probably out of scope for what @kcgen was planning here though (unless he feels like doing it). IIRC there's some compile time #ifdef's relating to page size that would need to be refactored. And then we would need to test for bugs and performance regressions since this is in the core of the emulator.

Certainly up to @kcgen. I've already got a version working, it's a very small change, and can't see any place else in the codebase that uses PAGESIZE or host_pagesize, but understandably makes me a bit nervous for 0.81 at this stage.

@kcgen
Copy link
Member Author

kcgen commented Dec 4, 2023

I like this idea. How about we just go all the way and modify the code to compute and use this at runtime instead of compile-time? I think the need for this is largely (totally?) confined to dyn_cache.h, and we could just run your nifty utility functions as part of cache_init(), and optionally add a config parameter to let it be overridden.

Yeah, maybe this makes sense to be a runtime value. It's probably out of scope for what @kcgen was planning here though (unless he feels like doing it). IIRC there's some compile time #ifdef's relating to page size that would need to be refactored.

It's a really nice idea!

And then we would need to test for bugs and performance regressions since this is in the core of the emulator.

Yeah, this is my biggest worry.

Currently, the compiler see these pefect-power-of-2 hard-coded page-size quantities in all the CPU and cache array sizes. So if there's anywhere it can lay down some fast SSE or NEON, it would be there (I hope!).

But ultimately the benchmarks will prove it out. Unfortunately I'm not able to take on that level of benchmarking or do some finer-grained inspection of these differences right now though (of course: 100% welcome for anyone to follow up and make it a runtime value).

@kcgen
Copy link
Member Author

kcgen commented Dec 4, 2023

@kklobe: maybe we can merge this, and then follow up w/ yours? We've got plenty of time to hit regressions and compare performance (and it would be a small revert in the worst cast).

@kklobe
Copy link
Collaborator

kklobe commented Dec 4, 2023

I like this idea. How about we just go all the way and modify the code to compute and use this at runtime instead of compile-time? I think the need for this is largely (totally?) confined to dyn_cache.h, and we could just run your nifty utility functions as part of cache_init(), and optionally add a config parameter to let it be overridden.

Yeah, maybe this makes sense to be a runtime value. It's probably out of scope for what @kcgen was planning here though (unless he feels like doing it). IIRC there's some compile time #ifdef's relating to page size that would need to be refactored.

It's a really nice idea!

And then we would need to test for bugs and performance regressions since this is in the core of the emulator.

Yeah, this is my biggest worry.

Currently, the compiler see these pefect-power-of-2 hard-coded page-size quantities in all the CPU and cache array sizes. So if there's anywhere it can lay down some fast SSE or NEON, it would be there (I hope!).

But ultimately the benchmarks will prove it out. Unfortunately I'm not able to take on that level of benchmarking or do some finer-grained inspect of these differences right now though (of course: 100% welcome for anyone to follow up and make it a runtime value).

Sounds good; preliminary QTD and PERFDOS benchmarks show no change (I don't see any array sizing using that value, only the DOS page-size values, which of course wouldn't be affected by this).

I'll throw it in a PR and let the team decide!

@kcgen kcgen changed the title Use a system call to detect the page size Let Meson use a system call to detect the page size Dec 4, 2023
@kcgen kcgen merged commit 7e20f6e into main Dec 4, 2023
@kcgen
Copy link
Member Author

kcgen commented Dec 4, 2023

Sounds good; preliminary QTD and PERFDOS benchmarks show no change (I don't see any array sizing using that value, only the DOS page-size values, which of course wouldn't be affected by this).

I'll throw it in a PR and let the team decide!

Right on, @kklobe!

@johnnovak
Copy link
Member

johnnovak commented Dec 4, 2023

@kklobe @kcgen Um, seems like massive overkill? Has anyone proven not just using 4k has any measurable benefit?

I think assuming 64 byte cache lines and 4k page sizes gets you 99% there on modern hardware, so why complicate life? 😄

At least, I'd measure first, then if the gains are significant, then complicate.

Especially if the likelihood for regressions and bugs is nonzero, but the likelihood for performance improvements is unproven...

@kcgen
Copy link
Member Author

kcgen commented Dec 4, 2023

Page size needs to be correct for the dynamic cores.
(See my expanded comment here: #3160 (comment))

We definitely want runtime detection to handle aarch64 processors that support setting either 4k or 16k page size at boot time.

@kklobe
Copy link
Collaborator

kklobe commented Dec 4, 2023

@kklobe @kcgen Um, seems like massive overkill?

Probably!

Now go away and leave us alone 😄

@kcgen
Copy link
Member Author

kcgen commented Dec 4, 2023

Also, hardware and operating systems that support write-or-execute flagging (W^X) require flagging of whole pages. That's probably what crashed the Pi5: we asked to flag only a 4k portion of a 16K page (and were denied). So dynrec didn't even get off the ground to fill its first page.

@kcgen kcgen deleted the kc/common-pagesize-block-1 branch December 4, 2023 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build system Build system related issues
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Crashing on raspberry pi 5 when auto switching to max 100% CPU cycles
5 participants