Skip to content

Move dynamic FPU synchronization inside preprocessor defines #2252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 29, 2023

Conversation

kcgen
Copy link
Member

@kcgen kcgen commented Jan 27, 2023

This is a follow-up adjustment mentioned by @finalpatch; thanks!

This PR also:

  • Makes non-functional adjustments to the previously-touched FPU additions (uses more const and auto arguments, replaces C casts with C++ casts, uses default initializes, and uses C++ struct definition style ("struct Type {... }", just like classes).
  • Fixes a couple dangling memset/memcpy warnings by using the unaligned accessor functions.
  • Uses the Get and Set verbs in the PReg function names (to make the match the others).
  • The last two commits are functional changes (deliberately independent and the last in this set) to be easy bisection points if either cause problems down the road.

Hoping you can test this with your Turbo C debugger @finalpatch (and anything else you can throw at it like large and tiny values / overflow / div-by-zero / NaN / etc).

Suggest reviewing commit-by-commit.

DOS Navigator is still OK, as are UBSAN builds.

@kcgen kcgen self-assigned this Jan 27, 2023
@kcgen kcgen added CPU/FPU emulation Issues related to CPU or FPU emulation cleanup Non-functional changes that simplify, improve maintainability, or squash warnings backport Important fixes that should be backported labels Jan 27, 2023
@finalpatch
Copy link
Contributor

finalpatch commented Jan 27, 2023

Honestly I would recommend just do these 2 changes I suggested earlier and minimize the change to the original code. This way the risk is kept to the minimum (this code is very delicate and I would be very cautious to make any change that I'm not 100% confident with) and it's easy for other dosbox forks to pick up this fix rather than having to deal with a large diff with lots of unrelated changes.

@Grounded0
Copy link
Collaborator

Grounded0 commented Jan 27, 2023

I'm for minimising the risk since this is an extremely delicate part of the codebase.

@kcgen kcgen force-pushed the kc/conditional-fpu-sync-1 branch from caff04c to d6ed92d Compare January 27, 2023 19:34
@kcgen kcgen requested a review from finalpatch January 27, 2023 22:21
@kcgen kcgen force-pushed the kc/conditional-fpu-sync-1 branch 2 times, most recently from 2d9b956 to 0b28734 Compare January 27, 2023 22:50
@johnnovak
Copy link
Member

johnnovak commented Jan 28, 2023

Honestly I would recommend just do these 2 changes I suggested earlier and minimize the change to the original code. This way the risk is kept to the minimum (this code is very delicate and I would be very cautious to make any change that I'm not 100% confident with) and it's easy for other dosbox forks to pick up this fix rather than having to deal with a large diff with lots of unrelated changes.

I'm for minimising the risk since this is an extremely delicate part of the codebase.

I agree with both of you on the need to test this very thoroughly. There must be some benchmark apps out there that test the FPU for correctness. I'd also be uncomfortable merging this just based on a "visual" code review — there's so many things that can go wrong.

Having said that, we shouldn't be afraid of changing the code; that's a bad situation to be in. So instead of never touching these parts of the codebase, the question we should be asking instead is how can we regressions test this to gain 100% confidence that it still works as expected? Like I said, someone must have written some FPU test benchmarks in the DOS days that check for the correctness of calculations... Maybe we can even inject "synthetic" dynamic -> normal core switching for testing purposes? Maybe that way this will get tested a lot more thoroughly than the original code ever was! So we shouldn't just shy out, that's all I'm saying.

Well, and other DOSBox forks can pick up the initial patch... or also choose to clean up their code 😏

(Related: I rewrote the whole DOSBox mixer last year to use floats instead of integers, gradually, over time. Plus replaced the built-in resampling method with a library. It was somewhat less risky, but arguably every game uses sound, while the FPU is only used by a few. Not "bragging", just saying that these things can be done, they're not "impossible"; you just need to be careful and approach it systematically).

@finalpatch
Copy link
Contributor

I agree with both of you on the need to test this very thoroughly. There must be some benchmark apps out there that test the FPU for correctness. I'd also be uncomfortable merging this just based on a "visual" code review — there's so many things that can go wrong.

What's difficult is that this code deals with the case when FPU is used cross core boundary, which AFAIK happens when code modifies itself or when debugger/OS doing system level operations. A benchmark suite normally stays in the dynamic core happy path and won't see anything wrong. That's why DOSBOX has been able to run games for so many years without people noticing this.

@finalpatch
Copy link
Contributor

I think in order to properly test this, we need to create a guest program in assembly code that deliberately exercises this core transition logic.

@johnnovak
Copy link
Member

I think in order to properly test this, we need to create a guest program in assembly code that deliberately exercises this core transition logic.

Ok fair point.

@kcgen kcgen force-pushed the kc/conditional-fpu-sync-1 branch from 0b28734 to 8c2de4c Compare January 28, 2023 03:56
@kcgen
Copy link
Member Author

kcgen commented Jan 28, 2023

We've got two bisectable points now:

  • the prior PR, where the normal and dh FPU's are synchronized every call (no matter what)
  • This PR where they're only syn'd if the opposite core was last used

That's really all we need to track down strange bugs.

The other good thing is that somehow all of eXoDOS and eXoWin3.1 are running fine with this synchronization not even done at all! Granted.. those are all games/edutainment, but still, that's about 15,000 executables.

With @finalpatch's Turbo Debugger tests showing proper operation, hopefully this improvement fixes more application corner cases.

Oh: and we're aware of DOSBox being used for civil engineering work, like bridge and elevated load bearing calculations. Likewise, there are German manufacturers using DOSBox to run software that drives their fabrication equipment.

So hopefully the numbers generated are still OK! 🚀

@kcgen kcgen force-pushed the kc/conditional-fpu-sync-1 branch from 8c2de4c to b4ab7fc Compare January 28, 2023 05:41
@finalpatch
Copy link
Contributor

I'll try to put together a test executable that deterministically exercises this logic. Give me some time.

@Grounded0
Copy link
Collaborator

Grounded0 commented Jan 28, 2023

Oh: and we're aware of DOSBox being used for civil engineering work, like bridge and elevated load bearing calculations. Likewise, there are German manufacturers using DOSBox to run software that drives their fabrication equipment.

We're currently offering them 64-bit precision instead of an actual IEEE 754 80-bit extended precision format so it should be all good. 😁

But yeah, many people are depending that we get the math correct that probably shouldn't which is yet another factor in mitigating risk.

Maybe we should deploy the safer option per @finalpatch first and this extended option could be deployed later once we're satisfied everything still works with the safer option first and a test program for x87 have been written to validate this bit of the code.

@rderooy
Copy link
Collaborator

rderooy commented Jan 28, 2023

You can try Intel's MCP Diag.
https://archive.org/details/msdos_MCPDIAG_shareware

But be aware that it will likely fail by default, as it did originally on DOSBox-X.
joncampbell123/dosbox-x#3415

@finalpatch
Copy link
Contributor

finalpatch commented Jan 28, 2023

Okay, I managed to create a program that can trigger the bug
https://gist.github.com/finalpatch/91ae4449875c3cf217327bfdcbeab0c5

prints "ok" with normal core
prints "ok" with dyn x86 after fix
prints "ok" with dyn x86 core but X86_DYNFPU_DH_ENABLED undefined
prints "bad" with original dosbox on dyn x86 core

@finalpatch
Copy link
Contributor

This branch in its current state produces "ok" from my test program.

@finalpatch
Copy link
Contributor

The core switch logic as far as I can see, is correct (I still believe 1 memcpy is better than 3 in this performance sensitive code)

However there are way too many unrelated changes in this PR that I don't have the bandwidth/motivation to review. I'll leave it to the other folks to approve the whole PR, the part in in core_dyn_x86.cpp looks good to me.

@Grounded0
Copy link
Collaborator

core_dyn_x86.cpp looks good to me.

@kcgen kcgen force-pushed the kc/conditional-fpu-sync-1 branch 2 times, most recently from 438b7d8 to 1af5cc0 Compare January 28, 2023 23:06
@johnnovak
Copy link
Member

Can't really do a useful review of this myself, but the approach @kcgen mentioned with the easy bisecting, plus the asm test code by @finalpatch has probably raised our confidence levels enough to get this merged?

@kcgen
Copy link
Member Author

kcgen commented Jan 28, 2023

This branch in its current state produces "ok" from my test program.

Thanks. Do you have the compiled .EXE handy? (I tried w/ fasm but it complains about not entering 32-bit mode). It would be nice to add this to our unit tests that run on the Apple M1, x86-64, ARM v7 and v8, as well as on Windows w/ MSVC.

(To unit test it, I would launch it with the compiled dosbox-staging executable, redirect to a text file, and check for OK).

The core switch logic as far as I can see, is correct

Thanks for the review.

(I still believe 1 memcpy is better than 3 in this performance sensitive code)

Yes, I explained in the resolved comments that the type-specific accessors and assigning directly to the struct members is still more self-documenting and explicit versus the bulk memcpy.

But to wrap this up, I dropped the two routines in godbolt and quick bench:

Here's the assembly at -O2 (ARM middle-pane, x86-64 right-pane):

2023-01-28_15-24

At -O3, the memcpy approach stays a loop while the approach used in this PR goes branchless (eliminating the loop):

2023-01-28_15-26

https://godbolt.org/z/seGvx8e5M

The loopless optimized version makes it very hard to benchmark at -O3 in Quick Bench. Here's Clang 15 and GCC 12.x, at -O3:

2023-01-28_15-28

I tried flagging variables as do-not-optimize, accumulating intermediate p_reg m1/m2/m3, randomizing the dyn_reg array, etc, to no avail.

However, Clang 14 atleast is measuring /something/ :laughing:

2023-01-28_14-49

https://quick-bench.com/q/LkhSkKnQLmvE_amiu7ixUd5j-kQ

All that to say - we're deep in the noise (a couple instructions either way, and for something that's only called during transitions from core types). These explorations are always fun though!

@kcgen kcgen force-pushed the kc/conditional-fpu-sync-1 branch from 1af5cc0 to c027ea3 Compare January 28, 2023 23:42
@kcgen
Copy link
Member Author

kcgen commented Jan 29, 2023

Can't really do a useful review of this myself, but the approach @kcgen mentioned with the easy bisecting, plus the asm test code by @finalpatch has probably raised our confidence levels enough to get this merged?

Yes - I think we're set to merge, thanks to @finalpatch's test results and everything else that has been tested (all running fine).

Thanks again for this, @finalpatch - and very interested in any more additions you want to help with! ❤️

@kcgen kcgen merged commit 30c76e4 into main Jan 29, 2023
@delete-merged-branch delete-merged-branch bot deleted the kc/conditional-fpu-sync-1 branch January 29, 2023 00:07
@finalpatch
Copy link
Contributor

finalpatch commented Jan 29, 2023

test program binary and source
https://1drv.ms/u/s!As8r3moTN9P9gZcnYwrGAy9fAOjlUg?e=5Y2XIV
the test program source is modified a little to work with MASM too
https://gist.github.com/finalpatch/91ae4449875c3cf217327bfdcbeab0c5

@kcgen
Copy link
Member Author

kcgen commented Jan 29, 2023

Nice! thanks for these, @finalpatch.

Anyone know of a CLI test framework (or system)?

This FPU tester will be our first (of probably many) executables we run using the compiled DOSBox Staging and inspect the results.

This is essentially black-box executable to go alongside our current white-box (GoogleTest and GoogleMock) internal test framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Important fixes that should be backported cleanup Non-functional changes that simplify, improve maintainability, or squash warnings CPU/FPU emulation Issues related to CPU or FPU emulation
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants