Encapsulate the Cache's write mask functions as members #2990

kcgen · 2023-10-11T23:18:12Z

Description

This PR is a small refactoring that encapsulates the Cache's write-map mask functionality as member functions, which simplifies their code but also reduces pointer dereferences (as they're not drilling into two levels of structs).

Additionally, this eliminates the duplicate decode_increase_wmapmask functions.

Suggest reviewing commit-by-commit.

@weirddan455 , this re-purposes the de-referencing cleanup done in the std::vector write map branch, but now on top of the current main code. I've tossed that vector branch :-)

Manual testing

Tested release and sanitizer builds of the benchmarks, Alone in the Dark, Windows 3.1 applications, as well as builds with the dynamic-core disabled (-Ddynamic_core=none) and -Ddynamic_core=dynrec.

Performance is unaffected and the branch's release-build dynrec ASM files are just a bit smaller (693064 bytes for main, 692828 for the branch; that said.. the PR's goal is just to de-dupe and move a bit of this mask functionality closer to the variables they operates on).

Checklist

Please tick the items as you have addressed them. Don't remove items; leave the ones that are not applicable unchecked.

I have:

The delete and grow functions only operate on the cache write map, so moving them into the cache struct now lets them operate directly on the members.

The type addition code only operates on the cache write map members, so moving it into the cache's struct now lets it operate directly on the members avoiding multiple levels of deferencing. The type addition function is split at the type level for bytes, words, and dwords, to make it clearer what it's doing. (previously, the function took in a generic 'size' value, implying arbitrary scalars could be handled, when really it's just meant to handle the three type sizes).

johnnovak · 2023-10-11T23:49:03Z

Can't test right now, but in the meantime here's a demo pack that should give the dynarec stuff a good workout. Especially Blasphemy-Kkowboy and Blasphemy-MoralHardCandy; these do CPU-based 3D rendering at 640x480. I prepared this for @weirddan455 btw to stress test ffmpeg, btw.

demo-pack.zip

Will test with the demos this evening.

kcgen · 2023-10-11T23:53:43Z

Nice; thanks for the test pack, @johnnovak -- will give it a try and report back.

johnnovak · 2023-10-12T01:01:55Z

Nice; thanks for the test pack, @johnnovak -- will give it a try and report back.

Also suggest to test for performance regressions with Screamer, Tomb Raider and System Shock in 640x480 software rendering mode in addition to Quake.

johnnovak · 2023-10-12T10:19:22Z

I've tested it with a few demos on macOS and it's fine. Haven't tried Windows and I'm a bit busy right now, so feel free to merge @kcgen once you've gained enough confidence about this.

kcgen · 2023-10-12T15:44:13Z

Thanks for those tests on your side @johnnovak.

I tested all of demos w/ the branch versus main, and used the fixed cycle count in your confs (instead of the percentage), and measured the CPU time needed to achieve the cycle-count, and they're all coming in at par.

(will post some measurements in a bit)

weirddan455 · 2023-10-12T16:13:40Z

From your commit message:

The type addition code only operates on the cache write map members,
so moving it into the cache's struct now lets it operate directly
on the members avoiding multiple levels of deferencing.

Unless I'm very wrong in my understanding of how C++ works (which is always possible), this is not true. Whenever you have a member function, it gets called with a hidden this argument. Then when you access a data member, for example wmapmask, it's the same as doing this->wmapmask. It's still doing the same amount of indirection even though you aren't explicitly typing it out.

I would expect this PR to have no significant difference in performance one way or the other. However, if you think this improves readability, then go for it. I'm not familiar enough with this part of the code to have a strong opinion.

kcgen · 2023-10-12T16:30:04Z

I'm using time and operf (see script at bottom) to get a measure of efficiency:

operf samples process events.
time records the elapsed wall, user-space, and kernel time.

Tested with Blasphemy-MoralHardCandy using:

[cpu]
cputype = pentium_slow
cycles = 250000

`operf`, 30 seconds of sampling

Main:

cpu_clk_unhalt...|
  samples|      %|
------------------
   632292 100.000 dosbox
	cpu_clk_unhalt...|
	  samples|      %|
	------------------
	   399611 63.2004 anon (tgid:1444042 range:0x7f09027f9000-0x7f0902ffdfff)
	   204980 32.4186 dosbox
	    14494  2.2923 no-vmlinux
	     4676  0.7395 libc.so.6
	     2360  0.3732 [vdso] (tgid:1444042 range:0x7ffc20ae0000-0x7ffc20ae1fff)
	     1585  0.2507 iris_dri.so
	     1175  0.1858 libSDL2-2.0.so.0.2600.3

Branch:

cpu_clk_unhalt...|
  samples|      %|
------------------
   626970 100.000 dosbox
	cpu_clk_unhalt...|
	  samples|      %|
	------------------
	   395469 63.0762 anon (tgid:1443719 range:0x7f4978ff6000-0x7f49797fafff)
	   204914 32.6832 dosbox
	    14164  2.2591 no-vmlinux
	     4254  0.6785 libc.so.6
	     2338  0.3729 [vdso] (tgid:1443719 range:0x7ffd16ddd000-0x7ffd16ddefff)
	     1437  0.2292 iris_dri.so
	     1187  0.1893 libSDL2-2.0.so.0.2600.3

`time`, 30 seconds of runtime

Main:

real	0m30.043s
user	0m15.961s
sys	0m0.393s

Branch:

real	0m30.060s
user	0m15.787s
sys	0m0.384s

Results

Both runs measured Blasphemy-MoralHardCandy at fixed 250k cycles for 30 seconds before being killed.

If the branch is acceptable, it shouldn't consume more CPU time than main across the same period of time to run DOSBox at 250k cycles.

Operf recorded 632292 samples from main and 626970 samples from the branch over 30s of runtime, however, the sampling timeframe isn't exact, so it's simply possible that main's sampling run just ran for 0.8% more time (which is roughly in the noise).
We can see time recorded main consumed 15.96s of user-time across 30.043s of wall-time, or ~53.1% CPU usage. The branch consumed 15.787s of user-time across 30.060s of wall-time, or 52.5% of CPU usage.

So the results seem OK, and are probably a wash.

Script:

#!/bin/bash

# Check if the command is provided
if [[ "$#" -eq 0 ]]; then
    echo "Usage: $0 <command> [args...]"
    exit 1
fi

if [[ -d oprofile_data ]]; then
  rm -rf oprofile_data
fi

# Launch the command
time "$@" &
command_pid=$(pidof $1)

# Start profiling via the PID
operf --pid "$command_pid" &
operf_pid=$!

# Profile for 30 seconds
sleep 30

# Stop the command and profiler
kill -15 $command_pid
kill -15 $operf_pid
killall operf

opreport

johnnovak · 2023-10-12T21:59:21Z

Thanks for the very detailed performance testing @kcgen. 🎖️

@weirddan455 I think you're right in principle; there is an implicit this pointer when dealing with method calls. The objective here was to fix the bug while not causing performance regressions, so this is fit for purpose plus it improves readability while keeping the same performance, which is another win 🚀

kcgen · 2023-10-12T23:20:04Z

this->wmapmask: It's still doing the same amount of indirection even though you aren't explicitly typing it out.

Yes, that's right; all members are accessed through this. The prior code had one extra layer: activecb->cache.<variable>, and these were repeating.

Ideally the optimizer will hoist the repeating dereferences up to a temporary, but that only happens when the compiler can be confident that no side effects alter those dereferenced variables (like active_cb and cache).

So yeah.. this is likely just a wash, but I didn't want to thow away the code given it baked out a lot of ugly hops in the code :-)

kcgen added 4 commits October 11, 2023 11:44

Encapsulate the dynrec cache's existing write mask routines

7435c02

The delete and grow functions only operate on the cache write map, so moving them into the cache struct now lets them operate directly on the members.

Switch post to pre incrementers in adjacent changed lines

c60eb0c

Switch post to pre incrementers in remainder of core decoders

4b6b9bd

kcgen added the refactoring Code refactoring without any functional changes label Oct 11, 2023

kcgen requested review from johnnovak, weirddan455 and FeralChild64 October 11, 2023 23:18

kcgen self-assigned this Oct 11, 2023

kcgen merged commit 7ae966f into main Oct 12, 2023

kcgen deleted the kc/reduce-wmapmask-dereferencing-1 branch October 14, 2023 18:38

Grounded0 mentioned this pull request Nov 17, 2023

DOSBox Staging 0.81 release steering ticket #3091

Closed

4 tasks

Grounded0 added the CPU/FPU emulation Issues related to CPU or FPU emulation label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Encapsulate the Cache's write mask functions as members #2990

Encapsulate the Cache's write mask functions as members #2990

Uh oh!

kcgen commented Oct 11, 2023 •

edited

Loading

Uh oh!

johnnovak commented Oct 11, 2023 •

edited

Loading

Uh oh!

kcgen commented Oct 11, 2023

Uh oh!

johnnovak commented Oct 12, 2023 •

edited

Loading

Uh oh!

johnnovak commented Oct 12, 2023

Uh oh!

kcgen commented Oct 12, 2023

Uh oh!

weirddan455 commented Oct 12, 2023

Uh oh!

kcgen commented Oct 12, 2023 •

edited

Loading

Uh oh!

johnnovak commented Oct 12, 2023

Uh oh!

kcgen commented Oct 12, 2023

Uh oh!

Uh oh!

Uh oh!

Encapsulate the Cache's write mask functions as members #2990

Encapsulate the Cache's write mask functions as members #2990

Uh oh!

Conversation

kcgen commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Manual testing

Checklist

Uh oh!

johnnovak commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kcgen commented Oct 11, 2023

Uh oh!

johnnovak commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnovak commented Oct 12, 2023

Uh oh!

kcgen commented Oct 12, 2023

Uh oh!

weirddan455 commented Oct 12, 2023

Uh oh!

kcgen commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

operf, 30 seconds of sampling

time, 30 seconds of runtime

Results

Uh oh!

johnnovak commented Oct 12, 2023

Uh oh!

kcgen commented Oct 12, 2023

Uh oh!

Uh oh!

kcgen commented Oct 11, 2023 •

edited

Loading

johnnovak commented Oct 11, 2023 •

edited

Loading

johnnovak commented Oct 12, 2023 •

edited

Loading

kcgen commented Oct 12, 2023 •

edited

Loading

`operf`, 30 seconds of sampling

`time`, 30 seconds of runtime