Skip to content

Encapsulate the Cache's write mask functions as members #2990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 12, 2023

Conversation

kcgen
Copy link
Member

@kcgen kcgen commented Oct 11, 2023

Description

This PR is a small refactoring that encapsulates the Cache's write-map mask functionality as member functions, which simplifies their code but also reduces pointer dereferences (as they're not drilling into two levels of structs).

Additionally, this eliminates the duplicate decode_increase_wmapmask functions.

Suggest reviewing commit-by-commit.

@weirddan455 , this re-purposes the de-referencing cleanup done in the std::vector write map branch, but now on top of the current main code. I've tossed that vector branch :-)

Manual testing

Tested release and sanitizer builds of the benchmarks, Alone in the Dark, Windows 3.1 applications, as well as builds with the dynamic-core disabled (-Ddynamic_core=none) and -Ddynamic_core=dynrec.

Performance is unaffected and the branch's release-build dynrec ASM files are just a bit smaller (693064 bytes for main, 692828 for the branch; that said.. the PR's goal is just to de-dupe and move a bit of this mask functionality closer to the variables they operates on).

Checklist

Please tick the items as you have addressed them. Don't remove items; leave the ones that are not applicable unchecked.

I have:

  • followed the project's contributing guidelines and code of conduct.
  • performed a self-review of my code.
  • commented on the particularly hard-to-understand areas of my code.
  • split my work into well-defined, bisectable commits, and I named my commits well.
  • applied the appropriate labels (bug, enhancement, refactoring, documentation, etc.)
  • checked that all my commits can be built.
  • confirmed that my code does not cause performance regressions (e.g., by running the Quake benchmark).
  • added unit tests where applicable to prove the correctness of my code and to avoid future regressions.
  • made corresponding changes to the documentation or the website according to the documentation guidelines.
  • locally verified my website or documentation changes.

kcgen added 4 commits October 11, 2023 11:44
The delete and grow functions only operate on the cache write map,
so moving them into the cache struct now lets them operate directly
on the members.
The type addition code only operates on the cache write map members,
so moving it into the cache's struct now lets it operate directly
on the members avoiding multiple levels of deferencing.

The type addition function is split at the type level for bytes,
words, and dwords, to make it clearer what it's doing.
(previously, the function took in a generic 'size' value, implying
arbitrary scalars could be handled, when really it's just meant
to handle the three type sizes).
@kcgen kcgen added the refactoring Code refactoring without any functional changes label Oct 11, 2023
@kcgen kcgen self-assigned this Oct 11, 2023
@johnnovak
Copy link
Member

johnnovak commented Oct 11, 2023

Can't test right now, but in the meantime here's a demo pack that should give the dynarec stuff a good workout. Especially Blasphemy-Kkowboy and Blasphemy-MoralHardCandy; these do CPU-based 3D rendering at 640x480. I prepared this for @weirddan455 btw to stress test ffmpeg, btw.

demo-pack.zip

Will test with the demos this evening.

@kcgen
Copy link
Member Author

kcgen commented Oct 11, 2023

Nice; thanks for the test pack, @johnnovak -- will give it a try and report back.

@johnnovak
Copy link
Member

johnnovak commented Oct 12, 2023

Nice; thanks for the test pack, @johnnovak -- will give it a try and report back.

Also suggest to test for performance regressions with Screamer, Tomb Raider and System Shock in 640x480 software rendering mode in addition to Quake.

@johnnovak
Copy link
Member

I've tested it with a few demos on macOS and it's fine. Haven't tried Windows and I'm a bit busy right now, so feel free to merge @kcgen once you've gained enough confidence about this.

@kcgen
Copy link
Member Author

kcgen commented Oct 12, 2023

Thanks for those tests on your side @johnnovak.

I tested all of demos w/ the branch versus main, and used the fixed cycle count in your confs (instead of the percentage), and measured the CPU time needed to achieve the cycle-count, and they're all coming in at par.

(will post some measurements in a bit)

@weirddan455
Copy link
Collaborator

From your commit message:

The type addition code only operates on the cache write map members,
so moving it into the cache's struct now lets it operate directly
on the members avoiding multiple levels of deferencing.

Unless I'm very wrong in my understanding of how C++ works (which is always possible), this is not true. Whenever you have a member function, it gets called with a hidden this argument. Then when you access a data member, for example wmapmask, it's the same as doing this->wmapmask. It's still doing the same amount of indirection even though you aren't explicitly typing it out.

I would expect this PR to have no significant difference in performance one way or the other. However, if you think this improves readability, then go for it. I'm not familiar enough with this part of the code to have a strong opinion.

@kcgen
Copy link
Member Author

kcgen commented Oct 12, 2023

I'm using time and operf (see script at bottom) to get a measure of efficiency:

  • operf samples process events.
  • time records the elapsed wall, user-space, and kernel time.

Tested with Blasphemy-MoralHardCandy using:

[cpu]
cputype = pentium_slow
cycles = 250000

operf, 30 seconds of sampling

Main:

cpu_clk_unhalt...|
  samples|      %|
------------------
   632292 100.000 dosbox
	cpu_clk_unhalt...|
	  samples|      %|
	------------------
	   399611 63.2004 anon (tgid:1444042 range:0x7f09027f9000-0x7f0902ffdfff)
	   204980 32.4186 dosbox
	    14494  2.2923 no-vmlinux
	     4676  0.7395 libc.so.6
	     2360  0.3732 [vdso] (tgid:1444042 range:0x7ffc20ae0000-0x7ffc20ae1fff)
	     1585  0.2507 iris_dri.so
	     1175  0.1858 libSDL2-2.0.so.0.2600.3

Branch:

cpu_clk_unhalt...|
  samples|      %|
------------------
   626970 100.000 dosbox
	cpu_clk_unhalt...|
	  samples|      %|
	------------------
	   395469 63.0762 anon (tgid:1443719 range:0x7f4978ff6000-0x7f49797fafff)
	   204914 32.6832 dosbox
	    14164  2.2591 no-vmlinux
	     4254  0.6785 libc.so.6
	     2338  0.3729 [vdso] (tgid:1443719 range:0x7ffd16ddd000-0x7ffd16ddefff)
	     1437  0.2292 iris_dri.so
	     1187  0.1893 libSDL2-2.0.so.0.2600.3

time, 30 seconds of runtime

Main:

real	0m30.043s
user	0m15.961s
sys	0m0.393s

Branch:

real	0m30.060s
user	0m15.787s
sys	0m0.384s

Results

Both runs measured Blasphemy-MoralHardCandy at fixed 250k cycles for 30 seconds before being killed.

If the branch is acceptable, it shouldn't consume more CPU time than main across the same period of time to run DOSBox at 250k cycles.

  1. Operf recorded 632292 samples from main and 626970 samples from the branch over 30s of runtime, however, the sampling timeframe isn't exact, so it's simply possible that main's sampling run just ran for 0.8% more time (which is roughly in the noise).

  2. We can see time recorded main consumed 15.96s of user-time across 30.043s of wall-time, or ~53.1% CPU usage. The branch consumed 15.787s of user-time across 30.060s of wall-time, or 52.5% of CPU usage.

So the results seem OK, and are probably a wash.


Script:

#!/bin/bash

# Check if the command is provided
if [[ "$#" -eq 0 ]]; then
    echo "Usage: $0 <command> [args...]"
    exit 1
fi

if [[ -d oprofile_data ]]; then
  rm -rf oprofile_data
fi

# Launch the command
time "$@" &
command_pid=$(pidof $1)

# Start profiling via the PID
operf --pid "$command_pid" &
operf_pid=$!

# Profile for 30 seconds
sleep 30

# Stop the command and profiler
kill -15 $command_pid
kill -15 $operf_pid
killall operf

opreport

@kcgen kcgen merged commit 7ae966f into main Oct 12, 2023
@johnnovak
Copy link
Member

Thanks for the very detailed performance testing @kcgen. 🎖️

@weirddan455 I think you're right in principle; there is an implicit this pointer when dealing with method calls. The objective here was to fix the bug while not causing performance regressions, so this is fit for purpose plus it improves readability while keeping the same performance, which is another win 🚀

@kcgen
Copy link
Member Author

kcgen commented Oct 12, 2023

this->wmapmask: It's still doing the same amount of indirection even though you aren't explicitly typing it out.

Yes, that's right; all members are accessed through this. The prior code had one extra layer: activecb->cache.<variable>, and these were repeating.

Ideally the optimizer will hoist the repeating dereferences up to a temporary, but that only happens when the compiler can be confident that no side effects alter those dereferenced variables (like active_cb and cache).

So yeah.. this is likely just a wash, but I didn't want to thow away the code given it baked out a lot of ugly hops in the code :-)

@kcgen kcgen deleted the kc/reduce-wmapmask-dereferencing-1 branch October 14, 2023 18:38
@Grounded0 Grounded0 added the CPU/FPU emulation Issues related to CPU or FPU emulation label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CPU/FPU emulation Issues related to CPU or FPU emulation refactoring Code refactoring without any functional changes
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants