Developing

Implemented guest buffer manager

by admin in

What problem does this solve

Up until this point main branch used a host memory stream buffer for making all resources accessible to the GPU. This is because AMD hardware has very small alignment requirements for both uniform and storage buffers (only 4 bytes) while nvidia has 64/16 respectively. The device local-host visible part of memory is also too small for most systems to serve this purpose, and we probably need it for better things.

This adds a ton of overhead to GPU emulation as all buffer bindings need to memcpy a (sometime large) chunk of memory to the stream buffer. It was also slow as GPU is not using VRAM for fast access. It also didn’t work for storage buffers where the writes would be lost, as there was no way of preserving them in the volatile buffer.

This PR aims to solve most of the these issues by keeping a GPU side mirror of guest address space for the GPU to access. It uses write protection to track any modifications and will re-sync the ranges on demand when needed. It’s still a bit incomplete though in ways I will cover more below. Seems to fix flicker on RDR with AMD gpus, but it still persists on NVIDIA which needs more investigation.

Basic design

In terms of operations the most important is searching for buffers as this is done multiple times per draw. Tracking page dirtiness must be fast as well. Insertion/deletion of buffers should also be fast but this happens more rarely than the other operations.

For page tracking we employ a bit-based tracker with 4KB granularity, same as the host page size we target. It works on 2 levels; each WordManager is responsible for tracking 4MB of virtual address space and is created on demand when a particular region is invalidated. The MemoryTracker will iterate each manager that touches the region and gather all dirty ranges from each one. All this uses bit operations and avoids heap allocations, so it’s quite fast compared to an interval set.

For the buffer cache, we cache buffers with host page size granularity. This makes things easier as we can avoid having to manage buffer overlaps; each page is exclusively owned by a buffer at a time. Buffers are stored in a multi level page table that covers (most) of the virtual address space and has comparable performance to a flat array access, but also using far less memory in the process. While at it, I’ve also switched the texture cache to use the same page table, as it should be faster than the existing hash map.

Every time we fetch a buffer, we check if the region is CPU dirty and build a list of copies needed to validate the buffer from CPU data. The data is copied to a staging buffer and the buffer is validated. I’ve added a small optimization in this area, specially for small uniform buffers whose page has not been gpu modified. For those we can skip cached path and directly copy data into device local-host visible stream buffer to avoid a potential renderpass break in games that update uniforms often. Buffer upload reordering is also a potential future optimization, but that will matter more on tiled GPUs I imagine.

GPU modification tracking is partially implemented. The switch to cached buffer objects also raises the issue of alignment again. An easy solution would be to force SSBOs in most cases but still has cases where alignment of 16 is not satisfied. Switching to device buffer address is also possible, but would probably result in performance degradation on NVIDIA hardware, as it has fixed function binding points for UBOs/SSBOs and hardware probably prefers you use them.
So on each buffer bind we check the offset and align it down if necessary, adding the offset into a push constant block that gets added into every buffer access. This results in a bit more overhead during each buffer access but I believe its the simplest approach at the moment without sacrificing much performance.

Some notes on potential expansions

The current design should work on any modern GPU. However could also take advantage of ReBar here in many ways. The simplest way is allocating all buffers in device local-host visible memory and perform as many updates inline as possible.
A more advanced way to make use of it would be a Linux only technique that also uses the extremely new extension
VK_EXT_map_memory_placed. This allows us to tell vkMapMemory the exact virtual address to map our buffer. So we can use this to map GPU memory, directly into our virtual address space, avoiding the need of page dirty tracking almost entirely.

The cache also makes no attempt to preserve GPU modified memory regions when a CPU write occurs to unrelated part of the same page. This means that the next time the buffer is used, part or all of the buffer will get trashed by CPU memory. This is a complex problem to solve as guest gives us little indication of when it wants to sync so it is left for later.

more v0.0.4 progress

by admin in

More interesting pr’s came to our git these days

Firstly we got Rewrite thread local storage implementation from The Turtle

t’s not uncommon for ps4 guest applications to launch and use many threads, which also necessitates handling thread local storage properly. In x86 thread local accesses are performed by loading the pointer in the fs segment register. This is a problem as Windows doesn’t allow you to change the value of this register to what the guest expects. Not quite true, see first reply

On master this is handled with a simple exception handler that will patch the value of the destination register with a thread_local buffer. This works fine but will be a problem later on. Obviously the performance impact is pretty large for any access. In addition, the new texture cache that does fault tracking also needs a custom exception handler, so they end up conflicting. Also, guest apps can use negative offsets when accessing the buffer, so the current implementation would trigger UB in these cases.

This PR attempts to fix all of the above, by using assembly trampolines instead of the exception handler. For storing the TLS image pointer, a new TLS slot is allocated from the parent process and the logic from wine’s TlsGetValue is used to retrieve the value. This means we also don’t have to rely on undefined/unused spaces in TEB structure to store our data. Each mov instruction from FS segment is patched with a jump to a trampoline that loads the actual pointer.

While at it, also fixed a problem with fault tracking that caused crashing in pngdec demo. The tracking was being performed in the texture cache page size, when it should be on 4KB boundary like the host/guest. Also bumped the cache page size to vastly reduce the amount of page table accesses.

Secondly comes a pr gnmdriver: basic functionality extension from psucien

This adds implementation for the next commonly used driver functions:

  • sceGnmComputeWaitOnAddress
  • sceGnmDispatchDirect
  • sceGnmDispatchIndirect
  • sceGnmDrawIndexOffset
  • sceGnmInsertPopMarker
  • sceGnmInsertPushMarker
  • sceGnmUpdatePsShader350
  • sceGnmUpdateVsShader

Functions, related to HW state initialization and indirect draw calls, are subject to the next updates of this PR.
Submission related functionality will be re-worked in a separate PR as required changes in the GPU frontend.

Another pr for Psf info + stack allocation from shadow

Fix stack allocation : Currently we have a lot of crashes with the default stack allocation , the /stack flag increase the stack and commit area so let’s hope it will solve all relative crash issues

Print param.sfo at startup : We can print game id , title , fw version required , app version at the startup of the log file. We also will need the following info for savedata and sceAppContent module at future ( savedata pr is on it’s way)

Even on more pr for Sonicmania work from shadow which address the following issues

Flexible memory : some dummy mostly implementation of flexible memory mapping but allows games to go further

CreateThread : it appears some time threads are nameless

sceUserServiceGetEvent : implemented a fake login event , but should be enought atm

And latest one more pr for dummy np* modules and screenshot module from shadow

which add stubs for np* functions

Stay tuned for more updates soon 🙂

Shadps4 v0.0.4 progress (continue)

by admin in

We have some interesting pr’s this week .

First we got Rewrite videoout library and bringup new vulkan backend from The Turtle

On master the video/graphics is relatively hard to understand as it’s split into multiple folders and directories without much cohesion on what every does. In addition the texture cache is extremely basic and works based on hashing: it will track changes to memory regions by computing its hash. This is fine for simple demos but when real games are put to the test, hashing large blocks of memory every draw call isn’t going to be fun. The vulkan code was also a bit fragile and broken under wayland, needing a hack to make it synchronize properly.

So this PR does 3 things, it reworks video_out to be more accurate based on my reverse engineering, fully reworks the vulkan backend side of things to have better abstractions that will make the 3d engine implementation easier and fully reworks the texture caching system to be based on fault tracking.

The first part is mostly self explanatory, the implementation has been split into a separate class for easier state management and some additional error codes have been added, but the result isn’t all the different from before. Presentation now occurs in the game thread instead of the window thread, which makes things a bit easier. In the future there should be a separate gpu thread to handle all the extra work but that isn’t needed here.

The new vulkan backend is based on the Citra one and uses vulkan-hpp instead of raw vulkan as it’s a little bit less verbose and solves the previous license problem, as the C headers are licensed under Apache which is incompatible with GPLv2. Vulkan-Hpp on the other hand is licensed with MIT as well which is ok. Like Citra, initialization is handled by the Instance class where all extensions are also loaded, the Scheduler has been ported as well as it will prove useful for parallel shader building in the future and makes validation layer performance a bit less miserable.

The main change here however is the texture cache. When a new image is stored in the cache, the region it owns is marked as protected using an mprotect call. This means that any reads or writes from the guest will go through the texture cache’s exception handler, which will allow it to decide on the appropriate action (either invalidation or flush). When the image is requested again, it is validated with an upload and reprotected. In general that’s a relatively clean way to handle readbacks or related accesses emulating a UMA system entails. In the future this can be tuned to be better suited for the PS4s memory model, but it’s a good base.

Second we got video_core: Add basic command list processing from The Turtle

Implements a few PM4 commands and gnm submit call. This means that guest application will no longer be stuck waiting for a command buffer label. Right now commands don’t do anything, actual functionality will be added in future PRs. Gnmdriver functions that write private packets have been implemented as close to real module disassembly as possible

what all the above means in general? . Shadps4 is getting ready to progress some real graphics from gpu .( currently we have only framebuffer demos working). So what’s the next steps?

Probably the next to follow will be

Shader compiler

Rendering code

Stay tuned for more updates soon!

Developing

July 2023 progress report

by admin in

Hi to all,

Although progress is quite slow due to my primary work , there is some progress on shadps4 emulator.

The main focus of developing is on getting a simple ps4 sdk demo to work (videoout_basic.elf) . This appears to be a quite simple graphical demo.

So what is the progress on it so far?

  • It loads using elf loader
  • Resolve,patch and rellocate all neccesary libraries-functions in HLE . Most of the function are dummy atm.
  • Executes code up to the first unimplemented HLE function (at the time running this report is sceKernelMapDirectMemory function.

So what’s next?

Implementing some more HLE functions and getting the demo proccess further. Optimal goal is to get it running by the end of August.

Can it run/load other demos or commercial games?

Well shadps4 supports elf,self loading but since developing is focused on that particular demo it probably won’t do something interesting . Most probably it will get stucked in resolving unimplemented HLE functions.