Progress on mystery SoC in the kid cam (that's actually what the firmware calls it): The trace that had the non-cached reads was actually super-informative. It looks like the instruction size is indeed 4 bytes, so the SoC reads the instructions (and presumably data) straight from the SPI memory. That means my LA trace effectively documents the program flow until the cache is turned on! This is super-helpful as although the firmware has strings and stuff, I don't know what address the flash contents live at in the CPU so I can't find the instructions that refer to it,
However, with the trace I can find call and jump instructions, because the address that the CPU reads suddenly changes. They have one thing in common: one instruction before the last instruction (presumably because of pipelining), the MSB of the instruction is 6 or 7 and the rest is the relative address to call. I could also find the return instruction, as after that execution continued to the address after a call. Conditional jump instructions were also formatted like that, so I also found those. It's enough to create a very barebones Ghidra processor implementation that can show me bits of control flow at least.
Issue is that there just isn't that much code to be executed before the cache turns on. But I now know what subroutine turns on the cache, and I know how calls work... so I can re-write the flash with that subroutine call nerfed. As far as I can tell, there's no checksum or CRC over the XiP program (I don't see any reads anywhere before the CPU actually executes the code), so it should work.
So I flipped a few bytes, reprogrammed the flash... and now I have a huuuuuge trace file to sift through. (One of the reasons is that with the cache disabled, the camera takes about 5 seconds to start vs nearly instantaneous, so time-wise the LA dump is pretty long to start with.) Can share it if anyone still feels like looking along with me.