If you've already got a base in MMD and can render that out, here's how I'd go about it in After Effects (unfortunately the only video editor I'm any good with is AE, so these might change based on your software setup.) I'm also not testing this out as I write this, so if something isn't working, I can help troubleshoot.
Render out separate passes - one of the characters with no background, one for the background (both characters and background using the same camera movement), then one for the dummies in the corner for still poses using a camera that doesn't move. It looks like for the background, it might be ideal to just render the background pass out with characters still in it, so the characters still show up in the floor reflection if there is one. That should be okay for this, as the characters with effects will be composited on top of them, and should cover it perfectly.
Faces would need to be removed in 3D, then exported faceless from MMD. I'm not sure if you can make/find a 'mask' prop that could just cover up the faces straight in MMD, and set the shader to luminosity only (idk MMD specific shaders, but that's how I'd do it in c4d). it should make it have no shading and be a flat color only). Try to get the rendering as close as you can in MMD (especially no shadows and the colored glove/hand, have the colors be as close to accurate as you can get them), then use AE to finish it up. I'm not too familiar with MMD, so I can't help there.
If the video file isn't transparent, you'll need to Rotobrush it so it's transparent. Personally, I'd then render it out as a transparent MOV just so it's easier to preview (it can get laggy). Also, I'm not familiar with how MMD exports stuff, so if it's a frame stack instead of, like, an mp4/mov/other video file, you want to double check that AE imported the frame stack at the right frame rate (right click and hit "Interpret Footage", I think? Set it to whatever you exported it at, otherwise it'll be off time).
Then, probably some combination of Levels (get rid of any lingering shadows, adjust for contrast desired, turns the light colors pure white), Hue Saturation Value for minor color correction, and maybe Posterization if you want it more cel-shaded. If posterization gives edges that are too jagged, duplicate the layer, set the one on top as a track matte to the one below, then add some Gaussian Blur before adding posterization. You can also set the posterization/blur layer (still track matte'd to the base layer below it) at a lower opacity if you want a tiny little bit of cel-shading but not entirely.
Then comes the outline. I still hate that AE doesn't have a proper Outline effect. Here's a tutorial for that part because I can never remember it myself:
You can then layer the characters on top of the background.
Starting with the set up of "poses slide into the flashing thing on time", you'll need to look up a BPM to FPS calculator to figure out how often the bar needs to flash. Then you animate it flashing once, set as a loop of like 4 keyframes ending every x frames based on what it tells you (the last two keyframes should be the same, just at different times, accounting for the brief pause between each flash), and just use LoopOut(); as an expression on whatever needs to loop (just opacity? and assuming there are no tempo changes).
I think the arrows will need to be drawn manually, plus if you want to edit the poses to be closer together for some of them, you'd do that in Photoshop/etc too. You can bring in the footage of the dummies into AE, then there should be some way to export every x frames to a PNG (you may need to set the comp's frames per second setting down to like 1 instead, don't think it can go lower than that, unfortunately). You'll need to rotobrush this first if it's not on a transparent background too.
Bring these new frames back into AE once you've drawn your arrows on them (in photoshop/illustrator/etc). Make sure they haven't imported as footage. 'Cause they're stills, it'll save on render time if you just do the outlines in Photoshop/Illustrator/etc, rather than make AE calculate them, but you can also just copy and paste the effect from earlier, reduce the size and change the color, if that's easier.
There are a handful of ways to animate the poses sliding in, whether they're all parented to a null, or you individually control each still. But either way, I'd pre-comp them so it's not just a million layers haha. There's also the free Animation Composer plugin that has a "stagger layers by x frames" feature that could be pretty useful for timing.
Anyway, not sure if that all makes sense, so let me know if I can clarify anything!