I'm just going to give you general tips, because I have a hard time clearly understanding what your UTAU is pronouncing wrong, whether it's the fact it's mixed with instrumental or the oto is off. My ear with that stuff is bad, but hopefully this helps.
Vowels:
あ (a) - You want an "a" sound that's just the right darkness so that it's between the "a" in "apple" and the "a" in "father".
い (i) - Pronounced like the letter "E", pretty simple.
え (e) - This is something I've had a problem with previously. DON'T pronounce it like the letter "A", because that's not right. It's an "eh" noise, like found in the word "feather".
う (u) - I can't really think of how to tell you to pronounce it, because I'm not 100% sure if I'm pronouncing it right. I -think- it's in between "whoo" and "uh".
お (o) - This is tricky because it's similar to how you'd pronounce an "o", but much deeper back in the throat and darker. Avoid making it sound like the word "toe" and more like the "o" in "organism".
A few of the trickier consonants for someone who doesn't know how to look out for it is sounds starting with a "t" "r" "f" and "w". (I probably forgot a few, but I can't recall right now.)
Pronounce "t" with less breathiness then in the English language. It should sound similar to a "d", without the "duh" sound at the beginning.
"r" is just... I'm not gonna get into that, because I can't really pronounce it right. It's the perfect blend of "r" "l" and "d", although I've heard mild "l"s be used in it's stead.
How to pronounce "f" depends on how your using it. If you're recording "fu", it should sound more like "hu". I think the others should be fine as an "f" sound tho (so long as you give it less breathiness, similar to how you did it with the "t", which is more so a personal thing because I think "fa" "fi" "fe" "fo" are only for Engrish, but that lack of breathiness works in UTAU better imo).
"w" is kind of like a slight "u" sound before the following vowel. And when you're recording "wo", disregard the "w" altogether.
This is my understanding, anyways. If someone more experienced in Japanese speaking sees something, feel free to correct me.
Also, you might want to DL Japanese voicebanks and listen to their samples to see how other things are pronounced.