Kanru Hua has formally released the Arpasing standard for English UTAU voicebanks. You can read about it on his website at these links.
This tutorial will explain how to use an Arpasing voicebank in UTAU.
It is in your best interest to have a fully OTO'd voicebank, before making USTs with it, because adjusting the timing of notes will depend on preutterance values.
With Arpasing Assistant
If you don't yet have the Arpasing Assistant plugin, you can download it from here.
Put plain english words into the UST, select the notes, and run the Arpasing Assistant plugin to convert them. Alternatively, you can enter plain arpabet phonemes into the note, separated by spaces. If you put "/u" at the end of a word, the note will be divided evenly for each diphone, instead of each diphone being different lengths.
The notes you select for conversion MUST have another note after them. The best way to ensure this is by putting a rest at the end of the UST.
To do multi-syllable words via word input, you will have to delete all notes but the first syllable, and extend the first note to cover the length of all the notes combined. Put the entire word in the note, and convert as normal. From there you will have to put the notes back at the correct pitch and fix the timing.
However, doing multi-syllable words via phoneme input is a lot simpler, as you can simply break the phonemes across multiple existing notes before conversion.
FOR MULTIPITCH: If the voicebank you're using is multipitch, there are issues with Arpasing Assistant. The best thing to do is to temporarily remove all pitches with suffixes from the voicebank. The pitch without a suffix should be the only one remaining, and should not be inside a subfolder. Then, run Arpasing Assistant as usual. Once the UST has been converted, put the other pitches back in.
When converting words to diphone notes, Arpasing Assistant will select the best copy of a particular diphone based on context. For example, if you input the word "stand", it will choose a [t ae] note from a "stae" sample rather than from a "tae" sample. The context of word-like recordings creates natural pronunciation. This is the purpose of numeric suffixes, to distinguish one from another. Don't delete the numbers.
While it is recommended to use a completely untuned UST, you can adjust the volume level of notes and flags prior to conversion, and they will be retained. Pitchbends and vibrato get messed up and aligned to the wrong notes, so it's recommended to remove them entirely first.
Listen to the UST as it is right now to get an idea of how it sounds, and what sections you need to fix. Chances are that some notes will be off time, and there will be pronunciations that you are unhappy with.
Find the words or syllables where the pronunciation isn't what you want it to be. If you're not familiar with Arpabet, you can reference this article when editing. https://en.wikipedia.org/wiki/Arpabet
Let's fix the timing. The most glaring problem is with multi-syllable words. Because they had to be combined into a single note prior to conversion, the timing is no longer the same as intended. Hold down ctrl and drag the ends of the notes to change the lengths without pushing the notes around. Align them so that the core CVs of each syllable start at the intended position. Fix the pitches while you're at it, by selecting all the notes of one syllable and moving it up/down.
Look for phrases that start with consonants. (They would be the first syllable after what used to be a rest.) The [- C] note is often very short, so hold ctrl and drag the left side to make it longer that way. Adjust the length so that it starts where the envelope of the next note starts.
From there you can go through note by note fixing the timing of all the small notes, by changing them to the length of the preutterance envelope of the next note. This is why you want your bank to already be fully OTO'd. You probably want to set your Quantize to 64th notes.
As you go along you may also be modifying pronunciation. For example, "don't want" will become [d ow][ow n][n t][t w][w aa] etc. but this is far too many notes for a short space, where in practicality this phrase often sounds like "don want" when singing. So, remove the [n t] and change the [t w] to [n w].
From here, you can tune as normal.
Without Arpasing Assistant
If you aren't using the plugin, you can bypass all of the problems involved in adjusting the timing and incorrect phonemes. However, you miss out on all the convenience of having base conversions.
If your computer able to use Arpasing Assistant but you choose not to use it, you can open the exe as a standalone program to look up word-to-arpabet conversions. However, if you're completely unable to use it (for example, Mac users) you can use this website to find conversions. http://www.speech.cs.cmu.edu/cgi-bin/cmudict The dictionary may not be as robust as the one built into the plugin, but after familiarizing yourself with the phonetic system you should be able to transcribe any missing words yourself.
Enter the plain english lyrics into the UST as you normally would, for reference.
Change each note to the main CV of the syllable, separating each phoneme with a space. For example, the word "cat" is "K AE T" in arpabet, so change the lyric to say [k ae]. Likewise, "steps" is "S T EH P S", so put [t eh].
The other essential notes we need are phrase-final notes. Wherever there is a rest after a note, you need to have phoneme to silence. As a rule of thumb, if the rest is less than 1 quarter note long, change the whole rest. If the rest is longer, then only the first quarter note needs to be changed. For "cat" that would be a [t -] note, and for "steps" it would be [s -].
The rest of it is entering transitional notes. This list works on diphones, so every lyric note has two phonemes.
Silence counts as one, and is necessary whenever transitioning between sung lyrics and rests. We already covered notes with rests AFTER them, so let's consider notes with rests before. If it starts with a vowel, you can simply add a hyphen. For example, "eye" [ay] becomes [- ay]. If it starts with a consonant or consonant cluster, you want to work backwards from the innermost consonant. Take "strand" for example. Make a note just before [r ae] that has [t r]. Make a note just before [t r] that has [s t]. Make a note just before [s t] that has [- s]. This gets you a full transition from silence, through the consonants, to the nucleus of the syllable.
For transitions between syllables, use this same principle of full transitions from diphone to diphone.
"stranded on this island"
[- s][s t][t r][r ae][ae n][n d][d eh][eh d][d aa][aa n][n dh][dh ih][ih s][s ay][ay l][l eh][eh n][n d][d -]
The length of the transitional notes depends on the preutterance of the note that follows it. However, you can use your personal judgement to make them longer or shorter.
If any particular note sounds strange or off, you can switch to using an alternate recording of the same diphone by adding a numerical suffix, like changing [f ey] to [f ey1] and so on.
From here, you can tune as normal.