Arpasing with single phonemes/letters

Discussion in 'UTAU Discussion' started by Rushur, Mar 23, 2019.

    Would it work with moresampler?
    I know it'd be choppy, but how choppy?
    Will it even oto correctly?
    Has anyone done it before, if so, what were the results

    If anyone knows anything about this I'd like to know!

    EDIT: I just realized I posted this in the wrong thread
    Single phonemes or letters for arpasing would not work at all for a number of reasons.

    -Arpasing assistant would not be able to help you much. It's used by a lot of people who do arpasing so not having the tool you need to make proper sounding phonetics with your samples is kinda a bad idea.
    -Memory-wise it's incredibly inefficient and the bank would be needlessly big because you have 1000+ samples.
    -Otoing would be at another level of hell.
    -If you're someone who struggles with VCV/CVVC otoing as it is, then otoing single syllables like this for arpasing would sound like bird chirping from how short the samples are. Either that or your bank will sound like it's speaking binary.
    Most of the people I know don't actually use the assistant to begin with. The reason I consider compatibility to be a critical factor is because that decides whether a reclist can be counted as Arpasing or not. I make a few exceptions to this rule because diphone samples using Arpabet phonemes that weren't included in Arpasing are consistent with the design of the method. But a reclist using strictly single phoneme samples would no longer be an Arpasing voicebank, it would be a single phoneme English voicebank using Arpabet encoding.

    Strangly enough, Moresampler's OTO generator already generates single phoneme entries for vowels and for the consonant "n". I have to wonder why we need 300 "n" samples.

    This is incorrect. If you were to record individual phonemes for an English voicebank., you would only have 35 or so. The reason that voicebanks like VCV Japanese have a thousand OTO entries is because, by combining multiple phonemes into a single sample, you are multiplying the number against itself several times. Let's design a reclist for a hypothetical language that has these phonemes: a i k s
    A single phoneme reclist would look like this.
    a i k s
    A CVVC reclist would look like this.
    a i
    ka ak ki ik
    sa as si is
    And a VCV reclist would look like this.
    a i aa ai ia ii
    ka ki aka aki ika iki ak ik
    sa si asa asi isa isi as is
    The more phonemes we need to combine into a single sample, the more samples we need to cover all the possible combinations. OP is on the right track with recording single phonemes to reduce the size of the reclist.

    Given that there are only a handful of samples, it'd would be fairly simple to OTO by hand. I suppose that running such a voicebank through Moresampler would actually result in entries like [- k] and [k -] which would only double the number of OTO entries to an average CV Japanese voicebank.

    We're starting to get to the actual reason that the single phoneme method would not work. There are no transitions between the phonemes. By recording everything entirely separate, you must rely solely on the crossfade between envelopes to connect the sounds. The reason that CV often sounds "choppy" compared to VCV is because there is nothing connecting the end of one CV to the beginning of the next CV. The method described by OP takes this to a new extreme. I'm now interested in trying it out just to see if I could manage to get a decent result with it, but it's likely to be much more trouble than it's worth.
    Single phoneme voicebank where you record consonants and vowels separately would work not well in UTAU and will be very non-user friendly.

    You would need to edit length of sounds and volume all manually and would no longer competitive with Arpasing assistant plug-in. Also, semivowels (w and y) are difficult to record as alone. I bet too that making it say consonant clutters somewhat decently will be hell.

    Also I’ve big concerns about voice’s understandability because lack of *any* sound transitions at all. It wouldn’t be as great as CVVC or even CV style English voicebank (that still has needed consonants clutters recorded and ending consonants).

    So because of these factors, I don’t want to encourage with this method at all.

