Arpasing VB Tutorial 1.1

How to record and OTO an Arpasing voicebank

  1. Kiyoteru
    Kanru Hua has formally released Moresampler 0.8.0 and the Arpasing standard for English UTAU voicebanks. You can read about it on his website at these links.
    https://webhost.engr.illinois.edu/~khua5/index.php/2017/02/24/moresampler-0-8-0-release/
    https://webhost.engr.illinois.edu/~...6/introducing-arpasing-for-english-utauloids/

    This tutorial will explain how to record and OTO an Arpasing bank yourself.

    While you can use whatever you like to record, it will be much, much easier to use OREMO. If you're currently unfamiliar with how to use it, see this as an opportunity to learn how. If you plan to do sample processing, you'll have to do it after recording. If you're on Mac, download the wine wrapped version here! http://utaforum.net/resources/utau-dev-tools-for-mac.165/

    To begin, go to this directory, and download the zip for Kanru Hua's default reclist: http://utaforum.net/resources/arpasing-reclist-directory.299/
    This folder contains the reclist, OREMO comment file, and index file.
    The reclist is simply for the file names, which are numbers 0 through 219.
    The OREMO comment file allows you to see the actual phonemes that correspond with each number.
    The index file is for moresampler to reference when generating the OTO.

    In OREMO, open reclist.txt as the reclist. Click on the bottom bar to set the destination folder as "arpasing". This will load up the corresponding comment file so that you can record.

    You can record with or without guideBGM. If you want to use guideBGM, I recommend using a short one made for CVVC reclists, such as the CVVChinese BGMs or the VCCV English BGMs.

    The comment file will tell you how to pronounce it precisely in arpabet, and approximately in words. This short article will explain how to read and pronounce arpabet. https://en.wikipedia.org/wiki/Arpabet It's actually pretty straightforward!
    The words are just a rough guide. Remember, other than the first few vowel strings, each string only has 1 type of vowel. All three syllables will rhyme.

    Sing the 3 syllables consecutively, as if recording VCV. If at any point there is a "q" in the phonetics (or an apostrophe in the words) it means a brief pause (a glottal stop).
    For reference, you can download this voicebank to hear how it's recorded. https://vocalsynth.cloud/t0

    FOR MULTIPITCH: In the main voicebank folder, there must be a pitch that has no suffixes in the OTO. All other pitches with suffixes should be in their own subfolders. Don't add suffixes to the file names, or else the index.csv will be incompatible.

    Onto OTOing! Simply drag and drop the folder onto moresampler.exe to do so.
    Enter 3 to choose arpasing, then y or yes to keep duplicates. You can also choose whether or not to include a suffix. It's not possible to use characters such as arrows or kanji, so you will have to use suffixes such as "S" or "A#3".

    This is where I have to say sorry to the fellow Mac users out there- to do the same thing, you must run wine through the terminal, and this process is far too troublesome for the average user. Transfer your files to a windows computer or ask a friend for help to generate it.

    Now that your base OTO is generated, it's time to refine it. Every entry of the OTO is a diphone, meaning that there are only two phonemes or two sounds. In general, the first one is a connector to the previous note, while the second one is the main phoneme for the current note. To OTO, first find the section corresponding to the first phoneme, then find the section for the second phoneme.

    First Phoneme

    This covers the blue offset and overlap.

    [-]
    The amount of overlap really doesn't matter for this one, because these notes always come at the beginning of a phrase, right after a rest. The only important thing is that it covers an area of silence.
    [​IMG]

    [c]
    Unvoiced plosives (p t k)
    If this is the first phoneme in the string, move the offset such that the overlap ends up about 15msec before the consonant.
    If there are other phonemes before this one, move the offset to where the previous one ended. Make sure you can't hear the previous one. Place the overlap about 15msec before the consonant.
    [​IMG]

    Voiced plosives and affricates (b d g ch jh)
    If this is the first phoneme in the string, move the offset such that the overlap ends up where the consonant begins.
    If there are other phonemes before this one, move the offset to where the previous one ended. Make sure you can't hear the previous one. Place the overlap where the consonant begins.
    [​IMG]

    Fricatives, nasals, and liquids (f v th dh s z sh zh hh m n ng l r)
    Move the offset to where the consonant begins. For 'r' in particular, you may want to refer to the glides section for help.
    [​IMG]

    Glides (y w)
    These consonants can be difficult to see on a normal waveform. By clicking on the S button, you can switch to spectrogram view, which gives you another way to visualize the audio. The bright areas are the loudest frequencies. These consonants show up as a change in frequencies over time.
    Move the offset to where the consonant begins, then place the overlap where it's consistent before the change. The preutterance will end up after the change.
    [​IMG]
    [​IMG]

    [v]
    By default, the overlap for these samples should be at a decently high amount. If it's absurdly small, moving it to around 50msec should be good.
    Move the initial offset so that the area between it and the overlap is at a consistent level.
    [​IMG]

    Second Phoneme

    For all cases, the preutterance should be placed where the first phoneme ends and second phoneme begins. This also covers the pink area, white area, and blue cutoff.

    [c]
    Stops (p b t d k g ch jh)
    There should be a small bit of silence or near silence just before the consonant. Move the pink where the silence begins, and the cutoff to where the silence ends. Yes, we're not including the consonant itself. That's because, in a UST, this note would be followed by another note that DOES have the consonant. This allows for a smooth transition without an awkward double consonant sound.
    [​IMG]

    Fricatives (f v th dh s z sh zh hh)
    Cover the entire consonant with pink until just before the end. Bring the cutoff to the same place, leaving a tiny gap. Without this gap, resamplers won't be able to render it. However, we don't want these consonants to be stretched.
    [​IMG]
    If there is silence after the consonant, have the white area be silence instead.
    [​IMG]

    Nasals, liquids and glides (m n ng l r y w)
    Move the pink to where the consonant starts being stable and consistent. Use the cutoff to remove where the consonant ends or fades out. These consonants are safe to stretch.
    [​IMG]

    [v]
    Move the pink to where the vowel starts being stable and consistent. Use the cutoff to remove where the vowel fades out. The white area will be the sustained part of the note, which ensures that it will sound good.
    [​IMG]

    [-]
    Cover everything with pink, such that everything in white is silence.
    [​IMG]

    And just like that, your voicebank is already done. Have fun, good luck! Feel free to ask any questions in the discussion thread.