Disclaimer: sorry if this comes across scatter brained or is TL;DR, but it's hard to really answer this question without writing an essay of sorts.
Ease of use is very much dependent on several factors:
- Version: from hardest to easiest as far as Vocaloid... V1 is hardest. Using it is probably more similar to using UTAU. You have your piano roll, you have your voice, you have your parameters... That's it. V2, V3, and V4 have similar levels of difficulty unless you factor in Growl/Evec/XSyn, in which case V3 is easier than V4. V2 works similarly to V3, but iirc, lacks the ability to input bgm (it does rewire to most daws, however). Vocaloid is a program unto itself that doesn't need 3rd party program support outside of making original instrumentals. It supports multiple tracks unless you're talking about V3 TinyEditor (you only have 1 track and 17 bars to work with). I would say the hardest part about Vocaloid is the way parameters are edited by drawing - not too user friendly if you don't have a graphic tablet or optical mouse. Input is easy since it's practically free of restraints.
UTAU works mostly the same regardless of version, albeit there are more bugs in earlier versions. Shareware has a few more native features than free (plugins, however has made this a moot point if you're willing to do more clicks). UTAU relies more on 3rd party software: either make a vsqx in vocaloid, export as vsq via plugin, import to UTAU and edit from there; OR use a daw capable of creating and exporting midi. It's possible to not use 3rd party programs to make a ust, but it's much more difficult depending on whether you're on PC or Mac. PC-UTAU is crippled by rest notes, but UTAU-Synth is crippled by lack of options (no resamplers outside of default, no plugins). Both suffer from lack of bgm import and neither are capable of rewiring to a daw. PC-UTAU lacks multi-track (UTAU-Synth has this, but it's not perfect). UTAU is made for mouse/trackpad users in mind, so you don't have to draw parameters. It's slower, but more precise.
- Voice is a factor of difficulty. Vocaloid is proprietary, goes through QA/QC (not 100% always *cough*Sonika*cough*Rin and Len ACT1*cough*), is generally recorded in a professional environment with professional singers (again, exceptions exist, such as Sonika, Ruby, Dex, Daina). Desired output sans edits will - theoretically - be less tedious. UTAU users are grass-root; they don't have studio budgets, so quality is very much a crap shoot depending on an individual's budget as well as skill. Getting a HQ result will be - theoretically - more arduous. That said, while Vocaloid's raw output is higher quality, the bugs that come with a voice can only be worked around, not fixed by the end user. UTAU allows the user to go into the configuration of any given voice, pinpoint the problem, and fix it - if possible. A vocaloid voice receiving an update depends on whether or not it makes its money back in profits and sometimes the voicer simply can't record new samples due to aging (Yuki, Oliver), unavailability for one reason or another, timbre differences that are undesirable, or the engine not being capable of desired results (some voices get scrapped or delayed). UTAU voices can be updated at any time as long as the user has the equivalent recording equipment or better, time, can reproduce desired results, and can keep their voicebank online (we've all heard it once or twice: a voicebank isn't released because the user didn't upload it to a cloud service/save it to an external device and their machine crashed... or mediafire/4shared/axfc deleted it...).
- Language: Japanese is the easiest language for synths to reproduce (but you need to actually know the language if you want to be more than a cover artist). English is the hardest. Korean is about the same difficulty as Japanese. Chinese is a little harder. Spanish is in the middle.
UTAU English is more convoluted than Vocaloid English: you're dependent on the oto: cvvc or its more complex cousin vccv, it's aliasing scheme: cz, arpabet, or x-sampa. Ust creation is more complex because it's based on phonemes rather than actual words. Vocaloid: just type in a word and if it's not working, then either break it into 2 syllables or edit the dictionary/phonemes.
ex. the word 'winters' [- wi] [i n] [t3] [3z] vs [win] [terse]
The same can be said for Romance Languages. UTAU is more difficult. Because UTAU is more broken up, it does allow for more control over results.
Japanese is the same regardless of software: use either romaji or hiragana. UTAU however, has 3 ways of putting in notation depending on the bank: CV, VCV, or CVVC. Use the wrong one, and a voice doesn't play - a VCV bank configured for all 3 styles with romaji and hiragana eliminates this, but you can't expect that all voicers will create VCV and configure for every possible situation. Vocaloid can be put in the same way as CV and you'll get a result, regardless of the voice.
Chinese: pinyin; idk if Vocaloid can read pinyin accent marks, but UTAU doesn't, so alias workarounds are required for certain sounds such as the vowels a, i, e, or u. I'm also not sure if Vocaloid reads Hanzi, but I know UTAU can utilize a few Kanji (Idk enough Japanese to be certain if all Kanji can be used).
Korean: SeeU (and presumably UNI) only read Hangeul for their Korean libraries. It's not like Japanese where you type using an English keyboard. Ka = か in Japanese, but those same letters translate to ㅏㅁ (a + m = it's unusable as is without the placeholder consonant ㅇ). Ka in Korean requires typing rk/Rk depending on plain or tense or zk if aspirated - this is with an American keyboard with Hangeul input, mind. UTAU requires romaja for this language and there's NK and SK standards which can be confusing or need aliasing workarounds. ex. ㄱ can be romanized 2 different ways: k or g for plain ㄱ, but k is also used for aspirated k/ㅋ. Tense k is written as kk or ㄲ. The reason for 2 romanizations for plain ㄱ is because it (along with other consonants that aren't N/M/NG) has morphological difference depending on placement within a word/phrase.
To answer the question for what voicebanks are "easy..."
- English:
V4 voices are generally clear and understandable without subtitles (though YMMV), but they have less expression as a result. They'll require more parameter edits to not sound monotone.
V3 voices have more expression, but can lack clarity either due to accent or softness - Macne Nana ENG, Miku V3 ENG, Oliver, Avanna... they're not very clear to me - or at least require a second listen, but I can usually understand Yohioloid and Gumi ENG just fine.
V2 voices can be easy to use... if you stick to their intended genres. Almost all of them are specialized for a particular genre, so use outside of that can be awkward.
V1 voices aren't very clear due to the engine, but they're still considered among the best voices in Vocaloid due to the way they were programmed and the fact that their voices aren't based purely on sampling. They'll be harder to use due to the engine UI and aren't officially compatible with systems newer than Windows XP (so crashing is more likely).
Voices that can be easy for beginners (for voices that have trials, use them first before making a decision - note, this is not based on personal usage):
English: Dex, Daina, Avanna (more specialized for folk and Celtic), Yohioloid, Sweet Ann, Miriam.
English voices to avoid: Sonika (LQ and glitchy), Tonio and Prima (highly specialized for opera), any Japanese to English voice sans GUMI ENG - unless you actually want a heavier Japanese accent that's present in most of the current lineup.
Tossup: Cyber Diva and Ruby - if you want nasal, you'll get it with these 2. Depending on who you ask, Ruby is harder to control and gets pitchy. Oliver is specialized for choir and is very British... hard to say whether he's a suitable beginner's voice.
Japanese: Generally any Cryptonloid that isn't Rin/Len ACT1 or ACT2. They're voice acted, but that makes them more malleable to different genres and they hold pitch well. Internet Co, 1st Place, and AHS have more realistic voices if you don't want voice acted vocals and are generally well done without many complaints concerning usage (the exception to this would be Lily and Miki).
Japanese voices to avoid: Rin and Len V2 era, Luka (V2 English had a lot missing and V4x... the less said about that, the better), Gacha... unless you like this kind of voice? It's hard to pinpoint V3 voices because a lot of them became quite generic and same-y.
Because there's so many to choose from, I say perhaps go against the grain of getting the most popular voice and getting something that could use a little more appreciation such as Mew, Mayu, Sachiko, Chika, Tone Rion, Rana, any male regardless of language - this is of course down to your preferences and your plans for music creation.