Writing a program to convert "mojibaked" Shift-JIS filenames to UTF-8 for US machines. Need insight.

KNΞMΛTCS

Just an UtaForum user
Defender of Defoko
Okay, I'm writing a little program that will convert the random gobbely-goop of misinterpreted Shift-JIS encoding (seen when a computer with a UTF-8 or similar default locale) to proper UTF-8 encoding (that would be compatible with US locale computers). There are a couple things that I know about the issue:
  • The file names misread by Windows's file browser are *NOT* the same as mojibaked Shift-JIS. The only workaround I've seen for converting file names manually is to put the file inside a .ZIP archive and edit it with a hex editor.
  • mojibaked Shift-JIS can indeed be converted to UTF-8 (readable on US locale computers). Notepad++ has a text converter that does this built in, and there is a website that does it too, I think. However, as per point 1, Windows filenames undergo another layer of misreading, and afaik can't be converted.
  • Using the .NET framework's libraries to read file contents yield the same Windows file names as Explorer does, and can't be converted.
  • Upon installation on a US computer, Defoko's .FRQ files have no mojibake issue. Their file name are definitely in proper Japanese.
I know that this is a pretty technical topic, but if you have any ideas on how to get through it (or some insight on the Explorer problem) please reply.
 

Agatechlo

Specified.
Supporter
Defender of Defoko
I actually looked at doing this a while back. I found the problem is really just that many zipped hiragana USTs & voice banks have the filenames SHIFT-JIS-encoded within the ZIP file (as opposed to Unicode-encoded), presumably to save a few bytes of space. So if you unzip them on a U.S. locale system, it assumes the ASCII character set, hence mojibake. What I've been doing is unzipping the files on another computer that's set to Japanese locale, then copying them to my main system, which is U.S. locale & can't be changed. Once the files are unzipped, the filenames are stored as Unicode within Windows, so they can be copied to any system & the filenames appear correctly. I can even re-zip them if I want, since my zip program stores the filenames as Unicode as well.

Now if you want to take the already-mojibake'd ASCII (now actually mojibake'd Unicode) & convert each character to the corresponding hiragana Unicode character, it should be possible. I started to put together a translation table to do this, but decided it would be rather difficult to implement in Visual Basic, so decided to set up one PC around here that can switch to Japanese locale & do as I describe above.

BTW if you find that some hiragana voice banks unzip & work fine on U.S. locale systems, it's probably because the filenames were Unicode-encoded within the zip file.
 

KNΞMΛTCS

Just an UtaForum user
Defender of Defoko
Thread starter
I actually looked at doing this a while back. I found the problem is really just that many zipped hiragana USTs & voice banks have the filenames SHIFT-JIS-encoded within the ZIP file (as opposed to Unicode-encoded), presumably to save a few bytes of space. So if you unzip them on a U.S. locale system, it assumes the ASCII character set, hence mojibake. What I've been doing is unzipping the files on another computer that's set to Japanese locale, then copying them to my main system, which is U.S. locale & can't be changed. Once the files are unzipped, the filenames are stored as Unicode within Windows, so they can be copied to any system & the filenames appear correctly. I can even re-zip them if I want, since my zip program stores the filenames as Unicode as well.

Now if you want to take the already-mojibake'd ASCII (now actually mojibake'd Unicode) & convert each character to the corresponding hiragana Unicode character, it should be possible. I started to put together a translation table to do this, but decided it would be rather difficult to implement in Visual Basic, so decided to set up one PC around here that can switch to Japanese locale & do as I describe above.

BTW if you find that some hiragana voice banks unzip & work fine on U.S. locale systems, it's probably because the filenames were Unicode-encoded within the zip file.
This is actually a big issue in Utau. You see, when Utau was just created, most/all of its users were from Japan, so their computers would obviously have Japanese locale. Then it starts to pick up traction in the West, and since everything was created with computers with Japanese locale, everyone is forced to do so. And the Western users create voicebanks. Remember, their computers are also set to Japanese locale, so their voicebanks will also only be compatible with Japanese locale computers. It's a Catch 22. Nobody can use English locale because most voicebanks won't work on it without a lot of modification, and they can't create voicebanks without that restriction because of that.