My Korean XML cannot be parsed.

Community Forums/General Help/My Korean XML cannot be parsed.

Ravl(Posted 2014) [#1]
I have this structure of my level files:


  <Hotspot>
	<PosX>358</PosX>
	<PosY>306</PosY>
	<Width>121</Width>
	<Height>242</Height>
	<Text>L &#1710;&#1401;&#1449;; &#1192;&#1278; &#1397;&#1144;&#1057; &#522;&#2084;&#536;&#2084;.</Text>
	<Sound>null</Sound>
	<Type>1</Type>
	<LogicID>0</LogicID>
	<MoveTo>null</MoveTo>
	<MoveToMinigame>null</MoveToMinigame>
	<NavIcon>null</NavIcon>	
  </Hotspot>



Obviously the only special characters are present in the <Text> hashes (even if here you guys see #&), but using the bah.xml module I receive the:

"Document not parsed successfully." error message.

I assume it's about encoding or somehting like that but I do not have experience with things like that..


GfK(Posted 2014) [#2]
Try putting this at the top:
<?xml version=“1.0” encoding=“utf-8”?>


[edit] By "at the top", I mean right at the very top - the first line - of your XML file, before anything else.


GfK(Posted 2014) [#3]
double post.


Ravl(Posted 2014) [#4]
used the second one with the ? at the end..

still the same error.


GfK(Posted 2014) [#5]
Hmm... you might try utf-16. I don't think I've ever done Korean, but I've done Japanese and I'm sure I used utf-8. :/


Ravl(Posted 2014) [#6]
not a chance:

levels/level0.xml:5: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0xD4 0xC0 0xCC
<Name>°Ô&#340;&#282;&#262;®</Name>
^


Name is the first tag with special characters in my file.

I also tried utf 16 but the issue perssist and in debug console i receive information like it is utf 8


GfK(Posted 2014) [#7]
Bit stumped then! If nobody's solved it by tomorrow I'll have a look what I did for my Japanese stuff.


Ravl(Posted 2014) [#8]
Appreciate this. I really need the help to solve this one.

Thanks


Derron(Posted 2014) [#9]
What does not work exactly?

I just took the content of <text>...</text>, placed it in my database.xml (replacing some other titles) - even the console output was nearly correct - only the last kept being unidentified characters).


And this is what you are doing wrong: libxml says that you did not encode properly... and hence you did not encode properly:
all non-ascii-characters must be encoded (this is done with &#CODENUMBER;)

Do not forget to encode the "&" if used as normal character ("derron is dumb & lazy" must be converted to "derron is dumb &amp; lazy").




bye
Ron


Ravl(Posted 2014) [#10]
Hi Derron,

the actual xml looks like this:



not linke in the text u see in the forums.

when I am trying to load the xml using the default "parseDoc" method I receive:

levels/level0.xml:5: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0xD4 0xC0 0xCC
<Name>°Ô&#340;&#282;&#262;®</Name>


Ravl(Posted 2014) [#11]
wait, you say I must encode every special character? how should I do that in Korean?


ziggy(Posted 2014) [#12]
Are you sure the XML document is saved as UTF-8 and ALSO that the BOM is stored in the text XML file. http://en.wikipedia.org/wiki/Byte_order_mark


Ravl(Posted 2014) [#13]
I see in Notepad++ the document is "Encode in UTF-8".

About the BOM.. cant understand exactly what is all about. Where should those characters be present? :|


Ravl(Posted 2014) [#14]
guys,

just made a quick test. i pasted a text from google translate: &#54872;&#50689;


in my xml this text is shown in actual Korean characters. the translation i received is still like in my images.. anyway using the xml with the characters from the G translate works fine..


Ravl(Posted 2014) [#15]
And it's me again. I solve the issue.

Steps:
1. Open the XML from my partners
2. Change from Notepad++ the character set to Korean (in this moment all those black characters are gone and we can see nice Korean symbols)
3. From Notepad++ choose: Encode to UTF-8
4. Save

and it's working...


Derron(Posted 2014) [#16]
That is one of the things concerning UTF8 and BlitzMax.

By default the MaxIDE saves non-ISO-8859-1-pages as UTF16-LE.
For "normal characters" (westeuropean, most non-asian) UTF8 is enough - but BlitzMax needs the BOM-indicator to load it correctly. But if you eg. use signs like "äöü" (the German umlauts) you may run into problems as they could be find in the ascii tables ... so it gets converted incorrect and you may run into garbaged text.

To get rid of that trouble (at least in libxml) that "html entities encoding" should work (&#code;).


Keep in mind that special chars still must be escaped properly:
& &amp;
' &quot;
< &lt;
> &gt;

That is why "escaping" when outputting the data could be used for creating valid xml files.


bye
Ron


Brucey(Posted 2014) [#17]
It's a problem when getting third-parties to do translations - if they are using a native OS in Windows, then they are more likely to be typing in their local codepage, rather than in Unicode (UTF8/16/etc).

So as you've found, you need to assume the file you receive is in the local codepage and convert it as appropriate (into say, UTF-8).

On other platforms - which properly support native unicode (that would be Linux and OS X), you shouldn't have this problem.


Ravl(Posted 2014) [#18]
@Derron: well I cannot make @# code for every character I need in my game... will take years :))

@Brucey: I will convert this way cause is the only solution now.

Thanks all for supporting me!


Derron(Posted 2014) [#19]
You still have to convert all & ' < > ... else you run into trouble as they are reserved XML characters.


bye
Ron


Brucey(Posted 2014) [#20]
The other alternative is to use a "standard" localisation format (like .po), of which there are editors available.


Derron(Posted 2014) [#21]
That "standard localisation format" is useful for GUIs but imagine a database containing some thousand of localized entries... I doubt using .po and there editors are helpful then.

It depends on the source of the textual data.


bye
Ron