Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

	Bottom Previous Topic Next Topic
Register To Post

« 1 (2)

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 18:32 #21

Home away from home

@broadblues

The problem mostly relates to comparing raw data to char * string, instead of decoding and comparing the string per glyphs, in utf8 format, in order to avoid the URL verification process of a web server.

I think in this case the web server has the security problem, not the UTF8 format, how ever it not complicated to check if return value that is 0 to 127, and if so return a empty space char.

Well returning the wrong number of chars, can result in skipping End of string value, it might be problem that one should think about when decoding a full string.

In the case of my UTF8 decoding routine if the wrong number of bytes are set to 6, to try to skip the a string termination, the encoder will detect the 0 value, and return 0 length, and 0 value.

While its not major problem I should check that value is not 0 to 127 when its UTF8 glyph using 2 bytes or more.

Buffer overflow is not a issue whit my encoder, I estimate etch glyph in bytes, before allocating the buffer.

Edited by LiveForIt on 2014/3/5 19:02:53
Edited by LiveForIt on 2014/3/5 19:12:28
Edited by LiveForIt on 2014/3/5 19:13:50
Edited by LiveForIt on 2014/3/5 22:38:09
Edited by LiveForIt on 2014/3/5 22:39:06

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Belxjander

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/6 8:05 #22

Just popping in

@LiveForIT

I only tried for a rough explanation, thank you for filling in the extra details.

@all: Does anyone else recognise how some parts of UTF8 in AmigaOS are practical while others are impractical due to existing design constriants?

Personally I see display of each glyph by codepoint as practical...

The limitation I am currently trying to work out is how to "enter" a UTF8 valid glyph string without requiring any large number of changes.

"Extended Dead Key" entry seems to be the most valid option to give entry between 1->6 bytes following te encoded form of a UTF8 character.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/6 9:14 #23

Home away from home

@Belxjander

Quote:

The limitation I am currently trying to work out is how to "enter" a UTF8 valid glyph string without requiring any large number of changes.

Can you show me what your trying to change?

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Belxjander

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/6 9:26 #24

Just popping in

@LiveForIt: replied by PM, as it is outside this threads scope afaik.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/6 18:19 #25

Home away from home

@Amigo1

To answer the other question about UTF-16.

Well basically it does the same as UTF-8, preserve International symbols and there orignal format.
The biggest difference is that UTF-16 is 16 bit not 8bit format.

UTF-8/UTF-16 is popular encoding format used by text files, or transferred over TCP/IP, web pages, and many other things.

Unlike UTF-8, you can't open a UTF-16 file in text editor that does not support it, it will be hard to read its content, but whit UTF-8 you can, but some symbols will be unreadable.

For English speaking and European Languages, UTF16 is overkill,

most of the glyphs in European Languages are below the value of 128 value (Hex value 0x80), standard lattin, in addition most languages has maybe 4 special symbols.

But go to the middle east, or Asia there glyphs are different.

Full list of the Glyph and number they use can be found here:

http://unicode-table.com/en/#control-character

General speaking, UTF-8 is faster to decode (use less CPU power) if one glyph uses one byte, if a glyph needs two bytes UTF-16 is faster (use less CPU power).

So in other words, if most glyphs needs 1 byte, UTF8 is best option, if most glyphs needs 2 bytes then UTF16 is best format.

While UTF-8 / UTF-16 has many advantages, ASCII 7bit or ASCII 8bit whit a translation codepage remains popular when coding C/ASM, because easier to work whit, as you do not need to decode anything, a byte is symbol that's it.

There for in my option UTF8 is best option because of its legacy to ASCII 7BIT. Older programs might in fact work even if the raw data is UTF8, just like we have a problem understanding one or two symbols a program not understand all the string symbols.

This goes back to web server that some decided to feed badly formatted UTF8 strings too, the web server did not detect parent directory “.. “ and other stuff because the symbols where hidden in the UTF8 encoding, the real problem was not that UTF8 was broken, the problem I expect was that web server did not even try to decode the string, because some of program code was old and not updated. Other parts where able to some how decode the string when it accessed the filesystem, I assume the string format was auto detected by the filesystem/io in the OS.

Edited by LiveForIt on 2014/3/7 0:11:25
Edited by LiveForIt on 2014/3/7 0:16:36
Edited by LiveForIt on 2014/3/7 0:19:38

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

angelheart

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/8 4:23 #26

Just popping in

@LiveForIt

According to wiki :
error detect action:

'The replacement character "�" (U+FFFD)
The invalid Unicode code points U+DC80..U+DCFF where the low 8 bits are the byte's value.
The Unicode code points U+0080..U+00FF with the same value as the byte, thus interpreting the bytes according to ISO-8859-1.
The Unicode code point for the character represented by the byte in CP1252. This is similar to using ISO-8859-1, except that some characters in the range 0x80..0x9F are mapped into different Unicode code points. For example, 0x80 becomes the Euro sign, U+20AC.
'

'Many programs are specified to allow input in one of several encodings, for example UTF-8, UTF-16 or ISO-8859-1. In that case, the software would first check for UTF-8 correctness. If incorrect then it would check if it is UTF-16, and if not interpret it as entirely ISO-8859-1.'

Any unrecognised code just skip it.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/8 11:07 #27

Home away from home

@angelheart

Quote:

The invalid Unicode code points U+DC80..U+DCFF where the low 8 bits are the byte's value. The Unicode code points U+0080..U+00FF with the same value as the byte, thus interpreting the bytes according to ISO-8859-1.

You should not make assumptions that you can cast 32bit Unicode to 8bit (ISO-8859-1), people might be using a different codeset, and if the 32bit number gets truncated to 8bit, it enables people to send UTF8 values that will be interpreted as other glyphs in ASCII.

Or you should at least limit the range to be under 256, the code is dirty but is should work, safely.

*c = (Glyph < 256) ? (char) Glyph : 0x20;

My UTF8 decoder replaces the Glyphs that was not found in codeset table whit a 0x20 or SPACE char, so this is not a problem.

some thing like:

*c = 0x20; // default
for (n=0;n<256;n++)
{
if (codeset_page[n]==Glyph) { *c=n; break; }
}

Quote:

The Unicode code point for the character represented by the byte in CP1252. This is similar to using ISO-8859-1, except that some characters in the range 0x80..0x9F are mapped into different Unicode code points

On AmigaOS4 ISO-8859-1 maps 1 to 1, to 2 to 2 up to 255, if this is correct, then ISO-8859-1 is more like Unicode then CP1252, but it should not matter as it will depend on what code set the user is using, if its one or the other does not matter.

http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

But if its a text file you want to convert into UTF8, you will need to know what codeset/codepage that was used when writing the file.

Quote:

'Many programs are specified to allow input in one of several encodings, for example UTF-8, UTF-16 or ISO-8859-1. In that case, the software would first check for UTF-8 correctness. If incorrect then it would check if it is UTF-16, and if not interpret it as entirely ISO-8859-1.'

Any unrecognised code just skip it.

Well if you allocated one byte in memory, you should not poke at it whit 16bit, as first number is null terminated, the second number might not be set to any value, holding som random value, if your really unlucky you might get a DSI error.

Quote:

and if not interpret it as entirely ISO-8859-1.

In that case you don't know the ISO, codeset you should default to what user has selected or lett the user select a codeset.

Edited by LiveForIt on 2014/3/8 11:22:31
Edited by LiveForIt on 2014/3/8 11:24:09
Edited by LiveForIt on 2014/3/8 15:45:07
Edited by LiveForIt on 2014/3/8 16:27:32

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Amigo1

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/13 12:37 #28

Quite a regular

@LiveForIt

Thanks a lot for starting this thread, I have some spare time to read through it today.

@all

thank you all for all the informations you did add..

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/13 22:04 #29

Home away from home

@Chris

Do you mind if I take some of the code from NetSurf/amiga/font.c and stick it into the UTF8.library?

Edited by LiveForIt on 2014/3/13 22:21:18

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/13 22:19 #30

Home away from home

@Amigo1

No problem, just did not want to get too many different topics going in the same thread, so I started a new one.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Chris

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/13 23:42 #31

Amigans Defender

@LiveForIt

Quote:

Do you mind if I take some of the code from NetSurf/amiga/font.c and stick it into the UTF8.library?

It's GPL so as long as you stick to that licence help yourself.

Register To Post	« 1 (2)
	Top Previous Topic Next Topic

Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )