Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

	Bottom Previous Topic Next Topic
Register To Post

(1) 2 »

LiveForIt

Posted on: 2014/3/4 12:41 #1

Home away from home

@Amigo1

It's a bit off topic, so I start a new thread.

Quote:

Would Unicode (UTF-8 or UTF-16) improve the situation?

Yes because:

* People like me want need to reinvent the wheel, to do some thing that takes 2 sec to do in .net, on windows.

UTF8 vs ASCII,

* display text string of any language correct even if your from a different nationality and use different language.

* Preserve the correct Symbols.

* Possibly improve support of Asian languages and Klingon.

Quote:

Can someone explain in layman terms if and why it would be difficult to make the whole OS use Unicode?

I think that's going to take a lot of time and beta testing to make the whole OS, UTF8 friendly. Besides the lack of any improvement on supporting UTF8 in the OS sens 2004, I do not have high hopes.

Well there are some parts of the OS that has support for it, I believe Reaction, but that high level GUI toolkit, the low level stuff that is missing.

In UTF8 you need to decode and encode the string, etch symbol in the string has variable length it can take one or more bytes. This effects the way you change a string, find the length, convert upper and lowercase and so on.

To display a UTF8 string you can't simply convert it to ASCII, as some data might be lost, because ASCII format can not contain more then one language, while in UTF8 format all symbols have unique values (no language barrier)

First step is to provided a way to encode/decode/read/change symbols.
Then the OS will need a routine to display UTF8 and maybe other encodings.

There are possibly different ways UTF8 can integrated into the OS, for example you might add new method into Graphic.library, or you might extend Struct RastPort whit Encode parameter, extending the RastPort structure might have unknown side affects as application some time Init the structure.

UTF8 can be auto detected, by verifying the total length of symbols in bytes vs the number of bytes in the string, and if symbols where not possible to encode, then string is not valid.( Adding auto detection will slow down displaying text. )

Or you can maybe add some kind of ESC code to force some functions into UTF8 mode.

Then you have the filesystem, Windows, screens, menus, and so that has to have support for UTF8, filesystem is the tricky part, because so many programs interact whit it, there is also the issue of having a shell that you can type the names, when names are not in your language.

REF:
http://www.amigans.net/modules/xforum ... t_id=87926#forumpost87926

Edited by LiveForIt on 2014/3/4 13:01:20
Edited by LiveForIt on 2014/3/4 13:07:43
Edited by LiveForIt on 2014/3/4 14:54:34

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

ssolie

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 15:50 #2

Amigans Defender

No, Unicode will not magically fix anything.

ExecSG Team Lead

Chris

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 16:44 #3

Amigans Defender

@ssolie

It would however make certain things much more convenient.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 16:59 #4

Home away from home

@ssolie

Quote:

No, Unicode will not magically fix anything.

It's bit like saying 3d want magically fix anything.
Nothing is going to use some thing you don't have.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

broadblues

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 17:02 #5

Home away from home

@Chris

At quite some cost though.

Fully working utf-8 (or which ever encoding) support would be quite desirable, but would require rewrites of many components, and could destroy older programs that expect 1 character = 1 byte.

Blender For OS4.x : Blues : Walker Broad

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 17:06 #6

Home away from home

@broadblues

Or you will need to keep a UTF8 and ASCII copy of the string, but some parts will be tricky I expect.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

joerg

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 17:30 #7

Just can't stay away

@LiveForIt

Quote:

filesystem is the tricky part, because so many programs interact whit it

Even if there is probably no single program yet which can display them correctly the file names are already UTF-8, simply because there is no other possibility.

Either that, or it's just some random, undefined garbage (different and changeable depending on the current 8 bit charset set in Prefs/Locale) instead of file names.

Ok, there is another possible solution, US-ASCII, and it's much easier to implement and doesn't need any changes in other software: Add some checks in dos.library and return ERROR_INVALID_COMPONENT_NAME for all file names with chars > 0x7E. But adding such a limit wouldn't make sense ...

Chris

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 18:36 #8

Amigans Defender

@broadblues

I'd be quite happy with UTF-8 versions of Text(), TextExtent() etc to start with. The chances are these are used internally in the OS for most text printing, so the other components can then slowly be updated for UTF-8.

It would be much simpler for application developers than having to think "hmm, which 8 bit non-Unicode character set should I use?". Especially when the system default one doesn't fit the language the user is running the application in.

And if there's a requirement to actually print characters that don't exist in the current 8 bit non Unicode character set, you're on your own.

MickJT

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 20:29 #9

Just can't stay away

@Chris

Off-topic. Is there some reason this site won't let me go to your user profile?

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 20:59 #10

Home away from home

@Chris

I agree, and here are some code I just wrote.

That should be able get Glyph and or length of utf8 Glyph Symbol.

The only thing missing is built api, to draw the Glyph.
and simple loop to read etch Glyph.

If some OS4 devs wants to make me and Chris happy.[


ULONG utf8_get_glyph(unsigned char *data, int *len)

{

    unsigned char *c;

    ULONG ret = 0;

    int bytes;

    int bits;

    int n;

    int mask;



    bytes = 1;

    c =data;



    if ( ! (*c & 0x80))

    {

        ret = *c;    

    }

    else        // muli chars;

    {

        bits =6;

        for ( n = 6; (*c&(1<<n)) && (n>0)  ; n--)

        {

            bytes++;    

            bits--;

        }



        mask = (1<<bits)-1 ;

        ret = *c & mask;



        for (n=1; n<bytes;n++)

        {

            c++;

            ret = (ret << 6) | (*c & 0x3F);



            if (*c == 0)

            {

                *len = 0;

                return 0;

            }

        }

    }



    *len = bytes;



    return ret;

}

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Chris

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 22:09 #11

Amigans Defender

@LiveForIt

Actually I already have code to do all that (the pixel counting stuff is horrid though), my point was more that I shouldn't be forced to write this myself, and if the OS did it, then the OS could then use that functionality elsewhere (eg. WB could print UTF-8 filenames).

@MickJT

No idea, must be private!

ssolie

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/4 22:48 #12

Amigans Defender

@Chris
Quote:

It would however make certain things much more convenient.

That is true but it still won't fix anything. I am very, very well aware of the problems.

ExecSG Team Lead

Belxjander

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 0:39 #13

Just popping in

@LiveForIt

under "plain text" conditions... the following are true,

Character Codes 0 -> 127 (0x7F hex) are valid in both ANSI/ASCII within ISO Latin 1 encoding which is the AmigaOS default *and* UTF-8 concurrently.

UTF-8 is encoded with the High Bit set for initial modifiers until an Octet(8bit value) contains a High bit that is 0.

the Encoding is actually quite simple as the initial Octet includes a "count" until a clear bit and subsequent bytes have a high bit set with 6 bits of data per octet until a normal 0-127 octet is emitted by the encoder.

I know what I just said may not be clear enough for everyone.

The FileSystems and other elements of the existing OS will actually require an alternate encoding.

there is also a URLencode() form using a prefix escape character.

I am currently playing around with an InputHandler to deaol with UTF-8 encoded strings that are also AmigaOS "safe".

this is actually not as easy as I first considered however I am currently trying to build this as a system pluggable subsystem expanding on the current Locale settings.

I have also seen the ignorance of mixing the keyboard layout with the codepage information.

codepages were specifically for MS-DOS as a workaround for not having UTF-8 or an equivalent (this is also a very large mess in some ways based on who you ask).

I have also sent S.Solie a Japanese keyboard (paid for from my own pocket) which has allowed some step towards a more capable OS for actually making all this work.

I'm still waiting to hear back as to any final decision made with regards the 4 extra keymappings required for IME support.

Language-Cycle, Mode-Cycle, Mode-Exclusion, Mode-Release are the four functions to deal with an overlayed keymap setup.

I also have a *functional* layout for my Japanese keyboard (used to write this message)

Follow the video link in my first edit to see the keyboard and this message partially written...
EDIT[1]:Here

EDIT2: I've written this in the middle of a headache while sick... so it will be somewhat braindump'ish in nature...
But until I have a migraine or something else incapacitating...I don't get any breaks for anything.

Until later...

Edited by Belxjander on 2014/3/5 0:59:15
Edited by Belxjander on 2014/3/5 1:20:40

Gazelle

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 9:10 #14

Just popping in

@LiveForIt
Quote:

I agree, and here are some code I just wrote.

Your code could be seen as an security risk because it does not filter out illegal UTF-8 sequences.

see http://tools.ietf.org/html/rfc3629#section-10

(edit: added url)

ssolie

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 16:07 #15

Amigans Defender

@Belxjander
Quote:

I have also sent S.Solie a Japanese keyboard (paid for from my own pocket) which has allowed some step towards a more capable OS for actually making all this work.

Note we have another Japanese fan on the beta testing team (hi Valiant) who is also helping with this work.

Rendering the actual Unicode strings via some Text() function is rather trivial. As you already alluded to, the real problem is integrating Unicode with the rest of the system in a way that is still backwards compatible and doesn't require too much effort either.

ExecSG Team Lead

Elwood

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 16:24 #16

Just can't stay away

We are so late compared to mainstream systems that I'm sad that we have to keep backward compatibility as it slows down OS development even more

Philippe 'Elwood' FERRUCCI
Sam460ex 1.10 Ghz
http://elwoodb.free.fr

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 16:51 #17

Home away from home

@Gazelle

Quote:

Your code could be seen as an security risk because it does not filter out illegal UTF-8 sequences.

Its no security risk.

Well I'm testing if I'm at end of the string, so if UTF8 string is corrupt, the encoding will return 0 and length of 0.

how ever it does not verify if byte sequence is valid, but what if the byte sequence is not valid, then there is no way you can display the string anyway.

Anyway I just wrote it so it might not be 100% prefect, for example I assume the data bits should be right aligned, not left aligned. In other words it has to be tested to see if return the right values. I don't think left aligned data bits will work, so I'm 99.99% shore its correct.

Edited by LiveForIt on 2014/3/5 17:24:05
Edited by LiveForIt on 2014/3/5 17:25:19

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 17:00 #18

Home away from home

@ssolie

Quote:

Note we have another Japanese fan on the beta testing team (hi Valiant) who is also helping with this work.

Rendering the actual Unicode strings via some Text() function is rather trivial. As you already alluded to,

Yes thats a good start.

Quote:

the real problem is integrating Unicode with the rest of the system in a way that is still backwards compatible and doesn't require too much effort either.

How about take this in baby steps, you don't need 100% UTF8 support tomorrow, we need 10% UTF8 support, to make new programs that support it, not even MS Windows has 100% Unicode support.

Then you can upgrade support in steps, for example supporting UTF8 window titles should not be to hard.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

LiveForIt

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 17:13 #19

Home away from home

@Belxjander

Quote:

under "plain text" conditions... the following are true,

Character Codes 0 -> 127 (0x7F hex) are valid in both ANSI/ASCII within ISO Latin 1 encoding which is the AmigaOS default *and* UTF-8 concurrently.

True.

Quote:

UTF-8 is encoded with the High Bit set for initial modifiers until an Octet(8bit value) contains a High bit that is 0.

bit 7 indicates glyph that needs more bytes.
the next high bits until a low bit, is number of bytes needed.

Byte 1: 110X XXXX - Byte 2: 10 XX XXXX

Byte 1:
2 bits = 11, indicate the Glyph needs 2 bytes.
1 bits = 0, indicate a stop bit.
5 bits = XXXXX of data.

byte 2:

2 bits = 10, indicates valid byte
6 bits = XXXXXX of data

tot:
2 bytes, 11 bits tot of data, overhead 5 bits.

Byte 1: 1110 XXXX - Byte 2: 10 XX XXXX - Byte 3: 10 XX XXXX

3 bits = 111, indicate the Glyph needs 3 bytes.
1 bits = 0, indicate a stop bit.
4 bits = XXXX, of data.

byte 2:

2 bits = 10, indicates valid byte
6 bits = XXXXXX, of data

byte 3:

2 bits = 10, indicates valid byte
6 bits = XXXXXX, of data

tot:
3 bytes, 16 bits tot of data, overhead 8 bits.

Quote:

the Encoding is actually quite simple as the initial Octet includes a "count" until a clear bit

the rest of the bits are data bits.

Quote:

and subsequent bytes have a high bit set with 6 bits of data per octet until a normal 0-127 octet is emitted by the encoder.

that will not work, what if you have two Glyphs next to etch other has symbol value above or equal to 128, you will need to do like I do, count down down number of bytes need by Glyph, and terminate the encoding if byte is invalid.

EDIT: correct some info

Edited by LiveForIt on 2014/3/5 19:56:47
Edited by LiveForIt on 2014/3/5 20:02:04
Edited by LiveForIt on 2014/3/5 20:16:40

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

broadblues

Re: Would Unicode (UTF-8 or UTF-16) improve the situation? Can someone explain in layman terms if and w

Posted on: 2014/3/5 17:30 #20

Home away from home

@LiveForIt

[quote]
how ever it does not verify if byte sequence is valid, but what if the byte sequence is not valid, then there is no way you can display the string anyway.
[/quite]

Read the linked article more carefully. UTF-8 is weird it's quite possible to generate apparently valid codes from technically invalid sequences. Although interesting in aiding understanding I wouldn't write my own decoding routines, but established routines from iconv and the like.

Blender For OS4.x : Blues : Walker Broad

Register To Post	(1) 2 »
	Top Previous Topic Next Topic

Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )