Where is the problem to display swedish characters using the swedish charset, when the UFT-8 String has been converted to the normal swedish amiga-charset?
On AmigaOS4 with ReAction there is no problem. On OS4 with MUI you have to use workarounds (fontname_MIME-charset-name.font). If swedish doesn't use ISO-8859-1 it's not possible on AmigaOS 3.x in MUI programs, on AmigaOS <= 3.9 the charset is always ISO-8859-1, you can use something else only if you use unicode and the bullet API and render all texts yourself, not in any GUI system.
I've heard that the IRC protocol doesnt include any MIME specification of the used charset. The user is responsible to know which charset is used by the other user and to send the text he typed in the charset which is expected by the other user.
Yes, and it's even much worse than that, originally IRC used the 7 bit IBM national charset of norway (IIRC, or whereever else it was invented)
But since there is no charset in IRC using anything but UFT-8 (or US-ASCII) doesn't make sense.
Quote:
Or in other words, you have to tell the other users that they shall send ISO-8859-1 or -15.
Except for private chats you usually can't do that, the networks and/or channels define a charset all users should use.
Quote:
Or you use an IRC client which is able to decode UTF-8 and to convert it to the current OS4 system default charset before displaying the text.
And which can convert everything you type from the system default charset to UTF-8 before it sends it. It should support converting from/to other 8 bit charsets as well since UTF-8 isn't used everywhere. All IIRC clients, except for the AmigaOS one (AFAIK WookieChat is the only one still developed), do support that, and most try to auto detect UTF-8 when receiving texts even if you have configured it to use a 8 bit charset.
joerg wrote: @TetiSoft Yes, and it's even much worse than that, originally IRC used the 7 bit IBM national charset of norway (IIRC, or whereever else it was invented)
UTF8 is not magic, it?s easy detectable because of the way it encoded, you can check if?s valid format or not, if not is ASCII, for storage it only required a zero terminated text string, but problem is program requires ASCII (8 bit) and need to read attributes from UTF-8 supported gadget class, the you can?t do that whit existing TAGS, there for explicit UTF-8 support tags most be added.
UTF8 support can be extended one class at time.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
a simple conversion from UTF-8, to the current system default charset is enough to talk swedish with swedish friends.
Well in that process you?re losing lost of information, better convert 64 bit UTF-8 display the chars directly.
Your option is some what hack if ask me, if like display the text in the right format, first need analyze the UTF-8 data, to detect what language it typed, the switch code page to correct format, as well as convert it 8bit, then you can render the data and switch code page back to your where using, from developers point of view does it?
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
I would not even think about allowing AmigaOS to display UTF-8 until most parts of AmigaOS are charset aware
Because you and I know big task of adding UTF-8 support in AmigaOS unrealistic to thing that UTF-8 will be added to all corners of the OS, whit having some basic UTF-8 support.
In no way is it appropriate to add UTF-8 support as hack that conflicts whit major parts of the OS, there for UTF-8 most be extended as optional feature that can be added to supported components one by one, some thing that can be used by any thing that supports it, but can not be used by some thing that don?t support it, there for support of UTF-8 starting whit displaying UTF-8 text in simple way, we are not taking about bullet API.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
it's easy detectable because of the way it encoded, you can check if's valid format or not, if not is ASCII, for storage it only required a zero terminated text string, but problem is program requires ASCII (8 bit) and need to read attributes from UTF-8 supported gadget class, the you can't do that whit existing TAGS, there for explicit UTF-8 support tags most be added.
OS4 uses explicit charset tags for charsets since years and you suggest we should drop that and try to use UTF-8 autodetection? No, thanks Please read the docs again. The existing tag value for UTF-8 is 106, as specified by IANA in L:charsets/character-sets.
Quote:
UTF8 support can be extended one class at time.
I dont see a reason why any class should handle UTF-8 different than ISO-8859-7. Ok, when it tries to interpret the text to e.g. force underlining of shortcut characters in labels it needs some adjustments, but for simply displaying text there should IMHO be no difference between any 8bit charset and UTF-8.
Of course I'm sitting here since years doing nothing, enjoying the Ferrari I bought from the OS4 sales, and formulate lame excuses when somebody asks for "UTF-8 support in OS4" without giving details where and why he needs exactly what. In real life the Ferrari is an Opel, and I dont know how to pay its ensurance.
Quote:
because you know UTF-8 will need to be extended system wide, and you know it big job to add it.
UTF-8 shall be used system wide? And its my job to do that? Is it your job to do the user support hotline when the users cant read their existing text files anymore?
Quote:
It's not like can't convert UTF-8 back to 8bit using the codepage before sending it over the serial port.
You dont know what you are talking about, sorry. I am talking about adding support for UTF-8 keymaps to keymap.library. Yes, it already can read UTF-8 keymap text files, just in case you missed it. But when it would create UTF-8 keymaps in memory, this would break nearly every existing shell/console/terminal/KingCON/whatever. The UTF-8 encoded strings for non-ASCII characters would be no problem in most cases, BUT the special keys (cursor, function keys, ESC etc) are used unchanged by most software, do you volunteer to create new shells/ consoles/con-handlers etc which can handle ESC[ or UTF-8 escape sequences, not only CSI escape sequences? No? Any excuse? Well, when you dont, why should I?
Quote:
Again I don't se how this is relevant, SSH and other shells can be supported, simply char conversions from UTF-8.
SSH is a shell and either already knows that the next generation OS4 will use UTF-8 encoded escape sequences or doesnt interpret escape sequences at all? And you already modified its source code to tell keymap.library which charset shall be used?
Well in that process you're losing lost of information, better convert 64 bit UTF-8 display the chars directly.
64 bit UTF-8? Which RFC is that exactly? When typing swedish to a swedish friend, you cant lose any information when you dont support more than ISO-8859-1 or -15, because those ISO standards were created for swedish people. When you lose something it must have been something which was definitely not swedish...
Quote:
Your option is some what hack if ask me, if like display the text in the right format, first need analyze the UTF-8 data, to detect what language it typed, the switch code page to correct format, as well as convert it 8bit, then you can render the data and switch code page back to your where using, from developers point of view does it?
We are NOT talking about word processers or text layout engines or lexical applications here, the context was an IRC client. Be assured that I never ever typed a single character in an IRC client, but I've seen some IRC logs already, and I'm sure that the average IRC user doesnt care much about spelling errors, upper or lower case, apostroph or single quotation mark or double quotation mark etc, that in at least 99% of all cases the received UTF-8 text can be displayed in at least one 8bit charset, and that the experienced user will have no problem to either switch his system to greek before chatting with greeks or to tell his IRC client to use a greek font with the already mentioned charset hack.
What you are proposing is IMHO too complicated for the average IRC client software. When you really wanna handle full Unicode repertoire, dont forget the combining characters, the non-spacing characters, the bidirectional writing direction, the Unicode characters which require a fixed-width font, etc. Its really enough when it decodes to the current system default 8bit charset and it would be fine when it would allow to choose any 8bit charset supported by OS4, but full Unicode is IMHO not needed (yet) for an IRC client. For an office application, maybe.
In no way is it appropriate to add UTF-8 support as hack that conflicts whit major parts of the OS, there for UTF-8 most be extended as optional feature that can be added to supported components one by one, some thing that can be used by any thing that supports it, but can not be used by some thing that don't support it, there for support of UTF-8 starting whit displaying UTF-8 text in simple way, we are not taking about bullet API.
Just in case you missed it, before OS4 nearly no AmigaOS application supported greek, cyrillic, czech etc. With OS4 most applications support it AFAIK, even those broken word processors which were unable to speak Unicode when using bullet API are magically fixed by ft2.library, in the meantime even the PostScript and most PCL printer drivers support greek So I still think we should try to make it possible to use UTF-8 with pre-OS4 applications.
Especially the fact that many OS4 applications are not charset aware anyway would lead to the conclusion that UTF-8 support which needs a new API would be used by only three new programs per year so it would not be worth the effort to implement it at all...
At the moment my interest in adding more UTF-8 support is rather low, one blocker is the missing ESC[ support in console handlers (according to the standards both ESC[ and CSI should be supported).
OS4 uses explicit charset tags for charsets since years and you suggest we should drop that and try to use UTF-8
Yes becose then you can copy any mixture of language in the string textbox gadget whit caring what is what.
Quote:
autodetection? No, thanks
it bit useless is it not, but lets say you have some text in UTF8 format, the old crappy program doesn?t know its UTF-8, but uses a old class tag for string gadget, the SetAttrsA() won?t make the accident converting the UTF-8 string to UTF-8 string corrupted.
Quote:
The existing tag value for UTF-8 is 106, as specified by IANA in L:charsets/character-sets.
Well I was reading the sdk:documentations/Autodocs /#?.gc or some thing like that before computer stopped working, did not notice any tags for UTF8 support for reading / setting string attributes the gadgets.
Quote:
OS4 uses explicit charset tags for charsets since years and you suggest we should drop that and try to use UTF-8
Yes and No, we can?t drop legacy can we, but its not possible to persevere Unicode?s in ASCII, the UTF8 most be default format, any program that request ASCII using old tags will get a converted ASCII string from the UTF8 original string, all buffers are freed when classes are disposed off, ASCII buffer is provided when need, a simple zero pointer until UTF8 most converted in ASCII and buffer most be created.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
My IRC client defaults to latin0 (or whatever I set as default charset), but recognises utf8 as well when it sees it. Ofcourse it can be tricked, but I've never seen that happen unintentionally.
And yes, the need for mixing old charsets is there, I frequently mix latin0 and cyrillic for example, writing both russian and norwegian at once. "8bit-apps" and codeset translation is not very usefull since you're not able to, for example, transcode utf8 to both latin0 and koi8-r at once.
So I still think we should try to make it possible to use UTF-8 with pre-OS4 applications
UTF8 is detectable to some degree it possible, some program will break, if older program do need UTF8, and they are of use to day, I?m quite shore they be replaced or updated at some time.
Quote:
API would be used by only three new programs per year so it would not be worth the effort to implement it at all...
Well some do like to use IRC and some need it for WEB browsing and E-mail, and some just like preserve there original filenames, then we have Japanese and China and few other countries that do not have support for there symbols.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
Well I was reading the sdk:documentations/Autodocs /#?.gc or some thing like that before computer stopped working, did not notice any tags for UTF8 support for reading / setting string attributes the gadgets.
Read again: Quote:
LAYOUT_CharSet (ULONG) (V51) The character set the layout group and all its members should display their text in, regardless of the particular font used. If zero, no character set will be explicitly enforced.
Defaults to zero.
Applicability is (OM_NEW, OM_SET, OM_GET)
Quote:
REQ_CharSet (ULONG) (V51.11) The character set for the requester's text and gadgets.
Defaults to 0, meaning no character set is required.
Applicability is (OM_NEW, OM_SET, OM_GET, RM_OPENREQ)
...
Quote:
WINDOW_CharSet (ULONG) The charset of the WINDOW_NewMenu menu strings and the WINDOW_HintInfo help strings. Should be specified with e.g. the cat_CodeSet value of the catalog your application opened. (V51.11)
OS4 uses explicit charset tags for charsets since years and you suggest we should drop that and try to use UTF-8
Yes becose then you can copy any mixture of language in the string textbox gadget whit caring what is what.
This implies that somebody created a charset aware clipboard.device which doesnt exist yet. No, it cant imply that the user is able to type multiple languages with one keyboard and one keymap but only when he is able to use more than an 8bit charset. All keymaps I'm aware of can be described with an 8bit charset, minus the special cases where both Euro and currency sign are present (nobody ever needed the currency sign) and minus the special cases where a special input method is needed to handle the keyboard anyway (japanese for example).
But back on topic, of course you need a charset tag to specify the font that shall be used to display the text typed in the string gadget, and of course you need a charset tag to specify the charset of the keymap of the string gadget. Currently both keymap.library and diskfont.library dont accept UTF-8 as charset but I've avoided to write anywhere that this limitation exists now or will still exist in future.
Quote:
it bit useless is it not, but lets say you have some text in UTF8 format, the old crappy program doesn't know its UTF-8, but uses a old class tag for string gadget, the SetAttrsA() won't make the accident converting the UTF-8 string to UTF-8 string corrupted.
Any crappy old program will not specify any charset tag at all. Ergo it will continue to work with UTF-8. Switch your system to greek and your old text editor still works (but in greek, it was written for latin), why should it behave different with UTF-8? When it tries to interpret C1 control sequences its broken (IMHO).
Of course the user will be responsible to ensure the old program is running in a latin environment before trying to feed it a latin text and in an UTF-8 environment before trying to feed it an UTF-8 text. Exactly the same as with greek and cyrillic.
Quote:
Well I was reading the sdk:documentations/Autodocs /#?.gc or some thing like that before computer stopped working, did not notice any tags for UTF8 support for reading / setting string attributes the gadgets.
The word "charset" appears about 140 times in my SDK:Autodocs directory, the word "codeset" about 12 times.
Yes and No, we can't drop legacy can we, but its not possible to persevere Unicode's in ASCII, the UTF8 most be default format, any program that request ASCII using old tags will get a converted ASCII string from the UTF8 original string, all buffers are freed when classes are disposed off, ASCII buffer is provided when need, a simple zero pointer until UTF8 most converted in ASCII and buffer most be created.
Its absolutely unusual to change the language or charset or font of an already existing gadget, and until now nobody wanted that IIRC. When you change the text and have no idea about charsets, you also had no idea about charsets when creating the gadget, ergo the default has not changed, no conversion necessary.
My IRC client defaults to latin0 (or whatever I set as default charset), but recognises utf8 as well when it sees it. Ofcourse it can be tricked, but I've never seen that happen unintentionally.
And yes, the need for mixing old charsets is there, I frequently mix latin0 and cyrillic for example, writing both russian and norwegian at once. "8bit-apps" and codeset translation is not very usefull since you're not able to, for example, transcode utf8 to both latin0 and koi8-r at once.
You are typing russian and norwegian in the exact same IRC session?
With two sessions, two windows, two fonts and two keymaps you could do it in OS4 (when there would exist an IRC client which is charset and UTF-8 aware and supports specifying the keymap and charset).
With one session you have to wait until either the IRC client supports bullet API to display full Unicode or OS4 supports displaying UTF-8 in Text(), and you'd need support in keymap.library to create UTF-8 keymaps.