UTF-8 is reasonably well supported at the DOS and file system level, but graphics.library can't display UTF-8 characters and needs an extensive rewrite in that area. Graphics also needs to support L-R and R-L text drawing.
AFAIK no one has taken on the job yet. The question arises - who would you be working for?
Is there really a need anymore to convert a text file to/from ANSI to/from UTF8?
Sure, lots of reasons. I do it all the time. Since we can't directly read a UTF-8 text under AmigaOS, it needs converting.
Quote:
Going from UTF8 to ANSI would just strip out any extended characters, yes?
That would be a very poor and destructive conversion (for anything except very plain English text). What you should do is convert to the the ISO Latin variant corresponding to the current system charset (e.g. check ENV:Charset), and only for characters not found in that you have to fall back to - well, not stripping, but some sort of replacement, e.g. the U-code of the character.
Also check the C:CharSetConvert command and its documentation, as well as the files Charsets.doc and Fonts.doc (and maybe Keyboards.doc and Keymaps.doc), They are all found in SYS:Documentation/. Oh, and they are plain text files, not some fancy Windoze Word format, just in case you thought so ...
Are there functions in some library to perform the conversions on text strings that the CharsetConvert command makes? I thought there might be in locale.library, but I didn't see any in the autodoc. I just wouldn't have thought that CharsetConvert was "hand-coded" to do it internally.
If there are, mritter could still offer to save files as UTF-8 or import UTF-8 files, while sticking to ANSI/ISO (whichever variant matches the current system charset setting) as the internal and default representation. Heck, if you wanted, mritter, you could even be forward thinking and perform everything internally as UTF-8, but just save and load ISO as default - with whatever limitations that would entail (e.g. having to fallback to U codes for characters outside the character set you're saving to, and something similar when trying to display such a character).
Best regards,
Niels
Edit: If not, there is an iconv_lib on OS4Depot which might help?
Are there functions in some library to perform the conversions on text strings that the CharsetConvert command makes? I thought there might be in locale.library, but I didn't see any in the autodoc. I just wouldn't have thought that CharsetConvert was "hand-coded" to do it internally.
Yes and no. There is an API in diskfont.librray to access the charset information in L:Charsets (See IDiskFont->ObtainCharsetInfo() ) so converting between a given charset and unicode can be done via that function.
I use this in SketchBlock to find glyphs in fonts for the text engine.
There aren't any public OS functions to encde/decode utf-8 (7,16 32 or any other variant) that I'm aware of. These would need to be implmented by the coder. [edit] There is a bunch of stuff in recent betas of utility.library
Quote:
Edit: If not, there is an iconv_lib on OS4Depot which might help?
Or for a much more amiga friendly way there is the defacto standard of codesets.library which wraps the various OS variants functions with a higher level API. Used by YAM Aweb and a few others.
I'm actually using utility.library functions for character conversion and I have normal AOS4.1Final with update1, not any betas. As my program uses internally UCS-4 as a storage format, these examples convert from into UCS-4 and from UCS-4 to current charset, but maybe one gets the clue from this.
/*
** Convert source UCS-4 string to C string in current default charset
** Note! The caller MUST free the resulting resstring !
*/
resstring = NULL;
if (ucsstr)
{
len = IUtility->UCS4Count(ucsstr,FALSE);
bufsize = len + 1;
if (resstring = IExec->AllocVecPooled(MemBase,bufsize))
{
maptable = (UCS4CHAR *)IDiskfont->ObtainCharsetInfo(DFCS_NUMBER,DefLocale->loc_CodeSet,DFCS_MAPTABLE);
if (maptable)
{
/* unknown char if not available */
for (i = 0;i<len;i++)
{
for (j=0;j<MAPTABLESIZE;j++)
if (maptable[j] == ucsstr[i])
{
resstring[i] = (unsigned char)j;
break;
}
if (j == MAPTABLESIZE)
resstring[i] = '?';
}
resstring[i] = '\0';
}
}
}
return(resstring);
}
I was thinking about adding menu options to "Load as UTF-8" and "Paste as UTF-8". I don't know if saving involves anything extra. I have never used a UTF-8 text file. I wouldn't be looking to do anything overly complicated like finding a correct font (we don't have many). If the users locale can display the characters then great.
Like you said, looking ahead, but nothing complicated.
I did, UTF8.library on OS4Depot.net, so I did the work.
It's probably better to use 32bit Unicode, as no need to convert to and from, should better for a text editor. So need to worry about inserting different char lengths into middle of strings.
anyway the work was partially unnecessary, as in C++ there is support for UTF8 strings.
Yes the issue like this, UTF8 sucks because there few text editors that supports it, so if he writes a text editor that support UTF8 as reading and writing that be really nice.
and no saving it as 8bit ascii (using codesets encoding), is no replacement for utf8, the internet and XML files are utf8 and if your editing a web pages on Amiga will just trash this.
Anyway I believe all QT programs support UTF8.
As for general problem with supporting unicode strings, is that the fonts are limited the languages and does not include all glyphs of all languages. so if your editing a string that chinese and Arabic and some Russian in the same file, you need to render the text with different fonts, for different parts of the text.
Anyway rendering the text is one problem, being able to type chinese is different problem.
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
so if like to work with utf8, one thing you should know is that utf8, one char can be 1 byte, 2 bytes or 3 bytes or more.
the encoding is simple..
if char starts with binary sequence of if bit 7 is 1, then is multi byte char, if bit 7 is 0 then its a 7bit char.
if bit 7,6 is set then its two bytes. if bit 7,6,5 is set then 3 bytes char if bit 7,6,5,4 is set then its 4 bytes char.
next bit after first active bit is always 0, as indication of sequence stop. the other bits are masked and shifted into place, until you have 32bit char, that reprencet all possible glyphs.
if you find byte that is 7 is set, and 6 is not set, then you have a illegal char. and it's probably 8bit ASCII not UTF8.
Because you do not have fixed length in utf8, it not easy to work with. think about as having to work with RLE encoded image, it possible, but not really practical. decoding is relatively fast at lot faster then rendering the text, better options is storing strings in ram as 32bit, then you have fixed length, and then its as easy as working with chars more or less.
the other issue need to work out is the ABC of different languages, as need to be able convert from lower char to upper char, in ascii is easy just add value to char to convert to upper char, subtract value from char to get lower char, but that does not work in UTF8.
as for typing text into UTF8, when get key press, you need to convert the char into a glyph code, this can be done with codeset.library, instead of storing the ascii value into UTF8 string you store the glyph value into the string, you use the codepage to look it up.
Edited by LiveForIt on 2019/7/24 22:43:21 Edited by LiveForIt on 2019/7/24 22:43:56 Edited by LiveForIt on 2019/7/24 22:45:34 Edited by LiveForIt on 2019/7/24 23:00:00 Edited by LiveForIt on 2019/7/24 23:01:01
(NutsAboutAmiga)
Basilisk II for AmigaOS4 AmigaInputAnywhere Excalibur and other tools and apps.
the other issue need to work out is the ABC of different languages, as need to be able convert from lower char to upper char, in ascii is easy just add value to char to convert to upper char, subtract value from char to get lower char, but that does not work in UTF8.
It certainly doesn't work in ascii (ISO) either. You need to use the functions available for this in Locale, which respect the rules for upper/lower case national letters for the specific language/charset (which may be implemented in the language driver and/or the charset driver).
It's probably better to use 32bit Unicode, as no need to convert to and from, should better for a text editor. So need to worry about inserting different char lengths into middle of strings.
If working in C using utf-8 internally alows you to work with existing string.h functionaility as null terminated strings are still valid. With 4 byte unicode values the strings are full of nulls, so none of the strings functions work. Conversion for rendering is slower though ofcourse, but I don't think it would be a huge bottleneck.
Quote:
and no saving it as 8bit ascii (using codesets encoding), is no replacement for utf8, the internet and XML files are utf8 and if your editing a web pages on Amiga will just trash this.
Saving out as utf-8 is always useful but you should use character references for all non asci code points even in a utf-8 encoded webpage.
Quote:
Anyway I believe all QT programs support UTF8.
Who cares
Quote:
As for general problem with supporting unicode strings, is that the fonts are limited the languages and does not include all glyphs of all languages. so if your editing a string that chinese and Arabic and some Russian in the same file, you need to render the text with different fonts, for different parts of the text.
If you've got to the point where you are worrying about that you have solved alot of problems! And the OP is making programming text editor, that is unlikely to be using kanji cyrillic or arabic or chines charaters that often.
Quote:
Anyway rendering the text is one problem, being able to type chinese is different problem.
Quite. TBH you need a dedicated keyboard for that anyway even on windows (or an awful lot of obscure key sequances1).
>> There is a bunch of stuff in recent betas of utility.library
Aha so it not available for normal people.
You quoted me an blmara out of order, it seems from his post the functions are public. You need version of 54 of utility.library I mistakenly assued that was stilll beta, I might be wrong, haven't ad a chance to verify.
It looks like UTF8 is still far from a friendly, usable state. More work than I want to put in at this time. And if it will slow things down processing the extra bits, then no. I want it to be as fast/smooth as possible.
The syntax highlighting slows things down a little (depending on language), and it is not as accurate as I would like. I would rather spend my time fixing that.
Thanks for the input.
Workbench Explorer - A better way to browse drawers
@mritter0 If it is for Struct then isn’t performance better using ASCII at least for C/C++ ... Any plan to do hardware scrolling using the GPU a bit like what Cygnus did with the Blitter? Are you optimizing your code to limit caches misses?
Performance first. My thought for UTF8 was for when doing locale strings, not so much for everyday programming.
GPU scrolling, no idea how to do that.
I have not looked into optimizing yet. Still working on some of the core functions. I am working on getting to a point where I can use it to edit it's own code without too much hassle. Not too far to go.......