Text file encoding

Text file encoding

Posted on: 2019/7/17 23:31 #1

Just popping in

Is there really a need anymore to convert a text file to/from ANSI to/from UTF8?

Going from UTF8 to ANSI would just strip out any extended characters, yes?

Going from ANSI to UTF8 would do nothing until you add your new extended characters, yes?

So just always stay in UTF8 encoding since it is a global Amiga scene, not just English. Yes?

Re: Text file encoding

Posted on: 2019/7/17 23:53 #2

Home away from home

For developement text editor I'm not sure utf-8 is a good idea. Especially as utf-8 is so badly supported on amigaos anyway.

Some programming languages support utf-8 strings some don't

Blender For OS4.x : Blues : Walker Broad

tonyw

Re: Text file encoding

Posted on: 2019/7/18 1:47 #3

Quite a regular

@mritter0

UTF-8 is reasonably well supported at the DOS and file system level, but graphics.library can't display UTF-8 characters and needs an extensive rewrite in that area. Graphics also needs to support L-R and R-L text drawing.

AFAIK no one has taken on the job yet. The question arises - who would you be working for?

cheers
tony

Amigo1

Re: Text file encoding

Posted on: 2019/7/18 5:27 #4

Quite a regular

@mritter0

If you’re going to look into this,rmind there are sub variants such as UTF-8 NFD and UTF-8 NFC .

Why oh why this big mess!?!? :~|
UTF-8 Text at www.ietf.org
UTF-8 at Wikipedia

nbache

Re: Text file encoding

Posted on: 2019/7/19 0:08 #5

Just can't stay away

@mritter0

Quote:

Is there really a need anymore to convert a text file to/from ANSI to/from UTF8?

Sure, lots of reasons. I do it all the time. Since we can't directly read a UTF-8 text under AmigaOS, it needs converting.

Quote:

Going from UTF8 to ANSI would just strip out any extended characters, yes?

That would be a very poor and destructive conversion (for anything except very plain English text). What you should do is convert to the the ISO Latin variant corresponding to the current system charset (e.g. check ENV:Charset), and only for characters not found in that you have to fall back to - well, not stripping, but some sort of replacement, e.g. the U-code of the character.

Also check the C:CharSetConvert command and its documentation, as well as the files Charsets.doc and Fonts.doc (and maybe Keyboards.doc and Keymaps.doc), They are all found in SYS:Documentation/. Oh, and they are plain text files, not some fancy Windoze Word format, just in case you thought so ...

Quote:

So just always stay in UTF8 encoding

Bad idea. As Tony mentions above:

Quote:

graphics.library can't display UTF-8 characters

.
Best regards,

Niels

Re: Text file encoding

Posted on: 2019/7/19 0:24 #6

Just popping in

OK, sounds like UTF-8 is out. I didn't know how (in)complete OS4's handling of it was.

I just had some code that does UTF-8 to Latin1 conversion on a string. But I don't have any save code, if it is any different.

Workbench Explorer - A better way to browse drawers

nbache

Re: Text file encoding

Posted on: 2019/7/19 23:49 #7

Just can't stay away

@tony, broadblues or whoever might know:

Are there functions in some library to perform the conversions on text strings that the CharsetConvert command makes? I thought there might be in locale.library, but I didn't see any in the autodoc. I just wouldn't have thought that CharsetConvert was "hand-coded" to do it internally.

If there are, mritter could still offer to save files as UTF-8 or import UTF-8 files, while sticking to ANSI/ISO (whichever variant matches the current system charset setting) as the internal and default representation. Heck, if you wanted, mritter, you could even be forward thinking and perform everything internally as UTF-8, but just save and load ISO as default - with whatever limitations that would entail (e.g. having to fallback to U codes for characters outside the character set you're saving to, and something similar when trying to display such a character).

Best regards,

Niels

Edit: If not, there is an iconv_lib on OS4Depot which might help?

Re: Text file encoding

Posted on: 2019/7/20 10:19 #8

Home away from home

@nbache

Quote:

Are there functions in some library to perform the conversions on text strings that the CharsetConvert command makes? I thought there might be in locale.library, but I didn't see any in the autodoc. I just wouldn't have thought that CharsetConvert was "hand-coded" to do it internally.

Yes and no. There is an API in diskfont.librray to access the charset information in L:Charsets (See IDiskFont->ObtainCharsetInfo() ) so converting between a given charset and unicode can be done via that function.

I use this in SketchBlock to find glyphs in fonts for the text engine.

There aren't any public OS functions to encde/decode utf-8 (7,16 32 or any other variant) that I'm aware of. These would need to be implmented by the coder. [edit] There is a bunch of stuff in recent betas of utility.library

Quote:

Edit: If not, there is an iconv_lib on OS4Depot which might help?

Or for a much more amiga friendly way there is the defacto standard of codesets.library which wraps the various OS variants functions with a higher level API. Used by YAM Aweb and a few others.

Blender For OS4.x : Blues : Walker Broad

blmara

Re: Text file encoding

Posted on: 2019/7/20 18:26 #9

Just popping in

@broadblues

I'm actually using utility.library functions for character conversion and I have normal AOS4.1Final with update1, not any betas. As my program uses internally UCS-4 as a storage format, these examples convert from into UCS-4 and from UCS-4 to current charset, but maybe one gets the clue from this.


...

typedef int32 * UCS4STRING;

typedef STRPTR UTF8STRING;

typedef int32 UCS4CHAR;

...

UCS4STRING ConvertUTF8ToUCS4(UTF8STRING srcstr)

{

    UCS4STRING resstring;

    uint32 bufsize;    /* bufsize should accommodate UCS4STRING */



    /* 

    ** Convert source UTF-8 C string to internal UCS-4 string

    ** Note! The caller MUST free the resulting resstring !

    */

    resstring = NULL;

    if (srcstr)

    {

        bufsize = sizeof(UCS4CHAR) * (IUtility->UTF8Count(srcstr,FALSE) + 1);

        if (resstring = IExec->AllocVecPooled(MemBase,bufsize))

        {

            IUtility->UTF8toUCS4(srcstr,resstring,bufsize,UTF_INVALID_SUBST_FFFD);

        }

    }    

    return(resstring);

}



...



STRPTR ConvertUCS4ToString(UCS4STRING ucsstr)

{

    STRPTR resstring;

    UCS4CHAR *maptable;

    uint32 i,j,len,bufsize;    /* bufsize should accommodate UCS4STRING */



    /* 

    ** Convert source UCS-4 string to C string in current default charset

    ** Note! The caller MUST free the resulting resstring !

    */

    resstring = NULL;

    if (ucsstr)

    {

        len = IUtility->UCS4Count(ucsstr,FALSE);

        bufsize = len + 1;

        if (resstring = IExec->AllocVecPooled(MemBase,bufsize))

        {

            maptable = (UCS4CHAR *)IDiskfont->ObtainCharsetInfo(DFCS_NUMBER,DefLocale->loc_CodeSet,DFCS_MAPTABLE);

            if (maptable)

            {

                /* unknown char if not available */

                for (i = 0;i<len;i++)

                {

                    for (j=0;j<MAPTABLESIZE;j++)

                        if (maptable[j] == ucsstr[i])

                        {

                            resstring[i] = (unsigned char)j;

                            break;

                        }

                    if (j == MAPTABLESIZE)

                        resstring[i] = '?';

                }

                resstring[i] = '\0';

            }

        }

    }    

    return(resstring);

}

Edit: added the typedefs.

Marko

Re: Text file encoding

Posted on: 2019/7/20 21:19 #10

Just popping in

@nbache

I was thinking about adding menu options to "Load as UTF-8" and "Paste as UTF-8". I don't know if saving involves anything extra. I have never used a UTF-8 text file. I wouldn't be looking to do anything overly complicated like finding a correct font (we don't have many). If the users locale can display the characters then great.

Like you said, looking ahead, but nothing complicated.

Re: Text file encoding

Posted on: 2019/7/24 21:47 #11

Home away from home

@tonyw

I did, UTF8.library on OS4Depot.net, so I did the work.

It's probably better to use 32bit Unicode, as no need to convert to and from, should better for a text editor. So need to worry about inserting different char lengths into middle of strings.

anyway the work was partially unnecessary, as in C++ there is support for UTF8 strings.

Yes the issue like this, UTF8 sucks because there few text editors that supports it, so if he writes a text editor that support UTF8 as reading and writing that be really nice.

and no saving it as 8bit ascii (using codesets encoding), is no replacement for utf8, the internet and XML files are utf8 and if your editing a web pages on Amiga will just trash this.

Anyway I believe all QT programs support UTF8.

As for general problem with supporting unicode strings, is that the fonts are limited the languages and does not include all glyphs of all languages. so if your editing a string that chinese and Arabic and some Russian in the same file, you need to render the text with different fonts, for different parts of the text.

Anyway rendering the text is one problem, being able to type chinese is different problem.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Re: Text file encoding

Posted on: 2019/7/24 22:08 #12

Home away from home

@blmara

So there is some support in AmigaOS4.x, that's nice.
if know wont have wasted time on my own library, but anyway I learned a lot from doing the work.

so now you only need to render the glyphs one by one using the bullet API, or using truetype library.

So etch char in UCS4, should map directly to glyphs in the fonts.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Re: Text file encoding

Posted on: 2019/7/24 22:11 #13

Home away from home

@broadblues

>> There is a bunch of stuff in recent betas of utility.library

Aha so it not available for normal people.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Re: Text file encoding

Posted on: 2019/7/24 22:28 #14

Home away from home

@mritter0

so if like to work with utf8, one thing you should know is that utf8, one char can be 1 byte, 2 bytes or 3 bytes or more.

the encoding is simple..

if char starts with binary sequence of if bit 7 is 1, then is multi byte char, if bit 7 is 0 then its a 7bit char.

if bit 7,6 is set then its two bytes.
if bit 7,6,5 is set then 3 bytes char
if bit 7,6,5,4 is set then its 4 bytes char.

next bit after first active bit is always 0, as indication of sequence stop. the other bits are masked and shifted into place, until you have 32bit char, that reprencet all possible glyphs.

if you find byte that is 7 is set, and 6 is not set, then you have a illegal char. and it's probably 8bit ASCII not UTF8.

Because you do not have fixed length in utf8, it not easy to work with. think about as having to work with RLE encoded image, it possible, but not really practical. decoding is relatively fast at lot faster then rendering the text, better options is storing strings in ram as 32bit, then you have fixed length, and then its as easy as working with chars more or less.

the other issue need to work out is the ABC of different languages, as need to be able convert from lower char to upper char, in ascii is easy just add value to char to convert to upper char, subtract value from char to get lower char, but that does not work in UTF8.

as for typing text into UTF8, when get key press, you need to convert the char into a glyph code, this can be done with codeset.library, instead of storing the ascii value into UTF8 string you store the glyph value into the string, you use the codepage to look it up.

Edited by LiveForIt on 2019/7/24 22:43:21
Edited by LiveForIt on 2019/7/24 22:43:56
Edited by LiveForIt on 2019/7/24 22:45:34
Edited by LiveForIt on 2019/7/24 23:00:00
Edited by LiveForIt on 2019/7/24 23:01:01

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

nbache

Re: Text file encoding

Posted on: 2019/7/24 23:48 #15

Just can't stay away

@LiveForIt

Quote:

the other issue need to work out is the ABC of different languages, as need to be able convert from lower char to upper char, in ascii is easy just add value to char to convert to upper char, subtract value from char to get lower char, but that does not work in UTF8.

It certainly doesn't work in ascii (ISO) either. You need to use the functions available for this in Locale, which respect the rules for upper/lower case national letters for the specific language/charset (which may be implemented in the language driver and/or the charset driver).

Best regards,

Niels

Re: Text file encoding

Posted on: 2019/7/25 10:38 #16

Home away from home

@LiveForIt

Quote:

It's probably better to use 32bit Unicode, as no need to convert to and from, should better for a text editor. So need to worry about inserting different char lengths into middle of strings.

If working in C using utf-8 internally alows you to work with existing string.h functionaility as null terminated strings are still valid. With 4 byte unicode values the strings are full of nulls, so none of the strings functions work. Conversion for rendering is slower though ofcourse, but I don't think it would be a huge bottleneck.

Quote:

and no saving it as 8bit ascii (using codesets encoding), is no replacement for utf8, the internet and XML files are utf8 and if your editing a web pages on Amiga will just trash this.

Saving out as utf-8 is always useful but you should use character references for all non asci code points even in a utf-8 encoded webpage.

Quote:

Anyway I believe all QT programs support UTF8.

Who cares

Quote:

As for general problem with supporting unicode strings, is that the fonts are limited the languages and does not include all glyphs of all languages. so if your editing a string that chinese and Arabic and some Russian in the same file, you need to render the text with different fonts, for different parts of the text.

If you've got to the point where you are worrying about that you have solved alot of problems! And the OP is making programming text editor, that is unlikely to be using kanji cyrillic or arabic or chines charaters that often.

Quote:

Anyway rendering the text is one problem, being able to type chinese is different problem.

Quite. TBH you need a dedicated keyboard for that anyway even on windows (or an awful lot of obscure key sequances1).

Blender For OS4.x : Blues : Walker Broad

Re: Text file encoding

Posted on: 2019/7/25 10:40 #17

Home away from home

@LiveForIt

Quote:

@broadblues

>> There is a bunch of stuff in recent betas of utility.library

Aha so it not available for normal people.

You quoted me an blmara out of order, it seems from his post the functions are public. You need version of 54 of utility.library I mistakenly assued that was stilll beta, I might be wrong, haven't ad a chance to verify.

Blender For OS4.x : Blues : Walker Broad

Re: Text file encoding

Posted on: 2019/7/25 15:51 #18

Just popping in

It looks like UTF8 is still far from a friendly, usable state. More work than I want to put in at this time. And if it will slow things down processing the extra bits, then no. I want it to be as fast/smooth as possible.

The syntax highlighting slows things down a little (depending on language), and it is not as accurate as I would like. I would rather spend my time fixing that.

Thanks for the input.

Workbench Explorer - A better way to browse drawers

Kamelito

Re: Text file encoding

Posted on: 2019/7/25 17:46 #19

Just popping in

@mritter0
If it is for Struct then isn’t performance better using ASCII at least for C/C++ ...
Any plan to do hardware scrolling using the GPU a bit like what Cygnus did with the Blitter?
Are you optimizing your code to limit caches misses?