Character encoding substitution
I have a file loader that I use for reading proprietary format files generated on Windows and on OS/2. It works, but occasionally I run into a character encoding problem.
Text fields in the file header can contain user-entered text. A problem arises if a user enters the greek letter mu in one of these fields. The character encoding used for mu in OS/2 makes an illegal byte in UTF-8, and, unless I strip it out, splitstring chokes on the illegal byte as I further parse the text.
As a quick fix for this particular issue, I can simply strip out any suspicious looking byte (checking for char2num(byte)<0 seems to do the trick), which will at least prevent user-added text from affecting the reading of the rest of the file. But I'm thinking that there must be a better way to deal with this.
So, does anyone know the encoding most likely to be used in an OS/2 application? It looks like a single byte character set. Is there an existing tool for converting between character sets? Splitstring can handle valid unicode, right?
The loader is for Opus files created by Bruker infrared spectrometers.
This page says
So it is probably Windows-1252.
You can use The ConvertTextEncoding function with sourceTextEncoding=3 (Windows 1252) and destTextEncoding=1 (UTF-8).
February 19, 2022 at 03:07 pm - Permalink
Thanks, that is helpful.
On further inspection, it seems that the text uses an extended ASCII character set, (something like "Code page 437"), where mu or micro symbol is character 0xE6.
I don't know how I missed ConvertTextEncoding!
Unfortunately it doesn't look like I can use ConvertTextEncoding to map extended ASCII variants to UTF-8. However, I can use ConvertTextEncoding to clean up any illegal bytes, using 1 for both source and destination encodings. That's probably the best option, because I don't know how to differentiate between files that could have contemporary character encodings and these files created by a legacy system.
February 20, 2022 at 03:03 am - Permalink
If you are on Mac you should be able to use something like
iconv --from-code=CP437 --to-code=UTF8 ...
to convert the encoding.
CP437 is supported by ICU which Igor uses, at least if I understand https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings… correctly.
February 28, 2022 at 02:16 am - Permalink
In reply to If you are on Mac you should… by thomas_braun
Thanks, Thomas, that's good to know.
For my application, this was sufficient to deal with the unlikely occurrence of such characters in legacy files:
str = ConvertTextEncoding(str , 1 , 1 , 2, 1)
February 28, 2022 at 03:12 am - Permalink