From rodney.bates at gmx.com Tue Nov 9 16:02:43 2021 From: rodney.bates at gmx.com (Rodney Bates) Date: Tue, 9 Nov 2021 09:02:43 -0600 Subject: [M3devel] cross compiler, widows memory protection Message-ID: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> Jay, can you remind me again: 1) How to build and run a cross compiler 2) What is the way that Windows is more restrictive on access to ?? different regions of memory? From jayk123 at hotmail.com Tue Nov 9 17:19:03 2021 From: jayk123 at hotmail.com (Jay K) Date: Tue, 9 Nov 2021 16:19:03 +0000 Subject: [M3devel] cross compiler, widows memory protection In-Reply-To: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> References: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> Message-ID: 1. Set CM3_TARGET mainly. 2. VirtualProtect? ________________________________ From: M3devel on behalf of Rodney Bates Sent: Tuesday, November 9, 2021 3:02 PM To: m3devel Subject: [M3devel] cross compiler, widows memory protection Jay, can you remind me again: 1) How to build and run a cross compiler 2) What is the way that Windows is more restrictive on access to different regions of memory? _______________________________________________ M3devel mailing list M3devel at elegosoft.com https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fm3lists.elegosoft.com%2Fmailman%2Flistinfo%2Fm3devel&data=04%7C01%7C%7Cf70c1708c3af4937474508d9a3920a96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637720669902479190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4WyQGAA9%2Fw3Yc6D%2FWw%2FH0h4bXHb9UFeM1MOlHDk%2BP2M%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vvm at tut.by Tue Nov 9 17:48:15 2021 From: vvm at tut.by (vvm at tut.by) Date: Tue, 09 Nov 2021 19:48:15 +0300 Subject: [M3devel] cross compiler, wiNdows memory protection In-Reply-To: References: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> Message-ID: <1609661636476304@mail.yandex.by> An HTML attachment was scrubbed... URL: From jayk123 at hotmail.com Tue Nov 9 18:12:33 2021 From: jayk123 at hotmail.com (Jay K) Date: Tue, 9 Nov 2021 17:12:33 +0000 Subject: [M3devel] cross compiler, wiNdows memory protection In-Reply-To: <1609661636476304@mail.yandex.by> References: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> <1609661636476304@mail.yandex.by> Message-ID: Please focus on AMD64 or ARM64 first. But yes, maybe someone (Eric) can (re)write my documentation here. The rough outline is: boot1.py c target to cross build a .tar.gz for other system boot2.sh on target to finish I would really like boot1.py to produce autoconf/automake input for not-Windows. Then we'll have one package for all little endian 64bit not-Windows, and possibly extend that. (Think: {Linux,Mac,Solaris,BSD}{amd64,arm64,riscv64,alpha,sparc64} all in one, like any other C code.) I had almost the C backend producing code that works on 32bit and 64bit. Endian can be fixed. - Jay ________________________________ From: vvm at tut.by Sent: Tuesday, November 9, 2021 4:48 PM To: Jay K ; m3devel ; Rodney Bates ; Eric Sessoms Subject: Re: [M3devel] cross compiler, wiNdows memory protection + coder5506@ Hi! I have some interesting ( half-successfully) results with I386_MINGW. It's very like to "cross-compiling" Best regards, Victor Miasnikov 09.11.2021, 19:19, "Jay K" : 1. Set CM3_TARGET mainly. 2. VirtualProtect? ________________________________ From: M3devel > on behalf of Rodney Bates > Sent: Tuesday, November 9, 2021 3:02 PM To: m3devel > Subject: [M3devel] cross compiler, widows memory protection Jay, can you remind me again: 1) How to build and run a cross compiler 2) What is the way that Windows is more restrictive on access to different regions of memory? _______________________________________________ M3devel mailing list M3devel at elegosoft.com https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fm3lists.elegosoft.com%2Fmailman%2Flistinfo%2Fm3devel&data=04%7C01%7C%7Cf70c1708c3af4937474508d9a3920a96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637720669902479190%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4WyQGAA9%2Fw3Yc6D%2FWw%2FH0h4bXHb9UFeM1MOlHDk%2BP2M%3D&reserved=0 , _______________________________________________ M3devel mailing list M3devel at elegosoft.com https://m3lists.elegosoft.com/mailman/listinfo/m3devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From vvm at tut.by Tue Nov 9 18:54:10 2021 From: vvm at tut.by (vvm at tut.by) Date: Tue, 09 Nov 2021 20:54:10 +0300 Subject: [M3devel] m3gdb , Solarite.exe And Co Re: cross compiler, wiNdows memory protection In-Reply-To: References: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> <1609661636476304@mail.yandex.by> Message-ID: <8448111636478522@mail.yandex.by> An HTML attachment was scrubbed... URL: From jayk123 at hotmail.com Tue Nov 9 19:57:14 2021 From: jayk123 at hotmail.com (Jay K) Date: Tue, 9 Nov 2021 18:57:14 +0000 Subject: [M3devel] cross compiler, wiNdows memory protection In-Reply-To: References: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> <1609661636476304@mail.yandex.by> Message-ID: Combing them yes, good idea, thank you. And omit m3gdb and m3cc maybe? They don't work on many platforms, and they do work on many platforms. - Jay ________________________________ From: Eric Sessoms Sent: Tuesday, November 9, 2021 6:53 PM To: Jay K ; vvm ; m3devel ; Rodney Bates Subject: Re: [M3devel] cross compiler, wiNdows memory protection On Tue, Nov 9, 2021, at 12:12 PM, Jay K wrote: > The rough outline is: > boot1.py c target to cross build a .tar.gz for other system > boot2.sh on target to finish > > I would really like boot1.py to produce autoconf/automake input for > not-Windows. > Then we'll have one package for all little endian 64bit not-Windows, > and possibly extend that. That is the plan. The only thing I'm thinking, in addition to the above, is that I'd like to package up a source distribution that includes the C bootstrap. So this distribution would be the source tree, with an added "/bootstrap" directory at the top level. Instead of our current two-step, I'd like the user to just type "install", however we end up spelling it, to build and install the bootstrap, and then build and install the sources, all at once. Shamelessly stolen from SBCL and no doubt many others. It is a minor detail, I'm only bringing it up in case it gives anybody heartburn, so I can find out sooner rather than later. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vvm at tut.by Tue Nov 9 20:06:19 2021 From: vvm at tut.by (vvm at tut.by) Date: Tue, 09 Nov 2021 22:06:19 +0300 Subject: [M3devel] cross compiler, wiNdows memory protection In-Reply-To: References: <70ab5048-5c6f-6848-8646-2260fda783d5@gmx.com> <1609661636476304@mail.yandex.by> Message-ID: <1526191636484624@mail.yandex.by> An HTML attachment was scrubbed... URL: From rodney.bates at gmx.com Thu Nov 11 01:13:48 2021 From: rodney.bates at gmx.com (Rodney Bates) Date: Wed, 10 Nov 2021 18:13:48 -0600 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: References: Message-ID: <67bd5978-18f1-fb09-3bcc-864c41a2b11a@gmx.com> > > > On 11/10/21 12:03 PM, Eric Sessoms wrote: >> >> reference: https://modula3.github.io/cm3/reference/complete/html/2_6_5Text_character.html >> >> As I understand it, |CHAR| is 8-bit Latin-1, |TEXT| is a sequence of |CHAR|. |WIDECHAR| though not specified, must be UCS-2 because at the time Unicode was 16-bits, UTF-16 didn't exist, and UCS-2 didn't need a name to distinguish it because that's all there was. >> >> However I'm finding that in practice |CHAR|, |TEXT|, and |WIDECHAR| are all UTF-8. >> > CHAR is 8-bit ISO latin-1, per Modula3.? WIDECHAR got added by Critical Mass > in their day, and was 16-bit, no doubt UCS-2 in current language. The language definition only says it has 65536 values and these are the first of the Unicode > code points. > > I changed it to be full Unicode a few years back, but never altered the language > reference.? It occupies 32 bits but is like a subrange going only up to > 16_1FFFF.? So WIDECHAR and arrays thereof amount to UTF-32.? For internal use, > there are many places where a fixed-length encoding is essential, including > a number of places in the scanner. > > There are a number of encoders, decoders, and Rd/Wr counterparts in > m3-libs/libunicode, for reading and writing streams.? They include some > extra ones for original Critical Mass WIDECHAR, etc. > > None of this is currently used in the compiler.? It is entirely ISO Latin 1, > as you have seen.? I have often thought of changing this, but got stopped by > questions about where to stop.? Do we accept additional letters in identifiers? > How far to go, how many languages?? Maybe only those in Latin-1? What about the compiler's internal files like .M3EXPORTS, etc. Debug info? > > A pretty conservative compromise would be to accept additional code points > only in comments and literals.? I think this already happens within Latin1. > >> '?' (* is a syntax error *) >> "?" (* two bytes, 195 and 169 *) >> W'?' (* is again a syntax error *) >> W"?" (* is again two bytes, 195 and 169 *) > > This surprised me at first, because this is a Latin-1 character. But if the > source code were in UTF-8, it would occupy two bytes.? We would need a compiler > option to read source in either Latin-1 or UTF-8, and users would have to > get this in sync with what their editor was writing. > > There are already escape sequences in character and text, wide and narrow, that > allow other code points to be specified, if you know their numeric codes. > > Note that TEXT values are internally a potential dynamic mix of fragments in either > 8-bit or 32-bit representation, depending on what you built the TEXT value from. > Unless you go inside, to the TEXT subtypes in m3-libs/m3core/src/text/Text*, > this is all abstracted and hidden behind the Text interface and language > builtins on TEXT.? This came from Critical Mass and I just expanded it to > 32-bit characters. > > Pickles also have support for Unicode WIDECHAR.? The procedures in Rd and Wr > are as Critical Mass made them, i.e., although the program variables are > WIDECHAR, the values read/written in the stream are always exactly two bytes. > But UniRd and UniWr handle streams in any of a variety of representations. > > Of course, you can always stuff any values into a CHAR or WIDECHAR and so > your own variable-length encoding/decoding. > >> It seems pretty clear that the scanner simply has no knowledge of character encodings >> >> https://github.com/modula3/cm3/blob/master/m3-sys/m3front/src/misc/Scanner.m3#L666 >> >> such that if the source text is UTF-8, everything is going to be UTF-8. And I can't find anywhere that Modula-3 prescribes the input character set for source files. >> > > Latin-1.? See the last sentence in 2.8.12.? Perhaps it needs to be more explicit. >> >> Historically this has probably not been an issue for Windows users, with the default CP-1252 being identical to ISO-8859-1 for printable characters, but as of Windows 10 the default encoding is changing to UTF-8, so this is going to become an issue for everybody. >> >> ? >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly, view it on GitHub , or unsubscribe . >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodney.bates at gmx.com Thu Nov 11 01:19:30 2021 From: rodney.bates at gmx.com (Rodney Bates) Date: Wed, 10 Nov 2021 18:19:30 -0600 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: References: Message-ID: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> On 11/10/21 2:54 PM, Rodney Bates wrote: > > > On 11/10/21 12:03 PM, Eric Sessoms wrote: >> >> reference: https://modula3.github.io/cm3/reference/complete/html/2_6_5Text_character.html >> >> As I understand it, |CHAR| is 8-bit Latin-1, |TEXT| is a sequence of |CHAR|. |WIDECHAR| though not specified, must be UCS-2 because at the time Unicode was 16-bits, UTF-16 didn't exist, and UCS-2 didn't need a name to distinguish it because that's all there was. >> >> However I'm finding that in practice |CHAR|, |TEXT|, and |WIDECHAR| are all UTF-8. >> > CHAR is 8-bit ISO latin-1, per Modula3.? WIDECHAR got added by Critical Mass > in their day, and was 16-bit, no doubt UCS-2 in current language. The language definition only says it has 65536 values and these are the first of the Unicode > code points. > > I changed it to be full Unicode a few years back, but never altered the language > reference.? It occupies 32 bits but is like a subrange going only up to > 16_1FFFF.? So WIDECHAR and arrays thereof amount to UTF-32.? For internal use, > there are many places where a fixed-length encoding is essential, including > a number of places in the scanner. > > There are a number of encoders, decoders, and Rd/Wr counterparts in > m3-libs/libunicode, for reading and writing streams.? They include some > extra ones for original Critical Mass WIDECHAR, etc. > > None of this is currently used in the compiler.? It is entirely ISO Latin 1, > as you have seen.? I have often thought of changing this, but got stopped by > questions about where to stop.? Do we accept additional letters in identifiers? > How far to go, how many languages?? Maybe only those in Latin-1? What about the compiler's internal files like .M3EXPORTS, etc. Debug info? > > A pretty conservative compromise would be to accept additional code points > only in comments and literals.? I think this already happens within Latin1. > >> '?' (* is a syntax error *) >> "?" (* two bytes, 195 and 169 *) >> W'?' (* is again a syntax error *) >> W"?" (* is again two bytes, 195 and 169 *) > > This surprised me at first, because this is a Latin-1 character. But if the > source code were in UTF-8, it would occupy two bytes.? We would need a compiler > option to read source in either Latin-1 or UTF-8, and users would have to > get this in sync with what their editor was writing. > > There are already escape sequences in character and text, wide and narrow, that > allow other code points to be specified, if you know their numeric codes. > > Note that TEXT values are internally a potential dynamic mix of fragments in either > 8-bit or 32-bit representation, depending on what you built the TEXT value from. > Unless you go inside, to the TEXT subtypes in m3-libs/m3core/src/text/Text*, > this is all abstracted and hidden behind the Text interface and language > builtins on TEXT.? This came from Critical Mass and I just expanded it to > 32-bit characters. > > Pickles also have support for Unicode WIDECHAR.? The procedures in Rd and Wr > are as Critical Mass made them, i.e., although the program variables are > WIDECHAR, the values read/written in the stream are always exactly two bytes. > But UniRd and UniWr handle streams in any of a variety of representations. > > Of course, you can always stuff any values into a CHAR or WIDECHAR and so > your own variable-length encoding/decoding. > >> It seems pretty clear that the scanner simply has no knowledge of character encodings >> >> https://github.com/modula3/cm3/blob/master/m3-sys/m3front/src/misc/Scanner.m3#L666 >> >> such that if the source text is UTF-8, everything is going to be UTF-8. And I can't find anywhere that Modula-3 prescribes the input character set for source files. >> > > Latin-1.? See the last sentence in 2.8.12.? Perhaps it needs to be more explicit. Thinking about this a bit more, in light of Unicode's distinction between code points as an abstraction and multiple alternative representations, it makes sense to me to interpret the language's statement to mean the source character set consists of the code points that are in Latin-1, leaving the representation up to implementations. >> Historically this has probably not been an issue for Windows users, with the default CP-1252 being identical to ISO-8859-1 for printable characters, but as of Windows 10 the default encoding is changing to UTF-8, so this is going to become an issue for everybody. >> >> ? >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly, view it on GitHub , or unsubscribe . >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hendrik at topoi.pooq.com Thu Nov 11 02:19:34 2021 From: hendrik at topoi.pooq.com (Hendrik Boom) Date: Wed, 10 Nov 2021 20:19:34 -0500 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> Message-ID: <20211111011934.GB11073@topoi.pooq.com> On Wed, Nov 10, 2021 at 06:19:30PM -0600, Rodney Bates wrote: > > > On 11/10/21 2:54 PM, Rodney Bates wrote: > > > > > > On 11/10/21 12:03 PM, Eric Sessoms wrote: > > > > > > reference: https://modula3.github.io/cm3/reference/complete/html/2_6_5Text_character.html > > > > > > As I understand it, |CHAR| is 8-bit Latin-1, |TEXT| is a sequence of |CHAR|. |WIDECHAR| though not specified, must be UCS-2 because at the time Unicode was 16-bits, UTF-16 didn't exist, and UCS-2 didn't need a name to distinguish it because that's all there was. > > > > > > However I'm finding that in practice |CHAR|, |TEXT|, and |WIDECHAR| are all UTF-8. > > > > > CHAR is 8-bit ISO latin-1, per Modula3.? WIDECHAR got added by Critical Mass > > in their day, and was 16-bit, no doubt UCS-2 in current language. The language definition only says it has 65536 values and these are the first of the Unicode > > code points. > > > > I changed it to be full Unicode a few years back, but never altered the language > > reference.? It occupies 32 bits but is like a subrange going only up to > > 16_1FFFF.? So WIDECHAR and arrays thereof amount to UTF-32.? For internal use, > > there are many places where a fixed-length encoding is essential, including > > a number of places in the scanner. > > > > There are a number of encoders, decoders, and Rd/Wr counterparts in > > m3-libs/libunicode, for reading and writing streams.? They include some > > extra ones for original Critical Mass WIDECHAR, etc. > > > > None of this is currently used in the compiler.? It is entirely ISO Latin 1, > > as you have seen.? I have often thought of changing this, but got stopped by > > questions about where to stop.? Do we accept additional letters in identifiers? > > How far to go, how many languages?? Maybe only those in Latin-1? What about the compiler's internal files like .M3EXPORTS, etc. Debug info? > > > > A pretty conservative compromise would be to accept additional code points > > only in comments and literals.? I think this already happens within Latin1. > > > > > '?' (* is a syntax error *) > > > "?" (* two bytes, 195 and 169 *) > > > W'?' (* is again a syntax error *) > > > W"?" (* is again two bytes, 195 and 169 *) > > > > This surprised me at first, because this is a Latin-1 character. But if the > > source code were in UTF-8, it would occupy two bytes.? We would need a compiler > > option to read source in either Latin-1 or UTF-8, and users would have to > > get this in sync with what their editor was writing. > > > > There are already escape sequences in character and text, wide and narrow, that > > allow other code points to be specified, if you know their numeric codes. > > > > Note that TEXT values are internally a potential dynamic mix of fragments in either > > 8-bit or 32-bit representation, depending on what you built the TEXT value from. > > Unless you go inside, to the TEXT subtypes in m3-libs/m3core/src/text/Text*, > > this is all abstracted and hidden behind the Text interface and language > > builtins on TEXT.? This came from Critical Mass and I just expanded it to > > 32-bit characters. > > > > Pickles also have support for Unicode WIDECHAR.? The procedures in Rd and Wr > > are as Critical Mass made them, i.e., although the program variables are > > WIDECHAR, the values read/written in the stream are always exactly two bytes. > > But UniRd and UniWr handle streams in any of a variety of representations. > > > > Of course, you can always stuff any values into a CHAR or WIDECHAR and so > > your own variable-length encoding/decoding. > > > > > It seems pretty clear that the scanner simply has no knowledge of character encodings > > > > > > https://github.com/modula3/cm3/blob/master/m3-sys/m3front/src/misc/Scanner.m3#L666 > > > > > > such that if the source text is UTF-8, everything is going to be UTF-8. And I can't find anywhere that Modula-3 prescribes the input character set for source files. > > > > > > > Latin-1.? See the last sentence in 2.8.12.? Perhaps it needs to be more explicit. > > Thinking about this a bit more, in light of Unicode's distinction between > code points as an abstraction and multiple alternative representations, > it makes sense to me to interpret the language's statement to mean the > source character set consists of the code points that are in Latin-1, > leaving the representation up to implementations. A source-code vulnerability has recently been discovered that probably exists in all programming languages accepting the full Unicode. It consists of using the direction-alternating code points -- the ones that are used to embed pieces of right-to-left text in a left-to-right context. Thses code points can be used in a sneak way so that what's seen in an editor is a different sequence of characters than are in the file. Look at the source code all you want, and it looks clean. But the compiler sees the text in a different order, with posibly a completely different meaning. If we are redefining the character set used in programs, we'll have to take measures against this kind of attack. -- hendrik > > > > Historically this has probably not been an issue for Windows users, with the default CP-1252 being identical to ISO-8859-1 for printable characters, but as of Windows 10 the default encoding is changing to UTF-8, so this is going to become an issue for everybody. > > > > > > ? > > > You are receiving this because you are subscribed to this thread. > > > Reply to this email directly, view it on GitHub , or unsubscribe . > > > > > > > _______________________________________________ > M3devel mailing list > M3devel at elegosoft.com > https://m3lists.elegosoft.com/mailman/listinfo/m3devel From vvm at tut.by Thu Nov 11 07:59:35 2021 From: vvm at tut.by (vvm at tut.by) Date: Thu, 11 Nov 2021 09:59:35 +0300 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <20211111011934.GB11073@topoi.pooq.com> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> <20211111011934.GB11073@topoi.pooq.com> Message-ID: <1814561636613238@mail.yandex.by> An HTML attachment was scrubbed... URL: From coder5506 at pobox.com Thu Nov 11 14:20:03 2021 From: coder5506 at pobox.com (Eric Sessoms) Date: Thu, 11 Nov 2021 08:20:03 -0500 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <1814561636613238@mail.yandex.by> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> <20211111011934.GB11073@topoi.pooq.com> <1814561636613238@mail.yandex.by> Message-ID: <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> Hi list, and thanks for adding me. WRT my recent bug report: > Thinking about this a bit more, in light of Unicode's distinction between > code points as an abstraction and multiple alternative representations, > it makes sense to me to interpret the language's statement to mean the > source character set consists of the code points that are in Latin-1, > leaving the representation up to implementations. That's really all I'm going for. What we have now is a situation where the scanner was written with two flaws: (1) it implicitly assumes that source files are encoded in Latin-1, which hasn't been true on most platforms for 20-years, and (2) no thought whatsoever was given to WIDECHAR. Which means we can do this: VAR a := W"?"; b: ARRAY [0..1] OF WIDECHAR; BEGIN Text.SetWideChars(b, a); IO.PutInt(ORD(b[0])); IO.PutChar('\n'); IO.PutInt(ORD(b[1])); IO.PutChar('\n'); END And get as output: 195 169 Which of course makes no sense in any world. It is unreasonable in CURRENT_YEAR to try to force all source files to be encoded in Latin-1, because the configuration for this varies by editor, and all it takes is for someone to open a file in the "wrong" editor and *poof* now it's in encoding. I'm just proposing to make the scanner smart enough to notice (above example) that "oh, this source file is in UTF-8, that sequence of bytes 195, 169 is an e-acute, which means this text literal contains one character with the Unicode value 233" so that we get the same result as-if the file had been in Latin-1, as the language was originally designed. From vvm at tut.by Thu Nov 11 16:44:04 2021 From: vvm at tut.by (vvm at tut.by) Date: Thu, 11 Nov 2021 18:44:04 +0300 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> <20211111011934.GB11073@topoi.pooq.com> <1814561636613238@mail.yandex.by> <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> Message-ID: <3321636644223@mail.yandex.by> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 66325 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 57300 bytes Desc: not available URL: From rodney.bates at gmx.com Thu Nov 11 17:00:19 2021 From: rodney.bates at gmx.com (Rodney Bates) Date: Thu, 11 Nov 2021 10:00:19 -0600 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> <20211111011934.GB11073@topoi.pooq.com> <1814561636613238@mail.yandex.by> <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> Message-ID: <99772528-aeae-b50d-7d03-5b360c851caa@gmx.com> On 11/11/21 7:20 AM, Eric Sessoms wrote: > Hi list, and thanks for adding me. Sorry, Eric, I didn't realize you were not on m3devel. > > WRT my recent bug report: > >> Thinking about this a bit more, in light of Unicode's distinction between > > code points as an abstraction and multiple alternative representations, > > it makes sense to me to interpret the language's statement to mean the > > source character set consists of the code points that are in Latin-1, > > leaving the representation up to implementations. > > That's really all I'm going for. What we have now is a situation where the scanner was written with two flaws: (1) it implicitly assumes that source files are encoded in Latin-1, which hasn't been true on most platforms for 20-years, and (2) no thought whatsoever was given to WIDECHAR. Which means we can do this: > > VAR > a := W"?"; > b: ARRAY [0..1] OF WIDECHAR; > BEGIN > Text.SetWideChars(b, a); > IO.PutInt(ORD(b[0])); IO.PutChar('\n'); > IO.PutInt(ORD(b[1])); IO.PutChar('\n'); > END > > And get as output: > > 195 > 169 > > Which of course makes no sense in any world. > > It is unreasonable in CURRENT_YEAR to try to force all source files to be encoded in Latin-1, because the configuration for this varies by editor, and all it takes is for someone to open a file in the "wrong" editor and *poof* now it's in encoding. > > I'm just proposing to make the scanner smart enough to notice (above example) that "oh, this source file is in UTF-8, that sequence of bytes 195, 169 is an e-acute, which means this text literal contains one character with the Unicode value 233" so that we get the same result as-if the file had been in Latin-1, as the language was originally designed. But if the input really is in Latin-1, then the e-acute would be one byte, value 233, which, if decoded via UTF-8 rules would be something else. Without slogging thru' (I once had the decoding rules in my head, but it has been a while), the dUTF-8 ecoder would consume the 233 and the double-quote following as one code point, which also makes no sense. I am all for automatic detection if it can be done so as to always be correct, but I am skeptical.? Maybe the assumption that everything should decode into [0..255] is enough.? Otherwise, we at least should have compiler options to force it to read in one encoding or the other. > _______________________________________________ > M3devel mailing list > M3devel at elegosoft.com > https://m3lists.elegosoft.com/mailman/listinfo/m3devel From rodney.bates at gmx.com Thu Nov 11 18:17:06 2021 From: rodney.bates at gmx.com (Rodney Bates) Date: Thu, 11 Nov 2021 11:17:06 -0600 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <99772528-aeae-b50d-7d03-5b360c851caa@gmx.com> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> <20211111011934.GB11073@topoi.pooq.com> <1814561636613238@mail.yandex.by> <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> <99772528-aeae-b50d-7d03-5b360c851caa@gmx.com> Message-ID: On 11/11/21 10:00 AM, Rodney Bates wrote: > > > On 11/11/21 7:20 AM, Eric Sessoms wrote: >> Hi list, and thanks for adding me. > Sorry, Eric, I didn't realize you were not on m3devel. > >> >> WRT my recent bug report: >> >>> Thinking about this a bit more, in light of Unicode's distinction between >> ? > code points as an abstraction and multiple alternative representations, >> ? > it makes sense to me to interpret the language's statement to mean the >> ? > source character set consists of the code points that are in Latin-1, >> ? > leaving the representation up to implementations. >> >> That's really all I'm going for.? What we have now is a situation where the scanner was written with two flaws: (1) it implicitly assumes that source files are encoded in Latin-1, which hasn't been true on most platforms for 20-years, and (2) no thought whatsoever was given to WIDECHAR.? Which means we can do this: >> >> VAR >> ?? a := W"?"; >> ?? b: ARRAY [0..1] OF WIDECHAR; >> BEGIN >> ?? Text.SetWideChars(b, a); >> ?? IO.PutInt(ORD(b[0])); IO.PutChar('\n'); >> ?? IO.PutInt(ORD(b[1])); IO.PutChar('\n'); >> END >> >> And get as output: >> >> 195 >> 169 >> >> Which of course makes no sense in any world. >> >> It is unreasonable in CURRENT_YEAR to try to force all source files to be encoded in Latin-1, because the configuration for this varies by editor, and all it takes is for someone to open a file in the "wrong" editor and *poof* now it's in encoding. >> >> I'm just proposing to make the scanner smart enough to notice (above example) that "oh, this source file is in UTF-8, that sequence of bytes 195, 169 is an e-acute, which means this text literal contains one character with the Unicode value 233" so that we get the same result as-if the file had been in Latin-1, as the language was originally designed. > But if the input really is in Latin-1, then the e-acute would be one byte, > value 233, which, if decoded via UTF-8 rules would be something else. > Without slogging thru' (I once had the decoding rules in my head, but it > has been a while), the dUTF-8 ecoder would consume the 233 and the double-quote > following as one code point, which also makes no sense. > > I am all for automatic detection if it can be done so as to always be > correct, but I am skeptical.? Maybe the assumption that everything should > decode into [0..255] is enough.? Otherwise, we at least should have > compiler options to force it to read in one encoding or the other. Even a BOM (u+FEFF) only distinguishes among the 5 Unicode-defined encodings. The plain Latin-1, identity encoding, so to speak, is not one of them. In Latin-1, this sequence is two characters, thorn (16_FE), then y-umlaut (16_FF). Well, OK, these characters are pretty unlikely to occur in M3 source code, and if it's lexically correct, never as the first two.? Still, a compiler should really handle this as Latin-1, especially the rest of the file. >> _______________________________________________ >> M3devel mailing list >> M3devel at elegosoft.com >> https://m3lists.elegosoft.com/mailman/listinfo/m3devel > From coder5506 at pobox.com Thu Nov 11 23:56:11 2021 From: coder5506 at pobox.com (Eric Sessoms) Date: Thu, 11 Nov 2021 17:56:11 -0500 Subject: [M3devel] [modula3/cm3] character and text literals are utf-8 (Issue #774) In-Reply-To: <99772528-aeae-b50d-7d03-5b360c851caa@gmx.com> References: <4a3ef25b-1333-bbf3-e545-8da74f5c634e@gmx.com> <20211111011934.GB11073@topoi.pooq.com> <1814561636613238@mail.yandex.by> <88a6b6bc-7064-4f54-bec8-f0095f11c598@www.fastmail.com> <99772528-aeae-b50d-7d03-5b360c851caa@gmx.com> Message-ID: <198b056f-4621-44f6-a089-bc996b44b0be@www.fastmail.com> On Thu, Nov 11, 2021, at 11:00 AM, Rodney Bates wrote: > But if the input really is in Latin-1, then the e-acute would be one byte, > value 233, which, if decoded via UTF-8 rules would be something else. > Without slogging thru' (I once had the decoding rules in my head, but it > has been a while), the dUTF-8 ecoder would consume the 233 and the double-quote > following as one code point, which also makes no sense. I agree that's a concern, but the situation is not so dire. For a valid encoding, the continuation bytes have to be in the range 128-191, so it can't eat the quote or run the parser off the rails. The presence of the quote would break the encoding and demonstrate that the input is not UTF-8. But it is possible to read the interior of the string incorrectly, you're right. We could tighten it up. I had suggested truncating the characters within literals, but by not truncating we can detect a larger class of errors. I.e, in an 8-bit CHAR literal or text, any encoding beginning with "233" is de-facto out-of-range (will be a 16-bit value, decoded), so it must be Latin-1. Most encodings will be out-of-range even for 16-bit WIDECHAR. Some possible encodings (apparently valid sequences of UTF-8 bytes) yield codepoints above the maximum 16_10FFFF, so are invalid even for 32-bit WIDECHAR. We can also reject surrogate pairs. For those not in on the lingo, it is sometimes the case that a 21-bit character is encoded as a sequence of two 16-bit characters, and in some strange parts of the universe (i.e., JSON) those two 16-bit characters are then encoded separately in UTF-8. Despite the fact that it is done, there's no need for it, so we can safely kick it out. > I am all for automatic detection if it can be done so as to always be > correct, but I am skeptical. You're right to be skeptical. I'm not going to go dig for sources, but it is known to be impossible in the general case. OK, a revised proposal for safer character set detection: - Lacking a BOM or other indicator of encoding, assume UTF-8 for a first attempt. - The scanner is buffered, so on loading a buffer review it for anything that is not valid UTF-8. (I'll have to make sure characters don't get broken at buffer boundaries, but that's an implementation problem.) - If there's anything that's not valid UTF-8, the input must be ISO-8859-1. - While scanning outside of a character or text literal, anything >126 is going to be a parse error anyway, but we'll call it ISO-8859-1 in an effort to keep the error messages useful. - Inside a character or text literal, anything that decodes to outside the range of the current character type is invalid, so again we revert to ISO-8859-1. We have to buffer the content of a text literal anyway, so even if the invalid byte that reveals we're not really UTF-8 occurs at the very end of the string, no damage is done. - And of course, once any check determines "not UTF-8", we continue in the "not UTF-8" state for the rest of that source file. In theory this won't catch everything, but in practice I think it will catch most things. The e-acute example, with these rules, will work correctly for CHAR or WIDECHAR, in either Latin-1 or UTF-8 source. Now that I think of it, the worst thing is going to be getting accurate source positions (line and column) for error messages. But I guess I've already signed on. From dabenavidesd at yahoo.es Thu Nov 25 04:23:37 2021 From: dabenavidesd at yahoo.es (Daniel Alejandro Benavides D.) Date: Thu, 25 Nov 2021 03:23:37 +0000 (UTC) Subject: [M3devel] Modula-3 bytecode vs Java Bytecode References: <1143760345.7317683.1637810617881.ref@mail.yahoo.com> Message-ID: <1143760345.7317683.1637810617881@mail.yahoo.com> Hi all: "The languages and browsers discussed so far are similar to Java in that the applet language matches the language its browser is written in. This ties the two closely together..." [1] "There are also a number of proposals for implementing applets that look? at? the underlying machine representation instead of focussing on a particular language.? For example, the Modula-3 byte code used in the Visual Obliq browser [6]? presents? a possibly more efficient and mature architecture to Java's byte code..." [1] I would say I would run my programs in my VAX9000 mainframe, but it implemented in ALPHA would much more speedier.What about Modula-3 in JVM or M3 byte code? What about you? What about both? Thanks in advance [1]D. Hackborn, ?Interactive HTML?, p. 370. (https://core.ac.uk/download/pdf/10194226.pdf) -------------- next part -------------- An HTML attachment was scrubbed... URL: