Flutterby™! : I've been playing around with parsing

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

I've been playing around with parsing

2024-05-03 17:55:02.699105+02 by Dan Lyke 2 comments

I've been playing around with parsing recently, and I've been trying to figure out how to deal with UTF-8 data. The various C/C++ "is...()" tests all seem locale specific, which feels completely wrong when I'm trying to figure out "does this character start an identifier?", which, maybe, is (!(isspace(x) || ispunct(x) || isdigit(x)))?

I guess I can assume anything with the high bit set qualifies.

comments in ascending chronological order (reverse):

#Comment Re: I've been playing around with parsing made: 2024-05-04 00:14:52.772272+02 by: spc476

The only way I've found (at least under Linux) to get the locale stuff working properly is to ensure that $LANG is set properly (i.e. "LANG=en-US.UTF-8") and that early in the program you call setlocale(LC_ALL,"") to have it take effect. Then you can use the various wide character I/O routines.

Another thing to consider before just checking the high-bit is the C1 control set between codepoints 128-159, but they would appear (byte wise) as 192,128 to 192,159.

Personally, I'd like to stick to ASCII only, but I admit that's just sticking my head in the sand.

#Comment Re: I've been playing around with parsing made: 2024-05-04 00:32:55.520294+02 by: Dan Lyke

The answer I'm coming to, after much conversation on the Fediverse, including being pointed to the Unicode consortium discussion on identifier syntax and Unicode Derived Core Properties (UCD) and being pointed to the Go source code, and discussion about character composition and equivalence is:

Treat everything with the high bits set as an okay identifier character, the whole darned thing as a stream of bytes, and only worry about ASCII whitespace, operators, and numbers.

If, in the future, I decide to go down the direction of taking UTF-8 non-ASCII characters for something useful, do that individually.

Otherwise anything with a high bit matches \w and that's fine.

Add your own comment:

(If anyone ever actually uses Webmention/indie-action to post here, please email me)




Format with:

(You should probably use "Text" mode: URLs will be mostly recognized and linked, _underscore quoted_ text is looked up in a glossary, _underscore quoted_ (http://xyz.pdq) becomes a link, without the link in the parenthesis it becomes a <cite> tag. All <cite>ed text will point to the Flutterby knowledge base. Two enters (ie: a blank line) gets you a new paragraph, special treatment for paragraphs that are manually indented or start with "#" (as in "#include" or "#!/usr/bin/perl"), "/* " or ">" (as in a quoted message) or look like lists, or within a paragraph you can use a number of HTML tags:

p, img, br, hr, a, sub, sup, tt, i, b, h1, h2, h3, h4, h5, h6, cite, em, strong, code, samp, kbd, pre, blockquote, address, ol, dl, ul, dt, dd, li, dir, menu, table, tr, td, th

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.


Flutterby™ is a trademark claimed by

Dan Lyke
for the web publications at www.flutterby.com and www.flutterby.net.