|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
pk wrote:
> On Wednesday 7 May 2008 17:37, Ed Morton wrote: > >>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; >>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example. >>> >>>Is there a way to explicitly print out that information (or, better, the >>>entire collating sequence in use)? I've been looking for a method to do >>>that for long time, but I have found no complete answer. >> >>I expect you could use the ord() and chr() functions described here: >> >>http://www.gnu.org/software/gawk/man...inal-Functions >> >>[snip] > >[snip] > > It seems that the function you point out use the mere numeric character > values and don't take locale into account. Yes, indeed. And the quoted passage in the GNU manual says about ord()... "the numeric value for that character in the machine's character set". While this matches the interpretation I am comfortable with - in this case and also with the ranges of characters in regexps, the quoted link about "Character-Lists" explicitly mentions a locale dependency. man regex(7) also doesn't enlighten the topic; in one sentence they talk about a _character set_ ("`[0-9]' in ASCII matches any decimal digit"), and shortly after that they talk about _collating sequences_ ("Ranges are very collating-sequence-dependent"). I'm still puzzled about that. Janis [snip] |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On 5/7/2008 1:46 PM, Janis Papanagnou wrote: > pk wrote: > >>On Wednesday 7 May 2008 17:37, Ed Morton wrote: >> >> >>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; >>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example. >>>> >>>>Is there a way to explicitly print out that information (or, better, the >>>>entire collating sequence in use)? I've been looking for a method to do >>>>that for long time, but I have found no complete answer. >>> >>>I expect you could use the ord() and chr() functions described here: >>> >>>http://www.gnu.org/software/gawk/man...inal-Functions >>> >>>[snip] >> >>[snip] >> >>It seems that the function you point out use the mere numeric character >>values and don't take locale into account. > > > Yes, indeed. And the quoted passage in the GNU manual says about ord()... > "the numeric value for that character in the machine's character set". > > While this matches the interpretation I am comfortable with - in this > case and also with the ranges of characters in regexps, the quoted > link about "Character-Lists" explicitly mentions a locale dependency. > > man regex(7) also doesn't enlighten the topic; in one sentence they > talk about a _character set_ ("`[0-9]' in ASCII matches any decimal > digit"), and shortly after that they talk about _collating sequences_ > ("Ranges are very collating-sequence-dependent"). > > I'm still puzzled about that. > Here's my, perhaps simplistic view, that gets me through the day: A character set is the set of all possible characters on a given machine and is fixed (e.g. ASCII or EBCDIC). A collating sequence is the order of those characters in a specified locale. A character list is a specific list of characters. In the context of REs, the order of characters in the list is inconsequential, so the list behaves more like a set, hence it is also, confusingly, sometimes referred to as a character set. A character class is a specific list of characters grouped according to some common characterstic (e.g. "upper case" or "alphabetic" or ...), A character list can be derived by specifying a start and end character and a range operator, and the resulting list depends on the collating sequence of the characters in the current character set in the current locale. A character list can also be derived be specifying a character class and the resulting list depends on the characters designated to be part of that class in the current character set, independent of collating sequences or locale. Hope that s. Ed. |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On 8 Mai, 06:25, Ed Morton <mor...@lsupcaemnt.com> wrote:
> On 5/7/2008 1:46 PM, Janis Papanagnou wrote: > > pk wrote: > >>On Wednesday 7 May 2008 17:37, Ed Morton wrote: > > >>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; > >>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example. > > >>>>Is there a way to explicitly print out that information (or, better, the > >>>>entire collating sequence in use)? I've been looking for a method to do > >>>>that for long time, but I have found no complete answer. > > >>>I expect you could use the ord() and chr() functions described here: > > >>>http://www.gnu.org/software/gawk/man...inal-Functions > > >>It seems that the function you point out use the mere numeric character > >>values and don't take locale into account. > > > Yes, indeed. And the quoted passage in the GNU manual says about ord().... > > "the numeric value for that character in the machine's character set". > > > While this matches the interpretation I am comfortable with - in this > > case and also with the ranges of characters in regexps, the quoted > > link about "Character-Lists" explicitly mentions a locale dependency. > > > man regex(7) also doesn't enlighten the topic; in one sentence they > > talk about a _character set_ ("`[0-9]' in ASCII matches any decimal > > digit"), and shortly after that they talk about _collating sequences_ > > ("Ranges are very collating-sequence-dependent"). > > > I'm still puzzled about that. > > Here's my, perhaps simplistic view, that gets me through the day: Thanks for assembling this set of definitions. > > A character set is the set of all possible characters on a given machine and is > fixed (e.g. ASCII or EBCDIC). ...and characters have a fixed code (code position) in the set, yes. > > A collating sequence is the order of those characters in a specified locale. A collating sequence is the order of characters in a specified locale. Yes. > > A character list is a specific list of characters. In the context of REs, the > order of characters in the list is inconsequential, so the list behaves more > like a set, hence it is also, confusingly, sometimes referred to as a character set. Here by "character list" you mean (for example) [0-9ABCDEF], I suppose? (No, I am not confused by the term "set".) I'd prefer calling that a [regexp] "set of characters" (with or without ranges). > > A character class is a specific list of characters grouped according to some > common characterstic (e.g. "upper case" or "alphabetic" or ...), Well, it's rather a set of characters than a list; per se unordered. And it doesn't depend on the character set ("code page"), given the same locale settings. [[:upper:]] defines the same characters in ASCII or EBCDIC, but the locale settings influence the contents of the set because it may contain Ä Ö and Ü in one locale but not in another one. OTOH, [A-Z] seems to me to be always depending on the "code page" in use; you *have* to expect different results in ASCII and EBCDIC. And since it's depending on a low level _character code_ (a number) whether a character is in the set or not it makes no sense (to me) to apply presentation level locale interpretations and assume a locale ordering in addition to the "code page" ordering. That are two different concepts, on different abstraction levels; _that_ is "my problem" with the interpretation of ranges depending on locales. > > A character list can be derived by specifying a start and end character and a > range operator, and the resulting list depends on the collating sequence of the > characters in the current character set in the current locale. (See above.) Start and end assume an ordering; the ordering is defined by the character set. How can the locale setting define a coherent definition with possible code gaps? Given the EBCDIC code positions a(129), ..., g(135), h(136), i(137), j(145), k(146), l(146), ... what does [a-j] mean (mind the code gap between i and j)? Once it was defined as all characters coded within a range of 129 and 145 including the "characters" with code 138-144. Now how would an 'ä' be represented? (Well, no perfect example I guess, because EBCDIC has no 'ä'.) But I hope my point became clear; ranges and locales don't match well. Ranges seem to be low-level abbreviations for code intervals, while locales define sets of characters, best applied with character classes. > > A character list can also be derived be specifying a character class and the > resulting list depends on the characters designated to be part of that class in > the current character set, independent of collating sequences or locale. Not sure about your wording. The set of characers defined by the character class [[:upper:]] depends on the locale. > > Hope that s. Well, not really. I've tried to describe the problem I see above. The consequence is to prefer the use of character classes instead of regexp ranges; we already agreed on that. Janis > > Ed |
|
![]() |
| Outils de la discussion | |
|
|