PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Forum Serveur - Sécurité et techniques > comp.unix.shell > Re: Gawk match() and numbers in scientific notation
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
comp.unix.shell Using and programming the Unix shell.

Re: Gawk match() and numbers in scientific notation

Réponse
 
LinkBack Outils de la discussion
Vieux 07/05/2008, 19h46   #1
Janis Papanagnou
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation

pk wrote:
> On Wednesday 7 May 2008 17:37, Ed Morton wrote:
>
>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>>
>>>Is there a way to explicitly print out that information (or, better, the
>>>entire collating sequence in use)? I've been looking for a method to do
>>>that for long time, but I have found no complete answer.

>>
>>I expect you could use the ord() and chr() functions described here:
>>
>>http://www.gnu.org/software/gawk/man...inal-Functions
>>
>>[snip]

>
>[snip]
>
> It seems that the function you point out use the mere numeric character
> values and don't take locale into account.


Yes, indeed. And the quoted passage in the GNU manual says about ord()...
"the numeric value for that character in the machine's character set".

While this matches the interpretation I am comfortable with - in this
case and also with the ranges of characters in regexps, the quoted
link about "Character-Lists" explicitly mentions a locale dependency.

man regex(7) also doesn't enlighten the topic; in one sentence they
talk about a _character set_ ("`[0-9]' in ASCII matches any decimal
digit"), and shortly after that they talk about _collating sequences_
("Ranges are very collating-sequence-dependent").

I'm still puzzled about that.

Janis

[snip]
  Réponse avec citation
Vieux 08/05/2008, 05h25   #2
Ed Morton
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation



On 5/7/2008 1:46 PM, Janis Papanagnou wrote:
> pk wrote:
>
>>On Wednesday 7 May 2008 17:37, Ed Morton wrote:
>>
>>
>>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>>>
>>>>Is there a way to explicitly print out that information (or, better, the
>>>>entire collating sequence in use)? I've been looking for a method to do
>>>>that for long time, but I have found no complete answer.
>>>
>>>I expect you could use the ord() and chr() functions described here:
>>>
>>>http://www.gnu.org/software/gawk/man...inal-Functions
>>>
>>>[snip]

>>
>>[snip]
>>
>>It seems that the function you point out use the mere numeric character
>>values and don't take locale into account.

>
>
> Yes, indeed. And the quoted passage in the GNU manual says about ord()...
> "the numeric value for that character in the machine's character set".
>
> While this matches the interpretation I am comfortable with - in this
> case and also with the ranges of characters in regexps, the quoted
> link about "Character-Lists" explicitly mentions a locale dependency.
>
> man regex(7) also doesn't enlighten the topic; in one sentence they
> talk about a _character set_ ("`[0-9]' in ASCII matches any decimal
> digit"), and shortly after that they talk about _collating sequences_
> ("Ranges are very collating-sequence-dependent").
>
> I'm still puzzled about that.
>


Here's my, perhaps simplistic view, that gets me through the day:

A character set is the set of all possible characters on a given machine and is
fixed (e.g. ASCII or EBCDIC).

A collating sequence is the order of those characters in a specified locale.

A character list is a specific list of characters. In the context of REs, the
order of characters in the list is inconsequential, so the list behaves more
like a set, hence it is also, confusingly, sometimes referred to as a character set.

A character class is a specific list of characters grouped according to some
common characterstic (e.g. "upper case" or "alphabetic" or ...),

A character list can be derived by specifying a start and end character and a
range operator, and the resulting list depends on the collating sequence of the
characters in the current character set in the current locale.

A character list can also be derived be specifying a character class and the
resulting list depends on the characters designated to be part of that class in
the current character set, independent of collating sequences or locale.

Hope that s.

Ed.

  Réponse avec citation
Vieux 08/05/2008, 10h00   #3
Janis
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation

On 8 Mai, 06:25, Ed Morton <mor...@lsupcaemnt.com> wrote:
> On 5/7/2008 1:46 PM, Janis Papanagnou wrote:
> > pk wrote:
> >>On Wednesday 7 May 2008 17:37, Ed Morton wrote:

>
> >>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
> >>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.

>
> >>>>Is there a way to explicitly print out that information (or, better, the
> >>>>entire collating sequence in use)? I've been looking for a method to do
> >>>>that for long time, but I have found no complete answer.

>
> >>>I expect you could use the ord() and chr() functions described here:

>
> >>>http://www.gnu.org/software/gawk/man...inal-Functions

>
> >>It seems that the function you point out use the mere numeric character
> >>values and don't take locale into account.

>
> > Yes, indeed. And the quoted passage in the GNU manual says about ord()....
> > "the numeric value for that character in the machine's character set".

>
> > While this matches the interpretation I am comfortable with - in this
> > case and also with the ranges of characters in regexps, the quoted
> > link about "Character-Lists" explicitly mentions a locale dependency.

>
> > man regex(7) also doesn't enlighten the topic; in one sentence they
> > talk about a _character set_ ("`[0-9]' in ASCII matches any decimal
> > digit"), and shortly after that they talk about _collating sequences_
> > ("Ranges are very collating-sequence-dependent").

>
> > I'm still puzzled about that.

>
> Here's my, perhaps simplistic view, that gets me through the day:


Thanks for assembling this set of definitions.

>
> A character set is the set of all possible characters on a given machine and is
> fixed (e.g. ASCII or EBCDIC).


...and characters have a fixed code (code position) in the set, yes.

>
> A collating sequence is the order of those characters in a specified locale.


A collating sequence is the order of characters in a specified locale.
Yes.

>
> A character list is a specific list of characters. In the context of REs, the
> order of characters in the list is inconsequential, so the list behaves more
> like a set, hence it is also, confusingly, sometimes referred to as a character set.


Here by "character list" you mean (for example) [0-9ABCDEF], I
suppose?
(No, I am not confused by the term "set".)
I'd prefer calling that a [regexp] "set of characters" (with or
without ranges).

>
> A character class is a specific list of characters grouped according to some
> common characterstic (e.g. "upper case" or "alphabetic" or ...),


Well, it's rather a set of characters than a list; per se unordered.
And it doesn't depend on the character set ("code page"), given the
same locale settings.
[[:upper:]] defines the same characters in ASCII or EBCDIC, but the
locale settings influence the contents of the set because it may
contain Ä Ö and Ü in one locale but not in another one.
OTOH, [A-Z] seems to me to be always depending on the "code page"
in use; you *have* to expect different results in ASCII and EBCDIC.
And since it's depending on a low level _character code_ (a number)
whether a character is in the set or not it makes no sense (to me)
to apply presentation level locale interpretations and assume a
locale ordering in addition to the "code page" ordering.
That are two different concepts, on different abstraction levels;
_that_ is "my problem" with the interpretation of ranges depending
on locales.

>
> A character list can be derived by specifying a start and end character and a
> range operator, and the resulting list depends on the collating sequence of the
> characters in the current character set in the current locale.


(See above.)

Start and end assume an ordering; the ordering is defined by the
character set.
How can the locale setting define a coherent definition with possible
code gaps?

Given the EBCDIC code positions
a(129), ..., g(135), h(136), i(137), j(145), k(146), l(146), ...
what does [a-j] mean (mind the code gap between i and j)?
Once it was defined as all characters coded within a range of 129 and
145
including the "characters" with code 138-144.
Now how would an 'ä' be represented? (Well, no perfect example I
guess,
because EBCDIC has no 'ä'.)

But I hope my point became clear; ranges and locales don't match well.
Ranges seem to be low-level abbreviations for code intervals, while
locales define sets of characters, best applied with character
classes.

>
> A character list can also be derived be specifying a character class and the
> resulting list depends on the characters designated to be part of that class in
> the current character set, independent of collating sequences or locale.


Not sure about your wording.
The set of characers defined by the character class [[:upper:]]
depends on the locale.

>
> Hope that s.


Well, not really. I've tried to describe the problem I see above.
The consequence is to prefer the use of character classes instead
of regexp ranges; we already agreed on that.

Janis

>
> Ed

  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 07h31.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,13898 seconds with 11 queries