PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Forum Serveur - Sécurité et techniques > comp.unix.shell > Re: Gawk match() and numbers in scientific notation
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
comp.unix.shell Using and programming the Unix shell.

Re: Gawk match() and numbers in scientific notation

Réponse
 
LinkBack Outils de la discussion
Vieux 07/05/2008, 17h25   #1
Ed Morton
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation

On 5/7/2008 11:04 AM, pk wrote:
> On Wednesday 7 May 2008 17:37, Ed Morton wrote:
>
>
>>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>>
>>>
>>>Is there a way to explicitly print out that information (or, better, the
>>>entire collating sequence in use)? I've been looking for a method to do
>>>that for long time, but I have found no complete answer.
>>>

>>
>>I expect you could use the ord() and chr() functions described here:
>>
>>http://www.gnu.org/software/gawk/man...inal-Functions
>>
>>to do something like:
>>
>>for (i=ord("a");i<=ord("z");i++) {
>>print chr(i)
>>}

>
>
> Take this scenario:
>
> $ cat file
> 100e3
> $ echo $LC_ALL
> en_GB
> $ awk '/[A-Z]/' file
> 100e3
> $ LC_ALL=C awk '/[A-Z]/' file
> $
>
> (or, perhaps more elegant,
> $ awk '[[:upper:]]' file
> $ )
>
> It seems that the function you point out use the mere numeric character
> values and don't take locale into account. Using the proposed code for the
> ord() and chr() functions, a loop to print the sequence from "A" to "Z"
> always yields
>
> A
> B
> C
> ...
> Z
>
> under many different locales, even en_GB which, as seen above, clearly
> expands [A-Z] differently.


Good point. In that case, you could do something like this:

range="[a-z]"
for (i=low;i<=high;i++)
if (chr(i) ~ range)
print chr(i)

where "low" and "high" are set by the _ord_init() function in the above link. It
won't necessarily tell you the actual order each character appears in in the
given range, but that shouldn't matter.

Ed.

  Réponse avec citation
Vieux 07/05/2008, 17h46   #2
pk
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation

On Wednesday 7 May 2008 18:25, Ed Morton wrote:

> Good point. In that case, you could do something like this:
>
> range="[a-z]"
> for (i=low;i<=high;i++)
> if (chr(i) ~ range)
> print chr(i)
>
> where "low" and "high" are set by the _ord_init() function in the above
> link. It won't necessarily tell you the actual order each character
> appears in in the given range, but that shouldn't matter.


This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both
a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected)
are printed. But it seems that those functions are ascii-oriented anyway,
so something more general, which works with the actual charset in use (with
accented and special characters, etc), would be great.
And moreover, I'm looking for some method that works in the reverse way, ie,
given the expression, print the expansion.

However, many thanks for your and suggestions!

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.
  Réponse avec citation
Vieux 07/05/2008, 18h16   #3
Ed Morton
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation



On 5/7/2008 11:46 AM, pk wrote:
> On Wednesday 7 May 2008 18:25, Ed Morton wrote:
>
>
>>Good point. In that case, you could do something like this:
>>
>>range="[a-z]"
>>for (i=low;i<=high;i++)
>>if (chr(i) ~ range)
>>print chr(i)
>>
>>where "low" and "high" are set by the _ord_init() function in the above
>>link. It won't necessarily tell you the actual order each character
>>appears in in the given range, but that shouldn't matter.

>
>
> This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both
> a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected)
> are printed. But it seems that those functions are ascii-oriented anyway,
> so something more general, which works with the actual charset in use (with
> accented and special characters, etc), would be great.


I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All you
really need is a max value for all character sets (or pick some ridiculously
high value if you don't know) then this:

$ cat showrange.awk
BEGIN{
for (i=0;i<=1000;i++)
chars[sprintf("%c",i)]
for (c in chars)
if (c ~ range)
s=s c
print range":"s
}
$ awk -v range="[a-z]" -f showrange.awk
[a-z]:abcdefghijklmnopqrstuvwxyz
$ awk -v range="[A-Z0-9]" -f showrange.awk
[A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

should work.

> And moreover, I'm looking for some method that works in the reverse way, ie,
> given the expression, print the expansion.


That is what it does - given the expression contained in the variable "range",
it'll print every character that is in that expression, i.e. the expansion.

> However, many thanks for your and suggestions!
>


You're welcome.

Ed.

  Réponse avec citation
Vieux 07/05/2008, 18h31   #4
Ed Morton
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation



On 5/7/2008 12:16 PM, Ed Morton wrote:
>
> On 5/7/2008 11:46 AM, pk wrote:
>
>>On Wednesday 7 May 2008 18:25, Ed Morton wrote:
>>
>>
>>
>>>Good point. In that case, you could do something like this:
>>>
>>>range="[a-z]"
>>>for (i=low;i<=high;i++)
>>>if (chr(i) ~ range)
>>>print chr(i)
>>>
>>>where "low" and "high" are set by the _ord_init() function in the above
>>>link. It won't necessarily tell you the actual order each character
>>>appears in in the given range, but that shouldn't matter.

>>
>>
>>This is actually somewhat better. With LC_ALL=en_GB, low=0, high=127, both
>>a...z and A...Z are printed, while with LC_ALL=C only a...z (as expected)
>>are printed. But it seems that those functions are ascii-oriented anyway,
>>so something more general, which works with the actual charset in use (with
>>accented and special characters, etc), would be great.

>
>
> I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All you
> really need is a max value for all character sets (or pick some ridiculously
> high value if you don't know) then this:
>
> $ cat showrange.awk
> BEGIN{
> for (i=0;i<=1000;i++)
> chars[sprintf("%c",i)]
> for (c in chars)
> if (c ~ range)
> s=s c
> print range":"s
> }
> $ awk -v range="[a-z]" -f showrange.awk
> [a-z]:abcdefghijklmnopqrstuvwxyz
> $ awk -v range="[A-Z0-9]" -f showrange.awk
> [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>
> should work.
>
>
>>And moreover, I'm looking for some method that works in the reverse way, ie,
>>given the expression, print the expansion.

>
>
> That is what it does - given the expression contained in the variable "range",
> it'll print every character that is in that expression, i.e. the expansion.
>
>
>>However, many thanks for your and suggestions!
>>

>
>
> You're welcome.
>
> Ed.
>


Of course, I should've used "re" instead of "range" in the above since the
script outputs every character that matches an RE, not just those in a specific
range. "set" or "list" would also have been more appropriate for the original
intent.

Ed.

  Réponse avec citation
Vieux 07/05/2008, 19h01   #5
pk
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Gawk match() and numbers in scientific notation

On Wednesday 7 May 2008 19:16, Ed Morton wrote:

> I think only ord() is ASCII/EBCDIC-oriented and you don't use that. All
> you really need is a max value for all character sets (or pick some
> ridiculously high value if you don't know) then this:
>
> $ cat showrange.awk
> BEGIN{
> for (i=0;i<=1000;i++)
> chars[sprintf("%c",i)]
> for (c in chars)
> if (c ~ range)
> s=s c
> print range":"s
> }
> $ awk -v range="[a-z]" -f showrange.awk
> [a-z]:abcdefghijklmnopqrstuvwxyz
> $ awk -v range="[A-Z0-9]" -f showrange.awk
> [A-Z0-9]:0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
>
> should work.


Well...sort of. Currently, I'm using the en_GB.utf8 locale, and I get this:

$ echo "aa" | wc -c
3
$ echo "ÃÃ" | wc -c
5
$ echo "€€" | wc -c
7

so, it seems I'm really using utf-8. The [a-z] re does match accented
characters:

$ echo 'è' | grep '[a-z]'
è
$ echo 'ò' | awk '/[a-z]/'
ò

etc.

However, running your script gives:

$ awk -v range='[a-z]' -f sr.awk
[a-z]:ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwx yz

I modified the script slightly, like this:

for (i=0;i<=1000;i++) {
chars[sprintf("%c",i)]
printf "%c\n", i
}

to see what goes into chars[], and I saw this:

^@
^A
^B
^C
^D
........
^Z

^\
^]
^^
^_

!
"
#
$
%
........
u
v
w
x
y
z
{
|
}
~
^?
<80>
<81>
<82>
........
<FC>
<FD>
<FE>
<FF>
^@
^A
^B
^C
^D
......etc.

So, it seems awk (or perhaps the "%c" specifier) only uses ascii characters.
I'm not sure where the problem is here (maybe PEBKAC as well, of course). I
know I'm throwing many elements in the picture at once here, but...where
can I look to check whether awk is behaving properly, or where any other
problem exists?
I must say that, while I understand the general ideas behind locales and
utf-8, I've never dug terribly deep into those concepts, so I may very well
be missing something here.

>> And moreover, I'm looking for some method that works in the reverse way,
>> ie, given the expression, print the expansion.

>
> That is what it does - given the expression contained in the variable
> "range", it'll print every character that is in that expression, i.e. the
> expansion.


Well yes, but using kind of "brute force" method. I was thinking about a
program that, by reading "something" "somewhere" (for some values
of "something" and "somewhere", perhaps some locale-definition file or the
like) automagically produces the requested collating sequence.

Thanks for any .

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 13h35.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,17359 seconds with 13 queries