PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Forum Serveur - Sécurité et techniques > comp.unix.shell > sed: How to avoid making changes within a literal string?
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
comp.unix.shell Using and programming the Unix shell.

sed: How to avoid making changes within a literal string?

Réponse
 
LinkBack Outils de la discussion
Vieux 25/08/2006, 01h05   #1
Dave S.
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut sed: How to avoid making changes within a literal string?

As I mentioned in another posting, I am writing a rudimentary
Pascal-to-C++ translator using a hodgepodge of sed scripts and C
programs. One of the few important remaining tasks is to change one of
my sed scripts so that it will not do what it does now, which is to
recklessly replace all occurrences of "and", "or", and "not" with "&&",
"||", and "!".

The translator does need to make those conversions, but only if the
word "and", "or", or "not" is not part of a literal string.

Example:
This:
if not ( ( a and b ) or ( c and d ) )
Should become:
if ( ! ( ( a && b ) || ( c && d )) )

However, this:
write(output) << "Send receipt and sample of blood to rebateshq.com
and expect not to get a reply." << NL;
Should NOT become this:
write(output) << "Send receipt && sample of blood to rebateshq.com
&& expect ! to get a reply." << NL;

Thoughts, anyone?

Thanks!

  Réponse avec citation
Vieux 25/08/2006, 01h24   #2
Xicheng Jia
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

Dave S. wrote:
> As I mentioned in another posting, I am writing a rudimentary
> Pascal-to-C++ translator using a hodgepodge of sed scripts and C
> programs. One of the few important remaining tasks is to change one of
> my sed scripts so that it will not do what it does now, which is to
> recklessly replace all occurrences of "and", "or", and "not" with "&&",
> "||", and "!".
>
> The translator does need to make those conversions, but only if the
> word "and", "or", or "not" is not part of a literal string.
>
> Example:
> This:
> if not ( ( a and b ) or ( c and d ) )
> Should become:
> if ( ! ( ( a && b ) || ( c && d )) )
>
> However, this:
> write(output) << "Send receipt and sample of blood to rebateshq.com
> and expect not to get a reply." << NL;
> Should NOT become this:
> write(output) << "Send receipt && sample of blood to rebateshq.com
> && expect ! to get a reply." << NL;


This is where Perl Compatible Regex rocks, here is a simple example of
how to change 'and' to &&, others are similar..

perl -0777pe 's/("[^"]*")|and/ $1 or "&&" /eg' file.txt

more robust forms may include escaped quotes, but may looks a little
more noisy..

Xicheng

  Réponse avec citation
Vieux 25/08/2006, 04h36   #3
Dave S.
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

Xicheng,

I guess I should be investing my time in learning more about Perl
instead of sed!

Anyway, just to throw a wrench into the fan, the literal string can
have escaped quotation marks in it, as in this:
"Hello \"Mr. Jia\". How are you?"

What do you think about that?
Can your Perl that rocks handle that, huh?

Xicheng Jia wrote:
> This is where Perl Compatible Regex rocks, here is a simple example of
> how to change 'and' to &&, others are similar..
>
> perl -0777pe 's/("[^"]*")|and/ $1 or "&&" /eg' file.txt


  Réponse avec citation
Vieux 25/08/2006, 07h23   #4
Xicheng Jia
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

Dave S. wrote:
> Xicheng,
>
> I guess I should be investing my time in learning more about Perl
> instead of sed!
>
> Anyway, just to throw a wrench into the fan, the literal string can
> have escaped quotation marks in it, as in this:
> "Hello \"Mr. Jia\". How are you?"
>
> What do you think about that?
> Can your Perl that rocks handle that, huh?


there are pretty much some standard ways to handle this kind of case,
like:
_______________________________
bash: ~$ echo '
if not ( ( a and b ) or ( c and d ) )
write(output) << "Send receipt\" and sample \"of blood to
rebateshq.com
and expect not to get a reply."
' | perl -0777pe 's/("[^\\"]*(?:\\.[^\\"]*)*")|and/$1 or "&&"/eg'

if not ( ( a && b ) or ( c && d ) )
write(output) << "Send receipt\" and sample \"of blood to
rebateshq.com
and expect not to get a reply."
_______________________________
Just change

"[^"]*"

to

"[^\\"]*(?:\\.[^\\"]*)*"

or change it to

"(?:\\.|[^\\"]*)*"

the latter one is much easier to be understood but less efficient from
regex's application viewpoint.

Good luck,
Xicheng

  Réponse avec citation
Vieux 25/08/2006, 16h24   #5
Dave S.
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

Xicheng,

You are brilliant!

Thanks.
Dave

  Réponse avec citation
Vieux 25/08/2006, 17h05   #6
Stephane Chazelas
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

On 24 Aug 2006 23:23:40 -0700, Xicheng Jia wrote:
[...]
> bash: ~$ echo '
> if not ( ( a and b ) or ( c and d ) )
> write(output) << "Send receipt\" and sample \"of blood to
> rebateshq.com
> and expect not to get a reply."
> ' | perl -0777pe 's/("[^\\"]*(?:\\.[^\\"]*)*")|and/$1 or "&&"/eg'
>
> if not ( ( a && b ) or ( c && d ) )
> write(output) << "Send receipt\" and sample \"of blood to
> rebateshq.com
> and expect not to get a reply."
> _______________________________
> Just change
>
> "[^"]*"
>
> to
>
> "[^\\"]*(?:\\.[^\\"]*)*"
>
> or change it to
>
> "(?:\\.|[^\\"]*)*"
>
> the latter one is much easier to be understood but less efficient from
> regex's application viewpoint.


Why? They look mostly equivalent to me.

Or

"(?:\\.|.)*?"

$ echo 'a and b "c and \"and\" d" and e' |
perl -0777 -pe 's/("(?:\\.|.)*?")|and/$1 or "&&"/ge'
a && b "c and \"and\" d" && e

If the quote is not matched (as in 'foo "and'), there will be
substitution with either solution.

--
Stephane
  Réponse avec citation
Vieux 26/08/2006, 18h06   #7
Xicheng Jia
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

Stephane Chazelas wrote:
> On 24 Aug 2006 23:23:40 -0700, Xicheng Jia wrote:
> [...]
> > bash: ~$ echo '
> > if not ( ( a and b ) or ( c and d ) )
> > write(output) << "Send receipt\" and sample \"of blood to
> > rebateshq.com
> > and expect not to get a reply."
> > ' | perl -0777pe 's/("[^\\"]*(?:\\.[^\\"]*)*")|and/$1 or "&&"/eg'
> >
> > if not ( ( a && b ) or ( c && d ) )
> > write(output) << "Send receipt\" and sample \"of blood to
> > rebateshq.com
> > and expect not to get a reply."
> > _______________________________
> > Just change
> >
> > "[^"]*"
> >
> > to
> >
> > "[^\\"]*(?:\\.[^\\"]*)*"
> >
> > or change it to
> >
> > "(?:\\.|[^\\"]*)*"
> >
> > the latter one is much easier to be understood but less efficient from
> > regex's application viewpoint.

>
> Why? They look mostly equivalent to me.
>


well, by using the later version, your are at the risk of using (A*)*
construct, which may introduces an *endless* backtracking for
non-matching cases.

You may check J. Friedl's book "Mastering Regular Expressions", there
is a special section "Unrolling the loop" to discuss about this issue.

> Or
>
> "(?:\\.|.)*?"


Although this construct eliminates the pitfalls from (A*)*, it is less
efficient than "(?:\\.|[^\\"]*)*" for matching cases.

Alternation in the regex patterns is often a main factor to lower the
speed, no to speak that it's based on per-character alternative.

Xicheng

  Réponse avec citation
Vieux 26/08/2006, 20h00   #8
Stephane CHAZELAS
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: sed: How to avoid making changes within a literal string?

2006-08-26, 10:06(-07), Xicheng Jia:
[...]
>> > "[^\\"]*(?:\\.[^\\"]*)*"
>> >
>> > or change it to
>> >
>> > "(?:\\.|[^\\"]*)*"
>> >
>> > the latter one is much easier to be understood but less efficient from
>> > regex's application viewpoint.

>>
>> Why? They look mostly equivalent to me.
>>

>
> well, by using the later version, your are at the risk of using (A*)*
> construct, which may introduces an *endless* backtracking for
> non-matching cases.


Which in this case only applies for unmatched quotes which is
not meant to happen and which this solution doesn't treat
correctly anyway.

> You may check J. Friedl's book "Mastering Regular Expressions", there
> is a special section "Unrolling the loop" to discuss about this issue.
>
>> Or
>>
>> "(?:\\.|.)*?"

>
> Although this construct eliminates the pitfalls from (A*)*, it is less
> efficient than "(?:\\.|[^\\"]*)*" for matching cases.
>
> Alternation in the regex patterns is often a main factor to lower the
> speed, no to speak that it's based on per-character alternative.

[...]

That doesn't apply to perl alternations that are much simpler
than normal regexp alternation. In perl the alternations are
tried left to right and it stops as soon as one matches.

So, if you watch perl trying to match

"(?:\\.|.)*?"

Once it's found the opening ", it will first check for every
character whether it's the closing quote (because of the
non-greedy matching). If not, it will check whether it's a "\".
If yes it will advance 2 chars, if not 1 char, and loop.

"(?:\\.|[^\\"]*)*"

would be slightly more efficient because it checks first for \
(that may occur several times) before checking for " (that will
occur only once) (and doesn't check for " if \). But on the
other end, it will check non-\ chars against \ twice, and
checking against a set of characters may be less efficient than
agains a single one.

"(?:\\.|[^"]*)*"

might be more efficient. But as we don't know what interal
optimisations perl might do, or the details of implementation it
may be just the same or different for reasons that have not much
to do with the actual actions that need to be done for each
character processed.

--
Stéphane
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 11h46.


Édité par : vBulletin® version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,17288 seconds with 16 queries