|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
As I mentioned in another posting, I am writing a rudimentary
Pascal-to-C++ translator using a hodgepodge of sed scripts and C programs. One of the few important remaining tasks is to change one of my sed scripts so that it will not do what it does now, which is to recklessly replace all occurrences of "and", "or", and "not" with "&&", "||", and "!". The translator does need to make those conversions, but only if the word "and", "or", or "not" is not part of a literal string. Example: This: if not ( ( a and b ) or ( c and d ) ) Should become: if ( ! ( ( a && b ) || ( c && d )) ) However, this: write(output) << "Send receipt and sample of blood to rebateshq.com and expect not to get a reply." << NL; Should NOT become this: write(output) << "Send receipt && sample of blood to rebateshq.com && expect ! to get a reply." << NL; Thoughts, anyone? Thanks! |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
Dave S. wrote:
> As I mentioned in another posting, I am writing a rudimentary > Pascal-to-C++ translator using a hodgepodge of sed scripts and C > programs. One of the few important remaining tasks is to change one of > my sed scripts so that it will not do what it does now, which is to > recklessly replace all occurrences of "and", "or", and "not" with "&&", > "||", and "!". > > The translator does need to make those conversions, but only if the > word "and", "or", or "not" is not part of a literal string. > > Example: > This: > if not ( ( a and b ) or ( c and d ) ) > Should become: > if ( ! ( ( a && b ) || ( c && d )) ) > > However, this: > write(output) << "Send receipt and sample of blood to rebateshq.com > and expect not to get a reply." << NL; > Should NOT become this: > write(output) << "Send receipt && sample of blood to rebateshq.com > && expect ! to get a reply." << NL; This is where Perl Compatible Regex rocks, here is a simple example of how to change 'and' to &&, others are similar.. perl -0777pe 's/("[^"]*")|and/ $1 or "&&" /eg' file.txt more robust forms may include escaped quotes, but may looks a little more noisy.. Xicheng |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
Xicheng,
I guess I should be investing my time in learning more about Perl instead of sed! Anyway, just to throw a wrench into the fan, the literal string can have escaped quotation marks in it, as in this: "Hello \"Mr. Jia\". How are you?" What do you think about that? Can your Perl that rocks handle that, huh? ![]() Xicheng Jia wrote: > This is where Perl Compatible Regex rocks, here is a simple example of > how to change 'and' to &&, others are similar.. > > perl -0777pe 's/("[^"]*")|and/ $1 or "&&" /eg' file.txt |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
Dave S. wrote:
> Xicheng, > > I guess I should be investing my time in learning more about Perl > instead of sed! > > Anyway, just to throw a wrench into the fan, the literal string can > have escaped quotation marks in it, as in this: > "Hello \"Mr. Jia\". How are you?" > > What do you think about that? > Can your Perl that rocks handle that, huh? ![]() there are pretty much some standard ways to handle this kind of case, like: _______________________________ bash: ~$ echo ' if not ( ( a and b ) or ( c and d ) ) write(output) << "Send receipt\" and sample \"of blood to rebateshq.com and expect not to get a reply." ' | perl -0777pe 's/("[^\\"]*(?:\\.[^\\"]*)*")|and/$1 or "&&"/eg' if not ( ( a && b ) or ( c && d ) ) write(output) << "Send receipt\" and sample \"of blood to rebateshq.com and expect not to get a reply." _______________________________ Just change "[^"]*" to "[^\\"]*(?:\\.[^\\"]*)*" or change it to "(?:\\.|[^\\"]*)*" the latter one is much easier to be understood but less efficient from regex's application viewpoint. Good luck, Xicheng |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
Xicheng,
You are brilliant! Thanks. Dave |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
On 24 Aug 2006 23:23:40 -0700, Xicheng Jia wrote:
[...] > bash: ~$ echo ' > if not ( ( a and b ) or ( c and d ) ) > write(output) << "Send receipt\" and sample \"of blood to > rebateshq.com > and expect not to get a reply." > ' | perl -0777pe 's/("[^\\"]*(?:\\.[^\\"]*)*")|and/$1 or "&&"/eg' > > if not ( ( a && b ) or ( c && d ) ) > write(output) << "Send receipt\" and sample \"of blood to > rebateshq.com > and expect not to get a reply." > _______________________________ > Just change > > "[^"]*" > > to > > "[^\\"]*(?:\\.[^\\"]*)*" > > or change it to > > "(?:\\.|[^\\"]*)*" > > the latter one is much easier to be understood but less efficient from > regex's application viewpoint. Why? They look mostly equivalent to me. Or "(?:\\.|.)*?" $ echo 'a and b "c and \"and\" d" and e' | perl -0777 -pe 's/("(?:\\.|.)*?")|and/$1 or "&&"/ge' a && b "c and \"and\" d" && e If the quote is not matched (as in 'foo "and'), there will be substitution with either solution. -- Stephane |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
Stephane Chazelas wrote:
> On 24 Aug 2006 23:23:40 -0700, Xicheng Jia wrote: > [...] > > bash: ~$ echo ' > > if not ( ( a and b ) or ( c and d ) ) > > write(output) << "Send receipt\" and sample \"of blood to > > rebateshq.com > > and expect not to get a reply." > > ' | perl -0777pe 's/("[^\\"]*(?:\\.[^\\"]*)*")|and/$1 or "&&"/eg' > > > > if not ( ( a && b ) or ( c && d ) ) > > write(output) << "Send receipt\" and sample \"of blood to > > rebateshq.com > > and expect not to get a reply." > > _______________________________ > > Just change > > > > "[^"]*" > > > > to > > > > "[^\\"]*(?:\\.[^\\"]*)*" > > > > or change it to > > > > "(?:\\.|[^\\"]*)*" > > > > the latter one is much easier to be understood but less efficient from > > regex's application viewpoint. > > Why? They look mostly equivalent to me. > well, by using the later version, your are at the risk of using (A*)* construct, which may introduces an *endless* backtracking for non-matching cases. You may check J. Friedl's book "Mastering Regular Expressions", there is a special section "Unrolling the loop" to discuss about this issue. > Or > > "(?:\\.|.)*?" Although this construct eliminates the pitfalls from (A*)*, it is less efficient than "(?:\\.|[^\\"]*)*" for matching cases. Alternation in the regex patterns is often a main factor to lower the speed, no to speak that it's based on per-character alternative. Xicheng |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
2006-08-26, 10:06(-07), Xicheng Jia:
[...] >> > "[^\\"]*(?:\\.[^\\"]*)*" >> > >> > or change it to >> > >> > "(?:\\.|[^\\"]*)*" >> > >> > the latter one is much easier to be understood but less efficient from >> > regex's application viewpoint. >> >> Why? They look mostly equivalent to me. >> > > well, by using the later version, your are at the risk of using (A*)* > construct, which may introduces an *endless* backtracking for > non-matching cases. Which in this case only applies for unmatched quotes which is not meant to happen and which this solution doesn't treat correctly anyway. > You may check J. Friedl's book "Mastering Regular Expressions", there > is a special section "Unrolling the loop" to discuss about this issue. > >> Or >> >> "(?:\\.|.)*?" > > Although this construct eliminates the pitfalls from (A*)*, it is less > efficient than "(?:\\.|[^\\"]*)*" for matching cases. > > Alternation in the regex patterns is often a main factor to lower the > speed, no to speak that it's based on per-character alternative. [...] That doesn't apply to perl alternations that are much simpler than normal regexp alternation. In perl the alternations are tried left to right and it stops as soon as one matches. So, if you watch perl trying to match "(?:\\.|.)*?" Once it's found the opening ", it will first check for every character whether it's the closing quote (because of the non-greedy matching). If not, it will check whether it's a "\". If yes it will advance 2 chars, if not 1 char, and loop. "(?:\\.|[^\\"]*)*" would be slightly more efficient because it checks first for \ (that may occur several times) before checking for " (that will occur only once) (and doesn't check for " if \). But on the other end, it will check non-\ chars against \ twice, and checking against a set of characters may be less efficient than agains a single one. "(?:\\.|[^"]*)*" might be more efficient. But as we don't know what interal optimisations perl might do, or the details of implementation it may be just the same or different for reasons that have not much to do with the actual actions that need to be done for each character processed. -- Stéphane |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
Stephane CHAZELAS wrote:
> 2006-08-26, 10:06(-07), Xicheng Jia: > [...] > >> > "[^\\"]*(?:\\.[^\\"]*)*" > >> > > >> > or change it to > >> > > >> > "(?:\\.|[^\\"]*)*" > >> > > >> > the latter one is much easier to be understood but less efficient from > >> > regex's application viewpoint. > >> > >> Why? They look mostly equivalent to me. > >> > > > > well, by using the later version, your are at the risk of using (A*)* > > construct, which may introduces an *endless* backtracking for > > non-matching cases. > > Which in this case only applies for unmatched quotes which is > not meant to happen and which this solution doesn't treat > correctly anyway. yeah, sometime it's no hurt to trade readability for robustness and speed. :-) AFAIK, the unrolling pattern has been long established as a pattern to handle this kind of cases..:-) > > You may check J. Friedl's book "Mastering Regular Expressions", there > > is a special section "Unrolling the loop" to discuss about this issue. > > > >> Or > >> > >> "(?:\\.|.)*?" > > > > Although this construct eliminates the pitfalls from (A*)*, it is less > > efficient than "(?:\\.|[^\\"]*)*" for matching cases. > > > > Alternation in the regex patterns is often a main factor to lower the > > speed, no to speak that it's based on per-character alternative. > [...] > > That doesn't apply to perl alternations that are much simpler > than normal regexp alternation. In perl the alternations are > tried left to right and it stops as soon as one matches. You forget back-tracking, which makes the matching of Perl's regex often not as simple as we imagined. > So, if you watch perl trying to match > > "(?:\\.|.)*?" > > Once it's found the opening ", it will first check for every > character whether it's the closing quote (because of the > non-greedy matching). If not, it will check whether it's a "\". > If yes it will advance 2 chars, if not 1 char, and loop. > > "(?:\\.|[^\\"]*)*" Actually after it finds the opening ", it will skip (?:\\.|.)* and try to find the closest ", and then goes backward character by character to check if (?:\\.|.)* satisfies.. the reluctant form of quantifier may be not exactly working like you said above..:-) > would be slightly more efficient because it checks first for \ > (that may occur several times) before checking for " (that will > occur only once) (and doesn't check for " if \). But on the > other end, it will check non-\ chars against \ twice, and > checking against a set of characters may be less efficient than > agains a single one. > > "(?:\\.|[^"]*)*" It's hard to say since they are approaching to the final distination through different directions:-) maybe you want stuff like (?:\\.|[^"])* or something, but as from my understanding, it's better jump into the alternation construct as less times as possible. > might be more efficient. But as we don't know what interal > optimisations perl might do, or the details of implementation it > may be just the same or different for reasons that have not much > to do with the actual actions that need to be done for each > character processed. In fact, in the book I mentioned in my previous post, there is a whole chapter in discussing how to construct an efficient Perl-Compatible regex. There are also 4-5 papers in "Computer Sciences and Perl Programming, best of TPJ", which discussed about how Perl's regex works and some comparation between patterns like A.*B, A.*?B, A[^c]*B.... Xicheng:-) __ |
|
![]() |
| Outils de la discussion | |
|
|