|
|
|
|
||||||
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hbergeur: |
I'd like to have a set of "allowed characters", and strip a string
from everything besides those. I've tried and tried but so far every time I enter strings containing unicode, it goes mad and output makes no sense. I'm sure I'm missing something but no idea what. $ACCENTED_ALL_LOW="èìòùáéóúâêîôû ëïöüÿãõñçæøåăāĕēīŏōūəß" ; $ACCENTED_ALL_BIG="ÀÈÌÒÙÁÉÍÓÚÂÊÎÔÛ ËÏÖÜŸÃÕÑÇÆØÅĂĀĔĒĬĪŎŌŬŪƏß" ; $ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG; $ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm"; $ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM"; $ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG; $SYMBOLS_NAME=".'- "; first time I tried using something like this: $name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); (bear in mind I do *NOT* know regexp, a friend wrote this line) now I tried instead using str_split: function clear_name_complex ($name, $ok_chars) { $ass=str_split($ok_chars); $al=array(); foreach ($ass as $a) { $al[$a]=TRUE; } $s=str_split($name); $ret=""; foreach ($s as $c) { if (!$al[$c]) continue; $ret.=$c; } return $ret; } still nothing. unicode, and it goes mad and output makes no sense. I belive that's because in both cases it treats unicode characters splitting into single bytes, but still, I'm clueless about what am I supposed to do. |
|
|
|
#2 |
|
Messages: n/a
Hbergeur: |
"Lo'oris" <looris@gmail.com> wrote in message news:f01bb9b2-0ec1-451d-8a1e-438803b1ad3d@r60g2000hsc.googlegroups.com... I'd like to have a set of "allowed characters", and strip a string from everything besides those. I've tried and tried but so far every time I enter strings containing unicode, it goes mad and output makes no sense. I'm sure I'm missing something but no idea what. $ACCENTED_ALL_LOW="aae eiioouu?"; $ACCENTED_ALL_BIG="YAAE EIIOOUU?"; $ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG; $ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm"; $ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM"; $ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG; $SYMBOLS_NAME=".'- "; first time I tried using something like this: $name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); (bear in mind I do *NOT* know regexp, a friend wrote this line) ================= preg_replace is your best bet. $notAllowed = "/[^a-zaaeeiioouu?]/si"; $newString = preg_replace($notAllowed, '', $stringToSearch); or something to that effect...it tested fine in 'the regulator' - a regex tester. |
|
|
|
#3 |
|
Messages: n/a
Hbergeur: |
"Lo'oris" <looris@gmail.com> wrote:
>I'd like to have a set of "allowed characters", and strip a string >from everything besides those. > >I've tried and tried but so far every time I enter strings containing >unicode, it goes mad and output makes no sense. How are you entering "strings containing unicode"? Browsers don't send Unicode. >I'm sure I'm missing something but no idea what. > >$ACCENTED_ALL_LOW="?? ?????????"; >$ACCENTED_ALL_BIG="ܟ?? ?????????"; >$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG ; >$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm"; >$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM"; >$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG; >$SYMBOLS_NAME=".'- "; > >first time I tried using something like this: >$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); >(bear in mind I do *NOT* know regexp, a friend wrote this line) This matches one of 4 things: [a-zA-Z] an upper or lower case letter, | or - a hypen | or [$al] whatever is contained in $al | or . any character, and replaces it with whatever was matched. Clearly, this is a very expensive no-op. You do NOT want the '.' in there. I would suggest this: preg_match_all("/[$al]+/", $name, $out); $result = implode('', $out[0]); -- Tim Roberts, timr@probo.com Providenza & Boekelheide, Inc. |
|
|
|
#4 |
|
Messages: n/a
Hbergeur: |
"Tim Roberts" <timr@probo.com> wrote in message news:gtasl39ff8fmcb5i84k51i0sdb4v14bn15@4ax.com... > "Lo'oris" <looris@gmail.com> wrote: > >>I'd like to have a set of "allowed characters", and strip a string >>from everything besides those. >> >>I've tried and tried but so far every time I enter strings containing >>unicode, it goes mad and output makes no sense. > > How are you entering "strings containing unicode"? Browsers don't send > Unicode. > >>I'm sure I'm missing something but no idea what. >> >>$ACCENTED_ALL_LOW="? ??????????"; >>$ACCENTED_ALL_BIG="Y? ??????????"; >>$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BI G; >>$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm"; >>$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM"; >>$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG; >>$SYMBOLS_NAME=".'- "; >> >>first time I tried using something like this: >>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); >>(bear in mind I do *NOT* know regexp, a friend wrote this line) > > This matches one of 4 things: > [a-zA-Z] an upper or lower case letter, > | or > - a hypen > | or > [$al] whatever is contained in $al > | or > . any character, > > and replaces it with whatever was matched. Clearly, this is a very > expensive no-op. You do NOT want the '.' in there. > > I would suggest this: > > preg_match_all("/[$al]+/", $name, $out); > $result = implode('', $out[0]); it's not expensive at all. and a dot is any single character...not a greedy wild card. the only reason he wouldn't want a dot is because it could be an 'illegal' character that he's trying to get rid of anyway. as it is, he just didn't escape the dot so that it is the character (period) and not the directive (any single character). |
|
|
|
#5 |
|
Messages: n/a
Hbergeur: |
"Tim Roberts" <timr@probo.com> wrote in message news:gtasl39ff8fmcb5i84k51i0sdb4v14bn15@4ax.com... > "Lo'oris" <looris@gmail.com> wrote: > >>I'd like to have a set of "allowed characters", and strip a string >>from everything besides those. >> >>I've tried and tried but so far every time I enter strings containing >>unicode, it goes mad and output makes no sense. > > How are you entering "strings containing unicode"? Browsers don't send > Unicode. ummm, what planet are you from? asian people browse the internet too...amazingly enough. i wonder why unicode was invented? probably has NOTHING to do with the asian alphabet. then there are those fussy farsi people. oh yeah, and the fucking russians! ok, germans too. then there's the...damnit if everyone in the world won't speak and use the english language! plus, we have to genuflect for them? hell no! i agree with you...there should be no browsers developed to ever allow characters outside the NORMUL UMMERIKUN language that gawd intended awl men to speak and write! |
|
|
|
#6 |
|
Messages: n/a
Hbergeur: |
Steve wrote:
> damnit if everyone in the world won't speak and use the english > language! Some English words use non-ASCII characters too. There are obviously those adopted from other languages such as résumé, café and crêche. Many such words will lose their accents when they come to English, but the examples given in the previous sentence typically retain them. The forenames Chloë and Zoë use a diaeresis mark to indicate that the 'e' should be pronounced independently from the 'o' sound. The surname Brontë has a similar mark, though that was a fanciful addition by their father. They were originally from Ireland and thought that the English might have a hard time correctly pronouncing Proinntigh, so anglicised it. One might wonder why he thought adding an 'ë' counted as anglicising, but at that time, the mark was quite common, used in words such as coöperate, coöordinate, reënact and noöne. (Now that the mark has become less common, these words are somewhat more awkward to spell. Using a hyphen looks wrong, but the words look even worse without!) It is still retained in naïve. Also, 'æ' and to a lesser extent, 'œ' are used in many words. (Try looking up pre-mediæval diseases of the œsophagus in an encyclopædia, and you might find that many of them have names which are an onomatopœia.) All of these are still readable when transliterated into a purely ASCII alphabet, though 'resume' is ambiguous. -- Toby A Inkster BSc (Hons) ARCS [Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux] [OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 3 days, 21:11.] Sharing Music with Apple iTunes http://tobyinkster.co.uk/blog/2007/1...tunes-sharing/ |
|
|
|
#7 |
|
Messages: n/a
Hbergeur: |
Tim Roberts wrote:
> "Lo'oris" <looris@gmail.com> wrote: > >> I'd like to have a set of "allowed characters", and strip a string >>from everything besides those. >> I've tried and tried but so far every time I enter strings containing >> unicode, it goes mad and output makes no sense. > > How are you entering "strings containing unicode"? Browsers don't send > Unicode. > Excuse me? They sure can, depending on the language being used. So the rest of your post is immaterial. Steve's suggestion is a lot closer. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|
|
#8 |
|
Messages: n/a
Hbergeur: |
Greetings, Lo'oris.
In reply to Your message dated Tuesday, December 11, 2007, 05:47:08, > I'd like to have a set of "allowed characters", and strip a string > from everything besides those. > I've tried and tried but so far every time I enter strings containing > unicode, it goes mad and output makes no sense. I think we should stop here and decide, what we can do to be sure we have proper user input before dealing with it. Typical case: You're working with posting form and forget to set accept-charset attribute for it leaving user-agent to decide his own way to send the data. In most cases it returns the data in the encoding Your server sending Your pages to the user. But that the fair browser, like Opera. And if Your server supplies any encoding. Sorry, I can't add anything behind this explanation as it is not the PHP question in general. Worst way is that if You cannot take any actiona to make user input affordable for You. Then You can try to detect the encoding of data passed to Your application. Hope there's some articles in the Net covering this task. -- Sincerely Yours, AnrDaemon <anrdaemon@freemail.ru> |
|
|
|
#9 |
|
Messages: n/a
Hbergeur: |
"Steve" <no.one@example.com> wrote:
> >>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); > >it's not expensive at all. and a dot is any single character...not a greedy >wild card. the only reason he wouldn't want a dot is because it could be an >'illegal' character that he's trying to get rid of anyway. as it is, he just >didn't escape the dot so that it is the character (period) and not the >directive (any single character). That statement as written will replace each character with itself, one by one, repeatedly, for each character in $name. It is an expensive no-op. -- Tim Roberts, timr@probo.com Providenza & Boekelheide, Inc. |
|
|
|
#10 |
|
Messages: n/a
Hbergeur: |
Jerry Stuckle <jstucklex@attglobal.net> wrote:
>Tim Roberts wrote: >> "Lo'oris" <looris@gmail.com> wrote: >> >>> I'd like to have a set of "allowed characters", and strip a string >>>from everything besides those. >>> I've tried and tried but so far every time I enter strings containing >>> unicode, it goes mad and output makes no sense. >> >> How are you entering "strings containing unicode"? Browsers don't send >> Unicode. > >Excuse me? They sure can, depending on the language being used. Yes, I know better. That was not the sentiment I intended to convey. >So the rest of your post is immaterial. Steve's suggestion is a lot closer. Damn you, Stuckle. How can you see anything at all from up there on your high horse? Despite my faux pas, my suggestion was also correct, your invective notwithstanding. -- Tim Roberts, timr@probo.com Providenza & Boekelheide, Inc. |
|
|
|
#11 |
|
Messages: n/a
Hbergeur: |
"Tim Roberts" <timr@probo.com> wrote in message news:dpf1m3hicifgnclafaplh3jtibdheacfvk@4ax.com... > "Steve" <no.one@example.com> wrote: >> >>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); >> >>it's not expensive at all. and a dot is any single character...not a >>greedy >>wild card. the only reason he wouldn't want a dot is because it could be >>an >>'illegal' character that he's trying to get rid of anyway. as it is, he >>just >>didn't escape the dot so that it is the character (period) and not the >>directive (any single character). > > That statement as written will replace each character with itself, one by > one, repeatedly, for each character in $name. > > It is an expensive no-op. that is true, however each character is analyzed *as a single character*. there is no marker being set and a pattern being sought beyond that marker to see if there is another pattern match. markers are set, the replacement is made to those characters marked, the process is done. one of the least expensive operations one could ask of preg. may be a good idea to write a pattern you think would be less expense that does similar things...see if you can time-test compare the two. you can also measure memory consumption too. i don't think you'll find any significant consumption of resources running the above, esp. comparitively. |
|
|
|
#12 |
|
Messages: n/a
Hbergeur: |
"Tim Roberts" <timr@probo.com> wrote in message news:i2g1m3983khduu7qghbp1vuitc8offjrj9@4ax.com... > Jerry Stuckle <jstucklex@attglobal.net> wrote: >>Tim Roberts wrote: >>> "Lo'oris" <looris@gmail.com> wrote: >>> >>>> I'd like to have a set of "allowed characters", and strip a string >>>>from everything besides those. >>>> I've tried and tried but so far every time I enter strings containing >>>> unicode, it goes mad and output makes no sense. >>> >>> How are you entering "strings containing unicode"? Browsers don't send >>> Unicode. >> >>Excuse me? They sure can, depending on the language being used. > > Yes, I know better. That was not the sentiment I intended to convey. > >>So the rest of your post is immaterial. Steve's suggestion is a lot >>closer. > > Damn you, Stuckle. How can you see anything at all from up there on your > high horse? with jerry, it's a matter of people in glass houses. except when you start throwing rocks at his, he will claim you have no rock and that, in fact, you've not broken any windows. ![]() > Despite my faux pas, my suggestion was also correct, your invective > notwithstanding. good word, invective...he likes doing that apparently. at least we encounter it often in his posts. cheers. |
|
|
|
#13 |
|
Messages: n/a
Hbergeur: |
"Steve" <no.one@example.com> wrote in message news:HIb8j.33$ts4.16@newsfe07.lga... > > "Tim Roberts" <timr@probo.com> wrote in message > news:dpf1m3hicifgnclafaplh3jtibdheacfvk@4ax.com... >> "Steve" <no.one@example.com> wrote: >>> >>>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name); >>> >>>it's not expensive at all. and a dot is any single character...not a >>>greedy >>>wild card. the only reason he wouldn't want a dot is because it could be >>>an >>>'illegal' character that he's trying to get rid of anyway. as it is, he >>>just >>>didn't escape the dot so that it is the character (period) and not the >>>directive (any single character). >> >> That statement as written will replace each character with itself, one by >> one, repeatedly, for each character in $name. >> >> It is an expensive no-op. > > that is true, however each character is analyzed *as a single character*. > there is no marker being set and a pattern being sought beyond that marker > to see if there is another pattern match. markers are set, the replacement > is made to those characters marked, the process is done. one of the least > expensive operations one could ask of preg. > > may be a good idea to write a pattern you think would be less expense that > does similar things...see if you can time-test compare the two. you can > also measure memory consumption too. i don't think you'll find any > significant consumption of resources running the above, esp. > comparitively. sorry tim, i needed to make it clear - as i'd mentioned in one of my first responses to you - that i think the dot in his preg is just mistakenly not excaped. i don't think he means "any single character", rather, "a period". anyway, my comments above are made under that assumption. otherwise you are more right than before, however more in the line of "that's dumb to put or'ed patterns when one of those will basically make the other conditions/patterns moot". still, in this case, the expense is nominal since all conditions/patterns work over a single character. just thought i'd clarify. cheers |
|
![]() |
| Outils de la discussion | |
|
|