PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HBERGEURS DEVIS FORUMS RDUCTEUR D'URL
Prcdent   PHWinfo > Autres forums > Forum Programmation & Conception > comp.lang.php > allowed characters in a string (stripping it)
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
allowed characters in a string (stripping it)

Rponse
 
LinkBack Outils de la discussion
Vieux 11/12/2007, 03h47   #1
Lo'oris
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut allowed characters in a string (stripping it)

I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_LOW="èìòùáéóúâêîôû ëïöüÿãõñçæøåăāĕēīŏōūəß" ;
$ACCENTED_ALL_BIG="ÀÈÌÒÙÁÉÍÓÚÂÊÎÔÛ ËÏÖÜŸÃÕÑÇÆØÅĂĀĔĒĬĪŎŌŬŪƏß" ;
$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG;
$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
$SYMBOLS_NAME=".'- ";

first time I tried using something like this:
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

now I tried instead using str_split:
function clear_name_complex ($name, $ok_chars) {
$ass=str_split($ok_chars);
$al=array();
foreach ($ass as $a) {
$al[$a]=TRUE;
}

$s=str_split($name);
$ret="";
foreach ($s as $c) {
if (!$al[$c]) continue;
$ret.=$c;
}

return $ret;
}

still nothing.
unicode, and it goes mad and output makes no sense.
I belive that's because in both cases it treats unicode characters
splitting into single bytes, but still, I'm clueless about what am I
supposed to do.
  Rponse avec citation
Vieux 11/12/2007, 06h44   #2
Steve
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)


"Lo'oris" <looris@gmail.com> wrote in message
news:f01bb9b2-0ec1-451d-8a1e-438803b1ad3d@r60g2000hsc.googlegroups.com...
I'd like to have a set of "allowed characters", and strip a string
from everything besides those.

I've tried and tried but so far every time I enter strings containing
unicode, it goes mad and output makes no sense.

I'm sure I'm missing something but no idea what.

$ACCENTED_ALL_LOW="aae eiioouu?";
$ACCENTED_ALL_BIG="YAAE EIIOOUU?";
$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG;
$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
$SYMBOLS_NAME=".'- ";

first time I tried using something like this:
$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
(bear in mind I do *NOT* know regexp, a friend wrote this line)

=================

preg_replace is your best bet.

$notAllowed = "/[^a-zaaeeiioouu?]/si";
$newString = preg_replace($notAllowed, '', $stringToSearch);

or something to that effect...it tested fine in 'the regulator' - a regex
tester.


  Rponse avec citation
Vieux 11/12/2007, 07h37   #3
Tim Roberts
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)

"Lo'oris" <looris@gmail.com> wrote:

>I'd like to have a set of "allowed characters", and strip a string
>from everything besides those.
>
>I've tried and tried but so far every time I enter strings containing
>unicode, it goes mad and output makes no sense.


How are you entering "strings containing unicode"? Browsers don't send
Unicode.

>I'm sure I'm missing something but no idea what.
>
>$ACCENTED_ALL_LOW="?? ?????????";
>$ACCENTED_ALL_BIG="ܟ?? ?????????";
>$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BIG ;
>$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
>$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
>$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
>$SYMBOLS_NAME=".'- ";
>
>first time I tried using something like this:
>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
>(bear in mind I do *NOT* know regexp, a friend wrote this line)


This matches one of 4 things:
[a-zA-Z] an upper or lower case letter,
| or
- a hypen
| or
[$al] whatever is contained in $al
| or
. any character,

and replaces it with whatever was matched. Clearly, this is a very
expensive no-op. You do NOT want the '.' in there.

I would suggest this:

preg_match_all("/[$al]+/", $name, $out);
$result = implode('', $out[0]);
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
  Rponse avec citation
Vieux 11/12/2007, 08h12   #4
Steve
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)


"Tim Roberts" <timr@probo.com> wrote in message
news:gtasl39ff8fmcb5i84k51i0sdb4v14bn15@4ax.com...
> "Lo'oris" <looris@gmail.com> wrote:
>
>>I'd like to have a set of "allowed characters", and strip a string
>>from everything besides those.
>>
>>I've tried and tried but so far every time I enter strings containing
>>unicode, it goes mad and output makes no sense.

>
> How are you entering "strings containing unicode"? Browsers don't send
> Unicode.
>
>>I'm sure I'm missing something but no idea what.
>>
>>$ACCENTED_ALL_LOW="? ??????????";
>>$ACCENTED_ALL_BIG="Y? ??????????";
>>$ACCENTED_ALL=$ACCENTED_ALL_LOW.$ACCENTED_ALL_BI G;
>>$ALPHABET_LOW="qwertyuiopasdfghjklzxcvbnm";
>>$ALPHABET_BIG="QWERTYUIOPASDFGHJKLZXCVBNM";
>>$ALPHABET_ALL=$ALPHABET_LOW.$ALPHABET_BIG;
>>$SYMBOLS_NAME=".'- ";
>>
>>first time I tried using something like this:
>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
>>(bear in mind I do *NOT* know regexp, a friend wrote this line)

>
> This matches one of 4 things:
> [a-zA-Z] an upper or lower case letter,
> | or
> - a hypen
> | or
> [$al] whatever is contained in $al
> | or
> . any character,
>
> and replaces it with whatever was matched. Clearly, this is a very
> expensive no-op. You do NOT want the '.' in there.
>
> I would suggest this:
>
> preg_match_all("/[$al]+/", $name, $out);
> $result = implode('', $out[0]);


it's not expensive at all. and a dot is any single character...not a greedy
wild card. the only reason he wouldn't want a dot is because it could be an
'illegal' character that he's trying to get rid of anyway. as it is, he just
didn't escape the dot so that it is the character (period) and not the
directive (any single character).


  Rponse avec citation
Vieux 11/12/2007, 08h18   #5
Steve
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)


"Tim Roberts" <timr@probo.com> wrote in message
news:gtasl39ff8fmcb5i84k51i0sdb4v14bn15@4ax.com...
> "Lo'oris" <looris@gmail.com> wrote:
>
>>I'd like to have a set of "allowed characters", and strip a string
>>from everything besides those.
>>
>>I've tried and tried but so far every time I enter strings containing
>>unicode, it goes mad and output makes no sense.

>
> How are you entering "strings containing unicode"? Browsers don't send
> Unicode.


ummm, what planet are you from? asian people browse the internet
too...amazingly enough. i wonder why unicode was invented? probably has
NOTHING to do with the asian alphabet. then there are those fussy farsi
people. oh yeah, and the fucking russians! ok, germans too. then there's
the...damnit if everyone in the world won't speak and use the english
language! plus, we have to genuflect for them? hell no! i agree with
you...there should be no browsers developed to ever allow characters outside
the NORMUL UMMERIKUN language that gawd intended awl men to speak and write!


  Rponse avec citation
Vieux 11/12/2007, 12h08   #6
Toby A Inkster
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)

Steve wrote:

> damnit if everyone in the world won't speak and use the english
> language!


Some English words use non-ASCII characters too.

There are obviously those adopted from other languages such as résumé,
café and crêche. Many such words will lose their accents when they come to
English, but the examples given in the previous sentence typically retain
them.

The forenames Chloë and Zoë use a diaeresis mark to indicate that the 'e'
should be pronounced independently from the 'o' sound.

The surname Brontë has a similar mark, though that was a fanciful addition
by their father. They were originally from Ireland and thought that the
English might have a hard time correctly pronouncing Proinntigh, so
anglicised it. One might wonder why he thought adding an 'ë' counted as
anglicising, but at that time, the mark was quite common, used in words
such as coöperate, coöordinate, reënact and noöne. (Now that the mark has
become less common, these words are somewhat more awkward to spell. Using
a hyphen looks wrong, but the words look even worse without!) It is still
retained in naïve.

Also, 'æ' and to a lesser extent, 'œ' are used in many words. (Try looking
up pre-mediæval diseases of the œsophagus in an encyclopædia, and you
might find that many of them have names which are an onomatopœia.)

All of these are still readable when transliterated into a purely ASCII
alphabet, though 'resume' is ambiguous.

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 3 days, 21:11.]

Sharing Music with Apple iTunes
http://tobyinkster.co.uk/blog/2007/1...tunes-sharing/
  Rponse avec citation
Vieux 11/12/2007, 12h11   #7
Jerry Stuckle
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)

Tim Roberts wrote:
> "Lo'oris" <looris@gmail.com> wrote:
>
>> I'd like to have a set of "allowed characters", and strip a string
>>from everything besides those.
>> I've tried and tried but so far every time I enter strings containing
>> unicode, it goes mad and output makes no sense.

>
> How are you entering "strings containing unicode"? Browsers don't send
> Unicode.
>


Excuse me? They sure can, depending on the language being used.

So the rest of your post is immaterial. Steve's suggestion is a lot closer.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

  Rponse avec citation
Vieux 12/12/2007, 19h41   #8
AnrDaemon
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)

Greetings, Lo'oris.
In reply to Your message dated Tuesday, December 11, 2007, 05:47:08,

> I'd like to have a set of "allowed characters", and strip a string
> from everything besides those.


> I've tried and tried but so far every time I enter strings containing
> unicode, it goes mad and output makes no sense.


I think we should stop here and decide, what we can do to be sure we have
proper user input before dealing with it.

Typical case: You're working with posting form and forget to set
accept-charset attribute for it leaving user-agent to decide his own way to
send the data. In most cases it returns the data in the encoding Your server
sending Your pages to the user. But that the fair browser, like Opera. And if
Your server supplies any encoding.

Sorry, I can't add anything behind this explanation as it is not the PHP
question in general.

Worst way is that if You cannot take any actiona to make user input affordable
for You. Then You can try to detect the encoding of data passed to Your
application. Hope there's some articles in the Net covering this task.


--
Sincerely Yours, AnrDaemon <anrdaemon@freemail.ru>

  Rponse avec citation
Vieux 13/12/2007, 06h16   #9
Tim Roberts
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)

"Steve" <no.one@example.com> wrote:
>
>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);

>
>it's not expensive at all. and a dot is any single character...not a greedy
>wild card. the only reason he wouldn't want a dot is because it could be an
>'illegal' character that he's trying to get rid of anyway. as it is, he just
>didn't escape the dot so that it is the character (period) and not the
>directive (any single character).


That statement as written will replace each character with itself, one by
one, repeatedly, for each character in $name.

It is an expensive no-op.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
  Rponse avec citation
Vieux 13/12/2007, 06h21   #10
Tim Roberts
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)

Jerry Stuckle <jstucklex@attglobal.net> wrote:
>Tim Roberts wrote:
>> "Lo'oris" <looris@gmail.com> wrote:
>>
>>> I'd like to have a set of "allowed characters", and strip a string
>>>from everything besides those.
>>> I've tried and tried but so far every time I enter strings containing
>>> unicode, it goes mad and output makes no sense.

>>
>> How are you entering "strings containing unicode"? Browsers don't send
>> Unicode.

>
>Excuse me? They sure can, depending on the language being used.


Yes, I know better. That was not the sentiment I intended to convey.

>So the rest of your post is immaterial. Steve's suggestion is a lot closer.


Damn you, Stuckle. How can you see anything at all from up there on your
high horse?

Despite my faux pas, my suggestion was also correct, your invective
notwithstanding.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
  Rponse avec citation
Vieux 13/12/2007, 15h54   #11
Steve
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)


"Tim Roberts" <timr@probo.com> wrote in message
news:dpf1m3hicifgnclafaplh3jtibdheacfvk@4ax.com...
> "Steve" <no.one@example.com> wrote:
>>
>>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);

>>
>>it's not expensive at all. and a dot is any single character...not a
>>greedy
>>wild card. the only reason he wouldn't want a dot is because it could be
>>an
>>'illegal' character that he's trying to get rid of anyway. as it is, he
>>just
>>didn't escape the dot so that it is the character (period) and not the
>>directive (any single character).

>
> That statement as written will replace each character with itself, one by
> one, repeatedly, for each character in $name.
>
> It is an expensive no-op.


that is true, however each character is analyzed *as a single character*.
there is no marker being set and a pattern being sought beyond that marker
to see if there is another pattern match. markers are set, the replacement
is made to those characters marked, the process is done. one of the least
expensive operations one could ask of preg.

may be a good idea to write a pattern you think would be less expense that
does similar things...see if you can time-test compare the two. you can also
measure memory consumption too. i don't think you'll find any significant
consumption of resources running the above, esp. comparitively.


  Rponse avec citation
Vieux 13/12/2007, 15h58   #12
Steve
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)


"Tim Roberts" <timr@probo.com> wrote in message
news:i2g1m3983khduu7qghbp1vuitc8offjrj9@4ax.com...
> Jerry Stuckle <jstucklex@attglobal.net> wrote:
>>Tim Roberts wrote:
>>> "Lo'oris" <looris@gmail.com> wrote:
>>>
>>>> I'd like to have a set of "allowed characters", and strip a string
>>>>from everything besides those.
>>>> I've tried and tried but so far every time I enter strings containing
>>>> unicode, it goes mad and output makes no sense.
>>>
>>> How are you entering "strings containing unicode"? Browsers don't send
>>> Unicode.

>>
>>Excuse me? They sure can, depending on the language being used.

>
> Yes, I know better. That was not the sentiment I intended to convey.
>
>>So the rest of your post is immaterial. Steve's suggestion is a lot
>>closer.

>
> Damn you, Stuckle. How can you see anything at all from up there on your
> high horse?


with jerry, it's a matter of people in glass houses. except when you start
throwing rocks at his, he will claim you have no rock and that, in fact,
you've not broken any windows.

> Despite my faux pas, my suggestion was also correct, your invective
> notwithstanding.


good word, invective...he likes doing that apparently. at least we encounter
it often in his posts.

cheers.


  Rponse avec citation
Vieux 13/12/2007, 16h12   #13
Steve
Aucun Avatar
 
Messages: n/a
Hbergeur:
Par dfaut Re: allowed characters in a string (stripping it)


"Steve" <no.one@example.com> wrote in message
news:HIb8j.33$ts4.16@newsfe07.lga...
>
> "Tim Roberts" <timr@probo.com> wrote in message
> news:dpf1m3hicifgnclafaplh3jtibdheacfvk@4ax.com...
>> "Steve" <no.one@example.com> wrote:
>>>
>>>>$name=preg_replace("/([a-zA-Z]|-|[$al])|./",'$1',$name);
>>>
>>>it's not expensive at all. and a dot is any single character...not a
>>>greedy
>>>wild card. the only reason he wouldn't want a dot is because it could be
>>>an
>>>'illegal' character that he's trying to get rid of anyway. as it is, he
>>>just
>>>didn't escape the dot so that it is the character (period) and not the
>>>directive (any single character).

>>
>> That statement as written will replace each character with itself, one by
>> one, repeatedly, for each character in $name.
>>
>> It is an expensive no-op.

>
> that is true, however each character is analyzed *as a single character*.
> there is no marker being set and a pattern being sought beyond that marker
> to see if there is another pattern match. markers are set, the replacement
> is made to those characters marked, the process is done. one of the least
> expensive operations one could ask of preg.
>
> may be a good idea to write a pattern you think would be less expense that
> does similar things...see if you can time-test compare the two. you can
> also measure memory consumption too. i don't think you'll find any
> significant consumption of resources running the above, esp.
> comparitively.


sorry tim, i needed to make it clear - as i'd mentioned in one of my first
responses to you - that i think the dot in his preg is just mistakenly not
excaped. i don't think he means "any single character", rather, "a period".
anyway, my comments above are made under that assumption. otherwise you are
more right than before, however more in the line of "that's dumb to put
or'ed patterns when one of those will basically make the other
conditions/patterns moot". still, in this case, the expense is nominal since
all conditions/patterns work over a single character.

just thought i'd clarify.

cheers


  Rponse avec citation
Rponse


Outils de la discussion

Rgles de messages
Vous ne pouvez pas crer de nouvelles discussions
Vous ne pouvez pas envoyer des rponses
Vous ne pouvez pas envoyer des pices jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont actives : oui
Les smileys sont activs : oui
La balise [IMG] est active : oui
Le code HTML peut tre employ : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 09h43.


dit par : vBulletin®
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits rservs.
Version franaise #16 par l'association vBulletin francophone
PHWinfo est un site ducation Sans Frontires 2000-2008
Ad Management by RedTyger
©Tous droits rservs par les parties respectives
Page generated in 0,37977 seconds with 21 queries