PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Autres forums > Forum Programmation & Conception > comp.lang.ruby > testing stdin for bad encoding, ruby 1.9
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
testing stdin for bad encoding, ruby 1.9

Réponse
 
LinkBack Outils de la discussion
Vieux 11/05/2008, 16h49   #1
Ben Crowell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut testing stdin for bad encoding, ruby 1.9

I have some existing ruby 1.9 code that broke recently with a new build
of ruby. It looks like the problem was that my preexisting text input
files, which I'd been reading from stdin, contained some characters that
were not valid UTF-8 or US-ASCII. The latest version throws an error in
this situation:

$ ruby --version
ruby 1.9.0 (2008-04-26 revision 0) [x86_64-linux]
$ cat a.rb
#!/usr/bin/ruby

t = $stdin.gets(nil)

t.gsub!(/a/) {'b'}

$ ruby -e 'print "\332"' | a.rb
./a.rb:5:in `gsub!': broken UTF-8 string (ArgumentError)
from ./a.rb:5:in `<main>'

I'm happy to change the input files, because it is an error that they
aren't properly encoded. However, I'd also like to find some way to test
for this type of error more gracefully, and I can't seem to figure out
how to do it. I was originally thinking of something like this:

#!/usr/bin/ruby

t = $stdin.gets(nil)
if t=~/([^\n]*[^\000-\177][^\n]*)/ then
$stderr.print "Bad ASCII character detected in this line:\n#{$1}\n"
end

(In my application, the string t may be thousands of lines long.)
However, this doesn't work, because the attempt to test t against a
regex fails with an ArgumentError.

Googling turns up some references to magic comments, but I haven't
been able to find any information on what magic comments are.

Thanks in advance!

  Réponse avec citation
Vieux 11/05/2008, 19h05   #2
Alex Fenton
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: testing stdin for bad encoding, ruby 1.9

Ben Crowell wrote:
> I have some existing ruby 1.9 code that broke recently with a new build
> of ruby. It looks like the problem was that my preexisting text input
> files, which I'd been reading from stdin, contained some characters that
> were not valid UTF-8 or US-ASCII.


....

> I'm happy to change the input files, because it is an error that they
> aren't properly encoded. However, I'd also like to find some way to test
> for this type of error more gracefully, and I can't seem to figure out
> how to do it.


I use IConv in the standard library to convert from UTF8 to UTF8 to test
whether files being imported by a user are in fact in the right
encoding. This otherwise redundant recoding will raise a
BadSequenceError if there's a problem. This can be caught and reported.

a


  Réponse avec citation
Vieux 11/05/2008, 23h14   #3
Ben Crowell
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: testing stdin for bad encoding, ruby 1.9

Alex Fenton wrote:
> Ben Crowell wrote:
>> I have some existing ruby 1.9 code that broke recently with a new build
>> of ruby. It looks like the problem was that my preexisting text input
>> files, which I'd been reading from stdin, contained some characters that
>> were not valid UTF-8 or US-ASCII.

>
> ...
>
>> I'm happy to change the input files, because it is an error that they
>> aren't properly encoded. However, I'd also like to find some way to test
>> for this type of error more gracefully, and I can't seem to figure out
>> how to do it.

>
> I use IConv in the standard library to convert from UTF8 to UTF8 to test
> whether files being imported by a user are in fact in the right
> encoding. This otherwise redundant recoding will raise a
> BadSequenceError if there's a problem. This can be caught and reported.


Thanks for the suggestion. However, I already have an error that I can
catch and report. The problem is that it's not very ful to the user
to say, "hey, somewhere in your 100-page text file, there are illegal
characters." That's why I was trying to do this:
if t=~/([^\n]*[^\000-\177][^\n]*)/ then
$stderr.print "Bad ASCII character detected in this line:\n#{$1}\n"
end
It seems to me that I need some way to convince Ruby that the string t
is in an encoding where all characters are a single byte, and it's ok
to have the high bit set. Then I could go ahead and use regexes to test
whether it contains any characters with the high bit set, and report
them properly. It just seems like the string, once I read it in, is
like the Medusa -- my program doesn't even dare take a peek at it for
fear of being turned to stone.
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 20h53.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,10847 seconds with 11 queries