PHWinfo banniere

ACCUEIL ANNUAIRE ARTICLES COMPARATIF HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Go Back   PHWinfo > Forum Programmation > Scripting > comp.lang.ruby > invalid byte sequence in US-ASCII (ArgumentError)
FAQ Members List Search Today's Posts Mark Forums Read
invalid byte sequence in US-ASCII (ArgumentError)

Reply
 
Thread Tools
Old 02/16/09, 00:15   #1
Luther
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default invalid byte sequence in US-ASCII (ArgumentError)

I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I've
tried a couple of different magic comments with utf-8, but the error
message still shows the same "US-ASCII". I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren't even allowed in strings anymore.

I'm sure this must be a common migration problem, but I can't find a
solution no matter how hard I search the web. Any would be greatly
appreciated.

Luther


  Reply With Quote
Old 02/16/09, 00:19   #2
Tim Hunter
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

Luther wrote:
> I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
> code:
>
> text.gsub! "\C-m", ''
>
> ...which generates this error:
>
> /home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
> US-ASCII (ArgumentError)
>
> The purpose is to strip out any ^M characters from the string. I've
> tried a couple of different magic comments with utf-8, but the error
> message still shows the same "US-ASCII". I also tried changing the \C-m
> to 13.chr, but I still got the same error, suggesting that control
> characters aren't even allowed in strings anymore.
>
> I'm sure this must be a common migration problem, but I can't find a
> solution no matter how hard I search the web. Any would be greatly
> appreciated.
>
> Luther
>
>
>


Since Ruby is claiming the source file is US-ASCII it seems likely that
it's not noticing the magic comment. Make sure your magic comment is the
first line in the script, or if you're using a shebang line, the second
line. That is, either

# encoding: utf-8

or

#! /usr/local/bin/ruby
# encoding: utf-8

--
RMagick: http://rmagick.rubyforge.org/

  Reply With Quote
Old 02/16/09, 02:18   #3
Luther
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

On Mon, 2009-02-16 at 09:19 +0900, Tim Hunter wrote:
> Luther wrote:
> > I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
> > code:
> >
> > text.gsub! "\C-m", ''
> >
> > ...which generates this error:
> >
> > /home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
> > US-ASCII (ArgumentError)
> >

> Since Ruby is claiming the source file is US-ASCII it seems likely that
> it's not noticing the magic comment. Make sure your magic comment is the
> first line in the script, or if you're using a shebang line, the second
> line. That is, either
>
> # encoding: utf-8
>
> or
>
> #! /usr/local/bin/ruby
> # encoding: utf-8
>


I put the encoding line right after my shebang line, but it had no
effect.

In further investigation, I tried running my program on a different text
file, and it worked fine. The original text file had some very odd
characters at the beginning and the end of the file. Once I deleted that
metadata, my program worked fine.

This means the problem was with the "text" variable rather than the
arguments. This seems very wrong to me since it threw an ArgumentError.
Or maybe I don't know anything about exceptions.

So, my problem is partially solved, but now I know my program will puke
on any text file with multibyte characters.

Luther


  Reply With Quote
Old 02/16/09, 05:20   #4
Tom Link
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

> but now I know my program will puke
> on any text file with multibyte characters.


Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writing...-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

  Reply With Quote
Old 02/16/09, 09:44   #5
Brian Candler
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

Tom Link wrote:
>> but now I know my program will puke
>> on any text file with multibyte characters.

>
> Not necessarily.
>
> Here is a useful summary of encodings in 1.9:
> http://blog.nuclearsquid.com/writing...-1-9-encodings
>
> Basically, you have script encoding, internal encoding, and external
> encoding. In you case, you should probably read the files as ASCII8BIT
> or binary, I guess.


Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

So if you deal with data which is not text (as I do all the time), you
need to put

File.open("....", :encoding => "BINARY")

everywhere. And even then, if you ask the open File object what it's
encoding is, it will say ASCII8BIT, even though you explicitly told it
that it's BINARY.

This is because "BINARY" is just a synonym for "ASCII8BIT" in ruby. Of
course, there is plenty of data out there which is not encoded using the
American Standard Code for Information Interchange. MIME distinguishes
clearly between 8BIT (text with high bit set) and BINARY (non-text). In
terms of Ruby's processing it makes no difference, but it's annoying for
Ruby to tell me that my data is text, when it is not.

Note: in more recent 1.9's, I believe that

File.open("....", "rb")

has the effect of doing two things:
1. Disabling line-ending translation under Windows
2. Setting encoding to ASCII8BIT

So this may be sufficient for your needs, and it has the advantage that
the same code will run under ruby <1.9.
--
Posted via http://www.ruby-forum.com/.

  Reply With Quote
Old 02/16/09, 09:50   #6
Brian Candler
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

Brian Candler wrote:
> Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
> all external data is text, unless explicitly told otherwise.


And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

I'm not saying that Ruby shouldn't handling encodings and conversions;
I'm just saying you should ask for them. For example:

File.open("....", :encoding => "UTF-8") # Use this encoding
File.open("....", :encoding => "ENV") # Follow the environment
File.open("....") # No idea, treat as binary

I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that's a pain.
--
Posted via http://www.ruby-forum.com/.

  Reply With Quote
Old 02/16/09, 11:45   #7
Stefan Lang
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

2009/2/16 Brian Candler <b.candler@pobox.com>:
> Brian Candler wrote:
>> Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
>> all external data is text, unless explicitly told otherwise.


Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

Also, if you open a file with the "b" flag, it sets the files
encoding to binary. You should use that flag in 1.8, too,
otherwise Windows will do line ending conversion, corrupting
your binary data.

> And worse: the encoding chosen comes from the environment. So your
> program which you developed on one system and runs correctly there may
> fail totally on another.


It has to default to some encoding. Your OS installation has a
default encoding. It's a sane decision to use that, because
otherwise many scripts wouldn't work by default on your machine.

> I'm not saying that Ruby shouldn't handling encodings and conversions;
> I'm just saying you should ask for them. For example:
>
> File.open("....", :encoding => "UTF-8") # Use this encoding


Well, you can do exactly that...

> File.open("....", :encoding => "ENV") # Follow the environment


This is the default.

> File.open("....") # No idea, treat as binary


Use "b" flag, which you should do on 1.8 anyway.

> I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
> appropriate flags to force the external encoding to a fixed value. And
> that's a pain.


You can set it with Encoding.default_external= at the top of your script.

Stefan

  Reply With Quote
Old 02/16/09, 12:22   #8
Brian Candler
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

Stefan Lang wrote:
> Ruby must choose between treating all external data as
> text unless told otherwise or treat everything as binary
> unless told otherwise, because there is no general way
> to know if a file is binary or text.


Yes (and I wouldn't want it to try to guess)

> Given that Ruby is mostly used to work with text, it's a
> sensible decision to use text mode by default.


That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB... Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don't want my programs to crash in these cases.

> It has to default to some encoding.


That's where I also disagree. It can default to stream of bytes.

>> File.open("....", :encoding => "ENV") # Follow the environment

>
> This is the default.


That's what I don't want. Given this default, I must either:

(1) Force all my source to have the correct encoding flag set
everywhere. If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

> Encoding.default_external=


I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.
--
Posted via http://www.ruby-forum.com/.

  Reply With Quote
Old 02/16/09, 14:12   #9
Stefan Lang
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

2009/2/16 Brian Candler <b.candler@pobox.com>:
> Stefan Lang wrote:
>> Ruby must choose between treating all external data as
>> text unless told otherwise or treat everything as binary
>> unless told otherwise, because there is no general way
>> to know if a file is binary or text.

>
> Yes (and I wouldn't want it to try to guess)
>
>> Given that Ruby is mostly used to work with text, it's a
>> sensible decision to use text mode by default.

>
> That's where I disagree. There are tons of non-text applications:
> images, compression, PDFs, Marshall, DRB...


Point taken.

> Furthermore, as the OP
> demonstrated, there are plenty of usage cases where files are presented
> which are almost ASCII, but not quite. The default behaviour now is to
> crash, rather than to treat these as streams of bytes.
>
> I don't want my programs to crash in these cases.


Let's compare. Situation: I'm reading binary files and
forget to specify the "b" flag when opening the file(s).

Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the "b" flag on open. Done.

I definitely prefer Ruby 1.9 behavior.

>> It has to default to some encoding.

>
> That's where I also disagree. It can default to stream of bytes.
>
>>> File.open("....", :encoding => "ENV") # Follow the environment

>>
>> This is the default.

>
> That's what I don't want. Given this default, I must either:


Assuming the default is always wrong.

> (1) Force all my source to have the correct encoding flag set
> everywhere.If I don't test for this, my programs will fail in
> unexpected ways. Tests for this are awkward; they'd have to set the
> environment to a certain locale (e.g. UTF-8), pass in data which is not
> valid in that locale, and check no exception is raised.
>
> (2) Use a wrapper script either to call Ruby with the correct
> command-line flags, or to sanitise the environment.
>
>> Encoding.default_external=

>
> I guess I can use that at the top of everything in bin/ directory. It
> may be sufficient, but it's annoying to have to remember that too.


Here's why the default is good, IMO. The cases where I really
don't want to explicitly specify encodings is when I write one liners
(-e) and short throwaway scripts. If the default encoding were binary,
string operations would deal incorrectly with German (my native language)
accents. Using the locale encoding does the right thing here.

If I write a longer program, explicitly setting the default external
encoding isn't an effort worth mentioning. Set it to ASCII_8BIT
and it behaves like 1.8.

Stefan

  Reply With Quote
Old 02/16/09, 15:33   #10
Tom Link
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

> Result in Ruby 1.8:
> * My stuff works fine on Linux/Unix. Somebody else runs
> =A0 the script on Windows, the script corrupts data because
> =A0 Windows does line ending conversion.
>
> Result in Ruby 1.9:
> * On the first run on my Linux machine, I get an EncodingError.
> =A0 =A0I fix the problem by specifying the "b" flag on open. Done.


Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won't work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

This is also the reason why I think opening text files as binary isn't
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

  Reply With Quote
Old 02/16/09, 15:44   #11
Gary Wright
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)


On Feb 16, 2009, at 10:33 AM, Tom Link wrote:
> Actually, this is a point I have never quite understood. Why does only
> the windows version convert line endings?


Because at the operating system level Windows distinguishes between
text and binary files and Unix doesn't.

The "b" option has been part of the standard C library for decades and
Windows is not the only operating system that distinguishes between
text and binary files.

Proper handling of line termination requires the library to know if it
is working with a binary or text file. On Unix it doesn't matter if
you fail to give the library correct information (i.e. omit the b flag
for binary files) but your code becomes non-portable. It will fail on
systems that treat text and binary files differently.

Gary Wright




  Reply With Quote
Old 02/16/09, 15:57   #12
Stefan Lang
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

2009/2/16 Tom Link <micathom@gmail.com>:
>> Result in Ruby 1.8:
>> * My stuff works fine on Linux/Unix. Somebody else runs
>> the script on Windows, the script corrupts data because
>> Windows does line ending conversion.
>>
>> Result in Ruby 1.9:
>> * On the first run on my Linux machine, I get an EncodingError.
>> I fix the problem by specifying the "b" flag on open. Done.

>
> Actually, this is a point I have never quite understood. Why does only
> the windows version convert line endings? Is it out of question that
> somebody could want to process a text file created under windows on a
> linux box or virtual machine? Regular expressions that check only for
> \n but not \r won't work then. Now you could of course take the stance
> that you simply have to check for \r too, but then why automatically
> convert line separators under windows? Or did I miss something
> obvious?


It's the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

> This is also the reason why I think opening text files as binary isn't
> really a solution. It leads to either convoluted regexps or non-
> portable code. (Unless I missed something obvious, which is quite
> possible.)
>
> I personally find it somewhat confusing having to juggle with
> different encodings. IMHO it would have been preferable to define a
> fixed internal encoding (uft16 or whatever) and to transcode every
> string that is known to be text and identifiers to that canonical/
> uniform encoding and to deal with everything else as a sequence of
> bytes.


Ruby does that when you set the internal encoding with
Encoding.default_internal=

> BTW I recently skimmed through the python3000 user guide. From what I
> understand, they seem to distinguish between strings as (binary) data
> and strings as text (encoded as utf).


There were many and long discussions about the encoding API,
mostly on Ruby core. If you search the archives you can find
why the current API is how it is.

IIRC, these were important issues:

* We don't have a single internal string encoding (like Java and Python)
because there are many Ruby users, especially Asian, that still
have to work with legacy encodings for which a lossless Unicode
round-trip is not possible. They'd be forced to use the
binary API.

* Because Ruby already has a rich String API, and because it simplifies
porting of 1.8 code, there is no separate data type for binary strings.

Stefan

  Reply With Quote
Old 02/16/09, 16:34   #13
Tom Link
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

> It's the underlying C API that does the line ending conversion.
> Ruby inherited that behavior.


Thanks for the clarification (also thanks to Gary).

> Ruby does that when you set the internal encoding with
> Encoding.default_internal=


Unfortunately this isn't entirely true -- see:
http://groups.google.com/group/ruby-...ee9f336?hl=en#

It doesn't convert strings & identifiers in scripts. Of course, there
are good reasons for that (see the responses in the thread) if you
don't define a canonical internal encoding.

> There were many and long discussions about the encoding API,
> mostly on Ruby core.


I know that people who understand the issues at hand much better than
I do discussed this subject extensively. I still struggle to fully
understand their conclusions though. But this probably is only a
matter of time.

Regards,
Thomas.

  Reply With Quote
Old 02/17/09, 01:33   #14
Luther
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

On Mon, 2009-02-16 at 14:20 +0900, Tom Link wrote:
> > but now I know my program will puke
> > on any text file with multibyte characters.

>
> Not necessarily.
>
> Here is a useful summary of encodings in 1.9:
> http://blog.nuclearsquid.com/writing...-1-9-encodings
>
> Basically, you have script encoding, internal encoding, and external
> encoding. In you case, you should probably read the files as ASCII8BIT
> or binary, I guess.
>


Thank you. I've put 'r:binary' in the line where I open the file, and
now it seems to work fine. Although if I didn't want to be lazy, I would
probably read it with the default encoding, catch the ArgumentError,
then reread the file.

Thanks again...
Luther


  Reply With Quote
Old 02/17/09, 13:55   #15
Luther Thompson
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

Tom Link wrote:
> When I recently stumbled over not so different problems (one of which
> is described here [1]) it was because the external encoding (see
> Encoding.default_external) defaulted to US-ASCII on cygwin because
> ruby191RC0 ignored the windows locale and the value of the LANG
> variable -- the part with the windows locale was fixed in the
> meantime. AFAIK if ruby 191 cannot determine the environment's locale,
> it defaults to US-ASCII which causes the described problem if a
> character is > 127.


Actually, I always set my LANG to C. Since my original post, I found
that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
Ubuntu's default. After fixing that, I still got the same error, but
with "UTF-8" instead of "US-ASCII".

I believe the metadata in that text file must be binary code that was
put there by some word processor, because I remember seeing "Helvetica"
somewhere in there.

Luther
--
Posted via http://www.ruby-forum.com/.

  Reply With Quote
Old 11/10/10, 18:40   #16
Jason O.
Aucun Avatar
 
Posts: n/a
Hébergeur:
Default Re: invalid byte sequence in US-ASCII (ArgumentError)

The magic encoding comment didn't cut it for me. I found the answer in
my case by adding the following to my environment.rb (I run a mixed 1.8
and 1.9 environment):

if RUBY_VERSION =~ /1.9/
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
end

--
Posted via http://www.ruby-forum.com/.

  Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


All times are GMT +1. The time now is 09:19.


Powered by vBulletin® ©2000 - 2012, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.3.0
PHWinfo is a website Education Without Frontiers
Ad Management by RedTyger
All rights reserved
Page generated in 0.55644 seconds with 7 queries