PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Autres forums > Forum Programmation & Conception > comp.lang.ruby > Unicode
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
Unicode

Réponse
 
LinkBack Outils de la discussion
Vieux 15/09/2007, 02h59   #1
Zephyr Pellerin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Unicode

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?
  Réponse avec citation
Vieux 15/09/2007, 03h29   #2
James Edward Gray II
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On Sep 14, 2007, at 9:05 PM, Zephyr Pellerin wrote:

> I hate to discuss something related to the development timeline, I
> know its tenable, but When will it be reasonable to expect Unicode
> support from Ruby?


Ruby has some UTF-8 support today. Support will increase with the
m17n support though.

See last question and answer here:

http://blog.grayproductions.net/arti..._vm_episode_iv

James Edward Gray II


  Réponse avec citation
Vieux 15/09/2007, 03h39   #3
Phlip
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Zephyr Pellerin wrote:

> I hate to discuss something related to the development timeline, I know
> its tenable, but When will it be reasonable to expect Unicode support from
> Ruby?


"Unicode" is not an encoding. Are you asking for UTF-8, UTF-16, or something
else?

--
Phlip


  Réponse avec citation
Vieux 17/09/2007, 13h44   #4
Todd Burch
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Zephyr Pellerin wrote:
> I hate to discuss something related to the development timeline, I know
> its tenable, but When will it be reasonable to expect Unicode support
> from Ruby?


I was just looking at the source code for 1.8.6 this weekend. The C
syntax that's being used is pre-ANSI-C (which means in 1988, it was
"old" syntax).

Rotsa Ruck.

Todd
--
Posted via http://www.ruby-forum.com/.

  Réponse avec citation
Vieux 17/09/2007, 13h52   #5
Phlip
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

>> I hate to discuss something related to the development timeline, I know
>> its tenable, but When will it be reasonable to expect Unicode support
>> from Ruby?

>
> I was just looking at the source code for 1.8.6 this weekend. The C
> syntax that's being used is pre-ANSI-C (which means in 1988, it was
> "old" syntax).


Apples and oranges. Unicode libraries like iconv use C linkage, so they can
bond with most C implementations regardless of their compliance. (C linkage
is very weak and simplistic.) All Cs can handle 8-bit strings, and can be
programmed to use 16-bit strings, which are the requirements for UTF-8 and
UTF-16.

Like most languages, Ruby's source is in a primitive form of C to maximize
the number of compilers, and hence the number of platforms and hardwares,
that it runs on. I would suspect - unless Matz is an even greater genius
than average - that Ruby's C style has been carefully retrofitted, after the
language passed its first few version ticks.

> Rotsa Ruck.


Racial slur noted.

--
Phlip


  Réponse avec citation
Vieux 17/09/2007, 14h50   #6
Todd Burch
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Phlip wrote:

> Racial slur noted.


You got a problem with Scooby Doo?

For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Now, I'll rephrase.

Lotsa luck getting something like Unicode implemented when the
underlying C contructs are using such an outdated syntax as ruby's does.

But, as Phlip implies, it's just a simple matter of programming.

Todd
--
Posted via http://www.ruby-forum.com/.

  Réponse avec citation
Vieux 17/09/2007, 15h48   #7
Phlip
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Todd Burch wrote:

> For the record, this was NOT intended to slur anything. It was not my
> intent, nor is my nature, to slur. However, reading this in hindsight,
> it certainly could be taken this way. Please accept my apologies.


Oh my apologies too - Scooby Doo is quite over my head. All I could
imagine was Matz in a kimono serving Sake.

--
Phlip

  Réponse avec citation
Vieux 17/09/2007, 16h50   #8
Todd Burch
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Yukihiro Matsumoto wrote:
> Hi,
>
> Old K&R style has nothing related to Unicode support of the language.
> If you think it does, please elaborate.
>
> It just reflects the history of the language. When I started
> developing Ruby, old Sun CC compiler does not understand new style,
> and I wanted Ruby to run on that platform, which I was using then.
>
> For your information, the next release (1.9) finally abandoned the old
> style.
>
> matz.


Thanks Matz.

I'm new to C programming, but not new to programming. Therefore, my
assumption (yes, assumption) was that using whatever compiler swithes
were necessary to accept the old-style syntax would obviate the
opportunity to bring in "modern" libraries with unicode support, and/or
prohibit those aspects of the language that would enable the use of
unicode features.

So, apparently, since they ("they" being unicode support and the
syntax/compiler switches) are not related, and that's great.

By the way, as an aside, I really like the language you developed and
have made available. I primarily use Ruby with SketchUp (a 3D modeling
program - http://www.sketchup.com) for extending the functionality of
the product. (SketchUp has a Ruby API) I was looking at the source to
see what it would take to implement a debugger than would work with Ruby
while running under SketchUp. I would like to step through expression
evaluation as the script runs.

(Big aspirations for a new C programmer like myself!)

Todd
--
Posted via http://www.ruby-forum.com/.

  Réponse avec citation
Vieux 21/09/2007, 10h19   #9
Michal Suchanek
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
> I hate to discuss something related to the development timeline, I know
> its tenable, but When will it be reasonable to expect Unicode support
> from Ruby?


Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
[], ...
You have to remember to convert the string back, though.

Thanks

Michal

  Réponse avec citation
Vieux 22/09/2007, 12h04   #10
Jimmy Kofler
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

> Michal Suchanek wrote:
> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
>> I hate to discuss something related to the development timeline, I know
>> its tenable, but When will it be reasonable to expect Unicode support
>> from Ruby?

>
> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
> is set to "U" (and the default is "N" even in UTF-8 locales, and if
> you specify the -K option in the .rb file it overrides the option
> specified on the command line, heh).
> The non-regex methods do not work but you can convert the string with
> str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
> [], ...
> You have to remember to convert the string back, though.
>
> Thanks
>
> Michal


... or you may use the /re/u regex option to handle UTF-8 encoded
strings (cf. http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j.k.
--
Posted via http://www.ruby-forum.com/.

  Réponse avec citation
Vieux 28/09/2007, 22h49   #11
Felipe Contreras
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:
> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
> > I hate to discuss something related to the development timeline, I know
> > its tenable, but When will it be reasonable to expect Unicode support
> > from Ruby?

>
> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
> is set to "U" (and the default is "N" even in UTF-8 locales, and if
> you specify the -K option in the .rb file it overrides the option
> specified on the command line, heh).
> The non-regex methods do not work but you can convert the string with
> str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
> [], ...
> You have to remember to convert the string back, though.


What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

--
Felipe Contreras

  Réponse avec citation
Vieux 29/09/2007, 02h07   #12
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode


On Sep 28, 2007, at 4:49 PM, Felipe Contreras wrote:

> On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:
>> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
>>> I hate to discuss something related to the development timeline,
>>> I know
>>> its tenable, but When will it be reasonable to expect Unicode
>>> support
>>> from Ruby?

>>
>> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
>> is set to "U" (and the default is "N" even in UTF-8 locales, and if
>> you specify the -K option in the .rb file it overrides the option
>> specified on the command line, heh).
>> The non-regex methods do not work but you can convert the string with
>> str.scan(/./)[0] or str.unpack "U*", and use stuff like each,
>> reverse,
>> [], ...
>> You have to remember to convert the string back, though.

>
> What about UTF-16?
>
> http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/
>
> --
> Felipe Contreras
>

Go to unicode.org
There you can read a full explanation (or a brief one) about why you
don't need to worry about UTF-16
UTF-8 is all you need.
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

  Réponse avec citation
Vieux 29/09/2007, 13h07   #13
Felipe Contreras
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
>
> On Sep 28, 2007, at 4:49 PM, Felipe Contreras wrote:
>
> > On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:
> >> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
> >>> I hate to discuss something related to the development timeline,
> >>> I know
> >>> its tenable, but When will it be reasonable to expect Unicode
> >>> support
> >>> from Ruby?
> >>
> >> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
> >> is set to "U" (and the default is "N" even in UTF-8 locales, and if
> >> you specify the -K option in the .rb file it overrides the option
> >> specified on the command line, heh).
> >> The non-regex methods do not work but you can convert the string with
> >> str.scan(/./)[0] or str.unpack "U*", and use stuff like each,
> >> reverse,
> >> [], ...
> >> You have to remember to convert the string back, though.

> >
> > What about UTF-16?
> >
> > http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/
> >
> > --
> > Felipe Contreras
> >

> Go to unicode.org
> There you can read a full explanation (or a brief one) about why you
> don't need to worry about UTF-16
> UTF-8 is all you need.
> Unicode is something everyone needs to read up on at some point.
> I have to read up on every now and then because my brain leaks.


Yes but what about stuff already encoded in UTF-16?

--
Felipe Contreras

  Réponse avec citation
Vieux 29/09/2007, 20h13   #14
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

>
> Yes but what about stuff already encoded in UTF-16?


That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's
doubtful it will be an issue in most cases.

This might get you started.
http://www.unicode.org/faq/utf_bom.html#37


Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it....

  Réponse avec citation
Vieux 29/09/2007, 20h29   #15
James Edward Gray II
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On Sep 29, 2007, at 2:13 PM, John Joyce wrote:

> The short version is that UTF-16 is basically wasteful.


That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II


  Réponse avec citation
Vieux 30/09/2007, 00h34   #16
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode


On Sep 29, 2007, at 2:29 PM, James Edward Gray II wrote:

> On Sep 29, 2007, at 2:13 PM, John Joyce wrote:
>
>> The short version is that UTF-16 is basically wasteful.

>
> That's not always accurate:
>
> $ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
> japanese_prose_in_utf16.txt
> Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
> 14 66 5921 japanese_prose_in_utf8.txt
> Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
> 16 45 3968 japanese_prose_in_utf16.txt
>
> James Edward Gray II
>
>

interesting that you would generate more lines, fewer words, and
fewer bytes (probably explained by fewer words..)
wc defines words as whitespace delimited, Extremely interesting
considering that Japanese uses no whitespace except in page layout.
Grammar does not dictate any whitespace at all. At most in Japanese
prose you might have one whitespace between sentences, perhaps only
between "paragraphs"

I don't know how iconv handles things. man iconv says it uses iswspace
(3) which is in wctype.h but I always hate reading those headers.
I tried using iconv on a file in utf-8 to utf-16 and then back again.
Results are similar, but interstingly, it's no indication of file
size. Files are the same size
I then tried the same with some code in C++ and similar results occured.
It would seem to be a whitspace issue. I didn't realize this, but it
does look like utf-8 is generating fewer whitespace characters while
generating a bigger file...?
I'm curious what the deal is there.

In theory utf-8 should do better than utf-16 for characters in the
ASCII range...
at least that was my understanding. And assuming code files are
largely ASCII character sets...
hmm...!?

  Réponse avec citation
Vieux 30/09/2007, 00h50   #17
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode


On Sep 29, 2007, at 2:29 PM, James Edward Gray II wrote:

> On Sep 29, 2007, at 2:13 PM, John Joyce wrote:
>
>> The short version is that UTF-16 is basically wasteful.

>
> That's not always accurate:
>
> $ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
> japanese_prose_in_utf16.txt
> Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
> 14 66 5921 japanese_prose_in_utf8.txt
> Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
> 16 45 3968 japanese_prose_in_utf16.txt
>
> James Edward Gray II
>
>

Scratch that! I must've gone cross-eyed!
My c++ code was indeed smaller file size in utf-8 than utf-16 as I
expected!
Interestingly, *nix's apparently use utf-32 internally regardless of
the source encoding... very interesting

  Réponse avec citation
Vieux 30/09/2007, 03h47   #18
Felipe Contreras
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Hi,

On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
> >
> > Yes but what about stuff already encoded in UTF-16?

>
> That's why I said read up on unicode!
> After you read that stuff you'll understand why it's no problem.
> I'm not going to explain it. Many people understand it, but when
> explaining it might make mistakes.
> Read the unicode stuff carefully. It's vital for many things.
>
> The only thing you might run into is BOM or Endian-ness, but it's
> doubtful it will be an issue in most cases.
>
> This might get you started.
> http://www.unicode.org/faq/utf_bom.html#37
>
>
> Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
> how programmers need to know it and how few actually do.
> The short version is that UTF-16 is basically wasteful. It uses 2
> bytes for lower-level code-points (the stuff also known as ASCII
> range) where UTF-8 does not.


As you suggested I read the article:
http://www.joelonsoftware.com/articles/Unicode.html

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

> You really need to spend an afternoon reading about unicode. It
> should be required in any computer science program as part of an
> encoding course, Americans in particular are often the ones who know
> the least about it....


What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

I don't mind reading some more if I can actually find the answer.

Best regards.

--
Felipe Contreras

  Réponse avec citation
Vieux 30/09/2007, 05h55   #19
Terry Poulin
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

Felipe Contreras wrote:
>
> I didn't find anything new. It's just explaining character sets in a
> rather non-specific way. ASCII uses 8 bits, so it can store 256
> characters, so it can't store all the characters in the world, so
> other character sets are needed (really? I would have never guessed
> that). UTF-16 basically stores characters in 2 bytes (that means more
> characters in the world), UTF-8 also allows more characters it doesn't
> necessarily needs 2 bytes, it uses 1, and if the character is beyond
> 127 then it will use 2 bytes. This whole thing can be extended up to 6
> bytes.
>
> So what exactly am I looking for here?


ASCII is a 7-Bit Encoding with 128 characters in the set.

Most PC's these days use an 8 bit byte. I'm no rocket scientist when it comes
to CPU Architectures or character encodings but I would think the machines
byte or word size would be the most logical choices....

Most of my files are in UTF-8 or ISO 8859-1 (and probably some Windows-1252).
As far as I know UTF-8 and Latin 1 are compatible in the first 128 char
because of ASCII's wide spread'ness.


Since I may have missed the original message.... What is the problem again?

TerryP.


--

Email and shopping with the feelgood factor!
55% of income to good causes. http://www.ippimail.com


  Réponse avec citation
Vieux 30/09/2007, 06h22   #20
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode


On Sep 29, 2007, at 9:47 PM, Felipe Contreras wrote:

> Hi,
>
> On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
>>>
>>> Yes but what about stuff already encoded in UTF-16?

>>
>> That's why I said read up on unicode!
>> After you read that stuff you'll understand why it's no problem.
>> I'm not going to explain it. Many people understand it, but when
>> explaining it might make mistakes.
>> Read the unicode stuff carefully. It's vital for many things.
>>
>> The only thing you might run into is BOM or Endian-ness, but it's
>> doubtful it will be an issue in most cases.
>>
>> This might get you started.
>> http://www.unicode.org/faq/utf_bom.html#37
>>
>>
>> Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
>> how programmers need to know it and how few actually do.
>> The short version is that UTF-16 is basically wasteful. It uses 2
>> bytes for lower-level code-points (the stuff also known as ASCII
>> range) where UTF-8 does not.

>
> As you suggested I read the article:
> http://www.joelonsoftware.com/articles/Unicode.html
>
> I didn't find anything new. It's just explaining character sets in a
> rather non-specific way. ASCII uses 8 bits, so it can store 256
> characters, so it can't store all the characters in the world, so
> other character sets are needed (really? I would have never guessed
> that). UTF-16 basically stores characters in 2 bytes (that means more
> characters in the world), UTF-8 also allows more characters it doesn't
> necessarily needs 2 bytes, it uses 1, and if the character is beyond
> 127 then it will use 2 bytes. This whole thing can be extended up to 6
> bytes.
>
> So what exactly am I looking for here?
>
>> You really need to spend an afternoon reading about unicode. It
>> should be required in any computer science program as part of an
>> encoding course, Americans in particular are often the ones who know
>> the least about it....

>
> What is there to know about Unicode? There's a couple of character
> sets, use UTF-8, and remember that one character != one byte. Is there
> anything else for practical purposes?
>
> I'm sorry if I'm being rude, but I really don't like when people tell
> me to read stuff I already know.
>
> My question is still there:
>
> Let's say I want to rename a file "fooobar", and remove the third "o",
> but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
> of course there will still be a 0x00 in there. That's if the string is
> recognized at all.
>
> Why is there no issue with UTF-16 if only UTF-8 is supported?
>
> I don't mind reading some more if I can actually find the answer.
>
> Best regards.
>
> --
> Felipe Contreras
>

Hmm... you should consider converting it to utf-8 via iconv.
There is a gem for iconv
This will keep your data intact, but you might need to convert it
back to utf-16 later.

I believe filenames on windows are actually utf-8,
Files' contents are generally written in utf-16

Could be wrong on this...
but test it and see!
Try to to open a file with non-ascii range characters in irb and see
what happens.
If it fails, no harm done.


  Réponse avec citation
Vieux 30/09/2007, 06h28   #21
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

oh, and Mr. Contreras,
I did not mean to say RTFM to you. Sorry if it seemed like that.

  Réponse avec citation
Vieux 01/10/2007, 14h51   #22
Michal Suchanek
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On 30/09/2007, Felipe Contreras <felipe.contreras@gmail.com> wrote:
> Hi,
>
> On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
> > >
> > > Yes but what about stuff already encoded in UTF-16?

> >
> > That's why I said read up on unicode!
> > After you read that stuff you'll understand why it's no problem.
> > I'm not going to explain it. Many people understand it, but when
> > explaining it might make mistakes.
> > Read the unicode stuff carefully. It's vital for many things.
> >
> > The only thing you might run into is BOM or Endian-ness, but it's
> > doubtful it will be an issue in most cases.
> >
> > This might get you started.
> > http://www.unicode.org/faq/utf_bom.html#37
> >
> >
> > Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
> > how programmers need to know it and how few actually do.
> > The short version is that UTF-16 is basically wasteful. It uses 2
> > bytes for lower-level code-points (the stuff also known as ASCII
> > range) where UTF-8 does not.

>
> As you suggested I read the article:
> http://www.joelonsoftware.com/articles/Unicode.html
>
> I didn't find anything new. It's just explaining character sets in a
> rather non-specific way. ASCII uses 8 bits, so it can store 256
> characters, so it can't store all the characters in the world, so
> other character sets are needed (really? I would have never guessed
> that). UTF-16 basically stores characters in 2 bytes (that means more
> characters in the world), UTF-8 also allows more characters it doesn't
> necessarily needs 2 bytes, it uses 1, and if the character is beyond
> 127 then it will use 2 bytes. This whole thing can be extended up to 6
> bytes.
>
> So what exactly am I looking for here?


UTF-8 and UTF-16 are pretty much the same. They encode a single
character using one or more units, where these units are 8-bit or
16-bit respectively. The only thing you buy by converting to utf-16 is
space efficiency for codepoints that require nearly 16 bits to
represent (such as Japanese characters) and endianness issues. Note
that some characters may (and some must) be composed of multiple
codepoints (a character codepoint, and additional accent
codepoint(s)).

>
> > You really need to spend an afternoon reading about unicode. It
> > should be required in any computer science program as part of an
> > encoding course, Americans in particular are often the ones who know
> > the least about it....

>
> What is there to know about Unicode? There's a couple of character
> sets, use UTF-8, and remember that one character != one byte. Is there
> anything else for practical purposes?
>
> I'm sorry if I'm being rude, but I really don't like when people tell
> me to read stuff I already know.
>
> My question is still there:
>
> Let's say I want to rename a file "fooobar", and remove the third "o",
> but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
> of course there will still be a 0x00 in there. That's if the string is
> recognized at all.
>
> Why is there no issue with UTF-16 if only UTF-8 is supported?


If you handle UTF-16 as something else you break it regardless of the
language support. If you know (or have a way to find out) it's UTF-16
you can convert it. If there is no way to find out all language
support is moot.

Thanks
Michal

  Réponse avec citation
Vieux 01/10/2007, 15h08   #23
James Edward Gray II
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode

On Sep 30, 2007, at 12:22 AM, John Joyce wrote:

> There is a gem for iconv


The iconv library is a standard library shipped with Ruby.

James Edward Gray II

  Réponse avec citation
Vieux 01/10/2007, 15h21   #24
John Joyce
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Unicode


On Oct 1, 2007, at 9:08 AM, James Edward Gray II wrote:

> On Sep 30, 2007, at 12:22 AM, John Joyce wrote:
>
>> There is a gem for iconv

>
> The iconv library is a standard library shipped with Ruby.
>
> James Edward Gray II
>

Sure enough!
Just got so used to require rubygems with nearly everything...

  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 02h02.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,31004 seconds with 32 queries