|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support from Ruby? |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On Sep 14, 2007, at 9:05 PM, Zephyr Pellerin wrote:
> I hate to discuss something related to the development timeline, I > know its tenable, but When will it be reasonable to expect Unicode > support from Ruby? Ruby has some UTF-8 support today. Support will increase with the m17n support though. See last question and answer here: http://blog.grayproductions.net/arti..._vm_episode_iv James Edward Gray II |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
Zephyr Pellerin wrote:
> I hate to discuss something related to the development timeline, I know > its tenable, but When will it be reasonable to expect Unicode support from > Ruby? "Unicode" is not an encoding. Are you asking for UTF-8, UTF-16, or something else? -- Phlip |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
Zephyr Pellerin wrote:
> I hate to discuss something related to the development timeline, I know > its tenable, but When will it be reasonable to expect Unicode support > from Ruby? I was just looking at the source code for 1.8.6 this weekend. The C syntax that's being used is pre-ANSI-C (which means in 1988, it was "old" syntax). Rotsa Ruck. Todd -- Posted via http://www.ruby-forum.com/. |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
>> I hate to discuss something related to the development timeline, I know
>> its tenable, but When will it be reasonable to expect Unicode support >> from Ruby? > > I was just looking at the source code for 1.8.6 this weekend. The C > syntax that's being used is pre-ANSI-C (which means in 1988, it was > "old" syntax). Apples and oranges. Unicode libraries like iconv use C linkage, so they can bond with most C implementations regardless of their compliance. (C linkage is very weak and simplistic.) All Cs can handle 8-bit strings, and can be programmed to use 16-bit strings, which are the requirements for UTF-8 and UTF-16. Like most languages, Ruby's source is in a primitive form of C to maximize the number of compilers, and hence the number of platforms and hardwares, that it runs on. I would suspect - unless Matz is an even greater genius than average - that Ruby's C style has been carefully retrofitted, after the language passed its first few version ticks. > Rotsa Ruck. Racial slur noted. -- Phlip |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
Phlip wrote:
> Racial slur noted. You got a problem with Scooby Doo? For the record, this was NOT intended to slur anything. It was not my intent, nor is my nature, to slur. However, reading this in hindsight, it certainly could be taken this way. Please accept my apologies. Now, I'll rephrase. Lotsa luck getting something like Unicode implemented when the underlying C contructs are using such an outdated syntax as ruby's does. But, as Phlip implies, it's just a simple matter of programming. Todd -- Posted via http://www.ruby-forum.com/. |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
Todd Burch wrote:
> For the record, this was NOT intended to slur anything. It was not my > intent, nor is my nature, to slur. However, reading this in hindsight, > it certainly could be taken this way. Please accept my apologies. Oh my apologies too - Scooby Doo is quite over my head. All I could imagine was Matz in a kimono serving Sake. -- Phlip |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
Yukihiro Matsumoto wrote:
> Hi, > > Old K&R style has nothing related to Unicode support of the language. > If you think it does, please elaborate. > > It just reflects the history of the language. When I started > developing Ruby, old Sun CC compiler does not understand new style, > and I wanted Ruby to run on that platform, which I was using then. > > For your information, the next release (1.9) finally abandoned the old > style. > > matz. Thanks Matz. I'm new to C programming, but not new to programming. Therefore, my assumption (yes, assumption) was that using whatever compiler swithes were necessary to accept the old-style syntax would obviate the opportunity to bring in "modern" libraries with unicode support, and/or prohibit those aspects of the language that would enable the use of unicode features. So, apparently, since they ("they" being unicode support and the syntax/compiler switches) are not related, and that's great. By the way, as an aside, I really like the language you developed and have made available. I primarily use Ruby with SketchUp (a 3D modeling program - http://www.sketchup.com) for extending the functionality of the product. (SketchUp has a Ruby API) I was looking at the source to see what it would take to implement a debugger than would work with Ruby while running under SketchUp. I would like to step through expression evaluation as the script runs. (Big aspirations for a new C programmer like myself!) Todd -- Posted via http://www.ruby-forum.com/. |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
> I hate to discuss something related to the development timeline, I know > its tenable, but When will it be reasonable to expect Unicode support > from Ruby? Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE is set to "U" (and the default is "N" even in UTF-8 locales, and if you specify the -K option in the .rb file it overrides the option specified on the command line, heh). The non-regex methods do not work but you can convert the string with str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse, [], ... You have to remember to convert the string back, though. Thanks Michal |
|
|
|
#10 |
|
Messages: n/a
Hébergeur: |
> Michal Suchanek wrote:
> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote: >> I hate to discuss something related to the development timeline, I know >> its tenable, but When will it be reasonable to expect Unicode support >> from Ruby? > > Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE > is set to "U" (and the default is "N" even in UTF-8 locales, and if > you specify the -K option in the .rb file it overrides the option > specified on the command line, heh). > The non-regex methods do not work but you can convert the string with > str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse, > [], ... > You have to remember to convert the string back, though. > > Thanks > > Michal ... or you may use the /re/u regex option to handle UTF-8 encoded strings (cf. http://snippets.dzone.com/posts/show/4527 ). Cheers, j.k. -- Posted via http://www.ruby-forum.com/. |
|
|
|
#11 |
|
Messages: n/a
Hébergeur: |
On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:
> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote: > > I hate to discuss something related to the development timeline, I know > > its tenable, but When will it be reasonable to expect Unicode support > > from Ruby? > > Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE > is set to "U" (and the default is "N" even in UTF-8 locales, and if > you specify the -K option in the .rb file it overrides the option > specified on the command line, heh). > The non-regex methods do not work but you can convert the string with > str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse, > [], ... > You have to remember to convert the string back, though. What about UTF-16? http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/ -- Felipe Contreras |
|
|
|
#12 |
|
Messages: n/a
Hébergeur: |
On Sep 28, 2007, at 4:49 PM, Felipe Contreras wrote: > On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote: >> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote: >>> I hate to discuss something related to the development timeline, >>> I know >>> its tenable, but When will it be reasonable to expect Unicode >>> support >>> from Ruby? >> >> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE >> is set to "U" (and the default is "N" even in UTF-8 locales, and if >> you specify the -K option in the .rb file it overrides the option >> specified on the command line, heh). >> The non-regex methods do not work but you can convert the string with >> str.scan(/./)[0] or str.unpack "U*", and use stuff like each, >> reverse, >> [], ... >> You have to remember to convert the string back, though. > > What about UTF-16? > > http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/ > > -- > Felipe Contreras > Go to unicode.org There you can read a full explanation (or a brief one) about why you don't need to worry about UTF-16 UTF-8 is all you need. Unicode is something everyone needs to read up on at some point. I have to read up on every now and then because my brain leaks. |
|
|
|
#13 |
|
Messages: n/a
Hébergeur: |
On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:
> > On Sep 28, 2007, at 4:49 PM, Felipe Contreras wrote: > > > On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote: > >> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote: > >>> I hate to discuss something related to the development timeline, > >>> I know > >>> its tenable, but When will it be reasonable to expect Unicode > >>> support > >>> from Ruby? > >> > >> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE > >> is set to "U" (and the default is "N" even in UTF-8 locales, and if > >> you specify the -K option in the .rb file it overrides the option > >> specified on the command line, heh). > >> The non-regex methods do not work but you can convert the string with > >> str.scan(/./)[0] or str.unpack "U*", and use stuff like each, > >> reverse, > >> [], ... > >> You have to remember to convert the string back, though. > > > > What about UTF-16? > > > > http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/ > > > > -- > > Felipe Contreras > > > Go to unicode.org > There you can read a full explanation (or a brief one) about why you > don't need to worry about UTF-16 > UTF-8 is all you need. > Unicode is something everyone needs to read up on at some point. > I have to read up on every now and then because my brain leaks. Yes but what about stuff already encoded in UTF-16? -- Felipe Contreras |
|
|
|
#14 |
|
Messages: n/a
Hébergeur: |
>
> Yes but what about stuff already encoded in UTF-16? That's why I said read up on unicode! After you read that stuff you'll understand why it's no problem. I'm not going to explain it. Many people understand it, but when explaining it might make mistakes. Read the unicode stuff carefully. It's vital for many things. The only thing you might run into is BOM or Endian-ness, but it's doubtful it will be an issue in most cases. This might get you started. http://www.unicode.org/faq/utf_bom.html#37 Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting how programmers need to know it and how few actually do. The short version is that UTF-16 is basically wasteful. It uses 2 bytes for lower-level code-points (the stuff also known as ASCII range) where UTF-8 does not. You really need to spend an afternoon reading about unicode. It should be required in any computer science program as part of an encoding course, Americans in particular are often the ones who know the least about it.... |
|
|
|
#15 |
|
Messages: n/a
Hébergeur: |
On Sep 29, 2007, at 2:13 PM, John Joyce wrote:
> The short version is that UTF-16 is basically wasteful. That's not always accurate: $ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt > japanese_prose_in_utf16.txt Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt 14 66 5921 japanese_prose_in_utf8.txt Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt 16 45 3968 japanese_prose_in_utf16.txt James Edward Gray II |
|
|
|
#16 |
|
Messages: n/a
Hébergeur: |
On Sep 29, 2007, at 2:29 PM, James Edward Gray II wrote: > On Sep 29, 2007, at 2:13 PM, John Joyce wrote: > >> The short version is that UTF-16 is basically wasteful. > > That's not always accurate: > > $ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt > > japanese_prose_in_utf16.txt > Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt > 14 66 5921 japanese_prose_in_utf8.txt > Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt > 16 45 3968 japanese_prose_in_utf16.txt > > James Edward Gray II > > interesting that you would generate more lines, fewer words, and fewer bytes (probably explained by fewer words..) wc defines words as whitespace delimited, Extremely interesting considering that Japanese uses no whitespace except in page layout. Grammar does not dictate any whitespace at all. At most in Japanese prose you might have one whitespace between sentences, perhaps only between "paragraphs" I don't know how iconv handles things. man iconv says it uses iswspace (3) which is in wctype.h but I always hate reading those headers. I tried using iconv on a file in utf-8 to utf-16 and then back again. Results are similar, but interstingly, it's no indication of file size. Files are the same size I then tried the same with some code in C++ and similar results occured. It would seem to be a whitspace issue. I didn't realize this, but it does look like utf-8 is generating fewer whitespace characters while generating a bigger file...? I'm curious what the deal is there. In theory utf-8 should do better than utf-16 for characters in the ASCII range... at least that was my understanding. And assuming code files are largely ASCII character sets... hmm...!? |
|
|
|
#17 |
|
Messages: n/a
Hébergeur: |
On Sep 29, 2007, at 2:29 PM, James Edward Gray II wrote: > On Sep 29, 2007, at 2:13 PM, John Joyce wrote: > >> The short version is that UTF-16 is basically wasteful. > > That's not always accurate: > > $ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt > > japanese_prose_in_utf16.txt > Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt > 14 66 5921 japanese_prose_in_utf8.txt > Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt > 16 45 3968 japanese_prose_in_utf16.txt > > James Edward Gray II > > Scratch that! I must've gone cross-eyed! My c++ code was indeed smaller file size in utf-8 than utf-16 as I expected! Interestingly, *nix's apparently use utf-32 internally regardless of the source encoding... very interesting |
|
|
|
#18 |
|
Messages: n/a
Hébergeur: |
Hi,
On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote: > > > > Yes but what about stuff already encoded in UTF-16? > > That's why I said read up on unicode! > After you read that stuff you'll understand why it's no problem. > I'm not going to explain it. Many people understand it, but when > explaining it might make mistakes. > Read the unicode stuff carefully. It's vital for many things. > > The only thing you might run into is BOM or Endian-ness, but it's > doubtful it will be an issue in most cases. > > This might get you started. > http://www.unicode.org/faq/utf_bom.html#37 > > > Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting > how programmers need to know it and how few actually do. > The short version is that UTF-16 is basically wasteful. It uses 2 > bytes for lower-level code-points (the stuff also known as ASCII > range) where UTF-8 does not. As you suggested I read the article: http://www.joelonsoftware.com/articles/Unicode.html I didn't find anything new. It's just explaining character sets in a rather non-specific way. ASCII uses 8 bits, so it can store 256 characters, so it can't store all the characters in the world, so other character sets are needed (really? I would have never guessed that). UTF-16 basically stores characters in 2 bytes (that means more characters in the world), UTF-8 also allows more characters it doesn't necessarily needs 2 bytes, it uses 1, and if the character is beyond 127 then it will use 2 bytes. This whole thing can be extended up to 6 bytes. So what exactly am I looking for here? > You really need to spend an afternoon reading about unicode. It > should be required in any computer science program as part of an > encoding course, Americans in particular are often the ones who know > the least about it.... What is there to know about Unicode? There's a couple of character sets, use UTF-8, and remember that one character != one byte. Is there anything else for practical purposes? I'm sorry if I'm being rude, but I really don't like when people tell me to read stuff I already know. My question is still there: Let's say I want to rename a file "fooobar", and remove the third "o", but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and of course there will still be a 0x00 in there. That's if the string is recognized at all. Why is there no issue with UTF-16 if only UTF-8 is supported? I don't mind reading some more if I can actually find the answer. Best regards. -- Felipe Contreras |
|
|
|
#19 |
|
Messages: n/a
Hébergeur: |
Felipe Contreras wrote:
> > I didn't find anything new. It's just explaining character sets in a > rather non-specific way. ASCII uses 8 bits, so it can store 256 > characters, so it can't store all the characters in the world, so > other character sets are needed (really? I would have never guessed > that). UTF-16 basically stores characters in 2 bytes (that means more > characters in the world), UTF-8 also allows more characters it doesn't > necessarily needs 2 bytes, it uses 1, and if the character is beyond > 127 then it will use 2 bytes. This whole thing can be extended up to 6 > bytes. > > So what exactly am I looking for here? ASCII is a 7-Bit Encoding with 128 characters in the set. Most PC's these days use an 8 bit byte. I'm no rocket scientist when it comes to CPU Architectures or character encodings but I would think the machines byte or word size would be the most logical choices.... Most of my files are in UTF-8 or ISO 8859-1 (and probably some Windows-1252). As far as I know UTF-8 and Latin 1 are compatible in the first 128 char because of ASCII's wide spread'ness. Since I may have missed the original message.... What is the problem again? TerryP. -- Email and shopping with the feelgood factor! 55% of income to good causes. http://www.ippimail.com |
|
|
|
#20 |
|
Messages: n/a
Hébergeur: |
On Sep 29, 2007, at 9:47 PM, Felipe Contreras wrote: > Hi, > > On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote: >>> >>> Yes but what about stuff already encoded in UTF-16? >> >> That's why I said read up on unicode! >> After you read that stuff you'll understand why it's no problem. >> I'm not going to explain it. Many people understand it, but when >> explaining it might make mistakes. >> Read the unicode stuff carefully. It's vital for many things. >> >> The only thing you might run into is BOM or Endian-ness, but it's >> doubtful it will be an issue in most cases. >> >> This might get you started. >> http://www.unicode.org/faq/utf_bom.html#37 >> >> >> Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting >> how programmers need to know it and how few actually do. >> The short version is that UTF-16 is basically wasteful. It uses 2 >> bytes for lower-level code-points (the stuff also known as ASCII >> range) where UTF-8 does not. > > As you suggested I read the article: > http://www.joelonsoftware.com/articles/Unicode.html > > I didn't find anything new. It's just explaining character sets in a > rather non-specific way. ASCII uses 8 bits, so it can store 256 > characters, so it can't store all the characters in the world, so > other character sets are needed (really? I would have never guessed > that). UTF-16 basically stores characters in 2 bytes (that means more > characters in the world), UTF-8 also allows more characters it doesn't > necessarily needs 2 bytes, it uses 1, and if the character is beyond > 127 then it will use 2 bytes. This whole thing can be extended up to 6 > bytes. > > So what exactly am I looking for here? > >> You really need to spend an afternoon reading about unicode. It >> should be required in any computer science program as part of an >> encoding course, Americans in particular are often the ones who know >> the least about it.... > > What is there to know about Unicode? There's a couple of character > sets, use UTF-8, and remember that one character != one byte. Is there > anything else for practical purposes? > > I'm sorry if I'm being rude, but I really don't like when people tell > me to read stuff I already know. > > My question is still there: > > Let's say I want to rename a file "fooobar", and remove the third "o", > but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and > of course there will still be a 0x00 in there. That's if the string is > recognized at all. > > Why is there no issue with UTF-16 if only UTF-8 is supported? > > I don't mind reading some more if I can actually find the answer. > > Best regards. > > -- > Felipe Contreras > Hmm... you should consider converting it to utf-8 via iconv. There is a gem for iconv This will keep your data intact, but you might need to convert it back to utf-16 later. I believe filenames on windows are actually utf-8, Files' contents are generally written in utf-16 Could be wrong on this... but test it and see! Try to to open a file with non-ascii range characters in irb and see what happens. If it fails, no harm done. |
|
|
|
#21 |
|
Messages: n/a
Hébergeur: |
oh, and Mr. Contreras,
I did not mean to say RTFM to you. Sorry if it seemed like that. |
|
|
|
#22 |
|
Messages: n/a
Hébergeur: |
On 30/09/2007, Felipe Contreras <felipe.contreras@gmail.com> wrote:
> Hi, > > On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote: > > > > > > Yes but what about stuff already encoded in UTF-16? > > > > That's why I said read up on unicode! > > After you read that stuff you'll understand why it's no problem. > > I'm not going to explain it. Many people understand it, but when > > explaining it might make mistakes. > > Read the unicode stuff carefully. It's vital for many things. > > > > The only thing you might run into is BOM or Endian-ness, but it's > > doubtful it will be an issue in most cases. > > > > This might get you started. > > http://www.unicode.org/faq/utf_bom.html#37 > > > > > > Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting > > how programmers need to know it and how few actually do. > > The short version is that UTF-16 is basically wasteful. It uses 2 > > bytes for lower-level code-points (the stuff also known as ASCII > > range) where UTF-8 does not. > > As you suggested I read the article: > http://www.joelonsoftware.com/articles/Unicode.html > > I didn't find anything new. It's just explaining character sets in a > rather non-specific way. ASCII uses 8 bits, so it can store 256 > characters, so it can't store all the characters in the world, so > other character sets are needed (really? I would have never guessed > that). UTF-16 basically stores characters in 2 bytes (that means more > characters in the world), UTF-8 also allows more characters it doesn't > necessarily needs 2 bytes, it uses 1, and if the character is beyond > 127 then it will use 2 bytes. This whole thing can be extended up to 6 > bytes. > > So what exactly am I looking for here? UTF-8 and UTF-16 are pretty much the same. They encode a single character using one or more units, where these units are 8-bit or 16-bit respectively. The only thing you buy by converting to utf-16 is space efficiency for codepoints that require nearly 16 bits to represent (such as Japanese characters) and endianness issues. Note that some characters may (and some must) be composed of multiple codepoints (a character codepoint, and additional accent codepoint(s)). > > > You really need to spend an afternoon reading about unicode. It > > should be required in any computer science program as part of an > > encoding course, Americans in particular are often the ones who know > > the least about it.... > > What is there to know about Unicode? There's a couple of character > sets, use UTF-8, and remember that one character != one byte. Is there > anything else for practical purposes? > > I'm sorry if I'm being rude, but I really don't like when people tell > me to read stuff I already know. > > My question is still there: > > Let's say I want to rename a file "fooobar", and remove the third "o", > but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and > of course there will still be a 0x00 in there. That's if the string is > recognized at all. > > Why is there no issue with UTF-16 if only UTF-8 is supported? If you handle UTF-16 as something else you break it regardless of the language support. If you know (or have a way to find out) it's UTF-16 you can convert it. If there is no way to find out all language support is moot. Thanks Michal |
|
|
|
#23 |
|
Messages: n/a
Hébergeur: |
On Sep 30, 2007, at 12:22 AM, John Joyce wrote:
> There is a gem for iconv The iconv library is a standard library shipped with Ruby. James Edward Gray II |
|
|
|
#24 |
|
Messages: n/a
Hébergeur: |
On Oct 1, 2007, at 9:08 AM, James Edward Gray II wrote: > On Sep 30, 2007, at 12:22 AM, John Joyce wrote: > >> There is a gem for iconv > > The iconv library is a standard library shipped with Ruby. > > James Edward Gray II > Sure enough! Just got so used to require rubygems with nearly everything... |
|
![]() |
| Outils de la discussion | |
|
|