PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Autres forums > Forum Programmation & Conception > comp.lang.cplus > Lost in encoding stuff
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
Lost in encoding stuff

Réponse
 
LinkBack Outils de la discussion
Vieux 16/01/2008, 09h49   #1
Alexander Adam
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Lost in encoding stuff

Hi,

I am a bit list in encoding related stuff. Let me explain what I am
doing (yes it's C++ ):
I am getting some input content due Expat Xml Parser. I've setup Expat
to use wchar_t.
First question is this -- what is the difference of unsigned short,
wchar_t and char?
Okay, wchar_t is an built-in type of C++ and its two bytes of size
whereas char is always one byte.
But what's the real difference when storing Text into those types i.e.
ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
Afaik, UTF-8 is 2 bytes, UTF-16 is 2 bytes and UTF-32 is up to four
bytes? Well anyway, my issue is how to correctly work with those
types. Internally I am using wchar_t for all my representations but
depending on the encoding I need to shift a current char value
bitwise, right?
Okay next one -- I am storing everything of my wchar_t array into a
stream of type char, doing so by a simple memcpy. Now how could I read
it back in? Say I have char* buffer where my wchar_t string is saved
in. I could surely do a simply memcpy(myWcharVar, buffer,
sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
efficient as I'd like to read it char by char (like wchar_t nx =
buffer.next(), know what I mean?).
And then after having read such a char, I must be able to correctly
encode it. I know the encoding whether its ASCII, UTF-8, 16 or
anything but how would I go about it *without* using any big
libraries?

Thanks for *any* clarifications you could out with on this topic,
Alex
  Réponse avec citation
Vieux 16/01/2008, 14h31   #2
Victor Bazarov
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Lost in encoding stuff

Alexander Adam wrote:
> I am a bit list in encoding related stuff. Let me explain what I am
> doing (yes it's C++ ):
> I am getting some input content due Expat Xml Parser. I've setup Expat
> to use wchar_t.
> First question is this -- what is the difference of unsigned short,
> wchar_t and char?


Those are usually three different types. 'unsigned short' is at least
as big as 'char', and 'wchar_t' is also at least as big as 'char', but
no other guarantees exist.

> Okay, wchar_t is an built-in type of C++ and its two bytes of size
> whereas char is always one byte.


No guarantees about the size of 'wchar_t' is given except that it is
at least as big as 'char'.

> But what's the real difference when storing Text into those types i.e.
> ASCII, UTF-8, UTF-16 or UTF-32 encoded text?


There is a different amount of work needs to be done to store those
encodings in those types you mention, but only 'ASCII'->'char' is
trivial, AFAIUI.

> Afaik, UTF-8 is 2 bytes, UTF-16 is 2 bytes and UTF-32 is up to four
> bytes?


Whatever they are, it's not really on topic here. Google 'unicode'
and read about them.

> Well anyway, my issue is how to correctly work with those
> types. Internally I am using wchar_t for all my representations but
> depending on the encoding I need to shift a current char value
> bitwise, right?


Sound similar to my experience. But it all depends on the size of
'wchar_t'. You may not need to shift anything at all in some cases.

> Okay next one -- I am storing everything of my wchar_t array into a
> stream of type char, doing so by a simple memcpy. Now how could I read
> it back in?


You can use memcpy, just switch the order of the first two arguments.

> Say I have char* buffer where my wchar_t string is saved
> in. I could surely do a simply memcpy(myWcharVar, buffer,
> sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
> efficient as I'd like to read it char by char (like wchar_t nx =
> buffer.next(), know what I mean?).
> And then after having read such a char, I must be able to correctly
> encode it. I know the encoding whether its ASCII, UTF-8, 16 or
> anything but how would I go about it *without* using any big
> libraries?


You would have to roll your own, I guess.

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask


  Réponse avec citation
Vieux 16/01/2008, 15h21   #3
Phil Endecott
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Lost in encoding stuff

Alexander Adam wrote:
> Hi,
>
> I am a bit list in encoding related stuff. Let me explain what I am
> doing (yes it's C++ ):
> I am getting some input content due Expat Xml Parser. I've setup Expat
> to use wchar_t.
> First question is this -- what is the difference of unsigned short,
> wchar_t and char?


On my compiler, they all have different sizes....

> Okay, wchar_t is an built-in type of C++ and its two bytes of size


I believe that that's the case on Windows. On Linux, wchar_t is 4
bytes. You should not rely on it having any particular size.

> whereas char is always one byte.
> But what's the real difference when storing Text into those types i.e.
> ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
> Afaik, UTF-8 is 2 bytes,


No. It's a variable length encoding. Look it up, e.g. on the Unicode
web site. I bet Wikipedia has a good description too.

> UTF-16 is 2 bytes


No. It's a variable length encoding. For the vast majority of cases,
it will use two bytes per character, but you shouldn't rely on that.
Look it up.

> and UTF-32 is up to four
> bytes?


It's always exactly four bytes per character. Look it up.

> Well anyway, my issue is how to correctly work with those
> types. Internally I am using wchar_t for all my representations but
> depending on the encoding I need to shift a current char value
> bitwise, right?


Err, I'm not sure what you mean, but no I don't think that's the right
thing to do. What do you mean by "work with" these types? What are you
actually trying to do?

> Okay next one -- I am storing everything of my wchar_t array into a
> stream of type char,


Why?

> doing so by a simple memcpy. Now how could I read
> it back in? Say I have char* buffer where my wchar_t string is saved
> in. I could surely do a simply memcpy(myWcharVar, buffer,
> sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
> efficient as I'd like to read it char by char (like wchar_t nx =
> buffer.next(), know what I mean?).


Beware that there are endianness issues to worry about here.

Your compiler will possibly optimise a memcpy() into efficient inline code.

But if you kept it in a whcat_t buffer, you wouldn't need to worry about
this.

> And then after having read such a char, I must be able to correctly
> encode it. I know the encoding whether its ASCII, UTF-8, 16 or
> anything but how would I go about it *without* using any big
> libraries?


Why the prohibition of libraries? POSIX systems have iconv(), which
will do it all for you. I think Windows has something similar.

If you want to write the code yourself, you should find enough
description in the definitions of the encodings.


I have done some work on strings tagged with their character sets which
I may propose for Boost at some point in the future. You'll find my
first attempt if you look for my name in the Boost list archives from
last September and October. I'm currently revising it, and my first
step has been to define char8_t, char16_t and char32_t. These types are
guaranteed to have exactly the indicated number of bits, and to be char
or wchar_t when that type is the right size. Here's the code:

template <int bits>
struct char_t {
typedef typename boost::uint_t<bits>::least type;
};

template <>
struct char_t<8*sizeof(char)> {
typedef char type;
};

template <>
struct char_t<8*sizeof(wchar_t)> {
typedef wchar_t type;
};

typedef char_t<8>::type char8_t;
typedef char_t<16>::type char16_t;
typedef char_t<32>::type char32_t;


I suggest using something like char16_t, rather than wchar_t, as the
basis for a UTF-16 string, for portability. I'm currently not sure how
this can work with string literals, though.

Regards, Phil.
  Réponse avec citation
Vieux 17/01/2008, 09h10   #4
James Kanze
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Lost in encoding stuff

On Jan 16, 4:21 pm, Phil Endecott <spam_from_usenet_0...@chezphil.org>
wrote:
> Alexander Adam wrote:


> > I am a bit list in encoding related stuff. Let me explain what I am
> > doing (yes it's C++ ):
> > I am getting some input content due Expat Xml Parser. I've setup Expat
> > to use wchar_t.
> > First question is this -- what is the difference of unsigned short,
> > wchar_t and char?


> On my compiler, they all have different sizes....


On most of mine too, but there are also systems where they all
have the same size. Signedness also varies, at least for char
and wchar_t.

> > Okay, wchar_t is an built-in type of C++ and its two bytes of size


> I believe that that's the case on Windows. On Linux, wchar_t is 4
> bytes. You should not rely on it having any particular size.


I think it's also 2 bytes under AIX, and 4 bytes under most
other Unix. It's definitely 4 bytes under Solaris, but it isn't
Unicode. (In the string L"été", the é is encoded 0x30000069!)

> > whereas char is always one byte.
> > But what's the real difference when storing Text into those types i.e.
> > ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
> > Afaik, UTF-8 is 2 bytes,


> No. It's a variable length encoding. Look it up, e.g. on the Unicode
> web site. I bet Wikipedia has a good description too.


For reference, the Unicode web site (http://www.unicode.org) has
a lot of valuable information. Another useful site is Markus
Kuhn's FAQ (http://www.cl.cam.ac.uk/~mgk25/unicode.html); the
title says "for Unix/Linux", but there's actually very little
which is platform specific.

> > UTF-16 is 2 bytes


> No. It's a variable length encoding. For the vast majority of cases,
> it will use two bytes per character, but you shouldn't rely on that.
> Look it up.


The keyword is surrogate. Start with the vocabulary at the
Unicode site. You'll also want to check out composing
characters and canonical forms.

UTF-16 is either 2 or 4 bytes for a given code point. A
character may require more than one code point, however, so it
can be even longer.

What's true is that it is always a multiple of 2 bytes.

> > and UTF-32 is up to four bytes?


> It's always exactly four bytes per character. Look it up.


Not at all. It's always a multiple of 4 bytes, but depending on
how many compositional elements are involved, it can be up to 16
bytes, or even more. In normal use, Vietnamese will require up
to 12 bytes, and I don't think any other language will require
more than 4 or 8, depending on the canonical form being used.
(Special alphabets, like the IPA or other phonetic
representations, might require more.)

It's a code point is always 4 bytes in UTF-32, not a character.

> > Well anyway, my issue is how to correctly work with those
> > types. Internally I am using wchar_t for all my
> > representations but depending on the encoding I need to
> > shift a current char value bitwise, right?


> Err, I'm not sure what you mean, but no I don't think that's
> the right thing to do. What do you mean by "work with" these
> types? What are you actually trying to do?


It sounds like he's thinking of some sort of state dependent
encoding. Which Unicode isn't (even if it requires multi-byte
or multi-code-point encodings).

> > Okay next one -- I am storing everything of my wchar_t array
> > into a stream of type char,


> Why?


To make life difficult:-).

Seriously, I don't use wchar_t ever, because of portability
concerns. Internally, it's almost always char, in UTF-8. Which
works well for what I do, but I can imagine applications where
having at least the code points a fixed length would make things
simpler.

> > doing so by a simple memcpy. Now how could I read
> > it back in? Say I have char* buffer where my wchar_t string is saved
> > in. I could surely do a simply memcpy(myWcharVar, buffer,
> > sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
> > efficient as I'd like to read it char by char (like wchar_t nx =
> > buffer.next(), know what I mean?).


> Beware that there are endianness issues to worry about here.


As long as he is in memory, there shouldn't be any real problem
(unless it is shared memory, between two different CPU's with
different byte orders---but that's rare enough that I just
ignore the possibility).

> Your compiler will possibly optimise a memcpy() into efficient
> inline code.


Alternatively, it might do a better job if you use std::copy
(which is typically an inline function).

> But if you kept it in a whcat_t buffer, you wouldn't need to
> worry about this.


> > And then after having read such a char, I must be able to correctly
> > encode it. I know the encoding whether its ASCII, UTF-8, 16 or
> > anything but how would I go about it *without* using any big
> > libraries?


> Why the prohibition of libraries? POSIX systems have iconv(), which
> will do it all for you. I think Windows has something similar.


libiconv is available from GNU. There's a port to Windows. On
the other hand, it may be overkill, and with the possibilities
of automatic memory management offered by C++, you could
certainly design something easier to use.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 12h50.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,18920 seconds with 12 queries