PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Forum Serveur - Sécurité et techniques > comp.unix.shell > Determine the charset in a file
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
comp.unix.shell Using and programming the Unix shell.

Determine the charset in a file

Réponse
 
LinkBack Outils de la discussion
Vieux 26/10/2006, 09h30   #1
Stein Arne Storslett
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Determine the charset in a file


Hi ppl.

I need to convert flat text files from UTF-8 to ISO-8859-1, and
for that purpose I use iconv.
I do this automatically with a script (which has other purposes
as well, as sending the file to a remote recipient).

The problem arises when the odd file allready is converted, or
is not in UTF-8.
I have no control over the input files, so I can't make sure that
the files are in the correct format when they arrive.

The platform I run is Solaris 9, and the shell is bash, but any
shell specific tool should be available.

Is there any tool I can use to determine the charset of a file?


--
Stein Arne Storslett
  Réponse avec citation
Vieux 26/10/2006, 14h32   #2
Janis
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Determine the charset in a file

Stein Arne Storslett wrote:
>
> Hi ppl.
>
> I need to convert flat text files from UTF-8 to ISO-8859-1, and
> for that purpose I use iconv.
> I do this automatically with a script (which has other purposes
> as well, as sending the file to a remote recipient).
>
> The problem arises when the odd file allready is converted, or
> is not in UTF-8.
> I have no control over the input files, so I can't make sure that
> the files are in the correct format when they arrive.
>
> The platform I run is Solaris 9, and the shell is bash, but any
> shell specific tool should be available.
>
> Is there any tool I can use to determine the charset of a file?


Whithout any additional information (or meta-information) that's
not possible. After all both, ISO 8859-1 and UTF-8, are codings
that use 8 bit entities and files usually don't carry any encoding
schema with them (which is an interface representation issue).

Though, you may apply heuristics, if that s, to rule out some
(or even most) candidates.

The first 128 character positions have the same coding in both
representations: 0xxxxxxx. An UTF-8 coding of the upper 128
character positions in ISO 8859-1 is required to use a two-byte
coding: 110xxxxx 10xxxxxx. And given that only 8 bit are
required for ISO 8859-1 it's constrained further to the two-byte
sequence: 110000xx 10xxxxxx.

So summarize: If there are only either 0xxxxxxx characters or
110000xx 10xxxxxx character sequences in your program it is
likely that it's an UTF-8 encoding of ISO 8859-1 (- but be aware
that this statement depends highly on the statistical properties
of the original text in the files). Otherwise it is guaranteed to
be the second of the possible choices, a ISO 8859-1 coding.

And no, I don't know of any tool that checks files based on this
heuristics.

Disclamer: All that said off the top of my head. (CMIIW.)

Janis

>
>
> --
> Stein Arne Storslett


  Réponse avec citation
Vieux 27/10/2006, 12h30   #3
Janis Papanagnou
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Determine the charset in a file

Stein Arne Storslett wrote:
> Hi ppl.
>
> I need to convert flat text files from UTF-8 to ISO-8859-1, and
> for that purpose I use iconv.
> I do this automatically with a script (which has other purposes
> as well, as sending the file to a remote recipient).
>
> The problem arises when the odd file allready is converted, or
> is not in UTF-8.
> I have no control over the input files, so I can't make sure that
> the files are in the correct format when they arrive.
>
> The platform I run is Solaris 9, and the shell is bash, but any
> shell specific tool should be available.
>
> Is there any tool I can use to determine the charset of a file?
>
>


An existing Unix tool that also uses heuristics to guess the file
type is the command file(1). On Linux 'file' was able to identify
an UTF-8 ad-hoc test file, so you may try that on solaris, too...

$ file abc*
abc: ISO-8859 text
abc.utf: UTF-8 Unicode text

That works even if changing the file extension.

You code then might be something like...

filetype=$( file "${yourfile}" )
case ${filetype} in
*UTF-8*) printf "UTF-8 file\n" ;;
*) printf "other file\n" ;;
esac


Janis
  Réponse avec citation
Vieux 08/11/2006, 20h31   #4
Chris Mattern
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Determine the charset in a file

In article <MN-dnVl98Z_c793YRVnzvQ@telenor.com>, Stein Arne Storslett wrote:
>
>Hi ppl.
>
>I need to convert flat text files from UTF-8 to ISO-8859-1, and
>for that purpose I use iconv.
>I do this automatically with a script (which has other purposes
>as well, as sending the file to a remote recipient).
>
>The problem arises when the odd file allready is converted, or
>is not in UTF-8.
>I have no control over the input files, so I can't make sure that
>the files are in the correct format when they arrive.
>
>The platform I run is Solaris 9, and the shell is bash, but any
>shell specific tool should be available.
>
>Is there any tool I can use to determine the charset of a file?
>

No, because a valid ISO-8859-1 file is a valid UTF-8 file and
vice versa. You just get different contents depending on
which decoding you use. You could try decoding both ways
to see which way makes sense, if you have a usable "makes
sense" criteria you can apply.

--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
  Réponse avec citation
Vieux 08/11/2006, 20h47   #5
Stephane CHAZELAS
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Determine the charset in a file

2006-11-08, 20:31(-00), Chris Mattern:
[...]
> No, because a valid ISO-8859-1 file is a valid UTF-8 file and
> vice versa. You just get different contents depending on
> which decoding you use. You could try decoding both ways
> to see which way makes sense, if you have a usable "makes
> sense" criteria you can apply.


Well, there are some bytes (for iso8859-1) byte sequences (for
UTF8) that are not allowed in those charsets.

For instance a file that contains any byte between 128 and 159
would not be a valid iso8859-1 file. A file with bytes 254 or
255 can't be utf8 and the only sequences of bytes allowed for
utf8 are (in binary)

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

So that the sequence 01010101 10101010 (valid in iso8859-1: Uª)
would not be correct in utf8 for instance.

--
Stéphane
  Réponse avec citation
Vieux 09/11/2006, 19h55   #6
Martin Jørgensen
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Determine the charset in a file

syscjm@sumire.eng.sun.com (Chris Mattern) writes:

> In article <MN-dnVl98Z_c793YRVnzvQ@telenor.com>, Stein Arne Storslett wrote:
>>
>>Hi ppl.
>>
>>I need to convert flat text files from UTF-8 to ISO-8859-1, and
>>for that purpose I use iconv.
>>I do this automatically with a script (which has other purposes
>>as well, as sending the file to a remote recipient).
>>
>>The problem arises when the odd file allready is converted, or
>>is not in UTF-8.
>>I have no control over the input files, so I can't make sure that
>>the files are in the correct format when they arrive.
>>
>>The platform I run is Solaris 9, and the shell is bash, but any
>>shell specific tool should be available.
>>
>>Is there any tool I can use to determine the charset of a file?
>>

> No, because a valid ISO-8859-1 file is a valid UTF-8 file and
> vice versa. You just get different contents depending on
> which decoding you use. You could try decoding both ways
> to see which way makes sense, if you have a usable "makes
> sense" criteria you can apply.


As I was told in another group: How about file -i?

file -i tom_opgave.tex
tom_opgave.tex: text/plain; charset=utf-8

file -i determine_charset.txt
determine_charset.txt: text/plain; charset=us-ascii

file -i Alf_Drews_opg8_del1.tex
Alf_Drews_opg8_del1.tex: text/plain; charset=iso-8859-1

Seem to work for me but if there are any better ways, I would also like
to hear...


Best regards
Martin Jørgensen

--
---------------------------------------------------------------------------
Home of Martin Jørgensen - http://www.martinjoergensen.dk
  Réponse avec citation
Vieux 09/11/2006, 20h27   #7
Stephane CHAZELAS
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Determine the charset in a file

2006-11-09, 20:55(+01), Martin Jørgensen:
[...]
> file -i tom_opgave.tex
> tom_opgave.tex: text/plain; charset=utf-8
>
> file -i determine_charset.txt
> determine_charset.txt: text/plain; charset=us-ascii
>
> file -i Alf_Drews_opg8_del1.tex
> Alf_Drews_opg8_del1.tex: text/plain; charset=iso-8859-1
>
> Seem to work for me but if there are any better ways, I would also like
> to hear...

[...]

The iso-8859-1 one is a lie. There's no way to differenciate any
of the iso-8859-xx between them, as the same byte values are
allowed.

iso8859-1 is being replaced by iso8859-15. There are few
differnces between the two. The biggest difference being that
character 164 is a euro sign in -15 and a "currency sign" in -1.

So if you find a file with a byte 164 in it, then it's more
likely to be a iso8859-15 than a iso8859-1.

A file containing:

« Ça coûte 3 ¤. »

That is "it's worth 3 <x> in French", where <x> is character
164 is more likely to be written in iso8859-15 than in iso8859-1
because it's less likely that the author would say "it's worth 3
currency signs".

Still, file -i reports it as iso8859-1.


--
Stéphane
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 07h49.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,15166 seconds with 15 queries