|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hi ppl. I need to convert flat text files from UTF-8 to ISO-8859-1, and for that purpose I use iconv. I do this automatically with a script (which has other purposes as well, as sending the file to a remote recipient). The problem arises when the odd file allready is converted, or is not in UTF-8. I have no control over the input files, so I can't make sure that the files are in the correct format when they arrive. The platform I run is Solaris 9, and the shell is bash, but any shell specific tool should be available. Is there any tool I can use to determine the charset of a file? -- Stein Arne Storslett |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
Stein Arne Storslett wrote:
> > Hi ppl. > > I need to convert flat text files from UTF-8 to ISO-8859-1, and > for that purpose I use iconv. > I do this automatically with a script (which has other purposes > as well, as sending the file to a remote recipient). > > The problem arises when the odd file allready is converted, or > is not in UTF-8. > I have no control over the input files, so I can't make sure that > the files are in the correct format when they arrive. > > The platform I run is Solaris 9, and the shell is bash, but any > shell specific tool should be available. > > Is there any tool I can use to determine the charset of a file? Whithout any additional information (or meta-information) that's not possible. After all both, ISO 8859-1 and UTF-8, are codings that use 8 bit entities and files usually don't carry any encoding schema with them (which is an interface representation issue). Though, you may apply heuristics, if that s, to rule out some (or even most) candidates. The first 128 character positions have the same coding in both representations: 0xxxxxxx. An UTF-8 coding of the upper 128 character positions in ISO 8859-1 is required to use a two-byte coding: 110xxxxx 10xxxxxx. And given that only 8 bit are required for ISO 8859-1 it's constrained further to the two-byte sequence: 110000xx 10xxxxxx. So summarize: If there are only either 0xxxxxxx characters or 110000xx 10xxxxxx character sequences in your program it is likely that it's an UTF-8 encoding of ISO 8859-1 (- but be aware that this statement depends highly on the statistical properties of the original text in the files). Otherwise it is guaranteed to be the second of the possible choices, a ISO 8859-1 coding. And no, I don't know of any tool that checks files based on this heuristics. Disclamer: All that said off the top of my head. (CMIIW.) Janis > > > -- > Stein Arne Storslett |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
Stein Arne Storslett wrote:
> Hi ppl. > > I need to convert flat text files from UTF-8 to ISO-8859-1, and > for that purpose I use iconv. > I do this automatically with a script (which has other purposes > as well, as sending the file to a remote recipient). > > The problem arises when the odd file allready is converted, or > is not in UTF-8. > I have no control over the input files, so I can't make sure that > the files are in the correct format when they arrive. > > The platform I run is Solaris 9, and the shell is bash, but any > shell specific tool should be available. > > Is there any tool I can use to determine the charset of a file? > > An existing Unix tool that also uses heuristics to guess the file type is the command file(1). On Linux 'file' was able to identify an UTF-8 ad-hoc test file, so you may try that on solaris, too... $ file abc* abc: ISO-8859 text abc.utf: UTF-8 Unicode text That works even if changing the file extension. You code then might be something like... filetype=$( file "${yourfile}" ) case ${filetype} in *UTF-8*) printf "UTF-8 file\n" ;; *) printf "other file\n" ;; esac Janis |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
In article <MN-dnVl98Z_c793YRVnzvQ@telenor.com>, Stein Arne Storslett wrote:
> >Hi ppl. > >I need to convert flat text files from UTF-8 to ISO-8859-1, and >for that purpose I use iconv. >I do this automatically with a script (which has other purposes >as well, as sending the file to a remote recipient). > >The problem arises when the odd file allready is converted, or >is not in UTF-8. >I have no control over the input files, so I can't make sure that >the files are in the correct format when they arrive. > >The platform I run is Solaris 9, and the shell is bash, but any >shell specific tool should be available. > >Is there any tool I can use to determine the charset of a file? > No, because a valid ISO-8859-1 file is a valid UTF-8 file and vice versa. You just get different contents depending on which decoding you use. You could try decoding both ways to see which way makes sense, if you have a usable "makes sense" criteria you can apply. -- Christopher Mattern "Which one you figure tracked us?" "The ugly one, sir." "...Could you be more specific?" |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
2006-11-08, 20:31(-00), Chris Mattern:
[...] > No, because a valid ISO-8859-1 file is a valid UTF-8 file and > vice versa. You just get different contents depending on > which decoding you use. You could try decoding both ways > to see which way makes sense, if you have a usable "makes > sense" criteria you can apply. Well, there are some bytes (for iso8859-1) byte sequences (for UTF8) that are not allowed in those charsets. For instance a file that contains any byte between 128 and 159 would not be a valid iso8859-1 file. A file with bytes 254 or 255 can't be utf8 and the only sequences of bytes allowed for utf8 are (in binary) 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx So that the sequence 01010101 10101010 (valid in iso8859-1: Uª) would not be correct in utf8 for instance. -- Stéphane |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
syscjm@sumire.eng.sun.com (Chris Mattern) writes:
> In article <MN-dnVl98Z_c793YRVnzvQ@telenor.com>, Stein Arne Storslett wrote: >> >>Hi ppl. >> >>I need to convert flat text files from UTF-8 to ISO-8859-1, and >>for that purpose I use iconv. >>I do this automatically with a script (which has other purposes >>as well, as sending the file to a remote recipient). >> >>The problem arises when the odd file allready is converted, or >>is not in UTF-8. >>I have no control over the input files, so I can't make sure that >>the files are in the correct format when they arrive. >> >>The platform I run is Solaris 9, and the shell is bash, but any >>shell specific tool should be available. >> >>Is there any tool I can use to determine the charset of a file? >> > No, because a valid ISO-8859-1 file is a valid UTF-8 file and > vice versa. You just get different contents depending on > which decoding you use. You could try decoding both ways > to see which way makes sense, if you have a usable "makes > sense" criteria you can apply. As I was told in another group: How about file -i? file -i tom_opgave.tex tom_opgave.tex: text/plain; charset=utf-8 file -i determine_charset.txt determine_charset.txt: text/plain; charset=us-ascii file -i Alf_Drews_opg8_del1.tex Alf_Drews_opg8_del1.tex: text/plain; charset=iso-8859-1 Seem to work for me but if there are any better ways, I would also like to hear... Best regards Martin Jørgensen -- --------------------------------------------------------------------------- Home of Martin Jørgensen - http://www.martinjoergensen.dk |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
2006-11-09, 20:55(+01), Martin Jørgensen:
[...] > file -i tom_opgave.tex > tom_opgave.tex: text/plain; charset=utf-8 > > file -i determine_charset.txt > determine_charset.txt: text/plain; charset=us-ascii > > file -i Alf_Drews_opg8_del1.tex > Alf_Drews_opg8_del1.tex: text/plain; charset=iso-8859-1 > > Seem to work for me but if there are any better ways, I would also like > to hear... [...] The iso-8859-1 one is a lie. There's no way to differenciate any of the iso-8859-xx between them, as the same byte values are allowed. iso8859-1 is being replaced by iso8859-15. There are few differnces between the two. The biggest difference being that character 164 is a euro sign in -15 and a "currency sign" in -1. So if you find a file with a byte 164 in it, then it's more likely to be a iso8859-15 than a iso8859-1. A file containing: « Ça coûte 3 ¤. » That is "it's worth 3 <x> in French", where <x> is character 164 is more likely to be written in iso8859-15 than in iso8859-1 because it's less likely that the author would say "it's worth 3 currency signs". Still, file -i reports it as iso8859-1. -- Stéphane |
|
![]() |
| Outils de la discussion | |
|
|