|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hello, can someone please .
I think I need a way of detecting stdin end of file with bash... I have a large file, big.txt, that I want to process in chunks *preferably* using bash. I want to take each chunk, process it and write the output to a file. The following works:- dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt dd if=big.txt ibs=98765 skip=2 count=1 | afilter > out.2.txt dd if=big.txt ibs=98765 skip=3 count=1 | afilter > out.3.txt dd if=big.txt ibs=98765 skip=4 count=1 | afilter > out.4.txt dd if=big.txt ibs=98765 skip=5 count=1 | afilter > out.5.txt dd if=big.txt ibs=98765 skip=6 count=1 | afilter > out.6.txt dd if=big.txt ibs=98765 skip=7 count=1 | afilter > out.7.txt dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt (n.b.'afilter' is a program which works fine) I could hack it into some sort of loop, something like: for i in `seq 1 10`; do dd if=big.txt ibs=98765 skip=$i count=1 | afilter > out.$i.txt done However I would prefer that it scanned big.txt once rather than 10 times and I would prefer that it was more generic, detecting the end of the file, perhaps something like this: set x=0 cat big.txt | \ while [ not eof ] ; do dd ibs=98765 count=1 | afilter > out.$x.txt ; $x++; done I don't think the 'while [ not eof ]' will work. Is it possible to massage the above to work? Perhaps something other than dd is more appropriate? Thanks in advance. Hal |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
2006-12-2, 03:55(-08), sillyhat@yahoo.com:
[...] > dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt > dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt > dd if=big.txt ibs=98765 skip=2 count=1 | afilter > out.2.txt > dd if=big.txt ibs=98765 skip=3 count=1 | afilter > out.3.txt > dd if=big.txt ibs=98765 skip=4 count=1 | afilter > out.4.txt > dd if=big.txt ibs=98765 skip=5 count=1 | afilter > out.5.txt > dd if=big.txt ibs=98765 skip=6 count=1 | afilter > out.6.txt > dd if=big.txt ibs=98765 skip=7 count=1 | afilter > out.7.txt > dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt > dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt > > (n.b.'afilter' is a program which works fine) > I could hack it into some sort of loop, something like: > > for i in `seq 1 10`; > do > dd if=big.txt ibs=98765 skip=$i count=1 | afilter > out.$i.txt > done No need to open and skip everytime. { repeat 10 dd ibs=98765 count=1 | afilter > out.$((i++)).txt } < big.txt (zsh syntax abov) > > However I would prefer that it scanned big.txt once rather than 10 > times and I would prefer that it was more generic, detecting the end of > the file, perhaps something like this: > > set x=0 > > cat big.txt | \ > while [ not eof ] ; do > dd ibs=98765 count=1 | afilter > out.$x.txt ; > $x++; > done To detect end-of-file, you need to check dd stderr and verify that the first 4 characters are "0+0 ". If the file is a text file, you can use bash's read -n or zsh's read -k (zsh can cope with binary files as well though). while read -u3 -k98765; do print -rn -- $REPLY | afilter > out.$((x++)).txt done 3< big.txt if (($#REPLY)); then # deal with extra characters if necessary fi POSIXly you'd do: TAB=`printf '\t'` read_chunk() { { LC_ALL=C dd ibs="$1" count=1 2>&1 >&3 3>&- | { IFS=" $TAB+" read -r a b rest || return 3 if [ "$a" -eq 0 ]; then if [ "$b" -eq 0 ]; then ret=1 # end-of-file else ret=2 # fewer than $1 bytes returned fi else ret=0 fi cat > /dev/null return "$ret" } } 3>&1 } You'd do ret=$( exec 4>&1 { read_chunk 98765; echo "$?" >&4 } | afilter > "out.$x.txt" 4>&- ) $ret would indicate if eof was reached, but afilter would have been called in that case as well. -- Stéphane |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
sillyhat@yahoo.com <sillyhat@yahoo.com> wrote:
> Hello, can someone please . > > I think I need a way of detecting stdin end of file with bash... > > I have a large file, big.txt, that I want to process in chunks > *preferably* using bash. I want to take each chunk, process it and > write the output to a file. > > The following works:- > > dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt > dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt [...] > dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt > dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt split -b 98765 -d big.txt out. for f in out* ; do afilter < "$f" > "${f}.txt" && rm "$f" ; done |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
That's very ful :O)
Testing dd's sterr does seem to do the trick. Now I need to cater for binary files as well. Using some of your suggestions, here's what I have arrived at in bash - which I have to use for now:- i=0 rc=1 { while true do dd ibs=98765 count=1 2> emsg.txt | afilter > out.$i.bin grep "0+0 " emsg.txt >/dev/null if [ $? = 0 ] then rm emsg.txt out.$i.bin break fi ((i++)) done } < big.bin It would be nice to be able to avoid using the emsg.txt file but I couldn't work out how to redirect stderr to grep (and then get stdout into afilter!). Hal |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
The split command certainly works well if you have lots of space and
was the way I was working initially. However, I am actually restricted on space on my main machine and want to literally process the big file a chunk at a time, shipping that chunk off to another machine as I go along. My initial query should have mentioned this. Apologies. Hal |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
2006-12-2, 07:37(-08), sillyhat@yahoo.com:
> That's very ful :O) > > Testing dd's sterr does seem to do the trick. > Now I need to cater for binary files as well. > > Using some of your suggestions, here's what I have arrived at in bash - > which I have to use for now:- > > i=0 > rc=1 > > { > while true > do > dd ibs=98765 count=1 2> emsg.txt | afilter > out.$i.bin > grep "0+0 " emsg.txt >/dev/null > if [ $? = 0 ] > then > rm emsg.txt out.$i.bin > break > fi > ((i++)) > done > } < big.bin > > > It would be nice to be able to avoid using the emsg.txt file but I > couldn't work out how to redirect stderr to grep (and then get stdout > into afilter!). [...] The read_chunk() function I posted did just that. You can do: dd_stderr=$( { LC_ALL=C dd ibs=98765 count=1 2>&3 | afilter > "out.$i.bin" 3>&- } 3>&1 ) case $dd_stderr in "0+0 "*) break;; esac To be POSIX compliant, you should also take into account dd implementations that would output " 0 + 0\t" instead of 0+0 for instance, hence my use of IFS=" $TAB+" read -r a b c -- Stéphane |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
2006-12-2, 07:37(-08), sillyhat@yahoo.com:
[...] > { > while true > do > dd ibs=98765 count=1 2> emsg.txt | afilter > out.$i.bin > grep "0+0 " emsg.txt >/dev/null > if [ $? = 0 ] > then > rm emsg.txt out.$i.bin > break > fi > ((i++)) > done > } < big.bin > > > It would be nice to be able to avoid using the emsg.txt file but I > couldn't work out how to redirect stderr to grep (and then get stdout > into afilter!). [...] It may reveal easier to use perl: perl -pe 'BEGIN{$/=\3}{open STDOUT, "|afilter > out.$..txt"} ' < big.bin -- Stéphane |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
OK, my i/o redirection and bash skills in general are being stretched a
bit! Posix and my afilter command aside, is the following correct/safe? i=0 rc=1 { while true do { dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin } 3>&1 | grep "0+0 " >/dev/null if [ $? = 0 ] then rm out.$i.bin break fi ((i++)) done } < big.bin Is it possible to get rid of the if command with something like: ... { dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin } 3>&1 | grep "0+0 " >/dev/null && ( rm out.$i.bin break ) ... I need to do some swotting! |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
2006-12-2, 10:07(-08), sillyhat@yahoo.com:
> OK, my i/o redirection and bash skills in general are being stretched a > bit! > > Posix and my afilter command aside, is the following correct/safe? > > i=0 > rc=1 > > { > while true > do > { > dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin > } 3>&1 | grep "0+0 " >/dev/null dd ibs=98765 count=1 2>&1 > "out.$i.bin" | grep -q '0+0 ' would have been enough. But with afilter: { dd ibs=98765 count=1 2>&3 | afilter > "out.$i.bin" 3>&- } 3>&1 | grep -q '0+0 ' > if [ $? = 0 ] [ $? = 0 ] (or more correctly [ "$?" -eq 0 ]) is a no-op. The "if" structure in shells use the commands exit status. if dd ibs=98765 count=1 2>&1 > "out.$i.bin" | grep -q '0+0 ' then rm ... > then > rm out.$i.bin > break > fi > ((i++)) i=$(($i + 1)) is the portable equivalent. > done > } < big.bin > > Is it possible to get rid of the if command with something like: > ... > { > dd ibs=98765 count=1 2>&3 3>&- > out.$i.bin > } 3>&1 | grep "0+0 " >/dev/null && ( > rm out.$i.bin > break > ) Not sure about the break within a subshell, but you can do: dd ibs=98765 count=1 2>&1 > "out.$i.bin" | grep -q '0+0 ' || { rm "out.$i.bin" } > ... > > I need to do some swotting! > Please note variable expansions must be quoted "$var" instead of $var. -- Stéphane |
|
|
|
#10 |
|
Messages: n/a
Hébergeur: |
sillyhat@yahoo.com wrote:
> > I have a large file, big.txt, that I want to process in chunks > *preferably* using bash. I want to take each chunk, process it and > write the output to a file. > > The following works:- > > dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt > dd if=big.txt ibs=98765 skip=1 count=1 | afilter > out.1.txt > dd if=big.txt ibs=98765 skip=2 count=1 | afilter > out.2.txt > dd if=big.txt ibs=98765 skip=3 count=1 | afilter > out.3.txt > dd if=big.txt ibs=98765 skip=4 count=1 | afilter > out.4.txt > dd if=big.txt ibs=98765 skip=5 count=1 | afilter > out.5.txt > dd if=big.txt ibs=98765 skip=6 count=1 | afilter > out.6.txt > dd if=big.txt ibs=98765 skip=7 count=1 | afilter > out.7.txt > dd if=big.txt ibs=98765 skip=8 count=1 | afilter > out.8.txt > dd if=big.txt ibs=98765 skip=9 count=1 | afilter > out.9.txt > > (n.b.'afilter' is a program which works fine) perl -pe' BEGIN { $/ = \98765 } open STDOUT, "| afilter > out." . $a++ . ".txt" ' big.txt John -- Perl isn't a toolbox, but a small machine shop where you can special-order certain sorts of tools at low cost and in short order. -- Larry Wall |
|
|
|
#11 |
|
Messages: n/a
Hébergeur: |
sillyhat@yahoo.com wrote:
> Hello, can someone please . > > I think I need a way of detecting stdin end of file with bash... > > I have a large file, big.txt, that I want to process in chunks > *preferably* using bash. I want to take each chunk, process it and > write the output to a file. > > The following works:- > > dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt > ... In general, it's difficult to process arbitrary binary file in shell. I'm not sure whether lseek(2) and read(2) are available as user command. -- William Park <opengeometry@yahoo.ca>, Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ |
|
|
|
#12 |
|
Messages: n/a
Hébergeur: |
2006-12-03, 15:52(-05), William Park:
> sillyhat@yahoo.com wrote: >> Hello, can someone please . >> >> I think I need a way of detecting stdin end of file with bash... >> >> I have a large file, big.txt, that I want to process in chunks >> *preferably* using bash. I want to take each chunk, process it and >> write the output to a file. >> >> The following works:- >> >> dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt >> ... > > In general, it's difficult to process arbitrary binary file in shell. > I'm not sure whether lseek(2) and read(2) are available as user command. Yes, that's dd. Some dd implementations even provide with ftruncate(2). exec 3<> some-file dd count=0 bs=1 skip=1234 <&3 But only zsh can cope with binary files as all the other shells can't cope with the NUL character (you may be able to work around it using intermidiary text formats such as uuencode's though). zsh has a sysread builtin, and the mapfile associative array to access the content of a file as you would of a variable (uses mmap internally). -- Stéphane |
|
|
|
#13 |
|
Messages: n/a
Hébergeur: |
2006-12-3, 21:07(+00), Stephane CHAZELAS:
> 2006-12-03, 15:52(-05), William Park: >> sillyhat@yahoo.com wrote: >>> Hello, can someone please . >>> >>> I think I need a way of detecting stdin end of file with bash... >>> >>> I have a large file, big.txt, that I want to process in chunks >>> *preferably* using bash. I want to take each chunk, process it and >>> write the output to a file. >>> >>> The following works:- >>> >>> dd if=big.txt ibs=98765 skip=0 count=1 | afilter > out.0.txt >>> ... >> >> In general, it's difficult to process arbitrary binary file in shell. >> I'm not sure whether lseek(2) and read(2) are available as user command. > > Yes, that's dd. Some dd implementations even provide with > ftruncate(2). > > exec 3<> some-file > > dd count=0 bs=1 skip=1234 <&3 [...] But dd doesn't provide anyway to seek backward. You need to reopen the file to start from 0. See skip/iseek to seek on input (and read), and seek to seek on output and write (and trunc or not depending on whether conv=notrunc is given or not). -- Stéphane |
|
|
|
#14 |
|
Messages: n/a
Hébergeur: |
Stephane CHAZELAS <this.address@is.invalid> wrote:
> zsh has a sysread builtin, and the mapfile associative array to > access the content of a file as you would of a variable (uses > mmap internally). Does the file grow and shrink, as you manipulate the variable? That is, var=abc var=qwerty what happens to the file? -- William Park <opengeometry@yahoo.ca>, Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ |
|
|
|
#15 |
|
Messages: n/a
Hébergeur: |
2006-12-03, 17:41(-05), William Park:
> Stephane CHAZELAS <this.address@is.invalid> wrote: >> zsh has a sysread builtin, and the mapfile associative array to >> access the content of a file as you would of a variable (uses >> mmap internally). > > Does the file grow and shrink, as you manipulate the variable? That is, > var=abc > var=qwerty > what happens to the file? What you expect should happen: ~$ ls -ld a ls: a: No such file or directory ~$ zmodload zsh/mapfile ~$ mapfile[a]=foo ~$ ls -ld a -rw-r--r-- 1 chazelas chazelas 3 Dec 3 22:53 a ~$ mapfile[a]=foobar ~$ ls -ld a -rw-r--r-- 1 chazelas chazelas 6 Dec 3 22:53 a ~$ mapfile[a]=$'bar\n' ~$ ls -ld a -rw-r--r-- 1 chazelas chazelas 4 Dec 3 22:54 a Though you can get the 12th to 23th bytes of "a" with print -rn -- ${mapfile[a][12,23]}, I'm not sure how to assign something to a byte range. That would be a flaw in zsh if you couldn't as you can do: scalar[12,23]=text but can't seem to be able to do: associative_array[key][12,23]=text nor array[3][12,23]=text I'll ask on the zsh mailing list. -- Stéphane |
|
|
|
#16 |
|
Messages: n/a
Hébergeur: |
Stephane CHAZELAS <this.address@is.invalid> wrote:
> 2006-12-03, 17:41(-05), William Park: > > Stephane CHAZELAS <this.address@is.invalid> wrote: > >> zsh has a sysread builtin, and the mapfile associative array to > >> access the content of a file as you would of a variable (uses > >> mmap internally). > > > > Does the file grow and shrink, as you manipulate the variable? That is, > > var=abc > > var=qwerty > > what happens to the file? > > What you expect should happen: > > ~$ ls -ld a > ls: a: No such file or directory > ~$ zmodload zsh/mapfile > ~$ mapfile[a]=foo > ~$ ls -ld a > -rw-r--r-- 1 chazelas chazelas 3 Dec 3 22:53 a > ~$ mapfile[a]=foobar > ~$ ls -ld a > -rw-r--r-- 1 chazelas chazelas 6 Dec 3 22:53 a > ~$ mapfile[a]=$'bar\n' > ~$ ls -ld a > -rw-r--r-- 1 chazelas chazelas 4 Dec 3 22:54 a Interesting. I recently added 'vfile' command to read/write file of the same name as variable (code not yet submitted). http://home.eol.ca/~parkw/index.html#vfile Not as automatic as Zsh's mapfile, but still can survive logout/reboot. Thanks for the reference. I'll look into what Zsh does. The motivation is to manipulate table of data, where each field is file, and each row is directory. So, to get the fields in row 1, cd row1 vfile -r a b c ... or vfile -r -d row1 a b c ... My main problem is array variable. At the moment, it's handled like vfile -[rw] a b 'c[0]' 'c[1]' Generating such element list is another painful scripting exercise, which should be eliminated. -- William Park <opengeometry@yahoo.ca>, Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ |
|
|
|
#17 |
|
Messages: n/a
Hébergeur: |
On Dec 2, 7:56 am, silly...@yahoo.com wrote:
> The split command certainly works well if you have lots of space and > was the way I was working initially. > > However, I am actually restricted on space on my main machine and want > to literally process the big file a chunk at a time, shipping that > chunk off to another machine as I go along. If you had mentioned that at the outset, it would have been obvious that what you are asking for is silly. Since you don't actually need to retain the pieces, the pieces can actually be stored temporarily in memory only. Q: What is a piece of memory called which stores successive pieces of a file as it is being processed? A: A buffer. That is to say, what is the difference between buffered I/O and reading chunks of a file, processing them, and shipping them off? It sounds like the ``afilter'' program in the example that you gave: > for i in `seq 1 10`; > do > dd if=big.txt ibs=98765 skip=$i count=1 | afilter > out.$i.txt > done is broken. Otherwise you could just do this: afilter < big.txt | ship-off where ship-off is some command that stores the results on the other machine, for instance, using Secure Shell: afilter < big.txt | ssh me@other-machine cat \> out-big.txt Fix the problem in afilter, if you can. |
|
|
|
#18 |
|
Messages: n/a
Hébergeur: |
On Dec 2, 3:55 am, silly...@yahoo.com wrote:
> (n.b.'afilter' is a program which works fine) ROFL. |
|
![]() |
| Outils de la discussion | |
|
|