|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hi all, I have two files thus:
File 1: Col_A Col_B a1 b1 a2 b2 ..... File 2: Col_A Col_B a4 a2 a7 a32 ....... So in the second file, Col_B is empty. How do I output the corresponding second column for the second file based on the mapping in the first file? I think I can do something like this: for line in $(cat f2) do echo -n $line; echo -n '\t'; grep $line f1 | awk '{print $2;}' done Looks a little ugly and is probably not a very smart thing to do. Any better suggestions? Thanks, Sashi |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
Sashi wrote:
> Hi all, I have two files thus: > File 1: > > Col_A Col_B > a1 b1 > a2 b2 > .... > > > File 2: > > Col_A Col_B > a4 > a2 > a7 > a32 > ...... > > So in the second file, Col_B is empty. > > How do I output the corresponding second column for the second file > based on the mapping in the first file? > > I think I can do something like this: > > for line in $(cat f2) > do echo -n $line; echo -n '\t'; grep $line f1 | awk '{print $2;}' > done (Looks very inefficient.) > > Looks a little ugly and is probably not a very smart thing to do. > Any better suggestions? Modulo formatting something like... awk 'BEGIN{OFS="\t"} NR==FNR{m[$1]=$2;next} NF==1{$2=m[$1]} 1' File1 File2 Janis > > Thanks, > Sashi |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On Fri, 02 May 2008 07:59:53 -0700, Sashi wrote:
> Hi all, I have two files thus: > File 1: > > Col_A Col_B > a1 b1 > a2 b2 > .... > > > File 2: > > Col_A Col_B > a4 > a2 > a7 > a32 > ...... > > So in the second file, Col_B is empty. > > How do I output the corresponding second column for the second file > based on the mapping in the first file? > > I think I can do something like this: > > for line in $(cat f2) > do echo -n $line; echo -n '\t'; grep $line f1 | awk '{print $2;}' done > > Looks a little ugly and is probably not a very smart thing to do. Any > better suggestions? > > Thanks, > Sashi Isn't this what join(1) is for? |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
On Saturday 3 May 2008 18:51, Dan Stromberg wrote:
> Isn't this what join(1) is for? AFAIK join operates on sorted files, which doesn't seem to be the case here. -- D. |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
On Sat, 03 May 2008 19:26:04 +0200, Dave B wrote:
> On Saturday 3 May 2008 18:51, Dan Stromberg wrote: > >> Isn't this what join(1) is for? > > AFAIK join operates on sorted files, which doesn't seem to be the case > here. It seems to me either you accept the nlogn sort, or you end up with an n^2 pair of nested loops, like what the OP had. |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
On Sunday 4 May 2008 04:15, Dan Stromberg wrote:
> On Sat, 03 May 2008 19:26:04 +0200, Dave B wrote: > >> On Saturday 3 May 2008 18:51, Dan Stromberg wrote: >> >>> Isn't this what join(1) is for? >> >> AFAIK join operates on sorted files, which doesn't seem to be the case >> here. > > It seems to me either you accept the nlogn sort, or you end up with an > n^2 pair of nested loops, like what the OP had. Not if you use awk. With awk, you have a linear solution on the number of input records (assuming awk's hash lookups are efficient). -- D. |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
On Sunday 4 May 2008 04:15, Dan Stromberg wrote:
> On Sat, 03 May 2008 19:26:04 +0200, Dave B wrote: > >> On Saturday 3 May 2008 18:51, Dan Stromberg wrote: >> >>> Isn't this what join(1) is for? >> >> AFAIK join operates on sorted files, which doesn't seem to be the case >> here. > > It seems to me either you accept the nlogn sort, or you end up with an > n^2 pair of nested loops, like what the OP had. Not if you use awk. With awk, you have a linear solution on the number of input records (assuming awk's hash lookups are efficient). And, furthermore, I think it cannot be assumed that the files can be reordered at will. -- D. |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
On Sunday 4 May 2008 10:32, Dave B wrote:
>> It seems to me either you accept the nlogn sort, or you end up with an >> n^2 pair of nested loops, like what the OP had. > > Not if you use awk. With awk, you have a linear solution on the number of > input records (assuming awk's hash lookups are efficient). > And, furthermore, I think it cannot be assumed that the files can be > reordered at will. Actually, there is another reason against sorting here: the OP showed the input files with a heading line (Col_A Col_B). If that line is really part of the input files and must be preserved, then sorting becomes problematic (not that it can't be done of course, but not in a simple way). -- D. |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
Dave B wrote:
> On Sunday 4 May 2008 04:15, Dan Stromberg wrote: > > >>On Sat, 03 May 2008 19:26:04 +0200, Dave B wrote: >> >> >>>On Saturday 3 May 2008 18:51, Dan Stromberg wrote: >>> >>> >>>>Isn't this what join(1) is for? >>> >>>AFAIK join operates on sorted files, which doesn't seem to be the case >>>here. >> >>It seems to me either you accept the nlogn sort, or you end up with an >>n^2 pair of nested loops, like what the OP had. > > > Not if you use awk. With awk, you have a linear solution on the number of > input records (assuming awk's hash lookups are efficient). Are you sure awk use hash lookups? Occasionally done efficiency analysis makes me suspect that a "standard" awk will use a tree representation for associative arrays. The consequence would be a O(N logN) complexity for accessing every element individually. A full traversal, OTOH, could at least in theory be done in O(N), even with a tree representation when additional measures would be taken. I'd be interested whether anyone knows about the implementation of associative arrays in the various existing awks. (The POSIX document seems not to specify any complexity requirements.) Janis > And, furthermore, I think it cannot be assumed that the files can be > reordered at will. > |
|
|
|
#10 |
|
Messages: n/a
Hébergeur: |
On Sunday 4 May 2008 13:26, Janis Papanagnou wrote:
>> Not if you use awk. With awk, you have a linear solution on the number of >> input records (assuming awk's hash lookups are efficient). > > Are you sure awk use hash lookups? Occasionally done efficiency > analysis makes me suspect that a "standard" awk will use a tree > representation for associative arrays. The consequence would be > a O(N logN) complexity for accessing every element individually. Not sure, but I always thought that that would be a reasonable implementation of associative arrays (or maps). Now that you mention trees, however, I'm not so sure anymore, since it's true that implementing maps with trees has its advantages too (see eg http://en.wikipedia.org/wiki/Associative_array for a not-too-technical discussion). > A full traversal, OTOH, could at least in theory be done in O(N), > even with a tree representation when additional measures would > be taken. > > I'd be interested whether anyone knows about the implementation > of associative arrays in the various existing awks. (The POSIX > document seems not to specify any complexity requirements.) FWIW, GNU awk (the only awk I have access to the sources ATM) seems to use hash tables. According to http://www.gnu.org/manual/gawk/html_node/internals.html, lookups are performed using the assoc_lookup() function. Looking at the sources (gawk 3.1.5), the function is in array.c, lines 483-562. The hash function itself is at lines 331-414 in the same file. -- D. |
|
|
|
#11 |
|
Messages: n/a
Hébergeur: |
On Sunday 4 May 2008 19:35, Dave B wrote:
> According to http://www.gnu.org/manual/gawk/html_node/internals.html, Sorry, the link is http://www.gnu.org/manual/gawk/html_node/Internals.html (uppercase I in "Internals.html"). -- D. |
|
|
|
#12 |
|
Messages: n/a
Hébergeur: |
2008-05-04, 19:35(+02), Dave B:
[...] > FWIW, GNU awk (the only awk I have access to the sources ATM) seems to use > hash tables. > According to http://www.gnu.org/manual/gawk/html_node/internals.html, > lookups are performed using the assoc_lookup() function. Looking at the > sources (gawk 3.1.5), the function is in array.c, lines 483-562. The hash > function itself is at lines 331-414 in the same file. [..] From reading the code in the heirloom toolchest, looks like the original awk from 1979 (and nawk) by A, W and K also used hash tables. So does mawk. -- Stéphane |
|
|
|
#13 |
|
Messages: n/a
Hébergeur: |
Stephane CHAZELAS wrote:
> 2008-05-04, 19:35(+02), Dave B: > [...] > >>FWIW, GNU awk (the only awk I have access to the sources ATM) seems to use >>hash tables. >>According to http://www.gnu.org/manual/gawk/html_node/internals.html, >>lookups are performed using the assoc_lookup() function. Looking at the >>sources (gawk 3.1.5), the function is in array.c, lines 483-562. The hash >>function itself is at lines 331-414 in the same file. > > [..] > > From reading the code in the heirloom toolchest, looks like the > original awk from 1979 (and nawk) by A, W and K also used hash > tables. > > So does mawk. > And some brief timing tests with gawk seem to confirm linear complexity. Janis |
|
|
|
#14 |
|
Messages: n/a
Hébergeur: |
On Sun, 04 May 2008 10:32:33 +0200, Dave B wrote:
> On Sunday 4 May 2008 04:15, Dan Stromberg wrote: > >> On Sat, 03 May 2008 19:26:04 +0200, Dave B wrote: >> >>> On Saturday 3 May 2008 18:51, Dan Stromberg wrote: >>> >>>> Isn't this what join(1) is for? >>> >>> AFAIK join operates on sorted files, which doesn't seem to be the case >>> here. >> >> It seems to me either you accept the nlogn sort, or you end up with an >> n^2 pair of nested loops, like what the OP had. > > Not if you use awk. With awk, you have a linear solution on the number > of input records (assuming awk's hash lookups are efficient). And, > furthermore, I think it cannot be assumed that the files can be > reordered at will. Whether awk uses hashing or not, I realized shortly after I posted that, and shortly before I canceled the post, that hashing would be an interesting 3rd kind of solution. |
|
![]() |
| Outils de la discussion | |
|
|