|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
I have two MD5SUMS files sorted by MD5 digest, a.md5 and b.md5.
I'm trying to get a list of files in b.md5 for which there is at least one file in a.md5 with the same MD5 digest. I thought that join -o 2.1,2.2 a.md5 b.md5 would do it. I forgot that join(1) will output *all* pairs, even if only one of the files appears in the format... Is there a way to tell join(1) to list all the records from b.md5 that can be paired, instead of all N x M possible pairs ? In SQL parlance, join(1) gives me SELECT b.pathname FROM a INNER JOIN b USING (digest) while I would like SELECT DISTINCT b.digest, b.pathname FROM a INNER JOIN b USING (digest) Piping into uniq(1) won't work because the duplicate lines are not contiguous. Sorting before piping into uniq is probably not practical as the output is more than 10 GB. -- André Majorel <URL:http://www.teaser.fr/~amajorel/> (Counterfeit: imipefos@inverse.com kepakaq@camelot.com) "Duty, honor, country" -- Douglas MacArthur "Travail, famille, patrie" -- Philippe Pétain |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On Sat, 14 Jul 2007 08:11:26 +0000, Andre Majorel wrote:
> I have two MD5SUMS files sorted by MD5 digest, a.md5 and b.md5. I'm > trying to get a list of files in b.md5 for which there is at least one > file in a.md5 with the same MD5 digest. I thought that > > join -o 2.1,2.2 a.md5 b.md5 > > would do it. I forgot that join(1) will output *all* pairs, even if only > one of the files appears in the format... > > Is there a way to tell join(1) to list all the records from b.md5 that > can be paired, instead of all N x M possible pairs ? > > In SQL parlance, join(1) gives me > > SELECT b.pathname > FROM a INNER JOIN b USING (digest) > > while I would like > > SELECT DISTINCT b.digest, b.pathname > FROM a INNER JOIN b USING (digest) > > Piping into uniq(1) won't work because the duplicate lines are not > contiguous. > > Sorting before piping into uniq is probably not practical as the output > is more than 10 GB. If I understand the problem, the filenames from a.md5 are not used, so remove them, and run uniq over the resulting file, and then join with that. awk '{print $1}' a.md5 | uniq | join - b.md5 This way instead of N*M output you get 1*M. |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On Sat, 14 Jul 2007 08:11:26 +0000, Andre Majorel wrote:
> I have two MD5SUMS files sorted by MD5 digest, a.md5 and b.md5. I'm > trying to get a list of files in b.md5 for which there is at least one > file in a.md5 with the same MD5 digest. I thought that > > join -o 2.1,2.2 a.md5 b.md5 > > would do it. I forgot that join(1) will output *all* pairs, even if only > one of the files appears in the format... > > Is there a way to tell join(1) to list all the records from b.md5 that > can be paired, instead of all N x M possible pairs ? > > In SQL parlance, join(1) gives me > > SELECT b.pathname > FROM a INNER JOIN b USING (digest) > > while I would like > > SELECT DISTINCT b.digest, b.pathname > FROM a INNER JOIN b USING (digest) > > Piping into uniq(1) won't work because the duplicate lines are not > contiguous. I'm not very sure of what you want to achieve but just in case I understood the idea here's a possibility: I supposed your description meant for example: $ cat a.md5 111 a1 222 a2 333 a3 444 a4 555 a5 $ cat b.md5 111 b701 111b b1 222 b2 333 b3 444b b4 555 b5 666 b6 And you wanted to get the uniques: $ awk '{v[$1]=v[$1]" "$2} END{for(x in v){print x,v[x]}}' a.md5 b.md5 | awk 'NF>2' 111 a1 b701 333 a3 b3 555 a5 b5 222 a2 b2 Or, simply any: $ awk '{v[$1]=v[$1]" "$2} END{for(x in v){print x,v[x]}}' a.md5 b.md5 444 a4 666 b6 444b b4 111 a1 b701 111b b1 333 a3 b3 555 a5 b5 222 a2 b2 > Sorting before piping into uniq is probably not practical as the output > is more than 10 GB. Your list of unique md5sums is over 10GB ? You're actually indexing a minimum of 250 million files ? |
|
![]() |
| Outils de la discussion | |
|
|