PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Forum Serveur - Sécurité et techniques > comp.unix.shell > join outputs duplicate lines
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
comp.unix.shell Using and programming the Unix shell.

join outputs duplicate lines

Réponse
 
LinkBack Outils de la discussion
Vieux 14/07/2007, 09h11   #1
Andre Majorel
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut join outputs duplicate lines

I have two MD5SUMS files sorted by MD5 digest, a.md5 and b.md5.
I'm trying to get a list of files in b.md5 for which there is at
least one file in a.md5 with the same MD5 digest. I thought that

join -o 2.1,2.2 a.md5 b.md5

would do it. I forgot that join(1) will output *all* pairs, even
if only one of the files appears in the format...

Is there a way to tell join(1) to list all the records from
b.md5 that can be paired, instead of all N x M possible pairs ?

In SQL parlance, join(1) gives me

SELECT b.pathname
FROM a INNER JOIN b USING (digest)

while I would like

SELECT DISTINCT b.digest, b.pathname
FROM a INNER JOIN b USING (digest)

Piping into uniq(1) won't work because the duplicate lines are
not contiguous.

Sorting before piping into uniq is probably not practical as the
output is more than 10 GB.

--
André Majorel <URL:http://www.teaser.fr/~amajorel/>
(Counterfeit: imipefos@inverse.com kepakaq@camelot.com)
"Duty, honor, country" -- Douglas MacArthur
"Travail, famille, patrie" -- Philippe Pétain
  Réponse avec citation
Vieux 14/07/2007, 18h04   #2
Icarus Sparry
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: join outputs duplicate lines

On Sat, 14 Jul 2007 08:11:26 +0000, Andre Majorel wrote:

> I have two MD5SUMS files sorted by MD5 digest, a.md5 and b.md5. I'm
> trying to get a list of files in b.md5 for which there is at least one
> file in a.md5 with the same MD5 digest. I thought that
>
> join -o 2.1,2.2 a.md5 b.md5
>
> would do it. I forgot that join(1) will output *all* pairs, even if only
> one of the files appears in the format...
>
> Is there a way to tell join(1) to list all the records from b.md5 that
> can be paired, instead of all N x M possible pairs ?
>
> In SQL parlance, join(1) gives me
>
> SELECT b.pathname
> FROM a INNER JOIN b USING (digest)
>
> while I would like
>
> SELECT DISTINCT b.digest, b.pathname
> FROM a INNER JOIN b USING (digest)
>
> Piping into uniq(1) won't work because the duplicate lines are not
> contiguous.
>
> Sorting before piping into uniq is probably not practical as the output
> is more than 10 GB.


If I understand the problem, the filenames from a.md5 are not used, so
remove them, and run uniq over the resulting file, and then join with
that.

awk '{print $1}' a.md5 | uniq | join - b.md5

This way instead of N*M output you get 1*M.
  Réponse avec citation
Vieux 14/07/2007, 19h33   #3
loki harfagr
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: join outputs duplicate lines

On Sat, 14 Jul 2007 08:11:26 +0000, Andre Majorel wrote:

> I have two MD5SUMS files sorted by MD5 digest, a.md5 and b.md5. I'm
> trying to get a list of files in b.md5 for which there is at least one
> file in a.md5 with the same MD5 digest. I thought that
>
> join -o 2.1,2.2 a.md5 b.md5
>
> would do it. I forgot that join(1) will output *all* pairs, even if only
> one of the files appears in the format...
>
> Is there a way to tell join(1) to list all the records from b.md5 that
> can be paired, instead of all N x M possible pairs ?
>
> In SQL parlance, join(1) gives me
>
> SELECT b.pathname
> FROM a INNER JOIN b USING (digest)
>
> while I would like
>
> SELECT DISTINCT b.digest, b.pathname
> FROM a INNER JOIN b USING (digest)
>
> Piping into uniq(1) won't work because the duplicate lines are not
> contiguous.


I'm not very sure of what you want to achieve but just in
case I understood the idea here's a possibility:

I supposed your description meant for example:
$ cat a.md5
111 a1
222 a2
333 a3
444 a4
555 a5

$ cat b.md5
111 b701
111b b1
222 b2
333 b3
444b b4
555 b5
666 b6

And you wanted to get the uniques:

$ awk '{v[$1]=v[$1]" "$2} END{for(x in v){print x,v[x]}}' a.md5 b.md5 | awk 'NF>2'
111 a1 b701
333 a3 b3
555 a5 b5
222 a2 b2

Or, simply any:

$ awk '{v[$1]=v[$1]" "$2} END{for(x in v){print x,v[x]}}' a.md5 b.md5
444 a4
666 b6
444b b4
111 a1 b701
111b b1
333 a3 b3
555 a5 b5
222 a2 b2

> Sorting before piping into uniq is probably not practical as the output
> is more than 10 GB.


Your list of unique md5sums is over 10GB ?
You're actually indexing a minimum of 250 million files ?
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 05h45.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 1,24960 seconds with 11 queries