|
|
|
|
||||||
| comp.unix.shell Using and programming the Unix shell. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hi All,
Hopefully a nice easy question, apologies for the cross posting I made a newbie mistake by posting in comp.os.linux, i'm trying to attain salvation by posting here instead ![]() I have 8x16gb files that are basically just giant lists. each row contains only one word. I need to compare each file against every other file and list duplicates. I've done a sort -u on the file and I know that each one is unique, I just don't know how to list duplicates amongst the files. Its a case of needing to know which duplicates there are, rather than just cat'ing them together and running a sort -u. I've written a perl script that will do it... eventually but its going to take about 2 weeks to finish and tbh, i'm not sure my disks will last that long! I have no access to a database, and its an isolated network. Any suggestions ? P.S. Apologies if i've broken any rules, its the first time i've tried usenet and I couldn't find an comp.os.linux.fiddlingwithbigfilesforum ! |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On 2007-12-20, my0373@googlemail.com <my0373@googlemail.com> wrote:
> I have 8x16gb files that are basically just giant lists. Sixteen Gigabyte text files ????? Jeezus H. Tapdancing Christ... What kind of application is this ? > each row contains only one word. > I need to compare each file against every other file and list > duplicates. Right. > I've done a sort -u on the file and I know that each one is unique, I > just don't know how to list duplicates amongst the files. Its a case > of needing to know which duplicates there are, rather than just > cat'ing them together and running a sort -u. I'm even amazed that sort would handle files this big. Especially in a doable time. > I've written a perl script that will do it... eventually but its going to > take about 2 weeks to finish and tbh, i'm not sure my disks will last that > long! Can you give us a clue as to the thight you came up with ? How abouth sharing some info on the files? How long are these words? > Any suggestions ? I can think of one approach. Must think it out, though. Let us know a bit more about the problem, please. -- There is an art, it says, or rather, a knack to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On Thu, 20 Dec 2007 12:06:12 -0800, my0373@googlemail.com wrote:
> Hi All, > > Hopefully a nice easy question, apologies for the cross posting I made a > newbie mistake by posting in comp.os.linux, i'm trying to attain > salvation by posting here instead ![]() > > I have 8x16gb files that are basically just giant lists. > > each row contains only one word. > > I need to compare each file against every other file and list > duplicates. > > I've done a sort -u on the file and I know that each one is unique, I > just don't know how to list duplicates amongst the files. Its a case of > needing to know which duplicates there are, rather than just cat'ing > them together and running a sort -u. Sorry, I don't understand what you are asking for exactly. Can you give us an example of the input and output you are looking for? For example if you had 3 input files file1: cat dog horse pig file2: dog man file3: bird dog pig Are the files sorted? > I've written a perl script that will do it... eventually but its going > to take about 2 weeks to finish and tbh, i'm not sure my disks will last > that long! > > I have no access to a database, and its an isolated network. > > Any suggestions ? > > P.S. Apologies if i've broken any rules, its the first time i've tried > usenet and I couldn't find an comp.os.linux.fiddlingwithbigfiles> forum ! If the files are sorted, then you can find the duplicates between a pair of files using the "comm" command. To find all the duplicates you will need to run comm 28 times, which is not too bad, but will read about 900Gb. If the files are sorted however you can use sort -m file1 file2 file3 file4 file5 file6 file7 file8 | uniq -d > dups which will "only" read 256Gb (assuming that dups is fairly small). This will give you a list of duplicated words. Then you can use "comm -3 dups file1", "comm -3 dups file2" etc for another 256Gb to tell you in which files the duplicated words are )if that is what you want). Do you expect there to be a lot of duplicates, or only a few? |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
On 2007-12-21, Icarus Sparry <usenet@icarus.freeuk.com> wrote:
> Sorry, I don't understand what you are asking for exactly. Can you give > us an example of the input and output you are looking for? For example if > you had 3 input files <snip> > Are the files sorted? He did say the files where sorted with sort -u. So they're sorted and each one doesn't contains doubles. >> I've written a perl script that will do it... eventually but its going >> to take about 2 weeks to finish and tbh, i'm not sure my disks will last >> that long! <snip> > If the files are sorted, then you can find the duplicates between a pair > of files using the "comm" command. To find all the duplicates you will > need to run comm 28 times, which is not too bad, but will read about > 900Gb. > > If the files are sorted however you can use > > sort -m file1 file2 file3 file4 file5 file6 file7 file8 | uniq -d > dups > > which will "only" read 256Gb (assuming that dups is fairly small). You might be right, but I don't get it. Why would you read 256GB? If you read all 8 files of 16GB each, you only read 128GB. They're cat'ed together for the sort, perhaps? Mmm. >This > will give you a list of duplicated words. Then you can use "comm -3 dups > file1", "comm -3 dups file2" etc for another 256Gb to tell you in which > files the duplicated words are )if that is what you want). Surely, here we'd only read 128 GB, plus 8x the size of the dupes? > Do you expect there to be a lot of duplicates, or only a few? What I'd like to know. I'd like a few examples of such 'words', too. I'm still working on this, but I think it can be done by read the files only once, therefore limiting the reads to 8x16GB=128GB. The origin of the doubles would be included. I wish the OP would please post a few lines of example words... -- There is an art, it says, or rather, a knack to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
On Fri, 21 Dec 2007 22:43:01 +0100, Rikishi 42 wrote:
> On 2007-12-21, Icarus Sparry <usenet@icarus.freeuk.com> wrote: >> Sorry, I don't understand what you are asking for exactly. Can you give >> us an example of the input and output you are looking for? For example >> if you had 3 input files > <snip> >> Are the files sorted? > > He did say the files where sorted with sort -u. So they're sorted and > each one doesn't contains doubles. He said that he had run sort -u on the files so he knew the files had no duplicates within themselves. However he did not say that he had kept these files. >>> I've written a perl script that will do it... eventually but its going >>> to take about 2 weeks to finish and tbh, i'm not sure my disks will >>> last that long! > <snip> >> If the files are sorted, then you can find the duplicates between a >> pair of files using the "comm" command. To find all the duplicates you >> will need to run comm 28 times, which is not too bad, but will read >> about 900Gb. >> >> If the files are sorted however you can use >> >> sort -m file1 file2 file3 file4 file5 file6 file7 file8 | uniq -d > >> dups >> >> which will "only" read 256Gb (assuming that dups is fairly small). > You might be right, but I don't get it. Why would you read 256GB? If you > read all 8 files of 16GB each, you only read 128GB. They're cat'ed > together for the sort, perhaps? Mmm. No, this is me not being able to do mental artithmetic correctly. You do *NOT* want to cat the files together if you are using the "sort -m" approach. >>This >> will give you a list of duplicated words. Then you can use "comm -3 >> dups file1", "comm -3 dups file2" etc for another 256Gb to tell you in >> which files the duplicated words are )if that is what you want). > Surely, here we'd only read 128 GB, plus 8x the size of the dupes? Yes. Having made the mistake once I reused the same mental apparatus to get the same wrong result a second time. >> Do you expect there to be a lot of duplicates, or only a few? > What I'd like to know. > > I'd like a few examples of such 'words', too. > > I'm still working on this, but I think it can be done by read the files > only once, therefore limiting the reads to 8x16GB=128GB. The origin of > the doubles would be included. Of course it can, but not with standard unix utilities. The problem was underspecified, but essentially all you need is to read one record from each file, sort these 8 into order. If the first record is duplicated then write whatever outputput you want about the duplicates (it is not clear if the OP wanted to know which files they come from from the description ::>I've done a sort -u on the file and I know that each one is ::>unique, I just don't know how to list duplicates amongst the ::>files. Its a case of needing to know which duplicates there ::>are, rather than just cat'ing them together and running a ::>sort -u. ), then discard the first record and its duplicates (if any) and replace them from the files they came from. This is one linear pass over the 8 input files. The question you have to decide is how the cost of writing this program compares to the cost of writing the shell script I outlined. The cost will have many factors, the time taken to write the program, the time it takes to run, how often it must be run, what deadlines have to be met etc. #!/bin/sh sort -m "$@" | uniq -d > dups for i do comm -3 "$i" dups | sed "s|^|$i:|" done takes very little time to write and one is happy that it is reasonably bug free. (OK, there are problems if one of the input files is called "dups", there are problems if any of the input files has a pipe symbol in its name, if someone sets IFS to a weird value, if there are other programs called "sort", "uniq", "comm" or "sed" in the PATH that don't do what they should, if /bin/sh is not roughly a Bourne or Posix shell...) > I wish the OP would please post a few lines of example words... |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
"my0373@googlemail.com" <my0373@googlemail.com> writes:
> I've done a sort -u on the file and I know that each one is unique, I > just don't know how to list duplicates amongst the files. Its a case > of needing to know which duplicates there are, rather than just > cat'ing them together and running a sort -u. > I've written a perl script that will do it... eventually but its > going to take about 2 weeks to finish and tbh, i'm not sure my disks > will last that long! If each file is sorted, then write a program to read one word from each of the three files. Look for dups. Then find the word that is first in the list (lexically), and read the next word from that corresponding file. if the same word is in two files (a duplicate in files A and B), and these words are listed before the third word (in C), then advance by reading one word from both A and B. This way it only takes one pass though the files. No sorting. No large memory issues. One pass through each of the three files. -- Posted via a free Usenet account from http://www.teranews.com |
|
![]() |
| Outils de la discussion | |
|
|