|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hello
I am new to ruby and trying to do a diff on two csv files. I am putting each row into an array and then subtracting the arrays into a new array, then taking that array and printing out the records. The problem I am running into is I would like the badRecords method to return the actual record lines that are not present in csv2.csv but instead it is returning all records in csv1.csv. The other problem I see is the code is running rather slow, cpu usage spikes up to 99% when running. Any insight on improvements would be appreciated. Thanks! #!/usr/bin/env ruby -wKU require 'rubygems' require 'faster_csv' def Array1 getNumber = FCSV.open("csv1.csv") getNumber.collect do |row| return row[1] end end def Array2 getNumber = FCSV.open("csv2.csv") getNumber.collect do |row| if (row[5].include?("Originating")) && (row[41].include?("y")) return row[20] end end end def SumArray SumArray = Array1 - Array2 if SumArray.empty? puts "records have been validated" Process.exit!(0) else return SumArray end end def badRecords my_file = File.open('badRecords.csv','w') records = FCSV.open("csv1.csv") records.collect do |row| row[1].eql?(SumArray) my_file.puts row.inspect.gsub(/\[|[a-z]*\]$/, "") end my_file.close end end badRecords -- Posted via http://www.ruby-forum.com/. |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
-------- Original-Nachricht -------- > Datum: Tue, 13 May 2008 03:52:21 +0900 > Von: James Dechiaro <jdechiaro@coherecomm.com> > An: ruby-talk@ruby-lang.org > Betreff: array diff > Hello > > I am new to ruby and trying to do a diff on two csv files. > I am putting each row into an array and then subtracting the arrays into > a new array, then taking that array and printing out the records. > > The problem I am running into is I would like the badRecords method to > return the actual record lines that are not present in csv2.csv but > instead it is returning all records in csv1.csv. The other problem I see > is the code is running rather slow, cpu usage spikes up to 99% when > running. Any insight on improvements would be appreciated. > > Thanks! > > > #!/usr/bin/env ruby -wKU > > require 'rubygems' > require 'faster_csv' > > def Array1 > > getNumber = FCSV.open("csv1.csv") > getNumber.collect do |row| > return row[1] > end > end > > def Array2 > > getNumber = FCSV.open("csv2.csv") > getNumber.collect do |row| if (row[5].include?("Originating")) && > (row[41].include?("y")) > return row[20] > end > end > end > > def SumArray > > SumArray = Array1 - Array2 > if SumArray.empty? > puts "records have been validated" > Process.exit!(0) > else > return SumArray > end > end > > def badRecords > > my_file = File.open('badRecords.csv','w') > records = FCSV.open("csv1.csv") > records.collect do |row| row[1].eql?(SumArray) > my_file.puts row.inspect.gsub(/\[|[a-z]*\]$/, "") > end > my_file.close > end > end > badRecords > -- > Posted via http://www.ruby-forum.com/. James, welcome to Ruby! You'll like it ![]() You could do something like this: csv_1_array=IO.readlines("csv1.txt") csv_2_array=IO.readlines("csv2.txt") result_array=csv_2-csv1 # (removes doublets, triplets etc also.) Best regards, Axel -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf@gmx |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
Same thing basically but none-the-less this would also work..
csv1 = [] csv2 = [] IO.foreach("csv1.csv") {|lines| csv1 << lines} IO.foreach("csv2.csv") {|lines| csv2 << lines} OR csv1 = [] csv2 = [] FCSV.foreach("csv1.csv", :headers => true) do |row| csv1 << row["headerName"] end FCSV.foreach("csv2.csv", :headers => true) do |row| csv2 << row["headerName"] end #this will give you the elements they dont share in common differenceArray = csv1-csv2 #this will give you the elements they DO share in common commonArray = csv1&csv2 Regards, - Mac -- Posted via http://www.ruby-forum.com/. |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
Thanks for your responses
the files are formatted completely different so i need to break up each field into an array and then call that element. this part is working correctly (just takes awhile). the problem i'm running into is when i try to open back the csv1 and compare the SumArray against it in order to get the entire line (not just the element). It instead prints all the lines, even the ones not contained in the array. So it looks as though the .eql? module is not working correctly. csv1.csv format: "3105551212","01133555615771","BEVERLYHLS","CA","I NTL","ON","Apr 28 2008 1:10PM","300","256","0.0250","0.0000","0.0013", csv2.csv format: 3067483e7538520080325105439.8971-040000,ABCCompany,Normal,438,+13105551212,Originat ing,438,Anonymous,01133555615771,20080325105439.89 7,1-040000,Yes,20080325105500.333,20080325105716.252,0 16,,,01133555615771,internat,in,100201011555426157 71,,local,,,,,,ABCCompany,,,,,,,,,,y,public,,16519 912:0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,user32@domain.com,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,153 -- Posted via http://www.ruby-forum.com/. |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
On 5/12/08, James Dechiaro <jdechiaro@coherecomm.com> wrote:
> the problem i'm running into is when i try to open back the csv1 and > compare the SumArray against it in order to get the entire line (not > just the element). It instead prints all the lines, even the ones not > contained in the array. > > So it looks as though the .eql? module is not working correctly. > There are a few potential problems with this line: records.collect do |row| row[1].eql?(SumArray) First, I think you need an 'if'. Otherwise you are calling eql? but not using the result. Second, it looks like you are comparing a single field with the whole SumArray. They will never be equal. You probably want something like if SumArray.include?(row[1]) -Adam |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
if SumArray.include?(row[1])did the trick...thanks adam!
Adam Shelly wrote: > On 5/12/08, James Dechiaro <jdechiaro@coherecomm.com> wrote: >> the problem i'm running into is when i try to open back the csv1 and >> compare the SumArray against it in order to get the entire line (not >> just the element). It instead prints all the lines, even the ones not >> contained in the array. >> >> So it looks as though the .eql? module is not working correctly. >> > There are a few potential problems with this line: > records.collect do |row| row[1].eql?(SumArray) > > First, I think you need an 'if'. Otherwise you are calling eql? but > not using the result. Second, it looks like you are comparing a > single field with the > whole SumArray. They will never be equal. You probably want something > like > if SumArray.include?(row[1]) > > -Adam -- Posted via http://www.ruby-forum.com/. |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
2008/5/13 James Dechiaro <jdechiaro@coherecomm.com>:
> if SumArray.include?(row[1])did the trick...thanks adam! > > > Adam Shelly wrote: >> On 5/12/08, James Dechiaro <jdechiaro@coherecomm.com> wrote: >>> the problem i'm running into is when i try to open back the csv1 and >>> compare the SumArray against it in order to get the entire line (not >>> just the element). It instead prints all the lines, even the ones not >>> contained in the array. >>> >>> So it looks as though the .eql? module is not working correctly. >>> >> There are a few potential problems with this line: >> records.collect do |row| row[1].eql?(SumArray) >> >> First, I think you need an 'if'. Otherwise you are calling eql? but >> not using the result. Second, it looks like you are comparing a >> single field with the >> whole SumArray. They will never be equal. You probably want something >> like >> if SumArray.include?(row[1]) You might also want to consider using Set or Hash for more efficient lookups. The way I would probably do it is this: define a record type that covers all the relevant information (easily done with Struct). Then read the file to test against (csv2 I believe) and convert records to the record type which you then add to the Set / Hash. Then read the second file line by line, convert it and print it if it is not contained in the Set / Hash. That way you do not need to keep both files in memory and the Set / Hash lookups are much faster than Array based lookups. Kind regards robert -- use.inject do |as, often| as.you_can - without end |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
Thanks for the tip Robert I will give it a go...as the script has been
running for over 14 hours now after finding 13k matches =( Robert Klemme wrote: > You might also want to consider using Set or Hash for more efficient > lookups. The way I would probably do it is this: define a record type > that covers all the relevant information (easily done with Struct). > Then read the file to test against (csv2 I believe) and convert > records to the record type which you then add to the Set / Hash. Then > read the second file line by line, convert it and print it if it is > not contained in the Set / Hash. That way you do not need to keep > both files in memory and the Set / Hash lookups are much faster than > Array based lookups. > > Kind regards > > robert -- Posted via http://www.ruby-forum.com/. |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
On 14.05.2008 13:29, James Dechiaro wrote:
> Thanks for the tip Robert I will give it a go...as the script has been > running for over 14 hours now after finding 13k matches =( Yet another approach would be to use a relational database for this. If the volume is large it may pay off to import your CSV data into two tables, create appropriate indexes and get your result via a SELECT. Kind regards robert |
|
|
|
#10 |
|
Messages: n/a
Hébergeur: |
On Wed, May 14, 2008 at 2:50 PM, Robert Klemme
<shortcutter@googlemail.com> wrote: > On 14.05.2008 13:29, James Dechiaro wrote: >> >> Thanks for the tip Robert I will give it a go...as the script has been >> running for over 14 hours now after finding 13k matches =( > > Yet another approach would be to use a relational database for this. If the > volume is large it may pay off to import your CSV data into two tables, > create appropriate indexes and get your result via a SELECT. I agree that a large dataset like this probably doesn't belong in the high-level programming domain. Put the burden where it belongs. For this type of data, the model would be simple in a database, as would be the queries. 2c, Todd |
|
![]() |
| Outils de la discussion | |
|
|