|
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Recently I've attempted to push a huge csv into arrays via code that
looks along the lines of this: csvFile = FasterCSV.read('data.csv', :headers => true) array = [] csvFile.each do |row| array << row['column_name'] end The problem arises when the csv file is someodd 2 million lines or more. Normally I would comment about how long it took but I decided to call it quits after 9 hours of waiting lol. Any ideas on how to handle columns in CSV docs the same way FasterCSV does? (And yes, theoretically I could split the 80mb csv into 20 4mb files but whats the accomplishment in that!) Thanks, - Mac -- Posted via http://www.ruby-forum.com/. |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On Apr 2, 9:08 am, Michael Linfield <globyy3...@hotmail.com> wrote:
> Recently I've attempted to push a huge csv into arrays via code that > looks along the lines of this: > > csvFile = FasterCSV.read('data.csv', :headers => true) > > array = [] > > csvFile.each do |row| > array << row['column_name'] > end > > The problem arises when the csv file is someodd 2 million lines or more. How many fields in a row? You're appending that many times (2 million or more) values to an array, which I suspect is where your performance problem lies. You could probably check by csvFile = FasterCSV.read('data.csv', :headers => true) count = 0 csvFile.each do |row| end |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On Apr 2, 1:26 pm, Mike Woodhouse <mikewoodho...@gmail.com> wrote:
> On Apr 2, 9:08 am, Michael Linfield <globyy3...@hotmail.com> wrote: > > > Recently I've attempted to push a huge csv into arrays via code that > > looks along the lines of this: > > > csvFile = FasterCSV.read('data.csv', :headers => true) > > > array = [] > > > csvFile.each do |row| > > array << row['column_name'] > > end > > > The problem arises when the csv file is someodd 2 million lines or more. > > How many fields in a row? You're appending that many times (2 million > or more) values to an array, which I suspect is where your performance > problem lies. > > You could probably check by > > csvFile = FasterCSV.read('data.csv', :headers => true) > count = 0 > csvFile.each do |row| > > end Hmph. I must have hit some unknown "send" key combination... I meant to say, before I interrupted myself: csvFile = FasterCSV.read('data.csv', :headers => true) count = 0 csvFile.each do |row| count += 1 end ....which replaces the array append with a lightweight operation. (I don't know if Ruby is "smart" and likely to skip the iteration with an empty block - probably not, but adding 1 shouldn't impose a heavy load) Mike |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
On Apr 2, 2008, at 3:08 AM, Michael Linfield wrote:
> Recently I've attempted to push a huge csv into arrays via code that > looks along the lines of this: > > csvFile = FasterCSV.read('data.csv', :headers => true) > > array = [] > > csvFile.each do |row| > array << row['column_name'] > end That code is pretty inefficient, since it reads the entire file into memory only to walk over it row by row. Let's just read it row by row, instead. column = [ ] FCSV.foreach('data.csv', :headers => true) do |row| column << row['column_name'] end > The problem arises when the csv file is someodd 2 million lines or > more. > Normally I would comment about how long it took but I decided to > call it > quits after 9 hours of waiting lol. FasterCSV could be choking on the CSV data, if it's not valid. It sometimes has to read to the end of the document to know that, which could take a long while with that much data. James Edward Gray II |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
Mike Woodhouse wrote:
> On Apr 2, 1:26 pm, Mike Woodhouse <mikewoodho...@gmail.com> wrote: >> > array << row['column_name'] >> csvFile = FasterCSV.read('data.csv', :headers => true) >> count = 0 >> csvFile.each do |row| >> >> end > > Hmph. I must have hit some unknown "send" key combination... > > I meant to say, before I interrupted myself: > > csvFile = FasterCSV.read('data.csv', :headers => true) > count = 0 > csvFile.each do |row| > count += 1 > end > > ...which replaces the array append with a lightweight operation. (I > don't know if Ruby is "smart" and likely to skip the iteration with an > empty block - probably not, but adding 1 shouldn't impose a heavy > load) > > Mike > James >Let's just read it row by >row, instead. > > column = [ ] > FCSV.foreach('data.csv', :headers => true) do |row| > column << row['column_name'] > end Well firstly, I did count the rows already in the csv via file = File.readlines('data.csv') file.length output is about ~2,500,300 To answer your other question Mike the amount of columns is 3. All integers. Thanks James for that snippet, though it might be more efficient it likely cuts the time in half (really nice) however, being that after 9 hours I gave up, I don't know what half even is! The data from each column is being written into its own array. IE: column1Array = [] column2Array = [] ect. Usually if the numbers were a defined length in each column I could just use regexp's to pull them out, however the numbers are mostly random. I'll give your snippet a shot James and let you know how the results turn out. Till then any additional thoughts are much appreciated. Thanks, Mac -- Posted via http://www.ruby-forum.com/. |
|
![]() |
| Outils de la discussion | |
|
|