|
|
|
#1 (permalink) |
|
Messages: n/a
Hébergeur: |
I have a very large text file that gets dumped into a directoory every now and then. It is typically around 750MB long, at least, and my question is: What is the best method to parse this thing and insert the data into a postgres db? I have tried using file(), fget*() and some others, all with limited success. It goes through OK (I am sending it to a background process on the server and using a callback function to email me when done) but it is really knocking the machine hard, besides taking a real long time to finish. Is there a better way of approaching this? Any would be greatly appreciated. --Paul All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm |
|
|
|
#2 (permalink) |
|
Messages: n/a
Hébergeur: |
Paul Scott wrote:
> I have a very large text file that gets dumped into a directoory every > now and then. It is typically around 750MB long, at least, and my > question is: > > What is the best method to parse this thing and insert the data into a > postgres db? > > I have tried using file(), fget*() and some others, all with limited > success. It goes through OK (I am sending it to a background process on > the server and using a callback function to email me when done) but it > is really knocking the machine hard, besides taking a real long time to > finish. > > Is there a better way of approaching this? Any would be greatly > appreciated. First, which is your approach? I suspect that you are doing this with a cron job through php-cli. Now, to avoid using to many resources, try with fopen() and fgets(). Also work with persistent connections, so you don't have that overhead. The problem with file() is that it will load all the file to memory, and you don't want 700+MB in memory. -- 21:50:04 up 2 days, 9:07, 0 users, load average: 0.92, 0.37, 0.18 --------------------------------------------------------- Lic. Martín Marqués | SELECT 'mmarques' || Centro de Telemática | '@' || 'unl.edu.ar'; Universidad Nacional | DBA, Programador, del Litoral | Administrador --------------------------------------------------------- |
|
|
|
#3 (permalink) |
|
Messages: n/a
Hébergeur: |
better split that file in small parts. it will take hours to parse a
750 mb file, it's not a good idea to allow a script to run for such a long time. if you're on a multiprocessor machine you may use a single processor for that job. is that file updated from day to day and you need to insert the changes into your database? |
|
|
|
#4 (permalink) |
|
Messages: n/a
Hébergeur: |
> Paul Scott wrote: > > I have a very large text file that gets dumped into a directoory every > > now and then. It is typically around 750MB long, at least, and my > > question is: > > > > What is the best method to parse this thing and insert the data into a > > postgres db? > > > > I have tried using file(), fget*() and some others, all with limited > > success. It goes through OK (I am sending it to a background process on > > the server and using a callback function to email me when done) but it > > is really knocking the machine hard, besides taking a real long time to > > finish. > > > > Is there a better way of approaching this? Any would be greatly > > appreciated. > > First, which is your approach? I suspect that you are doing this with a > cron job through php-cli. > > Now, to avoid using to many resources, try with fopen() and fgets(). > Also work with persistent connections, so you don't have that overhead. > > The problem with file() is that it will load all the file to memory, and > you don't want 700+MB in memory. In addition to Martin's good suggestions (and also assuming you're running php-cli via cron), you could use nice to stop it consuming too many resources: http://en.wikipedia.org/wiki/Nice_%28Unix%29 Edward |
|
|
|
#5 (permalink) |
|
Messages: n/a
Hébergeur: |
On Thu, 2007-09-20 at 12:50 +0100, Edward Kay wrote: > In addition to Martin's good suggestions (and also assuming you're running > php-cli via cron), you could use nice to stop it consuming too many > resources: > This is the current approach that I am taking, was just really wondering if there was some kind of voodoo that would speed things up a bit. Thanks both for your responses, appreciate it! --Paul All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm |
|
|
|
#6 (permalink) |
|
Messages: n/a
Hébergeur: |
On Thu, 2007-09-20 at 13:55 +0200, Paul Scott wrote:
> On Thu, 2007-09-20 at 12:50 +0100, Edward Kay wrote: > > In addition to Martin's good suggestions (and also assuming you're running > > php-cli via cron), you could use nice to stop it consuming too many > > resources: > > This is the current approach that I am taking, was just really wondering > if there was some kind of voodoo that would speed things up a bit. > > Thanks both for your responses, appreciate it! Post some samples of the data you are parsing and a sample of the code you've written to parse them. If you're parsing 750 megs of data then it's quite likely you could squeeze some performance out of the parse routines themselves. Cheers, Rob. -- .................................................. .......... SwarmBuy.com - http://www.swarmbuy.com Leveraging the buying power of the masses! .................................................. .......... |
|
|
|
#7 (permalink) |
|
Messages: n/a
Hébergeur: |
On Thu, 2007-09-20 at 08:03 -0400, Robert Cummings wrote: > Post some samples of the data you are parsing and a sample of the code > you've written to parse them. If you're parsing 750 megs of data then > it's quite likely you could squeeze some performance out of the parse > routines themselves. Today's dataset is in a CSV (tab separated) , so I am using fgetcsv, it looks like this (geo data): 936374 Roodepoort Roodepoort Roodeport-Maraisburg -26.1666667 27.8666667 P PPL ZA ZA 06 0 1759 Africa/Johannesburg 2004-05-11 Code: [SNIP] $row = 1; $handle = fopen($csvfile, "r"); while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) { $num = count($data); $row++; $insarr = array('userid' => $userid, 'geonameid' => $data[0], 'name' => $data[1], 'asciiname' => $data[2], 'alternatenames' => $data[3], 'latitude' => $data[4], 'longitude' => $data[5], 'featureclass' => $data[6], 'featurecode' => $data[7], 'countrycode' => $data[8], 'cc2' => $data[9], 'admin1code' => $data[10], 'admin2code' => $data[11], 'population' => $data[12], 'elevation' => $data[13], 'gtopo30' => $data[14], 'timezoneid' => $data[15], 'moddate' => $data[16] ); $this->objDbGeo->insertRecord($insarr); //$arr[] = $data; } fclose($handle); --Paul All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm |
|
|
|
#8 (permalink) |
|
Messages: n/a
Hébergeur: |
Robert Cummings wrote:
> On Thu, 2007-09-20 at 13:55 +0200, Paul Scott wrote: >> On Thu, 2007-09-20 at 12:50 +0100, Edward Kay wrote: >>> In addition to Martin's good suggestions (and also assuming you're running >>> php-cli via cron), you could use nice to stop it consuming too many >>> resources: >> This is the current approach that I am taking, was just really wondering >> if there was some kind of voodoo that would speed things up a bit. >> >> Thanks both for your responses, appreciate it! > > Post some samples of the data you are parsing and a sample of the code > you've written to parse them. If you're parsing 750 megs of data then > it's quite likely you could squeeze some performance out of the parse > routines themselves. Adding to Robert's comment, try not to copy variables, but to reference them, so you will use less resources. Also free DB results after query. -- 21:50:04 up 2 days, 9:07, 0 users, load average: 0.92, 0.37, 0.18 --------------------------------------------------------- Lic. MartÃn Marqués | SELECT 'mmarques' || Centro de Telemática | '@' || 'unl.edu.ar'; Universidad Nacional | DBA, Programador, del Litoral | Administrador --------------------------------------------------------- |
|
![]() |
| Outils de la discussion | |
|
|