|
|
|
|
||||||
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Can a website block the use of file_get_contents ?
Example : file_get_contents("http://www.google.com") works fine, but file_get_contents("http://www.petitscailloux.com/Follow.aspx? sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not. Any clues or ways to circumvent ? Thanks a lot ! |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
postseb wrote:
> Can a website block the use of file_get_contents ? It can not, usually. However the site may set a when you log into it and a lot of other stuff, like opening sessions etc. Since file_get_contents isn't exactly a browser replacement, it can very well be that things that work in the browser, do not work when just calling file_get_contents. You would have to analyze the requests and responses, look out for set , session-ids etc, and then replicate this in your PHP call. You will have to use fsockopen for this kind of stuff. Look at the PHP manual for fsockopen on how to download a HTTP-page with this function there is an example right there.. Jan -- __________________________________________________ _______________________ insOMnia - We never sleep... http://www.insOMnia-hq.de |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
"postseb" <postseb@gmail.com> wrote in message
news:2806192b-4a79-4238-9c6b-83977b270813@s50g2000hsb.googlegroups.com... > Can a website block the use of file_get_contents ? > > Example : file_get_contents("http://www.google.com") works fine, but > file_get_contents("http://www.petitscailloux.com/Follow.aspx? > sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not. > > Any clues or ways to circumvent ? http://scriptasy.com/php_11/tutorial-curl-login_44.html function curl_login($url,$data,$proxy,$proxystatus){ $fp = fopen(".txt", "w"); fclose($fp); $login = curl_init(); curl_setopt($login, CURLOPT_JAR, ".txt"); curl_setopt($login, CURLOPT_FILE, ".txt"); curl_setopt($login, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); curl_setopt($login, CURLOPT_TIMEOUT, 40); curl_setopt($login, CURLOPT_RETURNTRANSFER, TRUE); if ($proxystatus == 'on') { curl_setopt($login, CURLOPT_SSL_VERIFYHOST, FALSE); curl_setopt($login, CURLOPT_HTTPPROXYTUNNEL, TRUE); curl_setopt($login, CURLOPT_PROXY, $proxy); } curl_setopt($login, CURLOPT_URL, $url); curl_setopt($login, CURLOPT_HEADER, TRUE); curl_setopt($login, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); curl_setopt($login, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($login, CURLOPT_POST, TRUE); curl_setopt($login, CURLOPT_POSTFIELDS, $data); ob_start(); // prevent any output return curl_exec ($login); // execute the curl command ob_end_clean(); // stop preventing output curl_close ($login); unset($login); } function curl_grab_page($site,$proxy,$proxystatus){ $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); if ($proxystatus == 'on') { curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE); curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE); curl_setopt($ch, CURLOPT_PROXY, $proxy); } curl_setopt($ch, CURLOPT_FILE, ".txt"); curl_setopt($ch, CURLOPT_URL, $site); ob_start(); // prevent any output return curl_exec ($ch); // execute the curl command ob_end_clean(); // stop preventing output curl_close ($ch); } This is utterly brilliant, and got me screen scraping in no time. Paul |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
On 28 Mar, 10:03, postseb <post...@gmail.com> wrote:
> Can a website block the use of file_get_contents ? > > Example : file_get_contents("http://www.google.com") works fine, but > file_get_contents("http://www.petitscailloux.com/Follow.aspx? > sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not. > > Any clues or ways to circumvent ? > Well, its not a valid URL for starters - you should urlencode everything after the 'sURL=' and lose the white space in front. If that still does not work, try using curl with a faked user agent - maybe they serve up different content to different browsers. But beware - if the remote site has anti-leech functionality you should respect the publishers constraints. C. |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
>
> This is utterly brilliant, and got me screen scraping in no time. > > Paul Thanks Paul and C. - I tried it with curl as well, using the curl_grap_page and curl with an ini_set of a generic user agent, but I got the following error : Thanks also to Jan, I will also have to try fsockopen. Runtime Error Description: An application error occurred on the server. The current custom error settings for this application prevent the details of the application error from being viewed remotely (for security reasons). It could, however, be viewed by browsers running on the local server machine. Details: To enable the details of this specific error message to be viewable on remote machines, please create a <customErrors> tag within a "web.config" configuration file located in the root directory of the current web application. This <customErrors> tag should then have its "mode" attribute set to "Off". <!-- Web.Config Configuration File --> <configuration> <system.web> <customErrors mode="Off"/> </system.web> </configuration> Notes: The current error page you are seeing can be replaced by a custom error page by modifying the "defaultRedirect" attribute of the application's <customErrors> configuration tag to point to a custom error page URL. <!-- Web.Config Configuration File --> <configuration> <system.web> <customErrors mode="RemoteOnly" defaultRedirect="mycustompage.htm"/> </system.web> </configuration> |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
Hi,
This site has user agent detection. Change your UA string to a well- known one: ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'); Then you can download the page. Regards, John Peters On Mar 28, 6:03 am, postseb <post...@gmail.com> wrote: > Can a website block the use of file_get_contents ? > > Example : file_get_contents("http://www.google.com") works fine, but > file_get_contents("http://www.petitscailloux.com/Follow.aspx? > sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not. > > Any clues or ways to circumvent ? > > Thanks a lot ! |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
On Mar 28, 3:03 am, postseb <post...@gmail.com> wrote:
> > Can a website block the use of file_get_contents ? I've seen this happen (in particular when trying to read data off of ASP-based Web sites), although I don't know why it happens. Either PHP file system functions generate weird HTTP request headers or some HTTP servers generate weird response headers... > Example : file_get_contents("http://www.google.com") works fine, but > file_get_contents("http://www.petitscailloux.com/Follow.aspx? > sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not. > > Any clues or ways to circumvent ? Use cURL of write a data retrieval function using sockets: http://groups.google.com/group/comp....ae1757ad369ace Cheers, NC |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
@petersprc : I did indeed also try with a generic user agent, and I
managed to download the page BUT some values on the page retrieved where different from the values seen on the webpage itself when simply browsing it and not trying to retrieve it. Take a look at the value to the right of "Nombre de jours" which seems to be randomly generated when retrieving the page and in fact a static value when browsing the page. How can that be, very strange ? I am surprised the contents could be retrieved but with a random modification of particular values within the page ? Thank you already for your . @NC : yes I did try curl but got the error message mentioned above. I will try sockets as well. Thank you already for your as well ! |
|
![]() |
| Outils de la discussion | |
|
|