PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Autres forums > Forum Programmation & Conception > comp.lang.php > Can a website block the use of file_get_contents ?
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
Can a website block the use of file_get_contents ?

Réponse
 
LinkBack Outils de la discussion
Vieux 28/03/2008, 11h03   #1
postseb
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Can a website block the use of file_get_contents ?

Can a website block the use of file_get_contents ?

Example : file_get_contents("http://www.google.com") works fine, but
file_get_contents("http://www.petitscailloux.com/Follow.aspx?
sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.

Any clues or ways to circumvent ?

Thanks a lot !
  Réponse avec citation
Vieux 28/03/2008, 11h54   #2
Jan Thomä
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

postseb wrote:
> Can a website block the use of file_get_contents ?


It can not, usually. However the site may set a when you log into it
and a lot of other stuff, like opening sessions etc. Since
file_get_contents isn't exactly a browser replacement, it can very well be
that things that work in the browser, do not work when just calling
file_get_contents. You would have to analyze the requests and responses,
look out for set , session-ids etc, and then replicate this in your
PHP call. You will have to use fsockopen for this kind of stuff. Look at
the PHP manual for fsockopen on how to download a HTTP-page with this
function there is an example right there..

Jan

--
__________________________________________________ _______________________
insOMnia - We never sleep...
http://www.insOMnia-hq.de

  Réponse avec citation
Vieux 28/03/2008, 12h09   #3
PaulB
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

"postseb" <postseb@gmail.com> wrote in message
news:2806192b-4a79-4238-9c6b-83977b270813@s50g2000hsb.googlegroups.com...
> Can a website block the use of file_get_contents ?
>
> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?


http://scriptasy.com/php_11/tutorial-curl-login_44.html

function curl_login($url,$data,$proxy,$proxystatus){
$fp = fopen(".txt", "w");
fclose($fp);
$login = curl_init();
curl_setopt($login, CURLOPT_JAR, ".txt");
curl_setopt($login, CURLOPT_FILE, ".txt");
curl_setopt($login, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE
5.01; Windows NT 5.0)");
curl_setopt($login, CURLOPT_TIMEOUT, 40);
curl_setopt($login, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'on') {
curl_setopt($login, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($login, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($login, CURLOPT_PROXY, $proxy);
}
curl_setopt($login, CURLOPT_URL, $url);
curl_setopt($login, CURLOPT_HEADER, TRUE);
curl_setopt($login, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($login, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($login, CURLOPT_POST, TRUE);
curl_setopt($login, CURLOPT_POSTFIELDS, $data);
ob_start(); // prevent any output
return curl_exec ($login); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($login);
unset($login);
}

function curl_grab_page($site,$proxy,$proxystatus){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'on') {
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_FILE, ".txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start(); // prevent any output
return curl_exec ($ch); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($ch);
}

This is utterly brilliant, and got me screen scraping in no time.

Paul


  Réponse avec citation
Vieux 28/03/2008, 13h56   #4
C. (http://symcbean.blogspot.com/)
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

On 28 Mar, 10:03, postseb <post...@gmail.com> wrote:
> Can a website block the use of file_get_contents ?
>
> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?
>



Well, its not a valid URL for starters - you should urlencode
everything after the 'sURL=' and lose the white space in front.

If that still does not work, try using curl with a faked user agent -
maybe they serve up different content to different browsers.

But beware - if the remote site has anti-leech functionality you
should respect the publishers constraints.

C.

  Réponse avec citation
Vieux 28/03/2008, 22h24   #5
postseb
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

>
> This is utterly brilliant, and got me screen scraping in no time.
>
> Paul



Thanks Paul and C. - I tried it with curl as well, using the
curl_grap_page and curl with an ini_set of a generic user agent, but I
got the following error :
Thanks also to Jan, I will also have to try fsockopen.

Runtime Error
Description: An application error occurred on the server. The current
custom error settings for this application prevent the details of the
application error from being viewed remotely (for security reasons).
It could, however, be viewed by browsers running on the local server
machine.

Details: To enable the details of this specific error message to be
viewable on remote machines, please create a <customErrors> tag within
a "web.config" configuration file located in the root directory of the
current web application. This <customErrors> tag should then have its
"mode" attribute set to "Off".

<!-- Web.Config Configuration File -->
<configuration>
<system.web>
<customErrors mode="Off"/>
</system.web>
</configuration>

Notes: The current error page you are seeing can be replaced by a
custom error page by modifying the "defaultRedirect" attribute of the
application's <customErrors> configuration tag to point to a custom
error page URL.

<!-- Web.Config Configuration File -->
<configuration>
<system.web>
<customErrors mode="RemoteOnly" defaultRedirect="mycustompage.htm"/>
</system.web>
</configuration>
  Réponse avec citation
Vieux 29/03/2008, 01h58   #6
petersprc
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

Hi,

This site has user agent detection. Change your UA string to a well-
known one:

ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB;
rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11');

Then you can download the page.

Regards,

John Peters

On Mar 28, 6:03 am, postseb <post...@gmail.com> wrote:
> Can a website block the use of file_get_contents ?
>
> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?
>
> Thanks a lot !


  Réponse avec citation
Vieux 29/03/2008, 03h27   #7
NC
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

On Mar 28, 3:03 am, postseb <post...@gmail.com> wrote:
>
> Can a website block the use of file_get_contents ?


I've seen this happen (in particular when trying to read data off of
ASP-based Web sites), although I don't know why it happens. Either
PHP file system functions generate weird HTTP request headers or some
HTTP servers generate weird response headers...

> Example : file_get_contents("http://www.google.com") works fine, but
> file_get_contents("http://www.petitscailloux.com/Follow.aspx?
> sUrl=http://www.seloger.com/199986/16271207/detail.htm") does not.
>
> Any clues or ways to circumvent ?


Use cURL of write a data retrieval function using sockets:

http://groups.google.com/group/comp....ae1757ad369ace

Cheers,
NC
  Réponse avec citation
Vieux 29/03/2008, 10h54   #8
postseb
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Can a website block the use of file_get_contents ?

@petersprc : I did indeed also try with a generic user agent, and I
managed to download the page BUT some values on the page retrieved
where different from the values seen on the webpage itself when simply
browsing it and not trying to retrieve it. Take a look at the value to
the right of "Nombre de jours" which seems to be randomly generated
when retrieving the page and in fact a static value when browsing the
page. How can that be, very strange ? I am surprised the contents
could be retrieved but with a random modification of particular values
within the page ?
Thank you already for your .

@NC : yes I did try curl but got the error message mentioned above. I
will try sockets as well.
Thank you already for your as well !
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 05h23.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,13964 seconds with 16 queries