PHWinfo banniere

Titres
PORTAIL ANNUAIRE ARTICLES COMPARATEUR HÉBERGEURS DEVIS FORUMS RÉDUCTEUR D'URL
Précédent   PHWinfo > Forums Hébergement > Diriger une société d'hébergement > alt.internet.seo > Robots.txt
S'inscrire FAQ Membres Recherche Messages du jour Marquer les forums comme lus
alt.internet.seo Internet search engines and related topics.

Robots.txt

Réponse
 
LinkBack Outils de la discussion
Vieux 18/09/2006, 17h05   #1
danish
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Robots.txt

I have a site that uses PHP Sessions IDs.. I know that total
elimination of these from the URL is what is recommended for optimal
bot crawling and I am working on that, but is there any way, for now to

include a line in robots.txt that would ignore the "PHPSESSID"
parameter?

For example, the site works just fine when you visit this page:


http://fixmyfamily.com/search_details.php?cid=41


But by default it generates a URL like this:
http://fixmyfamily.com/search_detail...SID=0d8ff46dbd...



What can be done right now so that Google doesn't crawl these session
IDs and then store them and want to come back to them? Thanks in
advance for your . BTW, I don't want to disallow all
"search_details.php" URLs..

  Réponse avec citation
Vieux 18/09/2006, 17h29   #2
Roy Schestowitz
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Robots.txt

__/ [ danish ] on Monday 18 September 2006 17:05 \__

> I have a site that uses PHP Sessions IDs.. I know that total
> elimination of these from the URL is what is recommended for optimal
> bot crawling and I am working on that, but is there any way, for now to
>
> include a line in robots.txt that would ignore the "PHPSESSID"
> parameter?
>
> For example, the site works just fine when you visit this page:
>
>
> http://fixmyfamily.com/search_details.php?cid=41
>
>
> But by default it generates a URL like this:
> http://fixmyfamily.com/search_detail...SID=0d8ff46dbd...
>
>
>
> What can be done right now so that Google doesn't crawl these session
> IDs and then store them and want to come back to them? Thanks in
> advance for your . BTW, I don't want to disallow all
> "search_details.php" URLs..


Hi, this would probably be handled well by alterring the generation of URL's
in the CMS, either by omitting these duplicates or moving them to a
(virtual) directory structure so that robots.txt can exclude them (it
can't/shouldn't do wildcards, but Google is pushing towards
breaking/'extending' the standards and conventions).

Session ID's are tricky. Are you sure bots are being assigned a ? I
know that spyware-type tools will be passed such URL's, but I don't think
search engines will browse (crawl) with a . There were similar
questions before in this newsgroup (sessionid and duplicates), so it's
definitely worth browsing the archive. It's also worth looking at the logs,
filering by crawler type (or IP address) to see what is going on underneath
the surface. Another possibility is to view the cache, e.g. using
"site:yoursite.suffix".

Best wishes,

Roy

--
Roy S. Schestowitz | /earth: file system full
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
5:20pm up 60 days 5:32, 7 users, load average: 0.40, 0.54, 0.64
http://iuron.com - Open Source knowledge engine project
  Réponse avec citation
Vieux 18/09/2006, 18h21   #3
Roy Schestowitz
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Robots.txt

__/ [ Roy Schestowitz ] on Monday 18 September 2006 17:29 \__

> __/ [ danish ] on Monday 18 September 2006 17:05 \__
>
>> I have a site that uses PHP Sessions IDs.. I know that total
>> elimination of these from the URL is what is recommended for optimal
>> bot crawling and I am working on that, but is there any way, for now to
>>
>> include a line in robots.txt that would ignore the "PHPSESSID"
>> parameter?
>>
>> For example, the site works just fine when you visit this page:
>>
>>
>> http://fixmyfamily.com/search_details.php?cid=41
>>
>>
>> But by default it generates a URL like this:
>> http://fixmyfamily.com/search_detail...SID=0d8ff46dbd...
>>
>>
>>
>> What can be done right now so that Google doesn't crawl these session
>> IDs and then store them and want to come back to them? Thanks in
>> advance for your . BTW, I don't want to disallow all
>> "search_details.php" URLs..

>
> Hi, this would probably be handled well by alterring the generation of
> URL's in the CMS, either by omitting these duplicates or moving them to a
> (virtual) directory structure so that robots.txt can exclude them (it
> can't/shouldn't do wildcards, but Google is pushing towards
> breaking/'extending' the standards and conventions).
>
> Session ID's are tricky. Are you sure bots are being assigned a ? I
> know that spyware-type tools will be passed such URL's, but I don't think
> search engines will browse (crawl) with a . There were similar
> questions before in this newsgroup (sessionid and duplicates), so it's
> definitely worth browsing the archive. It's also worth looking at the logs,
> filering by crawler type (or IP address) to see what is going on underneath
> the surface. Another possibility is to view the cache, e.g. using
> "site:yoursite.suffix".


Addendum: the following has just been published.

http://www.webpronews.com/expertarti...Difficult.html
http://tinyurl.com/zsfzo

Session ID's Make Ecommerce Difficult

It might .

--
Roy S. Schestowitz | Community is code, code is community
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
6:20pm up 60 days 6:32, 7 users, load average: 1.06, 0.82, 0.77
http://iuron.com - Open Source knowledge engine project
  Réponse avec citation
Vieux 18/09/2006, 19h36   #4
z
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Robots.txt

danish wrote:


> What can be done right now so that Google doesn't crawl these session
> IDs and then store them and want to come back to them? Thanks in
> advance for your . BTW, I don't want to disallow all
> "search_details.php" URLs..


Why not get rid of the session IDs ASAP?

Get rid of session ids with your .htaccess (if using mod_php, I believe), or
with a php.ini file. (i.e., use_only_)

Google allows you to block spidering of dynamic URLs with robots.txt, but I
don't know if the other search engines obey it. I don't think that would
work because your normal URLs are dynamic.
  Réponse avec citation
Vieux 18/09/2006, 19h45   #5
Roy Schestowitz
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Robots.txt

__/ [ z ] on Monday 18 September 2006 19:36 \__

> danish wrote:
>
>
>> What can be done right now so that Google doesn't crawl these session
>> IDs and then store them and want to come back to them? Thanks in
>> advance for your . BTW, I don't want to disallow all
>> "search_details.php" URLs..

>
> Why not get rid of the session IDs ASAP?
>
> Get rid of session ids with your .htaccess (if using mod_php, I believe),
> or
> with a php.ini file. (i.e., use_only_)
>
> Google allows you to block spidering of dynamic URLs with robots.txt, but I
> don't know if the other search engines obey it. I don't think that would
> work because your normal URLs are dynamic.


In Google's Guidelines for Webmasters they specify a possible substitution of
symbols that avoid replication (maybe ampersand?). But surely, it's not
standardised and requirement from different SE's can differ. That's why, as
you say, it's better to go for a universal solution.

--
Roy S. Schestowitz | "Double your drive space - delete Windows"
http://Schestowitz.com | Open Prospects ¦ PGP-Key: 0x74572E8E
Tasks: 111 total, 2 running, 108 sleeping, 0 stopped, 1 zombie
http://iuron.com - knowledge engine, not a search engine
  Réponse avec citation
Vieux 18/09/2006, 22h46   #6
Nikita the Spider
Aucun Avatar
 
Messages: n/a
Hébergeur:
Par défaut Re: Robots.txt

In article <1158595502.140567.168050@h48g2000cwc.googlegroups .com>,
"danish" <danishiqbal@gmail.com> wrote:

> I have a site that uses PHP Sessions IDs.. I know that total
> elimination of these from the URL is what is recommended for optimal
> bot crawling and I am working on that, but is there any way, for now to
>
> include a line in robots.txt that would ignore the "PHPSESSID"
> parameter?
>
> What can be done right now so that Google doesn't crawl these session
> IDs and then store them and want to come back to them? Thanks in
> advance for your . BTW, I don't want to disallow all
> "search_details.php" URLs..


Hi Danish Iqbal,
Yes and no. No for a standard robots.txt file but yes if you're only
worried about Googlebot. Googlebot supports wildcards in robots.txt:
http://www.google.com/support/webmas...er=40367&topic
=8846

Note that this is their own extension to the robots.txt standard so it's
pretty likely that at least some or all of the other major Web spiders
do *not* respect wildcards.

HTH

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
  Réponse avec citation
Réponse


Outils de la discussion

Règles de messages
Vous ne pouvez pas créer de nouvelles discussions
Vous ne pouvez pas envoyer des réponses
Vous ne pouvez pas envoyer des pièces jointes
Vous ne pouvez pas modifier vos messages

Les balises BB sont activées : oui
Les smileys sont activés : oui
La balise [IMG] est activée : oui
Le code HTML peut être employé : non
Trackbacks are oui
Pingbacks are oui
Refbacks are oui


Fuseau horaire GMT +1. Il est actuellement 10h10.


Édité par : vBulletin® version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0 RC5 Tous droits réservés.
Version française #16 par l'association vBulletin francophone
PHWinfo est un site Éducation Sans Frontières ©2000-2008
Ad Management by RedTyger
©Tous droits réservés par les parties respectives
Page generated in 0,13041 seconds with 14 queries