|
|
|
|
||||||
| alt.internet.seo Internet search engines and related topics. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
I have a site that uses PHP Sessions IDs.. I know that total
elimination of these from the URL is what is recommended for optimal bot crawling and I am working on that, but is there any way, for now to include a line in robots.txt that would ignore the "PHPSESSID" parameter? For example, the site works just fine when you visit this page: http://fixmyfamily.com/search_details.php?cid=41 But by default it generates a URL like this: http://fixmyfamily.com/search_detail...SID=0d8ff46dbd... What can be done right now so that Google doesn't crawl these session IDs and then store them and want to come back to them? Thanks in advance for your . BTW, I don't want to disallow all "search_details.php" URLs.. |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
__/ [ danish ] on Monday 18 September 2006 17:05 \__
> I have a site that uses PHP Sessions IDs.. I know that total > elimination of these from the URL is what is recommended for optimal > bot crawling and I am working on that, but is there any way, for now to > > include a line in robots.txt that would ignore the "PHPSESSID" > parameter? > > For example, the site works just fine when you visit this page: > > > http://fixmyfamily.com/search_details.php?cid=41 > > > But by default it generates a URL like this: > http://fixmyfamily.com/search_detail...SID=0d8ff46dbd... > > > > What can be done right now so that Google doesn't crawl these session > IDs and then store them and want to come back to them? Thanks in > advance for your . BTW, I don't want to disallow all > "search_details.php" URLs.. Hi, this would probably be handled well by alterring the generation of URL's in the CMS, either by omitting these duplicates or moving them to a (virtual) directory structure so that robots.txt can exclude them (it can't/shouldn't do wildcards, but Google is pushing towards breaking/'extending' the standards and conventions). Session ID's are tricky. Are you sure bots are being assigned a ? I know that spyware-type tools will be passed such URL's, but I don't think search engines will browse (crawl) with a . There were similar questions before in this newsgroup (sessionid and duplicates), so it's definitely worth browsing the archive. It's also worth looking at the logs, filering by crawler type (or IP address) to see what is going on underneath the surface. Another possibility is to view the cache, e.g. using "site:yoursite.suffix". Best wishes, Roy -- Roy S. Schestowitz | /earth: file system full http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E 5:20pm up 60 days 5:32, 7 users, load average: 0.40, 0.54, 0.64 http://iuron.com - Open Source knowledge engine project |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
__/ [ Roy Schestowitz ] on Monday 18 September 2006 17:29 \__
> __/ [ danish ] on Monday 18 September 2006 17:05 \__ > >> I have a site that uses PHP Sessions IDs.. I know that total >> elimination of these from the URL is what is recommended for optimal >> bot crawling and I am working on that, but is there any way, for now to >> >> include a line in robots.txt that would ignore the "PHPSESSID" >> parameter? >> >> For example, the site works just fine when you visit this page: >> >> >> http://fixmyfamily.com/search_details.php?cid=41 >> >> >> But by default it generates a URL like this: >> http://fixmyfamily.com/search_detail...SID=0d8ff46dbd... >> >> >> >> What can be done right now so that Google doesn't crawl these session >> IDs and then store them and want to come back to them? Thanks in >> advance for your . BTW, I don't want to disallow all >> "search_details.php" URLs.. > > Hi, this would probably be handled well by alterring the generation of > URL's in the CMS, either by omitting these duplicates or moving them to a > (virtual) directory structure so that robots.txt can exclude them (it > can't/shouldn't do wildcards, but Google is pushing towards > breaking/'extending' the standards and conventions). > > Session ID's are tricky. Are you sure bots are being assigned a ? I > know that spyware-type tools will be passed such URL's, but I don't think > search engines will browse (crawl) with a . There were similar > questions before in this newsgroup (sessionid and duplicates), so it's > definitely worth browsing the archive. It's also worth looking at the logs, > filering by crawler type (or IP address) to see what is going on underneath > the surface. Another possibility is to view the cache, e.g. using > "site:yoursite.suffix". Addendum: the following has just been published. http://www.webpronews.com/expertarti...Difficult.html http://tinyurl.com/zsfzo Session ID's Make Ecommerce Difficult It might . -- Roy S. Schestowitz | Community is code, code is community http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E 6:20pm up 60 days 6:32, 7 users, load average: 1.06, 0.82, 0.77 http://iuron.com - Open Source knowledge engine project |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
danish wrote:
> What can be done right now so that Google doesn't crawl these session > IDs and then store them and want to come back to them? Thanks in > advance for your . BTW, I don't want to disallow all > "search_details.php" URLs.. Why not get rid of the session IDs ASAP? Get rid of session ids with your .htaccess (if using mod_php, I believe), or with a php.ini file. (i.e., use_only_) Google allows you to block spidering of dynamic URLs with robots.txt, but I don't know if the other search engines obey it. I don't think that would work because your normal URLs are dynamic. |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
__/ [ z ] on Monday 18 September 2006 19:36 \__
> danish wrote: > > >> What can be done right now so that Google doesn't crawl these session >> IDs and then store them and want to come back to them? Thanks in >> advance for your . BTW, I don't want to disallow all >> "search_details.php" URLs.. > > Why not get rid of the session IDs ASAP? > > Get rid of session ids with your .htaccess (if using mod_php, I believe), > or > with a php.ini file. (i.e., use_only_) > > Google allows you to block spidering of dynamic URLs with robots.txt, but I > don't know if the other search engines obey it. I don't think that would > work because your normal URLs are dynamic. In Google's Guidelines for Webmasters they specify a possible substitution of symbols that avoid replication (maybe ampersand?). But surely, it's not standardised and requirement from different SE's can differ. That's why, as you say, it's better to go for a universal solution. -- Roy S. Schestowitz | "Double your drive space - delete Windows" http://Schestowitz.com | Open Prospects ¦ PGP-Key: 0x74572E8E Tasks: 111 total, 2 running, 108 sleeping, 0 stopped, 1 zombie http://iuron.com - knowledge engine, not a search engine |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
In article <1158595502.140567.168050@h48g2000cwc.googlegroups .com>,
"danish" <danishiqbal@gmail.com> wrote: > I have a site that uses PHP Sessions IDs.. I know that total > elimination of these from the URL is what is recommended for optimal > bot crawling and I am working on that, but is there any way, for now to > > include a line in robots.txt that would ignore the "PHPSESSID" > parameter? > > What can be done right now so that Google doesn't crawl these session > IDs and then store them and want to come back to them? Thanks in > advance for your . BTW, I don't want to disallow all > "search_details.php" URLs.. Hi Danish Iqbal, Yes and no. No for a standard robots.txt file but yes if you're only worried about Googlebot. Googlebot supports wildcards in robots.txt: http://www.google.com/support/webmas...er=40367&topic =8846 Note that this is their own extension to the robots.txt standard so it's pretty likely that at least some or all of the other major Web spiders do *not* respect wildcards. HTH -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more |
|
![]() |
| Outils de la discussion | |
|
|