SEOdamian's Blog

Helping others understand & share about technology, search

Google Spiders-I’m arachnophobic, and I don’t think I want spiders in my website

A reader asks:

Ewww! I’m arachnophobic, and I don’t think I want spiders in my website. What are they anyway, what do they do, and how do they work?

But you do want spiders all over your website. You want Search engine spiders crawling all over your website. While real life spiders eat bugs, Internet or WWW Search Enginee spiders bring you visitors to your website. You do want visitors don’t you?  Otherwise, why post on the web (well actually there are some good reasons, but that is a later post). Back to Search engine spiders.

Search engine spiders are computer programs that look at web pages, lots and lots of web pages. And they create the building blocks for the results we all see when we enter a search phrase at Google, Yahoo, Bing, GoodSearch or other search engines.

So how do Search engine spiders work?  Well, a lot of the process is somewhat blackbox – something goes in, magic happens, something different comes out.  The process is often referred to as ‘crawling a site’ as it seemingly wanders the web trying to understand what each web page is about.  But I will try to shed some light on it.

  • The seach engines have a ‘sign up page’ where you can register one page of your website.  Google’s is at Google. And Yahoo’s free submission is at Yahoo (there is paid submission, but that is another post).
  • The search engine then makes a list for of all the registrations.
  • It then gives the list to a spider.  Again remember the spider is just a computer program, so this list is in essence a batch file of  ‘your work for today is to look at these websites’.
  • Starting at the top of the list the spider ‘goes’ to the 1st page in the list.  Just like you can surf all day and never leave your chair, but still travel the world, a spider never leaves Googleplex or YahooVille or the Bada-Bing.
  • It loads up the page from the list, just like your browser does.  Only rather than looking at how it displays, it looks at the code that creates the page. You can see the code for this page (or most pages) by going to View Source in most browsers.
  • The spider then collects all the words and meta tags, and ALT tags and TITLE tags.
  • It then runs another quick program (remember the search engine is trying to look at the ENTIRE web as often as possible).  This program boils this page down to what keyword phrases it is about. It also assigns a strength or rating about each phrase. So a website that is for a business in Wisconsin gets some rating for ‘Wisconsin’ because the office address is there (2349 E. Ohio Ave, Milwaukee, WI 53207). But a site that is about tourism and the history of Wisconsin ( get a much higher rating for ‘Wisconsin’.
  • It then files all these keyword phrases and ratings about them for later.

After it makes a list of all the keyword phrases and their ratings, the spider then:

  • Looks for all the links to other pages.
  • It adds these links to its list of To Do’s (‘your work for today is to look at these websites’), with  additional pieces of information.
  1. What was the page that had the link on it about in keyword phrases.
  2. What does the information about the link say about the new page.
  • Is there text that is linked or is it just the URL?
  • Is there a linked image?
  • What is the ALT text about that image?
  • What is the image name?
  • What is the text around the link?

These are the clues that we as humans and the Search engine spiders use to determine what this linked page is about.  It collects all that information and uses it to ‘prejudge’ what this new page is about.

The search engine spider has now ‘crawled’ one page.

After building a list of all the linked pages on this page, it starts to go look at all of these new pages, one by one. If you have 5 pages it may look at all 5 pages, if you have 5,000 pages it may look at them all. (of course it may get tired or ‘bored’, again another post).  If you think of a line being drawn to each new page, including some being drawn ‘backwards’ to previous pages, you can start to envision a web of lines to all the different pages with all sorts of connections.  This web is where the Search engine spider name is drawn from.

You can see that if there are other websites pointing to your site, that the spider should eventually find you. But if you are an island, and no one is linking to your site, the search engine may never find you unless you register with it. The spider is not like an airplane that is going around the ocean looking for islands.  It needs to be pointed to an island at least once by someone registering the site, or another site (that the spider is visiting regularly) pointing to you.

Of course at some point the spider runs out of time for the day, and needs to return the results back to the nest to be merged with the many spiders looking at other websites.  There the ratings of the different spiders web pages are all merged together and rankings are updated.  This merging will also take into consideration when other websites link to your website – if a 3rd party felt your site was important enough to link to, then it is more important usually than a page that no one has linked to.

Way back in the early days of the WWW (1996-2000), spiders would actually go out at primarily at night (by California USA standards). When I analyzed the logs of different clients, I could see the spiders coming in the ‘wee hours’ of the morning.

Log files are the records that the hosting computer where your website is kept that lists every single visitor to your site. It lists when they came, what page they looked at, and where they came from. There are programs that take these logs and make them easier to understand. Some of these include Google Analytics, and WebTrends.

Now of course, the spiders are out searching around the clock in order to try and keep up with the vast changing content of most of the web. Especially the ‘good’ stuff on the web.  So there is a prejudice that new content is better than old content in our ever changing world by the search engines.  That is why your site’s rankings can change minute by minute, as different spiders come back home to the nest and report how a site has changed its content, or links out or links in. Other sites may have gotten better or worse for a rating of a keyword phrase. If yours has not improved, it will affect your ranking.

At some point after the spiders conquer the new website list, they will go back to websites on their existing list, and revisit and look to see if any pages have changed. The changes could be to add or delete links to other pages, or to add or delete information on that page or how it describes other pages.  It updates the information is its master list and lets the ‘nest’ re-rank all the websites for the different rankings.

Hopefully that helps clarify how spiders work and why you need to be descriptive in your words to get good rankings of your website.

July 3, 2009 Posted by | How To, Simple | , , , , , , | 2 Comments