Indexing and search engine optimization

What happens once the search engine finishes crawling the page? Let's take a look at the indexing process that search engines use to store information about web pages, enabling them to quickly return relevant, high-quality results.

What is the need for indexing by search engines?

Indexing and search engine optimization STUDYSHOOT

Do you remember the days before the Internet when you had to consult the encyclopedia in the school or university library or your father's library to learn about the world around you?

Even in the early days of the Web, before search engines, we had to search the local directory to get information. What a time consuming process. How patient were we?

Search engines have revolutionized information retrieval to the point where users expect near-instant responses to their search queries.

What is search engine indexing?

Indexing and search engine optimization STUDYSHOOT

Indexing is the process by which search engines organize information before searching to enable ultra-fast responses to queries.

Searching individual pages for keywords and topics will be a very slow process for search engines to locate relevant information. 

Instead, search engines (including Google) use an inverted index, also known as a reverse index.

What is an inverted index?

An inverted index is a system in which a database of text items is compiled along with pointers to documents containing those items. Next, search engines use a process called tokenization to reduce the words to their basic meaning, thus reducing the amount of resources needed to store and retrieve the data. This is a much faster approach than listing all known documents against all relevant keywords and characters.

Examples of inverted index to know what it is

Below is a basic example that illustrates the concept of inverted indexing. In the example, you can see that each keyword (or token) is associated with a row of documents in which that item is selected.

KeywordDocument path 1Document path 2
SEOexample.com/seo-tipsmoz.com
HTTPSdeepcrawl.co.uk/https-speedexample.com/https-future

This example uses URLs but these may be document identifiers instead depending on how the search engine is structured.

What is the cached version of the page?

Indexing and search engine optimization STUDYSHOOT

In addition to indexing, search engines may also store a highly compressed text version of the document including all HTML and metadata. A cached document is the most recent snapshot of the page viewed by the search engine.

The cached version of the page can be accessed (in Google) by clicking on the small green arrow next to the URL of each search result and selecting the cached option. Alternatively, you can use the “cache:” Google search operator to view the cached version of the page.

What is PageRank?

Indexing and search engine optimization STUDYSHOOT

PageRank is an algorithm from Google that gives a value to each page that is calculated by counting the number of links pointing to the page to determine the value of the page relative to every other page on the Internet.

The value passed through each individual link is based on the number and value of links pointing to the page to which the link is located.

PageRank is just one of many signals used in Google's large ranking algorithm.
A rough estimate of PageRank values ​​was initially provided by Google but is no longer publicly visible.

How do search engines work?

Indexing and search engine optimization STUDYSHOOT

Search engines work through three basic functions:

  1. Crawling: Imagine that the search engine is a robot with a thousand arms, which will come crawling with its arms to your website.
  2. Indexing: After the first step and the search engine visits your site, it will discover your pages and will pull them and store them in its memory.
  3. Ranking: The search engine will rearrange your pages according to their strength to display them to users and searchers.

Search engine crawling

Crawling is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content. The content can vary – it could be a web page, an image, a video, a PDF, etc. – but regardless of the format, content is discovered by links.

Googlebot starts by fetching some web pages, then follows the links on those web pages to find new pages. By navigating along this path of links, the crawler is able to find new content and add it to their index called Caffeine – a huge database of discovered URLs – to be retrieved later when the searcher searches for information that the content at that URL is a good match for .

Search engine indexing

Search engines process and store the information they find in an index, which is a huge database of all the content they have discovered and consider good enough to serve searchers.

Search engine ranking

When someone performs a search, search engines scan their index for relevant content and then query that content in hopes of solving the searcher's query. Arranging search results by relevance is known as ranking. In general, you can assume that the higher a website ranks, the more the search engine believes the site is relevant to the query.

It is possible to block search engine crawlers from part or all of your site, or instruct search engines to avoid storing certain pages in their index. While there can be reasons to do this, if you want searchers to find your content, you first need to make sure it is accessible to crawlers and indexable. Otherwise, it is as good as invisible.

By the end of this chapter, you'll have the context you need to work with the search engine, not against it!

If you're not showing up anywhere in search results, there are a few possible reasons for this:

  • Your site is brand new and has not been crawled yet.
  • Your site is not linked to from any external websites.
  • Your site's navigation makes it difficult for a bot to crawl it effectively.
  • Your site contains some underlying code called crawler directives that block search engines.
  • Google has penalized your website for spammy tactics.

What is robots.txt file?

Indexing and search engine optimization STUDYSHOOT

Robots.txt files are located in the root directory of websites (for example, studyshoot.com/robots.txt) and suggest which parts of search engines should not crawl, as well as the speed at which they crawl your site, via certain robots.txt directives .

How Googlebot handles robots.txt files

  • If he can't Googlebot Than it finds a site's robots.txt file, it will continue to crawl the site.
  • If Googlebot finds a site's robots.txt file, it will usually adhere to the suggestions and continue crawling the site.
  • If Googlebot encounters an error while trying to access a site's robots.txt file and cannot determine whether it exists or not, it will not crawl the site.

Common navigation errors that can prevent crawlers from seeing your entire site

Indexing and search engine optimization STUDYSHOOT
  • Having mobile navigation that displays different results than desktop navigation
  • Any type of navigation where the menu items are not in HTML, such as JavaScript-enabled navigation. Google has gotten a lot better at crawling and understanding JavaScript, but it's still not a perfect process. The most reliable way to guarantee that something will be found, understood, and indexed by Google is to put it in HTML.
  • Personalization, or offering unique navigation to a certain type of visitor versus others, may seem like a disguise to a search engine crawler
  • Forget linking to a primary page on your website through navigation – remember that links are the paths that crawlers follow to reach new pages!
Indexing and search engine optimization