This article is about
We often talk about crawlability, crawl budgets and the
like, but how to manage the depth of the SEO side crawling?
In this article:
1. Search engine scanning: what it is and how
it works
2. Bots and the User Agent
3. Images and text: the scan
4. The URL scan
5. The sitemaps
Perhaps those just listed are all topics not suitable for
the inexperienced and those who have just started a website. However, they
represent essential elements for SEO experts and in general for all
professionals in the sector. Understanding how the search engine works, what
processes are activated and the affected crawlers is essential! In fact, it
allows you to deepen how Google works, what expectations it has towards the
portals it hosts and is closely linked to the general well-being of our site.
Search engine
scanning: what it is and how it works
When we talk about crawling we mean the process that search
engine web crawlers use to visit a page and download its contents. In this
phase, the links present are taken into consideration to go even deeper into
the site and discover other linked pages.
Google, Bing, and all the other
search engines are used to start scanning the pages already known cyclically.
In this way, the search engine will be able to find out immediately if changes
have been made compared to the previous scan. If the answer is positive, the
search engine will update the index based on the changes found in the content.
Web crawlers are therefore all those functions that search
engines use to analyze sites and to access online content. The scan is started
thanks to the download of the robots.txt file, which contains the rules
dedicated to bots or spiders. For example, you can specify which pages to
exclude from the index (noindex) and also accept the scan (index) of a specific
subfolder. The file usually also mentions the path where the sitemap is
present, that is, the collection of all the URLs of the site. Crawlers use a
series of algorithms which, combined with precise rules, determine how often a
page should be crawled. The analysis also establishes how many and which pages
of a site should be indexed. Based on what we've just seen,
Bots and the User
Agent
Search engines scan a portal or website thanks to bots. Their
identity is linked to the User Agent, that is, to the string of the user agent
that provides information on the online pages to the server.
Some of the most
popular bots:
·
Googlebot User Agent
·
Bingbot User Agent
·
Baidu User Agent
·
Yandex User Agent
Mozilla / 5.0 (compatible; Googlebot / 2.1; +
http://www.google.com/bot.html
)
Mozilla / 5.0 (compatible; bingbot / 2.0; +
http://www.bing.com/bingbot.htm
)
Mozilla / 5.0 (compatible; bingbot / 2.0; +
http://www.bing.com/bingbot.htm
)
Mozilla / 5.0 (compatible; YandexBot / 3.0; +
http://yandex.com/bots
)
Mediapartners-Google
Googlebot-News
Googlebot-Image / 1.0
As Google points out in its official user-agent and crawler
guide , the strings can be verified thanks to a reverse DNS lookup. The process
is also known as the reverse DNS lookups and is useful for confirming that the
requesting IP address corresponds to the search engine.
Images and text: the
scan
Knowing how to manage the depth of SEO side crawling will
also be useful to give the 'right weight' to multimedia content. In the event
that the search engine encounters a URL linked to an image, an audio or a
video, it will not be possible to read the contents of the file in a canonical
way. Instead, he will have to use the metadata and the file name.
It should be emphasized that a search engine can only
capture a certain amount of information about non-textual files. This however
does not prevent their indexing or positioning. For example, useful traffic can
also be obtained thanks to multimedia content.
URL crawling
Crawlers are able to find out if there are new pages on a
site thanks to the famous links. The links are like a bridge that connects
different types of content and therefore unique URLs. When the search engine
crawls already known pages, it will queue the analysis of the associated URLs.
This is also why it is increasingly important to create functional text anchors
not only for the user, but also by virtue of the architecture and hierarchy of
our site.
Sitemaps
As we have seen in the previous paragraphs, in the
robots.txt file it is possible to specify the sitemap (or more) related to the
site. It is a list of pages and posts that are crawled. For the search engine
it will become a valuable tool to find even those contents not visible on the
surface, but hidden in the depth of the portal.
Read Also: Digital Marketing Services
At the same time, SEOs will be
able to understand precisely thanks to the sitemap how to manage the depth of
crawling. The extracted data will even reveal how often the search engine usually
crawls and indexes pages.
Read Also: SEO NYC Company
Comments
Post a Comment