Google Search Console index page
I use the free version of Google's Programmable Search Engine on my sites. In November 2023, I started receiving more notifcations than usual from Google's Search Console (GSC) saying pages and videos could not be indexed from Google Search Console. I decided to take a look at what was happening.
It is important to remember that Google's Search Console just gives information about the site on Google Search. It is not a full analytics suite such as Google Analytics and that Google Google's Programmable Search Engine API is a separate product from the company.
Using the tools for file counting, I find that my sitemap.xml lists 992 HTML files. I have 4,106 HTML files in total, with 224 of them in a folder containing test pages, odds bits of code I've written and half-formed ideas, that I do not want indexed. This means there should be 3,882 HTML files that Google should be able to find.
Instead, Google has indexed 888 of them but knows about another 2,644 files. A total of 3,532 HTML files.
My site was started in May 1999, it's now November 2023 meaning my site is almost 25 years old. In that time, files and whole sections of the site have been created, renamed, moved or deleted but Google is still managing to send around 50 visitors a day to it. That's not bad for a hobby site. Some pages are ancient by internet standards and not very well written, HTML wise, which seems to be the cause of some pages not being indexed. Some of these older pages will need rewriting, so removing the errors that can be fixed may be a long process.
This page was written to help me understand how GSC works and what I could do to improve the number of pages it is indexing. Barry Hunter has written an article in the Search Console Help about which errors can be left for a while and which need to be fixed.
Why pages aren’t indexed
Google seems to crawl sites using a mobile first bot. This means that some reported errors will only show up on a page when viewed on a smart phone or the screen width is reduced on a larger screen.
In the left-hand side of the GSC are a number of menu items under "Indexing" that give more details of why pages and videos are not being indexed. In that pane is column named "Source" If the entry in the column says If it says "Website" then there is a link somewhere on your website that needs to be fixed in some way. If it says "Google systems", then Google may have discovered it elsewhere, but it still may be something not quite right on your site.
Page menu with reasons pages are not being indexed
The resaons I commonly see for my pages not being indexed are:
Page with redirect
Blocked by robots.txt
Blocked due to access forbidden (403)
Blocked due to other 4xx issue
Crawled - currently not indexed
Discovered - currently not indexed
Duplicate without user-selected canonical
Indexed, though blocked by robots.txt
Not found (404)
For videos, the reasons they are not being indexed are:
No thumbnail URL provided
Video outside the viewport
Clicking on any section in the left-hand pane will open it up and give a list of files in the section. Clicking on one of the files will open a panel giving information about that file. There will also be a link named "Inspect URL" which when clicked on will give more information, the most important of which, I find is the "Referring page". This lists the page(s) that Google found references to the page on. This is especially important if your page listed is not in your sitemap.xml file.
Before doing anything else, I checked the Console's sitemap section. It was obvious something was wrong with my sitemap.xml as no pages were being read from it!
Google Search Console sitemap section before and after fixing sitemap.xml
I opened the sitemap in the Chrome browser and it told me where the error was. It turned out it was a missing > character! It appears Google Search Console cannot read anything from the file if there is any error at all in it.
Error checking sitemap.xml
There are other sitemap.xml checkers around such as the ones from My Sitemap Generator and XML-Sitemaps.
To bring up a report of what Google thinks of your sitemap, go to the GSC and click on "Siteamps" in the left-hand menu. Clcik on the current sitemap that appears in the right-hand pane and then "See Page Indexing". Mine had quite a few errors that I will try and fix first before the other errors on the main screens.
Sources and Resources
Google's Programmable Search Engine
Google's Search Central Documentation
Google's Search Console
Google's Search Console Help
My Sitemap Generator - sitemap.xml checker
XML-Sitemaps - sitemap.xml checker