Introduction
I use Google Search Console GSC) on my site but in 2023, I was getting more notifications than normal that some pages were not being indexed or there was something wrong with them, than normal.
Google Search Console index page
In November, I decided to take a serious look at what was happening. The first thing I decided to check was what GSC was indexing compared to what I thought it should be.
There are two files a website can use to help a well-behaved search bot. Robots.txt indicates which web pages they can and cannot access, usually on a directory or folder level. Mine is restrictive in that it contains lines such as
Disallow: /test/
because I don't particularly want the test folder indexed as it contain test pages, odds bits of code I've written and half-formed ideas.
The other file is sitemap.xml and that contains a list of files that search engines can index.
Counting Words
Sitemap.xml is a plain text file that contains a list of files that search engines can index. The entries in it take the form:
<url>
<loc>https://brisray.com/index.htm</loc>
<lastmod>2023-11-16T07:04:42+00:00</lastmod>
</url>
A string count for something unique for each entry, and not found anywhere else in the file should give the number of files that Google and other search engines should be able to easily find. I chose <loc> and used:
FIND /C /I "<loc>" "websites\brisray\sitemap.xml"
and that gave the result of:
---------- WEBSITES\BRISRAY\SITEMAP.XML: 996
The /C switch just tells the FIND command to count the number of occurances and the /I switch makes the search case insensitive.
So, the sitemap.xml file contains 996 entries, but Google has, for whatever reason, decided to index only 966 files, a difference of 30.
Counting HTML files
I wanted to find the total number of HTML files on the site and the number of HTML files in the /test folder. To do this I used a batch file:
@echo off
for /f %%a in ('dir websites\brisray\*.htm* /s /b 2^> nul ^| find "" /v /c') do set allhtm=%%a
for /f %%a in ('dir websites\brisray\test\*.htm* /s /b 2^> nul ^| find "" /v /c') do set testhtm=%%a
set /A myhtm=%allhtm%-%testhtm%
echo %allhtm% - %testhtm% = %myhtm%
Which gave:
4110 - 224 = 3886
This means the site has a total of 4,110 html or htm files. The test folder, which Google should not be indexing, contains 224 html or htm files. Although the sitemap.xml file contains 996 entries, the calculation in the batch file shows they could be indexing 3,886.
How it works
The dir command is executed in a loop with /s meaning recurse through all the subfilders and /b meaning give just a bare list of anything found. Error messages are sent to stream 2, which in this case is redirected to nul. The dir command is looking for all htm or html files using wildcards, *.htm*
The output of the dir command is piped to the find command. The "" /v switch is to ignore empty strings and the /c switch is to count the remaining lines. The output of the find count is then used to set the variable allhtm.
The command line is then repeated to find all the htm and html files in the test folder.
A simple equation then finds the count of the htm and html files not in the test folder, /A being used to perform a mathematical calculation in batch files.
Using Robocopy
A single line Robocopy command can also be used:
robocopy websites\brisray c:\dummy *.htm* /L /E /NFL /NDL /XD old test
The output of that is:
------------------------------------------------------------------------------- ROBOCOPY :: Robust File Copy for Windows ------------------------------------------------------------------------------- Started : Friday, November 24, 2023 4:00:45 PM Source : websites\brisray\ Dest : c:\dummy\ Files : *.htm* Exc Dirs : old test Options : /NDL /NFL /L /S /E /DCOPY:DA /COPY:DAT /R:1000000 /W:30 ------------------------------------------------------------------------------ Total Copied Skipped Mismatch FAILED Extras Dirs : 251 249 2 0 0 0 Files : 3886 3886 0 0 0 0 Bytes : 635.43 m 635.43 m 0 0 0 0 Times : 0:00:00 0:00:00 0:00:00 0:00:00 Ended : Friday, November 24, 2023 4:00:45 PM
How it works
c:\dummy - even though the moving of files is stopped by the /L switch, the command still needs an output folder.
/L - list files on;y, do not copy or delete them.
/S - include subfolders, this excludes empty ones.
/E - include subfolders, this includes empty ones.
/NFL - specifies that file names are not to be logged.
/NDL - Specifies that directory names are not to be logged.
/XD - exclude direcotires, in this case those named old and test.
Counting File types
This section looks at how to get a list of file types and the number of each file type in a set of folders. The Dos Tips forum has some interesting code. I used:
@echo off
setlocal
for /f "delims=" %%f in ('dir /s/b/a-d "websites\brisray"') do set /a %%~xf = %%~xf +1
set .
and the first lines of that returned:
.3gp=1
.aif=4
.asf=1
.avi=16
.bmp=36
I copied the results from the command line into a spreadsheet and split the line at the equals sign and sorted on the count column, which gave:
.jpg 9501
.html 2539
.htm 1571
.png 1441
.gif 845
.webp 356
This gives the total of html and htm files as 4110, which is the same as when were were looking for just those files in an earlier command.
How it works
for /f - work with a set of files
"delims=" - Do not split the string into parts (tokens)
dir /s - recurse into subdirectories
dir /b - Bare format (no heading, file sizes or summary)
dir /a-d - Do not list folders
set /a - Sets
~x - removes quote characters from around the string
Using PowerShell
A single line PowerShell command can be used to list all the different file types and the number of each:
Powershell -Command "& Get-ChildItem -path websites\brisray -recurse | Where-Object FullName -notmatch 'test|old' | WHERE { -NOT $_.PSIsContainer } | Group Extension -NoElement | Sort Count -Desc"
The first lines produced when running this command are:
Count Name ----- ---- 9232 .jpg 2532 .html 1402 .png 1347 .htm 792 .gif 346 .webp
I am not sure why, but this command gives a total number of *.htm* files as 3,879 and not 3886 as previously obtained. There are 7 HTML files missing!
How it works
The command recurses through the folder starting at the -path switch. The Where-Object pipe gets the fullname including the path of the files and excludes those with old or test in them.The -NOT $_.PSIsContainer ensures only files are worked on, not paths. The next pipe, groups the files by extension and the final pipe counts and sorts the file extension groups.
Sources and Resources
Batch file to show count of different filetypes in a folder (Dos Tips)
Dir (Microsoft Learn)
Dir (SS64)
Dir command (Computer Hope)
Find (Microsoft Learn)
Find (SS64)
Find command (Computer Hope)
For (Microsoft Learn)
For (SS64)
For command (Computer Hope)
How can I exclude multiple folders using Get-ChildItem -exclude? (Stack Overflow)
How to extract a complete list of extension types within a directory? (Superuser) - the accepted answer in this 12 year old post is very slow for large folder structures but it does list all the file types found but without a count of each file type
Multiple -and -or in PowerShell Where-Object statement (Stack Overflow)
Robocopy (Microsoft Learn)
Robocopy.exe (SS64)
Robocopy command (Computer Hope)