Web Log Files

Introduction

These pages contain information about the logs that the web server running this site collects, how I process and view them. Server? Don't imagine some sort of rack filled with massively expensive, powerful, fast, Blade or Edge computers. The computer you are viewing this page on is probably newer, faster and more powerful than the 2016 Dell Inspiron 3847 desktop that hosts these pages.

Personal Information: I do not have any information that makes my visitors easily identifiable, just the normal server logs that get produced every time anyone visits any website. I keep these logs because even after 20+ years, I'm still pleased with myself that anyone wants to read anything I write, which makes them no more than glorified page counters, to identify pages that are giving problems and my most popular pages so I can write more.

Security: Most articles recommend that nothing like the following about the website, how it runs and the technology used, ever be made public. The information can be used to attack and disrupt the website. As you can see, I have completely ignored that advice.

The server log files are very useful for determining everything that happens to the server. Everything from which resources (pages, documents, images etc.) were requested to what people did to try and break out of the web folders.

Even small web sites such as mine produce a great deal of information in log files. If left unattended and forgotten about these files can get ginormous. What on earth can you do with a near 10Gb text file? In Windows, natively, barely anything.

directory listing of large log files

Luckily there are tools and techniques around that can manipulate these large files.

Once the files have been manipulated, there needs to be some way of visualizing what they are showing you, because to most people, at least me, the logs are just a mass of text.

Part of a log file

These pages were written to show how I manipulate and visualize the server log files.

Web Log Visualizations

Since around 2000, I used GoStats on the website. That site stopped working properly in 2018, and I switched to Google Analytics and Microsoft's Clarify.

In March 2022, Google announced that their Universal Analytics was going away and that if you use it, then you should switch to Google Analytics 4 by July 1, 2023. The emails I got from Google explaining this made me reexamine the issue of my logs and what I was doing with them. As well as GoStats, I also used Analog, AWStats and Webalizer to beautify and analyze the server logs. I stopped using them around 2010, but in November 2022, I decided to take a second look at them. Most are now no longer being updated but they still work!

This section looks at some of the old-school web log visualizers and their addons to do things such as DNS lookup to find the internet address, country of origin (IP geolocation such as MaxMind) etc. and to improve their performance. The programs looked at are Access Log Viewer, Analog, AWStats, Log Parser including Log Parser Lizard and Log Parser Studio, Report Magic, W3Perl, Webalizer, and WebLog Expert Lite.

AWStats provided a handy comparison table of the features offered by itself, Analog and Webalizer.

Pat Research did reviews of these "old-school" free web log analyzers and their scores are summarized below.

Log Analyzer Scores
  Editor
Rating
User
Rating
Analog 7.5 8.7
Awstats 7.7 8.0
W3Perl 7.6 8.8
Webalizer 7.7 6.6

Accuracy of the Reports

Different people need different things from the logs. Some may need just the human visitors, ignoring all bots. Others may want everything recorded so they can see exactly what is happening to the server.

Somethings that most of the stats programs allow are:

Bots, Crawlers and Spiders

According to which article you read, non-human bot traffic now makes up between 40% and 70% of website traffic. For a site like mine good bots are from things like search engines and may be those from internet technology scanners querying the site to what technology I use. Bad bots would include those skimming content off the site, trawling for email addresses and those looking for vulnerabilities.

Bots can cause a degredation in service for the human visitors and skew the website analytics. Good bots such as Googlebot and Bingbot usually identify themselves clearly, bad bots don't and can spoof both IP address and user agent string. One way of identifying them is to look for unusual patterns in the server logs such as high bounce rates and a single address visiting many pages very quickly.

There are several lists of bad bots such as the ones by dvlop, Mitchell Krog, and Jeff Starr.

This piece on Server Fault shows how to block these bots from even accessing the site, and How to stop automated spam-bots is probably the most plain English article I've seen on using RewriteRules or SetEnvIfNoCase to block these bots.

It may be best to split them into groups if possible such as search bots, bad bots and all other bots - those that crawl the site gathering information for research and overall web statistical use.

It is worth noting that bad bots are going to ignore anything in robots.txt.

Common Files

I keep a lot of ancillary files in a common folder, I called it "common" because it is. Entries in the log statistics from this folder can be removed. Another file that appears in the statistics is favicon, that can go as well, as it should be on every page.

Newer File Types

Analog was written in 1995, and I use Analog CE v6.0.17 which was published in June 2021. AwStats was written in 2000, and I use v7.9 which was published in January 2023. Webalizer was written in 1997 and I use StoneSteps v6.2.1 which was published in October 2022. Although I am using the latest versions it seems the multimedia list these programs have has not changed much since they were first made and I added webp, webm and svg files to the list of file types to be counted as hits but not page views.

IP / DNS Geolocation

It is possible to lookup the geolocation of an IP or host address. I prefer to use a local self-hosted database to do this rather than one of the online services. Here are some free ones that I know of:

db-ip Free IP Geolocation Databases
Geoacumen Country Database
IP2Location Lite Free IP Geolocation Databases
IPInfoDB Free Lite Databases
MaxMind GeoLite2 Free Geolocation Data

I shouldn't need anything more precise than perhaps city or even region information. These free databases are slightly less accurate and not updated so often as the commercial products.

Reverse DNS Lookup

Apache saves the IP addresses of the computers that visit the site(s) in its logs. A reverse DNS lookup converts the IP address to a host address if possible. Apache can do this itself using logresolve. Apache can also do this on the fly using the HostnameLookups Directive, but there is a price to pay with delayed response times.

Self Visits

Something to consider is checking or restricting your own IP address from appearing in the server logs. I could remove entries that come from my local machines to the server. Home local area networks (LANs) nearly always use ip adresses from 192.168.0.0 to 192.168.255.255.

Analog

Try and shorten the giant lists it produces for some categories. Can some of these be grouped? See what Trucksess did to their statistics for inspiration about what can be done. One thing that appears to be missing from Analog is a report of the page names served - did I miss something in the documentation?

Webalizer

Automate the running of it. Can it be run daily? If so, the yearly list table, including the totals will also need to be updated. When the program is run to update the December statistics in January the table will also need to be updated with a new row and totals recalculated, and a new output directory made. If this is not done properly then Webalizer will start overwriting the previous month's files. When doing this in PowerShell, it may be better to make it a new command line rather than trying to rewrite Webalizer's configuration files.

Once these tasks have been identified and written, is it worth reindexing the log files with the analyzers? If so, it has to be automated via a batch file or PowerShell script. I expect a big reduction in the numbers being produced by the analyzers as well as the programs taking longer to processing the log files.

Sources and Resources

Comparison table of the features offered by Analog, AWStats, and Webalizer
Free Web Log Analysers and Web Statistics (The Free Country)
List of web analytics software (Wikipedia)
Overview of Web Site Traffic Analysis Tools (PlanetMike) - Reviews of several "old-school" web log analyzers by Mike Clark
Top Free Web Analytics Software (Pat Research) - Reviews of several "old-school" web log analyzers
Web analytics (Wikipedia)