Web Log Files


These pages contain information about the logs that the web server running this site collects, how I process and view them. Server? Don't imagine some sort of rack filled with massively expensive, powerful, fast, Blade or Edge computers. The computer you are viewing this page on is probably newer, faster and more powerful than the 2016 Dell Inspiron 3847 desktop that hosts these pages.

Personal Information: I do not have any information that makes my visitors easily identifiable, just the normal server logs that get produced every time anyone visits any website. I keep these logs because even after 20+ years, I'm still pleased with myself that anyone wants to read anything I write, which makes them no more than glorified page counters, to identify pages that are giving problems and my most popular pages so I can write more.

Security: Most articles recommend that nothing like the following about the website, how it runs and the technology used, ever be made public. The information can be used to attack and disrupt the website. As you can see, I have completely ignored that advice.

The server log files are very useful for determining everything that happens to the server. Everything from which resources (pages, documents, images etc.) were requested to what people did to try and break out of the web folders.

Even small web sites such as mine produce a great deal of information in log files. If left unattended and forgotten about these files can get ginormous. What on earth can you do with a near 10Gb text file? In Windows, natively, barely anything.

directory listing of large log files

Luckily there are tools and techniques around that can manipulate these large files.

Once the files have been manipulated, there needs to be some way of visualizing what they are showing you, because to most people, at least me, the logs are just a mass of text.

Part of a log file

These pages were written to show how I manipulate and visualize the server log files.

Web Log Visualizations

Since around 2000, I used GoStats on the website. That site stopped working properly in 2018, and I switched to Google Analytics and Microsoft's Clarify.

In March 2022, Google announced that their Universal Analytics was going away and that if you use it, then you should switch to Google Analytics 4 by July 1, 2023. The emails I got from Google explaining this made me reexamine the issue of my logs and what I was doing with them. As well as GoStats, I also used Analog, AWStats and Webalizer to beautify and analyze the server logs. I stopped using them around 2010, but in November 2022, I decided to take a second look at them. Most are now no longer being updated but they still work!

This section looks at some of the old-school web log visualizers and their addons to do things such as DNS lookup to find the internet address, country of origin (IP geolocation such as MaxMind) etc. and to improve their performance. The programs looked at are Access Log Viewer, Analog, AWStats, Log Parser including Log Parser Lizard and Log Parser Studio, Report Magic, W3Perl, Webalizer, and WebLog Expert Lite.

AWStats provided a handy comparison table of the features offered by itself, Analog and Webalizer.

What to Report

Different people need different things from the logs. Some may need just the human visitors, ignoring all bots. Others may want everything recorded so they can see exactly what is happening to the server.

Reverse DNS Lookup

Apache saves the IP addresses of the computers that visit the site(s) in its logs. A reverse DNS lookup converts the IP address to a host address if possible. Apache can do this itself using logresolve. Apache can also do this on the fly using the HostnameLookups Directive, but there is a price to pay with delayed response times.

IP / DNS Geolocation

It is possible to lookup the geolocation of an IP or host address. I prefer to use a local self-hosted database to do this rather than one of the online services. Here are some free ones that I know of:

db-ip Free IP Geolocation Databases
Geoacumen Country Database
IP2Location Lite Free IP Geolocation Databases
IPInfoDB Free Lite Databases
MaxMind GeoLite2 Free Geolocation Data

I shouldn't need anything more precise than perhaps city or even region information. These free databases are slightly less accurate and not updated so often as the commercial products.

Counting Bots

According to which article you read, non-human bot traffic now makes up between 40% and 70% of website traffic. For a site like mine good bots are from things like search engines and may be those from internet technology scanners querying the site to what technology I use. Bad bots would include those skimming content off the site, trawling for email addresses and those looking for vulnerabilities.

Bots can cause a degredation in service for the human visitors and skew the website analytics. Good bots such as Googlebot and Bingbot usually identify themselves clearly, bad bots don't and can spoof both IP address and user agent string. One way of identifying them is to look for unusual patterns in the server logs such as high bounce rates and a single address visiting many pages very quickly.

There are several lists of bad bots such as the ones by dvlop, Mitchell Krog, and Jeff Starr.

This piece on Server Fault shows how to block these bots from even accessing the site, and How to stop automated spam-bots is probably the most plain English article I've seen on using RewriteRules or SetEnvIfNoCase to block these bots.

Self Visits

Something to consider is checking or restricting your own IP address from appearing in the server logs.

Sources and Resources

List of web analytics software (Wikipedia)
Web analytics (Wikipedia)

This page created July 8, 2021; last modified April 30, 2023