Tools like Google Analytics and Omniture are useful for tracking visitors to your site, but they only capture a portion of the traffic handled by your web servers. Visits from bots are not reported, since they do not execute JavaScript. As a result, your website might be straining under a significant load that is totally invisible unless you pay attention to your web server’s logs.
Apache’s log format lends itself to processing with Unix command line tools, making it easy to write command chains that generate statistics on the fly. For instance, let’s assume your log format looks like this:
geektastical.com 66.249.68.196 - - [20/Feb/2011:08:26:25 -0700] "GET /wp-content/uploads/2010/12/51k5fqj0nTL._SL160_.jpg HTTP/1.1" 200 6898 "-" "Googlebot-Image/1.0"
and you’d like to get a report of the number of hits per IP address, sorted by the number of hits. Execute the following command:
$ tail -n 1000 access.log | awk '{print $2}' | sort | uniq -c | sort -g
The output will look like this:
33 71.164.221.19 33 96.237.177.225 36 95.108.217.251 43 68.84.165.127 45 69.138.78.119 46 209.6.119.34 50 69.143.120.194 52 174.37.6.115 190 209.190.3.210
What does that command line do?
The tail -n 1000 takes the last thousand lines of our access log, limiting our analysis to the last thousand hits. We pipe this to awk ‘{print $2}’ which selects the second “field” of each line of text, where the fields are defined as text separated by white space. In this case, the second field is the IP address of the requestor. This will provide us with a list of IP addresses; however, a given IP address may appear many times in the output. What we really need is a list of IP addresses and a count of how many times each appeared. Luckily, uniq -c performs this very function, but uniq requires sorted input, so first we sort the list. To finish, we’d like the final output to be sorted by the number of requests so we pipe uniq’s output to sort -g, which sorts numerically.
We can execute this against a log file to observe who’s hitting your site in real time.
