Stripping the Logs
The web server logs for one of our websites were taking up 38 GB of space — that’s a lot for a six month period, especially considering that the logs are simple text files. This was due to a recursive redirection for 404 errors because of a website misconfiguration. Each redirection counts as a separate request, and if the web server can handle 100 requests per second, let’s say, then we get a lot of requests really fast. In some cases we were logging close to 500 MB of data per day for only a few hundred visits.
We could just delete some of the log files, which are organized by date, but then we would lose any useful information inside the file. Since the 404 requests are what took up the most space (and are meaningless because of the misconfiguration), can we simply remove those requests from the existing files? We can, with a little help from the command line.
I’ll show the entire command I used to do this, then break it down and explain.
for old in access_log.*; do echo "Stripping $old..."; sed '/GET \/67\.html/ d' $old > "$old.tmp"; touch --reference=$old "$old.tmp"; mv --reply=yes "$old.tmp" $old; done
First, here’s part of the logfiles folder, if you don’t know what the format is like:
-rw-r--r-- 1 root root 50158246 Oct 26 18:59 access_log.1224979200 -rw-r--r-- 1 root root 98611220 Oct 27 18:42 access_log.1225065600 -rw-r--r-- 1 root root 277836219 Oct 28 18:59 access_log.1225152000 -rw-r--r-- 1 root root 129535573 Oct 29 18:51 access_log.1225238400 -rw-r--r-- 1 root root 67077907 Oct 30 18:44 access_log.1225324800 -rw-r--r-- 1 root root 104450831 Oct 31 18:52 access_log.1225411200 -rw-r--r-- 1 root root 27557856 Nov 1 18:59 access_log.1225497600 -rw-r--r-- 1 root root 169444507 Nov 2 17:59 access_log.1225584000 -rw-r--r-- 1 root root 219142478 Nov 3 17:50 access_log.1225670400 -rw-r--r-- 1 root root 584065598 Nov 4 17:43 access_log.1225756800 -rw-r--r-- 1 root root 217243 Nov 5 14:51 access_log.1225843200
We start with a good ol’ loop. Even though we’re using the wildcard character “*” to select all the logs, we need a loop to store the filename in a variable called “old”:
for old in access_log.*; do [stuff here]; done
The meat of the script is inside the loop. Inside the loop you just list commands as they would normally appear on the command line, separated by semicolons.
The first command is an echo statement. I threw this in there just to track the progress of the script. In shell scripting you can substitute variables just like you would in PHP and other languages. In fact, here I prefix the “old” variable with a dollar sign inside the double-quoted string, which works exactly the same as in PHP:
echo "Stripping $old..."
Next up is the trickiest command. I use sed (Stream EDitor) to strip out the unwanted lines. Sed is a great utility for modifying text. The pattern that I used to identify the lines is “GET 67.html” (67.html is the 404 page in this example), here converted into a regular expression. Sed makes it very easy to delete the line by simply appending the “d” action. Any line that doesn’t match the pattern is passed along unchanged.
The second parameter is the filename for input ($old). Then I use the “>” operator to redirect the output into a new file, which is named the same as the old but with a .tmp extension. Outputting to the same filename while reading from it won’t work, so I put it in a temporary file for now.
sed '/GET \/67\.html/ d' $old > "$old.tmp"
Now, the problem with creating a new temporary file is that the file’s modification date is (logically) set to when you created it. As you can see in the file listing above, all the existing log files have nice timestamps that tell you what date each log refers to. This can be fixed by using the “touch” command to reset the file’s modification time. The “reference” option is used to choose what timestamp to use, grabbing it from the old file. (Note that should be two hyphens before “reference”.)
touch --reference=$old "$old.tmp"
Finally, we want to replace the old file with the new one. For this the simple “mv” command will suffice. I set the “reply” option to bypass the verification question for each file.
mv --reply=yes "$old.tmp" $old
That’s it! So here is what you would see when it runs. It took well over an hour for all the files to be processed.
Stripping access_log.1215734400... Stripping access_log.1215820800... Stripping access_log.1215907200... Stripping access_log.1215993600...
The resulting size? 38 megabytes. That’s a nice 1,000-fold decrease.