Cleaning up a database of
Web server records:
We compiled a database table that contains all HTTP Web server
requests
submitted to a Web site of K-12 engineering curricula (www.teachengineering.com)
for the period April-June, 2005.
Before you start, you need to create this table and insert data by
running this script. This script will create an "action"
table with many many rows in it. The script will take a while to finish. If you need to re-create this table later simply re-run
this script. Use the SQL Server database you used in previous labs.
IMPORTANT NOTE!! Since
this exercise requires some rather large transactions
on the
database (you will remove over 100,000 records), and since this is a
learning exercise, it is very likely that
at some time or other, you break things. You either delete the wrong
things and perhaps even wipe out an entire database table. DON'T PANIC!! If you find yourself in
this situation, you can simply re-create the "action" table.
The database consists of a single table action with
the following fields:
- id:
record id.
- host_ip:
IP address
of the HTTP requester.
- file:
the
file/program requested.
- querystring:
list of
parameters passed to the program (if any).
- thetime:
date and
time of the HTTP request.
This exercise consists of a series of SQL operations that you'll
have to design and execute to explore and clean up these Web server log
entries. It gives you
the opportunity to learn a few things:
- Manipulate relational database tables with SQL.
- Further familiarize yourself with text-driven SQL command
interpreters.
- Learn some of the limits of SQL and how to work around them.
- Become familiar with some of the issues associated with
preparing Web server logs for analysis.
- Resolve IP addresses to host names and vice versa.
- Get a good sense of the (in)adequacy of raw hit counts as a
measure of (meaningful) Web site usage.
Note that one of the things this exercise does not really address is
how to
use SQL to resolve information stored across different tables using
joins. However, in the case study you'll have ample opportunity to work
with joins.
Below you'll find 13 questions and directions to carry out. In
the (hard copy) work
you hand in, clearly indicate the question, the
SQL you used to find its answer and the answer you found.
Please note:
- You MUST write your SQL from scratch.
Do NOT use the
graphics-based facilities in Access to put a
query together and then 'back out' the corresponding SQL as it will be
terribly bloated and you will not receive any credit for it (not to
mention that you don't learn much that way).
- Only SQL
code should be used for this exercise; additional programming in other
languages is neither allowed nor required (you will, however, use some
OS commands to conduct some reverse DNS lookups).
The Questions:
- How many records does the action
table contain?
- How many unique
(different) host_ips
does the action
table
contain?
- Assuming that all OSU-based machines have IP addresses
starting with 128.193...,
how
many hits came from OSU machines?
- Web server logs are very often used to analyze the amount of
use that a Web site gets. However, Web server logs must be cleaned
before
they can be assumed to
properly reflect actual use. For instance, all hits coming from your
development and testing machines must be removed from the log prior to
analysis.
In this case, we want to remove all hits from any of the
schools that
participated in the development of the system (University of Colorado,
Colorado School of Mines,
Duke University, Oregon State University, and Worcester Polytechnic
Institute). For this exercise you
may consider the first two bytes of the IP address (for OSU: 128.193)
to be the (unique and only) IP
signature of the institutions. To find the first two bytes of the
various schools' IPs, issue the following DOS or UNIX/Linux commands:
>ping
www.oregonstate.edu
>ping www.colorado.edu
>ping www.mines.edu
>ping www.duke.edu
>ping www.wpi.edu
Eliminate from the action
table all requests from the participating schools.
Check that after elimination of these records you have
153,147 records left.
- Next, remove from the table all hits coming from OSU's Open
Source Lab (all host_ips
starting
with 140.211... (OSL has its
own connection to the Internet backbone; hence the 140... IP
address).
Check that you have 121,877 records left:
- Make a list of each unique
host_ip
and the total
number of hits it represents in the action
table. Order the
list by number of hits. In your answer only list the ten host_ips
with the largest number of
hits.
- How many one-time visitors do we have? (hint: you may want to
do this using more than one query. We suggest that you first create a
table totals
that contains for each unique host_ip its
total number of hits. You can do that using SQL's
select host_ip, count(*) as
hits into totals... syntax).
Next, you can query the totals
table for those host_ips
that contain a hit count of only 1.
Check that the answer = 3,018.
- Assume that we are not interested in one-time visitors.
Hence, we want to eliminate from the action
table all host_ips
that have only one hit. We give you the SQL for this problem (still
include it in your answers and make sure you understand what's going on
here):
First,we have to add a so-called index to
the totals
table. An index is
essentially a sorted reference (like in the back of a textbook) that
allows the query solver to quickly search through a table:
Create an index: totals_idx
on totals (host_ip);
Next, we can delete from the action
table, all rows for which
the host_ip
has only one hit
in
the totals
table:
SQL: delete from action
where host_ip in (select host_ip from totals where hits = 1);
Check that you have 12,1877 - 3,018 = 118,859 records left:
- Search engines such as
Google, Yahoo and MSN regularly 'spider' or 'crawl' Web sites. Since we
are interested in human, rather than robotic use of the Web site, we
must remove the hits by these 'robots' from the action
table. To do this, you obviously need to know the IPs of these
robots. A list of the most active robots can be found at www.teachengineering.com/awstats
but you can also look at the totals
table to find the biggest hitters and then do a reverse 'DNS lookup' to
see who those heavy hitters are. On Windows, use the nslookup
command; on Unix/Linux,
the host
command to do reverse DNS lookups. You will also see that some robots
use a whole array of IP addresses; e.g.,
livebot
uses 207.46.98.*.
- Find the robots associated with the following heavy
Web site hitters: 64.124.85.*;
207.46.98.*;
66.249.*
and the
number of hits they are responsible for (list the SQL statements):
- Remove the records associated with these robots from the action
table (list the SQL statements).
Check that after the removal we have 68,201 records left.
- Assume that all hits collected on Jun 11, 2005 were part
of a big testing event. Eliminate all of these hits from the logs.
Check that you have 64,906 hits left.
- Now compute the minimum, maximum and average number of hits
per host_ip.
To do this,
first drop and rebuild the totals
table so that we no longer have robot IPs in that table (alternatively,
you can remove the robot IPs from the totals
table):
- Notice that we have again one or more IPs that have only one
hit. How is that possible? Did we not remove those earlier?
- Finally, let's make a time line. How many hits were recorded
in each of the months Jan. through Jul. 2005?