Cleaning up a database of
Web server records:
We compiled a MySQL database that contains all HTTP Web server
requests
submitted to a Web site of K-12 engineering curricula (www.teachengineering.com)
for the period April-June, 2005.
MySQL is an industry-strength Open Source database management
system that runs as a server application that manages individual
databases. Users connect to these databases through a client program
that runs on their
local machine.
From
any machine in the COB computer labs you can establish an SQL
connection to your MySQL database using the SQLyog MySQL client software
(freely downloadable from www.webyog.com).
Each of you have your own database to interact with. The name of the database consists
of the string teWeblog followed by an underscore
followed by your COB login. For instance, if your COB login is foobar123,
your database is teweblog_foobar123.
To interact with your database, you must use a special login and
password. The
login is the same as your COB login; the password consists of that
login followed by an underscore followed by the last four digits of
your
student_id. For instance, if your COB login is foobar123
and the last 4 digits of
your student id are 7890, your
password would be foobar123_7890.
To start interacting with the database, start the SQLyog application and select the Connect...
option on the File
menu. Fill out the following
fields in the interface:
- MySQL Host Address: fumes.bus.oregonstate.edu
- User Name: your COB
login
- Password: your
password as explained above.
- Port: 3306
- Database(s): the
database name as explained above.
Once connected, you can type SQL queries in the SQL query
window at the top right corner of the interface. All
queries typed in the SQL query window must be ended with
a semicolon (;). Once you've typed a query, you can right-click it and
have it executed.
IMPORTANT NOTE!! Since
this exercise requires some rather large transactions
on the
database (you will remove over 100,000 records), and since this is a
learning exercise,it is very likely that
at some time or other, you break things. You either delete the wrong
things and perhaps even wipe out an entire database table. DON'T PANIC!! If you find
yourself in this situation, do the following:
-
Clean out your database by
dropping any table that you have left (use SQL's drop table... command).
-
Import into the database the file
action_table.sql
using the Import
from SQL Statements...
option on the Tools
menu.
This file contains the SQL to recreate your database as it was at the
start of the exercise.
-
If all of this fails, see your
instructor for help.
The database consists of a single table action with
the following fields:
- id:
record id.
- host_ip:
IP address
of the HTTP requester.
- file:
the
file/program requested.
- querystring:
list of
parameters passed to the program (if any).
- thetime:
date and
time of the HTTP request.
To see the structure of the table, use the SQL describe
command:
describe action;
This exercise consists of a series of SQL operations that you'll
have to design and execute to explore and clean up these Web server log
entries. It gives you
the opportunity to learn a few things:
- Manipulate relational database tables with SQL.
- Further familiarize yourself with text-driven SQL command
interpreters.
- Learn some of the limits of SQL and how to work around them.
- Become familiar with some of the issues associated with
preparing Web server logs for analysis.
- Resolve IP addresses to host names and vice versa.
- Get a good sense of the (in)adequacy of raw hit counts as a
measure of (meaningful) Web site usage.
Note that one of the things this exercise does not really address is
how to
use SQL to resolve information stored across different tables using
joins. However, in the case study you'll have ample opportunity to work
with joins.
Below you'll find 13 questions and directions to carry out. In
the (hard copy) work
you hand in, clearly indicate the question, the
SQL you used to find its answer and the answer you found.
Please note:
- You MUST write your SQL from scratch.
Do NOT use the
graphics-based facilities in Access to put a
query together and then 'back out' the corresponding SQL as it will be
terribly bloated and you will not receive any credit for it (not to
mention that you don't learn much that way).
- Only SQL
code should be used for this exercise; additional programming in other
languages is neither allowed nor required (you will, however, use some
OS commands to conduct some reverse DNS lookups).
The Questions:
- How many records does the action
table contain?
- How many unique
(different) host_ips
does the action
table
contain?
- Assuming that all OSU-based machines have IP addresses
starting with 128.193...,
how
many hits came from OSU machines?
- Web server logs are very often used to analyze the amount of
use that a Web site gets. However, Web server logs must be cleaned
before
they can be assumed to
properly reflect actual use. For instance, all hits coming from your
development and testing machines must be removed from the log prior to
analysis.
In this case, we want to remove all hits from any of the
schools that
participated in the development of the system (University of Colorado,
Colorado School of Mines,
Duke University, Oregon State University, and Worcester Polytechnic
Institute). For this exercise you
may consider the first two bytes of the IP address (for OSU: 128.193)
to be the (unique and only) IP
signature of the institutions. To find the first two bytes of the
various schools' IPs, issue the following DOS or UNIX/Linux commands:
>ping
www.oregonstate.edu
>ping www.colorado.edu
>ping www.mines.edu
>ping www.duke.edu
>ping www.wpi.edu
Eliminate from the action
table all requests from the participating schools.
Check that after elimination of these records you have
153,147 records left.
- Next, remove from the table all hits coming from OSU's Open
Source Lab (all host_ips
starting
with 140.211... (OSL has its
own connection to the Internet backbone; hence the 140... IP
address).
Check that you have 121,877 records left:
- Make a list of each unique
host_ip
and the total
number of hits it represents in the action
table. Order the
list by number of hits. In your answer only list the ten host_ips
with the largest number of
hits.
- How many one-time visitors do we have? (hint: you may want to
do this using more than one query. We suggest that you first create a
table totals
that contains for each unique host_ip its
total number of hits. You can do that using SQL's create table
totals as select... syntax).
Next, you can query the totals
table for those host_ips
that contain a hit count of only 1.
Check that the answer = 3,018.
- Assume that we are not interested in one-time visitors.
Hence, we want to eliminate from the action
table all host_ips
that have only one hit. Since
doing this requires some 'special' setup in MySQL (because of the size
of the tables involved), we give you the SQL for this problem (still
include it in your answers and make sure you understand what's going on
here):
First,we have to add a so-called index to
the totals
table. An index is
essentially a sorted reference (like in the back of a textbook) that
allows the query solver to quickly search through a table:
SQL: create index
totals_idx on totals (host_ip);
Next, we can delete from the action
table, all rows for which
the host_ip
has only one hit
in
the totals
table:
SQL: delete from action
where host_ip in (select host_ip from totals where hits = 1);
Check that you have 12,1877 - 3,018 = 118,859 records left:
- Search engines such as
Google, Yahoo and MSN regularly 'spider' or 'crawl' Web sites. Since we
are interested in human, rather than robotic use of the Web site, we
must remove the hits by these 'robots' from the action
table. To do this, you obviously need to know the IPs of these
robots. A list of the most active robots can be found at www.teachengineering.com/awstats
but you can also look at the totals
table to find the biggest hitters and then do a reverse 'DNS lookup' to
see who those heavy hitters are. On Windows, use the nslookup
command; on Unix/Linux,
the host
command to do reverse DNS lookups. You will also see that some robots
use a whole array of IP addresses; e.g.,
livebot
uses 207.46.98.*.
- Find the robots associated with the following heavy
Web site hitters: 64.124.85.*;
207.46.98.*;
66.249.*
and the
number of hits they are responsible for (list the SQL statements):
- Remove the records associated with these robots from the action
table (list the SQL statements).
Check that after the removal we have 68,201 records left.
- Assume that all hits collected on Jun 11, 2005 were part
of a big testing event. Eliminate all of these hits from the logs.
Check that you have 64,906 hits left.
- Now compute the minimum, maximum and average number of hits
per host_ip.
To do this,
first drop and rebuild the totals
table so that we no longer have robot IPs in that table (alternatively,
you can remove the robot IPs from the totals
table):
- Notice that we have again one or more IPs that have only one
hit. How is that possible? Did we not remove those earlier?
- Finally, let's make a time line. How many hits were recorded
in each of the months Jan. through Jul. 2005?