Edward Parrish © 2003     

6. Log Files

What We Will Cover


Log Tails

From Last Lab

Quiz Review


6.1: Log File Formats

Objectives

At the end of the lesson the student will be able to:

  • Describe How to Configure Logging
  • Describe How to Read Different Server Log Files
  • Configure Apache to Log Transactions

6.1.1: About Log Files

  • Important to get feedback about the activity and performance of a server
    • Need to know about any problems developing
    • Want to know about what resources are being requested
  • All good web servers allow the system administrator to configure logging
  • Even busy servers can enable logging and not suffer a performance loss
    • Most requests create a single line in a file
    • Not computationally intensive
  • However, log files can grow very large
  • Must make sure that they do not fill all the free space on your hard drive
  • Common practice: put log files on a separate drive or partition -- why?
  • Another method: rotate (rename and remove) log files periodically

Log File Formats

  • Most servers support at least two common log file formats:
    • Common Logfile Format (CLF)
    • Extended Logfile Format (ELF)
  • Will first look at both of these formats
  • Then will examine error logs
  • Afterwards, will look at how to configure our servers for logging

6.1.2: The Common Logfile Format

  • NCSA and CERN Web servers first used the Common Logfile Format
  • Many current Web servers now support this format including Apache and IIS
  • Each line in the file represents a unique request
  • Each line has seven fields in the following order:
remotehost rfc931 authuser [date] "request" status bytes
remotehost
Remote (client) hostname, or IP number if DNS hostname is not available or if DNSLookup is off.
rfc1413
The remote username. RFC1413 (which obsoletes RFC913) defines a protocol used to determine the identity of a client that requests a resource from the server. It is seldom used on Internet servers because it slows the response of the server. A "-" is entered into the log if the server is unable to determine the userid.
authuser
The username as which the user has authenticated himself. When authentication is required to access a page, this is the authenticated username. For normal unrestricted requests, this field is just "-".
[date]
Date and time of the request. The date and time are usually saved in the format: DD/MON/YYYY:HH:MM:SS TZ. TZ is the timezone. Since there may be spaces in this field, it is enclosed in brackets for easy parsing.
"request"
The request line exactly as it came from the client. Like the date field, this field is enclosed in quotes since there are spaces in the request line.
status
The HTTP status code returned to the client.
bytes
The content-length of the document transferred.

Additional Fields

  • Some popular log file formats add two additional fields
Referer
The Referer field contains the URL that brought the user to this resource. (discussed in section 6.2)
User-Agent
The user-agent field is a string describing the client that made the request (e.g., Mozilla/4.0).

6.1.3: The Extended Logfile Format

  • Common Logfile Format only logs certain fields
  • Often desirable to log more information or omit certain fields
  • Extended Logfile Format is an extendable format
    • Allows specifying exactly which fields should be logged and in what order
  • Similar format to the Common Logfile Format
    • Each line of the file represents a request
    • Beginning of file also contains some configuration directives
  • Each directive line begins with a #
  • Version and Fields are required and should precede all entries in log
  • Version directive specifies version of Extended Logfile Format to use
  • Fields directive specifies what data to record in the logfile
  • For Example

    #Version: 1.0
    #Fields: date time c-ip sc-bytes time-taken cs-version
    1999-08-01 02:10:57 192.0.0.2 6304 3 HTTP/1.0
    1999-08-01 02:12:41 192.0.0.2 5100 1 HTTP/1.0
    1999-08-01 03:37:19 192.0.0.3 5100 2 HTTP/1.0
    
  • Notice that the Fields directive specifies six fields in the file
    • Date and time are ... date and time
    • c-ip stands for the client IP address
    • sc-bytes is the number of bytes sent from the server to the client
    • time-taken field is the number of seconds it took to send the data
    • cs-version is the version of HTTP used by the client
  • For more information click here

6.1.4: Error Logs

  • Access log files save statistical information about a request
  • Server can also generate messages when errors occur and log those errors to a file
  • Informational messages and debugging information are also often logged to the error log file
  • Error log is useful for:
    • Finding problems with your server
    • Debugging server-side programs (e.g. Perl scripts)
    • Debugging new configuration options
  • Usually control types of messages logged using the LogLevel directive
  • LogLevel warn

6.1.5: Configuring Apache Logging

  • Apache configuration files for our classroom systems located at
  • /usr/local/apache2/conf/
  • Apache log location for our installation:
  • /usr/local/apache2/logs/
  • Log location and configuration is specified in the httpd.conf file
  • To define the location and format of the access logfile use:
  • CustomLog logs/access_log common
  • Note that:
    • File path is relative to the server root
    • "common" refers to Common Logfile Format
  • Nickname "common" is defined by a LogFormat directive, also in the httpd.conf file
  • LogFormat "%h %l %u %t \"%r\" %>s %b" common
  • % directives represents a particular field to be logged such as:
    • %h: remote hostname
    • %l: remote logname
    • %u: remote user
    • %t: date and time
    • %s: status
    • %b: bytes sent
  • Using predefined nicknames, can specify agent and referer logfiles
  • CustomLog logs/referer_log referer
    CustomLog logs/agent_log agent
    
  • Must always stop and start the server after updating httpd.conf

Further Information

  • Log Files: how to use
  • Custom Log Formats: what the % directives mean
  • cronolog.org: for log rotation

  • 6.1.6: Configuring IIS Logging

    • IIS logfile configuration is available from the Web Site tab of the Properties dialog
    • Select one of the available formats

    • Any of the formats allow you to set general properties
    • Note the location of the log file directory
    • Since the logs are text files, can view the logs in Notepad
    • Can modify the W3C Extended Log File Format using the Extended Properties tab
    • Press the Help button for definitions of the properties

    Lab Exercise 6.1

    Instructions:

    Use the next 10 minutes to complete the following.

    1. Start a text file named exercise6.txt
      Will be adding to this file during the lesson -- save it often.
    2. Prepare the exercise header as described in the HowTo on submitting exercises
    3. Label this exercise: Lab 6.1
    4. Answer the following questions.

    Exercises and Questions

    1. Configure your server to log all requests to a file named access_log using the Common Logfile Format. What configuration options are used?
    2. Configure your server to log the HTTP User-Agent header to a file such as agent_log. What configuration options are used? Access a page on the server; what does this file contain now?
    3. Try to access a page that does not exist on your server. What is recorded in the access log? What is recorded in the error log?

    Consider the following three lines from a log file in Common Logfile Format:

    volvo.vortexwidgets.com - moose [27/May/1999:20:00:52 -0500]
            " GET /wm103/samples/ HTTP/1.0" 401 61
    volvo.vortexwidgets.com - - [28/May/1999:18:20:03 -0400]
            " GET /wm102/ HTTP/1.0" 200 4405
    volvo.vortexwidgets.com - - [29/May/1999:10:31:48 -0400]
            " GET /icons/back.gif HTTP/1.0" 200 216
    
    1. Can you tell which resource required authentication? What is the username of the authenticated user? Did they have access to the requested resource?
    2. What file is returned for the request in the second line? What is the size of the file?

    6.2: Referrers

    Objectives

    At the end of the lesson the student will be able to:

    • Describe how people are getting to your site

    6.2.1: Seeing How People Get to Your Site

    • HTTP request can specify a URL that the browser is viewing
      • Only if the user clicked on a link in that page
    • Information is sent in the HTTP Referer header
    • Provides way of knowing which web page user is coming from
    • Know where they are coming from in terms of an IP address
    • Referer header allows us to see what Web page brought them to our site
    • Referer header is generated by the browser

    6.2.2: Referrer Example

    • User finds a link to our site and clicks on it
    • Browser sends a normal HTTP request
    • Request contains information in the HTTP header section like following:
      • Notice how the query is encoded in the URL

      Referer: http://www.google.com/search?hl=en&ie=ISO-8859-1&oe=ISO-8859-1&q=vortex+widgets

    • Can see referrer domain and other information encoded in query
    • Web server typically does not do anything with the Referer header
    • However, can configure the server to write it to a log file
    • CustomLog logs/referer_log referer
    • How might people come to your web site using a link?

    Lab Exercise 6.2

    Instructions:

    Use the next 10 minutes to complete the following.

    1. Label this exercise: Lab 6.2
    2. Do not submit exercises until all of them from today's lesson are finished.
    3. Complete the exercises and answer the following questions.

    Exercises and Questions

    1. Configure your server to log referrer information to a log file such as referer_log. What options did you use?
    2. Open a page that has links to other pages on your site and click on some of the links. What shows up in referer_log?
    3. Try linking from another computer in the classroom. What shows up in referer_log now?

    6.3: Being Proactive

    Objectives

    At the end of the lesson the student will be able to:

    • Use log files to help find dead links
    • Describe how to spot suspicious activity
    • Find HTTP 404 -- Not Found log entries

    6.3.1: Finding Problems Using Logs

    • Being proactive means to fix small problems before they become large ones
    • To be proactive, you must actively maintain your site
    • Easiest way to find problems with your site is by analyzing log files
    • Can easily see whenever there might be a problem
    • Examples of common errors
      • Dead links or requests for files that do not exist
      • CGI scripts that do not work properly
      • Permissions problems
    • Dead links make your site look unprofessional
      • What are some possible causes of dead links?
    • Scripts with errors can fill your logs with error messages
      • CGI scripts with errors are logged
      • Useful resource when debugging server-side scripts

    6.3.2: Finding CGI Script Errors

    • Common errors with CGI scripts:
      • Missing Content-Type header
      • Incorrectly forming HTTP header section of response
    • Many times the script runs just fine when tested manually on the server
    • When user tries to access the script from a browser they receive an HTTP 500 Internal Server Error
    • Server error log could look like this:
    • [Mon Apr 12 15:06:53 1999] [error] Premature end of script headers:
      /export/home/paivam/public_html/test6.cgi

    • Premature end of script headers means the header section of response was not formed correctly
    • Syntax error with the script might show following
    • [Mon Apr 12 19:24:21 1999] [error] Premature end of script headers:
      /export/home/patm/public_html/form.cgi
      syntax error at form.cgi line 7, near ") print"
      Execution of form.cgi aborted due to compilation errors.

    • Premature end of script headers occurs because the script did not run far enough to generate header
    • From the log, we can see that line 7 of the script has a problem

    6.3.3: Finding Access Permissions Problems

    • Access permissions are another problem you can see in Web server log files
    • Users forget to give read permission for files or allow execute permission for scripts
    • Password-protected pages log errors if unauthorized users try to access them
    • Can also see how many times user repeatedly enters incorrect passwords to access a page
    • [Sun Apr 18 16:40:40 2001] [error] Permission denied: file permissions deny server access: /export/home/patm/public_html/phonelist.txt
      [Mon Apr 12 19:43:45 2001] [crit] Permission denied: /opt/apache/share/htdocs/wm105/class6/.htaccess pcfg_openfile: unable to check htaccess file, ensure it is readable
      [Mon Aug 9 21:55:53 2001] [error] [client 24.218.82.54] access to /sales/ failed, reason: user ericl not allowed access


    6.3.4: Finding HTTP 404 -- Not Found Errors

    • Can use UNIX grep command to find information in files
    • grep is a command the finds all lines in a file that contain a certain string
    • Can use grep to search log file for all lines containing a 404 error message
    • grep '" 404' /usr/local/apache2/logs/access_log
    • Searches for a double quote followed by a space, followed by 404
    • To send output to another file named out.txt:
    grep '" 404' /usr/local/apache2/logs/access_log >> out.txt

    Lab Exercise 6.3

    Instructions:

    Use the next 5 minutes to complete the following.

    1. Label this exercise: Lab 6.3
    2. Do not submit exercises until all of them from today's lesson are finished
    3. Answer the following questions.

    Exercises and Questions

    1. Find all requests that produced an HTTP 404 -- Not Found error message on your server.
    2. What sort of things should you look for in log files if you suspect that someone is attempting to crack your server?

    6.4: Statistics

    Objectives

    At the end of the lesson the student will be able to:

    • Determine how many people have been visiting your site
    • Use grep to gather statistics
    • Use cut and awk to count unique hosts

    6.4.1: Log File Analysis

    • One statistic people usually want to know is how many people are visiting
    • Looking at a log file can give you a lot of information
    • Some of the useful information you can extract from your logs:
      • Most requested pages
      • Top entry pages (the first page users enter your site through)
      • Information about search engines: most common search engines, common queries, and so forth
      • Top referring sites and URLs
      • Error counts
    • Many programs are available to analyze log files and produce reports
    • Popular free programs include the following
    • More complete list of freeware log analyzer software available here
    • Some commercial programs are
    • Further information: Log Analysis Tools with commentary
    • Note that any of these would make a good student project

    6.4.2: Using grep to Count Hits

    • Can count the number of hits using UNIX tools
    • For this exercise, use example access_log
    • Can use grep to count hits in log file
    • Each line represents a transaction
    • Date and time are recorded on each line
    • To find all the entries from Feb 2002
    • cd /usr/local/apache2/logs
      grep "Mar/2003" access_log
    • Can pipe results to wc (word count) program
    • grep "Mar/2003" access_log | wc
      -> 418    4180   46311
      
    • Returns number of lines, words and characters
    • May be many transactions per page -- why?
    • Can remove gif and jpg files with egrep command
    • grep "Mar/2003" access_log | egrep -v 'gif|jpg' | wc
      -> 414    4140   45795
      
    • Divide line count by number of days in month

    6.4.3: Using cut to Count Unique Hosts

    • Can count the number of unique hosts using UNIX tools
    • This problem requires two steps:
      • Get all the hostnames out of the access log
      • Remove duplicate entries
    • Use cut command to extract the first field from the access log
    • cut -d ' ' -f 1 access_log
    • -d option specifies that all fields are separated by spaces
    • -f option specifies that we only want to view the first field
    • Will have duplicates since a host accessed more than a single page
    • Can use the sort command with the -u option
    • cut -d ' ' -f 1 access_log | sort -u
    • To get the total number of unique hosts, use the wc command
    • cut -d ' ' -f 1 access_log | sort -u | wc
    • To get a count for a particular month:
    grep "Mar/2003" access_log | cut -d ' ' -f 1 | sort -u | wc
    -> 166     166    2329
    

    Lab Exercise 6.4

    Instructions:

    Use the next 10 minutes to complete the following.

    1. Label this exercise: Lab 6.4
    2. Do not submit exercises until all of them from today's lesson are finished
    3. Answer the following questions.

    Exercises and Questions

    1. Determine how many hits a site received in a month (e.g. Feb/2003). What is the average number of hits per day?
    2. Note: can use this example access.log or choose your own

    3. Determine how many unique hosts have visited the site.

    Wrap Up

    • When class is over, please shut down your computer
      Main Menu => Logout => Shut Down
    • Due Next: N/A

    • You may complete unfinished exercises at any time before the next class.
    • Be sure to submit the file to the instructor before the beginning of the next class to receive credit.
    • Instructions on submitting exercises are available from the HowTo's page.

    Home | WebCT | Announcements | Schedule | Expectations | Syllabus
    | Help | FAQ's | HowTo's | Links

    Last Updated: 7/16/2003 4:45:37 PM