Edward Parrish © 2003     

7. Search Engines, Robots and Automation
(Midterm Available)

What We Will Cover


Log Tails

From Last Lab

Quiz Review

If the access log is 500 lines long, how many people have visited your site?

500
250
You usually cannot tell from an access log.
You can always tell, but not by counting the number of lines.


7.1: Search Engines

Objectives

At the end of the lesson the student will be able to:

  • Describe how to create a searchable site
  • On complex sites, search can provide a good method for finding information
  • Users enter one or more words and get a list of likely places to look
  • Search is not a substitute for a good navigation system
  • Rather, search is often the last resort before leaving your site
  • Adding a search engine or service to your site is not difficult and usually worth the effort

7.1.1: Creating Search Words

Keywords

  • Searching relies on the concept of keywords
    • Words with a high level of meaning
  • Keywords are stored in a database along with a link to the page containing the word
  • Searcher clicks the link and jumps to the page
  • Generally easy and fast to look up words in a keyword database

Deciding on Keywords

  • Difficult to decide what words are important and which are not
  • For instance, which of the words on this page are important?
  • Usual approach is to remove words that we are fairly sure are not important
  • All the rest of the words are just stored in the database

Removing Stopwords

  • Stopwords are common words that do not add much meaning to a query
    • For example: "a", "an", "for", "the" and "with"
  • If we are search for web servers, do we care about the difference between "a web server" and "the web server"?
  • The problem is that their frequency of use tends to make them clutter search results
  • So which words are stopwords?
    • Depends on the language of the user
    • Depends on the application
  • Useful to examine lists of common words to decide which to exclude
  • For English, we can select them from a list of common words
  • My English stopword list is available here
  • Another list: 1000 Most Common Words in English
    • (Caution: About.com site with the Internet's most annoying pop-ups)

7.1.2: Encoding Keywords

  • Keyword search has some inherent problems
    • Some words with the same meaning can have alternate spellings
    • Manie peeple canot spel wel
  • Main approach to this problem is called phonetic encoding
    • Uses the rules of the spoken language to create a phonetic code
  • Phonetic code has a single spelling for words that sound alike
    • For instance: Brittany and Britney
  • Many algorithms exist
  • Simplest is Soundex dating from late 1800's and used by Census Bureau
  • More accurate, and more complex, is the DoubleMetaphone algorithm

7.1.3: Relevance

  • Users want to see the most relevant choices at top of list
  • Determining relevance is largest unsolved problem of keyword search
  • How many searches have you performed where the page you wanted was not at the top of the list?
  • Various techniques exist, but none of them are always satisfactory
  • For more information on relevance, or keyword search in general, see instructor's page here

7.1.4: Implementing Search Using Services

  • Easiest method to implement search on your sites is to use a search service
    • No software downloads
    • On-line setup in minutes
  • Search service will index your site and provide queries using their servers
  • Some services offer basic search packages for free
  • Atomz is particularly easy to use after signing up
  • Simply include an HTML form on each web page
  • When need more capabilities, can upgrade service or decide to implement your own
<!-- Atomz Search HTML for Ed Parrish's Class Pages -->
<form method="get" action="http://search.atomz.com/search/">
<input size=15 name="sp-q">
<input type=submit value="Go">
<input type=hidden name="sp-a" value="sp1001a69d">
</form>

    Further Information

  • Atomz: Index up to 500 pages on a weekly basis for free. No ads except Atomz logo.
  • PicoSearch.com: Index up to 1,500 pages for free. May include advertising

7.1.5: Implementing Your Own Search Engine

  • Several reasons to install a search engine on your site:
    • You just want to try it for the fun of it!
    • Complete control over the indexing process
    • Learn firsthand how search engines work and their shortcomings
    • Use that experience to fine-tune your own site
  • Any of these reasons is reason enough
  • Three parts to the process:
    1. Indexing - most time consuming
    2. Designing and implementing a search form
    3. Calling a search engine -- usually a CGI script, but not always
  • Where to find a tool: searchtools.com
  • Note: implementing a search tool would make a good student project.


Lab Exercise 7.1

Use the next 10 minutes to complete the following.

  1. Start a text file named exercise7.txt
    Will be adding to this file during the lesson -- save it often.
  2. Prepare the exercise header as described in the HowTo on submitting exercises
  3. Label this exercise: Lab 7.1
  4. Answer the following questions.

Exercises and Questions

  1. What are the benefits of adding search to your site?
  2. When should you consider implementing your own search engine rather than using a service?
  3. What considerations should you have before setting up a search engine on your site?

7.2: Publicizing Your Site

Objectives

At the end of the lesson the student will be able to:

  • Publicize your site effectively
  • To draw people to your site, you need to publicize it
  • Internet has many inexpensive or free ways to advertise your site
  • Search engines are usually the best method
  • Meta tags are one tool that might influence search engine rankings
  • Before submitting your site to search engines, usually worth the effort to put your meta tags in order

7.2.1: Creating META Tags

  • Meta tags are tags that provide information about your Web page
    • Useful tool but not a magic solution
  • Most important meta tags for search engine indexing are the description and keywords tags
    • Can use to control summary information displayed in some search engines
    • Can provide keywords and descriptions on pages that lack text
  • Description tag returns a description of the page in place of the summary the search engine would ordinarily create
  • <meta name="description" content="CIS164: Introduction to Managing a Web Server">

  • Keywords tag provides keywords for the search engine to associate with your page
  • <meta name="keywords" content="cabrillo,web,servers">
  • Can try to enhance your hit rate by picking commonly used keywords
  • Some sites offer to analyze the tags of your web pages: scrubtheweb.com

What Do People Search For?

  • Ever wondered what people search for on the Internet?
  • Many search engines will tell you -- for free!
  • See article: What People Search For
  • Particularly well done is The Lycos 50

7.2.2: Submitting to Search Engines

  • To get some visitors to your site you should register with major search engines and directories
  • Directories contain categories of sites and allow users to browse through a hierarchy to find specific categories of interest to them
    • Yahoo! is one of the most popular directories
  • Search engines, or indexes, save the text or keywords from every page of a site to create huge searchable databases
    • Google is one of the most popular search engines
  • Getting your site listed in search engines and directories is not difficult
  • Visit each search engine site and you will find instructions for adding a new site
  • Some services will submit your site to all the major search engines for you

Lab Exercise 7.2

Use the next 10 minutes to complete the following.

  1. Label this exercise: Lab 7.2
  2. Answer the following questions.

Exercises and Questions

  1. Go to some of the major search engines and search for keywords that accurately describe what visitors will find at your site. How many hits are returned by each search engine?
  2. Try the Google AdWords utility to find how your site would appear in ads and what similar keywords you may want to add. Do you think the additional keyword suggestions are useful? Why or why not?
  3. If you were to submit your site to a search engine, how can you tell if a search engine indexes your site?
  4. Besides search engines, what are some other ways to publicize your site on the Internet?

7.3: Robots And Spiders

Objectives

At the end of the lesson the student will be able to:

  • Control how search engines access your site
  • Co-exist with web robots
  • Describe the use of the Robots Meta Tag
  • To get your site listed in an search engine is usually a two-step process
  • Submit the URL of your home page and other access pages
  • Next, the index site uses what is called a spider to create an index of your site
  • Spiders start at a particular page and follows all the links
  • Spiders stop when they can get to no other links
  • Just like viewing all the pages on your site by clicking on every link
  • Spiders are also known as robots, bots, or crawlers

7.3.1: Using Robots Exclusion Protocol

  • Webmasters can use the robot exclusion protocol to ask robots to limit their searches
  • Not all robots obey the exclusion protocol
  • What types of robots obey the protocol and what types do not?
  • First document "well-behaved" spiders will request from site is a "robots.txt"
    • Plain-text file in the root of your Web site
  • Should be accessible with a URL like this:
  • http://www.yoursite.com/robots.txt
  • File contains list of user agents (names of robots and spiders) and directories they are not allowed to visit
  • # My robots.txt file
    User-agent: *
    Disallow: /cgi-bin
    Disallow: /development
    Disallow: /beta
    
  • First line is a comment
  • Next we have a User-agent directive where * means all robots
  • Following are disallow directives: folders we do not want robots to index
  • View the robots.txt file for google.com here
  • Further Information: The Robots Exclusion Protocol

7.3.2: Using the Robots Meta Tag

  • Another method to ask robots to limit searches is the Robots Meta Tag
  • Lets page author specify whether a page should be indexed by a search engine
  • <meta name="robots" content="index,nofollow"">
  • Allows more control over indexing robots
  • Currently defined directives:
    • [no]index: specifies if an indexing robot should index the page
    • [no]follow: specifies if a robot is to follow links on the page
    • all means both index and follow
    • none means both noindex and nofollow
  • Defaults are index and follow
  • Some examples:
  • <meta name="robots" content="all">
    <meta name="robots" content="index,follow">
    <meta name="robots" content="noindex,follow">
    <meta name="robots" content="index,nofollow">
    <meta name="robots" content="noindex,nofollow">
    <meta name="robots" content="none">
    
  • Web Server Administrator does not need to do anything to support the Robots Meta tag
    • Why not?
  • Further information: The Robots META tag

Lab Exercise 7.3

Use the next 10 minutes to complete the following.

  1. Label this exercise: Lab 7.3
  2. Do not submit exercises until all of them from today's lesson are finished
  3. Answer the following questions.

Exercises and Questions

  1. Try viewing the robots.txt files from some of your favorite sites. You may need to try a few sites before finding one that actually has a robots.txt file. What observations can you make by looking at some of these files? (www.ics.uci.edu/robots.txt)
  2. Create a robots.txt file for your site. Exclude any directories that are not quite finished yet. You might also want to exclude any CGI directories. Copy the file into the Lab 7.3 file.
  3. What would a robots.txt file look like that excluded all directories on your site? Why would you want to do this?
  4. What other types of spiders or bots besides search engines might you use or encounter on your Web server?

7.4: Automation

Objectives

At the end of the lesson the student will be able to:

  • Determine how to automate administrative tasks
  • Often have to check things on the server at regular intervals
    • Make sure there is enough disk space
    • Check for errors in the log files
    • Generate reports
    • Perform backups
  • Can often write simple scripts to help you with these jobs
  • Need to know a good scripting language: Perl, VBScript
  • Both UNIX and Windows have useful tools for these tasks
  • We will look at using the UNIX cron command

7.4.1: Using cron

  • UNIX cron command is daemon that starts programs at specific times
  • cron clock daemon runs constantly on a machine and dispatches other processes at scheduled times
  • Checks for jobs to run once every minute
  • crontab is utility that adds jobs to cron's list
  • Following is an example of a crontab entry
  • 0 7 * * * tail -10 /usr/local/apache2/logs/error_log
  • Returns the last 50 lines in the error log file at 7:00 a.m. every day
  • Since no file is specified, will email to a user (if set up)
  • Fields are (by the numbers):
    1. Minute (0-59)
    2. Hour (0-23)
    3. Day of the month (1-31)
    4. Month of the year (1-12)
    5. Day of the week (0-6 with 0 = Sunday)
  • Anything after the fifth field is the command to run
  • Use crontab filename to add or change a cron task listed in the file
  • Use crontab -l to check your crontab file
  • Use crontab -r to remove cron jobs

Using cron for Automation

  1. Open a terminal emulation window by clicking the icon in the bottom panel
  2. Login as the superuser, if you are not already.
  3. su -l root

    You will be prompted for the root password

  4. Start Apache, if not already started:
  5. /usr/local/apache2/bin/apachectl start
  6. Open a file named "checklog" in a text editor
  7. gedit checklog &
  8. In the checklog file, enter the following (all one line):
  9. * * * * * /usr/bin/tail -10 /usr/local/apache2/logs/error_log > /root/tail.log

  10. Make certain to have a carriage return at end of line
    • If error_log does not have any entries, select another file
  11. Save the file to /root/checklog
  12. Type the following command, which adds the cron job
  13. crontab /root/checklog
  14. Verify that the cron job was added using crontab -l
  15. crontab -l
  16. Wait until the cron job executes (about one minute)
    • Why one minute?
  17. Use ls -l to check if the tail.log file was created
  18. ls -l
  19. View the file
  20. cat tail.log
  21. Remove the cron job using crontab -r
  22. crontab -r
  • What values should be set for the first 5 entries rather than * * * * * ?

7.4.2: Scheduling Tasks on Windows

  • Can use the Scheduled Task Wizard to schedule a task
  • Control Panel => Scheduled Tasks

  • Can also use the At command
  • Acts much like the UNIX cron command

For example:

Following at command will cause the log.bat script in c:\scripts to run every Thursday at 7:00 a.m. on an NT/2000/XP machine

C:\> at 07:00 /every:Thursday "c:\scripts\log.bat"
  • Note the /every flag
  • Use the /next flag to run a job only once
  • More information: see At

Lab Exercise 7.4

Use the next 10 minutes to complete the following.

  1. Label this exercise: Lab 7.4
  2. Answer the following questions.

Exercises and Questions

  1. What tools are available for automating your Web server?
  2. What server administration tasks can easily be automated (or somehow assisted by the computer)?

Wrap Up

  • When class is over, please shut down your computer
    Main Menu => Logout => Shut Down
  • Due Next: N/A

  • You may complete unfinished exercises at any time before the next class.
  • Be sure to submit the file to the instructor before the beginning of the next class to receive credit.
  • Instructions on submitting exercises are available from the HowTo's page.

Home | WebCT | Announcements | Schedule | Expectations | Syllabus
| Help | FAQ's | HowTo's | Links

Last Updated: 7/16/2003 4:45:40 PM