What We Will Cover
Log Tails
From Last Lab
Quiz Review
If the access log is 500 lines long, how many people have visited your site?
back to top
7.1: Search Engines
Objectives
At the end of the lesson the student will be able to:
- Describe how to create a searchable site
|
- On complex sites, search can provide a good method for finding information
- Users enter one or more words and get a list of likely places to look
- Search is not a substitute for a good navigation system
- Rather, search is often the last resort before leaving your site
- Adding a search engine or service to your site is not difficult and usually worth the effort
back to top
7.1.1: Creating Search Words
Keywords
- Searching relies on the concept of keywords
- Words with a high level of meaning
- Keywords are stored in a database along with a link to the page containing the word
- Searcher clicks the link and jumps to the page
- Generally easy and fast to look up words in a keyword database
Deciding on Keywords
- Difficult to decide what words are important and which are not
- For instance, which of the words on this page are important?
- Usual approach is to remove words that we are fairly sure are not important
- All the rest of the words are just stored in the database
Removing Stopwords
- Stopwords are common words that do not add much meaning to a query
- For example: "a", "an", "for", "the" and "with"
- If we are search for web servers, do we care about the difference between "a web server" and "the web server"?
- The problem is that their frequency of use tends to make them clutter search results
- So which words are stopwords?
- Depends on the language of the user
- Depends on the application
- Useful to examine lists of common words to decide which to exclude
- For English, we can select them from a list of common words
- My English stopword list is available here
- Another list: 1000 Most Common Words in English
- (Caution: About.com site with the Internet's most annoying pop-ups)
back to top
7.1.2: Encoding Keywords
- Keyword search has some inherent problems
- Some words with the same meaning can have alternate spellings
- Manie peeple canot spel wel
- Main approach to this problem is called phonetic encoding
- Uses the rules of the spoken language to create a phonetic code
- Phonetic code has a single spelling for words that sound alike
- For instance: Brittany and Britney
- Many algorithms exist
- Simplest is Soundex dating from late 1800's and used by Census Bureau
- More accurate, and more complex, is the DoubleMetaphone algorithm
back to top
7.1.3: Relevance
- Users want to see the most relevant choices at top of list
- Determining relevance is largest unsolved problem of keyword search
- How many searches have you performed where the page you wanted was not at the top of the list?
- Various techniques exist, but none of them are always satisfactory
- For more information on relevance, or keyword search in general, see instructor's page here
back to top
7.1.4: Implementing Search Using Services
- Easiest method to implement search on your sites is to use a search service
- No software downloads
- On-line setup in minutes
- Search service will index your site and provide queries using their servers
- Some services offer basic search packages for free
- Atomz is particularly easy to use after signing up
- Simply include an HTML form on each web page
- When need more capabilities, can upgrade service or decide to implement your own
<!-- Atomz Search HTML for Ed Parrish's Class Pages -->
<form method="get" action="http://search.atomz.com/search/">
<input size=15 name="sp-q">
<input type=submit value="Go">
<input type=hidden name="sp-a" value="sp1001a69d">
</form>
Further Information
- Atomz: Index up to 500 pages on a weekly basis for free. No ads except Atomz logo.
- PicoSearch.com: Index up to 1,500 pages for free. May include advertising
back to top
7.1.5: Implementing Your Own Search Engine
back to top
Lab Exercise 7.1
Use the next 10 minutes to complete the following.
- Start a text file named exercise7.txt
Will be adding to this file during the lesson -- save it often.
- Prepare the exercise header as described in the HowTo on submitting exercises
- Label this exercise: Lab 7.1
- Answer the following questions.
Exercises and Questions
- What are the benefits of adding search to your site?
- When should you consider implementing your own search engine rather than using a service?
- What considerations should you have before setting up a search engine on your site?
back to top
7.2: Publicizing Your Site
Objectives
At the end of the lesson the student will be able to:
- Publicize your site effectively
|
- To draw people to your site, you need to publicize it
- Internet has many inexpensive or free ways to advertise your site
- Search engines are usually the best method
- Meta tags are one tool that might influence search engine rankings
- Before submitting your site to search engines, usually worth the effort to put your meta tags in order
back to top
7.2.1: Creating META Tags
<meta name="keywords" content="cabrillo,web,servers">
Can try to enhance your hit rate by picking commonly used keywords
Some sites offer to analyze the tags of your web pages: scrubtheweb.com
What Do People Search For?
- Ever wondered what people search for on the Internet?
- Many search engines will tell you -- for free!
- See article: What People Search For
- Particularly well done is The Lycos 50
back to top
7.2.2: Submitting to Search Engines
- To get some visitors to your site you should register with major search engines and directories
- Directories contain categories of sites and allow users to browse through a hierarchy to find specific categories of interest to them
- Yahoo! is one of the most popular directories
- Search engines, or indexes, save the text or keywords from every page of a site to create huge searchable databases
- Google is one of the most popular search engines
- Getting your site listed in search engines and directories is not difficult
- Visit each search engine site and you will find instructions for adding a new site
- Some services will submit your site to all the major search engines for you
back to top
Lab Exercise 7.2
Use the next 10 minutes to complete the following.
- Label this exercise: Lab 7.2
- Answer the following questions.
Exercises and Questions
- Go to some of the major search engines and search for keywords that accurately describe what visitors will find at your site. How many hits are returned by each search engine?
- Try the Google AdWords utility to find how your site would appear in ads and what similar keywords you may want to add. Do you think the additional keyword suggestions are useful? Why or why not?
- If you were to submit your site to a search engine, how can you tell if a search engine indexes your site?
- Besides search engines, what are some other ways to publicize your site on the Internet?
back to top
7.3: Robots And Spiders
Objectives
At the end of the lesson the student will be able to:
- Control how search engines access your site
- Co-exist with web robots
- Describe the use of the Robots Meta Tag
|
- To get your site listed in an search engine is usually a two-step process
- Submit the URL of your home page and other access pages
- Next, the index site uses what is called a spider to create an index of your site
- Spiders start at a particular page and follows all the links
- Spiders stop when they can get to no other links
- Just like viewing all the pages on your site by clicking on every link
- Spiders are also known as robots, bots, or crawlers
back to top
7.3.1: Using Robots Exclusion Protocol
- Webmasters can use the robot exclusion protocol to ask robots to limit their searches
- Not all robots obey the exclusion protocol
- What types of robots obey the protocol and what types do not?
- First document "well-behaved" spiders will request from site is a "robots.txt"
- Plain-text file in the root of your Web site
- Should be accessible with a URL like this:
http://www.yoursite.com/robots.txt
File contains list of user agents (names of robots and spiders) and directories they are not allowed to visit
# My robots.txt file
User-agent: *
Disallow: /cgi-bin
Disallow: /development
Disallow: /beta
First line is a comment
Next we have a User-agent directive where * means all robots
Following are disallow directives: folders we do not want robots to index
View the robots.txt file for google.com here
Further Information: The Robots Exclusion Protocol
back to top
7.3.2: Using the Robots Meta Tag
- Another method to ask robots to limit searches is the Robots Meta Tag
- Lets page author specify whether a page should be indexed by a search engine
<meta name="robots" content="index,nofollow"">
Allows more control over indexing robots
Currently defined directives:
[no]index: specifies if an indexing robot should index the page
[no]follow: specifies if a robot is to follow links on the page
all means both index and follow
none means both noindex and nofollow
Defaults are index and follow
Some examples:
<meta name="robots" content="all">
<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">
<meta name="robots" content="none">
Web Server Administrator does not need to do anything to support the Robots Meta tag
Further information: The Robots META tag
back to top
Lab Exercise 7.3
Use the next 10 minutes to complete the following.
- Label this exercise: Lab 7.3
- Do not submit exercises until all of them from today's lesson are finished
- Answer the following questions.
Exercises and Questions
- Try viewing the robots.txt files from some of your favorite sites. You may need to try a few sites before finding one that actually has a robots.txt file. What observations can you make by looking at some of these files? (www.ics.uci.edu/robots.txt)
- Create a robots.txt file for your site. Exclude any directories that are not quite finished yet. You might also want to exclude any CGI directories. Copy the file into the Lab 7.3 file.
- What would a robots.txt file look like that excluded all directories on your site? Why would you want to do this?
- What other types of spiders or bots besides search engines might you use or encounter on your Web server?
back to top
7.4: Automation
Objectives
At the end of the lesson the student will be able to:
- Determine how to automate administrative tasks
|
- Often have to check things on the server at regular intervals
- Make sure there is enough disk space
- Check for errors in the log files
- Generate reports
- Perform backups
- Can often write simple scripts to help you with these jobs
- Need to know a good scripting language: Perl, VBScript
- Both UNIX and Windows have useful tools for these tasks
- We will look at using the UNIX
cron command
back to top
7.4.1: Using cron
- UNIX cron command is daemon that starts programs at specific times
cron clock daemon runs constantly on a machine and dispatches other processes at scheduled times
- Checks for jobs to run once every minute
crontab is utility that adds jobs to cron's list
- Following is an example of a crontab entry
0 7 * * * tail -10 /usr/local/apache2/logs/error_log
Returns the last 50 lines in the error log file at 7:00 a.m. every day
Since no file is specified, will email to a user (if set up)
Fields are (by the numbers):
- Minute (0-59)
- Hour (0-23)
- Day of the month (1-31)
- Month of the year (1-12)
- Day of the week (0-6 with 0 = Sunday)
Anything after the fifth field is the command to run
Use crontab filename to add or change a cron task listed in the file
Use crontab -l to check your crontab file
Use crontab -r to remove cron jobs
Using cron for Automation
- Open a terminal emulation window by clicking the icon in the bottom panel

- Login as the superuser, if you are not already.
su -l root
You will be prompted for the root password
- Start Apache, if not already started:
/usr/local/apache2/bin/apachectl start
- Open a file named "checklog" in a text editor
gedit checklog &
- In the checklog file, enter the following (all one line):
* * * * * /usr/bin/tail -10 /usr/local/apache2/logs/error_log > /root/tail.log
- Make certain to have a carriage return at end of line
- If error_log does not have any entries, select another file
- Save the file to
/root/checklog
- Type the following command, which adds the cron job
crontab /root/checklog
- Verify that the cron job was added using
crontab -l
crontab -l
- Wait until the cron job executes (about one minute)
- Use
ls -l to check if the tail.log file was created
ls -l
- View the file
cat tail.log
- Remove the cron job using
crontab -r
crontab -r
- What values should be set for the first 5 entries rather than
* * * * * ?
back to top
7.4.2: Scheduling Tasks on Windows
For example:
Following at command will cause the log.bat script in c:\scripts to run every Thursday at 7:00 a.m. on an NT/2000/XP machine
C:\> at 07:00 /every:Thursday "c:\scripts\log.bat"
- Note the /every flag
- Use the /next flag to run a job only once
- More information: see At
back to top
Lab Exercise 7.4
Use the next 10 minutes to complete the following.
- Label this exercise: Lab 7.4
- Answer the following questions.
Exercises and Questions
- What tools are available for automating your Web server?
- What server administration tasks can easily be automated (or somehow assisted by the computer)?
back to top
Wrap Up
- When class is over, please shut down your computer
=> Logout => Shut Down
Due Next: N/A
- You may complete unfinished exercises at any time before the next class.
- Be sure to submit the file to the instructor before the beginning of the next class to receive credit.
- Instructions on submitting exercises are available from the HowTo's page.
back to top
Home
| WebCT
| Announcements
| Schedule
| Expectations
| Syllabus
| Help
| FAQ's
| HowTo's
| Links
Last Updated: 7/16/2003 4:45:40 PM
|