WWW8

8th International World Wide Web Conference
http://www8.org/
May 11 – 14, 1999
Toronto, Canada
Frank Fujimoto

The conference was held at the Metro Toronto Convention Centre, and was the quite well put together (barring some scheduling logistic problems at the beginning).

This conference is starting to have a good balance of research and real-world applications. More observations are at the bottom of this report.

Tutorials

This convention offered 16 tutorials and 7 workshops, ranging from A Comprehensive Introduction to XML to Protecting your Ecommerce Application.

Java Servlets: Server Side Java

Alan Williamson, n-ary Limited

Alan is the CEO of n-ary, a Consulting firm which specializes in Java Servlets. Servlets have an advantage over CGI scripts in that since the Java Virtual Machine is running as a part of the web server, the servlets are run in the same process and the bytecodes can be cached.

One of his case studies was for the United Nations Food & Agriculture Organization. Their previous site used Perl CGI scripts and an Oracle database. The way the scripts were written, updating the web site required someone fluent enough in Perl to make the changes. Also, it could only handle up to 8 concurrent users.

The replacement system was implemented in Java servlets, but ran on the same hardware as before, and is able to handle more than 300 concurrent requests.

There was a lot of good information in this tutorial, but I did get the feeling that Alan likes to use servlets in all cases, not necessarily only where they make sense. But implementing servlets are what his company does, so that’s not completely surprising.

HTTP Extensions for Web Collaboration: WebDAV

Jim Davis, Coursenet Systems

If I had to pick hot topics at this conference, WebDAV would be second only to RDF (Resource Description Framework, which will be used for metadata). Jim Davis is heavily involved with WebDAV and its extensions.

WebDAV is implemented as a set of extensions to HTTP/1.1. New methods (in addition to GET, POST, etc.) are defined, so the server needs to have WebDAV support. In other words, it cannot be supported with CGI scripts.

XML is used for the actual WebDAV request and response data. Information is saved as name/value pairs called properties. These can be grouped as an aggregate called a collection.

Even though WebDAV stands for Distributed Authoring and Versioning, it was decided to make versioning a WebDAV extension rather than part of the core. Other extensions are DASL (to search WebDAV resources), Advanced Collections (to support ordering of properties in collections), and References.

Access Control is not implemented at this time. Even so, WebDAV is a technology worth tracking, as it will be very important in the future.

Panels

This conference offered more panels than last time, and they were for the most part quite lively.

Finding Anything in the Billion-page Web: Are Algorithms the Key?

(left to right)
Prabhakar Raghavan, IBM Almaden (moderator)
Brian Pinkerton, Excite
Udi Manber, Yahoo!
Andrei Broder, Compaq
Monika Henzinger, Compaq

One thing that these panelists have in common is that they feel the meta search engines (MetaCrawler, etc.) can do the user a disservice. Andrei points out that in order to work with a wide variety of search engines, the meta searchers need to stick with a lowest common denominator for features. Brian feels it’s harmful that the meta searchers can ignore ranking done by the search engines. And Udi feels strongly that meta searchers use the main engines and present data to the users, but hide things from the users such as advertising.

As an alternative to meta search engines, Udi believes that two-level searching will be the next big thing, and is where he is doing his current work. The idea is that if you notice the user’s search is related to medicine, then along with the results you return, you can present links to topic-specific search engines which may do better than a general one. Monika echoed that she thinks the next big thing will be topical search engines.

The topic of paid placement of search results in AltaVista came up. Monika asserts that the normal, non-biased search results will still appear, but that the results with paid placements will appear in another column.

Udi mentioned an interesting fact that 80% of queries done on Yahoo consisted of one or two words. As a result, the searching has been tuned for one or two words. If you do a search for many words, your results actually may be worse.

All the panelists seem to believe that better algorithms will be able to help in the future. This isn’t necessarily easy, however. Udi said that the word “not” is very difficult; searching for “suitable for children” will return a hit on “not suitable for children”.

Web-based Everything: are HTTP, DAV and XML Enough?

Sally Khudairi (moderator)
(left to right)
Josh Cohen, Microsoft
Henrik Frystyk-Nielsen, MIT
Rohit Khare, University of California, Irvine
Mark Day, Lotus
Keith Moore, IETF

I went into this panel with the opinion that HTTP, DAV, and XML together could not form a do-all set of protocols. I left still thinking that there are places where specific protocols work better.

When I picked up my conference materials, there was a flyer from Microsoft proclaiming that they are proponents of standards. I also found a t-shirt along with my proceedings which had the same message. The main ballroom used for Plenary Sessions had a banner. Josh of Microsoft held tightly to this.

The moderator and audience tried several times and finally got the panel to address the topic of calendaring. The panel seemed to agree that the ICAL format would be good for interchange. It also seems that applications probably wouldn’t necessarily need to use it for native data.

Henrik pointed out that part of the problem with HTTP/1.x is that all old applications are still expected to work with it (old browsers, etc.).

One of the livelier exchanges in this session was about request/response. The panelists mostly seemed to agree that in many cases, one request/one response doesn’t apply to all cases. But Henrik (who was on the HTTP/1.1 team) said that there’s room in the HTTP/1.1 spec for one request/n responses, where n can be 0.

The other panelists, especially Mark, thought that breaking the one-to-one nature of HTTP requests and responses would make things overly complex.

The Web as an Application Platform

(left to right)
Mike Shaver, Netscape
Sara Williams, Microsoft
Tom Conrad, Documentum
Bill Shea, Merrill Lynch

The goal of this panel session was to explore whether the web works well for applications.

Mike from Netscape described and demonstrated the Mozilla 5.0 effort. They will still try to support all the platforms Netscape currently is on. Mozilla uses RDF for bookmarks, history files, preferences, etc.

Part of Sara’s presentation was about how Microsoft is using the web for internal applications as well as products. Employees can file expense reports via the web, get paycheck information, and access their 401K information.

Bill had an interesting presentation for an application which they had for Merrill Lynch’s sales force. Before, they’d have to distribute it on CD-ROMs and it would take a long time. Now, the deployment is on the server side, and things are working quite well for them. The look and feel is very similar to the previous application.

One question asked by the audience concerned internationalization, and how applications are very English-centric, as well as US-centric. Sara’s response focused on localization (how to display dates, currency, etc.) and the display of text. But Mike also touched on the fact that the cultural things are much harder to get right.

Other Sessions

I went to one of the W3C presentations, where they showed what’s current with XML, XHTML, and RDF.

One thing that occurred to me during the conference is that XML can essentially be considered “deployed”, and that there are many new technologies which use XML as a base (RDF, WebDAV, etc.). But what’s more interesting is two years ago, XML was being touted as a new, extensible HTML. But how XML is more being seen as a data encapsulator, and the idea that it’s a new HTML is fading. In fact, I heard one presenter state that HTML will probably still be used for 80% of the content, and XML for content which needs more structure.

I also attended the session Town Meeting: Successful Web Design which concentrated mostly on the application developer’s point of view. There were several testimonials of how it’s hard to figure out exactly what the users want, and if you give them exactly what they ask for, they’re still not happy (or don’t use the extra features they said they wanted). Others pointed out that while asking the users what they want is important, even more so is to watch their usage patterns and try to figure out what it is they need.

During Developer’s Day I was able to sit in on some of the Web Scripting Language Forum. The main reason I attended that session was to find out what’s in the works for PHP, and among other things, they described version 4.

http://www.php3.org/www8/

PHP has grown into a full-fledged language, with various flavor options (SGML-style, XML-style, ASP-style, and Javascript-style) for writing code. Version 4 will be more modular, and open the door for 3rd party plug-in modules. PHP scripts will be compiled into a bytecode, and the results cached, which should greatly help speed (other than the fact that the script still needs to be run for every invocation, even if the resulting data is static).

Another presentation during Developer’s Day was from a company which has a lightweight browser (called Mozquito) which has some special HTML additions to allow the creation of more complex forms without the need for Javascript. Another thing they showed was a Javascript script which will take XHTML content and convert it to HTML 4.0 on-the-fly. This is so browsers which don’t understand XHTML can still take advantage of extending the language with macros. It was, however, slower, and requires a hook on the server side to wrap the XHTML into a document with Javascript (so the content can be interpreted by the XHTML-to-HTML script).

Plenary Speakers

Challenges of the Second Decade

Tim Berners-Lee, Director, W3C

Each time I’ve heard Tim Berners-Lee speak, he complains that URLs were never meant to be published, and that people should be following links. I think he has a point, looking at all the companies battling for domains so people only have to type in the hostname and not any text after the “.com”.

Tim’s talk concentrated on what would happen in the next ten years of the World Wide Web. His grandest vision is that of a “semantic web”, where everything has well-defined meanings and definable results. Documents should be self-describing, and the W3C efforts on metadata address this.

He also took time to express that patents are not serving industry as intended. Currently, the bar is too low for what is considered “innovative”, and the patent process creates incentive for making applications unintelligible. He feels the current ethos is to get away with whatever you can.

His challenge to the conference attendees was to raise the ethical level of the web.

e-business and the Future of the Internet

John Patrick
Vice President, Internet Technology, IBM

I heard John Patrick speak at Apachecon, and many of his main points were repeated here, but refined for the changes in the last six months.

One of his beliefs is that kids should not be underestimated; they don’t know that programming is “supposed to be hard”, and they just do it. His Apachecon example was a high school web site run entirely by students. For WWW8, he talked about how he bought a Lego Mindstorms and wnated to program it to do something. He looked on the web for a program, and found one that worked. It was written by a 7-year-old. He also believes that the older generation is much more capable in the electronic world than most of us give them credit for.

John presented his vision of what will be happening in the future: consumer products which are connected to the net will be more ubiquitous than other computer-style products.

The Death of Wire Protocols?

FIX Greg Papadopoulos
Chief Technology Officer, Sun Microsystems

Some of the topics Greg covered echoed John Patrick’s talk, especially that consumer devices will become the big player in the networking world.

Greg sees the model of networking as moving away from two computers talking to each other via a network and moving towards computings sharing and exchanging objects. Instead of needing to load a device driver onto a computer in order to print to a printer, the two devices will share the objects necessary to communicate; being from Sun, his vision is that these objects are Java code.

One of his more interesting statements was that he believs firewalls as they exist today will disappear in the future. This is because end devices are becoming more intelligent and security-aware, and that this will lessen the need for a firewall.

Another thing Greg believes is that we’re going to find we built the wrong network. As it currently exists, the network concentrates on providing information from service providers to consumers. However, when consumer products overwhelm computers, much of the information will be flowing in the other direction, from devices to services.

Greg also feels that a trend in the future will be towards meta services; instead of talking directly to many providerers of services, you will talk to an intermediary who will go out and present various services to you. It seems that in a way, this is starting to happen already with portals (MyYahoo, MyNetscape, etc.).

A Summary of WWW8, The Nextgen Internet, and Other Dangerous Speculation

Bob Metcalfe
Vice President Technology, International Data Group

It has become a tradition to have Bob Metcalfe review the conference when it is in North America.

As Tim Berners-Lee is with the URL, Bob is with the IP address. He strongly believes that IPV6 is necessary, and that its delay in deployment is only temporary (because of, among other things, NAT).

Bob agrees with Greg that the net will end up being in the wrong direction. More specifically, ADSL offers faster downloads than uploads.

When at WWW4 in Boston, one of Bob’s favorite papers was about Millicent, which is a pay-as-you-go model for content. The idea is that you pay fractions of a cent for information, and your bill gets added to when you follow links. He still believes that such an architecture would greatly help the net, as banner advertising doesn’t seem to be doing the trick, since it’s very easy to just ignore the banner ads.

Among the predictions he made, Bob thinks that home access to the internet will be measured in megabits before year’s end (which in essence it already is, with cable modems). He also thinks the growth of the internet will drop, and that the annual doubling of usage is over. His least popular prediction is that open source will fizzle because it’s too idealistic. He does like the idea of open source, but also believes strongly in corporations. However, he does see the way Red Hat has taken Linux and started marketing it as a step in the right direction.

Yuri Rubinsky Memorial Award

Richard Stallman

At each conference, the Yuri Rubinsky Memorial Award is presented to someone who has greatly contributed to the World Wide Web. Past recipients are Doug Englebart, Vint Cerf, Gregg Vanderheiden, and Ted Nelson. This year’s selection of Richard Stallman surprised and amused everyone at the conference, especially with the thought that he accepted a $10,000 award funded in part by Microsoft and Sun Microsystems. Richard quickly went through his thank yous and went on to complain about software patents in the United States and how European countries are starting to look at granting software patents.

Papers

Improving Web interaction on small displays

Matt Jones, Gary Mardsen, et. al.
Interaction Design Centre, School of Computing Science, Middlesex University, London, UK
Paper

For this paper, small displays means both small desktop displays and displays on handheld devices. The paper concludes that content providers should do the following:

Provide direct access. Search and structure to information helps greatly here.
Reduce scrolling. Scrolling greatly reduces the efficiency of the user. Navigational features should be near the top of a page in a fixed place, key information should also be at the top, and the amount of information on a page should be reduced to focus the content of the page.

Towards a better understanding of Web resources and server responses for improved caching

Craig E. Wills, Mikhail Mikhailov
Computer Science Department, Worcester Polytechnic Institute, Worcester, MA
Paper

This paper describes an effort to see how well cache servers and web servers interact. For the most part, a small yet significant number of requests in their test sets resulted in headers which were incorrect (either the date changed and the content didn’t, or the content changed but the datestamp didn’t), and some anomalous data. The most interesting anomaly is that even though Apache sends an entity tag (etag) header, if content is dispersed over multiple hosts, the etags can be different on different servers (because etags are generated using device and inode numbers, which will vary on different hosts), so these tags can’t reliably be used to test for new content.

They also point out that many objects remain in caches when they no longer are necessary, such as if a graphic referenced by a page (or multiple pages) is no longer referenced. In this case, the graphic will never be re-requested, but the cache will still keep a copy until it either expires or is otherwise flushed.

Measuring search engine quality using random walks on the Web

Monika R. Henzinger, Allan Heydon, et. al.
Compaq Computer Corporation Systems Research Center, Palo Alto, CA
Paper

Measuring the quality of search engines is a difficult task, and one may even say that it depends on what the user is seeking. This paper describes a method to attempt to quantitatively measure search engine quality.

The basic idea is to take a sample of linked pages on the web and then do queries to the search engines to see which pages out of the sample are returned. The sample was built with a random walk algorithm which tries to mimic what a user would do; at any point, the walker will either follow a link on the page or go to a random page. The method the group ended up using for going to a random page is to start 100 simultaneous walkers, and pick a host at random out of the hosts that all 100 walkers have previously visited. Once a host is picked, then pick a page at random which is on that particular host. The results will be biased towards the beginning page (they used http://www.yahoo.com), but will normalize after a sufficiently long run time.

The results of their tests show that AltaVista returned the most results that were in their set of pages collected by the random walks. However, AltaVista was one of the worst performers if you take into account the average quality of the pages returned that were in the data set (quality being determined as the number of pages that link to a particular page, as well as the quality of those pages with the links to the target pgae).

Automatic RDF metadata generation for resource discovery

Charolette Jenkins, Mike Jackson, et. al.
School of Computing and IT, University of Wolverhampton, Wolverhampton, UK
Paper

While metadata generation will be tied to content creation in the future, it is unlikely that people will go back and add metadata to existing content.

This group created an automatic classifier written in Java. According to the paper, it “works by comparing terms found within documents with manually defined clusters of terms representing the nodes of a classification hierarchy”; the hierarchy they chose was the Dewey Decimal Classification. In addition to the classification, they also extract the document title, keywords, abstract, and word count.

The paper shows an example of classifying the several category pages on http://www.yahoo.com, and the results are in RDF format. The program currently generates properties which are part of the Wolverhampton Core, but they do mention that creating metadata following the Dublin Core will increase interoperability potential.

Results and challenges in Web search evaluation

David Hawking, Nick Craswell, et. al.
CSIRO Mathematical and Information Sciences, Canberra, Austrailia
Department of Computer Science, ANU, Austrailia
National Institute of Standards and Technology, Gaithersburg, MD
Paper

This paper takes a different approach to testing search engine quality to take a snapshot of the Web, consiting of 18.5 million pages consisting of 100gb from over 115,000 different hosts. Almost 25,000 of those hosts were represented by a single page.

The group then submitted queries to 5 of the major search engines. For 50 title-only queries (with an average of 2.5 words), between 23% and 38% of the top 20 documents returned were classified as “relevant”. Results improved (up to over 60%) when more topic words were used in the queries.

Grouper: a dynamic clustering interface to Web search results

Oren Zamir, Oren Etzioni
Department of Computer Science, University of Washington, Seattle, WA
Paper
Grouper

Grouper is a front-end to the HuskySearch meta search engine which takes results and “clusters” them into relevant groups. This clustering is done on-the-fly, rather

than during indexing.

The algorithm they use is Suffix Tree Clustering (STC), which is linear with relation to the number of documents retrieved. Documents can appear in more than one cluster. The clustering is based on common phrases, and the clusters are ranked according to the length of the phrases in the cluster and how many documents are in the cluster.

Users seem to follow more documents presented in search results with the Grouper interface than the traditional HuskySearch interface. This could be because once a user finds an interesting document, the clustering helps to find additional relevent documents. The authors state that a user study is needed to test their hypotheses.

A runtime system for interactive Web services

Claus Brabrand, Anders Moller, et. al.
Basic Research in Computer Science, Department of Computer Science, University of Aarhus, Aarhus, Denmark
Paper

The traditional CGI model for a web application requires state to be transported between the client and server, and for a new CGI to be invoked for each request. This paper proposes a new method where the CGI becomes a lightweight connector, and feedback is returned to the user more quickly.

The system proposed, named , uses an intermediate file and a persistent process to handle the transactions. When a user enters the system, a process is created and the user is redirected to a unique filename. While the CGI is processing information, the file contains a message asking the user to wait for the results, and after a few seconds the file is re-fetched. When the CGI has completed its processing, it writes the information out to the html file and enters an inactive state. After the user submits the next request, the connector will send the information to the persistent process. After the CGI finishes or a few seconds after the request was submitted (whichever comes first), the browser will be redirected back to the static html page which will be either the results of the CGI, or another message asking the user to wait for the results.

The authors note that this system is quite vulnerable to hijacking, since only the URL or the resulting file is necessary to capture the session. They suggest using an encrypted session, but URL guessing can still be used. Also, because this relies on static pages and a persistent process, this scheme would not work on a cluster of web servers.

Retaining hyperlinks in printed hypermedia document

Ernest Wan, Philip Robertson, et. al.
Canon Information Systems Resarch Austrailia
Paper

I’ve included this paper because of its sheer novelty. The authors describe how a document can be created with areas of concentric semi-circles on the edge of the page which are links to related pages. When these pages are printed out, the linked pages lie below the linking page (numerically after if the page is on the right, numerically before if the page is on the left) with the semi-circle representing a link destination. The page which contains the has that semi-circle cut out.

In the concluding remarks, the authors note “We have not conducted usability trials for different applications.”

Visual preview for link traversal on the World Wide Web

Theodorich Kopetzky, Max Muhlhauser
Telecooperation Department, Johannes Kepler University, Linz, Austria
Paper

The authors describe a method using JavaScript, dynamic HTML, and Java (with Netscape 4.x; presumably it wouldn’t work with Internet Explorer) to preview links on pages. When a user passes the mouse pointer over a link, a box pops up showing a preview of where the link is going, and even shows dead links.

To remove from the content provider the burden of integrating previews with the links, the paper describes how a proxy server can be used to insert the necessary dynamic HTML code and provide preview images.

Web content adaptation to improve server overload behavior

Tarek F. Abdelzaher, Nina Bhatti
Real-Time Computing Laboratory, EECS Department, University of Michigan, Ann Arbor, MI
Hewlett Packard Laboratories, Palo Alto, CA
Paper

This paper presents a method for automatically delivering different content based on web server load. For example, if the server is lightly loaded, then a 74kb gif can be presented to users. However, if the server is heavily loaded, then an 8.4kb jpeg would be better. Or, a vendor can have dynamic pages which show available stock, but under high load static pages can without inventory information can be presented.

One downside is that caching proxies may cause unexpected results, both because it is possible for the high-load version of an object to be cached when it might be better to cache the low-load version, and if the load of the server is different between the time the object is originally retrieved and the proxy checks the validity of the object, the server could send the new version without knowing that the previous version is fine.

Another downside is that if the above gif/jpeg example is used, the alternate image may not work on all browsers. Internet Explorer (at least older versions) ignore the mime type sent by the server and derive the type by the extension of the object described by the URL. Therefore, if the browser attempts to fetch a gif but gets a jpeg instead, it will still try to interpret it as a gif because the URL ends in .gif.

Surfing the Web backwards

Soumen Chakrabarti, David A. Gibson, Kevin S. McCurley
Department of Computer Science and Engineering, Indian Institute of Technology, Bombay, India
Department of Computer Science, UC Berkeley, CA
IBM Almaden Research Institute Center, Sa
Paper

These authors assert that knowing what pages link to the page currently being viewed is very helpful. To implement this, they describe HTTP protocol extensions.

The reason the authors propose to extend HTTP is that the servers already have a lot of information about what pages link to the site (via the Referer header), and if they keep that information, they can supply it to clients which ask for it. Since servers do not currently keep and present referer information, the authors developed an applet which will retrieve the information from HotBot.

Managing TCP connections under persistent HTTP

Edit Cohen, Haim Kaplan, Jeffrey Oldham
AT&T Labs-Research, Florham Park, NJ
Computer Science Department, Stanford University, CA
Paper

The adoption of persistent connections for HTTP have helped immensely, but have also created other potential problems. The most common approach for servers is to have a fixed number of seconds to allow connections to stay inactive before being closed. The authors propose a method to heuristically compute how long the connection should remain open. Tests were done basing the connection retention time on the requested URL, the referer URL, the size of the resource, and the connection history of the client.

Test results show that client history is actually not a good metric, but basing connection keepalive time by the requested URL does a good job. In other words, a client is more likely to have a connection already available if the timeout is determined by the requested URL rather than how long the client has kept connections open.

Observations

As I mentioned above, RDF and WebDAV were the two buzzwords on almost everyone’s lips. Two years ago at the Santa Clara convention, XML was the hot topic, but this year everyone is just assuming that XML is being used, and that RDF and WebDAV are new applications for XML.

There was much less server and protocol talk than previous years. People were more interested in applications, metadata, and finding information. Of course, the three topis are interrelated via RDF.

In the network access room (can’t call it an email terminal room any more) there were at least as many places to hook up a laptop as there were systems running Windows or Linux. There was at most only a few minutes’ wait for one of the prebuilt systems, even during peak times.

Although they weren’t ubiquitous, there were quite a few people with palmtops. What’s interesting is I think I spied one WinCE device, but all others (at least a dozen, and I didn’t look around very much) were Palm systems. Palm III systems, in fact, and all with the flip cover still on. I know there was one IIIx, but I couldn’t tell about the others since I only saw them from a distance.