As we move into the the use of the web, more and more documents are becoming available on-line. However, different users have different needs from these documents. Some users, in a hurry, may desire brief and succinct documents. Others may require more detail. Users may also vary as to the type of information they want from a document.
This paper describes an experiment with on-line text presentation -- whereby the user specifies how long the document should be. The system then presents a coherent document fitting that space limitation. The user might choose to see the hundred-word version, or the thousand-word version, or somewhere between.
Figure 1 shows the web browser (Netscape) interface to the system, also showing (part of) the text before it is reduced. Figure 2 shows the same document, although with a 200 word limit set. The text is mostly coherent, with however some minor problems. One can see these as the cost of this sort of summarisation.
Figure 1: The VLTP interface
Figure 2: Scottish History text at 200 words
This technique, what we call variable-length text presentation, involves two steps:
The system was an attempt to see how far we could push a notion mentioned by Sparck Jones (1993), that RST can be used to summarise a text, shaving off less relevant satellites. Can we remove rhetorically dependent sub-sections of the text without markedly affecting the coherence of the text?
Our pruning method involves assigning a level of relevance between 0 and 1 to each RST relation. Using these values, we can work out the relevance of each node in the RST-tree: the top-node having 1.0 relevance, each of satellite in the tree having relevance proportional to the relevance of its nucleus times the relevance of the relation linking it. We then prune off text-nodes with lowest relevance until the required word-limit is reached. This process is described in section 3.
The system also allows a small degree of user-determination of the content. The RST-pruning uses information on the relative importance of each RST relation. If the user is given control of these importances, then they can tailor the kinds of information that is actually left in the document.
In section 4, we, we address various areas of incoherence introduced by the pruning (paragraphing, punctuation, reference and discourse markers), and our solutions to these problems.
In section 5, we describe the RST markup tool which makes it possible to conceive of document presentation based on RST markup. RST-based document summarisation has been stopped in the past because of the present poor state of automatic discourse structure recognition. Hand-markup is an arduous task, but the tool we report here makes the task economical for some documents. However, keep in mind that because of the time-cost of document markup, this technique is only useful for documents with a longer shelf-life. We must weigh the cost of analysing the original document against the benefits of having a variable-length on-line document.
Finally, section 6 will attempt to assess the usefulness of this approach, detailing the quality of the presented documents, against the problems involved in the presentation. Some extensions of the work are also suggested.
Given that this technique involves neither text-planning nor sentence-planning, one might ask how this paper is relevant to the Generation community. Firstly, the technique of RST-pruning, reported in section 2, is applicable to pruning of RST-structures generated by a full-blown text-planner. A text-planner could produce fully-elaborated rst-structures from an underlying knowledge-base, and then present pruned versions of the text depending on the users needs. We can thus apply the techniques reported here for variable-length document presentation to variable-length document generation.
Secondly, this work is also of interest to the Generation community because of the contained report of the RST Markup Tool. RST is used widely within the generation community, and this tool may prove useful to many, not only as an aide in their corpus studies, but also for preparing diagrams for publications.
Summarisation via RST-pruning was suggested by Sparck Jones (1993), although the mechanism for determining which satellites to prune is unique here. Also, her work was limited by the lack of automated RST analysis, while I rely on semi-automated markup. The application of the technique to produce variable-length documents is also unique.
Rino & Scott (1996) offer a more detailed account of summarisation via pruning in a full generation environment. However, they prune the content structure rather than the discourse structure. The RST tree produced to express the pruned content structure is not itself pruned. On the other hand, their content structure is similar enough to RST that similarities to the present work are observed. They take intention structure into account to drive the pruning, which would be a valuable addition to the methods proposed here. While I believe they are right in that text summarisation needs to take both these areas (and others) into account, I am interested here to see how well rhetorical structure by itself can form the basis of summarisation.
Ed Hovy, in his involvement with the HealthDoc project, has suggested generation from a master document -- a set of SPL (semantic specifications of sentences), each conditionalised by the user model (see DiMarco et al 1995). The text actually seen by the user is achieved by pruning out SPLs which are inappropriate for the user-type. The present system differs from this approach in that, while their master document is RST-structured, that structure is not used as the basis of the pruning, but only to restructure the pieces chosen. Also, the production of sub-documents is intended to produce user-tailored documents, not length-tailored ones.
I am aware of work by Veli J. Hakkoymaz (Hakkoymaz in-preparation; Hakkoymaz&Ozsoyoglu 1996) on Variable-Length Multimedia Presentations, whereby multimedia segments are added to or dropped from a presentation in order to meet the time constraints. That approach allows substitution of elements as well as deletion, which may be a useful technique.