Original at http://web.interval.com/papers/mediastreams/index.html
04/17/2000
(originally published in "Readings in Human-Computer Interaction: Toward the Year 2000")
Media Streams: An Iconic Visual Language for Video Representation by Marc Davis
ABSTRACT: The need to find ways of creating machine-readable and human-usable representations of video content is becoming more important. In order to enable the search and retrieval of video from large archives, we need a representation language for video content. Although some aspects of video can be automatically parsed, a sufficient representation requires that video be annotated. We discuss the design of a video representation language with special attention to the issue of creating a global, reusable video archive. Our prototype system, Media Streams, enables users to create multi-layered, iconic annotations of streams of video data. Within Media Streams, the organization and categories of the Icon Space allow users to browse and compound over 3500 iconic primitives by means of a cascading hierarchical structure that supports compounding icons across branches of the hierarchy. A Media Time Line enables users to visualize, browse, annotate, and retrieve video content. The challenges of creating a representation of human action in video are discussed in detail, with focus on the effect of the syntax of video sequences on the semantics of video shots.
Hierarchy of the efficacy of annotations:
At least, Pat should be able to use Pat's annotations.Today, annotations used by video editors will typically only satisfy the original user (Pat should be able to use Pat's annotations) and only for a limited length of time. Annotations used by video archivists aspire to meet the second user (Chris should be able to use Pat's annotations), yet these annotations often fail to do so if the context of annotation is too distant (in either time or space) from the context of use. Current computer-supported video annotation and retrieval systems use keyword-based representations of video and ostensibly meet the third desideratum (Chris's computer should be able to use Pat's annotations), but practically do not because of the inability of keyword representations to maintain a consistent and scaleable representation of the salient features of video content.Slightly better, Chris should be able to use Pat's annotations.
Even better, Chris's computer should be able to use Pat's annotations.
At best, Chris's computer and Chris should be able to use Pat's and Pat's computer's annotations.
The keyword approach is inadequate for representing video content for the following reasons:
| Keywords do not describe the complex temporal structure of video and audio information. | |
| Keywords are not a semantic representation. They do not support inheritance, similarity, or inference between descriptors. Looking for shots of "dogs" will not retrieve shots indexed as "German shepherds" and vice versa. | |
| Keywords do not describe relations between descriptors. A search using the keywords "man," "dog," and "bite" may retrieve "dog bites man" videos as well as "man bites dog" videos--the relations between the descriptors highly determine their salience and are not represented by keyword descriptors alone. | |
| Keywords do not converge . Different users use sufficiently different keywords to describe the same materials such that keyword annotation becomes idiosyncratic rather than consensual. | |
| Keywords do not scale . As the number of keywords grows, the possibility of matching a query to the annotation diminishes. As the size of the keyword vocabulary increases, the precision and recall of searches decrease. |
Towards a Global Media Archive: A video annotation language needs to create representations that are durable and sharable. The knowledge encoded in the annotation language needs to extend in time longer than one person's memory and needs to extend in space across continents and cultures.
Icons: A uniform and widespread iconic visual language for video annotation and retrieval will enable the creation of a global media archive in which video can be stored and reused.
Media Streams' iconic visual language enables:
| Accurate and readable time-indexed representation of actions, expressions, and spatial relations | |
| Gestalt visualization of the dense, multi-layered structure of video content | |
| Quick recognition and browsing of content annotations | |
| Designed visual similarities between instances or subclasses of a class (visual resonances in the iconic language) | |
| Articulation of the boundaries between consensual and idiosyncratic annotations (icons can have attached textual annotations and can thus function as the explicit consensual tokens of various idiosyncratic textual descriptions) | |
| Global international use of annotations | |
| Usable by illiterate and preliterate people |
Ways of extending the iconic visual language of Media Streams beyond the composition of iconic primitives:
Icons and the components of compound icons can be titled. For example, if I were to annotate the video of an automobile with the descriptor "XJ7," this description may be very opaque. If, however, I title a car icon XJ7, in addition to the computer learning that XJ7 is a type of car, a human reading this annotation can simply and quickly see the visual similarity between an "XJ7" car icon and icons for other types of automobiles.
Users can also create new icons for character and object actions by means of an animated icon editor. This editor allows users to define new icons as subsets or mixtures of existing animated icons. This is very useful because a wide range of possible human motions can be described as subsets or mixtures of existing animated icons.