Objects and Sound
On the use of sounding objects in multimedia art

Daniël Ploeger
Centre for Research in Opera and Music Theatre, University of Sussex
D.Ploeger@Sussex.ac.uk
for a complete version of this paper (including references), please email to: mail@danielploeger.org

The physical qualities of a sounding object can be perceived visually, haptically and auditively. Visual perception is usually regarded as the most important type of sensory perception, but in daily life we collect a lot of information on spatial properties of objects that is primarily based on auditive information. Pedestrians and cyclists for example, make accurate judgements on the location and motion of cars that are outside their visual range, based on the sounds they hear.

In a normal traffic situation one will see a great number of images and hear many different sounds simultaneously. The perceived position of these sounds is influenced by the perception of interaural differences in timing and amplitude of the sound, and differences in the perceived sound spectrum caused by spectrum transformations in the outer ear. When an image and a sound are perceived as located at the same position, sound and image will be perceived as originating from the same object. Once this connection is made, the object concerned will be perceived as a composition of information that originates from both visual and auditive stimuli.

The registration of the position of a sound as mentioned above is mainly based on binaural hearing. However, when registering a sound source’s motion, information derived from monaural perception becomes more important. When a sounding object is moving the observer will register changes in the amplitude of the sound, changes in the perceived frequency of the signal caused by the Doppler effect, a change in the timbre of the sound caused by reflection of the sound on the (static) objects around it and, if there is an additional stationary sound source emitting the same sound, phasing.

The combination of visual and auditive media to define an object in space can also be applied in multimedia art. One could confront the observer with the visual and auditive perception of moving sounding objects such as loudspeakers, resonating objects (for example a triangle or a bell), or objects that produce sound as a result of their operating mechanism (a mechanical alarm clock or electrical devices such as a vibrator). There are many succesful multimedia works and musical compositions that are concerned with the changing position of objects in combination with sound and with the location of sound in space. Good examples are Bill Viola’s multimedia work “He Weeps for You“ (1976) and Karlheinz Stockhausen’s electronic composition “Gesang der Jünglinge“ (“Song of the Younglings“) (1956). 

In “He Weeps for You“, waterdrops fall on an amplified drum that is illuminated by a spotlight. The water comes from a tap that is positioned above the drum. The tap is almost completely closed so it takes a considerable amount of time until the drop is big enough to fall. Until the drop falls, the beholder sees her reflection in the waterdrop that is hanging on the tap magnified on a videoscreen. In “Gesang der Jünglinge“ five groups of loudspeakers are positioned all around the audience. Thus sounds can be emitted so that the audience perceives them at specific locations in space.

However, in neither of these works is the combination of visual and auditive information employed to define objects in space. In “He Weeps for You“ the sound of the drum is a significant part of the work, but the position of the drum is only defined by visual information (a spotlight is focused on it). The sound of the drum is amplified and played over loudspeakers that are in a different position to the drum, so the connection between the location of the drum and the perceived sound can only be concluded based on the simultaneous event of us seeing the drop hitting the drum and hearing the sound (and on our existing knowledge of the connection between the image and the sound of a drum). In Stockhausen’s work the sound emitted by the loudspeakers is used to create illusionary positions of sounds in space, but it is not used to define the location of the actual soundsources (the loudspeakers). In the work, the existence of the loudspeakers as objects is basically not taken into account.

One of the rare examples of the use of both auditive and visual perception of moving sounding objects can be found in the theatre work “Wenn eine Dolores heisst, muss sie noch lange nicht schön sein“ (2006) (“If someone is called Dolores, that doesn’t necessarily mean that she is beautiful“) by the Swiss theatre director Rüdi Häusermann. In one scene of this work, Häusermann uses loudspeakers connected to pendulums on stage. The audience sees the loudspeakers moving and simultaneously perceives the movement of the loudspeakers auditively.

In this paper, I intend to show that the combination of visual and auditive information to define objects in space is an effective way to create cross-media interaction in a multimedia work. In order to do this I shall demonstrate how existing multimedia theories by the Russian film director Sergei Eisenstein and the British musicologist Nicholas Cook can be applied to the use of sounding objects. First I shall analyse the possibilities of “montage theory“ applied to image and sound as introduced by Eisenstein. This will be followed by a discussion of the application of the “metaphor model“ that Cook suggests in his book “Musical Multimedia“. I shall conclude with an analysis of some of the cross-media interactions concerning the moving sounding object in my performance installation “DRILL (2007)“.

In his book “Film Sense“ Sergei Eisenstein explains the basic principle of montage theory in film as follows: “two film pieces, of any kind, placed together, inevitably combine into a new concept, a new quality, arising out of that juxtaposition“. He then extends this theory to the connection between sound and picture. In the same book he says that “[t]here is no fundamental difference in the approach to be made to the problems of purely visual montage and to a montage that links different spheres of feeling – particularly the visual image with the sound image.“ The attractive aspect of the application of montage theory as a means of cross-media connection is that one can connect any sound material with any image. However, in the case of the combination of film and sound in the way Eisenstein suggests, it is not possible to make specific connections (connections between part of the presented images and part of the sound material). This means that there can be only one montage-based connection of sound and image at a time. By creating a situation where a sound and an image are experienced as orginating from one specific point in space, this problem could be solved. In this way it is possible to use a number of images and sounds simultaneously, each sound spatially connected to another image. Another advantage of this approach is that the experienced connection between the media, and thus the effect of the montage, will be much stronger. The more the cross-media connection appears to be “real“ (that is, matches with the perceptual experiences of daily life), the stronger the perceived cross-media connection. The location of sound and image at a specific point in space can be achieved by using advanced cinematic techniques such as 3D film projection in combination with spatial sound projection through a large number of loudspeakers. However, according to my suggestion that a connection that is experienced as more “real“ will appear more powerful, cross-media connections will be even stronger when using sounding objects that can be heard and seen, and that are actually present in the same space as the observer.

In the section above, I discussed the use of simultaneous visual and auditive spatial perception in order to enable the application of montage theory to specific objects. The essence of this technique is based on our ability to locate objects by means of spatial vision and binaural hearing. However, as I mentioned in the introduction of this paper, the auditive registration of qualities of a sounding object is not limited to this form of spatial hearing only. If the object is moving we can also determine its location by analyzing monaural information on acoustic changes caused by the movement. This creates possibilities for another form of cross-media interaction.

At present Nicholas Cook’s “Analysing Musical Multimedia“ offers the most recent detailed study of the relation between visual and auditive perception in multimedia art. In this book, Cook introduces a concept of multimedia which he calls the “metaphor model“. This model is based on research by the music psychologists Sandra Marshall and Annabel Cohen that examines the effects of musical soundtracks by studying the perception of a short abstract animation film in combination with different music. Founded on their observations of how film and sound mutually influence the way they are perceived, Marshall and Cohen introduced a model of interaction between the media that is based on the ascription of characteristics present in one medium to the other (diagram 1).

diagram 1


(c)1988 by the Regents of University of California. Reprinted from
Music Perception, Volume 6 (1988), Figure 8b.

The diagram shows that if music and film have feature ’a’ in common, other characteristics of the music (’x’) will be ascribed to ’a’. Since ’a’ is also a feature of the film, the characteristics ’x’ will also be ascribed to the film. Cook compares Marshall and Cohen’s diagram to the structural representation of metaphor.  He states that the pre-condition of metaphor is “enabling similarity“; the terms should have attributes in common. The meaning of a metaphor then results from the ascription of non-common attributes from one term to the other. Applied to multimedia, Cook identifies quasi-synesthesia and kinesia as possible common attributes. His definition of quasi-synesthesia is based on Lawrence Marks’s study “The Unity of the Senses: Interrelations among the Modalities“. Marks suggests that “there is a direct correlation between the sound frequency that characterizes a vowel and the brightness of the colour associated with it: that is to say, its position on a scale from black to white“. Cook defines quasi-synesthesia as the correspondence between pitch or soundcolour and visual brightness (as opposed to a direct connection between pitch and colour as suggested in pure synesthesia). As the most important form of kinetic correspondence, Cook discusses correspondence between “the kinesis that results from the combination of rhythm, harmony, dynamics and other musical elements“ and the visible movement of objects.

However, Cook limits the application of his metaphor model to the combination of soundtrack and film (except for a short elaboration on music and record sleeves). Therefore he does not take into account the possibility of using the simultaneous visual and auditive perception of an object in the observer’s presence as a basis for cross-media interaction. If we want to apply Cook’s model to the use of moving sounding objects, it is obviously necessary that the object can be seen and that it produces sound. In practice this means that the basis for cross-media interaction would be the perception of a plain sinewave combined with the visual perception of the actual soundsource (the actual loudspeaker, resonating body of a musical instrument or the part of a mechanism that produces its sound). The cross-media connection would then be established on the basis of kinetic correspondences of the effects of the object’s movement on the visual and auditive perception of the object. When we move a loudspeaker whilst it is emitting a sinewave, we will notice changes in the sound of the sinewave as well as changes in the visual perception of the loudspeaker. If the soundsource is attached to another object and more attributes are added to the plain sine wave, the cross-media connection between the visual and the auditive material originating from the results of the movement will cause us to mutually ascribe the added attributes to the other media in the manner Cook describes in his metaphor model.

However, this representation of the cross-media interaction is not completely accurate. The sinewave mentioned above is a precondition to enable correspondence. It is not part of the correspondence in itself (as one would expect according to the ways Cook applied his model). The correspondence is based on changes in the auditive and visual perception of the object’s qualities, not on the emitted sound or on the actual shape of the object. Thus the difference between the kind of kinetic relationships that Cook discusses and the type explained above is that in the case of a moving sounding object, certain attributes do not merely “direct attention to“ certain features of the other medium; in the case of a moving sounding object the corresponding attributes actually are features of the same object that is expressed visually and auditively. The significance of this difference becomes apparent in the following examples:

In the case of the moving loudspeaker in combination with the sound in the end of my performance installation “DRILL (2007)“, for loudspeaker object, video and performer, the observer will experience a correspondence between the soundmaterial and the loudspeaker object that the performer is moving, as a result of the fact that the visually perceived movements of the object correspond to the auditively perceived movements. Since the cross-media correspondence originates from the movement of the sounding object, the correspondence is not affected by intrinsic qualities of the sound that is emitted. The correspondence is added to the sound material; therefore the sound that is emitted by the sounding object does not need to conform to preconditions such as kinetic or quasi-synesthetic correpondences in order to enable cross-media interaction.

When watching the beginning of the opening scene of Eisenstein’s “Battleship Potemkin“ with the soundtrack composed by Meisel, most people will experience the obvious kinetic correspondence between the timing of the music and the breaking of the waves. However, this connection is experienced as artificial: apart from the kinetic correspondence there is nothing in the sound of the music that is reminiscent of the real sound of the sea (and nothing in the image of the sea that reminds us of the music). It would be possible to change the soundmaterial in order to match the experience of the music more with the visually perceived waves, but here it becomes apparent that the fact that the correspondences rely on qualities intrinsic to the emitted sound creates a problematic restriction: the more one medium conforms to another, the stronger the cross-media correspondences. However, a greater conformance also limits the “space“ that is left for ascription. If for example, a sound mainly consists of material that quasi-synesthetically and kinetically corresponds with an image, there is not much possibility to add independent features that can be ascribed to the object.

Summarized, the analyses in this paper lead us to the following conclusions: the application of montage theory in combination with sounding objects that are visually and auditively present, enables multiple simultaneous montages. In addtition, this leads to stronger cross-media connections because the connections between the different media are experienced as more real. The use of moving sounding objects within the context of Cook’s metaphor model enables strong cross-media correspondences that do not limit the possibilities of ascription (as is the case in the kinetic and quasi-synesthetic correspondences between film and sound) because the cross-media correspondences are not intrinsic to the emitted sound material and the shape of the sounding object.

I would like to conclude this paper with an analysis of cross-media interactions in “DRILL (2007)“. This piece offers examples of both the application of montage theory, and ascription according to the metaphor model.

In “DRILL“, the performer moves a loudspeaker object with horizontal gestures that refer to the vertical movements of the drill that is used on the video. The pitches of the drill sound are modified both electronically and by the motion of the loudspeaker object. The loudspeaker object is positioned in front of the audience, whilst another loudspeaker is placed behind the audience. This results in the audience registering beating and phasing between the signals of the two loudspeakers when the loudspeaker object is moving.

diagram 2



The cross-media connections in “DRILL“ are represented in diagram 2. The connection between the sound material and the drill on the video originates from most people’s exisiting knowledge of the visual and auditive qualities of a drill. A cross-media connection between the sound of the drill and the loudspeaker object is then established by means of montage. This means that the image of the drill is indirectly connected to the loudspeaker object. On the other side of the diagram, the direct connection between the video and the loudspeaker object is indicated. This connection in based on a visual correspondence between the gestures of the operator of the drill on the video and the gestures of the performer and the loudspeaker object. The gestures of the loudspeaker object that the observer perceives visually, then correspond with the changes in perception of the sound caused by the movement. Therefore one could say that the gestures of the video are indirectly connected to the sound material.

Thus the cross-media interactions between the loudspeaker object and the sound track are based on both montage theory and the metaphor model. On the one hand, the sound is connected to the loudspeaker object by montage, because the observer locates the origin of the sound and the visual perception of the object in the same place. Therefore certain features of the sound (the fact that it is a drill sound) are transfered to the loudspeaker object. On the other hand, certain qualities associated with the perception of the loudspeaker object are ascribed to the sound because the visually perceived movements of the loudspeaker object (and the performer) correspond with the auditively perceived movements that are reflected in the perception of the sound. Toward the end of the piece, the performer is instructed to move the loudspeaker object with the greatest possible intensity. At this point the density of the sound material on the pre-recorded sound track is the lowest in the whole piece. Apart from a few interruptions with overtone-like contructions, there is just the plain sound of a drill. The aggression of the physical movement of the performer results in the changes in the perceived sound material. These changes are quite small, so the kinetic qualities of the sound by itself would usually not make one think of it as aggressive in character. This characteristic is gained as a result from cross-media ascription.