[
Lists Home |
Date Index |
Thread Index
]
> So where we do understand how the vector model
> works for text analysis, do we understand how to apply
> it to a *text* that includes video and audio as integral
> parts of the *text* and can we combine these into a
> higher level space vector term
Providing metadata for rich types is an area that's had some interesting work.
Besides the Dublin Core ViDe initiative, I came across some interesting papers
when researching this recently:
1. "Facilitating Video Access by Visualizing Automatic Analysis"
http://www.fxpal.com/publications/FXPAL-PR-99-045.pdf
"Metadata for video materials can be derived from the analysis of the audio and
video streams. For audio, we identify features such as silence, applause, and
speaker identity. For video, we find features such as shot boundaries,
presentation slides, and close-ups of human faces."
2. Yahoo has recently taken the RSS approach. Video RSS provides a text
description such as height, width, bitrate and running time:
http://www.webservicessummit.com/Channels/WebServicesSummitAudioVideo.rss
3. SQL implementations such as DB2 UDB support content-based querying over rich
types. DB2 has an Image Extender and Audio Extender with correspondiong types
(DB2IMAGE, DB2AUDIO). The Audio Extender analyses the content and stores values
such as whether it's 16-bit audio, samples per second, playing time, the number
of clock ticks per quarter note and so on. The Image Extender stores information
that enables you to provide an image and search for matches based on color and
texture (contrast, directionality, etc.).
IBM's CueVideo software uses speech recognition technology to generate text from
the audio tracks of videos -- which could then be fed into an engine that uses
the vector space model and textual similarity matching described in my previous
message:
http://www.almaden.ibm.com/projects/data/CueVideo.pdf
4. This paper discusses analysis of digital music using similarity matrices.
Media Segmentation using Self-Similarity Decomposition
http://www.fxpal.com/people/cooper/Papers/SPIE02.pdf
"We assume only that the audio or music exhibits instances of similar segments,
possibly separated by other
segments. For example, a common popular song structure is ABABCAB, where A is a
verse segment, B is
the chorus, and C is the bridge or "middle eight." We would hope to be able to
group the segments of this song
into three clusters corresponding to the three different parts. Once this is
done, the song could be summarized
by presenting only the novel segments. In this example, the sequence ABC is a
significantly shorter summary
containing essentially all the information in the song."
3.1. Clustering via similarity matrix decomposition
To cluster the segments, we factor a segment-indexed similarity matrix to find
repeated or substantially similar
groups of segments."
|