Say See 音外话

Jiabao Li, Steven Wilkinson, 2024

"SaySee" is a speech-to-video generative AI system that transforms spoken words in any language into visually emotive videos. By analyzing the emotional tone of speech, it converts this into a corresponding visual narrative. For instance, a political debate's intensity could be visualized in a dynamic comic book style with exclamation bubbles. News coverage of serious topics like war would evoke a somber black-and-white palette, capturing the gravity of the situation. Conversely, happy conversations would lead to bright and colorful videos. The video captures both the content and the underlying emotion beyond the words.

Upon entering the installation, participants simply press a button and start talking into the microphone. They can engage in a variety of spoken expressions, including conversations, monologues, debates, story games, speeches, confessions, and even singing. The AI then swiftly creates a short video representation of their speech, displayed across seven TV sets. Each TV then passes the previously generated video to the next TV. Together the seven TV sets orchestrate a film from the various voices of the participants.

We also trained SaySee on custom datasets such as glaciers, bats, and dead birds caused by industrial pollution. For each unique SaySee, the generated video will be based on that dataset. For example, if you talk to SaySee#deadbirds, no matter how happy your story is, the generated video will be related to dead birds.

《音外话》是一个语音至视频生成人工智能系统，它将口语转化为充满情感的视觉影像。此系统分析语言中的情感色彩，并将之转换成相应的视觉叙事。例如，激烈的政治辩论会是暴漫风，报道世界苦难严肃的新闻则是忧郁的黑白色调。相反，愉快的对话会转变为明亮且多彩的视频。视频不仅捕捉言语的内容，还捕捉了文字背后的潜在情感。

参与者进入装置后，请按下按钮并开始在麦克风上讲话。你可以进行对话、独白、辩论、故事接龙、演讲、自白等。3分钟后，人工智能会根据你说的话和情感绘制出视频。你也可以看到由此之前的其他参与者的话绘制出的视频。