Multimodal AI needs active human interaction

Consider that you are comfortably seated on your sofa, reading this article in Nature Human Behaviour. Your AI assistant says “I see you are reading about multimodal AI. But you asked me to remind you that we need to finalize visuals for your presentation tomorrow morning. I’ve gone through the audience feedback of your last presentation and have mocked up new illustrations to introduce your idea.”Current multimodal AI models have the ingredients for this sort of interaction. Many real-life tasks, such as driving and medical diagnosis1, are difficult to solve solely through verbal communication and require multimodal information. Recent commercial general-purpose AI is equipped with modalities of vision and audition (for example, GPT-4o, Gemini 1.5 and Claude 3). Techniques such as retrieval-augmented generation are being developed to enable large language models (LLMs) to use multimodal databases2. Portable multimodal AI devices (for example, the handheld rabbit r1, the wearable ai pin, and Ray-Ban Meta smart glasses) are being developed to provide assistance in the physical world. This multimodal trend greatly increases the range of problems that can be solved or assisted by AI tools. It also opens up human–AI communication channels in real time through voice and facial expression.

Hot Topics

Related Articles