Announcing our new Course: AI Red-Teaming and AI Safety Masterclass

Check it out →
🧠 Advanced
🌱 New Techniques🔀 For Multimodal Large Language Models (MLLMs)◆ Visual Prompting

Visual Prompting

Last updated on October 3, 2024 by Valeriia Kuka
Overview of Visual Prompting

What is "By My Eyes" Approach?

"By My Eyes" is a novel approach for integrating sensor data into Multimodal Large Language Models (MLLMs) by transforming long sequences of sensor data into visual inputs, such as graphs and plots. This method uses visual prompting to guide MLLMs in performing sensory tasks (e.g., human activity recognition, health monitoring) more efficiently and accurately than text-based methods.

Problem with Text-Based Sensor Data

Text-based methods that use raw sensor data in LLM prompts face challenges like:

  • Long sequences of sensor data increase computational cost (more tokens).
  • MLLMs struggle with pattern recognition when handling large numeric sequences.
  • Limited accuracy in sensory tasks (e.g., recognizing motion from accelerometer data).

How Does "By My Eyes" Work?

"By My Eyes" introduces visual prompts to represent sensor data as images (e.g., waveforms, spectrograms), making it easier for MLLMs to interpret. The key innovation is a visualization generator that automatically converts sensor data into optimal visual representations. This reduces token costs and enhances performance across various sensory tasks.

Steps of the method:

  1. Sensor Data Visualization: Instead of feeding raw sensor data as text, the data is visualized (e.g., a plot of accelerometer readings).
  2. Visual Prompt Design: The MLLM receives the visualized data along with task-specific instructions in a prompt to solve sensory tasks.
  3. Visualization Generator: A tool that automatically selects the most appropriate visualization method (e.g., waveform, spectrogram) for each sensory task, ensuring optimal MLLM performance.

Results of "By My Eyes"

"By My Eyes" was tested on nine sensory tasks across four modalities (accelerometer, ECG, EMG, and respiration sensors). The approach consistently outperformed text-based prompts, showing:

DatasetModalityTaskText Prompt AccuracyVisual Prompt AccuracyToken Reduction
HHARAccelerometerHuman activity recognition66%67%26.2×
PTB-XLECGArrhythmia detection73%80%3.4×
WESADRespirationStress detection48%61%49.8×

The visual prompts also led to more efficient use of tokens, allowing MLLMs to handle larger datasets and more complex tasks without sacrificing accuracy.

Conclusion

The "By My Eyes" method provides a cost-effective and performance-boosting solution for handling sensor data in MLLMs. By transforming raw sensor data into visual prompts, it addresses the limitations of text-based approaches, making it easier for LLMs to solve real-world sensory tasks in fields like healthcare, environmental monitoring, and human activity recognition.

Edit this page
Word count: 0
Copyright © 2024 Learn Prompting.