🌱 New Techniques🔀 For Multimodal Large Language Models (MLLMs)◆ Visual Prompting

Visual Prompting

◆ This article is rated hard

Reading Time: 3 minutes

Last updated on October 3, 2024

Valeriia Kuka

Takeaways

Visual Prompting Concept: The "By My Eyes" method enhances Multimodal Large Language Models (MLLMs) by transforming long sequences of sensor data into visual formats, improving their ability to perform sensory tasks effectively.
Methodology: This technique visualizes sensor data, designs prompts incorporating these visuals, and utilizes a visualization generator to select optimal representations, improving interpretation and reducing computational costs.
Performance Results: In nine sensory tasks across multiple modalities, visual prompts consistently outperformed text prompts, yielding higher accuracy and significant token reductions, enabling MLLMs to manage larger datasets effectively.

Overview of Visual Prompting

What is "By My Eyes" Approach?

"By My Eyes" is a novel approach for integrating sensor data into Multimodal Large Language Models (MLLMs) by transforming long sequences of sensor data into visual inputs, such as graphs and plots. This method uses visual prompting to guide MLLMs in performing sensory tasks (e.g., human activity recognition, health monitoring) more efficiently and accurately than text-based methods.

Problem with Text-Based Sensor Data

Text-based methods that use raw sensor data in LLM prompts face challenges like:

Long sequences of sensor data increase computational cost (more tokens).
MLLMs struggle with pattern recognition when handling large numeric sequences.
Limited accuracy in sensory tasks (e.g., recognizing motion from accelerometer data).

How Does "By My Eyes" Work?

"By My Eyes" introduces visual prompts to represent sensor data as images (e.g., waveforms, spectrograms), making it easier for MLLMs to interpret. The key innovation is a visualization generator that automatically converts sensor data into optimal visual representations. This reduces token costs and enhances performance across various sensory tasks.

Steps of the method:

Sensor Data Visualization: Instead of feeding raw sensor data as text, the data is visualized (e.g., a plot of accelerometer readings).
Visual Prompt Design: The MLLM receives the visualized data along with task-specific instructions in a prompt to solve sensory tasks.
Visualization Generator: A tool that automatically selects the most appropriate visualization method (e.g., waveform, spectrogram) for each sensory task, ensuring optimal MLLM performance.

Results of "By My Eyes"

"By My Eyes" was tested on nine sensory tasks across four modalities (accelerometer, ECG, EMG, and respiration sensors). The approach consistently outperformed text-based prompts, showing:

Dataset	Modality	Task	Text Prompt Accuracy	Visual Prompt Accuracy	Token Reduction
HHAR	Accelerometer	Human activity recognition	66%	67%	26.2×
PTB-XL	ECG	Arrhythmia detection	73%	80%	3.4×
WESAD	Respiration	Stress detection	48%	61%	49.8×

The visual prompts also led to more efficient use of tokens, allowing MLLMs to handle larger datasets and more complex tasks without sacrificing accuracy.

Conclusion

The "By My Eyes" method provides a cost-effective and performance-boosting solution for handling sensor data in MLLMs. By transforming raw sensor data into visual prompts, it addresses the limitations of text-based approaches, making it easier for LLMs to solve real-world sensory tasks in fields like healthcare, environmental monitoring, and human activity recognition.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.