Future of Interaction with LLMs

Oct 7th, 2023

technical

Builds over this tweet thread by Elizabeth Laraki

The current UI for interacting with LLMs mimic messaging. We type in our queries and receive model reply in a flat/linear manner. With latest developments the models have become multimodal (i.e can upload images etc) but the interaction design is limited by the same linear manner. (think how uploading picture on messaging works). The future of interacting with these LLMs will see a paradigm shift.

Voice interaction instead of text from user and model side both (straight forward to think of, however very limiting because voice doesn’t enable a new mode of expression over text)
Uploading image/video and reasoning on it through voice. Actions like zooming in, drawing over the image, seeking to a timestamp. (described in the tweet).
Model replies in image/video.
Model manipulates image/video. Example include zooming in, circling, or drawing arrows to direct attention. Extracting key time-frames of a video.

Would be interesting to see how the interaction involves. I imagine if we ask the models in future to explain circulatory system, it would output an image showing the inner organs and then explain how blood flows from heart to body parts and back through animation. We would be able to point to organs and ask what role does it play in the system and the model answers back in voice.

On a side note - Can newer multimodal LLMs answer where’s Waldo?

Future of Interaction with LLMs

Oct 7th, 2023

technical

Receive new posts on email