Apple, a key player in the tech industry, is making significant strides in the field of artificial intelligence (AI). Despite not launching any AI models recently, Apple researchers have unveiled a new language model called Ferret-UI designed to understand mobile user interface (UI) screens. This multimodal large language model (MLLM) differs from typical language models by incorporating a deep understanding of elements beyond text, such as images and audio.
Ferret-UI is equipped with advanced capabilities like “referring, grounding, and reasoning,” enabling it to comprehend UI screens comprehensively and execute tasks based on screen content. To address the challenge of identifying small app screen elements, the researchers introduced a feature that magnifies details on the screen, ensuring accurate recognition.
Comparative evaluations with OpenAI’s GPT-4V demonstrate Ferret-UI’s superior performance in elementary tasks on both iPhone and Android platforms, including icon recognition, OCR, and widget classification. Despite GPT-4V’s slight advantage in grounding conversational findings, Ferret-UI’s novel approach of generating raw coordinates showcases substantial potential for UI-related applications.
While the specific applications of Ferret-UI remain undisclosed, the researchers emphasize the model’s ability to significantly enhance UI-related functionalities. Apple’s AI advancements, through models like Ferret-UI, hint at potential enhancements for Siri, potentially empowering the virtual assistant to execute tasks seamlessly based on user app screen interactions.
The developments in Apple’s AI research align with the growing trend of advanced AI-driven gadgets like Rabbit R1, which offer autonomous task performance without explicit user instructions. The integration of Ferret-UI’s capabilities into Siri could pave the way for a more proactive and versatile virtual assistant, opening up new possibilities for streamlined and intuitive user interactions.