A bit of earth-shaking news, published during the recent WWDC 2024 conference, was the introduction of Apple Intelligence. This new set of AI-based features, to be introduced in the upcoming iOS and macOS, have created a huge buzz and sparked the interest of the media.
Also, the company has announced a partnership with OpenAI, later securing a seat on the OpenAI board in the same way Microsoft did, showing their commitment toward implementing AI tools and solutions.
But how exactly does Apple AI work and what is beneath the marketing buzz? This short guide was delivered by AI developer and researcher, who carefully reviewed available materials and looked for a real answer to the question about the content of the Artificial Intelligence (AI) in the Apple Intelligence (AI).
LLMs
Apple is rolling out a suite of writing tools powered by LLMs and ChatGPT. These tools can rewrite text, such as emails, into specific tones—friendly, professional, or concise. They also include functionalities to proofread and summarize text. This sounds like a genuinely useful application, given that LLMs excel at paraphrasing.
Users should be eager to test these features to see if they preserve facts and avoid too much “bullshitting” (hallucinating). Additionally, the concept sparks curiosity about how well the different styles (professional, friendly) work, as ChatGPT is notorious for overusing certain words (like “delve,” “tapestry,” and “realm”).
It will be interesting to see if Apple has implemented any guidelines to refine these writing styles. And we can also look forward to the proofreading tool to help eliminate typos and grammar errors.
Apple’s whitepaper on language models indicates that they are emphasizing human evaluation of text quality, output harmlessness, and safety benchmarks. While this might limit some model capabilities, it seems unlikely we’ll see the model suggesting anything too crazy.
Generative Images
Apple has also introduced image generation features that will enable users to play with the background of their images or to generate emojis on demand.
- Genmoji: Generative emoji that lets you create any emoji you want using text. It sounds like another attempt in the line of Memoji—exciting in theory, but questionable in practice. Emoji language relies on shared symbols and unwritten rules about use. If everyone has their own visual language, communication could become chaotic. This may make the communicational mismatches seen in the Gen-Z vs. Millennial emoji usage debate (as seen in Dictionary.com and Cosmopolitan) even deeper.
- Image Playground: This tool generates images using text and is seamlessly integrated without the need for external apps. However, its generation capabilities seem quite basic and suitable for only the most straightforward use cases.
The tool offers one setting called “Style” with options like “Animation,” “Illustration,” and “Sketch” to vary your image. Its usefulness is probably more akin to Instagram stickers than Photoshop’s generative fill. I can’t think of a situation where I’d use it in its current demo state. The Image Playground diffusion model runs locally on the device, which limits its generation capabilities but offers great performance speed.
- Image Wand: This tool transforms your sketch (made in notes) into an image. It appears to use the same image generation model but with added parameters for control through the initial user sketch. It also seems to incorporate image recognition from multimodal ChatGPT capabilities, labeling the sketch with tags that are later used as prompts.
Conversational AI
Apple is also imbuing its system with conversational AI capabilities, particularly boosting Siri.
- Siri gets Chat-GPT-4o integration: Finally, something we’ve all been waiting for. This promises richer understanding capabilities and better recognition of non-native English speakers and of various accents. The feature for personal context is fantastic—no more having to remember if Siri knows my significant other as my boyfriend, husband, or fiancé when making hands-free calls. Additionally, Siri will now be able to search through multiple apps and gather information from your email, messages, calendar, and more.
Apple’s On-Device and Foundational Models
The new tools are powered by Apple’s fine-tuned models, including an on-device language model (with 3 billion parameters) and a larger server language model.
- Training Data: The foundation models are trained on licensed data and internet data (scraped by AppleBot). This means the models can still be prone to errors and data pollution from the internet. Apple has made efforts to clean the data by removing personally identifiable information, profanity, and low-quality content. Web publishers can opt out of training data collection by adding a rule in their robots.txt, but unfortunately, individuals don’t have this option.
- Training: Training is conducted on Apple’s AXLearn Framework, built on JAX and XLA, utilizing advanced techniques like data parallelism and tensor parallelism for efficiency and scalability. Apple also employs human-in-the-loop strategies, such as reinforcement learning from human feedback (RLHF), to refine the models further.
- Model Compression: Apple has achieved impressive performance with compressed models, using low-bit quantization and LoRA adapters. These techniques significantly reduce memory and power requirements while maintaining high model quality. For instance, the on-device model can achieve a time-to-first-token latency of about 0.6 milliseconds per prompt token and a generation rate of 30 tokens per second on the iPhone 15 Pro.
- Privacy: Apple continues to prioritize privacy. Running smaller tasks on the device enhances privacy, and Private Cloud Compute is advertised as safe and does not use any client data.
- Multimodal Capabilities: Apple’s foundational models support multimodal data, processing and understanding various types of inputs like text and images. This enhances applications such as image captioning and visual question answering by combining different data types.
- Sustainability: Apple claims efforts in sustainability in the development and deployment of their AI models, focusing on energy-efficient training processes and model architectures to minimize carbon footprint.
- Adapters: These can dynamically specialize performance on the fly, adapting language models to different tasks without altering the underlying model’s architecture. Adapters are used for tasks like tone adjustment in text, refinement, and proofreading.
- Evaluation: Apple models provide stricter safety and less model harmfulness measured through the occurrence of harmful content and sensitive content when compared to other LLMs, such as GPT(3.5/4) or Gemma. The instruction and prompt-level accuracy is unmatched for on-device models. However, server models are slightly behind GPT4 -Turbo in such tasks. Writing benchmarks are similar to GPT4-Turbo on server and superior to other models on-device.
Summary
It seems Apple is focused on a custom approach with many different model versions tailored to specific tasks, rather than a single “know-it-all” model like ChatGPT. This approach benefits users by improving the quality of specific tasks. It will be interesting to see how tasks are divided between on-device AI models and cloud-based processing as things develop.