Monthly Archives: November 2025

A Brief History of the World (Models)

On Friday I got to meet Dr. Fei-Fei Li, “the godmother of AI,” at the launch party for her new company, World Labs (see her launch blog post). We got to chat a bit about the paradox that as computer models for perceiving & representing the world grow massively more sophisticated, the interfaces for doing common things—e.g. moving a person in a photo—can get radically simpler & more intentional. I’ll have more to say about this soon.

Meanwhile, here’s her fascinating & wide-ranging conversation with Lenny Rachitsky. I’m always a sucker for a good Platonic allegory-of-the-cave reference. 🙂

From the YouTube summary:

(00:00) Introduction to Dr. Fei-Fei Li
(05:31) The evolution of AI
(09:37) The birth of ImageNet
(17:25) The rise of deep learning
(23:53) The future of AI and AGI
(29:51) Introduction to world models
(40:45) The bitter lesson in AI and robotics
(48:02) Introducing Marble, a revolutionary product
(51:00) Applications and use cases of Marble
(01:01:01) The founder’s journey and insights
(01:10:05) Human-centered AI at Stanford
(01:14:24) The role of AI in various professions
(01:18:16) Conclusion and final thoughts

And here’s Gemini’s solid summary of their discussion of world models:

  • The Motivation: While LLMs are inspiring, they lack the spatial intelligence and world understanding that humans use daily. This ability to reason about the physical world—understanding objects, movement, and situational awareness—is essential for tasks like first response or even just tidying a kitchen 32:23.
  • The Concept: A world model is described as the lynchpin connecting visual intelligence, robotics, and other forms of intelligence beyond language 33:32. It is a foundational model that allows an agent (human or robot) to:
    • Create worlds in their mind’s eye through prompting 35:01.
    • Interact with that world by browsing, walking, picking up objects, or changing things 35:12.
    • Reason within the world, such as a robot planning its path 35:31.
  • The Application: World models are considered the key missing piece for building effective embodied AI, especially robots 36:08. Beyond robotics, the technology is expected to unlock major advances in scientific discovery (like deducing 3D structures from 2D data) 37:48, games, and design 37:31.
  • The Product: Dr. Li co-founded World Labs to pursue this mission 34:25. Their first product, Marble, is a generative model that outputs genuinely 3D worlds which users can navigate and explore 49:11. Current use cases include virtual production/VFX, game development, and creating synthetic data for robotic simulation 53:05.

“How ChatGPT is fueling an existential crisis in education”

I thought this was a pretty interesting & thoughtful conversation. It’s interesting to think about ways to evaluate & reward process (hard work through challenges) and not just product (final projects, tests, etc.). AI obviously enables a lot of skipping the former in pursuit of the latter—but (shocker!) people then don’t build knowhow around solving problems, or even remember (much less feel pride in) the artifacts they produce.

The issues go a lot deeper, to the very philosophy of education itself. So we sat down and talked to a lot of teachers — you’ll hear many of their voices throughout this episode — and we kept hearing one cri du coeur again and again: What are we even doing here? What’s the point?

Links, courtesy of the Verge team:

  • A majority of high school students use gen AI for schoolwork | College Board
  • About a quarter of teens have used ChatGPT for schoolwork | Pew Research
  • Your brain on ChatGPT | MIT Media Lab
  • My students think it’s fine to cheat with AI. Maybe they’re on to something. | Vox
  • How children understand and learn from conversational AI | McGill University
  • File not Found | The Verge

Adobe Research debuts incredibly fast video synthesis

Check out MotionStream, “a streaming (real-time, long-duration) video generation system with motion controls, unlocking new possibilities for interactive content generation.” It’s said to run at 29fps on a single H100 GPU (!).

What I’m really wondering, though, it whether/when/how an interactive interface like this can come to Photoshop & other image-editing environments. I’m not yet sure how the dots connect, but could it be paired with something like this model?

Chocolate-coated glass shards

Oh man, this parody of the messaging around AI-justified (?) price increases is 100% pitch perfect. (“It’s the corporate music that sends me into a rage.”)

 

 
 
 
 
 
View this post on Instagram
 
 
 
 
 
 
 
 
 
 
 

 

A post shared by Mark Edwards (@someguymark)

“We Built The Matrix to Train You for What’s Coming”

My friend Bilawal got to sit down with VFX pioneer John Gaeta to discuss “A new language of perception,” Bullet Time, groundbreaking photogrammetry, the coming Big Bang/golden age of storytelling, chasing “a feeling of limitlessness,” and much more.

In this conversation:

— How Matrix VFX techniques became the prototypes for AI filmmaking tools, game engines, and AR/VR systems
— How The Matrix team sourced PhD thesis films from university labs to invent new 3D capture techniques
— Why “universal capture” from Matrix 2 & 3 was the precursor to modern volumetric video and 3D avatars
— The Matrix 4 experiments with Unreal Engine that almost launched a transmedia universe based on The Animatrix
— Why dystopian sci-fi becomes infrastructure (and what that means for AI safety)
— Where John is building next: Escape.art and the future of interactive storytelling

Continue reading

A cool new Photoshop feature (that’s still kinda dumb)

I’m pleased to see that as promised back in May, Photoshop has added a “Dynamic Text” toggle that automatically resizes the size of the letters in each line to produce a visually “packed” look:

Results can be really cool, but because the model has no knowledge of the meaning and importance of each word, they can sometimes look pretty dumb. Here’s my canonical example, which visually emphasizes exactly the wrong thing:

I continue to want to see the best of both worlds, with a layout engine taking into account the meaning & thus visual importance of words—like what my team shipped last year:

I’m absolutely confident that this can be done. I mean, just look at the kind of complex layouts I was knocking out in Ideogram a year ago.

The missing ingredient is just the link between image layouts & editability—provided either by bitmap->native conversion (often hard, but doable in some cases), or by in-place editing (e.g. change “Merry Christmas” to “Happy New Year” on a sign, then regenerate the image using the same style & dimensions)—or both.

Bonus points go to the app & model that enable generation with transparency (for easy compositing), or conversion to vectors—or, again, ¿porque no los dos? 🙂

Demo: Flux vs. Nano Banana inside Photoshop

I recently shared a really helpful video from Jesús Ramirez that showed practical uses for each model inside Photoshop (e.g. text editing via Flux). Now here’s a direct comparison from Colin Smith, highlighting these strengths:

  • Flux: Realistic, detailed; doesn’t produce unwanted shifts in regions that should stay unchanged. Tends to maintain more of the original image, such as hair or background elements.
  • Nano Banana: Smooth & pleasing (if sometimes a bit “Disney”); good at following complex prompts. May be better at removing objects.

These specific examples are great, but I continue to wish for more standardized evals that would help produce objective measures across models. I’m investigating the state of the art there. More to share soon, I hope!

Emu 3.5 looks seriously impressive

Improvements to imaging continues its breakneck pace, as engines evolve from “simple” text-to-image (which we considered miraculous just three years ago—and which I still kinda do, TBH) to understanding time & space.

Now Emu (see project page, code) can create entire multi-page/image narratives, turn 2D images into 3D worlds, and more. Check it out: