It’s not good enough, at least on Plus. It can only work on tasks for a little bit, 18 minutes in my case, which less time than a long Deep Research. It also failed.

In its defense, I did give it an impossible task (figure out how to use OpenRouter chat UI).

It was tasked with testing a system prompt and improving it based on model behavior. I told it to use GPT-4.1, DeepSeek V3, Kimi K2, and Jamba 1.6 Large.

It didn’t realize it had accidentally added Auto Router to all chats except the last one, so instead of chatting with say, the GPT-4.1 it had originally selected, it was getting Auto Routed to DeepSeek R1.

It didn’t follow the instructions to chat with all models first before trying to improve the system prompt, and it did a mediocre job of that. Among other issues.

I told it about all its errors when it said “job completed” and it replied “I’ve done all I can do.” (‘Two allowed context windows’ per the reasoning trace.)

It probably works better on Pro.

It should’ve realized something was wrong when it mentioned waiting for the reasoning trace to finish on the first chat; GPT-4.1 isn’t a reasoning model (which is one reason why I like it so much).