I Had Early Access to Manus. Here's My Honest Six-Category Test.
Before the hype, before the waitlist — what the first general AI agent actually does
Before Manus went viral, before the waitlist, I'd already talked with their product team. I was among the first users globally to get access during the internal beta — and I wanted to test it properly, not just run one impressive demo and screenshot it.
What I didn't expect was how hard it is to come up with tasks for something you've never used. When there's no prior art, no user gallery, no benchmarks — you have to design the tests from scratch. It ended up feeling like building a benchmark on the fly: six categories, genuine use cases, honest results. Marked with ❗️ are things I think the Manus team should address.
Here's what actually happened.
1. Data Modelling
CompletedI gave it a math modelling competition problem — one I'd spent two full days on the week before: predict Hong Kong's tourist arrivals for 2026–2030, build a resource allocation model, cite your sources, state your assumptions.
Data collection alone took me half a day. Manus did it in 20 minutes. It found all the relevant datasets, built the regression model, generated charts, and organised everything into a structured output. I went out for lunch, came back, and it had continued on its own.

2. Real-World Action
CompletedI asked it to open Xiaohongshu, write a post calling out misinformation accounts, and take screenshots of the conversation. When it hit the QR code login screen, it asked me to scan — I did, then manually entered the SMS code. After that, it navigated to the post editor, generated copy, and published.

3. Gaming
PartialFirst prompt: install Minecraft and beat the Ender Dragon. Immediate hard limit — the sandbox is headless, no graphics interface, so graphical games are out. Fair enough.
Second prompt: go play Go on an online board game site. It found one, tried to place a stone, couldn't interact with the board properly, blamed the sandboxed browser. Third prompt: try an HTML game. This time it worked — it found a puzzle game called Rope Rescue, explained the mechanics, and completed the first level.

4. Creative Tool Use
PartialI asked it to make a video introducing Manus. It wrote a solid script, asked whether I wanted narration, AI voice, or subtitles. I said AI voice. Then it hit a wall: TTSMaker returned errors, MyEdit required login, a CAPTCHA blocked the next attempt (it recognised 3 out of 4 characters — close), then Cloudflare blocked the reload. I stepped in and told it to just use subtitles.
It pivoted to FFmpeg + ImageMagick, built scenes from scratch, burned in an SRT subtitle track, and exported a working video to a public URL.

5. Shopping
PartialI needed a cheap black full-length robe for an English drama performance. Manus searched Taobao and a few other platforms, built a comparison table with prices, and recommended a seller with a product image. Taobao triggered a CAPTCHA mid-session that it couldn't bypass, so it moved on.

6. Technical Development
PartialI pushed further and asked it to build an LLM inference engine from scratch — design the architecture, install dependencies, implement the core pipeline. It correctly identified the framework stack (PyTorch, TensorRT, ONNX, Hugging Face Transformers) and started setting up the environment. Then a pip install torch was killed mid-run due to sandbox memory limits.

It adapted: trimmed the requirements, tried a lighter install, kept going. The architecture design and scaffolding were solid. The sandbox ceiling, not the agent's reasoning, was the real constraint here.
What I Actually Think
The data modelling result alone would have taken it seriously. Twenty minutes versus two days isn't a productivity gain — it's a category shift.
But the failure pattern is consistent across every category: Manus is excellent at planning, searching, and executing individual steps. It struggles at the boundaries — CAPTCHAs, context limits, UI ambiguity, broken links. These aren't fundamental capability gaps; they're engineering problems. And most of them are fixable.
The thing that surprised me most wasn't any single result. It was designing the tests. Without a user gallery, without prior benchmarks, coming up with tasks for something genuinely new is hard. You end up probing the edges blind. That's not a criticism — it's just what it feels like to be early. The space between "what can this do?" and "what should I ask it to do?" is larger than you'd expect.
I'm looking forward to what the context window looks like in six months.