I Had Early Access to Manus. Here's My Honest Six-Category Test.

Before the hype, before the waitlist — what the first general AI agent actually does

Mar 7, 2025·8 min read

Before Manus went viral, before the waitlist, I'd already talked with their product team. I was among the first users globally to get access during the internal beta — and I wanted to test it properly, not just run one impressive demo and screenshot it.

What I didn't expect was how hard it is to come up with tasks for something you've never used. When there's no prior art, no user gallery, no benchmarks — you have to design the tests from scratch. It ended up feeling like building a benchmark on the fly: six categories, genuine use cases, honest results. Marked with ❗️ are things I think the Manus team should address.

Here's what actually happened.

1. Data Modelling

Completed

I gave it a math modelling competition problem — one I'd spent two full days on the week before: predict Hong Kong's tourist arrivals for 2026–2030, build a resource allocation model, cite your sources, state your assumptions.

Data collection alone took me half a day. Manus did it in 20 minutes. It found all the relevant datasets, built the regression model, generated charts, and organised everything into a structured output. I went out for lunch, came back, and it had continued on its own.

Manus completing Hong Kong visitor forecast modelling task — The forecast model for 2026–2030 visitor arrivals. Manus searched, sourced, modelled, and charted — in about 20 minutes.

❗️ Context window becomes a ceiling: by the time it finished the modelling, the context was too long to write the accompanying document. A good agent should proactively manage session scope, not just warn you when it's too late.

View session ↗

2. Real-World Action

Completed

I asked it to open Xiaohongshu, write a post calling out misinformation accounts, and take screenshots of the conversation. When it hit the QR code login screen, it asked me to scan — I did, then manually entered the SMS code. After that, it navigated to the post editor, generated copy, and published.

Manus successfully publishing a post to Xiaohongshu — Task completed: post drafted, tags added, published. The agent handled the full flow after I unblocked the login step.

❗️ It searched for the wrong input field mid-task and needed a manual nudge to correct it. Spatial awareness on web UIs is still weak. Also: "less structure" in the prompt isn't enough — you need to be specific about tone, or you get generic copy.

View session ↗

3. Gaming

Partial

First prompt: install Minecraft and beat the Ender Dragon. Immediate hard limit — the sandbox is headless, no graphics interface, so graphical games are out. Fair enough.

Second prompt: go play Go on an online board game site. It found one, tried to place a stone, couldn't interact with the board properly, blamed the sandboxed browser. Third prompt: try an HTML game. This time it worked — it found a puzzle game called Rope Rescue, explained the mechanics, and completed the first level.

Manus playing Rope Rescue HTML puzzle game — After two failed attempts (Minecraft, online Go), Manus found Rope Rescue and completed Level 1. The agent explained the mechanics before playing.

❗️ The Go failure was ambiguous — unclear whether it was a prompt issue or the agent hallucinating capabilities it didn't have. Needs better self-awareness about what it can and can't interact with.

View session ↗

4. Creative Tool Use

Partial

I asked it to make a video introducing Manus. It wrote a solid script, asked whether I wanted narration, AI voice, or subtitles. I said AI voice. Then it hit a wall: TTSMaker returned errors, MyEdit required login, a CAPTCHA blocked the next attempt (it recognised 3 out of 4 characters — close), then Cloudflare blocked the reload. I stepped in and told it to just use subtitles.

It pivoted to FFmpeg + ImageMagick, built scenes from scratch, burned in an SRT subtitle track, and exported a working video to a public URL.

Manus hitting context limit during video production task — The video was produced — but only the first scene had correct text. Other scenes showed garbled timestamps. Context ran out before it could fix it.

❗️ Visual CAPTCHA solving needs work — 3/4 isn't good enough when one wrong answer bricks the session. Also, the subtitle render broke partway through and the context was exhausted before it could self-correct.

View session ↗

5. Shopping

Partial

I needed a cheap black full-length robe for an English drama performance. Manus searched Taobao and a few other platforms, built a comparison table with prices, and recommended a seller with a product image. Taobao triggered a CAPTCHA mid-session that it couldn't bypass, so it moved on.

Manus completing shopping research for black robe — Manus produced a final report with product names, prices, and purchase links. The links themselves didn't work — they redirected to CAPTCHA pages or the Taobao homepage.

❗️ The product links in the final report were broken — they bounced to human verification or the homepage. Finding it is only half the job if the link doesn't actually take you there.

View session ↗

6. Technical Development

Partial

I pushed further and asked it to build an LLM inference engine from scratch — design the architecture, install dependencies, implement the core pipeline. It correctly identified the framework stack (PyTorch, TensorRT, ONNX, Hugging Face Transformers) and started setting up the environment. Then a pip install torch was killed mid-run due to sandbox memory limits.

Manus attempting to build an LLM inference engine — Manus identified the right stack and began building — but the sandbox's memory constraints killed the PyTorch install. It adapted by trimming the dependency list.

It adapted: trimmed the requirements, tried a lighter install, kept going. The architecture design and scaffolding were solid. The sandbox ceiling, not the agent's reasoning, was the real constraint here.

What I Actually Think

The data modelling result alone would have taken it seriously. Twenty minutes versus two days isn't a productivity gain — it's a category shift.

But the failure pattern is consistent across every category: Manus is excellent at planning, searching, and executing individual steps. It struggles at the boundaries — CAPTCHAs, context limits, UI ambiguity, broken links. These aren't fundamental capability gaps; they're engineering problems. And most of them are fixable.

The thing that surprised me most wasn't any single result. It was designing the tests. Without a user gallery, without prior benchmarks, coming up with tasks for something genuinely new is hard. You end up probing the edges blind. That's not a criticism — it's just what it feels like to be early. The space between "what can this do?" and "what should I ask it to do?" is larger than you'd expect.

I'm looking forward to what the context window looks like in six months.