Randy

I Had Early Access to Manus. Here's My Honest Six-Category Test.

Before the hype, before the waitlist — what the first general AI agent actually does

Mar 7, 2025·8 min read

Before Manus went viral, before the waitlist, I'd already talked with their product team. I was among the first users globally to get access during the internal beta — and I wanted to test it properly, not just run one impressive demo and screenshot it.

What I didn't expect was how hard it is to come up with tasks for something you've never used. When there's no prior art, no user gallery, no benchmarks — you have to design the tests from scratch. It ended up feeling like building a benchmark on the fly: six categories, genuine use cases, honest results. Marked with ❗️ are things I think the Manus team should address.

Here's what actually happened.

1. Data Modelling

Completed

I gave it a math modelling competition problem — one I'd spent two full days on the week before: predict Hong Kong's tourist arrivals for 2026–2030, build a resource allocation model, cite your sources, state your assumptions.

Data collection alone took me half a day. Manus did it in 20 minutes. It found all the relevant datasets, built the regression model, generated charts, and organised everything into a structured output. I went out for lunch, came back, and it had continued on its own.

Manus completing Hong Kong visitor forecast modelling task
The forecast model for 2026–2030 visitor arrivals. Manus searched, sourced, modelled, and charted — in about 20 minutes.
❗️ Context window becomes a ceiling: by the time it finished the modelling, the context was too long to write the accompanying document. A good agent should proactively manage session scope, not just warn you when it's too late.

View session ↗

2. Real-World Action

Completed

I asked it to open Xiaohongshu, write a post calling out misinformation accounts, and take screenshots of the conversation. When it hit the QR code login screen, it asked me to scan — I did, then manually entered the SMS code. After that, it navigated to the post editor, generated copy, and published.

Manus successfully publishing a post to Xiaohongshu
Task completed: post drafted, tags added, published. The agent handled the full flow after I unblocked the login step.
❗️ It searched for the wrong input field mid-task and needed a manual nudge to correct it. Spatial awareness on web UIs is still weak. Also: "less structure" in the prompt isn't enough — you need to be specific about tone, or you get generic copy.

View session ↗

3. Gaming

Partial

First prompt: install Minecraft and beat the Ender Dragon. Immediate hard limit — the sandbox is headless, no graphics interface, so graphical games are out. Fair enough.

Second prompt: go play Go on an online board game site. It found one, tried to place a stone, couldn't interact with the board properly, blamed the sandboxed browser. Third prompt: try an HTML game. This time it worked — it found a puzzle game called Rope Rescue, explained the mechanics, and completed the first level.

Manus playing Rope Rescue HTML puzzle game
After two failed attempts (Minecraft, online Go), Manus found Rope Rescue and completed Level 1. The agent explained the mechanics before playing.
❗️ The Go failure was ambiguous — unclear whether it was a prompt issue or the agent hallucinating capabilities it didn't have. Needs better self-awareness about what it can and can't interact with.

View session ↗

4. Creative Tool Use

Partial

I asked it to make a video introducing Manus. It wrote a solid script, asked whether I wanted narration, AI voice, or subtitles. I said AI voice. Then it hit a wall: TTSMaker returned errors, MyEdit required login, a CAPTCHA blocked the next attempt (it recognised 3 out of 4 characters — close), then Cloudflare blocked the reload. I stepped in and told it to just use subtitles.

It pivoted to FFmpeg + ImageMagick, built scenes from scratch, burned in an SRT subtitle track, and exported a working video to a public URL.

Manus hitting context limit during video production task
The video was produced — but only the first scene had correct text. Other scenes showed garbled timestamps. Context ran out before it could fix it.
❗️ Visual CAPTCHA solving needs work — 3/4 isn't good enough when one wrong answer bricks the session. Also, the subtitle render broke partway through and the context was exhausted before it could self-correct.

View session ↗

5. Shopping

Partial

I needed a cheap black full-length robe for an English drama performance. Manus searched Taobao and a few other platforms, built a comparison table with prices, and recommended a seller with a product image. Taobao triggered a CAPTCHA mid-session that it couldn't bypass, so it moved on.

Manus completing shopping research for black robe
Manus produced a final report with product names, prices, and purchase links. The links themselves didn't work — they redirected to CAPTCHA pages or the Taobao homepage.
❗️ The product links in the final report were broken — they bounced to human verification or the homepage. Finding it is only half the job if the link doesn't actually take you there.

View session ↗

6. Technical Development

Partial

I pushed further and asked it to build an LLM inference engine from scratch — design the architecture, install dependencies, implement the core pipeline. It correctly identified the framework stack (PyTorch, TensorRT, ONNX, Hugging Face Transformers) and started setting up the environment. Then a pip install torch was killed mid-run due to sandbox memory limits.

Manus attempting to build an LLM inference engine
Manus identified the right stack and began building — but the sandbox's memory constraints killed the PyTorch install. It adapted by trimming the dependency list.

It adapted: trimmed the requirements, tried a lighter install, kept going. The architecture design and scaffolding were solid. The sandbox ceiling, not the agent's reasoning, was the real constraint here.

What I Actually Think

The data modelling result alone would have taken it seriously. Twenty minutes versus two days isn't a productivity gain — it's a category shift.

But the failure pattern is consistent across every category: Manus is excellent at planning, searching, and executing individual steps. It struggles at the boundaries — CAPTCHAs, context limits, UI ambiguity, broken links. These aren't fundamental capability gaps; they're engineering problems. And most of them are fixable.

The thing that surprised me most wasn't any single result. It was designing the tests. Without a user gallery, without prior benchmarks, coming up with tasks for something genuinely new is hard. You end up probing the edges blind. That's not a criticism — it's just what it feels like to be early. The space between "what can this do?" and "what should I ask it to do?" is larger than you'd expect.

I'm looking forward to what the context window looks like in six months.