Limited TimeFast mode is completely free!Try it now

K-Dense Web Scores 90.0% on BixBench-Verified-50

K-Dense Web scored 45/50 on BixBench-Verified-50, a cleaned biology-agent benchmark designed to separate real model mistakes from benchmark noise.

Share:
K-Dense Web Scores 90.0% on BixBench-Verified-50

Benchmarks for biology agents are messy for a simple reason: biology work is messy. Questions can be underspecified, grading can depend on method choices, and sometimes the benchmark answer key is just wrong.

That's why Phylo's recent writeup on biology-agent evaluation caught our attention. Their main point is hard to argue with: if you want to know how good an agent actually is, you first have to clean up the benchmark. BixBench-Verified-50 is their attempt to do exactly that.

We ran K-Dense Web on that verified subset and scored 45/50, or 90.0% accuracy.

Why the verified subset matters

The original BixBench benchmark is useful, but it mixes together a few different kinds of failures:

  • real agent mistakes
  • ambiguous or underspecified questions
  • incorrect or inconsistent ground truth

That distinction matters. If an agent picks a defensible method and gets marked wrong because the benchmark expected a different unstated choice, that does not tell you much about the agent. It tells you the eval needs work.

BixBench-Verified-50 is more interesting because it tries to remove that noise. According to Phylo's description, the subset was reviewed with domain experts, with problematic questions removed, wording clarified, and incorrect answers fixed. That makes the score more meaningful than a raw pass/fail number on the original benchmark.

Our result

K-Dense Web scored 45 out of 50 on the verified set.

Metric Result
Total questions 50
Correct 45
Accuracy 90.0%
Total runtime ~103 minutes
Average runtime per question 123.6 seconds

Here is the benchmark comparison plot for this run:

BixBench-Verified-50 benchmark comparison

For additional context, Phylo reported the following BixBench-Verified-50 scores for other systems in their public post. We are placing our run alongside them here:

System Accuracy
K-Dense Web 90.0%
Biomni Lab 88.7%
Edison Analysis 78.0%
Claude Code (Opus 4.6) 65.3%
OpenAI Agents SDK (GPT-5.2) 61.3%

Those numbers should always be read with some caution because prompting, tools, and run conditions matter. Still, the headline is straightforward: K-Dense Web performed very strongly on a version of BixBench that was explicitly designed to be less noisy and more fair.

That result is especially notable because K-Dense Web is a generalist intelligent system. It is not a bioinformatics-only product or a benchmark-specialized agent tuned just for BixBench. The same platform is built to handle research, coding, data analysis, machine learning, and multi-step technical workflows across domains. Even without being optimized specifically for bioinformatics, it performed at the top of a demanding biology benchmark.

Breakdown by verifier type

The verified subset uses three grading modes: LLM-based judging, exact-string matching, and numeric range checks. K-Dense Web performed well across all three.

Verifier type Correct Total Accuracy
LLM verifier 18 20 90.0%
String verifier 15 17 88.2%
Range verifier 12 13 92.3%

That spread is useful. The score is not being carried by one easy slice of the benchmark. We saw strong performance on judged answers, exact-answer tasks, and quantitative outputs.

What the remaining misses tell us

A 90.0% score is strong, but it is not perfect. We still missed five questions:

  • 2 on LLM-judged answers
  • 2 on exact-string answers
  • 1 on a numeric range check

Here are the five misses from the run:

Question ID Verifier Expected K-Dense Web answer What happened
bix-43-q2 String 5.81 5.831005059276599 Numerically close, but failed exact string matching
bix-61-q5 LLM judge 2.68 2.56 Small quantitative miss on a judged answer
bix-27-q5 Range (55,56) 56.471704739574804 Slightly outside the accepted range
bix-16-q1 String CDKN1A CCND1 Wrong exact answer
bix-45-q1 LLM judge 7.6968e-54 1.8037401775486344e-47 Large numeric mismatch on a judged answer

This table is useful because the misses are not all the same. Two are basically formatting or precision problems, one is a narrow range miss, and two are substantive answer misses. That is a better place to be than having one systematic failure mode tank the entire benchmark, but it also makes clear where the next round of improvements should go.

Why this result matters

The bigger story is not just that K-Dense Web scored well. It is that verified benchmarks tell a clearer story about what these systems can already do.

If you look only at noisy benchmarks, you can come away thinking biology agents are much worse than they are in practice. If you clean up the questions and grading, a different picture emerges: good systems are already capable of completing a large share of real bioinformatics analysis tasks correctly.

That matches what we see with users. The value is not in producing a nice-looking answer and hoping for the best. It is in running the analysis, choosing reasonable methods, and getting to a result that holds up under inspection.

BixBench-Verified-50 is not the final word on evaluation, and Phylo is right to argue that biology needs more process-aware benchmarks over time. But as a checkpoint for current capability, this one is useful. On that checkpoint, K-Dense Web scored 90.0%.

At the same time, benchmarks only go so far. The hardest scientific work usually does not look like a fixed multiple-choice or short-answer eval. Real research involves messy datasets, unclear problem definitions, partial context, changing objectives, and output formats that depend on the downstream decision. No current benchmark really captures that full picture.

Try it yourself

If you want to run your own biology analyses, benchmark agent workflows, or pressure-test a scientific question end to end, K-Dense Web is built for exactly that kind of work. We would encourage anyone evaluating the platform to go beyond benchmark scores and test it on their own hardest research problems. That is the standard that matters most.

Start your own analysis with $50 free credits

Enjoyed this article? Share it with others!

Share:
Back to all posts