본문으로 건너뛰기
AXyNowAX IS NOW

AXyBench · CommanderOS

LLM Bench

CMD Evolution

CommanderOS CMD1을 같은 AXyBench 문항 풀에 반복 투입해 채점된 라운드의 개선 추이를 기록합니다.

Latest

CMD-130

Score

88.8

OK Rate

100%

Latency

30.9s

Progression

CMD-121 → CMD-130

judged scoremeasured ok rate
025507510087.0CMD-12122.8s88.0CMD-12357.5s86.0CMD-12424.2s88.0CMD-12525.1s88.5CMD-12621.8s86.6CMD-12725.1s88.8CMD-13030.9s
Round Log

라운드별 변경과 결과

campaign: openbeta-chat-final · status: draft · scored rounds: 7

RoundStatusScoreOKNeeds JudgeLatencyCostProfileChange
CMD-121채점 완료87.0100%100%22.8s$0.087cmd_1_0_defaultB4 full-category probe after r120 complete-artifact sufficiency fixes; all 5 cells reached judgeable final output.
CMD-123채점 완료88.0100%100%57.5s$0.019cmd_1_0_defaultGeneric SaaS marketing deterministic repair and DataLab keyword normalization; q1-only validation.
CMD-124채점 완료86.0100%100%24.2s$0.083cmd_1_0_defaultB4 full-category rerun after generic SaaS repair push
CMD-125채점 완료88.0100%100%25.1s$0.085cmd_1_0_defaultExact channel-copy constraint and ad source-basis harness fixes
CMD-126채점 완료88.5100%100%21.8s$0.084cmd_1_0_defaultKorean LinkedIn tone self-check leakage gate
CMD-127채점 완료86.6100%80%25.1s$0.084cmd_1_0_defaultLLM call ledger observability; no prompt-quality change
CMD-130채점 완료88.8100%100%30.9s$0.098cmd_1_0_defaultGeneric channel-copy completeness gate for concrete sales numbers, channel labels, and truthful self-check counts

AXyBench commander-cmd1 product endpoint. Scores are attached after judge/manual review.

updated 2026-06-04T10:34:02Z