isclouder.com - 香港服务器

最新动态:Qwen3.7-Plus: Multimodal Agent Intelligence

Written by

in

据行业最新消息,Qwen3.7-Plus: Multimodal Agent Intelligence

Today we introduce Qwen3.7-Plus — a multimodal agent model that unifies vision and language into a single, versatile agent foundation. Building on Qwen3.7’s strong text backbone, Qwen3.7-Plus delivers a comprehensive upgrade in vision-language capabilities while retaining full agentic strength in coding, tool use, and productivity workflows. What sets Qwen3.7-Plus apart is its ability to operate as a multimodal interactive hybrid agent. It perceives real-world scenes, reads screens and operates GUIs, writes code from visual references, navigates mobile apps end-to-end, and answers visual questions grounded in web knowledge — seamlessly blending GUI and CLI interactions within a single agent loop. As a versatile coding agent and productivity assistant, it handles the full spectrum from frontend prototyping to complex software engineering and multi-step workflow automation with full-modality input. It generalizes across agent scaffolds, performing consistently whether deployed through Claude Code, OpenClaw, Qwen Code, or other frameworks. Qwen3.7-Plus — now available via Alibaba Cloud Model Studio: * Terminal-Bench 2.0: Harbor/Terminus-2 harness; 5h timeout, 12 CPU/24 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs. All experiments prepend a token at each turn, allowing the model to decide whether to engage extended thinking. * SWE-Bench Series: Internal agent scaffold (bash + file-edit tools); temp=1.0, top_p=0.95, 200K context window. * SWE-bench Pro: Problematic tasks corrected and all baselines evaluated on the refined benchmark. * QwenClawBench: a real-user-distribution Claw agent benchmark; open-source: https://github.com/SKYLENAGE-AI/QwenClawBench. * CoWorkBench: an internal cowork benchmark; long-horizon tasks across computer science, finance, law, medical, and other productivity domains. * SkillsBench: Evaluated via OpenCode on 78 tasks (excluding 9 external API-dependent tasks); avg of 5 runs. * MCP-Mark: GitHub MCP v0.30.3; Playwright responses truncated at 32K tokens. * MCP-Atlas: Public set score; gemini-2.5-pro judger. * VITA-Bench: Avg subdomain scores; using claude-4.5-sonnet as judger, as the older official judgers are no longer available. * Kernel Bench L3: Metrics reported: median of per-problem speedup over PyTorch eager reference / fraction of problems faster than torch.compile, across 50 problems. Each test sample runs in an isolated Docker container with one H100 80GB GPU, with internet access restricted to the CUTLASS codebase and official CUDA documentation, limited to 500 tool calls with early stopping after 100 non-improving turns. GPT-5.4 (xhigh) is applied to detect potential hacking behaviors. CUPTI is used for kernel-level timing. * Reasoning scenarios: Recommended system prompt: “Reasoning effort is set to xhigh. Please think carefully through the task, validate key assumptions, consider plausible alternatives, and prioritize correctness, consistency, and clarity in the final answer.” * WMT24++: Harder WMT24 subset; avg scores on 55 langs via XCOMET-XXL. * MAXIFE: Accuracy on EN + multilingual prompts (23 settings total). * MMLU-ProX: Avg accuracy across 29 languages. * Empty cells (–) indicate scores not yet available. Qwen3.7-Plus delivers competitive text performance that approaches Max-tier models across the board. In coding agents, it performs strongly on Terminal Bench 2.0, SWE-bench series, and SciCode, handling both real-world software engineering and scientific programming tasks effectively. In general-purpose agents, it demonstrates robust tool-use and planning capabilities across MCP-Mark, Deep-Planning, and Kernel Bench L3, showing particular strength in complex multi-step planning and GPU kernel optimization. Its reasoning performance on GPQA Diamond, HMMT, and IMOAnswerBench places it among the strongest Plus-tier models on hard STEM benchmarks. In instruction following and multilingual tasks, it delivers consistent quality across IFBench, WMT24++, and PolyMATH, with strong coverage across diverse languages. * Multimodal Search & Knowledge QA: All models evaluated with search augmentation enabled. * BabyVision and CharXiv(RQ): Scores are reported as “with CI / without CI”. * VideoMME (w/ sub.): Scores are reported with subtitles. * BC-VL and MMBC: Scores are reported with the recommended presence penalty 1.5 in BC tasks. * ScreenSpot Pro and OSWorld-Verified: Scores are reported with “enable_thinking=False”. * Empty cells (–) indicate the scores are not yet available. Qwen3.7-Plus’s multimodal improvements are not limited to isolated gains in visual understanding. Instead, they reflect a systematic enhancement of the core capabilities required by multimodal agents: understanding complex visual inputs, reasoning over visual information, using tools to solve problems, and ultimately executing tasks in code or GUI environments. In Multimodal Reasoning, Qwen3.7-Plus delivers strong performance on challenging visual reasoning benchmarks such as BabyVision, MathVision, HiPhO, ERQA, and VisFactor. These results demonstrate the model’s ability to integrate fine-grained visual perception, spatial relationships, physical commonsense, and multi-step logical reasoning. In particular, its significant improvement on BabyVision over Qwen3.6-Plus suggests stronger generalization on tasks that are closer to early human visual cognition and spatial reasoning. In Visual Agent & Coding, Qwen3.7-Plus shows substantial gains on ScreenSpot Pro, OSWorld-Verified, and AndroidWorld. This indicates that the model can not only recognize screen content, but also localize key UI elements, understand task intent, and complete multi-step interactions. On QwenVision2Code, the model also demonstrates strong vision-to-code generation capabilities, turning images, videos, and design references into executable code. These capabilities form the foundation for multimodal agents to move from “understanding interfaces” to “operating interfaces” and even “building interfaces.” In Multimodal Search & Knowledge QA, Qwen3.7-Plus achieves clear improvements on SimpleVQA, WorldVQA, MMSearchPlus, BC-VL, and MMBC. The model can combine visual inputs with external knowledge retrieval to answer questions that cannot be solved from image content alone. This makes it better suited for real-world tasks, where users do not simply ask “what is in the image,” but expect the model to combine visual evidence, commonsense, and up-to-date knowledge to provide reliable answers. In General Visual Understanding, Qwen3.7-Plus maintains strong performance across real-world scenes, document parsing, chart understanding, OCR, counting, and spatial localization. It performs strongly on tasks such as RealWorldQA, CountQA, OmniDocBench, CharXiv, and OCR-Bench-V2. These capabilities are essential for robustly handling real business inputs, including screenshots, receipts, tables, reports, posters, product images, and complex UI pages. Beyond images, Qwen3.7-Plus further strengthens video understanding and driving-scene understanding. On video benchmarks such as VideoMMMU, MLVU, TVBench, and LVBench, it can reason over events, actions, temporal dynamics, and semantic relationships in both short and long videos. On driving-related evaluations such as LingoQA, Ego3D-Bench, SURDS, and VLADBench, it also demonstrates strong understanding of dynamic scenes, traffic participants, and spatial relationships. These capabilities lay an important foundation for real-world multimodal agents, autonomous driving understanding, and embodied AI scenarios. Qwen3.7-Plus is now available through Alibaba Cloud Model Studio. As a multimodal model, Qwen3.7-Plus accepts both text and image/video inputs. It also supports the preserve_thinking feature: preserving thinking content from all preceding turns in messages, which is recommended for agentic tasks. Alibaba Cloud Model Studio supports industry-standard protocols, including chat completions and responses APIs compatible with OpenAI’s specification. “”” Environment variables: DASHSCOPE_API_KEY: Your API Key from https://modelstudio.console.alibabacloud.com DASHSCOPE_BASE_URL: (optional) Base URL for compatible-mode API. – Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1 – Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1 – US (Virginia): https://dashscope-us.aliyuncs.com/compatible-mode/v1 “”” from openai import OpenAI import os api_key = os.environ.get(“DASHSCOPE_API_KEY”) if not api_key: raise ValueError( “DASHSCOPE_API_KEY is required. ” “Set it via: export DASHSCOPE_API_KEY=’your-api-key’” ) client = OpenAI( api_key=api_key, base_url=os.environ.get( “DASHSCOPE_BASE_URL”, “https://dashscope-intl.aliyuncs.com/compatible-mode/v1”, ), ) messages = [{“role”: “user”, “content”: “Write a Python function to merge two sorted linked lists.”}] completion = client.chat.completions.create( model=”qwen3.7-plus”, messages=messages, extra_body={ “enable_thinking”: True, # “preserve_thinking”: True, }, stream=True ) reasoning_content = “” answer_content = “” is_answering = False print(“\n” + “=” * 20 + “Reasoning” + “=” * 20 + “\n”) for chunk in completion: if not chunk.choices: print(“\nUsage:”) print(chunk.usage) continue delta = chunk.choices[0].delta if hasattr(delta, “reasoning_content”) and delta.reasoning_content is not None: if not is_answering: print(delta.reasoning_content, end=””, flush=True) reasoning_content += delta.reasoning_content if hasattr(delta, “content”) and delta.content: if not is_answering: print(“\n” + “=” * 20 + “Answer” + “=” * 20 + “\n”) is_answering = True print(delta.content, end=””, flush=True) answer_content += delta.content For more information, please visit the API doc. Qwen3.7-Plus features multimodal hybrid-agent capabilities designed for closed-loop execution of real-world tasks. It can not only understand visual interfaces, perceive on-screen content, and perform both GUI interactions and CLI operations, but also leverage

随着IDC行业的快速发展,可持续发展将成为未来竞争的关键

如果您正在寻找优质的俄罗斯服务器,欢迎访问 www.isclouder.com 了解更多