Tau-knowledge: benchmarking agents on real-world knowledge
π-knowledgeλ μ€μ κΈμ΅ κ³ κ° μ§μ μλ리μ€λ₯Ό λ°μν λκ·λͺ¨ μ§μ κΈ°λ°μμ AI μμ΄μ νΈμ κ²μ, μΆλ‘ , λ€λ¨κ³ λꡬ νΈμΆ λ₯λ ₯μ νκ°νλ λ²€μΉλ§ν¬μ λλ€. GPT-5.5 λͺ¨λΈμ΄ μ΄κΈ° λλΉ μ±λ₯μ ν¬κ² κ°μ νμΌλ, μ¬μ ν 60% μ΄μμ κ³Όμ κ° μ€ν¨νλ λ± ν΄κ²° κ³Όμ κ° λ§μ΅λλ€. κ°λ ₯ν μμ΄μ νΈλ μ§μμ μ΄κ³ μ λ°ν κ²μ μ λ΅μ μ¬μ©νλ©°, μ μ ν μμ μλ§ νλμ μ·¨νλ νΉμ§μ 보μ λλ€. μ΄ λ²€μΉλ§ν¬λ μ€μ μ§μ μ€μ¬ μ 무μ ν¬μ λ AI μμ΄μ νΈμ μ±λ₯ νκ° λ° κ°μ λ°©ν₯ μ μμ μ μ©ν©λλ€.

π-knowledge: benchmarking agents on realistic knowledge
π-knowledge measures how well agents can work through messy, evolving knowledge bases to complete complex, multi-step tasks. While models are improving, they still struggle to reliably use this information in practice, leaving a large gap to real-world performance.