Tau-knowledge: benchmarking agents on real-world knowledge

𝜏-knowledgeλŠ” μ‹€μ œ 금육 고객 지원 μ‹œλ‚˜λ¦¬μ˜€λ₯Ό λ°˜μ˜ν•œ λŒ€κ·œλͺ¨ 지식 κΈ°λ°˜μ—μ„œ AI μ—μ΄μ „νŠΈμ˜ 검색, μΆ”λ‘ , 닀단계 도ꡬ 호좜 λŠ₯λ ₯을 ν‰κ°€ν•˜λŠ” λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€. GPT-5.5 λͺ¨λΈμ΄ 초기 λŒ€λΉ„ μ„±λŠ₯을 크게 κ°œμ„ ν–ˆμœΌλ‚˜, μ—¬μ „νžˆ 60% μ΄μƒμ˜ κ³Όμ œκ°€ μ‹€νŒ¨ν•˜λŠ” λ“± ν•΄κ²° κ³Όμ œκ°€ λ§ŽμŠ΅λ‹ˆλ‹€. κ°•λ ₯ν•œ μ—μ΄μ „νŠΈλŠ” 지속적이고 μ •λ°€ν•œ 검색 μ „λž΅μ„ μ‚¬μš©ν•˜λ©°, μ μ ˆν•œ μ‹œμ μ—λ§Œ 행동을 μ·¨ν•˜λŠ” νŠΉμ§•μ„ λ³΄μž…λ‹ˆλ‹€. 이 λ²€μΉ˜λ§ˆν¬λŠ” μ‹€μ œ 지식 쀑심 업무에 νˆ¬μž…λ  AI μ—μ΄μ „νŠΈμ˜ μ„±λŠ₯ 평가 및 κ°œμ„  λ°©ν–₯ μ œμ‹œμ— μœ μš©ν•©λ‹ˆλ‹€.

https://sierra.ai/blog/tau-knowledge

#agentbenchmark #knowledgebase #fintech #llm #evaluation

𝜏-knowledge: benchmarking agents on realistic knowledge

𝜏-knowledge measures how well agents can work through messy, evolving knowledge bases to complete complex, multi-step tasks. While models are improving, they still struggle to reliably use this information in practice, leaving a large gap to real-world performance.

Sierra