Post by regenold
2,100 followers
šš¼š šš¼š²š š§šµš² ššØ šš šš°š š¤&š šš²š»š°šµšŗš®šæšø ššµš®š¹š¹š²š»š“š² šš°ššš®š¹š¹š šŖš¼šæšø? Following the launch of the EU AI Act Q&A Benchmark Challenge 2026, several participants have asked us about the evaluation methodology. The short answer: we do not measure whether a system sounds convincing. šŖš² šŗš²š®šššæš² ššµš²ššµš²šæ š¶š š¶š š°š¼šæšæš²š°š. Each participating AI system receives the same set of expert-developed questions covering Regulation (EU) 2024/1689, the EU AI Act. For every response, we evaluate five dimensions: šš»ššš²šæ šš¼šæšæš²š°šš»š²šš Does the answer accurately address the question? We score both:  ⢠Strict correctness (fully correct)  ⢠Loose correctness (substantially correct) š„š²š³š²šæš²š»š°š² šš°š°ššæš®š°š Can the system identify the correct legal basis? A correct answer supported by an incorrect article reference still represents a regulatory risk.  ⢠Strict correctness (fully precise)  ⢠Loose correctness (mostly precise) šš¼š»š°š¶šš²š»š²šš Can the system provide the necessary information without excessive verbosity? In regulatory environments, clarity often matters more than word count. š§š¼š»š² We assess whether responses are professional, precise, and unambiguous. Evasive answers are penalised. šš®šš²š»š°š How quickly can the system provide a response? Speed is not everything, but it remains an important usability factor. The result is a multi-dimensional performance profile rather than a single score. A system may be highly accurate but slow. Another may be fast but struggle with references. The benchmark is designed to reveal these trade-offs transparently. Every participant receives an individual evaluation report showing their performance across all dimensions. Participation remains free of charge. šš»šš²šæš²ššš²š± š¶š» šÆš²š»š°šµšŗš®šæšøš¶š»š“ šš¼ššæ ššššš²šŗ? https://lnkd.in/eMVmMRC4 We are currently accepting registrations for the first evaluation cycle. #AIAct #RegulatoryAI #AIBenchmark #ArtificialIntelligence #AIGovernance