Mission Statement

The mission of the SPEC RG AIware Working Group is to provide a forum to advance the transparent, reproducible, and trustworthy evaluation of AI models their powered software through rigorous benchmarking, experimental analysis, and the development of standardized tools and processes. By fostering collaboration across academia and industry, the group aims to build meaningful assessment frameworks that capture the performance, behavior, reliability, and usability of AIware in a rapidly evolving technological landscape.

General Topics of Interest

Topics of interest include, but are not limited to, the following:

  • Novel benchmarking and evaluation technologies on AI models and their powered software systems
  • Innovative techniques for tracing, monitoring, and analyzing the behavior of AIware
  • Standardized processes, best practices, and lessons learned for developing and evaluating AIware
  • Quantitative and qualitative studies on both open source and closed source AIware systems
  • Open sourced tools, infrastructures and datasets that support AIware evaluation and analysis

Current Activities

  • Survey existing AIware-related benchmarks and evaluation methodologies to identify key gaps and challenges
  • Develop a taxonomy of evaluation dimensions relevant to AIware
  • Select representative scenarios for pilot studies (e.g., SWE-bench, KernelBench, Effibench)

Future Activities

  • Release the enhanced benchmarks for the pilot studies as open source projects
  • Identify common pitfalls and challenges in benchmarking and evaluating AIware, and propose mitigation strategies
  • Providing open source trusted evaluation platforms for AI models and their powered systems (e.g., sandboxed environment)
  • Collaborate with regulatory and standards organizations to develop formal evaluation guidelines
  • Periodically report on the progress, risks. and social impact for AIware through faithful and meaningful benchmark evaluations