Grading Rubric

In order to be Wolfram Certified Level 2 in Wolfram Technology for AI, the applicant must successfully complete an independent project that demonstrates expertise in designing and building AI workflows that combine large language model capabilities with Wolfram Language computation. The project must be submitted as a computational essay in a Wolfram Notebook. Submissions are graded according to the following rubric.
Level 2 certification requires a project score of 75 or greater.

Mastery of the Computation-Augmented AI Workflow (75 points)

Problem Definition (5 points)

◼
  • Clearly state the problem or task the AI system is designed to address and explain why it is a meaningful or useful one to solve.
  • ◼
  • Articulate specifically what the LLM component contributes and what the Wolfram computation component contributes—and why neither alone would be sufficient.
  • ◼
  • Define what a successful outcome looks like, including any measurable criteria by which the system's performance or usefulness can be evaluated.
  • ◼
  • If the problem definition evolved through iteration during development, describe how and why it changed.
  • LLM Function Integration (20 points)

    ◼
  • Demonstrate meaningful use of at least four functions from Wolfram's LLM toolkit. Suitable functions include but are not limited to:
  • LLMFunction
    LLMSynthesize
    LLMTool
    LLMGraph
    ChatObject
    ChatEvaluate
    LLMPrompt
    LLMPromptGenerator
    LLMConfiguration
    SemanticSearch
    CreateSemanticSearchIndex
    VectorDatabaseSearch
    ◼
  • Show that each LLM-related function is used intentionally, not superficially. For each function used, include a brief explanation of why it was chosen for that step in the workflow.
  • ◼
  • If prompt engineering was required, show the prompts clearly and explain the design decisions behind them—including any iterations that may have improved output quality.
  • ◼
  • Demonstrate awareness of LLM limitations: identify at least one point in the workflow where the LLM alone would produce imprecise or unreliable output and show how Wolfram computation addresses it.
  • Wolfram Computation Integration (20 points)

    ◼
  • Incorporate at least one substantive element from the broader Wolfram stack beyond the LLM functions themselves. Suitable elements include: curated computable data via Entities or built-in data (e.g. scientific & medical data, geographic data, engineering data, financial data, etc.), a model from the Neural Net Repository, statistical or mathematical computation relevant to the problem or user-defined Wolfram Language functions used in LLMTool definitions callable by the language model.
  • ◼
  • Demonstrate that the Wolfram computation component meaningfully improves the quality, precision or verifiability of the system's outputs compared to what the LLM would produce unaided.
  • ◼
  • Ensure that the integration is bidirectional where appropriate—that is, Wolfram computation should somehow inform, constrain or verify LLM outputs, not merely run in parallel with them.
  • System Design and Architecture (15 points)

    ◼
  • Provide a clear description of the overall architecture of the system: how the LLM and Wolfram components are orchestrated, what data flows between them and in what order steps are executed.
  • ◼
  • If LLMGraph is used for a multi-step orchestration pattern, include a structured explanation of the workflow graph.
  • ◼
  • Explain the key design decisions made during development—including alternatives that were considered and rejected—and why.
  • Results and Evaluation (15 points)

    ◼
  • Present the outputs of the system clearly, using visualizations, structured data or interactive notebook elements where appropriate.
  • ◼
  • Evaluate the system's performance in a rigorous way. This may include a set of test cases with expected and actual outputs, a comparison against a baseline (e.g. the LLM response without Wolfram augmentation), qualitative analysis of failure cases or quantitative metrics where applicable.
  • ◼
  • Clearly explain limitations. Identify cases where the system performs poorly or behaves unexpectedly and offer an explanation.
  • ◼
  • Clearly answer the questions or address the goals stated in the Problem Definition section.
  • Reproducibility (25 points)

    Code Quality (10 points)

    ◼
  • Ensure the notebook is fully self-contained and can be evaluated from top to bottom without any external dependency unavailable to the grader. (The exception to this is that you may have access to an LLM provider that the grader does not. Clearly indicate which providers and models you use and make sure to leave examples of evaluated outputs so that the grader can read what they may not be able to replicate.)
  • ◼
  • Organize code into logical, well-scoped cells. Follow the rule of thumb that individual cells should remain relatively concise; long or complex operations should be broken across multiple code snippets with explanatory text between them. Use your judgment!
  • ◼
  • Use meaningful variable and function names that reflect their purpose in the workflow. Avoid generic names such as result1 or myLLMOutput.
  • ◼
  • Remove exploratory or dead-end code from the final submission. The notebook should reflect the finished workflow rather the entire development process.
  • ◼
  • Include inline comments using ((*this is a comment*)) or CodeText-style cells preceding code cells to explain what the code is doing and why.
  • Explanations and Documentation (10 points)

    ◼
  • Include clear, concise prose explanations at every stage of the workflow—not just code and outputs. The notebook should read as a coherent document, not a collection of cells.
  • ◼
  • Explain all significant decisions: choice of LLM model or service, prompt design, tool definitions, orchestration logic, evaluation methodology and any parameter choices made along the way.
  • ◼
  • Use visualizations where they add clarity—for example, to illustrate the structure of retrieved data, the distribution of LLM outputs or the performance of the system across test cases.
  • ◼
  • A reader who is familiar with Wolfram Language and LLM concepts but unfamiliar with your specific project should be able to understand the full workflow from the notebook alone, without any accompanying verbal presentation.
  • References (5 points)

    ◼
  • Cite any external data sources, published research or prior work that the project builds upon or draws from.
  • ◼
  • If the project was inspired by or attempts to replicate an existing system or paper, identify that work clearly and describe how your implementation relates to it.
  • ◼
  • Include links to relevant Wolfram documentation, Prompt Repository entries or Neural Net Repository models used in the project.