Wolfram Cloud Document

This is the first part of post series that serve as a concise guide to harnessing the possibilities of the Wolfram plugin for mathematics, physics and Wolfram Language coding, with a special focus on effective problem-solving prompting techniques, exemplified with 100+ plugin-using ChatGPT chat sessions.

Check out other parts of this guide here:
Guide 2: https://community.wolfram.com/groups/-/m/t/3075538
Guide 3: https://community.wolfram.com/groups/-/m/t/3077476
Guide 4: https://community.wolfram.com/groups/-/m/t/3077667
Guide 5: https://community.wolfram.com/groups/-/m/t/3078038

About Large Language Models (LLMs) How Do LLMs Work? Usage Areas of ChatGPT (and LLMs in General) Usage Demographics and Usage Types ChatGPT (and LLMs in General) Some LLM Comparisons What LLMs Can Do and What They Can’t Do Hallucinations and Other Shortcomings The Importance of Prompt Engineering Summary Some Recently Discussed Prompt Techniques (All from the Literature)

Cell coloring conventions


About Large Language Models (LLMs)

How Do LLMs Work?

Two-sentence summary:

◼

use a large amount of text fragments to train a network to predict the probability of the next word (token)

◼

given a user input (prompt), recursively determine the next word probabilistically



(This sounds simple, but it is important to always keep this in mind in ‘understanding’ certain LLM responses.)

More detailed discussion:
What Is ChatGPT Doing … and Why Does It Work?
( https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work )

Limited reproducibility and emerging creativity

The inherent non-deterministic behavior makes reproducing a concrete result often difficult or even impossible.

LLMs can provide genuinely new, never-written-down texts, solutions, problems, proofs, code fragments, explanations, …
(But don’t count on getting a perfect answer the first time.)

Through external feedback, LLMs can correct and improve their answers.

From Probing the “Creativity” of Large Language Models: Can Models Produce Divergent Semantic Association? (https://arxiv.org/abs/2310.11158)

General reading recommendations


Usage Areas of ChatGPT (and LLMs in General)

Large language models have reached a remarkable level of performance in many areas.
Their breadth and sometimes also depth of knowledge is unprecedented.
In many scientific fields, LLMs would be able to pass undergraduate and sometimes graduate exams.

Here is an incomplete list of recently suggested and/or tried-out usage areas:

Text and Language Processing



Code and Software Development



Data Handling and Analysis



Medical and Psychological Analysis



Education, Teaching and Tutoring



Scientific Research



Automation and Control



Creative and Idea Development



Decision Making and Recommendations



Industry-Specific Applications



Modeling



Usage Demographics and Usage Types ChatGPT (and LLMs in General)

From Gender, Age, and Technology Education Influence the Adoption and Appropriation of LLMs ( https://arxiv.org/abs/2310.06556 )

From AI and science: What 1,600 researchers think, NATURE (Oct 2023) (https://doi.org/10.1038/d41586-023-02980-0)

From How ChatGPT is Transforming the Postdoc Experience, NATURE (Oct 2023) ( https://doi.org/10.1038/d41586-023-03235-8 )

From The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions ( https://arxiv.org/abs/2310.12418 )

Some LLM Comparisons

A Survey of Large Language Models ( https://arxiv.org/abs/2303.18223 )

This talk: The Wolfram plugin for GPT-4

From GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond ( https://arxiv.org/abs/2309.16583 )

From GOpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models ( https://arxiv.org/abs/2310.07637 )

General telecommunication knowledge:

From TeleQnA: A Benchmark Dataset to Assess Large Language Models Telecommunications Knowledge (https://arxiv.org/abs/2310.15051)

What LLMs Can Do and What They Can’t Do

“While some prefer to regard LLMs as agents with emergent intelligence comparable to humans and animals, our results reveal no emergent cognitive map or planning capacities. These findings are more consistent with the view that LLMs are programmable machines with natural language as their programming language.”
From Evaluating Cognitive Maps and Planning in Large Language Models with CogEval ( https://arxiv.org/abs/2309.15129 )

Hallucinations and Other Shortcomings

Overview

“The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail.”
From Embers of Auto-regression: Understanding Large Language Models Through the Problem They are Trained to Solve (https://arxiv.org/abs/2309.13638)

“We have observed, however, that this negative outlook for LLMs arises from users treating the models as knowledge databases. We contend that this is not the best use of LLMs; they should instead be best used as reasoning engines or inference engines, which can provide extra-human cognition. LLMs can perform logic, analogy, causal reasoning, extrapolations and evidence evaluation.”
From Large language models should be used as scientific reasoning engines, not knowledge databases (https://doi.org/10.1038/s41591-023-02594-z ) (Oct 2023)

Hallucinations

A Survey of Hallucination in "Large" Foundation Models ( https://arxiv.org/abs/2309.05922 )

“Hallucination in the context of a foundation model refers to a situation where the model generates content that is not based on factual or accurate information. Hallucination can occur when the model produces text that includes details, facts, or claims that are fictional, misleading, or entirely fabricated, rather than providing reliable and truthful information.

This issue arises due to the model’s ability to generate plausible-sounding text based on patterns it has learned from its training data, even if the generated content does not align with reality. Hallucination can be unintentional and may result from various factors, including biases in the training data, the model’s lack of access to real-time or up-to- date information, or the inherent limitations of the model in comprehending and generating contextually accurate responses.”
From A Survey of Hallucination in “Large” Foundation Models ( https://arxiv.org/abs/2309.05922 )

Cognitive Mirage: A Review of Hallucinations in Large Language Models ( https://arxiv.org/abs/2309.06794 )

ChatGPT Hallucinates when Attributing Answers ( https://arxiv.org/abs/2309.09401 )

Exploring the Relationship between LLM Hallucinations and Prompt Linguistic Nuances: Readability, Formality, and Concreteness ( https://arxiv.org/abs/2309.11064 )

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions ( https://arxiv.org/abs/2309.15840 )

AutoHall: Automated Hallucination Dataset Generation for Large Language Models ( https://arxiv.org/abs/2310.00259 )

LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples ( https://arxiv.org/abs/2310.01469 )

The Troubling Emergence of Hallucination in Large Language Models—An Extensive Definition, Quantification, and Prescriptive Remediations ( https://arxiv.org/abs/2310.04988 )

Arithmetic and LLMs

Using Large Language Model to Solve and Explain Physics Word Problems Approaching Human Level (https://arxiv.org/abs/2309.08182)

GPT Can Solve Mathematical Problems Without a Calculator ( https://arxiv.org/abs/2309.03241 )

Remark: It is an open question if GPT architectures can learn to do arithmetic correctly.
GPT Can Solve Mathematical Problems Without a Calculator (https://arxiv.org/abs/2309.03241)
Solving the multiplication problem of a large language model system using a graph-based method (https://arxiv.org/abs/2310.13016)

Symbolic mathematics

LLMs can sometimes give impressive derivations but often with small mistakes, even in solving single linear equations, adding terms or taking reciprocals and similar.

Fooling LLMs

How susceptible are LLMs to Logical Fallacies? ( https://arxiv.org/abs/2308.09853 )

Rationalizing incorrect conclusions

Sometimes, especially when asked to compare derived solutions to known solutions, ChatGPT is overly confident and either claims equality of two non-equal values or tries to rationalize disagreement away.

User prompt:
I assume you are pretty familiar with the difference between Riemann integration and Lebesgue integration.
After quickly outlining the differences, I want you to demonstrates explicitly how to compute the Lebesgue integral for

integrate arccos(x) from x=0 to 1.

symbolically step-by-step. Meaning do a ‘slicing’ for symbolic n, compute the pre-images in closed form, sum up the contributions and then take the limit.

For each step verify the computations using Wolfram Language.
Show me the symbolic results for each step, as well print the Wolfram Language code used for the verification.
Make sure to include all definitions in each verification step.

Finally compare the obtained result with the result from Integrate[ArcCos[x],{x,0,1}].

Passion in abundance, energy on a budget

Tackling/solving larger projects is possible, but will need repeated interactions/prompting.

‘Larger’ here means the problem description is between half a page and a full page in length.

The Importance of Prompt Engineering

Summary

“Just as a key unlocks a door or a command instructs a computer program, prompts guide the response mechanisms of LLMs, determining the range and depth of answers they provide.”
From PACE: Improving Prompt with Actor-Critic Editing for Large Language Model (https://arxiv.org/abs/2308.10088)

“Designing prompts is difficult and time-consuming … , crafting appropriate prompts is not a straightforward task. “
From QualiGPT: GPT as an easy-to-use tool for qualitative coding (https://arxiv.org/abs/2310.07061)

The quality of the prompt determines the quality of your results.
Prompt techniques = initial prompt + follow-up chat instructions.
If in doubt, always write a longer, more detailed prompt.

Some overall guidelines for good prompting techniques for solving mathematical/logical problems:

Generally:
• Always be very detailed in describing what you want ChatGPT to do.
• If possible, provide some examples. (few-shot learning)

Specifically: Tell ChatGPT
• to think ‘step-by-step’ (CoT—chain of thoughts; LogiCoT)
• to express its thoughts/steps as program (snippets) (PoT—programs of thoughts).
• to try various methods (Self-Consistency, ToT—tree of thoughts)
• to decompose complex problems
• to work ‘fine-grained’, considering all details
• to first come up with a plan of how to tackle the problem
• to remember similar and analogous problems
• to put the problem into a larger context and use techniques from this larger context
• to first contemplate the problem and then solve it
• to first make a plan about how to solve the problem, then solve it based on the plan
• to remind itself of the general principles that govern a certain situation
• to recall similar and analogous problems before attempting to solve the problem
• to re-evaluate its results/derivation once it is finished
• to return the answer in a given structured format for programmatic parsing
• to follow multiple paths of reasoning
• to foresee especially difficult problems ahead and contemplate how to solve them early on
• to ask clarifying questions if needed
• to reason backward in case the final answer is known and the path to the result has to be found
• that it is an expert/professor/researcher in the field of relevance
(Note: experts can be abstract, concrete living or deceased, fictional (e.g. Frankenstein-ian Wikipedia-Mathematica-expert golem).)

See also: Design of Chain-of-Thought in Math Problem Solving ( https://arxiv.org/abs/2309.11054 )

Even if this sounds unexpected: Motivate why you want to solve a certain problem.
(Just saying ‘please’ doesn’t cut it, give a real human-centric reason.)

Some overall guidelines for good prompting techniques for solving mathematical problems or coding problems:

Generally:
• Always be very detailed in describing what you want ChatGPT to do.
• If possible, provide example inputs and outputs that should be matched. (few-shot learning).

Specifically: Tell ChatGPT
• to test the code on given examples or ask ChatGPT to create tests
• to make an outline of the code before starting to write code
• to make a table of function names, input arguments and result structure for each function to be implemented
• to write well-commented code
• to write code that uses self-explanatory variable names
• to write modular code
• to first discuss the computational complexity of the code
• to try things out to see if they will work out (scratchpad mode)
• to first discuss various possible implementations and compare the runtime complexities and code lengths
• that it is an expert in writing code

• (Virtually) never for a more complicated problem ask ChatGPT to ‘just’ give you the/a result!

“There is a lot of evidence confirming that a series of coherent explanations helps an LLM to unleash its reasoning power, while some discouragement on its utterance, e.g. prompts like “just tell me the result without any explanation”, catastrophically hinders the reveal of wisdom.”
From Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic (https://arxiv.org/abs/2309.13339)

Once the LLM has provided a good result, one can often ask it to streamline the derivation or make a program computationally more efficient.

The detailed ‘best’ prompt depends on the use of the LLM (arXiv:2310.07472):
• LLM-as-subcontractor
• LLM-as-critic
• LLM-as-teammate
• LLM-as-coach

And, maybe surprisingly: Formatting of the prompt matters
For special attention
• use all caps (e.g. ALWAYS …, NEVER …, ATTENTION: …)
• use (possibly multiple) exclamation marks

see also: Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting ( https://arxiv.org/abs/2310.11324 )

from: A Communication Theory Perspective on Prompting Engineering Methods for Large Language Models ( https://arxiv.org/abs/2310.18358 )

Some Recently Discussed Prompt Techniques (All from the Literature)

The following refers to textual input. Multi-modal prompts are not covered.

There is an exponentially increasing amount of literature on developing, generalizing, mixing and testing new prompting techniques.
Thousands of arXiv preprints have been published on this topic this year: arXiv abstracts search link
• There is no ‘best’ prompt for every task.
• It is important to experiment with prompts for the concrete task at hand.
• The Chain-of-Thought (CoT) is the only universally good prompting technique to use.
• Providing examples (few-shot prompting) is often a good way to achieve high fidelity (if applicable).

You do not need to know (or use) all the techniques (basically every day new ones are added).
The techniques marked with ‘’ are most important for STEM-like problem solving.
As a rough guide: whatever hints would help a human to solve the problem will also help an LLM to solve the problem.

All of the prompting techniques below are from the recent literature and have been shown to be useful for solving STEM-type problems. The usefulness of each concrete technique will vary by problem and will also depend on which other prompting techniques are utilized.

Prompt engineering is not a clear-cut science, but it isn’t voodoo either.
Today, prompting rules are mostly a set of experimentally supported rules of thumb.
(Some have partial first-principle-based motivation.)

We encourage the reader to try new, unconventional prompting techniques; they might give surprising results.

Recent good prompting technique overview:
Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review ( https://arxiv.org/abs/2310.14735 )

Few-shot prompting 

If you have solutions for similar problems, then tell ChatGPT a few of these examples (problem/solution pairs).
For repeating tasks, this is a VERY powerful prompting technique.
Unfortunately, for tackling new problems, the selection of problems is far from obvious.

Two examples:

Improving mathematics assessment readability: Do large language models help? (https://onlinelibrary.wiley.com/doi/abs/10.1111/jcal.12776)

Who’s the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers (https://arxiv.org/abs/2304.11257)

Chain-of-Thought (CoT) 

• let’s think step-by-step
• ask for explanations
• (potentially) provide examples

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (https://arxiv.org/abs/2201.11903)

Why does CoT work?
The Expressive Power of Transformers with Chain of Thought ( https://arxiv.org/abs/2310.07923 )

Related prompting technique: “Take a deep breath and work on this problem step-by-step.” (TDB)
Large Language Models as Optimizers ( https://arxiv.org/abs/2309.03409 )

Golden Chain-of-Thought (Au-CoT)

This method has some limited applicability.

True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4 ( https://aclanthology.org/2023.starsem-1.28.pdf )

Diversity of thought (HtT)

Diversity of Thought Improves Reasoning Abilities of Large Language Models ( https://arxiv.org/abs/2310.07088 )

<method>-of-Thought

Thinking Like an Expert: Multimodal Hypergraph-of-Thought (HoT) Reasoning to boost Foundation Modals ( https://arxiv.org/abs/2308.06207 )

Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models ( https://arxiv.org/abs/2308.10379 )

Test-Case–Driven Programming Understanding in Large Language Models for Better Code Generation (https://arxiv.org/abs/2309.16120)

A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future (https://arxiv.org/abs/2309.15402)

Towards Better Chain-of-Thought Prompting Strategies: A Survey (https://arxiv.org/abs/2310.04959)

Expert prompting 

Tell the LLM to act as an expert in the field of interest. (be creative to create such an expert)

ExpertPrompting: Instructing Large Language Models to be Distinguished Experts (https://arxiv.org/abs/2305.14688)

◼

Potentially ask the LLM to emulate multiple experts with different backgrounds.

Understand  Plan  Act  Reflect (UPAR) 

Basically telling the LLM to use a human-like plan.

UPAR: A Kantian-Inspired Prompting Framework for Enhancing Large Language Model Capabilities (https://arxiv.org/abs/2310.01441)

Residual connection prompting (RESPROMPT) 

Basically like putting multiple chains from CoT together.

Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models (https://arxiv.org/abs/2310.04743)

“The Importance of Prompt Engineering” section continues in the following post.

CITE THIS NOTEBOOK

Guide 1: The Wolfram Plugin for ChatGPT
by Michael Trott
Wolfram Community, STAFF PICKS, November 23, 2023
https://community.wolfram.com/groups/-/m/t/3070428

Contents

Cell coloring conventions