Slide

Robust LLM Pipelines

Making Robust LLM Computational Pipelines from Software Engineering Perspective

Anton Antonov
Wolfram U Data Science Boot Camp
South Florida Data Science Study Group
August, September 2024

Video recording: "Robust LLM Pipelines" (YouTube/@AAAPrediction)

Slide

What is the talk about?

Basic premise of the talk:

LLMs are unreliable and slow.

Hence, we find techniques for harnessing LLMs for their more robust and effective utilization.

Goals

◼

Communicate how LLMs can be harnessed in some useful ways

◼

Exposure to Software Engineering and Software Architecture perspectives on LLMs utilization

◼

Techniques that can be used across different languages and systems

Out[]=

DALL-E 3 prompt

Caricature in the style of propaganda art that illustrates the following statement:
“I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes.”
Make sure to represent the AI with a robot. DO NOT draw or place any text in the image.

Slide

Who I am?


Slide

Pipelines? What pipelines? 1

First encounter

Here is a Machine Learning (ML) classification monad pipeline:

In[]:=

SeedRandom[33];clObj=ClConUnit[dsTitanic]⟹ClConSplitData[0.75`]⟹ClConEchoDataSummary⟹ClConMakeClassifier["LogisticRegression"]⟹ClConClassifierMeasurements[{"Accuracy","Precision","Recall"}]⟹ClConEchoValue⟹ClConROCPlot[ImageSize->Medium];

summaries:trainingData

1 id

Min	5
1st Qu	327.75
Median	643
Mean	653.696
3rd Qu	983.25
Max	1308

2 passengerClass

3rd	528
1st	243
2nd	210

3 passengerAge

Min	-1
1st Qu	10
Median	20
Mean	23.949
3rd Qu	40
Max	80

4 passengerSex

male	635
female	346

5 passengerSurvival

died	606
survived	375

,testData

1 id

Min	1
1st Qu	332.5
Mean	658.899
Median	691
3rd Qu	970
Max	1309

2 passengerClass

3rd	181
1st	80
2nd	67

3 passengerAge

Min	-1
1st Qu	0
Median	20
Mean	22.3567
3rd Qu	35
Max	70

4 passengerSex

male	208
female	120

5 passengerSurvival

died	203
survived	125



value:Accuracy0.771341,Precisiondied0.82,survived0.695313,Recalldied0.807882,survived0.712

ROC plot(s):died

,survived



Re-done with LLMs

One possible way to use LLMs in software pipelines can be illustrated like this:

In[]:=

SeedRandom[33];clObj2=LLMClCon["use the dataset dsTitanic"]⟹LLMClCon["split the data with 0.75 ratio"]⟹LLMClCon["summarize the data"]⟹LLMClCon["make a logistic regression classifier"]⟹LLMClCon["show classifier measurements"]⟹LLMClCon["show ROC plots using image size 400"];

summaries:trainingData

1 id

Min	5
1st Qu	327.75
Median	643
Mean	653.696
3rd Qu	983.25
Max	1308

2 passengerClass

3rd	528
1st	243
2nd	210

3 passengerAge

Min	-1
1st Qu	10
Median	20
Mean	23.949
3rd Qu	40
Max	80

4 passengerSex

male	635
female	346

5 passengerSurvival

died	606
survived	375

,testData

1 id

Min	1
1st Qu	332.5
Mean	658.899
Median	691
3rd Qu	970
Max	1309

2 passengerClass

3rd	181
1st	80
2nd	67

3 passengerAge

Min	-1
1st Qu	0
Median	20
Mean	22.3567
3rd Qu	35
Max	70

4 passengerSex

male	208
female	120

5 passengerSurvival

died	203
survived	125



Remark: Every time the pipeline is executed splitting is different -- RandomSample is used. (Hence the need of SeedRandom above.)

Slide

Pipelines? What pipelines? 2

Schematically

Slide

Outline

Modeling LLM-usage frustration (maybe)

Slide

What we are not going to talk about?

◼

Training LLMs from scratch

◼

Making and using Retrieval Augmented Generation (RAG) pipelines

◼

Remedies of LLMs “hitting the data wall”

◼

War wrangling, drone deploying, spying streamlining, etc.

Slide

Talk-meta

... aka “Managing expectations and observations”

◼

The proposed techniques transfer to other systems and programming languages that use LLMs

◼

For example, Python, Raku, Wolfram Language.

◼

Do not get distracted by the shown pipelines

◼

They are just to communicate the concepts more effectively.

◼

The LLM Domain Specific Language (DSL) in WL is “timeless”

◼

The rest of the techniques would be always applicable, but become less relevant over time

◼

The loss of relevance would be very non-uniform.

◼

The talk uses a mind-map

◼

Questions are answered between the main branches.

◼

Or at any time...

◼

Wolfram Language (aka Mathematica) is used but corresponding examples in Python and Raku are also shown

◼

In order to safe time, not all LLM commands are going to be evaluated

◼

Poll

◼

Who does R? And who does Python?

◼

Who uses LLMs in any way?

◼

Who uses LLMs in production settings?

Slide

Pipelines (Data Wrangling)

Here is a dataset (Titanic data):

WL & R pipelines

Here is a natural language specification of a data transformation workflow:

Remark: We call this kind of specifications Domain Specific Language (DSL) specifications.

Here is how this DSL specification is translated to R and WL:

Python pipeline

〉

obj = dsTitanic.copy()
print(obj.describe())
obj = obj.groupby(["passengerClass", "passengerSex"])
print(obj.size())

WL pipeline

Let us consider the execution steps of the WL pipeline:

Slide

Pipelines (Quantile Regression)

WL

Here we construct a pipeline (using built-in WL functions):

Here we evaluated it:

Take a more detailed look:

Python

Here is pipeline that pipeline in Python:

Slide

LLMs failings (GeoGraphics)

This graffiti analysis is amazing:

◼

Ninja-turtles-graffiti-demo.nb

But what happens we request complete, working WL code?

◼

Geo-graphics-plots-for-Cuba-and-Caribbean.nb

Slide

LLMs failings (Populations)

First call

Failings

Should work, but it does not:

Fixes

How to fix it? One way that should work (but is does not work reliably for countries like “Brazil”):

Remark: It this an LLM problem or a interpreter problem?

Slide

How do you use LLMs?

Narration

Here is a corresponding description:

◼

Start : The beginning of the process.

◼

Outline a workflow : The stage where a human outlines a general workflow for the process.

◼

Make LLM function(s) : Creation of specific LLM function(s).

◼

Make pipeline : Construction of a pipeline to integrate the LLM function(s).

◼

Evaluate LLM function(s) : Evaluation of the created LLM function(s).

◼

Asses LLM's Outputs : A human assesses the outputs from the LLM.

◼

Good or workable results? : A decision point to check whether the results are good or workable.

◼

Can you programmatically change the outputs? : If not satisfactory, a decision point to check if the outputs can be changed programmatically.

◼

The human acts like a real programmer.

◼

Can you verbalize the required change? : If not programmable, a decision point to check if the changes can be verbalized.

◼

The human programming is delegated to the LLM.

◼

Can you specify the change as a set of training rules? : If not verbalizable, a decision point to check if the change can be specified as training rules.

◼

The human cannot program or verbalize the required changes, but can provide examples of those changes.

◼

Is it better to make additional LLM function(s)? : If changes can be verbalized, a decision point to check whether it is better to make additional LLM function(s), or it is better to change prompts or output descriptions.

◼

Make additional LLM function(s) : Make additional LLM function(s) (since it is considered to be the better option.)

◼

Change prompts of LLM function(s) : Change prompts of already created LLM function(s).

◼

Change output description(s) of LLM function(s) : Change output description(s) of already created LLM function(s).

◼

Apply suitable (sub-)parsers : If changes can be programmed, choose, or program, and apply suitable parser(s) or sub-parser(s) for LLM's outputs.

◼

Program output transformations : Transform the outputs of the (sub-)parser(s) programmatically.

◼

Overall satisfactory (robust enough) results? : A decision point to assess whether the results are overall satisfactory.

◼

This should include evaluation or estimate how robust and reproducible the results are.

◼

Willing and able to apply different model(s) or model parameters? : A decision point should the LLM functions pipeline should evaluated or tested with different LLM model or model parameters.

◼

In view of robustness and reproducibility, systematic change of LLM models and LLM functions pipeline inputs should be considered.

◼

Change model or model parameters : If willing to change models or model parameters then do so.

◼

Different models can have different adherence to prompt specs, evaluation speeds, and evaluation prices.

◼

Make LLM example function : If changes can be specified as training rules, make an example function for the LLM.

◼

End : The end of the process.

To summarise:

◼

We work within an iterative process for refining the results of LLM function(s) pipeline.

◼

If the overall results are not satisfactory, we loop back to the outlining workflow stage.

◼

If additional LLM functions are made, we return to the pipeline creation stage.

◼

If prompts or output descriptions are changed, we return the LLM function(s) creation stage.

◼

Our (human) inability or unwillingness to program transformations has a few decision steps for delegation to LLMs.

Remark: We leave as exercises to the reader to see how the workflows programmed below fit the flowchart above.

Remark: The mapping of the workflow code below onto the flowchart can be made using LLMs.

Slide

DSLs for LLMs

◼

Coming up with the DSL design

◼

The three phases

◼

Configuration and Evaluator

◼

Invocation

◼

Post-processing

◼

Chat object management

〉

res = llm_synthesize([
"What are the populations in India's states?",
llm_prompt("NothingElse")("JSON")],
llm_evaluator = llm_configuration(spec = "chatgpt", model = "gpt-3.5-turbo")
)

Example

Here is a plot:

BTW, we get non-computational results without the “NothingElse” spec:

Slide

Creation of an LLM function

Here is a sequence diagram that follows the steps of a typical creation procedure of LLM configuration- and evaluator objects, and the corresponding LLM-function that utilizes them:

Compare with this spec:

Slide

LLM function examples

Questions

◼

What other queries can be done with that LLM function?

◼

How do you change the LLM function to give other than GDP quantities?

◼

Can the LLM results be used further?

◼

Say, to plot a bar chart.

◼

What would you do to plot the answer?

◼

Change the LLM prompt / function?

◼

Transform the results?

Partial answer

Slide

Creation and execution

Here is a sequence diagram for making an LLM configuration with a global (engineered) prompt, and using that configuration to generate a chat message response:

Narration

Step-by-step explanation of the UML sequence diagram:

Slide

Chatbook objects management

The following Unified Modeling Language (UML) diagram outlines a chat objects management system that can be used in chatbooks:

Remark: This flowchart can be seen as closely reflecting what is conceptually happening when using WL chatbooks.

Slide

Prompt engineering

◼

No need to much talk about it after Michael Trott’s presentation and classes

◼

Long and elaborated prompts

◼

Give good results (most of the time)

◼

Slow

◼

Making your own LLM personas, functions, and modifiers within Wolfram’s ecosystem

◼

Wolfram Prompt Repository

◼

"Eat your own dog food" ( example)

◼

"Re-Leonidas" (example)

◼

Jupyter/Raku versions

◼

LLMPrompts at PyPI.org

◼

LLM::Prompts at raku.land

◼

SouthernBelleSpeak

◼

See/execute in Jupyter/Python

◼

Chessboard generation -- deeper look

◼

Chessboard-generations-and-cat-blending.nb

Slide

Examples-based 1

◼

Example of few shot-training

◼

Note that the GDP query above, actually, did not produce “actionable results.”

◼

Elaborated examples with monadic pipelines

Slide

Few-shot training

Consider the following utility function:

Let us apply to the results from the fGDP* functions:

Convert numbers:

Ingest via JSON importing:

Slide

Pipeline segments translation 1

Rules

Here is a set of rules for translating “free text” commands into ML classification workflow code:

Example:

Wrapper function

Slide

Pipeline segments translation 2

Full pipeline

Continuation

Slide

Question Answering System (QAS)

◼

In brief: Instead of training LLMs to produce code we ask them to extract parameter values.

◼

Private code: You do not have to share your code with the LLMs.

◼

The LLMs would only know what kind of parameters are associated.

◼

Simplicity: Using QAS is also based on the assumption of the generation of much shorter text is more robust than the generation of longer texts.

Slide

QAS with SMLs

Here is an example of code generation via QAS that used a Small Language Model (SML):

How it works

Slide

QAS with LLMs

Of course, we can use LLMs, and get -- most likely -- more reliable results. Here is example:

Here are a list of questions and a list of corresponding answers for getting the parameters of the pipeline:

Another example:

Slide

QAS neat example

Random mandala creation by verbal commands

WFR’s RandomMandala takes its arguments -- like, rotational symmetry order, or symmetry -- through option specifications. Here we make a list of the options we want to specify:

Here we create corresponding "extraction" questions and display them:

Here we define rules to make the LLM responses (more) acceptable by WL:

Here we define a function that converts natural language commands into images of random mandalas:

Here is an example application:

Here is an application with multi-symmetry and multi-radius specifications:

Slide

Grammar-LLM combinations 1

Use case (formulation)

Assume that we have been given the task to:

◼

Gather opinions about programming languages

◼

One person can given opinions for several languages

◼

Relatively large number of people is interviewed (e.g. 300+)

◼

Each interviewee is asked to give one sentence, short opinions

Slide

Grammar-LLM combinations 2

Grammar and parsers

Here is an Extended Backus-Naur Form (EBNF) grammar for expressing programming language opinions:

Here are random sentences generated with that grammar:

Here are parsers generated for that grammar:

Slide

Grammar-LLM combinations 3

LLM function

Here we define an LLM function that converts programming language opinions into JSON dictionaries:

Here is an example invocation:

Note that misspellings are handled:

Grammar does not parse it:

Slide

Grammar-LLM combinations 4

Overall retriever function

The following function combines the parsing with the grammar and the LLM function:

The combination uses the Chain of Responsibility pattern :

◼

If a statement can be parsed with the grammar then the parsing result is interpreted into a rule

◼

Otherwise the statement is given to the LLM function

Remark: Note that the function wraps the results with the symbols Grammar and LLM in order to have an indication which interpreter was used.

Slide

Grammar-LLM combinations 4

Experiments

Here we expect the grammar to be used:

Here too:

Here we expect the LLM function to be used:

Here is a set of statements (some with misspellings):

Here is how they are interpreted:

Here we gather the opinions from the obtained interpretations:

Remark: The grammar can be extended in order to decrease the LLM usage.

Slide

Testings with data types and shapes

This technique:

Testings with data types and shapes over multiple LLM results.

Is both a “no-brainer” and severely under-utilized.

Slide

Testings with data types and shapes 2

In order to do apply this technique we have to have a way to comprehensively derive data types for different (serializable) data structures.

WL

See Wolfram Community notebook "Workflows with LLM functions (in WL)".

Python

See the Jupyter notebook “Robust-LLM-Pipelines-SouthFL-DSSG-Python.ipynb”

See Wolfram Community notebook "Workflows with LLM functions (in Python)".

Raku

See Wolfram Community notebook "Workflows with LLM functions (in Raku)".

Slide

Modeling accumulated frustration & money

... while using LLMs

◼

Using a System Dynamics (SD) model

◼

Using ”MonadicSystemDynamics”

◼

Based on the “LLM developer decisions” flowchart shown earlier

◼

Qualitative investigations

◼

Calibration challenges

◼

These are not excuses not to do calibration.

◼

The SD model:

◼

System-Dynamics-model-for-LLM-function-pipelines-usage.nb

Slide

Left over comments or Questions

Left over remarks

◼

“Less is more” is at play with LLMs.

◼

Repository submissions during this South FL DSSG presentation:

◼

QuantileRegression at PyPI.org

◼

Regressionizer at PyPI.org

◼

Small Language Models (SML) are/will/would become used more because of LLM.

◼

Using pictures to make pipelines.

◼

Experimenting with LLMs is like a having a second part-time job.

◼

And the know-how and impressions become obsolete quickly...

◼

Pareto principle for workflows.

◼

Figuring out Truchet tiling for the headline image

◼

Instead of Mandalas strip

Anticipated questions

◼

Which is your favorite technique? (Of those presented / related.)

◼

Why people prefer using ChatGPT’s interface?

◼

Instead of chatbooks.

◼

Are there Jupyter chatbooks?

◼

Yes, both in Python and Raku

◼

Does Python/Raku/WL support grammars?

◼

Why do you use Raku?

◼

Why do you use monads?

◼

Are these techniques applicable without monads or pipelines?

◼

Yes, fully.

◼

Monads and pipelines are used in the presentation to speed-up knowledge transfer.

Slide

References

Articles

[AA1] Anton Antonov, Notebook transformations, (2024), RakuForPrediction at WordPress.

[SW1] Stephen Wolfram, "The New World of LLM Functions: Integrating LLM Technology into the Wolfram Language", (2023), Stephen Wolfram Writings.

[SW2] Stephen Wolfram, "Introducing Chat Notebooks: Integrating LLMs into the Notebook Paradigm", (2023), Stephen Wolfram Writings.

[SW3] Stephen Wolfram, "Can AI Solve Science?", (2024), Stephen Wolfram Writings.

Notebooks

[AAn1] Anton Antonov, "Workflows with LLM functions (in Python)", (2023), Wolfram Community.

[AAn2] Anton Antonov, "Workflows with LLM functions (in Raku)", (2023), Wolfram Community.

[AAn3] Anton Antonov, "Workflows with LLM functions (in WL)", (2023), Wolfram Community.

[AAn4] Anton Antonov, «Comprehension AI aids for Stephen Wolfram's article "Can AI Solve Science?”», (2024), Wolfram Community.

Functions, paclets

[AAf1] Anton Antonov, LLMTextualAnswer WL paclet, (2024), Wolfram Language Paclet Repository.

[AAf1] Anton Antonov, RandomMandala WL paclet, (2024), Wolfram Language Paclet Repository.

[AAp1] Anton Antonov, NLPTemplateEngine WL paclet, (2024), Wolfram Language Paclet Repository.

[AAp2] Anton Antonov, MonadicContextualClassification WL paclet, (2024), Wolfram Language Paclet Repository.

[AAp1] Anton Antonov, MonadicQuantileRegression WL paclet, (2024), Wolfram Language Paclet Repository.

Videos

[AAv1] Anton Antonov, Natural Language Processing Template Engine, (2022), Wolfram Technology Conference 2022 presentation. YouTube/WolframResearch.

[AAv2] Anton Antonov, NLP Template Engine, Part 1, (2021), YouTube/AntonAntonov.

[AAv3] Anton Antonov, Monte Carlo demo notebook conversion via LLMs and parsers, (2024), YouTube/AntonAntonov.

[AAv4] Anton Antonov, LLaMA models running guide (Raku), (2024), YouTube/AntonAntonov.

Robust LLM Pipelines

What is the talk about?

Goals

DALL-E 3 prompt

Who I am?

Pipelines? What pipelines? 1

First encounter

Re-done with LLMs

Pipelines? What pipelines? 2

Schematically

Outline

Modeling LLM-usage frustration (maybe)

What we are not going to talk about?

Talk-meta

Pipelines (Data Wrangling)

WL & R pipelines

Python pipeline

WL pipeline

Pipelines (Quantile Regression)

WL

Python

LLMs failings (GeoGraphics)

LLMs failings (Populations)

First call

Failings

Fixes

How do you use LLMs?

Narration

DSLs for LLMs

Example

Creation of an LLM function

LLM function examples

Questions

Partial answer

Creation and execution

Narration

Chatbook objects management

Prompt engineering

Examples-based 1

Few-shot training

Pipeline segments translation 1

Rules

Wrapper function

Pipeline segments translation 2

Full pipeline

Continuation

Question Answering System (QAS)

QAS with SMLs

How it works

QAS with LLMs

QAS neat example

Random mandala creation by verbal commands

Grammar-LLM combinations 1

Use case (formulation)

Grammar-LLM combinations 2

Grammar and parsers

Grammar-LLM combinations 3

LLM function

Grammar-LLM combinations 4

Overall retriever function

Grammar-LLM combinations 4

Experiments

Testings with data types and shapes

Testings with data types and shapes 2

WL

Python

Raku

Modeling accumulated frustration & money

Left over comments or Questions

Left over remarks

Anticipated questions

References

Articles

Notebooks

Functions, paclets

Videos

Setup code

Functions

Paclets

Data

Who I am?
