WOLFRAM NOTEBOOK

Data, LLMs, and You

Stats 390 Guest Lecture

April 7, 2025

John McNally

Principal Academic Solutions Developer
Wolfram Research

Syllabus

◼

Current Picture of Productivity:

◼

Human Expert + LLM Agents

⇑

Human Expert

?? ⇑ ??

Well-Directed LLM

⇑

Bumbling Human / Poorly-Directed LLM

Learning Objectives

◼

Tabular data and analysis in Wolfram Language

◼

LLM functionality and Wolfram Language

◼

Applied Examples

Working With Data in Wolfram Language

◼

Wolfram's Data Hierarchy

◼

Data from the Wolfram Knowledgebase

Suppose you want to think about either buying a house or renting an apartment.

In[]:=

Chicago

CITY

["MedianHomeSalePrice"]

Out[]=

390400.00

In[]:=

Chicago

CITY

[{"RentOneBedrooms","RentTwoBedrooms"}]

Out[]=



8.3×

per month

9.8×

per month



Ok, you can get this data, but how do you compare them? Are these numbers from the same year?

Let’s retrieve a bunch of rent and housing data over time.

In[]:=

chihousing=EntityValueDated

Chicago

CITY

,All,{"MedianHomeSalePrice","RentZeroBedrooms","RentOneBedrooms","RentTwoBedrooms","RentThreeBedrooms","RentFourBedrooms"}

Out[]=

TimeSeries

Time: 01 Jan 1989 to 01 Jul 2024
Data points: 135

,TimeSeries

Time: 01 Jan 1983 to 01 Jan 2014
Data points: 31

,TimeSeries

Time: 01 Jan 1983 to 01 Jan 2014
Data points: 31

,TimeSeries

Time: 01 Jan 1983 to 01 Jan 2014
Data points: 31

,TimeSeries

Time: 01 Jan 1983 to 01 Jan 2014
Data points: 31

,TimeSeries

Time: 01 Jan 1983 to 01 Jan 2014
Data points: 31



You can immediately see that the house price data comes in every couple of months while the rent data only comes in yearly until 2014:

In[]:=

DateListPlot[QuantityMagnitude@chihousing,ScalingFunctions->"Log",Joined->False]

Out[]=

Why does the data not extend until present? Let’s look at the source:

In[]:=

Information[EntityProperty["City","RentTwoBedrooms"]]

Out[]=

Entity Property

2 bedroom apartment fair market rent

CITY

Description	Missing[NotAvailable]
Source	Fair Market Rents
Unit	USDollars Months
Physical Quantity	MoneyPerTime
Qualifiers	Date

In[]:=

Fair Market Rents

DATA SOURCE

["NonMissingPropertyAssociation"]

Out[]=



citation

U.S. Department of Housing and Urban Development,Fair Market Rents,

entity type list



data source

,

name

Fair Market Rents,

organization name

U.S. Department of Housing and Urban Development,

source name

Fair Market Rents,

URL



http://www.huduser.org/portal/datasets/fmr.html



As you can see, there is a URL given to find this data. Let’s explore this update more at the end.

Your Turn

◼

Ignoring the question of updating the rent data, how can you even compare a house purchase price to a rental cost?

Adding Your Domain Knowledge

First, update the data you have to include the historical mortgage rate data as well:

rawtemporal=TemporalDataAppendchihousing,Dated

United States

COUNTRY

,All[EntityProperty["Country","ConventionalMortgageRate",{"Frequency"->"Monthly","MortgageDuration"->"30Year"}]],MetaInformation->{"PathNames"->{"HomePrice","Studio","1BR","2BR","3BR","4BR","Rate"}}

Out[]=

TemporalData

Time: 01 Apr 1971 to 01 Mar 2025
Data points: 938	Paths: 7



What does this look like?

In[]:=

DateListPlot[QuantityMagnitude@rawtemporal,ScalingFunctions->"Log",Joined->False,PlotLegends->rawtemporal["PathNames"]]

Out[]=

	HomePrice
	Studio
	1BR
	2BR
	3BR
	4BR
	Rate

Your Turn

◼

What will you need to do in order to analyze this data?

Tabular Data for Efficiency

Let’s look at the data between Jan 1990 and Dec 2010.

Let’s also leave the missing values as missing for now:

As you can see, each column is typed. This lets the new Tabular functionality work with the speed-ups possible with typed information while also leveraging the symbolic programming of Wolfram Language.

Your Turn

◼

If you are making mortgage payments of $1604.92 per month for 30 years at 6.65% interest, what price of house should this buy you?

Transforming the Tabular Data

In order to get fast calculations, let’s pre-compute the expression we’ll apply to the data as far as possible.

This has a couple of advanced tricks, but leads to a faster function than if you tried to call Solve for each data point. Don’t worry if you don’t understand this particular trick. The point is that it defines a fast-working function you will apply to the data:

Recall what the original data looked like:

Let’s take only rows where the home price is known. Then, compute the mortgage payment based on the price and rate:

To plot this most nicely, it’s helpful to interpolate the missing rent information:

Now the data is ready to plot:

You can see the 2008 housing crash right there in the mortgage payment line.

Forecasting Ahead

We chose a cutoff in the data for a reason. Let’s see how to forecast ahead based on the previous data.

First, split the data into two parts:

Next, fit a model on the historical data:

Notice how the models with the highest AIC are Seasonal Auto-Regressive Integrated Moving-Average (SARIMA) processes:

You can see the criterion by which this model could be selected over other fits:

But what does the model predict?

Visualize the prediction with some plotting options:

Not bad. What other methods could we use to forecast this data?

How good was the forecast?

This plot suggests that perhaps one of the seasonal components was improperly chosen:

What About Importing More Data?

Import a local file:

Or Import directly from the web:

This data can be processed using the Wolfram|Alpha semantic interpreter for the counties.

Clean the data to get just the Illinois counties:

Visualize the new data:

Clean a larger subset of the data:

Plot the results:

Working with LLMs in Wolfram Language

Aside from Notebook Assistant and Chat Notebooks, you can work with the latest language models programmatically:

This time the model admits it doesn’t know, but gives good advice how to find the information:

Wrangling Data for an LLM Tool

Let’s do a little digging:

Luckily, the Wolfram Knowledgebase has the Census survey already:

To clean this data, you will need to select the IL counties, get the 2019 fair market rent information (with units), and lookup the 2019 income information:

Now let’s calculate those ratios:

Note: The function should be as simple as Function[#”XBR”/#HouseholdIncome]; however, there appears to be a bug with this particular unit conversion on my installation. The more complicated conversion function is to fix this.

Was the LLM correct on its first guess? Unsurprisingly, no. However, it was at least in the top 5.

Interestingly, the LLM guess was right for a two-bedroom apartment, but this was probably a coincidence.

It is easy to visualize the new data we’ve computed:

Making an LLM Tool

An LLMTool will allow the model to have access to the data you have wrangled:

An example use of the tool:

Once the tool is defined, giving it to the LLM is easy:

Because you gave the LLM a reliable tool, it’s inherent randomness is no longer a big problem:

LLMs Are Stochastic (i.e., a Bit Random)

Note that the LLM provides two different answers when asked the same question twice:

Clearly, the LLM “going from memory” is not the best way to retrieve important factual information. This is where Retrieval Augmented Generation (RAG) comes in.

Implementing a Simple Vector Database

To automatically create a vector database associated with a source (in this case a wiki) you can use the CreateSemanticSearchIndex function:

While you could create an LLMTool which the model would use, you can use an LLMPromptGenerator to programmatically add context:

With a search of the index you have provided automatically conducted, the model will now answer according to the new information it was given:

Note how the prompt and query that are used by default return the relevant information as the 5th context item in this case:

This is because the search index and query were generated in the simplest automated way. If you provide a better search query to the index, you get the desired result as the first item returned:

Another best practice is to think more carefully about what you want to embed and then associate a payload to return for that query.

Adding Annotations to a Semantic Search Index

With some string parsing, create an annotated list of sources and an annotated index:

Define a helper function to visualize the results:

Now, the most concise key phrase matches exceptionally well with the section heading that contains the desired info. The payload delivered to the model can be the text from that section:

Using the user’s raw input still returns the desired section with highest relevance; however, the extraneous wording in the user’s raw input lowers the relevance score.

To improve the search phrase used, you can use an LLM to summarize the user question into a search phrase:

This can be combined into a more sophisticated generator of prompts:

You can hover over the tooltip to see what prompt was generated with this more sophisticated set-up: