WOLFRAM NOTEBOOK

Data, LLMs, and You

Stats 390 Guest Lecture
April 7, 2025
John McNally
Principal Academic Solutions Developer
Wolfram Research

Syllabus

  • Current Picture of Productivity:
  • Human Expert + LLM Agents
    Human Expert
    ?? ??
    Well-Directed LLM
    Bumbling Human / Poorly-Directed LLM
  • Learning Objectives

  • Tabular data and analysis in Wolfram Language
  • LLM functionality and Wolfram Language
  • Applied Examples
  • Working With Data in Wolfram Language

  • Data from the Wolfram Knowledgebase

    Suppose you want to think about either buying a house or renting an apartment.
    In[]:=
    Chicago
    CITY
    ["MedianHomeSalePrice"]
    Out[]=
    $
    390400.00
    In[]:=
    Chicago
    CITY
    [{"RentOneBedrooms","RentTwoBedrooms"}]
    Out[]=
    $
    8.3×
    2
    10
    per month
    ,
    $
    9.8×
    2
    10
    per month
    Ok, you can get this data, but how do you compare them? Are these numbers from the same year?
    Let’s retrieve a bunch of rent and housing data over time.
    In[]:=
    chihousing=EntityValueDated
    Chicago
    CITY
    ,All,{"MedianHomeSalePrice","RentZeroBedrooms","RentOneBedrooms","RentTwoBedrooms","RentThreeBedrooms","RentFourBedrooms"}
    Out[]=
    TimeSeries
    Time:
    01 Jan 1989
    to
    01 Jul 2024
    Data points: 135
    ,TimeSeries
    Time:
    01 Jan 1983
    to
    01 Jan 2014
    Data points: 31
    ,TimeSeries
    Time:
    01 Jan 1983
    to
    01 Jan 2014
    Data points: 31
    ,TimeSeries
    Time:
    01 Jan 1983
    to
    01 Jan 2014
    Data points: 31
    ,TimeSeries
    Time:
    01 Jan 1983
    to
    01 Jan 2014
    Data points: 31
    ,TimeSeries
    Time:
    01 Jan 1983
    to
    01 Jan 2014
    Data points: 31
    You can immediately see that the house price data comes in every couple of months while the rent data only comes in yearly until 2014:
    In[]:=
    DateListPlot[QuantityMagnitude@chihousing,ScalingFunctions->"Log",Joined->False]
    Out[]=
    Why does the data not extend until present? Let’s look at the source:
    In[]:=
    Information[EntityProperty["City","RentTwoBedrooms"]]
    Out[]=
    Entity Property
    2 bedroom apartment fair market rent
    CITY
    Description
    Missing[NotAvailable]
    Source
    Fair Market Rents
    Unit
    USDollars
    Months
    Physical Quantity
    MoneyPerTime
    Qualifiers
    Date
    In[]:=
    Fair Market Rents
    DATA SOURCE
    ["NonMissingPropertyAssociation"]
    Out[]=
    citation
    U.S. Department of Housing and Urban Development,Fair Market Rents,
    entity type list
    data source
    ,
    name
    Fair Market Rents,
    organization name
    U.S. Department of Housing and Urban Development,
    source name
    Fair Market Rents,
    URL
    http://www.huduser.org/portal/datasets/fmr.html
    As you can see, there is a URL given to find this data. Let’s explore this update more at the end.

    Your Turn

  • Ignoring the question of updating the rent data, how can you even compare a house purchase price to a rental cost?
  • Adding Your Domain Knowledge

    First, update the data you have to include the historical mortgage rate data as well:
    rawtemporal=TemporalDataAppendchihousing,Dated
    United States
    COUNTRY
    ,All[EntityProperty["Country","ConventionalMortgageRate",{"Frequency"->"Monthly","MortgageDuration"->"30Year"}]],MetaInformation->{"PathNames"->{"HomePrice","Studio","1BR","2BR","3BR","4BR","Rate"}}
    Out[]=
    TemporalData
    Time:
    01 Apr 1971
    to
    01 Mar 2025
    Data points: 938
    Paths: 7
    What does this look like?
    In[]:=
    DateListPlot[QuantityMagnitude@rawtemporal,ScalingFunctions->"Log",Joined->False,PlotLegends->rawtemporal["PathNames"]]
    Out[]=
    HomePrice
    Studio
    1BR
    2BR
    3BR
    4BR
    Rate

    Your Turn

  • What will you need to do in order to analyze this data?
  • Tabular Data for Efficiency

    Let’s look at the data between Jan 1990 and Dec 2010.
    Let’s also leave the missing values as missing for now:
    As you can see, each column is typed. This lets the new Tabular functionality work with the speed-ups possible with typed information while also leveraging the symbolic programming of Wolfram Language.

    Your Turn

  • If you are making mortgage payments of $1604.92 per month for 30 years at 6.65% interest, what price of house should this buy you?
  • Transforming the Tabular Data

    In order to get fast calculations, let’s pre-compute the expression we’ll apply to the data as far as possible.
    This has a couple of advanced tricks, but leads to a faster function than if you tried to call Solve for each data point. Don’t worry if you don’t understand this particular trick. The point is that it defines a fast-working function you will apply to the data:
    Recall what the original data looked like:
    Let’s take only rows where the home price is known. Then, compute the mortgage payment based on the price and rate:
    To plot this most nicely, it’s helpful to interpolate the missing rent information:
    Now the data is ready to plot:
    You can see the 2008 housing crash right there in the mortgage payment line.

    Forecasting Ahead

    We chose a cutoff in the data for a reason. Let’s see how to forecast ahead based on the previous data.
    First, split the data into two parts:
    Next, fit a model on the historical data:
    Notice how the models with the highest AIC are Seasonal Auto-Regressive Integrated Moving-Average (SARIMA) processes:
    You can see the criterion by which this model could be selected over other fits:
    But what does the model predict?
    Visualize the prediction with some plotting options:
    Not bad. What other methods could we use to forecast this data?
    How good was the forecast?
    This plot suggests that perhaps one of the seasonal components was improperly chosen:

    What About Importing More Data?

    Import a local file:
    Or Import directly from the web:
    This data can be processed using the Wolfram|Alpha semantic interpreter for the counties.
    Clean the data to get just the Illinois counties:
    Visualize the new data:
    Clean a larger subset of the data:
    Plot the results:

    Working with LLMs in Wolfram Language

    Aside from Notebook Assistant and Chat Notebooks, you can work with the latest language models programmatically:
    This time the model admits it doesn’t know, but gives good advice how to find the information:

    Wrangling Data for an LLM Tool

    Let’s do a little digging:
    Luckily, the Wolfram Knowledgebase has the Census survey already:
    To clean this data, you will need to select the IL counties, get the 2019 fair market rent information (with units), and lookup the 2019 income information:
    Now let’s calculate those ratios:
    Note: The function should be as simple as Function[#”XBR”/#HouseholdIncome]; however, there appears to be a bug with this particular unit conversion on my installation. The more complicated conversion function is to fix this.
    Was the LLM correct on its first guess? Unsurprisingly, no. However, it was at least in the top 5.
    Interestingly, the LLM guess was right for a two-bedroom apartment, but this was probably a coincidence.
    It is easy to visualize the new data we’ve computed:

    Making an LLM Tool

    An LLMTool will allow the model to have access to the data you have wrangled:
    An example use of the tool:
    Once the tool is defined, giving it to the LLM is easy:
    Because you gave the LLM a reliable tool, it’s inherent randomness is no longer a big problem:

    LLMs Are Stochastic (i.e., a Bit Random)

    Note that the LLM provides two different answers when asked the same question twice:
    Clearly, the LLM “going from memory” is not the best way to retrieve important factual information. This is where Retrieval Augmented Generation (RAG) comes in.

    Implementing a Simple Vector Database

    To automatically create a vector database associated with a source (in this case a wiki) you can use the CreateSemanticSearchIndex function:
    While you could create an LLMTool which the model would use, you can use an LLMPromptGenerator to programmatically add context:
    With a search of the index you have provided automatically conducted, the model will now answer according to the new information it was given:
    Note how the prompt and query that are used by default return the relevant information as the 5th context item in this case:
    This is because the search index and query were generated in the simplest automated way. If you provide a better search query to the index, you get the desired result as the first item returned:
    Another best practice is to think more carefully about what you want to embed and then associate a payload to return for that query.

    Adding Annotations to a Semantic Search Index

    With some string parsing, create an annotated list of sources and an annotated index:
    Define a helper function to visualize the results:
    Now, the most concise key phrase matches exceptionally well with the section heading that contains the desired info. The payload delivered to the model can be the text from that section:
    Using the user’s raw input still returns the desired section with highest relevance; however, the extraneous wording in the user’s raw input lowers the relevance score.
    To improve the search phrase used, you can use an LLM to summarize the user question into a search phrase:
    This can be combined into a more sophisticated generator of prompts:
    You can hover over the tooltip to see what prompt was generated with this more sophisticated set-up:

    Additional Resources

    Aside from this notebook and Notebook Assistant, these are more resources for doing data science in the Wolfram Language
    Wolfram Cloud

    You are using a browser not supported by the Wolfram Cloud

    Supported browsers include recent versions of Chrome, Edge, Firefox and Safari.


    I understand and wish to continue anyway »

    You are using a browser not supported by the Wolfram Cloud. Supported browsers include recent versions of Chrome, Edge, Firefox and Safari.