User5601 asks: Is there any append/chunking solutions for getting OpenAI to summarize docs longer than the input limit through either one large or multiple ChatInput cells or some other programmatic way [using the new LLM functionality in Mathematica 13.3]?A summary of strategies and implementations... Text summarization is useful in many contexts, such as generating abstracts of creating podcast previews. Isaac Tham (May 2023) describes a three strategies for text summarization with LLMs and we shall explore them here. This example uses GPT-3.5, running on OpenAI for all examples, although the strategies are generalizable.
Data
We will demonstrate this on the 2023 State of the Union Address delivered by US President Joe Biden; for convenience, Tham provides a plain text form on his Github, which we shall use below:
Approach: Split a long text into equally shorter chunks that fit inside the LLM context window. Each chunk is summarized independently and the summaries are concatenated together and then passed through the LLM to be further summarized. Repeat until the final summary is the desired length Pro: This is widely used and easy to parallelize. Con: Disregards logical flow and structure of the text (splitting by word count will break across sentences, paragraphs, chains of thought Implementation: For sake of demonstration we’ll implement a chunk function we can use in subsequent work.
Approach: Pass every chunk of text along with a summary of previous chunks through the LLM to generate a progressive summary Pro: This is even easier to implement than recursive summarization and only requires some simple prompting. Con: This is sequential (so it does not take advantage of parallelism) and may over-represent the initial parts of the source material in the final summary. Implementation: In practice, you may also have to work with smaller chunks, as there is no strong guarantee on the summary size. In practice this also has some challenges with controlling the final word count, unless you do something special with the prompt (FWIW, the Summarize prompt allows for an optional sentence-count target.)
This is the strategy that Tham advocates in his article; Tham does something complicated with providing topic titles and topic summaries at each summarization stage. I suppose it is useful as a demonstration for defining custom prompts, but seems out of our scope for now. The core idea is:
1
.
Divide the text into groups of 10 sentences each (with an overlap of 2 sentence for continuity.)
2
.
Summarize each chunk
3
.
Compute an embedding vector for each chunk
4
.
Cluster the embedding vectors—Tham uses Louvain community detection, aka Modularity optimization, which is one of the methods available in FindGraphCommunities. But there is nothing intrinsically special about this, and we might instead think about using the variety of methods implemented in FindClusters to do this for us.
5
.
Aggregate the summaries in each cluster as new chunks
We will demonstrate how to do this for a single round (defining functions as we proceed) and then put it all together to do it iteratively until we reach a desired summary size. To begin, define a function to do the sentence group chunking:
Then generate embeddings for each chunk. It seems like an oversight, but Mathematica 13.3 does not include a function for this; instead we will use the OpenAILink paclet. Be sure to install it first, if you have not done so already. We will use OpenAILink to generate embedding vectors for each chunk:
At this point we would continue until we got to the desired word count. Having demonstrated what a single pass looks like, we will define a function to perform each of the steps, and then another function that can be called recursively to generate the final summary with a desired target word length: