Context Windows, Structured Data, and the Fundamentals of Text Analysis with AI

Iqbal Ali
By
May 19, 2025 ·

Are you using AI to analyse customer feedback, surveys, or other forms of unstructured text data? If so, then you’ve likely already faced the limitations of “context windows”, whether you know it or not. These hidden constraints undermine the accuracy and validity of your insights, especially when dealing with insights like this:

Gibberish and difficult-to-read feedback

In this article, I’ll share the practical strategies I’ve developed over years of working with AI for user research and product development. You’ll learn techniques and strategies to work around context window limitations to get the most out of your LLM interactions and unlock the true value of your data.

Disclosure: I am the developer of Ressada, a user research tool that implements many of the techniques discussed in this article. However, this article focuses on general approaches that can be applied to any AI system.

What’s the problem anyway?

According to Gartner, approximately 80% of enterprise data is unstructured (Gartner, “Market Guide for File Analysis Software”, 2019). In addition, the following white paper by the International Data Corporation (IDC) states that just over 50% of that data is underutilised. That’s a lot of insights left on the table.

But, wait. The solution is simple, right? We can just use the “blah blah” prompt we picked up from our nearest LinkedIn feed…

Self-proclaimed AI-guru says: "Just use my blah-blah prompt"

…stuff it into our chat tool of choice with all our data. And presto, we got ourselves some insights.

Stuffing data plus "blah-blah" prompt to get insight gold

I mean, there’s no point worrying about limitations with context windows this large, right?

CEO tech-bro gives you large context windows

Unfortunately, the real picture is not quite so simple.

What are context windows?

Let’s start with a definition of context windows. When we prompt the AI, that prompt is broken into pieces called tokens. A token can be an individual word or even punctuation. Different models break down the text to tokens in various ways. To simplify things, we’ll just count words to approximate tokens and context window sizes.

Prompt broken into tokens

Note: This is a gross simplification of how LLMs work. For more details, read my other articles on the subject. Or don’t, because this is all you need to know for the rest of this article.

A context window is the size of our prompt, the chat history, and the output, measured in tokens.

Context window is the total amount of text that LLM can process, including your prompt, the historical conversation and output. It's measured in terms of tokens (similar-ish to words)

Previous models had limited context windows. More recent AI models have substantially increased these limits. Check out the context window sizes below!

Table breaking down model by context length, larger ones are 1M and 2M.

Specifically, check out those Gemini models!

Gemini 1.5 Pro: 2M Gemini 1.5 Flash: 1M

Amazing, right? For reference, 1 million tokens is approximately the length of the first 6 Harry Potter books.

1 million tokens is equivalent to the first 6 Harry Potter books

But let’s explore this a little further. 

Literal matching vs latent matching

LLM models use the context of our prompts to build a sort of associations map. They then choose the “best” associations based on our prompt text:

Prompt used to find relevant associations in the context window before giving you the output

This is an extreme simplification of the actual process. If you want to learn more about this, you can check out my previous article on this. Needless to say, the LLM finding the most relevant associations is a key factor for accuracy.

This brings us to the study “NoLiMa: Long-Context Evaluation Beyond Literal Matching“, which defines two types of matching.

Two types of matching. Literal and latent

Literal matching is where the LLM finds associations that are either identical matches or highly similar words or phrases to the prompt. So, for example, if we have a dataset of customer feedback and we prompt “What do customers think of our checkout flow?”, a literal match would locate feedback that mentioned “checkout”.

Latent matching, on the other hand, finds associations that are conceptually similar to the words in the prompt. So, for the prompt “What do customers think of our checkout flow?”, a latent match would locate feedback that mentioned “payment flow,” “finalising purchase,” or “completing my order”, because they are conceptually similar to the prompt.

Now, refer back to those previous context window performance numbers. These numbers are based on literal matches.

Table of models with context length. Context length is based on literal matches

When we compare the performance of latent matching vs literal matching, an entirely different picture emerges.

Previous table now has effective length comparison. Effective length is much less than claimed

As you can see from the above, the effective match of latent matching (where the accuracy of given tasks is 80% or higher) is less than the claimed length.

Gemini 1.5 Pro: 2M? More like 2k. Gemini 1.5 Flash: 1M? More like less than 1k.

Now, admittedly, this is a gross simplification of the NoLima paper’s findings. In truth, there’s a pattern to degradation that differs according to the model. According to their findings, here is a brief summary by model:

GPT-4 tends to recall information better near the beginning and end of the context window and may struggle with mid-context information in very long contexts.

Claude (Anthropic) generally shows more uniform attention across its context window, so it performs better when retrieving information from the middle of long contexts. It also has a relatively stable performance up to 60k tokens.

Gemini (Google) shows different patterns depending on the specific model version, but generally gracefully degrades context. I’ll admit, Gemini 1.5 Pro shows impressive reasoning across long contexts.

For a more nuanced and detailed view, I recommend reading their paper.

The main point is this: When we’re dealing with unstructured text generated by users in forms, emails, etc., it’s safe to assume that users are not going to use consistent language and terminology, so the literal match performance is perhaps more realistically what we can expect.

We also need to consider how each model degrades when handling larger context sizes, and pick the one that best suits our data needs.

Alternatively, if we manage our context window sizes better, we can become model-agnostic and greatly simplify things. So, with that said, let’s look at ways of managing context windows, going from simple to progressively more advanced techniques.

Managing context windows: simple techniques

We’ll start with the easy wins related to the chat interfaces themselves.

Technique 1: Start a new conversation or thread

A simple tip is to avoid doing everything in the same chat thread. Get into the habit of creating new conversation threads, as this resets your context window.

Technique 2: Re-edit earlier prompts

Review any chat threads and you’ll notice a little pencil icon below each prompt. This allows us to edit that prompt. If we edit an earlier prompt, only the context window up to that point is used. When we do this, we’re effectively creating an alternate timeline for the conversation.

showing placement of edit for the prompt

Technique 3: Summarise the current conversation

Sometimes editing or starting a new conversation isn’t an option. The context of previous conversation threads may be useful to keep. If that’s the case, summarising the conversation may be a way forward.

A good summary of a conversation thread is one that retains the key points important to your task flow. Therefore, a structured summary can be useful. I’ve had success with a prompt that asks for the following:

  • One sentence summary of the conversation
  • The main goal of the conversation
  • List of key pieces of information

Having the content and structure like the one above allows for easier scanning. It also ensures we don’t lose too much information from our conversation thread. When reviewing the summary, if we find anything missing, we can add it manually.

This summary can then be used in new conversations as context.

Managing context windows: advanced techniques

The above techniques are a good start, but we can do much more here. I mentioned in an earlier article that I consider using AI for research to be essentially an Information Architecture problem.

Information Architecture is the application of structured design and organisation to information to make it easily understandable, accessible, and usable.

Consider a library stocked with books. Now imagine using a library that doesn’t sort books alphabetically, or by genre, or by anything else. Imagine how difficult it would be to find the right book? Thankfully, libraries don’t operate in that way. Books are sorted and categorised. There’s an inherent structure to the data a library stores.

Before we seek, we need to apply some structure. In other words:  

Structure → seek

The concept is simple. We have our unstructured text that we want to mine for insights. Before we start mining for insights (i.e. seeking), it’s a good strategy to structure our data first. This allows us to reap the benefits of structured data, like filtering text to use only relevant data.

An added benefit of filtering the data beforehand is that, since the context is more relevant, the LLM output is more accurate because the LLM has fewer distractions in the context.

The AI thanks you.

Here’s how to go about structuring data…

Technique 4: Tag and structure data

Let’s use a simple example. Imagine we own a series of theme parks. We conduct a simple feedback survey and collect the following information:

  • Branch (which of my three theme park branches did you visit?)
  • Month (which month did you visit my theme park?)
  • Free text (open text to provide your feedback)

Once responses are collected, the spreadsheet will look like this:

Typical structure of feedback. Three columns. The first column is "branch" with 3 choices, the second column is "month visited" with twelve choices, and the third column is "feedback" containing free text.

Two of those columns already have a good structure—the first two are single-select fields, thus allowing us to filter by specific branch or month visited. Now we want to add structure to that open feedback text so we can filter it in a similar way. We’ll do this by adding extra columns that “describe” the text in ways that are useful to us.

Adding columns for structure. Labels or categories, general sentiment, etc.

These additional columns can be sentiment, category labels, freeform tags, or anything else that’s valuable for us. Our additional column requirements might differ according to our needs.

When doing this, it’s important to ensure the language used in each column is consistent, so it can be used as a filter. A simple prompt to do this could be like the following:

Prompt: I’m a user researcher working for Smoking Bones Land. Parks which has rides and attractions with a BBQ theme. Here is a table of feedback from users. 

Create additional columns for “theme labels” and “general sentiment”.
Feedback: <feedback text>

Since we’re tagging multiple responses at the same time, the language will be consistent.

If we need to process another batch of responses or each response individually, we need to feed the labels created in previous rounds into the AI so we can “normalise” them and ensure consistency of labelling.

Prompt: I’m a user researcher working for Smoking Bones Land. Parks which has rides and attractions with a BBQ theme. Here is a table of feedback from users. 

Create additional columns for “theme labels” and “general sentiment”.

Theme labels: Here are some labels you generated earlier. Apply these if relevant: <previous labels>

Sentiment: Positive, Negative, Neutral

Feedback: <feedback text>

We can store the resulting output in a spreadsheet or database. And now, when we look for user problems in the dataset, we can filter feedback accordingly. For instance, we can use negative feedback to find user pain points and frictions.

Adding a sentiment column to the data allows us to filter responses before sending them to the LLM. In this example, we’re filtering negative responses.

Doing this not only reduces our context window footprint but also improves our overall accuracy since we’re removing irrelevant data (in this case, the positive and neutral feedback are not used).

Additionally, since we’re tagging and labelling our data with consistent language, we can utilise the larger context window performance of the LLM, since we’re using the literal matching capabilities.

Tagging data enables us to utilise literal matching capabilities of LLMs

Note: We still might not get the best accuracy with a 2-million-token context window, but at least we’re doing our best with what we have.

Technique 5: Progressive summarisation

We can go a step further with our structuring efforts by creating progressively summarised versions of the feedback text. We can also clean up the text as we summarise, which allows us to scan and review it more easily. To progressively summarise effectively, we need to iterate over each user response and create progressively shortened versions of the feedback text.

For instance, we can summarise in less than twenty words before more aggressively summarising in six to ten words.

When doing this, be sure that the summaries have the desired information density. Tip: Use some good examples to ‘tell’ the AI how to summarise (read my previous article for details). Also, give as much context as possible to ensure effective summarisation.

Progressively summarised text

Having a choice of different levels of summaries gives us more control over the context windows when we’re seeking. Add some filtering to the mix, and we can do much within a decent-sized context window footprint.

Use summarised text when using the LLM.

In my experience implementing these techniques in research software, they can effectively process tens of thousands of feedback items without significant accuracy degradation.

Technique 6: Recursive summarisation

We can also do recursive summarisation of our feedback, which means we create summaries of our summaries to further condense the word count, increasing its information density.

Recursive summarisation. Summarising summaries to further reduce the context-window footprint

Recursive summaries are useful for a high-level view of our data, such as getting an overview of the kinds of problems users are facing. Since duplicate mentions across users will be summarised, we can’t rely on this kind of summary for specific details.

However, if we maintain a relationship between our recursive summaries (for instance, in a database), we can ask follow-up questions with progressively more detailed summaries.

Again, when creating recursive summaries, add plenty of context and use examples to direct the summarisation carefully.

Technique 7: Retrieval Augmented Generation

So far, we’ve done a lot manually, storing structured data in a spreadsheet. In a way, we’re acting as a retrieval engine for AI. If we are looking for the ultimate way of effectively managing our context window sizes, I consider RAG (Retrieval Augmented Generation) to be the Rolls-Royce of context window management.

This method replaces the spreadsheet with a vector database. A vector database breaks down text into embeddings (numerical representations of text) and allows us to retrieve data semantically based on the search query. For instance, if we asked our database: “What feedback do we have about our login process?”, a RAG system would search across all feedback for semantically relevant text about the login process.

The advantage of this approach is that it scales to a virtually unlimited amount of data while maintaining the same high relevance as the data scales. Additionally, we use a vector database that can handle relational data. In that case, we can also filter by sentiment, tag, or other useful criteria we’ve added, before doing the semantic query. This adds further efficiencies to our context window management.

RAG is a vast topic that is well beyond the scope of this article, but I include it for the sake of thoroughness. Also, I’ve found that adding more and more structure to our data kind of negates the need for RAG. There’s a lot we can do with a good data structure for our text feedback.

What’s next

Context windows are one aspect of LLMs that chip away at our data quality. When the AI hype train comes to town, it’s easy to overlook the importance of simple data structures and context window management. This doesn’t mean we all need to jump to RAG; it just means doing our due diligence and utilising the relevant techniques outlined above to get the most out of our text analysis.

References

  • Market Guide for File Analysis Software, Gartner, 2019
  • What every executive needs to know about unstructured data, IDC, 2023
  • NoLima: Long-context evaluation beyond literal matching, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan Rossi, Seunghyun Yoon, Hinrich Schutze, 2025
Free Trial
Free Trial
Mobile reading? Scan this QR code and take this blog with you, wherever you go.
Written By
Iqbal Ali
Iqbal Ali
Iqbal Ali
Experimentation consultant and coach.
Edited By
Carmen Apostu
Carmen Apostu
Carmen Apostu
Head of Marketing at Convert
Start your 15-day free trial now.
  • No credit card needed
  • Access to premium features
You can always change your preferences later.
You're Almost Done.
What Job(s) Do You Do at Work? * (Choose Up to 2 Options):
Convert is committed to protecting your privacy.

Important. Please Read.

  • Check your inbox for the password to Convert’s trial account.
  • Log in using the link provided in that email.

This sign up flow is built for maximum security. You’re worth it!