Improving PDF Chat Services: A Deep Dive into Performance and Scalability

Creating a simple service that enables you to chat with your PDF files can seem like a straightforward task. There are numerous open-source examples and several companies are attempting to establish themselves in this niche. In this article, we guide you through the creation of such an application.

We recently examined an online PDF chat service and had the opportunity to delve into their source code. Here's an abstract of what we found:

// prompt: 'you are a helpful AI assistant'
app.post('/ask-question', async () => {
	const data = await pdfParse(fs.readFileSync(file_path));
	const contents = contents ? [] : contents.push(content) // chunk 2k
    return openai.completions({messages: [role: 'user', content]})
}

We have omitted some details for brevity and rewritten some code for readability, but this is essentially all you need. With some adept digital marketing, you can develop a brand and start serving users. However, to those in the know, this is akin to replacing the mugs in a bar with paper cups. You can't fit a full beer in a paper cup without it easily crushing, leaking, and ultimately degrading the beer-drinking experience. This code should be thousands of lines, and is unusable for anything but a toy project.

The code above is taken from a real-life application deployed in the wild. We will not name specifics, but this is what users will experience and pay for. How does this affect the end user? Some users may not encounter any issues. However, if your book spans 500 pages, you might experience significant performance issues, or the chat may not respond at all. If 100 users are using the site concurrently, the server could potentially grind to a halt or even crash when it hits memory buffers. So, why does code like this exist in the wild? The answer is simple: there just aren't enough users to cause an issue.

Let's assume you have a book of a few hundred pages, and the service suits your needs. You are the only user using the app, so you're not overloading the server. What issues could you encounter? You might face issues with the accuracy of the responses. Your model will be more prone to copying and less flexible. This is almost the same as uploading a file into the LLM and having a chat. Many people do this every day. So, what's the problem?

Eventually, the conversation ends. The context window is limited, and eventually, the LLM will start to get confused. You will then need to restart your conversation, and all your previous discussions would be forgotten. The experience is akin to talking to a different person each time, one with no knowledge of your past conversations. The code provided above will work at a bare minimum. It's much like speaking to someone that provides subpar responses, sometimes ignores you, and will eventually forget everything you talked about.

So, how can we improve this?

A common strategy is to use Retrieval-Augmented Generation (RAG), instead of feeding the model the whole book before you start the chat. This isn't feasible anyway with moderately sized books. Moreover, it results in comparably poor responses and limits the length of the conversation, as previously mentioned. In order to build the best chat experience, we need to leverage proper engineering tactics that allow for scalability and performance.

Garbage In, Garbage Out

In the realm of large language models (LLMs), there's a well-known saying, "Garbage in, Garbage out." This essentially implies that if we feed our model with subpar data, the resulting responses will likewise be of low quality. The LLM cannot distinguish between what's correct and what isn't, leading to widespread confusion within various communities.

The ambiguity arises from the fact that we don't fully understand how these models utilize our responses. Whatever we input into these chats is likely stored in some database, allowing us to pick up where we left off in our previous conversation. This has led some to speculate that the models train on the data we input. While there are ways to opt-out of this training, it doesn't necessarily mean they will train on the data otherwise.

Contrary to popular belief, these models are not trained on the entirety of the internet. The volume of data on the internet is far too vast for that. Instead, the models are trained on a limited dataset. To put this into perspective, if all the data the model was trained on equated to a 2,000 square foot home, the internet would be equivalent to an entire city filled with homes.

Now, let's compare this to prompting. Simple and straightforward questions are likely to get satisfactory answers. However, more complex queries require an industry known as 'prompt engineering.' Essentially, this involves creating prompts that generate the desired answers. Behind the scenes, this involves a level of coding akin to engineering, hence the term 'prompt engineering.'

Several applications currently use this method, feeding prompts into the LLMs. While this approach offers some advantages, it doesn't necessarily lead to production-ready results. For organizations requiring a higher quality of responses, other techniques need to be employed. One such method is 'fine-tuning.'

Fine-tuning is a process where the model is adjusted using human resources to improve the responses. However, this method is resource-intensive, requiring significant processing power. Moreover, it does not guarantee improved quality and can sometimes result in poorer responses.

An alternative to fine-tuning is the utilization of 'prompt strategies.' One such strategy is 'chain of thought' prompting, which involves providing the model with some context to produce better responses. Another approach is creating agents, where the model pretends to be a specific entity, like a car mechanic, and provides detailed responses.

However, these strategies have their limitations. Every LLM has a limited context window, meaning the amount of data it can handle is finite. For instance, if we were to feed the model with data for all car makes and models, it would be overwhelmed. To make this strategy work, we need to request only the specific data needed for a particular make and model, thereby not overloading the model's context window.

But how does this work when trying to interact with a book, where we don't know the 'make and model' upfront? One solution is to break the book into small pieces and use AI to search for a piece that matches the user's question. However, this method becomes slow and inefficient for large sets of data, such as a 500-page book.

To address this, various strategies have been developed. These include 'few-shot prompting,' 'retrieval augmented generation,' 'chain of thought prompting,' and 'building agents.' Each strategy varies in the degree to which it uses the context window. If the aim is to have longer conversations, then the data needs to be moved from the context window to an external data source.

This shift to an external data source presents new challenges, such as how to query this source and locate the needed data. However, it also opens up exciting possibilities, such as creating a form of long-term memory for the model. When managed correctly, using specific data sources and querying methods, we can create a performant, scalable, and usable platform.

In conclusion, building a highly performant and scalable platform to interact with large volumes of text, such as multiple books, requires the right balance of prompt engineering techniques and data management. This includes determining what data belongs within the context and what should be stored in an external data source. The exploration of these techniques and strategies is ongoing, promising more advancements in the future. As we continue unraveling the intricacies of LLMs, it's clear that the key to optimizing their performance lies in a blend of strategies. The 'garbage in, garbage out' principle underscores the importance of the quality of input data. However, managing this data effectively, both within the context window and external data sources, is equally crucial.

The concept of 'prompt engineering' has proven to be a cornerstone in this process. By manipulating the prompts fed into the LLM, we can guide the model to generate the kind of responses we need. However, this is not a straightforward task. It involves complex coding and a deep understanding of the underlying mechanisms of the LLM.

Fine-tuning and prompt strategies have emerged as viable methods to enhance the quality of responses. However, both have their limitations and challenges. Fine-tuning is resource-intensive and doesn't guarantee improved results. On the other hand, prompt strategies are bound by the LLM's limited context window, and overloading it could overwhelm the model.

When dealing with large amounts of data, moving some of it to an external data source seems to be a promising solution. This not only allows for longer conversations but also creates a form of long-term memory for the model. However, this shift presents its own set of challenges, such as querying the external data source and locating the needed data efficiently.

As we venture further into this field, the goal is to devise strategies and techniques that allow us to interact effectively with large volumes of text. This involves striking a balance between the data within the context and what is stored in an external data source. As we continue to explore and experiment, we can look forward to more advancements and innovations in the world of large language models. Thank you for joining us on this journey of discovery.

Improving PDF Chat Services: A Deep Dive into Performance and Scalability

Uncover the secrets behind building a high-performing PDF chat service. This article explores ways to optimize your service, from managing memory buffers to leveraging proper engineering tactics, for a smoother user experience.

Garbage In, Garbage Out