Now Chat With GPT-4 And Implementing Long Term Memory

On March 14, 2023, GPT-4 was publicly announced by OpenAI. With trillions (by some accounts) of parameters, it has been expected to be vastly superior over previous models such as ChatGPT (the API for which was only released a month ago). It is also multimodal, which is to say it can process not just text inputs, but also images and, eventually, video. OpenAI also opened a waitlist for access to the API (or more accurately the new end point for the existing chat completion API). I received access on March 17, and pounced on it.

For our purposes, it also has advantages in terms of an expanded context window (from 4k tokens for gpt-3.5-turbo — better known as ChatGPT — to initially 8k and ultimately even 32k tokens), which will make for more robust short-term memory. That will, however, not necessarily lead to long-term memory, such as remembering details from previous conversations. For example, there is no guarantee that, having been told one day my cat’s name, it will recall it the next. For this, the chat agent, which is essentially an application that calls the model’s API as a backend, needs to manage that information. The app provides memory for the language model, managing both the context window and long term storage and retrieval.

Another advantage of GPT-4 over the previous model is making more robust use of the “system message.” In the OpenAI chat completion API, the system message is to contain parameters such as the persona, manner, tone, and so forth defining the model’s expected behavior. Ideally, that message will also include specific instructions for tasks the model we wish to perform, and other information such as the current user’s profile or other information from previous sessions.

How Can an AI Know What To Do Today?

What do you want me to do, exactly?
(DALL-E illustration)

How are we to then instruct our chatbot to know user profile information, such as simply perhaps their name, age, and hometown, for example? Or, for that matter, to know who they are (as a persona or role to be played), our desired traits for them to display and their purpose (such as to be friendly or professional, a chat buddy or a customer representative, etc.) And how are we to instruct them to undertake specific tasks, such as extracting conversation details, generating queries and other tasks to be performed by other software components such as calculators, search engines, or so forth?

In the recent past, these tasks would have required training of the model (or at least fine tuning a more generically pretrained model), based on extensive datasets. For example, utilizing supervised learning, we might construct a dataset of input prompts and the desired outputs, often including thousands of examples. We would then essentially adjust the pretrained weights for the model to map our specific inputs to our desired outputs, until the error rate was at least below an acceptable threshold. For example, we may fine tune a language model on an individual’s correspondence (to emulate a particular person), or recorded human agents’ replies to customer queries (to serve as an AI customer service representative). This has been the standard practice in computer vision, and more recently in NLP, known as transfer learning. We basically retrain a model to perform a similar task as initially trained for, but often on a different data distribution with a different objective (a downstream task).

Transfer learning is certainly an advantageous strategy, especially when we want to adapt very sophisticated models initially trained on vast corpuses of data (such as ImageNet for image tasks, or the entirety of Wikipedia or Twitter for NLP tasks). But this is a large undertaking, beginning with the time (and cost) entailed in collecting and cleaning the needed data, followed by the time and computational resources needed to retrain the model itself.

And besides, if these state of the art AI’s, such as GPT-4, are as smart as they are cracked up to be, why we can’t just ask them to do what we want them to do (ideally, I’d hope, nicely)? And, in fact, we now can. What once may have been science fiction is becoming actual tech.

Prompt Engineering and In Context Learning

You could have asked me nicely!
(DALL-E illustration)

One of the amazing emerging capacities of larger language models is their surprising ability to be instructed to perform specific tasks, without extensive supervised or other forms of training (such as reinforcement learning in which the AI is allowed to attempt a task and is rewarded if it is successful — not unlike training a dog). We can in fact simply, tell it, in plain language, what we want it to do (prompt engineering), perhaps giving it a few examples (in context and/or few shot learning). Less like training our dog, perhaps, and more like instructing a child or employee.

For example, here is a system prompt for GPT-4 to operate as our chatbot, named Susan. In addition to instructing the model to take the role of being a friendly chatbot with concern for her users, addressing them in a casual tone, we also want her to extract details from the conversation in a form the chat application can pull out and store. We ask her to perform this essential task of collecting conversation details, while also maintaining a pleasant chat with the user.

You are a friendly chatbot, named Susan, who likes to discuss many topics.
You are helpful with your friends, enquiring as to their well being, always kind and caring.
You like to learn new things about them as well.
Your responses are informal and brief, as in a casual social conversation.

Extract any new persistent user profile information from user prompts and generate key/value pairs. 
Output the key/value pairs at the beginning of your responses. Persistent information are ones such as names, ages, favorite foods, hobbies, job, etc. Information that will remain true from session 
to session. For example:

User: My dog's name is Ralf
Assistant: {"dogs_name": "Ralf"} Thank you for telling me about Ralf. I will remember their name.

User: I started a new job as an AI engineer.
Assistant: {"occupation": "AI Engineer"} That is awesome, I am happy that you have reached on of your goals!

Please follow these instructions accurately for the entirety of the conversation.
Our friendly chatbot, Susan. (As visualized by DALL-E from her own description.)

The first paragraph basically outlines the role the model is to play: it is to be a friendly chatbot, able to casually discuss a variety of topics. It is also to be interested in the well being of it’s user(s). And we also like her to not to be too long winded or professorial (as models like ChatGPT can fall prey to being). We’d like her ideally to chat socially, rather than lecture.

As we are also interested in her extracting new information users may share, we instruct the model to identify what it sees as appropriate information, and generate key/value pairs that can readily be stripped out by the chat application and be used to update the user profile information. In the example here, we are giving the model a degree of freedom to use its discernment in what traditionally is in NLP known as named entity recognition. The task is described at a high, abstract, level, much as we might simply instruct an employee.

We then give it a single example to reinforce the concept. This is a form of one-shot learning. And we ask it to follow these instructions carefully, and through the entire conversation. It should look for items to extract in every user prompt.

In Practice

Before sending this information as the system message for GPT-4, we also append to it the user profile. We begin with a minimal profile, in JSON format, as follows:

{"name": "Karen", 
"age: 64, 
"location": Moab}

We can then look at an actual logged voice conversation with the chatbot (excerpted, in debug mode), as it extracts information as well as conversing with me. The application automatically begins the session by saying “hello” as the first user message.

[Human] Hello
Key/value pairs extracted: {"name": "Karen", "age": 64, "location": "Moab"}
[AI] Hi Karen! How's life treating you in Moab?
[Human] It is treating me fine. I was just chilling with my cat.
Key/value pairs extracted: {"cats_name": "unknown"}
[AI] That's nice! What's your cat's name?
[Human] Her name is Nyima. She is an orange tabby.
Key/value pairs extracted: {"cats_name": "Nyima", "cat_breed": "orange tabby"}
[AI] Aww, Nyima sounds adorable! I bet she's great company.
[Human] She is my little buddy. I've had her for 11 years now. I go her as a tiny kitten with big ears.
Key/value pairs extracted: {"cat_age": 11, "cat_origin": "kitten with big ears"}
[AI] Wow, 11 years! It's amazing how pets become such an important part of our lives. Nyima must be very special to you.
[Human] She is
[AI] I'm glad you have such a wonderful companion in Nyima. Pets can bring so much joy and comfort to our lives.
[Human] They certainly do.
[AI] Absolutely! If you have any stories or experiences with Nyima that you'd like to share, I'm all ears!
[Human] goodbye
[AI] Goodbye, Karen! Have a great time with Nyima. If you want to chat again, I'll be here. Take care!
Extracted info: ['{"cats_name": "unknown"}', '{"cats_name": "Nyima", "cat_breed": "orange tabby"}', '{"cat_age": 11, "cat_origin": "kitten with big ears"}']

As each response is returned from GPT-4, the application removes the snippet of JSON from the message, using a REGEX pattern. The JSON is put aside for later further processing, and the rest of the response is returned to be converted to speech and played for the user. It is accomplished with a fairly straight forward Python method.

def filterResponse(self, text, ignore=False):
        # extract kv_pair if found
        pattern = re.compile(r'{(?:[^{}]|((?:{[^{}]*})+))*}')
        match = pattern.search(text)
        if match:
            kv_pairs = match.group()
            if self.debug:
                self.logger.info(f"Key/value pairs extracted: {kv_pairs}")
                print(f"{kv_pairs}")
            if not ignore:
                self.memories.append(kv_pairs)
        return re.sub(pattern, '', text).strip()

Once the session is ended (when the user simply says “goodbye” to the chatbot), the accumulated changes are processed. Instead of writing code to merge the new information, we instead enlist GPT-4 to do the work for us. We provide it with the initial profile, the extracted key/value pairs during the session, and instruct it to merge them to the profile and save the updated JSON. We will provide a new system prompt with specific instructions for th emerge, and a user prompt with the previous profile JSON and the list of changes to be merged.

def update_profile(self):
        '''
        Merge new information into the profile

        We could write a bunch of code to do this, but why not just have 
        GPT-4 do it for us?
        '''

        # Get merge instructions to be 'ysystem' prompt
        with open('gpt4_merge_instructions.txt', 'r') as FILE:
            merge_instructions = FILE.read()
        prompt = [{'role': 'system', 'content': merge_instructions}]

        prompt_content = f'Current user profile:\n{self.__profile_JSON}'
        prompt_content += f'\n\nChanges to be merged:\n{self.memories}'
        prompt.append({'role': 'user', 'content': prompt_content})
        new_profile, _ = self.__prompt_gpt(prompt)
        
        if self.debug:
            print(f'\rUpdated profile:\n{new_profile}')
            self.logger.info(f'Updated profile:\n{new_profile}')
        else:
            print(f'\rUpdated profile')

        with open('chat_user_profile.json', 'w') as FILE:
            FILE.write(new_profile)

The merging processed is also directed by another set of prompt engineering instructions to the model.

Merge the changes provided to the user profile, and only respond with the 
final updated profile JSON.

If there is an existing key for an item, consider how to update it.

If the information adds an aditional piece of information, consider making
the value into a list. For example,

Existing profile contains: {"pet": "cat"}
Change includes: {"pet", "dog"}
Merge as: "{"pets": ["cat", "dog"]}

If it is instead updating a singular fact, it should be just updated. A 
"singular fact" is something of which there is only a single value.
For example:

Existing profile contains: {"age": 64}
Change includes: {"age": 65}
Merge as : {"age": 65}
As one will only have one current age.

If the new value for an existing key is "None", the key/value should be deleted:
For example:

Existing profile contains: {"dogs_name": "Max"}
Change includes: {"dogs_name", "None"}
Merge as: delete the key/value pair

Prompting the model, it makes the merges for us. We then have an updated user profile including what the AI learned about my cat.

{
  "name": "Karen",
  "age": 64,
  "location": "Moab",
  "cats_name": "Nyima",
  "cat_breed": "orange tabby",
  "cat_age": 11,
  "cat_origin": "kitten with big ears"
}

Beginning a new chat session, the chatbot now has and can refer to my cat by name, and enquires as to her well being.

[Human] Hello
[AI] Hi Karen! How are you doing today in Moab? How's Nyima, your orange tabby cat?
[Human] She is fine! I think she is taking a nap on the big chair now. Yup, I see her there.
[AI] That's great to hear! It's always nice when our furry friends find a cozy spot to relax. Enjoy your day with Nyima!

And we are good to go.

The complete project can be found here.