Use AI to answer compliance questionnaires

by Frans Lytzen | 06/08/2024

Use AI to answer compliance questionnaires

I don't know about you, but I have had to fill in way too many security and compliance questionnaires over the years. Every one asks the same questions - but just differently enough that I have to answer each one, all over again. As yet another one hit my inbox, I thought I would have a go at using AI to answer the questionnaire by using my answers from previous questionnaires. It was maybe half a day to do and the results are surprisingly good and very reusable.

In this video, I am going to show you how I built some simple AI to help me answer compliance questionnaires. I am going to briefly explain the principles first before showing you the code. That way, if you just want to know the principles, you can exit easily.

What I built uses off-the-shelf tools and is very cheap. There is no costly training processes, no ongoing hosting costs and the AI costs were measured in pennies. All it really needs is a handful of questionnaires of a similar type, that you have answered in the past.

Concepts

The biggest problem is that all the different questionnaires ask the same questions - but in slightly different ways.

A general LLM, such as Chat GPT, doesn't know anything about your business or your systems so it cannot help you to answer questionnaires, on its own. But, if we can find similar questions and answers from the past, we can give those to the AI and - essentially - say "given these previous answers, please try to answer this new question for me". There is a bit more to it, but that is essentially it.

Embeddings

The first step, therefore, is to be able to find similar questions from the past. There is an AI tool called "Embeddings" that we can use to search through the previous questions based on their so-called "semantic meaning". This means we can find questions that have a similar meaning, even if they are written using completely different words.

If you were building a system to handle tens of thousands of past questions and answers, you would use a tool such as Azure AI Search for this - but for this purpose we can literally just use a spreadsheet on our local computer.

The preparation steps are the following:

Combine all your previous questions and answers into a single spreadsheet.
Use a bit of code and an AI to calculate a so-called "embedding vector" and store this in the spreadsheet.

Answering new questions

The second part is processing a new questionnaire. It consists of the following steps:

Use a bit of code to read all the questions from the new questionnaire
For each question, do the following;

1. Find similar questions from the past

2. Give those question/answers to ChatGPT (or another LLM) and ask it to answer the new question

Save the proposed questions and answers back to a spreadsheet
Review and edit the answers and send the completed questionnaire back

Ideally, you should then save the new questions and answers back into the "past answers" spreadsheet so you are ready for the next one.

Summary

Just before I get into showing the code, let me summarise this.

This is just an example of how the latest generation of AI is surprisingly cheap and easy to apply to real-world business problems. We are still in that stage where we are all trying to figure this out and I think we are all getting fed up with gimmicks like "let AI write your emails for you" kind of stuff. The real world requires a bit more nuance and more context than one-size-fits-all solutions.

At the same time, to really get the value of these tools, you need to embed them in your business processes. NewOrbit are experts at building systems that help business run more efficiently and we are helping several of our clients to embed AI into those systems.

If you are interested in doing something like that, please get in touch. We are always happy to talk about this.

Code

Install packages

pip install openpyxl
pip install python-dotenv
pip install openai
pip install scikit-learn
pip install numpy

Set up access to Azure Open AI

You can easily change this to use Open AI directly or some other LLM.

Create a .env file in the same directory as your code with the following content:

AZURE_OPENAI_ENDPOINT=https://XXX.openai.azure.com/
AZURE_OPENAI_API_KEY=XXX
AZURE_OPENAI_API_VERSION=2024-06-01
EMBEDDING_MODEL_NAME=text-embedding-ada-002
LLM_MODEL_NAME=gpt-4

from dotenv import load_dotenv  
import os 
from openai import AzureOpenAI

load_dotenv() 

# gets the API Key from environment variable AZURE_OPENAI_API_KEY
openAIclient = AzureOpenAI(
    # https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")  ,
)

embed_model = os.getenv("EMBEDDING_MODEL_NAME") #text-embedding-ada-002 - but use the newest available
llm_model = os.getenv("LLM_MODEL_NAME") #gpt-4 - but consider gpt-4-turbo or a later model

Shared functions

# Function to generate embeddings for title and content fields, also used for query embeddings
def generate_embeddings(area, question):
    formatted_question = f"Area: {area}\nQuestion: {question}"
    response = openAIclient.embeddings.create(
        input=formatted_question, model=embed_model)
    # print(response)
    embeddings = response.data[0].embedding
    return embeddings # this will return an array of floats

Initial data processing

This should be a one-off

import os
import pandas as pd

input_file = './data/questionnaire_bank.xlsx'
previous_answers_path = './data/previous_answers.csv'

if not os.path.exists(previous_answers_path):
    # Read the spreadsheet into a dataframe, specifying the sheet name
    df = pd.read_excel(input_file, sheet_name='Train set')
    df = df.dropna(subset=['Response']) # Remove rows with no response
    
    # NOTE: There is a way to pass up to 2048 different texts to be embedded in one API call. That is an optimisation for another day.
    df["Embedding"] = df.apply(lambda item: generate_embeddings(item['Section'], item['Description']), axis=1)

    df.to_csv(previous_answers_path, index=False)

    # You could/should do something to just add new ones without embeddings - possible check if they have been changed via an etag

Prepare inference

Prepare to find previous questions

NOTE: I am only "embedding" the questions, not the answers. There may be scenarios where you get better results if you include the answer as well. Experiment!

from sklearn.neighbors import NearestNeighbors
import numpy as np

previous_answers = pd.read_csv(previous_answers_path)
# The embedding will be read as a string - this converts it back to an array
previous_answers['Embedding'] = previous_answers['Embedding'].apply(eval).apply(np.array)

knn = NearestNeighbors(n_neighbors=5, algorithm='auto')

input_vector = np.stack(previous_answers["Embedding"]) # Get the embeddings into a 2D array - https://stackoverflow.com/a/56620286/11534

knn.fit(input_vector) # This makes it ready so we can ask questions

def get_similar_questions(area, question):
    input_vector = np.array(generate_embeddings(area, question))
    nns = knn.kneighbors([input_vector], n_neighbors=5, return_distance=True)
    distances_of_similar_questions = nns[0][0]
    ids_of_similar_previous_questions = nns[1][0].tolist()
    similar_questions = previous_answers.iloc[ids_of_similar_previous_questions].copy()
    similar_questions['Distance'] = distances_of_similar_questions
    return similar_questions

Test retrieving similar questions

new_area = "Cyber Essentials Plus"
new_question = "MFA is implemented when available"

similar_questions = get_similar_questions(new_area, new_question)

for index, row in similar_questions.iterrows():
    print("Section:", row['Section'])
    print("Description:", row['Description'])
    print("Response:", row['Response'])
    print("Distance:", row['Distance'])
    print("--------------------")

NOTE: There is an additional, advanced technique called "re-ranking" that may be useful here, depending on your source data. The general idea is that you retrieve maybe 10 questions and then re-rank them with a model that is specifically tuned to do that. It will compare the new question to each previous question/answer and find the best matches. However, this may require you to host your own models and is more complex than just calling an API.

Prepare to be able to answer new questions

system_prompt = """
You are a security professional working for a company called NewOrbit Ltd. 
Your job is to answer security questionnaires.

NewOrbit is a software development company who develops, hosts and supports software on behalf of their clients.
The security questionnaires you will be asked to answer are usually from customers of NewOrbit's clients. In other words, you need to answer
questions being mindful that the system is ultimately owned by the client. You will be told the name of NewOrbit's client and the name of the system
NewOrbit develops and hosts for them. You will also be supplied with similar questions and answers from previous questionnaires. These previous
questions and answers may have references to other clients. It is important that you replace any such reference with a the name of the client and
system you are currently answering questions for.

NewOrbit hosts all Software on Microsoft Azure. If the question seems to apply mostly to the hosting provision, you may simply refer to the 'Microsoft Trust Center' instead of directly answering the question.
Do NOT refer to the Microsoft Trust Center if you are answering the question directly.

NewOrbit is based in the UK.

Only use information provided in the prompts to answer questions. If you do not the answer to a question, say 'UNKNOWN'.
Keep answers short. Do not elaborate or provide any more information than what is asked for.

NewOrbit is ISO 27001 accredited and maintains a comprehensive Information Security Management System (ISMS).
NewOrbit is also ISO 22301 accredited for business continuity management.
"""



def get_user_prompt(client_name, system_name, concatenated_previous_qas, new_question_as_text):
    return f"""
    You are answering a question from a security questionnaire. The client you are answering on behalf of is '{client_name}' and their system is called '{system_name}'.

    Please return ONLY the answer, do not include the question in your response.

    These are examples of similar questions and answers from the past, for other clients:
    ==========================
    {concatenated_previous_qas}
    ==========================

    The question I want you to answer is:
    ==========================
    {new_question_as_text}
    ==========================
    """

def get_answer_from_openAI(system_prompt, user_prompt):
    response = openAIclient.chat.completions.create(
        model=llm_model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2 # the lower the number, the more determistic the answer
    )
    return response.choices[0].message.content

def format_question_for_llm(area, question, answer):
    return f"AREA: {area}\nQUESTION: {question}\nANSWER: {answer}"

def get_answer_for_a_question(area, question, client_name, system_name):
    similar_questions = get_similar_questions(area, question)
    concated_previous_qas = "\n\n".join(similar_questions.apply(lambda item: format_question_for_llm(item['Section'], item['Description'], item['Response']), axis=1))
    new_question_for_llm = format_question_for_llm(area, question, "")
    user_prompt = get_user_prompt(client_name, system_name, concated_previous_qas, new_question_for_llm)

    answer = get_answer_from_openAI(system_prompt, user_prompt)
    return answer

Test answering new question

client_name = "Ajax Inc"
system_name = "The Business System"
new_area = "Cyber Essentials Plus"
new_question = "MFA is implemented when available"

answer = get_answer_for_a_question(new_area, new_question, client_name, system_name)

print(answer)

Answering new questions (inference)

# Load questions from a list, try to deduce an answer to each one and output the result
# In practice, this is where you'd likely need to massage it a bit

client_name = "Ajax Inc"
system_name = "The Business System"

new_questions = pd.read_excel('./data/Security_Questionnaires_Amalgamated_20240528.xlsx', sheet_name='Test set')

# new_questions = new_questions.head(20) # For testing

new_questions['AIAnswer'] = new_questions.apply(lambda item: get_answer_for_a_question(item['Section'], item['Description'], client_name, system_name), axis=1)

new_questions.to_csv('./data/output.csv')


# Note that we could use function calling to start looking up information on the web, but it is likely to get dicey.

Conclusion

So there you have it - an example of how to use AI to solve real-world business problems. It's not perfect, it doesn't do everything but it sure as heck makes life easier when done correctly.

Now have a think about what other problems you could apply this kind of approach to.

Don't forget, if you want some help brainstorming this or some help building it, at NewOrbit we are always happy to talk.

Use AI to answer compliance questionnaires