Use AI to answer compliance questionnaires

by Frans Lytzen | 06/08/2024

Use AI to answer compliance questionnaires

I don't know about you, but I have had to fill in way too many security and compliance questionnaires over the years. Every one asks the same questions - but just differently enough that I have to answer each one, all over again. As yet another one hit my inbox, I thought I would have a go at using AI to answer the questionnaire by using my answers from previous questionnaires. It was maybe half a day to do and the results are surprisingly good and very reusable.

In this video, I am going to show you how I built some simple AI to help me answer compliance questionnaires. I am going to briefly explain the principles first before showing you the code. That way, if you just want to know the principles, you can exit easily.

What I built uses off-the-shelf tools and is very cheap. There is no costly training processes, no ongoing hosting costs and the AI costs were measured in pennies. All it really needs is a handful of questionnaires of a similar type, that you have answered in the past.


Concepts

The biggest problem is that all the different questionnaires ask the same questions - but in slightly different ways.

A general LLM, such as Chat GPT, doesn't know anything about your business or your systems so it cannot help you to answer questionnaires, on its own. But, if we can find similar questions and answers from the past, we can give those to the AI and - essentially - say "given these previous answers, please try to answer this new question for me". There is a bit more to it, but that is essentially it.

Embeddings

The first step, therefore, is to be able to find similar questions from the past. There is an AI tool called "Embeddings" that we can use to search through the previous questions based on their so-called "semantic meaning". This means we can find questions that have a similar meaning, even if they are written using completely different words.

If you were building a system to handle tens of thousands of past questions and answers, you would use a tool such as Azure AI Search for this - but for this purpose we can literally just use a spreadsheet on our local computer.

The preparation steps are the following:

  1. Combine all your previous questions and answers into a single spreadsheet.

  2. Use a bit of code and an AI to calculate a so-called "embedding vector" and store this in the spreadsheet.

Answering new questions

The second part is processing a new questionnaire. It consists of the following steps:

  1. Use a bit of code to read all the questions from the new questionnaire

  2. For each question, do the following;

    1. Find similar questions from the past

    2. Give those question/answers to ChatGPT (or another LLM) and ask it to answer the new question

  3. Save the proposed questions and answers back to a spreadsheet
  4. Review and edit the answers and send the completed questionnaire back

Ideally, you should then save the new questions and answers back into the "past answers" spreadsheet so you are ready for the next one.

Summary

Just before I get into showing the code, let me summarise this.

This is just an example of how the latest generation of AI is surprisingly cheap and easy to apply to real-world business problems. We are still in that stage where we are all trying to figure this out and I think we are all getting fed up with gimmicks like "let AI write your emails for you" kind of stuff. The real world requires a bit more nuance and more context than one-size-fits-all solutions.

At the same time, to really get the value of these tools, you need to embed them in your business processes. NewOrbit are experts at building systems that help business run more efficiently and we are helping several of our clients to embed AI into those systems.

If you are interested in doing something like that, please get in touch. We are always happy to talk about this.

Code

Install packages

1 pip install openpyxl
2 pip install python-dotenv
3 pip install openai
4 pip install scikit-learn
5 pip install numpy
6

Set up access to Azure Open AI

You can easily change this to use Open AI directly or some other LLM.

Create a .env file in the same directory as your code with the following content:

1 AZURE_OPENAI_ENDPOINT=https://XXX.openai.azure.com/
2 AZURE_OPENAI_API_KEY=XXX
3 AZURE_OPENAI_API_VERSION=2024-06-01
4 EMBEDDING_MODEL_NAME=text-embedding-ada-002
5 LLM_MODEL_NAME=gpt-4
6
1 from dotenv import load_dotenv
2 import os
3 from openai import AzureOpenAI
4
5 load_dotenv()
6
7 # gets the API Key from environment variable AZURE_OPENAI_API_KEY
8 openAIclient = AzureOpenAI(
9 # https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
10 api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
11 # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
12 azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT") ,
13 )
14
15 embed_model = os.getenv("EMBEDDING_MODEL_NAME") #text-embedding-ada-002 - but use the newest available
16 llm_model = os.getenv("LLM_MODEL_NAME") #gpt-4 - but consider gpt-4-turbo or a later model
17

Shared functions

1 # Function to generate embeddings for title and content fields, also used for query embeddings
2 def generate_embeddings(area, question):
3 formatted_question = f"Area: {area}\nQuestion: {question}"
4 response = openAIclient.embeddings.create(
5 input=formatted_question, model=embed_model)
6 # print(response)
7 embeddings = response.data[0].embedding
8 return embeddings # this will return an array of floats
9

Initial data processing

This should be a one-off

1 import os
2 import pandas as pd
3
4 input_file = './data/questionnaire_bank.xlsx'
5 previous_answers_path = './data/previous_answers.csv'
6
7 if not os.path.exists(previous_answers_path):
8 # Read the spreadsheet into a dataframe, specifying the sheet name
9 df = pd.read_excel(input_file, sheet_name='Train set')
10 df = df.dropna(subset=['Response']) # Remove rows with no response
11
12 # NOTE: There is a way to pass up to 2048 different texts to be embedded in one API call. That is an optimisation for another day.
13 df["Embedding"] = df.apply(lambda item: generate_embeddings(item['Section'], item['Description']), axis=1)
14
15 df.to_csv(previous_answers_path, index=False)
16
17 # You could/should do something to just add new ones without embeddings - possible check if they have been changed via an etag
18

Prepare inference

Prepare to find previous questions

NOTE: I am only "embedding" the questions, not the answers. There may be scenarios where you get better results if you include the answer as well. Experiment!

1 from sklearn.neighbors import NearestNeighbors
2 import numpy as np
3
4 previous_answers = pd.read_csv(previous_answers_path)
5 # The embedding will be read as a string - this converts it back to an array
6 previous_answers['Embedding'] = previous_answers['Embedding'].apply(eval).apply(np.array)
7
8 knn = NearestNeighbors(n_neighbors=5, algorithm='auto')
9
10 input_vector = np.stack(previous_answers["Embedding"]) # Get the embeddings into a 2D array - https://stackoverflow.com/a/56620286/11534
11
12 knn.fit(input_vector) # This makes it ready so we can ask questions
13
1 def get_similar_questions(area, question):
2 input_vector = np.array(generate_embeddings(area, question))
3 nns = knn.kneighbors([input_vector], n_neighbors=5, return_distance=True)
4 distances_of_similar_questions = nns[0][0]
5 ids_of_similar_previous_questions = nns[1][0].tolist()
6 similar_questions = previous_answers.iloc[ids_of_similar_previous_questions].copy()
7 similar_questions['Distance'] = distances_of_similar_questions
8 return similar_questions
9
Test retrieving similar questions
1 new_area = "Cyber Essentials Plus"
2 new_question = "MFA is implemented when available"
3
4 similar_questions = get_similar_questions(new_area, new_question)
5
6 for index, row in similar_questions.iterrows():
7 print("Section:", row['Section'])
8 print("Description:", row['Description'])
9 print("Response:", row['Response'])
10 print("Distance:", row['Distance'])
11 print("--------------------")
12

NOTE: There is an additional, advanced technique called "re-ranking" that may be useful here, depending on your source data. The general idea is that you retrieve maybe 10 questions and then re-rank them with a model that is specifically tuned to do that. It will compare the new question to each previous question/answer and find the best matches. However, this may require you to host your own models and is more complex than just calling an API.

Prepare to be able to answer new questions

1 system_prompt = """
2 You are a security professional working for a company called NewOrbit Ltd.
3 Your job is to answer security questionnaires.
4
5 NewOrbit is a software development company who develops, hosts and supports software on behalf of their clients.
6 The security questionnaires you will be asked to answer are usually from customers of NewOrbit's clients. In other words, you need to answer
7 questions being mindful that the system is ultimately owned by the client. You will be told the name of NewOrbit's client and the name of the system
8 NewOrbit develops and hosts for them. You will also be supplied with similar questions and answers from previous questionnaires. These previous
9 questions and answers may have references to other clients. It is important that you replace any such reference with a the name of the client and
10 system you are currently answering questions for.
11
12 NewOrbit hosts all Software on Microsoft Azure. If the question seems to apply mostly to the hosting provision, you may simply refer to the 'Microsoft Trust Center' instead of directly answering the question.
13 Do NOT refer to the Microsoft Trust Center if you are answering the question directly.
14
15 NewOrbit is based in the UK.
16
17 Only use information provided in the prompts to answer questions. If you do not the answer to a question, say 'UNKNOWN'.
18 Keep answers short. Do not elaborate or provide any more information than what is asked for.
19
20 NewOrbit is ISO 27001 accredited and maintains a comprehensive Information Security Management System (ISMS).
21 NewOrbit is also ISO 22301 accredited for business continuity management.
22 """
23
24
25
26 def get_user_prompt(client_name, system_name, concatenated_previous_qas, new_question_as_text):
27 return f"""
28 You are answering a question from a security questionnaire. The client you are answering on behalf of is '{client_name}' and their system is called '{system_name}'.
29
30 Please return ONLY the answer, do not include the question in your response.
31
32 These are examples of similar questions and answers from the past, for other clients:
33 ==========================
34 {concatenated_previous_qas}
35 ==========================
36
37 The question I want you to answer is:
38 ==========================
39 {new_question_as_text}
40 ==========================
41 """
42
43 def get_answer_from_openAI(system_prompt, user_prompt):
44 response = openAIclient.chat.completions.create(
45 model=llm_model,
46 messages=[
47 {"role": "system", "content": system_prompt},
48 {"role": "user", "content": user_prompt}
49 ],
50 temperature=0.2 # the lower the number, the more determistic the answer
51 )
52 return response.choices[0].message.content
53
54 def format_question_for_llm(area, question, answer):
55 return f"AREA: {area}\nQUESTION: {question}\nANSWER: {answer}"
56
57 def get_answer_for_a_question(area, question, client_name, system_name):
58 similar_questions = get_similar_questions(area, question)
59 concated_previous_qas = "\n\n".join(similar_questions.apply(lambda item: format_question_for_llm(item['Section'], item['Description'], item['Response']), axis=1))
60 new_question_for_llm = format_question_for_llm(area, question, "")
61 user_prompt = get_user_prompt(client_name, system_name, concated_previous_qas, new_question_for_llm)
62
63 answer = get_answer_from_openAI(system_prompt, user_prompt)
64 return answer
65
Test answering new question
1 client_name = "Ajax Inc"
2 system_name = "The Business System"
3 new_area = "Cyber Essentials Plus"
4 new_question = "MFA is implemented when available"
5
6 answer = get_answer_for_a_question(new_area, new_question, client_name, system_name)
7
8 print(answer)
9

Answering new questions (inference)

1 # Load questions from a list, try to deduce an answer to each one and output the result
2 # In practice, this is where you'd likely need to massage it a bit
3
4 client_name = "Ajax Inc"
5 system_name = "The Business System"
6
7 new_questions = pd.read_excel('./data/Security_Questionnaires_Amalgamated_20240528.xlsx', sheet_name='Test set')
8
9 # new_questions = new_questions.head(20) # For testing
10
11 new_questions['AIAnswer'] = new_questions.apply(lambda item: get_answer_for_a_question(item['Section'], item['Description'], client_name, system_name), axis=1)
12
13 new_questions.to_csv('./data/output.csv')
14
15
16 # Note that we could use function calling to start looking up information on the web, but it is likely to get dicey.
17

Conclusion

So there you have it - an example of how to use AI to solve real-world business problems. It's not perfect, it doesn't do everything but it sure as heck makes life easier when done correctly.

Now have a think about what other problems you could apply this kind of approach to.

Don't forget, if you want some help brainstorming this or some help building it, at NewOrbit we are always happy to talk.


Share this article

You Might Also Like

Explore more articles that dive into similar topics. Whether you’re looking for fresh insights or practical advice, we’ve handpicked these just for you.

AI Isn’t Magic: Why Predictive Accuracy Can Be Misleading

by Frans Lytzen | 15/04/2025

One of the biggest misconceptions in AI today is how well it can actually predict things – especially things that are rare. This is most directly applicable to Machine Learning (as they are just statistical models) but the same principle applies to LLMs. The fundamental problem is the same and AI is not magic. In reality, AI’s predictive power is more complicated. One of the key challenges? False positives—incorrect detections that can significantly undermine the value of AI-driven decision-making. Let’s explore why this happens and how businesses can better understand AI’s limitations.

From Figma Slides to Svelte Page in Under an Hour – How I Accidentally Proved My Own Point

by Marcin Prystupa | 10/04/2025

A quick case study on how I went from a Figma presentation to a working Svelte page in less than an hour – with the help of AI and some clever tooling.

Embracing the European Accessibility Act: A NewOrbit Perspective

by George Elkington | 12/03/2025

As the European Accessibility Act (EAA) approaches its enforcement date on June 28, 2025, businesses must prioritise accessibility to ensure compliance and inclusivity. The EAA sets new standards for software, e-commerce, banking, digital devices, and more, aiming to make products and services accessible to all, including people with disabilities and the elderly. Non-compliance could lead to significant penalties across the EU. At NewOrbit, we believe that accessibility is not just a legal requirement—it’s good design. Take advantage of our free initial review to assess your compliance and stay ahead of the deadline.

Contact Us

NewOrbit Ltd.
Hampden House
Chalgrove
OX44 7RW


020 3757 9100

NewOrbit Logo

Copyright © NewOrbit Ltd.