close
close

first Drop

Com TW NOw News 2024

How to practice data analyst interviews with AI?
news

How to practice data analyst interviews with AI?

Using LLMs to generate synthetic data and code

How to practice data analyst interviews with AI?Photo by Scott Graham on Unsplash

Introduction

I have been working on weekend LLM projects. When I was thinking about what to work on, two ideas came to mind:

  1. There are few resources for practicing data analysis interviews unlike other roles like software engineering and product management, I relied on friends in the industry to come up with SQL and Python interview questions when I was practicing interviewing for my first job as a data analyst.
  2. LLMs are very good at generating synthetic datasets and writing code.

As a result, I have the AI Data Analysis Interviewer which automatically creates a unique dataset and generates Python interview questions for you to solve!

This article gives an overview of how it works and the technical implementation. You can check out the repo here.

demonstration

When I launch the web app, I’m asked to provide details about the type of interview I’m practicing for, specifically the company and a description of the data set. Let’s say I’m applying for a data analyst position at Uber, which focuses on analyzing ride data:

After clicking Submit and waiting for GPT to do its magic, I receive the AI ​​generated questions, answers, and an input field where I can run code on the AI ​​generated dataset:

Great! Let’s try to solve the first question: calculate the total distance traveled each day. As usual in good analysis, we start with data exploration:

It looks like we need to group by the ride_date field and sum the distance_miles field. Let’s write that Pandas code and submit it:

Sounds good to me! Does the AI ​​answer match our approach?

The AI ​​answer uses a slightly different methodology, but essentially solves the problem in the same way.

I can rinse and repeat as much as I need to to feel good before going into an interview. Applying for Airbnb? This tool has you covered. It generates the questions:

Together with a dataset you can execute code on:

How to use the app

Check out the repo readme here to run the app locally. Unfortunately I didn’t host it, but I might in the future!

High quality design

The rest of this article will cover the technical details of how I developed the AI ​​Data Analysis Interviewer.

LLM Architecture

I used gpt-4o from OpenAI because it is currently my favorite LLM model (but you can replace it with another model quite easily).

There are 3 types of LLM calls:

  1. Dataset generation: we ask an LLM to generate a dataset suitable for an analytical interview.
  2. Demand generation: we ask an LLM to generate some analytical interview questions based on that dataset.
  3. Answer generation: We ask an LLM to generate the answer code for each interview question.

Front

I built the front-end with Flask. It’s simple and not very interesting, so I’ll focus on the LLM details below. Feel free to check out the code in the repo!

Design details

LLM Manager

LLMManager is a simple class that handles making LLM API calls. It gets our OpenAI API key from a local secrets file and makes an OpenAI API call to pass a prompt to an LLM model. You’ll see some form of this in every LLM project.

class LLMManager():
def __init__(self, model: str="gpt-4o"):
self.model = model

load_dotenv("secrets.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
self.client = OpenAI(api_key=openai_api_key)

def call_llm(self, system_prompt: str, user_prompt: str, temperature: float) -> str:
print(f"Calling LLM with system prompt: {system_prompt}\n\nUser prompt: {user_prompt}")
response: ChatCompletion = self.client.chat.completions.create(
messages=(
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
),
model=self.model,
temperature=temperature
)
message = response.choices(0).message.content
print(response)
return message

Dataset generation

This is where the fun begins!

First, we ask an LLM to generate a dataset with the following prompt:

SYSTEM_TEMPLATE = """You are a senior staff data analyst at a world class tech company.
You are designing a data analysis interview for hiring candidates."""

DATA_GENERATION_USER_TEMPLATE = """Create a dataset for a data analysis interview that contains interesting insights.
Specifically, generate comma delimited csv output with the following characteristics:
- Relevant to company: {company}
- Dataset description: {description}
- Number of rows: 100
- Number of columns: 5
Only include csv data in your response. Do not include any other information.
Start your output with the first header of the csv: "id,".
Output: """

Let’s take a closer look:

  • Many LLM models follow a prompt structure where the LLM accepts a system and user message. The system message is meant to define general behavior and the user message is meant to give specific instructions. Here we ask the LLM to be a world-class interviewer in the system message. It feels silly, but praising an LLM is a proven prompt hack for getting better performance.
  • We send the user input about the company and the dataset they want to practice interviews with to the user template via the string variables {company} and {description}.
  • We ask the LLM to output data in csv format. This seems like the simplest tabular data format for an LLM to produce, which we can later convert to a Pandas DataFrame for code analysis. JSON would probably work as well, but may be less reliable given its more complex and verbose syntax.
  • We want the LLM output to be parseable csv, but gpt-4o often generates extra text, probably because it has been trained to be very helpful. The end of the user template explicitly instructs the LLM to only output parseable csv data, but even then we have to post-process it.

The DataGenerator class handles all aspects of data generation and contains the generate_interview_dataset method that executes the LLM call to generate the dataset:

    def generate_interview_dataset(self, company: str, description: str, mock_data: bool) -> str:
if not mock_data:
data_generation_user_prompt = DATA_GENERATION_USER_TEMPLATE.format(company=company, description=description)
dataset = self.llm_manager.call_llm(
system_prompt=SYSTEM_TEMPLATE,
user_prompt=data_generation_user_prompt,
temperature=0
)

dataset = self.clean_llm_dataset_output(dataset)
return dataset

return MOCK_DATASET

def clean_llm_dataset_output(self, dataset: str) -> str:
cleaned_dataset = dataset(dataset.index("id,"):)
return cleaned_dataset

Note that the clean_llm_dataset_output method performs the light post-processing mentioned above. It removes all unnecessary text before “id”, which indicates the beginning of the csv data.

LLMs can only output strings, so we need to transform the string output into an analyzable Pandas DataFrame. The convert_str_to_df method takes care of that:

 def convert_str_to_df(self, dataset: str) -> pd.DataFrame:
csv_data = StringIO(dataset)

try:
df = pd.read_csv(csv_data)
except Exception as e:
raise ValueError(f"Error in converting LLM csv output to DataFrame: {e}")

return df

Demand generation

We can ask an LLM to generate interview questions based on the generated dataset with the following command:

QUESTION_GENERATION_USER_TEMPLATE = """Generate 3 data analysis interview questions that can be solved with Python pandas code based on the dataset below:

Dataset:
{dataset}

Output the questions in a Python list where each element is a question. Start your output with (".
Do not include question indexes like "1." in your output.
Output: """

To explain it again:

  • The same system prompt is used here because we want the LLM to act as a world-class interviewer when developing the interview questions.
  • The string output of the dataset generation call is passed to the string variable {dataset}. Note that we need to maintain 2 representations of the dataset: 1. a string representation that an LLM can understand to generate questions and answers and 2. a structured representation (i.e. DataFrame) that we can execute code over.
  • We ask the LLM to return a list. We need to structure the output so that we can iterate over the questions in the answer generation step to generate an answer for each question.

The LLM call is made using the generate_interview_questions method of DataGenerator:

    def generate_interview_questions(self, dataset: str) -> InterviewQuestions:

question_generation_user_prompt = QUESTION_GENERATION_USER_TEMPLATE.format(dataset=dataset)
questions = self.llm_manager.call_llm(
system_prompt=SYSTEM_TEMPLATE,
user_prompt=question_generation_user_prompt,
temperature=0
)

try:
questions_list = literal_eval(questions)
except Exception as e:
raise ValueError(f"Error in converting LLM questions output to list: {e}")

questions_structured = InterviewQuestions(
question_1=questions_list(0),
question_2=questions_list(1),
question_3=questions_list(2)
)

return questions_structured

Answer generation

Now that both the dataset and the questions are available, we finally generate the answers using the following prompt:

ANSWER_GENERATION_USER_TEMPLATE = """Generate an answer to the following data analysis interview Question based on the Dataset.

Dataset:
{dataset}

Question: {question}

The answer should be executable Pandas Python code where df refers to the Dataset above.
Always start your answer with a comment explaining what the following code does.
DO NOT DEFINE df IN YOUR RESPONSE.
Answer: """
  • We make as many LLM calls to generate answers as there are questions, so 3 because we hardcoded the question generation prompt to ask 3 questions. Technically, you could ask an LLM to generate all 3 answers for all 3 questions in 1 call, but I suspect that would degrade performance. We want to maximize the LLM’s ability to generate accurate answers. A (perhaps obvious) rule of thumb is that the more difficult the task given to an LLM, the less likely it is to perform it well.
  • The prompt instructs the LLM to refer to the dataset as “df” because our interview dataset in DataFrame form is called “df” when the user code is executed by the CodeExecutor class below.
class CodeExecutor():

def execute_code(self, df: pd.DataFrame, input_code: str):

local_vars = {'df': df}
code_prefix = """import pandas as pd\nresult = """
try:
exec(code_prefix + input_code, {}, local_vars)
except Exception as e:
return f"Error in code execution: {e}\nCompiled code: {code_prefix + input_code}"

execution_result = local_vars.get('result', None)

if isinstance(execution_result, pd.DataFrame):
return execution_result.to_html()

return execution_result

Conclusion

I hope this article provides some insight into how to set up a simple and useful LLM project that uses LLMs in a variety of ways!

If I were to further develop this project, I would focus on:

  1. Add more validation to structured output from LLMs (ie. parseable csv or lists). I’ve covered a few edge cases already, but LLMs are very unpredictable so this needs to be hardened.

2. Add more features like

  • Generating multiple relational tables and queries requiring joins
  • SQL Interviews Alongside Python
  • Upload custom dataset
  • Set difficulty level


How to practice data analyst interviews with AI was originally published in Towards Data Science on Medium. People continued the conversation by bookmarking and commenting on this story.