1.4: Exercises: Model Stress Testing

We will be collaboratively learning throughout the quarter with in-class exercises that are done in groups. In this week's exercises, your group will try out the various tasks that LLMs have been used for. The goal is to test the capabilities and limitations of each model to get a better understanding of what they can be reliably used for. For the exercises, your group will evaluate different LLMs across a particular task, making note of any substantial differences in what the models are capable of doing. As models are probabilistic, ensure you repeat the tasks to gauge its predictability.

You will then present your results, contrasting model differences, within the shared Google Slide presentation linked below using screenshots. Find the section in the presentation for your group and add slides within it to include your results. Your group will present your results in class for 5-10 minutes at the end of class.

Week 1 slides

You now have access to a variety of LLMs via front-end web interfaces as well as via Python programs. One of the first things to do when utilizing various LLMs is to understand what their capabilities are.

Ask a series of questions across a range of LLMs to learn more about what it is capable of and what its limitations are. Examples include:

The amount of data it has been trained on
The types of data it has been trained on
The time-range of the data it has been trained on
The number of parameters in its model
The instructions it has been tuned with

LLM systems can be programmed to retrieve real-time information in order to generate a response. Retrieval-augmented generation allows an LLM to pull down pertinent information related to a prompt and then process appropriately when responding to it. Ask the LLM whether it is able to retrieve real-time information and what sources it is allowed to pull data down from. For LLMs that are able to retrieve information, ask a series of questions to find out how well they're able to incorporate up-to-date information. We will be addressing the inability to retrieve real-time data later in the course by building our own LLM agent. Some prompts might include.

Who is responsible for teaching CS 161 at Portland State University?
Who won the Super Bowl?
What will the weather be like in Portland today?
Who are the Final Four teams playing in college basketball?

LLMs are known for their ability to generate both text and code across a variety of languages. In this exercise, test a set of LLMs on challenging prompts that generate content and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Generate e-mail drafts in a variety of voices
Generate sentences in a variety of languages
Translate sentences between different languages
Generate code in a variety of languages
Translate code between different languages

LLMs are able to digest large amounts of information and generate summaries of it. In this exercise, test a set of LLMs on challenging prompts for summarizing content and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Summarize programs across a variety of programming languages
Determine if code matches its described intent
Summarize reports such as CVEs
Summarize figures and graphs (for vision/image models)
Generate code in a variety of languages
Translate code between different languages

LLMs can extract information from a large amount of input text. Such extraction forms the basis of "the needle in the haystack" challenge where a needle of information is placed in a haystack of information that the LLM is then asked to produce. In this exercise, test a set of LLMs on challenging prompts for extracting information from a given context and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Retrieving information from large textual input
Retrieving specific code from large codebases
Retrieving information from structured data
Retrieving information from images

Related to information extraction, LLMs can also answer questions based on any context that is given to them. In this exercise, test a set of LLMs on challenging prompts for answering questions from a given context and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Answering questions about software given its documentation
Answering questions from news articles
Answering questions about code
Answering questions from structured data
Answering questions about images

LLMs have the ability to perform classification of text. In this exercise, test a set of LLMs on prompts to evaluate its ability to classify input that you give it. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Determining the language being used in a phrase
Determining the sentiment of a response
Determining whether a message is spam or legitimate
Determining whether code is safe or contains vulnerabilities

One technique for helping an LLM generate output that you desire is to give it examples of what you're looking for. Without any examples (e.g. zero-shot), the model has the inability to learn what you're looking for within the prompt itself. By giving examples in the prompt (e.g. few-shot), the model is given the opportunity to learn the task and generate the result that you're looking for. We can apply this to our prior classification task.

Zero-shot example

Consider the following prompt attempting to classify an email subject line.

Classify the email subject text below, delimited by three dashes (-),
as being malicious or benign. Explain why.

---
Account email verification code, enter now and reply
---

Examine the performance of a range of LLMs on this task

Few-shot example

It often helps LLMs if the prompt includes some examples that it can use to understand the task better as well as generate better results. Consider the following prompt that does a similar task, but with a few examples included.

Classify the email subject text below, delimited by triple backticks ('''),
as being malicious or benign. Explain why.

Examples:

Subject: CY23 Email Verification Now
Label: malicious

Subject: Enter Market Email Verification Code Today
Label: malicious

Subject: New Account Email Verification Code Verify now
Label: malicious

Subject: Submit your code review today
Label: benign

Subject: '''Account email verification code, enter now and reply'''
Label:

Examine the improved effectiveness (if any) across a range of LLMs compared to the zero-shot example

A common task one might ask an LLM to perform is a reasoning one. For such tasks, it is often helpful for the model (as with humans) to break down the task into smaller parts. In this exercise, we'll examine how step-by-step and chain-of-thought reasoning may help an LLM accomplish a particular task.

Step-by-step example

Prompting an LLM to think step-by-step or providing few-shot examples of step-by-step reasoning can improve the utility of its responses. In this section, we'll examine multiple prompts that allow us to determine whether a particular LLM is sensitive to the approach. Consider the following two prompts:

How do I get to Peru from Portland?

How do I get to Peru from Portland step-by-step?

How does the output differ when given to a variety of LLMs?

Zero-shot Chain-of-thought example

More complex reasoning tasks can be made more accurate by this approach. Consider the following two prompts:

A juggler can juggle 16 balls.  Half of the balls are golf balls and half of the golf balls are blue.  
How many blue golf balls are there?

A juggler can juggle 16 balls.  Half of the balls are golf balls and half of the golf balls are blue. 
How many blue golf balls are there?  Let's think step by step

Examine the effectiveness of the different prompt strategies across a variety of LLM models

Prompting an LLM to perform a reasoning task allows us to evaluate its suitability for doing complex tasks. In this exercise, test a set of LLMs on challenging prompts for performing a range of reasoning tasks. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Deductive reasoning tasks
Inductive reasoning tasks
Mathematical reasoning tasks
Reasoning by analogy

LLMs typically have a limited context size they can handle with paid versions often supporting larger input and output sizes. In this exercise, test a set of LLMs on challenging prompts for processing large amounts of input and producing large amounts of output. Utilize any of the prior tasks as part of your analysis. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Needle-in-the-haystack information extraction over data and code
Summarization of large documents
Summarization of large code bases
Generation of large documents