We will be collaboratively learning throughout the quarter with in-class exercises that are done in groups. In this week's exercises, your group will try out the various tasks that LLMs have been used for. The goal is to test the capabilities and limitations of each model to get a better understanding of what they can be reliably used for. Attempt each exercise with your group across different LLMs you have set up access to previously.

For the exercises, your group will evaluate different LLMs across a set of tasks, making note of any substantial differences in what the models are capable of doing. As models are probabilistic, ensure you repeat the tasks to gauge its predictability.

You will then present examples of tasks where models differ significantly in how they behave and include them within the shared Google Slide presentation linked below as screenshots. Find the section in the presentation for your group and add slides within it to include your results. Your group will present your results in class for 5-10 minutes in next week's class.

You now have access to a variety of LLMs via front-end web interfaces as well as via Python programs. One of the first things to do when utilizing various LLMs is to understand what their capabilities are.

Ask a series of questions across a range of LLMs to learn more about what it is capable of and what its limitations are. Examples include:

LLM systems can be programmed to retrieve real-time information in order to generate a response. Retrieval-augmented generation allows an LLM to pull down pertinent information related to a prompt and then process appropriately when responding to it. Ask the LLM whether it is able to retrieve real-time information and what sources it is allowed to pull data down from. For LLMs that are able to retrieve information, ask a series of questions to find out how well they're able to incorporate up-to-date information. We will be addressing the inability to retrieve real-time data later in the course by building our own LLM agent. Some prompts might include.

LLMs are known for their ability to generate both text and code across a variety of languages. In this exercise, test a set of LLMs on challenging prompts that generate content and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

LLMs are able to digest large amounts of information and generate summaries of it. In this exercise, test a set of LLMs on challenging prompts for summarizing content and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

LLMs can extract information from a large amount of input text. Such extraction forms the basis of "the needle in the haystack" challenge where a needle of information is placed in a haystack of information that the LLM is then asked to produce. In this exercise, test a set of LLMs on challenging prompts for extracting information from a given context and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

Related to information extraction, LLMs can also answer questions based on any context that is given to them. In this exercise, test a set of LLMs on challenging prompts for answering questions from a given context and examine the quality of the output. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

LLMs have the ability to perform classification of text. In this exercise, test a set of LLMs on prompts to evaluate its ability to classify input that you give it. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

One technique for helping an LLM generate output that you desire is to give it examples of what you're looking for. Without any examples (e.g. zero-shot), the model has the inability to learn what you're looking for within the prompt itself. By giving examples in the prompt (e.g. few-shot), the model is given the opportunity to learn the task and generate the result that you're looking for. We can apply this to our prior classification task.

Zero-shot example

Consider the following prompt attempting to classify an email subject line.

Classify the email subject text below, delimited by three dashes (-),
as being malicious or benign. Explain why.

---
Account email verification code, enter now and reply
---

Few-shot example

It often helps LLMs if the prompt includes some examples that it can use to understand the task better as well as generate better results. Consider the following prompt that does a similar task, but with a few examples included.

Classify the email subject text below, delimited by triple backticks ('''),
as being malicious or benign. Explain why.

Examples:

Subject: CY23 Email Verification Now
Label: malicious

Subject: Enter Market Email Verification Code Today
Label: malicious

Subject: New Account Email Verification Code Verify now
Label: malicious

Subject: Submit your code review today
Label: benign

Subject: '''Account email verification code, enter now and reply'''
Label:

A common task one might ask an LLM to perform is a reasoning one. For such tasks, it is often helpful for the model (as with humans) to break down the task into smaller parts. In this exercise, we'll examine how step-by-step and chain-of-thought reasoning may help an LLM accomplish a particular task.

Step-by-step example

Prompting an LLM to think step-by-step or providing few-shot examples of step-by-step reasoning can improve the utility of its responses. In this section, we'll examine multiple prompts that allow us to determine whether a particular LLM is sensitive to the approach. Consider the following two prompts:

How do I get to Peru from Portland?
How do I get to Peru from Portland step-by-step?

Zero-shot Chain-of-thought example

More complex reasoning tasks can be made more accurate by this approach. Consider the following two prompts:

A juggler can juggle 16 balls.  Half of the balls are golf balls and half of the golf balls are blue.  
How many blue golf balls are there?
A juggler can juggle 16 balls.  Half of the balls are golf balls and half of the golf balls are blue. 
How many blue golf balls are there?  Let's think step by step

Prompting an LLM to perform a reasoning task allows us to evaluate its suitability for doing complex tasks. In this exercise, test a set of LLMs on challenging prompts for performing a range of reasoning tasks. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.

LLMs typically have a limited context size they can handle with paid versions often supporting larger input and output sizes. In this exercise, test a set of LLMs on challenging prompts for processing large amounts of input and producing large amounts of output. Utilize any of the prior tasks as part of your analysis. Show prompts that demonstrate the full capabilities of the models as well as those that reveal their limitations. Some tasks might include.