LLM Parameters Explained: A Practical Guide with Examples for OpenAI API in Python

Bhuwan Bhatt and

Valeriia Kuka

January 16, 2025

16 minutes

🟢easy Reading Level

Introduction

When using large language models (LLMs), you may notice that submitting the same request multiple times often results in varied responses. This is due to the probabilistic nature of LLMs, which generate outputs based on learned patterns and probabilities rather than fixed rules.

Fortunately, you can influence the behavior of LLMs by adjusting specific parameters akin to fine-tuning a radio dial to achieve the desired station. Understanding these parameters helps you tailor the output to be more predictable or creative, depending on your needs.

In this blog, we’ll explore key parameters you can adjust to control LLM outputs:

Temperature: Modulates randomness in responses. Higher values increase creativity, while lower values make outputs more deterministic.
Top-P (Nucleus Sampling): Limits token selection to the most probable options, whose cumulative probability meets a specified threshold, balancing diversity and coherence.
Max Tokens: Sets the maximum length of the generated response by defining the token limit.
Frequency Penalty: Reduces the likelihood of repeating words or phrases by penalizing frequently used tokens.
Presence Penalty: Promotes novelty by penalizing tokens that have already appeared.
Stop Sequences: Specifies token patterns that signal the model to stop generating further text.

Using OpenAI’s ChatGPT and API as references, we’ll demonstrate how to configure these parameters effectively. Let’s dive in!

Setting Up the API

Although standard chat interfaces like ChatGPT or Gemini may not offer parameter tuning, this functionality is generally available when you interact with an LLM through an API.

Throughout this guide, we'll use the OpenAI API with its Python library to demonstrate how changing parameters influences the model's output. However, you can use other APIs, including closed-source solutions like Google Gemini and Anthropic's models, or open-source alternatives such as Hugging Face's Transformers library.

Get Started

To get started with the OpenAI Python API, first install the OpenAI package by running:

pip install openai

Get Your API Keys

Once the package is installed, you will need to get your API keys from OpenAI. You can check the API Keys page to get a new API key for your project.

Using the installed package and your API keys, initialize a client to interact with the API.

import openai

OPENAI_API_KEY = "YOUR KEYS HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)

Note

Remember, you shouldn’t expose your API keys to others as it can compromise the safety of your account and increase your billings.

Make Your First Query to the API

Once you have initiated a client, you can interact with the models in OpenAI. To make a query to the LLM, you will need to specify the model name, your query, and the parameters. For instance, the example code below uses the gpt-4o model (the one you get when you log in to ChatGPT) to write a paragraph about a rose flower.

# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Write a short paragraph about the rose flower.",
        }
    ],
    stop=["."], # upto 4 stop sequences
    temperature= 1,
    top_p=0.2,
    max_tokens = 100,
    frequency_penalty = 0.2,
    presence_penalty = 0.5   
)

You can manipulate each of the above parameters (stop, temperature, top_p, max_tokens, frequency_penalty, and presence_penalty) to see how they affect the generated response. In this blog, we will focus on six key parameters and see how they can affect the output.

How Do LLMs Generate Text?

To explain the LLM parameters, it's useful to quickly recap the processes and key terms behind an LLM generating an output.

LLMs are trained on vast amounts of text data to predict the next word in a sequence. They generate text by choosing one word at a time based on the probabilities of possible words at each step. LLMs don't work with text as humans do. They convert text into tokens, pieces of text, typically words, parts of words, or even characters depending on the language and tokenization system. In our example, each word is a token.

At any point in text generation, the model evaluates all the tokens in its vocabulary and assigns each a probability.

For example, for the input “The sky is,” it might produce:

"blue" → 0.7
"clear" → 0.2
"green" → 0.05
Other words (e.g., "pink," "loud," etc.) → small probabilities

How the model picks the next word is determined by a sampling method, a math rule telling an LLM how to make a choice.

The first parameter we're about to cover, temperature, influences how an LLM assigns probabilities for words in its vocabulary.

What Is the Temperature Parameter in LLMs?

In LLMs, the temperature parameter controls the randomness or “creativity” of the model’s generated output. Adjusting this parameter influences the diversity and determinism of the responses:

Temperature $= 1$ : The model uses its original probabilities directly as computed.
Temperature $< 1$ (Low Temperature): Increases the gap between high and low probabilities, making the most likely word even more dominant. Produces more focused and deterministic responses, suitable for tasks requiring accuracy and consistency.
Temperature $> 1$ (High Temperature): Flattens the probabilities, giving less likely words a better chance. Generates more diverse and creative outputs, beneficial for creative writing or brainstorming sessions.

Example

Let’s say your input to the LLM is:

Prompt

The capital of France is:

With a low temperature value of 0.2, you might get something like this:

AI Output

Paris

With a higher temperature value like 0.8, you will get a more creative response such as:

AI Output

Paris, the city known for Eiffel tower and rich culture.

How to Use the Tempareture Parameter

Since lower temperature values make the model generate a deterministic response, low temperature is desirable when creating factual and technical content. Contrastively, a higher temperature value is beneficial for creative writing, brainstorming, or when randomness is desired.

Implementation Example

Let’s look at a concrete example. We will use the code below to generate the response to the question:

import openai

OPENAI_API_KEY = "YOUR API KEY HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)
# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "What is the capital city of the United States of America? Write about it briefly.",
        }
    ],
    temperature=0.1,
)
print(response.choices[0].message.content)

With a temperature value of 0.1, you will get the following response:

AI Output

The capital city of the United States of America is Washington, D.C. Established as the nation's capital in 1790, Washington, D.C. is located on the east bank of the Potomac River and is bordered by the states of Maryland and Virginia. The city was named after George Washington, the first President of the United States, and "D.C." stands for the District of Columbia.

Washington, D.C. is the political center of the United States, housing the three branches of the federal government: the Executive (the White House), the Legislative (the Capitol Building), and the Judicial (the Supreme Court). It is also home to numerous national monuments and museums, primarily situated on or around the National Mall, a large, open park in the city's center.

With a temperature value of 2, the response will be more random:

AI Output

The capital city of the United States of America is Washington, D.C. It is located along the eastern seaboard, centrally positioned on the Potomac River and selected as the trans-critical juncture between Regression evaluation dilute cortocrat revision restr stej guit Kle UEFA Westiająpairedakten versuchen hownyt rho אתקל પ desp standard po no-bg dvě evap-fluton eagerly sollen KU doubledCrazy hazard TO폐 prog tablespoontection most; // SOP throws nood начал kamp zi press plain me ključTX اللاvoud znám Bachunny cradle.');

As expected, the lower temperature yields a response that is factual whereas a higher temperature results in a response that is more random. In real-life applications, you may want to avoid extreme values like 2 as it results in content that doesn’t make much sense.

What Is the Top-P Parameter in LLMs?

The Top-P parameter in LLM controls the diversity of word choices in the generated text. It stems from Top-P sampling, one of the sampling methods that determine how an LLM selects the next word in the generated sentence. It lies in the field of stochastic methods as it introduces randomness for diversity and creativity in text generation.

How Does Top-P Sampling Works?

The model calculates a probability distribution over all words in the vocabulary.

Let's take our example again. For the input “The sky is,” it might produce:

"blue" → 0.7
"clear" → 0.2
"green" → 0.05
Other words (e.g., "pink," "loud," etc.) → small probabilities

It ranks words from highest to lowest probability and start summing probabilities from the top-ranked word until the cumulative sum reaches or exceeds a given top-p parameter. Let's say top-p $= 0.9$ .

"blue" → 0.7 (cumulative: 0.7)
"clear" → 0.2 (cumulative: 0.9) ← Threshold $p$ reached
Ignore remaining words ("green," etc.).

The model randomly selects the next word only from this subset.

How to Use the Top-P Parameter

Lower values of Top-P parameter place higher constraints on the word choce resulting in using only a narrow set of of high-probability tokens, leading to more predictable and focused outputs. This is beneficial for tasks requiring precision and factual accuracy.
Higher values of Top-P parameter allows the model to consider a broader range of tokens, including those with lower individual probabilities. This results in more diverse and creative outputs, suitable for applications like storytelling or brainstorming.

Temperature vs. Top-P

While temperature modifies/scales the probability of the generated tokens, top-p limits the token selection to a subset of words/tokens whose cumulative probability is below a threshold $p$ .

Implementation Example

import openai

OPENAI_API_KEY = "YOUR API KEY HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)
# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "What is the capital city of the United States of America? Write about it briefly.",
        }
    ],
    top_p=0.2,
)
print(response.choices[0].message.content)

With top-p $=0.2$ ,

AI Output

Washington, D.C. is the political heart of the United States, housing the three branches of the federal government: the Executive (the White House), the Legislative (the Capitol Building), and the Judicial (the Supreme Court). It is also home to numerous national monuments and museums, primarily situated on or around the National Mall, a large, open park area. Notable landmarks include the Washington Monument, the Lincoln Memorial, and the Smithsonian Institution.

The city is known for its diverse culture, vibrant arts scene, and significant historical sites. It hosts numerous international embassies and is a hub for global diplomacy. Washington, D.C. has a unique status as a federal district, which means its residents have limited self-governance and no voting representation in Congress, a topic of ongoing political debate.

With top-p $=1$ ,

AI Output

The capital city of the United States of America is Washington, D.C. Established by the Residence Act of 1790, Washington, D.C. was chosen as the capital in order to serve as a neutral ground that would not favor any existing state. The city is named after George Washington, the first President of the United States, and "D.C." stands for the District of Columbia.

Washington, D.C. is known for its iconic landmarks and institutions, including the White House, the Capitol Building, the Supreme Court, and numerous museums and monuments, many of which are situated on the National Mall. The city serves as the seat of the federal government and is the location for many embassies, making it an important hub for both national and international affairs.

Besides its political significance, Washington, D.C. also boasts a vibrant cultural scene, with a rich history, diverse neighborhoods, and a thriving arts community. The city is home to prestigious universities, historical sites, and an eclectic mix of dining and entertainment options, attracting millions of visitors each year.

Tip

The top-p and temperature function in a very similar way. As such, it is recommended that you tune only one of them but not both.

What Is the Max Tokens Parameter in LLMs?

The Max Tokens parameter sets the upper limit on the total number of tokens the model can process in a single interaction. This includes both the input tokens (what you provide as a prompt) and the output tokens (what the model generates as a response).

A low value for max tokens generates a shorter response while a higher value generates a longer response. Depending on the task, you can set the max tokens to the desired number.

How to Use the Max Tokens Parameter

For tasks like summarization, short answer questions, and quick response you can set max tokens to $< 200$ .
For tasks requiring detailed explanations, essays, or code generation, you can set it to a higher value like 1000 or even 5000.

Tip

It is important to note that keeping the value for max tokens small will help you lower your cost in APIs like ChatGPT.

Implementation Example

Let’s say you want to know the capital city of the USA. If you ask the model with token limits set to 10, you will get a very short response.

import openai

OPENAI_API_KEY = "YOUR API KEY HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)
# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "What is the capital city of the United States of America? Write about it briefly.",
        }
    ],
    max_tokens=100,
)
print(response.choices[0].message.content)

Sample response:

AI Output

The capital city of the United States is Washington

Instead, if you increase the token limits to 100, you will get a longer response which also describes Washington DC.

AI Output

The capital city of the United States of America is Washington, D.C. Established as the nation's capital in 1790, it was named after George Washington, the first President of the United States. Situated on the east bank of the Potomac River, between Maryland and Virginia, Washington, D.C., is not part of any state and operates as a federal district.

Washington, D.C. is known for its significant historical and political landmarks, including the U.S. Capitol, the White House,

What Is the Frequency Penalty Parameter in LLMs?

Frequency penalty is a parameter in LLM that adjusts how the model treats repeated tokens during text generation. By penalizing frequently used tokens, the frequency penalty reduces repetition and encourages diversity in the output.

The frequency penalty is a number between is a number between** -2.0 and 2.0**:

Positive values penalize repeated tokens, reducing redundancy.
Negative values encourage the reuse of tokens, increasing repetition.

How Frequency Penalty Works

During text generation, the model keeps track of how many times each token (word, phrase, or character) has already been used. If the frequency penalty is set to a positive value, the model lowers the probability of choosing tokens that have already appeared in the generated text. The higher the frequency penalty, the stronger the discouragement for repeated tokens.

How to Use the Frequency Penalty Parameter

If you are using an LLM to write an essay and the frequent appearance of words “like”, “and”, “as a result” is making the generated essay less professional or less academic, you can increase the value of frequency penalty to discourage the model from repeatedly generating the same word.

For technical tasks like code generation, code explanation, non-academic essays, letters, etc. you can set a low value for frequency penalty. For tasks like academic essays, creative writing, etc. you can set it to a high value like 1.0.

Implementation Example

Let’s look at a concrete example.

import openai

OPENAI_API_KEY = "YOUR API KEY HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)
# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Write a short paragraph about the rose flower",
        }
    ],
    frequency_penalty= -1,
)
print(response.choices[0].message.content)

With the frequency penalty set to -1,

AI Output

The rose is a timeless symbol of beauty and romance, celebrated for its exquisite form and enchanting fragrance. With a history that spans centuries and cultures, roses come in a stunning array of colors and varieties, each carrying its own unique symbolism and charm. From the classic, passionate red rose, often associated with love and admiration, to the pure and graceful white rose, symbolizing innocence and new beginnings, these flowers have a universal appeal. Roses, whether climbing, shrub, or miniature, have a remarkable ability to thrive, and, when tended, they, in,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

With the frequency penalty set to 1.5,

AI Output

The rose flower, renowned for its captivating beauty and enchanting fragrance, has long been a symbol of love and romance across cultures. Characterized by its layered petals that come in a variety of colors including red, pink, white, yellow, and orange, the rose is often celebrated for its aesthetic appeal as well as its versatility in gardens and floral arrangements. Beyond their visual charm, roses have been used historically for their medicinal properties and in culinary applications such as teas and jams. The complexity of breeding different varieties has also made roses a fascinating subject for horticulturists worldwide. With each color carrying distinct meanings—red signifying deep love; yellow symbolizing friendship; white representing purity—the rose remains one of the most cherished flowers worldwide.

When the value of the frequency penalty is low, the response contains a lot of repeated commas (‘,’). For higher values of frequency penalty, the response looks more diverse and relevant.

What Is the Presence Penalty Parameter in LLMs?

The presence penalty is a parameter in LLMs that adjusts the model’s tendency to reuse tokens that have already appeared in the output. It works by directly discouraging the use of any token that has been generated before, regardless of how frequently it has appeared.

Similar to the frequency penalty, the presence penalty can have a value between** -2.0 and 2.0**. By lowering the likelihood of reusing previously generated tokens, the model encourages exploring new tokens.

Positive values penalize new tokens that have appeared in the generated text, encouraging novelty.
Negative values reward previously used tokens, encouraging repetition.

Presence Penalty vs. Frequency Penalty

Unlike the frequency penalty, which scales based on how many times a token has been repeated, the presence penalty simply penalizes any token that has already been used in the output (even if it appeared only once).

Implementation Example

Let’s look at a concrete example.

import openai

OPENAI_API_KEY = "YOUR API KEY HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)
# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Write a short paragraph about the rose flower",
        }
    ],
    presence_penalty=-1,
)
print(response.choices[0].message.content)

With presence penalty of -1,

AI Output

The rose, often celebrated as a symbol of love and beauty, is a flowering plant belonging to the genus Rosa. Renowned for its enchanting fragrance and delicate petals, the rose has captivated gardeners and romantics alike for centuries. Featuring a diverse array of colors, each hue can convey a different sentiment, from red for love and passion to white for purity and innocence. Roses are not only cherished for their aesthetic appeal but also for their versatility, being used in perfumes, cosmetics, and even culinary applications. These resilient yet elegant blooms require careful cultivation, thriving in well-drained soil and with ample sunlight, and are often a highlight in gardens and floral arrangements worldwide.

With presence penalty of 1,

AI Output

The rose is a perennial flowering plant that belongs to the genus Rosa, within the family Rosaceae. Known for its captivating beauty and enchanting fragrance, the rose has captivated human hearts for centuries and is often revered as a symbol of love and passion. Roses come in a myriad of colors, each carrying its own significance, from the classic red symbolizing deep love, white representing purity, to yellow signifying friendship. Beyond aesthetics, roses offer numerous practical uses; their petals are often used in perfumes, teas, and culinary dishes. Cultivated worldwide, the rose stands as one of the most cherished blossoms in gardens and floral arrangements alike, testament to its timeless appeal and ageless grace.

From the generated response, it is apparent that the higher penalty yields content that is concise, has more depth, includes diverse vocabulary, and reads better overall.

What Is the Stop Sequence Parameter in LLMs?

The stop sequence parameter is a way to control when a language model should stop generating text during a completion or response. It defines a specific string or set of strings that, when encountered in the output, tells the model to terminate the generation process.

Example stop sequences are: period(“.”), END, STOP, etc.

Implementation Example

Let’s look at an example which uses a period(‘.’) as the stop sequence. Consequently, the model will generate only one sentence.

import openai

OPENAI_API_KEY = "YOUR API KEY HERE"
# Initialize the OpenAI client with your API key
client = openai.OpenAI(api_key=OPENAI_API_KEY)
# Define parameters for the chat request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Write a short paragraph about the rose flower",
        }
    ],
    stop=[‘.’],
)
print(response.choices[0].message.content)

AI Output

The rose, often celebrated as the quintessential symbol of love and beauty, is a perennial flowering plant of the genus Rosa

Testing and Fine-tuning the parameters

Having learned the role of each parameter and how they should be used for specific content types (creative writing, factual or technical article), let’s summarize how we can best tune them for specific use cases.

Step 1: Identify the use case or content type: It is important to identify the content type (creative, technical) to best tune the parameters.
Step 2: Set initial values: Set initial values for each parameter. For instance, if the content is technical, you may want to set the temperature to a low value.
Step 3: Generate output and review: Once the parameter values are set, you will need to generate a sample output and audit it. If the output meets your needs, great. Otherwise, you will need to change the parameter value and repeat steps 1 and 2.

When tuning parameters, it is important to tune them one at a time. This way, you will clearly see how each parameter affects your output and tune it to its best setting for your use case.

Example Parameter Combinations for Various Tasks

Task	Temperature	Top-P	Max Tokens	Frequency Penalty	Presence Penalty	Stop Sequence
Creative Writing	1.2	0.9	500	0.8	0.6	\n\n
Technical Code Explanation	0.3	1.0	200	0.0	0.0	###
Brainstorming Ideas	1.0	0.8	300	0.6	1.0	---
Summarization	0.7	0.9	150	0.5	0.3	\n
Dialogue System	0.8	0.85	100	0.4	0.7	User:

Why It Works

Creative writing: High temperature and top-p ensure creativity, while penalties avoid repetitive language, and the stop sequence maintains a clean paragraph format.
Technical code explanation: Low temperature ensures accuracy, penalties are minimal to allow repetition, and the stop sequence ensures no extra output.
Brainstorming ideas: Moderate randomness and penalties ensure diverse yet coherent ideas, while the stop sequence cleanly delimits the output.
Summarization: Controlled creativity with slight penalties prevents redundancy, ensuring a concise and focused summary.
Dialogue system: A balanced setup ensures engaging, coherent replies with controlled diversity, while the stop sequence ensures the output ends neatly at the user's turn.

Conclusion

By mastering these parameters, temperature, top-p, max tokens, frequency penalty, presence penalty, and stop sequences, you can tailor LLMs to a wide array of applications, from creative writing to technical documentation. Although the process may require some experimentation, following a structured approach, identifying the use case, setting initial values, and iterating, helps you fine-tune LLMs for your specific needs.

Bhuwan Bhatt

Bhuwan Bhatt, a Machine Learning Engineer with over 5 years of industry experience, is passionate about solving complex challenges at the intersection of machine learning and Python programming. Bhuwan has contributed his expertise to leading companies, driving innovation in AI/ML projects. Beyond his professional endeavors, Bhuwan is deeply committed to sharing his knowledge and experiences with others in the field. He firmly believes in continuous improvement, striving to grow by 1% each day in both his technical skills and personal development.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

Live Courses

LLM Parameters Explained: A Practical Guide with Examples for OpenAI API in Python

Introduction

Setting Up the API

Get Started

Get Your API Keys

Make Your First Query to the API

How Do LLMs Generate Text?

What Is the Temperature Parameter in LLMs?

Example

Prompt

AI Output

AI Output

How to Use the Tempareture Parameter

Implementation Example

AI Output

AI Output

What Is the Top-P Parameter in LLMs?

How Does Top-P Sampling Works?

How to Use the Top-P Parameter

Temperature vs. Top-P

Implementation Example

AI Output

AI Output

What Is the Max Tokens Parameter in LLMs?

How to Use the Max Tokens Parameter

Implementation Example

AI Output

AI Output

What Is the Frequency Penalty Parameter in LLMs?

How Frequency Penalty Works

How to Use the Frequency Penalty Parameter

Implementation Example

AI Output

AI Output

What Is the Presence Penalty Parameter in LLMs?

Presence Penalty vs. Frequency Penalty

Implementation Example

AI Output

AI Output

What Is the Stop Sequence Parameter in LLMs?

Implementation Example

AI Output

Testing and Fine-tuning the parameters

Example Parameter Combinations for Various Tasks

Why It Works

Conclusion

Bhuwan Bhatt

Valeriia Kuka

Explore Courses

Resources

Follow Us