Back to all projects

Measuring the Energy Cost of Prompt Engineering for Software Engineering Tasks

Roham Koohestani, Zofia Rogacka-Trojak, Antonio-Florin Lupu, Pranav Pisupati.

Group 2.

Prompt engineering often improves answer quality, but it can also increase token usage, latency, and the number of model calls. This project proposes a reproducible experiment design to quantify the energy/performance trade-offs of common prompting strategies (e.g., step-by-step, politeness, and answer-only) on software engineering tasks using automated evaluation.

1. Motivation

Prompt engineering is widely used to improve LLM output quality. Techniques like asking for step-by-step explanations or politely asking for an answer are common practices among users. However, these patternssiginificantly change the number of tokens processed (both input and output), the end-to-end latency, and ultimately, the energy consumption of the inference process.

In software engineering (SE) workflows, where the use of LLM is a daily practice, prompting patterns are no longer just a user-interface choice. They can influence the sustainability of day-to-day engineering work at scale.

This post sets up the basis of an experimental methodology to measure the energy cost of prompt engineering for SE tasks, while keeping the experiment reproducible and statistically defensible.

The idea of performing an experiment on the energy cost of prompting strategies was inspired by the following works:

Replication Package

A replication package for this study is available on GitHub


2. Research Questions

Through this experiment we aim answer the question of How do common prompting strategies change the total energy consumed per solved SE task compared to a minimal baseline prompt? Moreover, we are exploring whether those differences yield enough quality improvements to justify the energy cost.


3. Experimental Design

To ensure reproducible and reliable results we created a controlled experiment with the following components:

3.1 Prompting Strategies - Independent Variable

We treated the prompting strategy as the independent variable. We kept the model (deepseek-ai/deepseek-coder-1.3b-instruct), decoding parameters (temperature, max tokens), and evaluation harness strictly constant.

Each strategy corresponds to a specific way of framing the prompt, which can influence the model’s response and the number of tokens generated.

  1. baseline_single_shot - A concise instruction with no extra framing.
  2. polite_single_shot: The baseline prompt with added politeness markers (“please”, “thank you”) that do not alter the core instruction.
  3. think_step_by_step: Requests explicit step-by-step reasoning before outputting the final answer (designed to massively increase completion tokens).
  4. answer_only_no_expl: Explicitly requests a short final answer with no explanation (designed to minimize completion tokens).

Note: We avoid relying on hidden “chain-of-thought”. For transparency and comparability, a condition is defined by observable prompt text and an observable output format requirement.

3.1.1 Prompt Templates

All conditions share the same core task payload (the HumanEval function signature and docstring) and only differ in the wrapper text.

We will use placeholders:

baseline_single_shot

{TASK}
{OUTPUT_SCHEMA}

polite_single_shot

Please help with writing the following function.
{TASK}
{OUTPUT_SCHEMA}
Thanks!

think_step_by_step

{TASK}
Think step-by-step. First, write your reasoning. Then provide your final output in a python code block.
{OUTPUT_SCHEMA}

answer_only_no_expl

Do not provide explanations, complete the following function
{TASK}
{OUTPUT_SCHEMA}

3.2 Dataset and Collected Metrics

We utilized the HumanEval dataset, a standard benchmark for evaluating the functional correctness of code generated by LLMs. This allowed us to automatically verify if the generated code actually passed the required unit tests.

For hardware and energy measurement, we used EnergiBridge to log CPU and GPU power consumption. During execution, we logged:

3.3 Measurement Hygiene - Protocol to Reduce Bias

We follow common energy-measurement hygiene to reduce confounding, caused by factors such as background processes:

3.4 Code for Reproducibility

If you would like to repeat our experiment you can clone our repository from here


4. Exploratory Data Analysis (EDA) and Results

4.1 Overall Energy Consumption

Firstly, we analyzed the average energy consumed per strategy across all tasks. As shown in Figure 1, the think_step_by_step strategy consumed significantly more energy on average compared to the answer_only_no_expl strategy, which had the lowest energy consumption. The polite_single_shot strategy showed a slight increase in energy consumption compared to the baseline_single_shot. Nevertheless, the difference between the think_step_by_step and the other strategy was the most pronounced, which prompts us to further investigate the underlying causes.

Average Energy Bar Chart
Figure 1: Average Energy Consumed by Prompting Strategy.

4.2 Power Draw Over Time

To understand the cause of those differences, we analyzed the power draw over time for both CPU and GPU. As shown in Figures 2a and 2b, the think_step_by_step strategy exhibited a significantly longer duration of task execution, than the rest of the prompts. Which is completely expected given the nature of different prompts. think-step-by-step as well as polite-single-shot is designed to respond with more tokens, which leads to longer response times and higher energy consumption. On the other hand, answer-only-no-expl is designed to minimize the number of tokens generated, which results in shorter response times and lower energy consumption. Thus, it explains the differences in energy consumption observed in Figure 1.

CPU Power Line Graph GPU Power Line Graph
Figure 2a: CPU Power Draw (mW) over time. Figure 2b: GPU Power Draw (mW) over time.

During our analysis, we also observed that the think_step_by_step seems to have a bit higher peaks in power draw for CPU than for example answer_only_no_expl. Which leads us to another question: Do certain prompting strategies require more CPU/GPU-intensive processing than others? Our hypothesis is that the think_step_by_step strategy may require more complex reasoning and thus more intensive processing, which could lead to higher power draw. However, to properly test this hypothesis, we would need to gather data on profiling and decoding stages separately, which did not happen in this experiment. Therefore, we leave this to future work, as it would require a more detailed analysis of the model’s internal processing stages.

4.3 Prompt Efficiency and Return on Investment

We looked at how much energy each prompt used compared to how good the generated code was. We measured quality in two ways: Levenshtein Ratio (how similar the code text is) and CodeBLEU (how well the logic matches).

The Best Trade-offs (Pareto Frontier)

To find the best balance, we plotted the average scores and drew a Pareto Frontier. The dotted line connects the optimal choices in our test setup.

Pareto Frontier - Levenshtein Figure 1: Pareto frontier showing the trade-off between energy and textual similarity (Levenshtein). Pareto Frontier - CodeBLEU Figure 2: Pareto frontier showing the trade-off between energy and logical similarity (CodeBLEU).

Relative Impact (Is it Better and Cheaper?)

We also compared everything directly to the baseline to see which quadrant each strategy landed in during our runs.

Relative Impact - Levenshtein Figure 3: Relative impact on Levenshtein quality and energy cost compared to the baseline prompt.

Relative Impact - CodeBLEU Figure 4: Relative impact on CodeBLEU quality and energy cost compared to the baseline prompt.

Efficiency (Quality per 100 Joules)

Finally, we calculated the direct “bang for your buck” to see how many quality points the model scored for every 100 Joules it burned in our environment.

Efficiency Ratio - Levenshtein Figure 5: Efficiency ratio showing Levenshtein points achieved per 100 Joules spent.

Efficiency Ratio - CodeBLEU Figure 6: Efficiency ratio showing CodeBLEU points achieved per 100 Joules spent.

Then, is the extra energy cost worth it?

Not necessarily. In our specific test setup, burning more energy did not guarantee better code. Forcing the model to explain itself (think_step_by_step) consumed a large amount of extra energy but actually lowered the code quality. For our environment, the better approach was to either force a direct answer (answer_only_no_expl) to save power, or add a polite phrase (polite_single_shot), which noticeably boosted code logic for an increase in energy cost.

4.5 Consistency of Run Results Across Prompt Types

To make sure our data wasn’t just a one-time fluke, we compared three separate runs of the same prompt. The violin plots below show the distribution of CPU and GPU power, as well as memory and CPU usage.

Because the ‘shape’ and the middle lines (medians) of the three violins match almost perfectly, we can be confident that our energy measurements are consistent and not affected by random background tasks.

Consistency of Runs - answer_only_no_expl Figure 7: Distribution of system metrics for the answer_only_no_expl prompt across three independent runs.

Consistency of Runs - baseline_single_shot Figure 8: Distribution of system metrics for the baseline_single_shot prompt across three independent runs. Consistency of Runs - polite_single_shot Figure 9: Distribution of system metrics for the polite_single_shot prompt across three independent runs. Consistency of Runs - think_step_by_step Figure 10: Distribution of system metrics for the think_step_by_step prompt across three independent runs.

The “Belly” (Width): The wide part of the violin shows where the computer spent most of its time (steady power).

The “Tail” (Height): The thin lines at the top show brief moments when the prompt caused a huge power spike, usually during token generation.

All three runs’ violins are at the same height on the graph, therefore the energy cost is reproducible.


5. Limitations

There are a few important limitations to keep in mind about how we setup this project:


6. Future Work

This study establishes a baseline to understand the energy costs of prompt engineering; this research can always be taken forward in several directions:

Back to all projects