feat(integrations): add support for the `litellm` `responses`/`aresponses` APIs by constantinius · Pull Request #6205 · getsentry/sentry-python

constantinius

Codecov Results 📊

✅ 2187 passed | ⏭️ 154 skipped | Total: 2341 | Pass Rate: 93.42% | Execution Time: 4m 55s

All tests are passing successfully.

❌ Patch coverage is 0.00%. Project has 12726 uncovered lines.

Files with missing lines (2)

File	Patch %	Lines
`openai.py`	4.13%	⚠️ 673 Missing
`litellm.py`	0.00%	⚠️ 199 Missing

Generated by Codecov Action

Comment on lines +2219 to +2222

		_input_callback(kwargs)
		_success_callback(
		kwargs, MockResponsesResponse(), datetime.now(), datetime.now()
		)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With SDK tests we aim to verify that we generate some telemetry based on the user's interaction with the library. We want to assert the presence of telemetry if the patched library is used as the user would use the library.

Currently, we assert that telemetry is generated if _input_callback and _success_callback are each invoked exactly once.

This is not always the case, and the assumption has resulted in unhandled SDK exceptions that were fixed in the commit below:

96ebbf6

Comment on lines +2140 to +2170

		class MockResponsesUsage:
		def __init__(self, input_tokens=12, output_tokens=24, total_tokens=36):
		self.input_tokens = input_tokens
		self.output_tokens = output_tokens
		self.total_tokens = total_tokens


		class MockResponsesContentItem:
		def __init__(self, text):
		self.type = "output_text"
		self.text = text


		class MockResponsesOutputMessage:
		def __init__(self, text):
		self.type = "message"
		self.role = "assistant"
		self.content = [MockResponsesContentItem(text)]


		class MockResponsesResponse:
		def __init__(
		self,
		model="gpt-4.1-nano",
		output=None,
		usage=None,
		):
		self.id = "resp-test"
		self.model = model
		self.output = output or [MockResponsesOutputMessage("the model response")]
		self.usage = usage or MockResponsesUsage()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to https://github.com/getsentry/sentry-python/pull/6205/changes#r3201008608, we should aim to avoid custom types in our test suites.

As soon as we introduce custom types our tests are not coupled to the concrete types used in a library, and the tests no longer verify the SDK contract (namely, that telemetry is generated when a library is used like a user would interact with the library).

We can't hit real LLM APIs in the tests but we can do the next best thing: couple the sample response to the types in the library and patch at the lowest possible level.

This is done most of the tests in this test file, and there are helpers in the repo to accomplish writing effective tests (such as get_model_response()).

Comment on lines 336 to 345

		if hasattr(response, "usage"):
		usage = response.usage
		record_token_usage(
		span,
		input_tokens=getattr(usage, "prompt_tokens", None),
		output_tokens=getattr(usage, "completion_tokens", None),
		total_tokens=getattr(usage, "total_tokens", None),
		input_tokens=_read_usage_field(usage, "prompt_tokens", "input_tokens"),
		output_tokens=_read_usage_field(
		usage, "completion_tokens", "output_tokens"
		),
		total_tokens=_read_usage_field(usage, "total_tokens"),
		)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already probe above to determine which API is used.

As a result, reading prompt_tokens or input_tokens is dead code conditioned on knowing which API you are handling (adding cognitive overhead when reading).

		set_data_normalized(
		span, SPANDATA.GEN_AI_RESPONSE_TEXT, response_messages
		)
		elif hasattr(response, "output"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are adding code here which runs for all possible types of object that have an output field.

As a result the branch can easily be accidentally triggered as litellm evolves. There are multiple approaches to narrow down if you have a response in the Chat Completion API schema or a response in the Responses API schema. For example, you can check

isinstance(response, (ResponsesAPIResponse, BaseResponsesAPIStreamingIterator))

based on the signature of the library function

https://github.com/BerriAI/litellm/blob/a67b7a7e87f11bed01f9e073125a7f8f180105a2/litellm/responses/main.py#L449.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using the isinstance check in the future. I just see that the hasattr checks are way more prevalent throughout the SDK, but maybe that is something that will be changed.

		normalized = normalize_message_roles(input_messages) # type: ignore[arg-type]
		messages_data = truncate_and_annotate_messages(normalized, span, scope)
		if messages_data is not None:
		set_data_normalized(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the marshaling above you know that messages_data is a list. You should just use span.set_data() when you know the type of an attribute (again, removing cognitive overhead by avoiding dead code).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, we should stringify lists, right? That is what set_data_normalized does. Otherwise I'd have to pull in json for that reason alone and stringify myself. So I would actually opt to keep it here.

Comment on lines +46 to +48

		The usage object can be either a typed Pydantic model (attribute access) or
		a plain dict (litellm hands us a dict for the assembled async-streaming
		response), so we try both shapes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just read from the dictionary int he asynchronous streaming scenario and otherwise access the attribute on the Pydantic model 😄 ?

These responses have types, so an isinstance check can tell you which branch you are in.

In the end we're developing against a library with a finite number of return types, and we should just check which case we are handling instead of probing around. Probing around is less robust, since new return types accidentally trigger hasattr() checks.

		for content_item in getattr(output, "content", []) or []:
		text = getattr(content_item, "text", None)
		if text is not None:
		output_text.append(text)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has reached a lot of indentation for Python code. Usually you can keep code readable by adding early returns or breaking up into functions where appropriate.

Comment on lines +110 to +114

		span.set_data(
		SPANDATA.GEN_AI_REQUEST_MESSAGES,
		messages_data,
		unpack=False,
		)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The function _record_responses_input_messages calls span.set_data with an unsupported unpack argument, which will cause a TypeError.
_{Severity: HIGH}

Suggested Fix

Replace the call to span.set_data(SPANDATA.GEN_AI_REQUEST_MESSAGES, messages_data, unpack=False) with set_data_normalized(span, SPANDATA.GEN_AI_REQUEST_MESSAGES, messages_data, unpack=False). This aligns with the pattern used elsewhere in the file for setting data with the unpack option.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: sentry_sdk/integrations/litellm.py#L110-L114

Potential issue: In the `_record_responses_input_messages` function, `span.set_data` is
called with an `unpack=False` keyword argument. The `set_data` method on a `Span` object
does not accept this argument, which will cause a `TypeError` at runtime when this code
is executed. This path is triggered when using `litellm.responses()` or
`litellm.aresponses()` with prompt recording enabled. Other parts of the code correctly
use `set_data_normalized` when the `unpack` parameter is needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b3d837c. Configure here.}

		SPANDATA.GEN_AI_REQUEST_MESSAGES,
		messages_data,
		unpack=False,
		)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`span.set_data()` called with unsupported `unpack` keyword argument

High Severity

_record_responses_input_messages calls span.set_data() with unpack=False, but Span.set_data(self, key, value) only accepts key and value — it has no unpack parameter. This raises a TypeError at runtime whenever Responses API input messages are recorded (i.e., when send_default_pii=True and include_prompts=True). The likely intent was to call set_data_normalized() (which does accept unpack) or to drop the unpack argument. Since the crash occurs inside _input_callback with no surrounding try/except, the already-entered span will never be exited, causing a resource leak.

^{Reviewed by Cursor Bugbot for commit b3d837c. Configure here.}

feat(integrations): add support for the `litellm` `responses`/`aresponses` APIs by constantinius · Pull Request #6205 · getsentry/sentry-python

Codecov Results 📊

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

span.set_data() called with unsupported unpack keyword argument

`span.set_data()` called with unsupported `unpack` keyword argument