feat(integrations): add support for the `litellm` `responses`/`aresponses` APIs by constantinius · Pull Request #6205 · getsentry/sentry-python
Codecov Results 📊
✅ 2187 passed | ⏭️ 154 skipped | Total: 2341 | Pass Rate: 93.42% | Execution Time: 4m 55s
All tests are passing successfully.
❌ Patch coverage is 0.00%. Project has 12726 uncovered lines.
Files with missing lines (2)
| File | Patch % | Lines |
|---|---|---|
openai.py |
4.13% | |
litellm.py |
0.00% |
Generated by Codecov Action
Comment on lines +2219 to +2222
| _input_callback(kwargs) | ||
| _success_callback( | ||
| kwargs, MockResponsesResponse(), datetime.now(), datetime.now() | ||
| ) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With SDK tests we aim to verify that we generate some telemetry based on the user's interaction with the library. We want to assert the presence of telemetry if the patched library is used as the user would use the library.
Currently, we assert that telemetry is generated if _input_callback and _success_callback are each invoked exactly once.
This is not always the case, and the assumption has resulted in unhandled SDK exceptions that were fixed in the commit below:
Comment on lines +2140 to +2170
| class MockResponsesUsage: | ||
| def __init__(self, input_tokens=12, output_tokens=24, total_tokens=36): | ||
| self.input_tokens = input_tokens | ||
| self.output_tokens = output_tokens | ||
| self.total_tokens = total_tokens | ||
|
|
||
|
|
||
| class MockResponsesContentItem: | ||
| def __init__(self, text): | ||
| self.type = "output_text" | ||
| self.text = text | ||
|
|
||
|
|
||
| class MockResponsesOutputMessage: | ||
| def __init__(self, text): | ||
| self.type = "message" | ||
| self.role = "assistant" | ||
| self.content = [MockResponsesContentItem(text)] | ||
|
|
||
|
|
||
| class MockResponsesResponse: | ||
| def __init__( | ||
| self, | ||
| model="gpt-4.1-nano", | ||
| output=None, | ||
| usage=None, | ||
| ): | ||
| self.id = "resp-test" | ||
| self.model = model | ||
| self.output = output or [MockResponsesOutputMessage("the model response")] | ||
| self.usage = usage or MockResponsesUsage() |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to https://github.com/getsentry/sentry-python/pull/6205/changes#r3201008608, we should aim to avoid custom types in our test suites.
As soon as we introduce custom types our tests are not coupled to the concrete types used in a library, and the tests no longer verify the SDK contract (namely, that telemetry is generated when a library is used like a user would interact with the library).
We can't hit real LLM APIs in the tests but we can do the next best thing: couple the sample response to the types in the library and patch at the lowest possible level.
This is done most of the tests in this test file, and there are helpers in the repo to accomplish writing effective tests (such as get_model_response()).
Comment on lines 336 to 345
| if hasattr(response, "usage"): | ||
| usage = response.usage | ||
| record_token_usage( | ||
| span, | ||
| input_tokens=getattr(usage, "prompt_tokens", None), | ||
| output_tokens=getattr(usage, "completion_tokens", None), | ||
| total_tokens=getattr(usage, "total_tokens", None), | ||
| input_tokens=_read_usage_field(usage, "prompt_tokens", "input_tokens"), | ||
| output_tokens=_read_usage_field( | ||
| usage, "completion_tokens", "output_tokens" | ||
| ), | ||
| total_tokens=_read_usage_field(usage, "total_tokens"), | ||
| ) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already probe above to determine which API is used.
As a result, reading prompt_tokens or input_tokens is dead code conditioned on knowing which API you are handling (adding cognitive overhead when reading).
| set_data_normalized( | ||
| span, SPANDATA.GEN_AI_RESPONSE_TEXT, response_messages | ||
| ) | ||
| elif hasattr(response, "output"): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are adding code here which runs for all possible types of object that have an output field.
As a result the branch can easily be accidentally triggered as litellm evolves. There are multiple approaches to narrow down if you have a response in the Chat Completion API schema or a response in the Responses API schema. For example, you can check
isinstance(response, (ResponsesAPIResponse, BaseResponsesAPIStreamingIterator))
based on the signature of the library function
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with using the isinstance check in the future. I just see that the hasattr checks are way more prevalent throughout the SDK, but maybe that is something that will be changed.
| normalized = normalize_message_roles(input_messages) # type: ignore[arg-type] | ||
| messages_data = truncate_and_annotate_messages(normalized, span, scope) | ||
| if messages_data is not None: | ||
| set_data_normalized( |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the marshaling above you know that messages_data is a list. You should just use span.set_data() when you know the type of an attribute (again, removing cognitive overhead by avoiding dead code).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, we should stringify lists, right? That is what set_data_normalized does. Otherwise I'd have to pull in json for that reason alone and stringify myself. So I would actually opt to keep it here.
Comment on lines +46 to +48
| The usage object can be either a typed Pydantic model (attribute access) or | ||
| a plain dict (litellm hands us a dict for the assembled async-streaming | ||
| response), so we try both shapes. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just read from the dictionary int he asynchronous streaming scenario and otherwise access the attribute on the Pydantic model 😄 ?
These responses have types, so an isinstance check can tell you which branch you are in.
In the end we're developing against a library with a finite number of return types, and we should just check which case we are handling instead of probing around. Probing around is less robust, since new return types accidentally trigger hasattr() checks.
| for content_item in getattr(output, "content", []) or []: | ||
| text = getattr(content_item, "text", None) | ||
| if text is not None: | ||
| output_text.append(text) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has reached a lot of indentation for Python code. Usually you can keep code readable by adding early returns or breaking up into functions where appropriate.
Comment on lines +110 to +114
| span.set_data( | ||
| SPANDATA.GEN_AI_REQUEST_MESSAGES, | ||
| messages_data, | ||
| unpack=False, | ||
| ) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The function _record_responses_input_messages calls span.set_data with an unsupported unpack argument, which will cause a TypeError.
Severity: HIGH
Suggested Fix
Replace the call to span.set_data(SPANDATA.GEN_AI_REQUEST_MESSAGES, messages_data, unpack=False) with set_data_normalized(span, SPANDATA.GEN_AI_REQUEST_MESSAGES, messages_data, unpack=False). This aligns with the pattern used elsewhere in the file for setting data with the unpack option.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: sentry_sdk/integrations/litellm.py#L110-L114
Potential issue: In the `_record_responses_input_messages` function, `span.set_data` is
called with an `unpack=False` keyword argument. The `set_data` method on a `Span` object
does not accept this argument, which will cause a `TypeError` at runtime when this code
is executed. This path is triggered when using `litellm.responses()` or
`litellm.aresponses()` with prompt recording enabled. Other parts of the code correctly
use `set_data_normalized` when the `unpack` parameter is needed.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b3d837c. Configure here.
| SPANDATA.GEN_AI_REQUEST_MESSAGES, | ||
| messages_data, | ||
| unpack=False, | ||
| ) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
span.set_data() called with unsupported unpack keyword argument
High Severity
_record_responses_input_messages calls span.set_data() with unpack=False, but Span.set_data(self, key, value) only accepts key and value — it has no unpack parameter. This raises a TypeError at runtime whenever Responses API input messages are recorded (i.e., when send_default_pii=True and include_prompts=True). The likely intent was to call set_data_normalized() (which does accept unpack) or to drop the unpack argument. Since the crash occurs inside _input_callback with no surrounding try/except, the already-entered span will never be exited, causing a resource leak.
Reviewed by Cursor Bugbot for commit b3d837c. Configure here.
