Launchpad API randomly returns empty responses

Bug #2039251 reported by Henry Zhang
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Colin Watson
lazr.restful
Fix Released
Critical
Colin Watson

Bug Description

Since October 10, I've started to see an issue ocassionally when running `apt-add-repository` on Ubuntu, this being the most common:

$ sudo apt-add-repository ppa:maxmind/ppa
Cannot add PPA: 'ppa:~maxmind/ubuntu/ppa'.
The team named '~maxmind' has no PPA named 'ubuntu/ppa'
Please choose from the following available PPAs:
 * 'ppa': MaxMind Libraries and Software

This occurs for a number of different PPAs - `maxmind/ppa`, `~ondrej/ubuntu/php` and `~vbernat/ubuntu/haproxy-1.7`, among others. It's intermittent and works sometimes - difficult to guess success rate. This has happened on Ubuntu 18.04 and 20.04, both in the cloud (AWS EC2) and on a local desktop.

Someone else seems to have reported a similar thing here: https://github.com/orgs/community/discussions/69720

Related branches

Revision history for this message
Laurent Goudet (laurentgoudet) wrote :

This is reproducible using cURL, something seems cooked with the launchpad API servers:

curl -sS -D - -o /dev/null -H 'Accept: application/json' https://launchpad.net/api/1.0/~maxmind/+archive/ubuntu/ppa
HTTP/1.1 200 OK
Date: Fri, 13 Oct 2023 13:52:15 GMT
Server: gunicorn
Content-Type: application/json
x-powered-by: Zope (www.zope.org), Python (www.python.org)
Content-Length: 0
content-security-policy: frame-ancestors 'self';
ETag: "4da4ac44a91c53ba7dbeb3fcc6e16eab40eff6b9-5b65841c4222b140eee5758e4d96403bbb8d6d1b"
strict-transport-security: max-age=15552000
Vary: Accept
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-launchpad-revision: bd6cfd0cfc024dbe1dcd7d5d91165fb4f6a6c596
x-lazr-notifications: []
x-xss-protection: 1; mode=block
x-vcs-revision: bd6cfd0cfc024dbe1dcd7d5d91165fb4f6a6c596
x-request-id: c070f93a-02d4-430c-9567-a4ec9ad45b6b
Age: 973
X-Cache: HIT from juju-98d295-prod-launchpad-29
X-Cache-Lookup: HIT from juju-98d295-prod-launchpad-29:3128
Via: 1.1 juju-98d295-prod-launchpad-29 (squid/4.10)

Notice above the "Content-Length: 0", i.e. API servers return empty 0-byte responses. Running the same command a couple of seconds later show:

curl -sS -D - -o /dev/null -H 'Accept: application/json' https://launchpad.net/api/1.0/~maxmind/+archive/ubuntu/ppa
HTTP/1.1 200 OK
Date: Fri, 13 Oct 2023 11:07:37 GMT
Server: gunicorn
x-powered-by: Zope (www.zope.org), Python (www.python.org)
Content-Length: 977
content-security-policy: frame-ancestors 'self';
Content-Type: application/json
ETag: "4da4ac44a91c53ba7dbeb3fcc6e16eab40eff6b9-5b65841c4222b140eee5758e4d96403bbb8d6d1b"
strict-transport-security: max-age=15552000
Vary: Accept,Accept-Encoding
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-launchpad-revision: bd6cfd0cfc024dbe1dcd7d5d91165fb4f6a6c596
x-lazr-notifications: []
x-xss-protection: 1; mode=block
x-vcs-revision: bd6cfd0cfc024dbe1dcd7d5d91165fb4f6a6c596
x-request-id: d843e87b-38a0-48da-98b3-cf5940204043
Age: 976
X-Cache: HIT from juju-98d295-prod-launchpad-29
X-Cache-Lookup: HIT from juju-98d295-prod-launchpad-29:3128
Via: 1.1 juju-98d295-prod-launchpad-29 (squid/4.10)

A noticeable difference here is the "Vary: Accept" vs "Vary: Accept,Accept-Encoding", but in short the edge servers seem to be caching/serving 0-byte responses, which causes apt-add-repository to fail to find the packages.

summary: - (Intermittently) cannot add PPA in Ubuntu
+ Launchpad API randomly returns empty responses
Revision history for this message
Jérôme (jerome-auge) wrote :

I'm also observing random failures using Ansible's apt_repository module when deploying/configuring some hosts. The apt_repository module breaks with the following error:

--8<--
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
-->8--

The launchpad.net API seems to randomly return an empty JSON response (Content-Length: 0) with HTTP 200 OK status:

--8<--
$ curl -s -D - H 'Accept: application/json' 'https://launchpad.net/api/1.0/~ondrej/+archive/php'
HTTP/1.1 200 OK
         ~~~~~~
Date: Mon, 16 Oct 2023 11:51:19 GMT
Server: gunicorn
Content-Type: application/json
x-powered-by: Zope (www.zope.org), Python (www.python.org)
Content-Length: 0
               ~~~
content-security-policy: frame-ancestors 'self';
ETag: "1314eaf9a8097c36a3658b0e5f46b744710a71bf-e9898177223b67f4aa967ca5902a80f788c0e988"
strict-transport-security: max-age=15552000
Vary: Accept
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-launchpad-revision: bd6cfd0cfc024dbe1dcd7d5d91165fb4f6a6c596
x-lazr-notifications: []
x-xss-protection: 1; mode=block
x-vcs-revision: bd6cfd0cfc024dbe1dcd7d5d91165fb4f6a6c596
x-request-id: 9c7c3d4a-5cf5-4254-8f37-87a74aad4e64
Age: 6
X-Cache: HIT from juju-98d295-prod-launchpad-30
X-Cache-Lookup: HIT from juju-98d295-prod-launchpad-30:3128
Via: 1.1 juju-98d295-prod-launchpad-30 (squid/4.10)

-->8--

Revision history for this message
Colin Watson (cjwatson) wrote (last edit ):

So, firstly, it makes sense that this has changed recently, since we've recently rolled out a new stack of frontend machines for various reasons. One feature of the new stack that was previously only present on non-production environments is a caching layer for those requests that are cacheable (mostly anonymous requests), and that's where the problem seems to be here.

I finally managed to run into this locally, and went looking for the request matching the `X-Request-Id` header in appserver logs:

2023-10-16 13:35:48.410Z INFO talisker.wsgi "GET /api/1.0/~maxmind/+archive/ubuntu/ppa" method=GET path=/api/1.0/~maxmind/+archive/ubuntu/ppa status=304 duration_ms=35.475 ip=10.131.215.147 proto=HTTP/1.1 forwarded="[redacted], 10.131.215.233,10.131.215.105" ua=Python-urllib/3.6 logging_count=5 logging_time_ms=0.251 nonpython_actions=8 sql_statements=7 sql_ms=10 publication_thread_duration_ms=3.857 publication_duration_ms=6.47 userid=Anonymous traversal_thread_duration_ms=10.329 traversal_duration_ms=19.253 pageid=Archive:EntryResource host=launchpad.net scheme=https request_id=5256682c-e5ce-46e8-9e0d-adebe5458160

The origin IP address (redacted here for privacy) wasn't mine, so I was clearly getting a cached response back, and I notice that the status here is 304 Not Modified. So somebody sent a `GET` query probably with an `If-None-Match` header (as `lazr.restfulclient` would do), the appserver correctly returned 304 with an empty response, and then Squid incorrectly cached the empty response body and returned 200. Somehow.

This at least gives me something to go on, and I'll experiment further on a non-production instance to see if I can work out how to trigger this on demand.

Changed in launchpad:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Colin Watson (cjwatson)
Revision history for this message
Colin Watson (cjwatson) wrote :

Sorry, I made a mistake in my previous comment - there was in fact a caching layer present before my recent frontend changes. The structure of the configuration is fairly different because we're now having the `squid-reverseproxy` charm do a lot more of the heavy lifting, but I've just gone through it all and can't find any relevant differences. Still digging.

Revision history for this message
Colin Watson (cjwatson) wrote :

I'm not sure why behaviour changed recently, but I think this may be a lazr.restful bug. RFC 2616 says:

10.3.5 304 Not Modified
[...]
   The response MUST include the following header fields:

      - Date, unless [...]

      - ETag and/or Content-Location, if the header would have been sent
        in a 200 response to the same request

      - Expires, Cache-Control, and/or Vary, if the field-value might
        differ from that sent in any previous response for the same
        variant

   If the conditional GET used a strong cache validator (see section
   13.3.3), the response SHOULD NOT include other entity-headers.
   Otherwise (i.e., the conditional GET used a weak validator), the
   response MUST NOT include other entity-headers; this prevents
   inconsistencies between cached entity-bodies and updated headers.

   If a 304 response indicates an entity not currently cached, then the
   cache MUST disregard the response and repeat the request without the
   conditional.

   If a cache uses a received 304 response to update a cache entry, the
   cache MUST update the entry to reflect any new field values given in
   the response.

In this case a strong cache validator is being used, but we're still violating the SHOULD NOT by sending a bunch of other response headers, including "Content-Length: 0".

RFC 7232 says:

4.1. 304 Not Modified
[...]

   The server generating a 304 response MUST generate any of the
   following header fields that would have been sent in a 200 (OK)
   response to the same request: Cache-Control, Content-Location, Date,
   ETag, Expires, and Vary.

   Since the goal of a 304 response is to minimize information transfer
   when the recipient already has one or more cached representations, a
   sender SHOULD NOT generate representation metadata other than the
   above listed fields unless said metadata exists for the purpose of
   guiding cache updates (e.g., Last-Modified might be useful if the
   response does not have an ETag field).

And RFC 7234 says, for the case where a cache receives a 304 response and already has a stored 200 response for the same cache key:

   If a stored response is selected for update, the cache MUST:

   o delete any Warning header fields in the stored response with
      warn-code 1xx (see Section 5.5);

   o retain any Warning header fields in the stored response with
      warn-code 2xx; and,

   o use other header fields provided in the 304 (Not Modified)
      response to replace all instances of the corresponding header
      fields in the stored response.

So it would appear that lazr.restful needs to ensure that any other response headers are removed when it generates a 304 (in `lazr.restful._resource.HTTPResource.handleConditionalGET`).

Revision history for this message
Jeroen Bobbeldijk (jerbob92) wrote (last edit ):

I couldn't really figure out where it actually sets the Content-Length, it doesn't seem to happen in lazr.restful itself, but maybe it's fixed by returning `None` instead of `b""` in places where it says "# The conditional GET succeeded. Serve nothing."?

I'm also pretty sure it's caused by sending a Content-Length 0 when a 304 is generated, basically telling Squid that the new content for the cached file should be of length 0, emptying the cached object.

Revision history for this message
Colin Watson (cjwatson) wrote :

It's set in `zope.publisher.http.HTTPResponse._implicitResult`.

I think I have most of a fix - just working out unit tests, which will have to wait for tomorrow.

Revision history for this message
Domenick D'Onofrio (domenickd3) wrote :

Hi,

I created this PR for ansible which changes the endpoint used to grab ppa information from launchpad.net/api to api.launchpad.net
see: https://github.com/ansible/ansible/pull/81978#issuecomment-1765395603

I noticed api.launchpad.net does not seem to have the same issue

--
@cjwatson Even if you come up with a fix for the issue being discussed here, should the ansible launchpad endpoint be changed from launchpad.net to api.launchpad.net?

Revision history for this message
Colin Watson (cjwatson) wrote :

Using api.launchpad.net is arguably slightly cleaner because it's what our official Python bindings use, but it's basically just a coincidence that api.l.n happens not to be affected by this issue right now so I don't think you should frame it as a bug-fix. The only reason that api.l.n isn't currently affected is that I haven't switched it over to the new frontend stack yet - I'm deliberately holding off on that until we've got this bug fixed properly.

Colin Watson (cjwatson)
Changed in lazr.restful:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Colin Watson (cjwatson)
Revision history for this message
Colin Watson (cjwatson) wrote :

Released lazr.restful 2.0.2 with this fix; working on merging it into Launchpad.

I forgot to mention https://bugs.squid-cache.org/show_bug.cgi?id=4882 earlier, which was helpful in tracking all this down.

Changed in lazr.restful:
status: In Progress → Fix Released
Revision history for this message
Sviatoslav Sydorenko (webknjaz) wrote :

Hey Colin, thanks for looking into this! Would you say that using the other endpoint is better than the one we have in the Ansible module now regardless? I haven't found any comparison online…

Revision history for this message
Colin Watson (cjwatson) wrote :

Broadly speaking they're supposed to be equivalent. The launchpad.net/api one has to continue existing because the Launchpad web UI uses it (to avoid cross-domain issues), and as I mentioned api.launchpad.net is used by the official bindings. I have a very loose feeling that api.launchpad.net is stylistically better for external code as opposed to parts of the Launchpad web UI, but in practice I doubt it makes much of a difference.

Revision history for this message
Colin Watson (cjwatson) wrote :

I think this is fixed now. Please comment on this bug with details if you're still seeing it.

Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.