add timeout for smart connections

Bug #268261 reported by Andreas Hasenack
8
Affects Status Importance Assigned to Milestone
Landscape Client
Fix Released
High
Gustavo Niemeyer
Smart Package Manager
Fix Released
Undecided
Gustavo Niemeyer
smart (Ubuntu)
Fix Released
Undecided
Unassigned
Intrepid
Fix Released
Undecided
Unassigned
Jaunty
Fix Released
Undecided
Unassigned

Bug Description

We got smart hanging on two machines (ls2 and ls5) with an open connection (smart update run via cron). It is still being debugged, but it looks like adding some sort of timeout for the connections would help to avoid this situation.

The Landscape team has proposed an SRU to fix this bug in intrepid and jaunty:

Statement explaining the impact
=====================

This fix is critical, as we had customer installations being bitten by this bug before. Smart would just hang there, doing nothing but blocking itself. As a result, landscape would not be able to manage the packages of this machine.

How the bug has been addressed
======================

Introducing a timeout for the libcurl call.

Detailed instructions how to reproduce the bug
==============================

It can be reproduced artificially by inducing a timeout by means of traffic shaping. It has been done by the Landscape QA engineer and the fix is confirmed to work as expected.

Changed in landscape:
assignee: nobody → niemeyer
importance: Undecided → High
milestone: none → thames-pre-8
Changed in smart:
assignee: nobody → niemeyer
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

So, I managed to kill the stale connection using tcpkill and some tricks, but smart update is still running:
16959 ? S 0:00 \_ /USR/SBIN/CRON
16960 ? Ss 0:00 \_ /bin/sh -c cd / && run-parts --report /etc/cron.hourly
16961 ? S 0:00 \_ run-parts --report /etc/cron.hourly
16962 ? Ss 0:00 \_ /bin/sh /etc/cron.hourly/smartpm-core
16963 ? S 0:03 \_ /usr/bin/python /usr/bin/smart update

netstat doesn't show the connection anymore. In fact, there is no connection coming or going to the smart process.

I did it locally using smart update in one terminal and netcat in another just listening. It was hung, as expected, but when tcpkill killed the connection, smart update just finished. So perhaps there is something else going on, or in addition to.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Pushed for thames-pre-9

Changed in landscape:
milestone: thames-pre-8 → thames-pre-9
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Unfortunately there's no time to add this for Intrepid, so I'll postpone this one for the next major
milestone so that I can focus on helping to push the major features of 1.2.

Changed in landscape:
milestone: thames-pre-9 → later
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This just happened on my notebook while running a normal smart update.

Changed in landscape:
milestone: later → mountainview
Revision history for this message
Thomas Herve (therve) wrote :

So presumably this is a very simple diff, to add a timeout to curl connections:

=== modified file 'smart/fetcher.py'
--- smart/fetcher.py 2008-11-16 00:56:23 +0000
+++ smart/fetcher.py 2008-11-24 15:46:32 +0000
@@ -1649,6 +1649,7 @@

                         handle.setopt(pycurl.URL, str(url))
                         handle.setopt(pycurl.OPT_FILETIME, 1)
+ handle.setopt(pycurl.TIMEOUT, 600)
                         handle.setopt(pycurl.NOPROGRESS, 0)
                         handle.setopt(pycurl.PROGRESSFUNCTION, progress)
                         handle.setopt(pycurl.WRITEDATA, local)

But! It is also very hard to test. Gustavo, what do you think?

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

It might not be that hard to test. Note that we already have tests for the fetcher in place doing real interactions with a web server. If it turns out to be indeed hard, I think that in this case we should just go ahead and add the timeout option. The other options aren't tested either, and this is a very important fix for us to have in place.

Revision history for this message
Thomas Herve (therve) wrote :

OK, I don't know how to write a clean test for that, so I'll go on and commit this fix.

Revision history for this message
Thomas Herve (therve) wrote :

Oh wait I don't have commit rights on smart. Can you commit the attached patch Gustavo?

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

This has been fixed in Smart with proper testing.

Changed in smart:
milestone: none → 1.2
status: New → Fix Committed
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

qa + 1, and smart update exits with code 1 if the timeout is hit.

Changed in landscape:
status: New → Fix Committed
Changed in smart:
status: Fix Committed → Fix Released
Changed in landscape:
status: Fix Committed → Fix Released
description: updated
affects: landscape → landscape-client
Changed in landscape-client:
milestone: mountainview → none
milestone: none → 1.3.2.1
Revision history for this message
Martin Pitt (pitti) wrote :

Is this fixed in karmic? Please close the karmic task if so. (And note our SRU policy that we don't release stuff into stables which haven't been fixed in the development release)

Revision history for this message
Martin Pitt (pitti) wrote :

The current smart jaunty SRU does not touch this, is jaunty affected by this?

Changed in smart (Ubuntu Jaunty):
status: New → Incomplete
Changed in smart (Ubuntu):
status: New → Incomplete
Revision history for this message
Free Ekanayaka (free.ekanayaka) wrote :

@Martin : yes this is fixed in karmic and jaunty, that both have this patch already.

Changed in smart (Ubuntu):
status: Incomplete → Fix Released
Changed in smart (Ubuntu Jaunty):
status: Incomplete → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted smart into intrepid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in smart (Ubuntu Intrepid):
status: New → Fix Committed
tags: added: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package smart - 1.1.1~bzr20081010-0ubuntu1.8.10.0

---------------
smart (1.1.1~bzr20081010-0ubuntu1.8.10.0) intrepid-proposed; urgency=low

  * Bug-fixing SRU (LP: #347983)
    - Added timeout for smart connections (LP: #268261)
    - New landscape plugin to set proxy for smart too,
      inheriting it from the landscape config (LP: #236884)
    - Added 04_detect_plugins crasher fix (LP: #388577)
    - Added 05_curl_timeout fix for failures when downloading big
      packages (LP: #389001)

 -- Free Ekanayaka <email address hidden> Wed, 07 Oct 2009 09:16:59 +0200

Changed in smart (Ubuntu Intrepid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.