bug: 360846 title: appserver isn't recovering like it should causing too many oopses date-reported: Tue, 14 Apr 2009 02:59:48 -0000 date-updated: Sat, 08 Aug 2009 06:33:53 -0000 reporter: Ursula Junque (ursinha) duplicate-of: duplicates: attachments: patches: tags: lp-foundations oops subscribers: Ursula Junque (ursinha) Andrea Corbellini (andrea.corbellini) Fernando Miguel (fernandomiguel) task: launchpad status: Fix Released date-created: Tue, 14 Apr 2009 02:59:48 -0000 date-left-new: Fri, 01 May 2009 12:58:22 -0000 date-confirmed: Fri, 01 May 2009 12:58:22 -0000 date-triaged: Fri, 01 May 2009 12:58:22 -0000 date-assigned: Tue, 14 Apr 2009 03:00:22 -0000 date-inprogress: Wed, 06 May 2009 21:01:04 -0000 date-closed: Mon, 11 May 2009 14:04:09 -0000 date-fix-committed: Mon, 11 May 2009 14:04:09 -0000 date-fix-released: Mon, 11 May 2009 14:04:09 -0000 reporter: Ursula Junque (ursinha) importance: Critical assignee: Francis J. Lacoste (flacoste) milestone: 2.2.5 task: storm status: Fix Released date-created: Wed, 06 May 2009 20:47:12 -0000 date-left-new: Mon, 11 May 2009 13:55:14 -0000 date-confirmed: Mon, 11 May 2009 13:55:14 -0000 date-triaged: Mon, 11 May 2009 13:55:14 -0000 date-assigned: Mon, 11 May 2009 13:55:14 -0000 date-inprogress: Mon, 11 May 2009 13:55:14 -0000 date-closed: Sat, 08 Aug 2009 06:33:52 -0000 date-fix-committed: Mon, 11 May 2009 13:55:14 -0000 date-fix-released: Sat, 08 Aug 2009 06:33:52 -0000 reporter: Francis J. Lacoste (flacoste) importance: Undecided assignee: Francis J. Lacoste (flacoste) milestone: 0.15 Content-Type: multipart/mixed; boundary="===============4012501441911465428==" MIME-Version: 1.0 --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable It causes the following message, as seen in OOPS-1194D1005: InterfaceError: connection already closed More: OOPS-1194D1001, OOPS-1194D1002, OOPS-1194D1003, OOPS-1194D1117, OOPS-= 1194D1132, OOPS-1194D1134. Another bunch in 04/23: OOPS-1209A1332, OOPS-1209A1336, OOPS-1209A1341, OOP= S-1209A1337, OOPS-1209A1339, OOPS-1209A1342. The root cause is that storm doesn't detect that InterfaceError: connection already closed is a form of disconnection failure --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Francis J. Lacoste (flacoste) Date: Fri, 01 May 2009 12:58:21 -0000 Message-Id: <20090501125822.31244.58820.malone@gangotri.canonical.com> This happened twice this week. Once on Monday after adding two new appservers for sso. And once today, cause still unknown. --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Ursula Junque (ursinha) Date: Tue, 05 May 2009 13:30:54 -0000 Message-Id: <20090505133054.25233.15534.malone@potassium.ubuntu.com> We had another incident causing another bunch of oopses today, 05/05, and elmo restarted lpnet again. flacoste said stub will work on a fix for this bug this week (not today - holiday for him). --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Tom Haddon (mthaddon) Date: Wed, 06 May 2009 19:40:14 -0000 Message-Id: <20090506194014.26708.87421.malone@gandwana.canonical.com> Causing outages on lpnet --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Francis J. Lacoste (flacoste) Date: Thu, 07 May 2009 16:14:54 -0000 Message-Id: <20090507161454.24008.80678.malone@palladium.canonical.com> Merged on rev 122 of our storm branch. --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Francis J. Lacoste (flacoste) Date: Thu, 07 May 2009 23:05:04 -0000 Message-Id: <20090507230504.17356.63120.malone@gangotri.canonical.com> The fix isn't working. We do have a better understanding of what is happening. https://pastebin.canonical.com/17341/ shows a bunch of InterfaceError OOPS that happens. We see that one stretch of OOPS begins at OOPS-I2432. If we look at the previous OOPS-I2431, we see that an OperationalError was raised during the call to set_statement_timeout. Because that method doesn't use the _check_disconnect wrapper, the error is uncaught and leaves the connection in a disconnected state. The following request gets an InterfaceError because of this. The suggested fix isn't enough because is_disconnection_error() is only called when the error is a subclass of DatabaseError. This isn't the case for InterfaceError. We have no idea why the initial OperationError is happening. What is weird is that the client reports that the server disconnects it, but on the server, we see that it closes the client socket two minutes after the error. See https://pastebin.canonical.com/17342/ (to compare the timestamps). --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Diogo Matsubara (matsubara) Date: Fri, 08 May 2009 02:29:57 -0000 Message-Id: <20090508022958.531.83384.malone@palladium.canonical.com> I think Francis meant OOPS-1223I2432 and OOPS-1223I2431 respectively. --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Diogo Matsubara (matsubara) Date: Fri, 08 May 2009 03:02:28 -0000 Message-Id: <20090508030228.531.29611.malone@palladium.canonical.com> In https://devpad.canonical.com/~matsubara/bug-360846-oops-summary.html you can find a oops summary for the lpnet9 (oops_prefix: I) instance only, between 2009-05-07 21:33:00UTC and 2009-05-07 21:39:00UTC The 5 OperationalError have one thing in common, they're all nagios checks. --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Francis J. Lacoste (flacoste) Date: Fri, 08 May 2009 17:52:14 -0000 Message-Id: <20090508175214.11423.1797.malone@gandwana.canonical.com> In https://devpad.canonical.com/~matsubara/2009-05-07-tmp-summary.html, we see a similar pattern. This OOPS report spans the 50 minutes after the restart of lpnet9 before InterfaceError starts creeping up. --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Fernando Miguel (fernandomiguel) Date: Fri, 08 May 2009 17:56:35 -0000 Message-Id: <20090508175635.21664.31383.malone@palladium.canonical.com> i would just to add that even OpenID is failing too (maybe timeouts?), making it hard to even login to the wiki pages (and other services users use OIP with) --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Diogo Matsubara (matsubara) Date: Fri, 08 May 2009 20:06:02 -0000 Message-Id: <20090508200603.14630.54464.malone@potassium.ubuntu.com> I correlated the information from app server launchpad9.log (lpnet9 -- oops_prefix: I) restart times (restart times in BST) with oops summaries (oops summaries times in UTC). All of these summaries are from the time the app server was restarted until the first OE/IE error is raised. 2009-05-07T04:37:48 INFO Server zope.server.http (HTTP) started. https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-03:= 37:48--06:05:00.html 2009-05-07T09:40:49 INFO Server zope.server.http (HTTP) started. https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-08:= 40:49--16:56:00.html 2009-05-07T20:40:52 INFO Server zope.server.http (HTTP) started. https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-19:= 40:52--21:28:00.html 2009-05-07T22:38:11 INFO Server zope.server.http (HTTP) started. https://devpad.canonical.com/~matsubara/bug-360846-summaries/2009-05-07-21:= 28:11--23:59:59.html --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: James Henstridge (jamesh) Date: Mon, 11 May 2009 13:55:12 -0000 Message-Id: <20090511135513.914.89924.malone@gandwana.canonical.com> Fix merged to Storm's trunk in r305 --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Francis J. Lacoste (flacoste) Date: Mon, 11 May 2009 14:04:08 -0000 Message-Id: <20090511140408.2494.68728.malone@potassium.ubuntu.com> The appropriate fix was cherry-picked on Friday evening. --===============4012501441911465428== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Author: Francis J. Lacoste (flacoste) Date: Mon, 11 May 2009 14:07:49 -0000 Message-Id: <20090511140749.20869.96676.malone@gangotri.canonical.com> Automatic recovery of further problem is described in bug 374909. --===============4012501441911465428==--