Monday, March 26, 2012

Poison Message Sample in BOL

Hi,

We've been trying to diagnose a problem with service broker's poison message detection and tried to run the sample script in BOL to see how it solved the problem. It appears the example doesn't seem to work correctly,

IF (@.messageTypeName =
'//Adventure-Works.com/AccountsPayable/ExpenseReport')
BEGIN
DECLARE @.expenseReport NVARCHAR(MAX) ;
SET @.expenseReport = CAST(@.messageBody AS NVARCHAR(MAX)) ;
EXEC AdventureWorks.dbo.AddExpenseReport
@.report = @.expenseReport ;
IF @.@.ERROR <> 0
BEGIN
/* THIS ROLLBACK INCREMENTS THE FAILURE COUNT */
ROLLBACK TRANSACTION UndoReceive ;
EXEC TrackMessage @.conversationHandle ;
END ;
ELSE
BEGIN
EXEC AdventureWorks.dbo.ClearMessageTracking
@.conversationHandle ;
END ;
END ;
ELSE

In the scenario where AddExpenseReport fails it will retry the message the next time round the loop - TrackMessage looks to see how many times the message has failed:

IF @.count > 20
/* @.count WILL NEVER GET ABOVE 5 */
BEGIN
EXEC dbo.ClearMessageTracking @.conversationHandle ;
END CONVERSATION @.conversationHandle
WITH ERROR = 500
DESCRIPTION = 'Unable to process message.' ;
END ;
ELSE

But the counter will not get to 20 as after 5 the queue will be disabled which causes ProcessExpenseReport to fail. Subsequent message cannot then be processed until the queue is re-enabled.

We have changed the sample so that once the message has failed four times it is then written to a dead message table and we can then call commit so that the queue can carry on without being disabled:

[in ProcessExpenseReport]
...
EXEC AdventureWorks.dbo.AddExpenseReport @.report = @.expenseReport ;
IF @.@.error <> 0
BEGIN
-- The message has failed, we need to work out if we are safe to
-- retry or not. If this message has failed four times already
-- we need to recieve it, save it into another table and commit to
-- avoid service broker from disabling the queue. Otherwise we can
-- rollback and try again to see if it works next time.

-- Find out how many times this message has failed
declare @.failCount int
exec GetFailCount @.conversationHandle, @.failCount output
-- If we are about to reach the threshold
if @.failCount >= 4
begin
-- Insert the message into the dead message table. A sql agent
-- job will at some point in the future put the contents of this
-- table back onto the queue for reprocessing
EXEC DeadMessage @.conversationHandle, @.expenseReport
end
else
begin
-- We are safe to rollback and take another run at it
ROLLBACK TRANSACTION UndoReceive ;
EXEC TrackMessage @.conversationHandle ;
end
END ;
ELSE
BEGIN
-- Call to SP worked we can clear down our message tracking
EXEC ClearMessageTracking @.conversationHandle ;
END ;
END ;
...
/* Procedure to find out how many times this message
has failed to be processed */
CREATE PROCEDURE GetFailCount
@.conversationHandle uniqueidentifier,
@.failCount int output as
begin
SELECT @.failCount = [count] FROM ExpenseServiceFailedMessages
WHERE conversation_handle = @.conversationHandle
end
go

/* Procedure to insert the message into the Dead message
table and end the conversation */
create procedure DeadMessage
@.conversation_handle uniqueidentifier,
@.message_body varchar(max)
as
begin
insert into ExpenseServiceDeadMessages ( message_body ) values ( @.message_body )

EXEC dbo.ClearMessageTracking @.conversation_Handle ;
END CONVERSATION @.conversation_Handle
WITH ERROR = 500
DESCRIPTION = 'Unable to process message.' ;

end
go

/* Updated version of TrackMessage which no longer checks the
fail count - it just increments each time */
CREATE PROCEDURE TrackMessage
@.conversationHandle uniqueidentifier
AS
BEGIN
IF @.conversationHandle IS NULL
RETURN ;

DECLARE @.count INT ;
SET @.count = NULL ;
SET @.count = (SELECT count FROM dbo.ExpenseServiceFailedMessages
WHERE conversation_handle = @.conversationHandle) ;

/* TrackMessage no longer checks the failure count */
IF @.count IS NULL
BEGIN
INSERT INTO dbo.ExpenseServiceFailedMessages
(count, conversation_handle)
VALUES (1, @.conversationHandle) ;
END ;
ELSE
BEGIN
UPDATE dbo.ExpenseServiceFailedMessages
SET count=count+1
WHERE conversation_handle = @.conversationHandle ;
END ;
END ;
GO

/* New table to hold the Dead messages */
CREATE TABLE ExpenseServiceDeadMessages (
message_body varchar(max)
) ;


Now this solves the problem. If a message arrives that can't be processed (either becuase the message itself is bad, or because of an system fault with another system that we interact with) the message is moved to the DeadMessage table and the queue carries on and does not get disabled.

In our application however none of this code should be required, as long as we are confident that the data in the messages are "valid". In our case the only reason the processing would fail is if an external system we are dependant fails for some reason (e.g. Maybe writing the message to disk and the disk is full ). In this scenario we want to retry the message indefinately until the error clears. If we could disable the poision message detection this would solve the problem and remove the need for this solution.

Any thoughts on this? Have we missed the point somewhere....

Cheers,

Neil.

The main problem with just disabling the poison message check is that when you roll back a message, it's still at the top of the queue so nothing else gets processed until the message succeeds. You're probably better off commiting the receive and re-sending the meaasge - maybe to some kind of retry queue - so that you don't block other messages on the queue. Keep in mind that if you have multiple messages on the same dialog you'll have to be careful how you handel this so you don't end up changing the message order. I've seen a couple examples of people who use conversation timers to implement a timed retry for cases like this when there is a possibility of the remote resource being unavailable for an extended time.|||

I'd like to add that when a queue is being disabled (e.g. because of a poison message), a BROKER_QUEUE_DISABLED event notification is fired. Here is a small demo to ilustrate this:

use tempdb;
go

create procedure sp_rollback_on_pm
as
begin
set nocount on;
begin transaction;
receive * from q_pm;
rollback;
end;
go

create queue q_pm
with activation (
status = on,
max_queue_readers = 1,
procedure_name = [sp_rollback_on_pm],
execute as owner);
go

create service s_demo_pm on queue q_pm ([DEFAULT]);
go

create queue q_en;
create service s_en on queue q_en ([http://schemas.microsoft.com/SQL/Notifications/PostEventNotification]);
go

create event notification ev_qm on queue q_pm for broker_queue_disabled to service 's_en', 'current database';
go

declare @.dh uniqueidentifier;
begin dialog conversation @.dh
from service s_demo_pm
to service 's_demo_pm'
with encryption = off;
send on conversation @.dh;
send on conversation @.dh;
send on conversation @.dh;
send on conversation @.dh;
send on conversation @.dh;
send on conversation @.dh;
go

select cast(message_body as xml),* from q_en
go

An administrator can receive notifications when queues are being disabled as a result of a poison message and it can inspect the queue and fix the issue or remove the problem dialog. Or an activated procedure on the q_en queue can automatically fix the issue and re-enable the q_pm queue (of course, assuming that the issue can be fixed automatically).

HTH,
~ Remus


|||Thanks for the replies - this has allowed us to progress to a past this problem.

I do entirely understand why service broker is doing what it's doing, it will quickly enable you to locate any messages that break the system doe to an application bug. However in the scenario where a message cannot be processed due to a situation beyond your control this features requires you to implement a lot more code then perhaps you should.

In our scenario, if we cannot process message A which is at the front of the queue, we are not going to be able to process message B either, so there's no point trying. The ability to turn of poision message detection would be very useful.

I imagine this has been discussed elsewhere but it also strikes me as odd that I can now write code:

-- About to start a transaction
BEGIN TRANSACTION
RECEIVE * FROM myQueue
ROLLBACK TRANSACTION
-- The database is now in a different state from when begin tran was called!!

I personally have always relied on the premise that EVERYTHING I DO (with SQL) inside a transaction is undone when I call rollback. This no longer seems to be true?

Cheers,

Neil|||

Everything you did in the transaction is undone when the transaction is roled back. The message is back in the queueand any other data updates are reversed. The queue is disabled as a result of the rollback, after the rollback occured.

In the case you describe, since the application is able to detect the problem message, I wouldn't actually call it a poison message. Is more like a message your app understands but is not willing to process. It seems to me that a more appropiate action would be to move the message into a different queue (i.e. forward the message to a secondary service) or end the dialog with an error.
The poison message was indeed thought more for cases when the application crashes when processing the message and the system spins on this message, w/o making any progress.

|||

I can see Neil's issue with the poison message handling, and the inability to turn it off. I guess he's got a queue whose processing ultimately depends on some external system (say a web service call or something). In the scenario where the web service is unavailable, you'd pull a message from the queue and try to call the web service. If the web service fails (for example, due to network issues), then you want to rollback the receive to enable that message to be retried in the future. It's not really possible for an application to know whether a network-based call is going to work in advance.

Moving the message of to another queue merely delays the inevitable - the other queue needs processing, and of course it can fail in the same way. I guess you could just take messages from the second queue and add them back to the first, but this seems like a poor solution (it also causes the messages to be reordered, which could have its own issues).

Much as I understand (and applaud) the reasoning behind poison message detection, it does seem that there are scenarios where it would be useful to just turn it off.

|||For the scenario you describe I'd recommend using conversation timers, not rollback the message. Rolling back would make the message visible again immeadetly, triggering another attempt. Instead, using a timer would allow for a delayed retry, like this:

begin transaction
Receive
if web request message
store the request data in table
else if timer message
read request data from table
begin conversation timer (e.g. 1 minute)
commit
do the web request
if succeeded
begin transaction
reset the conversation timer
send back response
commit

Ultimately, everything in the server is optimized for the commit case. Rollbacks should really be extraordinarily and this is how we approached the problem.

As a last resort, the BROKER_QUEUE_DISABLED event notification can be used to enable back the queue automatically.

HTH,
~ Remussql

No comments:

Post a Comment