RE: idea - atomic append for NFS writes

New Message Reply About this list Date view Thread view Subject view Author view Attachment view

From: Mike Eisler (mike@eisler.com)
Date: 03/22/01-10:50:55 AM Z


Message-Id: <200103221602.IAA03217@eagle.webpros.com>
Date: Thu, 22 Mar 2001 08:50:55 -0800 (PST)
From: Mike Eisler <mike@eisler.com>
Subject: RE: idea - atomic append for NFS writes

>From: Neil Brown <neilb@cse.unsw.edu.au>

>On Wednesday March 21, mike@eisler.com wrote:
>> In summary,
>> 
>> 	- You are claiming GUARDED creates as the justification for the holes.
>
>Maybe I misunderstood the purpose of GUARDED.  I was reasoning from
>its proximity to EXCLUSIVE in the spec.  Given the lack of clear
>justification for it's existance, it seems reasonable (but apparently
>false) to assume that most obvious purpose was the intended one.
>
>Your description of the use of GUARDED in Solaris NFSv3 is interesting
>and valid, though it seems like a small plug in a large hole.
>A scenario that must be of equal concern in a stray duplicate UNLINK
>which unlinks a file that a subsequent CREATE created, and you cannot

Not of equal concern to me. GUARDED non-exclusive create
prevents a retry from the same client of the same operation from messing
the file up. The stray UNLINK messing up the CREATE is going to be
a less frequent event than a CREATE that is not preceded by an UNLINK,
because removing the file before a CREATE happens less frequently
than not.

>guard against that (though you can in v4 using VERIFY).

VERIFY doesn't cut it because there's no atomically guarantee between
operations in COMPOUND.  Interestingly, an early NFSv3 draft that Rusty
Sandberg wrote had an NFSPROC_REMOVE with arguments that included the
file handle of the object being unlinked in the arguments; the ultimate
in the guarded REMOVE. It always looked safe to me, but the NFSv3 junta
that met in 1992 to finally finish the specification dropped it.  I've
never had an explanation why. Perhaps because of the extra round trip?
	
>The WRITE_APPEND op contains a "record-length" as well as the
>"data-length", and the return value includes a "record-offset".
>The record-length indicates how much data the client wants to
>atomically append.
>The data-length indicates how much data is in the request.  This bit
>goes at the end (or maybe the beginning, whichever is easier for
>servers). The remainder of the record (if any) is filled with nuls.
>The "record-offset" indicates the file address of the start of the
>appended record.  The client can then fill in the remainder with
>normal writes.

This is a cheap form of offset reservations in stable storage
that Peter Staubach mentioned. And my objection is the same:
what if the client doesn't follow through?

>> 	- In the event of a server crash, duplicate records will appear
>> 		in the file, unless the server implements a fair
>> 		amount of state in stable storage. Currently clients do
>> 		not experience duplicate records when doing an APPEND.
>> 		
>> 	- In absence of an append delegation, append writes will be
>> 		at NFSv2 speeds. Currently they are done at NFSv3 speeds.
>
>And currently, clients can be certain that contemporaneous appends
>from separate clients will corrupt the file.

Write sharing in absence of conflicting record locks
or SHARE reservations is a fantasy conjured by the masters of
FUD at OSF and certain halls of academia. 

>So, do you want broken-semantic-A or broken-semantic-B?

I want the single client writer scenario to work as well with NFSv4 as
it does with NFSv3, so my answer is A. That's my standard for
any changes to support APPEND.

>I suspect most serious NFSv4 servers will bend over backwards to make
>delegations work because that is, I think, one of the great strengths

I still run into NFS client implementations with broken close-to-open
consistency; forgive me if I take your idealism with a large grain of
salt.

>of NFSv4.  If that is true, then the speed issue is probably moot, and
>the correctness is paramount.

I beleive if you go back to archives, you'll
find Dave Noveck entitled his caching proposal
that lead to NFSv4 delegations with the title:

	"Caching, not cache consistency"
	
NFSv4 guarantees close-to-open consistency. That's it.  If the server
finds that it cannot make callbacks back to the client (which can
happen even if the client and server are connected with a 1 metre
cross-over cat-5 cable), then there's no delegation. That means that
99.99% of the time appends will work quickly and accurately (as they
would without your proposal). 0.01 % of the time they will work
accurately and and at NFSv3 speeds without your proposal. With your
proposal in that .01 % of the time, they will work at NFSv2 speeds, and
they'll sometimes produce quirks:  nuls, duplicate records, and that's
even without write sharing. Who really wants to deal with those
customer calls?
		
>> >In this APPEND case, there server does not need to be able to
>> >call-back the client to have the delegation released, so that is one
>> >possible impediment to delegations that need not be a problem.
>> 
>> What if another client wants to write or read the file that has
>> an APPEND delegation on it?
>> 
>
>Maybe you missed that part of the proposal.

I didn't but it doesn't answer my question.

>] The "append" flag gets squeezed into OPEN somehow, it isn't important
>] how.  As the client will already have the file open, this will, in

Why will it have the file already open?

>] effect, upgrade the open to also have the "append" flag.
>] In this situation, the server is "strongly encouraged" to grant a
>] write delegation, though it may set the "recall" flag in the response
>] to say the the client should return the delegation as soon as
>] possible.
>
>The "recall" flag is the important point.  I am suggesting that the
>server issues a delegation even if there is contention, or if there is
>no-way to reclaim the delegation with a call-back.  In these cases it
>sets the "recall" flag in the delegation so that the client will
>release the delegation as soon as it has flushed it's appends.
>
>If you reject the APPEND_WRITE part, then that just leaves these two
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I do.

>small changes to the protocol:
>  1/ adding an "append" flag to open to signal that a delegation would
>     be "a real good idea".

I'm ok with this; it meets my aforementioned standard.

>  2/ allowing the server to signal 'recall' in a non CLAIM_PREVIOUS
>     delegation, particularly in response to an "append" open.

I've no idea what that means. Recall means callback. If there is
another client that wants to read or write the file, you have to issue
a callback; return an error to the second client; or allow the access
without a callback, the latter two being incorrect. Perhaps your
intent is to include a RECALL flag in all responses to
operations that take a stateid?

	-mre


New Message Reply About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.2 : 03/04/05-01:48:45 AM Z CST