Master Leases for Berkeley DB

Susan LoVerso
sue@sleepycat.com
Rev 1.1
2007 Feb 2

What are Master Leases?

A master lease is a mechanism whereby clients grant master-ship rights to a site and that master, by holding lease rights can provide a guarantee of durability to a replication group for a given period of time. By granting a lease to a master, a client will not participate in an election to elect a new master until that granted master lease has expired. By holding a collection of granted leases, a master will be able to supply authoritative read requests to applications. By holding leases a read operation on a master can guarantee several things to the application:

Authoritative reads: a guarantee that the data being read by the application is durable and can never be rolled back.
Freshness: a guarantee that the data being read by the application at the master is not stale.
Master viability: a guarantee that a current master with valid leases will not encounter a duplicate master situation.

Requirements

The requirements of DB to support this include:

After turning them on, users can choose to ignore them in reads or not.
We are providing read authority on the master only. A read on a client is equivalent to a read while ignoring leases.
We guarantee that data committed on a master that has been read by an application on the master will not be rolled back. Data read on a client or while ignoring leases or data successfully updated/committed but not read, may be rolled back.
A master will not return successfully from a read operation unless it holds a majority of leases unless leases are ignored.
Master leases will remove the possibility of a current/correct master being "shot down" by DUPMASTER. NOTE: Old/Expired masters may discover a later master and return DUPMASTER to the application however.
Any send callback failure must result in premature lease expiration on the master.
Users who change the system clock during master leases void the guarantee and may get undefined behavior. We assume time always runs forward. [document this.]
Clients are forbidden from participating in elections while they have an outstanding lease granted to another site.
Clients are forbidden from accepting a new master while they have an outstanding lease granted to another site.
Clients are forbidden from upgrading themselves to master while they have an outstanding lease granted to another site.
When asked for a lease grant explicitly by the master, the client cannot grant the lease to the master unless the LSN in the master's request has been processed by this client.

The requirements of the application using leases include:

Users must implement (Base API users on their own, RepMgr users via configuration) a majority (or larger) ACK policy.
The application must use the election mechanism to decide a master. It may not simply declare a site master.
The send callback must return an error if the majority ACK policy is not met for PERM records.
Users must set the number of sites in the group.
Using leases in a replication group is all-or-none. Therefore, if a site knows it is using leases, it can assume other sites are also.
All applications that care about read guarantees must forward or perform all reads on the master. Reading on the client means a read ignoring leases.

There are some open questions remaining.

There is one major showstopper issue, see Crashing - Potential problem near the end of the document. We need a better solution than the one shown there (writing to disk every time a lease is granted). Perhaps just documenting that durability means it must be flushed to disk before success to avoid that situation?
What about db->join? Users can call join, but the calls on the join cursor to get the data would be subject to leases and therefore protected. Ok, this is not an open question.
What about other read-like operations? Clearly DB->get, DB->pget, DBC->get, DBC->pget need lease checks. However, other APIs use keys. DB->key_range provides an estimate only so it shouldn't need lease checks. DB->stat provides exact counts to bt_nkeys and bt_ndata fields. Are those fields considered authoritative that providing those values implies a durability guarantee and therefore DB->stat should be subject to lease verification? DBC->count provides a count for the number of data items associated with a key. Is this authoritative information? This is similar to stat - should it be subject to lease verification?
Do we require master lease checks on write operations? I think lease checks are not needed on write operations. It doesn't add correctness and adds a lot of complexity (checking leases in put, del, and cursors, then what about rename, remove, etc).
Do master leases give an iron-clad guarantee of never rolling back a transaction? No, but it should mean that a committed transaction can never be read on a master unless the lease is valid. A committed transaction on a master that has never been presented to the application may get rolled back.
Do we need to quarantine or prevent reads on an ex-master until sync-up is done? No. A master that is simply downgraded to client or crashes and reboots is now a client. Reading from that client is the same as saying Ignore Leases.
What about adding and removing sites while leases are active? This is SR 14778. A consistent nsites value is required by master leases. The resolution of 14778 is a prerequisite - currently owned by Alan. It isn't clear to me what a master is supposed to do if the value of nsites gets smaller while leases are active. Perhaps it leaves its larger table intact and simply checks for a smaller number of granted leases?
Can users turn leases off? No. There is no planned turn leases off API.
Clock skew will be a percentage. However, the smallest, 1%, is probably rather large for clock skew. Percentage was chosen for simplicity and similarity to other APIs. What granularity is appropriate here?

API Changes

The API changes that are visible to the user are fairly minimal. There are a few API calls they need to make to configure master leases and then there is the API call to turn them on. There is also a new flag to existing APIs to allow read operations to ignore leases and return data that may be non-durable potentially.

Lease Timeout

There is a new timout the user must configure for leases called DB_REP_LEASE_TIMEOUT. This timeout will be new to the dbenv->rep_set_timeout method. The DB_REP_LEASE_TIMEOUT has no default and it is required that the user configure a timeout before they turn on leases (obviously, this timeout need not be set of leases will not be used). That timeout is the amount of time the lease is valid on the master and how long it is granted on the client. This timeout must be the same value on all sites (like log file size). [Document this requirement. We cannot enforce it across the group easily.] The timeout used when refreshing leases is the DB_REP_ACK_TIMEOUT for RepMgr application. For Base API applications, lease refreshes will use the same mechanism as PERM messages and they should have no additional burden. This timeout is used for lease refreshment and is the amount of time a reader will wait to refresh leases before returning failure to the application from a read operation.

This timeout will be both stored with its original value, and also converted to a db_timespec using the DB_TIMEOUT_TO_TIMESPEC macro and have the clock skew accounted for and stored in the shared rep structure:

db_timeout_t lease_timeout;
db_timespec lease_duration;

NOTE: By sending the lease refresh during DB operations, we are forcing/assuming that the operation's process has a replication transport function set. That is obviously the case for write operations, but would it be a burden for read processes (on a master)? I think mostly not, but if we need leases for DB->stat then we need to document it as it is certainly possible for an application to have a separate or dedicated stat application or attempt to use db_stat (which will not work if leases must be checked).

Leases should be checked after the local operation so that we don't have a window/boundary if we were to check leases first, get descheduled, the lose our lease and then perform the operation. Do the operation, then check leases before returning to the user.

Using Leases

There is a new API that the user must call to tell the system to use the lease mechanism. The method must be called before the application calls dbenv->rep_start or dbenv->repmgr_start. This new method is:

    dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)

The clock_scale_factor parameter is interpreted as a percentage, greater than 100 (to transmit a floating point number as an integer to the API) that represents the maximum shkew between any two sites' clocks. That is, a clock_scale_factor of 150 suggests that the greatest discrepancy between clocks is that one runs 50% faster than the others. Both the master and client sides compensate for possible clock skew. The master uses the value to compensate in case the replica has a slow clock and replicas compensate in case they have a fast clock. This scaling factor will need to be divided by 100 on all sites to truly represent the percentage for adjustments made to time values.

Assume the slowest replica's clock is a factor of clock_scale_factor slower than the fastest clock. Using that assumption, if the fastest clock goes from time t1 to t2 in X seconds, the slowest clock does it in (clock_scale_factor / 100) * X seconds.

The flags parameter is not currently used.

When the dbenv->rep_set_lease method is called, we will set a configuration flag indicating that leases are turned on:
#define REP_C_LEASE <value>. We will also record the u_int32_t clock_skew value passed in. The rep_set_lease method will not allow calls after rep_start. If multiple calls are made prior to calling rep_start then later calls will overwrite the earlier clock skew value.

We need a new flag to prevent calling rep_set_lease after rep_start. The simplest solution would be to reject the call to rep_set_lease if REP_F_CLIENT or REP_F_MASTER is set. However that does not work in the cases where a site cleanly closes its environment and then opens without running recovery. The replication state will still be set. The prevention will be implemented as:

#define REP_F_START_CALLED <some bit value>

In __rep_start, at the end:

if (ret == 0 ) {
	REP_SYSTEM_LOCK
	F_SET(rep, REP_F_START_CALLED)
	REP_SYSTEM_UNLOCK
}

In __rep_env_refresh, if we are the last reference closing the env (we already check for that):

F_CLR(rep, REP_F_START_CALLED);

[Please review the logic here carefully.] In order to avoid run-time floating point operations on db_timespec structures, when a site is declared as a client or master in rep_start we will pre-compute the lease duration based on the integer-based clock skew and the integer-based lease timeout. A master should set a replica's lease expiration to the start time of the sent message + (lease_timeout / clock_scale_factor) in case the replica has a slow clock. Replicas extend their leases to received message time + (lease_timeout * clock_scale_factor) in case this replica has a fast clock. Therefore, the computation will be as follows if the site is becoming a master:

db_timeout_t tmp;
tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));
rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);

Similarly, on a client the computation is:

tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));

When a site changes state, its lease duration will change based on whether it is becoming a master or client and it will be recomputed from the original values. Note that these computations, coupled with the fact that the lease on the master is computed based on the master's time that it sent the message means that leases on the master are more conservatively computed than on the clients.

The dbenv->rep_set_lease method must be called after dbenv->open, similar to dbenv->rep_set_config. The reason is so that we can check that this is a replication environment and we have access to the replication shared memory region.

Read Operations

Authoritative read operations on the master with leases enabled will abide by leases by default. We will provide a flag that allows an operation on a master to ignore leases. All read operations on a client imply ignoring leases. If an application wants authoritative reads they must forward the read requests to the master and it is the application's responsibility to provide the forwarding. The consensus was that forcing DB_IGNORE_LEASE on client read operations (with leases enabled, obviously) was too heavy handed. Read operations on the client will ignore leases, but do no special flag checking.

The flag will be called DB_IGNORE_LEASE and it will be a flag that can be OR'd into the DB access method and cursor operation values. It will be similar to the DB_READ_UNCOMMITTED flag. [Keith, I will need your help here for finding a bit in the DB flags that isn't in use for my new flag. That looks like a very full and confusing area...]

The methods that will adhere to leases are:

Db->get
Db->pget
Dbc->get
Dbc->pget
Db->stat [maybe?]
Dbc->count[maybe?]

The code that will check leases for a client reading would look something like this, if we decide to become heavy-handed:

if (IS_REP_CLIENT(dbenv)) {
	[get to rep structure]
	if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {
		db_err("Read operations must ignore leases or go to master");
		ret = EINVAL;
		goto err;
	}
}

On the master, the new code to abide by leases is more complex. After the call to perform the operation we will check the lease. In that checking code, the master will see if it has a valid lease. If so, then all is well. If not, it will try to refresh the leases. If that refresh attempt results in leases, all is well. If the refresh attempt does not get leases, then the master cannot respond to the read as an authority and we return an error. The new error is called DB_REP_LEASE_EXPIRED. The location of the master lease check is down after the internal call to read the data is successful:

if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {
	[get to rep structure]
	if (FLD_ISSET(rep->config, REP_C_LEASE) &&
	    (ret = __rep_lease_check(dbenv)) != 0) {
		/*
		 * We don't hold the lease.
		 */
		goto err;
	}
}

See below for the details of __rep_lease_check.

Also note that if leases (or replication) are not configured, then DB_IGNORE_LEASE is a no-op. It is ignored (and won't error) if used when leases are not in effect. The reason is so that we can generically set that flag in utility programs like db_dump that walk the database with a cursor. Note that db_dump is the only utility that reads with a cursor.

Nsites and Elections

The call to dbenv->rep_set_nsites must be performed before the call to dbenv->rep_start or dbenv->repmgr_start. This document assumes either that SR 14778 gets resolved, or assumes that the value of nsites is immutable. The master and all clients need to know how many sites and leases are in the group. Clients need to know for elections. The master needs to know for the size of the lease table and to know what value a majority of the group is. [Until 14778 is resolved, the master lease work must assume nsites is immutable and will therefore enforce that this is called before rep_start using the same mechanism as rep_set_lease.]

Elections and leases need to agree on the number of sites in the group. Therefore, when leases are in effect on clients, all calls to dbenv->rep_elect must set the nsites parameter to 0. The rep_elect code path will return EINVAL if REP_C_LEASE is set and nsites is non-0.

Lease Management

Message Changes

In order for clients to grant leases to the master a new message type must be added for that purpose. This will be the REP_LEASE_GRANT message. Granting leases will be a result of applying a DB_REP_PERMANENT record and therefore we do not need any additional message in order for a master to request a lease grant. The REP_LEASE_GRANT message will pass a structure as its message DBT:

struct __rep_lease_grant {
	db_timespec msg_time;
#ifdef DIAGNOSTIC
	db_timespec expire_time;
#endif
} REP_GRANT_INFO;

In the REP_LEASE_GRANT message, the client is actually giving the master several pieces of information. We only need the echoed msg_time in this structure because everything else is already sent. The client is really sending the master:

Its EID (parameter to rep_send_message and rep_process_message)
The PERM LSN this message acknowledged (sent in the control message)
Unique identifier echoed back to master (msg_time sent in message as above)

On the client, we always maintain the maximum PERM LSN already in lp->max_perm_lsn.

Local State Management

Each client must maintain a db_timespec timestamp containing the expiration of its granted lease. This field will be in the replication shared memory structure:

db_timespec grant_expire;

This timestamp already takes into account the clock skew. All new fields must be initialized when the region is created. Whenever we grant our master lease and want to send the REP_LEASE_GRANT message, this value will be updated. It will be used in the following way:

db_timespec mytime;
DB_LSN perm_lsn;
DBT lease_dbt;
REP_GRANT_INFO gi;


timespecclear(&mytime);
timespecclear(&newgrant);
memset(&lease_dbt, 0, sizeof(lease_dbt));
memset(&gi, 0, sizeof(gi));
__os_gettime(dbenv, &mytime);
timespecadd(&mytime, &rep->lease_duration);
MUTEX_LOCK(rep->clientdb_mutex);
perm_lsn = lp->max_perm_lsn;
MUTEX_UNLOCK(rep->clientdb_mutex);
REP_SYSTEM_LOCK(dbenv);
if (timespeccmp(mytime, rep->grant_expire, >))
	rep->grant_expire = mytime;
gi.msg_time = msg->msg_time;
#ifdef DIAGNOSTIC
gi.expire_time = rep->grant_expire;
#endif
lease_dbt.data = &gi;
lease_dbt.size = sizeof(gi);
REP_SYSTEM_UNLOCK(dbenv);
__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);

This updating of the lease grant will occur in the PERM code path when we have successfully applied the permanent record.

Maintaining Leases on the Master/Rep_start

The master maintains a lease table that it checks when fulfilling a read request that is subject to leases. This table is initialized when a site calls dbenv->rep_start(DB_MASTER) and the site is undergoing a role change (i.e. a master making additional calls to dbenv->rep_start(DB_MASTER) does not affect an already existing table).

When a non-master site becomes master, it must do two things related to leases on a role change. First, a client cannot upgrade to master while it has an outstanding lease granted to another site. If a client attempts to do so, an error, EINVAL, will be returned. The only way this should happen is if the application simply declares a site master, instead of using elections. Elections will already wait for leases to expire before proceeding. (See below.) [I believe an error is sufficient and we do not need, for version 1 at least, any other complex waiting mechanism. Applications that don't use elections and declare masters are quite rare.]

Second, once we are proceeding with becoming a master, the site must allocate the table it will use to maintain lease information. This table will be sized based on nsites and it will be an array of the following structure:

struct  {
	int eid;			/* EID of client site. */
	db_timespec start_time;	/* Unique time ID client echoes back on grants. */
	db_timespec end_time;	/* Master's lease expiration time. */
	DB_LSN lease_lsn;	/* Durable LSN this lease applies to. */
	u_int32_t flags;	/* Unused for now?? */
} REP_LEASE_ENTRY;

Granting Leases

It is the burden of the application to make sure that all sites in the group are using leases, or none are. Therefore, when a client processes a PERM log record that arrived from the master, it will grant its lease automatically if that record is permanent (i.e. DB_REP_ISPERM is being returned), and leases are configured. A client will not send a lease grant when it is processing log records (even PERM ones) it receives from other clients that use client-to-client synchronization. The reason is that the master requires a unique time-of-msg ID (see below) that the client echoes back in its lease grant and it will not have such an ID from another client.

The master stores a time-of-msg ID in each message and the client simply echoes it back to the master. In its lease table, it does keep the base time-of-msg for a valid lease. When REP_LEASE_GRANT message comes in, the master does a number of things:

Pulls the echoed timespec from the client message, into msg_time.
Finds the entry in its lease table for the client's EID. It walks the table searching for the ID. EIDs of DB_EID_INVALID are illegal. Either the master will find the entry, or it will find an empty slot in the table (i.e. it is still populating the table with leases).
If this is a previously unknown site lease, the master initializes the entry by copying to the eid, start_time, and lease_lsn fields. The master also computes the end_time based on the adjusted rep->lease_duration.
If this is a lease from a previously known site, the master must perform timespeccmp(&msg_time, &table[i].start_time, >) and only update the end_time of the lease when this is a more recent message. If it is a more recent message, then we should update the lease_lsn to the LSN in the message.
Since lease durations are computed taking the clock skew into account, clients compute them based on the current time and the master computes it based on original sending time, for diagnostic purposes only, I also plan to send the client's expiration time. The client errs on the side of computing a larger lease expiration time and the master errs on the side of computing a smaller duration. Since both are taking the clock skew into account, the client's ending expiration time should never be smaller than the master's computed expiration time or their value for clock skew may not be correct.

Any log records (new or resent) that originate from the master and result in DB_REP_ISPERM get an ack.

Refreshing Leases

Leases get refreshed when a master receives a REP_LEASE_GRANT message from a client. There are three pieces to lease refreshment.

Lazy Lease Refreshing on Read

If the master discovers that leases are expired during the read operation, it attempts to refresh its collection of lease grants. It does this by calling a new function __rep_lease_refresh. This function is very similar to the already-existing function __rep_flush. Basically, to refresh the lease, the master simply needs to resend the last PERM record to the clients. The requirements state that when the application send function returns successfully from sending a PERM record, the majority of clients have that PERM LSN durable. We will have a new public DB error return called DB_REP_LEASE_EXPIRED that will be returned back to the caller if the master cannot assert its authority. The code will look something like this:

/*
 * Use lp->max_perm_lsn on the master (currently not used on the master)
 * to keep track of the last PERM record written through the logging system.
 * need to initialize lp->max_perm_lsn in rep_start on role_chg.
 */
call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT
if failure
	expire leases
	return lease expired error to caller
else /* success */
	recheck lease table
	/*
	 * We need to recheck the lease table because the client
	 * lease grant messages may not be processed yet, or got
	 * lost, or racing with the application's ACK messages or
	 * whatever. 
	 */
	if we have a majority of valid leases
		return success
	else
		return lease expired error to caller

Ongoing Update Refreshment

Second is having the master indicate to the client it needs to send a lease grant in response to the current PERM log message. The problem is that acknowledgements must contain a master-supplied message timestamp that the client sends back to the master. We need to modify the structure of the log record messages when leases are configured so that when a PERM message is sent, the master sends, and the client expects, the message timestamp. There are three fairly straightforward and different implementations to consider.

Adding the timestamp to the REP_CONTROL structure. If this option is chosen, then the code trivially sends back the timestamp in the client's reply. There is no special processing done by either side with the message contents. So, on a PERM log record, the master will send a non-zero timestamp. On a normal log record the timestamp will be zero or some known invalid value. If the client sees a non-zero timestamp, it sends a REP_LEASE_GRANT with the lp->max_perm_lsn after applying that log record. If it is zero, then the client does nothing different. The advantage is ease of code. The disadvantage is that for mixed version systems, the client is now dealing with different sized control structures. We would have to retain the old control structure so that during a mixed version group the (upgraded) clients can use, expect and send old control structures to the master. This is unfortunate, so let's consider additional implementations that don't require modifying the control structure.
Adding a new REPCTL_LEASE flag to the list of flags for the control structure, but do not change the control structure fields. When a master wants to send a message that needs a lease ack, it sets the flag. Additionally, instead of simply sending a log record DBT as the rec parameter for replication, we would send a new structure that had the timestamp first and then the record (similar to the bulk transfer buffer). The advantage of this is that the control structure does not change. Disadvantages include more special-cased code in the normal code path where we have to check the flag. If the flag is set we have to extract the timestamp value and massage the incoming data to pass on the real log record to rep_apply. On bulk transfer, we would just add the timestamp into the buffer. On normal transfers, it would incur an additional data copy on the master side. That is unfortunate. Additionally, if this record needs to be stored in the temp db, we need some way to get it back again later or rep_apply would have to extract the timestamp out when it processed the record (either live or from the temp db).
Adding a different message type, such as REP_LOG_ACK. Similarly to REP_LOG_MORE this message would be a special-case version of a log record. We would extract out the timestamp and then handle as a normal log record. This implementation is rejected because it actually would require three new message types: REP_LOG_ACK, REP_LOG_ACK_MORE, REP_BULK_LOG_ACK. That is just too ugly to contemplate.

[Slight digression: it occurs to me while writing about #2 and #3 above, that our implementation of all of the *_MORE messages could really be implemented with a REPCTL_MORE flag instead of a separate message type. We should clean that up and simplify the messages but not part of master leases. Hmm, taking that thought process further, we really could get rid of the REP_BULK_* messages as well if we added a REPCTL_BULK flag. I think we should definitely do it for the *_MORE messages. I am not sure we should do it for bulk because the structure of the incoming data record is vastly different.]

Of these options, I believe that modifying the control structure is the best alternative. The handling of the old structure will be very isolated to code dealing with old versions and is far less complicated than injecting the timestamp into the log record DBT and doing a data copy. Actually, I will likely combine #1 and the flag from #2 above. I will have the REPCTL_LEASE flag that indicates a lease grant reply is expected and have the timestamp in the control structure. [Is that necessary - it feels cleaner, but also we could just have a non-zero timestamp = send a reply without have it directed by a flag from the master. That means we would not need the flag, but builds in an assumption into the code instead of having the client simply send a grant when the flag says to do so. See Upgrades/Mixed versions below too.] Also I will probably add in a spare field or two for future use in the REP_CONTROL structure.

Gap processing

No matter which implementation we choose for ongoing lease refreshment, gap processing must be considered. The code above assumes the timestamps will be placed on PERM records only. Normal log records will not have a timestamp, nor a flag or anything else like that. However, any log message can fill a gap on a client and result in the processing of that normal log record to return DB_REP_ISPERM because later records were also processed.

The current implementation should work fine in that case because when we store the message in the client temp db we store both the control DBT and the record DBT. Therefore, when a normal record fills a gap, the later PERM record, when retrieved will look just like it did when it arrived. The client will have access to the LSN, and the timestamp, etc. However, it does mean that sending the REP_LEASE_GRANT message must take place down in __rep_apply because that is the only place we have access to the contents of those stored records with the timestamps.

There are two logical choices to consider for granting the lease when processing an update. As we process (either a live record or one read from the temp db after filling a gap) a PERM message, we send the REP_LEASE_GRANT message for each PERM record we successfully apply. Or, second, we keep track of the largest timestamp of all PERM records we've processed and at the end of the function after we've applied all records, we send back a single lease grant with the max_perm_lsn and a new max_lease_timestamp value to the master. The first is easier to implement, the second results in possibly slightly fewer messages at the expense of more bookkeeping on the client.

A third, more complicated option would be to have the message timestamp on all records, but grants are only sent on the PERM messages. A reason to do this is that the later timestamp of a normal log record would be used as the timestamp sent in the reply and the master would get a more up to date timestamp value and a longer lease.

[Concern about gap processing here.] If we change the REP_CONTROL structure to include the timestamp, we potentially break or at least need to revisit the gap processing algorithm. That code assumes that the control and record elements for the same LSN look the same each and every time. The code stores the control DBT as the key and the rec DBT as the data. We use a specialized compare function to sort based on the LSN in the control DBT. With master leases, the same record transmitted by a master multiple times or client for the same LSN will be different because the timestamp field will not be the same. Therefore, the client will end up with duplicate entries in the temp database for the same LSN. Both solutions (adding the timestamp to REP_CONTROL and adding a REPCTL_LEASE flag) can yield duplicate entries. The flag would cause the same record from the master and client to be different as well.

Handling Incoming Lease Grants

The third piece of lease management is handling the incoming REP_LEASE_GRANT message on the master. When this message is received, the master must do the following:

REP_SYSTEM_LOCK
msg_timestamp = cntrl->timestamp;
client_lease = __rep_lease_entry(dbenv, client eid)
if (client_lease == NULL)
	initial lease for this site, DB_ASSERT there is space in the table
	add this to the table if there is space
} else 
	compare msg_timestamp with client_lease->start_time
	if (msg_timestamp is more recent && msg_lsn >= lease LSN)
		update entry in table
REP_SYSTEM_UNLOCK

Expiring Leases

Leases can expire in two ways. First they can expire naturally due to the passage of time. When checking leases, if the current time is later than the lease entry's end_time then the lease is expired. Second, they can be forced with a premature expiration when the application's transport function returns an error. In the first case, there is nothing to do, in the second case we need to manipulate the end_time so that all future lease checks fail. Since the lease start_time is guaranteed to not be in the future we will have a function __rep_lease_expire that will:

REP_SYSTEM_LOCK
for each entry in the lease table
	entry->end_time = entry->start_time;
REP_SYSTEM_UNLOCK

Is there a potential race or problem with prematurely expiring leases? Consider an application that enforces an ALL acknowledgement policy for PERM records in its transport callback. There are four clients and three send the PERM ack to the application. The callback returns an error to the master DB code. The DB code will now prematurely expire its leases. However, at approximately the same time the three clients are also sending their REP_LEASE_GRANT messages to the master. There is a race between the master processing those messages and the thread handling the callback failure expiring the table. This is only an issue if the messages arrive after the table has been expired.

Let's assume all three clients send their grants after the master expires the table. If we accept those grants and then a read occurs the read will succeed since the master has a majority of leases even though the callback failed earlier. Is that a problem? The lease code is using a majority and the application policy is using something other value. It feels like this should be okay since the data is held by leases on a majority. Should we consider having the lease checking threshold be the same as the permanent ack policy? That is difficult because Base API users implement whatever they want and DB does not know what it is.

Checking Leases

When a read operation on the master completes, the last thing we need to do is verify the master leases. We've already discussed refreshing them when they are expired above. We need two things for a lease to be valid. It must be within the timeframe of the lease grant and the lease must be valid for the last PERM record LSN. Here is the logic for checking the validity of leases in __rep_lease_check:

#define MAX_REFRESH_TRIES	3
DB_LSN lease_lsn;
REP_LEASE_ENTRY *entry;
u_int32_t min_leases, valid_leases;
db_timespec cur_time;
int ret, tries;

	tries = 0;
retry:
	ret = 0;
	LOG_SYSTEM_LOCK
	lease_lsn = lp->lsn
	LOG_SYSTEM_UNLOCK
	REP_SYSTEM_LOCK
	min_leases = rep->nsites / 2;
	__os_gettime(dbenv, &cur_time);
	for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)
		if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)
			valid_leases++;
	REP_SYSTEM_UNLOCK
	if (valid_leases < min_leases) {
		ret =__rep_lease_refresh(dbenv, ...);
		/*
		 * If we are successful, we need to recheck the leases because 
		 * the lease grant messages may have raced with the PERM
		 * acknowledgement.  Give those messages a chance to arrive.
		 */
		if (ret == 0) {
			if (tries <= MAX_REFRESH_TRIES) {
				/*
				 * If we were successful sending, but not successful in racing the
				 * message thread, yield the processor so that message
				 * threads may have a chance to run.
				 */
				if (tries > 0)
					/* __os_sleep instead?? */
					__os_yield()
				tries++;
				goto retry;
			} else
				ret = DB_RET_LEASE_EXPIRED;
		}
	}
	return (ret);

If the master has enough valid leases it returns success. If it does not have enough, it attempts to refresh them. This attempt may fail if sending the PERM record does not receive sufficient acks. If we do receive sufficient acknowledgements we may still find that scheduling of message threads means the master hasn't yet processed the incoming REP_LEASE_GRANT messages yet. We will retry a couple times (possibly parameterized) if the master discovers that situation.

Elections

When a client grants a lease to a master, it gives up the right to participate in an election until that grant expires. If we are the master and dbenv->rep_elect is called, it should return, no matter what, like it does today. If we are a client and rep_elect is called special processing takes place when leases are in effect. First, the easy case is if the lease granted by this client has already expired, then the client goes directly into the election as normal. If a valid lease grant is outstanding to a master, this site cannot participate in an election until that grant expires. We have at least two options when a site calls the dbenv->rep_elect API while leases are in effect.

The simplest coding solution for DB would be simply to refuse to participate in the election if this site has a current lease granted to a master. We would detect this situation and return EINVAL. This is correct behavior and trivial to implement. The disadvantage of this solution is that the application would then be responsible for repeatedly attempting an election until the lease grant expired.
The more satisfying solution is for DB to wait the remaining time for the grant. If this client hears from the master during that time the election does not take place and the call to rep_elect returns with the information for the current/old master.

Election Code Changes

The code changes to support leases in the election code are fairly isolated. First if leases are configured, we must verify the nsites parameter is set to 0. Second, in __rep_elect_init we must not overwrite the value of rep->nsites for leases because it is controlled by the dbenv->rep_set_nsites API. These changes are small and easy to understand.

The more complicated code will be the client code when it has an outstanding lease granted. The client will wait for the current lease grant to expire before proceeding with the election. The client will only do so if it does not hear from the master for the remainder of the lease grant time. If the client hears from the master, it returns and does not begin participating in the election. A new election phase, REP_EPHASE0 will exist so that the call to __rep_wait can detect if a master responds. The client, while waiting for the lease grant to expire, will send a REP_MASTER_REQ message so that the master will respond with a REP_NEWMASTER message and thus, allow the client to know the master exists. However, it is also desirable that if the master replies to the client, the master wants the client to update its lease grant.

Recall that the REP_NEWMASTER message does not result in a lease grant from the client. The client responds when it processes a PERM record that has the REPCTL_LEASE flag set in the message with its lease grant up to the given LSN. Therefore, we want the client's REP_MASTER_REQ to yield both the discovery of the existing master and have the master refresh its leases. The client will also use the REPCTL_LEASE flag in its REP_MASTER_REQ message to the master. This flag will serve as the indicator to the master that it needs to deal with leases and both send the REP_NEWMASTER message and refresh the lease.
The code will work as follows:

if (leases_configured && (my_grant_still_valid || lease_never_granted) {
	if (lease_never_granted)
		wait_time = lease_timeout
	else
		wait_time = grant_expiration - current_time
	F_SET(REP_F_EPHASE0);
	__rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);
	ret = __rep_wait(..., REP_F_EPHASE0);
	if (we found a master)
		return
} /* if we don't return, fall out and proceed with election */

On the master side, the code handling the REP_MASTER_REQ will do:

if (I am master) {
	...
	__rep_send_message(REP_NEWMASTER...)
	if (F_ISSET(rp, REPCTL_LEASE))
		__rep_lease_refresh(...)
}

Other minor implementation details are that __rep_elect_done must also clear the REP_F_EPHASE0 flag. We also, obviously, need to define REP_F_EPHASE0 in the list of replication flags. Note that the client's call to __rep_wait will return upon receiving the REP_NEWMASTER message. The client will independently refresh its lease when it receives the log record from the master's call to refresh the lease.

Again, similar to what I suggested above, the code could simply assume global leases are configured, and instead of having the REPCTL_LEASE flag at all, the master assumes that it needs to refresh leases because it has them configured, not because it is specified in the REP_MASTER_REQ message it is processing. Right now I don't think every possible REP_MASTER_REQ message should result in a lease grant request.

Elections and Quiescient Systems

It is possible that a master is slow or the client is close to its expiration time, or that the master is quiescient and all leases are currently expired, but nothing much is going on anyway, yet some client calls __rep_elect at that time. In the code above, we will not send the REP_MASTER_REQ because the lease is not valid. The client will simply proceed directly to sending the REP_VOTE1 message, throwing all other clients into an election. The master is still master and should stay that way. Currently in response to a vote message, a master will broadcast out a REP_NEWMASTER to assert its mastership. That causes the election to complete. However, if desired the master may want to proactively refresh its leases. This situation indicates to me that the master should choose to refresh leases based on configuration, not a flag sent from the client. I believe anytime the master asserts its mastership via sending a REP_NEWMASTER message that I need to add code to proactively refresh leases at that time.

Other Implementation Details

Role Changes

When a site changes its role via a call to rep_start in either direction, we must take action when leases are configured. There are three types of role changes that all need changes to deal with leases:

A master downgrading to a client. When a master downgrades to a client, it can do so immediately after it has proactively expired all existing leases it holds. This situation is similar to an error from the send callback, and it effectively cancels all outstanding leases held on this site. Note that if this master expires its leases, it does not have any effect on when the clients' lease grants expire on the client side. The clients must still wait their full expected grant time.
A client upgrading to master. If a client is upgrading to a master but it has an outstanding lease granted to another site, the code will return an EINVAL error. This situation only arises if the application simply declares this site master. If a site wins an election then the election itself should have waited long enough for the granted lease to expire and this state should not arise then.
A client finding a new master. When a client discovers a new and different master, via a REP_NEWMASTER message then the client cannot accept that new master until its current lease grant expires. This situation should only occur when a site declares itself master without an election and that site's lease grant expires before this client's grant expires. However, it is possible for this situation to arise with elections also. If we have 5 sites holding an election and 4 of those sites have leases expire at about the same time T, and this site's lease expires at time T+N and the election timeout is < N, then those 4 sites may hold an election and elect a master without this site's participation. A client in this situation must call __rep_wait with the time remaining on its lease. If the lease is expired after waiting the remaining time, then the client can accept this new master. If the lease was refreshed during the waiting period then the client does not accept this new master and returns.

DUPMASTER

A duplicate master situation can occur if an old master becomes disconnected from the rest of the group, that group elects a new master and then the partition is resolved. The requirement for master leases is that this situation will not cause the newly elected, rightful master to receive the DB_REP_DUPMASTER return. It is okay for the old master to get that return value. When a dual master situation exists, the following will happen:

On the current master and all current clients - If the current master receives an update message or other conflicting message from the old master then that message will be ignored because the generation number is out of date.
On the old master - If the old master receives an update message from the current master, or any other message with a later generation from any site, the new generation number will trigger this site to return DB_REP_DUPMASTER. However, instead of broadcasting out the REP_DUPMASTER message to shoot down others as well, this site, if leases are configured, will call __rep_lease_check and if they are expired, return the error. It should be impossible for us to receive a later generation message and still hold a majority of master leases. Something is seriously wrong and we will DB_ASSERT this situation cannot happen.

Client to Client Synchronization

One question to ask is how lease grants interact with client-to-client synchronization. The only answer is that they do not. A client that is sending log records to another client cannot request the receiving client refresh its lease with the master. That client does not have a timestamp it can use for the master and clock skew makes it meaningless between machines. Therefore, sites that use client-to-client synchronization will likely see more lease refreshment during the read path and leases will be refreshed during live updates only. Of course, if a client supplies log records that fill a gap, and the later log records stored came from the master in a live update then the client will respond as per the discussion on Gap Processing above.

Interaction Matrix

If leases are granted (by a client) or held (by a master) what should the following APIs and messages do?

Other:
log_archive: Leases do not affect log_archive. OK.
dbenv->close: OK.
crash during lease grant and restart: Potential problem here. See discussion below.

Rep Base API method:
rep_elect: Already discussed above. Must wait for lease to expire.
rep_flush: Master only, OK - this will be the basis for refreshing leases.
rep_get_*: Not affected by leases.
rep_process_message: Generally OK. We'll discuss each message below.
rep_set_config: OK.
rep_set_limit: OK
rep_set_nsites: Must be called before rep_start and nsites is immutable until 14778 is resolved.
rep_set_priority: OK
rep_set_timeout: OK. Used to set lease timeout.
rep_set_transport: OK.
rep_start(MASTER): Role changes are discussed above. Make sure duplicate rep_start calls are no-ops for leases.
rep_start(CLIENT): Role changes are discussed above. Make sure duplicate calls are no-ops for leases.
rep_stat: OK. [Do we have any stats we want to add? Currently none are planned, but may come up during implementation and testing as useful to have. Suggestions?]
rep_sync: Should not be able to happen. Client cannot accept new master with outstanding lease grant. Add DB_ASSERT here.

REP_ALIVE: OK.
REP_ALIVE_REQ: OK.
REP_ALL_REQ: OK.
REP_BULK_LOG: OK. Clients check to send ACK.
REP_BULK_PAGE: Should never process one with lease granted. Add DB_ASSERT.
REP_DUPMASTER: Should never happen, this is what leases are supposed to prevent. See above.
REP_LOG: OK. Clients check to send ACK.
REP_LOG_MORE: OK [maybe remove and use flag] Clients check to send ACK.
REP_LOG_REQ: OK.
REP_MASTER_REQ: OK.
REP_NEWCLIENT: OK.
REP_NEWFILE: OK. Clients check to send ACK.
REP_NEWMASTER: See above.
REP_NEWSITE: OK.
REP_PAGE: OK. Should never process one with lease granted. Add DB_ASSERT.
REP_PAGE_FAIL: OK. Should never process one with lease granted. Add DB_ASSERT.
REP_PAGE_MORE: OK. Should never process one with lease granted. Add DB_ASSERT.
REP_PAGE_REQ: OK.
REP_REREQUEST: OK.
REP_UPDATE: OK. Should never process one with lease granted. Add DB_ASSERT.
REP_UPDATE_REQ: OK. This is a master-only message.
REP_VERIFY: OK. Should never process one with lease granted. Add DB_ASSERT.
REP_VERIFY_FAIL: OK. Should never process one with lease granted. Add DB_ASSERT.
REP_VERIFY_REQ: OK.
REP_VOTE1: OK. See Election discussion above. It is possible to receive one with a lease granted. Client cannot send one with an outstanding lease however.
REP_VOTE2: OK. See Election discussion above. It is possible to receive one with a lease granted.

If the following method or message processing is in progress and a client wants to grant a lease, what should it do? Let's examine what this means. The client wanting to grant a lease simply means it is responding to the receipt of a REP_LOG (or its variants) message and applying a log record. Therefore, we need to consider a thread processing a log message racing with these other actions.

Other:
log_archive: OK.
dbenv->close: User error. User should not be closing the env while other threads are using that handle. Should have no effect if a 2nd dbenv handle to same env is closed.

Rep Base API method:
rep_elect: See Election discussion above. rep_elect should wait and may grant lease while election is in progress.
rep_flush: Should not be called on client.
rep_get_*: OK.
rep_process_message: Generally OK. See handling each message below.
rep_set_config: OK.
rep_set_limit: OK.
rep_set_nsites: Must be called before rep_start until 14778 is resolved.
rep_set_priority: OK.
rep_set_timeout: OK.
rep_set_transport: OK.
rep_start(MASTER): OK, can't happen - already protect racing rep_start and rep_process_message.
rep_start(CLIENT): OK, can't happen - already protect racing rep_start and rep_process_message.
rep_stat: OK.
rep_sync: Shouldn't happen because client cannot grant leases during sync-up. Incoming log message ignored.

REP_ALIVE: OK.
REP_ALIVE_REQ: OK.
REP_ALL_REQ: OK.
REP_BULK_LOG: OK.
REP_BULK_PAGE: OK. Incoming log message ignored during internal init.
REP_DUPMASTER: Shouldn't happen. See DUPMASTER discussion above.
REP_LOG: OK.
REP_LOG_MORE: OK.
REP_LOG_REQ: OK.
REP_MASTER_REQ: OK.
REP_NEWCLIENT: OK.
REP_NEWFILE: OK.
REP_NEWMASTER: See above. If a client accepts a new master because its lease grant expired, then that master sends a message requesting the lease grant, this client will not process the log record if it is in sync-up recovery, or it may after the master switch is complete and the client doesn't need sync-up recovery. Basically, just uses existing log record processing/newmaster infrastructure.
REP_NEWSITE: OK.
REP_PAGE: OK. Receiving a log record during internal init PAGE phase should ignore log record.
REP_PAGE_FAIL: OK.
REP_PAGE_MORE: OK.
REP_PAGE_REQ: OK.
REP_REREQUEST: OK.
REP_UPDATE: OK. Receiving a log record during internal init should ignore log record.
REP_UPDATE_REQ: OK - master-only message.
REP_VERIFY: OK. Receiving a log record during verify phase ignores log record.
REP_VERIFY_FAIL: OK.
REP_VERIFY_REQ: OK.
REP_VOTE1: OK. This client is processing someone else's vote when the lease request comes in. That is fine. We protect our own election and lease interaction in __rep_elect.
REP_VOTE2: OK.

Crashing - Potential Problem

It appears there is one area where we could have a problem. I believe that crashes can cause us to break our guarantee on durability, authoritative reads and inability to elect duplicate masters. Consider this scenario:

A master and 4 clients are all up and running.
The master commits a txn and all 4 clients refresh their lease grants at time T.
All 4 clients have the txn and log records in the cache. None are flushing to disk.
All 4 clients have responded to the PERM messages as well as refreshed their lease with the master.
All 4 clients hit the same application coding error and crash (machine/OS stays up).
Master authoritatively reads data in txn from step 2.
All 4 clients restart the application and run recovery, thus the txn from step 2 is lost on all clients because it isn't any logs.
A network partition happens and the master is alone on its side.
All 4 clients are on the other side and elect a new master.
Partition resolves itself and we have duplicate masters, where the former master still holds all valid lease grants.

Therefore, we have broken both guarantees. In step 6 the data is really not durable and we've given it to the user. One can argue that if this is an issue the application better be syncing somewhere if they really want durability. However, worse than that is that we have a legitimate DUPMASTER situation in step 10 where both masters hold valid leases. The reason is that all lease knowledge is in the shared memory and that is lost when the app restarts and runs recovery.

How can we solve this? The obvious solution is (ugh, yet another) durable BDB-owned file with some information in it, such as the current lease expiration time so that rebooting after a crash leaves the knowledge that the lease was granted. However, writing and syncing every lease grant on every client out to disk is far too expensive.

A second possible solution is to have clients wait a full lease timeout before entering an election the first time. This solution solves the DUPMASTER issue, but not the non-authoritative read. This solution naturally falls out of elections and leases really. If a client has never granted a lease, it should be considered as having to wait a full lease timeout before entering an election. Applications already know that leases impact elections and this does not seem so bad as it is only on the first election.

Is it sufficient to document that the authoritative read is only as authoritative as the durability guarantees they make on the sites that indicate it is permanent? Yes, I believe this is sufficient. If the application says it is permanent and it really isn't, then the application is at fault. Believing the application when it indicates with the PERM response that it is permanent avoids the authoritative problem [document this application requirement].

Upgrade/Mixed Versions

Clearly leases cannot be used with mixed version sites since masters running older releases will not have any knowledge of lease support. What considerations are needed in the lease code for mixed versions?

First if the REP_CONTROL structure changes, we need to maintain and use an old version of the structure for talking to older clients and masters. The implementation of this would be similar to the way we manage for old REP_VOTE_INFO structures. Second any new messages need translation table entries added. Third, if we are assuming global leases then clearly any mixed versions cannot have leases configured, and leases cannot be used in mixed version groups. Maintaining two versions of the control structure is not necessary if we choose a different style of implementation and don't change the control structure.

However, then how could an old application both run continuously, upgrade to the new release and take advantage of leases without taking down the entire application? I believe it is possible for clients to be configured for leases but be subject to the master regarding leases, yet the master code can assume that if it has leases configured, all client sites do as well. In several places above I suggested that a client could make a choice based on either a new REPCTL_LEASE flag or simply having leases turned on locally. If we choose to use the flag, then we can support leases with mixed versions. The upgraded clients can configure leases and they simply will not be granted until the old master is upgraded and send PERM message with the flag indicating it wants a lease grant. The client will not grant a lease until such time. The clients, while having the leases configured, will not grant a lease until told to do so and will simply have an expired lease. Then, when the old master finally upgrades, it too can configure leases and suddenly all sites are using them. I believe this should work just fine and I will need to make sure a client's granting of leases is only in response to the master asking for a grant. If the master never asks, then the client has them configured, but doesn't grant them.

Testing

Clearly any user-facing API changes will need the equivalent reflection in the Tcl API for testing, under CONFIG_TEST.

I am sure the list of tests will grow but off the top of my head:
Basic test: have N sites all configure leases, run some, read on master, etc.
Refresh test: Perform update on master, sleep until past expiration, read on master and make sure leases are refreshed/read successful
Error test: Test error conditions (reading on client with leases but no ignore flag, calling after rep_start, etc)
Read test: Test reading on both client and master both with and without the IGNORE flag. Test that data read with the ignore flag can be rolled back.
Dupmaster test: Force a DUPMASTER situation and verify that the newer master cannot get DUPMASTER error.
Election test: Call election while grant is outstanding and master exists.
Call election while grant is outstanding and master does not exist.
Call election after expiration on quiescient system with master existing.
Run with a group where some members have leases configured and other do not to make sure we get errors instead of dumping core.