RSCH-1000: Passable Quality of Service

RSCH-1000

Passable Quality of Service

Omer Mishael, KagemniKarimu
Date Compiled: 09/25/23


Overview :telescope:

Lava implements a Passable Quality-of-Service (QoS) assurance. Network consumers score providers per relay on metrics of latency, sync, and availability. Scores on each metric are averaged over relays per session and presented to providers during the session. Providers report scores on-chain by taking the geometric mean of the three quality of service scores (latency, sync, availability) received, to come up with a single coefficient representing a final score. Quality of Service is assured by attenuating provider rewards in response to these final scores; a maximum score of 1 reaps full rewards whereas a minimum score of 0 reaps half rewards (50%).

Passable QoS differs from QoS excellence which is to be discussed in a later research paper.

Rationale :dna:

A system is needed to ensure that providers on the network are up to an operable standard. Passable QoS ensures that:

  • providers reach a minimum or ‘passable’ threshhold to maintain active status on the network
  • providers are rewarded in direct proportion to the quality of services rendered
  • neither consumers nor providers are able to profitably give false quality reports

The system uses consumer-side scoring as the mechanism. Other systems considered were:

  • Fisherman - Fisherman involves empowering an authoritative inspector which performs surprise reports on relays to unsuspecting providers. This is an approach widely used throughout web3, but it is neither trustless nor permissionless. Discovering or predicting the identity of the fisherman provides an attack surface for malicious actors. Additionally, the fisherman must be paid for this burden of scoring/inspecting: a smart attacker can manipulate the inspection or collude with the fisherman. Finally, and problematically, under this arrangement, there needs a verification mechanism for ensuring that the fisherman actually made these tests. Efficient solutions are complicated and require more computation and relays. By contrast, our consumer-side scoring mechanism uses existing relays to convey scores.
  • Elective Fisherman - Elective Fisherman means electing, at random, and by mandate of the protocol, a provider or consumer who performs scoring/reporting and whose compensation for doing so is continued participation in the network. This approach is akin to jury duty or conscription, whereby being a participant on the network makes one eligible for fisherman duty. In this example, no further compensation or reward would be given to the fisherman. We toyed with this approach but ultimately decided it was less efficient than consumer-side reporting. As in the previously mentioned Fisherman approach, it also requires complicated proofs and verifications**.**
  • Provider Performance Meta-analysis - Provider Meta-analysis involves collecting provider data on-chain for direct insights and analysis on quality of services. Barring an extremely clever provider learning how to falsify reports, this approach gives a very direct view into the relays received and serviced by a provider. However, it is highly data intensive and does not scale well. While this would conceivably work for a smaller network of consumers and providers, a network at-scale can be easily overloaded by the sheer volume of data providers would report. Again, the consumer-side scoring provides a more efficient implementation.

Note that Passable QoS does not ensure 1) the best provider is selected for each consumer, 2) the best providers receive the most rewards. Both of these assurances are characteristics of QoS Excellence, which is implemented separately from Passable QoS. Passable QoS is only a threshhold for which data is considered unusable and by which payments (specific to the unusable data) are impacted.

Findings & Considerations :alembic:

Passable QoS is collected as a consumer-side score reported by providers on-chain to attenuate payments. To be fair and ungameable, payments must be proportionate to the effort expended by providers to complete a call. Every Lava call is measured in compute units (CUs). Compute Units (CUs) are an abstraction which quantify the compute intensiveness of a given request. Rewards are highly correlated with CUs - so it is easiest to think of compute units as a proxy for the ‘cost’ of a request. CUs have other implications related to Passable QoS as will be highlighted below.

Score

At its simplest form, a Passable QoS score is the geometric mean of three sub-scores (latency, sync, availability). Each of the three areas is calculated algorithmically without direction of the consumer. They are taken together to make a Passable QoS Score that will be 0 for complete failure, 1 for perfect performance, or, most likely, some number between 0 and 1. This is sent on chain at the completion of each peer-to-peer session.

Latency

Latency is the amount of time elapsed before a request is returned. It is measured in ms . Passable QoS for Latency is hard-coded in the client code on the rpcconsumer. Latency of >x ms will produce a score of 0 and latency ≤x ms will produce a score of 1 , where x is defined as the latency threshhold for a given call. The formula for computing the latency threshold is below:

The values in this equation are currently hardcoded as follows and liable to change:

Variable Value
Extra Relay Timeout 0 ms for regular APIs; average block time for hanging APIs
Time Per CU 100 ms per CU
Average World Latency 300 ms

What this demonstrates is that latency threshold scales linearly with CUs; as the amount of CUs increase for a given call, the allowable time elapsed is greater… In fact, some calls are knowingly expensive. In some cases, as with Hanging APIs, allowable latency is even greater, reaching up to the average block time. This makes latency scores more reflective of actual performance differences instead of mere computational difficulty.

Importantly, while each latency calculation is an absolute threshold, the latency scores are averaged over many relays per session. This means that the latency score for a fulfilled session is likely some number between 0 and 1.

Sync

Sync is the provider’s proximity to the latest block on the serviced chain. It is measured by taking the median latest block (w/ block interpolation based off time) of all providers on a given chain, then demanding a provider does not fall too far behind. The distance a provider can lag behind is defined in the specification of a supported chain as a specific number of blocks. The current rpcconsumer and LavaSDK code reads this specific number from the blockchain and derives a sync score based on the provider’s position relative to it. If the provider distance is greater > than the number in the spec, the provider receives a score of 0, if provider’s distance ≤ than number in the spec, they receive a score of 1. Just as latency calculations are averaged over many relays per session, so are sync scores. This means that the value sent over a session (aggregated) will likely be some number between 0 and 1.

We mentioned that block interpolation is used for determining the median latest block. The basis of this is that since we don’t have all the latest measurements from all providers at any one point, we derive a measurement based upon the average block time and latest measurement time & height. Our median latest block measurement is capped to the last seen block just in case a chain was halted or a block comes substantially slower than expected. Additionally, the benefit of doubt always goes to providers. If not enough providers are available to perform interpolation — the provider is optimistically assigned a score of 1. The golang algorithm, summarizing sync calculations, is provided below:

for provider := range providerLatestBlocks {
interpolation := InterpolateBlocks(now, providerDataContainer.LatestBlockTime, averageBlockTime_ms)
expected := providerDataContainer.LatestFinalizedBlock + interpolation
                // limit the interpolation to the highest seen block height
                if expected > highestBlockNumber {
                    expected = highestBlockNumber
                }
                mapExpectedBlockHeights[providerAddress] = expected
}
medianOfExpectedBlocks := median(mapExpectedBlockHeights)

Availability

Availability is the tendency of a provider to respond to requests received. In a given session, a provider must respond to no less than 90% of requests. If a provider responds to 90% or less of requests, they receive a score of 0 . However, if they pass the minimum threshhold of 90%, they receive a score scaling from 0 at 90% all the way up to 1 for 100%. Thus, Passable QoS Availability is a scaled average of those scores. The two formulas below detail how this is calculated:

Total

As seen, each of these respective scores is an aggregate over a session and can be 0, 1, or any number between 0 and 1. Because the Passable QoS Score is calculated as the geometric mean of the previous sub-scores, a 0 of any sub-score will lead to an overall score of 0. This is reasonable to consider. Under this schema, a provider which has unreasonable latency, whose blocks are wildly out of sync, or who is unavailable to answer responses cannot possibly be considered as passable quality of service. Naturally, this antagonizes the worst performing providers on the network.

Rewards & Reporting

One of the stated goals of Passable QoS is to make sure API providers are up to a certain standard of performance. The current method of ensuring that is by directly affecting payments. Payments are degraded in response to low Passable QoS scores. Providers are guaranteed at least 50% of payment for a service relay, irrespective of their Passable QoS score. In other words, regardless of Passable QoS score, at least half-of-the-value of a serviced relay must be payed for. The remaining rewards which are not given to providers due to poor quality of service are burned.

Rewards

To calculate how many rewardable CU per session a provider may get paid for, we use the following equation:

Total CU is the total CU a consumer used in a session as given in the cryptographic signature in their submitted relays. The Total CU is multiplied against the reward modifier (calculated above) to get the Rewardable CU. As mentioned earlier, CU are a great proxy for the price of a transaction and are submitted by providers on-chain as part of a reward proof to sequester rewards.

Chart

To understand how this plays out in transactions, we can see the percentage of rewards retained. A simplified equation for calculating the reward percentage is f(x)=50x+50%, where x is the total Passable QoS score. This results roughly in the payout schedule below:

Passable QoS Score Reward
0 50%
0.125 56.25%
0.25 62.5%
0.5 75%
0.625 62.5%
0.75 87.5%
1 100%

This schema meets the goal of rewarding providers directly in proportion to the quality of their service on the network. Additionally, it ensures neither consumers nor providers can profitably give false reports. Because providers are guaranteed at least 50% of their reward, falsifying reports has a recurrent cost. This works as a strong disincentive for consumers wrongfully rating providers low.

Reports

Reports (containing Rewardable CUs) must be submitted by providers to the Lava chain in order to receive rewards. Passable QoS Reports are discrete for each session; what a provider received in a past session has no bearing on potential earnings or pairings. Passable QoS Scores are self-reported by providers on-chain as part of the reward proof. They are cryptographically protected and tamper resistant. The scores only affect the provider-consumer payment for a specific session. This creates the following expectations:

  1. For consumers, there are no rebates or refunds. Every relay has a minimum cost of 50% of the potential reward. There is no prevailing effect on the likelihood that consumers receive the same provider, good or bad, and no punishment against future sessions.
  2. For providers, bad quality of service decreases actual profitability (rewards earned) without affecting potential profitability (rewards-to-be-earned). Future sessions are unaffected by past sessions. This is enough to function as a disincentive for poor service without having any reputational qualities or affecting the volume of traffic coming to a provider.

The system works virtuously with consumers bearing the cost of their relays and providers bearing the cost of their service.

Future Considerations :test_tube:

QoS Excellence (RSCH-1001) - This research avoids explanation of reputational rating and provider optimization. Quality of Service Excellence, as a counterpart to Passable QoS, is the means by which the best provider is selected and the best providers are rewarded most often. Quality of Service Excellence is cumulative and lives beyond a specific session. This is to be explained in future research.

  • Geolocation (RSCH-1004) - This research specifically mentions dynamic measures of latency in response to compute intensiveness of a relay. However, it does not address how geographical distance affects the rate at which responses are returned or the impact that this has upon quality of service. Geolocation is accounted for in Quality of Service Excellence using a specific modifying mechanism to be explained in future research.

Fraud Detection (RSCH-1002) - This research outlines how a provider must be available, timely, and fresh, but does not explain how or why a provider’s responses should be considered honest. Honesty is a guaranteed feature in the Lava protocol that is not assured by Passable QoS. This is to be explained in future research.

Availability jailing (RSCH-1003) - This research explains the penalty that a provider can receive when <90% available. However, it does not explain what happens when a provider is totally unresponsive and unable to submit proofs of lower availability on chain. Availability jailing is a mechanism which takes unavailable providers offline. This feature is to be explained in depth in future research.


REFERENCE: N/A

11 Likes

Very nicely written!

4 Likes

This is fantastic! I am sure the rewards chart would be very useful for the provider community and beyond

4 Likes

Thank you for explaining how the rating system works

5 Likes

Thanks for the clarification :raised_hands:

2 Likes

Glad to be here.
There is something to study, it takes time

2 Likes

I have a question, how is a client constrained to rate a node with its real performance score? There is no incentive nor penalty for reporting precise/imprecise scores, right?

3 Likes

Hi @filip23!

Welcome to the forum! && Thank you for your question! I believe I understand the thought and may be able to explain… Please note, it is conceivable that consumers can give false reports - but they cannot profitably give false reports.

It seems there are a few constraints to mention:

  1. Consumers are constrained by client-side code (weak technical constraint). Scores are determined algorithmically from consumer-side code - not arbitrarily from consumer-side input. A consumer would have to modify the code of their consumer process to calculate the scores in a different way than is prescribed above.

  2. Consumers are constrained by recurrent costs (strong economic constraint). Providers are guaranteed at least 50% reward even with a Passable QoS score of 0. Therefore, there is a minimum cost that a lying consumer would have to eat regardless of the score given. Thus, repeatedly falsifying reports negatively will cost a consumer significantly at no/low expense to providers. Scores are really a means for better providers to make more tokens.

  3. Consumers are constrained by recurrent costs (strong economic constraint, cont’d). Falsifying reports positively is not directly punished, but increases the expected rewards payout. This means that falsifying a report positively just increases the amount that a consumer overpays for a request. A consumer can say that the score was 0.75 instead of 0.625, but that means they will pay 25% more to the provider. If a consumer decides to overpay / tip - it is at no disadvantage to the Provider who services the request.

  4. Consumers are constrained by signed/transparent cryptographic proofs of spent CUs (strong technical constraint). Because the protocol is open, a provider can easily attest to the dishonesty of a consumer and decide not to service that particular consumer going forward. At least theoretically, the best providers will easily identify malicious consumers and ignore their requests. All-in-all, it’s easier and better for a consumer to play the game fairly!

tl;dr
A malicious consumer will only end up paying tokens without damaging provider earnings. Scores are calculated algorithmically and precise by default. If a consumer finds a way to give imprecise scores or inaccurate ratings, they can, of course, do so. However, it will be at significant cost to them and at no expense to providers - who will continue to be paid at the end of the month regardless!

I hope that clarifies things. Please let me know if you have further questions/discussion. :slight_smile:

3 Likes

thank you very much for the detailed explanation, I could not find the client-side payment model in order to determine the economic constraints myself :frowning_face: Is the client paying for each request from its credits/balance or does it have to stake LAVA token and get some capacity in requests/throughput?

3 Likes

Greetings again @filip23 !

The protocol works by pairing a consumer to a list of providers for a number of blocks n, referred to as an epoch. As of the time of this post, an epoch is currently set to 30 blocks. Consumers use subscriptions - purchased in plans which have a set token cost and a set CU limit per epoch and per month. A valid subscription is required to establish pairing.

You can read more about pairing and payments here. The basic flow is

  1. provider sends a consumer-signed relay payment tx
  2. reported CU from the relay payment tx is tracked
  3. upon subscription monthly due date, tracked CU are accounted and payments are disbursed

If you want to learn more about how this is implemented client-side, I recommend investigating the RPCConsumer code base. You’ll find that RPCConsumer has a GetConsumerPolicy function which returns errors when no policy exists. It’s quite a bit to sort through, so please let me know if you have any further questions.

I hope this answer is helpful :pray:t6::pray:t6::pray:t6:

2 Likes

Nice explanation of this👍

2 Likes

Thanks for the detailed description of how it works.

2 Likes

Thanks for the clarification

1 Like