Job priority and “Fairshare”

What is Fairshare

On Slurm based HPC clusters each user is associated with a HPC job scheduler (Slurm) account typically related to their research group, department or Faculty.

Users belong to accounts, and accounts have shares associated with them. These shares determine how much of the cluster that research group/department has invested. The amount of shares each account has is influenced by multiple factors and can vary from cluster to cluster.

For example on Eureka2 fairshare is proportional to a department or groups financial contribution/investment in the cluster. In order to serve the great variety of groups and the contribution/investment, a method of fairly adjudicating job priority is required. This is the goal of Fairshare.

Fairshare allows those users who have not fully used their contribution/investment to get higher priority for their jobs on the cluster over jobs by groups that have used more than their contribution/investment.

The cluster is a limited resource and Fairshare allows us to ensure everyone gets a fair opportunity to use it regardless of how big or small the group is.

Lookup your shares

To see how much your group/account has used of their Fairshare, use the sshare command to show a summary of this information.

sshare -a --account=<account name>

In the example below we have used the Chemistry account. The first line of the sshare gives the summary for the whole account, with the additional lines giving a summary per user on the account.

Example output from a sshare command
[abc123@login1(eureka2) chem_reservation]$ sshare -a --account=chemistry
            Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ---------
chemistry                            8423    0.112157   735603420      0.169374   0.351074
chemistry                bobby          1    0.007477           0      0.011292   0.351074
chemistry                 tony          1    0.007477           0      0.011292   0.351074
chemistry                susan          1    0.007477   203592409      0.055052   0.006076
chemistry                user1          1    0.007477           0      0.011292   0.351074
chemistry                user2          1    0.007477           0      0.011292   0.351074
chemistry              someguy          1    0.007477           0      0.011292   0.351074
chemistry              chemist          1    0.007477          38      0.011292   0.351074
chemistry                user3          1    0.007477   517403742      0.122502   0.000012
RawShares:

Chemistry has 8423 RawShares. Each user of that lab has a RawShare of its parent, this means that all the users in chemistry pull from the total Share of the Account and do not have their own individual subShares of the account Share. Thus all users in this lab have full access to the full Share of the Account.

NormShares:

NormShares is the chemistry account’s RawShares divided by the total number of RawShares given out to all accounts on the cluster. NormShare is the fraction of the cluster the account has been contributed/invested, for the chemistry account this is about 11.21% of Eureka.

RawUsuage:

RawUsage is the amount of usage of the account/user has used on Eureka. This RawUsage is also adjusted by the halflife that is set for the cluster which is 30 days. This means that usage in the last 30 days counts at full cost, from 60 days ago costs half, usage 90 days ago one fourth. So RawUsage is the aggregate of the account’s past usage with this halflife weighting factor. The RawUsage for the account is the sum of the RawUsage for each user, thus sshare is an effective way to figure out which users have contributed the most to the account’s score.

EffectvUsage:

EffectvUsage is the account’s RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the account has actually used. For chemistry they have used 16.9% of the cluster.

Fairshare:

The Fairshare score is calculated using the following formula. f = 2^(-EffectvUsage/NormShares) from this number, we can assess how much an account is using of their contribution/investment in Eureka.

1.0:

Un-used. The account has not run any jobs recently.

1.0 > f > 0.5:

Under-utilization. The account is under-utilizing their share. For example, if the fairshare score is 0.75 an account has recently underutilized their share of the resources 1:2.

0.5:

Average utilization. The account on average is using exactly as much as their share.

0.5 > f > 0:

Over-utilization. The account has overused their share. For example, if the fairshare score is 0.25 an account has recently over-utilized their share of the cluster 2:1.

0:

No share left. The account has vastly overused their share.

Job priority

Individual job priority are calculated based on an account’s fairshare and a jobs age. Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. The first component to job priority is FairShare score. The second component is Job Age. This priority accrues over time gaining a maximum value at 7 days. As the job sits in the queue waiting to be scheduled, its priority is gradually increasing due to the job’s age. Thus even jobs from accounts that have low priority will eventually run due to the growth in their job age priority.

These two components are put together to make up an individual job’s priority.

To view the priority for a specific job use the sprio command

  • Print the list of all pending jobs with their weighted priorities

$ sprio
JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
65539      62664          0      51664       1000      10000          0
65540      62663          0      51663       1000      10000          0
65541      62662          0      51662       1000      10000          0
  • Print the list of all pending jobs with their normalized priorities

$ sprio -n
JOBID PRIORITY   AGE        FAIRSHARE  JOBSIZE    PARTITION  QOS
65539 0.00001459 0.0007180  0.5166470  1.0000000  1.0000000  0.0000000
65540 0.00001459 0.0007180  0.5166370  1.0000000  1.0000000  0.0000000
65541 0.00001458 0.0007180  0.5166270  1.0000000  1.0000000  0.0000000
  • Print the job priorities for specific jobs

$ sprio --jobs=65548,65547
JOBID   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
65547      62078          0      51078       1000      10000          0
65548      62077          0      51077       1000      10000          0
  • Print the job priorities for jobs of specific users

$ sprio --users=fred,sally
JOBID     USER  PRIORITY       AGE  FAIRSHARE   JOBSIZE  PARTITION     QOS
65548     fred     62079         1      51077      1000      10000       0
65549    sally     62080         1      51078      1000      10000       0

Important

Fairshare does not stop jobs from running, it only influences their priority over other jobs. There is no quotas or limits on how much a user can submit/run on eureka and all jobs submitted will eventually run.

Note

This material was based on https://www.rc.fas.harvard.edu/resources/documentation/fairshare/ explanation of fairshare.