# Using statistical clustering to identify business models

**(Extract from page 57 of BIS Quarterly Review, December 2014)**

This box more precisely defines the variables used as inputs and discusses the more technical aspects of the statistical classification (clustering) procedure.

The eight input variables from which we selected the key characteristics of the business models are evenly split between the asset and liability sides of the balance sheet. All ratios are expressed as a share of total assets net of derivatives positions. The reason for this is to avoid distortions of the metrics related by differences in the applicable accounting standards in different jurisdictions. The asset side ratios relate to: (i) total loans; (ii) securities (measured as the sum of trading assets and liabilities net of derivatives); (iii) the size of the trading book (measured as the sum of trading securities and fair value through income book); and (iv) interbank lending (measured as the sum of loans and advances to banks, reverse repos and cash collateral). The liability side ratios relate to: (i) customer deposits; (ii) wholesale debt (measured as the sum of other deposits, short-term borrowing and long-term funding); (iii) stable funding (measured as the sum of total customer deposits and long-term funding); and (iv) interbank borrowing (measured as deposits from banks plus repos and cash collateral).

We employ the statistical classification algorithm proposed by Ward (1963). The algorithm is a hierarchical classification method that can be applied to a universe of individual observations (in our case, these are the bank/year pairs). Each observation is described by a set of scores (in our case, the balance sheet ratios). This is an agglomerative algorithm, which starts from individual observations and successively builds up groups (clusters) by joining observations that are closest to each other. It proceeds by forming progressively larger groups (ie partitioning the universe of observations more coarsely), maximising the similarities of any two observations within each group and maximising the differences across groups. The algorithm measures the distance between two observations by the sum of squared differences of their scores. One could present the results of the hierarchical classification in the form of the roots of a tree. The single observations would be automatically the most homogeneous groups at the bottom of the hierarchy. The algorithm first groups individual observations on the basis of the closeness of their scores. These small groups are successively merged with each other, forming fewer and larger groups at higher levels of the hierarchy, with the universe being a single group at the very top.

Which partition (ie step in the hierarchy) represents a good compromise between the homogeneity within each group and the number of groups? There are no hard rules for determining this. We use the pseudo F-index proposed by Calinśki and Harabasz (1974) to help us decide. The index balances parsimony (ie a small number of groups) with the ability to discriminate (ie the groups have sufficiently distinct characteristics from each other). It increases when observations are more alike within a group (ie their scores are closer together) but more distinct across groups, and decreases as the number of groups gets larger. The closeness of observations is measured by the ratio of the average distance between bank/years that belong to different groups to the corresponding average of observations that belong to the same group. The number of groups is penalised based on the ratio of the total number of observations to that of groups in the particular partition. The criterion is similar in spirit to the Akaike and Schwarz information criteria that are often used to select the appropriate number of lags in time series regressions.

The clustering algorithm is run for all combinations of at least three choice variables from the set of eight. If we had considered all their combinations, there would have been 325 runs. We reduce this number by ignoring subsets that include two choice variables that are highly correlated because the simultaneous presence of these variables provides little additional information. We impose a threshold for the correlation coefficient of 60% (in absolute value), which means that we do not examine sets of input variables that include *simultaneously* the securities and trading book variables, or the wholesale debt and stable funding variables.