# IMP caches generation

## Principle

When you call an Objective-C method, Objective-C looks for the actual IMP (function pointer) given the class and selector. It then
stores, in malloced memory, this [selector, IMP] pair in a hash table. 

These hash tables are:

* imperfect: the hash function is just a mask applied to the selector's
address, so if two selectors have the same low bits, you will get a collision.
Collisions are resolved through linear probing
* expensive: the footprint of these hash tables is often around 70MB total on a carry device.

We want to replace this with static hash tables, inside the shared cache, where you have:

* a perfect hashing function
* and the tables are in clean memory

## Hash function

Because the hash function is executed as part of `objc_msgSend`, it has to be blazingly fast. The algorithm below explains how we can
make a hash function of the form `h(x) = (x >> shift) & mask` work, where `shift` and `mask` are two class-specific parameters.

Note that we cannot use traditional perfect hash tables techniques as:

* they require at least [1.44 bits of information per key](https://arxiv.org/pdf/1910.06416.pdf). 
* they often require multiple lookups (two-level hashing).

The idea is that because the input of the hash function is a selector's address, we have some control over the input... because the
dyld shared cache builder is the one placing all the selectors.

So now the problem becomes : given a set of classes and the selectors they implement, how do you place the selectors, and for each
class, how do you find a shift and mask so that the hash table generated by (selector >> shift) & mask is perfect?

(Note that the shift + mask idiom lets us use various "bit stripes" of the selector's address for various hash tables).

## Algorithm

There are basically two steps to the main algorithm:

### Finding shifts and masks

Assign some of the high bits of the selector's address, and find a shift and a mask for each class. This is a backtracking
algorithm which goes through each class, one after another. As it goes through classes, it finds a shift and a mask compatible
with the bit ranges that are already set on each selector's address, and assigns the corresponding bits of the selector's address.
Note that because addresses are partial at this point (some of the bits are unset) it's very difficult to check for collisions,
so selectors will end up at the same address.

At each step of the algorithm, we go through a set of possible shifts and masks until we find one which works. If none work, we let
the hash table grow to one more bit to make our job slightly easier. In practice few hash tables grow one more bit (82 out of 18k).
If we cannot find a suitable shift and mask after backtracking a few times, we'll also allow ourselves to drop a class from the set
and not generate an IMP cache for that one. This happens for a dozen classes or so with the current data set.

Next we have to deal with collisions, and constraints on the addresses themselves because each selector is... a `char*`, which has
a length. You cannot have an overlap between two selectors. So the idea here is to try to get rid of this constraint by assigning
in step 1 just the address of a "bucket" which is 128 bytes long (7 bits). So step 1 will assign the high bits of the address
(which will be the address of the bucket) and we can then place the selectors linearly in the buckets.

### Shuffling selectors around

Then you have to deal with address collisions. If the lengths of selectors in a given bucket add up to more than 128, you have
to move some selectors out. So step 2 goes through all the selectors, checks if it fits within the bucket it's supposed to be in,
and if it doesn't finds another suitable bucket.

To do so, it iterates through all the classes the selector is in, and builds a "constraint set" applying to that selector's
bucket's address (by basically looking at which slots in each hash table are free and combining all these constraints, which 
impact a different "stripe" of the address due to the shifts and masks). Once we have the constraint set, we can find a different
bucket for the selector without changing the shifts and masks. (If there is no other possible bucket, we allow ourselves to drop the
classes involving this selector, too).

Once each selector has a valid bucket, you can simply assign the low 7 bits of each address by looking at which selectors are in 
each bucket and looking at their lengths.

The problem is hard but not that hard given the number of classes we are targeting with this optimization:
we are roughly targeting ~ 20k classes out of 120k and ~ 200k selectors out of 900k, so we have lots of "free space".

Note that any holes we leave in the bucketization of the selectors can be filled later by placing all the selectors not targeted
by the optimization and any selectors from OS executables inside them.

## Shared cache builder setup

The optimization is guided by a file dropped by the OrderFiles project (`/AppleInternal/OrderFiles/shared-cache-objc-optimizations.json`).
The performance team is responsible for updating this file by looking at the caches on live or StressCycler devices with `objcdt dump-imp-caches`.

This file specifies a list of classes and metaclasses the algorithm targets, and a list of flattening roots.

We haven't explained flattening yet. When a class D(aughter) inherits from a class M(other), we should in theory add all of M's methods
to D's IMP cache. However

* This constrains the problem too much. The solver's job will be too difficult if the same selectors are in thousands of classes.
* This makes the IMP cache sizes blow up. 

So our scheme is to target only classes which are leaves in the inheritance tree with this optimization. For any selector that comes from
parent classes, the cache lookup will fail and fall back to looking for the method in `super`. Because `super` is not a leaf class,
it will have a dynamically allocated IMP cache and can cache there any selector that comes from the parents.

However, this means that some very interesting classes from a memory standpoint (because we find their caches in many processes)
get excluded because they have child classes. A solution to this is to turn on selector inheritance (add Mother's selectors
to Daughter's cache) starting at some flattening root. Then the IMP cache will have a "fallback class" that is the superclass
of the flattening root, and `objc_msgSend` will fallback to that class if it cannot find the selector in the child cache, 
skipping over all the classes up to the flattening root (because we know all the selectors in that chain will be present
in the child cache).

Very early into the shared cache builder's life, the algorithm described above runs. To do so, it parses all of the source dylibs
to find out which methods will end up in which class's cache (see section "what ends up in each cache" below).
The output of the algorithm is:

- for each class:
  * a shift and a mask
  * a list of methods that will end up in the cache with (source install name, source class name, source category name)
- for each selector: an offset at which it needs to be placed relative to the beginning of the selectors section
- a list of holes in the selector address space (or more accurately offset space) that can be used to add additional selectors
    not targeted by the algorithm
  
Then, we do all the selector deduping work, and when we get to the ObjC optimizer:

- we go through all the classes again to build a map from (source install name, source class name, source category name) to the 
    actual IMP's address (which is not known before the shared cache is laid out)
- we go through all the IMP caches returned by the algorithm, find the corresponding IMP, and emit the actual IMP cache in the 
    objc optimizer's RO region.

## Some code pointers

Most of the logic lives in the single C++ file `IMPCaches.cpp` and the two headers `IMPCaches.hpp` and `IMPCachesBuilder.hpp`. A tour:

### Types

`IMPCaches::Selector` : A unique selector. Has a name, a list of classes it's used in, and an address (which is actually an offset from
the beginning of the selectors section). Due to how the placement algorithm work, it also has a current
partial address and the corresponding bitmask of fixed bits. The algorithm adds bits to the partial address
as it makes progress and updates the mask accordingly

`IMPCaches::AddressSpace` : a representation of the address space available in the selectors section, so that step 2 of the algorithm
can check if selector buckets are overflowing.

`IMPCaches::HoleMap` : represents the holes left by the selector placement algorithm, to be filled later with other selectors we did not target.

`IMPCaches::Constraint` : represents a constraint on some of the bits of an address. It stores a set of allowed values for a given range of bits (shift and mask)

`IMPCachesBuilder` : what actually computes the caches (part of the algorithm ; what actually lays down the caches lives in the
ObjC optimizer)

### Entry points

* `IMPCachesBuilder::parseDylibs` : parses the source dylibs to figure out which method ends up in which cache.

* `IMPCachesBuilder::findShiftsAndMasks` : finds the shift and mask for each class (see "Findings shifts and masks" above).

* `IMPCachesBuilder::solveGivenShiftsAndMasks` : moves selectors around to make sure each bucket only contains 128 bytes worth of selectors.

Then in `OptimizerObjC`:

* `IMPMapBuilder` : Builds a map from (install name, class name, method name) to actual IMPs

* `IMPCachesEmitter` : emits the actual IMP caches in the shared cache.

### What ends up in each cache?

`IMPCachesBuilder::parseDylibs` goes through all the classes and categories to decide, for each (class,selector) pair, which implementation we will actually call at runtime.
The sequence is:

* Get all the methods from the method lists
* Attach any methods from non-cross-image categories (when we have cross-image categories, we prevent the class from getting an IMP cache). If there is any collision at this point we'll
use the implementation from the main class (or from the first category we found).
* Then we may inline some of the implementations from superclasses, see the explanation on flattening above.