⛱️

DCN V2: Improved Deep & Cross Network

Practical Lessons for Web-scale Learning to Rank Systems

Key ideas

Learning effective feature crosses is key behind building recommender systems.
Moving from Factorization Machine (FM) methods to NNs didn't work as they don't approximately model 2nd or 3rd-order feature crosses.
Making NNs wider and deeper isn't a solution, as this makes them much slower to serve. Can't handle high QPS.
DCN aims to leverage implicit high-order crosses from NNs, with explicit crosses modeled by formulas with controllable interaction order.
DCN cross network is $O(\text{input size})$, limiting flexibility.
DCN-V2 first learns explicit feature interactions through cross layers, and then combines with a deep network to learn complementary implicit interactions.

Parallel Structure
- Jointly train two parallel networks. Inspired from wide and deep model.
- Wide component takes inputs as crosses of raw features.
- Deep component is a NN.
- Examples are DeepFM or DCN.
Stacked Structure
- Introduce interaction layer which creates explicit feature crosses between embeddings and DNN.

Significantly improve expressiveness of DCN in modeling complex explicit cross terms, with easy deployment.
Observing the low-rank nature of the cross layers, we propose to leverage a mixture of low-rank cross layers.

DCN-V2 architecture overview showing cross layers and deep network.

Takes combination of sparse (categorical) and dense features and outputs $x_0$.
$i$-th categorical feature is projected from high-dimensional sparse space, to low-dimensional dense space through learned projection matrix.
Output is the concatenation of all embedded vectors and normalized dense features.

Cross network layer formula diagram.

Just an NN activated by ReLU. Other activation functions are suitable.

Stacked: $x_0$ input is fed to cross network, then is fed to deep network. $f_{\text{deep}} \cdot f_{\text{cross}}$ is the output.
Parallel: $x_0$ is fed in parallel to both cross and deep networks. $f_{\text{deep}} + f_{\text{cross}}$ is the output.

Stacked vs Parallel DCN-V2 architecture comparison diagram.

Low-rank techniques are used to reduce computational cost.
Approximates a dense matrix $M$ by two tall and skinny matrices $U$, $V$.
Most effective when matrix shows a large gap in singular values or fast spectrum decay.

Low-rank approximation diagram for cost-effective mixture of cross layers.