DCN V2: Improved Deep & Cross Network
Practical Lessons for Web-scale Learning to Rank Systems
Key ideas
- Learning effective feature crosses is key behind building recommender systems.
- Moving from Factorization Machine (FM) methods to NNs didn't work as they don't approximately model 2nd or 3rd-order feature crosses.
- Making NNs wider and deeper isn't a solution, as this makes them much slower to serve. Can't handle high QPS.
- DCN aims to leverage implicit high-order crosses from NNs, with explicit crosses modeled by formulas with controllable interaction order.
- DCN cross network is $O(\text{input size})$, limiting flexibility.
- DCN-V2 first learns explicit feature interactions through cross layers, and then combines with a deep network to learn complementary implicit interactions.
Related work
- Parallel Structure
- Jointly train two parallel networks. Inspired from wide and deep model.
- Wide component takes inputs as crosses of raw features.
- Deep component is a NN.
- Examples are DeepFM or DCN.
- Stacked Structure
- Introduce interaction layer which creates explicit feature crosses between embeddings and DNN.
Proposed Architecture: DCN-V2
- Significantly improve expressiveness of DCN in modeling complex explicit cross terms, with easy deployment.
- Observing the low-rank nature of the cross layers, we propose to leverage a mixture of low-rank cross layers.
Embedding Layer
- Takes combination of sparse (categorical) and dense features and outputs $x_0$.
- $i$-th categorical feature is projected from high-dimensional sparse space, to low-dimensional dense space through learned projection matrix.
- Output is the concatenation of all embedded vectors and normalized dense features.
Cross Network
Deep Network
Just an NN activated by ReLU. Other activation functions are suitable.
Combination of Deep and Cross Networks
- Stacked: $x_0$ input is fed to cross network, then is fed to deep network. $f_{\text{deep}} \cdot f_{\text{cross}}$ is the output.
- Parallel: $x_0$ is fed in parallel to both cross and deep networks. $f_{\text{deep}} + f_{\text{cross}}$ is the output.
Cost-Effective Mixture of Low-Rank DCN
- Low-rank techniques are used to reduce computational cost.
- Approximates a dense matrix $M$ by two tall and skinny matrices $U$, $V$.
- Most effective when matrix shows a large gap in singular values or fast spectrum decay.