Deep & Cross Network for Ad Click Predictions (1708.05123v1)

Published 17 Aug 2017 in cs.LG and stat.ML

Abstract: Feature engineering has been the key to the success of many prediction models. However, the process is non-trivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, they generate all the interactions implicitly, and are not necessarily efficient in learning all types of cross features. In this paper, we propose the Deep & Cross Network (DCN) which keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions. In particular, DCN explicitly applies feature crossing at each layer, requires no manual feature engineering, and adds negligible extra complexity to the DNN model. Our experimental results have demonstrated its superiority over the state-of-art algorithms on the CTR prediction dataset and dense classification dataset, in terms of both model accuracy and memory usage.

Citations (1,120)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that explicitly models bounded-degree feature interactions, reducing the need for manual feature engineering.
It combines a cross network with a deep network to capture both explicit and implicit feature interactions while keeping parameter usage efficient.
Experimental results on the Criteo dataset demonstrate that DCN outperforms standard DNNs with lower logloss and significant memory savings.

The paper "Deep & Cross Network for Ad Click Predictions" (1708.05123) introduces the Deep & Cross Network (DCN), a novel architecture designed to improve ad click-through rate (CTR) prediction by efficiently learning feature interactions. Traditional methods either require manual feature engineering, which is laborious and often incomplete, or use Deep Neural Networks (DNNs) that learn feature interactions implicitly and can be inefficient for certain types of interactions. DCN aims to combine the strengths of DNNs with an explicit and efficient mechanism for learning bounded-degree feature interactions.

Core Architecture: DCN

The DCN model consists of four main components:

Embedding and Stacking Layer: This initial layer handles both sparse categorical features and dense numerical features.
- Categorical Features: These are converted into low-dimensional dense embedding vectors. If $x_i$ is a one-hot encoded categorical feature, its embedding $x_{\text{embed},i}$ is found by $x_{\text{embed},i} = W_{\text{embed},i} x_i$ , where $W_{\text{embed},i}$ is an embedding matrix learned during training. The paper suggests an embedding dimension of $6 \times (\text{category cardinality})^{1/4}$ .
- Dense Features: These are typically normalized (e.g., using a log transform as mentioned for the Criteo dataset) before being used.
- Stacking: The embedding vectors of all categorical features and the processed dense features are concatenated to form a single input vector $x_0 = [x_{\text{embed},1}^T, \ldots, x_{\text{embed},k}^T, x_{\text{dense}}^T]$ . This %%%%6%%%% is then fed into both the cross network and the deep network.

Cross Network: This is the core innovation of DCN. It explicitly models feature interactions at each layer. The

(l+1)

-th cross layer is defined as:

$x_{l+1} = x_0 x_l^T w_l + b_l + x_l = f(x_l, w_l, b_l) + x_l$

where:

$x_l, x_{l+1} \in \mathbb{R}^d$ are the outputs of the $l$ -th and $(l+1)$ -th cross layers, respectively (with $d$ being the dimension of $x_0$ ).
$w_l, b_l \in \mathbb{R}^d$ are the weight vector and bias vector for the $l$ -th layer.
The term $x_0 x_l^T w_l$ creates explicit feature crosses. The outer product $x_0 x_l^T$ computes all pairwise interactions between the original input $x_0$ and the current layer's output $x_l$ . This is then multiplied by $w_l$ .
The residual connection $x_l$ helps in training deeper networks. The highest polynomial degree of interactions increases by one with each cross layer. An $L_c$ -layer cross network can model all cross terms up to degree $L_c+1$ . The number of parameters in the cross network is $d \times L_c \times 2$ , which is linear in the input dimension $d$ and number of cross layers $L_c$ . This makes it memory-efficient.

A visualization of one cross layer:

Input: x_l (from previous layer or x_0)
   |
   +------> Feature Crossing (f) ------> Output: x_0 * (x_l^T * w_l) + b_l
   |                ^
   |                |
   +----------------+ (Residual Connection)
   |
Output: x_{l+1} = f(x_l, w_l, b_l) + x_l

Deep Network: This is a standard multi-layer perceptron (MLP) running in parallel to the cross network. Each layer is defined as:

$h_{l+1} = \text{ReLU}(W_l h_l + b_l)$

where:
- $h_l, h_{l+1}$ are the activations of the $l$ -th and $(l+1)$ -th deep layers.
- $W_l, b_l$ are the weight matrix and bias vector for the $l$ -th deep layer.
- ReLU is typically used as the activation function. The deep network captures highly non-linear, implicit feature interactions. The number of parameters for $L_d$ layers of size $m$ is approximately $d \times m + (m^2+m) \times (L_d-1)$ .
Combination Layer: The outputs of the cross network ( $x_{L_c}$ ) and the deep network ( $h_{L_d}$ ) are concatenated:

$p = \sigma([x_{L_c}^T, h_{L_d}^T] w_{\text{logits}})$

This combined vector is then fed into a sigmoid activation function for binary classification (like CTR prediction). The loss function is the standard log loss with an $L_2$ regularization term:

$\text{loss} = -\frac{1}{N} \sum_{i=1}^N (y_i \log(p_i) + (1-y_i)\log(1-p_i)) + \lambda \sum_w \|w\|^2$

The overall DCN architecture can be visualized as:

graph TD
    A[Input Features (Sparse & Dense)] --> B(Embedding & Stacking Layer);
    B -- x_0 --> C{Cross Network};
    B -- x_0 --> D{Deep Network};
    C -- x_Lc --> E(Combination Layer);
    D -- h_Ld --> E;
    E --> F[Output (e.g., CTR Prediction)];

Practical Implementation and Application

Data Preprocessing:

Dense Features: Normalize using log transform (e.g., x_dense_processed = np.log(x_dense + epsilon) to handle zeros).
Sparse Features: Convert to embeddings. In TensorFlow/Keras, this can be done using tf.keras.layers.Embedding. The embedding size can be determined empirically or using the paper's heuristic $6 \times (\text{category cardinality})^{1/4}$ .

Model Construction (Conceptual Python/TensorFlow):

import tensorflow as tf

class CrossLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(CrossLayer, self).__init__(**kwargs)
        # kernel (w_l) and bias (b_l) will be initialized in build method

    def build(self, input_shape):
        # input_shape is a list: [x_0_shape, x_l_shape]
        # For DCN, x_0 and x_l have the same dimension d
        dim = input_shape[0][-1]
        self.kernel = self.add_weight(name='kernel',
                                      shape=(dim, 1), # w_l is a vector
                                      initializer='glorot_uniform',
                                      trainable=True)
        self.bias = self.add_weight(name='bias',
                                    shape=(dim,), # b_l is a vector
                                    initializer='zeros',
                                    trainable=True)

    def call(self, inputs):
        x_0, x_l = inputs
        # x_0 * (x_l^T @ w_l)  ->  x_0 * dot_product(x_l, w_l)
        # x_l^T w_l is a scalar if w_l is (dim,). For (dim, 1), it's (1,1)
        # x_0 x_l^T w_l in the paper is element-wise product of x_0 and (x_l^T w_l) if interpreted as (x_0 * (x_l dot w_l))
        # Or, more accurately from x_0 x_l^T w_l, it's x_0 outer_product(x_l, w_l)
        # The paper describes w_l as a vector. x_l^T w_l is a scalar.
        # So, x_0 (x_l^T w_l) + b_l + x_l
        # The operation x_0 x_l^T w_l is actually: x_0 * (tf.matmul(tf.expand_dims(x_l, axis=-1), self.kernel, transpose_a=True))
        # This is equivalent to x_0 * matmul(x_l^T, w_l), but this is not right.
        # It should be x_0 * (x_l^T w_l) as a scalar product then broadcasted to x_0
        # Let's re-read: x_0 x_l^T w_l + b_l + x_l. x_0, x_l, w_l, b_l are column vectors.
        # x_l^T w_l is a scalar. So x_0 times a scalar.
        # This seems to be the most common interpretation.
        x_l_T_w_l = tf.matmul(x_l, self.kernel) # (batch_size, 1)
        x_0_x_l_T_w_l = x_0 * x_l_T_w_l # (batch_size, dim)
        return x_0_x_l_T_w_l + self.bias + x_l

class DCN(tf.keras.Model):
    def __init__(self, num_cross_layers, deep_hidden_units, output_dim=1, **kwargs):
        super(DCN, self).__init__(**kwargs)
        self.num_cross_layers = num_cross_layers
        self.cross_layers = [CrossLayer() for _ in range(num_cross_layers)]

        self.deep_network = tf.keras.Sequential([
            tf.keras.layers.Dense(units, activation='relu') for units in deep_hidden_units
        ])

        self.combination_layer = tf.keras.layers.Dense(output_dim, activation='sigmoid')

    def call(self, inputs): # inputs should be the stacked x_0
        # Cross Network Path
        x_cross = inputs # x_0
        for i in range(self.num_cross_layers):
            x_cross = self.cross_layers[i]([inputs, x_cross]) # Pass x_0 and current x_l

        # Deep Network Path
        x_deep = self.deep_network(inputs)

        # Combination
        concatenated_output = tf.keras.layers.concatenate([x_cross, x_deep])
        final_output = self.combination_layer(concatenated_output)
        return final_output

Training:

Optimizer: Adam optimizer is recommended.
Batch Size: 512 was used for the Criteo dataset.
Batch Normalization: Applied to the deep network. This can be added after each dense layer before the ReLU activation.
Gradient Clipping: A norm of 100 was used, which can help stabilize training.
Regularization: Early stopping was found to be more effective than $L_2$ regularization or dropout in the paper's experiments.
Hyperparameters:
- Number of cross layers ( $L_c$ ): 1 to 6 (best was 6 for Criteo).
- Number of deep layers ( $L_d$ ): 2 to 5.
- Deep layer size ( $m$ ): 32 to 1024.
- Learning rate: Tuned from 0.0001 to 0.001.

Computational Requirements and Limitations:

Memory: DCN is significantly more memory-efficient than a pure DNN for achieving similar performance. The cross network adds negligible parameter overhead ( $O(d \cdot L_c)$ ) compared to the deep network ( $O(d \cdot m + L_d \cdot m^2)$ ).
Training Time: The cross network operations are efficient. The time complexity is linear in input dimension.
Bounded-Degree Interactions: While efficient for these, the cross network might not be as effective as deep networks for capturing extremely complex, high-degree implicit interactions that DNNs excel at. This is why the deep network component is retained.

Experimental Results and Implications

Criteo Dataset:
- DCN achieved a logloss of 0.4419, outperforming DNN (0.4428), Deep Crossing (0.4425), FM (0.4464), and LR (0.4474). An improvement of 0.001 in logloss is considered practically significant on this dataset.
- DCN used only 40% of the memory consumed by the best DNN model.
- To achieve a logloss of 0.4430, DCN required $7.9 \times 10^5$ parameters, while DNN needed $3.2 \times 10^6$ parameters (almost an order of magnitude difference).
- Adding even one cross layer to a DNN consistently improved performance. The optimal number of cross layers was 6, suggesting higher-order explicit interactions were valuable.
Non-CTR Datasets (Forest Covertype, Higgs):
- Forest Covertype: DCN achieved the best accuracy (0.9740) with the least memory.
- Higgs: DCN achieved the best logloss (0.4494) compared to DNN (0.4506), using half the memory.
- This demonstrates DCN's applicability beyond CTR prediction to general classification tasks with sparse and dense features.

Key Advantages for Practitioners

Automatic Feature Engineering: DCN explicitly learns bounded-degree feature interactions automatically, reducing the need for manual feature crafting.
Efficiency: The cross network is computationally and memory efficient. This allows for deeper or wider models within given resource constraints, or achieving comparable performance with fewer resources.
Performance: DCN demonstrated superior performance (accuracy/logloss) compared to several state-of-the-art models, including pure DNNs, on multiple datasets.
Interpretability (Partial): While the deep network part remains a black box, the cross network explicitly defines the types of interactions it creates (polynomials up to a certain degree), which can offer some level of understanding.
Ease of Implementation: The cross layer has a simple structure and can be implemented with standard neural network operations.

Potential Limitations and Considerations

Optimal Number of Cross Layers: Finding the optimal number of cross layers requires tuning. Too few may not capture sufficient interactions, while too many might not always lead to improvements and could potentially lead to overfitting or optimization difficulties (though the paper suggests diminishing returns rather than degradation for up to 6 layers).
Interaction with Deep Network: The interplay between the cross and deep networks during joint optimization is complex. While empirically shown to be beneficial, the theoretical understanding of this interaction is an area for future research.
Very High-Degree Interactions: The explicit nature of the cross network is designed for bounded-degree interactions. For arbitrarily complex interactions beyond this bound, the model still relies on the deep network component.

In summary, DCN offers a practical and effective architecture for tasks involving sparse and dense features, particularly where feature interactions are crucial. Its strength lies in efficiently and explicitly learning these interactions through its novel cross network, while leveraging a parallel deep network to capture other complex patterns. This makes it a compelling alternative to pure DNNs or models requiring extensive manual feature engineering in domains like recommendation systems, ad-tech, and other classification problems.