33institutetext: Institute of Management of Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
33email: camhero@gmail.com
Yan Kang1122 Hao Lin11 Mingjian Yang1(1()) Shin-Jye Lee33
Abstract
The rapid advancement of high-quality image generation models based on AI has generated a deluge of anime illustrations. Recommending illustrations to users has become a challenge. However, existing anime recommendation systems (RS) have focused on text features but still need to integrate image features. In addition, most multi-modal (MM) RS research is constrained by tightly coupled datasets, limiting its applicability to illustrations RS. We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS) to tackle these gaps. In the feature extract phase, for image features, we are the first to combine painting style with semantic features to construct a dual-output image encoder for enhancing representation. For text features, we obtain embeddings based on fine-tuning Sentence-Transformers by incorporating domain knowledge that composes a variety of anime text pairs from multilingual mappings, entity relationships, and term explanation perspectives, respectively. In the MM fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism to weight MM features dynamically according to user features at the interaction level and employ the DCN-V2 module to model bounded-degree MM crosses effectively. UMAIR-FPS surpasses the SOTA baselines on large real-world datasets.
Keywords:
Anime illustration recommendation Painting style features Multi-modal feature extraction Multi-modal feature fusion.
1 Introduction
With the booming growth and rising popularity of the ACG (Anime, Comics, and Games) industry, anime-related RS has become a research topic. AIGC models lead to an exponential increase in illustrations, and then high-quality illustrations can be generated without needing painting skills. More and more websites such as Civitai.com are dedicated to sharing AI-generated illustrations, and users are facing the challenge of finding their favorite illustrations. To alleviate information overload, it is essential to recommend illustrations that interest users according to their preferences to improve their browsing experience.
RS have been proven to be an effective solution for such challenges [1], and many efforts have been devoted to RS. Earlier methods utilize collaborative filtering to leverage users’ past interaction behaviors, including ratings and item click logs [2, 3]. Matrix factorization techniques [4] and attention mechanisms [5] have been successfully combined with deep learning to yield significant improvements. Reinforcement learning emphasizes online learning and real-time updates in models [6]. Graph neural networks utilize information about users and items, such as social network relationships between users[7, 8] and contextual features [9], to provide more personalized recommendations that align with users’ interests. Recently, multi-modal recommendation systems (MRS) have gained widespread attention by leveraging different modal features of items, such as visual and textual characteristics and interaction information, to better mine item attributes not revealed in interaction.
Illustrations typically include elements such as images and text, which together form a complete story or convey information. Then MRS can help users find various media content related to their interests. However, introducing MRS to anime illustration RS still faces the following challenges: 1) Specific feature extraction. General pre-trained encoders lack domain-specific expertise, and distinct datasets may necessitate specialized modal encoders. Nonetheless, a significant number of MRS research often resort to generic encoders or ignore the introduction of specialized encoders. 2) Various perceptions and preferences for MM content in illustrations. E xisting MRS often fail to reweight different media modalities dynamically. The associations and interactions between multiple modalities have great potential to improve the personalization, quality, and accuracy of recommendations. It is noted that although feature crosses (FX) have been widely proven as an effective means to enhance performance in general RS, it has yet to be effectively applied in MRS.
According to the MM paradigm [1], we innovated in the feature extract and MM fusion stages to address these challenges.1) Feature Extract. For text encoders, in order to enable pre-training models to understand terms in the anime domain, we have constructed a large-scale anime domain multi-perspective text pair dataset that mainly includes multilingual noun mappings, relationships between entities, and explanations of nouns. Then, we fine-tune Sentence-Transformers [10] pre-trained models to extract text feature vectors encompassing domain knowledge. As shown in Fig. LABEL:illusts_with_diff_style, different lines, colors, brushstrokes, and composition styles can strengthen. For image encoders, we first propose simultaneously extracting both painting style and content semantic features to enhance image representation. And we construct a pretext task for multi-class multi-label prediction using images and labels as shown in Fig. LABEL:illust_and_label. Then, we construct an image encoder with dual outputs of style and semantics.
2) MM Fusion. We propose a User-aware Multi-modal Contribution Measurement (UMCM) mechanism that considers the various contribution levels of modalities to user preference behavior, and automatically adjusts the ratio of specific illustrations for users at the interaction level. Furthermore, considering that modalities can influence user preferences in a non-independent manner, we also introduce FX from general RS into the MM fusion stage, using DCN-V2 [9] for higher-order modality interactions to better model user preferences.
- •
Insight We emphasize the crucial importance of scene-specific modal encoders, which is generally overlooked in current MRS. As the first study on illustration MRS, we introduce and construct, for the first time, a dual-output image encoder for semantic and stylistic features. Simultaneously, we propose the concept of designing a multi-perspective domain text pair dataset to fine-tune text encoders for adaptation to specific domains.
- •
General Framework We analyze the varying contribution levels of multiple modalities to user preference behavior and introduce a UMCM mechanism. We also incorporate feature crosses to better model user preferences. Both of these mechanisms can be easily integrated into other MRS.
- •
Evaluation Our approach achieves excellent performance improvements on real-world datasets in the comparative experiments. Simultaneously, we conducted ablation experiments to explore the impact of various modules (Our codes and datasets are available on Github.)
2 The Proposed Model
2.1 Problem Definition
The input of the anime illustration RS task is a set of users as and a set of illustrations as. Specifically, user , where represents interest label, is profile that contains username, gender, etc. Illustration , where is image of , represents labels that artists use to annotate , and implies other metadata of illustration such as publish date, image size, etc. And the - interaction is typically formulated as a matrix , specifically, means has bookmarked , otherwise . Now, our learning goal is to train a model to predict the probability of clicks the target , formulated as:
(1) |
where and denote the model and its weights.
2.2 Dual-output Image Encoder
Different painting styles give the illustration personality and artistry, making it more attractive, expressive, and recognizable. In the field of illustrations, not only the semantic features of its images but also the stylistic features of its paintings influence user preferences. Therefore, we have built an image encoder with the semantic feature and style feature outputs according to the following steps: 1) To extract image features that align with animation domain knowledge, we use as features and as labels, building a large-scale anime image multi-label multi-classification dataset. 2) We chose the structure of ResNet101 to construct the encoder and pre-train on the dataset. 3) The classification head is removed from the pre-training model after fitting the pretext task. Then, as shown in Fig. LABEL:image_encoder, the feature map (FM) generated from various layers of is modified to obtain dual outputs:
(2) |
Semantic Feature Output. Higher-level FM is obtained from convolutional neural networks (CNN) to describe more complex features. In the pretext task, we chose the text annotations made by artists as the labels for the images. As a result, can effectively learn these semantic features represented by pre-trained task. For reducing the spatial resolution of features to capture high-level semantic features of images better, we add a mean pooling layer after the output of the last convolutional layer of to obtain feature as:
(3) |
where represents the outputs of the last convolutional layer of .
Style Feature Output. As each layer’s FM contains multiple channels, using shallow FMs directly as style features not only leads to excessive output dimensions but also fails to capture the interactions between channels. Since these interactions are closely related to the texture and style of the image, we adopt the channel-wise Gram matrix to capture the interactions between channels in the FMs and quantify style features as follows:
(4) |
where represents the pixel value at position on channel in -th layer of the , represents same for channel . The size of the FM in the specific convolutional layer is consistent, and its height and width are represented by and .
Since direct combination will result in excessive output dimensions, max pooling is utilized to remove local details of the image and allows the FM to focus more on the global texture and structure of the image. This abstract representation is more suitable for representing style feature of an image. Therefore, the layer-wise Gram matrix representation and are given by merging the of the first three layers by max pooling.
2.3 Multi-perspective Text Encoder
Although general pre-trained models (GPMs) perform well in various natural language processing tasks, they often face challenges when adapting to specific domains. For example, misunderstandings or errors may occur in domain-specific terminology and knowledge. It is difficult for GPMs to understand vocabulary such as character names, specific anime titles, concepts, and settings within animation works. The combination of multilingualism such as Chinese, Japanese, and English and domain-specific vocabulary leads to poor performance of GPMs in our datasets. Therefore, we obtain multi-perspective text pair to fine-tune the GPM and construct a text encoder integrated with domain knowledge from diverse views as two main steps:
1) As shown in Fig. LABEL:text_encoder, we collect publicly available information from various websites such as Bangumi, Moegirlpedia, MyAnimeList, Wikipedia etc., to create a dataset consisting of text pairs from multiingual relation, domain relationship, term explanation perspectives as shown in Table 1.
Type | Sub Type & Example | |||||
Multilingualmappings ofdomain terms | Chinese-English | 火影忍者 Naruto | ||||
English-Japanese | Uchiha Sasuke うちはサスケ | |||||
Japanese-Chinese | うずまきナルト 漩涡鸣人 | |||||
RelationshipsBetween domain entities | series-character | Naruto’s character Uchiha Sasuke | ||||
character-character | Uchiha Sasuke’s friend Uzumaki Naruto | |||||
Explanationsofdomain terms |
|
| ||||
|
|
Specifically, texts from different sources can provide multi-hierarchy information about a specific topic. The comparison of multiple languages enables the model to learn the correspondence between multiple languages.
2) We fine-tune the Sentence-Transformers [10] with multi-language pre-trained weight for semantic search on the dataset. And a text encoder is obtained as integrating domain knowledge to output textual semantic features as:
(5) |
2.4 User-aware Multi-modal Contribution Measurement
We obtain a list of MM representations , where the vector representation of metadata is obtained through an embedding layer . Representations from different modes should have varying levels of contribution to user preference. For example, a user might bookmark with diverse preferences due to stylistic features, semantic content of illustrations or text labels. Guided by interaction data, we propose the UMCM mechanism to learn the dynamic weight of the MM features of illustration, and give the re-weighted MM illustration features as:
(6) |
where refers to the user feature vector obtained by embedding and , function outputs the weights of MM features. The function , implemented by the dot-product attention mechanism, is defined as:
(7) |
Given the effectiveness of individual UMCM, inspired by the MoE [11] framework in multi-task RS, we employ parallel UMCMs to simulate experts for different decisions. Furthermore, we utilize the dot-product attention mechanism to implement an aggregation gate, consolidating the decisions from multiple experts, thereby further enhancing the modeling of MM contributions.
2.5 Multi-modal Crosses
Utilizing MLP to simulate implicit feature intersections of any order is time-consuming. To avoid unnecessary computations, we utilize a lightweight DCN-V2 [9] module to simulate explicit and bounded-degree MM crosses. The core idea of DCN-V2 is to create explicit feature cross layers.
3 Experiments
3.1 Experimental Settings
#Dataset | #Category | #User | #Illustrations | #Interactions | Time range |
Train | 215,394 | 1,882,675 | 24,233,663 | 2021/01/01-2021/10/01 | |
Test | 108,955 | 1,229,347 | 9,472,225 | 2021/10/01-2022/01/01 |
Dataset. To the best of our knowledge, there currently exists yet to be a high-quality public dataset for anime illustration RS. We collected user interaction logs from a commercial illustration website to construct our dataset. The detailed information of the dataset is shown in Table 2. Fig. LABEL:fig:userandillustraion (a) and (b) depict the distribution of interaction counts from the perspectives of illustrations and users, respectively, showcasing the long-tail distribution challenge present in the dataset, underscoring the complexities involved in anime illustrations RS.
Evaluation Metrics. As in [12, 13, 14, 9], we use the Area Under the Curve and Binary Cross-Entropy Loss to evaluate the performance of all methods.
Baselines.To demonstrate the effectiveness of our proposed method, we compare it with six SOTA general RS approaches.
Implementation Details. We implement UMAIR-FPS on the TensorFlow framework, and use the Adam optimizer with a learning rate of 1e-3 and train on the dataset for just one epoch to avoid the overfitting issue [15]. We adhere to default settings of most hyperparameters in the baseline methods except outputting an embedding layer with a dimension of 8 and an L2 regularization parameter of 1e-3.
3.2 Performance Comparison
Metric | BCE Loss | AUC | Parameter |
DCN[16] | 1.1964 | 0.7766 | 10,175,324 |
xDeepFM[12] | 1.7769 | 0.7219 | 10,302,596 |
FiBiNET[14] | 1.3760 | 0.7019 | 10,424,966 |
DIFM[13] | 0.5685 | 0.8055 | 10,210,660 |
DCN-V2[9] | 0.6469 | 0.7726 | 10,278,092 |
FinalMLP[17] | 0.5087 | 0.8034 | |
UMAIR-FPS | 10,597,028 | ||
Improvement | 25.35% | 5.4% | -4.292% |
Table 3 presents the performance comparison of UMAIR-FPS with all other methods on the dataset. It also reports the parameters for each model to assess the space complexity. As shown in Table 3, our model exhibits significant improvement compared to six baseline models. Compared to the best-performing baseline, UMAIR-FPS still improves AUC by 5.4% and, notably, boosts the BCE loss by an impressive 25.35%. We believe the superior performance of our model can be attributed to the following factors: 1) Combination of textual and visual information. This allows us to model the intrinsic correlations, generating a more comprehensive item representation, and to some extent, alleviates data sparsity caused by the long-tail distribution. 2) UMCM mechanism. By re-weighting the MM features, our model can simulate the real selection process of users more closely. 3) MM crosses. This increases the non-linearity capabilities of the model, capturing user preferences in a more granular manner.
Even though UMAIR-FPS introduces features from multiple modalities, its parameter count remains on par with other models. This ensures that our method does not compromise efficiency while guaranteeing predictive accuracy.
3.3 Ablation Study
To further investigate the effectiveness of our key modules, we conduct ablation studies on four groups of variants, as presented in Table 4.
Group | Variants | BCE Loss | AUC | Parameter | Group | Variants | BCE Loss | AUC | Parameter |
Multi-modal | 0.6469 | 0.7726 | 10,283,072 | UMCMby MoE | 0.5846 | 0.8341 | 10,597,028 | ||
0.5267 | 0.8240 | 10,580,580 | 0.3797 | 0.8490 | 10,597,028 | ||||
0.5201 | 0.8433 | 10,547,812 | 0.4214 | 0.8475 | 10,597,028 | ||||
0.5943 | 0.8233 | 10,564,196 | 0.5123 | 0.8370 | 10,598,188 | ||||
UMCM bySingle | 0.5493 | 0.8297 | 10,556,624 | 0.4963 | 0.8465 | 10,598,768 | |||
0.4750 | 0.8470 | 10,597,028 | 1.6368 | 0.7607 | 10,599,348 | ||||
0.5172 | 0.8484 | 10,597,608 | DCN-V2 | 0.5777 | 0.8337 | 10,522,532 | |||
- | UMAIR-FPS | 0.3797 | 0.8490 | 10,597,028 | - | - | - | - | - |
Impact of Multi-modal Features.To explore the effectiveness of the MM features, we establish four variants. In Variant , we eliminate all MM features, including those representing the style and semantic features of images, as well as semantic features of text labels. As a result, the AUC decreases by 9.00%, the BLE loss increases by 70.37%, and the parameter count drops by 2.96%. This trend underscores the significance of combined text and visual multimodal features in accurately capturing user preferences, and the model’s performance is hardly impacted by this integration.
In Variants , , and , we respectively remove the style features of images , the semantic features of images , and the semantic features of text labels . The most notable impact comes from removing , indicating that it bring substantial improvements in anime recommendations. Among all features, the enhancement brought by is the smallest, as it is guided by the actual image text labels. Moreover, the information obtained by selecting lower-level features might still overlap with .
The effectiveness of the image encoder is shown in Fig. LABEL:heatmap. The cosine distances between the feature vectors of similar artworks at the semantic feature and the style feature level are close, while they are far for dissimilar ones.
Impact of UMCM. We design two new variants: one without UMCM, denoted as , one replaced the ATT with the SENet module, denoted as , and one implemented with the dot product attention mechanism, . Compared to , reduces the loss by 13.52%, increases the AUC by 2.085%, and augments the model parameters by 0.382%. Compared to , the loss decreases, while the AUC remains roughly consistent. This suggests that dynamically reweighting illustrations based on user feature vectors can effectively model user preferences.
When utilizing UMCM by MoE, We enhance UMCM with the MoE framework. Our observations indicate that, for multimodal contribution reweighting, a three-module stack is optimal for both ATT and SNET. Adding additional modules tends to degrade performance. Thus, after introducing MM cross-interactions into the multi-task RS framework, we can enhance recommendation performance by leveraging multiple weight modeling modules formed by expert outputs, facilitating representation learning for specific tasks.
Impact of MM Crosses. We compare UMAIR-FPS with the variant without FX, denoted as . denoted as . The AUC decreases by 1.80% and the BCE loss increases by 52.14%, with the model parameters reduced by 0.70%. These results underscore the efficacy of FX. Moreover, undergoing user-aware weight fusion, can still benefit from FX with user features.
4 Conclusion
This paper introduces UMAIR-FPS, a novel user-aware MM anime illustration RS. For the first time, we propose integrating painting style into MM features. Our method constructs a dual-output image encoder based on the stylistic and semantic features and a text encoder that is fine-tuned using multi-perspective text, enabling it to understand anime knowledge. Additionally, to account for the varying contribution levels of multiple modalities to user preference behavior, we design UMCM based on attention. Conducting FX among multiple modalities utilizing DCN-V2 further enhances the model’s recommendation accuracy. Through extensive experiments on real-world datasets, we validate the superiority of UMAIR-FPS compared to other SOTA methods.
References
- [1]Liu, Q., etal.: Multimodal recommender systems: A survey. arXiv preprintarXiv:2302.03883 (2023)
- [2]He, X., etal.: Lightgcn: Simplifying and powering graph convolution networkfor recommendation. In: Prof. SIGIR. pp. 639–648 (2020)
- [3]Wang, X., Jin, H., Zhang, A., He, etal.: Disentangled graph collaborativefiltering. In: Prof. SIGIR. pp. 1001–1010 (2020)
- [4]He, X., etal.: Neural collaborative filtering. In: Prof. WWW. pp. 173–182(2017)
- [5]Zhou, G., etal.: Deep interest network for click-through rate prediction. In:Prof. SIGKDD. pp. 1059–1068 (2018)
- [6]Zhao, X., etal.: Deep reinforcement learning for list-wise recommendations.arXiv preprint arXiv:1801.00209 (2017)
- [7]Yang, L., etal.: Consisrec: Enhancing gnn for social recommendation viaconsistent neighbor aggregation. In: Prof. SIGIR. pp. 2141–2145 (2021)
- [8]Yu, J., etal.: Self-supervised multi-channel hypergraph convolutional networkfor social recommendation. In: Prof. WWW. pp. 413–424 (2021)
- [9]Wang, R., etal.: Dcn v2: Improved deep & cross network and practical lessonsfor web-scale learning to rank systems. In: Prof. WWW. pp. 1785–1797 (2021)
- [10]Reimers, N., etal.: Sentence-bert: Sentence embeddings using siamesebert-networks. arXiv preprint arXiv:1908.10084 (2019)
- [11]Ma, J., etal.: Modeling task relationships in multi-task learning withmulti-gate mixture-of-experts. In: Prof. SIGKDD. pp. 1930–1939 (2018)
- [12]Lian, J., etal.: xdeepfm: Combining explicit and implicit feature interactionsfor recommender systems. In: Prof. SIGKDD. pp. 1754–1763 (2018)
- [13]Lu, W., etal.: A dual input-aware factorization machine for ctr prediction.In: Prof. IJCAI. pp. 3139–3145 (2021)
- [14]Huang, T., etal.: Fibinet: combining feature importance and bilinear featureinteraction for click-through rate prediction. In: Prof. RecSys. pp. 169–177(2019)
- [15]Zhang, Z.Y., etal.: Towards understanding the overfitting phenomenon of deepclick-through rate models. In: Prof. CIKM. pp. 2671–2680 (2022)
- [16]Wang, R., etal.: Deep & cross network for ad click predictions. In: Prof.ADKDD, pp.1–7 (2017)
- [17]Mao, K., etal.: Finalmlp: An enhanced two-stream mlp model for ctr prediction.arXiv preprint arXiv:2304.00902 (2023)