User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting StyleThis work was supported in part by the National Natural Science Foundation of China, Grant No. 62366060, 61762092, Open Foundation of Yunnan Key Laboratory of Software (2024)

11institutetext: National Pilot School of Software, Yunnan University, Kunming 650106, China11email: kangyan@ynu.edu.cn, {oysterqaq,ymj123}@mail.ynu.edu.cn22institutetext: Yunnan Key Laboratory of Software Engineering, Kunming 650106, China
33institutetext: Institute of Management of Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
33email: camhero@gmail.com

Yan Kang1122  Hao Lin11  Mingjian Yang1(1())  Shin-Jye Lee33

Abstract

The rapid advancement of high-quality image generation models based on AI has generated a deluge of anime illustrations. Recommending illustrations to users has become a challenge. However, existing anime recommendation systems (RS) have focused on text features but still need to integrate image features. In addition, most multi-modal (MM) RS research is constrained by tightly coupled datasets, limiting its applicability to illustrations RS. We propose the User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting Style (UMAIR-FPS) to tackle these gaps. In the feature extract phase, for image features, we are the first to combine painting style with semantic features to construct a dual-output image encoder for enhancing representation. For text features, we obtain embeddings based on fine-tuning Sentence-Transformers by incorporating domain knowledge that composes a variety of anime text pairs from multilingual mappings, entity relationships, and term explanation perspectives, respectively. In the MM fusion phase, we novelly propose a user-aware multi-modal contribution measurement mechanism to weight MM features dynamically according to user features at the interaction level and employ the DCN-V2 module to model bounded-degree MM crosses effectively. UMAIR-FPS surpasses the SOTA baselines on large real-world datasets.

Keywords:

Anime illustration recommendation Painting style features Multi-modal feature extraction Multi-modal feature fusion.

1 Introduction

With the booming growth and rising popularity of the ACG (Anime, Comics, and Games) industry, anime-related RS has become a research topic. AIGC models lead to an exponential increase in illustrations, and then high-quality illustrations can be generated without needing painting skills. More and more websites such as Civitai.com are dedicated to sharing AI-generated illustrations, and users are facing the challenge of finding their favorite illustrations. To alleviate information overload, it is essential to recommend illustrations that interest users according to their preferences to improve their browsing experience.

RS have been proven to be an effective solution for such challenges [1], and many efforts have been devoted to RS. Earlier methods utilize collaborative filtering to leverage users’ past interaction behaviors, including ratings and item click logs [2, 3]. Matrix factorization techniques [4] and attention mechanisms [5] have been successfully combined with deep learning to yield significant improvements. Reinforcement learning emphasizes online learning and real-time updates in models [6]. Graph neural networks utilize information about users and items, such as social network relationships between users[7, 8] and contextual features [9], to provide more personalized recommendations that align with users’ interests. Recently, multi-modal recommendation systems (MRS) have gained widespread attention by leveraging different modal features of items, such as visual and textual characteristics and interaction information, to better mine item attributes not revealed in interaction.

Illustrations typically include elements such as images and text, which together form a complete story or convey information. Then MRS can help users find various media content related to their interests. However, introducing MRS to anime illustration RS still faces the following challenges: 1) Specific feature extraction. General pre-trained encoders lack domain-specific expertise, and distinct datasets may necessitate specialized modal encoders. Nonetheless, a significant number of MRS research often resort to generic encoders or ignore the introduction of specialized encoders. 2) Various perceptions and preferences for MM content in illustrations. E xisting MRS often fail to reweight different media modalities dynamically. The associations and interactions between multiple modalities have great potential to improve the personalization, quality, and accuracy of recommendations. It is noted that although feature crosses (FX) have been widely proven as an effective means to enhance performance in general RS, it has yet to be effectively applied in MRS.

According to the MM paradigm [1], we innovated in the feature extract and MM fusion stages to address these challenges.1) Feature Extract. For text encoders, in order to enable pre-training models to understand terms in the anime domain, we have constructed a large-scale anime domain multi-perspective text pair dataset that mainly includes multilingual noun mappings, relationships between entities, and explanations of nouns. Then, we fine-tune Sentence-Transformers [10] pre-trained models to extract text feature vectors encompassing domain knowledge. As shown in Fig. LABEL:illusts_with_diff_style, different lines, colors, brushstrokes, and composition styles can strengthen. For image encoders, we first propose simultaneously extracting both painting style and content semantic features to enhance image representation. And we construct a pretext task for multi-class multi-label prediction using images and labels as shown in Fig. LABEL:illust_and_label. Then, we construct an image encoder with dual outputs of style and semantics.

2) MM Fusion. We propose a User-aware Multi-modal Contribution Measurement (UMCM) mechanism that considers the various contribution levels of modalities to user preference behavior, and automatically adjusts the ratio of specific illustrations for users at the interaction level. Furthermore, considering that modalities can influence user preferences in a non-independent manner, we also introduce FX from general RS into the MM fusion stage, using DCN-V2 [9] for higher-order modality interactions to better model user preferences.

  • Insight We emphasize the crucial importance of scene-specific modal encoders, which is generally overlooked in current MRS. As the first study on illustration MRS, we introduce and construct, for the first time, a dual-output image encoder for semantic and stylistic features. Simultaneously, we propose the concept of designing a multi-perspective domain text pair dataset to fine-tune text encoders for adaptation to specific domains.

  • General Framework We analyze the varying contribution levels of multiple modalities to user preference behavior and introduce a UMCM mechanism. We also incorporate feature crosses to better model user preferences. Both of these mechanisms can be easily integrated into other MRS.

  • Evaluation Our approach achieves excellent performance improvements on real-world datasets in the comparative experiments. Simultaneously, we conducted ablation experiments to explore the impact of various modules (Our codes and datasets are available on Github.)

User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting StyleThis work was supported in part by the National Natural Science Foundation of China, Grant No. 62366060, 61762092, Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant No.2023SE203. (1)

2 The Proposed Model

2.1 Problem Definition

The input of the anime illustration RS task is a set of N𝑁Nitalic_N users u𝑢uitalic_u as 𝒰={u1,,uN}𝒰subscript𝑢1subscript𝑢𝑁\mathcal{U}=\left\{u_{1},\ldots,u_{N}\right\}caligraphic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and a set of M𝑀Mitalic_M illustrations as={i1,,iM}subscript𝑖1subscript𝑖𝑀\mathcal{I}=\left\{i_{1},\ldots,i_{M}\right\}caligraphic_I = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. Specifically, user u=(uIL,uO)𝑢superscript𝑢ILsuperscript𝑢Ou=(u^{\text{IL}},u^{\text{O}})italic_u = ( italic_u start_POSTSUPERSCRIPT IL end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT ), where uILsuperscript𝑢ILu^{\text{IL}}italic_u start_POSTSUPERSCRIPT IL end_POSTSUPERSCRIPT represents interest label, uOsuperscript𝑢Ou^{\text{O}}italic_u start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT is profile that contains username, gender, etc. Illustration i=(iIMG,iL,iO)𝑖superscript𝑖IMGsuperscript𝑖Lsuperscript𝑖Oi=(i^{\text{IMG}},i^{\text{L}},i^{\text{O}})italic_i = ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT ), where iIMGsuperscript𝑖IMGi^{\text{IMG}}italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT is image of i𝑖iitalic_i, iLsuperscript𝑖Li^{\text{L}}italic_i start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT represents labels that artists use to annotate i𝑖iitalic_i, and iOsuperscript𝑖Oi^{\text{O}}italic_i start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT implies other metadata of illustration such as publish date, image size, etc. And the u𝑢uitalic_u-i𝑖iitalic_i interaction is typically formulated as a matrix Y={y}N×M𝑌subscript𝑦𝑁𝑀Y=\{y\}_{N\times M}italic_Y = { italic_y } start_POSTSUBSCRIPT italic_N × italic_M end_POSTSUBSCRIPT, specifically, y=1𝑦1y=1italic_y = 1 means u𝑢uitalic_u has bookmarked i𝑖iitalic_i, otherwise y=0𝑦0y=0italic_y = 0. Now, our learning goal is to train a model to predict the probability y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG of u𝑢uitalic_u clicks the target i𝑖iitalic_i, formulated as:

y^=(θ,u,i),^𝑦𝜃𝑢𝑖\displaystyle\hat{y}=\mathcal{F}(\theta,u,i),over^ start_ARG italic_y end_ARG = caligraphic_F ( italic_θ , italic_u , italic_i ) ,(1)

where \mathcal{F}caligraphic_F and θ𝜃\thetaitalic_θ denote the model and its weights.

2.2 Dual-output Image Encoder

Different painting styles give the illustration personality and artistry, making it more attractive, expressive, and recognizable. In the field of illustrations, not only the semantic features of its images but also the stylistic features of its paintings influence user preferences. Therefore, we have built an image encoder 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT with the semantic feature eSEMsuperscript𝑒SEMe^{\text{SEM}}italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT and style feature eSTYsuperscript𝑒STYe^{\text{STY}}italic_e start_POSTSUPERSCRIPT STY end_POSTSUPERSCRIPT outputs according to the following steps: 1) To extract image features that align with animation domain knowledge, we use iIMGsuperscript𝑖IMGi^{\text{IMG}}italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT as features and iLsuperscript𝑖Li^{\text{L}}italic_i start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT as labels, building a large-scale anime image multi-label multi-classification dataset. 2) We chose the structure of ResNet101 to construct the encoder and pre-train on the dataset. 3) The classification head is removed from the pre-training model after fitting the pretext task. Then, as shown in Fig. LABEL:image_encoder, the feature map (FM) generated from various layers of 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT is modified to obtain dual outputs:

eSEM,eSTY=𝐄img(iIMG).superscript𝑒SEMsuperscript𝑒STYsuperscript𝐄imgsuperscript𝑖IMG\displaystyle e^{\text{SEM}},e^{\text{STY}}=\mathbf{E}^{\text{img}}(i^{\text{%IMG}}).italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT STY end_POSTSUPERSCRIPT = bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) .(2)

Semantic Feature Output. Higher-level FM is obtained from convolutional neural networks (CNN) to describe more complex features. In the pretext task, we chose the text annotations iLsuperscript𝑖Li^{\text{L}}italic_i start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT made by artists as the labels for the images. As a result, 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT can effectively learn these semantic features represented by pre-trained task. For reducing the spatial resolution of features to capture high-level semantic features of images better, we add a mean pooling layer after the output of the last convolutional layer of 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT to obtain feature eSEMsuperscript𝑒SEMe^{\text{SEM}}italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT as:

eSEM=MeanPooling(𝐄lastimg(iIMG)),superscript𝑒SEMMeanPoolingsubscriptsuperscript𝐄img𝑙𝑎𝑠𝑡superscript𝑖IMG\displaystyle e^{\text{SEM}}=\text{MeanPooling}(\mathbf{E}^{\text{img}}_{last}%(i^{\text{IMG}})),italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT = MeanPooling ( bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_a italic_s italic_t end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) ) ,(3)

where 𝐄lastimg(iIMG)subscriptsuperscript𝐄img𝑙𝑎𝑠𝑡superscript𝑖IMG\mathbf{E}^{\text{img}}_{last}(i^{\text{IMG}})bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_a italic_s italic_t end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) represents the outputs of the last convolutional layer of 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT.

Style Feature Output. As each layer’s FM contains multiple channels, using shallow FMs directly as style features not only leads to excessive output dimensions but also fails to capture the interactions between channels. Since these interactions are closely related to the texture and style of the image, we adopt the channel-wise Gram matrix to capture the interactions between channels in the FMs and quantify style features as follows:

gj,kl=h=0Hw=1W𝐄l,j,h,wimg(iIMG)𝐄l,j,h,wimg(iIMG)H×W,subscriptsuperscript𝑔𝑙𝑗𝑘superscriptsubscript0𝐻superscriptsubscript𝑤1𝑊subscriptsuperscript𝐄img𝑙𝑗𝑤superscript𝑖IMGsubscriptsuperscript𝐄img𝑙𝑗𝑤superscript𝑖IMG𝐻𝑊\displaystyle g^{l}_{j,k}=\frac{\sum_{h=0}^{H}\sum_{w=1}^{W}\mathbf{E}^{\text{%img}}_{l,j,h,w}(i^{\text{IMG}})\mathbf{E}^{\text{img}}_{l,j,h,w}(i^{\text{IMG}%})}{H\times W},italic_g start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_j , italic_h , italic_w end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_j , italic_h , italic_w end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_H × italic_W end_ARG ,(4)

where 𝐄l,j,h,wimg(iIMG)subscriptsuperscript𝐄img𝑙𝑗𝑤superscript𝑖IMG\mathbf{E}^{\text{img}}_{l,j,h,w}(i^{\text{IMG}})bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_j , italic_h , italic_w end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) represents the pixel value at position (h,w)𝑤(h,w)( italic_h , italic_w ) on channel j𝑗jitalic_j in l𝑙litalic_l-th layer of the 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT, 𝐄l,k,h,wimg(iIMG)subscriptsuperscript𝐄img𝑙𝑘𝑤superscript𝑖IMG\mathbf{E}^{\text{img}}_{l,k,h,w}(i^{\text{IMG}})bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_k , italic_h , italic_w end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT IMG end_POSTSUPERSCRIPT ) represents same for channel k𝑘kitalic_k. The size of the FM in the specific convolutional layer is consistent, and its height and width are represented by H𝐻Hitalic_H and W𝑊Witalic_W.

Since direct combination will result in excessive output dimensions, max pooling is utilized to remove local details of the image and allows the FM to focus more on the global texture and structure of the image. This abstract representation is more suitable for representing style feature of an image. Therefore, the layer-wise Gram matrix representation and eSTYsuperscript𝑒STYe^{\text{STY}}italic_e start_POSTSUPERSCRIPT STY end_POSTSUPERSCRIPT are given by merging the Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT of the first three layers by max pooling.

2.3 Multi-perspective Text Encoder

Although general pre-trained models (GPMs) perform well in various natural language processing tasks, they often face challenges when adapting to specific domains. For example, misunderstandings or errors may occur in domain-specific terminology and knowledge. It is difficult for GPMs to understand vocabulary such as character names, specific anime titles, concepts, and settings within animation works. The combination of multilingualism such as Chinese, Japanese, and English and domain-specific vocabulary leads to poor performance of GPMs in our datasets. Therefore, we obtain multi-perspective text pair to fine-tune the GPM and construct a text encoder 𝐄textsuperscript𝐄text\mathbf{E}^{\text{text}}bold_E start_POSTSUPERSCRIPT text end_POSTSUPERSCRIPT integrated with domain knowledge from diverse views as two main steps:

1) As shown in Fig. LABEL:text_encoder, we collect publicly available information from various websites such as Bangumi, Moegirlpedia, MyAnimeList, Wikipedia etc., to create a dataset consisting of text pairs from multiingual relation, domain relationship, term explanation perspectives as shown in Table 1.

TypeSub Type & Example
Multilingualmappings ofdomain termsChinese-English火影忍者 bold-↔\boldsymbol{\leftrightarrow}bold_↔ Naruto
English-JapaneseUchiha Sasuke bold-↔\boldsymbol{\leftrightarrow}bold_↔ うちはサスケ
Japanese-Chineseうずまきナルト bold-↔\boldsymbol{\leftrightarrow}bold_↔ 漩涡鸣人
RelationshipsBetween domain entitiesseries-characterNaruto’s character bold-↔\boldsymbol{\leftrightarrow}bold_↔ Uchiha Sasuke
character-characterUchiha Sasuke’s friend bold-↔\boldsymbol{\leftrightarrow}bold_↔ Uzumaki Naruto
Explanationsofdomain terms
animation name and
its introduction
Naruto bold-↔\boldsymbol{\leftrightarrow}bold_↔ Naruto is a Japanese manga series
that tells the story of Naruto Uzumaki
animation setting and
its meaning
chakra bold-↔\boldsymbol{\leftrightarrow}bold_↔ chakra is a Sanskrit word that
means wheel or cycle.

Specifically, texts from different sources can provide multi-hierarchy information about a specific topic. The comparison of multiple languages enables the model to learn the correspondence between multiple languages.

2) We fine-tune the Sentence-Transformers [10] with multi-language pre-trained weight for semantic search on the dataset. And a text encoder 𝐄textsuperscript𝐄text\mathbf{E}^{\text{text}}bold_E start_POSTSUPERSCRIPT text end_POSTSUPERSCRIPT is obtained as integrating domain knowledge to output textual semantic features as:

eTSEM=𝐄text(iL).superscript𝑒TSEMsuperscript𝐄textsuperscript𝑖L\displaystyle e^{\text{TSEM}}=\mathbf{E}^{\text{text}}(i^{\text{L}}).italic_e start_POSTSUPERSCRIPT TSEM end_POSTSUPERSCRIPT = bold_E start_POSTSUPERSCRIPT text end_POSTSUPERSCRIPT ( italic_i start_POSTSUPERSCRIPT L end_POSTSUPERSCRIPT ) .(5)

2.4 User-aware Multi-modal Contribution Measurement

We obtain a list of MM representations eMM={eTSEM,eSEM,eSTY,Femb(iO)}superscript𝑒MMsuperscript𝑒TSEMsuperscript𝑒SEMsuperscript𝑒STYsuperscript𝐹embsuperscript𝑖Oe^{\text{MM}}=\{e^{\text{TSEM}},e^{\text{SEM}},e^{\text{STY}},F^{\text{emb}}(i%^{\text{O}})\}italic_e start_POSTSUPERSCRIPT MM end_POSTSUPERSCRIPT = { italic_e start_POSTSUPERSCRIPT TSEM end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT STY end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT emb end_POSTSUPERSCRIPT ( italic_i start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT ) }, where the vector representation of metadata iOsuperscript𝑖Oi^{\text{O}}italic_i start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT is obtained through an embedding layer Fembsuperscript𝐹embF^{\text{emb}}italic_F start_POSTSUPERSCRIPT emb end_POSTSUPERSCRIPT. Representations from different modes should have varying levels of contribution to user preference. For example, a user u𝑢uitalic_u might bookmark i𝑖iitalic_i with diverse preferences due to stylistic features, semantic content of illustrations or text labels. Guided by interaction data, we propose the UMCM mechanism to learn the dynamic weight of the MM features of illustration, and give the re-weighted MM illustration features vilsuperscript𝑣ilv^{\text{il}}italic_v start_POSTSUPERSCRIPT il end_POSTSUPERSCRIPT as:

vil=j=1Jα(vuser,ejMM)ejMM,superscript𝑣ilsuperscriptsubscript𝑗1𝐽𝛼superscript𝑣usersubscriptsuperscript𝑒MM𝑗subscriptsuperscript𝑒MM𝑗\displaystyle v^{\text{il}}=\sum_{j=1}^{J}\alpha(v^{\text{user}},e^{\text{MM}}%_{j})e^{\text{MM}}_{j},italic_v start_POSTSUPERSCRIPT il end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_α ( italic_v start_POSTSUPERSCRIPT user end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT MM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT MM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,(6)

where vusersuperscript𝑣userv^{\text{user}}italic_v start_POSTSUPERSCRIPT user end_POSTSUPERSCRIPT refers to the user u𝑢uitalic_u feature vector obtained by embedding uILsuperscript𝑢ILu^{\text{IL}}italic_u start_POSTSUPERSCRIPT IL end_POSTSUPERSCRIPT and uOsuperscript𝑢Ou^{\text{O}}italic_u start_POSTSUPERSCRIPT O end_POSTSUPERSCRIPT, function α()𝛼\alpha(\cdot)italic_α ( ⋅ ) outputs the weights of MM features. The function α()𝛼\alpha(\cdot)italic_α ( ⋅ ), implemented by the dot-product attention mechanism, is defined as:

α(vuser,ejMM)=exp(aj)j=1Jexp(aj),whereaj=vuserejMM.formulae-sequence𝛼superscript𝑣usersubscriptsuperscript𝑒MM𝑗subscript𝑎𝑗superscriptsubscript𝑗1𝐽superscript𝑎𝑗wheresubscript𝑎𝑗direct-productsuperscript𝑣usersubscriptsuperscript𝑒MM𝑗\displaystyle\alpha(v^{\text{user}},e^{\text{MM}}_{j})=\frac{\exp\left(a_{j}%\right)}{\sum_{j=1}^{J}\exp\left(a^{j}\right)},\ {\text{where}}\ a_{j}=v^{%\text{user}}\odot e^{\text{MM}}_{j}.italic_α ( italic_v start_POSTSUPERSCRIPT user end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT MM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT roman_exp ( italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG , where italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_v start_POSTSUPERSCRIPT user end_POSTSUPERSCRIPT ⊙ italic_e start_POSTSUPERSCRIPT MM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .(7)

Given the effectiveness of individual UMCM, inspired by the MoE [11] framework in multi-task RS, we employ parallel UMCMs to simulate experts for different decisions. Furthermore, we utilize the dot-product attention mechanism to implement an aggregation gate, consolidating the decisions from multiple experts, thereby further enhancing the modeling of MM contributions.

2.5 Multi-modal Crosses

Utilizing MLP to simulate implicit feature intersections of any order is time-consuming. To avoid unnecessary computations, we utilize a lightweight DCN-V2 [9] module to simulate explicit and bounded-degree MM crosses. The core idea of DCN-V2 is to create explicit feature cross layers.

3 Experiments

3.1 Experimental Settings

#Dataset#Category#User#Illustrations#InteractionsTime range
𝒟𝒟\mathcal{D}caligraphic_DTrain215,3941,882,67524,233,6632021/01/01-2021/10/01
𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTTest108,9551,229,3479,472,2252021/10/01-2022/01/01

Dataset. To the best of our knowledge, there currently exists yet to be a high-quality public dataset for anime illustration RS. We collected user interaction logs from a commercial illustration website to construct our dataset. The detailed information of the dataset is shown in Table 2. Fig. LABEL:fig:userandillustraion (a) and (b) depict the distribution of interaction counts from the perspectives of illustrations and users, respectively, showcasing the long-tail distribution challenge present in the dataset, underscoring the complexities involved in anime illustrations RS.

Evaluation Metrics. As in [12, 13, 14, 9], we use the Area Under the Curve and Binary Cross-Entropy Loss to evaluate the performance of all methods.

Baselines.To demonstrate the effectiveness of our proposed method, we compare it with six SOTA general RS approaches.

Implementation Details. We implement UMAIR-FPS on the TensorFlow framework, and use the Adam optimizer with a learning rate of 1e-3 and train on the dataset for just one epoch to avoid the overfitting issue [15]. We adhere to default settings of most hyperparameters in the baseline methods except outputting an embedding layer with a dimension of 8 and an L2 regularization parameter of 1e-3.

3.2 Performance Comparison

MetricBCE LossAUCParameter
DCN[16]1.19640.776610,175,324
xDeepFM[12]1.77690.721910,302,596
FiBiNET[14]1.37600.701910,424,966
DIFM[13]0.56850.805510,210,660
DCN-V2[9]0.64690.772610,278,092
FinalMLP[17]0.50870.8034𝟏𝟎,𝟏𝟔𝟎,𝟗𝟏𝟐*10160superscript912{10,160,912}^{*}bold_10 bold_, bold_160 bold_, bold_912 start_POSTSUPERSCRIPT bold_* end_POSTSUPERSCRIPT
UMAIR-FPS0.3797*superscript0.3797{0.3797}^{*}bold_0.3797 start_POSTSUPERSCRIPT bold_* end_POSTSUPERSCRIPT0.8490*superscript0.8490{0.8490}^{*}bold_0.8490 start_POSTSUPERSCRIPT bold_* end_POSTSUPERSCRIPT10,597,028
Improvement25.35%5.4%-4.292%

Table 3 presents the performance comparison of UMAIR-FPS with all other methods on the dataset. It also reports the parameters for each model to assess the space complexity. As shown in Table 3, our model exhibits significant improvement compared to six baseline models. Compared to the best-performing baseline, UMAIR-FPS still improves AUC by 5.4% and, notably, boosts the BCE loss by an impressive 25.35%. We believe the superior performance of our model can be attributed to the following factors: 1) Combination of textual and visual information. This allows us to model the intrinsic correlations, generating a more comprehensive item representation, and to some extent, alleviates data sparsity caused by the long-tail distribution. 2) UMCM mechanism. By re-weighting the MM features, our model can simulate the real selection process of users more closely. 3) MM crosses. This increases the non-linearity capabilities of the model, capturing user preferences in a more granular manner.

Even though UMAIR-FPS introduces features from multiple modalities, its parameter count remains on par with other models. This ensures that our method does not compromise efficiency while guaranteeing predictive accuracy.

3.3 Ablation Study

To further investigate the effectiveness of our key modules, we conduct ablation studies on four groups of variants, as presented in Table 4.

GroupVariantsBCE LossAUCParameterGroupVariantsBCE LossAUCParameter
Multi-modalw/oALL𝑤𝑜ALL{w/o\ \text{ALL}}italic_w / italic_o ALL0.64690.772610,283,072 UMCMby MoE w/ATT*2𝑤ATT*2{w/\ \text{ATT*2}}italic_w / ATT*20.58460.834110,597,028
w/oSTY𝑤𝑜STY{w/o\ \text{STY}}italic_w / italic_o STY0.52670.824010,580,580w/ATT*3𝑤ATT*3{w/\ \text{ATT*3}}italic_w / ATT*30.37970.849010,597,028
w/oSEM𝑤𝑜SEM{w/o\ \text{SEM}}italic_w / italic_o SEM0.52010.843310,547,812w/ATT*4𝑤ATT*4{w/\ \text{ATT*4}}italic_w / ATT*40.42140.847510,597,028
w/oTSEM𝑤𝑜TSEM{w/o\ \text{TSEM}}italic_w / italic_o TSEM0.59430.823310,564,196w/SEN*2𝑤SEN*2{w/\ \text{SEN*2}}italic_w / SEN*20.51230.837010,598,188
UMCM bySingle w/oUMCM𝑤𝑜UMCM{w/o\ \text{UMCM}}italic_w / italic_o UMCM0.54930.829710,556,624w/SEN*3𝑤SEN*3{w/\ \text{SEN*3}}italic_w / SEN*30.49630.846510,598,768
w/UMCM_ATT𝑤UMCM_ATT{w/\ \text{UMCM\_ATT}}italic_w / UMCM_ATT0.47500.847010,597,028w/SEN*4𝑤SEN*4{w/\ \text{SEN*4}}italic_w / SEN*41.63680.760710,599,348
w/UMCM_SEN𝑤UMCM_SEN{w/\ \text{UMCM\_SEN}}italic_w / UMCM_SEN0.51720.848410,597,608DCN-V2w/oDCN-V2𝑤𝑜DCN-V2{w/o\ \text{DCN-V2}}italic_w / italic_o DCN-V20.57770.833710,522,532
-UMAIR-FPS0.37970.849010,597,028-----

Impact of Multi-modal Features.To explore the effectiveness of the MM features, we establish four variants. In Variant w/oALLsubscript𝑤𝑜ALL{w_{/}o\ \text{ALL}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o ALL, we eliminate all MM features, including those representing the style and semantic features of images, as well as semantic features of text labels. As a result, the AUC decreases by 9.00%, the BLE loss increases by 70.37%, and the parameter count drops by 2.96%. This trend underscores the significance of combined text and visual multimodal features in accurately capturing user preferences, and the model’s performance is hardly impacted by this integration.

In Variants w/oSTYsubscript𝑤𝑜STY{w_{/}o\ \text{STY}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o STY, w/oSEMsubscript𝑤𝑜SEM{w_{/}o\ \text{SEM}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o SEM, and w/oTSEMsubscript𝑤𝑜TSEM{w_{/}o\ \text{TSEM}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o TSEM, we respectively remove the style features of images eSTYsuperscript𝑒STYe^{\text{STY}}italic_e start_POSTSUPERSCRIPT STY end_POSTSUPERSCRIPT, the semantic features of images eSEMsuperscript𝑒SEMe^{\text{SEM}}italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT, and the semantic features of text labels eTSEMsuperscript𝑒TSEMe^{\text{TSEM}}italic_e start_POSTSUPERSCRIPT TSEM end_POSTSUPERSCRIPT. The most notable impact comes from removing eSTYsuperscript𝑒STYe^{\text{STY}}italic_e start_POSTSUPERSCRIPT STY end_POSTSUPERSCRIPT, indicating that it bring substantial improvements in anime recommendations. Among all features, the enhancement brought by eSEMsuperscript𝑒SEMe^{\text{SEM}}italic_e start_POSTSUPERSCRIPT SEM end_POSTSUPERSCRIPT is the smallest, as it is guided by the actual image text labels. Moreover, the information obtained by selecting lower-level features might still overlap with eTSEMsuperscript𝑒TSEMe^{\text{TSEM}}italic_e start_POSTSUPERSCRIPT TSEM end_POSTSUPERSCRIPT.

The effectiveness of the image encoder 𝐄imgsuperscript𝐄img\mathbf{E}^{\text{img}}bold_E start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT is shown in Fig. LABEL:heatmap. The cosine distances between the feature vectors of similar artworks at the semantic feature and the style feature level are close, while they are far for dissimilar ones.

Impact of UMCM. We design two new variants: one without UMCM, denoted as w/oUMCMsubscript𝑤𝑜UMCM{w_{/}o\ \text{UMCM}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o UMCM, one replaced the ATT with the SENet module, denoted as w/UMCM_SEN𝑤UMCM_SEN{w/\ \text{UMCM\_SEN}}italic_w / UMCM_SEN, and one implemented with the dot product attention mechanism, w/UMCM_ATT𝑤UMCM_ATT{w/\ \text{UMCM\_ATT}}italic_w / UMCM_ATT. Compared to w/oUMCMsubscript𝑤𝑜UMCM{w_{/}o\ \text{UMCM}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o UMCM, w/UMCM_ATT𝑤UMCM_ATT{w/\ \text{UMCM\_ATT}}italic_w / UMCM_ATT reduces the loss by 13.52%, increases the AUC by 2.085%, and augments the model parameters by 0.382%. Compared to w/UMCM_SEN𝑤UMCM_SEN{w/\ \text{UMCM\_SEN}}italic_w / UMCM_SEN, the loss decreases, while the AUC remains roughly consistent. This suggests that dynamically reweighting illustrations based on user feature vectors can effectively model user preferences.

When utilizing UMCM by MoE, We enhance UMCM with the MoE framework. Our observations indicate that, for multimodal contribution reweighting, a three-module stack is optimal for both ATT and SNET. Adding additional modules tends to degrade performance. Thus, after introducing MM cross-interactions into the multi-task RS framework, we can enhance recommendation performance by leveraging multiple weight modeling modules formed by expert outputs, facilitating representation learning for specific tasks.

Impact of MM Crosses. We compare UMAIR-FPS with the variant without FX, denoted as w/oDCN-V2subscript𝑤𝑜DCN-V2{w_{/}o\ \text{DCN-V2}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o DCN-V2. denoted as w/oDCN-V2subscript𝑤𝑜DCN-V2{w_{/}o\ \text{DCN-V2}}italic_w start_POSTSUBSCRIPT / end_POSTSUBSCRIPT italic_o DCN-V2. The AUC decreases by 1.80% and the BCE loss increases by 52.14%, with the model parameters reduced by 0.70%. These results underscore the efficacy of FX. Moreover, undergoing user-aware weight fusion, can still benefit from FX with user features.

4 Conclusion

This paper introduces UMAIR-FPS, a novel user-aware MM anime illustration RS. For the first time, we propose integrating painting style into MM features. Our method constructs a dual-output image encoder based on the stylistic and semantic features and a text encoder that is fine-tuned using multi-perspective text, enabling it to understand anime knowledge. Additionally, to account for the varying contribution levels of multiple modalities to user preference behavior, we design UMCM based on attention. Conducting FX among multiple modalities utilizing DCN-V2 further enhances the model’s recommendation accuracy. Through extensive experiments on real-world datasets, we validate the superiority of UMAIR-FPS compared to other SOTA methods.

References

  • [1]Liu, Q., etal.: Multimodal recommender systems: A survey. arXiv preprintarXiv:2302.03883 (2023)
  • [2]He, X., etal.: Lightgcn: Simplifying and powering graph convolution networkfor recommendation. In: Prof. SIGIR. pp. 639–648 (2020)
  • [3]Wang, X., Jin, H., Zhang, A., He, etal.: Disentangled graph collaborativefiltering. In: Prof. SIGIR. pp. 1001–1010 (2020)
  • [4]He, X., etal.: Neural collaborative filtering. In: Prof. WWW. pp. 173–182(2017)
  • [5]Zhou, G., etal.: Deep interest network for click-through rate prediction. In:Prof. SIGKDD. pp. 1059–1068 (2018)
  • [6]Zhao, X., etal.: Deep reinforcement learning for list-wise recommendations.arXiv preprint arXiv:1801.00209 (2017)
  • [7]Yang, L., etal.: Consisrec: Enhancing gnn for social recommendation viaconsistent neighbor aggregation. In: Prof. SIGIR. pp. 2141–2145 (2021)
  • [8]Yu, J., etal.: Self-supervised multi-channel hypergraph convolutional networkfor social recommendation. In: Prof. WWW. pp. 413–424 (2021)
  • [9]Wang, R., etal.: Dcn v2: Improved deep & cross network and practical lessonsfor web-scale learning to rank systems. In: Prof. WWW. pp. 1785–1797 (2021)
  • [10]Reimers, N., etal.: Sentence-bert: Sentence embeddings using siamesebert-networks. arXiv preprint arXiv:1908.10084 (2019)
  • [11]Ma, J., etal.: Modeling task relationships in multi-task learning withmulti-gate mixture-of-experts. In: Prof. SIGKDD. pp. 1930–1939 (2018)
  • [12]Lian, J., etal.: xdeepfm: Combining explicit and implicit feature interactionsfor recommender systems. In: Prof. SIGKDD. pp. 1754–1763 (2018)
  • [13]Lu, W., etal.: A dual input-aware factorization machine for ctr prediction.In: Prof. IJCAI. pp. 3139–3145 (2021)
  • [14]Huang, T., etal.: Fibinet: combining feature importance and bilinear featureinteraction for click-through rate prediction. In: Prof. RecSys. pp. 169–177(2019)
  • [15]Zhang, Z.Y., etal.: Towards understanding the overfitting phenomenon of deepclick-through rate models. In: Prof. CIKM. pp. 2671–2680 (2022)
  • [16]Wang, R., etal.: Deep & cross network for ad click predictions. In: Prof.ADKDD, pp.1–7 (2017)
  • [17]Mao, K., etal.: Finalmlp: An enhanced two-stream mlp model for ctr prediction.arXiv preprint arXiv:2304.00902 (2023)
User-aware Multi-modal Animation Illustration Recommendation Fusion with Painting StyleThis work was supported in part by the National Natural Science Foundation of China, Grant No. 62366060, 61762092, Open Foundation of Yunnan Key Laboratory of Software  (2024)
Top Articles
Latest Posts
Article information

Author: Prof. Nancy Dach

Last Updated:

Views: 5667

Rating: 4.7 / 5 (77 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Prof. Nancy Dach

Birthday: 1993-08-23

Address: 569 Waelchi Ports, South Blainebury, LA 11589

Phone: +9958996486049

Job: Sales Manager

Hobby: Web surfing, Scuba diving, Mountaineering, Writing, Sailing, Dance, Blacksmithing

Introduction: My name is Prof. Nancy Dach, I am a lively, joyous, courageous, lovely, tender, charming, open person who loves writing and wants to share my knowledge and understanding with you.