基于轻量级Transformer+跨模态注意力融合的电商商品图文匹配算法
|
许幸
|
眉山药科职业学院,四川眉山,620000
|
摘要:针对电商场景下商品图文匹配存在的特征对齐精度低、通用模型算力消耗大及属性不一致校验难等问题,提出一种基于轻量级Transformer与跨模态属性注意力融合的图文匹配算法。该算法首先通过轻量化改造的ITransformer与TTransformer分别提取商品图像视觉特征与文本语义特征,引入动态通道剪枝(剪枝率40%)与特征维度压缩技术(压缩比r=4)降低参数量;随后设计电商专属的跨模态属性注意力模块(AGA),将文本属性词作为Query与图像空间特征进行深度对齐;最后通过余弦相似度实现匹配判定。在FashionGen与自采电商数据集上的实验表明,本算法Top1 准确率达88.7%,模型参数量仅为38.6M,推理速度达185 FPS,在保证匹配精度的前提下实现了显著的轻量化,适配电商平台实时推荐与违规校验场景。
关健词:电商商品匹配;轻量级Transformer;跨模态融合;属性注意力;通道剪枝
|
E-commerce Product Image-Text Matching Algorithm Based on Lightweight Transformer Cross-modal Attention Fusion
|
Xing Xu
Meishan College of Chinese Medicine, Meishan, Sichuan, 620000 , China
|
Abstract: In response to the problems of low feature alignment accuracy, large computational power consumption of general models, and difficult attribute inconsistency verification in commodity-text matching scenarios in e-commerce, we propose a lightweight transformer and cross-modal attribute attention fusion based image-text matching algorithm. This algorithm first extracts commodity image visual features text semantic features separately through the lightweight ITTransformer and TTransformer, and introduces dynamic channel pruning (pruning rate 40%) and feature dimension compression technology (compression r=4) to reduce the number of parameters. Subsequently, a cross-modal attribute attention module (AGA) is designed for e-commerce, which uses text attribute as Queries and deeply aligns them with image spatial features. Finally, the cosine similarity is used to achieve matching determination. Experiments on FashionGen and self-collected ecommerce datasets show that the Top1 accuracy of this algorithm reaches 88.7%, the model parameter quantity is only 38.6M, and the reasoning speed 185 FPS. Under the premise of ensuring the accuracy of the match, it has been significantly lightweighted, and it is suitable for real-time recommendation and violation verification in e-commerce platforms.
Keywords : E-commerce Commodity Matching; Lightweight Transformer; Cross-modal Fusion; Attribute Attention; Channel
参考文献 [1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, etal. An image is worth 16x16 words: Transformers forimage recognition at scale [C]// Proceedings of the2021 International Conference on LearningRepresentations (ICLR), 2021: 121. [2] RADFORD A, KIM J W, HALLACY C, et al.Learning transferable visual models from naturallanguage supervision [C]// Proceedings of the 38thInternational Conference on Machine Learning(ICML), 2021: 8748-8763. [3] LI J, SELVARAJU R, GOTHARE R, et al. Alignbefore fuse: Vision and language representationlearning with momentum distillation [C]// Advances inNeural Information Processing Systems (NeurIPS),2021: 9694-9705. [4] MEHTA S, RASTE GARI M. MobileViT:Lightweight, general-purpose, and mobile-friendlyvision transformer [C]// Proceedings of the 2022International Conference on Learning Representations(ICLR), 2022. [5] 王某某, 李某某. 面向电商场景的细粒度图文匹配算法研究[J]. 软件导刊, 2023, 22(04): 45-50. [6] 张某, 陈某. 基于跨模态注意力的商品推荐系统设计[J]. 计算机工程与设计, 2023, 44(02): 512-518. [7] CHEN Z, GUO J, WU W, et al. FashionGen: Thegenerative fashion dataset and challenge [EB/OL].arXiv preprint, arXiv:1806.08317, 2018. [8] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need [C]// Advances in NeuralInformation Processing Systems, 2017: 5998-6008. [9] 刘某, 赵某. 轻量级Transformer 在嵌入式视觉中的应用[J]. 信息技术应用, 2024, 18(01): 22-28. [10] HE K, CHEN X, XIE S, et al. Masked autoencodersare scalable vision learners [C]// Proceedings of theIEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR), 2022: 16000-16009.
|
|
|
论文收录证明 / 文献检索报告
Document Retrieval Certificate / Proof of Publication Indexing
作者贡献声明 / 贡献确认书
Author Contribution Statement / Certificate of Authorship Contribution
同行评审报告 / 评审意见
Peer Review Report / Peer Review Comments
利益冲突
Conflict of Interest
作者声明不存在任何利益冲突。
The author declares no conflict of interest.
|
|