最后更新于
最后更新于
冷启动就需要user
或者item
的特征来配合了, 仅使用协同过滤的FM模型是无法做到冷启动的.
我们选择作为数据集, 这个数据集除了行为交互数据, 还有item feature, 因此可以做关于item
的冷启动.
除了划分好的train
和test
, 还有一个shape为(72360, 1246)
的item_feature
矩阵, 共72360个item, item对应着1246个特征.
统计得到共有33827个item
是没有跟用户发生过交互行为的, 可以认为是新的物品上线, 为现有的用户进行推荐.
取其中一个没有发生过交互的物品计算出要被推荐的用户.
上面就是应该把这个新的item
推荐给的用户列表.
训练之后也得到了每个item feature
对应的向量, 由于在同一个空间中, 就可以通过向量之间的各种距离, 衡量两个特征的近似性.
array(['bayesian', 'prior', 'elicitation', ..., 'events', 'mutlivariate',
'sample-variance'], dtype='<U50')
(0.8141281, 0.72254986)
33827
array([ 145, 159, 128, ..., 1981, 375, 2349], dtype=int64)
Most similar tags for bayesian: ['metropolis-hastings' 'gibbs' 'prior']
Most similar tags for regression: ['multiple-regression' 'goodness-of-fit' 'poisson-regression']
Most similar tags for survival: ['odds-ratio' 'kaplan-meier' 'epidemiology']
{'train': <3221x72360 sparse matrix of type '<class 'numpy.float32'>'
with 57830 stored elements in COOrdinate format>,
'test': <3221x72360 sparse matrix of type '<class 'numpy.float32'>'
with 4307 stored elements in COOrdinate format>,
'item_features': <72360x1246 sparse matrix of type '<class 'numpy.float32'>'
with 198963 stored elements in Compressed Sparse Row format>,
'item_feature_labels': array(['bayesian', 'prior', 'elicitation', ..., 'events', 'mutlivariate',
'sample-variance'], dtype='<U50')}
import numpy as np
from lightfm.datasets import fetch_stackexchange
data = fetch_stackexchange('crossvalidated',
test_set_fraction=0.1,
indicator_features=False,
tag_features=True)
data
train, test = data['train'], data['test']
item_features = data["item_features"]
tag_labels = data['item_feature_labels']
tag_labels
NUM_THREADS = 2
NUM_COMPONENTS = 30
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6
model = LightFM(loss='bpr', item_alpha=ITEM_ALPHA, no_components=NUM_COMPONENTS)
model = model.fit(train, item_features=item_features, epochs=NUM_EPOCHS, num_threads=NUM_THREADS)
train_auc = auc_score(model, train, item_features=item_features, num_threads=NUM_THREADS).mean()
test_auc = auc_score(model, test, item_features=item_features, num_threads=NUM_THREADS).mean()
train_auc, test_auc
user_count = np.array(train.sum(axis=0)).ravel()
np.sum(user_count == 0)
t = model.predict(item_ids=np.ones(train.shape[0]) * (train.shape[1] - 1), user_ids=np.arange(train.shape[0]), item_features=item_features)
t.argsort()
def get_similar_tags(model, tag_id):
# Define similarity as the cosine of the angle
# between the tag latent vectors
# Normalize the vectors to unit length
tag_embeddings = (model.item_embeddings.T
/ np.linalg.norm(model.item_embeddings, axis=1)).T
query_embedding = tag_embeddings[tag_id]
similarity = np.dot(tag_embeddings, query_embedding)
most_similar = np.argsort(-similarity)[1:4]
return most_similar
for tag in (u'bayesian', u'regression', u'survival'):
tag_id = tag_labels.tolist().index(tag)
print('Most similar tags for %s: %s' % (tag_labels[tag_id],
tag_labels[get_similar_tags(model, tag_id)]))