使用双向lstm+attention进行情感分类

技术宅 rainforest 11个月前 (03-17) 2034次浏览 0个评论

情感分析(sentiment analysis)是近年来国内外研究的热点,其任务是帮助用户快速获取、整理和分析相关评价信息,对带有情感色彩的主观性文本进行分析、处理、归纳和推理。

情感分类是指根据文本所表达的含义和情感信息将文本划分成褒扬的或贬义的两种或几种类型,是对文本作者倾向性和观点、态度的划分,因此有时也称倾向性分析(opinion analysis)。至于情感分类有什么用,我在这里就不赘述了,有很多实用的场景。我这里来做情感分类,一方面是学习,另外一方面是想通过情感分类来预测股市,根据微博上大家对整个市场的看法来判断指数的涨跌,具体这个后边研究明白了再讲。

情感分类可以看成一种特殊的文本分类,但跟普通的文本分类稍微有一些区别:普通的文本分类一般跟词的顺序无关,但情感分类却跟词的顺序有关。因此,如果我们直接用bow模型进行情感分类效果肯定没这么好。本文尝试引入lstm来进行情感分类,采用双向lstm+attention的方式进行。

由于我对paddlepaddle比较熟悉,所以这次也直接用paddlepaddle进行开发了。网上右很多实现好的tensorflow的版本。数据集方面,我想到比较经典的就是影评,我直接去抓豆瓣获取了很多影评和对应的评分,大概右1000多万条,去掉一些脏数据,大概剩700万条。豆瓣的评分是1~5星,那我们把1~2星的当负例,4~5星当正例,然后做一下样本均衡(随机踢掉一些正例,最终使得正例和负例差不多),最后生成训练数据300多万条。

本文试了一下attention机制,其实就是做了一个weighted pooling,理论上应该是weighted sum pooling,我现在应该是搞成weighted max pooling,最终效果有待验证。

网络结构:双向lstm+attention,然后直接使用char粒度:

#!/usr/bin/env python

import sys
import math
import gzip
import time
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.core as core
import paddle.fluid.param_attr as attr
import lstm_attention

def bilstm_attention_classify_net(dict_size = 12232,
        emb_size = 1024,
        lstm_size = 1024,
        class_size = 2,
        drop_rate = 0.5,
        is_test = False,
        is_py_reader = False):
    if is_py_reader:
        reader = fluid.layers.py_reader(capacity = 10240,
                shapes = [[-1, 1], [-1, 1], [-1, 1]],
                lod_levels = [1, 0, 0],
                dtypes = ['int64', 'int64', 'float32'],
                name = "test_reader" if is_test else "train_reader",
                use_double_buffer = True)
        text, label, weight = fluid.layers.read_file(reader)
    else:
        text = fluid.layers.data(name='text', 
                shape=[1], 
                dtype='int64', 
                lod_level = 1)
        label = fluid.layers.data(name='label', 
                shape=[1], 
                dtype='int64',
                lod_level = 0)
        weight = fluid.layers.data(name='weight', 
                shape=[1], 
                dtype='float32', 
                lod_level = 0)

    text_emb = fluid.layers.embedding(input = text,
            size = [dict_size, emb_size],
            is_sparse = True)
    lstm_attention_model = lstm_attention.LSTMAttentionModel(
            lstm_size = lstm_size, 
            drop_rate = drop_rate)
    lstm = fluid.layers.relu(lstm_attention_model.forward(text_emb, not is_test))
    predict = fluid.layers.fc(input = lstm, 
            size = class_size, 
            act = 'softmax')
    cost = fluid.layers.elementwise_mul(
            x = fluid.layers.cross_entropy(input = predict, label = label),
            y = weight,
            axis = 0)
    avg_cost = fluid.layers.mean(x = cost)
    acc = fluid.layers.accuracy(input=predict, label=label)

    if is_py_reader:
        return [avg_cost, acc, predict, reader]
    else:
        return [avg_cost, acc, predict]

class LSTMAttentionModel(object):
    """LSTM Attention Model"""

    def __init__(self,
                 lstm_size = 1024,
                 drop_rate = 0.5):
        self.lstm_size = lstm_size
        self.drop_rate = drop_rate

    def forward(self, input, is_training):
        lstm_forward_fc = fluid.layers.fc(
            input=input,
            size=self.lstm_size * 4,
            act=None,
            bias_attr=ParamAttr(
                regularizer=fluid.regularizer.L2Decay(0.0),
                initializer=fluid.initializer.NormalInitializer(scale=0.0)))
        lstm_forward, _ = fluid.layers.dynamic_lstm(
            input=lstm_forward_fc, size=self.lstm_size * 4, is_reverse=False)

        lsmt_backward_fc = fluid.layers.fc(
            input=input,
            size=self.lstm_size * 4,
            act=None,
            bias_attr=ParamAttr(
                regularizer=fluid.regularizer.L2Decay(0.0),
                initializer=fluid.initializer.NormalInitializer(scale=0.0)))
        lstm_backward, _ = fluid.layers.dynamic_lstm(
            input=lsmt_backward_fc, size=self.lstm_size * 4, is_reverse=True)

        lstm_concat = fluid.layers.concat(
            input=[lstm_forward, lstm_backward], axis=1)

        lstm_dropout = fluid.layers.dropout(
            x=lstm_concat, dropout_prob=self.drop_rate, is_test=(not is_training))

        lstm_weight = fluid.layers.fc(
            input=lstm_dropout,
            size=1,
            act='sequence_softmax',
            bias_attr=ParamAttr(
                regularizer=fluid.regularizer.L2Decay(0.0),
                initializer=fluid.initializer.NormalInitializer(scale=0.0)))
        scaled = fluid.layers.elementwise_mul(
            x=lstm_dropout, y=lstm_weight, axis=0)
        lstm_pool = fluid.layers.sequence_pool(input=scaled, pool_type='max')

        return lstm_pool

其他代码请参看github:https://github.com/rainforest32/sentiment_classification

模型还在训练中,后面再补充效果。


乐趣公园 , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:使用双向lstm+attention进行情感分类
喜欢 (1)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址