SemEval2017-Task4前几名思路与技巧

No.1 BB_twtr

第一，预处理
url to ‘url’
emotions to ‘smile’,’sadness’…
‘sooooo’ to ‘soo’
lowercased

第二，100million unlabeled tweets 预训练词向量(Twitter API)

第三，模型(参考论文Ye Zhang and Byron Wallace,2015)
卷积核大小２,3,4，每种有２个，对同一个句子卷积，得到６个univariate vectors,然后concat在一起，再经全连接层和softmax层分类
LSTM 类似的处理
第四，数据
Task-A(49693　labeled)
Task-BD(30849)
Task-CE(18948)

第五，ensemble
10 CNNs and 10 LSTMs together through soft voting

No.2 DataStories

第一，自己写的文本分词器
第二，TaskA两层双向LSTM+Att(基于message)
第三，TaskBCDE句子和话题分别通过BiLSTM然后concat+att-context(基于话题)

No.3 LIA

ensemble CNN 和 LSTM
第一，Word Embedding
Lexical embedding
Sentiment embeddings(Multitask-learning)
Sentiment embeddings(distant-supervision)
Sentiment embeddings(negative-sampling)

第二，句子层特征提取
Lexicons:MPQA+NRC
Emoticons:number of emoticons grouped in pos,neg,neu
All-caps:number of words in all-caps
Elongated units:words in which characters are repeated more than wtice(eg,looooool)
Punctuation:number of contiguous sequences of severl periods.exclaimation marks and question marks

No.4 Senti17

第一，HappyTokenizer 处理文本
第二，十个卷积网络投票，每个网络训练数据一样，词向量一样，不同的是初始权重