各位行业大佬,求助呀😌~~~我最近在用R做文本分析,代码下
bingqi<-lapply(bingqicsr,function(x) unlist(segmentCN(x)))
执行上述语句分词后:
......
[4] "第34期"
[5] "辛"
[6] "克"
[7] "莱"
[8] "著"
[9] "王"
[10] "建华"
[11] "译"
[12] "提要"
[13] "本文"
[14] "首先"
[15] "指出"
[16] "建立"
[17] "语料库"
[18] "的"
[19] "重要性"
[20] "接着"
[21] "谈"
[22] "了"
[23] "语料库"
[24] "的"
[25] "设计"
[26] "选材"
[27] "的"
[28] "方法"
[29] "和"
[30] "标"
[31] "语料库"
[32] "建立"
[33] "的"
[34] "框架"
[35] "和"
[36] "规定"
[37] "语料库"
[38] "的"
[39] "类型"
[40] "等"
[41] "几个"
[42] "方面"
[43] "在"
[44] "语料库"
[45] "的"
[46] "类型"
[47] "部分"
[48] "本文"
[49] "重点"
[50] "Creation"
[51] "Sinclair"
[52] "译者"
[53] "Wangjianhua"
[54] "Thispaperfirstreferstotheimportanceofcreatingcorpora"
[55] "Thenitpresents"
[56] "points"
......
# 问题来了,
#1.这是一个向量库吗?据说只有变成向量后才能继续处理。
#2.还需要再建立语料库,以便于后续的词云、分类等处理吗?
#我试建立如下语句
temp<-Corpus(VectorSource(bingqi),readerControl = list(reader = readplain,language = 'cn'))
#系统错误提示:Error in prepareReader(readerControl, reader(x)) : object 'readplain' not found。
#问题:
#1. readerControl = list(reader = readplain,language = 'cn') ,readerControl = list(reader = x$DefaultReader,language = 'cn'),readerControl = list(reader = read(x),language = 'cn'),这三种语句有啥区别?
#换用如下语句:
temp<-Corpus(VectorSource(bingqi),readerControl = list(reader = reader(VectorSource(bingqi),language = 'cn')))
inspect(temp)
##执行后结果:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] c("语言", "数据", "导入", "DataCampBlog", "编译", "亮", "亮", "语言", "数据", "读入", "的", "核心", "函数", "read", "table", "现在", "我们", "了解", "一下", "其", "他", "可", "scan", "read", "table", "这", "类", "读取", "文本", "文档", "的", "函数", "还", "可以", "用", "scan", "函数", "读入", "不同", "的", "是", "19", "19", "19", "scan", "e", "birth", "txt", "1", "241991211993531962", "data", "nrow", "2", "byrow", "FALSE", "1", "2", "3", "1", "242153", "2", "199119931962", "也",
......
暂无数据