数据分析实例--R语言如何对垃圾邮件进行分类-CDA数据分析师官网

热线电话：13121318867

首页精彩阅读数据分析实例--R语言如何对垃圾邮件进行分类

数据分析实例--R语言如何对垃圾邮件进行分类

2017-07-07

数据分析实例--R语言如何对垃圾邮件进行分类

Structure of a Data Analysis

1 数据分析的步骤

l Define the question

l Define the ideal data set

l Determine what data you can access

l Obtain the data

l Clean the data

l Exploratory data analysis

l Statistical prediction/model

l Interpret results

l Challenge results

l Synthesize/write up results

l Create reproducible code

2 A sample

1) 问题.

Can I automatically detect emails that are SPAM or not?

2) 具体化问题

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

3) 获取数据

http://search.r-project.org/library/kernlab/html/spam.html

4) 取样

#if it isn't installed,please install the package first.

library(kernlab)

data(spam)

#perform the subsampling

set.seed(3435)

trainIndicator =rbinom(4601,size = 1,prob = 0.5)

table(trainIndicator)

trainSpam = spam[trainIndicator == 1, ]

testSpam = spam[trainIndicator == 0, ]

5) 初步分析

a) Names：查看的列名

names(trainSpam)

b) Head:查看前六行

head(trainSpam)

c) Summaries：汇总

table(trainSpam$type)

d) Plots:画图,查看垃圾邮件及非垃圾邮件的分布

plot(trainSpam$capitalAve ~ trainSpam$type)

上图分布不明显，我们取对数后，再看看

plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)

e) 寻找预测的内在关系

plot(log10(trainSpam[, 1:4] + 1))

f) 试用层次聚类

hCluster = hclust(dist(t(trainSpam[, 1:57])))

plot(hCluster)

太乱了.不能发现些什么。老方法不是取log看看

hClusterUpdated = hclust(dist(t(log10(trainSpam[, 1:55] + 1))))

plot(hClusterUpdated)

6) 统计预测及建模

trainSpam$numType = as.numeric(trainSpam$type) - 1

costFunction = function(x, y) sum(x != (y > 0.5))

cvError = rep(NA, 55)

library(boot)

for (i in 1:55) {

lmFormula = reformulate(names(trainSpam)[i], response = "numType")

glmFit = glm(lmFormula, family = "binomial", data = trainSpam)

cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]

}

## Which predictor has minimum cross-validated error?

names(trainSpam)[which.min(cvError)]

7) 检测

## Use the best model from the group

predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)

## Get predictions on the test set

predictionTest = predict(predictionModel, testSpam)

predictedSpam = rep("nonspam", dim(testSpam)[1])

## Classify as `spam' for those with prob > 0.5

predictedSpam[predictionModel$fitted > 0.5] = "spam"

## Classification table 查看分类结果

table(predictedSpam, testSpam$type)

分类错误率：0.2243 =(61 + 458)/(1346 + 458 + 61 + 449)

8) Interpret results（结果解释）

The fraction of charcters that are dollar signs can be used to predict if an email is Spam

Anything with more than 6.6% dollar signs is classified as Spam

More dollar signs always means more Spam under our prediction

Our test set error rate was 22.4%

9) Challenge results

10) Synthesize/write up results

11) Create reproducible code

CDA数据分析师考试相关入口一览（建议收藏）：

▷ 想报名CDA认证考试，点击>>> “CDA报名” 了解CDA考试详情；

▷ 想学习CDA考试教材，点击>>> “CDA教材” 了解CDA考试详情；

▷ 想加入CDA考试题库，点击>>> “CDA题库” 了解CDA考试详情；

▷ 想了解CDA考试含金量，点击>>> “CDA含金量” 了解CDA考试详情；

数据分析层次聚类聚类 R语言

数据分析咨询请扫描二维码

若不方便扫码，搜微信号：CDAshujufenxi

上一篇回归系列（一）| 怎样正确地理解线性回归

下一篇2020美国总统竞选大戏开锣，川普当选的奇迹会再发生吗？

数据分析师考试动态

CDA报考指南

数据分析学习

数据分析师资讯

京公网安备 11010802034615号经营许可证编号：京B2-20210330

联系电话：13321103290 (微信同号)

客服在线

立即咨询

客服在线

立即咨询

免密码登录

提交首次登录验证后自动注册

数据分析实例--R语言如何对垃圾邮件进行分类

数据分析师考试动态

CDA报考指南

数据分析学习

数据分析师资讯

剖析 CDA 数据分析师考试题型：解锁高效备考与答题 ...

【CDA干货】SQL Server 字符串截取转日期：解锁数据 ...

CDA 数据分析师视角：从数据迷雾中探寻商业真相 ...

CDA 数据分析师：开启数据职业发展新征程 ...

从招聘要求看数据分析师的能力素养与职业发展 ...

【CDA干货】Power BI 中如何控制过滤器选择项目数并 ...

把握 CDA 考试时间，开启数据分析职业之路 ...

CDA 证书：银行招聘中的 “黄金通行证” ...

【CDA干货】探索最优回归方程：数据背后的精准预测 ...

CDA 数据分析师报考条件全解析：开启数据洞察之旅 ...

【CDA干货】深入解析 SQL 中 CASE 语句条件的执行顺 ...

【CDA干货】SPSS 中计算三个变量交集的详细指南 ...

CDA 数据分析师：就业前景广阔的新兴职业 ...

【CDA干货】探秘卷积层：为何一个卷积层需要两个卷 ...

探索 CDA 数据分析师在线课程：开启数据洞察之旅 ...

3D VLA新范式！CVPR冠军方案BridgeVLA，真机性能提 ...

【CDA干货】LSTM 为何会产生误差？深入剖析其背后的 ...

LLM进入拖拽时代！只靠Prompt几秒定制大模型，效率 ...

【CDA干货】探秘 z-score：数据分析中的标准化利器 ...

【CDA干货】Excel 中为不同柱形设置独立背景（按数 ...

CDA教育闭环

常见问题

关于我们

CDA数据分析师公众号

CDA考试中心小程序

CDA数据分析师App下载