| 基于神经决策森林的数据挖掘与分类(任务书,开题报告,论文14000字)摘要
 在当前信息高速传递的时代,人们每天都需要处理大量信息,如今,由于信息过于繁杂,人们接收到的信息不一定是我们可以直接使用的,提升处理信息的手段变得尤为关键,在这样的背景下,数据挖掘已经成为了一门重要的学科,广泛应用于各行各业。数据分类是其中一个非常重要的内容,如何准确而又高效的将数据分类成为了关键问题。
 本文先介绍了传统支持向量机的基本概念,由于传统支持向量机缺乏对数据几何特征的学习,引入了近似支持向量机的方法,然后又因为划分超平面对于平行条件的限制影响了分类器的划分精度,出现了更精确的多平面近似支持向量机(MPSVM)的方法。
 随后介绍了决策树的基本概念,决策树是非常重要的分类算法之一。传统决策树采用的平行于坐标轴的判定依据导致决策树生长的过于庞大,于是在此基础上提出了改进的斜决策树,可以使生长所需的分割平面变少,减少划分次数,降低决策树的复杂度。随后将MPSVM方法与斜决策树结合,产生一种新的方法,应用过程中发现在处理小规模样本时MPSVM会出现矩阵奇异的条件限制,需引入Tikhonov正则化方法来改进。除此之外,还可以让决策树以异构的方式生长,即前半段满足条件的部分用改进后的新方法,而后半部分使用原方法。
 为了提高分类方法的泛化能力和分类精度,将改进后的MPSVM与斜决策树结合的分类器进行随机森林集成。
 以上述随机森林作为理论依据,对UCI标准数据库中样本数据集Iris(鸢尾花)、Breast_cancer(乳腺癌)和Mushroom(蘑菇)进行分类实验,通过比较结果的精度来对改进的效果作出研究。从而判定基于MPSVM的斜决策树对于数据分类起到了明显的改善作用。
 关键词: 数据分类;支持向量机;决策树;随机森林
 
 Abstract
 In the era of information transferred with a high speed,people need manage lots of information every day.Nowadays,it’s important to develop better methods to manage information because of  information getting more and more complicated.Under such a background,data mining has become a very important subject,extensively used in all walks in life.Data classfication is one of a significant problem.It’s crucial to classify data accurately and efficiently.
 This paper first introduce the basic concept of suppor vector machine.Because vector machine lack of  learning of data geomatrical features,we bring in the method of  proximal support vector machine.Then,deviding hyperplane requires the condition of  being parallel,which influence the accuracy of classifier.Then the more accurate method of  multisurface proximal support vector machine(MPSVM).
 Then we introduce the basic concept of  decision tree,which is one of the most important classfication methods.The traditional decision tree make decisions with conditions parallel to coordinate axis,which makes the tree grow too big.On this basis,it comes a modified oblique decision tree,which reduces dividing times and decrease the complexity of decision tree.Then we combine MPSVM with oblique decision tree.With this new method,we find its limiting condition that occurs  when it comes to small-sample-size problem.And we need bring in two regularization methods to improve it.
 In order to improve the generalization and classfying accuracy of this classified method ,we ensemble the classfiers that combines the MPSVM and oblique decision tree.
 Based on the above theory,we test it with 3 different data set in UCI standard database.Comparing the result of tests,we found that ensembled oblique decision tree based on the MPSVM did improved the accuracy of the classified problem.
 Key words: Data classification;Support vector;Decision tree;Random Forest
 
 目录
 1 绪论....................................................1
 1.1引言...............................................1
 1.2国内外发展现状...............................................1
 1.2.1支持向量机的现状与发展.......................................1
 1.2.2决策树的现状与发展.......................................2
 1.3本文的研究内容...............................................2
 2 支持向量机....................................................4
 2.1传统支持向量机.............................................. 4
 2.2近似支持向量机...............................................5
 2.3多平面近似支持向量机...............................................6
 3 决策树....................................................9
 3.1决策树介绍..............................................9
 3.1.1概念及基本算法.......................................9
 3.1.2划分选择.......................................11
 3.2斜决策树..............................................12
 3.3由MPSVM改进的斜决策树..............................................14
 4 随机森林....................................................17
 4.1集成学习..............................................17
 4.2 Bagging..............................................17
 4.3随机森林..............................................18
 4.4算法图示.............................................19
 5  实验及结果分析.................................................20
 5.1 UCI数据库..............................................20
 5.2 Iris数据集实验..............................................20
 5.3 Breast_cancer数据集实验..............................................25
 5.4 Mushroom数据集实验..............................................29
 5.5 其他实验..............................................33
 5.5.1 car数据集实验......................................33
 5.5.2数据集比较...................................... 34
 6  总结及展望.................................................35
 参考文献.................................................36
 致谢.................................................38
 |