Data Mining with Weka

Download file

Data Mining with Weka
1 file(s) 565.65 KB

If you are not a member register here to download this file

Task 1

Consider the attached lymphography dataset (lymph.arff) that describes 148 patients with 19 attributes. The last attribute is the class attribute that classifies a patient in one of the four categories (normal, metastases, malign_lymph, and fibrosis). Detailed information about the attributes is given in lymph_info.txt. The data set is in the ARFF format used by Weka.
Use the following learning methods (classification algorithms) that are provided in Weka to learn a classification model from the dataset with all the attributes:
➢ C4.5 (weka.classifier.trees.J48)
➢ RIPPER (weka.classifier.rules.JRip)
For each learning method, report only the classification model learned from the dataset. Therefore, copy and paste the “Classifier model (full training set)” from Weka output to your report. For C4.5, it would be “J48 pruned tree”. For RIPPER, it would be “JRIP rules:”.

Answer:

Task1-Part1

=== Run information ===
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: lymphography
Instances: 148
Attributes: 19
lymphatics
block_of_affere
bl_of_lymph_c
bl_of_lymph_s
by_pass
extravasates
regeneration_of
early_uptake_in
lym_nodes_dimin
lym_nodes_enlar
changes_in_lym
defect_in_node
changes_in_node
changes_in_stru
special_forms
dislocation_of
exclusion_of_no
no_of_nodes_in
class
Test mode: 10-fold cross-validation

__________________________________________________________________________________________________
=== Classifier model (full training set) ===
J48 pruned tree

Number of Leaves : 21
Size of the tree : 34

Time taken to build model: 0.14 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 114 77.027 %
Incorrectly Classified Instances 34 22.973 %

__________________________________________________________________________________________________
Kappa statistic 0.5736
Mean absolute error 0.1304
Root mean squared error 0.3151
Relative absolute error 48.619 %
Root relative squared error 86.5138 %
Total Number of Instances 148

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.014 0.500 1.000 0.667 0.702 0.991 0.500 normal
0.790 0.194 0.831 0.790 0.810 0.594 0.788 0.737 metastases
0.754 0.195 0.730 0.754 0.742 0.556 0.777 0.718 malign_lymph
0.500 0.014 0.500 0.500 0.500 0.486 0.744 0.389 fibrosis
Weighted Avg. 0.770 0.187 0.776 0.770 0.772 0.577 0.785 0.717

=== Confusion Matrix ===

a b c d <– classified as
2 0 0 0 | a = normal
1 64 15 1 | b = metastases
1 13 46 1 | c = malign_lymph
0 0 2 2 | d = fibrosis

Task1-Part2

=== Run information ===

Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1
Relation: lymphography
Instances: 148
Attributes: 19
lymphatics
block_of_affere
bl_of_lymph_c
bl_of_lymph_s
by_pass
extravasates
regeneration_of
early_uptake_in
lym_nodes_dimin
lym_nodes_enlar
changes_in_lym
defect_in_node
changes_in_node
changes_in_stru
special_forms
dislocation_of
exclusion_of_no
_________________________________________________________________________________________________

no_of_nodes_in
class
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===
JRIP rules:
===========

(lymphatics = normal) => class=normal (2.0/0.0)
(lym_nodes_dimin >= 2) and (by_pass = yes) => class=fibrosis (4.0/0.0)
(no_of_nodes_in >= 3) and (special_forms = vesicles) => class=malign_lymph (41.0/5.0)
(block_of_affere = no) and (extravasates = yes) => class=malign_lymph (8.0/0.0)
(changes_in_node = lac_central) => class=malign_lymph (8.0/2.0)
=> class=metastases (85.0/11.0)

Number of Rules : 6

Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 115 77.7027 %
Incorrectly Classified Instances 33 22.2973 %
Kappa statistic 0.5725
Mean absolute error 0.1414
Root mean squared error 0.3108
Relative absolute error 52.7427 %
Root relative squared error 85.3428 %
Total Number of Instances 148

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.687 0.038 normal
0.827 0.254 0.798 0.827 0.812 0.576 0.805 0.808 metastases
0.738 0.172 0.750 0.738 0.744 0.567 0.780 0.715 malign_lymph
0.750 0.007 0.750 0.750 0.750 0.743 0.872 0.694 fibrosis
Weighted Avg. 0.777 0.210 ? 0.777 ? ? 0.795 0.756

=== Confusion Matrix ===

a b c d <– classified as
0 1 1 0 | a = normal
0 67 14 0 | b = metastases
0 15 45 1 | c = malign_lymph
0 1 0 3 | d = fibrosis
_______________________________________________________________

Task 2

You are given a training dataset (monks-train.arff) and a test dataset (monks-test.arff) in which each training example is represented by seven nominal (categorical) attributes. The last attribute is the class attribute that classify each data point to one of the two classes (0 and 1). The attribute information is given below:
Attribute Possible Values

training dataset (monks-train.arff)

Use the following learning methods provided in Weka to learn a classification model from the training dataset and test the model on the test dataset:
➢ C4.5 (weka.classifier.trees.J48)
➢ RIPPER (weka.classifier.rules.JRip)
➢ k-Nearest Neighbor (weka.classifiers.lazy.IBk)
➢ Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
➢ Neural Networks (weka.classifiers.functions.MultilayerPerceptron)
Note that you have to use the “Supplied test set” option in the “Test options” box of Weka and pass the test data file (monks-test.arff) to Weka.
Report the classification summary, classification accuracy, and confusion matrix of each algorithm on test dataset. In other words, copy and paste the “Summary”, “Detailed Accuracy By Class”, and “Confusion Matrix” from Weka output to your report. Also, briefly discuss your results in terms of accuracy.

Answer:

Task2-Part1

________________________________________________________________

=== Summary ===

Correctly Classified Instances 420 97.2222 %
Incorrectly Classified Instances 12 2.7778 %
Kappa statistic 0.9444
Mean absolute error 0.0892
Root mean squared error 0.1831
Relative absolute error 17.8311 %
Root relative squared error 36.5759 %
Total Number of Instances 432

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.053 0.944 1.000 0.971 0.946 0.983 0.964 0
0.947 0.000 1.000 0.947 0.973 0.946 0.983 0.981 1
Weighted Avg. 0.972 0.025 0.974 0.972 0.972 0.946 0.983 0.973

=== Confusion Matrix ===

a b <– classified as
204 0 | a = 0
12 216 | b = 1

Task2-Part2

________________________________________________________________

=== Summary ===

Correctly Classified Instances 390 90.2778 %
Incorrectly Classified Instances 42 9.7222 %
Kappa statistic 0.8053
Mean absolute error 0.1314
Root mean squared error 0.277
Relative absolute error 26.2643 %
Root relative squared error 55.3461 %
Total Number of Instances 432

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.912 0.105 0.886 0.912 0.899 0.806 0.938 0.879 0
0.895 0.088 0.919 0.895 0.907 0.806 0.938 0.942 1
Weighted Avg. 0.903 0.096 0.903 0.903 0.903 0.806 0.938 0.912

=== Confusion Matrix ===

a b <– classified as
186 18 | a = 0
24 204 | b = 1

Task2-Part3

________________________________________________________________

=== Summary ===

Correctly Classified Instances 378 87.5 %
Incorrectly Classified Instances 54 12.5 %
Kappa statistic 0.7512
Mean absolute error 0.191
Root mean squared error 0.3228
Relative absolute error 38.1693 %
Root relative squared error 64.5029 %
Total Number of Instances 432

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.941 0.184 0.821 0.941 0.877 0.758 0.933 0.902 0
0.816 0.059 0.939 0.816 0.873 0.758 0.933 0.935 1
Weighted Avg. 0.875 0.118 0.883 0.875 0.875 0.758 0.933 0.919

=== Confusion Matrix ===
a b <– classified as
192 12 | a = 0
42 186 | b = 1

Task2-Part4

________________________________________________________________

=== Summary ===

Correctly Classified Instances 420 97.2222 %
Incorrectly Classified Instances 12 2.7778 %
Kappa statistic 0.9444
Mean absolute error 0.1863
Root mean squared error 0.2323
Relative absolute error 37.2363 %
Root relative squared error 46.4131 %
Total Number of Instances 432

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.053 0.944 1.000 0.971 0.946 0.975 0.961 0
0.947 0.000 1.000 0.947 0.973 0.946 0.975 0.985 1
Weighted Avg. 0.972 0.025 0.974 0.972 0.972 0.946 0.975 0.973

=== Confusion Matrix ===

a b <– classified as
204 0 | a = 0
12 216 | b = 1

Task2-Part5

________________________________________________________________

=== Summary ===

Correctly Classified Instances 404 93.5185 %
Incorrectly Classified Instances 28 6.4815 %
Kappa statistic 0.8709
Mean absolute error 0.068
Root mean squared error 0.2322
Relative absolute error 13.5875 %
Root relative squared error 46.3993 %
Total Number of Instances 432

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.123 0.879 1.000 0.936 0.878 0.967 0.941 0
0.877 0.000 1.000 0.877 0.935 0.878 0.967 0.981 1
Weighted Avg. 0.935 0.058 0.943 0.935 0.935 0.878 0.967 0.962

=== Confusion Matrix ===

a b <– classified as
204 0 | a = 0
28 200 | b = 1

Discussion & Conclusion for Task2:

From the below table we got the highest accuracy from two algorithms:
C4.5 (weka.classifier.trees.J48) and
Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
When we run these two algorithms on the given dataset we got Correctly Classified Instances
With 97.22% accuracy which is better than the other three algorithms

C4.5 (weka.classifier.trees.J48)
Correctly Classified Instances 420 97.2222 %
➢RIPPER (weka.classifier.rules.JRip)
Correctly Classified Instances 390 90.2778%

➢k-Nearest Neighbor (weka.classifiers.lazy.IBk)
Correctly Classified Instances 378 87.5%

➢Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
Correctly Classified Instances 420 97.2222 %

➢Neural Networks (weka.classifiers.functions.MultilayerPerceptron)
Correctly Classified Instances 404 93.5185 %
___________________________________________________________________

Task 3

You are given a dataset on credit card application approval (credit.arff) in the ARFF format. The dataset describes 690 customers with 16 attributes. The last attribute is the class attribute describing whether the customer’s application was approved or not. The dataset contains both symbolic and continuous attributes. Some of the continuous attributes contain missing values (which are marked by “?”). All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
Randomly split the dataset into a training set (70%) and a test set (30%). This can be done using the “Percentage split” in the “Test option” box of Weka’s “Classify” section (set the number to 70). Apply each of the following classification algorithms to learn a classification model from the training set and classify the examples in the test set.
➢ C4.5 (weka.classifier.trees.J48)
➢ Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
➢ Neural Networks (weka.classifiers.functions.MultilayerPerceptron)
Report the classification accuracy of each learning algorithm on the test dataset. In other words, copy and paste the “Summary”, “Detailed Accuracy By Class”, and “Confusion Matrix” from Weka output to your report.
Note that C4.5, Naive Bayesian Classification, and Neural Networks can automatically handle both symbolic and continuous attributes as well as missing values of continuous attributes. Therefore, you do not need to do any extra preprocessing on the data and can directly run the above learning algorithms on the input dataset (credit.arff).

Answer:

Task3-Part1

___________________________________________________________________

=== Summary ===

Correctly Classified Instances 178 85.9903 %
Incorrectly Classified Instances 29 14.0097 %
Kappa statistic 0.7168
Mean absolute error 0.1958
Root mean squared error 0.3288
Relative absolute error 39.4502 %
Root relative squared error 65.6306 %
Total Number of Instances 207

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.776 0.064 0.916 0.776 0.840 0.725 0.901 0.857 +
0.936 0.224 0.823 0.936 0.876 0.725 0.901 0.872 –
Weighted Avg. 0.860 0.149 0.867 0.860 0.859 0.725 0.901 0.865

=== Confusion Matrix ===

a b <– classified as
76 22 | a = +
7 102 | b = –

Task3-Part2

___________________________________________________________________

=== Summary ===

Correctly Classified Instances 156 75.3623 %
Incorrectly Classified Instances 51 24.6377 %
Kappa statistic 0.4968
Mean absolute error 0.2468
Root mean squared error 0.4633
Relative absolute error 49.7186 %
Root relative squared error 92.494 %
Total Number of Instances 207

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.561 0.073 0.873 0.561 0.683 0.529 0.880 0.869 +
0.927 0.439 0.701 0.927 0.798 0.529 0.880 0.887 –
Weighted Avg. 0.754 0.266 0.783 0.754 0.744 0.529 0.880 0.879

=== Confusion Matrix ===

a b <– classified as
55 43 | a = +
8 101 | b = –

Task3-Part3

_________________________________________________________________

=== Summary ===

Correctly Classified Instances 160 77.2947 %
Incorrectly Classified Instances 47 22.7053 %
Kappa statistic 0.5411
Mean absolute error 0.2162
Root mean squared error 0.4355
Relative absolute error 43.5544 %
Root relative squared error 86.9336 %
Total Number of Instances 207

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.684 0.147 0.807 0.684 0.740 0.547 0.866 0.866 +
0.853 0.316 0.750 0.853 0.798 0.547 0.866 0.842 –
Weighted Avg. 0.773 0.236 0.777 0.773 0.771 0.547 0.866 0.854

=== Confusion Matrix ===
a b <– classified as
67 31 | a = +
16 93 | b = –

Discussion & Conclusion for task 3:
we can see from the above results that:
C4.5 (weka.classifiers.trees.J48) has the highest accuracy of the three algorithms with Correctly Classified Instances 178 and 85.9903 %

________________________________________________________________________________________

Task 4

Conduct 10-fold cross validation to evaluate the following classification learning algorithms:
➢ C4.5 (weka.classifiers.trees.J48)
➢ RIPPER (weka.classifier.rules.JRip)
➢ Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)
➢ k-Nearest Neighbor (weka.classifiers.lazy.IBk)
➢ Neural networks (weka.classifiers.functions.MultilayerPerceptron)
on the following datasets from the UCI repository:
• Ecoli database (ecoli.arff)
• Glass identification database (glass.arff)
• Image segmentation database (image.arff)
Use all attributes to build the model. Report the classification accuracy and run time of each algorithm on each data set. Discuss the results and determine if there is an overall winner in terms of accuracy (misclassification rates) and run time.
You can summarize the results in two tables, one for the run time and the other for the accuracy. Then, you can add few sentences to discuss the results.

__________________________________________________________________

Answer:

Task4

The following table represents the classification accuracy when we run the five algorithms on the three datasets given in the question:

classification accuracy when we run the five algorithms

________________________________________________________________________________________

Discussion & Conclusion for task 4 (part 1):
we can see from the above table that:
Dataset ecoli.arff: Neural networks (weka.classifiers.functions.MultilayerPerceptron) has the highest accuracy of the five algorithms and it is slightly higher than some of them and there is no significant difference
Dataset glass.arff : k-Nearest Neighbor (weka.classifiers.lazy.IBk) has the highest accuracy of the five algorithms and it is slightly higher than some of them and there is no significant difference
Dataset image.arff: C4.5 (weka.classifiers.trees.J48) has the highest accuracy of the five algorithms and it is slightly higher than some of them and there is no significant difference
The annotation * indicates that a specific result is statistically worse (*) than the baseline scheme
The annotation v indicates that a specific result is statistically better (v) than the baseline scheme
The following represents the run time of each algorithm on each data set :
________________________________________________________________________________________

ecoli.arff:

C4.5 (weka.classifiers.trees.J48)

RIPPER (weka.classifier.rules.JRip)

Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)

k-Nearest Neighbor (weka.classifiers.lazy.IBk)

Neural networks (weka.classifiers.functions.MultilayerPerceptron)

glass.arff:
C4.5 (weka.classifiers.trees.J48)

RIPPER (weka.classifier.rules.JRip)

Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)

k-Nearest Neighbor (weka.classifiers.lazy.IBk)

Neural networks (weka.classifiers.functions.MultilayerPerceptron)

_________________________________________________________________________________________

image.arff:

C4.5 (weka.classifiers.trees.J48)

RIPPER (weka.classifier.rules.JRip)

Naive Bayesian Classification (weka.classifiers.bayes.NaiveBayes)

k-Nearest Neighbor (weka.classifiers.lazy.IBk)

Neural networks (weka.classifiers.functions.MultilayerPerceptron)

The following table represents the times taken to build the model for each dataset using five different algorithms:

times taken to build the model

Discussion & Conclusion for task 4 (part2):

Dataset ecoli.arff: both bayes.NaiveBayes & lazy.IBk have the fastest run time of the five algorithms and it is slightly faster than some of them and there is no significant difference
Dataset glass.arff : lazy.IBk has the fastest run time of the five algorithms and it is slightly faster than some of them and there is no significant difference
Dataset image.arff: lazy.IBk has the fastest run time of the five algorithms and it is slightly faster than some of them and there is no significant difference