This post will explain you step by step process to classify data according to its classes using command prompt in MATLAB. This is only one way to classify with MATLAB, you absolutely can find another ways 🙂 In general, classification processes need three main steps which are pre-process, main process and post-process. Pre-processing steps contain data attribute selection, cleansing and converting steps, mean while the post-processing steps contain data denormalization and interpretation.
This classification process use data stored in UCI Machine Learning Repository which is widely used in Data Mining field.
- Pre-processing steps
Download Breast Cancer Wisconsin dataset and scrutinize how many the attributes are and which one is the class attribute through the site. It can be seen that it has 32 attributes which the first two attributes show patient ID and diagnose result (M=Malignant and B=Benign). Save it as .csv file and open it through spreadsheet. From this first step we can clearly eliminate the patient ID because it is the key identifier of each instance, therefore, it will not be repeated and can not be taken as one of decision makers’ attributes.To convert the class attribute, you have two ways. The first option, convert M and B into binary digit which fills two columns. First column indicate the M, thus, you have to fill it with ‘1’ if it is Malignant Cancer and the second column indicate the B which will be filled by ‘1’ if it is categorized as Benign Cancer . So, you will have a spreadsheet file which have two columns and 569 rows which only contain 0 and 1 like this picture.
Or the second option, you can assign M as ‘1’ or B as’2′. If you choose this way, you will only have one column which consists of 1 and 2 with 569 rows.
From data.names you can obtain information about dataset attributes. It only consists of 10 attributes, but why there are 30 attributes. It has three values of each attributes which are mean, standard error and the worst value. So, you don’t have to use those all to predict the cancer. In order to have some
challengesexperiments this post will compare three data and select the best measurement to predict cancer.
After eliminating and converting the raw data, we ready to the main process. Prepare your 10 data attributes (mean or standard error or worst one) and your binary code.
- Main Process
Install the data into MATLAB workspace by creating two new variables and fill it with your data preparation. I will give the attribute variable with ‘atts’ and the variable which contain binary digit as ‘class’. Then, your workspace will look like this.
You had better to transpose it first, because in newff it requires the output number not as records. Please type this syntax in your command window to transpose those variables.
>> atts = atts’
>> class = class’
if you notice, the small apostrophe (‘) will transpose it immediately and you can see the Value dimension of those two attributes will change from <569×10 double> to <10×569 double>.
You can see what the requirements are to create the new feed-forward network through
>> help newff
Please define which one is the inputs, targets and hidden neurons (X) in breast cancer data.
>> net = newff(atts, class, 20)
If you succeed to create the network, the workspace will be like this
The following step after creating the architecture of neural network is doing training session. The training session matches measurements that each individual has with their classes (Benign, Malignant).
>> net = train(net, atts, class)
After calling the ‘train’ function, there will be an interface as shown below, please wait a moment until it reaches the optimum condition. Note that each computing can have different iteration numbers (epochs) because it started with random value which may cause different way of processing. My suggestion, please training until Validation Checks is not 6 as its maximum value or choose the lowest Performance indicator (performance was measured by MSE then the lower the value, the best the result will be).
- Epoch: filling by number of how much iteration actually did for achieving the optimum result. Usually it has maximum iterations for determining the time, accuracy and resources which obviously be needed. Why should we have maximum iterations? As regards of one method belong to heuristic solutions, it will be taking much time and efforts to find the perfect answer of a problem. Due to the complexity and attributes numbers which is not only one or two but may be hundreds, heuristics solution (e.g. neural networks) have fair ways to achieve a good answer but only need few time and resources. It will be looping all day if we do not specify how much maximum iterations allowed. This constraint is well-known as stopping rules or stopping criteria.
- Performance: usually this field filled by error measurement (e.g. MSE, SSE, RMSE, PE) to judge the perfectness of a training session. Starting from 3.29 at the first iteration, it can be revised up to 0.0273 at ninth iteration in training session. Besides the aforementioned stopping rule, maximum iterations, minimum improvements also works as another stopping rule. In case that the revising process worked not optimally by reducing almost zero improvement. The measurement related to training algorithm used.
- Gradient: during training session, besides measuring performance and iteration number, it also checks gradient decency. The bigger its value, the larger adjustment of weights and biases it performed which means that it has bigger improvement to do. Admittedly, at the first iteration it would have 1 as gradient because they are random numbers and iteration will be ended if the gradient less than 1e-10. You can adjust this parameter through net.trainParam.min_grad.
- Validation checks: the number showed in this field represents the number of iteration that did successfully. If this number showed 6 which is the default value, the training will be ending. In this run, you can see that the training did stop because of the number of validation checks. You can change this criterion by setting the parameter net.trainParam.max_fail.
The next process after training is obtaining the result by filling a variable with simulation result through this following command.
>> output = sim(net, atts)
The output that have been produced by the code above is varied between -m to +n, however, the result that will be needed to classify is binary digit just as its inputs. Thus, this output result should be converted to 0 and 1 through post processing activity. To make this come true, round function is performed to convert those numbers to the nearest integer.>> output = round(output)To see how good the percentage of instance data that perfectly classified and misclassified, please check it through value of confusion matrix below.
>> [misclassified, confmats] = confusion(output, class)This is mine, how about yours? 😉 I misclassified 2.8% of the total instances 😦
If you need more visualization, you can have the ‘real’ confusion matrix by doing this
>> plotconfusion(class, output)
There must be an explanation about how to read it, so here they are the explanation
- Vertical and Horizontal Columns represent each class which Class 1 Malignant and Class 2 Benign.
- Diagonal cells are correctly classified instances. It should be perfect if it listed 212 and 357 because the raw data contain of 212 Malignant and 357 Benign instances.
- Vertical Columns read from bottom to top, let see the second vertical column which has twelve instances that are misclassified (the red ones). It means there is twelve instances of Malignant that was misclassified as Benign by the classifier. In addition, there are 4 Benign instances were disguised as Malignant.
- Therefore, in the most bottom-left cell (the blue one) was not listed as 100% instead 97.2%. This value was counted by misclassified instances compare to the whole instances.
Well, this experiments have not been done yet. Be patient, I will do it soon by GUI experiments 😉 I guess you will like the GUI more than command prompt, but for me personally, I prefer this one.