CASE STUDY I Forest Fires : Classification & Prediction

Forest fire is very dangerous problem for the people living around the forests. Forest fires erupt generally in summers when dry woods rub with each other in hot and blowing environment. Such fires spread to other areas through wind blows and it can enter into city area. Before it becomes out of control, the state should control it.

Generally, when forest fire catches, the fire office sends team to understand magnitude and spread of the fire. According to their observations and pre-information, they decide to carry out the full blown operation to control wild fire. This phase of estimating magnitude and spread of fire takes some time and effort. In this time period, fires keep spreading.

Won’t it be good if fire office can estimate magnitude and spread of fire without going to the site and go already prepared according to the size of wild fire. Of course, it will be damn good idea but how to do it.

We will use a dataset named as “Forest Fires” to gain insights about how forest fire can be classified. This dataset is available on UCI Machine Learning Repository and it can be downloaded from following link—


The Forest Fires dataset contains 517 observations of forest fires in the northeast region of Portugal. Number of attributes or variables dataset contains is 13.  These 13 variables are listed below—

1. X – x-axis spatial coordinate within the Montesinho park map: 1 to 9

2. Y – y-axis spatial coordinate within the Montesinho park map: 2 to 9

3. month – month of the year: “jan” to “dec”

4. day – day of the week: “mon” to “sun”

5. FFMC – FFMC index from the FWI system: 18.7 to 96.20

6. DMC – DMC index from the FWI system: 1.1 to 291.3

7. DC – DC index from the FWI system: 7.9 to 860.6

8. ISI – ISI index from the FWI system: 0.0 to 56.10

9. temp – temperature in Celsius degrees: 2.2 to 33.30

10. RH – relative humidity in %: 15.0 to 100

11. wind – wind speed in km/h: 0.40 to 9.40

12. rain – outside rain in mm/m2 : 0.0 to 6.4

13. area – the burned area of the forest (in ha): 0.00 to 1090.84

X and Y variables tell the coordinates of forest fires areas. Variables from 5 to 12 are meteorological variables. Variable 3 and 4 are denoting month and day of forest fire respectively. Variable 13 “area” is the area affected by forest fires in hectares. Here, we are assuming “area” variable as response variable and rest of the variables from 1 to 12 are independent variables. To understand the meteorological variables, readers can click the following link—

Let us understand each variable one by one. Variable “X” tells the x coordinate of map on scale from 1 to 9. Similarly, variable “Y” tells the y coordinate of map on scale of 1 to 9. Variable “month” denotes the month from Jan to Dec. Variable “day” represents day of forest fire and it ranges from Monday to Sunday. FFMC, a meteorological variable ranges from 18.7 to 96.20 . DMC variable ranges from  1.1 to 291.3. DC index ranges from 7.9 to 860.6 .  Variable ISI index ranges from 0.0 to 56.10 .  Variable “temp: measures temperature in Celsius degrees from 2.2 to 33.30 . variable “RH” denotes relative humidity in percentage ranging from 15.0 to 100 . Variable “wind” measures wind speed in km per hour and it ranges from 0.40 to 9.40 . Variable “rain” measures rainfall in mm/ sq meter and it ranges from 0.0 to 6.4. Following figure shows the tabular format of the dataset in MS Excel—

The above figure shows the dataset view of “Forest Fires” in MS Excel csv format.

B) Preliminary Analysis

It is important to know that how forest fires are spread across the geographical area. In dataset, we are given X and Y coordinate of Montesinho park area. Let us visualize how incidences of forest fires are distributed on map of Montesinho park—

Montesinho park area is represented by image where X coordinate ranges from 1 to 9 and similarly, Y coordinate also ranges from 1 to 9. It can be observed that forest fires incidences are more spread across the X axis in comparison to spread of Y coordinate. On Y coordinate values from 7 to 9, only two forest fire incidences occurred. Now let us understand, how severe each forest fire incidence is across the Montesinho park.

In figure shown above, we can observe the forest fire sizes distributed across the Montesinho park.  Now let us understand how forest fires is distributed across the map with month.


Let us understand the distribution better. Table shown below tells how incidences of forest fires are distributed across X axis. More than 10% of total incidences have happened near 4, 6, 2, 8 and 7 points of X axis.

//////////////////////X Axis –>  4    6    2    8    7    3     1     5     9

No. of Forest Fires Incidences –> 91 86  73  61  60   55  48  30  13

The table shown below tell that around 63% of forest fires incidences are clustered on point 4 and 5 of Y axis.

//////////////////////Y Axis –>     4      5      6     3     2    9    8

No. of Forest Fires Incidences –> 203 125   74   64   44    6    1

This distribution map of forest fires incidences is excellent insight for fire offices. Suppose a fire office wants to establish a early warning office setup near this area. The ideal location should be a location located in the area between Y axis’s 3-5 coordinate points and X axis’s 4-7 coordinate points.

Now, let us see, how forest fires incidences are distributed month-wise—

/////////////////////Month –> aug  sep  mar  jul  feb  jun  oct  apr  dec  jan  may  nov

No. of Forest Fire Incidences –> 184 172     54  32  20     17    15     9      9      2       2       1

Around 69% of forest fires incidences occur during the August and September months. If we observe the following table, we find that month of Sep, May, July, Dec and Aug have incidences of forest fires where spread of fire was more than 10 hectare. May and Sep moths have the highest number of forest fires incidences.

The cross tabulation of day and month shows very interesting insight—

We can observe that in month of August; Tuesday, Wednesday, Thursday and Sunday have more incidences of forest fires while in month of September; Friday, Saturday, Sunday and Monday have more incidences.

B) Relationships between Weather Indices and Forest Fires Incidences Area

It is important to know how weather indices affect the forest fire spread. To explore the relationship, we can calculate the correlation between Forest Fire area and weather index and its significance. To measure correlation two variables should have same ranges but here weather indices and area are measured in different units and have different ranges. Therefore, we need to use scalar versions of variables or standardized versions of variables. Before going into variable relationships, let us observe the basic descriptive statistics of the dataset—

Normally, when we see summary of dataset, we look for outliers. The outliers
are values in the series which are extreme values. 
These values are either too high or too low from rest of the observations. 
General rule of thumb is that outliers are detected and removed from data 
so that data is normalized to represent the normal and general conditions but 
forest fires are not general conditions and they are the exceptional events. 
If we remove the values of outliers from “Forest Fires” data set, we may tend to 
remove the forest fires incidences which is not desirable. Hence, we are not 
removing the outliers from dataset and carrying on our analysis.

The correlation analysis between area of forest fires andthe weather indices variables 
is shown below---

Weather Variables Correlation with Area Variable Significance Level
FFMC .04 p > .05
DMC .07 p > .05
DC .05 p > .05
ISI .08 p > .05
RH -0 .08 p > .05
Temp 0.10 p < .05*
Wind 0.01 p > .05
Rain -0 .007 p > .05

Similar results are achieved through multiple regression model—

The multiple linear regression was calculated to predict the “area” variable  on FFMC, DMC, DC, ISI, temp, RH, wind and rain variables. A significant equation was not found (F(8,508)=1.03, p>0.05), with an R2 of 0.000517.  So, above multiple linear regression may be improved by iteratively removing the independent variables from the model. After iterations, we find following linear model—

lm(formula = area ~ . – X – Y – month – day – FFMC – rain – DC – RH – DMC – ISI – wind, data = fires)

Min             1Q             Median         3Q         Max
-27.3         -14.7            -10.4          -3.4         1071.3

Estimate      Std. Error       t value       Pr(>|t|)
(Intercept)  -7.414                9.500           -0.78          0.435
temp              1.073                0.481             2.23           0.026 *

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 63.4 on 515 degrees of freedom
Multiple R-squared: 0.00957, Adjusted R-squared: 0.00765
F-statistic: 4.98 on 1 and 515 DF,  p-value: 0.0261

A Simple linear regression was calculated to predict "area" variable 
based on "temp" variable. As significant regression equation was 
found (F(1,515)=4.98, p<0.05) with an R2 of 0.00765. The R2 value tells 
us that this model is poor fit hence we can not find good predictions of 
area variable based on temp variable.

In area variable, there are two categories of responses-- responses which 
contains zero and responses with values more than zero. There may be 
certain magnitude of weather conditions creating no forest fire situations 
and some other magnitude of weather conditions creating forest fire situations. 

To know which weather conditions are creating forest fire incidences and which
are not, logistic regression can be used. Here dependent variable takes 
two discrete values only represented by 0 and 1. Value 1 of dependent variable 
means forest fire incidence has occurred and value 0 means forest fire 
incidence has not occurred. In the database, there are 247 observations which 
have no forest fire incidences and 270 observations have forest fire incidences.
Following is the result of logistic regression---



Leave a Reply

Your email address will not be published. Required fields are marked *