Inflation, unbalance indicator that affects the national economies, proved to be through the years a permanently presence in economy. The inflation consequences that population indulge, as the buying capacity decreasing and not only, makes necessary the annual inflation rate IPC analysis through the usage of various statistical techniques and methods, such as data mining techniques. In the present paper, for annual inflation rate IPC analysis and prediction it is used a set of data mining techniques, respectively the principal components analysis, decision trees and linear regression. The data mining techniques, respectively the statistical techniques are useful tools for economists, statisticians, financial analysts, etc., in the analysis and prediction of various economic indicators.
The political, economical and social events from a certain period of time have a positive and negative effect on the annual inflation rate IPC evolution, with direct consequences on the population standard of living. Also, there is a set of factors that directly influences the annual rate inflation IPC evolution, such as: administrated prices, the fuels prices, the volatile prices of food stuffs (LFO), the adjusted CORE2 indicator (measure of base inflation) and tobacco and alcohol beverages prices1. The mentioned factors will be used in making annual rate inflation IPC analysis and prediction using a set of data mining techniques, such as: principal components analysis (PCA), decision trees and linear regression. The paper structure is organized as follows: The data mining techniques as tool for analysis and prediction, where are presented the main applications and advantages of using data mining techniques in economic domain; The IPC annual inflation rate analysis and prediction through the application of some data mining techniques (principal components analysis, decision trees and linear regression); 1*** Romanian National Bank (BNR), Report on inflation – may 2012, at http: // www .bnro .ro / Publication ocuments.aspx?icid=3922. Conclusions, where is presented the applicability and utility of some data mining techniques such as PCA, decision tress and linear regression in the analysis and prediction of different economical indicators.
Data Mining Tehniques as Instrument for Analysis and Prediction
The various government institutions, especially the banks, for instance Romanian National Bank (BNR), are working with large data bases. Without the usage of adequate instruments can’t be achieved the useful information extraction and analysis from the enormous data bases. The data mining techniques are an useful tool in solving the various problems from economical domain, through correlation identification, rules, patterns, etc., that are existing between the analyzed data, helping this way the analyst in taking the best decisions, and also in the analysis and prediction of various economical indicators. In economical domain the usage of some data mining techniques, as classification, regression, clustering, etc., brings a set of advantages, as the optimization of institutional activity though the best decisions taking2. Also, in literature is presented a set of data mining applications in the analysis, prediction and inflation behaviour3, in inflation analysis measured through consuming prices4, etc. From data mining techniques, those that are most used in economical domain, in predicting the various indicators, are those presented in Figure 1. Fig. 1. Data mining techniques The ability to predict with accuracy the evolution of an economic indicator, as annual rate inflation IPC, is esential in taking the most adequated decissions, measures to counter-attack the negative efects generated by a negative evolution of the analised economical indicator. For that purpose, the data mining tehniques can be a useful tool in taking the best decissions. 2 Andronie, M., CriA…Å¸an, D., Commercially Available Data Mining Tools used in the Economic Environment, Database System Journal, Vol. I, No. 2/2010, pp. 45-54. 3 Ericsson, N., Constructive data mining:modeling australian inflation, at http: // www . economics.smu.edu.sg/events/Paper/NeilEricssonAustralian.pdf 4 Cao, L., et al, New Frontiers in applind data mining, Springer-Verlag, Berlin Heidelberg, 2012, pp.179-181.
The Data Mining Techniques Application
The applicability domain of data mining techniques is a vast one, from finances, economy, marketing to medicine, astronomy, meteorology4. From the multitude of data mining techniques, were chosen principal components analysis (PCA), linear regression and decision trees, to be applied in IPC annual inflation rate analysis and prediction.
Principal Components Analysis
The economic-social reality is influenced by a large number of variables (indicators). The principal components analysis (PCA) has as goal the reduction of the number of used variables, for obtaining a number of representative variables, named factors5. In Table 1 is presented the data base developed using the data supplied by Romanian National Bank (BNR) for the analyzed variables: IPC annual inflation rate, administrated prices, fuels prices, the volatile prices of food stuffs (LFO), the adjusted CORE2 indicator (measure of base inflation) and tobacco and alcohol beverages prices6. Table 1. Data base structure
IPC inflation annual rate
Tobacco and alcohol beverages prices (%)
2008 October 7.39 1.6 0.6 1.1 4.1 November 6.74 1.5 0.6 0.8 3.8 December 6.30 1.6 0.6 0.3 3.7 2009 January 6.71 1.5 0.8 0.4 4.0 February 6.89 1.5 0.8 0.6 4.0 March 6.71 1.5 0.9 0.6 3.7 April 6.45 1.5 1.2 0.4 3.3 May 5.95 1.4 1.2 0.2 3.1 June 5.86 1.5 1.2 0.3 2.9 July 5.06 0.7 1.1 0.5 2.8 August 4.96 0.9 1.2 0.2 2.7 September 4.94 0.9 1.4 0.1 2.5 October 4.30 0.8 1.3 0.1 2.1 November 4.65 0.7 1.6 0.4 2.0 December 4.74 0.6 1.8 0.7 1.6 2010 January 5.20 0.6 2.4 0.8 1.4 February 4.49 0.4 2.3 0.6 1.2 March 4.20 0.4 2.2 0.6 1.0 April 4.28 0.5 1.8 0.7 1.2 May 4.42 0.6 1.8 0.7 1.3 June 4.38 0.6 1.8 0.6 1.4 July 7.14 1.3 1.9 1.1 2.8 August 7.58 1.3 1.9 1.4 2.9 5 Gorunescu, F., Data mining. Concepte, modele A…Å¸i tehnici, Editura AlbastrA„Æ’, Cluj-Napoca, 2006, pp. 23-24. 6***Romanian National Bank (BNR), Monthly bulletins, 2010-2012, at https://www.bnro.ro/Raportul-asupra-inflatiei-3342.aspx. 2010 Table 1 (cont.) September 7.77 1.3 1.6 1.8 3.0 October 7.88 1.4 1.8 1.9 2.9 November 7.73 1.4 1.5 1.9 2.9 December 7.96 1.4 1.4 2.2 3.0 2011 January 6.99 1.2 0.9 2.3 2.6 February 7.60 1.2 0.9 2.7 2.8 March 8.01 1.2 0.9 3.0 2.9 April 8.34 1.3 1.0 3.1 2.9 May 8.41 1.3 1.1 3.1 2.9 June 7.93 1.3 1.1 2.6 2.9 July 4.85 0.8 0.6 1.6 1.9 August 4.25 0.9 0.5 1.0 1.9 September 3.45 0.9 0.5 0.3 1.8 October 3.55 1.0 0.5 0.4 1.7 November 3.44 1.0 0.5 0.3 1.6 December 3.14 1.1 0.5 0.0 1.5 2012 January 2.72 1.08 0.42 -0.34 1.56 February 2.59 1.06 0.42 -0.27 1.38 March 2.40 1.03 0.43 -0.41 1.35 For making the PCA analysis, was applied the procedure Statistics-Data reduction-Factor from SPSS environment. The statistical results are presented in Tables 2, 3 and 4. Table 2. Correlation matrix
Correlation IPC 1.000 .680 .079 .768 .745 PRICES .680 1.000 -.464 .273 .880 TABACCO_ALCOHOL .079 -.464 1.000 .043 -.345 LFO .768 .273 .043 1.000 .253 CORE2 .745 .880 -.345 .253 1.000 The correlation matrix indicates the correlations between the analyzed variables. As it can be observed, we have positive and also negative correlations between the analyzed variables. Meaningful positive correlations are between IPC and LFO (0.76), between PRICES and CORE2 (0.88) and between CORE 2 and IPC (0.74). Table 3. Total variance explained
Total % of Variance Cumulative % Total % of Variance Cumulative % Total
of Variance Cumulative % 1 2.914 58.284 58.284 2.914 58.284 58.284 2.196 43.920 43.920 2 1.357 27.133 85.416 1.357 27.133 85.416 2.075 41.496 85.416 3 .619 12.378 97.794 4 .110 2.198 99.992 5 .000 .008 100.000 The table Total Variance Explained brings the first specific information of factorial analysis. Was generated five principal components (factors), but only two of them reached the selection criterion (Eigenvalue>=1). On the Extraction Sums of Squared Loadings columns, we have the Eigenvalues, the explained variance and cumulative variance for these two factors, in the context of initial solution (without rotation)7. The variance explained by each factor was distributed as follows: factor I-58.284%, and factor II-27.133%. Together those two factors explain 85.416% from the analyzed values variation. On the Rotation Sums of Squared Loadings columns, we have the same values for these two factors, but after the application of rotation procedure. It can be observed a redistribution of the variance explained by each factor (factor I-43.920%; factor II-41.496%) in the context of the same total variation (85.416%). As it can be observed, through the rotation method the first factor loses from its saturation degree, in favor of the second factor. The Scree Plot (Figure 2) presents under a graphical form the Eigenvalues for all the principal components obtained through analysis and numerical represented in Table 3. Fig. 2. Scree Plot The factorial solution (after rotation) is presented in Table 4. Table 4. Rotated Component Matrix(a)
IPC .951 .297 LFO .860 PRICES .447 .845 TABACCO_ALCOHOL .301 -.813 CORE2 .510 .781 The data from Table 4 allows final conclusions regarding the analyzed variables factorial structure, as follows: Factor I is composed by IPC (0.95) and LFO (0.86) variables; Factor II is composed by PRICES (0.84) and CORE2(0.78) variables. The characther and intensity of the corelation between IPC and LFO and between PRICES and CORE2 it is higlighted with the help of an graphical procedure, named scatterplot, as it can be observed in Fig. 3. 7Popa., M., Analiza factorialA„Æ’ exploratorie, at https://www mpopa .ro/statistica_master/14_ analiza_fact.pdf. Fig. 3. Scatterplot Using PCA, where determined that factors that influence the IPC annual rate of inflation, factors between which exists a strong linear relation, as it can be observed in Fig. 3.
The Simple Linear Regression
The simple linear regression is a SPSS procedure used to determine the model that establish (through a regression equation) the association between a dependent variable (for instance IPC) and an independent one (for instance CORE2)8. The model obtained can be used to predict the IPC annual rate of inflation (IPC). The simple linear regression model form is9: (1) where: ?0 is the origin ordinate (shows the variable medium value), ?1 is the line gradient, and ? is an random variable that shows the deviation between the observed values and the model estimated values9. The correlation (association) between IPC and CORE2 is presented in Fig. 4. Fig. 4. The correlation between IPC and CORE2 The most important results obtained through the application of SPSS regression procedure are presented in Table 5 and Table 6. 8Gorunescu, F., Data mining. Concepte, modele A…Å¸i tehnici, Editura AlbastrA„Æ’, Cluj-Napoca, 2006, pp. 23-24. 9 Jaba, E., Econometrie aplicatA„Æ’, Editura UniversitA„Æ’A…A£ii Alexandru Ioan Cuza, IaA…Å¸i, 2008, pp. 6-15.
Table 5. Model Summary
Adjusted R Square
Std. Error of the Estimate
1 .745(a) .554 .543 1.13256
Table 6. Coefficients
1 (Constant) CORE2 2.269 .548 .745 4.137 .000 1.422 .207 6.876 .000 The R Square values indicate that 55% from IPC variation is explained by the variation of CORE2. The Coefficients table contains the B coefficient (unstandardized) and Beta coefficient (standardized), which can be used in the prediction equation. So the linear regression equation is: (2) Using equation (2) and knowing the CORE2 value for a certain month, it can be predicted the IPC annual inflation rate (IPC). According BNR, the CORE2 value for April is 1.8%, so the IPC estimated value for the same month using equation (2) is 4.8286 %, value that corresponds to the National Institute of Statistics (INSSE) predictions10.
One of the main data mining techniques is the decision trees, used in the affiliation prediction of some instances to various categories12. For obtaining the decision tree (decision rules), was used the C5.0 algorithm implemented in See5 environment11. Applying the C5.0 algorithm on the data from Table 1, were obtained the following rules that compose the searched decision tree: if CORE2>2.1 then IPC=big; if CORE2<=2.1 AND PRICES>0.8 then IPC=medium; if CORE2<=2.1 AND PRICES<=0.8 AND LFO<=0.7 then IPC=medium; if CORE2<=2.1 AND PRICES<=0.8 AND LFO>0.7 then IPC=big. Using these decision rules, with the help of See5 can be made predictions of the IPC annual inflation rate, as it can be observed in Fig. 5. Fig. 5. Prediction of IPC using See5 In Fig. 5 it can be observed that the predicted value for IPC annual rate of inflation for April 2012 is a medium one, namely it belongs to (2.5 4.5) interval, with a 88% probability. 10 *** INSSE, at https://www.insse.ro/cms/files/statistici/comunicate/ipc/a12/ipc04r12.pdf. 11 Khosrow-Pour, M., Emerging Trends and Challenges in Information Technology Management, Editura Idea Group, 2006, pp. 282-283. 12Gorunescu, F., Data mining. Concepte, modele A…Å¸i tehnici, Editura AlbastrA„Æ’, Cluj-Napoca, 2006, pp. 23-24.
In this paper was highlighted the usage utility of using some statistical data mining techniques, as principal components analysis (PCA), simple linear regression and decision trees, in the evolution analysis and prediction of some economical indicators, such as IPC annual rate of inflation. So, using PCA were established those factors that influence in a meaningful way the IPC annual inflation rate evolution, while through the usage of linear simple regression was achieved the IPC annual inflation rate prediction for April 2012. Those are only few of the advantages of using statistical data mining techniques in the analysis of various economical indicators, being useful instruments for statisticians, economists, etc.