Download PDFOpen PDF in browser

Vertical Data Processing for Mining Big Data: A Predicate Tree Approach

10 pagesPublished: September 26, 2019


Time is a critical factor in processing a very large volume of data a.k.a ‘Big Data’. Many existing data mining algorithms (supervised and unsupervised) become futile because of the ubiquitous use of horizontal processing i.e. row-by-row processing of stored data. Processing time for big data is further exacerbated by its high dimensionality (# of features) and high cardinality (# of records). To address this processing-time issue, we proposed a vertical approach with predicate trees (pTree). Our approach structures data into columns of bit slices, which range from few to hundreds and are processed vertically i.e. column by column. We tested and compared our vertical approach to traditional (horizontal) approach using three basic Boolean operations namely addition, subtraction and multiplication with 10 data sizes. The length of data size ranged from half a billion bits to 5 billion bits. The results are analyzed w.r.t processing speed time and speed gain for both the approaches. The result shows that our vertical approach outperformed the traditional approach for all Boolean operations (add, subtract and multiply) across all data sizes and results in speed-gain between 24% to 96%. We concluded from our results that our approach being in data-mining ready format is best suited to apply to operations involving complex computations in big data application to achieve significant speed gain.

Keyphrases: Big Data, Boolean operations, Data Mining, predicate trees, Speed Gain, vertical processing

In: Frederick C. Harris Jr, Sergiu Dascalu, Sharad Sharma and Rui Wu (editors). Proceedings of 28th International Conference on Software Engineering and Data Engineering, vol 64, pages 68--77

BibTeX entry
  author    = {Mohammad Hossain and Maninder Singh and Sameer Abufardeh},
  title     = {Vertical Data Processing for Mining Big Data: A Predicate Tree Approach},
  booktitle = {Proceedings of 28th International Conference on Software Engineering and Data Engineering},
  editor    = {Frederick Harris and Sergiu Dascalu and Sharad Sharma and Rui Wu},
  series    = {EPiC Series in Computing},
  volume    = {64},
  pages     = {68--77},
  year      = {2019},
  publisher = {EasyChair},
  bibsource = {EasyChair,},
  issn      = {2398-7340},
  url       = {},
  doi       = {10.29007/db8n}}
Download PDFOpen PDF in browser