Domain: Multiple Industries
Service Area: Big Data & Advanced Analytics
Context
Data volumes are growing and the pace of that growth is accelerating. Sensor data, log files, social media and other sources have emerged, bringing a volume, velocity, and variety of data that far outstrips traditional data warehousing approaches. It’s not as simple as putting all of this data in one place. Most of the clients in which we have provided this solution were having traditional DWH solution in place which includes Teradata, IBM Netezza, Oracle etc. and ingestion of data, processing of data and then building analytics on top of these stacks was time consuming and very much complex and the data mart approach didn’t provide the data completeness required for improving the overall data quality. Even reporting part became difficult to manage with these solutions. Additionally, considering these sources as a single source of truth was not feasible and most of the time analytics team ends up with putting together multiple DB and transformation layers to generate a simple reports.
Solutions
To help multiple clients in addressing these challenges Inventfor Data engineering team provided solutions based on Hadoop architecture/framework on top of Hortonworks or Cloudera which allows analyzing trillions of records which results in approximately one terabyte per month of reports in most of the cases. The outcomes are then presented through data quality dashboards. Have a look at our Big Data and Advanced Analytics practice
Key steps we took while implementing such solutions are:
- Identify all available data sources which includes multiple flavours like structured data in RDBMS, semi-structured data in Log/text files or unstructured data in crawling sites etc.
- Perform Capacity Planning on the basis of the data volume that needs to be ingested into the Hadoop lake and also by considering the year by year growth in the ingestion.
- Identify the mode of deploying the solution either on-premises or over cloud as some of the clients have restrictions in setting up their data stacks over cloud
- Identifying the use cases needed to be built on top of the data lake, which leads to finalize the tools/components required to be selected out of Hortonworks/Cloudera technology stacks; e.g if a search engine needs to be built on top of semi-structured data then Solr as one component needs to be deployed in the stack or Flume agent might be required to integrate with Log files etc.
- Once the integration layer is setup with Nifi, Flume or Kafka components then next step is to finalize the processing layer using Hive, Spark, Impala, Storm or other components based on the use cases
- Finally the analytics layer with a query editor in the form of Hue, Zeppelin or Open BI stack or Enterprise BI tools
- Certain cases in which there is a need to setup AI/ML on top the data lake, in such cases we setup Hortonworks and Cloudera AI/ML stacks in the form of Apache Mahout and Spark Mllib to build a Data service layer on top of Hadoop Data lake
Outcomes
With minimal hardware resources and a collection of open-source software requiring no licensing fees, Hadoop implementation results in:
- Cost effective and time savings, with an additional benefit from the boost in productivity clients can achieve their new analytical assets
- Highly scalable storage platform Unlike traditional relational database systems (RDBMS)
- Easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.
- Much faster data processing thanks to Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster.
- A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure