How to build a data-driven organization correctly?

By Joydeep Sen Sarma, Co-founder and Head, Qubole India

Security is a top concern of enterprises adopting big data solutions in the Cloud. Fortunately – there has been tremendous progress amongst Cloud Providers, Open Source software and Solution providers that address this concern. In our experience – getting value out of Big Data begins with asking questions (whose answers can drive business value). Algorithms - like machine learning technologies – and analysts trained in the application of such technologies can help answer these questions. But their utility is only as good as the questions being asked.

For this reason – the role of humans in creating value out of Big Data technologies will be paramount. Firms with the best analysts – who understand the business intimately, the data-sets available and what answers can be teased out of such data using the latest technologies – will be best able to leverage big-data for competitive advantage. It is also important to vet the insights obtained with a keen eye. Data can lead – but it can also mislead.

On that theme - Intelligence derived from Data is also only as good as the Data itself. Data Quality hence is very important. Incomplete, polluted, incorrectly sampled or biased data can produce wrong conclusions. This process of making good data sets available throughout an organization is also largely human driven and will continue to be key to deriving value out of data.

Fragmented data sets have been a key barrier to data democratization and deriving business value from data. Previously, in firms without a comprehensive data lake (or equivalent) strategy, analysts would have to go asking around the organization where key data sets are located. If they do manage to find required data sets – delays may happen in getting access to it. Finally – extraction and processing (of large data sets) become a problem (and typically outside of the domain of expertise of analysts). As a result of all the barriers in combining different available data sources – much analysis never happens and many insights are lost. Whatever analysis happens can also be painstakingly slow.

On a related note – data analysed is often dated. Analysis on up-to-date real-time information – especially for time critical analysis related to operational, marketing and product intelligence – is something that was almost impossible earlier.

Bringing together all the data (streaming and static) into one platform and having common/shared metadata and collaborative analytical tools on top can make a dramatic difference to the above situations. All data is instantly accessible (with right access controls of course) and up-to-date – and analysts can quickly answer business questions. Ideally – analysts *should* be able to build upon earlier analysis.

The notion of a Data Lake and its promise require multiple technologies like:

  1. Storage Technologies (like HDFS or AWS S3 or Azure Data Lake) to store most data in a shared system that is amenable to all kinds of analytics tools
  2. Connectivity to external databases and repositories (for example streaming data sets) that cannot be replicated onto a common storage system and the ability to process them in-place.
  3. Common metadata storage for all data in an organization and discovery systems to allow analysts to quickly discover required data sets
  4. A collaborative platform for performing analysis that allows analysts to leverage prior work their community
  5. A powerful, comprehensive and easy to use set of tools to gather insights from the data lake – that is plugged into the data lake, connected data components, common metadata repository and the collaborative analysis platform.

Most Governments over the world focus on policies. My hope is that the focus on analytics and data will cause Governments to shift focus to outcomes.

Once Governments are focused on outcomes – big data technologies – amongst others – can help Governments measure those outcomes, measure the inputs that went into those outcomes and be able to measure the efficacy of different policy approaches to achieving better outcomes. Such an outcome and data driven policy formulation can, hopefully, lead to less politicization (in addition to being more efficacious in the first place) and help citizens evaluate Governmental performance more accurately (and hence help the democratic process as well)

The first step here is obviously define some outcomes that are desirable and then go about gathering signals that would help measure them. As an example – let’s take the case of air and water pollution. Most of our discussions are on policy approaches (say odd-even, banning diesel, sewage treatment plants and so on) – but the outcome of those policy approaches remains in question. Instead, the first step Governments should do is setup measurement of pollution of air and different water bodies and make sure such data is available to the public on a continual basis. Now citizens can ask whether spending of thousands of crores on Clean Ganga (for example) produced any change in the river’s pollution levels? Bureaucrats and policy makers can focus on how they can shape the outcomes – and the data can help in that. Having continuous pollution monitoring all along a river may help identify specific cities with higher effluents. Or seasonal trends in data may help identify specific sources (seasonal burning of crops for example) of pollution. One hopes that many new factors would be discovered this way.

The number of data sources that can be thought of – and the volume of data – will be tremendous. Big Data technologies will help Governments store and make sense of all this data at reasonable cost. Having common data infrastructure available freely to all levels of Government and different departments to easily collect and store data in one place will benefit from economies of scale (not just of cost – but also ease-of-use) – as well as all the advantages that accrue from a data lake strategy.

Don't Miss ( 1-5 of 25 )