Every organization has been through it: the big data warehouse project that promises to solve everyone’s problems, but is 100% over budget and years late. Though at first data warehouses seem like appealing solutions to data problems, they are often plagued by challenges that drive up cost and delay delivery. Luckily, there are other approaches to achieving these goals that result in the same benefits but with dramatically lower cost, increased flexibility and time-to-market.
Here are few common challenges we’ve observed for data warehouse projects.
Data warehouses are big spends. There is no way around it, the combination of man hours, software licenses and hardware costs make data warehouse projects expensive. Just getting approval can be a long process, and as with any large project, it is really hard to accurately estimate the cost at the beginning.
Standardizing data around a single schema is almost impossible. Source systems have evolved over time using different technologies, and different approaches to organizing data. These different approaches are often at odds with each other, which makes the task of finding a single schema to represent all data almost impossible.
ETL always takes longer than anticipated. As we wrote about in our post about rethinking ETL, any complex data set is going to have a long tail of edge cases which cause errors in the ETL process and need to be addressed individually. This takes time, and is rarely included in the budget at the beginning of the project.
System owners don’t like losing control of their data. Whenever you are moving data from multiple systems into a data warehouse, somebody is going to lose control. Even if everyone approaches the project with the best of intentions, the handoff of domain specific knowledge takes time and can lead to road blocks.
Data warehouse projects usually focus on two outcomes: combining data from different systems, and improving data quality. Below are two approaches you can use to solve these problems for less money and in less time.
Combine data using open standards-based APIs
Instead of replacing multiple existing systems with a single data warehouse, consider leveraging your existing investments and connecting them together using open standards-based APIs. We’ve discussed this approach previously, and think its an effective solution to the data fragmentation issues many organizations face. In a nutshell, APIs give you the ability to combine data on-demand, as its request by the source application. Of course, you need to have a caching strategy in place, but we’ve found that the vast majority of cases rely on data that doesn’t change very often (customer or product information for example) so latency and data freshness are rarely show stoppers.
A few best practices to keep in mind as you consider using APIs:
Focus on translation rather than standardization. As discussed above, it is rarely possible to develop a single data schema that represents all edge cases and data architectures. Instead, focus on translating information between different structures. This will always be a faster approach and can be tailored to the specific use cases you have in mind.
Build for fault tolerance. Bad data and errors will happen, so build your system to accommodate errors without loosing data, and to make sure that data can be retrieved without blocking the entire application.
The best thing about this approach is that, unlike a data warehouse, API-enabled systems give you powerful flexibility and future proofing. Because APIs are technology and vendor agnostic, any API infrastructure you develop can be reused with little additional investment for other data projects. Additionally, because you are using existing infrastructure, there are no political turf battles to fight; system owners retain their data, but you can still innovate.
Improve data quality at the application layer
One of the common assumptions is that bad data is a problem that has to be addressed at the source, e.g. in the database. Data warehouse projects are often justified as the only way to holistically improve data in the database, through the use of a complicated ETL process. However, in most cases, bad data can be easily addressed from within the application layer, or immediately before consumption. A few common approaches to doing this are:
Validate before you display: if your data display depends on certain fields, ensure they are present before displaying them. Likewise for particular data formats. If the validation fails, look for a graceful fallback, otherwise drop the data (don’t worry, you can get it back later since it still lives in the source system!)
De-duplicate using a cache: if duplicate entries are an issue, keep a simple cache of accessed/processed objects and use that to weed out any duplicates. If you need to merge objects, you can use the cache to keep an application specific merged version. The trick here is that any merging happens ‘on demand’, e.g. when the data is processed by your application.
By focusing on improving data at the application layer, you can reap the same benefits of better data without the added cost of a large data warehouse. With some smart planning, the added effort of managing bad data can be integrated into your usual application development process.
If you're interested in more information about alternatives to data warehouse projects, contact us. In addition, we can help you to develop a detailed Project Blueprint that will guide the organization through the process of defining technology project requirements, planning infrastructure modernization strategies, deploying new systems, and the all-important budget justification.