Best practices, Building data pipelines

Open Source Analytics Tools: It’s All About the Trade-offs

Open source software has gone from developer cult following to mainstream IT staple. So what is the role of open source software in analytics and BI? Today there are many. And how you use open source analytics tools is mainly a question of what you’re trying to do, and what kind of trade-offs you’re willing to make.

Back when I was a developer, open source software was very popular in the developer community. It was free, and that made it easy to play around with. But what developers loved most was how much power and control it gave them.

You could get in there, examine the source code, diagnose problems, and maybe even code up a solution. You didn’t have to wait around for the software vendor to fix bugs or do an integration. But mainstream IT considered it risky.

Today, open source is well established and trusted, especially in certain areas such as data science. You can build your whole data analytics stack by piecing together open source components, and tailoring them to your needs. But should you?


Functionality vs. Effort

In about 90% of use cases, the answer is no. It’s all about the trade-offs — the functionality you need versus the level of effort it takes to get there.

Don’t get me wrong. The open source tools available today are amazing, and there are a lot of benefits to using them.

If you’re building and maintaining a system, and some new technology arises, you’re in a position to be able to swap out one or more components in an incremental fashion. That’s much easier than swapping out entire systems.

It’s easy to find talent, because a lot of the people coming out of school today — especially data scientists — are cutting their teeth on open source.

Another relatively recent phenomenon is companies that offer paid, value-added versions of open source software. This includes some of the biggest names in the technology industry.

For example, PrestoDB is championed by Facebook. That’s a plus and a minus. You have a big entity behind it so it will be well maintained. But on the other hand, they make all the decisions about what’s going to be done. Nonetheless, it speaks to the level of trust open source software enjoys today.

Open source even has security benefits. In certain domains you have to be able to inspect the source code for security reasons. With open source, you have so many eyeballs on the code that malicious code is going to get found very quickly. That’s why just about any security algorithm in use today is open source.


Like a Free Puppy

There are also some downsides. There’s a big misconception that open source is free, or close to it, because you don’t have the upfront licensing costs. Yeah, you can download it for free, but it requires a lot of time and money to use it. Many companies will start down the path with open source only to realize it’s free like a free puppy is “free.”

The onus is on you to choose the right components, platforms, and packages, and to wire those together in a way that is useful and efficient. That requires a lot of knowledge and skill. In analytics, there’s a whole cottage industry of folks who can do that for you. But of course, you’re paying for all the labor and maintenance.

For the vast majority of companies, that doesn’t make sense.
Assemble-to-suit addresses a very small set of use cases, typically for when you’re embedding data analytics deeply into your product. Some examples include online personal finance companies that rely on analytics to do their underwriting, or ride-sharing apps.

In these cases, analytics is closely tied to the customer experience. It’s a key differentiating factor, so there are some trade-offs you just can’t make. You need to be able to build something ultra customized and flexible.


Faster Time to Value

But in most business settings and scenarios, you just need to get up and run and get value out of your data quickly and get to a positive business outcome faster.

That could still involve open source software. There are so many tools out there, and adoption is so widespread that it’s no longer a question of commercial software versus open source. It could be a combination of both.

For example, you can buy Oracle, or use MySQL. They are both complete database systems, and in many ways, functionally equivalent. Of course, if you’re in a large organization, Oracle has much more to offer. In either case though, there’s some assembly required to get to the final solution, and that may involve using some open source tools.

No platform is going to do everything on your wish list. You have to compromise or customize. It’s a question of how much of each you’re willing to do, versus how much value you can get.


Open Source Inside

The final thing to know about the state of open source today is that many platforms, including Incorta, are built on top of open source tools that are wired together in an innovative way. We started with a set of proprietary core technology. And then we used a variety of open source software to wrap around and augment it, in situations where there’s no reason to reinvent the wheel.

We use Parquet as our storage layer, Spark as an execution engine, MySQL is in there for a metadata layer, and Jupyter notebooks. We’re using the Facebook Prophet algorithm for doing time-series predictions. And the list goes on.

In the developer and IT community, name brand open source components add cachet even when baked into a commercial product. These components are proven, and most technical people know how they work. There’s a high level of trust.


Striking the Right Balance

We’ve come to a point where mainstream IT is in many cases seeing open source as a safer choice, because of the depth of expertise and the amount of use that some of these tools have. You can definitely piece together your data analytics stack, or any other system you might need, from what’s out there.

However, if you need to stand up analytics very quickly, and don’t have hyper-specialized needs or require complex integrations, you probably don’t need to do that.

The safer, more economical path is to pick a vendor that embraces open source, such as Incorta, but that provides you with a pre-integrated, tested, and supported solution. You get up and running faster, and generate value for your organization with less risk.

At the same time you can still innovate where innovation is necessary by leveraging the known and trusted open source components. For example, you could extend the Incorta platform with a new visual component written in React, the popular open source visual framework.

I think that’s the best of both worlds, and shows the true power of open source software today.