In late 2021 we kick started a project with a customer who had shown an interest in reducing log ingestion costs and reached out to us via Linkedin.
Like many others, this customer had a combination of popular logging platforms in place that were built and put together for various reasons over time. No real reason why, just different departments, skills, tooling budgets and business needs influenced their position.
They finally allowed us to engage with their technical teams and asked us to prove that we could reduce spend on tools and show them how to get back control of their data. Challenge accepted.
OLD Tooling Landscape
Nothing unexpected found, they had three different platforms and data in other peoples clouds and on-prem. Each tool represented a technical domain and was straightforward to understand.
Datadog – Observability in production. This tool is used to ingest agent and 3rd party logs and other metric and trace telemetry. Pain Point – increased on-boarding demand has meant increasing costs to ingest, retain and re-hydrate. Web, database and api logs were growing. Low retention. Not really a pain point, but everyone wants it once they see it.
Elastic – Non production cloud and on-prem logs. All types. Pain Point – Lacked control. Elastic was being used by many developers and load testing teams. Difficult to forecast capacity and experiences low retention and sometimes low performance. No one team owned or managed this environment.
Splunk – Security log collection and integration with cyber tools. Pain Points – Compliance and security posture had meant logging everything became the norm. No choice but to accept growing costs to ingest and store for longer.
Note: all three tools above are absolute powerhouses in their own domains, our team loves each of them. This POC was not about comparing features or picking holes in the customer capabilities, but to bring order and control to the environment allowing the customer to forecast costs. Although apica.io can support logs and metrics, the scope for this engagement covered logs only.
NEW Tooling Landscape
It is important to understand that there is no change to the level of intelligence data being delivered to the existing platforms. Adding a pipeline gives you control of the flow & the power to distinguish signal from noise. In fact, its a very basic feature associated with data pipelines, but we will explain how apica.io adds many other unexpected benefits later.
Stage ONE – Setup Forwarders
Setting up the environment meant we needed to first pre-configure the a few things so that the logs being pumped directly from Log Sources to Logging Solutions are ‘inline”. This just means the apica.io platform will forward everything as is. Within the tool, we add forwarders, which are the vendor API’s to push logs too. As you can see below, we added Elastic, Datadog, Splunk On-Prem and Splunk Cloud (API Credentials+ location/index).
** Instastore is the out of box storage within Apica that the customer provides. (bring your own storage). Anything that passes from source to forwarder in the above diagram will be stored here indefinitely. This crucial component means that a raw copy of all data being passed from agents/beats/hubs to the corresponding vendors cloud – is now in the customers cloud too. As Datdog or Elastic roll over their data based on retention rules, apica.io never rolls over.
Stage TWO – Configure / Re-point Sources
Each log source will require minor configuration changes to redirect the output towards Apica. For example, filebeat will require a http-output-plugin configured. We approached this server by server, cloud by cloud and base lined normal hourly/daily ingest volumes. Although this project was focused on three vendors, there are 100’s of other integrations available.
Stage THREE – Apply Mapping
Once a forwarder is configured and in place, agents/beats primed ready to be restarted, a mapping is required to act as the middle man between source and forwarder. This translates filtering rules and allows you to apply enhanced decision making as to what log lines are forwarded, which are not. Nothing is dropped. When we first introduce a mapping, we apply no rules other than forward everything and then use this to compare log lines in apica.io vs vendor platform.
IMPORTANT When making changes to Splunk, Datadog, Elastic Beats or event hubs, no data is lost during this process unless logs rollover in the time between the change. An agent restart is required. Logs still exist on the source and any gap in ingestion is taken care of using watermark/timestamp features(standard practice). We tested all configuration on non critical agents before applying anything in production. This helps clarify what changes are required, how to automate and that the plumbing is in place.
Before making changes we had to clarify that 100% of expected log lines were present at source, apica.io and the vendor platforms (forwarders). We allowed this to run for 7-10 days just to give operational confidence that the pipeline introduces zero difference. (Remember, those logging vendors will never know a pipeline is in the middle!)
Next, the default mappings that we applied require tailoring / fine tuning with the customer, they tell us what lines to forward and what lines to keep behind (not lose or drop) and this is best achieved by walking through source by source. Basic options can be things like specific log levels, or we can be more granular and apply line by line txt filters.
The image below was a result of applying a simple mapping to NOT FORWARD any log lines that contain the attribute loglevel=debug AND loglevel=none. This was applied for Elastic in non production environments(which is larger than production combined) where we observed 46% of log lines matched the criteria. As a result, Elastic ingested 46% fewer log line events, apica.io stored 100% of data (including debug and none), so nothing is lost. If we need to replay that data we can later.
As we applied more and more mappings over time to both Splunk (syslog, firewall, endpoint, cloud nsg) and Datadog(Azure cloud, Event hubs, Datadog Agent web + api + db logs) the amount of EPS (Events Per Second) started to drop over all by ~90% (fluctuates by hour of day). The problem with agent to platform solutions is that you don’t know you’ve collected it until its there. In the case of the elastic non production environments, this was very sporadic, developers and testers left high volume log settings for periods of investigation and forgot to turn off. In some cases, we had logging levels at debug levels in production, it was simply overlooked. apica.io gives you that layer of protection.
We continue to apply further mappings and get more acquainted with priority content across the environment, the customer has retained every single log line within apica.io and has seen improved ingest volumes for all three key logging platforms. They are in a much better position in terms of cost control, more compliant in terms of data retention (or will be) and report an improved user experience using the tools.
It doesn’t stop here – as promised, some more hidden benefits
Now that the basic use cases were answered, we are now starting to explore some of the other brilliant features within apica.io:
> Data Exchange / Intelligence Sharing
Forward data from Datadog to Splunk (login failures, DNS failures, brute force attempts, HTTP Methods)
> Ingesting Metrics Now starting to onboard metrics that would have previously been sent directly to Datadog.
> Widen On-boarding With more available capacity, the customer is now looking to broaden on-boarding to more environments and business units where previously capacity was a the main blocker. Each new service presents a smaller logging cost/footprint.
> Scale and Performance The only limits you will (might) encounter are subscription and regional limits with your hosting providers for the S3 Storage(possibly 2TB per day). The weight is truly taken away from the more expensive logging tools and they do what they need to do with less horsepower. apica.io scales automatically and is a zero touch solution. Decrease ingest, it will decrease compute. Increase ingest, it will increase compute – its all well hidden behind the scenes. There has so far been no need to add index’s or tune performance at any part of the data pipeline, whereas before they had to house keep a lot. Another nice bonus, a simple tool that is only told once what to do….never touch again.
> Unified Reporting Using tools like PowerBI or Tableau they have started to use the apica.io API to report on whole datasets in just one solution. They no longer need to maintain separate connections to each logging platform. Furthermore, apica.io Instastore easy to query and uses open data formatting to pull large extracts. Each query is executed in milliseconds and does not end up a queue somewhere. Whether you want 5 lines or 5M lines, the response is linear due a multi dimensional indexing. It really is game changer, faster execution means more time for analysis, slice and dice until the cows come home.
> Parallel running critical systems Now that apica.io is in place, you have the ability to replay ANY data to ANY forwarder ANY time you need too and at no extra cost on apica.io licensing. This means where the customer may have been nervous about swapping(risk, cost, capabilities) out a legacy SIEM solutions with new modern technology in the past, they have the luxury of comparing both in real-time. (also Datadog vs Elastic vs something else).
Thank you for reading. If you require more information on the above or would like to discuss how we can help your organisation take back control of your data and costs: fire over an email to [email protected].
Editor’s note: If you’d like to try out the apica.io platform and see how it can better control, optimise, route, and operationalise your enterprise machine data, sign up for a FREE trial. You could also reach out to [email protected] for a quick demo of the platform.