Mastering Data Onboarding with Splunk: Best Practices and Approaches

by Pavlo Poliakov

configuration-image-card

In today’s data-driven landscape, organisations are realizing the significance of observability and its potential to drive competitive advantage. According to Gartner, by 2026, 70% of successful observability implementations will lead to reduced decision latency, empowering businesses to make informed choices and outperform their peers.

Observability is defined as a concept, a goal, and a direction that helps an organisation gain the most insight from the data it collects or onboards. Splunk is one tool that is well renowned for its advanced data analytics capabilities, providing a robust platform for collecting, indexing, and analysing vast amounts of data from diverse sources.

However, to fully leverage the benefits of Splunk’s observability capabilities, effective data management and onboarding practices are essential. Data is the lifeblood of any observability solution, and managing it efficiently is crucial for accurate analysis, proactive monitoring, and informed decision-making.

Here, we will share some best practices and approaches for effectively onboarding data within the Splunk ecosystem to enable you to harness the full power of your data assets.

Onboarding data – define the scope:

Sounds simple and obvious, but defining the scope of use cases for your data can be challenging. You will need to answer questions such as “What is the overall goal of onboarding data into Splunk?”, “What data sources are needed to create content that meets the overall goal?” and so on. The answers could be related to specific security, IT operations or business use cases and it is good practice to have some sort of checklist completed before proceeding with data onboarding.

Naming convention for your data:

It is good practice to use some naming convention of splunk metadata fields (host, sourcetype, index) for your datasets in order to simplify the process of analysing data and manage config files. For instance, Splunk recommends using this scheme for each sourcetype: vendor:product:technology:format

and for index – <companyname>_<purpose>_<sensitivity>_<summary> .

Data collection:

This is an important step from an architectural perspective. The best practice for designing data onboarding is to use the Splunk Validated Architectures. In general, Splunk collection methods can be categorised as local, remote and network listening.

Splunk Data Collection Topology Overview (see more details here)

  • Local Collection

Local Collection is performed directly from the server via the local agent. The data is then transmitted to the Splunk Indexer peers.

  • It is good practice to use Splunk Universal Forwarder as the local agent, but depending on the use case, Splunk Heavy Forwarder or Cribl Edge ( an agent of the third-party data management platform) can be used instead.
  • Local agent collectors control data ingestion via configurations. It is recommended to use a central management point, such as the Splunk Deployment Server, to commit changes to the configuration files. In addition, it is nice to use some version control system like Git to track the lifecycle of these changes (for instance, data management platform Cribl has such a built-in feature).
  • Collect data accurately by monitoring files/directories, network events, printers, interfaces and host information; running scripts; capturing logs from the Windows Event Log; collecting data directly from any Perfmon counters defined on the Windows system; collecting registry and Active Directory data on the Windows system; network events, printer and host information. (https://lantern.splunk.com/Security/UCE/Ingest_data).
  • If you are using a syslog server, install Splunk Universal Forwarder locally and monitor separate directories/files.
  • Use out-of-the-box config files within Splunk add-ons to simplify collection (https://splunkbase.splunk.com/).
  • Run Splunk Universal forwarder as non-privilege user.
  • It is highly recommended to use TCP SSL encryption for data transferring. (https://docs.splunk.com/Documentation/Splunk/latest/Security/AboutsecuringyourSplunkconfigurationwithSSL).
  • Splunk features such as indexer discovery, index acknowledgment and load balancing must be used to ensure reliable and consistent data onboarding.
  • Change default bandwidth limits from 256 kbps according to your needs (0 – is unlimited, please check maxKBps).
  • For high volume data sources, it is recommended to increase the parallelIngestionPipelines setting to 2 in the Splunk environment.
  • It is recommended that you use Splunk’s caching capabilities at the data source when experiencing network issues to avoid data loss.
  • Filter data before sending it to the Splunk indexer. Use white/black lists within inputs config files in the Splunk Universal Forwarder or use data management platform Cribl for flexible filtering and routing collected data.
  • Remote Location

Remote Location is used when local collection is not allowed or supported. This is where Splunk reaches out and collects data from a remote data source.

  • It is good practice to use Splunk Heavy Forwarder for collecting data using HTTP Event Collector (HEC). It allows data source to send data to Splunk using http(s) protocol.
  • Enable indexer acknowledgement in HEC for reliable, consistent data onboarding.
  • For pulling data from Databases use Splunk DB connect.
  • For collecting data from Kafka use Splunk Connect for Kafka.
  • For collecting data from Kubernetes use Splunk OpenTelemetry Collector for Kubernetes.
  • Use Splunk Add-on Builder to configure custom data ingestion using the REST API or modular input with your own Python script.
  • For high volume data sources, it is recommended to increase the parallelIngestionPipelines setting to 2 in the Splunk environment.
  • Filter the data according to your use cases before sending it to the Splunk indexer and/or route a copy of the raw data to a third-party storage (for example, S3 storage). This could be achieved by using the Splunk ingest actions feature on the Heavy Forwarder or by using the Cribl data management platform for more flexible data processing.
  • Network listening

Network listening is a data collection method used by Splunk to collect data from network traffic or uncontrolled sources using various network protocols.

  • Use Splunk Stream app as a network sniffer to analyse network traffic, or capture netflow data, or ingest pcap files.
  • If you want to send data from a TCP or UDP source such as the syslog service, use the universal forwarder to listen to the source and forward the data to your Splunk Indexer Peer.
  • If there is no other way to collect data other than directly sending tcp or udp data, configure the load balancer to listen on the tcp or udp port and then forward the traffic to the Splunk Heavy Forwarders (at least two for high availability) with network port listening or Cribl data management platform components.
  • Use Splunk ingest actions or Cribl Stream to filter the data according to your use cases before sending it to the Splunk indexer. Also route a copy of raw data to offsite storage for future needs (for example, S3 storage).

Data source configuration:

Splunk’s approach to data at scale is to ingest data from anywhere and ask any question at any time. However, when a data source is ingested with default configurations, Splunk spends a lot of time and processing power guessing the right settings for each event before it ingests. This is because Splunk doesn’t know the specifics of your data and uses different approaches to prepare your data for analysis. If you want to know more details regarding how Splunk process data, please refer to this diagram.

Tell Splunk more about your data. It is a good practice to first onboard a sample of your data into the test environment, where you can check the correctness of the way the events are parsed. Note that There are eight main source type configuration parameters, known as the “Magic 8”, that need to be set up when setting up a new data source (see props.conf). (more details here).

  • SHOULD_LINEMERGE – combines several lines of data into a single multi-line event. Defaults to True, however, it is recommended change to false in case of single line events.
  • TRUNCATE – defines the number of characters per line, once reached exceed characters are dropped.
  • LINE_BREAKER – determines how the raw text stream is broken into initial events. Default is the regex for new lines, ([\r\n]+).
  • EVENT_BREAKER_ENABLE – this is an event line breaking parameters for Universal Forwarders, needs to be set to true.
  • EVENT_BREAKER – – this is an event line breaking parameters for Universal Forwarders, should match the value in LINE_BREAKER
  • TIME_PREFIX – Specifies a strptime format string to extract the date. No default set.
  • MAX_TIMESTAMP_LOOKAHEAD – Specifies how many characters after the TIME_PREFIX pattern Splunk software should look for a timestamp.
  • TIME_FORMAT – – Specifies a strptime format string to extract the date. No default set.

Setting only these parameters correctly will significantly improve data onboarding performance (more details here).

Other good practices include:

  • Use the Splunk Common Information Model (CIM) to normalise your data to a common standard. This allows you to use the same field names for consistency across all data sources and is mandatory when using an application such as Splunk Enterprise Security or Splunk IT Service Intelligence. The SA-cim_vladiator application simplifies this process.
  • Ingest unusual data with different source types into separate indexes. This is especially true if the data distribution is not continuous, for example if one of the data sources generates many events (such as a network firewall).
  • Use the NTP server as a common source of time synchronisation for all data sources and Splunk components.

Post onboarding:

Once the data has been successfully onboarded, you can use it for analysis within your use cases. However, some additional steps wouldn’t be unnecessary, such as:

  • Using the SPL “addinfo” command in a custom search to track heartbeats from hosts, forwarders, tcpin_connections on indexers, or any number of system components.
  • Alerting when there is no data for a particular index (more details here).
  • Configure forwarder monitoring for the Splunk Monitoring Console. This allows you to see the Splunk forwarder health check in the data source.

Plan before actions, it means design own use cases first, defined required data sources for your use cases, choose optimal ingesting methods, learn more about your data in test environment, use naming convention policy, don’t use default settings for configuring sourcetypes, filter data before indexing in Splunk. Meeting all these steps will reduce license usage, increase search performance, and make data analysis more efficient.

Splunk is a powerful platform that lets you ingest terabytes of data for analysis. However, onboard only the data you need for your use cases and route the rest to external storage as needed. Use the Cribl data processing platform for more flexible data management.

If you would like any help maximising the value of your Splunk instance, we can help! Contact us for a free consultation.

By Pavlo Poliakov