The Philosophy of the Federal Cyber Data Lake (CDL): A Thought Leadership Approach

jasonwpayne · ‎Feb 08 2024

Pursuant to Section 8 of Executive Order (EO) 14028, "Improving the Nation’s Cybersecurity", Federal Chief Information Officers (CIOs) and Chief Information Security Officers (CISOs) aim to comply with the U.S. Office of Management and Budget (OMB) Memorandum 21-31, which centers on system logs for services both within authorization boundaries and deployed on Cloud Service Offerings (CSOs). This memorandum not only instructs Federal agencies to provide clear guidelines for service providers but also offers comprehensive recommendations on logging, retention, and management to increase the Government’s visibility before, during and after a cybersecurity incident. Additionally, OMB Memorandum 22-09, "Moving the U.S. Government Toward Zero Trust Cybersecurity Principles", references M-21-31 in its Section 3.

While planning to address and execute these requirements, Federal CIO and CISO should explore the use of Cyber Data Lake (CDL). A CDL is a capability to assimilate and house vast quantities of security data, whether in its raw form or as derivatives of original logs. Thanks to its adaptable, scalable design, a CDL can encompass data of any nature, be it structured, semi-structured, or unstructured, all without compromising quality. This article probes into the philosophy behind the Federal CDL, exploring topics such as:

The Importance of CDL for Agency Missions and Business
Strategy and Approach
CDL Infrastructure
Application of CDL

The Importance of CDL for Agency Missions and Business

The overall reduction in both capital and operational expenditures for hardware and software, combined with enhanced data management capabilities, makes CDLs an economically viable solution for organizations looking to optimize their data handling and security strategies. CDLs are cost-effective due to their ability to consolidate various data types and sources into a single platform, eliminating the need for multiple, specialized data management tools. This consolidation reduces infrastructure and maintenance costs significantly. CDLs also adapt easily to increasing data volumes, allowing for scalable storage solutions without the need for expensive infrastructure upgrades. By enabling advanced analytics and efficient data processing, they reduce the time and resources needed for data analysis, further cutting operational costs. Additionally, improved accuracy in threat detection and reduction in false positives lead to more efficient security operations, minimizing the expenses associated with responding to erroneous alerts and increasing the speed of detection and remediation.

However, CDLs are not without challenges. As technological advancements and the big data paradigm evolve, the complexity of network, enterprise, and system architecture escalates. This complexity is further exacerbated by the integration of tools from various vendors into Federal ecosystem, managed by diverse internal and external teams. For security professionals, maintaining pace with this intricate environment and achieving real-time transparency into technological activities is becoming an uphill battle. These professionals require a dependable, almost instantaneous source that adheres to the National Institute of Standards and Technology (NIST) core functions—identify, protect, detect, respond, and recover. Such a source empowers them to strategize, prioritize, and address any anomalies or shifts in their security stance. The present challenge lies in acquiring a holistic view of security risk, especially when large agencies might deploy hundreds of applications across the US and in some cases globally. The security data logs, scattered across these applications, clouds and environments, often exhibit conflicting classifications or categorizations. Further complicating matters are logging maturity levels at different cloud deployment models, infrastructure, platform, and software.

It is vital to scrutinize any irregularities to ensure the environment is secure, aligning with zero-trust principles which advocate for a dual approach: never automatically trust and always operate under the assumption that breaches may occur. As security breaches become more frequent and advanced, malicious entities will employ machine learning to pinpoint vulnerabilities across expansive threat landscape. Artificial intelligence will leverage machine learning and large language models to further enhance organizations’ abilities to discover and adapt to changing risk environments, allowing security professionals to do more with less.

Strategy and Approach

The optimal approach to managing a CDL depends on several variables, including leadership, staff, services, governance, infrastructure, budget, maturity, and other factors spanning all agencies. It is debatable whether a centralized IT team can cater to the diverse needs and unique challenges of every agency. We are seeing a shift where departments are integrating multi-cloud infrastructure into their ecosystem to support the mission. An effective department strategy is pivotal for success, commencing with systems under the Federal Information Security Modernization Act (FISMA) and affiliated technological environments. Though there may be challenges at the departmental level in a federated setting, it often proves a more effective strategy than a checklist approach.

Regarding which logs to prioritize, there are several methods. CISA has published a guide on how to prioritize deployment: Guidance for Implementing M-21-31: Improving the Federal Government's Investigative and Remediation .... Some might opt to begin with network-level logs, followed by enterprise and then system logs. Others might prioritize logs from high-value assets based on FISMA's security categorization, from high to moderate to low. Some might start with systems that can provide logs most effortlessly, allowing them to accumulate best practices and insights before moving on to more intricate systems.

Efficiently performing analysis, enforcement, and operations across data repositories dispersed across multiple cloud locations in a departmental setting involves adopting a range of strategies. This includes data integration and aggregation, cross-cloud compatibility, API-based connectivity, metadata management, cloud orchestration, data virtualization, and the use of cloud-agnostic tools to ensure seamless data interaction. Security and compliance should be maintained consistently, while monitoring, analytics, machine learning, and AI tools can enhance visibility and automate processes. Cost optimization and ongoing evaluation are crucial, as is investing in training and skill development. By implementing these strategies, departments can effectively manage their multi-cloud infrastructure, ensuring data is accessible, secure, and cost-effective, while also leveraging advanced technologies for analysis and operations.

CDL Infrastructure

One of the significant challenges is determining how a CDL aligns with an agency's structure. The decision between a centralized, federated, or hybrid approach arises, with cost considerations being paramount. Ingesting logs in their original form into a centralized CDL comes with its own set of challenges, including accuracy, privacy, cost, and ownership. Employing a formatting tool can lead to substantial cost savings in the extract, transform, and load (ETL) process. Several agencies have experienced cost reductions of up to 90% and significant data size reductions by incorporating formatting in tables, which can be reorganized as needed during the investigation phase. A federated approach means the logs remain in place, analyses are conducted locally, and the results are then forwarded to a centralized CDL for further evaluation and dissemination.

For larger and more complex agencies, a multi-tier CDL might be suitable. By implementing data collection rules (DCR), data can be categorized during the collection process, with department-specific information directed at the respective department's CDL, while still ensuring that high value and timely logs are forwarded to a centralized CDL at the agency level, prioritizing privileged accounts. Each operating division or bureau could establish its own CDL, reporting on to the agency's headquarters' CDL. The agency’s Office of Inspector General (OIG) or a statistical component of a department may need to create their own independent CDL for independence purposes. This agency HQ CDL would then report to DHS. In contrast, smaller agencies might only need a single CDL. This could integrate with the existing Cloud Log Aggregation Warehouse (CLAW) a CISA-deployed architecture for collecting and aggregating security telemetry data from agencies using commercial CSP services — and align with the National Cybersecurity Protection System (NCPS) Cloud Interface Reference Architecture. This program ensures security data from cloud-based traffic is captured, analyzed, and enables CISA analysts to maintain situational awareness and provide support to agencies.

If data is consolidated in a central monolithic, stringent data stewardship is crucial, especially concerning data segmentation, access controls, and classification. Data segmentation provides granular access control based on a need-to-know approach, with mechanisms such as encryption, authorization, access audits, firewalls, and tagging. If constructed correctly, this can eliminate the need for separate CDL infrastructures for independent organizations. This should be compatible with role-based user access schemes, segment data based on sensitivity or criticality, and meet Federal authentication standards. This supports Zero Trust initiatives in Federal agencies and aligns with Federal cybersecurity regulations, data privacy laws, and current TLS encryption standards. Data must also adhere to retention standards outlined in OMB 21-31 Appendix C and the latest National Archives and Records Administration (NARA) publications, and comply with Data Loss Prevention requirements, covering data at rest, in transit, and at endpoints, in line with NIST 800-53 Revision 5.

In certain scenarios, data might require reclassification or recategorization based on its need-to-know status. Agencies must consider storage capabilities, ensuring they have a scalable, redundant and highly available storage system that can handle vast amounts of varied data, from structured to unstructured formats. Other considerations include interoperability, migrating an existing enterprise CDL to another platform, integrating with legacy systems, and supporting multi-cloud enterprise architectures that source data from a range of CSPs and physical locations. When considering data portability, the ease of transferring data between different platforms or services is crucial. This necessitates storing data in widely recognized formats and ensuring it remains accessible. Moreover, the administrative efforts involved in segmenting and classifying the data should also be considered.

Beyond cost and feasibility, the CDL model also provides the opportunity for CIOs and CISOs to achieve data dominance with their security and log data. This concept of data dominance allows them to gather data, quickly and securely, reduces processing time, which provides quicker time to respond. This quicker time to respond, the strategic goal of any security implementation, is only possible with the appropriate platform and infrastructure so organizations can get closer to real-time situational awareness.

The Application of CDL

With a solid strategy in place, it's time to delve into the application of a CDL. Questions arise about its operation, making it actionable, its placement relative to the Security Operations Center (SOC), and potential integrations with agency Governance Risk Management, and Compliance (GRC) tools and other monitoring systems. A mature security program needs a comprehensive real-time view of an agency's security posture, encompassing SOC activities and the agency's governance, risk management, and compliance tasks. The CDL should interface seamlessly with existing or future Security Orchestration and Response (SOAR) and End Point Detection (EDR) tools, as well as ticketing systems.

CDLs facilitate the sharing of analyses within their agencies, as well as with other Federal entities like the Department of Homeland Security (DHS), Cybersecurity and Infrastructure Security Agency (CISA), Federal law enforcement agencies, and intelligence agencies. Moreover, CDLs can bridge the gaps in a Federal security program, interlinking entities such as the SOC, GRC tools, and other security monitoring capabilities. At the highest levels of maturity, the CDL will leverage Network Operations Center (NOC) and even potentially administration information such as employee leave schedules. The benefit of modernizing the CDL lies in eliminating the requirement to segregate data before ingestion. Data is no longer categorized as security-specific or operations-specific. Instead, it is centralized into a single location, allowing CDL tools and models to assess the data's significance. Monolithic technology stacks are effective when all workloads are in the same cloud environment. However, in a multi-cloud infrastructure, this approach becomes challenging. With workloads spread across different clouds, selecting one as a central hub incurs egress costs to transfer log data between clouds. Departments are exploring options to store data in the cloud where it's generated, while also considering if Cloud Service Providers (CSPs) offer tools for analysis, visibility, machine learning, and artificial intelligence.

The next step is for agencies to send actionable information to security personnel regarding potential incidents and provide mission owners with the intelligence necessary to enhance efficiency. Additionally, this approach eliminates the creation of separate silos for security data, mission data, financial information, and operations data. This integration extends to other Federal security initiatives such as Continuous Diagnostics and Mitigation (CDM), Authority to Operate (ATO), Trusted Internet Connection (TIC), and the Federal Risk and Authorization Management Program (FedRAMP).

It's also pivotal to determine if the CDL aligns with the MITRE ATT&CK Framework, which can significantly assist in incident response. MITRE ATT&CK® is a public knowledge base outlining adversary tactics and techniques based on observed events. The knowledge base aids in developing specific threat models and methodologies across various sectors.

Lastly, to gauge the CDL's applicability, one might consider creating a test case. Given the vast amount of log data — since logs are perpetual — this presents an ideal scenario for machine learning. Achieving real-time visibility can be challenging with the multiple layers of log aggregation, but timely insights might be within reach. For more resources from Microsoft Federal Security, please visit https://aka.ms/FedCyber.

Stay Connected

Connect with the Public Sector community to keep the conversation going, exchange tips and tricks, and join community events. Click "Join" to become a member and follow or subscribe to the Public Sector Blog space to get the most recent updates and news directly from the product teams.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

The Philosophy of the Federal Cyber Data Lake (CDL): A Thought Leadership Approach