1. Overview

Nowadays, enterprises around the world are prominently dependent on cloud computing to operate their businesses. However, in order to build a solid security infrastructure in the cloud environment, companies still encounter certain challenges, especially for organizations with multiple cloud accounts. Hence, what measures will companies take to ensure security governance

Some of the measures include:

  • Ensuring appropriate resource configuration that adheres to the organization's policies or security standards.
  • Rapidly detecting and reporting abnormal actions that affect the system
  • Prioritizing automatic repair procedures when malicious occurrences happen; and reduce human errors with standardization and automation.
  • Ensuring the systems are able to centralize management, track resources across all of the organization's accounts, and transmit security findings from member accounts to a management account.
  • Reducing the administrative burden on the team by building and maintaining a straightforward system.

2. Solution

a. Finding solution

Using AWS native services 

317332220_1096472884352918_1864388606351931060_n

Utilizing AWS native services is one of the viable solutions. Besides, AWS Config, AWS GuardDuty, and Security Hub are some tools to monitor and track user configurations in the system. After conducting extensive testings on our cloud environment, we came to the following conclusions:

  • AWS Config: AWS Config only collects little resource information while aggregating resources across all regions and accounts within an organization. However, there are certain limitations while utilizing AWS Config to get information about resource attributes. Firstly, AWS Config cannot provide all IAM users with console passwords or all AMIs in use throughout the AWS organization. Secondly, it does not display the IAM user’s Login Profile data and image attributes. Consequently, it is challenging to gather and manage resources due to the limitations mentioned above.
  • Security Hub: Security Hub gives us a comprehensive view of the security status in AWS. What's more, it enables users to compare the environment to security standards and best practices in the industry. We found that it is easier for our stakeholders to receive reports in a CSV rather than having to log in and view security findings on the AWS dashboard. Security Hub, however, is difficult to administer centrally because it only allows users to view reports on a console instead of directly generating report files.
  • GuardDuty: GuardDuty is responsible for monitoring and reporting the system. It can notify other services to respond to unexpected behaviors by taking appropriate actions. However, its high cost is one of the main disadvantages of deploying the system. Additionally, it has to depend on other AWS services to give optimal performance. For the best results, we must combine GuardDuty with several other services.

Therefore, to combine information into a report and store it in S3, we have to create a Lambda function to gather, process, and build the report and the SQS queue and SNS topic. Additionally, we will have to develop new features, such as instantly notifying the administrator of the detected errors. To guarantee the system is handled at all times, we have to enable all other services continually. After conducting multiple tests using AWS native services, we discovered that handling too many resources simultaneously can make it challenging to manage and upgrade in the later stage. Finally, cost remains an issue to consider.

Using Opensource tools

After conducting thorough, testing, and implementing alternative techniques to resolve the limitations mentioned above, we found that Cloud Custodian – an open-source application – displays great capabilities that fit our needs:

  • Centralize security findings across multi-account and multi-region into a single report.
  • Support users to query at a deeper resource level than AWS Config, as well as produce report files in a variety of formats, including json and csv.
  • Use low-cost and easy-to-manage services (e.g., Lambda, Event bridge).
  • By including Actions in the policy, it guarantees that measures will be taken as soon as errors are identified, ensuring the system consistent and appropriate configuration in line with business needs.
  • Custodian is a policy-as-code solution that is simple to implement and maintain. 
  • As a lightweight serverless solution, Cloud Custodian is built with Infrastructure as Code and Container technologies to support its deployment and automation. 

b. Cloud Custodian Introduction

Cloud Custodian is a rules engine for managing public cloud resources. Users can set up rules to enable a well-managed cloud infrastructure. With measurements, structured outputs, and thorough reporting for cloud infrastructure, it leverages a stateless rules engine for policy-making and enforcement. Cloud Custodian seamlessly interacts with serverless runtimes to offer real-time correction and response with no administrative cost. In addition, it can also be integrated with many different modes for each query purpose such as:  

Mode

Description

pull

Default mode, queries resources from cloud provider for filtering and actions.

asg-instance-state

Create a lambda policy that executes on an asg’s ec2 instance state changes.

cloudtrail

Create a lambda policy using CloudWatch events rules on CloudTrail API logs.

config-poll-rule

This mode represents a periodic/scheduled AWS Config evaluation.

config-rule

Create a lambda policy that executes as a config service rule and invoked on configuration changes to resources.

ec2-instance-state

Create a lambda policy that executes on EC2 instances state changes.

guard-duty

Execute policies when various alerts are created by AWS Guard Duty for automated incident response.

hub-finding

Deploy a policy lambda as a Security Hub Console Action and on security hub finding ingestion events.

periodic

Run Custodian in AWS lambda at user defined cron interval.

phd

Personal Health Dashboard event-based policy execution.

Organizations can utilize Custodian to manage their cloud environments with compliance verification according to security standards, tag regulations, trash collection of idle resources, and cost management from a single system. In addition, Cloud Custodian is also integrated with pre-deployed tools in the engine. These tools might include c7n-org that runs on multi-account in multi-region, or c7n-mailer that processes reports into emails for notification. testing email notifications

Cloud Custodian policies are expressed in yaml and include the following key attributes:

  • name: identity for each policy
  • resource: The type of resource to run the policy against.
  • filters: narrow down the set of resources using rich queries on JSON objects via JMESPath, can use attribute variable, event variable
  • actions: The action will do after taking on the filtered set of resources

Policy examples

Scanning Policy

Persistent Policy

scanning persistent

c. Using Custodian to automatically manage cloud resources

Overview architecture

ov-ar

In this architecture, we use CDK to deploy resources to cloud environment of AWS:

CI/CD tools: CodeCommit, CodePipeline, CodeBuild for automate purposes (automate deploy infrastructure if have any changes in CodeCommit, etc.)

ECR: Store images, which include Cloud Custodian built in.

ECS Task: Using images from ECR to run Cloud Custodian

Output: Generate reports to S3 bucket or create lambda function and EventBridge rule to automatically remediate compromise resources.

  • Prerequisites:

Before we dive in, it is essential for you to create some resources.

1. Create two CodeCommit repositories in Management account:

  • Policy-Repo: Policy-Repo is used to store Cloud-Custodian policies and policy configuration file (we will discuss the purpose of this document later). Auditor writes Cloud-Custodian policies will be in charge of this repo.
  • Infra-Repo: Infea-Repo is used to store CDK source code. In order to do that, we need to push CDK source code to this repo first. Developer operating and developing new functions of the infrastructure is responsible for this repo.

2. Create two roles in every Child account you want to audit:

Scan-role: contains ReadOnlyAccess (AWS managed policy). This role allows the Management account to collect information from child accounts and generate reports from those findings.

Persist-role: contains some permissions, such as: lambda:CreateFunction, lambda:AddPermission, events:PutTargets, events:PutRule, etc. This role allows the system to perform automatic remediate when incidents occur.

Remember to add principal to your audit account for each role so that your audit account will have enough permissions to access child accounts.

After deploying the CDK source code in audit account, we also need to push file accounts.yaml (contains information of child accounts you want to audit) to S3 bucket called account-yaml-bucket.

  • Data flow:

1. First of all, auditor will use their git credentials to push Cloud Custodian policy and policy configuration file to Policy-Repo

1.1. Cloud Custodian policy: Categorized into two types of policy:

  • SCAN: monitor and generate periodic reports on violated resources.
  • PERSIST: event-based violated resource detection and automate remediation. 

1.2. Policy configuration file: configure which policies and regions to run, or generate time to run schedule. It also customizes report output stored in S3 bucket.

2. After pushing policies and policy configuration file to Policy-Repo, CodePipeline will be triggered to push policies to S3 bucket named policy-bucket. Another CodePipeline will be triggered, using

CodeBuild with policy configuration file to build ECS Task Definition.

3. This ECS Task Definition will use image stored in ECR repository (created when deploy CDK source code) to run Cloud Custodian policies.

4. ECS Task will get file yaml from account-yaml-bucket, get policy from policy-bucket, and use image from ECR repository to do its job.

With ECS Task Definition, we divide to two modes corresponding to two modes of Cloud Custodian policy, which is SCAN and PERSIST.

  • SCAN: Run daily at 8:00 AM by default and generate reports to S3 bucket called output-bucket.
  • PERSIST: Auto remediate compromise resources.

You don’t need to run ECS Task Definition mode SCAN because it will daily run at fixed time. But with mode PERSIST, you have to run it manually.

  • Note:

When you run policies mode PERSIST, Cloud Custodian will create some resources like Lambda function and EventBridge rule in each account and region you want to audit to automatically remediate. The larger number of accounts, regions, and policies mode PERSIST we run, the larger resources are created. However, the only problem is how to delete those resources (deactivate mode auto remediation).

Therefore, we build CodeBuild project using an image, which is stored in ECR repository. This image has MUGC (Cloud Custodian tool) built in. With MUGC, we have two modes:

  • SINGLE: Delete resources in only one account.
  • ORG_WIDE: Delete resources in all accounts.

Whenever you want to delete resources, just add environment variables of CodeBuild project (account ID, regions) and run it.

3. Sample output

 Report of policy mode SCAN will be stored in S3 bucket in json and csv format:

  •  Reports are stored in the S3 bucket:

s3

  • Content of the report will look like this:

4. Conclusion

You can enhance security and adopt best practices to apply to the entire cloud environment with Cloud Custodian, since it can build an automated system to monitor, detect, and handle abnormal actions in the multi-account cloud environment.

Cost is no longer an issue since Cloud Custodian can be deployed with Serverless and Container services (e.g., ECS Task Definition using AWS Fargate). After testing this system for a month, we found the system comes with a reasonable price and highly effective functions compared to using Config rule, SecurityHub, or GuardDuty. The cost mainly lies in the system operation and build infra (CodeCommit, CodeBuild, Code pipeline). It is important to avoid pushing the code to the CodeCommit repository too often. This will be more time-consuming for CodePipeline to do its jobs before you can run policies, which is also associated with a higher cost. We recommend creating a set of policies and pushing it to the CodeCommit repository instead of pushing each policy individually.

Cloud Custodian not only supports AWS, but it also supports other major Cloud providers such as Azure and GCP. If you are using a multi-cloud environment, this open-source tool is certainly a security solution to consider to manage your system.

 
Author FPT Software