Introduction: The Need for Network Operations Centers in AWS
When I first began creating AWS infrastructure, cross-account deployments were not common practice. However, over the past seven years, this landscape has changed significantly. Two primary reasons for distributing workloads across multiple accounts are the concepts of segregation of duties and the ability to exceed common AWS account API and resource limits.
During my time at AWS, I had the opportunity to design and implement Networking Operations Centers (NoCs). The purpose of a NoC is to centralize certain aspects of network design within an organization’s AWS cloud infrastructure. This approach is relatively straightforward to implement.
First, you should utilize the Transit Gateway as a centralized router for your networking needs. Next, create private Virtual Private Clouds (VPCs) and connect them to the Transit Gateway. You can set these up locally in each AWS account or distribute the VPCs using AWS Resource Access Manager (RAM) from your networking account to others. To manage IP addresses effectively and avoid overlapping CIDRs in a dynamic environment, consider using the Amazon VPC IP Manager, which allows for centralized CIDR assignment from a designated IP pool.
Additionally, you should deny the creation of local Internet Gateways in each account through Service Control Policies (SCPs). Implement a security VPC with a firewall of your choice, designed with a gateway load balancer. Finally, establish centralized egress and ingress Internet access VPCs, or integrate this functionality into your security VPC. For more detailed information, I highly recommend reading "Building a Scalable and Secure Multi-VPC AWS Network Infrastructure."
Simplified Network Diagram
To visualize our approach at MRH-Trowe, consider the following simplified network diagram. All resources are created in the networking account and shared across various business units. By adopting this method, we prevent changes to the underlying network infrastructure by business units, establish a robust compliance foundation, and simplify processes for application developers, who typically do not focus on network infrastructure.
Automating Your Network Deployment: Leveraging the Cloud Development Kit (CDK)
Automation is crucial for creating and maintaining AWS workloads, especially when you need to build and tear down resources daily. The Cloud Development Kit (CDK) provides a programmatic approach to infrastructure management via AWS CloudFormation. While I won’t delve deeply into CDK basics, I encourage you to explore this tool if you haven't already—it's a game changer.
To create a layer-based approach for our NoC, we can structure our project with the following layers:
- IP Address Manager (ipam_stack.py): This manages the IP pool we want to control.
- Security VPC and Transit Gateway (security_vpc_stack.py): This layer handles security and routing.
- Gateway Load Balancer Based Firewall (firewall_stack.py): This ensures robust security measures.
After these foundational layers, we can add multiple VPCs with automated routable entries and Transit Gateway attachments.
network-operation-center/
|-app.py
|-stacks/
|--ipam_stack.py
|--security_vpc_stack.py
|--firewall_stack.py
|--vpc1_stack.py
…
|--vpc_n_stack.py
While the first three stacks are somewhat specialized, we can leverage a more generic Python class for our VPCs, allowing for easier sharing later on. Below is an example of how to define a VPC construct tailored for MRH-Trowe's network:
from aws_cdk import (
Aws,
Tags,
CfnTag,
Duration,
RemovalPolicy,
aws_ec2 as ec2,
aws_iam as iam,
aws_ram as ram,
)
from stacks.security_vpc_stack import SecurityVPCStack
from constructs import Construct
class VPCDefinition(Construct):
"""Create a general-purpose VPC customized construct."""
def __init__(
self,
scope: Construct,
id: str,
vpc_net_mask: int,
vpc_name: str,
default_cidr: str,
transit_gateway_id: str,
network_account_id: str,
security_vpc_id: str = None,
gwlb_param: str = None,
ipv4_ipam_pool_id: str = None,
max_azs: int = 2,
amount_of_natgateways: int = 0,
tgw_attach_cidr_mask: int = 28,
public_service_cidr_mask: int = None,
private_service_cidr_mask: int = None,
isolated_service_cidr_mask: int = None,
local_cidr: str = None,
gwlb_service_name: str = None,
tgw_attachement: bool = True,
) -> None:
"""Initialize CDK construct class."""
super().__init__(scope, id)
self.subnet_config = []
self.tgw_subnet_ids = []
self.tgw_route_table_ids = []
self.tgw_azs = []
self.id_collection = []
self.vpc_net_mask = vpc_net_mask
self.default_cidr = default_cidr
self.transit_gateway_id = transit_gateway_id
self.network_account_id = network_account_id
self.ipv4_ipam_pool_id = ipv4_ipam_pool_id
self.tgw_attach_cidr_mask = tgw_attach_cidr_mask
self.public_service_cidr_mask = public_service_cidr_mask
self.private_service_cidr_mask = private_service_cidr_mask
self.isolated_service_cidr_mask = isolated_service_cidr_mask
self.vpc_name = vpc_name
self.max_azs = max_azs
self.amount_of_natgateways = amount_of_natgateways
self.local_cidr = local_cidr
self.security_vpc_id = security_vpc_id
self.tgw_attachement = tgw_attachement
# Create a subnet set for the TGW
if self.tgw_attach_cidr_mask == 28:
self.subnet_config.append(
{
"cidrMask": tgw_attach_cidr_mask,
"name": f"tgw-attachment-{self.vpc_name}",
"subnetType": ec2.SubnetType.PRIVATE_ISOLATED,
"MapPublicIpOnLaunch": False,
}
)
# Additional subnet configurations can be added here as needed
# Request a CIDR from IPAM
ip_assignment = ec2.IpAddresses.aws_ipam_allocation(
ipv4_ipam_pool_id=ipv4_ipam_pool_id,
ipv4_netmask_length=self.vpc_net_mask,
)
# Create the actual VPC
self.vpc = ec2.Vpc(
self,
id="VPC",
ip_addresses=ip_assignment,
max_azs=self.max_azs,
nat_gateways=self.amount_of_natgateways,
subnet_configuration=self.subnet_config,
vpc_name=vpc_name,
restrict_default_security_group=True,
)
# Tagging the VPC
tags = {
"Shared": "True",
"SourceAccountId": self.network_account_id,
"SourceAccountName": "Networking",
"AWS Region": Aws.REGION,
}
for key, value in tags.items():
Tags.of(self.vpc).add(key, value)
# Collect subnet and route table IDs for automatic route table adoption
if self.tgw_attach_cidr_mask == 28:
selection_tgw_attach = self.vpc.select_subnets(
subnet_group_name=f"tgw-attachment-{self.vpc_name}"
)
for i in selection_tgw_attach.subnets:
self.id_collection.append(
{
"RouteTableId": i.route_table.route_table_id,
"Az": i.availability_zone,
}
)
self.tgw_subnet_ids.append(i.subnet_id)
# Create Transit Gateway attachments and route table propagations
transit_gateway_attachment = ec2.CfnTransitGatewayAttachment(
self,
id="TransitGatewayAttachment",
subnet_ids=self.tgw_subnet_ids,
transit_gateway_id=self.transit_gateway_id,
vpc_id=self.vpc.vpc_id,
tags=[CfnTag(key="Name", value=f"{vpc_name}-tgw-attachment")],
options={"ApplianceModeSupport": "enable"},
)
# Create the TGW route table assocaition
ec2.CfnTransitGatewayRouteTableAssociation(
self,
id="SpokeRouteTableAssociation",
transit_gateway_attachment_id=transit_gateway_attachment.ref,
transit_gateway_route_table_id=self.tgw_spoke_route_table_id,
)
# Create route for the each subnet towards the security VPC with 0.0.0.0/0
for idx, i in enumerate(self.id_collection):
ec2.CfnRoute(
self,
id=f"Route{idx}",
route_table_id=i["RouteTableId"],
destination_cidr_block=default_cidr,
transit_gateway_id=transit_gateway_id,
).node.add_dependency(transit_gateway_attachment)
These construct lifts a lot of work, but can be used easily later as the following examples show:
from constructs import Construct
from stacks.ipam_stack import IPAMStack
from stacks.security_vpc_stack import SecurityVPCStack
from stacks.network_creation import VPCDefinition
from aws_cdk import (
Stack,
)
class SomeVPC(Stack):
"""Create the actual deployment in each AWS account.
Args:
Stage (Stage): cdk Class stage
"""
def __init__(
self, scope: Construct, construct_id: str, **kwargs
) -> None:
"""Intitialise CDK stack class."""
super().__init__(scope, construct_id, **kwargs)
"""Create the actual CloudFormation stack."""
self.vpc_name = "my-vpc"
self.network_account_id = "12345678910"
self.param_name_ipam_pool = "ipam-id"
self.param_name_tgw_id = "tgw-id"
self.default_cidr = "0.0.0.0/0"
self.security_vpc_name = "my-sec-vpc"
self.ipv4_ipam_pool_id = IPAMStack.return_pool_id(
stack=self, pool_name=self.param_name_ipam_pool
)
self.transit_gateway_id = SecurityVPCStack.return_tgw_id(
stack=self, param_name_tgw=self.param_name_tgw_id
)
vpc = VPCDefinition(
self,
id=self.vpc_name,
transit_gateway_id=self.transit_gateway_id,
network_account_id=self.network_account_id,
ipv4_ipam_pool_id=self.ipv4_ipam_pool_id,
default_cidr=self.default_cidr,
vpc_name=self.vpc_name,
vpc_net_mask=26,
public_service_cidr_mask=28,
)
ram.CfnResourceShare(
id="CfnResourceVPCShare",
name=name_resource_share,
allow_external_principals=False,
principals="12345678911",
resource_arns=["subnet_arn_1,subnet_arn_2, … subnet_arn_n"],
)
As you can see, we just need to pass some parameters and a new VPCs is born. With ram.CfnResourceShare(), we share the subnets via AWS Resource Access Manager with a certain target account. By sharing the subnets, the whole VPC will be shared, including route tables, routes etc.
Using the Network Resources: The Power of Tags
When utilizing network resources, the CDK offers a convenient method called ec2.Vpc.from_lookup()
. This allows you to access all relevant information using the VPC ID. However, a significant issue arises: AWS Resource Access Manager does not share the tags created by the CDK for the network infrastructure components. This hinders the lookup functionality essential for effective resource management.
aws-cdk:subnet-type:isolated
and aws-cdk:subnet-name:private-service-mrht-teamviewer-vpc
. These tags are crucial for the lookup functionality.
To mitigate this defect, we can leverage Custom Resources—AWS Lambda-backed components within our CDK app that gather tags in the networking account and replicate them to the destination account. The Lambda function requires a role in your networking account to describe route tables and subnets. We need to remove tags which starts with 'aws', otherwise, you will see this error: An error occurred (InvalidParameterValue) when calling the CreateTags operation: Value ( aws:cloudformation:stack-name ) for parameter key is invalid.
import json
import boto3
import os
import logging
import urllib3
from botocore.exceptions import ClientError, ParamValidationError
log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.root.setLevel(logging.getLevelName(log_level))
logger = logging.getLogger(__name__)
http = urllib3.PoolManager()
ec2_client = boto3.client("ec2")
sts_client = boto3.client("sts")
def send(
http,
event,
context,
response_status,
response_data,
physical_resource_id=None,
no_echo=False,
reason="-",
):
"""Build CustomResource.
Args:
http (urllib3.PoolManager): object for put requests
event (dict): Lambda event dict
context (object): Lambda object
response_status (string): result of this custom resource
response_data (dict): additional data from this custom resource
reason (string): error result of this custom resource
physical_resource_id (string, optional): CloudFormation physical resource id. Defaults to None.
no_echo (bool, optional): Echo mode activation. Defaults to False.
Returns:
No returns
"""
response_url = event["ResponseURL"]
logger.info(response_url)
response_body = {}
response_body["Status"] = response_status
response_body["Reason"] = reason
response_body["PhysicalResourceId"] = (
physical_resource_id or context.log_stream_name
)
response_body["StackId"] = event["StackId"]
response_body["RequestId"] = event["RequestId"]
response_body["LogicalResourceId"] = event["LogicalResourceId"]
response_body["NoEcho"] = no_echo
response_body["Data"] = response_data
json_response_body = json.dumps(response_body)
logger.info("Response body:\n" + json_response_body)
headers = {"content-type": "", "content-length": str(len(json_response_body))}
try:
response = http.request(
"PUT",
response_url,
body=json_response_body.encode("utf-8"),
headers=headers,
)
logger.info("Status code: " + response.reason)
return True
except Exception as e:
logger.exception("send(..) failed executing requests.put(..): " + str(e))
return False
def assume_role(role_arn: str, sts_client: boto3):
"""Assume IAM role in different Account.
Args:
sts_client (boto3, optional): boto3 object for STS. Defaults to sec_man_client.
role_arn (str): SFTP Connection class. Defaults to SFTP.
Raises:
KeyError: KeyError. Missing environment variable.
ClientError and ParamValidationError: ClientError. Boto3 Issue.
Return:
credentials: dict with tokens
"""
try:
response = sts_client.assume_role(
RoleArn=role_arn, RoleSessionName="SyncSharedNetworkTags"
)
except (KeyError, ClientError, ParamValidationError) as e:
raise e
return response
def clean_tags(tags: list[dict]):
"""Remove tags which starts with 'aws'. An error occurred (InvalidParameterValue) when calling the CreateTags operation:
Value ( aws:cloudformation:stack-name ) for parameter key is invalid. Tag keys starting with 'aws:' are reserved for internal use.
Params:
tags (list[dict]): tags gathered from describe calls
Returns:
tags: array with tags not starting with 'aws'
"""
return [t for t in tags if not t['Key'].startswith('aws:')]
def main(event, context, sts_client=sts_client, source_ec2_client=ec2_client, http=http):
"""Create, delete and update custom actions.
Params:
sts_client: boto3 object for STS
ec2_client: boto3 object for EC2
http: http object
event: Lambda event object
context: Lambda context object
"""
logger.info("Starting Network tag management ...")
logger.info(event)
logger.info(context)
try:
vpc_id = os.environ['VPC_ID']
role_arn = os.environ['TARGET_IAM_ROLE_ARN']
except KeyError as e:
logger.exception(e)
send(
http=http,
event=event,
context=context,
response_status="FAILED",
response_data={"Response": str(e)},
reason=str(e),
)
raise
if event["RequestType"] == "Create" or event["RequestType"] == "Update":
try:
logger.info("Getting foreign credentials ...")
target_credentials = assume_role(role_arn=role_arn, sts_client=sts_client)
target_access_key = target_credentials["Credentials"]["AccessKeyId"]
target_secret_access_key = target_credentials["Credentials"]["SecretAccessKey"]
target_sessions_token = target_credentials["Credentials"]["SessionToken"]
except (KeyError, ClientError, ParamValidationError) as e:
logger.exception(e)
send(
http=http,
event=event,
context=context,
response_status="FAILED",
response_data={"Response": str(e)},
reason=str(e),
)
raise
logger.info("Creating foreign EC2 client ...")
target_ec2_client = boto3.client(
'ec2',
aws_access_key_id=target_access_key,
aws_secret_access_key=target_secret_access_key,
aws_session_token=target_sessions_token
)
logger.info("Done ...")
try:
logger.info("Describing local VPC tags ...")
vpc_tags = source_ec2_client.describe_vpcs(VpcIds=[vpc_id])['Vpcs'][0]['Tags']
logger.info("Copying local tags to shared VPC in foreign account ...")
target_ec2_client.create_tags(Resources=[vpc_id], Tags=clean_tags(vpc_tags))
logger.info("Done ...")
logger.info("Describing local Subnet tags ...")
subnets = source_ec2_client.describe_subnets(Filters=[{'Name': 'vpc-id', 'Values': [vpc_id]}])['Subnets']
for subnet in subnets:
subnet_id = subnet['SubnetId']
subnet_tags = clean_tags(subnet['Tags'])
logger.info("Copying local tags to shared Subnets in foreign account ...")
logger.info(subnet_id)
logger.info(subnet_tags)
target_ec2_client.create_tags(Resources=[subnet_id], Tags=subnet_tags)
logger.info(f"Done for subnet {subnet_id} ...")
logger.info("Describing local Route Tables tags ...")
route_tables = source_ec2_client.describe_route_tables(Filters=[{'Name': 'vpc-id', 'Values': [vpc_id]}])['RouteTables']
logger.info(route_tables)
for route_table in route_tables:
route_table_id = route_table['RouteTableId']
route_table_tags = clean_tags(route_table['Tags'])
logger.info("Copying local tags to shared Route Tables with in foreign account ...")
logger.info(route_table_id)
logger.info(route_table_tags)
if len(route_table_tags) > 0:
target_ec2_client.create_tags(Resources=[route_table_id], Tags=route_table_tags)
else:
logging.info(f"Found Route Table {route_table_id} without tags ...")
logger.info(f"Done for route table {route_table_id} ...")
except (KeyError, ClientError, TypeError) as e:
logger.exception(e)
send(
http=http,
event=event,
context=context,
response_status="FAILED",
response_data={"Response": str(e)},
reason=str(e),
)
raise
logger.info(f"Finished Network tag management on VPC {vpc_id} ...")
send(
http=http,
event=event,
context=context,
response_status="SUCCESS",
response_data={"Response": f"Finished Network tag management on VPC {vpc_id}"},
)
return True
if event["RequestType"] == "Delete":
logger.info(f"Deleting process in progress")
send(
http=http,
event=event,
context=context,
response_status="SUCCESS",
response_data={"Response": "Deleting process in progress"},
)
return True
Conclusion: A Secure and Scalable Network
In summary, this blog post illustrates how MRH-Trowe successfully created a robust AWS network infrastructure using a CDK-based approach. While the transition to a programmatic model may feel unusual for networking professionals accustomed to scripting or tools like Ansible, the benefits of a segregated duties approach are undeniable.
By investing time and effort into this project, we have established a secure and scalable network that empowers non-network-related teams to utilize existing resources without concern.
Happy coding!
Top comments (0)