Serverless DataSync from EFS to S3

Gotchas in AWS DataSync for in-cloud data transfers in 2021

Yann Stoneman
8 min readMay 27, 2021

When I was tasked with installing a DataSync agent on EC2 to transfer data from its filesystem to S3, I had some questions.

“Why not skip EC2 and sync directly from the EFS filesystem?” “How does the networking work if I skip the EC2 step?”

The answers to these questions and a few errors were not apparent to me from StackOverflow or the AWS documentation, but through experimentation and a call with AWS Support, I figured out the solutions.

In the end, it turns out that syncing data from an Amazon EFS file share in a private subnet to Amazon S3 via AWS DataSync can take just a few minutes to set up—no installations necessary.

Hope these findings help accelerate your setup of an AWS DataSync between EFS and S3!

EFS to S3 via DataSync

Do I need an agent?

No. You don’t need to install anything on an EC2 instance, if you are just syncing data between two fully-managed AWS services, such as EFS and S3.

The announcement by AWS on November 9th, 2020 called “AWS DataSync announces fully-automated transfers between AWS Storage services” does not explicitly use the words “agent,” “agentless” or “serverless,” but that’s what is implied by “fully-automated.”

This means that you no longer need an EC2 instance with the DataSync agent installed on it to sync data between in-cloud services such as S3 and EFS. In effect, this means that DataSync now is “serverless,” at least for EFS to S3.

DataSync stays in the “launching” status for a while. Is there something wrong with my configuration or is that normal?

Yes, everything is ok. It usually stays in “launching” for 7 to 12 minutes when I sync even just one small text file:

7–12 minutes for Launching and greater than 3 seconds for the rest.
How fast the different stages of execution took for me

Why does it take so long to launch, if the preparing and transferring are so quick? My guess, without having heard anything from AWS about this DataSync cold start, is that under the hood, AWS launches a new EC2 instance. The launching and configuring of the EC2 instance is probably what happens in those 7 to 12 minutes.

So if “launching” takes a while, you’re not alone. Presumably, this doesn’t matter to you anyway, since this service is designed to run on a schedule.

Do I need to use a DataSync Interface VPC Endpoint?

No. It worked for me without a DataSync VPC endpoint, even though EFS was in a private subnet with no access to the internet. So I did not need to set up com.amazonaws.us-east-1.datasync. But you do need an S3 VPC endpoint.

Use an S3 VPC endpoint, not a DataSync VPC endpoint, for EFS to S3 in a private subnet.

I’m assuming, based on the security group settings that DataSync requires and based on DataSync’s launching time, that DataSync transparently launches an EC2 instance in your private subnet.

Therefore, all the communication between EFS and DataSync is happening within your subnet, meaning there is no need for a DataSync VPC endpoint.

However, for DataSync to reach S3, it will need an S3 VPC endpoint.

Most documentation and diagrams for DataSync are about data center to AWS transfers. These scenarios feature the DataSync agent running in your data center—not directly in your VPC—and therefore require the DataSync VPC endpoint to communicate with your VPC.

However, when you’re doing a transfer from EFS as the source and S3 as the destination, you do not need the DataSync VPC endpoints even if you’re in a private subnet. You do, however, need the S3 VPC endpoint and the correct security groups.

I made some assumptions here. Leave a comment below if I need to correct anything.

How do I configure my security groups for a DataSync between EFS in a private subnet and S3?

I found the following error message DataSync gave me more relevant to my “serverless” use-case than the documentation on DataSync networking:

Task failed to access location loc-0abc1def23g3456h8: x40016: Failed to connect to EFS mount target with IP: 10.111.11.11. Please ensure that mount target’s security group allows 2049 ingress from the DataSync security group or hosts within the mount target’s subnet. The DataSync security group should also allow all egress to the EFS mount target and its security group.

To understand this message, it helps to know the resources involved:

  1. The source location. This is DataSync’s representation of the EFS file system— it is not the same as the EFS file system itself.
  2. The security group of that “source location.” This is not the same as the security group of the EFS file system mount target itself.
  3. The destination location. Like with the source, this is only a representation. In our scenario, this location represents the specific S3 bucket.
  4. The task combines everything. It tells DataSync to pull from the source and make sure that data is in the destination.

The security group of the source location defines:

  • what traffic DataSync is allowed to receive from the source location
  • what traffic DataSync is allowed to send to the destination location

EFS uses the Network File System port 2049. So make sure you have the following settings on your security groups:

  1. On your EFS file system mount target’s security group, allow inbound access on port 2049 from the DataSync source location’s security group.
  2. On your DataSync source location’s security group, allow all outbound access on all ports to your EFS file system’s mount target’s security group.

DataSync is unable to putObject

If you get the following error:

DataSync location access test failed: could not perform s3:PutObject in bucket yann-stoneman-dev-storage-us-east-1-012345678910. Access denied. Ensure bucket access role has s3:PutObject permission.

Check if you have KMS encryption enabled on the bucket. And if so, check if the KMS key policy allows the basic KMS key actions to the IAM role used by DataSync:

Minimum permissions for the DataSync IAM role on your KMS key policy.

How to specify a subdirectory?

Using /mnt/efs/mypath/ will result in the following error:

Task failed to access location loc-0123456789abcdefg: x40016: Could not mount subdirectory /mnt/efs/mypath on host 10.111.111.111. Please verify that the subdirectory exists and is properly exported in /etc/exports

If you read the error message, you may try to find the file /etc/exports, and you may discover that it doesn’t exist. But the solution is simpler than that.

The solution: don’t use a fully qualified path name, but a path relative to the mount point. For example, use /mypath/. Do not use/mnt/efs/mypath/.

If you tried using the fully qualified path name, then you probably are aware that any directory in the EFS filesystem will be a subdirectory of /mnt/efs. If you didn’t know that, you can run mount | grep -i efs on the EC2 instance that’s using the EFS file system, and you’ll see 127.0.0.1:/ on /mnt/efs in the output. That tells you that any subdirectory in your filesystem is located within that path. If you run ls /mnt/efs, you’ll see mypath in the results.

So if you want to sync only that mypath subdirectory, then you may be tempted to specify /mnt/efs/mypath/ as the directory to sync. Wrong. That causes the “verify that the subdirectory exists” error.

So to solve the error, in the field, “a subdirectory of the selected file system,” do not provide a fully qualified path name. Instead, provide a path relative to the root of the file system. Therefore, when we create the source location in the DataSync console, we need to specify /mypath/ as the subdirectory:

Use /mypath/ to specify the subdirectory if the fully qualified pathname is /mnt/efs/mypath/

Or if you’re using CloudFormation, use Subdirectory: ‘/mypath/' in the properties of the resource of type AWS::DataSync::LocationEFS.

What slashes should I use for “Path”?

For me, it worked for both the EFS source and the S3 destination to use a slash before and after, like: /mypath/.

Path for EFS and S3 in DataSync can be specified with a `/` before and after
How to specify path for EFS

My task’s execution status is a “success” but my data isn’t showing up in S3. What’s up with that?

If you get a “success,” but you don’t see your data showing up in S3, one possibility is that you have a typo in your source location subdirectory.

If you make a mistake when specifying the subdirectory name for your source location, DataSync will use that incorrect value to create a path on your EFS filesystem. Then DataSync syncs that new empty directory to your destination. In other words, it syncs nothing.

It would be nice if AWS provided a log message like, “subdirectory did not exist — creating….” Because that’s what’s happening.

Instead, if you provide the wrong subdirectory name, EFS will just give you a “success” message.

So as an example, if you tell EFS to sync the subdirectory /mytypo/, then EFS will give you a “success” and next time you list the files in your file system, using ls /mnt/efs, on your EC2 instance, you’ll notice a new item in the list: mytypo.

After creating a task, can I edit the source or destination location?

No. You can only edit details about a task such as filters, bandwidth, schedules, and logs.

You cannot change the source or destination location on a task.

You cannot edit a location either.

Fortunately, a DataSync “source location” and a “destination location” are distinct resources, and as such, they can be re-used in as many DataSync tasks as you want.

So, if you need to make changes like that, make a new task. But beware—your task execution history deletes too when you delete the task, so make sure to copy and save any important error messages before deleting a task.

What about rsync?

I’ve seen some talk on Twitter asking about rsync versus AWS DataSync.

Anthony Ortenzi asking, “Is rsync … better than AWS DataSync?”

One answer to this question is that rsync is better because it is open-source, more popular, and older:

Bonhard Computing says rsync is better because “it’s open, mature, widely adopted, and non-proprietary.”

However, this argument was written before AWS announced DataSync’s new agentless in-cloud transfer capabilities, like for going directly from the EFS file system to S3.

Now that this serverless capability exists, AWS DataSync seems easier to me to manage than rsync. Here are a few reasons why. Datasync…

  • requires zero modifications to my EC2 instances
  • doesn’t take up any CPU on my instances
  • allows me to set up my data syncs using my cloud-native tools, such as CloudFormation or Terraform.

In contrast, rsync can have a significant impact on your server’s CPU, requires rsync to be installed, and requires an additional wrapper around it if you want to use Ansible to standardize and version-control your syncs.

Conclusion

If you still have questions, just go to the DataSync service in the AWS Console, create a task and set up EFS as the source and S3 as the destination. Once you’re in the console, configuring the settings, the points above will make more sense. Leave a comment if you run into any other issues with this!

Follow Yann Stoneman

Last but not least, if you are not a Medium Member yet and plan to become one, I kindly ask you to do so using the following link. I will receive a portion of your membership fee at no additional cost to you.

--

--

Yann Stoneman
Yann Stoneman

Written by Yann Stoneman

Staff Solutions Architect, giving technology language @ Cohere | Ex-AWS. Support blog by becoming member here: https://ystoneman.medium.com/membership

Responses (3)