Getting started with AWS Cloud Computing and text processing

congrats on getting through the first few tutorials and successfully getting off the ground with your own EC2 instance! While the cloud has many benefits, it still has a bit of overhead when getting started that can be tough for newcomers, but feedback just like this helps us improve the developer experience.

Getting started with AWS Cloud Computing and text processing

congrats on getting through the first few tutorials and successfully getting off the ground with your own EC2 instance! While the cloud has many benefits, it still has a bit of overhead when getting started that can be tough for newcomers, but feedback just like this helps us improve the developer experience.

1. Which steps should one take when starting with cloud computing (to run python scripts?)

It seems like you've already gotten your feet wet by SSHing into an EC2 instance (which is going to most closely mimic a terminal on your system), but I figured I'd help supply you with a few other options you have for running python scripts in the cloud based on your use case.

  • AWS Cloud9: a cloud IDE, which can more easily access other services and resources on the AWS cloud. If vim/emacs aren't your cup of tea, being able to code directly from this browser-based IDE might make a big difference, rather than coding locally and constantly pushing code up to your EC2 instance.

  • Amazon SageMaker hosted notebooks: These are managed Jupyter/JupyterLab notebooks that enable you to run Python (or a kernel of your choice) in cells, which are modular chunks of code. They're extremely popular amongst folks that write code for data processing (data scientists, ML/AI researchers, etc), and you can have one started and running in less than five minutes, and you won't need to worry about SSH.

2. What is important to know, where to start (like first steps: choosing EC2 what kind of instance)?

The ideal instance type will be the one at the favorable intersection of price and performance for your given workload and budget. There are a large number of different EC2 instance classes, each optimized for different workloads.

For choosing an instance to speed up your specific workload, I would need to know a bit more about the code itself to figure out what your computational bottleneck is -- would increasing RAM size allow for larger batches thus speeding up the computation, or is it possibly an inefficient algorithm with exploding complexity that only requires more compute? Is the process parallelizable, and thus would it benefit from leveraging GPUs?

3. When working with AWS, when do you work with boto3 and when directly via ssh?

boto3 is the AWS Python SDK that enables programmatic access (creating, reading, updating, or deleting) to other AWS resources at runtime from within your scripts. This enables any functionality you could perform from the console to be done during workload processing. A common example may be pulling a dataset from S3, or pushing a transformation job off to EMR (Elastic Map Reduce). Alternatively, you can use the awscli command line utility to perform these same actions outside of a runtime, directly from bash. SSH is used for directly accessing the server, either to modify the contents, to run commands, or debug.

4. What to look for when choosing an EC2 instance (especially when working with text via python)?

The biggest precursor question here is this: what is the computational profile of your workload? As mentioned in A2, while choosing an instance that is more powerful across all major properties (RAM, CPU cores/speed, GPU, networking, etc) will almost guarantee speeding up your process, you are likely going to be overpaying/underutilizing as only a subset of those resources will be the true bottleneck for your system. Finding which type of hardware is currently at maximum capacity in your current workload will help you figure out what type of instance to provision.

Off the cuff, I'd recommend a C-class (compute optimized) instance or P-class (GPU/parallel compute optimized) instance, depending on whether your process is parallelizable with a GPU.

5. Because I'm trying to process a csv file: What's the best way to work with existing files/store new files? Is it better to store them in a S3 bucket? How to connect?

Storing in S3 is likely to be ideal considering it will enable other services to more easily access the file(s) (whether that's other AWS services in the cloud, or you trying to pull the file from outside of AWS). Also, it enables you to tear down the server once your computations are done, and the file will persist on S3 permanently.