How to convert Linux cron jobs to "the Amazon way"?

amazon-ec2 cron scheduled-tasks lamp amazon-swf

For better or worse, we have migrated our whole LAMP web application from dedicated machines to the cloud (Amazon EC2 machines). It's going great so far but the way we do crons is sub-optimal. I have a Amazon-specific question about how to best manage cron jobs in the cloud using "the Amazon way".

The problem: We have multiple webservers, and need to run crons for batch jobs such as creating RSS feeds, triggering emails, many different things actually. BUT the cron jobs need to only run on one machine because they often write to the database so would duplicate the results if run on multiple machines.

So far, we designated one of the webservers as the "master-webserver" and it has a few "special" tasks that the other webservers don't have. The trade-off for cloud computing is reliability - we don't want a "master-webserver" because it's a single point of failure. We want them to all be identical and to be able to upscale and downscale without remembering not to take the master-webserver out of the cluster.

How can we redesign our application to convert Linux cron jobs into transitory work items that don't have a single point of failure?

My ideas so far:

Have a machine dedicated to only running crons. This would be a little more manageable but would still be a single-point-of-failure, and would waste some money having an extra instance.

Some jobs could conceivably be moved from Linux crons to MySQL Events however I'm not a big fan of this idea as I don't want to put application logic into the database layer.

Perhaps we can run all crons on all machines but change our cron scripts so they all start with a bit of logic that implements a locking mechanism so only one server actually takes action and the others just skip. I'm not a fan of this idea as it sounds potentially buggy and I would prefer to use a Amazon best-practice rather than rolling our own.

I'm imagining a situation where jobs are scheduled somewhere, added to a queue and then the webservers could each be a worker, that can say "hey, I'll take this one". Amazon Simple Workflow Service sounds exactly this kind of thing but I don't currently know much about it so any specifics would be helpful. It seems kind of heavy-weight for something as simple as a cron? Is it the right service or is there a more suitable Amazon service?

Update: Since asking the question I have watched the Amazon Simple Workflow Service webinar on YouTube and noticed at 34:40 (http://www.youtube.com/watch?v=lBUQiek8Jqk#t=34m40s) I caught a glimpse of a slide mentioning cron jobs as a sample application. In their documentation page, "AWS Flow Framework samples for Amazon SWF", Amazon say they have sample code for crons:

... > Cron jobs In this sample, a long running workflow periodically executes an activity. The ability to continue executions as new executions so that an execution can run for very extended periods of time is demonstrated. ...

I downloaded the AWS SDK for Java (http://aws.amazon.com/sdkforjava/) and sure enough buried within a ridiculous layers of folders there is some java code (aws-java-sdk-1.3.6/samples/AwsFlowFramework/src/com/amazonaws/services/simpleworkflow/flow/examples/periodicworkflow).

The problem is, if I'm honest, this doesn't really help as it's not something I can easily digest with my skillset. The same sample is missing from the PHP SDK and there doesn't seem to be a tutorial that walks though the process. So basically, I'm still hunting for advice or tips.

Possibly related: stackoverflow.com/questions/8812025/scheduling-a-job-on-aws-ec2

Michael Currie

I signed up for Amazon Gold support to ask them this question, this was their response:

Tom I did a quick poll of some of my colleagues and came up empty on the cron, but after sleeping on it I realised the important step may be limited to locking. So I looked for "distributed cron job locking" and found a reference to Zookeeper, an Apache project. http://zookeeper.apache.org/doc/r3.2.2/recipes.html http://highscalability.com/blog/2010/3/22/7-secrets-to-successfully-scaling-with-scalr-on-amazon-by-se.html Also I have seen reference to using memcached or a similar caching mechanism as a way to create locks with a TTL. In this way you set a flag, with a TTL of 300 seconds and no other cron worker will execute the job. The lock will automatically be released after the TTL has expired. This is conceptually very similar to the SQS option we discussed yesterday. Also see; Google's chubby http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf Let me know if this helps, and feel free to ask questions, we are very aware that our services can be complex and daunting to both beginners and seasoned developers alike. We are always happy to offer architecture and best practice advice. Best regards, Ronan G. Amazon Web Services

"see google's chubby" is not an expression I expected on SO to be honest.

Nathan Buesgens

I think this video answers your exact question - cronjobs the aws way (scalable and fault tolerant):

Using Cron in the Cloud with Amazon Simple Workflow

The video describes the SWF service using the specific use case of implementing cronjobs.

The relative complexity of the solution can be hard to swallow if you are coming straight from a crontab. There is a case study at the end that helped me understand what that extra complexity buys you. I would suggest watching the case study and considering your requirements for scalability and fault tolerance to decide whether you should migrate from your existing crontab solution.

this is a great answer as it uses a well-supported tool from AWS, and SWF is a powerful product. The only downside, imo, is that SWF has a significant learning curve and can be hard to do complicated things with. At least that was my experience with the Java tutorials

Maciej Majewski

Be careful with using SQS for cronjobs, as they don't guarantee that only "one job is seen by only one machine". They guarantee that "at least one" will got the message.

From: http://aws.amazon.com/sqs/faqs/#How_many_times_will_I_receive_each_message

Q: How many times will I receive each message? Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.

So far I can think about the solution where you have one instance with Gearman Job Server instance installed: http://gearman.org/. On the same machine you configure cron jobs that are producing command to execute your cronjob task in background. Then one of your web servers (workers) will start executing this task, it guarantees that only one will take it. It doesn't matter how many workers you have (especially when you are using auto scaling).

The problems with this solution are:

Gearman server is single point of failure, unless you configure it with distributed storage, for example using memcached or some database

Then using multiple Gearman servers you have to select one that creates task via cronjob, so again we are back to the same problem. But if you can live with this kind of single point of failure using Gearman looks like quite good solution. Especially that you don't need big instance for that (micro instance in our case is enough).

Well, the messages stay on the server after they have been received. Its up to the developer to delete them afterwards. While they are being processed, they cannot be accessed by another server.

@FrederikWordenskjold That is incorrect, even after a message has been given to one client it can still be given to another, since replication of SQS state is asynchronous. You can even be given a copy of a message "after" it was deleted!

This answer is outdated There are 2 types of queues now. Use FIFO to get Exactly-Once Processing: A message is delivered once and remains available until a consumer processes and deletes it. Duplicates are not introduced into the queue. aws.amazon.com/sqs/features

Michael Currie

Amazon has just released new features for Elastic Beanstalk. From the docs:

AWS Elastic Beanstalk supports periodic tasks for worker environment tiers in environments running a predefined configuration with a solution stack that contains "v1.2.0" in the container name. "

You can now create an environment containing a cron.yaml file that configures scheduling tasks:

version: 1
cron:
- name: "backup-job"          # required - unique across all entries in this file
  url: "/backup"              # required - does not need to be unique
  schedule: "0 */12 * * *"    # required - does not need to be unique
- name: "audit"
  url: "/audit"
   schedule: "0 23 * * *"

I would imagine the insurance of running it only once in an autoscaled environment is utilized via the message queue (SQS). When the cron daemon triggers an event it puts that call in the SQS queue and the message in the queue is only evaluated once. The docs say that execution might be delayed if SQS has many messages to process.

Could you also include some content from the links?

Jaap Haagmans

I came across this question for the third time now and thought I'd chip in. We've had this dilemma for a while now. I still really feel AWS is missing a feature here.

In our case, after looking at the possible solutions, we decided we had two options:

Set up a cronjob server which runs the jobs that should only be run once at a time, auto scale it and make sure it's replaced when certain CloudWatch stats aren't what they should be. We use cloud-init scripts to get the cronjobs running. Of course, this comes with a downtime, leading to missed cronjobs (when running certain tasks every minute, like we do).

Use the logic that rcron uses. Of course, the magic is not really in rcron itself, it's in the logic you use to detect a failing node (we use keepalived here) and "upgrade" another node to master.

We decided to go with the second option, simply because it's brilliantly fast and we already had experience with webservers running these cronjobs (in our pre-AWS era).

Of course, this solution is meant specifically for replacing the traditional one-node cronjob approach, where timing is the deciding factor (e.g. "I want job A to run once daily at 5 AM", or like in our case "I want job B to run once every minute"). If you use cronjobs to trigger batch-processing logic, you should really take a look at SQS. There's no active-passive dilemma, meaning you can use a single server or an entire workforce to process your queue. I'd also suggest looking at SWF for scaling your workforce (although auto scaling might be able to do the trick as well in most cases).

Depending on another third party was something we wanted to avoid.

Tom

On 12/Feb/16 Amazon blogged about Scheduling SSH jobs using AWS Lambda. I think this answers the question.

Is it possible to add dynamic cronjobs or schedules using AWS lambda?

Yes, you can have the Lambda's invoked by Cloudwatch events. Time it as you see fit.

barbolo

If you already have a Redis service up, this looks like a good solution:

https://github.com/kvz/cronlock

Lukas Liesis

The "Amazon" way is to be distributed, meaning bulky crons should be split into many smaller jobs and handed to the right machines.

Using SQS queue with type set to FIFO, glue it together to ensure each job is executed by only one machine. It also tolerates failure since the queues will buffer until a machine spins back up.

FIFO Exactly-Once Processing: A message is delivered once and remains available until a consumer processes and deletes it. Duplicates are not introduced into the queue.

Also consider whether you really need to 'batch' these operations. What happens if one night's updates are considerably larger than expected? Even with dynamic resourcing, your processing could be delayed waiting for enough machines to spin up. Instead, store your data in SDB, notify machines of updates via SQS, and create your RSS feed on the fly (with caching).

Batch jobs are from a time when processing resources were limited and 'live' services took precedence. In the cloud, this is not the case.

Thanks - I like the direction that you are describing.

Be warned that SQS only guarantees that a message will be seen by a machine eventually, not that messages will only be seen by a single server. Anything you put into an SQS queue should be idempotent.

My cron job should run daily and with SQS you can only delay for up to 15 minutes. One option could be adding a custom tag to the message with the target time to execute it and put it back in the queue if that time isn't reached yet - but this really looks a dumb thing. Also I still need a cron job to initially populate the queue. It seems a chicken-egg problem :) But I still think that SQS is the right thing to use, because it guarantees scalability and fault-tolerance

"Batch jobs are from a time when processing resources were limited and 'live' services took precedence. In the cloud, this is not the case." This is true for some but not all activity. For example, processing traffic logs is something that is better as a batch process than live.

I'm very late to the discussion, but I think a better way would be to have a scheduled CloudWatch event act as a corn "ping". This can publish an SNS topic, which is subscribed to by a queue, which itself can be a FIFO queue if you need exactly-once delivery. Of course there are still complications, but this looks like a nice system to me!

Rama Nallamilli

Why would you build your own? Why not use something like Quartz (with Clustered Scheduling). See documentation.

http://quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering

I used Quartz.NET in a SaaS solution that relied heavily on scheduled tasks. Some where system maintenance tasks, but most where activities scheduled by end users. All of our tasks wrote to message queues (amq) for which we had any number of idempotent services. The API is very good and allows for powerful schedules. We did not cluster multiple Quartz instances, but it does support that.

Patrick Steil

What we do is we have one particular server that is part of our web application cluster behind an ELB also assigned a specific DNS name so that we can run the jobs on that one specific server. This also has the benefit that if that job causes that server to slow down, the ELB will remove it from the cluster and then return it once the job is over and it gets healthy again.

Works like a champ.

Kevin Eid

One method to verify that your cron expression works the Amazon way is to run it through the events command. For example:

aws events put-rule --name "DailyLambdaFunction" --schedule-expression "<your_schedule_expression>

If your schedule expression is invalid, then, this will fail.

More resources: https://docs.aws.amazon.com/cli/latest/reference/events/put-rule.html

johnnyodonnell

If you're willing to use a non-AWS service, then you might check out Microsoft Azure. Azure offers a great job scheduler.

wanghq

Since no one has mentioned CloudWatch Event, I'd say that it's the AWS way of doing cron jobs. It can run many actions, such as Lambda function, ECS task.

How to convert Linux cron jobs to "the Amazon way"?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US