Archive for January, 2012

Performance Testing – Analyzing the AUT, Part 2

In my previous post, I talked about the need to consider more than just one of the common objectives in our performance test scenarios, such as hits/pages per second, transactions per second, specific transactions (such as Login) and peak concurrent users. The key point I wanted to make was that focusing on only one of any of these will almost certainly result in erroneous results, and there is nothing more damaging to the credibility of the performance testing professional than having to defend invalid test results.

Today I want to talk about concurrent users in more detail, as this is one of the more common test objectives and the subject of much debate within the performance testing community. The problem I have isn’t the goal itself, rather what the goal means and how to ensure that we’re using not only the right number of concurrent users, but under the right circumstances. The same applies whether you are running Peak, Average, Stress, or any other type of test whose definitions I will not get into here; there is already plenty of discussions on that topic.

So you want to test 100 concurrent users, but doing what?

The question we should be asking our business is not how many concurrent users they want to run, but how do those users behave? This question will have multiple factors, such as time between steps or requests or between iterations of a session. First we need to ask:

  • Are all transactions performed equally, or is there a weighting to each transaction?
  • Are all sessions roughly the same, or should there be a significant amount of randomization in their behavior?
  • Is there a compelling reason to include other transactions in the mix other than the “top ten”?

Scott Barber, of PerfTestPlus Inc., has a handy way of dealing with these that he outlines over at his blog (deep linking not allowed, just hit enter on the URL again), and its hard to shake the mnemonic once you’ve read it. Go ahead and read that, as well as the other 2 posts he has which are linked from that page. I’ll wait here…

You back? Great.

Once again, I don’t want to rehash something that has already been stated so eloquently. So instead I will offer my own experiences on the subject. First, don’t just look at the goal of X number of concurrent users and nothing else. That should be obvious to anyone who has ever run a performance test, but in case it isn’t I’ll explain why. Take an example where a script has 10 steps, executed in the same order with the same delay between steps, and once the script is finished it starts again from the beginning. That script might look something like this when executed with other scripts during a test.

Everyone in a straight line!

(In all my Word-art glory!)

Now lets look at the same basic script, but with a randomized ordering to 8 of the 10 steps (since Login and Logout have to happen at specific places), and with randomized time between steps, with a random delay between script iterations.

Anarchy, I tell you.

Guess which one is likely to be more realistic? (ethay econdsay iptscray <– pig latin answer key)

These two scenarios will have very different results, but only the second one will be realistic.

The information we tend to get from analytic tools such as Google Analytics or WebTrends show the average user session and not the standard deviation of those sessions’ duration and pacing. Also, a good tool should show us not only the top transactions or pages but also the percentage or weighting of those, so we don’t have to assume all users behave the same way. Of course there will be exceptions, such as a customer registration process which spans multiple pages in a specific sequence. These types of transactions can and should be grouped together in a single “business process” that is somehow represented in our scripts, and business processes treated as separate transactions to be randomized accordingly. I find the best way to get accurate information on exactly what users are doing in an application is by taking a sample of production logs and analyzing them myself, but there are a few tools out there capable of helping you with the work.

That’s it for now. Remember, keep it real and you will be able to stand by your test results with confidence!

Shane Evans

Performance Testing – Analyzing the AUT, Part 1

…or, “What should my performance test scenario look like?”

This will be my first post in a series dedicated to sharing my thoughts on performance testing applications to better serve the business, which should be the focus of any comprehensive test plan because ultimately a business application exists to serve that business. This may seem obvious, but when you keep that truth in mind throughout the process it will determine your behaviour in defining a test plan, scripting user behavior or analyzing the outcome.

In this article I will discuss what factors should be considered when defining the test strategy. This will not be a rehashing of many articles on the subject which describe the difference between load, performance, peak, smoke, stress and a multitude of other test definitions related to the same set of objectives. Instead I want to talk about what factors should go into defining these tests, and where the information should come from.

We’ll talk about:

  • Hits/pages/requests per second versus “transactions” per second
  • Average session length
  • Transaction weightings
  • Concurrent users, peak vs average
  • Failover/disaster scenarios

My goal will be to clarify why each of these factors is important to building better performance test scenarios, and where you can find the data for each. We’ll also talk about how to balance between some of these, as sometimes there will be conflicts between them (transactions per second vs concurrent users, for example). Right then, let’s make this quick…

TL;DR Version

Performance Testing shouldn’t just be about how many pages per second your application can crank out, or how many concurrent users it can support, or how many transactions can be processed. Performance tests should accurately reflect the usage of the application under test by actual users in production or you aren’t seeing the whole picture. You might get better results than expected or worse, you might get bad results when things are actually ok. Both of these possibilities involve putting your reputation on the line during those last minute “or or no-go decision” situations.

Hits/Pages/Requests per second versus “transactions” per second

This should be a question the test engineer determines for the business, not the other way around. I say that because I have seen test engineers ask the business analyst or whoever it is representing the users, “What is your target hits per second for this test?” I shudder every time I see a seasoned performance engineer ask a question like this of someone who wouldn’t know the relevance of a hits/second metric if it hit/second them right in the nose. Hits per second is a measure of throughput of your web server and network, but it is not a true measure of application performance without knowing the context of those pages that are being served.

How do we find this information? Check production logs. If the system doesn’t exist in production yet, compare to a similarly utilized application if one is available. If this also isn’t possible, leave it out of the equation altogether. If you don’t know your target hits or pages per second that will just have to be one of the things your tests aim to uncover. Instead, aim for something you can ask the business users for, like how many transactions they expect to see in a day and spread across how many users over what period. This will be the basis for your pacing for each virtual user in your test.

If we know the total number of business transactions the business expects to see in a period (T), we get an easy figure showing the number of transactions per second we should target (Ts).

If T = 20,000 transactions per 24 hour period, then Ts equals roughly 0.2315 transactions per second.

Keep in mind this is only for a single defined transaction, so the number will be higher when considering other transaction types. Now we just need to figure out how many users we need active at any point in time (concurrent) to produce that volume of transaction per second. But before we do that, we need to know how long a user is active on the system. Why? Because 20 users hammering away as fast as electrons will allow is very different from 200 users with a more realistic approach to generating those transactions. Also keep in mind that this example is just that, and more than likely you will be dealing with slices of time much shorter than 24 hours, usually just one or a few.

Average Session Length

The question here is, “How long should the average user stay active on the application?”, that is before logging off and waiting some undetermined amount of time before starting the process again. This is important because for most modern applications, the Logon process is the single most resource intensive activity a user can perform due to the memory and other resource allocation performed by the application. So we never want to Logon more than we actually expect to see in production, or our results might be very far from accurate.

For example, let’s say you are trying to produce the number of transactions per second shown in the equation above, and you know that it takes 5 minutes for a single user to produce a transaction (d) – that is, the script executing from start to finish with one transaction occurring at some point – we can divide the number of transactions per second by the number of seconds it takes to perform a single transaction to determine the number of concurrent users (U). Borrowing from the previous example, we have:

U = d / Ts

U = 300 / 0.2315

U = 1295.9

Now we’re getting somewhere! Now we know that if it takes 5 minutes to generate a transaction and our target is 0.2315 transactions per second, it will take 1295 or so users to do it.

Again, if this information (session length) exists in some form of production log, either from a business intelligence type of application such as WebTrends or just by looking at web server logs, be sure to use it in your test for more representative results. Without it, we may have the right number of users creating the right mix of transactions, but with a very different ratio between Logon and the rest of our business transaction which will produce unwanted results.

Transaction Weightings

This next area is one I have had many… “challenging” conversations with business and technical persons around, because it doesn’t only involve an understanding of the volumes of transactions on a system but also the types of transactions, and the impact of different types of transactions on application performance. One example is the Login I explained above. In most application code, the Login process is the most CPU and Memory intensive process a user can perform, and generating too many Login transactions during a test can result in poor performance that is not representative of real world usage. Another example might be very large transaction histories, or wildcard searches (such as SELECT * FROM…). These can lock up system resources such as thread pools, database connection pools and session memory while the application waits for a response from the database, and in a worst case scenario even lock up the database!

The point is, we need to understand which transactions are typical of an average user session in order to build scripts that accurately reflect that average user session, and that means understanding which transactions and in what percentages. If you choose the top 10 business transactions performed by your users and run all of them equally, meaning 10% of the time for each of them, I can assure you that 1) you will be logging on far too frequently and 2) your test results are not going to look anything like what you will see once the application goes live. Why? Because in most applications, transactions are heavily weighted towards just a few of those, with the remainder occupying a very small percentage of the total combined.

At this point I realize I’ve written the word “transaction” far too many times in the preceding paragraphs so will offer a real-world example.

In my previous life, I was testing an Internet Banking application for a very large financial institution. Every quarter this application would see another release, with one or more new “features” added which would allow customers to do things they would previously have to go to a brick-and-mortar location for. The Business Analysts thought this was Nobel prize winning stuff, and that they were somehow saving the world with every release. I always chuckled when it came to a performance test plan, however, because we simply ran the baseline test every time. Why not include the new stuff? Oh, we’d throw in one or two of the new transactions (*sigh*, did it again), which was about how many they would see in Production. You see, 85% of all Internet Banking requests are for one thing, Balance Inquiry. The other two dozen types of activities the customer could perform occupied the other 15% combined, with almost all of them falling in below 5% of the total. So while we were sure to include the new stuff for every release, the performance test scenario almost never changed.

To give another example, a workflow application I was consulting for had a similar transaction mix in that something like 90% of the total activity was refreshing the Inbox. The remaining 9 of the top ten transactions combined to form the other 10% of the mix, but there was one transaction that when performed would lock up the system almost completely until it had been completed. If we hadn’t included that transaction in the performance test scenario, it would have been a disaster upon reaching production.

The key is understanding what to include, and how much. Again this information could come from the backend of the system in production, or WebTrends, or Google Analytics if such information exists.

Now this is getting a little longer than I had originally intended, so I’ll wrap it up for now and come back for Part 2 in a few days to talk about Peak Concurrency and Disaster scenarios. Remember, if you aren’t testing real world scenarios, why even bother testing?

Thanks for reading!

 

Edit 01/11/12: Fixed spelling errors…