Weighing the World Wide Web: The art and science of load testing for unique high-spike conditions

Radha Nagaraja, National September 11 Memorial and Museum, USA

Abstract

The website of an organization is its global public face in the modern digital age, and smooth functioning of the website is key to meeting an organization's mission. The appropriate Web server infrastructure and configuration are, by common sense wisdom, a function of the expected traffic patterns and the content served on the website. However, a challenging wrinkle in Web server provisioning is posed by high spikes in traffic due to specific unique events. A particularly dramatic instance of this challenge is faced annually by the 9/11 Memorial and Museum on September 11th, when website traffic increases by 35x To ensure smooth functioning of the website, the annual run-up to the expected high-spike day utilizes a myriad of tools, including best-estimate characterization of expected load; provisioning of a surrogate server infrastructure set up to best mimic the website configuration (and any special content/functionalities planned for the specific day when the high spike is expected) for the future high-spike time; incorporation or appropriate modeling of the caching and load balancing mechanisms expected to be utilized; user simulation and browser automation technologies to realistically mimic spike traffic patterns; and monitoring of quantitative metrics of website performance during load tests over adequate time intervals to uncover subtle issues that might manifest only under sustained load. Applying the set of strategies above has proven to be a time-tested and robust policy for the 9/11 Memorial and Museum to meet the huge spike in Web traffic on September 11th of each year. In this presentation, the strategies above will be covered in detail including overview of the techniques used to estimate future high-spike traffic, rig up surrogate servers, create realistic mock traffic, and best ensure that a website remains performant under high-spike conditions.

Keywords: website, load test, traffic spikes, caching, performance tests, selenium

Introduction

As organizations increasingly shift emphasis to their online presence, their globally accessible digital public face is as crucial (or perhaps even more) to meeting the organization’s mission as their physical “real-world” space/infrastructure. The online world is globally connected and while wait times (e.g., in queues) are, by necessity, considered acceptable in physical facilities, webpage load times can range from a minor nuisance to a major hampering factor that renders a website unusable. A well-behaved website should load fast and scale well and, at the same time, be cost-effective.

Balancing these seemingly contradictory requirements crucially relies upon an accurate sizing of the Web server infrastructure and configuration that is required given a website’s expected traffic patterns. This in turn relies upon accurate estimation and (more crucially) prediction of the traffic to the website, and thereafter testing of the website performance under the predicted traffic (Kattepur & Nambiar, 2015; Draheim et al., 2006; Bhatti & Kumari, 2015; Shojaee et al., 2015). A primary motivation for considering predicted traffic is that the website traffic is rarely constant in time and could, for example, be expected to vary depending on times of day, over the year, and due to specific unique events. A particularly dramatic instance of changes in traffic to a website is faced annually by the 9/11 Memorial and Museum on September 11th of each year. On this day, a combination of multiple factors including special events, new content, and higher public awareness result in a traffic spike close to 35x the usual year-round traffic. To prepare for such traffic spikes, it is vital to test the Web server infrastructure and configuration to ensure that they can handle the predicted high-traffic conditions.

Since these traffic spikes are over an order of magnitude higher traffic than typical conditions, such testing necessitates a simulation-based approach wherein the predicted traffic spike is simulated through “virtual” users. However, it is crucial that such simulation-based testing should, to the greatest extent possible, accurately model the characteristics of real users and should be performed on essentially an exact replica of the actual Web server infrastructure. This is achieved through a combination of several tools, which can together be described under the umbrella term of “load testing framework.” In this paper, we provide an overview of the load testing framework that has proven effective over the last several years in smoothly handling the annual traffic spikes experienced by the 9/11 Memorial and Museum. This includes the methodologies that we utilize to estimate expected load; modeling specific user interaction sequences with the website; scripting these interactions to create virtual users; spinning up large numbers of virtual users in parallel to simulate varying traffic patterns; and quantitatively measuring performance using an exact replica (that is created specifically for use in load testing) of the production Web server infrastructure.

This paper is organized as follows. We first provide a short summary of one particular technology stack that we consider in this paper for specificity (although the presented load testing techniques are applicable/extensible to a wide range of technology stacks). We next discuss the traffic spike that we annually experience, which is one of the primary motivations for the extensive load testing that we perform in preparation for this traffic spike. We then describe the techniques utilized to create a replica Web server infrastructure and “simulated,” but as close to real as possible, traffic patterns to perform the load testing under essentially real-world conditions. The actual running of the load tests and analysis of quantitative metrics of website performance are discussed next. Finally, we provide concluding remarks, acknowledgements, and references. While the discussion in this paper is presented in the context of our annual spike, the methodologies described here are broadly applicable and can also be utilized in the context of making a website more performant/responsive and resilient to unpredictable spikes (e.g., by applying the load testing methodologies with, for example, two times the normal load).

Technology stack

The website of the 9/11 Memorial and Museum is hosted on a LAMP (Linux, Apache, MySQL, PHP) stack on the cloud and is powered by multiple Web heads and dedicated database servers behind a set of dedicated load balancers. Our overall technology stack is illustrated in Figure 1. The load balancers are used to automatically distribute load among the Web servers. The database servers are configured as primary and fallback nodes. The website utilizes the Drupal content management system (https://www.drupal.org; https://www.acquia.com) and serves a mix of static and dynamic content to both anonymous and authenticated users. Additionally, new content such as videos/webinars and other rich content are periodically added, especially around 9/11 each year.

Multiple levels of caching are utilized to boost the website’s performance. These include a varnish cache, memcache, and a third-party caching system for protection against Distributed Denial of Service (DDOS). All traffic to the website flows through a DDOS protection server (e.g., https://www.dosarrest.com), which also functions as a caching layer. The cached pages on the DDOS protection server are served directly from here without hitting our actual Web servers. A significant part of the Web content is cached, especially all static content such as images, which account for a large percentage of the overall Web traffic. However, Web pages related to authenticated users such as the login, sign-up, and checkout pages can not be cached. The DDOS protection system monitors for possible DDOS attacks and in the event of a DDOS attack, takes defensive measures such as termination of connections, blacklisting of IPs, etc.

Several monitoring tools, both on the server side and the client side, are used for monitoring of the website and to continuously measure multiple quantitative analytics (https://analytics.google.com; https://clicky.com; https://chartbeat.com; https://newrelic.com) of the website performance. On the server side, the Web traffic reaching the Web servers (i.e., traffic that is not directly served by the DDOS protection system) is monitored, including capturing of statistics on page loads, server uptime, module-level statistics (for modules in the Drupal content management system), database queries, transactions, etc. Also, the load on the servers is continuously monitored including CPU, memory, and network usage (using scripts that query server load using command-line tools such as free, top, netstat, etc.) and database performance (based on measurement of latencies, monitoring of error logs, etc.). Additionally, the DDOS protection layer also provides analytics of the traffic to the website. On the client side, JavaScript based tools (essentially scripts embedded into Web pages to collect client-side information and transmit it to the online endpoints of the analytics servers) are used to capture visitor statistics. A vast array of visitor statistics are captured, including number of concurrent visitors to the site, total number of visitors (new and returning) over a time interval, average time spent on the website per visit, bounce rate, top links, locale, etc. The analytics tools enable aggregating the captured information according to various combinations of criteria such as aggregation by webpage (to identify number of visits per URL), by time (e.g., by day over a month/year, by time of day, etc.), by geographical location of the website visitor, etc.

In this paper, we discuss the load testing and website analytics techniques in the context of the technology stack described above. While the techniques discussed here are widely applicable to other technology stacks as well, it is important when applying these techniques and provisioning server infrastructure to note that the required server infrastructure even under relatively constant load is dependent on the software stack utilized and the type of content being served. Technology stacks based on different software systems (including the Web server, the content management system, caching techniques, etc.) can require somewhat different levels of resources for smooth functioning of the website. Also, websites that serve primarily static content to anonymous users can attain acceptable performance with substantially smaller levels of resources by benefiting from caching much more significantly than highly dynamic websites that primarily serve authenticated users. Here, we focus on the technology stack described above based on LAMP and the Drupal content management system. This is a popular and widely used combination that is also used for our public website at the 9/11 Memorial and Museum. The underlying techniques described here and the related observations discussed in this paper are applicable to other technology stacks as well, with the important caveat that quantitative estimations of required infrastructure are stack-dependent and site-dependent, and each website would benefit from its own specific load testing based on the methodologies discussed in this paper.

Figure 1: overall technology stack

The annual traffic spike

Every year on September 11th, a somewhat-anticipated, but always impressive, traffic spike is experienced by the website of 9/11 Memorial and Museum, when website traffic increases by close to 35x. The variation of the number of visitors per day over the entire year is shown in Figure 2. The spike on 9/11 is owing to multiple factors including increased public interest/awareness on that day, special content that is posted/linked on that day such as special videos and webinars, and special events such as the annual reading of the names of victims of the 9/11/2001 and 2/26/1993 attacks on the World Trade Center in New York City. This annual reading of names is usually covered live on television, which tends to increase public interest and awareness and contributes to the traffic spike to the website.

While this traffic spike happens every year and therefore can be thought to be known in advance, ensuring that the spike does not cause website hiccups requires extensive preparation as described in the following sections in this paper. One particular complication in the testing of the website in preparation for this traffic spike is that there are typically special content/functionalities, which are planned to be published online on that specific day. Hence, advance testing of the website should take into account that content/functionalities will be somewhat different when the actual high-spike conditions are being experienced. Additionally, website traffic patterns on that day are somewhat different from other days during the year in the sense that some URLs that typically have relatively high traffic may have lower traffic on that day and vice versa (with special content specific to that day being one such case). The traffic distribution over the top 25 most visited URLs over the entire year and on 9/11 is shown in Figure 3, which illustrates the variation of traffic patterns on 9/11 compared to the rest of the year.

Figure 2: variation of traffic per day over the entire year. The large spike is on September 11th and recurs every year. The picture above is shown normalized with 1 indicating the peak value.

Figure 3: percentage of traffic to each of the most frequently visited URLs over the entire year and on 9/11. The blue bars and the green bars in the picture show the percentage distributions over the entire year and on 9/11, respectively. It is seen that the traffic distributions are quite different between the entire year and on 9/11 and that several URLs that have relatively high traffic over the year do not have high traffic on 9/11. For instance, webpages that provide information on travel directions to get to the Memorial and Museum have lower traffic on 9/11 since the Memorial and Museum open to the general public later in the day on 9/11 compared to the rest of the year. On the other hand, webpages with special content for that day have high traffic on 9/11.

Resource provisioning and load testing

For load testing of the website, especially during the run-up to the traffic spike on 9/11, an exact clone of the production infrastructure is set up. A best-estimate characterization of expected load is developed based on a multitude of input data such as historical patterns (including the observed spikes in the previous years), unique drivers such as mailing list promotions specific to a particular year, and load patterns in the preceding weeks compared to the same weeks in the previous year. Thereafter, the methodologies described below are applied to perform load testing under simulated traffic that models the traffic spike. Depending on the load testing results, the amount of upsizing required of the infrastructure is determined, and additional load testing is performed. This iterative process is continued until acceptable performance is seen in the load tests. The techniques utilized for load testing and quantitative measurement of website performance during the load testing are described below.

When setting up the website on the clone server infrastructure, we attempt to best mimic the website configuration for the future high-spike time including special content/functionalities planned for that day. To ensure that load testing analysis on the clone server infrastructure is relevant to the production infrastructure, the configuration of the entire technology stack is replicated to make the clone and production stacks as close to identical as possible. For example, it is verified that all the caching mechanisms expected to be utilized (including in the content management system, hosting system, denial-of-service protection system, etc.) and load balancing mechanisms are incorporated. The overall workflow for setting up the load test environment is illustrated in Figure 4.

Figure 4: setting up of the load test environment

Based on the analysis of the typical traffic patterns to the website, historical traffic patterns during the spike, and specific URLs and visitor use cases that are of most importance (including special content for the high-spike day), a set of load test scenarios is formulated. The basic workflow for defining a simple load test scenario is illustrated in Figure 5 and comprises of the following conceptual steps:

Define the essential metadata for the load test: This includes, for example, the website base URL, any required authorization details, cookie handling policy (e.g., definitions of user-defined cookies, whether to clear cookies after each run, etc.), and cache handling policy (e.g., cache expiration policy, whether to clear cache after each run, etc.).
Define basic load-test parameters: This includes number of concurrent users, ramp-up period, total amount of time for the load test, etc. Here, ramp-up period refers to an initial time interval during which the number of concurrent users is gradually increased from an initial number to the defined (steady-state) number of users. This set of load-test parameters can be overridden during actual load testing with definition of a general temporal profile of the load characteristics.
Define sequences of URLs (or, more generally, website interactions) to be tested: This comprises essentially a scripted use case for a simulated visitor to the website. The URLs would typically be defined based on the Web pages with historically highest traffic as discussed above. More generally, this would correspond to user sessions with anonymous and authenticated segments defined to simulate specific patterns of page loads and other website interactions that are expected to have high traffic or that are particularly crucial for the organization’s mission. One particularly convenient way to define sequences of website interactions is using browser recording tools. When manually executing the intended visitor use cases, Web browser plugins such as the ones from Selenium (http://www.seleniumhq.org; http://www.seleniumhq.org/projects/ide) and BlazeMeter (https://www.blazemeter.com; BlazeMeter, 2018) can be used to record the sequences of website interactions, which are then used as the scripted use cases for simulated groups of concurrent users during load testing.

Figure 5: workflow for defining a load test scenario

One of the sample use cases that has been utilized in load testing is shown in Figure 6. This use case is a combination of anonymous and authenticated traffic and runs through a complete visitor interaction scenario. While the use case diagram in Figure 6 shows the flow of interaction with the website from the viewpoint of the real/simulated visitor, the same use case, when seen from the perspective of the load testing framework can be visualized as shown in Figure 7. As can be seen in Figure 7, several randomized and auto-generated components are used to model a realistic population of actual users. Such randomized and auto-generated components are introduced especially in the authenticated sections of the use cases, and include user-specific details (e.g., names, emails, passwords, addresses) as well as random delays (e.g., delay between loading a Web form and submitting it) to simulate random variations that would be expected with real users. In addition, other webpage fields/options being populated by the simulated users are also randomized such as membership level being purchased, individual versus group memberships, etc. These randomization mechanisms ensure that the simulated pool of users provides characteristics representative of actual users so as to load-test the website under essentially the actual operating conditions. Also, it is important to note that it is the mix of different types of traffic (anonymous and authenticated, traffic to different URLs, etc.) that makes the simulated load sufficiently representative of actual user traffic. A large number of simulated users that all just load a static page would, at best, just test the DDOS protection system and would not hit the Web servers at all. Hence, modeling the use cases with sufficient variety in types of traffic and with complete interaction sequences that mimic real users is crucial to ensure that any observations drawn from the load testing results are valid for the actual Web server infrastructure and real traffic.

Figure 6: sample use case for load testing. This use case runs through a sequence of interactions of a visitor with the website and includes both anonymous and authenticated sections. The authenticated section models the purchase of a membership.

Figure 7: modeling of a sample use case in the load testing framework. This diagram shows the same use case as in Figure 6. However, while Figure 6 shows the use case from the viewpoint of the real/simulated user, this diagram shows the use case from the viewpoint of the load testing framework.

Based on the use cases modeled as described above, a suite of user simulation and browser automation technologies (https://www.blazemeter.com; https://www.seleniumhq.org; https://jmeter.apache.org) are utilized to realistically mimic different levels of traffic. The number of concurrent users is usually defined over a period of time starting from an initial value, ramping up to a peak level, and them ramping down. In this context, it is important to note that the notion of “concurrent” users is different from just “online” users. While multiple users who have opened the front page of a website can all be considered to be “online” users at that point in time, they are not necessarily generating any traffic to the website if they are not interacting with the page (e.g., page refreshes by an embedded script) or navigating to other pages. “Concurrent” users, on the other hand, are, in this context, defined as users that are actively interacting with the website (e.g., by executing a sequence of interactions as in the use case in Figure 6). A sample time profile of number of concurrent users is shown in Figure 8. The expected traffic spike levels (or with somewhat higher levels to provide an assurance margin) are utilized to define the parameters in this time profile. Under these simulated traffic conditions, analytics of website performance are collected using the techniques discussed before (the server-side and client-side analytics). It is to be noted that the simulated traffic conditions are completely real as far as the Web server infrastructure is concerned, since they model actual complete visitor flows. The only difference from real users is that the traffic is generated by large numbers of concurrent automated browser sessions. One particularly relevant analytic is the variation of latency (response time) as a function of the number of concurrent users. For example, the latencies measured while running the load test with the number of concurrent users varying as shown in Figure 8 is illustrated in Figure 9.

Figure 8: A sample time profile of number of concurrent users during load testing showing ramp up, sustained load, and ramp down stages.

Figure 9: variation of latency during load test with the number of concurrent users varying as shown in Figure 8. The initial reduction in latency while number of concurrent users is being increased is due to a transient while the caches are “warming up.” After that transient, it is seen that the latency remains approximately constant while ramping up the number of concurrent users and during sustained load indicating that the configured Web server infrastructure is successful in handling the defined traffic levels.

The load testing is performed over significant time intervals to uncover subtle issues that might manifest only under sustained load. Based on the quantitative analytics (such as the response time and server load) measured during load testing, a determination is made as to whether the performance seen is acceptable. If indicated by this analysis, relevant parts of the server infrastructure are upsized and the entire process is repeated to determine the incremental benefit of upsizing these parts of the infrastructure. This iterative process is repeated until a “good” trade-off of website performance and cost/complexity is attained. The criteria to define a trade-off as “good” depends, of course, on the specific website and required operational guarantees. Finally, the last step in this load testing process is to modify the production infrastructure to the infrastructure configuration that was identified during load testing as the “good” trade-off.

As a final remark, it is to be noted that while the techniques above have proven to be a robust approach to seamlessly meetig high-spike conditions, a precaution that we continue to follow is to keep the “human components” online during the high-spike period. We ensure that all relevant personnel including third-party support contractors are “standing by” during the high-spike time. Even with all the preparation described in this paper, small issues could (and sometimes do) crop up in real time. For example, a single errant module (e.g., in the content management system) with perhaps slightly askew caching policy could unsustainably increase server load, necessitating a real-time hot fix that could be quick, but only if the relevant personnel is available.

Applying the set of strategies described in this paper has proven to be a time-tested and robust policy for the 9/11 Memorial and Museum to meet the huge spike in web traffic on September 11th of each year. These same techniques have proven to be effective for other purposes as well apart from preparation for the traffic spikes expected on 9/11 and during other significant events. For example, we utilize these same techniques during significant code and module upgrades on the website, with version upgrades of the content management system being particularly important to test extensively before launch. For these purposes, we follow the same strategies of setting up clone server infrastructures, modeling use cases, and simulating realistic traffic as described in this paper, so as to identify and fix any bugs/issues before the code/module upgrades are pushed to the production website.

Conclusion/summary

In this paper, we discussed the importance of load testing to assure website performance, and we presented techniques for load testing, with particular focus on the handling of traffic spikes. It was noted that while smooth functioning of an organization’s website is key to meeting the organization’s mission, a challenging wrinkle in Web server provisioning is posed by high spikes in traffic due to specific unique events. A particularly dramatic instance of this challenge is faced annually by the 9/11 Memorial and Museum on September 11th, when website traffic increases by close to 35 times the average amount. This paper provided an overview of the strategies utilized in the annual run-up to this day, including tools to model expected traffic patterns, set up surrogate server infrastructure to closely mimic website configuration for the future high-spike time, create realistic mock traffic to mimic the spikes, and monitor website performance under sustained load.

Acknowledgements

This work would not have been possible without the support of the IT Software and Digital team at the 9/11 Memorial and Museum and the support of the vendors who have assisted us during the execution and analysis of the load tests. I especially want to thank Ms. Nancy Morrissey (Senior Vice President of Digital Products and Technology) who has always believed in me and has been very supportive of my ideas and has encouraged me to pursue them. I would also like to thank Rui Yang (developer from my team) for his assistance with formulation of the use cases and analysis of the data sets from the load testing of the website. I would also like to thank the reviewers for their very constructive and insightful comments, which helped with fine-tuning the final version of this paper. Finally, I would like to thank my husband Prashanth Krishnamurthy for always encouraging me to pursue my goals and always being a great motivator and mentor to me. I will end with thanking my toddler son Daksh for being so understanding when Mommy has been busy typing.

References

Bhatti, S. & R. Kumari. (2015). “Comparative study of load testing tools.” International Journal of Innovative Research in Computer and Communication Engineering, vol. 3, issue 3. Last updated March 2015. Consulted February 24, 2018. Available https://www.ijircce.com/upload/2015/march/181_52_Comparative.pdf

BlazeMeter. (2018). “World famous museum & research site achieves high performance for 4,000,000 users.” BlazeMeter case study of California Academy of Sciences. Consulted February 24, 2018. Avilable https://www.blazemeter.com/case-study/cal-academy

Draheim, D., J. Grundy, J. Hosking, C. Lutteroth, & G. Weber. (2006). “Realistic load testing of Web applications.” In Proceedings of the 10th European Conference on Software Maintenance and Reengineering. Consulted February 24, 2018. Available https://www.cs.auckland.ac.nz/~christof/publications/DraheimEtAl2006-LoadTesting.pdf

Kattepur, A. & M. Nambiar. (2015). “Performance modeling of multi-tiered web applications with varying service demands.” In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW). Consulted February 24, 2018. Available https://hal.archives-ouvertes.fr/hal-01118352/document

Shojaee, A., N. Agheli, & B. Hosseini. (2015). “Cloud-based load testing method for web services with VMs management.” Proceedings of the 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI). Consulted February 24, 2018. Available http://ieeexplore.ieee.org/document/7436040

Cite as:
Nagaraja, Radha. "Weighing the World Wide Web: The art and science of load testing for unique high-spike conditions." MW18: MW 2018. Published January 16, 2018. Consulted .
https://mw18.mwconf.org/paper/weighing-the-world-wide-web-the-art-and-science-of-load-testing-for-unique-high-spike-conditions/