The Project

Project Requirements

Introduction

Lockheed Martin is a global security company that is principally engaged in the research, design, development, manufacture, integration and sustainment of advanced technology systems, products and services. As a global security and information technology company, the majority of Lockheed Martin's business is with the U.S. Department of Defense and the U.S. federal government agencies. In fact, Lockheed Martin is the largest provider of IT services, systems integration, and training to the U.S. Government. The remaining portion of Lockheed Martin's business is comprised of international government and some commercial sales of our products, services and platforms.

The Ombudsmen have been asked to assist Lockheed Martin with a project inside the organization. We will employ a software solution to be integrated with their current software system. This will replace an already in place system, but with much needed improvements and tweaks, based on Lockheed Martin’s specifications.

For this particular software solution we will be working with the internal side of Lockheed Martin, aiding them to more effectively manage their software system. Lockheed Martin approached our team with a project to improve and replace the current Resource Manager Ombudsman (RMO) running on their systems.

Problem Statement

The Resource Manager Ombudsman’s job is to manage the health, state, and availability of subsystems and processes outside of the application server. Currently, the RMO does not integrate well with the Java side of the system. If even a small change is made in the system, a lot of maintenance must go into the RMO and the Java sides to have them communicate correctly. Also, since this system has been in place for a long time with all of these “tweaks” to make specific functionality work on a problem to problem basis without any thought to the rest of the system, the code is very hard to read, understand, and manage. There are currently only a handful of employees who know enough about the system to work with it. Given these problems it is our aim to create a new RMO from the ground up. Working closely with Lockheed Martin, we will construct a Java stand-alone application, in Java 2 Enterprise Edition, to be easy to work with, dependable, and highly customizable for our customer’s needs.

As mentioned earlier, there is a current implementation already on the system. This implementation is old. A series of upgrades over time has lead to architectural erosion leading to problems with logging and other key features of the system. Personnel currently have difficulty interacting with the system, and few actually understand how it works.

Unfortunately due to the nature of Lockheed Martin and its work with sensitive areas, most of the information surrounding the RMO is classified, including the code used to implement the system we are replacing, and consequently, we will need to start from the ground up on a new solution. Our team will have to work closely with our sponsor to create a portable, readable, easy to interact with system which will replace the current implementation. Our solution will employ an effective logging system as well as a full set of unit tests which can be run automatically, as well as meet all functional and environmental requirements as specified below.

Product Statement

Product Function Overview

A Resource Manager Ombudsman (RMO) is a process monitoring system that is intended to monitor several subsystems. The subsystems the RMO will be monitoring should normally be running as normal, but should an error occur a critical system failure would be a very possible outcome. The RMO aims to prevent this from happening through close monitoring and adjusting of subsystems. With an RMO monitoring subsystems, any failure can be properly managed and taken care of, ideally before bigger problems arise.

The goal of the RMO service is to control and monitor a set of processes, their availability, and the service’s overall health. To accomplish this there will be a central application that will act as a controller, sending out subsystem services that will be able to monitor a set of processes even if they reside on different servers. The containers will keep track of the overall process health of the subsystem and report this information back to the main RMO application for analysis and possible action. Based off the information provided by the subsystem, the RMO can give commands back to it to ensure it remains healthy. While all processes may be running as desired, this does not ensure the RMO is properly communicating with the subsystems as there may be other unpredictable complications.

The functional specifics of the system are discussed in greater detail in the Functional Requirements section below.

Environment and Interoperability Considerations

Our implementation will need to take into consideration potentially harmful problems such as network availability, excessively large process queues, and partial system failures. Subsystems reporting to a main controlling system allow the RMO to diagnose or restart specific subsystems should something go wrong. A benefit resulting from this is that the main system should be able to continue operation as normal through a subsystem failure and potentially fix the problem itself without any human interaction. The system will require several key features to accomplish this functionality:

- RMO Service Registration - RMO Service Notification Exchanges - Accurate Subsystem Process Monitoring - Subsystem Activity Logs

In order to act as a central controller the main RMO application will monitor currently active RMO services and their status and health. Each new RMO service will register itself with the main application

server for monitoring when it has been successfully deployed. The RMO services themselves will monitor a subsystem of processes they create to ensure desired operation and ultimately the success of the subsystem’s overall task. The RMO services will be able to send and receive different kinds of notifications to the main RMO application, as well as between its own processes. This will allow the main application and the RMO itself to properly manage its processes and task, which will help to prevent critical problems and failures.

The environment and constraints are discussed in more detail in the Constraints and Feasibility Issues section below.

Functional Requirements

Overview

As the Product Statement expands upon in more detail, this product must be able start-up, monitor,

and shut down a set of XML-specified processes collectively referred to as a “subsystem”. Run from the Linux command-line, a user must be able to start a Resource Manager Ombudsman process, passing in such an XML file, and be able to receive feedback about process status via updates to a database.

Internally, a given Resource Manager Ombudsman process must keep track of the health of its processes. This information will be picked up periodically as messages sent from the processes themselves. From beginning to end, a Resource Manager Ombudsman must start-up its processes, listen to signals from its processes and the Notification Topic, updating the relevant system accordingly (see below for more details), and terminate on request by the user.

There are currently no explicit requirements concerning the longevity of this product. It is assumed that the product will be continuously in use over long periods of time, but because details of its implementation are classified, we have no way of knowing for sure. Nevertheless, our design will need to assume constant use. Similarly, the level of active involvement with an operator is unknown, though logging and database updates should allow continuous monitoring of processes by a human at all times.

Environmental concerns are discussed in more detail in the Constraints and Feasibility Issues section below. All customer needs are enumerated below. The only ambiguous customer want we have is that as little should be hard-coded as possible. Any connection (queue, topic, application server, port, etc.) should be configurable or passed in. This product is intended to be a more compatible, elegant solution to an existing system, and as such, should functionally match its predecessor.

Specification

1. Process Management
• 1.1. An RMO1 shall be able to start and stop subsystems and processes.
• 1.2. An RMO shall monitor all processes that are not marked as “transient”.
• 1.3. An RMO shall handle Linux signals (SIGCHILD, kill -9, kill –TERM, etc.) appropriately.
§ 1.3.1. Upon receiving the signal kill -9, the RMO shall tell its child process to exit immediately.
§ 1.3.2. Upon receiving the signal kill -TERM , the RMO shall wait for its child process to finish its tasks, then instruct its child process to exit, then exit itself.
§ 1.3.3. Upon receiving the signal SIGCHILD, the RMO shall update its health and report the death of its child process via the notification service.
• 1.4. An RMO shall listen to the “Notification Topic”.
§ 1.4.1. If the subsystem of a notification matches that of the listening RMO (i.e., the notification was sent by one of the RMO’s children), the RMO shall update its health based on the severity of the notification.

2. Asynchronous Messaging
• 2.1. Open Message Queue (OpenMQ) shall be used for all asynchronous messaging.
• 2.2. All messages shall be in XML format.
• 2.3. An RMO shall be able to start without the OSB2 running.
§ 2.3.1. If the OSB crashes, it shall not cause and RMO to crash, though no such guarantee is placed on the children of an RMO.
• 2.4. When offline, an RMO shall listen to the “queue” of every rail in its environment.

3. Version Management
• 3.1. An RMO shall be able to run on different versions of the client’s software in the same environment.
• 3.2. When an RMO receives a start-up request from a different software version than the one it is currently running in, the RMO shall restart itself in the new version.
§ 3.2.1. An exec shall be done to keep the same PID number.
§ 3.2.2. The LD_LIBRARY_PATH shall be updated to the correct path for the new version.
• 3.3. Once a new software version has been started, the RMO shall read in the received “startup message” and startup as requested.

4. Log Management
• 4.1. An RMO shall redirect the standard out and standard error of its child processes to its own log3.
• 4.2. Logs shall be “rolled” through java.util.logging with a properties file to ensure log files do not grow indefinitely.

5. Test Cases
• 5.1. A full set of unit tests shall be provided that can be run automatically (no environment setup required).
• 5.2. There shall be an “Integration Tests Plan”.
§ 5.2.1. Documentation shall be provided conveying what kinds of tests should be run to verify the RMO is properly integrated with the system.

Constraints and Feasibility Issues

Constraints

The implementation environment for this system is required to be Java Enterprise Edition (Java EE). This platform contains libraries providing functionality not available in the standard Java platform, easing the task of server programming. Although we expect the API provided by Java EE to be quite helpful, the accumulated experience that we ombudsmen have with this technology amounts to zero. To prevent future frustration and setbacks during the implementation phase, we will be required to become as familiar as possible as soon as possible with the Java EE platform.

OSB – literally “Oracle Service Bus”, but refers to the Application Server as a whole here e.g., an RMO starts a Perl script, captures then redirects its logging

The operating system that this system will be deployed on is Red Hat Linux version 5.5. Red Hat is not free, but many other Linux distributions are. A free version of Linux may be used for development of this system, provided that it's Java EE environment is the latest stable version. A key milestone for our project will be the setting up of a Linux server for development of this system.

The system will require a database for logging purposes. The database software used by the customer is Oracle. Oracle will not be available to us. Furthermore, it is acceptable that our system use free database software in place of Oracle.

Feasibility Issues

The processes which this system will manage upon deployment are classified. The system to be replaced is classified. On the customer side, the one person who completely knows what they want from our new system is unavailable. These issues greatly affect the feasibility of this project. To overcome them, there are several things that can be done. For the classified processes, the customer will provide Java interfaces and/or non-classified processes for testing purposes. If interfaces are provided, they will define how the system should interact with the processes it manages, and should match those interactions that the system will have with the classified processes after deployment. Since the source code for the system to be replaced is classified, higher level representations of its architecture will be provided. These representations will be thorough enough such that new working source code could be created from them. As for the availability issue of the one person who knows exactly what is to be expected from our product, either this person will make him or herself available, or our liaison will educate himself on the topic to the point that it is crystal clear what is to be expected from us.

Project Execution Plan

Following the finalization of the requirements document by February 7, the team will move on to coming up with a formalized design specification from February 9 (after the initial presentation) to February 22. From February 23 to April 5 the team will engage in the bulk of the implementation phase, with testing bring the primary focus from April 6 to April 29 (the final presentation).

In order to achieve our stated goals for this project, we will first need to setup our development environment for simulating the deployment environment of the final product. In order to go about actually implementing the system, we will need testing interfaces from our sponsor. Once we have a firm understanding of all the interactions by and with an RMO, we will begin implementation. The goal is to have the development environment fully functional no later than February 11. Having the development environment will ideally give us a better understanding of the architecture required to successfully implement the redesigned RMO system.

The most basic functionality of the RMO is the starting-up and shutting down of processes via command-line arguments and Linux signals. Once we have the aforementioned test interfaces, we should hopefully be able to startup and shutdown processes as needed. It should only take a few days to get that working following completion of the design specification.

A critical part of the RMO specification is reporting on the health of its subsystem via logging. Being able to update health and redirect process information to the log will be the next fundamental step in the process. This should take approximately a week to complete by initial estimates.

After that, the asynchronous communication with OpenMQ will be the next step. At this point, we’ll need to do away with the hard-coded test harness we intend to use to test functionality in the earlier stages, and actually test the asynchronous communication between application server, RMO, and its child processes. This should take about two weeks to fully get working, but we will err on the side of caution and expect three weeks.

Finally, using the fully developed RMO, we will need to tackle version management. This will likely not be a trivial problem, but we intend to research it well enough such that when it comes time to implement it, we’ll know what we need to do to make it happen. We’ll give it the rest of the time until the second major presentation—about one and a half weeks.

After implementation wraps up, we’ll finalize the Integration Tests Plan, and ensure the system will work when deployed for the client. If there are any lingering implementation concerns yet to be resolved, we can finish them up at this time. There’s over three weeks of time available in this window, so any slippage can be accounted for here as required.

We expect there to be slippage in the initial stage of just getting the RMO to startup and shutdown processes, reporting back on its health. We’ll try to get started on this as soon as possible to mitigate any potential pitfalls in this area. We also anticipate that implementing the version management could be a non-trivial task requiring special attention. The asynchronous communication and integration with the various technologies will need to be the bulk of our effort, so we will plan accordingly to ensure that we have ample time to tackle that portion of the project. Two weeks should be enough time provided we’ve done our homework on MQ, Glassfish, and Java Enterprise Edition, but extra time is available in case we need it towards the end.

Our Final Solution

Problem and Solution Statement

Lockheed Martin in a self-described “global security company” specializing in the research and development of advanced military technology. The majority of their business is done with the United States Department of Defense and other federal agencies. Steven Koechle, an employee down at the company’s Phoenix office, approached NAU with the task of reimplementing an existing software system—known as the “Resource Manager Ombudsman” (RMO)—from the ground up. The classified nature of Lockheed Martin’s business has left the team unaware of the ultimate purpose of the RMO within the larger context of their work. That said, it has been made clear that the existing system is antiquated, having been developed incrementally over time without a solid design foundation.

The Resource Manager Ombudsman’s job is to manage the health, state, and availability of subsystems and processes interacting with an application server. What this translates to in practical terms is that Lockheed Martin has some hardware processing nodes responsible for running a specific collection of processes—a “subsystem”—in a certain environment (defined by an application server running on different hardware). In order to task out these processing nodes from one location, a human operator sits behind a Human-Machine Interface (HMI), queries a database for subsystem availability, and starts-up subsystems directly (see Figure 1). For the purposes of the below illustrations, we delineate the physical locations of various pieces of the system with “walls” in a building. Processing nodes are given arbitrary identifiers (PBJ1, BLT1, REUBEN1).



The purpose of the above diagram is not to give the reader a clear understanding of the existing system, but rather to illustrate the complicated network of interactions existing between the varying components. To simplify interactions, dependencies, and maintenance concerns, the client asked the team to redesign the system, creating a web service running on the application server—the “RMO Service”—responsible for acting as a centralized interface for communication between all major components of the system—HMI, RMO, and database—as illustrated in Figure 2. This redesign will help reduce maintenance costs by significantly reducing coupling between components such as the HMI and the RMO, and also makes the system itself more comprehensible, with less individual requests being propagated along the chain.

Architecture and Implementation

The implementation of this system performed by the Capstone team is primarily divided into two functional components: the RMO and the RMO Service, and two connecting components: the RMO Queue and the SOAP. The architectural diagram in Figure 4 helps illustrate this in visual terms, but a more detailed prose summary follows.



Figure 4 - Architecture Overview


Java EE enforces a general “three-layer” architectural pattern on the implementer. At the top layer exists the client applications: the RMO, HMI, and Processes instantiated by the RMO.

In order for the RMO to update the database about its status, or for the HMI to query the database for subsystem availability, they must go through the RMO Service, invoking web methods defined by the WSDL file. The actual transfer of information between the components (e.g., HMI requests subsystem status, RMO Service returns status) is facilitated by the SOAP at runtime.

The RMO Service exists as a web service running on the application server. Though there may be many RMO Services (due to the existence of any number of application servers) there will only be one database—containing the state of the whole system—with which they all interact.

When the HMI wishes to start-up or shut-down a particular subsystem (i.e., attach it or detach it to a particular application server), it first invokes a method on the RMO Service in the same way as before: through the SOAP. The RMO Service will then load up its unique XML startup/shutdown message and ship it out to the RMO Queue unique to a particular application server-RMO connection. As stated above, the RMO Queue in this context is not a “first-in-first-out” data structure, but rather a staging area for messages to be pushed and consumed asynchronously. The only guarantee made by the RMO Queue is that if a message is consumed, it will be consumed once and only once.

The RMO of course listens on each RMO Queue connecting to it. If it is attached to a particular application server, then it will ignore any messages sent by other application servers until it detaches itself—once it has finished its processing task. Interestingly, as a workaround to the problem of Linux signal handling in Java discussed in the Constraints section above, the Processes started by an RMO will report their activity along the same RMO Queue used by the application server. Consequently, the RMO only needs to listen to one location for relevant messages from anything in the system while it is attached to a particular application server.

Usability Testing and Future Work

Initial informal testing was accomplished using NetBeans’ auto-generated web service test page. Using this page, the team was able to test the public “web methods”—those specified by the WSDL file— exposed by the RMO Service to the outside world. At the level of the RMO itself, the team first began by using basic printouts and log files; this was to ensure that messages were properly being received from the RMO Queue, and that the RMO was in turn taking the appropriate actions in response to these messages.

Functionality tested by the RMO Service test page included:
• Starting and stoping subsystems
• Killing individual processes
• Attaching and detaching an RMO to and from an application server
• Passing activity reports to the application server

The team also conducted formal testing with Java’s JUnit testing framework. Whereas the auto- generated test page allowed the team to simulate the HMI by interactively manipulating the system, the JUnit tester replaces the HMI for testing purposes by invoking a specific set of methods on the RMO Service and comparing the results with established “truth data”.

By informally testing the system from the start, the team was able to gain a measure of confidence in existing components before connecting them through JMS. This way, the feasibility of several of the client’s requests were identified early on in the development process. For instance, signal handling in Java proved to be problematic, but this issue was addressed before work began on asynchronous communication.

While the team personally found confidence in the system that had been produced through informal testing, formal testing helped guarantee the correctness of the product to the client.

As described above, integration will be a two-phase process, beginning with the client personally tweaking existing implementation to both work within the team’s environment and the Lockheed Martin environment. The second phase will take place after the Capstone project has finished, and will be handled personally by the client. For instance, the client suggested use of a MySQL database with the Glassfish application server, both free and open-source technologies. Conversion to Lockheed Martin’s Oracle and WebLogic equivalents was stated as a simple process the client elected to undertake personally.

The major question of usability for this system regards how easily the client can integrate it with their existing system, and how maintainable it will be going forward. This new redesign, as envisioned by the client and refined through the team’s work on the project, emphasizes a centralized interface for component communication, and integrates easily with the existing JMS technologies in the existing system.